Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Indent Support in to_json #28130

Merged
merged 48 commits into from
Sep 18, 2019
Merged

Add Indent Support in to_json #28130

merged 48 commits into from
Sep 18, 2019

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Aug 24, 2019

After some of the recent refactor this should be relatively easy to do after vendoring a few updates

@WillAyd WillAyd added the IO JSON read_json, to_json, json_normalize label Aug 24, 2019
@WillAyd WillAyd added this to the 1.0 milestone Aug 24, 2019
pandas/core/generic.py Outdated Show resolved Hide resolved
@@ -2249,6 +2249,7 @@ def to_json(
lines=False,
compression="infer",
index=True,
indent=0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add type hints when modifying code?

Copy link
Member Author

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK type hints added where feasible, comments where I think not

double_precision: int = 10,
force_ascii: bool_t = True,
date_unit: str = "ms",
default_handler: Optional[Callable[[Any], Union[int, str, List, Dict]]] = None,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return for the annotation here should match the Serializable alias in the _json file. Not sure if better to put in a shared place like _typing or duplicate here; open to suggestions

@@ -31,20 +32,23 @@

TABLE_SCHEMA_VERSION = "0.20.0"

Serializable = Union[int, str, List, Dict]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted to go crazy could subscript List and Dict recursively and with int or str, but I don't think mypy yet supports recursive types. Figure this is a good enough tradeoff

@@ -31,20 +32,23 @@

TABLE_SCHEMA_VERSION = "0.20.0"

Serializable = Union[int, str, List, Dict]


# interface to/from
def to_json(
path_or_buf,
obj,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think obj should be FrameOrSeries, but since it gets passed too pandas.core.generic I was getting errors like

pandas/core/generic.py:2401: error: Value of type variable "FrameOrSeries" of "to_json" cannot be "NDFrame"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think FrameOrSeries in _typing should be changed to FrameOrSeries = TypeVar("FrameOrSeries", bound="NDFrame") as used in #27646

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea - I think worth exploring as a separate PR for sure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at this after merging master and still doesn't work because the SeriesWriter _write method seems to re-use this variable for a dict. I'm not sure if that's needed so should be able to refactor but probably better as a follow up to not cram too much in here

):
self.obj = obj

if orient is None:
orient = self._default_orient
orient = self._default_orient # type: ignore
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

orient is probably better as an abstract property but I figure best to leave that refactor to a follow up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment?

although adding an abstract property is straightforward, if you add if TYPE-CHECKING it won't affect runtime behaviour.

if TYPE_CHECKING:
   _default_orient = None  # type: <...>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's cool I'd prefer to do this as a quick follow up. I find a lot of TYPE_CHECKING blocks to be distracting so like to avoid where possible. It won't be difficult just want to minimize diff in extension change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem.

I don't think we should avoid these blocks as too beneficial.

in a (hopefully not too distant) future PR, i'll be adding...

    if TYPE_CHECKING:  # attributes index and columns are created dynamically
        index = None  # type: Index
        columns = None  # type: Index

to frame.py so that i can get types for self.fmt.tr_frame.index.format and self.frame.index.nlevels etc.

in this case all tests pass without the if TYPE_CHECKING but would rather not change runtime behaviour if avoidable.

@@ -31,20 +32,23 @@

TABLE_SCHEMA_VERSION = "0.20.0"

Serializable = Union[int, str, List, Dict]


# interface to/from
def to_json(
path_or_buf,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be FilePathOrBuffer but was failing with

pandas/io/json/_json.py:97: error: Item "Path" of "Union[Path, IO[Any]]" has no attribute "write"

I think this is actually a problem with the return of _stringify_path since it doesn't indicate that Path objects get converted to strings. Can open an issue as a follow up but @simonjayhawkins may already be working on that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, should already be fixed with overloads in #27732.

@simonjayhawkins
Copy link
Member

@WillAyd can you merge master.

@WillAyd
Copy link
Member Author

WillAyd commented Aug 27, 2019

Actually just realized that the simplejson implementation prefers indent to be a string rather than an int. Let me see if I can work that in here

https://github.com/simplejson/simplejson/blob/5fccfc1d88cc70455d013fbf13f9a49a19cd7929/simplejson/__init__.py#L328

@WillAyd
Copy link
Member Author

WillAyd commented Aug 27, 2019

Nevermind ignore previous comment - mixed up my JSON libraries this morning. ujson only supports an int

@WillAyd
Copy link
Member Author

WillAyd commented Aug 27, 2019

OK I think this is ready. If it helps for review here is the reference implementation that I've vendored into our ujson copy:

https://github.com/esnme/ultrajson/pull/163/files

@WillAyd
Copy link
Member Author

WillAyd commented Aug 30, 2019

For completeness here are benchmarks showing no change:

· Running 22 total benchmarks (2 commits * 1 environments * 11 benchmarks)
[  0.00%] · For pandas commit 51db82d9 <master> (round 1/2):
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...............................................
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  6.82%] ··· Running (io.json.ToJSON.time_to_json--).
[  9.09%] ··· Running (io.json.ToJSON.time_to_json_wide--).
[ 11.36%] ··· Running (io.json.ToJSONLines.time_delta_int_tstamp_lines--).....
[ 25.00%] · For pandas commit 638d0552 <json-indent2> (round 1/2):
[ 25.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 25.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 31.82%] ··· Running (io.json.ToJSON.time_to_json--).
[ 34.09%] ··· Running (io.json.ToJSON.time_to_json_wide--).
[ 36.36%] ··· Running (io.json.ToJSONLines.time_delta_int_tstamp_lines--).....
[ 50.00%] · For pandas commit 638d0552 <json-indent2> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 52.27%] ··· io.json.ToJSON.mem_to_json                                                                                    ok
[ 52.27%] ··· ========= ==== ============= ============== =============== ==================
              --                                       frame
              --------- --------------------------------------------------------------------
                orient   df   df_date_idx   df_td_int_ts   df_int_floats   df_int_float_str
              ========= ==== ============= ============== =============== ==================
                split    0         0             0               0                0
               columns   0         0             0               0                0
                index    0         0             0               0                0
                values   0         0             0               0                0
               records   0         0             0               0                0
              ========= ==== ============= ============== =============== ==================

[ 54.55%] ··· io.json.ToJSON.mem_to_json_wide                                                                               ok
[ 54.55%] ··· ========= ==== ============= ============== =============== ==================
              --                                       frame
              --------- --------------------------------------------------------------------
                orient   df   df_date_idx   df_td_int_ts   df_int_floats   df_int_float_str
              ========= ==== ============= ============== =============== ==================
                split    0         0             0               0                0
               columns   0         0             0               0                0
                index    0         0             0               0                0
                values   0         0             0               0                0
               records   0         0             0               0                0
              ========= ==== ============= ============== =============== ==================

[ 56.82%] ··· io.json.ToJSON.time_to_json                                                                                   ok
[ 56.82%] ··· ========= ========== ============= ============== =============== ==================
              --                                          frame
              --------- --------------------------------------------------------------------------
                orient      df      df_date_idx   df_td_int_ts   df_int_floats   df_int_float_str
              ========= ========== ============= ============== =============== ==================
                split    93.9±5ms     94.2±9ms      98.5±6ms        102±5ms          102±20ms
               columns   95.6±2ms     308±20ms      352±40ms        321±60ms         313±40ms
                index    121±3ms      319±10ms      318±20ms        318±5ms          316±3ms
                values   78.3±3ms     78.6±1ms      79.3±1ms        88.4±1ms         87.5±1ms
               records   89.1±6ms     92.0±2ms      109±2ms         117±2ms          116±3ms
              ========= ========== ============= ============== =============== ==================

[ 59.09%] ··· io.json.ToJSON.time_to_json_wide                                                                              ok
[ 59.09%] ··· ========= ========== ============= ============== =============== ==================
              --                                          frame
              --------- --------------------------------------------------------------------------
                orient      df      df_date_idx   df_td_int_ts   df_int_floats   df_int_float_str
              ========= ========== ============= ============== =============== ==================
                split    110±1ms      110±2ms       168±4ms         154±6ms          181±20ms
               columns   123±1ms      139±2ms       190±2ms         176±2ms          200±3ms
                index    125±3ms      125±2ms      188±0.9ms        168±4ms          207±7ms
                values   109±9ms      112±7ms       170±10ms        149±10ms         191±20ms
               records   139±10ms     137±10ms      199±10ms        171±20ms         195±6ms
              ========= ========== ============= ============== =============== ==================

[ 61.36%] ··· io.json.ToJSONLines.time_delta_int_tstamp_lines                                                          147±3ms
[ 63.64%] ··· io.json.ToJSONLines.time_float_int_lines                                                                 163±4ms
[ 65.91%] ··· io.json.ToJSONLines.time_float_int_str_lines                                                            168±10ms
[ 68.18%] ··· io.json.ToJSONLines.time_floats_with_dt_index_lines                                                      131±7ms
[ 70.45%] ··· io.json.ToJSONLines.time_floats_with_int_idex_lines                                                      126±3ms
[ 72.73%] ··· Setting up io/json.py:200                                                                                     ok
[ 72.73%] ··· io.json.ToJSONMem.peakmem_float                                                                            62.3M
[ 75.00%] ··· io.json.ToJSONMem.peakmem_int                                                                              62.5M.
[ 75.00%] · For pandas commit 51db82d9 <master> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 77.27%] ··· io.json.ToJSON.mem_to_json                                                                                    ok
[ 77.27%] ··· ========= ==== ============= ============== =============== ==================
              --                                       frame
              --------- --------------------------------------------------------------------
                orient   df   df_date_idx   df_td_int_ts   df_int_floats   df_int_float_str
              ========= ==== ============= ============== =============== ==================
                split    0         0             0               0                0
               columns   0         0             0               0                0
                index    0         0             0               0                0
                values   0         0             0               0                0
               records   0         0             0               0                0
              ========= ==== ============= ============== =============== ==================

[ 79.55%] ··· io.json.ToJSON.mem_to_json_wide                                                                               ok
[ 79.55%] ··· ========= ==== ============= ============== =============== ==================
              --                                       frame
              --------- --------------------------------------------------------------------
                orient   df   df_date_idx   df_td_int_ts   df_int_floats   df_int_float_str
              ========= ==== ============= ============== =============== ==================
                split    0         0             0               0                0
               columns   0         0             0               0                0
                index    0         0             0               0                0
                values   0         0             0               0                0
               records   0         0             0               0                0
              ========= ==== ============= ============== =============== ==================

[ 81.82%] ··· io.json.ToJSON.time_to_json                                                                                   ok
[ 81.82%] ··· ========= =========== ============= ============== =============== ==================
              --                                           frame
              --------- ---------------------------------------------------------------------------
                orient       df      df_date_idx   df_td_int_ts   df_int_floats   df_int_float_str
              ========= =========== ============= ============== =============== ==================
                split    90.9±10ms     88.8±6ms      92.2±4ms        98.5±7ms         107±2ms
               columns    107±10ms     313±20ms      298±20ms        320±7ms          324±20ms
                index    104±0.8ms     309±10ms      308±10ms        311±5ms          343±10ms
                values    75.1±1ms     76.5±2ms      80.3±3ms        122±20ms         85.3±6ms
               records   95.0±10ms    92.1±10ms      123±10ms        125±7ms          141±30ms
              ========= =========== ============= ============== =============== ==================

[ 84.09%] ··· io.json.ToJSON.time_to_json_wide                                                                              ok
[ 84.09%] ··· ========= =========== ============= ============== =============== ==================
              --                                           frame
              --------- ---------------------------------------------------------------------------
                orient       df      df_date_idx   df_td_int_ts   df_int_floats   df_int_float_str
              ========= =========== ============= ============== =============== ==================
                split     112±6ms      111±5ms       214±30ms        184±10ms         189±30ms
               columns    126±3ms      140±2ms       205±10ms        179±4ms          211±20ms
                index     125±2ms      123±2ms       186±7ms         169±5ms          215±10ms
                values    113±10ms     107±2ms       163±6ms         146±1ms          174±3ms
               records   122±0.7ms     135±10ms      239±50ms        167±3ms          207±20ms
              ========= =========== ============= ============== =============== ==================

[ 86.36%] ··· io.json.ToJSONLines.time_delta_int_tstamp_lines                                                          147±2ms
[ 88.64%] ··· io.json.ToJSONLines.time_float_int_lines                                                                 164±4ms
[ 90.91%] ··· io.json.ToJSONLines.time_float_int_str_lines                                                             171±5ms
[ 93.18%] ··· io.json.ToJSONLines.time_floats_with_dt_index_lines                                                      123±6ms
[ 95.45%] ··· io.json.ToJSONLines.time_floats_with_int_idex_lines                                                     149±20ms
[ 97.73%] ··· Setting up io/json.py:200                                                                                     ok
[ 97.73%] ··· io.json.ToJSONMem.peakmem_float                                                                            62.3M
[100.00%] ··· io.json.ToJSONMem.peakmem_int                                                                              62.3M
.
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

Anyone have thoughts on this?

@WillAyd
Copy link
Member Author

WillAyd commented Sep 6, 2019

Validation added and changed the default to indent=None, with a documentation note on index=0 behavior and how it differs from stdlib

@@ -234,7 +236,11 @@ def test_build_series(self):
),
]
)
assert result == expected

if PY35:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this here

@WillAyd
Copy link
Member Author

WillAyd commented Sep 16, 2019

Ping on this one - I think good to go

pandas/core/generic.py Outdated Show resolved Hide resolved
@WillAyd
Copy link
Member Author

WillAyd commented Sep 16, 2019

Thanks for the feedback @gfyoung . If no other comments I plan to merge this on Wednesday

@WillAyd WillAyd merged commit c94eaee into pandas-dev:master Sep 18, 2019
@WillAyd WillAyd deleted the json-indent2 branch September 18, 2019 15:10
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
@botzill
Copy link

botzill commented Jan 28, 2020

Hi.

Is indent feature still in pending?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add indent support to to_json method
6 participants