Rename duplicate column names in read_json(orient='split') #50370

datapythonista · 2022-12-21T06:54:54Z

xref API: Consistent handling of duplicate input columns #47718

All our read methods drop duplicated columns (read_xml and read_json(orient='records')) or rename them with the pattern x, x.1, x.2... (read_csv, read_fwf, read_excel, read_html). There is only one exception where the built DataFrame contains duplicated columns, which is read_json(orient='split').

Making read_json(orient='split') be consistent with the other methods, and rename the duplicate columns here.

Follow ups:

Use the dedup_names function everywhere when columns are renamed CLN: Use dedup_names in all instances where duplicate column names are renamed #50371
Add a parameter to the read_* functions where the format can have duplicates, to allow better control of the behavior when dealing with duplicate column names API: Consistent handling of duplicate input columns #47718

CC: @MarcoGorelli

jreback · 2022-12-21T08:03:15Z

pandas/io/json/_json.py

@@ -1245,6 +1246,9 @@ def _parse(self) -> None:
                str(k): v
                for k, v in loads(json, precise_float=self.precise_float).items()
            }


need tests for this!

The test I updated is already testing this. There was a test specific for duplicate columns that was parametrized for orient split, and check the column names. Am I missing something?

@jreback do you mind having another look, and letting me know if the added test and my previous comment address your concern. This is ready to be merged pending your review. Thanks!

MarcoGorelli

+1 for this change - I guess the

self._dedup_names(

still need renaming?

MarcoGorelli · 2022-12-24T08:25:50Z

pandas/io/json/_json.py

+            decoded["columns"] = dedup_names(
+                names=decoded["columns"], is_potential_multi_index=False
+            )
            self.check_keys_split(decoded)


decoded might not have 'columns', do you need to do the self.check_keys_split(decoded) call before the dedup_names one?

MarcoGorelli · 2022-12-24T09:12:46Z

pandas/tests/io/json/test_pandas.py

@@ -117,6 +117,7 @@ def test_frame_non_unique_columns(self, orient, data):
                expected.iloc[:, 0] = expected.iloc[:, 0].view(np.int64) // 1000000
        elif orient == "split":
            expected = df
+            expected.columns = ["x", "x.1"]


should there also be a test for when the columns are a multiindex, the reader is read_json, and orient='split'?

Good point. I'm not fully sure if it can be the case that read_json(orient='split') can generate a multiindex. What would the json look like?

I agree dedup_names can probably be better tested, and make sure that the multiindex cases are well tested and work as expected. In my opinion probably better to address it in #50371. For this PR multiindexing is irrelevant if I'm not missing something. #50371 should consolidate the two different versions of dedup_names and make sure they work for all use cases if the implementations are not exactly the same. I think ensuring proper testing makes more sense there, here the function is not modified, and I don't think any bug in the function can be introduced in this PR, as it's just been moved from a method to a function.

here's an example which passes on 1.5.2, but which fails here:

data = [["a", "b"], ["c", "d"]] df = DataFrame(data, index=[1, 2], columns=pd.MultiIndex.from_arrays([["x", "x"], ['a', 'x']])) read_json(df.to_json(orient='split'), orient='split')

Thanks @MarcoGorelli, this is very helpful. Looks like the json IO with orient='split' does try to support multiindexes, but the implementation in main seems buggy:

>>> df = DataFrame(data, index=[1, 2], columns=pd.MultiIndex.from_arrays([["2022", "2022"], ['JAN', 'FEB']])) >>> df 2022 JAN FEB 1 a b 2 c d >>> print(df.to_json(orient='split')) {"columns":[["2022","JAN"],["2022","FEB"]],"index":[1,2],"data":[["a","b"],["c","d"]]} >>> read_json(df.to_json(orient='split'), orient='split') 2022 JAN 2022 FEB 1 a b 2 c d

I opened #50456 to address the problem, and fixed the use case with multiindex and added the test, but xfailed it for now, so the current bug can be fixed in an separate PR.

MarcoGorelli

Generally looks good! Just got a comment on the orig_names = ... statement

MarcoGorelli · 2022-12-28T15:22:20Z

pandas/io/json/_json.py

+            orig_names = [
+                (tuple(col) if isinstance(col, list) else col)
+                for col in decoded["columns"]
+            ]


currently, this isn't necessary, because the test is xfailed, so it will pass even if there's an exception thrown (which I is what this line was meant to address?)

I'm not too keen on adding a line of code which doesn't have any effect on the test suite (someone else might accidentally remove it as part of a refactor, and the test suite would still pass), so how about either:

remove this, and just letting the code raise: as your example shows, multiindex with columns='split' wasn't supported to begin with, so it's OK for it to throw an error;

OR keep this, but change the test to check that the result value has tuples as column names, with a comment saying that the return type should be changed (instead of xfailing)

Good point. It's a bit more complex. This line does have an effect, and will make other tests fail if removed. A multiindex is represented internally in some places as a list of tuples. But JSON doesn't have the concept of tuples, so in the JSON it's represented as a list of lists. When calling dedup columns and the dataframe constructor, providing a list of lists is interpreted as a normal index with lists as values, which is a non-hashable type, and it raises a TypeError. Making them tuples will convert the values to the multiindex, and will make the JSON be loaded with a multiindex.

The part that is still missing in this PR is that the loaded multiindex is not the same as the original one, if the JSON was saves with to_json. That's what the other issue will fix, and what the xfailed test is checking.

What it makes sense based on your comment is to be more explicit on when the test should xfail. I added that the xfail needs to be for AssertionError (the compared values are different). So now, if this line is removed, the test will fail, as the error for trying to create an index with a list of lists is a TypeError, and the test will check that the code doesn't raise, and will need that a value is returned, just different from the expected one.

Does this make sense to you? I wouldn't implement anything more complex than this here, since fixing the other bug shouldn't be difficult. We can implement the other PR first too, but probably not worth it.

looks good, thanks!

MarcoGorelli · 2022-12-28T16:09:36Z

doc/source/whatsnew/v2.0.0.rst

@@ -496,7 +496,7 @@ Other API changes
  new DataFrame (shallow copy) instead of the original DataFrame, consistent with other
  methods to get a full slice (for example ``df.loc[:]`` or ``df[:]``) (:issue:`49469`)
 - Disallow computing ``cumprod`` for :class:`Timedelta` object; previously this returned incorrect values (:issue:`50246`)
-
+- Loading a JSON file with duplicate columns using ``read_json(orient='split')`` renames columns to avoid duplicates, as :func:`read_csv` and the other readers do (:issue:`XXX`)


PR number here? other than that looks good

Added, thanks!

MarcoGorelli

Looks good to me, thanks @datapythonista

mroeschke · 2023-01-09T23:55:16Z

Thanks @datapythonista (failures unrelated). Follow up PRs can be submitted if needed

hamedgibago · 2023-05-09T21:02:24Z

pandas/io/parsers/python_parser.py

@@ -434,6 +452,7 @@ def _infer_columns(
                        if i not in this_unnamed_cols
                    ] + this_unnamed_cols

+                    # TODO: Use pandas.io.common.dedup_names instead (see #50371)


@datapythonista This is one TODO you mentioned? Just to be sure and not doing rework? #50371

Yes, correct.

Rename duplicate column names in read_json(orient='split')

482a87f

datapythonista added the IO JSON read_json, to_json, json_normalize label Dec 21, 2022

datapythonista mentioned this pull request Dec 21, 2022

CLN: Use dedup_names in all instances where duplicate column names are renamed #50371

Open

Add issue number to TODO

c89cba5

jreback requested changes Dec 21, 2022

View reviewed changes

MarcoGorelli reviewed Dec 21, 2022

View reviewed changes

MarcoGorelli reviewed Dec 24, 2022

View reviewed changes

datapythonista added 8 commits December 26, 2022 14:23

Merge remote-tracking branch 'upstream/main' into json_no_dupes

f701a92

Finish renaming of _dedup_names and check keys before column renaming

fbc093d

Black

6d30477

Support multiindex and fix linters

4252e2f

Restoring commented code

2c59ef8

isort

a6ea294

Merge remote-tracking branch 'upstream/main' into json_no_dupes

5572574

Fix doctest

10a9036

MarcoGorelli requested changes Dec 28, 2022

View reviewed changes

datapythonista added 2 commits December 28, 2022 22:30

Merge remote-tracking branch 'upstream/main' into json_no_dupes

792ec14

Being more specific with xfail

5a0cd30

MarcoGorelli reviewed Dec 28, 2022

View reviewed changes

datapythonista added 3 commits December 29, 2022 00:36

Merge remote-tracking branch 'upstream/main' into json_no_dupes

83a94e2

Update whatsnew issue number

8bad27c

Merge main

0b865fe

datapythonista mentioned this pull request Dec 28, 2022

BUG: JSON serialization with orient split fails roundtrip with MultiIndex #50456

Open

MarcoGorelli approved these changes Dec 28, 2022

View reviewed changes

mroeschke approved these changes Jan 4, 2023

View reviewed changes

mroeschke added this to the 2.0 milestone Jan 4, 2023

Merge main

3d49073

mroeschke merged commit 5115f09 into pandas-dev:main Jan 9, 2023

hamedgibago reviewed May 9, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename duplicate column names in read_json(orient='split') #50370

Rename duplicate column names in read_json(orient='split') #50370

datapythonista commented Dec 21, 2022 •

edited

Loading

jreback Dec 21, 2022

datapythonista Dec 21, 2022

datapythonista Dec 29, 2022

MarcoGorelli left a comment

MarcoGorelli Dec 24, 2022

MarcoGorelli Dec 24, 2022

datapythonista Dec 26, 2022

MarcoGorelli Dec 26, 2022

datapythonista Dec 28, 2022

MarcoGorelli left a comment

MarcoGorelli Dec 28, 2022

datapythonista Dec 28, 2022

MarcoGorelli Dec 28, 2022

MarcoGorelli Dec 28, 2022

datapythonista Dec 28, 2022

MarcoGorelli left a comment

mroeschke commented Jan 9, 2023

hamedgibago May 9, 2023

datapythonista May 10, 2023

Rename duplicate column names in read_json(orient='split') #50370

Rename duplicate column names in read_json(orient='split') #50370

Conversation

datapythonista commented Dec 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment

mroeschke commented Jan 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datapythonista commented Dec 21, 2022 •

edited

Loading