Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename duplicate column names in read_json(orient='split') #50370

Merged
merged 16 commits into from
Jan 9, 2023

Conversation

datapythonista
Copy link
Member

@datapythonista datapythonista commented Dec 21, 2022

All our read methods drop duplicated columns (read_xml and read_json(orient='records')) or rename them with the pattern x, x.1, x.2... (read_csv, read_fwf, read_excel, read_html). There is only one exception where the built DataFrame contains duplicated columns, which is read_json(orient='split').

Making read_json(orient='split') be consistent with the other methods, and rename the duplicate columns here.

Follow ups:

CC: @MarcoGorelli

@@ -1245,6 +1246,9 @@ def _parse(self) -> None:
str(k): v
for k, v in loads(json, precise_float=self.precise_float).items()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need tests for this!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test I updated is already testing this. There was a test specific for duplicate columns that was parametrized for orient split, and check the column names. Am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback do you mind having another look, and letting me know if the added test and my previous comment address your concern. This is ready to be merged pending your review. Thanks!

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this change - I guess the

self._dedup_names(

still need renaming?

Comment on lines 1249 to 1252
decoded["columns"] = dedup_names(
names=decoded["columns"], is_potential_multi_index=False
)
self.check_keys_split(decoded)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decoded might not have 'columns', do you need to do the self.check_keys_split(decoded) call before the dedup_names one?

@@ -117,6 +117,7 @@ def test_frame_non_unique_columns(self, orient, data):
expected.iloc[:, 0] = expected.iloc[:, 0].view(np.int64) // 1000000
elif orient == "split":
expected = df
expected.columns = ["x", "x.1"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there also be a test for when the columns are a multiindex, the reader is read_json, and orient='split'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'm not fully sure if it can be the case that read_json(orient='split') can generate a multiindex. What would the json look like?

I agree dedup_names can probably be better tested, and make sure that the multiindex cases are well tested and work as expected. In my opinion probably better to address it in #50371. For this PR multiindexing is irrelevant if I'm not missing something. #50371 should consolidate the two different versions of dedup_names and make sure they work for all use cases if the implementations are not exactly the same. I think ensuring proper testing makes more sense there, here the function is not modified, and I don't think any bug in the function can be introduced in this PR, as it's just been moved from a method to a function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's an example which passes on 1.5.2, but which fails here:

data = [["a", "b"], ["c", "d"]]
df = DataFrame(data, index=[1, 2], columns=pd.MultiIndex.from_arrays([["x", "x"], ['a', 'x']]))
read_json(df.to_json(orient='split'), orient='split')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MarcoGorelli, this is very helpful. Looks like the json IO with orient='split' does try to support multiindexes, but the implementation in main seems buggy:

>>> df = DataFrame(data, index=[1, 2], columns=pd.MultiIndex.from_arrays([["2022", "2022"], ['JAN', 'FEB']]))
>>> df
  2022    
   JAN FEB
1    a   b
2    c   d
>>> print(df.to_json(orient='split'))
{"columns":[["2022","JAN"],["2022","FEB"]],"index":[1,2],"data":[["a","b"],["c","d"]]}
>>> read_json(df.to_json(orient='split'), orient='split')
  2022 JAN
  2022 FEB
1    a   b
2    c   d

I opened #50456 to address the problem, and fixed the use case with multiindex and added the test, but xfailed it for now, so the current bug can be fixed in an separate PR.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good! Just got a comment on the orig_names = ... statement

Comment on lines +1251 to +1254
orig_names = [
(tuple(col) if isinstance(col, list) else col)
for col in decoded["columns"]
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, this isn't necessary, because the test is xfailed, so it will pass even if there's an exception thrown (which I is what this line was meant to address?)

I'm not too keen on adding a line of code which doesn't have any effect on the test suite (someone else might accidentally remove it as part of a refactor, and the test suite would still pass), so how about either:

  • remove this, and just letting the code raise: as your example shows, multiindex with columns='split' wasn't supported to begin with, so it's OK for it to throw an error;
  • OR keep this, but change the test to check that the result value has tuples as column names, with a comment saying that the return type should be changed (instead of xfailing)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It's a bit more complex. This line does have an effect, and will make other tests fail if removed. A multiindex is represented internally in some places as a list of tuples. But JSON doesn't have the concept of tuples, so in the JSON it's represented as a list of lists. When calling dedup columns and the dataframe constructor, providing a list of lists is interpreted as a normal index with lists as values, which is a non-hashable type, and it raises a TypeError. Making them tuples will convert the values to the multiindex, and will make the JSON be loaded with a multiindex.

The part that is still missing in this PR is that the loaded multiindex is not the same as the original one, if the JSON was saves with to_json. That's what the other issue will fix, and what the xfailed test is checking.

What it makes sense based on your comment is to be more explicit on when the test should xfail. I added that the xfail needs to be for AssertionError (the compared values are different). So now, if this line is removed, the test will fail, as the error for trying to create an index with a list of lists is a TypeError, and the test will check that the code doesn't raise, and will need that a value is returned, just different from the expected one.

Does this make sense to you? I wouldn't implement anything more complex than this here, since fixing the other bug shouldn't be difficult. We can implement the other PR first too, but probably not worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thanks!

@@ -496,7 +496,7 @@ Other API changes
new DataFrame (shallow copy) instead of the original DataFrame, consistent with other
methods to get a full slice (for example ``df.loc[:]`` or ``df[:]``) (:issue:`49469`)
- Disallow computing ``cumprod`` for :class:`Timedelta` object; previously this returned incorrect values (:issue:`50246`)
-
- Loading a JSON file with duplicate columns using ``read_json(orient='split')`` renames columns to avoid duplicates, as :func:`read_csv` and the other readers do (:issue:`XXX`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR number here? other than that looks good

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @datapythonista

@mroeschke mroeschke added this to the 2.0 milestone Jan 4, 2023
@mroeschke mroeschke merged commit 5115f09 into pandas-dev:main Jan 9, 2023
@mroeschke
Copy link
Member

Thanks @datapythonista (failures unrelated). Follow up PRs can be submitted if needed

@@ -434,6 +452,7 @@ def _infer_columns(
if i not in this_unnamed_cols
] + this_unnamed_cols

# TODO: Use pandas.io.common.dedup_names instead (see #50371)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datapythonista This is one TODO you mentioned? Just to be sure and not doing rework? #50371

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants