Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: fix error caused by deprecation warning in pd.merge when joining two Series #47946

Merged
merged 7 commits into from
Aug 19, 2022

Conversation

nsarang
Copy link
Contributor

@nsarang nsarang commented Aug 3, 2022

There is a small bug/typo in the warning of message of pd.merge when merging two Series of different levels which results in an exception.

Thrown error message:
AttributeError: 'Series' object has no attribute 'columns'

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you post the current warning?

@nsarang
Copy link
Contributor Author

nsarang commented Aug 3, 2022

@phofl There's no warning. An exception occurs while processing the warning, and I've included the message in my original post.

The reason for the exception is that left and right might be a Series and they don't have the columns attribute. While _left and _right are the converted DataFrames.

@phofl
Copy link
Member

phofl commented Aug 3, 2022

Sorry, did not understand you then.

could you please add a test?

@phofl
Copy link
Member

phofl commented Aug 3, 2022

And a whatsnew in 1.4.4, this is a regression

@phofl phofl added Regression Functionality that used to work in a prior pandas version Warnings Warnings that appear or should be added to pandas labels Aug 3, 2022
@phofl phofl added this to the 1.4.4 milestone Aug 3, 2022
@nsarang
Copy link
Contributor Author

nsarang commented Aug 3, 2022

@phofl No worries. I'll add one in a few hours.

@jorisvandenbossche
Copy link
Member

@nsarang would you have time to add a test ?

@nsarang
Copy link
Contributor Author

nsarang commented Aug 12, 2022

@jorisvandenbossche Forgot about this one!
I've added the test now. The explanation for the test is that using to_frame on a Series with a multi-level name will result in a multi-level column DataFrame (to_frame is used in pd.merge._validate_operand). So it'll create the right condition where the levels of left and right in pd.merge are not equal.

@@ -2167,7 +2169,7 @@ def test_merge_series(on, left_on, right_on, left_index, right_index, nm):
name=nm,
)
expected = DataFrame(
{"A": [2, 4], "B": [1, 3]},
{"A": [2, 4], nm: [1, 3]},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a dedicated test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add one

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, looks good now. Could you add a whatsnew to v1.4.4,rst?

@nsarang
Copy link
Contributor Author

nsarang commented Aug 14, 2022

Sure

@nsarang
Copy link
Contributor Author

nsarang commented Aug 14, 2022

@phofl I can't reference the PR in the release notes since its path is not added to sphinx.ext.extlinks. Do you want me to add it?

@phofl
Copy link
Member

phofl commented Aug 14, 2022

It is sufficient if you add something like Fixed regression in :func:`merge` when...

to https://github.com/pandas-dev/pandas/blob/main/doc/source/whatsnew/v1.4.4.rst

You don't have to do anything else. Maybe you did not merge main in a while?

@nsarang
Copy link
Contributor Author

nsarang commented Aug 14, 2022

Fine by me. Just wanted to make sure it's okay that the release note is not having a reference.

Maybe you did not merge main in a while?

I don't follow. Can you explain?

@jorisvandenbossche
Copy link
Member

I pushed a small patch to ensure we catch the warning in the test (ideally, our tests don't cause warnings from pandas, if a warning is expected, it should be either asserted or explicitly ignored).

However, doing that also made we wonder why we are actually raising a deprecation warning here. The case from the test:

In [3]: merge(a, b, on=["outer", "inner"])
<ipython-input-3-4caf266e26f3>:1: FutureWarning: merging between different levels is deprecated and will be removed in a future version. (1 levels on the left, 2 on the right)
  merge(a, b, on=["outer", "inner"])
Out[3]: 
             A  (B, C)
outer inner           
a     1      2       1
b     1      4       3

In [4]: a
Out[4]: 
             A
outer inner   
a     0      1
      1      2
b     0      3
      1      4

In [5]: b
Out[5]: 
outer  inner
a      1        1
       2        2
b      1        3
       2        4
Name: (B, C), dtype: int64

So the original a and b don't have different number of levels in the columns. It's only because the Series is converted to a DataFrame that pandas itself introduces that. It's of course quite a corner case ..

@jorisvandenbossche jorisvandenbossche changed the title Fix bug in pd.merge when joining two Series REGR: fix error caused by deprecation warning in pd.merge when joining two Series Aug 19, 2022
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of my previous comment, this PR certainly already avoids the error, so this looks good to merge. Thanks @nsarang!

@phofl
Copy link
Member

phofl commented Aug 19, 2022

@jorisvandenbossche

We wanted to get rid of value dependent behaviour. Your op looks "correct", but if "a" also has a midx, then the tuple name is converted into a MultiIndex itself:

a = DataFrame(
    {("A", "B"): 1},
    index=MultiIndex.from_product([["a"], [0, 1]], names=["o", "i"]),
)
b = Series(
    1,
    index=MultiIndex.from_product([["a"], [1, 2]], names=["o", "i"]),
    name=("B", "C"),
)
pd.merge(b, a, on=["o", "i"])

And the more complicated case:

a = DataFrame(
    {("A", "B"): 1},
    index=MultiIndex.from_product([["a"], [0, 1]], names=["o", "i"]),
)
b = Series(
    1,
    index=MultiIndex.from_product([["a"], [1, 2]], names=["o", "i"]),
    name=("B", "C", "D"),
)
pd.merge(a, b, on=["o", "i"])

raises: ValueError: Length of names must match number of levels in MultiIndex.

If the DataFrame has more levels than our tuple name has elements, we flatten the DataFrame MultiIndex columns to a tuple too:

a = DataFrame(
    {("A", "B", "C"): 1},
    index=MultiIndex.from_product([["a"], [0, 1]], names=["o", "i"]),
)
b = Series(
    1,
    index=MultiIndex.from_product([["a"], [1, 2]], names=["o", "i"]),
    name=("B", "C"),
)
pd.merge(a, b, on=["o", "i"])
     (A, B, C)  (B, C)
o i                   
a 1          1       1

@phofl phofl merged commit 236cedf into pandas-dev:main Aug 19, 2022
@phofl
Copy link
Member

phofl commented Aug 19, 2022

thx @nsarang

@lumberbot-app

This comment was marked as resolved.

@@ -19,6 +19,7 @@ Fixed regressions
- Fixed regression in :meth:`DataFrame.loc` not updating the cache correctly after values were set (:issue:`47867`)
- Fixed regression in :meth:`DataFrame.loc` not aligning index in some cases when setting a :class:`DataFrame` (:issue:`47578`)
- Fixed regression in setting ``None`` or non-string value into a ``string``-dtype Series using a mask (:issue:`47628`)
- Fixed regression in :func:`merge` throwing an error when passing a :class:`Series` with a multi-level name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is no issue then the PR number should be added as the issue instead

(imo that should never happen though as an issue should always be raised to document the issue. there are many reasons to do this, help users find it, avoid duplicate issues, allow proper triage and give others an opportunity for discussion, and keep these out of the PR discussion as appears to have happen here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also sometimes PRs go stale so the discussion in those cases is effectively lost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks for pointing out Simon!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the sentiment. The only reason that I didn't create an issue in the first place was that I thought the bug happened because of a typo and that this PR should be straightforward. Aside from this, I want to point out two things:

  1. Now that you've decided to reference the PR instead of the issue, you may want to add the base path of Github PRs to sphinx.ext.extlinks. From what I'm seeing, only the base path of issues is provided in the configuration file.
  2. FYI, I ran into this bug when I was applying a complex function to a Series after a multi-level GroupBy.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Now that you've decided to reference the PR instead of the issue, you may want to add the base path of Github PRs to sphinx.ext.extlinks. From what I'm seeing, only the base path of issues is provided in the configuration file.

GitHub itself handles those links fine. For example the link for this PR (#47946), also works (redirects) if used with "issues": #47946

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh I see. Thanks @jorisvandenbossche !

phofl added a commit to phofl/pandas that referenced this pull request Aug 19, 2022
…g two Series (pandas-dev#47946)

* Fix bug in warning

* add test

* add note to whatsnew

* Add gh ref

* catch warning in the test

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@simonjayhawkins
Copy link
Member

And a whatsnew in 1.4.4, this is a regression

and to be clear, this is a regression from 1.0.5?

@phofl
Copy link
Member

phofl commented Aug 19, 2022

I've added the deprecation in the 1.3 series, which causes the error

@simonjayhawkins
Copy link
Member

the code sample (test) was raising an exception before that

~/miniconda3/envs/pandas-1.1.5/lib/python3.9/site-packages/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    639                 "merging between different levels can give an unintended "
    640                 f"result ({left.columns.nlevels} levels on the left,"
--> 641                 f"{right.columns.nlevels} on the right)"
    642             )
    643             warnings.warn(msg, UserWarning)

~/miniconda3/envs/pandas-1.1.5/lib/python3.9/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5139             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5140                 return self[name]
-> 5141             return object.__getattribute__(self, name)
   5142 
   5143     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'columns'

@simonjayhawkins simonjayhawkins added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Aug 19, 2022
@phofl
Copy link
Member

phofl commented Aug 19, 2022

Ok this is weird. I remember testing this and I thought I installed 1.2.5 for testing purposes. But can not reproduce now. Should we avoid backporting then?

@simonjayhawkins
Copy link
Member

the fix seems low risk so (imo) ok with the backport.

however, if there is an outstanding issue about whether we should be raising a warning here #47946 (comment), then should open an issue about that to ensure it is not forgotten. Whether any further changes get backported would be part of that discussion.

@phofl
Copy link
Member

phofl commented Aug 19, 2022

Ok, got it.

I think the warning is oldish, the current behaviour is definitely buggy (#47946 (comment)), so in principal I am ok with he warning, but we could also avoid treating tuples as MultiIndexes. Disallowing this would avoid any potential value dependent bugs

simonjayhawkins pushed a commit that referenced this pull request Aug 19, 2022
…pd.merge when joining two Series (#48154)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Nima Sarang <nimasarang@gmail.com>
@jorisvandenbossche
Copy link
Member

Yes, indeed with the current behaviour in other cases, the warning seems correct.
It's maybe worth a more general question whether we want to continue interpreting a tuple label as MultiIndex (but if you select a Series from a DataFrame with MI columns, you also get such tuple label, so it makes some sense to do that)

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
…g two Series (pandas-dev#47946)

* Fix bug in warning

* add test

* add note to whatsnew

* Add gh ref

* catch warning in the test

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode Warnings Warnings that appear or should be added to pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants