Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.agg with multiple cum functions creates wrong result #35490

Closed
2 of 3 tasks
qinxuye opened this issue Jul 31, 2020 · 8 comments · Fixed by #35723
Closed
2 of 3 tasks

BUG: DataFrame.agg with multiple cum functions creates wrong result #35490

qinxuye opened this issue Jul 31, 2020 · 8 comments · Fixed by #35723
Labels
Apply Apply, Aggregate, Transform Blocker Blocking issue or pull request for an upcoming release Bug Groupby Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@qinxuye
Copy link

qinxuye commented Jul 31, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here
In [1]: import pandas as pd                                                     

In [2]: df1 = pd.DataFrame({ 
   ...:     'a': [3, 4, 5, 3, 5, 4, 1, 2, 3], 
   ...:     'b': [1, 3, 4, 5, 6, 5, 4, 4, 4], 
   ...:     'c': list('aabaaddce'), 
   ...:     'd': [3, 4, 5, 3, 5, 4, 1, 2, 3], 
   ...:     'e': [1, 3, 4, 5, 6, 5, 4, 4, 4], 
   ...:     'f': list('aabaaddce'), 
   ...: })                                                                      

In [3]: df1.groupby('b').agg(['cummax', 'cumsum'])                              
Out[3]: 
       a             d             e       
  cummax cumsum cummax cumsum cummax cumsum
b                                          
1      4      4      4      4      3      3
3      3      3      3      3      5      5
4      5      5      5      5      6      6
5      4      7      4      7      5     10
6      5      6      5      6      4      8

Cumulative functions should generate the DataFrame with the same length.

Problem description

[this should explain why the current behaviour is a problem and why the expected output is a better solution]

In pandas 1.0.5, the result is

In [1]: import pandas as pd                                                     

In [2]: df1 = pd.DataFrame({ 
   ...:     'a': [3, 4, 5, 3, 5, 4, 1, 2, 3], 
   ...:     'b': [1, 3, 4, 5, 6, 5, 4, 4, 4], 
   ...:     'c': list('aabaaddce'), 
   ...:     'd': [3, 4, 5, 3, 5, 4, 1, 2, 3], 
   ...:     'e': [1, 3, 4, 5, 6, 5, 4, 4, 4], 
   ...:     'f': list('aabaaddce'), 
   ...: })                                                                      

In [3]: df1.groupby('b').agg(['cummax', 'cumsum'])                              
Out[3]: 
       a             d             e       
  cummax cumsum cummax cumsum cummax cumsum
0      3      3      3      3      1      1
1      4      4      4      4      3      3
2      5      5      5      5      4      4
3      3      3      3      3      5      5
4      5      5      5      5      6      6
5      4      7      4      7      5     10
6      5      6      5      6      4      8
7      5      8      5      8      4     12
8      5     11      5     11      4     16

Expected Output

Output of pd.show_versions()

In [5]: pd.show_versions()
/Users/qinxuye/miniconda3/envs/test_pandas_1.1/lib/python3.7/site-packages/setuptools/distutils_patch.py:26: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first.
"Distutils was imported before Setuptools. This usage is discouraged "

ImportError Traceback (most recent call last)
in
----> 1 pd.show_versions()

~/miniconda3/envs/test_pandas_1.1/lib/python3.7/site-packages/pandas/util/_print_versions.py in show_versions(as_json)
104 """
105 sys_info = _get_sys_info()
--> 106 deps = _get_dependency_info()
107
108 if as_json:

~/miniconda3/envs/test_pandas_1.1/lib/python3.7/site-packages/pandas/util/_print_versions.py in _get_dependency_info()
82 for modname in deps:
83 mod = import_optional_dependency(
---> 84 modname, raise_on_missing=False, on_version="ignore"
85 )
86 result[modname] = _get_version(mod) if mod else None

~/miniconda3/envs/test_pandas_1.1/lib/python3.7/site-packages/pandas/compat/_optional.py in import_optional_dependency(name, extra, raise_on_missing, on_version)
97 minimum_version = VERSIONS.get(name)
98 if minimum_version:
---> 99 version = _get_version(module)
100 if distutils.version.LooseVersion(version) < minimum_version:
101 assert on_version in {"warn", "raise", "ignore"}

~/miniconda3/envs/test_pandas_1.1/lib/python3.7/site-packages/pandas/compat/_optional.py in _get_version(module)
42
43 if version is None:
---> 44 raise ImportError(f"Can't determine version for {module.name}")
45 return version
46

ImportError: Can't determine version for numba

@qinxuye qinxuye added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 31, 2020
@simonjayhawkins
Copy link
Member

@qinxuye Thanks for the report

Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

can you make the code sample more minimal and also please indicate what the expected output should be (or state same as 1.0.5)

@simonjayhawkins simonjayhawkins added Apply Apply, Aggregate, Transform Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 31, 2020
@simonjayhawkins
Copy link
Member

it appears this change in behaviour is a result of #30858. So this change is not intentional. marking as regression. cc @MarcoGorelli

b6222ec is the first bad commit
commit b6222ec
Author: Marco Gorelli m.e.gorelli@gmail.com
Date: Tue Jul 14 21:34:55 2020 +0100

BUG: aggregations were getting overwritten if they had the same name (#30858)

@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Jul 31, 2020
@simonjayhawkins simonjayhawkins added this to the 1.1.1 milestone Jul 31, 2020
@MarcoGorelli
Copy link
Member

Thanks for the CC, and I'm sorry for the breakage caused! I'll look into this next week

@simonjayhawkins
Copy link
Member

Thanks @MarcoGorelli . and no need to apologise!

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Aug 8, 2020

I think the problem's here

        result = self._wrap_series_output(
            output=output, index=self.grouper.result_index
        )

where we have

(Pdb) output
{OutputKey(label='cumsum', position=0): 0     3
1     4
2     5
3     3
4     5
5     7
6     6
7     8
8    11
Name: a, dtype: int64}
(Pdb) result
b
1    4
3    3
4    5
5    7
6    6
Name: cumsum, dtype: int64
(Pdb) self.grouper.result_index
Int64Index([1, 3, 4, 5, 6], dtype='int64', name='b')

BTW, df1.groupby('b').agg('cummax') works but df1.groupby('b').agg(['cummax']) doesn't

@MarcoGorelli
Copy link
Member

@simonjayhawkins should the expected output be the same as it was in 1.0.5? TBH I find it strange that groupby.agg returns something with an index different from self.grouper.result_index, does that happen with other aggregations?

My (unqualified) opinion is that:

  • what pandas is currently returning for df1.groupby('b').agg('cumsum') should only be what it returns with df1.groupby('b').transform('cumsum'), while
  • df1.groupby('b').agg('cumsum') should return what is currently returned by df1.groupby('b').agg(['cumsum'])

@simonjayhawkins
Copy link
Member

I've not delved much into the groupby code. @WillAyd wdyt?

@WillAyd
Copy link
Member

WillAyd commented Aug 12, 2020

I think should revert to the 1.0.5 behavior

TBH I find it strange that groupby.agg returns something with an index different from self.grouper.result_index, does that happen with other aggregations?

I agree in principal but probably out of scope for this issue - can you see if there's one already out there or if not open one? agg is arguably too permissive of a function as it doesn't require the function you are applying to the groups to actually reduce, which it probably should.

@jorisvandenbossche jorisvandenbossche added the Blocker Blocking issue or pull request for an upcoming release label Aug 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform Blocker Blocking issue or pull request for an upcoming release Bug Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants