BUG: DataFrame.agg with multiple cum functions creates wrong result #35490

qinxuye · 2020-07-31T09:33:45Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here
In [1]: import pandas as pd                                                     

In [2]: df1 = pd.DataFrame({ 
   ...:     'a': [3, 4, 5, 3, 5, 4, 1, 2, 3], 
   ...:     'b': [1, 3, 4, 5, 6, 5, 4, 4, 4], 
   ...:     'c': list('aabaaddce'), 
   ...:     'd': [3, 4, 5, 3, 5, 4, 1, 2, 3], 
   ...:     'e': [1, 3, 4, 5, 6, 5, 4, 4, 4], 
   ...:     'f': list('aabaaddce'), 
   ...: })                                                                      

In [3]: df1.groupby('b').agg(['cummax', 'cumsum'])                              
Out[3]: 
       a             d             e       
  cummax cumsum cummax cumsum cummax cumsum
b                                          
1      4      4      4      4      3      3
3      3      3      3      3      5      5
4      5      5      5      5      6      6
5      4      7      4      7      5     10
6      5      6      5      6      4      8

Cumulative functions should generate the DataFrame with the same length.

Problem description

[this should explain why the current behaviour is a problem and why the expected output is a better solution]

In pandas 1.0.5, the result is

In [1]: import pandas as pd                                                     

In [2]: df1 = pd.DataFrame({ 
   ...:     'a': [3, 4, 5, 3, 5, 4, 1, 2, 3], 
   ...:     'b': [1, 3, 4, 5, 6, 5, 4, 4, 4], 
   ...:     'c': list('aabaaddce'), 
   ...:     'd': [3, 4, 5, 3, 5, 4, 1, 2, 3], 
   ...:     'e': [1, 3, 4, 5, 6, 5, 4, 4, 4], 
   ...:     'f': list('aabaaddce'), 
   ...: })                                                                      

In [3]: df1.groupby('b').agg(['cummax', 'cumsum'])                              
Out[3]: 
       a             d             e       
  cummax cumsum cummax cumsum cummax cumsum
0      3      3      3      3      1      1
1      4      4      4      4      3      3
2      5      5      5      5      4      4
3      3      3      3      3      5      5
4      5      5      5      5      6      6
5      4      7      4      7      5     10
6      5      6      5      6      4      8
7      5      8      5      8      4     12
8      5     11      5     11      4     16

Expected Output

Output of `pd.show_versions()`

In [5]: pd.show_versions()
/Users/qinxuye/miniconda3/envs/test_pandas_1.1/lib/python3.7/site-packages/setuptools/distutils_patch.py:26: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first.
"Distutils was imported before Setuptools. This usage is discouraged "

ImportError Traceback (most recent call last)
in
----> 1 pd.show_versions()

~/miniconda3/envs/test_pandas_1.1/lib/python3.7/site-packages/pandas/util/_print_versions.py in show_versions(as_json)
104 """
105 sys_info = _get_sys_info()
--> 106 deps = _get_dependency_info()
107
108 if as_json:

~/miniconda3/envs/test_pandas_1.1/lib/python3.7/site-packages/pandas/util/_print_versions.py in _get_dependency_info()
82 for modname in deps:
83 mod = import_optional_dependency(
---> 84 modname, raise_on_missing=False, on_version="ignore"
85 )
86 result[modname] = _get_version(mod) if mod else None

~/miniconda3/envs/test_pandas_1.1/lib/python3.7/site-packages/pandas/compat/_optional.py in import_optional_dependency(name, extra, raise_on_missing, on_version)
97 minimum_version = VERSIONS.get(name)
98 if minimum_version:
---> 99 version = _get_version(module)
100 if distutils.version.LooseVersion(version) < minimum_version:
101 assert on_version in {"warn", "raise", "ignore"}

~/miniconda3/envs/test_pandas_1.1/lib/python3.7/site-packages/pandas/compat/_optional.py in _get_version(module)
42
43 if version is None:
---> 44 raise ImportError(f"Can't determine version for {module.name}")
45 return version
46

ImportError: Can't determine version for numba

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2020-07-31T10:30:45Z

@qinxuye Thanks for the report

Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

can you make the code sample more minimal and also please indicate what the expected output should be (or state same as 1.0.5)

simonjayhawkins · 2020-07-31T11:11:34Z

it appears this change in behaviour is a result of #30858. So this change is not intentional. marking as regression. cc @MarcoGorelli

b6222ec is the first bad commit
commit b6222ec
Author: Marco Gorelli m.e.gorelli@gmail.com
Date: Tue Jul 14 21:34:55 2020 +0100

BUG: aggregations were getting overwritten if they had the same name (#30858)

MarcoGorelli · 2020-07-31T12:20:02Z

Thanks for the CC, and I'm sorry for the breakage caused! I'll look into this next week

simonjayhawkins · 2020-07-31T12:26:18Z

Thanks @MarcoGorelli . and no need to apologise!

MarcoGorelli · 2020-08-08T16:20:16Z

I think the problem's here

        result = self._wrap_series_output(
            output=output, index=self.grouper.result_index
        )

where we have

(Pdb) output
{OutputKey(label='cumsum', position=0): 0     3
1     4
2     5
3     3
4     5
5     7
6     6
7     8
8    11
Name: a, dtype: int64}
(Pdb) result
b
1    4
3    3
4    5
5    7
6    6
Name: cumsum, dtype: int64
(Pdb) self.grouper.result_index
Int64Index([1, 3, 4, 5, 6], dtype='int64', name='b')

BTW, df1.groupby('b').agg('cummax') works but df1.groupby('b').agg(['cummax']) doesn't

MarcoGorelli · 2020-08-11T12:31:29Z

@simonjayhawkins should the expected output be the same as it was in 1.0.5? TBH I find it strange that groupby.agg returns something with an index different from self.grouper.result_index, does that happen with other aggregations?

My (unqualified) opinion is that:

what pandas is currently returning for df1.groupby('b').agg('cumsum') should only be what it returns with df1.groupby('b').transform('cumsum'), while
df1.groupby('b').agg('cumsum') should return what is currently returned by df1.groupby('b').agg(['cumsum'])

simonjayhawkins · 2020-08-11T12:40:41Z

I've not delved much into the groupby code. @WillAyd wdyt?

WillAyd · 2020-08-12T16:33:21Z

I think should revert to the 1.0.5 behavior

TBH I find it strange that groupby.agg returns something with an index different from self.grouper.result_index, does that happen with other aggregations?

I agree in principal but probably out of scope for this issue - can you see if there's one already out there or if not open one? agg is arguably too permissive of a function as it doesn't require the function you are applying to the groups to actually reduce, which it probably should.

qinxuye added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 31, 2020

qinxuye mentioned this issue Jul 31, 2020

Fix pandas 1.1 compatibility mars-project/mars#1437

Merged

simonjayhawkins added Apply Apply, Aggregate, Transform Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 31, 2020

simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Jul 31, 2020

simonjayhawkins added this to the 1.1.1 milestone Jul 31, 2020

jorisvandenbossche added the Blocker Blocking issue or pull request for an upcoming release label Aug 12, 2020

This was referenced Aug 14, 2020

agg with list of non-aggregating functions #35723

Merged

ENH: groupby(...).agg should only accept reducers #35725

Open

jreback closed this as completed in #35723 Aug 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.agg with multiple cum functions creates wrong result #35490

BUG: DataFrame.agg with multiple cum functions creates wrong result #35490

qinxuye commented Jul 31, 2020

simonjayhawkins commented Jul 31, 2020

simonjayhawkins commented Jul 31, 2020

MarcoGorelli commented Jul 31, 2020

simonjayhawkins commented Jul 31, 2020

MarcoGorelli commented Aug 8, 2020 •

edited

Loading

MarcoGorelli commented Aug 11, 2020

simonjayhawkins commented Aug 11, 2020

WillAyd commented Aug 12, 2020

BUG: DataFrame.agg with multiple cum functions creates wrong result #35490

BUG: DataFrame.agg with multiple cum functions creates wrong result #35490

Comments

qinxuye commented Jul 31, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

simonjayhawkins commented Jul 31, 2020

simonjayhawkins commented Jul 31, 2020

MarcoGorelli commented Jul 31, 2020

simonjayhawkins commented Jul 31, 2020

MarcoGorelli commented Aug 8, 2020 • edited Loading

MarcoGorelli commented Aug 11, 2020

simonjayhawkins commented Aug 11, 2020

WillAyd commented Aug 12, 2020

Output of `pd.show_versions()`

MarcoGorelli commented Aug 8, 2020 •

edited

Loading