BUG: df.groupy().rolling().cov() does not group properly and the cartesian product of all groups is returned instead #42915

Xnot · 2021-08-06T12:25:57Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

import pandas as pd

print(pd.__version__)
df_a = pd.DataFrame({"value": range(10), "idx1": [1] * 5 + [2] * 5, "idx2": [1, 2, 3, 4, 5] * 2}).set_index(["idx1", "idx2"])
df_b = pd.DataFrame({"value": range(5), "idx2": [1, 2, 3, 4, 5]}).set_index("idx2")
test = df_a.groupby(level=0).rolling(2).cov(df_b)
print(test)

On pandas 1.3.1:

1.3.1
                value
idx1 idx1 idx2       
1    1    1       NaN
          2       0.5
          3       0.5
          4       0.5
          5       0.5
     2    1       NaN
          2       NaN
          3       NaN
          4       NaN
          5       NaN
2    1    1       NaN
          2       NaN
          3       NaN
          4       NaN
          5       NaN
     2    1       NaN
          2       0.5
          3       0.5
          4       0.5
          5       0.5

On pandas 1.0.5:

1.0.5
                value
idx1 idx1 idx2       
1    1    1       NaN
          2       0.5
          3       0.5
          4       0.5
          5       0.5
2    2    1       NaN
          2       0.5
          3       0.5
          4       0.5
          5       0.5

1.0.5 has the correct expected output. In 1.3.1, everything in index 1/2/x and 2/1/x is useless and will always be NaN.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : c7f7443
python : 3.9.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 1.3.1
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.2
pip : 21.1.2
setuptools : 54.1.2
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : None
pymysql : None
psycopg2 : 2.9.1 (dt dec pq3 ext lo64)
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.4.22
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2021-08-19T20:03:15Z

Thanks @Xnot for the report

1.0.5 has the correct expected output. In 1.3.1, everything in index 1/2/x and 2/1/x is useless and will always be NaN.

1.2.5 gives the same result as 1.0.5

first bad commit: [b2671cc] PERF: Rolling/Expanding.cov/corr (#39591)

cc @mroeschke

simonjayhawkins · 2021-10-16T18:28:21Z

changing milestone to 1.3.5

rhshadrach · 2021-12-05T17:05:26Z

From test_rolling_corr_cov, it looks like the current behavior tested. But it does appear undesirable to me, originating from this block:

pandas/pandas/core/window/rolling.py

Lines 750 to 753 in 1cbe011

    
           if other is not None: 
        
               # When we have other, we must reindex (expand) the result 
        
               # from flex_binary_moment to a "transform"-like result 
        
               # per groupby combination

As @Xnot points out, when reindexing here we are taking, for each group, the entirety of the result. But result contains the results for all groups, so I'm not sure why we'd want to do that. Maybe this entire block can be removed?

simonjayhawkins · 2021-12-07T13:48:25Z

From test_rolling_corr_cov, it looks like the current behavior tested. But it does appear undesirable to me,

df_a.groupby(level=0).apply(lambda x: x.rolling(2).cov(df_b))

on master gives the same result as

result = df_a.groupby(level=0).rolling(2).cov(df_b)

on 1.2.5

I assume these two should be equivalent and that the 1.2.5 behavior is therefore correct and test_rolling_corr_cov incorrect?

rhshadrach · 2021-12-07T22:59:26Z

I find this convincing, however I'm still confused by test_rolling_corr_cov. Modifying the example in this issue to replicate the test, it would be:

df_a = pd.DataFrame({"idx1": [1] * 5 + [2] * 5, "value": range(10, 20)})
result = df_a.groupby("idx1").rolling(2).cov(df_a)
expected = df_a.groupby("idx1").apply(lambda x: x.rolling(2).cov(df_a))

Now these two give the same result!

idx1               
1    0   NaN    NaN
     1   0.0    0.5
     2   0.0    0.5
     3   0.0    0.5
     4   0.0    0.5
     5   NaN    NaN
     6   NaN    NaN
     7   NaN    NaN
     8   NaN    NaN
     9   NaN    NaN
2    0   NaN    NaN
     1   NaN    NaN
     2   NaN    NaN
     3   NaN    NaN
     4   NaN    NaN
     5   NaN    NaN
     6   0.0    0.5
     7   0.0    0.5
     8   0.0    0.5
     9   0.0    0.5

In the above, the NaNs in group 1 from index 5-9 must always be NaN because this is where idx1 takes on the value 2. Similar remarks apply to group 2 from index 0-4. This again looks like incorrect output, but coming from apply as well as rolling.cov.

Edit: Removing the block I highlighted in #42915 (comment) gives what I think is the expected result:

        idx1  value
idx1               
1    0   NaN    NaN
     1   NaN    0.5
     2   NaN    0.5
     3   NaN    0.5
     4   NaN    0.5
2    5   NaN    NaN
     6   NaN    0.5
     7   NaN    0.5
     8   NaN    0.5
     9   NaN    0.5

rhshadrach · 2021-12-07T23:02:05Z

cc @mroeschke for any thoughts.

Xnot · 2021-12-08T01:12:44Z

[...] Modifying the example in this issue to replicate the test, it would be:
df_a = pd.DataFrame({"idx1": [1] * 5 + [2] * 5, "value": range(10, 20)})
result = df_a.groupby("idx1").rolling(2).cov(df_a)
expected = df_a.groupby("idx1").apply(lambda x: x.rolling(2).cov(df_a))
Now these two give the same result!

[...]

In the above, the NaNs in group 1 from index 5-9 must always be NaN because this is where idx1 takes on the value 2. Similar remarks apply to group 2 from index 0-4. This again looks like incorrect output, but coming from apply as well as rolling.cov.

@rhshadrach I believe the result might be correct for this specific example though. If we return to the example in the OP and apply the operation to each group individually:

df_a = pd.DataFrame({"value": range(10), "idx1": [1] * 5 + [2] * 5, "idx2": [1, 2, 3, 4, 5] * 2}).set_index(["idx1", "idx2"])
df_b = pd.DataFrame({"value": range(5), "idx2": [1, 2, 3, 4, 5]}).set_index("idx2")
test = df_a.groupby(level=0)
for idx, group in test:
    print(group.rolling(2).cov(df_b))

In this case, each group is the same size as df_b, and so is the result of each loop iteration:

           value
idx1 idx2       
1    1       NaN
     2       0.5
     3       0.5
     4       0.5
     5       0.5
           value
idx1 idx2       
2    1       NaN
     2       0.5
     3       0.5
     4       0.5
     5       0.5

However, in your example you are doing the covariance with the whole of df_a, which is larger than each group:

for idx, group in test:
    print(group.rolling(2).cov(df_a))

In this case, the group in each iteration is expanded to the size of df_a to provide the result:

           value
idx1 idx2       
1    1       NaN
     2       0.5
     3       0.5
     4       0.5
     5       0.5
2    1       NaN
     2       NaN
     3       NaN
     4       NaN
     5       NaN
           value
idx1 idx2       
1    1       NaN
     2       NaN
     3       NaN
     4       NaN
     5       NaN
2    1       NaN
     2       0.5
     3       0.5
     4       0.5
     5       0.5

I believe this is the correct behavior, and it is consistent with other operations:

print("short results:")
for idx, group in test:
    print(group + df_b)
print("expanded results:")
for idx, group in test:
    print(group + df_a)

short results:
           value
idx1 idx2       
1    1         0
     2         2
     3         4
     4         6
     5         8
           value
idx1 idx2       
2    1         5
     2         7
     3         9
     4        11
     5        13
expanded results:
           value
idx1 idx2       
1    1       0.0
     2       2.0
     3       4.0
     4       6.0
     5       8.0
2    1       NaN
     2       NaN
     3       NaN
     4       NaN
     5       NaN
           value
idx1 idx2       
1    1       NaN
     2       NaN
     3       NaN
     4       NaN
     5       NaN
2    1      10.0
     2      12.0
     3      14.0
     4      16.0
     5      18.0

So .groupby().apply() is correct on both cases, and .groupby().rolling().cov() is incorrect in the OP, but correct in your example.

Dr-Irv · 2021-12-08T19:39:34Z

Following up on the discussion we had at the pandas dev meeting on 12/8/2021, I think that the results prior to pandas 1.3 were correct in terms of the semantic meaning of groupby and rolling in this case.

Xnot added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 6, 2021

rhshadrach added the Window rolling, ewma, expanding label Aug 7, 2021

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 19, 2021

code sample for pandas-dev#42915

641695f

simonjayhawkins added this to the 1.3.3 milestone Aug 19, 2021

simonjayhawkins added Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 19, 2021

mroeschke added the Groupby label Aug 26, 2021

simonjayhawkins mentioned this issue Sep 4, 2021

BUG: RollingGroupby.corr() producing incorrect results #43386

Closed

3 tasks

simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021

simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021

simonjayhawkins mentioned this issue Nov 28, 2021

RLS: 1.3.5 #44080

Closed

simonjayhawkins added the Blocker Blocking issue or pull request for an upcoming release label Dec 3, 2021

mroeschke mentioned this issue Dec 9, 2021

BUG: Fix regression in groupby.rolling.corr/cov when other is same size as each group #44824

Merged

4 tasks

jreback closed this as completed in #44824 Dec 10, 2021

rtpsw mentioned this issue Jan 31, 2022

ENH: Rolling window with step size (GH-15354) #45735

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: df.groupy().rolling().cov() does not group properly and the cartesian product of all groups is returned instead #42915

BUG: df.groupy().rolling().cov() does not group properly and the cartesian product of all groups is returned instead #42915

Xnot commented Aug 6, 2021

INSTALLED VERSIONS

simonjayhawkins commented Aug 19, 2021

simonjayhawkins commented Oct 16, 2021

rhshadrach commented Dec 5, 2021

simonjayhawkins commented Dec 7, 2021

rhshadrach commented Dec 7, 2021 •

edited

Loading

rhshadrach commented Dec 7, 2021

Xnot commented Dec 8, 2021 •

edited

Loading

Dr-Irv commented Dec 8, 2021

BUG: df.groupy().rolling().cov() does not group properly and the cartesian product of all groups is returned instead #42915

BUG: df.groupy().rolling().cov() does not group properly and the cartesian product of all groups is returned instead #42915

Comments

Xnot commented Aug 6, 2021

Output of pd.show_versions()

INSTALLED VERSIONS

simonjayhawkins commented Aug 19, 2021

simonjayhawkins commented Oct 16, 2021

rhshadrach commented Dec 5, 2021

simonjayhawkins commented Dec 7, 2021

rhshadrach commented Dec 7, 2021 • edited Loading

rhshadrach commented Dec 7, 2021

Xnot commented Dec 8, 2021 • edited Loading

Dr-Irv commented Dec 8, 2021

Output of `pd.show_versions()`

rhshadrach commented Dec 7, 2021 •

edited

Loading

Xnot commented Dec 8, 2021 •

edited

Loading