Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: df.groupy().rolling().cov() does not group properly and the cartesian product of all groups is returned instead #42915

Closed
2 of 3 tasks
Xnot opened this issue Aug 6, 2021 · 8 comments · Fixed by #44824
Labels
Blocker Blocking issue or pull request for an upcoming release Bug Groupby Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding
Milestone

Comments

@Xnot
Copy link
Contributor

Xnot commented Aug 6, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


import pandas as pd

print(pd.__version__)
df_a = pd.DataFrame({"value": range(10), "idx1": [1] * 5 + [2] * 5, "idx2": [1, 2, 3, 4, 5] * 2}).set_index(["idx1", "idx2"])
df_b = pd.DataFrame({"value": range(5), "idx2": [1, 2, 3, 4, 5]}).set_index("idx2")
test = df_a.groupby(level=0).rolling(2).cov(df_b)
print(test)

On pandas 1.3.1:

1.3.1
                value
idx1 idx1 idx2       
1    1    1       NaN
          2       0.5
          3       0.5
          4       0.5
          5       0.5
     2    1       NaN
          2       NaN
          3       NaN
          4       NaN
          5       NaN
2    1    1       NaN
          2       NaN
          3       NaN
          4       NaN
          5       NaN
     2    1       NaN
          2       0.5
          3       0.5
          4       0.5
          5       0.5

On pandas 1.0.5:

1.0.5
                value
idx1 idx1 idx2       
1    1    1       NaN
          2       0.5
          3       0.5
          4       0.5
          5       0.5
2    2    1       NaN
          2       0.5
          3       0.5
          4       0.5
          5       0.5

1.0.5 has the correct expected output. In 1.3.1, everything in index 1/2/x and 2/1/x is useless and will always be NaN.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : c7f7443
python : 3.9.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 1.3.1
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.2
pip : 21.1.2
setuptools : 54.1.2
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : None
pymysql : None
psycopg2 : 2.9.1 (dt dec pq3 ext lo64)
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.4.22
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None

@Xnot Xnot added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 6, 2021
@rhshadrach rhshadrach added the Window rolling, ewma, expanding label Aug 7, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 19, 2021
@simonjayhawkins
Copy link
Member

Thanks @Xnot for the report

1.0.5 has the correct expected output. In 1.3.1, everything in index 1/2/x and 2/1/x is useless and will always be NaN.

1.2.5 gives the same result as 1.0.5

first bad commit: [b2671cc] PERF: Rolling/Expanding.cov/corr (#39591)

cc @mroeschke

@simonjayhawkins simonjayhawkins added this to the 1.3.3 milestone Aug 19, 2021
@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 19, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021
@simonjayhawkins
Copy link
Member

changing milestone to 1.3.5

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021
@simonjayhawkins simonjayhawkins added the Blocker Blocking issue or pull request for an upcoming release label Dec 3, 2021
@rhshadrach
Copy link
Member

From test_rolling_corr_cov, it looks like the current behavior tested. But it does appear undesirable to me, originating from this block:

if other is not None:
# When we have other, we must reindex (expand) the result
# from flex_binary_moment to a "transform"-like result
# per groupby combination

As @Xnot points out, when reindexing here we are taking, for each group, the entirety of the result. But result contains the results for all groups, so I'm not sure why we'd want to do that. Maybe this entire block can be removed?

@simonjayhawkins
Copy link
Member

From test_rolling_corr_cov, it looks like the current behavior tested. But it does appear undesirable to me,

df_a.groupby(level=0).apply(lambda x: x.rolling(2).cov(df_b))

on master gives the same result as

result = df_a.groupby(level=0).rolling(2).cov(df_b)

on 1.2.5

I assume these two should be equivalent and that the 1.2.5 behavior is therefore correct and test_rolling_corr_cov incorrect?

@rhshadrach
Copy link
Member

rhshadrach commented Dec 7, 2021

I find this convincing, however I'm still confused by test_rolling_corr_cov. Modifying the example in this issue to replicate the test, it would be:

df_a = pd.DataFrame({"idx1": [1] * 5 + [2] * 5, "value": range(10, 20)})
result = df_a.groupby("idx1").rolling(2).cov(df_a)
expected = df_a.groupby("idx1").apply(lambda x: x.rolling(2).cov(df_a))

Now these two give the same result!

idx1               
1    0   NaN    NaN
     1   0.0    0.5
     2   0.0    0.5
     3   0.0    0.5
     4   0.0    0.5
     5   NaN    NaN
     6   NaN    NaN
     7   NaN    NaN
     8   NaN    NaN
     9   NaN    NaN
2    0   NaN    NaN
     1   NaN    NaN
     2   NaN    NaN
     3   NaN    NaN
     4   NaN    NaN
     5   NaN    NaN
     6   0.0    0.5
     7   0.0    0.5
     8   0.0    0.5
     9   0.0    0.5

In the above, the NaNs in group 1 from index 5-9 must always be NaN because this is where idx1 takes on the value 2. Similar remarks apply to group 2 from index 0-4. This again looks like incorrect output, but coming from apply as well as rolling.cov.

Edit: Removing the block I highlighted in #42915 (comment) gives what I think is the expected result:

        idx1  value
idx1               
1    0   NaN    NaN
     1   NaN    0.5
     2   NaN    0.5
     3   NaN    0.5
     4   NaN    0.5
2    5   NaN    NaN
     6   NaN    0.5
     7   NaN    0.5
     8   NaN    0.5
     9   NaN    0.5

@rhshadrach
Copy link
Member

cc @mroeschke for any thoughts.

@Xnot
Copy link
Contributor Author

Xnot commented Dec 8, 2021

[...] Modifying the example in this issue to replicate the test, it would be:

df_a = pd.DataFrame({"idx1": [1] * 5 + [2] * 5, "value": range(10, 20)})
result = df_a.groupby("idx1").rolling(2).cov(df_a)
expected = df_a.groupby("idx1").apply(lambda x: x.rolling(2).cov(df_a))

Now these two give the same result!

[...]

In the above, the NaNs in group 1 from index 5-9 must always be NaN because this is where idx1 takes on the value 2. Similar remarks apply to group 2 from index 0-4. This again looks like incorrect output, but coming from apply as well as rolling.cov.

@rhshadrach I believe the result might be correct for this specific example though. If we return to the example in the OP and apply the operation to each group individually:

df_a = pd.DataFrame({"value": range(10), "idx1": [1] * 5 + [2] * 5, "idx2": [1, 2, 3, 4, 5] * 2}).set_index(["idx1", "idx2"])
df_b = pd.DataFrame({"value": range(5), "idx2": [1, 2, 3, 4, 5]}).set_index("idx2")
test = df_a.groupby(level=0)
for idx, group in test:
    print(group.rolling(2).cov(df_b))

In this case, each group is the same size as df_b, and so is the result of each loop iteration:

           value
idx1 idx2       
1    1       NaN
     2       0.5
     3       0.5
     4       0.5
     5       0.5
           value
idx1 idx2       
2    1       NaN
     2       0.5
     3       0.5
     4       0.5
     5       0.5

However, in your example you are doing the covariance with the whole of df_a, which is larger than each group:

for idx, group in test:
    print(group.rolling(2).cov(df_a))

In this case, the group in each iteration is expanded to the size of df_a to provide the result:

           value
idx1 idx2       
1    1       NaN
     2       0.5
     3       0.5
     4       0.5
     5       0.5
2    1       NaN
     2       NaN
     3       NaN
     4       NaN
     5       NaN
           value
idx1 idx2       
1    1       NaN
     2       NaN
     3       NaN
     4       NaN
     5       NaN
2    1       NaN
     2       0.5
     3       0.5
     4       0.5
     5       0.5

I believe this is the correct behavior, and it is consistent with other operations:

print("short results:")
for idx, group in test:
    print(group + df_b)
print("expanded results:")
for idx, group in test:
    print(group + df_a)
short results:
           value
idx1 idx2       
1    1         0
     2         2
     3         4
     4         6
     5         8
           value
idx1 idx2       
2    1         5
     2         7
     3         9
     4        11
     5        13
expanded results:
           value
idx1 idx2       
1    1       0.0
     2       2.0
     3       4.0
     4       6.0
     5       8.0
2    1       NaN
     2       NaN
     3       NaN
     4       NaN
     5       NaN
           value
idx1 idx2       
1    1       NaN
     2       NaN
     3       NaN
     4       NaN
     5       NaN
2    1      10.0
     2      12.0
     3      14.0
     4      16.0
     5      18.0

So .groupby().apply() is correct on both cases, and .groupby().rolling().cov() is incorrect in the OP, but correct in your example.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 8, 2021

Following up on the discussion we had at the pandas dev meeting on 12/8/2021, I think that the results prior to pandas 1.3 were correct in terms of the semantic meaning of groupby and rolling in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Bug Groupby Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding
Projects
None yet
5 participants