Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG:dataframe.groupby('some_column').timedelta.sum() results wrong when timedelta contains NaT (pandas=1.3.0) #42659

Closed
1 task
minientertainment opened this issue Jul 22, 2021 · 9 comments · Fixed by #44658
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reduction Operations sum, mean, min, max, etc. Regression Functionality that used to work in a prior pandas version Timedelta Timedelta data type
Milestone

Comments

@minientertainment
Copy link

minientertainment commented Jul 22, 2021

  • [y] I have checked that this issue has not already been reported.

  • [y] I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

this is a tiny example

tmp = pd.DataFrame({'id': ['1624477460271-3908654213', '1624477460271-3908654213'], 'dt':[np.nan, pd.to_timedelta('0:0:2')]})
tmp
Out[6]: 
                         id              dt
0  1624477460271-3908654213             NaT
1  1624477460271-3908654213 0 days 00:00:02

Problem description

dataframe.groupby('some_column').timedelta.sum() results wrong when timedelta contains NaT

tmp.groupby('id').dt.sum()
Out[7]: 
id
1624477460271-3908654213   -106752 days +00:12:45.145224192
Name: dt, dtype: timedelta64[ns]

Expected Output

tmp.dt.sum()
Out[8]: Timedelta('0 days 00:00:02')

Output of pd.show_versions()

[pandas 1.3.0]

@minientertainment minientertainment added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 22, 2021
@phofl
Copy link
Member

phofl commented Jul 22, 2021

Please povide something which is copy and pasteable (see https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports)

also please format your code

@phofl phofl added Groupby Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 22, 2021
@minientertainment
Copy link
Author

Please povide something which is copy and pasteable (see https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports)

also please format your code

updated

@minientertainment minientertainment changed the title BUG:dataframe.groupby('some_column').timedelta.sum() results wrong when timedelta contains NaT BUG:dataframe.groupby('some_column').timedelta.sum() results wrong when timedelta contains NaT (pandas=1.3.0) Jul 27, 2021
@nicolaslegrand91
Copy link

I also encountered this problem, it seems quite a major bug.

@phofl Here is a full copy-pastable example :

import pandas as pd

df = pd.DataFrame({
    'a': [1, 1, 2, 2],
    'b': [pd.Timedelta('1d'), pd.Timedelta('2d'), pd.Timedelta('3d'), pd.NaT],
})

print('\ndf:')
print(df)

print('\ndf.dtypes:')
print(df.dtypes)

print('\ndf.groupby(\'a\').b.sum():')
print(df.groupby('a').b.sum())

326DD059-EB73-4C3A-9AE3-41E8A8AD75B1

Output of pd.show_versions()

INSTALLED VERSIONS

commit : c7f7443
python : 3.9.5.final.0
python-bits : 64
OS : Darwin
OS-release : 20.4.0
Version : Darwin Kernel Version 20.4.0: Thu Apr 22 21:46:47 PDT 2021; root:xnu-7195.101.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.1
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.2.1
setuptools : 49.6.0
Cython : 0.29.24
pytest : 6.2.3
hypothesis : None
sphinx : 3.5.4
blosc : None
feather : None
xlsxwriter : 1.4.4
lxml.etree : 4.6.3
html5lib : None
pymysql : 1.0.2
psycopg2 : 2.9.1 (dt dec pq3 ext lo64)
jinja2 : 2.11.3
IPython : 7.25.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : 1.4.22
tables : None
tabulate : 0.8.9
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None

@phofl phofl removed the Needs Info Clarification about behavior needed to assess issue label Jul 28, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jul 29, 2021
@simonjayhawkins
Copy link
Member

first bad commit: [fa8f68e] BUG/API: SeriesGroupBy reduction with numeric_only=True (#41342)

cc @jbrockmendel

@simonjayhawkins simonjayhawkins added this to the 1.3.2 milestone Jul 29, 2021
@simonjayhawkins simonjayhawkins added the Timedelta Timedelta data type label Jul 29, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.2, 1.3.3 Aug 15, 2021
@mroeschke mroeschke added Reduction Operations sum, mean, min, max, etc. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Aug 21, 2021
@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Aug 31, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021
@simonjayhawkins
Copy link
Member

changing milestone to 1.3.5

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021
@nicolaslegrand91
Copy link

nicolaslegrand91 commented Oct 26, 2021

@phofl @simonjayhawkins, thanks for handling this request.

Do you think this problem will be fixed, or is it considered to be a bad practice to use groupby & mean/sum on pd.Timedelta values?

@jbrockmendel
Copy link
Member

A PR addressing this would be welcome.

@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

good to fix but not necessary for 1.3.x

@jbrockmendel
Copy link
Member

Best guess is a is_datetimelike arg in group_add is needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reduction Operations sum, mean, min, max, etc. Regression Functionality that used to work in a prior pandas version Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants