Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.cut regression since version 1.4.0 when operating on datetime #46218

Closed
3 tasks done
sergiykhan opened this issue Mar 3, 2022 · 18 comments · Fixed by #47771
Closed
3 tasks done

BUG: pd.cut regression since version 1.4.0 when operating on datetime #46218

sergiykhan opened this issue Mar 3, 2022 · 18 comments · Fixed by #47771
Labels
Bug cut cut, qcut Regression Functionality that used to work in a prior pandas version Timestamp pd.Timestamp and associated methods
Milestone

Comments

@sergiykhan
Copy link

sergiykhan commented Mar 3, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

pd.cut(
    pd.Series([pd.Timestamp('2022-02-26')]),
    bins=pd.interval_range(
        pd.Timestamp('2022-02-25'),
        pd.Timestamp('2022-02-27'),
        freq='1D',
    )
).value_counts()

Issue Description

Binning is not correct in Pandas 1.4.x.

(2022-02-25, 2022-02-26]    0
(2022-02-26, 2022-02-27]    0

Note that the same series gets processed correctly, when the dtype is not datetime64[ns]

pd.cut(
    pd.Series(['2022-02-26']),
    bins=pd.interval_range(
        pd.Timestamp('2022-02-25'),
        pd.Timestamp('2022-02-27'),
        freq='1D',
    )
).value_counts()

Expected Behavior

The expected behavior is

(2022-02-25, 2022-02-26]    1
(2022-02-26, 2022-02-27]    0

Installed Versions

INSTALLED VERSIONS

commit : 06d2301
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.60.1-microsoft-standard-WSL2
Version : #1 SMP Wed Aug 25 23:20:18 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.1
numpy : 1.22.1
pytz : 2021.3
dateutil : 2.8.2
pip : 22.0.3
setuptools : 60.7.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@sergiykhan sergiykhan added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 3, 2022
@rhshadrach
Copy link
Member

Thanks for the report! Confirmed on main.

@rhshadrach rhshadrach added cut cut, qcut Regression Functionality that used to work in a prior pandas version Timestamp pd.Timestamp and associated methods and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 5, 2022
@rhshadrach rhshadrach added this to the 1.4.2 milestone Mar 5, 2022
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Mar 8, 2022
@simonjayhawkins
Copy link
Member

first bad commit: [0283df6] REF: implement Index._should_partial_index (#42227)

cc @jbrockmendel

@jbrockmendel
Copy link
Member

@simonjayhawkins you never ping me on my good commits

@simonjayhawkins
Copy link
Member

The bad is copy and pasted from the git bisect output and is the default label. All your commits are good, but some are maybe less good than others?

I cc you on these since IIRC you requested me to do so.

@jbrockmendel
Copy link
Member

Just kidding around simon. my bad that wasn't clear.

@simonjayhawkins
Copy link
Member

first bad commit: [0283df6] REF: implement Index._should_partial_index (#42227)

in #42227, if not self._should_compare(target) and not is_interval_dtype(self.dtype): was changed to if not self._should_compare(target) and not self._should_partial_index(target): in Index.get_indexer in pandas/core/indexes/base.py

This is the changed code that the code sample in the OP hits.

it appears is_interval_dtype(self.dtype) is True whereas self._should_partial_index(target) is False for the code sample accounting for the changed behavior.

@jbrockmendel
Copy link
Member

In _bins_to_cuts there is a if isinstance(bins, IntervalIndex): fastpath. When we get there with the OP example, we have converted the Timestamp to a float64. The bins.get_indexer(x) result is now correct, it's the x we have at this point that is the problem.

@sergiykhan sergiykhan changed the title BUG: pd.cut regression starting in version 1.4.0 when operating on datetime BUG: pd.cut regression since version 1.4.0 when operating on datetime Mar 23, 2022
@simonjayhawkins
Copy link
Member

moving to 1.4.3

@simonjayhawkins simonjayhawkins modified the milestones: 1.4.2, 1.4.3 Apr 1, 2022
@dom-insytesys
Copy link

Is there a recommended workaround? Some of my code stopped working and it took me a couple of hours of debugging to realize that pd.cut was broken in recent Pandas releases, and then find this page.

@simonjayhawkins
Copy link
Member

@dom-insytesys comments like this are not helpful. pandas is a community project. Feel free to post a workaround for the benefit of other users or submit a PR to fix and it will be included in the next pandas patch release.

@dom-insytesys
Copy link

@simonjayhawkins I'm so sorry if my comment annoyed you. I didn't intend to be rude.

What I was hoping to convey was firstly that this is not an obscure corner case. One of the most common uses of pd.cut() is to bin time series data into windows. So I assume this bug affects a lot of Pandas users, myself included.

Until Pandas 1.4.3 is released, I need to figure out some sort of a workaround. Obviously, because Pandas is open-source (hoorah!) I can theoretically get the source code, find the bug, and fix it myself. Or I can use pd.cut() source as a guide to writing my own alternative version.

However, that has a pretty steep learning curve (I'm not very proficient in Python, nor familiar with Pandas internals, so I'd guess it would take me a day or more), and I was hoping that there was some sort of quick workaround that would be childishly obvious to guys like you who have a good understanding of this code.

For example, @jbrockmendel provides an explanation above that I think says that the bug involves the conversion of Timestamp data to float64. That suggests to me that if I convert the timeseries data that needs binning from datetime64[ns] to a float64 (seconds since epoch?), that will avoid the bug. I can't figure out how to do this (e.g. "astype('float64') produces TypeError) , but I bet it is trivial for you. That's why I was hoping one of the Pandas team would say, "here's a quick way to avoid the bug: just convert your data from datetime64 to float64 using XXXX function". That would help affected users get quickly back on course.

Again, I apologize wholeheartedly if I caused offense. I hugely appreciate the work of the volunteer community that works on Pandas. It's an amazing project. I'm just asking for some help, with the expectation that any guidance would probably be useful to other users who end up stumbling to this page for the same reason.

@sergiykhan
Copy link
Author

The quick fix is described in my bug report. Use strings.

@dom-insytesys
Copy link

dom-insytesys commented May 18, 2022

In the spirit of attempting to help out others, here's my crude effort at a workaround.

Firstly, some code to illustrate the bug:

# Example code to illustrate pd.cut() bug
df = pd.DataFrame({"clock":pd.date_range(start='2020/1/1', end='2020/1/2', periods=25), "randomdata":[random.random() for _ in range(25)]})
windows = pd.IntervalIndex.from_breaks(pd.date_range(start='2019/12/31 23:30', end='2020/1/1 23:30', periods=13))
bins = pd.cut(df.clock, bins=windows)
df['randomdata'].groupby(bins).mean()

This produces the following output:

clock
(2019-12-31 23:30:00, 2020-01-01 01:30:00] NaN
(2020-01-01 01:30:00, 2020-01-01 03:30:00] NaN
(2020-01-01 03:30:00, 2020-01-01 05:30:00] NaN
(2020-01-01 05:30:00, 2020-01-01 07:30:00] NaN
(2020-01-01 07:30:00, 2020-01-01 09:30:00] NaN
(2020-01-01 09:30:00, 2020-01-01 11:30:00] NaN
(2020-01-01 11:30:00, 2020-01-01 13:30:00] NaN
(2020-01-01 13:30:00, 2020-01-01 15:30:00] NaN
(2020-01-01 15:30:00, 2020-01-01 17:30:00] NaN
(2020-01-01 17:30:00, 2020-01-01 19:30:00] NaN
(2020-01-01 19:30:00, 2020-01-01 21:30:00] NaN
(2020-01-01 21:30:00, 2020-01-01 23:30:00] NaN
Name: randomdata, dtype: float64

If I understand correctly, the bug relates to the fact that df.clock has a Timestamp-like dtype. This means it gets on a fasttrack code path. That can be avoided by converting the datetime to a string. Changing the line:

bins = pd.cut(df.clock.dt.strftime("%Y/%m/%d %H:%M:%S.%f'"), bins=windows)

Produces the desired output:

clock
(2019-12-31 23:30:00, 2020-01-01 01:30:00] 0.686914
(2020-01-01 01:30:00, 2020-01-01 03:30:00] 0.462383
(2020-01-01 03:30:00, 2020-01-01 05:30:00] 0.667617
(2020-01-01 05:30:00, 2020-01-01 07:30:00] 0.503656
(2020-01-01 07:30:00, 2020-01-01 09:30:00] 0.615401
(2020-01-01 09:30:00, 2020-01-01 11:30:00] 0.428048
(2020-01-01 11:30:00, 2020-01-01 13:30:00] 0.631899
(2020-01-01 13:30:00, 2020-01-01 15:30:00] 0.357134
(2020-01-01 15:30:00, 2020-01-01 17:30:00] 0.624220
(2020-01-01 17:30:00, 2020-01-01 19:30:00] 0.495276
(2020-01-01 19:30:00, 2020-01-01 21:30:00] 0.617120
(2020-01-01 21:30:00, 2020-01-01 23:30:00] 0.112893
Name: randomdata, dtype: float64

This seems to be working for me. But I'm a little concerned that it is fragile. (What happens, for example if IntervalIndex is in a different timezone? etc.)

@simonjayhawkins
Copy link
Member

moving to 1.4.4

@simonjayhawkins simonjayhawkins modified the milestones: 1.4.3, 1.4.4 Jun 22, 2022
@mroeschke
Copy link
Member

Closing as a duplicate of #46218. Discussion of a solution seems further along there

@sergiykhan
Copy link
Author

I think you closed the wrong one.

@mroeschke mroeschke reopened this Jul 5, 2022
@mroeschke
Copy link
Member

Thanks @sergiykhan. Reopened

@jorisvandenbossche
Copy link
Member

I opened #47771 with a potential fix.

It doesn't fix the root cause inside cut of converting the datetime64 data to numeric but then still passing that to pd.cut (although in hindsight that should actually also be a straightforward change), but it restores the behaviour of IntervalIndex.get_indexer to accept numeric target values for datetime64 intervals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug cut cut, qcut Regression Functionality that used to work in a prior pandas version Timestamp pd.Timestamp and associated methods
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants