BUG: pd.cut regression since version 1.4.0 when operating on datetime #46218

sergiykhan · 2022-03-03T23:37:13Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

pd.cut(
    pd.Series([pd.Timestamp('2022-02-26')]),
    bins=pd.interval_range(
        pd.Timestamp('2022-02-25'),
        pd.Timestamp('2022-02-27'),
        freq='1D',
    )
).value_counts()

Issue Description

Binning is not correct in Pandas 1.4.x.

(2022-02-25, 2022-02-26]    0
(2022-02-26, 2022-02-27]    0

Note that the same series gets processed correctly, when the dtype is not datetime64[ns]

pd.cut(
    pd.Series(['2022-02-26']),
    bins=pd.interval_range(
        pd.Timestamp('2022-02-25'),
        pd.Timestamp('2022-02-27'),
        freq='1D',
    )
).value_counts()

Expected Behavior

The expected behavior is

(2022-02-25, 2022-02-26]    1
(2022-02-26, 2022-02-27]    0

Installed Versions

INSTALLED VERSIONS

commit : 06d2301
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.60.1-microsoft-standard-WSL2
Version : #1 SMP Wed Aug 25 23:20:18 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.1
numpy : 1.22.1
pytz : 2021.3
dateutil : 2.8.2
pip : 22.0.3
setuptools : 60.7.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2022-03-05T14:10:26Z

Thanks for the report! Confirmed on main.

simonjayhawkins · 2022-03-08T14:55:47Z

first bad commit: [0283df6] REF: implement Index._should_partial_index (#42227)

cc @jbrockmendel

jbrockmendel · 2022-03-08T16:14:38Z

@simonjayhawkins you never ping me on my good commits

simonjayhawkins · 2022-03-09T11:32:16Z

The bad is copy and pasted from the git bisect output and is the default label. All your commits are good, but some are maybe less good than others?

I cc you on these since IIRC you requested me to do so.

jbrockmendel · 2022-03-09T17:30:49Z

Just kidding around simon. my bad that wasn't clear.

simonjayhawkins · 2022-03-20T21:37:34Z

first bad commit: [0283df6] REF: implement Index._should_partial_index (#42227)

in #42227, if not self._should_compare(target) and not is_interval_dtype(self.dtype): was changed to if not self._should_compare(target) and not self._should_partial_index(target): in Index.get_indexer in pandas/core/indexes/base.py

This is the changed code that the code sample in the OP hits.

it appears is_interval_dtype(self.dtype) is True whereas self._should_partial_index(target) is False for the code sample accounting for the changed behavior.

jbrockmendel · 2022-03-20T22:34:39Z

In _bins_to_cuts there is a if isinstance(bins, IntervalIndex): fastpath. When we get there with the OP example, we have converted the Timestamp to a float64. The bins.get_indexer(x) result is now correct, it's the x we have at this point that is the problem.

simonjayhawkins · 2022-04-01T18:24:28Z

moving to 1.4.3

dom-insytesys · 2022-05-17T21:19:51Z

Is there a recommended workaround? Some of my code stopped working and it took me a couple of hours of debugging to realize that pd.cut was broken in recent Pandas releases, and then find this page.

simonjayhawkins · 2022-05-18T09:29:57Z

@dom-insytesys comments like this are not helpful. pandas is a community project. Feel free to post a workaround for the benefit of other users or submit a PR to fix and it will be included in the next pandas patch release.

dom-insytesys · 2022-05-18T16:53:13Z

@simonjayhawkins I'm so sorry if my comment annoyed you. I didn't intend to be rude.

What I was hoping to convey was firstly that this is not an obscure corner case. One of the most common uses of pd.cut() is to bin time series data into windows. So I assume this bug affects a lot of Pandas users, myself included.

Until Pandas 1.4.3 is released, I need to figure out some sort of a workaround. Obviously, because Pandas is open-source (hoorah!) I can theoretically get the source code, find the bug, and fix it myself. Or I can use pd.cut() source as a guide to writing my own alternative version.

However, that has a pretty steep learning curve (I'm not very proficient in Python, nor familiar with Pandas internals, so I'd guess it would take me a day or more), and I was hoping that there was some sort of quick workaround that would be childishly obvious to guys like you who have a good understanding of this code.

For example, @jbrockmendel provides an explanation above that I think says that the bug involves the conversion of Timestamp data to float64. That suggests to me that if I convert the timeseries data that needs binning from datetime64[ns] to a float64 (seconds since epoch?), that will avoid the bug. I can't figure out how to do this (e.g. "astype('float64') produces TypeError) , but I bet it is trivial for you. That's why I was hoping one of the Pandas team would say, "here's a quick way to avoid the bug: just convert your data from datetime64 to float64 using XXXX function". That would help affected users get quickly back on course.

Again, I apologize wholeheartedly if I caused offense. I hugely appreciate the work of the volunteer community that works on Pandas. It's an amazing project. I'm just asking for some help, with the expectation that any guidance would probably be useful to other users who end up stumbling to this page for the same reason.

sergiykhan · 2022-05-18T16:59:15Z

The quick fix is described in my bug report. Use strings.

dom-insytesys · 2022-05-18T17:03:03Z

In the spirit of attempting to help out others, here's my crude effort at a workaround.

Firstly, some code to illustrate the bug:

# Example code to illustrate pd.cut() bug
df = pd.DataFrame({"clock":pd.date_range(start='2020/1/1', end='2020/1/2', periods=25), "randomdata":[random.random() for _ in range(25)]})
windows = pd.IntervalIndex.from_breaks(pd.date_range(start='2019/12/31 23:30', end='2020/1/1 23:30', periods=13))
bins = pd.cut(df.clock, bins=windows)
df['randomdata'].groupby(bins).mean()

This produces the following output:

clock
(2019-12-31 23:30:00, 2020-01-01 01:30:00] NaN
(2020-01-01 01:30:00, 2020-01-01 03:30:00] NaN
(2020-01-01 03:30:00, 2020-01-01 05:30:00] NaN
(2020-01-01 05:30:00, 2020-01-01 07:30:00] NaN
(2020-01-01 07:30:00, 2020-01-01 09:30:00] NaN
(2020-01-01 09:30:00, 2020-01-01 11:30:00] NaN
(2020-01-01 11:30:00, 2020-01-01 13:30:00] NaN
(2020-01-01 13:30:00, 2020-01-01 15:30:00] NaN
(2020-01-01 15:30:00, 2020-01-01 17:30:00] NaN
(2020-01-01 17:30:00, 2020-01-01 19:30:00] NaN
(2020-01-01 19:30:00, 2020-01-01 21:30:00] NaN
(2020-01-01 21:30:00, 2020-01-01 23:30:00] NaN
Name: randomdata, dtype: float64

If I understand correctly, the bug relates to the fact that df.clock has a Timestamp-like dtype. This means it gets on a fasttrack code path. That can be avoided by converting the datetime to a string. Changing the line:

bins = pd.cut(df.clock.dt.strftime("%Y/%m/%d %H:%M:%S.%f'"), bins=windows)

Produces the desired output:

clock
(2019-12-31 23:30:00, 2020-01-01 01:30:00] 0.686914
(2020-01-01 01:30:00, 2020-01-01 03:30:00] 0.462383
(2020-01-01 03:30:00, 2020-01-01 05:30:00] 0.667617
(2020-01-01 05:30:00, 2020-01-01 07:30:00] 0.503656
(2020-01-01 07:30:00, 2020-01-01 09:30:00] 0.615401
(2020-01-01 09:30:00, 2020-01-01 11:30:00] 0.428048
(2020-01-01 11:30:00, 2020-01-01 13:30:00] 0.631899
(2020-01-01 13:30:00, 2020-01-01 15:30:00] 0.357134
(2020-01-01 15:30:00, 2020-01-01 17:30:00] 0.624220
(2020-01-01 17:30:00, 2020-01-01 19:30:00] 0.495276
(2020-01-01 19:30:00, 2020-01-01 21:30:00] 0.617120
(2020-01-01 21:30:00, 2020-01-01 23:30:00] 0.112893
Name: randomdata, dtype: float64

This seems to be working for me. But I'm a little concerned that it is fragile. (What happens, for example if IntervalIndex is in a different timezone? etc.)

simonjayhawkins · 2022-06-22T11:43:48Z

moving to 1.4.4

mroeschke · 2022-07-05T23:01:40Z

Closing as a duplicate of #46218. Discussion of a solution seems further along there

sergiykhan · 2022-07-05T23:09:54Z

I think you closed the wrong one.

mroeschke · 2022-07-05T23:11:49Z

Thanks @sergiykhan. Reopened

jorisvandenbossche · 2022-07-17T20:16:13Z

I opened #47771 with a potential fix.

It doesn't fix the root cause inside cut of converting the datetime64 data to numeric but then still passing that to pd.cut (although in hindsight that should actually also be a straightforward change), but it restores the behaviour of IntervalIndex.get_indexer to accept numeric target values for datetime64 intervals.

sergiykhan added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 3, 2022

rhshadrach added cut cut, qcut Regression Functionality that used to work in a prior pandas version Timestamp pd.Timestamp and associated methods and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 5, 2022

rhshadrach added this to the 1.4.2 milestone Mar 5, 2022

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Mar 8, 2022

code sample for pandas-dev#46218

68519b6

sergiykhan changed the title ~~BUG: pd.cut regression starting in version 1.4.0 when operating on datetime~~ BUG: pd.cut regression since version 1.4.0 when operating on datetime Mar 23, 2022

simonjayhawkins modified the milestones: 1.4.2, 1.4.3 Apr 1, 2022

simonjayhawkins mentioned this issue Jun 6, 2022

BUG: Using pd.cut with IntervalIndex containing Timedeltas returns only NaN #47171

Closed

3 tasks

simonjayhawkins modified the milestones: 1.4.3, 1.4.4 Jun 22, 2022

mroeschke closed this as completed Jul 5, 2022

mroeschke reopened this Jul 5, 2022

jorisvandenbossche mentioned this issue Jul 17, 2022

REGR: fix pd.cut with datetime IntervalIndex as bins #47771

Merged

jorisvandenbossche mentioned this issue Jul 17, 2022

REF: avoid converting input to cut if passing IntervalIndex bins #47773

Closed

simonjayhawkins closed this as completed in #47771 Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pd.cut regression since version 1.4.0 when operating on datetime #46218

BUG: pd.cut regression since version 1.4.0 when operating on datetime #46218

sergiykhan commented Mar 3, 2022 •

edited

Loading

INSTALLED VERSIONS

rhshadrach commented Mar 5, 2022

simonjayhawkins commented Mar 8, 2022

jbrockmendel commented Mar 8, 2022

simonjayhawkins commented Mar 9, 2022

jbrockmendel commented Mar 9, 2022

simonjayhawkins commented Mar 20, 2022

jbrockmendel commented Mar 20, 2022

simonjayhawkins commented Apr 1, 2022

dom-insytesys commented May 17, 2022

simonjayhawkins commented May 18, 2022

dom-insytesys commented May 18, 2022

sergiykhan commented May 18, 2022

dom-insytesys commented May 18, 2022 •

edited

Loading

simonjayhawkins commented Jun 22, 2022

mroeschke commented Jul 5, 2022

sergiykhan commented Jul 5, 2022

mroeschke commented Jul 5, 2022

jorisvandenbossche commented Jul 17, 2022

BUG: pd.cut regression since version 1.4.0 when operating on datetime #46218

BUG: pd.cut regression since version 1.4.0 when operating on datetime #46218

Comments

sergiykhan commented Mar 3, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Mar 5, 2022

simonjayhawkins commented Mar 8, 2022

jbrockmendel commented Mar 8, 2022

simonjayhawkins commented Mar 9, 2022

jbrockmendel commented Mar 9, 2022

simonjayhawkins commented Mar 20, 2022

jbrockmendel commented Mar 20, 2022

simonjayhawkins commented Apr 1, 2022

dom-insytesys commented May 17, 2022

simonjayhawkins commented May 18, 2022

dom-insytesys commented May 18, 2022

sergiykhan commented May 18, 2022

dom-insytesys commented May 18, 2022 • edited Loading

simonjayhawkins commented Jun 22, 2022

mroeschke commented Jul 5, 2022

sergiykhan commented Jul 5, 2022

mroeschke commented Jul 5, 2022

jorisvandenbossche commented Jul 17, 2022

sergiykhan commented Mar 3, 2022 •

edited

Loading

dom-insytesys commented May 18, 2022 •

edited

Loading