Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DatetimeIndex.intersection not working for localized indices #46702

Closed
2 of 3 tasks
rwijtvliet opened this issue Apr 8, 2022 · 9 comments · Fixed by #48167
Closed
2 of 3 tasks

BUG: DatetimeIndex.intersection not working for localized indices #46702

rwijtvliet opened this issue Apr 8, 2022 · 9 comments · Fixed by #48167
Labels
Bug Datetime Datetime data dtype Index Related to the Index class or subclasses Regression Functionality that used to work in a prior pandas version setops union, intersection, difference, symmetric_difference Timezones Timezone data dtype
Milestone

Comments

@rwijtvliet
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# this works:
i1 = pd.date_range('2020-03-27', periods=5, freq='D')
i2 = pd.date_range('2020-03-30', periods=5, freq='D')
i1.intersection(i2)  # DatetimeIndex(['2020-03-30', '2020-03-31'], dtype='datetime64[ns]', freq='D')

# but this does not:
i1 = pd.date_range('2020-03-27', periods=5, freq='D', tz='Europe/Berlin')
i2 = pd.date_range('2020-03-30', periods=5, freq='D', tz='Europe/Berlin')
i1.intersection(i2) # DatetimeIndex([], dtype='datetime64[ns, Europe/Berlin]', freq='D')
# should be DatetimeIndex(['2020-03-30 00:00:00+02:00', '2020-03-31 00:00:00+02:00'], dtype='datetime64[ns, Europe/Berlin]', freq='D')

Issue Description

intersection is not working correctly for timezone-aware DatetimeIndex instances.

Expected Behavior

That all timestamps that are in both indices are kept.

Installed Versions

INSTALLED VERSIONS

commit : 06d2301
python : 3.9.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : de_DE.cp1252

pandas : 1.4.1
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : None
pytest : 7.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.1.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
sqlalchemy : 1.4.32
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@rwijtvliet rwijtvliet added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 8, 2022
@rhshadrach rhshadrach added Index Related to the Index class or subclasses datetime.date stdlib datetime.date support labels Apr 9, 2022
@rwijtvliet
Copy link
Author

rwijtvliet commented Apr 10, 2022

Current workaround:

i1 = pd.date_range('2020-03-27', periods=5, freq='D', tz='Europe/Berlin')
i2 = pd.date_range('2020-03-30', periods=5, freq='D', tz='Europe/Berlin')
pd.Index([i for i in i1 if i in i2], freq=i1.freq)

@jreback
Copy link
Contributor

jreback commented Apr 10, 2022

can u try on master and see if these r working

@jreback
Copy link
Contributor

jreback commented Apr 10, 2022

these are only when the transition is happening right? or for all tz aware? (which i know we test so this is odd)

@rwijtvliet
Copy link
Author

rwijtvliet commented Apr 10, 2022

It's only an issue when there's a DST transition

@rwijtvliet
Copy link
Author

rwijtvliet commented Apr 10, 2022

Just to make sure I'm doing it right: I created and activated a fresh virtual environment, git clone https://github.com/pandas-dev/pandas.git newdir, then pip install -e ., installed the missing C++ build tools, and run the above code - is that correct?

The result is the same:

>>> i1 = pd.date_range('2020-03-27', periods=5, freq='D', tz='Europe/Berlin')
>>> i2 = pd.date_range('2020-03-30', periods=5, freq='D', tz='Europe/Berlin')
>>> i1.intersection(i2)
DatetimeIndex([], dtype='datetime64[ns, Europe/Berlin]', freq='D')

Whereas time periods not containing DST-start do work:

>>> i1 = pd.date_range('2020-03-20', periods=5, freq='D', tz='Europe/Berlin')
>>> i2 = pd.date_range('2020-03-23', periods=5, freq='D', tz='Europe/Berlin')
>>> i1.intersection(i2)
DatetimeIndex(['2020-03-23 00:00:00+01:00', '2020-03-24 00:00:00+01:00'], dtype='datetime64[ns, Europe/Berlin]', freq='D')
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 83f0d71a2cbde738c459e4feca54288b77d00e3d
python           : 3.9.12.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19044
machine          : AMD64
processor        : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : de_DE.cp1252

pandas           : 1.5.0.dev0+666.g83f0d71a2c
numpy            : 1.22.3
pytz             : 2022.1
dateutil         : 2.8.2
pip              : 21.2.4
setuptools       : 58.0.4
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
markupsafe       : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None

@mroeschke mroeschke added Datetime Datetime data dtype Timezones Timezone data dtype setops union, intersection, difference, symmetric_difference and removed Needs Triage Issue that has not been reviewed by a pandas team member datetime.date stdlib datetime.date support labels Apr 10, 2022
@mroeschke
Copy link
Member

Thanks for the report. The set-ops in general have issues dealing with DST transitions (DatetimeIndex.union issue #45863)

@rwijtvliet
Copy link
Author

My current workaround for anyone interested

def intersection(*indices: pd.DatetimeIndex) -> pd.DatetimeIndex:
    """Intersect several DatetimeIndices.

    Parameters
    ----------
    *indices : pd.DatetimeIndex
        The indices to intersect.

    Returns
    -------
    pd.DatetimeIndex
        The intersection, i.e., datetimeindex with values that exist in each index.

    Notes
    -----
    If indices have distinct timezones or names, the values from the first index are used.
    """
    if len(indices) == 0:
        raise ValueError("Must specify at least one index.")

    distinct_freqs = set([i.freq for i in indices])
    if len(indices) > 1 and len(distinct_freqs) != 1:
        raise ValueError(
            f"Indices must not have equal frequencies; got {distinct_freqs}."
        )

    tznaive = [i.tz is None for i in indices]
    if any(tznaive) and not all(tznaive):
        raise ValueError(
            f"All indices must be either timezone-aware or timezone-naive; got {tznaive} naive from {len(indices)}."
        )

    freq, name, tz = indices[0].freq, indices[0].name, indices[0].tz

    # Calculation is cumbersome: pandas DatetimeIndex.intersection not working correctly on timezone-aware indices.
    values = set(indices[0])
    for idx in indices[1:]:
        values = values.intersection(set(idx))
    idx = pd.DatetimeIndex(sorted(list(values)), freq=freq, name=name, tz=tz)

    return idx

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jun 4, 2022
@simonjayhawkins
Copy link
Member

Expected Behavior

That all timestamps that are in both indices are kept.

expected behavior in pandas-1.3.5

first bad commit: [ba6b4eb] REF: dispatch DTI/TDI setops to RangeIndex (#44039)

cc @jbrockmendel

@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Jun 4, 2022
@simonjayhawkins simonjayhawkins added this to the 1.4.3 milestone Jun 4, 2022
@simonjayhawkins
Copy link
Member

moving to 1.4.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Index Related to the Index class or subclasses Regression Functionality that used to work in a prior pandas version setops union, intersection, difference, symmetric_difference Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants