Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Regression - AmbiguousTimeError creating DataFrame #42505

Closed
3 tasks done
kenahoo opened this issue Jul 12, 2021 · 14 comments · Fixed by #44518
Closed
3 tasks done

BUG: Regression - AmbiguousTimeError creating DataFrame #42505

kenahoo opened this issue Jul 12, 2021 · 14 comments · Fixed by #44518
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Regression Functionality that used to work in a prior pandas version Timezones Timezone data dtype
Milestone

Comments

@kenahoo
Copy link

kenahoo commented Jul 12, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

Pandas 1.25:

import pandas as pd
dt = pd.to_datetime('2019-11-03 01:00:00-0700').tz_convert('America/Los_Angeles')
pd.DataFrame({'dt': dt, 'value': [1]})
#                          dt  value
# 0 2019-11-03 01:00:00-07:00      1

Pandas 1.3:

import pandas as pd
dt = pd.to_datetime('2019-11-03 01:00:00-0700').tz_convert('America/Los_Angeles')
pd.DataFrame({'dt': dt, 'value': [1]})

#  File "pandas/_libs/tslibs/tzconversion.pyx", line 284, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc
# pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2019-11-03 01:00:00, try using the 'ambiguous' argument

Problem description

The dt object already has a timezone attribute, so it shouldn't be converting/inferring at all.

Expected Output

1.25 and 1.3.0 should have the same output.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f00ed8f
python : 3.7.3.final.0
python-bits : 64
OS : Darwin
OS-release : 20.5.0
Version : Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.0
numpy : 1.21.0
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 57.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.1 (dt dec pq3 ext lo64)
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@kenahoo kenahoo added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 12, 2021
@kenahoo kenahoo changed the title BUG: Regression - AmbiguousTimeError creating BUG: Regression - AmbiguousTimeError creating DataFrame Jul 12, 2021
@mzeitlin11
Copy link
Member

Thanks for reporting this @kenahoo! Bisection indicates this changed with #39767, further investigations welcome!

@mzeitlin11 mzeitlin11 added Constructors Series/DataFrame/Index/pd.array Constructors Regression Functionality that used to work in a prior pandas version Timezones Timezone data dtype and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 12, 2021
@mzeitlin11 mzeitlin11 added this to the 1.3.1 milestone Jul 12, 2021
@rhshadrach
Copy link
Member

@mzeitlin11 - I think standard practice is to cc the author of a PR where a regression has (or may have) occurred.

cc @jbrockmendel

@jmcomie
Copy link
Contributor

jmcomie commented Jul 16, 2021

Brainstorm: Is it ever good practice to support a conversion from an offset to a named timezone? If we already "know" that 2019-11-03 01:00:00-0700 corresponds to the Los Angeles timezone, we can of course say that this is the first 1AM of the evening and it's therefore not an ambiguous time, but can that be generalized? If not it strikes me that the AmbiguousTimeError exception should simply occur on the initial tz_convert, and that it should be up to the user to strip the offset data from the datetime string and force the conversion a different way. Getting timezones right is often a pain as an end user but ultimately I think it's better to not shoulder ambiguities that the end user should resolve.

@jbrockmendel
Copy link
Member

based on a quick look:

  1. if you pass {"dt": [dt]} instead of {"dt": dt} it works as expected
  2. otherwise, inside sequence_to_dt64ns we go through objects_to_datetime64ns after which we cast to dt64naive. that is my best guess as to where we need to tinker

@kenahoo
Copy link
Author

kenahoo commented Jul 17, 2021

@jmcomie - I don't think that analysis is correct. Once the timezone or offset is attached, you know the exact instant (in seconds past 1970 GMT), so attaching a new timezone is a simple matter of reinterpreting the clock/calendar time on the wall.

Additional fact: in my domain, the data 2019-11-03 01:00:00-0700 does not actually correspond to the Los Angeles timezone, sometimes it's quite literally 7 hours offset all year (someone's idea of a way to avoid DST problems), so it's 1 hour off of LA time in the summer. So our process needs to be: read in the time with the given -0700 offset (which is a perfectly valid set of data, even if weird to a user's eyes), then convert to user-friendly timezone for display and for calculations that depend on what day it is (energy used by day, etc.).

@jmcomie
Copy link
Contributor

jmcomie commented Jul 17, 2021

@jmcomie - I don't think that analysis is correct. Once the timezone or offset is attached, you know the exact instant (in seconds past 1970 GMT), so attaching a new timezone is a simple matter of reinterpreting the clock/calendar time on the wall.

Touché. You're right -- both time representations are precise and the conversion should be valid.

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jul 19, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.1, 1.3.2 Jul 24, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.2, 1.3.3 Aug 15, 2021
@mroeschke mroeschke added the Bug label Aug 21, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021
@simonjayhawkins
Copy link
Member

changing milestone to 1.3.5

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021
@kenahoo
Copy link
Author

kenahoo commented Nov 4, 2021

@simonjayhawkins the DST transition is coming this weekend and this issue does not seem to be fixed. Is there a workaround, or a chance of fixing it yet this week?

@kenahoo
Copy link
Author

kenahoo commented Nov 4, 2021

Additional test case - under pandas version 1.2.5, this works:

pd.DataFrame({'dt': dt, 'value': [1]})

but this throws the above error:

pd.DataFrame({'dt': dt, 'value': [1, 2, 3]})

(where dt is pd.to_datetime('2021-11-07 01:00:00-0400').tz_convert('America/New_York').)

@jbrockmendel
Copy link
Member

Is there a workaround, or a chance of fixing it yet this week?

Does the workaround #42505 (comment) not work for you? There is essentially no chance of a release with a fix for this in the next week, no. The next planned release is end of November I think. A PR addressing this would be welcome.

@kenahoo
Copy link
Author

kenahoo commented Nov 6, 2021

@jbrockmendel - no, that workaround doesn't apply in all cases. For example, the second case I showed above would have to use [dt] * 3 or similar, rather than just [dt].

Am I correct that the proposed workaround is to simply never use a scalar datetime when creating a DataFrame? That's not really feasible to implement in all cases, partly because there are so many "array types" that can be passed to DataFrame().

In my code, I'm handed a dict containing the data to put into a DataFrame. We would have to scan through all the columns, detect any that pandas will interpret as an array (including lists, numpy arrays, Series, "list-like objects", tuples, etc. - I'm not sure the complete set), find the max length across all such columns, then find all datetime-like columns that are specified as scalars (I'm not sure the extent of types to consider - is it just pd.Timestamp? Or are numpy and datetime items problematic too?), and finally do some [val] * max_n transformation on them to protect them from whatever's causing this error inside numpy.

@simonjayhawkins
Copy link
Member

I not sure working around this issue is necessarily the best way forward.

The release note from the commit that caused the change in behavior is...

Deprecated casting datetime.date objects to datetime64 when used as fill_value in :meth:DataFrame.unstack, :meth:DataFrame.shift, :meth:Series.shift, and :meth:DataFrame.reindex, pass pd.Timestamp(dateobj) instead (:issue:39767)

This is in no way related to the change in behavior reported in this issue.

PR restoring the 1.2.5 behavior is welcome.


AFAICT

we have the exception raised from pd.core.arrays.datetimes.DatetimeArray._from_sequence with the dtype specified.

on master and 1.2.5 we have

dt = pd.to_datetime("2019-11-03 01:00:00-0700").tz_convert("America/Los_Angeles")
print(dt)
dtype = pd.DataFrame({"dt": [dt], "value": [1]}).dtypes[0]
print(dtype)
arr = pd.core.arrays.datetimes.DatetimeArray._from_sequence([dt])
dtype_inferred = arr.dtype
print(dtype_inferred)
print(dtype == dtype_inferred)
print(arr)
pd.core.arrays.datetimes.DatetimeArray._from_sequence([dt], dtype=dtype)
2019-11-03 01:00:00-07:00
datetime64[ns, America/Los_Angeles]
datetime64[ns, America/Los_Angeles]
True
<DatetimeArray>
['2019-11-03 01:00:00-07:00']
Length: 1, dtype: datetime64[ns, America/Los_Angeles]
---------------------------------------------------------------------------
AmbiguousTimeError                        Traceback (most recent call last)
/tmp/ipykernel_4117/1065977013.py in <module>
      8 print(dtype == dtype_inferred)
      9 print(arr)
---> 10 pd.core.arrays.datetimes.DatetimeArray._from_sequence([dt], dtype=dtype)

~/miniconda3/envs/pandas-1.2.5/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py in _from_sequence(cls, scalars, dtype, copy)
    305     @classmethod
    306     def _from_sequence(cls, scalars, *, dtype=None, copy: bool = False):
--> 307         return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy)
    308 
    309     @classmethod

~/miniconda3/envs/pandas-1.2.5/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py in _from_sequence_not_strict(cls, data, dtype, copy, tz, freq, dayfirst, yearfirst, ambiguous)
    324         freq, freq_infer = dtl.maybe_infer_freq(freq)
    325 
--> 326         subarr, tz, inferred_freq = sequence_to_dt64ns(
    327             data,
    328             dtype=dtype,

~/miniconda3/envs/pandas-1.2.5/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py in sequence_to_dt64ns(data, dtype, copy, tz, dayfirst, yearfirst, ambiguous)
   1999             # Convert tz-naive to UTC
   2000             tz = timezones.maybe_get_tz(tz)
-> 2001             data = tzconversion.tz_localize_to_utc(
   2002                 data.view("i8"), tz, ambiguous=ambiguous
   2003             )

pandas/_libs/tslibs/tzconversion.pyx in pandas._libs.tslibs.tzconversion.tz_localize_to_utc()

AmbiguousTimeError: Cannot infer dst time from 2019-11-03 01:00:00, try using the 'ambiguous' argument

Now this behavior of raising when the dtype is specified and is the same dtype as the inferred dtype when the dtype is not specified does indeed seem strange and maybe it is this that needs to be changed to fix the regression even though this is not changed from 1.2.5.

So what has changed?

There we two changes in #39767, maybe_promote and infer_dtype_from_scalar. Although there we extensive changes to maybe_promote, this is not hit in the code sample from the OP.

The change is that infer_dtype_from_scalar now returns a different val

dt = pd.to_datetime("2019-11-03 01:00:00-0700").tz_convert("America/Los_Angeles")
print(dt)
dtype, val = pd.core.dtypes.cast.infer_dtype_from_scalar(dt, pandas_dtype=True)
print(dtype, type(dtype))
print(val, type(val))

on 1.2.5

2019-11-03 01:00:00-07:00
datetime64[ns, America/Los_Angeles] <class 'pandas.core.dtypes.dtypes.DatetimeTZDtype'>
1572768000000000000 <class 'int'>

on master

2019-11-03 01:00:00-07:00
datetime64[ns, America/Los_Angeles] <class 'pandas.core.dtypes.dtypes.DatetimeTZDtype'>
2019-11-03 01:00:00-07:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

Now the docstring for infer_dtype_from_scalar doesn't give any details on what is happening to val so not sure if the behavior on master is correct or not.

Suffice to say, using the int val in pd.core.arrays.datetimes.DatetimeArray._from_sequence with the dtype specified does not raise an exception.

dt = pd.to_datetime("2019-11-03 01:00:00-0700").tz_convert("America/Los_Angeles")
print(dt)
dtype = pd.DataFrame({"dt": [dt], "value": [1]}).dtypes[0]
print(dtype)
int_val = dt.value
print(int_val)
result = pd.core.arrays.datetimes.DatetimeArray._from_sequence([int_val], dtype=dtype)
print(result)
2019-11-03 01:00:00-07:00
datetime64[ns, America/Los_Angeles]
1572768000000000000
<DatetimeArray>
['2019-11-03 01:00:00-07:00']
Length: 1, dtype: datetime64[ns, America/Los_Angeles]

@simonjayhawkins
Copy link
Member

@jbrockmendel any thoughts on the best way to fix?

Maybe three options

  1. reverting the change in return type of infer_dtype_from_scalar
  2. changing the handling of types in pd.core.arrays.datetimes.DatetimeArray._from_sequence
  3. special case the passing of the return of infer_dtype_from_scalar onto pd.core.arrays.datetimes.DatetimeArray._from_sequence. (i.e. convert to integer)

I'm not sure about option 1, but I assume that the new behaviour is correct? (Needs a better docstring)

I'm not keen on option 2 for a backport PR and late in the 1.3.x series even though it's probably the correct long term solution.

That leaves option 3. Not normally keen on special casing but probably a trivial and easy fix and safe and suitable for a backport.

@jbrockmendel
Copy link
Member

Best guess remains sequence_to_dt64ns (#42505 (comment)). i'll try to confirm this guess today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Regression Functionality that used to work in a prior pandas version Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants