Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP BUG: Inconsistent date parsing of to_datetime #35428

Closed

Conversation

arw2019
Copy link
Member

@arw2019 arw2019 commented Jul 27, 2020

This PR is an attempt to address concerns about datetime parsing. I'll update the docs if people approve this change.

Original problem

From #12585:

In [3]: pd.to_datetime(["31/12/2014", "10/03/2011"]) 
   ...:                                                                                                               
Out[3]: DatetimeIndex(['2014-12-31', '2011-03-10'], dtype='datetime64[ns]', freq=None)

The issue is that the first item is parsed as DD\MM\YYYY whereas for the second the format is MM\DD\YYYY, and there is no message to alert the user of inconsistency in the output.

The problem lies with the dayfirst argument that selects between DD\MM\YYYY and MM\DD\YYYY. In the parser:

cdef inline object _parse_delimited_date(str date_string, bint dayfirst):
"""
Parse special cases of dates: MM/DD/YYYY, DD/MM/YYYY, MM/YYYY.
At the beginning function tries to parse date in MM/DD/YYYY format, but
if month > 12 - in DD/MM/YYYY (`dayfirst == False`).
With `dayfirst == True` function makes an attempt to parse date in
DD/MM/YYYY, if an attempt is wrong - in DD/MM/YYYY
For MM/DD/YYYY, DD/MM/YYYY: delimiter can be a space or one of /-.
For MM/YYYY: delimiter can be a space or one of /-
If `date_string` can't be converted to date, then function returns
None, None
Parameters
----------
date_string : str
dayfirst : bool
Returns:
--------
datetime or None
str or None
Describing resolution of the parsed string.
"""
cdef:
const char* buf
Py_ssize_t length
int day = 1, month = 1, year
bint can_swap = 0
buf = get_c_string_buf_and_size(date_string, &length)
if length == 10:
# parsing MM?DD?YYYY and DD?MM?YYYY dates
if _is_not_delimiter(buf[2]) or _is_not_delimiter(buf[5]):
return None, None
month = _parse_2digit(buf)
day = _parse_2digit(buf + 3)
year = _parse_4digit(buf + 6)
reso = 'day'
can_swap = 1
elif length == 7:
# parsing MM?YYYY dates
if buf[2] == b'.' or _is_not_delimiter(buf[2]):
# we cannot reliably tell whether e.g. 10.2010 is a float
# or a date, thus we refuse to parse it here
return None, None
month = _parse_2digit(buf)
year = _parse_4digit(buf + 3)
reso = 'month'
else:
return None, None
if month < 0 or day < 0 or year < 1000:
# some part is not an integer, so
# date_string can't be converted to date, above format
return None, None
if 1 <= month <= MAX_DAYS_IN_MONTH and 1 <= day <= MAX_DAYS_IN_MONTH \
and (month <= MAX_MONTH or day <= MAX_MONTH):
if (month > MAX_MONTH or (day <= MAX_MONTH and dayfirst)) and can_swap:
day, month = month, day
if PY_VERSION_HEX >= 0x03060100:
# In Python <= 3.6.0 there is no range checking for invalid dates
# in C api, thus we call faster C version for 3.6.1 or newer
return datetime_new(year, month, day, 0, 0, 0, 0, None), reso
return datetime(year, month, day, 0, 0, 0, 0, None), reso
raise DateParseError(f"Invalid date specified ({month}/{day})")

we try DD\MM\YYYY or MM\DD\YYYY first depending on dayfirst but if the result is invalid we try the other.

Proposed change

I'd like to raise warnings whenever we parse in contradiction to the directive of dayfirst. The OP, where dayfirst defaults to False, now raises a Warning:

In [4]: pd.to_datetime(["31/12/2014", "10/03/2011"], dayfirst=False) 
   ...:                                                                                                               
/workspaces/pandas-arw2019/pandas/core/arrays/datetimes.py:2044: UserWarning: Parsing 31/12/2014 DD/MM format.
  result, tz_parsed = tslib.array_to_datetime(
Out[4]: DatetimeIndex(['2014-12-31', '2011-10-03'], dtype='datetime64[ns]', freq=None)

A downside of this change is that sometimes a consistent output produces unnecessary warnings (due to mismatch with the default value of dayfirst):

In [11]: pd.to_datetime(["31/12/2014", "30/04/2011", "31/05/2016"] )                                                 
/workspaces/pandas-arw2019/pandas/core/arrays/datetimes.py:2044: UserWarning: Parsing 31/12/2014 DD/MM format.
  result, tz_parsed = tslib.array_to_datetime(
/workspaces/pandas-arw2019/pandas/core/arrays/datetimes.py:2044: UserWarning: Parsing 30/04/2011 DD/MM format.
  result, tz_parsed = tslib.array_to_datetime(
/workspaces/pandas-arw2019/pandas/core/arrays/datetimes.py:2044: UserWarning: Parsing 31/05/2016 DD/MM format.
  result, tz_parsed = tslib.array_to_datetime(
Out[11]: DatetimeIndex(['2014-12-31', '2011-04-30', '2016-05-31'], dtype='datetime64[ns]', freq=None)

but these can be silenced with infer_datetime_format:

In [12]: pd.to_datetime(["31/12/2014", "30/04/2011", "31/05/2016"], infer_datetime_format=True )                      
Out[12]: DatetimeIndex(['2014-12-31', '2011-04-30', '2016-05-31'], dtype='datetime64[ns]', freq=None)

For invalid input (if it cannot be parsed according to a single format) we will always raise a warning, even with infer_datetime_format=True:

In [13]: pd.to_datetime(["31/12/2014", "03/31/2011"], infer_datetime_format=True)                                     
/workspaces/pandas-arw2019/pandas/core/arrays/datetimes.py:2044: UserWarning: Parsing 31/12/2014 DD/MM format.
  result, tz_parsed = tslib.array_to_datetime(
Out[13]: DatetimeIndex(['2014-12-31', '2011-03-31'], dtype='datetime64[ns]', freq=None)

Remaining issues

Personally I can see a case for raising an error with

In [13]: pd.to_datetime(["31/12/2014", "03/31/2011"], infer_datetime_format=True)                                     

since there's no valid way to process that. I also think that

In [11]: pd.to_datetime(["31/12/2014", "30/04/2011", "31/05/2016"])                                                 

shouldn't be raising warnings, but I feel much less strongly about that. That said, changing this behaviour will require more heavy-handed alterations to the codebase (and maybe performance implications - not sure).

I think that the warnings I added here will allow users to figure out the origin of these problems when they turn up and that should do away with most of the headaches.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think the warning approach is reasonable.

Might be preferable to the user if they just get 1 warning along the lines of "Parsed {date} as DD/MM/YYY and {other date} as MM/DD/YYY. Provide format or specify infer_datetime_format=True for consistent parsing"

@arw2019
Copy link
Member Author

arw2019 commented Jul 29, 2020

Might be preferable to the user if they just get 1 warning along the lines of "Parsed as DD/MM/YYY and as MM/DD/YYY. Provide format or specify infer_datetime_format=True for consistent parsing"

Sounds good. I'll keep working on this

@simonjayhawkins
Copy link
Member

@arw2019 doc build failing

>>>-------------------------------------------------------------------------
Warning in /home/runner/work/pandas/pandas/doc/source/user_guide/timeseries.rst at block ending on line 222
Specify :okwarning: as an option in the ipython:: block to suppress this message
----------------------------------------------------------------------------
/home/runner/work/pandas/pandas/pandas/core/arrays/datetimes.py:2044: UserWarning: Parsing '01-14-2012' in MM/DD/YYYY format.
  result, tz_parsed = tslib.array_to_datetime(
<<<-------------------------------------------------------------------------

@arw2019 arw2019 changed the title BUG: Inconsistent date parsing of to_datetime WIP BUG: Inconsistent date parsing of to_datetime Aug 7, 2020
@arw2019 arw2019 marked this pull request as draft August 7, 2020 21:39
@simonjayhawkins
Copy link
Member

@arw2019 are you still working on this?

@arw2019
Copy link
Member Author

arw2019 commented Sep 15, 2020

@simonjayhawkins yes, will push what I have shortly. Sorry about keeping this idle

@dsaxton dsaxton added Bug Datetime Datetime data dtype labels Sep 15, 2020
@arw2019
Copy link
Member Author

arw2019 commented Sep 18, 2020

Closing this for now. I'll circle back when I'm ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent date parsing of to_datetime
4 participants