Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: to_datetime(): Unstable default dateformat assumptions #42584

Closed
olippuner opened this issue Jul 17, 2021 · 1 comment
Closed

BUG: to_datetime(): Unstable default dateformat assumptions #42584

olippuner opened this issue Jul 17, 2021 · 1 comment
Labels
Datetime Datetime data dtype Duplicate Report Duplicate issue or pull request

Comments

@olippuner
Copy link

  • [ x] I have checked that this issue has not already been reported.

  • [ x] I have confirmed this bug exists on the latest version of pandas.

  • [ x] (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

for k in range(9,17):
    dstr = '{}-07-2021 15:57:40'.format(k)  # assumed format ''%d-%m-%Y %H:%M:%S' eg. in ch/DE
    print(' dstr: {}  pd.to_datetime(): {}'.format(dstr,pd.to_datetime(dstr)))

Output:
dstr: 9-07-2021 15:57:40  pd.to_datetime(): 2021-**09**-07 15:57:40
 dstr: 10-07-2021 15:57:40  pd.to_datetime(): 2021-**10**-07 15:57:40
 dstr: 11-07-2021 15:57:40  pd.to_datetime(): 2021-**11**-07 15:57:40
 dstr: 12-07-2021 15:57:40  pd.to_datetime(): 2021-**12**-07 15:57:40
 dstr: 13-07-2021 15:57:40  pd.to_datetime(): 2021-07-13 15:57:40
 dstr: 14-07-2021 15:57:40  pd.to_datetime(): 2021-07-14 15:57:40
 dstr: 15-07-2021 15:57:40  pd.to_datetime(): 2021-07-15 15:57:40
 dstr: 16-07-2021 15:57:40  pd.to_datetime(): 2021-07-16 15:57:40

Problem description

It would be ok if pd.to_datetime() assumed a default for parameter format, for example either US-format, or ISO-format '%Y/%m/%d %H:%M:%S'.

Anyhow as the code sample show, to_datetime() changes it's assumptions based on each input string. This is not acceptable and can be very harmful. There is no good in such dynamic parsing strategy.

I guess pd.to_datetime() assume a dateformat of '%m-%d-%Y %H:%M:%S', once this assumption is violated, the decoding strategy changes to

'%d-%m-%Y %H:%M:%S' instead of throwing an exception. Such instability is no good. pd.to_datetime() should work according a given date format. First violation according that date format should throw an exception

Expected Output

Any time pd.to_datetime() should work under control of a format string, a datetime string format. This could either be a default or a parameter passed format string.

If no format string is passed as a parameter to_datetime() uses a flexible strategy. As Pandas main scope is to work with DataFrame and Series object, timeseries are a prominent application. Such flexible date format interpretation can here be very destructive. Such behaviour of to_datetime() should not need to be expected and is NOT failsafe.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 7c48ff4
python : 3.8.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 45 Stepping 7, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : German_Switzerland.1252

pandas : 1.2.5
numpy : 1.20.2

pytz : 2021.1 dateutil : 2.8.1 pip : 21.1.3 setuptools : 52.0.0.post20210125 Cython : 0.29.23 pytest : 6.2.4 hypothesis : None sphinx : 4.0.2 blosc : None feather : None xlsxwriter : 1.4.4 lxml.etree : 4.6.3 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.1 IPython : 7.22.0 pandas_datareader: None bs4 : 4.9.3 bottleneck : 1.3.2 fsspec : 2021.06.0 fastparquet : None gcsfs : None matplotlib : 3.3.4 numexpr : 2.7.3 odfpy : None openpyxl : 3.0.7 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.6.2 sqlalchemy : 1.4.19 tables : 3.6.1 tabulate : None xarray : None xlrd : 2.0.1 xlwt : 1.3.0 numba : 0.53.1
@olippuner olippuner added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 17, 2021
@jreback
Copy link
Contributor

jreback commented Jul 17, 2021

see #12585

this is a long time issue - happy to have someone step up to fix it

it's actually not very hard

@jreback jreback added Datetime Datetime data dtype Duplicate Report Duplicate issue or pull request and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 17, 2021
@jreback jreback added this to the No action milestone Jul 17, 2021
@jreback jreback closed this as completed Jul 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

2 participants