Skip to content

to_datetime gives todays date when coercing errors of time values #23447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nickwoelk opened this issue Nov 1, 2018 · 4 comments
Closed

to_datetime gives todays date when coercing errors of time values #23447

nickwoelk opened this issue Nov 1, 2018 · 4 comments
Labels
Datetime Datetime data dtype Docs Needs Discussion Requires discussion from core team before further action

Comments

@nickwoelk
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd

d = {'start_date': ['37:04.0', '55:20.0', '44:05.3', '43:41.7', '40:35.0', '26:14.4', '57:00.7', '18:53.8', '12:23.2', '48:33.2', '36:38.1', '29:35.7', '56:02.4', '27:41.4', '11:49.9', '17:36.6', '50:44.2', '36:36.1', '21:14.7']}
data = pd.DataFrame(data=d)

data['customer_created_at'] = pd.to_datetime(data['start_date'], errors='coerce')

print (data['customer_created_at'])

Problem description

When coercing the list of time values I expect all of them to return NaT. However, a few of these come back with today's date. I tried yesterday and the returned value was:

0                     NaT
1                     NaT
2                     NaT
3                     NaT
4                     NaT
5                     NaT
6                     NaT
7     2018-10-31 18:53:48
8     2018-10-31 12:23:12
9                     NaT
10                    NaT
11                    NaT
12                    NaT
13                    NaT
14    2018-10-31 11:49:54
15    2018-10-31 17:36:36
16                    NaT
17                    NaT
18    2018-10-31 21:14:42

Today I get:

0                     NaT
1                     NaT
2                     NaT
3                     NaT
4                     NaT
5                     NaT
6                     NaT
7     2018-11-01 18:53:48
8     2018-11-01 12:23:12
9                     NaT
10                    NaT
11                    NaT
12                    NaT
13                    NaT
14    2018-11-01 11:49:54
15    2018-11-01 17:36:36
16                    NaT
17                    NaT
18    2018-11-01 21:14:42

Expected Output

All time values should return NaT when coerced

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 18.0.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.5.1
pip: 18.0
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
None

@mroeschke
Copy link
Member

I am guessing this is hitting our datutil parser as it will interpret these as times corresponding to the current date:

In [1]: from dateutil import parser

In [2]: parser.parse('18:53:48')
Out[2]: datetime.datetime(2018, 11, 1, 18, 53, 48)

In general we try to stay consistent with dateutil, but I can see how this is a surprising (undocumented) behavior. Could use some discussion on whether we should just document this case (which I am in more favor of) or should coerce these to NaT.

@mroeschke mroeschke added Docs Datetime Datetime data dtype Needs Discussion Requires discussion from core team before further action labels Nov 1, 2018
@nickwoelk
Copy link
Author

Thanks for the update. It's strange that it is only happening for certain rows. I've created a workaround on my end using an if statement to prevent the to.datetime from affecting rows without a date.

@mroeschke
Copy link
Member

mroeschke commented Nov 1, 2018

Well the rows that have "time looking" values are getting parsed like this. As you can see, rows where the first value is < 23 (i.e an hour looking value) are getting parsed and the rest to NaT

@MarcoGorelli
Copy link
Member

This now throws a warning making it clear that each row will be parsed by dateutil:

In [29]: import pandas as pd
    ...: 
    ...: d = {'start_date': ['37:04.0', '55:20.0', '44:05.3', '43:41.7', '40:35.0', '26:14.4', '57:00.7', '18:53.8', '12:23.2', '48:33.2', '36:38.1', '29:35.7', '56:02.4', '27:41.4',
    ...: '11:49.9', '17:36.6', '50:44.2', '36:36.1', '21:14.7']}
    ...: data = pd.DataFrame(data=d)
    ...: 
    ...: data['customer_created_at'] = pd.to_datetime(data['start_date'], errors='coerce')
    ...: 
    ...: print (data['customer_created_at'])
<ipython-input-29-c6a01a69a58d>:6: UserWarning: Could not infer format, so each element will be parsed individually by `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  data['customer_created_at'] = pd.to_datetime(data['start_date'], errors='coerce')
0                    NaT
1                    NaT
2                    NaT
3                    NaT
4                    NaT
5                    NaT
6                    NaT
7    2022-12-17 18:53:48
8    2022-12-17 12:23:12
9                    NaT
10                   NaT
11                   NaT
12                   NaT
13                   NaT
14   2022-12-17 11:49:54
15   2022-12-17 17:36:36
16                   NaT
17                   NaT
18   2022-12-17 21:14:42
Name: customer_created_at, dtype: datetime64[ns]

I think we can close then, but thanks for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Docs Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

3 participants