Skip to content

Hour overflow in tz-aware datetime conversions #13306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dlpd opened this issue May 27, 2016 · 10 comments
Closed

Hour overflow in tz-aware datetime conversions #13306

dlpd opened this issue May 27, 2016 · 10 comments
Labels
Bug Datetime Datetime data dtype Timezones Timezone data dtype
Milestone

Comments

@dlpd
Copy link

dlpd commented May 27, 2016

Comparison of tz-aware timestamps fails across DST boundaries. The comment in tslib.pyx:3845

# TODO: this assumed sortedness :/

perhaps implies this is a known problem that was never resolved, so apologies if a new issue is not appropriate.

Self contained example

import pandas as pd
import numpy as np

# Manifesting issue:                                                                                                                                                                                                                                 
ts_listA = ['2008-05-12 09:50:00-04:00',
            '2008-05-12 09:50:18-04:00',
            '2008-05-12 09:50:33-04:00',
            '2008-05-12 09:50:33-04:00',
            '2008-12-12 09:50:35-05:00',
            '2008-05-12 09:50:32-04:00',
            '2008-05-12 09:49:15-04:00',
            '2008-05-12 09:50:49-04:00',
            '2008-05-12 09:50:54-04:00']

ts_listB = ['2008-05-12 09:50:34-04:00',
            '2008-05-12 09:50:35-04:00',
            '2008-05-12 09:50:36-04:00',
            '2008-05-12 09:50:40-04:00',
            '2008-05-12 09:50:42-04:00',
            '2008-05-12 09:50:43-04:00',
            '2008-05-12 09:50:55-04:00',
            '2008-05-12 09:50:55-04:00',
            '2008-05-12 09:50:57-04:00']

df = pd.DataFrame({'dtA' : pd.to_datetime(ts_listA).tz_localize('US/Eastern'),
                   'dtB' : pd.to_datetime(ts_listB).tz_localize('US/Eastern')})
print (df.dtA - df.dtB) / pd.Timedelta('1 minutes')

# Underlying tslib.tz_convert call - similar problem                                                                                                                                                                                                 
asi8 = np.array([1210600200000000000,
                 1210600218000000000,
                 1210600233000000000,
                 1210600233000000000,
                 1229093435000000000,
                 1210600232000000000,
                 1210600155000000000,
                 1210600249000000000,
                 1210600254000000000])

tz = pd.Timestamp('2008-05-12 12:00:00').tz_localize('US/Eastern').tz
result = pd.tslib.tz_convert(asi8, 'UTC', tz)

print pd.to_datetime(asi8)
print pd.to_datetime(result)

Output

Expected output:

0        -0.566667
1        -0.283333
2        -0.050000
3        -0.116667
4    308219.883333
5        -0.183333
6        -1.666667
7        -0.100000
8        -0.050000
Name: dtA, dtype: float64

Actual output of timedelta computation:

0        -0.566667
1        -0.283333
2        -0.050000
3        -0.116667
4    308219.883333
5       -60.183333
6       -61.666667
7       -60.100000
8       -60.050000
Name: dtA, dtype: float64

Computed timedelta for rows after the 2008-12-12 date are off by an hour.

Output of tz_convert:

DatetimeIndex(['2008-05-12 13:50:00', '2008-05-12 13:50:18',
               '2008-05-12 13:50:33', '2008-05-12 13:50:33',
               '2008-12-12 14:50:35', '2008-05-12 13:50:32',
               '2008-05-12 13:49:15', '2008-05-12 13:50:49',
               '2008-05-12 13:50:54'],
              dtype='datetime64[ns]', freq=None)
DatetimeIndex(['2008-05-12 09:50:00', '2008-05-12 09:50:18',
               '2008-05-12 09:50:33', '2008-05-12 09:50:33',
               '2008-12-12 09:50:35', '2008-05-12 08:50:32',
               '2008-05-12 08:49:15', '2008-05-12 08:50:49',
               '2008-05-12 08:50:54'],
              dtype='datetime64[ns]', freq=None)

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-504.1.3.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: None
pip: 8.1.1
setuptools: 20.3
Cython: None
numpy: 1.8.1
scipy: None
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)
Jinja2: None
@jreback
Copy link
Contributor

jreback commented May 27, 2016

I get exactly the output you have above. What exactly is the issue?

@dlpd
Copy link
Author

dlpd commented May 27, 2016

Sorry for being unclear, the output is wrong - these timestamps are not an hour apart. The output above is the actual (buggy) output, not what it should be.

@jreback
Copy link
Contributor

jreback commented May 27, 2016

maybe the example is not clear. Can you show something and have an expected which is obvious.

@dlpd
Copy link
Author

dlpd commented May 27, 2016

The expected output is the correct timedelta - updated to reflect that.

@jreback
Copy link
Contributor

jreback commented May 27, 2016

In [36]: df = DataFrame(
{'A' : pd.to_datetime(['2008-05-12 13:50:33',
                       '2008-12-12 14:50:35',
                       '2008-05-12 13:50:32']).tz_localize('US/Eastern'), 
'B' : Timestamp('2008-05-12 13:50:33',tz='US/Eastern')}
)

In [37]: df
Out[37]: 
                          A                         B
0 2008-05-12 13:50:33-04:00 2008-05-12 13:50:33-04:00
1 2008-12-12 14:50:35-05:00 2008-05-12 13:50:33-04:00
2 2008-05-12 13:50:32-04:00 2008-05-12 13:50:33-04:00

In [39]: df.dtypes
Out[39]: 
A    datetime64[ns, US/Eastern]
B    datetime64[ns, US/Eastern]
dtype: object

In [40]: df.B-df.A
Out[40]: 
0       0 days 00:00:00
1   -215 days +22:59:58
2       0 days 01:00:01
Name: B, dtype: timedelta64[ns]

@jreback
Copy link
Contributor

jreback commented May 27, 2016

So we are not converting to UTC somewhere. Should be a straightforward fix. want to do a pull-request?

In [41]: df2 = df.astype('M8[ns]')

In [42]: df2
Out[42]: 
                    A                   B
0 2008-05-12 17:50:33 2008-05-12 17:50:33
1 2008-12-12 19:50:35 2008-05-12 17:50:33
2 2008-05-12 17:50:32 2008-05-12 17:50:33

# this also seems to have lost the name of the Series (B), oddly.
In [43]: df2.B-df2.A
Out[43]: 
0       0 days 00:00:00
1   -215 days +21:59:58
2       0 days 00:00:01
dtype: timedelta64[ns]

@jreback jreback added this to the Next Major Release milestone May 27, 2016
@dlpd
Copy link
Author

dlpd commented May 27, 2016

Please look at the second half of my comment. The (perhaps known?) bug is in tz_convert() itself, whose implementation assumes the array is sorted.

@jreback
Copy link
Contributor

jreback commented May 27, 2016

@dlpd you are reaching into the implementation. So not really sure what you are doing. You would have to demonstrate a bug using the public API which you did for the case above but not for anything else.

@dlpd
Copy link
Author

dlpd commented May 27, 2016

@jreback Thanks for the comments. You wrote:

So we are not converting to UTC somewhere

My point is that the bug is in the tz-conversion itself, not lack thereof.

If pandas.tslib.tz_convert() is not considered public, I'm happy to reproduce with DatetimeIndex.tz_convert() which is. If you trace the execution of the timedelta computation, you will see that these two examples are actually the same.

@jreback
Copy link
Contributor

jreback commented May 27, 2016

well if you'd like to submit a pull request would be great

@jreback jreback added the Bug label May 28, 2016
@jreback jreback modified the milestones: 0.18.2, Next Major Release May 31, 2016
uwedeportivo added a commit to uwedeportivo/pandas that referenced this issue Jun 2, 2016
uwedeportivo added a commit to uwedeportivo/pandas that referenced this issue Jun 2, 2016
@jreback jreback closed this as completed in ce56542 Jun 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants