Skip to content

combine_first loses index type information with MultiIndices and different timezones #13650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
multiloc opened this issue Jul 14, 2016 · 4 comments
Labels
Bug Duplicate Report Duplicate issue or pull request MultiIndex

Comments

@multiloc
Copy link
Contributor

multiloc commented Jul 14, 2016

See title and example below. I believe this is due to the fact that combination of indices with different timezones first converts to object dtype, then rebases all timestamps to UTC for comparison and then constructs a DatetimeIndex from that. However, this doesn't seem to be applied for the individual levels in a MultiIndex. This is on latest stable 0.18.1.

In [3]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:tz1, tz2 = 'America/New_York', 'UTC'
:
:from1, to1 = [pd.Timestamp('20160101', tz=tz1), pd.Timestamp('20160102', tz=tz1)], [pd.Timestamp('20160102', tz=tz1), pd.Timestamp('20160103', tz=tz1)]
:
:from2, to2 = [pd.Timestamp('20160103', tz=tz2), pd.Timestamp('20160104', tz=tz2)], [pd.Timestamp('20160104', tz=tz2), pd.Timestamp('20160105', tz=tz2)]
:
:index1 = pd.MultiIndex.from_arrays([from1, to1])
:df1 = pd.DataFrame([1, 2], index=index1)
:
:index2 = pd.MultiIndex.from_arrays([from2, to2])
:df2 = pd.DataFrame([1, 2], index=index2)
:
:result = df1.combine_first(df2)
:--

In [4]: df1.index.get_level_values(0)
Out[4]: DatetimeIndex(['2016-01-01 00:00:00-05:00', '2016-01-02 00:00:00-05:00'], dtype='datetime64[ns, America/New_York]', freq=None)

In [5]: df2.index.get_level_values(0)
Out[5]: DatetimeIndex(['2016-01-03', '2016-01-04'], dtype='datetime64[ns, UTC]', freq=None)

In [6]: result.index.get_level_values(0)
Out[6]: 
Index([2016-01-01 00:00:00-05:00, 2016-01-02 00:00:00-05:00,
       2016-01-03 00:00:00+00:00, 2016-01-04 00:00:00+00:00],
      dtype='object')

Works correctly if the inputs have the same timezone

In [12]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:tz1, tz2 = 'America/New_York', 'America/New_York' 
:
:from1, to1 = [pd.Timestamp('20160101', tz=tz1), pd.Timestamp('20160102', tz=tz1)], [pd.Timestamp('20160102', tz=tz1), pd.Timestamp('20160103', tz=tz1)]
:
:from2, to2 = [pd.Timestamp('20160103', tz=tz2), pd.Timestamp('20160104', tz=tz2)], [pd.Timestamp('20160104', tz=tz2), pd.Timestamp('20160105', tz=tz2)]
:
:index1 = pd.MultiIndex.from_arrays([from1, to1])
:df1 = pd.DataFrame([1, 2], index=index1)
:
:index2 = pd.MultiIndex.from_arrays([from2, to2])
:df2 = pd.DataFrame([1, 2], index=index2)
:
:result = df1.combine_first(df2)
:
:--

In [13]: result.index.get_level_values(0)
Out[13]: 
DatetimeIndex(['2016-01-01 00:00:00-05:00', '2016-01-02 00:00:00-05:00',
               '2016-01-03 00:00:00-05:00', '2016-01-04 00:00:00-05:00'],
              dtype='datetime64[ns, America/New_York]', freq=None)

Behavior is correct for single indices:

In [7]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:
:tz1, tz2 = 'America/New_York', 'UTC'
:
:index1 = [pd.Timestamp('20160101', tz=tz1), pd.Timestamp('20160102', tz=tz1)]
:index2 = [pd.Timestamp('20160103', tz=tz2), pd.Timestamp('20160104', tz=tz2)]
:
:df1 = pd.DataFrame([1, 2], index=index1)
:df2 = pd.DataFrame([1, 2], index=index2)
:
:result = df1.combine_first(df2)
:--

In [8]: df2.index
Out[8]: DatetimeIndex(['2016-01-03', '2016-01-04'], dtype='datetime64[ns, UTC]', freq=None)

In [9]: df1.index
Out[9]: DatetimeIndex(['2016-01-01 00:00:00-05:00', '2016-01-02 00:00:00-05:00'], dtype='datetime64[ns, America/New_York]', freq=None)

In [10]: result.index
Out[10]: 
DatetimeIndex(['2016-01-01 05:00:00+00:00', '2016-01-02 05:00:00+00:00',
               '2016-01-03 00:00:00+00:00', '2016-01-04 00:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)

output of pd.show_versions()

In [1]: import pandas as pd

In [2]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-88-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.6
pip: 8.1.1
setuptools: 20.3
Cython: 0.22
numpy: 1.9.2
scipy: 0.17.0
statsmodels: 0.6.1.post1
xarray: None
IPython: 3.1.0
sphinx: None
patsy: 0.2.1
dateutil: 2.4.2
pytz: 2015.4
blosc: None
bottleneck: 1.0.0
tables: None
numexpr: 2.4.3
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
@jorisvandenbossche
Copy link
Member

I agree that the coercing to UTC behaviour should be consistent between single and multi-index. Thanks for the report!

@jreback
Copy link
Contributor

jreback commented Jul 14, 2016

this is a dupe of #10567

@jreback jreback closed this as completed Jul 14, 2016
@jreback
Copy link
Contributor

jreback commented Jul 14, 2016

The multi-index is a red-herring.

@jorisvandenbossche
Copy link
Member

Ah, didn't notice that with a single index you actually get wrong datetimes (in that regard, the multi-index is actually more correct ...)

@jorisvandenbossche jorisvandenbossche added this to the No action milestone Jul 14, 2016
jreback pushed a commit that referenced this issue Aug 6, 2016
xref #13650
```

Author: sinhrks <[email protected]>

Closes #13926 from sinhrks/dttz_shift_dst and squashes the following commits:

c079ee3 [sinhrks] BUG: DatetimeTz shift raises AmbiguousTimeError near DST
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request MultiIndex
Projects
None yet
Development

No branches or pull requests

3 participants