Skip to content

BUG: edge case when reading from postgresl with read_sql_query and datetime with tz and chunksize #11216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 3, 2015

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Oct 2, 2015

  • When we don't specifiy a chunksize we get an object dtype which is ok
  • We create a propery datetime64[ns, tz] type, but its a pytz.FixedOffset(....)
    which ATM is not really a useful/palatable type and is mostly confusing for now.
    In the future could attempt to coerce this to a nice tz, e.g. US/Eastern, ,not sure if
    this is possible.
  • Note that this is w/o parse_dates specified

@jreback jreback added Bug Datetime Datetime data dtype IO SQL to_sql, read_sql, read_sql_query labels Oct 2, 2015
@jreback jreback added this to the 0.17.0 milestone Oct 2, 2015
@jreback
Copy link
Contributor Author

jreback commented Oct 2, 2015

@jorisvandenbossche pls have a look.

This is not a bug per-se, more of not wanting to actually coerce these ATM (as this is a new type and might be unexpected) as a user attested.

(Pdb) p data_frame
  TextCol    DateCol             DateColWithTz  IntDateCol  FloatCol  IntCol  \
0   first 2000-01-03 2000-01-01 03:00:00-05:00   535852800      10.1       1   

  BoolCol  IntColWithNull BoolColWithNull  
0   False               1           False  
(Pdb) p data_frame.dtypes
TextCol                                                       object
DateCol                                               datetime64[ns]
DateColWithTz      datetime64[ns, psycopg2.tz.FixedOffsetTimezone...
IntDateCol                                                     int64
FloatCol                                                     float64
IntCol                                                         int64
BoolCol                                                         bool
IntColWithNull                                                 int64
BoolColWithNull                                                 bool
dtype: object
(Pdb) data_frame.DateColWithTz
0   2000-01-01 03:00:00-05:00
Name: DateColWithTz, dtype: datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=-300, name=None)]

@jorisvandenbossche jorisvandenbossche added the Timezones Timezone data dtype label Oct 2, 2015
@jorisvandenbossche
Copy link
Member

@jreback thanks for working on this! xref #7364 were we also discussed some of these issues.

The main hesitation I feel for this is that with or without chunksize can give different results (what it can do anyway, but still ..). So I was thinking, maybe we should coerce the datetimes to utc either way, also if it are datetime objects.

It maybe also makes sense to return it as tz-aware data (but with utc timezone), since it is specified as aware in the database.

I didn't yet look into your updated commit, but will come back to it this evening

@jorisvandenbossche
Copy link
Member

Note that what exactly is returned from postgres depends on the postgres server timezone settings (it stores it internally as UTC, and converts to the timezone of that setting on output)

@jreback
Copy link
Contributor Author

jreback commented Oct 2, 2015

@jorisvandenbossche but that's exactly the point. I wan to coerce always to a naive tz (this is what this fixes). Its irrelevant whether you use chunksize, pass parse_dates, or use query.

as I said I think that we can remove this at some point to pass thru a 'better' tz.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 2, 2015

@jreback I was reading through the gitter chat, and the _harmonize_columns etc is only used for reading tables (so read_sql_table), as in this case we have information about the supposed types for each column. For reading a query, we just receive the values as they are fetched by the driver and feed that to DataFrame.from_records. So there is only automatic coercing of types that happens in pandas anyways (eg list of datetime.datetime objects are coerced to the datetime64 dtype).

So in this case, the starting point if we have a column 'timestamp with timezone' (for postgresql in this case), is the following:

[datetime.datetime(2012, 1, 1, 9, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None)), 
 datetime.datetime(2012, 6, 1, 9, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None))]

When this is feeded into a pandas objects, previously this gave an object dtype preserving the above objects. Now, in master after the introduction of the datetime tz, this gives:

In [38]: s = pd.Series([datetime.datetime(2012, 1, 1, 9, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None)), 
               datetime.datetime(2012, 6, 1, 9, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None))])

In [39]: s
Out[39]:
0   2012-01-01 09:00:00+01:00
1   2012-06-01 09:00:00+01:00
dtype: datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=60, name=None)]

I agree that the above is not very useful, so coercing it to UTC is probably a good idea (= what you do in this PR, the only question is do we want naive or aware UTC).
The problem, however, is that once there is a DST change in the timeseries, you still get object dtype because the datetime tz obviously supports only a uniform timezone:

In [40]: sb = pd.Series([datetime.datetime(2012, 1, 1, 9, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None)), 
               datetime.datetime(2012, 6, 1, 9, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=120, name=None))])

In [41]: sb
Out[41]: 
0    2012-01-01 09:00:00+01:00
1    2012-06-01 09:00:00+02:00
dtype: object

And what I meant with chunksize giving a different result: if you chunk the above (like in the tests) in two sets of one row, you get two times a series with a uniform timezone, so they are coerced to datetime64. And when combined together with concat they are casted to naive:

In [24]: s1 = pd.Series([datetime.datetime(2012, 1, 1, 9, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None))])

In [25]: s2 = pd.Series([datetime.datetime(2012, 6, 1, 9, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=120, name=None))])

In [26]: s1.dtype
Out[26]:
datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=60, name=None)]

In [27]: s2.dtype
Out[27]:
datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=120, name=None)]

In [28]: pd.concat([s1, s2])
Out[28]:
0   2012-01-01 08:00:00
0   2012-06-01 07:00:00
dtype: datetime64[ns]

@jreback
Copy link
Contributor Author

jreback commented Oct 2, 2015

@jorisvandenbossche

We can't currently force postgres to actually create a datetime with timezone column (unless you pass the dtype manually). So unless the table is existing and created this way (and in this case it was), I am not really sure whether to return naive or converted to UTC or the actual timezone. Hmm. I think its reasonable to return datetime64[ns, UTC] as that preserves the fact that this was a timezone type (though we are 'loosing' the timezone nature itself), but that can be remedied later.

…tetime with timezone types and a chunksize, pandas-dev#11216

- When we don't specifiy a chunksize we get an object dtype
  which is ok
- We create a propery datetime64[ns, tz] type, but its a
  pytz.FixedOffset(....), which ATM is not really a
  useful/palatable type and is mostly confusing for now.
  In the future could attempt to coerce this to a nice tz,
  e.g. US/Eastern, not sure if this is possible
- Note that this is w/o parse_dates specified
jreback added a commit that referenced this pull request Oct 3, 2015
BUG: edge case when reading from postgresl with read_sql_query and datetime with tz and chunksize
@jreback jreback merged commit 071cffd into pandas-dev:master Oct 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype IO SQL to_sql, read_sql, read_sql_query Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants