BUG: edge case when reading from postgresl with read_sql_query and datetime with tz and chunksize #11216

jreback · 2015-10-02T01:55:57Z

When we don't specifiy a chunksize we get an object dtype which is ok
We create a propery datetime64[ns, tz] type, but its a pytz.FixedOffset(....)
which ATM is not really a useful/palatable type and is mostly confusing for now.
In the future could attempt to coerce this to a nice tz, e.g. US/Eastern, ,not sure if
this is possible.
Note that this is w/o parse_dates specified

jreback · 2015-10-02T02:00:04Z

This is not a bug per-se, more of not wanting to actually coerce these ATM (as this is a new type and might be unexpected) as a user attested.

(Pdb) p data_frame
  TextCol    DateCol             DateColWithTz  IntDateCol  FloatCol  IntCol  \
0   first 2000-01-03 2000-01-01 03:00:00-05:00   535852800      10.1       1   

  BoolCol  IntColWithNull BoolColWithNull  
0   False               1           False  
(Pdb) p data_frame.dtypes
TextCol                                                       object
DateCol                                               datetime64[ns]
DateColWithTz      datetime64[ns, psycopg2.tz.FixedOffsetTimezone...
IntDateCol                                                     int64
FloatCol                                                     float64
IntCol                                                         int64
BoolCol                                                         bool
IntColWithNull                                                 int64
BoolColWithNull                                                 bool
dtype: object

(Pdb) data_frame.DateColWithTz
0   2000-01-01 03:00:00-05:00
Name: DateColWithTz, dtype: datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=-300, name=None)]

jorisvandenbossche · 2015-10-02T14:16:46Z

@jreback thanks for working on this! xref #7364 were we also discussed some of these issues.

The main hesitation I feel for this is that with or without chunksize can give different results (what it can do anyway, but still ..). So I was thinking, maybe we should coerce the datetimes to utc either way, also if it are datetime objects.

It maybe also makes sense to return it as tz-aware data (but with utc timezone), since it is specified as aware in the database.

I didn't yet look into your updated commit, but will come back to it this evening

jorisvandenbossche · 2015-10-02T14:18:05Z

Note that what exactly is returned from postgres depends on the postgres server timezone settings (it stores it internally as UTC, and converts to the timezone of that setting on output)

jreback · 2015-10-02T15:07:18Z

@jorisvandenbossche but that's exactly the point. I wan to coerce always to a naive tz (this is what this fixes). Its irrelevant whether you use chunksize, pass parse_dates, or use query.

as I said I think that we can remove this at some point to pass thru a 'better' tz.

jorisvandenbossche · 2015-10-02T19:30:03Z

@jreback I was reading through the gitter chat, and the _harmonize_columns etc is only used for reading tables (so read_sql_table), as in this case we have information about the supposed types for each column. For reading a query, we just receive the values as they are fetched by the driver and feed that to DataFrame.from_records. So there is only automatic coercing of types that happens in pandas anyways (eg list of datetime.datetime objects are coerced to the datetime64 dtype).

So in this case, the starting point if we have a column 'timestamp with timezone' (for postgresql in this case), is the following:

[datetime.datetime(2012, 1, 1, 9, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None)), 
 datetime.datetime(2012, 6, 1, 9, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None))]

When this is feeded into a pandas objects, previously this gave an object dtype preserving the above objects. Now, in master after the introduction of the datetime tz, this gives:

In [38]: s = pd.Series([datetime.datetime(2012, 1, 1, 9, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None)), 
               datetime.datetime(2012, 6, 1, 9, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None))])

In [39]: s
Out[39]:
0   2012-01-01 09:00:00+01:00
1   2012-06-01 09:00:00+01:00
dtype: datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=60, name=None)]

I agree that the above is not very useful, so coercing it to UTC is probably a good idea (= what you do in this PR, the only question is do we want naive or aware UTC).
The problem, however, is that once there is a DST change in the timeseries, you still get object dtype because the datetime tz obviously supports only a uniform timezone:

In [40]: sb = pd.Series([datetime.datetime(2012, 1, 1, 9, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None)), 
               datetime.datetime(2012, 6, 1, 9, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=120, name=None))])

In [41]: sb
Out[41]: 
0    2012-01-01 09:00:00+01:00
1    2012-06-01 09:00:00+02:00
dtype: object

And what I meant with chunksize giving a different result: if you chunk the above (like in the tests) in two sets of one row, you get two times a series with a uniform timezone, so they are coerced to datetime64. And when combined together with concat they are casted to naive:

In [24]: s1 = pd.Series([datetime.datetime(2012, 1, 1, 9, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None))])

In [25]: s2 = pd.Series([datetime.datetime(2012, 6, 1, 9, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=120, name=None))])

In [26]: s1.dtype
Out[26]:
datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=60, name=None)]

In [27]: s2.dtype
Out[27]:
datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=120, name=None)]

In [28]: pd.concat([s1, s2])
Out[28]:
0   2012-01-01 08:00:00
0   2012-06-01 07:00:00
dtype: datetime64[ns]

jreback · 2015-10-02T21:35:23Z

@jorisvandenbossche

We can't currently force postgres to actually create a datetime with timezone column (unless you pass the dtype manually). So unless the table is existing and created this way (and in this case it was), I am not really sure whether to return naive or converted to UTC or the actual timezone. Hmm. I think its reasonable to return datetime64[ns, UTC] as that preserves the fact that this was a timezone type (though we are 'loosing' the timezone nature itself), but that can be remedied later.

…tetime with timezone types and a chunksize, pandas-dev#11216 - When we don't specifiy a chunksize we get an object dtype which is ok - We create a propery datetime64[ns, tz] type, but its a pytz.FixedOffset(....), which ATM is not really a useful/palatable type and is mostly confusing for now. In the future could attempt to coerce this to a nice tz, e.g. US/Eastern, not sure if this is possible - Note that this is w/o parse_dates specified

BUG: edge case when reading from postgresl with read_sql_query and datetime with tz and chunksize

jreback added Bug Datetime Datetime data dtype IO SQL to_sql, read_sql, read_sql_query labels Oct 2, 2015

jreback added this to the 0.17.0 milestone Oct 2, 2015

jreback force-pushed the datetime_with_tz branch from 111ff2a to 705e7c5 Compare October 2, 2015 01:56

jreback force-pushed the datetime_with_tz branch from 705e7c5 to 1b34ae4 Compare October 2, 2015 01:57

jreback force-pushed the datetime_with_tz branch from 1b34ae4 to f77ef3a Compare October 2, 2015 02:18

jorisvandenbossche added the Timezones Timezone data dtype label Oct 2, 2015

jreback force-pushed the datetime_with_tz branch from f77ef3a to 3b12f16 Compare October 2, 2015 12:46

jreback force-pushed the datetime_with_tz branch from 3b12f16 to ebb634c Compare October 2, 2015 21:47

jreback force-pushed the datetime_with_tz branch from ebb634c to bbbd5d7 Compare October 3, 2015 14:52

jreback added 2 commits October 3, 2015 11:11

use datetime64[ns, UTC] for 'datetime with timezone' sql types

bd26dec

jreback force-pushed the datetime_with_tz branch from bbbd5d7 to bd26dec Compare October 3, 2015 15:15

jreback added a commit that referenced this pull request Oct 3, 2015

Merge pull request #11216 from jreback/datetime_with_tz

071cffd

BUG: edge case when reading from postgresl with read_sql_query and datetime with tz and chunksize

jreback merged commit 071cffd into pandas-dev:master Oct 3, 2015

jorisvandenbossche mentioned this pull request Jan 16, 2017

Inconsistent parsing for timestamp with timezone with read_sql_query #15119

Closed

ThibTrip mentioned this pull request Dec 23, 2019

pd.read_sql timestamptz converted to object dtype #30207

Open

mroeschke mentioned this pull request Nov 3, 2022

BUG: Allow tz-aware Datetime SQL columns to be passed to parse_dates kwarg. #49506

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: edge case when reading from postgresl with read_sql_query and datetime with tz and chunksize #11216

BUG: edge case when reading from postgresl with read_sql_query and datetime with tz and chunksize #11216

jreback commented Oct 2, 2015

jreback commented Oct 2, 2015

jorisvandenbossche commented Oct 2, 2015

jorisvandenbossche commented Oct 2, 2015

jreback commented Oct 2, 2015

jorisvandenbossche commented Oct 2, 2015 •

edited

Loading

jreback commented Oct 2, 2015

BUG: edge case when reading from postgresl with read_sql_query and datetime with tz and chunksize #11216

BUG: edge case when reading from postgresl with read_sql_query and datetime with tz and chunksize #11216

Conversation

jreback commented Oct 2, 2015

jreback commented Oct 2, 2015

jorisvandenbossche commented Oct 2, 2015

jorisvandenbossche commented Oct 2, 2015

jreback commented Oct 2, 2015

jorisvandenbossche commented Oct 2, 2015 • edited Loading

jreback commented Oct 2, 2015

jorisvandenbossche commented Oct 2, 2015 •

edited

Loading