pd.to/read_sql_table silently corrupts Categorical columns #8624

kay1793 · 2014-10-24T08:10:06Z

In [29]: from sqlalchemy import create_engine
    ...: engine = create_engine('sqlite://')
    ...: df=pd.DataFrame([[1,'John P. Doe'],[2,'Jane Dove'],[1,'John P. Doe']],
    ...: columns=['person_id','person_name'])
    ...: df.to_sql('data1',engine)
    ...: df['person_name']=pd.Categorical(df.person_name)
    ...: df.to_sql('data2',engine)
    ...: print pd.read_sql_table('data1',engine)
    ...: print pd.read_sql_table('data2',engine)
   index  person_id  person_name
0      0          1  John P. Doe
1      1          2    Jane Dove
2      2          1  John P. Doe
   index  person_id person_name
0      0          1           J
1      1          2           o
2      2          1           h

using relational db to store catagorical columns in seperare tables would be very cool, and rebuilding the frame in pandas by JOIN from the multiple tables would save time on the wire. also memory if categorical was build directly.

jorisvandenbossche · 2014-10-24T09:35:02Z

Thanks for the catch! This was untested.

Seems there is a problem with the CategoricalBlock.values

jorisvandenbossche · 2014-10-24T09:45:41Z

@jreback The values are stored differently in CategoricalBlock compared to others (other dimensions, as a Categorical is always one-dimensional in contrast to real numpy arrays):

In [48]: df
Out[48]:
    person_id   person_name
0   1   John P. Doe
1   2   Jane Dove
2   1   John P. Doe

In [49]:   df._data.blocks
Out[49]:
(IntBlock: slice(0L, 1L, 1), 1 x 3, dtype: int64,
 CategoricalBlock: slice(1, 2, 1), 1 x 3, dtype: category)

In [50]:   df._data.blocks[0].values.shape
Out[50]:
(1L, 3L)

In [51]:   df._data.blocks[1].values.shape
Out[51]:
(3,)

@jreback Should this be fixed in CategoricalBlock itself? Or should I just catch it in the sql function and reshape it there appropriately for if b.is_categorical?

jreback · 2014-10-24T10:06:26Z

this is how they are stored ; they are un consolidated (as are sparse) - eg you cannot usually combine 2 different categorical columns

you will need to turn them into a full array
something like np.array(cat .values) - but I think u use blocks directly iirc

you should be using get_values() if this is a block (which will densify these types of structures )

note that u obviously lose the fact that it is a categorical - but csv/sql are not able to store this type of meta data

jorisvandenbossche · 2014-10-24T10:15:33Z

Ah, yes, using get_values solves it (it gives the required 2D array).
I see, for Blocks, get_values just returns values, for NonConsolidatable blocks it reshapes it (what I wanted to do myself). OK easy fix then!

yes, the categorical is just returned and written as strings.

kay1793 changed the title ~~pd.to_sql_table silently corrupts Categorial columns~~ pd.to/read_sql_table silently corrupts Categorical columns Oct 24, 2014

jorisvandenbossche added Categorical Categorical Data Type IO SQL to_sql, read_sql, read_sql_query Bug labels Oct 24, 2014

jorisvandenbossche added this to the 0.15.1 milestone Oct 24, 2014

kay1793 mentioned this issue Oct 25, 2014

pd.Catagorical breaks to_msgpack #8632

Closed

jorisvandenbossche modified the milestones: 0.15.1, 0.15.2 Oct 30, 2014

jorisvandenbossche mentioned this issue Oct 30, 2014

BUG: fix writing of Categorical with to_sql (GH8624) #8682

Merged

jorisvandenbossche closed this as completed in #8682 Oct 31, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.to/read_sql_table silently corrupts Categorical columns #8624

pd.to/read_sql_table silently corrupts Categorical columns #8624

kay1793 commented Oct 24, 2014

jorisvandenbossche commented Oct 24, 2014

jorisvandenbossche commented Oct 24, 2014

jreback commented Oct 24, 2014

jorisvandenbossche commented Oct 24, 2014

pd.to/read_sql_table silently corrupts Categorical columns #8624

pd.to/read_sql_table silently corrupts Categorical columns #8624

Comments

kay1793 commented Oct 24, 2014

jorisvandenbossche commented Oct 24, 2014

jorisvandenbossche commented Oct 24, 2014

jreback commented Oct 24, 2014

jorisvandenbossche commented Oct 24, 2014