Skip to content

pd.to/read_sql_table silently corrupts Categorical columns #8624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kay1793 opened this issue Oct 24, 2014 · 4 comments · Fixed by #8682
Closed

pd.to/read_sql_table silently corrupts Categorical columns #8624

kay1793 opened this issue Oct 24, 2014 · 4 comments · Fixed by #8682
Labels
Bug Categorical Categorical Data Type IO SQL to_sql, read_sql, read_sql_query
Milestone

Comments

@kay1793
Copy link

kay1793 commented Oct 24, 2014

In [29]: from sqlalchemy import create_engine
    ...: engine = create_engine('sqlite://')
    ...: df=pd.DataFrame([[1,'John P. Doe'],[2,'Jane Dove'],[1,'John P. Doe']],
    ...: columns=['person_id','person_name'])
    ...: df.to_sql('data1',engine)
    ...: df['person_name']=pd.Categorical(df.person_name)
    ...: df.to_sql('data2',engine)
    ...: print pd.read_sql_table('data1',engine)
    ...: print pd.read_sql_table('data2',engine)
   index  person_id  person_name
0      0          1  John P. Doe
1      1          2    Jane Dove
2      2          1  John P. Doe
   index  person_id person_name
0      0          1           J
1      1          2           o
2      2          1           h

using relational db to store catagorical columns in seperare tables would be very cool, and rebuilding the frame in pandas by JOIN from the multiple tables would save time on the wire. also memory if categorical was build directly.

@kay1793 kay1793 changed the title pd.to_sql_table silently corrupts Categorial columns pd.to/read_sql_table silently corrupts Categorical columns Oct 24, 2014
@jorisvandenbossche jorisvandenbossche added Categorical Categorical Data Type IO SQL to_sql, read_sql, read_sql_query Bug labels Oct 24, 2014
@jorisvandenbossche jorisvandenbossche added this to the 0.15.1 milestone Oct 24, 2014
@jorisvandenbossche
Copy link
Member

Thanks for the catch! This was untested.

Seems there is a problem with the CategoricalBlock.values

@jorisvandenbossche
Copy link
Member

@jreback The values are stored differently in CategoricalBlock compared to others (other dimensions, as a Categorical is always one-dimensional in contrast to real numpy arrays):

In [48]: df
Out[48]:
    person_id   person_name
0   1   John P. Doe
1   2   Jane Dove
2   1   John P. Doe

In [49]:   df._data.blocks
Out[49]:
(IntBlock: slice(0L, 1L, 1), 1 x 3, dtype: int64,
 CategoricalBlock: slice(1, 2, 1), 1 x 3, dtype: category)

In [50]:   df._data.blocks[0].values.shape
Out[50]:
(1L, 3L)

In [51]:   df._data.blocks[1].values.shape
Out[51]:
(3,)

@jreback Should this be fixed in CategoricalBlock itself? Or should I just catch it in the sql function and reshape it there appropriately for if b.is_categorical?

@jreback
Copy link
Contributor

jreback commented Oct 24, 2014

this is how they are stored ; they are un consolidated (as are sparse) - eg you cannot usually combine 2 different categorical columns

you will need to turn them into a full array
something like np.array(cat .values) - but I think u use blocks directly iirc

you should be using get_values() if this is a block (which will densify these types of structures )

note that u obviously lose the fact that it is a categorical - but csv/sql are not able to store this type of meta data

@jorisvandenbossche
Copy link
Member

Ah, yes, using get_values solves it (it gives the required 2D array).
I see, for Blocks, get_values just returns values, for NonConsolidatable blocks it reshapes it (what I wanted to do myself). OK easy fix then!

yes, the categorical is just returned and written as strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants