-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Unable to write to HDF5 table if DataFrame has mixed object types (pd.Timestamp and str) #8284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
you can work around this by setting the non-string object columns as data_columns (that will segregate them up front) if these are truly utc tz aware then to be honest guy should simply make them datetime64[ns] columns and the problem also goes away you are right though see here : https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L1734 for the inference on an object column (note that they could be a period type, datetime tz aware, or an actual string) so the object block handling needs to be fixed up a bit - by further splitting of object blocks if necessary pull - requests welcome! |
see #7796 as well (for the period support) |
FYI u normally don't handle the columns separately and instead store them as a single block as it's much more efficient (can be controlled by specifying data_columns though) |
Thanks for the fast response. They aren't actually UTC in my application, that was just the easiest way to create a simple example. Setting as a data_column will work though, thanks for the tip. If I get a bit of time I'll look into a fix. |
I can't reproduce the original example. This simple example seems to work In [36]: df = pd.DataFrame({"A": [1, 2], 'B': ['a', 'b'], 'C': pd.to_datetime(['2017', '2018']).tz_localize("UTC")})
In [37]: df.to_hdf('test.h5', 'data', format='table') Let me know if that isn't representative of the original. |
I am having the same issue where the use-case is storing multidimensional and variable-shape np arrays (unflattened images). I store in 'table' format and I tried adding the column to
Are there other workarounds that I can try? Also, is this issue still open to contributions (beefing up the object-block handling to work with types other than strings)? |
there is no support for non scalar types at all |
I don’t mind converting them to bytes and saving that, but that too is not supported atm |
@petiop you are welcome to submit a PR for this but it’s non-trivial i would use parquet for this |
When attempting to store data in an HDF5 table, I found a problem where an error is raised if there are multiple object columns containing different data.
This leads to an exception: TypeError: Cannot serialize the column [Timestamps] because
its data contents are [datetime] object dtype
However, if I remove the string column:
Now it works fine - so it isn't a problem with using the pd.Timestamp type.
Digging a little deeper, it appears the problem is that pandas.io.pytables.Table.create_axes groups the columns by data type, with all columns of type object being grouped into one set of data. Then when set_atom is called, it does this:
This leads to an inferred type of 'mixed' since there are multiple types of objects present, and this isn't handled and throws the exception.
As a fix, it seems that each object column should be handled separately, or at least grouped by the inferred type. I haven't committed to pandas before, or dug this deeply into this section of code, so I'm not sure of the best way to fix this and what other implications there may be, but I'd be happy to help however I can.
The text was updated successfully, but these errors were encountered: