Skip to content

Unable to store np.ndarray objects as elements and store this DF as HDF #20440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fwillo opened this issue Mar 21, 2018 · 7 comments
Closed

Unable to store np.ndarray objects as elements and store this DF as HDF #20440

fwillo opened this issue Mar 21, 2018 · 7 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO HDF5 read_hdf, HDFStore

Comments

@fwillo
Copy link

fwillo commented Mar 21, 2018

I have a databases storing time series of many channels in SQLite. At the moment I'm converting them to HDF5 via pandas. In there are also many characteristic quantities like mass, time in seconds as well as the time signal as BLOB stored. When reading them with pd.read_sql() the BLOBs are interpreted as strings.

When using to_hdf to store the pandas Dataframe as HDF and format='tables' the database increases its size 6-7x compared to the SQLite database. I assumed here that the strings are the problem, which is why I convert them to a numpy array.

I'm running a for-Loop over all rows converting the string into a numpy array with the corresponding data type, see python Script following:

# Libraries
import pandas as pd
import numpy as np

df = pd.DataFrame({'time': [0, 1, 2], 'signal':[np.array([1, 2, 3], dtype='int'),
                                                np.array([2, 0, 1], dtype='int'),
                                                np.array([3, 3, 4], dtype='int')]})
    
df.to_hdf('events.h5', key='tab', mode='w', format='table', 
             append=True, data_columns=['time'])

When trying to save this as hdf, I recieve the following error.

<ipython-input-233-c0f602436672> in <module>()
----> 1 df.to_hdf('test.h5', format='table', append=True, key='table')

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/core/generic.py in to_hdf(self, path_or_buf, key, **kwargs)
   1469 
   1470         from pandas.io import pytables
-> 1471         return pytables.to_hdf(path_or_buf, key, self, **kwargs)
   1472 
   1473     def to_msgpack(self, path_or_buf=None, encoding='utf-8', **kwargs):

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in to_hdf(path_or_buf, key, value, mode, complevel, complib, append, **kwargs)
    279         with HDFStore(path_or_buf, mode=mode, complevel=complevel,
    280                       complib=complib) as store:
--> 281             f(store)
    282     else:
    283         f(path_or_buf)

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in <lambda>(store)
    271 
    272     if append:
--> 273         f = lambda store: store.append(key, value, **kwargs)
    274     else:
    275         f = lambda store: store.put(key, value, **kwargs)

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in append(self, key, value, format, append, columns, dropna, **kwargs)
    961         kwargs = self._validate_format(format, kwargs)
    962         self._write_to_group(key, value, append=append, dropna=dropna,
--> 963                              **kwargs)
    964 
    965     def append_to_multiple(self, d, value, selector, data_columns=None,

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, index, append, complib, encoding, **kwargs)
   1339 
   1340         # write the object
-> 1341         s.write(obj=value, append=append, complib=complib, **kwargs)
   1342 
   1343         if s.is_table and index:

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, **kwargs)
   3905         self.create_axes(axes=axes, obj=obj, validate=append,
   3906                          min_itemsize=min_itemsize,
-> 3907                          **kwargs)
   3908 
   3909         for a in self.axes:

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   3577                 self.values_axes.append(col)
   3578             except (NotImplementedError, ValueError, TypeError) as e:
-> 3579                 raise e
   3580             except Exception as detail:
   3581                 raise Exception(

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   3572                              encoding=self.encoding,
   3573                              info=self.info,
-> 3574                              **kwargs)
   3575                 col.set_pos(j)
   3576 

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in set_atom(self, block, block_items, existing_col, min_itemsize, nan_rep, info, encoding, **kwargs)
   1923                 min_itemsize,
   1924                 nan_rep,
-> 1925                 encoding)
   1926 
   1927         # set as a data block

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in set_atom_string(self, block, block_items, existing_col, min_itemsize, nan_rep, encoding)
   1955                         "Cannot serialize the column [%s] because\n"
   1956                         "its data contents are [%s] object dtype"
-> 1957                         % (item, inferred_type)
   1958                     )
   1959 

TypeError: Cannot serialize the column [signal] because
its data contents are [mixed] object dtype

Something I don't understand is why it's bragging about [mixed] object type although every element is from the same data type in one row. I already saw #8284, which did not help in my case. Other similar incidents are not reported (or I don't find them; if so I'm very sorry). Interestingly this works when changing from format='table' to format='fixed'. This is not an option here although because I need the possibility to search in my database (e.g. where='time >= 132450000') as well as append data due to the conversion process and lack of memory.

Is pandas able to store numpy arrays as elements and format='table'? This would be a very nice feature regarding time series in Dataframes.

Best,
fwillo

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-693.17.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: None
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Could you try making a smaller example? It's hard to see what's going on. Is sqlite necessary to demonstrate the bug?

@fwillo
Copy link
Author

fwillo commented Mar 21, 2018

@TomAugspurger I threw out most of things that might disturb now. You can see it updated in my main post. SQLite is necessary here although it occupies one 1-2 line of code here.

@TomAugspurger
Copy link
Contributor

You might want to glance through http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports :)

@fwillo
Copy link
Author

fwillo commented Mar 21, 2018

@TomAugspurger I hope I got your point now correctly. Updated the minimal example and the error message from it. I get the same error message. Sorry for the inconvenience.

@TomAugspurger
Copy link
Contributor

Perfect, thanks. So you're storing arrays within the column. No I don't think that's currently supported, and I'm not sure whether it's possible with pytables / HDF5.

In this case, your signal arrays are all the same case. Is that true in general? You might be able to store them separately as a 2-D array, and concatenate later.

@TomAugspurger
Copy link
Contributor

It seems like pytables and HDF5 do have some support for ragged (variable length) arrays. http://www.pytables.org/usersguide/libref/homogenous_storage.html#the-vlarray-class, so that might be an option.

In general, storing arrays inside a Series column isn't well supported by pandas at the moment.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2018

this is not supported in the table format at all. These must be a consistent scalar dtype.

@jreback jreback closed this as completed Mar 22, 2018
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions IO HDF5 read_hdf, HDFStore labels Mar 22, 2018
@jreback jreback added this to the won't fix milestone Mar 22, 2018
@TomAugspurger TomAugspurger modified the milestones: won't fix, No action Jul 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

No branches or pull requests

3 participants