Unable to store np.ndarray objects as elements and store this DF as HDF #20440

fwillo · 2018-03-21T17:52:02Z

I have a databases storing time series of many channels in SQLite. At the moment I'm converting them to HDF5 via pandas. In there are also many characteristic quantities like mass, time in seconds as well as the time signal as BLOB stored. When reading them with pd.read_sql() the BLOBs are interpreted as strings.

When using to_hdf to store the pandas Dataframe as HDF and format='tables' the database increases its size 6-7x compared to the SQLite database. I assumed here that the strings are the problem, which is why I convert them to a numpy array.

I'm running a for-Loop over all rows converting the string into a numpy array with the corresponding data type, see python Script following:

# Libraries
import pandas as pd
import numpy as np

df = pd.DataFrame({'time': [0, 1, 2], 'signal':[np.array([1, 2, 3], dtype='int'),
                                                np.array([2, 0, 1], dtype='int'),
                                                np.array([3, 3, 4], dtype='int')]})
    
df.to_hdf('events.h5', key='tab', mode='w', format='table', 
             append=True, data_columns=['time'])

When trying to save this as hdf, I recieve the following error.

<ipython-input-233-c0f602436672> in <module>()
----> 1 df.to_hdf('test.h5', format='table', append=True, key='table')

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/core/generic.py in to_hdf(self, path_or_buf, key, **kwargs)
   1469 
   1470         from pandas.io import pytables
-> 1471         return pytables.to_hdf(path_or_buf, key, self, **kwargs)
   1472 
   1473     def to_msgpack(self, path_or_buf=None, encoding='utf-8', **kwargs):

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in to_hdf(path_or_buf, key, value, mode, complevel, complib, append, **kwargs)
    279         with HDFStore(path_or_buf, mode=mode, complevel=complevel,
    280                       complib=complib) as store:
--> 281             f(store)
    282     else:
    283         f(path_or_buf)

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in <lambda>(store)
    271 
    272     if append:
--> 273         f = lambda store: store.append(key, value, **kwargs)
    274     else:
    275         f = lambda store: store.put(key, value, **kwargs)

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in append(self, key, value, format, append, columns, dropna, **kwargs)
    961         kwargs = self._validate_format(format, kwargs)
    962         self._write_to_group(key, value, append=append, dropna=dropna,
--> 963                              **kwargs)
    964 
    965     def append_to_multiple(self, d, value, selector, data_columns=None,

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, index, append, complib, encoding, **kwargs)
   1339 
   1340         # write the object
-> 1341         s.write(obj=value, append=append, complib=complib, **kwargs)
   1342 
   1343         if s.is_table and index:

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, **kwargs)
   3905         self.create_axes(axes=axes, obj=obj, validate=append,
   3906                          min_itemsize=min_itemsize,
-> 3907                          **kwargs)
   3908 
   3909         for a in self.axes:

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   3577                 self.values_axes.append(col)
   3578             except (NotImplementedError, ValueError, TypeError) as e:
-> 3579                 raise e
   3580             except Exception as detail:
   3581                 raise Exception(

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   3572                              encoding=self.encoding,
   3573                              info=self.info,
-> 3574                              **kwargs)
   3575                 col.set_pos(j)
   3576 

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in set_atom(self, block, block_items, existing_col, min_itemsize, nan_rep, info, encoding, **kwargs)
   1923                 min_itemsize,
   1924                 nan_rep,
-> 1925                 encoding)
   1926 
   1927         # set as a data block

/cluster/programs/miniconda/envs/miniconda-36/lib/python3.6/site-packages/pandas/io/pytables.py in set_atom_string(self, block, block_items, existing_col, min_itemsize, nan_rep, encoding)
   1955                         "Cannot serialize the column [%s] because\n"
   1956                         "its data contents are [%s] object dtype"
-> 1957                         % (item, inferred_type)
   1958                     )
   1959 

TypeError: Cannot serialize the column [signal] because
its data contents are [mixed] object dtype

Something I don't understand is why it's bragging about [mixed] object type although every element is from the same data type in one row. I already saw #8284, which did not help in my case. Other similar incidents are not reported (or I don't find them; if so I'm very sorry). Interestingly this works when changing from format='table' to format='fixed'. This is not an option here although because I need the possibility to search in my database (e.g. where='time >= 132450000') as well as append data due to the conversion process and lack of memory.

Is pandas able to store numpy arrays as elements and format='table'? This would be a very nice feature regarding time series in Dataframes.

Best,
fwillo

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-693.17.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: None
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-03-21T17:59:35Z

Could you try making a smaller example? It's hard to see what's going on. Is sqlite necessary to demonstrate the bug?

fwillo · 2018-03-21T18:05:01Z

@TomAugspurger I threw out most of things that might disturb now. You can see it updated in my main post. SQLite is necessary here although it occupies one 1-2 line of code here.

TomAugspurger · 2018-03-21T18:08:33Z

You might want to glance through http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports :)

fwillo · 2018-03-21T18:29:04Z

@TomAugspurger I hope I got your point now correctly. Updated the minimal example and the error message from it. I get the same error message. Sorry for the inconvenience.

TomAugspurger · 2018-03-21T18:36:36Z

Perfect, thanks. So you're storing arrays within the column. No I don't think that's currently supported, and I'm not sure whether it's possible with pytables / HDF5.

In this case, your signal arrays are all the same case. Is that true in general? You might be able to store them separately as a 2-D array, and concatenate later.

TomAugspurger · 2018-03-21T18:38:40Z

It seems like pytables and HDF5 do have some support for ragged (variable length) arrays. http://www.pytables.org/usersguide/libref/homogenous_storage.html#the-vlarray-class, so that might be an option.

In general, storing arrays inside a Series column isn't well supported by pandas at the moment.

jreback · 2018-03-22T10:25:18Z

this is not supported in the table format at all. These must be a consistent scalar dtype.

jreback closed this as completed Mar 22, 2018

jreback added Dtype Conversions Unexpected or buggy dtype conversions IO HDF5 read_hdf, HDFStore labels Mar 22, 2018

jreback added this to the won't fix milestone Mar 22, 2018

TomAugspurger modified the milestones: won't fix, No action Jul 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to store np.ndarray objects as elements and store this DF as HDF #20440

Unable to store np.ndarray objects as elements and store this DF as HDF #20440

fwillo commented Mar 21, 2018 •

edited by TomAugspurger

Loading

TomAugspurger commented Mar 21, 2018

fwillo commented Mar 21, 2018 •

edited

Loading

TomAugspurger commented Mar 21, 2018

fwillo commented Mar 21, 2018

TomAugspurger commented Mar 21, 2018

TomAugspurger commented Mar 21, 2018

jreback commented Mar 22, 2018

Unable to store np.ndarray objects as elements and store this DF as HDF #20440

Unable to store np.ndarray objects as elements and store this DF as HDF #20440

Comments

fwillo commented Mar 21, 2018 • edited by TomAugspurger Loading

TomAugspurger commented Mar 21, 2018

fwillo commented Mar 21, 2018 • edited Loading

TomAugspurger commented Mar 21, 2018

fwillo commented Mar 21, 2018

TomAugspurger commented Mar 21, 2018

TomAugspurger commented Mar 21, 2018

jreback commented Mar 22, 2018

fwillo commented Mar 21, 2018 •

edited by TomAugspurger

Loading

fwillo commented Mar 21, 2018 •

edited

Loading