Skip to content

Another HDFStore error #2784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jostheim opened this issue Jan 31, 2013 · 26 comments
Closed

Another HDFStore error #2784

jostheim opened this issue Jan 31, 2013 · 26 comments
Labels
Bug IO Data IO issues that don't fit into a more specific label

Comments

@jostheim
Copy link

After a long run of extracting features for some random forest action I ran into this when serializing the features:

Traceback (most recent call last):
File "XXXXX.py", line 1043, in
write_dataframe("features", all_df, store)
File "XXXXX.py", line 55, in write_dataframe
store[name] = df
File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 218, in setitem
self.put(key, value)
File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 458, in put
self._write_to_group(key, value, table=table, append=append, *_kwargs)
File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 788, in _write_to_group
s.write(obj = value, append=append, complib=complib, *_kwargs)
File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 1837, in write
self.write_array('block%d_values' % i, blk.values)
File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 1639, in write_array
self.handle.createArray(self.group, key, value)
File "/Library/Python/2.7/site-packages/tables-2.4.0-py2.7-macosx-10.8-intel.egg/tables/file.py", line 780, in createArray
object=object, title=title, byteorder=byteorder)
File "/Library/Python/2.7/site-packages/tables-2.4.0-py2.7-macosx-10.8-intel.egg/tables/array.py", line 167, in init
byteorder, _log)
File "/Library/Python/2.7/site-packages/tables-2.4.0-py2.7-macosx-10.8-intel.egg/tables/leaf.py", line 263, in init
super(Leaf, self).init(parentNode, name, _log)
File "/Library/Python/2.7/site-packages/tables-2.4.0-py2.7-macosx-10.8-intel.egg/tables/node.py", line 250, in init
self._v_objectID = self._g_create()
File "/Library/Python/2.7/site-packages/tables-2.4.0-py2.7-macosx-10.8-intel.egg/tables/array.py", line 200, in _g_create
nparr, self._v_new_title, self.atom)
File "hdf5Extension.pyx", line 884, in tables.hdf5Extension.Array._createArray (tables/hdf5Extension.c:8498)
tables.exceptions.HDF5ExtError: Problems creating the Array.

The error is pretty undefined, I know the table I was writing was big, >17000 columns by > 20000 rows. There are lots of np.nan's in the columns.

Since I seem to be one of the few who are serializing massive sets, and I have a 64GB RAM machine sitting next to me, are there some test cases that I can run, or write that would help? Thinking setting up large mixed dataframes etc...

@jreback
Copy link
Contributor

jreback commented Jan 31, 2013

you are actually creating Storer objects here (e.g. this is NOT queryable), is this what you want?

there are some test cases that are skipped, look in pandas\io\tests\test_pytables.py,
named like test_big_.....;

these are pretty big tables, mainly that have lots of rows, but also are a few hundred columns wide.
they don't directly test what you are doing, as never tried to store a VERY wide table (as a put), in theory should work.

are you going to retrieve in its entirety?

can you give a summary picture of your frame that you are storing?
the nans what type of columns are they in?

do you get the performance warning when trying to store this?

and try to store these as tables df.append('df',df), you can pass chunksize= to write it a chunk at a time
http://pandas.pydata.org/pandas-docs/stable/io.html#performance

you might want to take a look at: http://pandas.pydata.org/pandas-docs/stable/io.html#multiple-table-queries

@jostheim
Copy link
Author

@jreback

Yes I am creating a store object and putting data in like

store['features'] = df

Thanks for the pointers to the tests, I'll take a look and write some large ones.

Yes I want to retrieve it entirely, I have my 64GB machine to run ML algorithms on and most of those (though not SVM) want the data in a big chunk.

I do not get a performance warning on this store b/c in actuality I try to guarantee it is all floats (again the ML libraries blow up on non-floats). However I do get the performance warnings on other df's I write that are most definitely mixed.

I'll try using append and a chunk size next, unfortunately it takes 8 hours to extract all these features, so playing with my real data is painful. Hence why I want to write some tests...

Thanks again, and I know I still owe you some data on my other ticket.

@jreback
Copy link
Contributor

jreback commented Jan 31, 2013

ok..let me know.....but as a fyi....I would for sure split this into sub-tables (e.g. split by columns), then get them and concat - pytables seems to work better (for me) when you have more row-oriented tables (rather than lots of columns)

@jostheim
Copy link
Author

You are surely right, I am just being lazy. I guess I'll write some wrappers to split and reform my tables for me so I can facade away the complexity.

@jreback
Copy link
Contributor

jreback commented Jan 31, 2013

another plug for Tables!, look at append_multiple_tables...this will split for you (you can specify a dict how to split)

@jreback
Copy link
Contributor

jreback commented Jan 31, 2013

could add this same functionailty to storers as well.....

@jostheim
Copy link
Author

Awesome, I'll check that out! Thanks again!

@jostheim
Copy link
Author

Can't test these yet b/c my big machine is running the random forest (wrote to csv as well as hdf5 after the last failure, so used the csv file to read back in), but here is my code for breaking up a big table:

def write_dataframe(name, df, store):
    ''' Write a set of keys to our store representing N columns each of a larger table '''
    keys = {}
    buffered = []
    for i, col in enumerate(df.columns):
        buffered.append(col)
        if len(buffered == 100):
            keys["{0}_{1}".format(name, i)] = buffered
            buffered = []
    if len(buffered) > 0:
        keys["{0}_{1}".format(name, i)] = buffered
    store.append_to_multiple(keys, df, keys.keys()[0])


def read_dataframe(name, store):
    ''' Read a set of keys from our store representing N columns each of a larger table
     and then join the pieces back into the full table. '''
    keys = []
    i = 0
    while True:
        if "{0}_{1}".format(name, i) in store.keys():
            keys.append("{0}_{1}".format(name, i))
        else:
            break
        i += 1
    return store.select_as_multiple(keys)

Does this look reasonable?

@jreback
Copy link
Contributor

jreback commented Jan 31, 2013

yes...since you are only storing 20000 rows, then each table will be very fast to store, i'd say you could
do 500 columns in each table (try out combinations), assuming this is all similar dtype data, e.g. float64/32?

and btw...I am thinking about wrapping this type of code in a Splitter class, which will save alongside the tables, to make this easier

here is a similar idiom to batch

def create_batches(l, maxn = 250):
    """ create a list of batches, maxed at maxn """
    batches = []
    while(True):
        if len(l) <= maxn:
            if len(l) > 0:
                batches.append(l)
            break
        batches.append(l[0:maxn])
        l = l[maxn:]
    return batches

@jostheim
Copy link
Author

If indeed there is a problem with storing large tables, then encapsulation of this inside pandas might be a good idea. My features data is all float64/32, but my other intermediate files are not (they are mixed types), so it might be slow till I get to the actual feature extraction. I don't really care about speed as long as I can be sure that my work in progress is not going to be lost. Without pickle working on large amounts of data I think pandas really needs a reliable alternative for saving work, hopefully this will be it.

I'll test as soon as my random forest finishes...

@jreback
Copy link
Contributor

jreback commented Jan 31, 2013

this should work, and your dataset actually isn't that big (in row-space), you have large column-space;
pytables/hd5f is designed for many rows (like many millions)
mixed types should work as well

It think this solution will work; you maybe also want to give carray a try (its actually by the same author as pytables), but no interface from pandas
https://github.com/FrancescAlted/carray

@jostheim
Copy link
Author

Agreed, my data isn't really THAT big, but it is big the wrong way :) If this works then I am a happy camper, I'd like to keep everything in pandas, it was invaluable in joining together all the data into this monstrosity I am serializing.

At the very least this thread will hopefully point out a solution to anyone with the same problems! Once I test I'll close this up.

@jreback
Copy link
Contributor

jreback commented Jan 31, 2013

your use case is interesting (and different) from others I have been seeing later
it's all about data orientation and use case for how HDFStore should be storing things
we have straight stores and tables now, each has different uses
I think yours (and one which is essentially a really giant table on disk)
need to be handled (prob with some hints when u r storing)

btw I think u might be a me to pass index = False to append (as u really don't need indicies and they take some time to create)

also definitely try using a csv file - as I think u may need to experiment a bit to optimize your performance

u can directly email if u would like
[email protected]

@jostheim
Copy link
Author

jostheim commented Feb 1, 2013

I tried running with the code above and got the following:

time.datetime' has no len()
Traceback (most recent call last):
  File "XxXXX.py", line 803, in get_joined_data
    write_dataframe("{0}joined_{1}".format(prefix, date_prefix), df, store)
  File "XXXX.py", line 74, in write_dataframe
    store.append_to_multiple(keys, df, keys.keys()[0])
  File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 591, in append_to_multiple
    self.append(k, val, data_columns=dc, **kwargs)
  File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 532, in append
    self._write_to_group(key, value, table=True, append=True, **kwargs)
  File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 788, in _write_to_group
    s.write(obj = value, append=append, complib=complib, **kwargs)
  File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 2491, in write
    min_itemsize=min_itemsize, **kwargs)
  File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 2254, in create_axes
 raise Exception("cannot find the correct atom type -> [dtype->%s,items->%s] %s" % (b.dtype.name, b.items, str(detail)))
Exception: cannot find the correct atom type -> [dtype->object,items->Index([....object of type 'datetime.datetime' has no len()

Same thing for datetime64's. The columns have nan's expressed as np.nan. Not quite sure what is going on.

@jreback
Copy link
Contributor

jreback commented Feb 1, 2013

assume you are running with 0.10.1
have a look at PR #2752 (first example)

you MUST have datetime64[ns] dtypes in the columns in order to store, you CANNOT have np.nan (this is a float type), and thus the column will be object, instead the nan MUST be NaT

if you do not (and the exception tells you that), then cast them like this (this is a bug which is fixed in the above PR)

df['datetime'] = Series(df['datetime'].values,dtype='M8[ns]')

you also cann't have datetimes, again cast like the above, they will be converted to Timestamps

PyTables cannot store 'object' types efficiently, so we dont' allow it; HDFStore also cannot guess as 'object' generally represents string (but can also be unicode, or other types which cannot be serialized)

if you can post a str(df) would be helpful

@jreback
Copy link
Contributor

jreback commented Feb 2, 2013

I added some docs here: jreback@673f91c

@jostheim
Copy link
Author

jostheim commented Feb 2, 2013

No luck for me, I tried making sure my parser used np.datetime64('nat') and np.datetime64(dateutil.parser.parse()) and doing the casting you suggested and I got nowhere. At this point I am just going to go back to how I was doing it before.

At some point I will end up just writing a (or probably overriding the default) csv serializer that maintains column types, b/c it is getting a bit ridiculous how difficult it is to simply save out a dataframe and not lose all my type information.

@jreback
Copy link
Contributor

jreback commented Feb 2, 2013

can u post a small data sample, os, pandas, and numpy version
and how you are reading and converting (the code)

you should not normally even use np.datetime64 directly.
np.datetime64('nat') doesn't do anything (and is not valid code)

read_csv is a fantastic piece of code, no need to reinvent the wheel

@jostheim
Copy link
Author

jostheim commented Feb 6, 2013

Sorry trying to get stuff done and ignore my serialization problems for now. read_csv is good, except it does not keep dtypes on the columns, which makes it painful when I have a datetime column that is then an object...

I'd suggest read_csv should optionally include column type information in the header, and maybe even do the proper casting, but I can write that myself.

@jreback
Copy link
Contributor

jreback commented Feb 7, 2013

@jostheim
Copy link
Author

jostheim commented Feb 7, 2013

Sure, I just want that embedded in the header, so I don't have to re-specify it or serialize the dtypes separately and apply them separately after read_csv. I think it would be a good idea to have a to_csv that when you read_csv it recreates the dataframe exactly as it was when you wrote it without further work. Right now that is not the case (I think).

@jreback
Copy link
Contributor

jreback commented Feb 7, 2013

that's being worked on, and datetimes are being improved. if you would post a sample of your data I can try to help you out.

@jostheim
Copy link
Author

jostheim commented Feb 7, 2013

No doubt and I am going to try and help out, I am just on a deadline and just hacking away with whatever works.

@jostheim
Copy link
Author

Alright I finally got around to attempting to patch to_csv and read_csv to preserve types. I just monkey-patched over the existing methods to start b/c I am not very familiar with the inner-workings and didn't want to interfere with the things I did have working.

import pandas as pd
import numpy as np
from pandas.core import frame
from pandas.core.index import MultiIndex
from pandas.tseries.period import PeriodIndex
import pandas.lib as lib
import pandas.core.common as com
import csv
import dateutil
import pytz
from pytz import timezone



def parse_date_time(val):
    if val is not np.nan:
        #'2012-11-12 17:30:00+00:00
        try:
            datetime_obj = dateutil.parser.parse(val)
            datetime_obj = datetime_obj.replace(tzinfo=timezone('UTC'))
            datetime_obj = datetime_obj.astimezone(timezone('UTC'))
            return datetime_obj
        except ValueError as e:
#            print e
            return np.nan
    else:
        return np.nan


def _my_helper_csv(self, writer, na_rep=None, cols=None,
                header=True, index=True,
                index_label=None, float_format=None, write_dtypes=None):
    if cols is None:
        cols = self.columns

    series = {}
    for k, v in self._series.iteritems():
        series[k] = v.values

    has_aliases = isinstance(header, (tuple, list, np.ndarray))
    if has_aliases or header:
        if index:
            # should write something for index label
            if index_label is not False:
                if index_label is None:
                    if isinstance(self.index, MultiIndex):
                        index_label = []
                        for i, name in enumerate(self.index.names):
                            if name is None:
                                name = ''
                            index_label.append(name)
                    else:
                        index_label = self.index.name
                        if index_label is None:
                            index_label = ['']
                        else:
                            index_label = [index_label]
                elif not isinstance(index_label, (list, tuple, np.ndarray)):
                    # given a string for a DF with Index
                    index_label = [index_label]

                encoded_labels = list(index_label)
            else:
                encoded_labels = []

            if has_aliases:
                if len(header) != len(cols):
                    raise ValueError(('Writing %d cols but got %d aliases'
                                      % (len(cols), len(header))))
                else:
                    write_cols = header
            else:
                write_cols = cols
            encoded_cols = list(write_cols)
            if write_dtypes:
                for j, col in enumerate(cols):
                    encoded_cols[j] = "{0}:{1}".format(col, self._series[col].dtype)
            writer.writerow(encoded_labels + encoded_cols)
        else:
            if write_dtypes:
                for j, col in enumerate(cols):
                    encoded_cols[j] = "{0}:{1}".format(col, self._series[col].dtype)
            writer.writerow(encoded_cols)

            encoded_cols = list(cols)
            writer.writerow(encoded_cols)

    data_index = self.index
    if isinstance(self.index, PeriodIndex):
        data_index = self.index.to_timestamp()

    nlevels = getattr(data_index, 'nlevels', 1)
    for j, idx in enumerate(data_index):
        row_fields = []
        if index:
            if nlevels == 1:
                row_fields = [idx]
            else:  # handle MultiIndex
                row_fields = list(idx)
        for i, col in enumerate(cols):
            val = series[col][j]
            if lib.checknull(val):
                val = na_rep

            if float_format is not None and com.is_float(val):
                val = float_format % val
            elif isinstance(val, np.datetime64):
                val = lib.Timestamp(val)._repr_base

            row_fields.append(val)

        writer.writerow(row_fields)

def my_to_csv(self, path_or_buf, sep=",", na_rep='', float_format=None,
               cols=None, header=True, index=True, index_label=None,
               mode='w', nanRep=None, encoding=None, quoting=None,
               line_terminator='\n', write_dtypes=None):
        """
        Write DataFrame to a comma-separated values (csv) file

        Parameters
        ----------
        path_or_buf : string or file handle / StringIO
            File path
        sep : character, default ","
            Field delimiter for the output file.
        na_rep : string, default ''
            Missing data representation
        float_format : string, default None
            Format string for floating point numbers
        cols : sequence, optional
            Columns to write
        header : boolean or list of string, default True
            Write out column names. If a list of string is given it is
            assumed to be aliases for the column names
        index : boolean, default True
            Write row names (index)
        index_label : string or sequence, or False, default None
            Column label for index column(s) if desired. If None is given, and
            `header` and `index` are True, then the index names are used. A
            sequence should be given if the DataFrame uses MultiIndex.  If
            False do not print fields for index names. Use index_label=False
            for easier importing in R
        nanRep : deprecated, use na_rep
        mode : Python write mode, default 'w'
        encoding : string, optional
            a string representing the encoding to use if the contents are
            non-ascii, for python versions prior to 3
        line_terminator: string, default '\n'
            The newline character or character sequence to use in the output
            file
        quoting : optional constant from csv module
            defaults to csv.QUOTE_MINIMAL
        """
        if nanRep is not None:  # pragma: no cover
            import warnings
            warnings.warn("nanRep is deprecated, use na_rep",
                          FutureWarning)
            na_rep = nanRep

        if hasattr(path_or_buf, 'read'):
            f = path_or_buf
            close = False
        else:
            f = com._get_handle(path_or_buf, mode, encoding=encoding)
            close = True

        if quoting is None:
            quoting = csv.QUOTE_MINIMAL

        try:
            if encoding is not None:
                csvout = com.UnicodeWriter(f, lineterminator=line_terminator,
                                           delimiter=sep, encoding=encoding,
                                           quoting=quoting)
            else:
                csvout = csv.writer(f, lineterminator=line_terminator,
                                    delimiter=sep, quoting=quoting)
            self._helper_csv(csvout, na_rep=na_rep,
                             float_format=float_format, cols=cols,
                             header=header, index=index,
                             index_label=index_label, write_dtypes=write_dtypes)

        finally:
            if close:
                f.close()

pd.core.frame.DataFrame.to_csv = my_to_csv
pd.core.frame.DataFrame._helper_csv = _my_helper_csv

from pandas.io import parsers

def my_read_csv(filepath_or_buffer, sep=',', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine='c', delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, read_dtypes=None):
    df = pd.read_csv(filepath_or_buffer, sep=sep, dialect=dialect, compression=compression, doublequote=doublequote, escapechar=escapechar, quotechar=quotechar, quoting=quoting, skipinitialspace=skipinitialspace, lineterminator=lineterminator, header=header, index_col=index_col, names=names, prefix=prefix, skiprows=skiprows, skipfooter=skip_footer, skip_footer=skip_footer, na_values=na_values, true_values=true_values, false_values=false_values, delimiter=delimiter, converters=converters, dtype=dtype, usecols=usecols, engine=engine, delim_whitespace=delim_whitespace, as_recarray=as_recarray, na_filter=na_filter, compact_ints=compact_ints, use_unsigned=use_unsigned, low_memory=low_memory, buffer_lines=buffer_lines, warn_bad_lines=warn_bad_lines, error_bad_lines=error_bad_lines, keep_default_na=keep_default_na, thousands=thousands, comment=comment, decimal=decimal, parse_dates=parse_dates, keep_date_col=keep_date_col, dayfirst=dayfirst, date_parser=date_parser, memory_map=memory_map, nrows=nrows, iterator=iterator, chunksize=chunksize, verbose=verbose, encoding=encoding, squeeze=squeeze)
    if read_dtypes:
        for col, series in df.iteritems():
            splits= col.split(":")
            read_dtype = splits[1]
            if str(series.dtype) != read_dtype:
                if read_dtype == "datetime.datetime":
                    series = series.apply(lambda x: parse_date_time(x))
                    series = pd.Series(series.values,dtype='M8[ns]')
                elif "datetime64" in read_dtype:
                    series = series.apply(lambda x: parse_date_time(x))
                    series = pd.Series(df[col].values,dtype='M8[ns]')
                elif read_dtype == "float" or read_dtype == "float64" or read_dtype == "float32": 
                    series.values = series.values.astype(np.float64)
                elif read_dtype == "int" or read_dtype == "int64" or read_dtype == "int32":
                    series.values = series.values.astype(np.int64)
                elif read_dtype == "bool" or read_dtype == "bool_":
                    series.values = series.values.astype(np.bool_)
                elif read_dtype == "complex_":
                    series.values = series.values.astype(np.complex_)
                df[col] = series
    return df

pd.my_read_csv = my_read_csv

Quick test case:

'''
Created on Feb 18, 2013

@author: jostheim
'''
import unittest
import pandas as pd
import numpy as np
import datetime
import pandas_extensions

class Test(unittest.TestCase):


    def test_reading_and_writing(self):
        df = pd.DataFrame({'a':[1,2,4,7], 'b':[1.2, 2.3, 5.1, 6.3], 
                    'c':list('abcd'), 
                    'd':[datetime.datetime(2001,1,1),datetime.datetime(2001,1,2),np.nan, datetime.datetime(2012,11,2)] })
        for col, series in df.iteritems():
            df['d'] = pd.Series(df['d'].values,dtype='M8[ns]')
        df.to_csv("/tmp/test.csv", write_dtypes=True)
        new_df = pd.my_read_csv("/tmp/test.csv", index_col=0, read_dtypes=True)
        for i, t in enumerate(df.dtypes):
            print t, new_df.dtypes[i]
            self.assertEqual(t, new_df.dtypes[i], "dtypes match")


if __name__ == "__main__":
    #import sys;sys.argv = ['', 'Test.test_reading_and_writing']
    unittest.main()

The general idea is just to write out (optionally) the dtype into the header and then when we read in the csv parse back out the dtype and do a proper conversion. Obviously what I have here is very simple and doesn't take into account many things (like if a column name has a ":" in it, since I use that as the dtype delimiter during serilization).

Anyway I wanted to post this b/c it does allow for the current version to be monkey-patched to handle dates a little bit better with read_csv. I'd be happy to try and integrate this into the core code with the next release, but I probably need some coaching as the ideal way to handle this in a "pandas" way.

@jreback
Copy link
Contributor

jreback commented Feb 18, 2013

this is an interesting idea.....more of @wesm perview...

on original issue however,

if you update to master you can use

df.convert_objects(convert_dates='coerce') to clean up your frames
(will force conversion to datetime64[ns]) - invalid entries will be marked as NaT

@jostheim
Copy link
Author

Good stuff, I need to give the update to master another shot. I think even in master, it would be useful to have an option for to_csv and read_csv to explicitly keep dtype information through serialization/deserialization. HDFStore is really more than a lot of people need, especially when just saving working copies of things as backup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

2 participants