-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Another HDFStore error #2784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
you are actually creating there are some test cases that are skipped, look in these are pretty big tables, mainly that have lots of rows, but also are a few hundred columns wide. are you going to retrieve in its entirety? can you give a summary picture of your frame that you are storing? do you get the and try to store these as tables you might want to take a look at: http://pandas.pydata.org/pandas-docs/stable/io.html#multiple-table-queries |
Yes I am creating a store object and putting data in like
Thanks for the pointers to the tests, I'll take a look and write some large ones. Yes I want to retrieve it entirely, I have my 64GB machine to run ML algorithms on and most of those (though not SVM) want the data in a big chunk. I do not get a performance warning on this store b/c in actuality I try to guarantee it is all floats (again the ML libraries blow up on non-floats). However I do get the performance warnings on other df's I write that are most definitely mixed. I'll try using append and a chunk size next, unfortunately it takes 8 hours to extract all these features, so playing with my real data is painful. Hence why I want to write some tests... Thanks again, and I know I still owe you some data on my other ticket. |
ok..let me know.....but as a fyi....I would for sure split this into sub-tables (e.g. split by columns), then get them and concat - pytables seems to work better (for me) when you have more row-oriented tables (rather than lots of columns) |
You are surely right, I am just being lazy. I guess I'll write some wrappers to split and reform my tables for me so I can facade away the complexity. |
another plug for Tables!, look at append_multiple_tables...this will split for you (you can specify a dict how to split) |
could add this same functionailty to storers as well..... |
Awesome, I'll check that out! Thanks again! |
Can't test these yet b/c my big machine is running the random forest (wrote to csv as well as hdf5 after the last failure, so used the csv file to read back in), but here is my code for breaking up a big table: def write_dataframe(name, df, store):
''' Write a set of keys to our store representing N columns each of a larger table '''
keys = {}
buffered = []
for i, col in enumerate(df.columns):
buffered.append(col)
if len(buffered == 100):
keys["{0}_{1}".format(name, i)] = buffered
buffered = []
if len(buffered) > 0:
keys["{0}_{1}".format(name, i)] = buffered
store.append_to_multiple(keys, df, keys.keys()[0])
def read_dataframe(name, store):
''' Read a set of keys from our store representing N columns each of a larger table
and then join the pieces back into the full table. '''
keys = []
i = 0
while True:
if "{0}_{1}".format(name, i) in store.keys():
keys.append("{0}_{1}".format(name, i))
else:
break
i += 1
return store.select_as_multiple(keys) Does this look reasonable? |
yes...since you are only storing 20000 rows, then each table will be very fast to store, i'd say you could and btw...I am thinking about wrapping this type of code in a Splitter class, which will save alongside the tables, to make this easier here is a similar idiom to batch
|
If indeed there is a problem with storing large tables, then encapsulation of this inside pandas might be a good idea. My features data is all float64/32, but my other intermediate files are not (they are mixed types), so it might be slow till I get to the actual feature extraction. I don't really care about speed as long as I can be sure that my work in progress is not going to be lost. Without pickle working on large amounts of data I think pandas really needs a reliable alternative for saving work, hopefully this will be it. I'll test as soon as my random forest finishes... |
this should work, and your dataset actually isn't that big (in row-space), you have large column-space; It think this solution will work; you maybe also want to give |
Agreed, my data isn't really THAT big, but it is big the wrong way :) If this works then I am a happy camper, I'd like to keep everything in pandas, it was invaluable in joining together all the data into this monstrosity I am serializing. At the very least this thread will hopefully point out a solution to anyone with the same problems! Once I test I'll close this up. |
your use case is interesting (and different) from others I have been seeing later btw I think u might be a me to pass index = False to append (as u really don't need indicies and they take some time to create) also definitely try using a csv file - as I think u may need to experiment a bit to optimize your performance u can directly email if u would like |
I tried running with the code above and got the following:
Same thing for datetime64's. The columns have nan's expressed as np.nan. Not quite sure what is going on. |
assume you are running with 0.10.1 you MUST have datetime64[ns] dtypes in the columns in order to store, you CANNOT have np.nan (this is a float type), and thus the column will be object, instead the nan MUST be NaT if you do not (and the exception tells you that), then cast them like this (this is a bug which is fixed in the above PR)
you also cann't have datetimes, again cast like the above, they will be converted to Timestamps PyTables cannot store 'object' types efficiently, so we dont' allow it; HDFStore also cannot guess as 'object' generally represents string (but can also be unicode, or other types which cannot be serialized) if you can post a str(df) would be helpful |
I added some docs here: jreback@673f91c |
No luck for me, I tried making sure my parser used np.datetime64('nat') and np.datetime64(dateutil.parser.parse()) and doing the casting you suggested and I got nowhere. At this point I am just going to go back to how I was doing it before. At some point I will end up just writing a (or probably overriding the default) csv serializer that maintains column types, b/c it is getting a bit ridiculous how difficult it is to simply save out a dataframe and not lose all my type information. |
can u post a small data sample, os, pandas, and numpy version you should not normally even use np.datetime64 directly. read_csv is a fantastic piece of code, no need to reinvent the wheel |
Sorry trying to get stuff done and ignore my serialization problems for now. read_csv is good, except it does not keep dtypes on the columns, which makes it painful when I have a datetime column that is then an object... I'd suggest read_csv should optionally include column type information in the header, and maybe even do the proper casting, but I can write that myself. |
Sure, I just want that embedded in the header, so I don't have to re-specify it or serialize the dtypes separately and apply them separately after read_csv. I think it would be a good idea to have a to_csv that when you read_csv it recreates the dataframe exactly as it was when you wrote it without further work. Right now that is not the case (I think). |
that's being worked on, and datetimes are being improved. if you would post a sample of your data I can try to help you out. |
No doubt and I am going to try and help out, I am just on a deadline and just hacking away with whatever works. |
Alright I finally got around to attempting to patch to_csv and read_csv to preserve types. I just monkey-patched over the existing methods to start b/c I am not very familiar with the inner-workings and didn't want to interfere with the things I did have working.
Quick test case:
The general idea is just to write out (optionally) the dtype into the header and then when we read in the csv parse back out the dtype and do a proper conversion. Obviously what I have here is very simple and doesn't take into account many things (like if a column name has a ":" in it, since I use that as the dtype delimiter during serilization). Anyway I wanted to post this b/c it does allow for the current version to be monkey-patched to handle dates a little bit better with read_csv. I'd be happy to try and integrate this into the core code with the next release, but I probably need some coaching as the ideal way to handle this in a "pandas" way. |
this is an interesting idea.....more of @wesm perview... on original issue however, if you update to master you can use
|
Good stuff, I need to give the update to master another shot. I think even in master, it would be useful to have an option for to_csv and read_csv to explicitly keep dtype information through serialization/deserialization. HDFStore is really more than a lot of people need, especially when just saving working copies of things as backup. |
After a long run of extracting features for some random forest action I ran into this when serializing the features:
Traceback (most recent call last):
File "XXXXX.py", line 1043, in
write_dataframe("features", all_df, store)
File "XXXXX.py", line 55, in write_dataframe
store[name] = df
File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 218, in setitem
self.put(key, value)
File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 458, in put
self._write_to_group(key, value, table=table, append=append, *_kwargs)
File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 788, in _write_to_group
s.write(obj = value, append=append, complib=complib, *_kwargs)
File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 1837, in write
self.write_array('block%d_values' % i, blk.values)
File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 1639, in write_array
self.handle.createArray(self.group, key, value)
File "/Library/Python/2.7/site-packages/tables-2.4.0-py2.7-macosx-10.8-intel.egg/tables/file.py", line 780, in createArray
object=object, title=title, byteorder=byteorder)
File "/Library/Python/2.7/site-packages/tables-2.4.0-py2.7-macosx-10.8-intel.egg/tables/array.py", line 167, in init
byteorder, _log)
File "/Library/Python/2.7/site-packages/tables-2.4.0-py2.7-macosx-10.8-intel.egg/tables/leaf.py", line 263, in init
super(Leaf, self).init(parentNode, name, _log)
File "/Library/Python/2.7/site-packages/tables-2.4.0-py2.7-macosx-10.8-intel.egg/tables/node.py", line 250, in init
self._v_objectID = self._g_create()
File "/Library/Python/2.7/site-packages/tables-2.4.0-py2.7-macosx-10.8-intel.egg/tables/array.py", line 200, in _g_create
nparr, self._v_new_title, self.atom)
File "hdf5Extension.pyx", line 884, in tables.hdf5Extension.Array._createArray (tables/hdf5Extension.c:8498)
tables.exceptions.HDF5ExtError: Problems creating the Array.
The error is pretty undefined, I know the table I was writing was big, >17000 columns by > 20000 rows. There are lots of np.nan's in the columns.
Since I seem to be one of the few who are serializing massive sets, and I have a 64GB RAM machine sitting next to me, are there some test cases that I can run, or write that would help? Thinking setting up large mixed dataframes etc...
The text was updated successfully, but these errors were encountered: