Skip to content

WIP: add pd.read_ipc and DataFrame.to_ipc to provide efficient serialization to/from memory #15907

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Apr 5, 2017

No description provided.

@jreback jreback added IO Data IO issues that don't fit into a more specific label Enhancement labels Apr 5, 2017
@jreback
Copy link
Contributor Author

jreback commented Apr 5, 2017

cc @mrocklin
@wesm @cpcloud

@chris-b1
Copy link
Contributor

chris-b1 commented Apr 5, 2017

Given that this is a pretty 'advanced' feature (and presumably an unstable format?), maybe would make sense to expose these functions somewhere like pandas.api.lib rather than in the main namespace?

@jreback
Copy link
Contributor Author

jreback commented Apr 5, 2017

@chris-b1 yeah probably. This is just trying to get working (today) locally. :> (its not actually anywhere atm, but in the local module).

@codecov
Copy link

codecov bot commented Apr 5, 2017

Codecov Report

Merging #15907 into master will decrease coverage by 0.1%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15907      +/-   ##
==========================================
- Coverage      91%    90.9%   -0.11%     
==========================================
  Files         145      146       +1     
  Lines       49576    49631      +55     
==========================================
  Hits        45118    45118              
- Misses       4458     4513      +55
Flag Coverage Δ
#multiple 88.67% <0%> (-0.1%) ⬇️
#single 40.53% <0%> (-0.05%) ⬇️
Impacted Files Coverage Δ
pandas/io/ipc.py 0% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f35209e...f727477. Read the comment docs.

@wesm
Copy link
Member

wesm commented Apr 5, 2017

I agree putting such things into an experimental namespace would be a good idea

@jorisvandenbossche
Copy link
Member

Genuine question: since we are typically trying to limit pandas' scope, why should we include this in the core package? Instead of having this live in the package implementing this / a separate package ('pandas-ipc') providing this API?

@jreback
Copy link
Contributor Author

jreback commented Apr 5, 2017

Genuine question: since we are typically trying to limit pandas' scope, why should we include this in the core package? Instead of having this live in the package implementing this / a separate package ('pandas-ipc') providing this API?

True, we are trying to limit scope. This is essentialy an ipc version of arrow (which also back feather). So this ATM expands scope, though eventually would limit it somewhat (as the back-end would have less tooling).

Could consider this as pandas-ipc though this is pretty 'simple' and not meant as a full-fledged package, mostly a passthru / convenient interface for others (e.g. dask).

@wesm
Copy link
Member

wesm commented Apr 5, 2017

I think it would be useful for pandas to have more robust support for various transient serialization formats, with a choice between pickle (most compatible) vs msgpack vs arrow vs other things. Whether the implementation of this goes into core pandas, or into a "leaf" library that gets imported, I don't have a strong opinion

@jreback
Copy link
Contributor Author

jreback commented Jun 10, 2017

closing for now. This will be pretty transparent with pyarrow, so this interface would just be a simple wrapper. let's see if its even needed.

@jreback jreback closed this Jun 10, 2017
@jreback
Copy link
Contributor Author

jreback commented Sep 12, 2017

so this is now available in a released version in arrow: https://arrow.apache.org/docs/python/ipc.html (IIRC 0.5.0 has full support). appetite for this in main pandas as read_ipc/to_ipc ?

@wesm @cpcloud @jorisvandenbossche

@wesm
Copy link
Member

wesm commented Sep 12, 2017

Depending on what memory format these functions create it may affect the name. If it's vanilla arrow stream (schema + sequence of record batches), then it might be better to call it read_arrow_stream / to_arrow_stream (in the latter case, you could select a batch size to use for chunking the DataFrame once someone implements this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants