-
Notifications
You must be signed in to change notification settings - Fork 419
pandas dataframe #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What is your use case? Sync driver + sqlalchemy usually is more then On Fri, 5 Aug 2016 at 13:26 Suchin [email protected] wrote:
|
I'm also curious to hear about the use cases. Also, does Pandas provide an async API? We can add an API to register a custom row-decoder callback (function accepting |
Efficiently decoding data to the exact memory representation that pandas requires is currently somewhat complicated. It uses NumPy arrays as its internal representation, but null-handling adds some complexity
The business logic for producing these arrays is best written in C/C++/Cython with limited involvement of the Python C API (only where pandas does not have any async APIs at the moment AFAIK |
As a possible added benefit for pandas users, it would be nice to have the option for any string columns to be returned as pandas categorical dtype, which has significant performance benefits in analytics |
@wesm I think we can built-in this functionality into asyncpg -- instead of Records we can return a list of columns (numpy arrays, with the decoding semantics you've outlined). Would that be enough to integrate pandas? Also, would it be possible to provide a benchmark that we can tweak to use asyncpg and work with? |
Yeah, that would definitely work. Could put this NumPy deserializer in an optional extension so that people can still use asyncpg if they don't have NumPy installed (since you will have to build against NumPy's headers, etc.) We should be able to help kick the tires and come up with some typical benchmarking scenarios (i.e. numeric heavy reads, string heavy reads, etc.) |
I actually wanted to make an optional dependency. Let's say hide this functionality behind a argument to the
That is something we need to have before we can start the process (we aren't Pandas users ourselves). |
It sounds like what you need is a test suite, not benchmarks, am I interpreting that right? |
Sorry, I should have clarified my request. I wanted to ask for a small script that uses one type (say int32), fetches some data from the DB and performs some rudimentary calculation. We than could use that script to prototype the implementation and see how it compares to existing solutions. In any case, never mind, I think can jot a simple Pandas script myself. |
I was interested in attempting to implement this - I did some naive benchmarks using asyncpg to load large tables instead of psycopg and I saw about 3x speed improvement. I suspect it could be possible to load data into pandas tables even faster by generating numpy arrays directly in the cython layer. Any pointers on where to start poking around would be welcome - I am aiming to just produce some simple benchmark to start with, to see if its worth pursuing lower level integration (as opposed to just using the normal asyncpg interface). (Edit: context, I actually wrote quite a bit of the pandas sqlalchemy based SQL interface but i was never too satisfied with performance and type support). |
@mangecoeur Interface-wise, it would make most sense to integrate with
Step number 3 is the tricky part. For this whole thing to make sense performance-wise, the decoding pipeline must not dip into the interpreter loop, thus the decoders must be plugged in as C function pointers (maybe as a special argument to |
From a data representation point of view, it may also be worth looking at Apache Arrow as an intermediary to use -- then you don't have to deal with all the weird pandas stuff and just deal with strongly-typed nullable columns. The turbodbc folks have been using Arrow for data en route to/from pandas and that's been working very well cc @MathMagique @xhochy |
Thanks for the pointers, i will try to implement some really simple cases just to see what the performance is like, if it's promising i could follow @wesm suggestion and integrate with Arrow (although turbodbc looks a bit baroque wrt mixing C++ and python, compared to cython only). (Edit: I notice that there is a WIP cython api for arrow) |
The main thing you will benefit from in Arrow is that you get a simple construction of the columnar buffers using the Builder classes. The transformation from the Arrow structures to a Pandas DataFrame is then taken care of from the Arrow side. As Arrow is simpler structured then Pandas, the implementation is much simpler, still very efficient. API-wise, I have used Arrow with Cython, |
I would definitely be interested in being able to return the results of a query as a sequence of NumPy arrays representing the columns of the result. I don't use pandas very much, but I do use NumPy, and I would love an API that allows the user to pass in column dtypes and null handling instructions. |
I generally use res = await pool.fetch(...)
df = pd.DataFrame([dict(r) for r in res]) |
This didn't work for me, I had to use: dfs = pd.DataFrame([dict(r.items()) for r in results ]) |
Any news/updates on this enhancement? |
Hi all, I tried to implement a naive version of @elprans proposal. Here's what I came up with: A few remarks:
I don't know what the timings at the bottom are worth but at least it doesn't look bad. |
We haven't invested too much in the Cython API for the array builders, so using the C++ API (in Cython) would be the way to go (if Cython is the right implementation language).
We have the |
@wesm Thanks! |
I switched to c++. Writing python bindings will be quite easy considering how simple the interface is. It supports list, composite types and enums It basically tries to do the same as https://github.com/heterodb/pg2arrow Also looking for advices on how to improve my ugly c++ pattern |
hi @0x0L, great news. I'm interested to see if there is a way to create a small "C middleware library" in Apache Arrow that uses the C data interface to exchange data structures: https://arrow.apache.org/docs/format/CDataInterface.html The idea would be to have some C code that provides a minimal Arrow builder API along with a minimal C implementation of this data interface, so downstream applications don't necessarily need to use C++ or link to the Arrow C++ library. cc @pitrou for thoughts |
I'd like to hear if people have a problem with using Arrow C++ here. Since only a minimal build of Arrow C++ should be required, you can probably easily link statically to |
Hi all. I finally got a product that should be usable enough for others to test. Any feedback or help would be greatly welcomed :) |
I also thought about this in my blog post and will try to hack an alternative production-ready deserializer to numpy recarrays. My requirements are:
Let's see how much of the original native asyncpg code will be left and how faster it will work. |
Done. As I wrote in the post, merging back would require some effort, so I want to check my opportunities with the maintainers. |
I'd love to see it merged back, if Numpy will be an optional dependency |
@vmarkovtsev it would be awesome if you were able to merge |
How would you recommend converting Records to a pandas dataframe?
Also, what do you think about giving
fetch
the option of returning a dataframe instead of list? There might be performance concerns.The text was updated successfully, but these errors were encountered: