Skip to content

pandas dataframe #17

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kernelmachine opened this issue Aug 5, 2016 · 29 comments
Open

pandas dataframe #17

kernelmachine opened this issue Aug 5, 2016 · 29 comments

Comments

@kernelmachine
Copy link

kernelmachine commented Aug 5, 2016

How would you recommend converting Records to a pandas dataframe?

Also, what do you think about giving fetch the option of returning a dataframe instead of list? There might be performance concerns.

@jettify
Copy link

jettify commented Aug 5, 2016

What is your use case? Sync driver + sqlalchemy usually is more then
sufficient for such things, you even have it for free
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html

On Fri, 5 Aug 2016 at 13:26 Suchin [email protected] wrote:

How would you recommend converting Records to a pandas dataframe?

Also, what do you think about giving fetch the option of returning a
dataframe instead of list? There might be performance concerns.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#17, or mute the thread
https://github.com/notifications/unsubscribe-auth/AANoZ9YrtEJwJj72uUlP2dMZrv6ErQ5Gks5qc3JjgaJpZM4Jd5HD
.

@1st1
Copy link
Member

1st1 commented Aug 5, 2016

I'm also curious to hear about the use cases. Also, does Pandas provide an async API?

We can add an API to register a custom row-decoder callback (function accepting bytes and returning decoded row). That would make it possible for Pandas to implement their own decoding to any kind of data-types.

@1st1 1st1 added the question label Aug 5, 2016
@wesm
Copy link

wesm commented Aug 5, 2016

Efficiently decoding data to the exact memory representation that pandas requires is currently somewhat complicated. It uses NumPy arrays as its internal representation, but null-handling adds some complexity

  • float32/float64 columns — normal np.float32/float64 arrays with NaN for nulls
  • any integer columns: for no nulls, use int8/16/32/64, otherwise use float64 (with nulls encoded as NaN). Users should be warned about data loss for int64 values outside the representable range with floating point numbers
  • boolean columns: numpy.bool_ for no nulls, and numpy.object_ if there are nulls (encode null as None)
  • string columns: numpy.object_ dtype with PyString/PyUnicode objects internally. Use None to encode nulls. It is preferable to avoid creating two copies of the same string (see https://github.com/cloudera/hs2client/blob/master/python/hs2client/converters.h#L193 where I deal with this problem for HiveServer2)
  • datetime columns: nanoseconds since the UNIX epoch as numpy.int64 (logically: datetime64[ns]) array

The business logic for producing these arrays is best written in C/C++/Cython with limited involvement of the Python C API (only where numpy.object_ arrays are involved). I am happy to help advise on the implementation / do some code review, the C++ code in https://github.com/cloudera/hs2client/blob/master/python/hs2client/converters.h can form a reference for what pandas expects in the NumPy arrays (it does not do timestamp conversion — depending on PG's datetime representation you may be able to perform the conversion in C or leave it to pandas's internal timestamp methods to convert).

pandas does not have any async APIs at the moment AFAIK

@wesm
Copy link

wesm commented Aug 5, 2016

As a possible added benefit for pandas users, it would be nice to have the option for any string columns to be returned as pandas categorical dtype, which has significant performance benefits in analytics

@1st1
Copy link
Member

1st1 commented Aug 5, 2016

@wesm I think we can built-in this functionality into asyncpg -- instead of Records we can return a list of columns (numpy arrays, with the decoding semantics you've outlined). Would that be enough to integrate pandas?

Also, would it be possible to provide a benchmark that we can tweak to use asyncpg and work with?

@wesm
Copy link

wesm commented Aug 5, 2016

Yeah, that would definitely work. Could put this NumPy deserializer in an optional extension so that people can still use asyncpg if they don't have NumPy installed (since you will have to build against NumPy's headers, etc.)

We should be able to help kick the tires and come up with some typical benchmarking scenarios (i.e. numeric heavy reads, string heavy reads, etc.)

@1st1
Copy link
Member

1st1 commented Aug 5, 2016

Yeah, that would definitely work. Could put this NumPy deserializer in an optional extension so that people can still use asyncpg if they don't have NumPy installed (since you will have to build against NumPy's headers, etc.)

I actually wanted to make an optional dependency. Let's say hide this functionality behind a argument to the connect function (return_numpy_dataframes=False).

We should be able to help kick the tires and come up with some typical benchmarking scenarios (i.e. numeric heavy reads, string heavy reads, etc.)

That is something we need to have before we can start the process (we aren't Pandas users ourselves).

@1st1 1st1 added enhancement and removed question labels Aug 5, 2016
@wesm
Copy link

wesm commented Aug 5, 2016

That is something we need to have before we can start the process (we aren't Pandas users ourselves).

It sounds like what you need is a test suite, not benchmarks, am I interpreting that right?

@1st1
Copy link
Member

1st1 commented Aug 6, 2016

It sounds like what you need is a test suite, not benchmarks, am I interpreting that right?

Sorry, I should have clarified my request. I wanted to ask for a small script that uses one type (say int32), fetches some data from the DB and performs some rudimentary calculation. We than could use that script to prototype the implementation and see how it compares to existing solutions. In any case, never mind, I think can jot a simple Pandas script myself.

@mangecoeur
Copy link

mangecoeur commented Dec 7, 2017

I was interested in attempting to implement this - I did some naive benchmarks using asyncpg to load large tables instead of psycopg and I saw about 3x speed improvement. I suspect it could be possible to load data into pandas tables even faster by generating numpy arrays directly in the cython layer. Any pointers on where to start poking around would be welcome - I am aiming to just produce some simple benchmark to start with, to see if its worth pursuing lower level integration (as opposed to just using the normal asyncpg interface).

(Edit: context, I actually wrote quite a bit of the pandas sqlalchemy based SQL interface but i was never too satisfied with performance and type support).

@elprans
Copy link
Member

elprans commented Dec 7, 2017

@mangecoeur Interface-wise, it would make most sense to integrate with Connection.copy_*. Basically, the algorithm for "copy_from_*_to_pandas" would look something like this:

  1. Introspect the source relation, and to get the types of the attributes.
  2. Initialize the dataframe accordingly.
  3. Build a codec profile (see _ensure_rows_decoder), using decoder versions that produce pandas native format as output.
  4. Initiate COPY OUT ... (FORMAT binary), and decode the data chunks into the target dataframe.

Step number 3 is the tricky part. For this whole thing to make sense performance-wise, the decoding pipeline must not dip into the interpreter loop, thus the decoders must be plugged in as C function pointers (maybe as a special argument to copy*()).

@wesm
Copy link

wesm commented Dec 7, 2017

From a data representation point of view, it may also be worth looking at Apache Arrow as an intermediary to use -- then you don't have to deal with all the weird pandas stuff and just deal with strongly-typed nullable columns. The turbodbc folks have been using Arrow for data en route to/from pandas and that's been working very well cc @MathMagique @xhochy

@mangecoeur
Copy link

mangecoeur commented Dec 7, 2017

Thanks for the pointers, i will try to implement some really simple cases just to see what the performance is like, if it's promising i could follow @wesm suggestion and integrate with Arrow (although turbodbc looks a bit baroque wrt mixing C++ and python, compared to cython only).

(Edit: I notice that there is a WIP cython api for arrow)

@xhochy
Copy link

xhochy commented Dec 9, 2017

The main thing you will benefit from in Arrow is that you get a simple construction of the columnar buffers using the Builder classes. The transformation from the Arrow structures to a Pandas DataFrame is then taken care of from the Arrow side. As Arrow is simpler structured then Pandas, the implementation is much simpler, still very efficient.

API-wise, I have used Arrow with Cython, boost::python and pybind11. If you're fluent in C++ and Python, then pybind11 is the best option from my perspective as you get the best of both worlds. Cython is the more natural choice when you come from a Python-only perspective but limiting you in the use of the C++ as not all C++ features are supported in Cython. Neverthelesse, in Arrow, we want to support all three options and it should be simple to to use with Cython. Always happy to help you to get it working independent of the wrapper ;)

@deepers
Copy link

deepers commented Jul 15, 2019

I would definitely be interested in being able to return the results of a query as a sequence of NumPy arrays representing the columns of the result. I don't use pandas very much, but I do use NumPy, and I would love an API that allows the user to pass in column dtypes and null handling instructions.

@jedmitten
Copy link

I generally use

res = await pool.fetch(...)
df = pd.DataFrame([dict(r) for r in res])

@foxx
Copy link

foxx commented Mar 6, 2020

I generally use

res = await pool.fetch(...)
df = pd.DataFrame([dict(r) for r in res])

This didn't work for me, I had to use:

dfs = pd.DataFrame([dict(r.items()) for r in results ])

@RiccardoGiro
Copy link

Any news/updates on this enhancement?

@0x0L
Copy link

0x0L commented Nov 8, 2021

Hi all,

I tried to implement a naive version of @elprans proposal. Here's what I came up with:
https://gist.github.com/0x0L/3d55d47f6329eb0d7e7a46d1c895b5be

A few remarks:

  • the cython bindings for array builders are a bit shady: default ctor are missing, no reset method
  • I couldn't find a canonical way to get an array builder from a given datatype - which looks strange to me. Am I missing something? @wesm
  • I couldn't reused py-pgproto codecs as they spit out python objects and I didn't want the overhead.

I don't know what the timings at the bottom are worth but at least it doesn't look bad.

@wesm
Copy link

wesm commented Nov 8, 2021

We haven't invested too much in the Cython API for the array builders, so using the C++ API (in Cython) would be the way to go (if Cython is the right implementation language).

I couldn't find a canonical way to get an array builder from a given datatype - which looks strange to me. Am I missing something? @wesm

We have the arrow::MakeBuilder function in C++ https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_base.h#L280

@0x0L
Copy link

0x0L commented Nov 8, 2021

@wesm Thanks! Makebuilder is exactly what I was looking for
In terms of inter-op, a generic C++ lib would probably be a much better candidate here given the narrow and well-defined scope: convert pq binary buffers to table or recordbatch of a given schema. At least for primitive types, the schema translation from sql to arrow looks straightforward.

@0x0L
Copy link

0x0L commented Nov 14, 2021

I switched to c++. Writing python bindings will be quite easy considering how simple the interface is.

It supports list, composite types and enums
Still a WIP but you can check it out at https://github.com/0x0L/pg2arrow

It basically tries to do the same as https://github.com/heterodb/pg2arrow
Since we're using arrow api the code is much simpler

Also looking for advices on how to improve my ugly c++ pattern

@wesm
Copy link

wesm commented Nov 14, 2021

hi @0x0L, great news. I'm interested to see if there is a way to create a small "C middleware library" in Apache Arrow that uses the C data interface to exchange data structures:

https://arrow.apache.org/docs/format/CDataInterface.html

The idea would be to have some C code that provides a minimal Arrow builder API along with a minimal C implementation of this data interface, so downstream applications don't necessarily need to use C++ or link to the Arrow C++ library.

cc @pitrou for thoughts

@pitrou
Copy link

pitrou commented Nov 16, 2021

I'd like to hear if people have a problem with using Arrow C++ here. Since only a minimal build of Arrow C++ should be required, you can probably easily link statically to libarrow.a, and there shouldn't be any additional dependencies.

@0x0L
Copy link

0x0L commented Apr 16, 2022

Hi all.

I finally got a product that should be usable enough for others to test.
You can have a look at https://github.com/0x0L/pgeon

Any feedback or help would be greatly welcomed :)

@vmarkovtsev
Copy link

I also thought about this in my blog post and will try to hack an alternative production-ready deserializer to numpy recarrays. My requirements are:

  • Drop-in replacement for asyncpg interface. I plan to prefix the query with 🚀 to activate the columnar mode.
  • Handle nulls well. Our production needs different strategies for nulls: remove rows; coerce to constant; fallback to object array.
  • Play nice with custom types such as JSON and HSTORE. Support them and not crash, that is.
  • Do not convert strings to fixed S/U, because the memory can explode. Find a way to inform the internals whether such a conversion must be done.
  • Of course, there will be no intermediate protocol parsing to Record-s. That's the whole point 😄
  • No third-party dependencies except numpy.

Let's see how much of the original native asyncpg code will be left and how faster it will work.

@vmarkovtsev
Copy link

Done.
Blog post
Repository: https://github.com/athenianco/asyncpg-rkt

As I wrote in the post, merging back would require some effort, so I want to check my opportunities with the maintainers.

@eirnym
Copy link

eirnym commented Jun 5, 2022

I'd love to see it merged back, if Numpy will be an optional dependency

@ale-dd
Copy link

ale-dd commented Jun 13, 2023

@vmarkovtsev it would be awesome if you were able to merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests