pandas dataframe #17

kernelmachine · 2016-08-05T17:26:58Z

How would you recommend converting Records to a pandas dataframe?

Also, what do you think about giving fetch the option of returning a dataframe instead of list? There might be performance concerns.

The text was updated successfully, but these errors were encountered:

jettify · 2016-08-05T17:42:29Z

What is your use case? Sync driver + sqlalchemy usually is more then
sufficient for such things, you even have it for free
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html

On Fri, 5 Aug 2016 at 13:26 Suchin [email protected] wrote:

How would you recommend converting Records to a pandas dataframe?

Also, what do you think about giving fetch the option of returning a
dataframe instead of list? There might be performance concerns.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#17, or mute the thread
https://github.com/notifications/unsubscribe-auth/AANoZ9YrtEJwJj72uUlP2dMZrv6ErQ5Gks5qc3JjgaJpZM4Jd5HD
.

1st1 · 2016-08-05T17:56:25Z

I'm also curious to hear about the use cases. Also, does Pandas provide an async API?

We can add an API to register a custom row-decoder callback (function accepting bytes and returning decoded row). That would make it possible for Pandas to implement their own decoding to any kind of data-types.

wesm · 2016-08-05T18:20:33Z

Efficiently decoding data to the exact memory representation that pandas requires is currently somewhat complicated. It uses NumPy arrays as its internal representation, but null-handling adds some complexity

float32/float64 columns — normal np.float32/float64 arrays with NaN for nulls
any integer columns: for no nulls, use int8/16/32/64, otherwise use float64 (with nulls encoded as NaN). Users should be warned about data loss for int64 values outside the representable range with floating point numbers
boolean columns: numpy.bool_ for no nulls, and numpy.object_ if there are nulls (encode null as None)
string columns: numpy.object_ dtype with PyString/PyUnicode objects internally. Use None to encode nulls. It is preferable to avoid creating two copies of the same string (see https://github.com/cloudera/hs2client/blob/master/python/hs2client/converters.h#L193 where I deal with this problem for HiveServer2)
datetime columns: nanoseconds since the UNIX epoch as numpy.int64 (logically: datetime64[ns]) array

The business logic for producing these arrays is best written in C/C++/Cython with limited involvement of the Python C API (only where numpy.object_ arrays are involved). I am happy to help advise on the implementation / do some code review, the C++ code in https://github.com/cloudera/hs2client/blob/master/python/hs2client/converters.h can form a reference for what pandas expects in the NumPy arrays (it does not do timestamp conversion — depending on PG's datetime representation you may be able to perform the conversion in C or leave it to pandas's internal timestamp methods to convert).

pandas does not have any async APIs at the moment AFAIK

wesm · 2016-08-05T18:23:08Z

As a possible added benefit for pandas users, it would be nice to have the option for any string columns to be returned as pandas categorical dtype, which has significant performance benefits in analytics

1st1 · 2016-08-05T18:32:11Z

@wesm I think we can built-in this functionality into asyncpg -- instead of Records we can return a list of columns (numpy arrays, with the decoding semantics you've outlined). Would that be enough to integrate pandas?

Also, would it be possible to provide a benchmark that we can tweak to use asyncpg and work with?

wesm · 2016-08-05T18:35:35Z

Yeah, that would definitely work. Could put this NumPy deserializer in an optional extension so that people can still use asyncpg if they don't have NumPy installed (since you will have to build against NumPy's headers, etc.)

We should be able to help kick the tires and come up with some typical benchmarking scenarios (i.e. numeric heavy reads, string heavy reads, etc.)

1st1 · 2016-08-05T18:41:40Z

Yeah, that would definitely work. Could put this NumPy deserializer in an optional extension so that people can still use asyncpg if they don't have NumPy installed (since you will have to build against NumPy's headers, etc.)

I actually wanted to make an optional dependency. Let's say hide this functionality behind a argument to the connect function (return_numpy_dataframes=False).

We should be able to help kick the tires and come up with some typical benchmarking scenarios (i.e. numeric heavy reads, string heavy reads, etc.)

That is something we need to have before we can start the process (we aren't Pandas users ourselves).

wesm · 2016-08-05T20:02:12Z

That is something we need to have before we can start the process (we aren't Pandas users ourselves).

It sounds like what you need is a test suite, not benchmarks, am I interpreting that right?

1st1 · 2016-08-06T17:06:09Z

It sounds like what you need is a test suite, not benchmarks, am I interpreting that right?

Sorry, I should have clarified my request. I wanted to ask for a small script that uses one type (say int32), fetches some data from the DB and performs some rudimentary calculation. We than could use that script to prototype the implementation and see how it compares to existing solutions. In any case, never mind, I think can jot a simple Pandas script myself.

mangecoeur · 2017-12-07T16:50:42Z

I was interested in attempting to implement this - I did some naive benchmarks using asyncpg to load large tables instead of psycopg and I saw about 3x speed improvement. I suspect it could be possible to load data into pandas tables even faster by generating numpy arrays directly in the cython layer. Any pointers on where to start poking around would be welcome - I am aiming to just produce some simple benchmark to start with, to see if its worth pursuing lower level integration (as opposed to just using the normal asyncpg interface).

(Edit: context, I actually wrote quite a bit of the pandas sqlalchemy based SQL interface but i was never too satisfied with performance and type support).

elprans · 2017-12-07T18:06:35Z

@mangecoeur Interface-wise, it would make most sense to integrate with Connection.copy_*. Basically, the algorithm for "copy_from_*_to_pandas" would look something like this:

Introspect the source relation, and to get the types of the attributes.
Initialize the dataframe accordingly.
Build a codec profile (see _ensure_rows_decoder), using decoder versions that produce pandas native format as output.
Initiate COPY OUT ... (FORMAT binary), and decode the data chunks into the target dataframe.

Step number 3 is the tricky part. For this whole thing to make sense performance-wise, the decoding pipeline must not dip into the interpreter loop, thus the decoders must be plugged in as C function pointers (maybe as a special argument to copy*()).

wesm · 2017-12-07T18:42:58Z

From a data representation point of view, it may also be worth looking at Apache Arrow as an intermediary to use -- then you don't have to deal with all the weird pandas stuff and just deal with strongly-typed nullable columns. The turbodbc folks have been using Arrow for data en route to/from pandas and that's been working very well cc @MathMagique @xhochy

mangecoeur · 2017-12-07T23:00:21Z

Thanks for the pointers, i will try to implement some really simple cases just to see what the performance is like, if it's promising i could follow @wesm suggestion and integrate with Arrow (although turbodbc looks a bit baroque wrt mixing C++ and python, compared to cython only).

(Edit: I notice that there is a WIP cython api for arrow)

xhochy · 2017-12-09T14:28:25Z

The main thing you will benefit from in Arrow is that you get a simple construction of the columnar buffers using the Builder classes. The transformation from the Arrow structures to a Pandas DataFrame is then taken care of from the Arrow side. As Arrow is simpler structured then Pandas, the implementation is much simpler, still very efficient.

API-wise, I have used Arrow with Cython, boost::python and pybind11. If you're fluent in C++ and Python, then pybind11 is the best option from my perspective as you get the best of both worlds. Cython is the more natural choice when you come from a Python-only perspective but limiting you in the use of the C++ as not all C++ features are supported in Cython. Neverthelesse, in Arrow, we want to support all three options and it should be simple to to use with Cython. Always happy to help you to get it working independent of the wrapper ;)

deepers · 2019-07-15T20:44:37Z

I would definitely be interested in being able to return the results of a query as a sequence of NumPy arrays representing the columns of the result. I don't use pandas very much, but I do use NumPy, and I would love an API that allows the user to pass in column dtypes and null handling instructions.

jedmitten · 2019-10-08T20:57:47Z

I generally use

res = await pool.fetch(...)
df = pd.DataFrame([dict(r) for r in res])

foxx · 2020-03-06T02:13:12Z

I generally use

res = await pool.fetch(...)
df = pd.DataFrame([dict(r) for r in res])

This didn't work for me, I had to use:

dfs = pd.DataFrame([dict(r.items()) for r in results ])

RiccardoGiro · 2020-07-23T12:23:29Z

Any news/updates on this enhancement?

0x0L · 2021-11-08T10:07:50Z

Hi all,

I tried to implement a naive version of @elprans proposal. Here's what I came up with:
https://gist.github.com/0x0L/3d55d47f6329eb0d7e7a46d1c895b5be

A few remarks:

the cython bindings for array builders are a bit shady: default ctor are missing, no reset method
I couldn't find a canonical way to get an array builder from a given datatype - which looks strange to me. Am I missing something? @wesm
I couldn't reused py-pgproto codecs as they spit out python objects and I didn't want the overhead.

I don't know what the timings at the bottom are worth but at least it doesn't look bad.

wesm · 2021-11-08T14:15:43Z

We haven't invested too much in the Cython API for the array builders, so using the C++ API (in Cython) would be the way to go (if Cython is the right implementation language).

I couldn't find a canonical way to get an array builder from a given datatype - which looks strange to me. Am I missing something? @wesm

We have the arrow::MakeBuilder function in C++ https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_base.h#L280

0x0L · 2021-11-08T16:11:44Z

@wesm Thanks! Makebuilder is exactly what I was looking for
In terms of inter-op, a generic C++ lib would probably be a much better candidate here given the narrow and well-defined scope: convert pq binary buffers to table or recordbatch of a given schema. At least for primitive types, the schema translation from sql to arrow looks straightforward.

0x0L · 2021-11-14T16:33:27Z

I switched to c++. Writing python bindings will be quite easy considering how simple the interface is.

It supports list, composite types and enums
Still a WIP but you can check it out at https://github.com/0x0L/pg2arrow

It basically tries to do the same as https://github.com/heterodb/pg2arrow
Since we're using arrow api the code is much simpler

Also looking for advices on how to improve my ugly c++ pattern

wesm · 2021-11-14T19:22:03Z

hi @0x0L, great news. I'm interested to see if there is a way to create a small "C middleware library" in Apache Arrow that uses the C data interface to exchange data structures:

https://arrow.apache.org/docs/format/CDataInterface.html

The idea would be to have some C code that provides a minimal Arrow builder API along with a minimal C implementation of this data interface, so downstream applications don't necessarily need to use C++ or link to the Arrow C++ library.

cc @pitrou for thoughts

pitrou · 2021-11-16T13:15:32Z

I'd like to hear if people have a problem with using Arrow C++ here. Since only a minimal build of Arrow C++ should be required, you can probably easily link statically to libarrow.a, and there shouldn't be any additional dependencies.

0x0L · 2022-04-16T13:06:24Z

Hi all.

I finally got a product that should be usable enough for others to test.
You can have a look at https://github.com/0x0L/pgeon

Any feedback or help would be greatly welcomed :)

vmarkovtsev · 2022-04-22T21:56:36Z

I also thought about this in my blog post and will try to hack an alternative production-ready deserializer to numpy recarrays. My requirements are:

Drop-in replacement for asyncpg interface. I plan to prefix the query with 🚀 to activate the columnar mode.
Handle nulls well. Our production needs different strategies for nulls: remove rows; coerce to constant; fallback to object array.
Play nice with custom types such as JSON and HSTORE. Support them and not crash, that is.
Do not convert strings to fixed S/U, because the memory can explode. Find a way to inform the internals whether such a conversion must be done.
Of course, there will be no intermediate protocol parsing to Record-s. That's the whole point 😄
No third-party dependencies except numpy.

Let's see how much of the original native asyncpg code will be left and how faster it will work.

vmarkovtsev · 2022-06-02T17:13:43Z

Done.
Blog post
Repository: https://github.com/athenianco/asyncpg-rkt

As I wrote in the post, merging back would require some effort, so I want to check my opportunities with the maintainers.

eirnym · 2022-06-05T07:53:37Z

I'd love to see it merged back, if Numpy will be an optional dependency

ale-dd · 2023-06-13T21:21:38Z

@vmarkovtsev it would be awesome if you were able to merge

1st1 added the question label Aug 5, 2016

1st1 added enhancement and removed question labels Aug 5, 2016

runfalk mentioned this issue Aug 9, 2016

add a __getattr__ to Records so it looks cleaner #13

Closed

Spacerat mentioned this issue Aug 17, 2018

Is a copy_records_from_table function possible? #343

Closed

lsabi mentioned this issue Apr 5, 2020

Automatically convert Records to json #551

Closed

tbenst mentioned this issue Dec 6, 2021

Numpy / torch tensor support? #860

Closed

alexeyshockov mentioned this issue Feb 8, 2023

Plans to merge with the upstream? athenianco/asyncpg-rkt#22

Open

ale-dd mentioned this issue Jan 3, 2024

Late/optional decoding? #1112

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas dataframe #17

pandas dataframe #17

kernelmachine commented Aug 5, 2016 •

edited

Loading

jettify commented Aug 5, 2016

1st1 commented Aug 5, 2016

wesm commented Aug 5, 2016 •

edited

Loading

wesm commented Aug 5, 2016

1st1 commented Aug 5, 2016

wesm commented Aug 5, 2016

1st1 commented Aug 5, 2016

wesm commented Aug 5, 2016

1st1 commented Aug 6, 2016

mangecoeur commented Dec 7, 2017 •

edited

Loading

elprans commented Dec 7, 2017

wesm commented Dec 7, 2017

mangecoeur commented Dec 7, 2017 •

edited

Loading

xhochy commented Dec 9, 2017

deepers commented Jul 15, 2019

jedmitten commented Oct 8, 2019

foxx commented Mar 6, 2020 •

edited

Loading

RiccardoGiro commented Jul 23, 2020

0x0L commented Nov 8, 2021

wesm commented Nov 8, 2021

0x0L commented Nov 8, 2021 •

edited

Loading

0x0L commented Nov 14, 2021

wesm commented Nov 14, 2021

pitrou commented Nov 16, 2021

0x0L commented Apr 16, 2022 •

edited

Loading

vmarkovtsev commented Apr 22, 2022

vmarkovtsev commented Jun 2, 2022

eirnym commented Jun 5, 2022

ale-dd commented Jun 13, 2023

pandas dataframe #17

pandas dataframe #17

Comments

kernelmachine commented Aug 5, 2016 • edited Loading

jettify commented Aug 5, 2016

1st1 commented Aug 5, 2016

wesm commented Aug 5, 2016 • edited Loading

wesm commented Aug 5, 2016

1st1 commented Aug 5, 2016

wesm commented Aug 5, 2016

1st1 commented Aug 5, 2016

wesm commented Aug 5, 2016

1st1 commented Aug 6, 2016

mangecoeur commented Dec 7, 2017 • edited Loading

elprans commented Dec 7, 2017

wesm commented Dec 7, 2017

mangecoeur commented Dec 7, 2017 • edited Loading

xhochy commented Dec 9, 2017

deepers commented Jul 15, 2019

jedmitten commented Oct 8, 2019

foxx commented Mar 6, 2020 • edited Loading

RiccardoGiro commented Jul 23, 2020

0x0L commented Nov 8, 2021

wesm commented Nov 8, 2021

0x0L commented Nov 8, 2021 • edited Loading

0x0L commented Nov 14, 2021

wesm commented Nov 14, 2021

pitrou commented Nov 16, 2021

0x0L commented Apr 16, 2022 • edited Loading

vmarkovtsev commented Apr 22, 2022

vmarkovtsev commented Jun 2, 2022

eirnym commented Jun 5, 2022

ale-dd commented Jun 13, 2023

kernelmachine commented Aug 5, 2016 •

edited

Loading

wesm commented Aug 5, 2016 •

edited

Loading

mangecoeur commented Dec 7, 2017 •

edited

Loading

mangecoeur commented Dec 7, 2017 •

edited

Loading

foxx commented Mar 6, 2020 •

edited

Loading

0x0L commented Nov 8, 2021 •

edited

Loading

0x0L commented Apr 16, 2022 •

edited

Loading