API: port the magic X from pandas_ply/dplython to pandas proper? #13133

shoyer · 2016-05-11T02:29:26Z

Many DataFrame methods (now including __getitem__) accept callables that take the DataFrame as input, e..g, df[lambda x: x.sepal_length > 3].

However, this is annoyingly verbose. I recently suggested (#13040) enabling argument-free lambdas like df[lambda: sepal_length > 3], but this isn't a viable solution (too much magic!) because it's impossible to implement with Python's standard scoping rules.

pandas-ply and dplython provide an alternative approach, based on a magic X operator, e.g.,

(flights
  .groupby(['year', 'month', 'day'])
  .ply_select(
    arr = X.arr_delay.mean(),
    dep = X.dep_delay.mean())
  .ply_where(X.arr > 30, X.dep > 30))

pandas-ply also introduces (injects onto pandas.DataFrame) two new dataframe methods ply_select and ply_where that accept these symbolic expression build from X. dplython takes a different approach, introducing it's own dplyr like API for chaining expressions instead of using method chaining. The pandas-ply approach is much closer to what makes sense for pandas proper, given that we already support method chaining.

I think we should consider introducing an object like X into pandas proper and supporting its use on all pandas methods that accept callables that take the DataFrame as input.

I don't think we need to port ply_select and ply_where, because support for expressions in DataFrame.assign and indexing is a good substitute.

So my proposed syntax (after from pandas import X) looks like the following:

(flights
 .groupby(['year', 'month', 'day'])
 .assign(
     arr = X.arr_delay.mean(),
     dep = X.dep_delay.mean())
 [(X.arr > 30) & (X.dep > 30)])

Indexing is a little uglier than using the ply_where method, but otherwise this is a nice improvement.

Best of all, we don't need do any special tricks to introduce new scopes -- we simply define X.__getattr__ to looking attributes as columns in the DataFrame context. I expect we could even reuse the expression engines from pandas-ply or dplython directly, perhaps with a few modifications.

In my mind, this would mostly obviate the need for pandas-ply, though the alternate API provided by dpython would still be independently useful. In an ideal world, our X implementation in pandas would be something that could be reused by dplython.

cc @joshuahhh @dodger487

The text was updated successfully, but these errors were encountered:

datnamer · 2016-05-11T02:40:44Z

There is also this: https://github.com/dodger487/dplython

shoyer · 2016-05-11T02:48:31Z

@datnamer Thanks -- I had a feeling I was missing something! I updated my post to include discussion of dplython as well.

joshuahhh · 2016-05-11T02:49:35Z

I have mixed thoughts.

On the one hand, I agree that having to put lambda x: everywhere is awkward and verbose; probably awkward and verbose enough to discourage using the syntax.

But the X solution isn't perfect. The main problem is that if you have a function f and call f(X), everything breaks. (Unless f has a particularly simple implementation which doesn't look at its argument too closely.) This is why I added sym_call, but sym_call looks crappy, and you get cryptic error messages if you forget to use it. The introduction of .pipe on pandas dataframes/series made X.pipe(f) a nice option, but the "forget to use it" problem is still real.

I like using X a lot, since I understand it well and have built it into my habits. But Python isn't powerful enough to make it work in a totally predictable way, and I don't know if half-solutions like this belong in pandas.

(Thanks for asking!)

dodger487 · 2016-05-11T03:19:35Z

To add onto @joshuahhh's comment calling X as an input to a function, in dplython we use a decorator (DelayFunction) that causes the function to check arguments for any X arguments, and if so, delays calling until the correct time when the args can be supplied. I'd echo that it isn't ideal-- "you get cryptic error messages if you forget to use it." I've toyed with the idea of applying this to all functions in a module upon import but that seems like it could have some difficulties.

On the other hand, I've found that I don't need to often apply functions to X arguments because there are so many methods on Series. Also, if I'm writing a function that will be applied to an X, it's not too bad to use the DelayFunction decorator.

Overall, I agree there are some difficulties but I'm optimistic about X being a useful solution to include in pandas.

Thanks for including me on the thread!

shoyer · 2016-05-11T03:43:29Z

@dodger487 @joshuahhh thanks for sharing your thoughts! I think pandas supports method chaining enough that the inability to use arbitrary functions is OK. X.pipe(np.log) feels a little unnatural but is not so terrible. (Note that there are tentaive plans, possibly as part of the pandas 1.0 rewrite, to port commonly used numpy ufuncs such as np.log to methods on Series/DataFrame.)

It occurs to me that dask.delayed contains yet another implementation of deferred evaluation that might be a useful reference.

dpavlic · 2016-06-09T15:05:20Z

Very intriguing. I've tested out dplython and pandas-ply based on this issue and they both look very interesting. It looks like both use X for their own functions, but it can't be used elsewhere; i.e.:

df[(X.arr > 30) & (X.dep > 30)]

doesn't actually work with either implementation as it is. Your proposal sounds like it would allow its use there, and elsewhere; for example, I'm assuming instead of (please forgive the obviously highly contrived example):

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}).assign(c=lambda x: x.a + x.b)

I could instead do:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}).assign(c=X.a + X.b)

? While pandas is never going to have some of the sheer convenience of the R syntax for these types of things, that brings it a lot closer from what I can see.

shoyer · 2016-06-09T16:18:09Z

Despite these limitations, I still think pandas.X would be a clear win. The core functionality is useful enough on its own to merit inclusion in core pandas seems, even though add-ons like DelayFunction are probably best left to third party libraries.

@jreback @jorisvandenbossche @TomAugspurger any opinions?

jreback · 2016-06-09T18:50:37Z

I think exposing pd.X is pretty reasonable assuming well documented with nice use cases. Its opt-in so +1.

lpenguin · 2017-11-03T14:43:07Z

Guys, didn't saw this issue. I think i done something very similar to X magic, see #18077.
Proof-of-concept implementation is in https://github.com/lpenguin/pandas-query. Just use from pandas_query import _ as X and you will get similar functionality. Though i didn't implement separate ply_select and ply_where functions, i hacked DataFrame.__getitem__, DataFrame.__setitem__ (column assigment) and DataFrame.__assign__.

jbrockmendel · 2023-02-22T22:08:48Z

Discussed on today's dev call and the consensus is we don't want to add to the API. Closing.

shoyer added API Design Needs Discussion Requires discussion from core team before further action labels May 11, 2016

shoyer added this to the Next Major Release milestone May 11, 2016

shoyer mentioned this issue May 11, 2016

API: use argument-free lambdas for injecting DataFrames columns as variables? #13040

Closed

shoyer changed the title ~~API: port the magic X from pandas_ply to pandas proper?~~ API: port the magic X from pandas_ply/dplython to pandas proper? May 11, 2016

chris-b1 mentioned this issue Sep 13, 2016

WIP/API: add magic 'X' for selection #14209

Closed

4 tasks

chris-b1 mentioned this issue Jun 15, 2017

API: Generic namespace thunk ibis-project/ibis#1037

Closed

chris-b1 mentioned this issue Nov 2, 2017

Shorter syntax for selecting data and expression evaluation (proposal) #18077

Closed

max-sixty mentioned this issue Jun 26, 2019

Pipe Operator? pydata/xarray#3050

Closed

mroeschke added Enhancement and removed API Design labels Apr 30, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel closed this as completed Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: port the magic X from pandas_ply/dplython to pandas proper? #13133

API: port the magic X from pandas_ply/dplython to pandas proper? #13133

shoyer commented May 11, 2016 •

edited

Loading

datnamer commented May 11, 2016

shoyer commented May 11, 2016

joshuahhh commented May 11, 2016

dodger487 commented May 11, 2016

shoyer commented May 11, 2016

dpavlic commented Jun 9, 2016 •

edited

Loading

shoyer commented Jun 9, 2016

jreback commented Jun 9, 2016

lpenguin commented Nov 3, 2017 •

edited

Loading

jbrockmendel commented Feb 22, 2023

API: port the magic X from pandas_ply/dplython to pandas proper? #13133

API: port the magic X from pandas_ply/dplython to pandas proper? #13133

Comments

shoyer commented May 11, 2016 • edited Loading

datnamer commented May 11, 2016

shoyer commented May 11, 2016

joshuahhh commented May 11, 2016

dodger487 commented May 11, 2016

shoyer commented May 11, 2016

dpavlic commented Jun 9, 2016 • edited Loading

shoyer commented Jun 9, 2016

jreback commented Jun 9, 2016

lpenguin commented Nov 3, 2017 • edited Loading

jbrockmendel commented Feb 22, 2023

shoyer commented May 11, 2016 •

edited

Loading

dpavlic commented Jun 9, 2016 •

edited

Loading

lpenguin commented Nov 3, 2017 •

edited

Loading