Skip to content

API: port the magic X from pandas_ply/dplython to pandas proper? #13133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shoyer opened this issue May 11, 2016 · 10 comments
Closed

API: port the magic X from pandas_ply/dplython to pandas proper? #13133

shoyer opened this issue May 11, 2016 · 10 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@shoyer
Copy link
Member

shoyer commented May 11, 2016

Many DataFrame methods (now including __getitem__) accept callables that take the DataFrame as input, e..g, df[lambda x: x.sepal_length > 3].

However, this is annoyingly verbose. I recently suggested (#13040) enabling argument-free lambdas like df[lambda: sepal_length > 3], but this isn't a viable solution (too much magic!) because it's impossible to implement with Python's standard scoping rules.

pandas-ply and dplython provide an alternative approach, based on a magic X operator, e.g.,

(flights
  .groupby(['year', 'month', 'day'])
  .ply_select(
    arr = X.arr_delay.mean(),
    dep = X.dep_delay.mean())
  .ply_where(X.arr > 30, X.dep > 30))

pandas-ply also introduces (injects onto pandas.DataFrame) two new dataframe methods ply_select and ply_where that accept these symbolic expression build from X. dplython takes a different approach, introducing it's own dplyr like API for chaining expressions instead of using method chaining. The pandas-ply approach is much closer to what makes sense for pandas proper, given that we already support method chaining.

I think we should consider introducing an object like X into pandas proper and supporting its use on all pandas methods that accept callables that take the DataFrame as input.

I don't think we need to port ply_select and ply_where, because support for expressions in DataFrame.assign and indexing is a good substitute.

So my proposed syntax (after from pandas import X) looks like the following:

(flights
 .groupby(['year', 'month', 'day'])
 .assign(
     arr = X.arr_delay.mean(),
     dep = X.dep_delay.mean())
 [(X.arr > 30) & (X.dep > 30)])

Indexing is a little uglier than using the ply_where method, but otherwise this is a nice improvement.

Best of all, we don't need do any special tricks to introduce new scopes -- we simply define X.__getattr__ to looking attributes as columns in the DataFrame context. I expect we could even reuse the expression engines from pandas-ply or dplython directly, perhaps with a few modifications.

In my mind, this would mostly obviate the need for pandas-ply, though the alternate API provided by dpython would still be independently useful. In an ideal world, our X implementation in pandas would be something that could be reused by dplython.

cc @joshuahhh @dodger487

@shoyer shoyer added API Design Needs Discussion Requires discussion from core team before further action labels May 11, 2016
@shoyer shoyer added this to the Next Major Release milestone May 11, 2016
@datnamer
Copy link

There is also this: https://github.com/dodger487/dplython

@shoyer shoyer changed the title API: port the magic X from pandas_ply to pandas proper? API: port the magic X from pandas_ply/dplython to pandas proper? May 11, 2016
@shoyer
Copy link
Member Author

shoyer commented May 11, 2016

@datnamer Thanks -- I had a feeling I was missing something! I updated my post to include discussion of dplython as well.

@joshuahhh
Copy link

I have mixed thoughts.

On the one hand, I agree that having to put lambda x: everywhere is awkward and verbose; probably awkward and verbose enough to discourage using the syntax.

But the X solution isn't perfect. The main problem is that if you have a function f and call f(X), everything breaks. (Unless f has a particularly simple implementation which doesn't look at its argument too closely.) This is why I added sym_call, but sym_call looks crappy, and you get cryptic error messages if you forget to use it. The introduction of .pipe on pandas dataframes/series made X.pipe(f) a nice option, but the "forget to use it" problem is still real.

I like using X a lot, since I understand it well and have built it into my habits. But Python isn't powerful enough to make it work in a totally predictable way, and I don't know if half-solutions like this belong in pandas.

(Thanks for asking!)

@dodger487
Copy link

To add onto @joshuahhh's comment calling X as an input to a function, in dplython we use a decorator (DelayFunction) that causes the function to check arguments for any X arguments, and if so, delays calling until the correct time when the args can be supplied. I'd echo that it isn't ideal-- "you get cryptic error messages if you forget to use it." I've toyed with the idea of applying this to all functions in a module upon import but that seems like it could have some difficulties.

On the other hand, I've found that I don't need to often apply functions to X arguments because there are so many methods on Series. Also, if I'm writing a function that will be applied to an X, it's not too bad to use the DelayFunction decorator.

Overall, I agree there are some difficulties but I'm optimistic about X being a useful solution to include in pandas.

Thanks for including me on the thread!

@shoyer
Copy link
Member Author

shoyer commented May 11, 2016

@dodger487 @joshuahhh thanks for sharing your thoughts! I think pandas supports method chaining enough that the inability to use arbitrary functions is OK. X.pipe(np.log) feels a little unnatural but is not so terrible. (Note that there are tentaive plans, possibly as part of the pandas 1.0 rewrite, to port commonly used numpy ufuncs such as np.log to methods on Series/DataFrame.)

It occurs to me that dask.delayed contains yet another implementation of deferred evaluation that might be a useful reference.

@dpavlic
Copy link

dpavlic commented Jun 9, 2016

Very intriguing. I've tested out dplython and pandas-ply based on this issue and they both look very interesting. It looks like both use X for their own functions, but it can't be used elsewhere; i.e.:

df[(X.arr > 30) & (X.dep > 30)]

doesn't actually work with either implementation as it is. Your proposal sounds like it would allow its use there, and elsewhere; for example, I'm assuming instead of (please forgive the obviously highly contrived example):

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}).assign(c=lambda x: x.a + x.b)

I could instead do:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}).assign(c=X.a + X.b)

? While pandas is never going to have some of the sheer convenience of the R syntax for these types of things, that brings it a lot closer from what I can see.

@shoyer
Copy link
Member Author

shoyer commented Jun 9, 2016

Despite these limitations, I still think pandas.X would be a clear win. The core functionality is useful enough on its own to merit inclusion in core pandas seems, even though add-ons like DelayFunction are probably best left to third party libraries.

@jreback @jorisvandenbossche @TomAugspurger any opinions?

@jreback
Copy link
Contributor

jreback commented Jun 9, 2016

I think exposing pd.X is pretty reasonable assuming well documented with nice use cases. Its opt-in so +1.

@lpenguin
Copy link

lpenguin commented Nov 3, 2017

Guys, didn't saw this issue. I think i done something very similar to X magic, see #18077.
Proof-of-concept implementation is in https://github.com/lpenguin/pandas-query. Just use from pandas_query import _ as X and you will get similar functionality. Though i didn't implement separate ply_select and ply_where functions, i hacked DataFrame.__getitem__, DataFrame.__setitem__ (column assigment) and DataFrame.__assign__.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel
Copy link
Member

Discussed on today's dev call and the consensus is we don't want to add to the API. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants