API: use argument-free lambdas for injecting DataFrames columns as variables? #13040

shoyer · 2016-04-30T06:20:39Z

With a little bit of magic, we could make the following syntax work:

(df[lambda: sepal_length > 3]
 .groupby(lambda: pd.cut(sepal_width, 5))
 .apply(lambda: petal_width.mean()))

Syntax like df[lambda: sepal_length > 3] is an alternative to more verbose alternatives like the recently added df[lambda x: x.sepal_length > 3] (#11485). Here we use lambda essentially in place of a macro that would allow for delayed evaluation (which of course are not supported by Python syntax).

My proposal is to add support for such "thunks" to every pandas method that accepts a callable function that currently must take a single argument work on DataFrames.

Under the covers, this works by (1) copying the globals() dictionary at evaluation time and (2) injecting the current DataFrame into it. We would further ensure that this only works on lambda functions, by checking f.func_name == '<lambda>'.

The main gotcha is that it isn't possible to dynamically override ~~local~~ non-global variables without some true dark magic. This means that code like the following is going to behave contrary to expectations:

def x_plus_one(df):
    x = 0
    # uses x = 0 instead of df.x
    return df.pipe(lambda: x + 1)

df = pd.DataFrame({'x': np.arange(100)})
result = x_plus_one(df)  # all 1s, not range(1, 101)

Is this so bad? Shadowing variables in an outer scope is already poor design, but this is a pretty serious departure from expectations.

The other danger is that this could mask bugs, e.g., if a user mistakenly types df.pipe(lambda: x) instead of df.pipe(lambda x: x). This is an unavoidable danger of spelling two APIs with similar syntax.

On the plus side, this proposal is safer than @njsmith's "true dark magic" context manager (see above) for injecting DataFrame columns, because there's no possibility of variable assignment inside a lambda.

Would this be a good idea for pandas?

The text was updated successfully, but these errors were encountered:

jreback · 2016-04-30T07:15:24Z

cc @mrocklin iirc you had some thoughts about macro type things

jreback · 2016-04-30T07:19:43Z

we already do a similar thing with .query() by copying scope, though it's much simpler as don't have all of the function machinery and until recently only allowed a single statement (so possibility of assignment to a local inside was really small); and its string based (not lambda bases) so can't confuse the calling convention

njsmith · 2016-04-30T07:38:01Z

Interesting idea :-). I don't have anything particularly useful to say, but two somewhat tangential thoughts:

In your example of a lambda accessing a variable from the "local" scope: this is actually creating a closure, so the variable isn't a local inside the lambda -- lookup for closed-over variables is implemented differently than for locals. For locals, the bytecode uses LOAD_FAST, which accesses the special hidden array of local variables:

In [9]: def f():
   ...:     x = 1
   ...:     return x
   ...: 

In [10]: dis.dis(f)
[...]
  3           6 LOAD_FAST                0 (x)
              9 RETURN_VALUE

But for closed-over variables (nonlocal scope -- this is really what it's called :-)), Python uses LOAD_DEREF:

In [11]: def f():
   ....:     x = 1
   ....:     return lambda: x
   ....: 

In [12]: dis.dis(f())
  3           0 LOAD_DEREF               0 (x)
              3 RETURN_VALUE

I'm not sure what LOAD_DEREF's semantics are exactly, but it involves loading a cell object attached to the function (func.__closure__ on py3, maybe .func_closure on py2), and it might be possible to intercept that loading without all the dark magic. Or not, I haven't checked :-)

I may have mentioned this before, but I think we could make a reasonable proposal for getting actual runtime-evaluated macros in py 3.6 if someone cares enough to push it forward, with syntax like df![sepal_length > 3] where the ! is a rust-style marker meaning "this invocation gets passed the AST of its arguments instead of the actual values". I don't have time / care enough to take point on this, but I'm happy to help out anyone who does want to take point.

TomAugspurger · 2016-04-30T16:49:19Z

I think I'm +0.5 on this :) I need to read through your pandas-magic library again first. I've been mildly annoyed with the verbosity of df.assign(y=lambda x: x...) in the past when doing many assigns.

Now we just need a PEP for accepting λ in place of the lambda keyword :)

fperez · 2016-05-01T01:53:06Z

As much as I've wanted for a long time some type of delayed-evaluation semantics in Python, I'm very leery of these hacks that try to manage the global scope under the hood. They are brittle and hard to understand/debug (as @njsmith's caveats in his dark magic gist indicate). And there's always the possibility that something changes down the road in the CPython implementation itself that breaks this, I don't know the extent to which the semantics of these low-level pieces are official...

But a cleaner solution would be great!

rsdenijs · 2016-05-02T08:15:41Z

Instead of a lambda, would it be possible to do this with some magic dataframe that deffers all evaluations until the context is given?

datnamer · 2016-05-02T18:29:52Z

For the longer term, maybe it would be better to marshal behind @Haypo 's code transformer pep: https://www.python.org/dev/peps/pep-0511/

And then get a concerted effort going behind something like macropy.

@mrocklin had some ideas about python macros.

njsmith · 2016-05-02T19:10:34Z

I'm not a big fan of the idea of pandas trying to run a global search-replace over all my code, which is what that code transformer pep basically would give. Along with the obvious concerns about spooky action at a distance, there's the problem that when doing a static search/replace we don't know which bits of code that look like df[...] are actually data frame indexing, and figuring this out would require some kind of nasty static analysis.

But again, if someone wants to write a more targeted pep for what pandas would actually want, then I'm happy to help.

Folks following this might also be interested in the current Python-ideas thread(s) discussing syntax like using some_namespace: ...

shoyer · 2016-05-02T20:39:24Z

And there's always the possibility that something changes down the road in the CPython implementation itself that breaks this, I don't know the extent to which the semantics of these low-level pieces are official...

This specific implementation is less brittle than most -- far safer than @njsmith's context manager. Again, the implementation is all based on public API with the types module from the standard library:

def injected(df, thunk):
    """Evaluate a thunk in the context of DataFrame

    >>> df = pd.DataFrame({'x': [0, 1, 2]}, index=['a', 'b', 'c'])
    >>> injected(df, lambda: x ** 2)
    a    0
    b    1
    c    4
    Name: x, dtype: int64
    """
    new_globals = thunk.__globals__.copy()
    new_globals.update(df)
    new_thunk = types.FunctionType(thunk.__code__, new_globals, thunk.__name__,
                                   thunk.__defaults__, thunk.__closure__)
    return new_thunk()

The problem is that we create the new injected variables as globals in the context of the function evaluation, which is not what one would expect. It's a bad thing is something like this works differently in IPython or a script than wrapped in a function:

x = 1
df.pipe(lambda: x)

In theory, I think we could create a new closure object to create a new scope instead, but that starts to get down the dark magic path.

One thing we could do is raise an error if injecting new variables would overwrite any global or non-local variables. We could do this by checking to make sure that thunk.__closure__ is None and that no dataframe columns are found in thunk.__globals__. But I worry that this wouldn't be very satisfying, either, because it's very common to write stuff code where you do use columns as local variables first, e..g,

x = ...
df['x'] = x
df[lambda: x > 500]

Instead of a lambda, would it be possible to do this with some magic dataframe that deffers all evaluations until the context is given?

No, unfortunately not unless we're able to change Python itself, because Python builds in eager evaluation.

One viable alternative is the magic X from pandas-ply: http://pythonhosted.org/pandas-ply/

rsdenijs · 2016-05-03T09:40:10Z

Instead of a lambda, would it be possible to do this with some magic dataframe that deffers all evaluations until the context is given?

No, unfortunately not unless we're able to change Python itself, because Python builds in eager evaluation.

One viable alternative is the magic X from pandas-ply: http://pythonhosted.org/pandas-ply/

I was thinking about something in that direction. A "lambda DataFrame" λ could represent a defered object or an expression tree . A normal DataFrame could then pass self to the λ, which is then evaluated.

rsdenijs · 2016-05-05T22:04:44Z

@shoyer On Python 2.7.6 and Pandas 0.18.0 , when I want to play with the magic:

import pandas as pd
import pandas_magic.monkeypatched
pd.DataFrame([1])

I get

TypeError: unbound method _patched_new() must be called with DataFrame instance as first argument (got type instance instead)

Seems like the __new__method is not being overwritten properly?

shoyer · 2016-05-05T22:08:34Z

@rsdenijs I wrote this some months ago, and it worked on an earlier version of pandas with Python 2.7. It is quite likely that something has broken my hack -- you are welcome to look into fixing it.

takluyver · 2016-05-10T11:34:43Z

I like the idea, but the scopes don't work out the way you'd expect without some darker magic than the original proposal. One reason I like Python is that stuff like scoping mostly makes sense, so my mental model of what's going on usually works. I don't think it's worth sacrificing that for this convenience.

My take is that things like this have to be built into the language itself, as Nathaniel mentioned - adding it on top of the language is not going to be very robust or very widely understood by people reading the code.

shoyer · 2016-05-10T16:43:03Z

@takluyver Agreed. Closing this issue as "won't fix". We need changes to the Python language to make this viable.

datnamer · 2016-05-10T16:48:13Z

@shoyer do you think that will ever happen? Python is at quite a deficit compared to R and Julia for data manipulation syntax.

IIRC guido said he would entertain a macro PEP.

shoyer · 2016-05-10T18:44:01Z

@datnamer I agree with @njsmith above that if someone has sufficient interest to push this, there is a plausible chance of getting this in. I'm also happy to help but not ready to write the PEP myself -- I don't understand Python's internals well enough to provide the necessary technical detail.

datnamer · 2016-05-10T18:51:10Z

Oh sorry, I missed that.

rsdenijs · 2016-05-10T21:53:13Z

But wouldnt it be possible to reduce the amount of required magic by going for the following syntax? It would require a magic object L. Maybe I miss the point but it would only need to "resolve" L, there would be no conflicts with other local variables.

from pandas import MagicLambda as L

...

(df[L.sepal_length > 3]
 .groupby(pd.cut(L.sepal_width, 5))
 .apply(L.petal_width.mean()))

shoyer · 2016-05-10T22:02:23Z

But wouldnt it be possible to reduce the amount of required magic by going for the following syntax? It would require a magic object L.

Yes, in fact pandas-ply already provides almost exactly this object in the form of X, though it might need a little bit of work to ensure __call__ methods are defined appropriately such that we don't need to use install_ply. I'll open a new issue to discuss.

shoyer · 2016-05-11T02:30:23Z

Opened a new issue to propose porting the magic X from pandas-ply to pandas proper: #13133

shoyer changed the title ~~API: use argument-free lambdas for injecting DataFrames columns?~~ API: use argument-free lambdas for injecting DataFrames columns as variables? Apr 30, 2016

jreback added API Design Needs Discussion Requires discussion from core team before further action labels Apr 30, 2016

jreback added this to the 0.19.0 milestone Apr 30, 2016

shoyer closed this as completed May 10, 2016

shoyer added the Won't Fix label May 10, 2016

shoyer removed this from the 0.19.0 milestone May 10, 2016

shoyer mentioned this issue May 11, 2016

API: port the magic X from pandas_ply/dplython to pandas proper? #13133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: use argument-free lambdas for injecting DataFrames columns as variables? #13040

API: use argument-free lambdas for injecting DataFrames columns as variables? #13040

shoyer commented Apr 30, 2016 •

edited

Loading

jreback commented Apr 30, 2016

jreback commented Apr 30, 2016 •

edited

Loading

njsmith commented Apr 30, 2016

TomAugspurger commented Apr 30, 2016

fperez commented May 1, 2016

rsdenijs commented May 2, 2016

datnamer commented May 2, 2016 •

edited

Loading

njsmith commented May 2, 2016

shoyer commented May 2, 2016

rsdenijs commented May 3, 2016

rsdenijs commented May 5, 2016 •

edited

Loading

shoyer commented May 5, 2016

takluyver commented May 10, 2016

shoyer commented May 10, 2016

datnamer commented May 10, 2016 •

edited

Loading

shoyer commented May 10, 2016

datnamer commented May 10, 2016

rsdenijs commented May 10, 2016

shoyer commented May 10, 2016 •

edited

Loading

shoyer commented May 11, 2016

API: use argument-free lambdas for injecting DataFrames columns as variables? #13040

API: use argument-free lambdas for injecting DataFrames columns as variables? #13040

Comments

shoyer commented Apr 30, 2016 • edited Loading

jreback commented Apr 30, 2016

jreback commented Apr 30, 2016 • edited Loading

njsmith commented Apr 30, 2016

TomAugspurger commented Apr 30, 2016

fperez commented May 1, 2016

rsdenijs commented May 2, 2016

datnamer commented May 2, 2016 • edited Loading

njsmith commented May 2, 2016

shoyer commented May 2, 2016

rsdenijs commented May 3, 2016

rsdenijs commented May 5, 2016 • edited Loading

shoyer commented May 5, 2016

takluyver commented May 10, 2016

shoyer commented May 10, 2016

datnamer commented May 10, 2016 • edited Loading

shoyer commented May 10, 2016

datnamer commented May 10, 2016

rsdenijs commented May 10, 2016

shoyer commented May 10, 2016 • edited Loading

shoyer commented May 11, 2016

shoyer commented Apr 30, 2016 •

edited

Loading

jreback commented Apr 30, 2016 •

edited

Loading

datnamer commented May 2, 2016 •

edited

Loading

rsdenijs commented May 5, 2016 •

edited

Loading

datnamer commented May 10, 2016 •

edited

Loading

shoyer commented May 10, 2016 •

edited

Loading