Skip to content

API: use argument-free lambdas for injecting DataFrames columns as variables? #13040

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shoyer opened this issue Apr 30, 2016 · 20 comments
Closed
Labels
API Design Needs Discussion Requires discussion from core team before further action

Comments

@shoyer
Copy link
Member

shoyer commented Apr 30, 2016

With a little bit of magic, we could make the following syntax work:

(df[lambda: sepal_length > 3]
 .groupby(lambda: pd.cut(sepal_width, 5))
 .apply(lambda: petal_width.mean()))

Syntax like df[lambda: sepal_length > 3] is an alternative to more verbose alternatives like the recently added df[lambda x: x.sepal_length > 3] (#11485). Here we use lambda essentially in place of a macro that would allow for delayed evaluation (which of course are not supported by Python syntax).

My proposal is to add support for such "thunks" to every pandas method that accepts a callable function that currently must take a single argument work on DataFrames.

Under the covers, this works by (1) copying the globals() dictionary at evaluation time and (2) injecting the current DataFrame into it. We would further ensure that this only works on lambda functions, by checking f.func_name == '<lambda>'.

The main gotcha is that it isn't possible to dynamically override local non-global variables without some true dark magic. This means that code like the following is going to behave contrary to expectations:

def x_plus_one(df):
    x = 0
    # uses x = 0 instead of df.x
    return df.pipe(lambda: x + 1)

df = pd.DataFrame({'x': np.arange(100)})
result = x_plus_one(df)  # all 1s, not range(1, 101)

Is this so bad? Shadowing variables in an outer scope is already poor design, but this is a pretty serious departure from expectations.

The other danger is that this could mask bugs, e.g., if a user mistakenly types df.pipe(lambda: x) instead of df.pipe(lambda x: x). This is an unavoidable danger of spelling two APIs with similar syntax.

On the plus side, this proposal is safer than @njsmith's "true dark magic" context manager (see above) for injecting DataFrame columns, because there's no possibility of variable assignment inside a lambda.

Would this be a good idea for pandas?

@jreback
Copy link
Contributor

jreback commented Apr 30, 2016

cc @mrocklin iirc you had some thoughts about macro type things

@jreback
Copy link
Contributor

jreback commented Apr 30, 2016

we already do a similar thing with .query() by copying scope, though it's much simpler as don't have all of the function machinery and until recently only allowed a single statement (so possibility of assignment to a local inside was really small); and its string based (not lambda bases) so can't confuse the calling convention

@njsmith
Copy link

njsmith commented Apr 30, 2016

Interesting idea :-). I don't have anything particularly useful to say, but two somewhat tangential thoughts:

  • In your example of a lambda accessing a variable from the "local" scope: this is actually creating a closure, so the variable isn't a local inside the lambda -- lookup for closed-over variables is implemented differently than for locals. For locals, the bytecode uses LOAD_FAST, which accesses the special hidden array of local variables:
In [9]: def f():
   ...:     x = 1
   ...:     return x
   ...: 

In [10]: dis.dis(f)
[...]
  3           6 LOAD_FAST                0 (x)
              9 RETURN_VALUE

But for closed-over variables (nonlocal scope -- this is really what it's called :-)), Python uses LOAD_DEREF:

In [11]: def f():
   ....:     x = 1
   ....:     return lambda: x
   ....: 

In [12]: dis.dis(f())
  3           0 LOAD_DEREF               0 (x)
              3 RETURN_VALUE

I'm not sure what LOAD_DEREF's semantics are exactly, but it involves loading a cell object attached to the function (func.__closure__ on py3, maybe .func_closure on py2), and it might be possible to intercept that loading without all the dark magic. Or not, I haven't checked :-)

  • I may have mentioned this before, but I think we could make a reasonable proposal for getting actual runtime-evaluated macros in py 3.6 if someone cares enough to push it forward, with syntax like df![sepal_length > 3] where the ! is a rust-style marker meaning "this invocation gets passed the AST of its arguments instead of the actual values". I don't have time / care enough to take point on this, but I'm happy to help out anyone who does want to take point.

@shoyer shoyer changed the title API: use argument-free lambdas for injecting DataFrames columns? API: use argument-free lambdas for injecting DataFrames columns as variables? Apr 30, 2016
@jreback jreback added API Design Needs Discussion Requires discussion from core team before further action labels Apr 30, 2016
@jreback jreback added this to the 0.19.0 milestone Apr 30, 2016
@TomAugspurger
Copy link
Contributor

I think I'm +0.5 on this :) I need to read through your pandas-magic library again first. I've been mildly annoyed with the verbosity of df.assign(y=lambda x: x...) in the past when doing many assigns.

Now we just need a PEP for accepting λ in place of the lambda keyword :)

@fperez
Copy link
Contributor

fperez commented May 1, 2016

As much as I've wanted for a long time some type of delayed-evaluation semantics in Python, I'm very leery of these hacks that try to manage the global scope under the hood. They are brittle and hard to understand/debug (as @njsmith's caveats in his dark magic gist indicate). And there's always the possibility that something changes down the road in the CPython implementation itself that breaks this, I don't know the extent to which the semantics of these low-level pieces are official...

But a cleaner solution would be great!

@rsdenijs
Copy link

rsdenijs commented May 2, 2016

Instead of a lambda, would it be possible to do this with some magic dataframe that deffers all evaluations until the context is given?

@datnamer
Copy link

datnamer commented May 2, 2016

For the longer term, maybe it would be better to marshal behind @Haypo 's code transformer pep: https://www.python.org/dev/peps/pep-0511/

And then get a concerted effort going behind something like macropy.

@mrocklin had some ideas about python macros.

@njsmith
Copy link

njsmith commented May 2, 2016

I'm not a big fan of the idea of pandas trying to run a global search-replace over all my code, which is what that code transformer pep basically would give. Along with the obvious concerns about spooky action at a distance, there's the problem that when doing a static search/replace we don't know which bits of code that look like df[...] are actually data frame indexing, and figuring this out would require some kind of nasty static analysis.

But again, if someone wants to write a more targeted pep for what pandas would actually want, then I'm happy to help.

Folks following this might also be interested in the current Python-ideas thread(s) discussing syntax like using some_namespace: ...

@shoyer
Copy link
Member Author

shoyer commented May 2, 2016

And there's always the possibility that something changes down the road in the CPython implementation itself that breaks this, I don't know the extent to which the semantics of these low-level pieces are official...

This specific implementation is less brittle than most -- far safer than @njsmith's context manager. Again, the implementation is all based on public API with the types module from the standard library:

def injected(df, thunk):
    """Evaluate a thunk in the context of DataFrame

    >>> df = pd.DataFrame({'x': [0, 1, 2]}, index=['a', 'b', 'c'])
    >>> injected(df, lambda: x ** 2)
    a    0
    b    1
    c    4
    Name: x, dtype: int64
    """
    new_globals = thunk.__globals__.copy()
    new_globals.update(df)
    new_thunk = types.FunctionType(thunk.__code__, new_globals, thunk.__name__,
                                   thunk.__defaults__, thunk.__closure__)
    return new_thunk()

The problem is that we create the new injected variables as globals in the context of the function evaluation, which is not what one would expect. It's a bad thing is something like this works differently in IPython or a script than wrapped in a function:

x = 1
df.pipe(lambda: x)

In theory, I think we could create a new closure object to create a new scope instead, but that starts to get down the dark magic path.

One thing we could do is raise an error if injecting new variables would overwrite any global or non-local variables. We could do this by checking to make sure that thunk.__closure__ is None and that no dataframe columns are found in thunk.__globals__. But I worry that this wouldn't be very satisfying, either, because it's very common to write stuff code where you do use columns as local variables first, e..g,

x = ...
df['x'] = x
df[lambda: x > 500]

Instead of a lambda, would it be possible to do this with some magic dataframe that deffers all evaluations until the context is given?

No, unfortunately not unless we're able to change Python itself, because Python builds in eager evaluation.

One viable alternative is the magic X from pandas-ply: http://pythonhosted.org/pandas-ply/

@rsdenijs
Copy link

rsdenijs commented May 3, 2016

Instead of a lambda, would it be possible to do this with some magic dataframe that deffers all evaluations until the context is given?

No, unfortunately not unless we're able to change Python itself, because Python builds in eager evaluation.

One viable alternative is the magic X from pandas-ply: http://pythonhosted.org/pandas-ply/

I was thinking about something in that direction. A "lambda DataFrame" λ could represent a defered object or an expression tree . A normal DataFrame could then pass self to the λ, which is then evaluated.

@rsdenijs
Copy link

rsdenijs commented May 5, 2016

@shoyer On Python 2.7.6 and Pandas 0.18.0 , when I want to play with the magic:

import pandas as pd
import pandas_magic.monkeypatched
pd.DataFrame([1])

I get

TypeError: unbound method _patched_new() must be called with DataFrame instance as first argument (got type instance instead)

Seems like the __new__method is not being overwritten properly?

@shoyer
Copy link
Member Author

shoyer commented May 5, 2016

@rsdenijs I wrote this some months ago, and it worked on an earlier version of pandas with Python 2.7. It is quite likely that something has broken my hack -- you are welcome to look into fixing it.

@takluyver
Copy link
Contributor

I like the idea, but the scopes don't work out the way you'd expect without some darker magic than the original proposal. One reason I like Python is that stuff like scoping mostly makes sense, so my mental model of what's going on usually works. I don't think it's worth sacrificing that for this convenience.

My take is that things like this have to be built into the language itself, as Nathaniel mentioned - adding it on top of the language is not going to be very robust or very widely understood by people reading the code.

@shoyer
Copy link
Member Author

shoyer commented May 10, 2016

@takluyver Agreed. Closing this issue as "won't fix". We need changes to the Python language to make this viable.

@shoyer shoyer closed this as completed May 10, 2016
@datnamer
Copy link

datnamer commented May 10, 2016

@shoyer do you think that will ever happen? Python is at quite a deficit compared to R and Julia for data manipulation syntax.

IIRC guido said he would entertain a macro PEP.

@shoyer shoyer removed this from the 0.19.0 milestone May 10, 2016
@shoyer
Copy link
Member Author

shoyer commented May 10, 2016

@datnamer I agree with @njsmith above that if someone has sufficient interest to push this, there is a plausible chance of getting this in. I'm also happy to help but not ready to write the PEP myself -- I don't understand Python's internals well enough to provide the necessary technical detail.

@datnamer
Copy link

Oh sorry, I missed that.

@rsdenijs
Copy link

But wouldnt it be possible to reduce the amount of required magic by going for the following syntax? It would require a magic object L. Maybe I miss the point but it would only need to "resolve" L, there would be no conflicts with other local variables.

from pandas import MagicLambda as L

...

(df[L.sepal_length > 3]
 .groupby(pd.cut(L.sepal_width, 5))
 .apply(L.petal_width.mean()))

@shoyer
Copy link
Member Author

shoyer commented May 10, 2016

But wouldnt it be possible to reduce the amount of required magic by going for the following syntax? It would require a magic object L.

Yes, in fact pandas-ply already provides almost exactly this object in the form of X, though it might need a little bit of work to ensure __call__ methods are defined appropriately such that we don't need to use install_ply. I'll open a new issue to discuss.

@shoyer
Copy link
Member Author

shoyer commented May 11, 2016

Opened a new issue to propose porting the magic X from pandas-ply to pandas proper: #13133

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

8 participants