-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
API: use argument-free lambdas for injecting DataFrames columns as variables? #13040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @mrocklin iirc you had some thoughts about macro type things |
we already do a similar thing with |
Interesting idea :-). I don't have anything particularly useful to say, but two somewhat tangential thoughts:
In [9]: def f():
...: x = 1
...: return x
...:
In [10]: dis.dis(f)
[...]
3 6 LOAD_FAST 0 (x)
9 RETURN_VALUE But for closed-over variables ( In [11]: def f():
....: x = 1
....: return lambda: x
....:
In [12]: dis.dis(f())
3 0 LOAD_DEREF 0 (x)
3 RETURN_VALUE I'm not sure what LOAD_DEREF's semantics are exactly, but it involves loading a cell object attached to the function (
|
I think I'm +0.5 on this :) I need to read through your pandas-magic library again first. I've been mildly annoyed with the verbosity of Now we just need a PEP for accepting |
As much as I've wanted for a long time some type of delayed-evaluation semantics in Python, I'm very leery of these hacks that try to manage the global scope under the hood. They are brittle and hard to understand/debug (as @njsmith's caveats in his dark magic gist indicate). And there's always the possibility that something changes down the road in the CPython implementation itself that breaks this, I don't know the extent to which the semantics of these low-level pieces are official... But a cleaner solution would be great! |
Instead of a lambda, would it be possible to do this with some magic dataframe that deffers all evaluations until the context is given? |
For the longer term, maybe it would be better to marshal behind @Haypo 's code transformer pep: https://www.python.org/dev/peps/pep-0511/ And then get a concerted effort going behind something like macropy. @mrocklin had some ideas about python macros. |
I'm not a big fan of the idea of pandas trying to run a global search-replace over all my code, which is what that code transformer pep basically would give. Along with the obvious concerns about spooky action at a distance, there's the problem that when doing a static search/replace we don't know which bits of code that look like But again, if someone wants to write a more targeted pep for what pandas would actually want, then I'm happy to help. Folks following this might also be interested in the current Python-ideas thread(s) discussing syntax like |
This specific implementation is less brittle than most -- far safer than @njsmith's context manager. Again, the implementation is all based on public API with the def injected(df, thunk):
"""Evaluate a thunk in the context of DataFrame
>>> df = pd.DataFrame({'x': [0, 1, 2]}, index=['a', 'b', 'c'])
>>> injected(df, lambda: x ** 2)
a 0
b 1
c 4
Name: x, dtype: int64
"""
new_globals = thunk.__globals__.copy()
new_globals.update(df)
new_thunk = types.FunctionType(thunk.__code__, new_globals, thunk.__name__,
thunk.__defaults__, thunk.__closure__)
return new_thunk() The problem is that we create the new injected variables as globals in the context of the function evaluation, which is not what one would expect. It's a bad thing is something like this works differently in IPython or a script than wrapped in a function:
In theory, I think we could create a new closure object to create a new scope instead, but that starts to get down the dark magic path. One thing we could do is raise an error if injecting new variables would overwrite any global or non-local variables. We could do this by checking to make sure that x = ...
df['x'] = x
df[lambda: x > 500]
No, unfortunately not unless we're able to change Python itself, because Python builds in eager evaluation. One viable alternative is the magic |
I was thinking about something in that direction. A "lambda DataFrame" λ could represent a defered object or an expression tree . A normal DataFrame could then pass |
@shoyer On Python 2.7.6 and Pandas 0.18.0 , when I want to play with the magic:
I get
Seems like the |
@rsdenijs I wrote this some months ago, and it worked on an earlier version of pandas with Python 2.7. It is quite likely that something has broken my hack -- you are welcome to look into fixing it. |
I like the idea, but the scopes don't work out the way you'd expect without some darker magic than the original proposal. One reason I like Python is that stuff like scoping mostly makes sense, so my mental model of what's going on usually works. I don't think it's worth sacrificing that for this convenience. My take is that things like this have to be built into the language itself, as Nathaniel mentioned - adding it on top of the language is not going to be very robust or very widely understood by people reading the code. |
@takluyver Agreed. Closing this issue as "won't fix". We need changes to the Python language to make this viable. |
@shoyer do you think that will ever happen? Python is at quite a deficit compared to R and Julia for data manipulation syntax. IIRC guido said he would entertain a macro PEP. |
Oh sorry, I missed that. |
But wouldnt it be possible to reduce the amount of required magic by going for the following syntax? It would require a magic object L. Maybe I miss the point but it would only need to "resolve" L, there would be no conflicts with other local variables.
|
Yes, in fact pandas-ply already provides almost exactly this object in the form of |
Opened a new issue to propose porting the magic X from pandas-ply to pandas proper: #13133 |
With a little bit of magic, we could make the following syntax work:
Syntax like
df[lambda: sepal_length > 3]
is an alternative to more verbose alternatives like the recently addeddf[lambda x: x.sepal_length > 3]
(#11485). Here we uselambda
essentially in place of a macro that would allow for delayed evaluation (which of course are not supported by Python syntax).My proposal is to add support for such "thunks" to every pandas method that accepts a callable function that currently must take a single argument work on DataFrames.
Under the covers, this works by (1) copying the
globals()
dictionary at evaluation time and (2) injecting the current DataFrame into it. We would further ensure that this only works on lambda functions, by checkingf.func_name == '<lambda>'
.The main gotcha is that it isn't possible to dynamically override
localnon-global variables without some true dark magic. This means that code like the following is going to behave contrary to expectations:Is this so bad? Shadowing variables in an outer scope is already poor design, but this is a pretty serious departure from expectations.
The other danger is that this could mask bugs, e.g., if a user mistakenly types
df.pipe(lambda: x)
instead ofdf.pipe(lambda x: x)
. This is an unavoidable danger of spelling two APIs with similar syntax.On the plus side, this proposal is safer than @njsmith's "true dark magic" context manager (see above) for injecting DataFrame columns, because there's no possibility of variable assignment inside a
lambda
.Would this be a good idea for pandas?
The text was updated successfully, but these errors were encountered: