Skip to content

WIP/API: add magic 'X' for selection #14209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

chris-b1
Copy link
Contributor

This is very WIP, but wanted to put it up and show the general direction. This adds essentially a modified version of pandas_ply that produces plain callables that can be passed to the existing []/assign methods. Short demo below.

One thing that's tricky is figuring out when an expression is "complete." pandas_ply and dplython don't have to do this because they use a special method to instantiate the selection, but I'd prefer not to do this if possible, so this doesn't touch any pandas internals. There's one example below (X.c.str.upper()) that shows where the current heuristic is failing.

cc @shoyer, @jreback, @joshuahhh @dodger487, welcome any thoughts

df = pd.DataFrame({'a':[1,2,3], 'b':[1.5, 2.5, 3.4],
                   'c':['abc', 'def', 'efg'],
                   'd':pd.to_datetime(['2014-01-01', '2014-01-02', '2014-01-03'])})

from pandas import X

df[X.a > 1]
Out[3]: 
   a    b    c          d
1  2  2.5  def 2014-01-02
2  3  3.4  efg 2014-01-03

df[X.d.dt.day == 2]
Out[4]: 
   a    b    c          d
1  2  2.5  def 2014-01-02

df.assign(e=X.a+1)
Out[5]: 
   a    b    c          d  e
0  1  1.5  abc 2014-01-01  2
1  2  2.5  def 2014-01-02  3
2  3  3.4  efg 2014-01-03  4

df.assign(e=X.b.pipe(np.exp))
Out[6]: 
   a    b    c          d          e
0  1  1.5  abc 2014-01-01   4.481689
1  2  2.5  def 2014-01-02  12.182494
2  3  3.4  efg 2014-01-03  29.964100

# this should work, but doesn't
df.assign(e=X.c.str.upper())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-65b030f82188> in <module>()
----> 1 df.assign(e=X.c.str.upper())

# this can't work, but need to give a good error msg
df.assign(e=np.log(X.a))
---------------------------------------------------------------------------
TypeError: 
pandas `X` is a deferred object that cannot be passed into
functions. The object was attempted to be converted to a numpy array which is invalid.
To pass a deferred Series into a function, use the .pipe
function, for example, X.a.pipe(np.log), instead np.log(df.a)

@wesm
Copy link
Member

wesm commented Sep 13, 2016

I'm neutral to negative on this type of solution until we do a more thorough analysis of what kind of deferred expression API we might want in pandas. I feel like it might want to wait for pandas 2.0 to have time to incubate and see some hardening through use.

@codecov-io
Copy link

codecov-io commented Sep 13, 2016

Current coverage is 85.21% (diff: 72.72%)

Merging #14209 into master will decrease coverage by 0.02%

@@             master     #14209   diff @@
==========================================
  Files           140        141     +1   
  Lines         50563      50684   +121   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43103      43191    +88   
- Misses         7460       7493    +33   
  Partials          0          0          

Powered by Codecov. Last update 461e0e9...3c50338

@chris-b1
Copy link
Contributor Author

chris-b1 commented Sep 13, 2016

That's reasonable, and I'm not sure myself it should be in pandas. That said, a couple points. First, if this were included, I'd definitely mark it opt-in, experimental, etc.

Second, I don't see this as a delayed API solution, just a smoothover for a particular, existing use. The deferred pieces already exist in pandas, this just provides an arguably nicer way to express it. e.g., you can already do:

   (pd.read_csv(...)
    .assign(a=lambda x: x.b + 1,
            c=lambda x: x.c - 2)
    [lambda x: x.a > 100])

Where the X allows this.

   (pd.read_csv(...)
    .assign(a=X.b + 1
            c=X.c - 2)
    [X.a > 100]

@chris-b1 chris-b1 added API Design Needs Discussion Requires discussion from core team before further action labels Sep 18, 2016
@bkandel
Copy link
Contributor

bkandel commented Sep 19, 2016

What's the intended relationship between X and the .query method on dataframes? I.e. you can already do df.query("a > 1") or df.query("exp(a) > 3"). To my eyes, the query syntax is less opaque than the "magic X" syntax, which looks to the uninitiated like a dataframe that has not been created.

@chris-b1
Copy link
Contributor Author

Technically this a bit more flexible because it could handle column names that aren't valid python names (e.g. .query can't handle df[X['col with spaces'] == 5], but basically this is just an alternative to avoid coding in strings.

@chris-b1
Copy link
Contributor Author

I'm going to close this for now - I still do think something like the X would be helpful for usability, but adding a 4th way to do something probably isn't the answer, especially given the 2.0 work.

@chris-b1 chris-b1 closed this Oct 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: port the magic X from pandas_ply/dplython to pandas proper?
4 participants