WIP/API: add magic 'X' for selection #14209

chris-b1 · 2016-09-13T01:51:32Z

could close API: port the magic X from pandas_ply/dplython to pandas proper? #13133
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

This is very WIP, but wanted to put it up and show the general direction. This adds essentially a modified version of pandas_ply that produces plain callables that can be passed to the existing []/assign methods. Short demo below.

One thing that's tricky is figuring out when an expression is "complete." pandas_ply and dplython don't have to do this because they use a special method to instantiate the selection, but I'd prefer not to do this if possible, so this doesn't touch any pandas internals. There's one example below (X.c.str.upper()) that shows where the current heuristic is failing.

cc @shoyer, @jreback, @joshuahhh @dodger487, welcome any thoughts

df = pd.DataFrame({'a':[1,2,3], 'b':[1.5, 2.5, 3.4],
                   'c':['abc', 'def', 'efg'],
                   'd':pd.to_datetime(['2014-01-01', '2014-01-02', '2014-01-03'])})

from pandas import X

df[X.a > 1]
Out[3]: 
   a    b    c          d
1  2  2.5  def 2014-01-02
2  3  3.4  efg 2014-01-03

df[X.d.dt.day == 2]
Out[4]: 
   a    b    c          d
1  2  2.5  def 2014-01-02

df.assign(e=X.a+1)
Out[5]: 
   a    b    c          d  e
0  1  1.5  abc 2014-01-01  2
1  2  2.5  def 2014-01-02  3
2  3  3.4  efg 2014-01-03  4

df.assign(e=X.b.pipe(np.exp))
Out[6]: 
   a    b    c          d          e
0  1  1.5  abc 2014-01-01   4.481689
1  2  2.5  def 2014-01-02  12.182494
2  3  3.4  efg 2014-01-03  29.964100

# this should work, but doesn't
df.assign(e=X.c.str.upper())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-65b030f82188> in <module>()
----> 1 df.assign(e=X.c.str.upper())

# this can't work, but need to give a good error msg
df.assign(e=np.log(X.a))
---------------------------------------------------------------------------
TypeError: 
pandas `X` is a deferred object that cannot be passed into
functions. The object was attempted to be converted to a numpy array which is invalid.
To pass a deferred Series into a function, use the .pipe
function, for example, X.a.pipe(np.log), instead np.log(df.a)

wesm · 2016-09-13T03:50:04Z

I'm neutral to negative on this type of solution until we do a more thorough analysis of what kind of deferred expression API we might want in pandas. I feel like it might want to wait for pandas 2.0 to have time to incubate and see some hardening through use.

codecov-io · 2016-09-13T04:14:34Z

Current coverage is 85.21% (diff: 72.72%)

Merging #14209 into master will decrease coverage by 0.02%

@@             master     #14209   diff @@
==========================================
  Files           140        141     +1   
  Lines         50563      50684   +121   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43103      43191    +88   
- Misses         7460       7493    +33   
  Partials          0          0

Powered by Codecov. Last update 461e0e9...3c50338

chris-b1 · 2016-09-13T10:50:03Z

That's reasonable, and I'm not sure myself it should be in pandas. That said, a couple points. First, if this were included, I'd definitely mark it opt-in, experimental, etc.

Second, I don't see this as a delayed API solution, just a smoothover for a particular, existing use. The deferred pieces already exist in pandas, this just provides an arguably nicer way to express it. e.g., you can already do:

   (pd.read_csv(...)
    .assign(a=lambda x: x.b + 1,
            c=lambda x: x.c - 2)
    [lambda x: x.a > 100])

Where the X allows this.

   (pd.read_csv(...)
    .assign(a=X.b + 1
            c=X.c - 2)
    [X.a > 100]

bkandel · 2016-09-19T14:06:28Z

What's the intended relationship between X and the .query method on dataframes? I.e. you can already do df.query("a > 1") or df.query("exp(a) > 3"). To my eyes, the query syntax is less opaque than the "magic X" syntax, which looks to the uninitiated like a dataframe that has not been created.

chris-b1 · 2016-09-19T14:15:38Z

Technically this a bit more flexible because it could handle column names that aren't valid python names (e.g. .query can't handle df[X['col with spaces'] == 5], but basically this is just an alternative to avoid coding in strings.

chris-b1 · 2016-10-30T14:59:50Z

I'm going to close this for now - I still do think something like the X would be helpful for usability, but adding a 4th way to do something probably isn't the answer, especially given the 2.0 work.

chris-b1 added 2 commits September 12, 2016 20:35

API: add magic 'X' for selection

68a24a3

some fixups

3c50338

chris-b1 added API Design Needs Discussion Requires discussion from core team before further action labels Sep 18, 2016

chris-b1 closed this Oct 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP/API: add magic 'X' for selection #14209

WIP/API: add magic 'X' for selection #14209

chris-b1 commented Sep 13, 2016

wesm commented Sep 13, 2016

codecov-io commented Sep 13, 2016 •

edited

Loading

chris-b1 commented Sep 13, 2016 •

edited

Loading

bkandel commented Sep 19, 2016

chris-b1 commented Sep 19, 2016

chris-b1 commented Oct 30, 2016

WIP/API: add magic 'X' for selection #14209

WIP/API: add magic 'X' for selection #14209

Conversation

chris-b1 commented Sep 13, 2016

wesm commented Sep 13, 2016

codecov-io commented Sep 13, 2016 • edited Loading

Current coverage is 85.21% (diff: 72.72%)

chris-b1 commented Sep 13, 2016 • edited Loading

bkandel commented Sep 19, 2016

chris-b1 commented Sep 19, 2016

chris-b1 commented Oct 30, 2016

codecov-io commented Sep 13, 2016 •

edited

Loading

chris-b1 commented Sep 13, 2016 •

edited

Loading