-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Add pipe method #10253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add pipe method #10253
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -624,6 +624,77 @@ We can also pass infinite values to define the bins: | |
Function application | ||
-------------------- | ||
|
||
To apply your own or another library's functions to pandas objects, | ||
you should be aware of the three methods below. The appropriate | ||
method to use depends on whether your function expects to operate | ||
on an entire ``DataFrame`` or ``Series``, row- or column-wise, or elementwise. | ||
|
||
1. `Tablewise Function Application`_: :meth:`~DataFrame.pipe` | ||
2. `Row or Column-wise Function Application`_: :meth:`~DataFrame.apply` | ||
3. Elementwise_ function application: :meth:`~DataFrame.applymap` | ||
|
||
.. _basics.pipe: | ||
|
||
Tablewise Function Application | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. versionadded:: 0.16.2 | ||
|
||
``DataFrames`` and ``Series`` can of course just be passed into functions. | ||
However, if the function needs to be called in a chain, consider using the :meth:`~DataFrame.pipe` method. | ||
Compare the following | ||
|
||
.. code-block:: python | ||
|
||
# f, g, and h are functions taking and returning ``DataFrames`` | ||
>>> f(g(h(df), arg1=1), arg2=2, arg3=3) | ||
|
||
with the equivalent | ||
|
||
.. code-block:: python | ||
|
||
>>> (df.pipe(h) | ||
.pipe(g, arg1=1) | ||
.pipe(f, arg2=2, arg3=3) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the comma's at the end of the lines are not correct here? |
||
) | ||
|
||
Pandas encourages the second style, which is known as method chaining. | ||
``pipe`` makes it easy to use your own or another library's functions | ||
in method chains, alongside pandas' methods. | ||
|
||
In the example above, the functions ``f``, ``g``, and ``h`` each expected the ``DataFrame`` as the first positional argument. | ||
What if the function you wish to apply takes its data as, say, the second argument? | ||
In this case, provide ``pipe`` with a tuple of ``(callable, data_keyword)``. | ||
``.pipe`` will route the ``DataFrame`` to the argument specified in the tuple. | ||
|
||
For example, we can fit a regression using statsmodels. Their API expects a formula first and a ``DataFrame`` as the second argument, ``data``. We pass in the function, keyword pair ``(sm.poisson, 'data')`` to ``pipe``: | ||
|
||
.. ipython:: python | ||
|
||
import statsmodels.formula.api as sm | ||
|
||
bb = pd.read_csv('data/baseball.csv', index_col='id') | ||
|
||
(bb.query('h > 0') | ||
.assign(ln_h = lambda df: np.log(df.h)) | ||
.pipe((sm.poisson, 'data'), 'hr ~ ln_h + year + g + C(lg)') | ||
.fit() | ||
.summary() | ||
) | ||
|
||
The pipe method is inspired by unix pipes and more recently dplyr_ and magrittr_, which | ||
have introduced the popular ``(%>%)`` (read pipe) operator for R_. | ||
The implementation of ``pipe`` here is quite clean and feels right at home in python. | ||
We encourage you to view the source code (``pd.DataFrame.pipe??`` in IPython). | ||
|
||
.. _dplyr: https://github.com/hadley/dplyr | ||
.. _magrittr: https://github.com/smbache/magrittr | ||
.. _R: http://www.r-project.org | ||
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
Row or Column-wise Function Application | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Arbitrary functions can be applied along the axes of a DataFrame or Panel | ||
using the :meth:`~DataFrame.apply` method, which, like the descriptive | ||
statistics methods, take an optional ``axis`` argument: | ||
|
@@ -678,6 +749,7 @@ Series operation on each column or row: | |
tsdf | ||
tsdf.apply(pd.Series.interpolate) | ||
|
||
|
||
Finally, :meth:`~DataFrame.apply` takes an argument ``raw`` which is False by default, which | ||
converts each row or column into a Series before applying the function. When | ||
set to True, the passed function will instead receive an ndarray object, which | ||
|
@@ -690,6 +762,8 @@ functionality. | |
functionality for grouping by some criterion, applying, and combining the | ||
results into a Series, DataFrame, etc. | ||
|
||
.. _Elementwise: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
Applying elementwise Python functions | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,6 +10,7 @@ We recommend that all users upgrade to this version. | |
Highlights include: | ||
|
||
- Documentation on how to use ``numba`` with *pandas*, see :ref:`here <enhancingperf.numba>` | ||
- A new ``pipe`` method, see :ref:`here <whatsnew_0162.enhancements.pipe>` | ||
|
||
Check the :ref:`API Changes <whatsnew_0162.api>` before updating. | ||
|
||
|
@@ -22,6 +23,62 @@ Check the :ref:`API Changes <whatsnew_0162.api>` before updating. | |
New features | ||
~~~~~~~~~~~~ | ||
|
||
.. _whatsnew_0162.enhancements.pipe: | ||
|
||
Pipe | ||
^^^^ | ||
|
||
We've introduced a new method :meth:`DataFrame.pipe`. As suggested by the name, ``pipe`` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a link and |
||
should be used to pipe data through a chain of function calls. | ||
The goal is to avoid confusing nested function calls like | ||
|
||
.. code-block:: python | ||
|
||
# df is a DataFrame | ||
# f, g, and h are functions that take and return DataFrames | ||
f(g(h(df), arg1=1), arg2=2, arg3=3) | ||
|
||
The logic flows from inside out, and function names are separated from their keyword arguments. | ||
This can be rewritten as | ||
|
||
.. code-block:: python | ||
|
||
(df.pipe(h) | ||
.pipe(g, arg1=1) | ||
.pipe(f, arg2=2) | ||
) | ||
|
||
Now both the code and the logic flow from top to bottom. Keyword arguments are next to | ||
their functions. Overall the code is much more readable. | ||
|
||
In the example above, the functions ``f``, ``g``, and ``h`` each expected the DataFrame as the first positional argument. | ||
When the function you wish to apply takes its data anywhere other than the first argument, pass a tuple | ||
of ``(function, keyword)`` indicating where the DataFrame should flow. For example: | ||
|
||
.. ipython:: python | ||
|
||
import statsmodels.formula.api as sm | ||
|
||
bb = pd.read_csv('data/baseball.csv', index_col='id') | ||
|
||
# sm.poisson takes (formula, data) | ||
(bb.query('h > 0') | ||
.assign(ln_h = lambda df: np.log(df.h)) | ||
.pipe((sm.poisson, 'data'), 'hr ~ ln_h + year + g + C(lg)') | ||
.fit() | ||
.summary() | ||
) | ||
|
||
The pipe method is inspired by unix pipes, which stream text through | ||
processes. More recently dplyr_ and magrittr_ have introduced the | ||
popular ``(%>%)`` pipe operator for R_. | ||
|
||
See the :ref:`documentation <basics.pipe>` for more. (:issue:`10129`) | ||
|
||
.. _dplyr: https://github.com/hadley/dplyr | ||
.. _magrittr: https://github.com/smbache/magrittr | ||
.. _R: http://www.r-project.org | ||
|
||
.. _whatsnew_0162.enhancements.other: | ||
|
||
Other enhancements | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,4 +57,3 @@ | |
from pandas.util.print_versions import show_versions | ||
import pandas.util.testing | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2045,6 +2045,68 @@ def sample(self, n=None, frac=None, replace=False, weights=None, random_state=No | |
locs = rs.choice(axis_length, size=n, replace=replace, p=weights) | ||
return self.take(locs, axis=axis) | ||
|
||
_shared_docs['pipe'] = (""" | ||
Apply func(self, *args, **kwargs) | ||
|
||
.. versionadded:: 0.16.2 | ||
|
||
Parameters | ||
---------- | ||
func : function | ||
function to apply to the %(klass)s. | ||
``args``, and ``kwargs`` are passed into ``func``. | ||
Alternatively a ``(callable, data_keyword)`` tuple where | ||
``data_keyword`` is a string indicating the keyword of | ||
``callable`` that expects the %(klass)s. | ||
args : positional arguments passed into ``func``. | ||
kwargs : a dictionary of keyword arguments passed into ``func``. | ||
|
||
Returns | ||
------- | ||
object : the return type of ``func``. | ||
|
||
Notes | ||
----- | ||
|
||
Use ``.pipe`` when chaining together functions that expect | ||
on Series or DataFrames. Instead of writing | ||
|
||
>>> f(g(h(df), arg1=a), arg2=b, arg3=c) | ||
|
||
You can write | ||
|
||
>>> (df.pipe(h) | ||
... .pipe(g, arg1=a) | ||
... .pipe(f, arg2=b, arg3=c) | ||
... ) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe show an example of using the callable & data_keyword in the Notes? (can do later) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added. |
||
If you have a function that takes the data as (say) the second | ||
argument, pass a tuple indicating which keyword expects the | ||
data. For example, suppose ``f`` takes its data as ``arg2``: | ||
|
||
>>> (df.pipe(h) | ||
... .pipe(g, arg1=a) | ||
... .pipe((f, 'arg2'), arg1=a, arg3=c) | ||
... ) | ||
|
||
See Also | ||
-------- | ||
pandas.DataFrame.apply | ||
pandas.DataFrame.applymap | ||
pandas.Series.map | ||
""" | ||
) | ||
@Appender(_shared_docs['pipe'] % _shared_doc_kwargs) | ||
def pipe(self, func, *args, **kwargs): | ||
if isinstance(func, tuple): | ||
func, target = func | ||
if target in kwargs: | ||
msg = '%s is both the pipe target and a keyword argument' % target | ||
raise ValueError(msg) | ||
kwargs[target] = self | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we add some validation logic here to ensure that if target in kwargs:
raise ValueError('%s is both the pipe target and a keyword argument' % target) I'm not entirely sure it's worth complexifying things here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you saying maybe the user did something like df.pipe((sns.violinplot, 'data'), x='x', y='y', data=df) I guess the precedent here is with Python itself raising when you call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, that's the case I was thinking about |
||
return func(*args, **kwargs) | ||
else: | ||
return func(self, *args, **kwargs) | ||
|
||
#---------------------------------------------------------------------- | ||
# Attribute access | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1649,6 +1649,48 @@ def test_describe_raises(self): | |
with tm.assertRaises(NotImplementedError): | ||
tm.makePanel().describe() | ||
|
||
def test_pipe(self): | ||
df = DataFrame({'A': [1, 2, 3]}) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a test with a Panel (of some sort just to validate) |
||
f = lambda x, y: x ** y | ||
result = df.pipe(f, 2) | ||
expected = DataFrame({'A': [1, 4, 9]}) | ||
self.assert_frame_equal(result, expected) | ||
|
||
result = df.A.pipe(f, 2) | ||
self.assert_series_equal(result, expected.A) | ||
|
||
def test_pipe_tuple(self): | ||
df = DataFrame({'A': [1, 2, 3]}) | ||
f = lambda x, y: y | ||
result = df.pipe((f, 'y'), 0) | ||
self.assert_frame_equal(result, df) | ||
|
||
result = df.A.pipe((f, 'y'), 0) | ||
self.assert_series_equal(result, df.A) | ||
|
||
def test_pipe_tuple_error(self): | ||
df = DataFrame({"A": [1, 2, 3]}) | ||
f = lambda x, y: y | ||
with tm.assertRaises(ValueError): | ||
result = df.pipe((f, 'y'), x=1, y=0) | ||
|
||
with tm.assertRaises(ValueError): | ||
result = df.A.pipe((f, 'y'), x=1, y=0) | ||
|
||
def test_pipe_panel(self): | ||
wp = Panel({'r1': DataFrame({"A": [1, 2, 3]})}) | ||
f = lambda x, y: x + y | ||
result = wp.pipe(f, 2) | ||
expected = wp + 2 | ||
assert_panel_equal(result, expected) | ||
|
||
result = wp.pipe((f, 'y'), x=1) | ||
expected = wp + 1 | ||
assert_panel_equal(result, expected) | ||
|
||
with tm.assertRaises(ValueError): | ||
result = wp.pipe((f, 'y'), x=1, y=1) | ||
|
||
if __name__ == '__main__': | ||
nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], | ||
exit=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs backticks to pick up the references
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backticks on the "Elementwise_"? It works w/o the backticks since it's a single word. Are do you mean the method references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, ok, then didn't know that.