Skip to content

ENH: Add pipe method #10253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 6, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions doc/source/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -624,6 +624,77 @@ We can also pass infinite values to define the bins:
Function application
--------------------

To apply your own or another library's functions to pandas objects,
you should be aware of the three methods below. The appropriate
method to use depends on whether your function expects to operate
on an entire ``DataFrame`` or ``Series``, row- or column-wise, or elementwise.

1. `Tablewise Function Application`_: :meth:`~DataFrame.pipe`
2. `Row or Column-wise Function Application`_: :meth:`~DataFrame.apply`
3. Elementwise_ function application: :meth:`~DataFrame.applymap`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs backticks to pick up the references

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backticks on the "Elementwise_"? It works w/o the backticks since it's a single word. Are do you mean the method references?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, ok, then didn't know that.


.. _basics.pipe:

Tablewise Function Application
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. versionadded:: 0.16.2

``DataFrames`` and ``Series`` can of course just be passed into functions.
However, if the function needs to be called in a chain, consider using the :meth:`~DataFrame.pipe` method.
Compare the following

.. code-block:: python

# f, g, and h are functions taking and returning ``DataFrames``
>>> f(g(h(df), arg1=1), arg2=2, arg3=3)

with the equivalent

.. code-block:: python

>>> (df.pipe(h)
.pipe(g, arg1=1)
.pipe(f, arg2=2, arg3=3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comma's at the end of the lines are not correct here?

)

Pandas encourages the second style, which is known as method chaining.
``pipe`` makes it easy to use your own or another library's functions
in method chains, alongside pandas' methods.

In the example above, the functions ``f``, ``g``, and ``h`` each expected the ``DataFrame`` as the first positional argument.
What if the function you wish to apply takes its data as, say, the second argument?
In this case, provide ``pipe`` with a tuple of ``(callable, data_keyword)``.
``.pipe`` will route the ``DataFrame`` to the argument specified in the tuple.

For example, we can fit a regression using statsmodels. Their API expects a formula first and a ``DataFrame`` as the second argument, ``data``. We pass in the function, keyword pair ``(sm.poisson, 'data')`` to ``pipe``:

.. ipython:: python

import statsmodels.formula.api as sm

bb = pd.read_csv('data/baseball.csv', index_col='id')

(bb.query('h > 0')
.assign(ln_h = lambda df: np.log(df.h))
.pipe((sm.poisson, 'data'), 'hr ~ ln_h + year + g + C(lg)')
.fit()
.summary()
)

The pipe method is inspired by unix pipes and more recently dplyr_ and magrittr_, which
have introduced the popular ``(%>%)`` (read pipe) operator for R_.
The implementation of ``pipe`` here is quite clean and feels right at home in python.
We encourage you to view the source code (``pd.DataFrame.pipe??`` in IPython).

.. _dplyr: https://github.com/hadley/dplyr
.. _magrittr: https://github.com/smbache/magrittr
.. _R: http://www.r-project.org


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basics.apply (though I don't think used anywhere ATM)

Row or Column-wise Function Application
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Arbitrary functions can be applied along the axes of a DataFrame or Panel
using the :meth:`~DataFrame.apply` method, which, like the descriptive
statistics methods, take an optional ``axis`` argument:
Expand Down Expand Up @@ -678,6 +749,7 @@ Series operation on each column or row:
tsdf
tsdf.apply(pd.Series.interpolate)


Finally, :meth:`~DataFrame.apply` takes an argument ``raw`` which is False by default, which
converts each row or column into a Series before applying the function. When
set to True, the passed function will instead receive an ndarray object, which
Expand All @@ -690,6 +762,8 @@ functionality.
functionality for grouping by some criterion, applying, and combining the
results into a Series, DataFrame, etc.

.. _Elementwise:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basics.elementwise


Applying elementwise Python functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
40 changes: 0 additions & 40 deletions doc/source/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,46 +89,6 @@ representation; i.e., 1KB = 1024 bytes).

See also :ref:`Categorical Memory Usage <categorical.memory>`.

.. _ref-monkey-patching:

Adding Features to your pandas Installation
-------------------------------------------

pandas is a powerful tool and already has a plethora of data manipulation
operations implemented, most of them are very fast as well.
It's very possible however that certain functionality that would make your
life easier is missing. In that case you have several options:

1) Open an issue on `Github <https://github.com/pydata/pandas/issues/>`__ , explain your need and the sort of functionality you would like to see implemented.
2) Fork the repo, Implement the functionality yourself and open a PR
on Github.
3) Write a method that performs the operation you are interested in and
Monkey-patch the pandas class as part of your IPython profile startup
or PYTHONSTARTUP file.

For example, here is an example of adding an ``just_foo_cols()``
method to the dataframe class:

::

import pandas as pd
def just_foo_cols(self):
"""Get a list of column names containing the string 'foo'

"""
return [x for x in self.columns if 'foo' in x]

pd.DataFrame.just_foo_cols = just_foo_cols # monkey-patch the DataFrame class
df = pd.DataFrame([list(range(4))], columns=["A","foo","foozball","bar"])
df.just_foo_cols()
del pd.DataFrame.just_foo_cols # you can also remove the new method


Monkey-patching is usually frowned upon because it makes your code
less portable and can cause subtle bugs in some circumstances.
Monkey-patching existing methods is usually a bad idea in that respect.
When used with proper care, however, it's a very useful tool to have.


.. _ref-scikits-migration:

Expand Down
2 changes: 1 addition & 1 deletion doc/source/internals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ Subclassing pandas Data Structures

.. warning:: There are some easier alternatives before considering subclassing ``pandas`` data structures.

1. Monkey-patching: See :ref:`Adding Features to your pandas Installation <ref-monkey-patching>`.
1. Extensible method chains with :ref:`pipe <basics.pipe>`

2. Use *composition*. See `here <http://en.wikipedia.org/wiki/Composition_over_inheritance>`_.

Expand Down
57 changes: 57 additions & 0 deletions doc/source/whatsnew/v0.16.2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ We recommend that all users upgrade to this version.
Highlights include:

- Documentation on how to use ``numba`` with *pandas*, see :ref:`here <enhancingperf.numba>`
- A new ``pipe`` method, see :ref:`here <whatsnew_0162.enhancements.pipe>`

Check the :ref:`API Changes <whatsnew_0162.api>` before updating.

Expand All @@ -22,6 +23,62 @@ Check the :ref:`API Changes <whatsnew_0162.api>` before updating.
New features
~~~~~~~~~~~~

.. _whatsnew_0162.enhancements.pipe:

Pipe
^^^^

We've introduced a new method :meth:`DataFrame.pipe`. As suggested by the name, ``pipe``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a link and :ref: from the Highlites section

should be used to pipe data through a chain of function calls.
The goal is to avoid confusing nested function calls like

.. code-block:: python

# df is a DataFrame
# f, g, and h are functions that take and return DataFrames
f(g(h(df), arg1=1), arg2=2, arg3=3)

The logic flows from inside out, and function names are separated from their keyword arguments.
This can be rewritten as

.. code-block:: python

(df.pipe(h)
.pipe(g, arg1=1)
.pipe(f, arg2=2)
)

Now both the code and the logic flow from top to bottom. Keyword arguments are next to
their functions. Overall the code is much more readable.

In the example above, the functions ``f``, ``g``, and ``h`` each expected the DataFrame as the first positional argument.
When the function you wish to apply takes its data anywhere other than the first argument, pass a tuple
of ``(function, keyword)`` indicating where the DataFrame should flow. For example:

.. ipython:: python

import statsmodels.formula.api as sm

bb = pd.read_csv('data/baseball.csv', index_col='id')

# sm.poisson takes (formula, data)
(bb.query('h > 0')
.assign(ln_h = lambda df: np.log(df.h))
.pipe((sm.poisson, 'data'), 'hr ~ ln_h + year + g + C(lg)')
.fit()
.summary()
)

The pipe method is inspired by unix pipes, which stream text through
processes. More recently dplyr_ and magrittr_ have introduced the
popular ``(%>%)`` pipe operator for R_.

See the :ref:`documentation <basics.pipe>` for more. (:issue:`10129`)

.. _dplyr: https://github.com/hadley/dplyr
.. _magrittr: https://github.com/smbache/magrittr
.. _R: http://www.r-project.org

.. _whatsnew_0162.enhancements.other:

Other enhancements
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.17.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Check the :ref:`API Changes <whatsnew_0170.api>` and :ref:`deprecations <whatsne
New features
~~~~~~~~~~~~


.. _whatsnew_0170.enhancements.other:

Other enhancements
Expand Down
1 change: 0 additions & 1 deletion pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,4 +57,3 @@
from pandas.util.print_versions import show_versions
import pandas.util.testing


62 changes: 62 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -2045,6 +2045,68 @@ def sample(self, n=None, frac=None, replace=False, weights=None, random_state=No
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
return self.take(locs, axis=axis)

_shared_docs['pipe'] = ("""
Apply func(self, *args, **kwargs)

.. versionadded:: 0.16.2

Parameters
----------
func : function
function to apply to the %(klass)s.
``args``, and ``kwargs`` are passed into ``func``.
Alternatively a ``(callable, data_keyword)`` tuple where
``data_keyword`` is a string indicating the keyword of
``callable`` that expects the %(klass)s.
args : positional arguments passed into ``func``.
kwargs : a dictionary of keyword arguments passed into ``func``.

Returns
-------
object : the return type of ``func``.

Notes
-----

Use ``.pipe`` when chaining together functions that expect
on Series or DataFrames. Instead of writing

>>> f(g(h(df), arg1=a), arg2=b, arg3=c)

You can write

>>> (df.pipe(h)
... .pipe(g, arg1=a)
... .pipe(f, arg2=b, arg3=c)
... )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe show an example of using the callable & data_keyword in the Notes? (can do later)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

If you have a function that takes the data as (say) the second
argument, pass a tuple indicating which keyword expects the
data. For example, suppose ``f`` takes its data as ``arg2``:

>>> (df.pipe(h)
... .pipe(g, arg1=a)
... .pipe((f, 'arg2'), arg1=a, arg3=c)
... )

See Also
--------
pandas.DataFrame.apply
pandas.DataFrame.applymap
pandas.Series.map
"""
)
@Appender(_shared_docs['pipe'] % _shared_doc_kwargs)
def pipe(self, func, *args, **kwargs):
if isinstance(func, tuple):
func, target = func
if target in kwargs:
msg = '%s is both the pipe target and a keyword argument' % target
raise ValueError(msg)
kwargs[target] = self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some validation logic here to ensure that target is not overwriting a key in kwargs? Something like this:

if target in kwargs:
    raise ValueError('%s is both the pipe target and a keyword argument' % target)

I'm not entirely sure it's worth complexifying things here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying maybe the user did something like

df.pipe((sns.violinplot, 'data'), x='x', y='y', data=df)

I guess the precedent here is with Python itself raising when you call f(a=1, a=2).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's the case I was thinking about

return func(*args, **kwargs)
else:
return func(self, *args, **kwargs)

#----------------------------------------------------------------------
# Attribute access
Expand Down
42 changes: 42 additions & 0 deletions pandas/tests/test_generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1649,6 +1649,48 @@ def test_describe_raises(self):
with tm.assertRaises(NotImplementedError):
tm.makePanel().describe()

def test_pipe(self):
df = DataFrame({'A': [1, 2, 3]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test with a Panel (of some sort just to validate)

f = lambda x, y: x ** y
result = df.pipe(f, 2)
expected = DataFrame({'A': [1, 4, 9]})
self.assert_frame_equal(result, expected)

result = df.A.pipe(f, 2)
self.assert_series_equal(result, expected.A)

def test_pipe_tuple(self):
df = DataFrame({'A': [1, 2, 3]})
f = lambda x, y: y
result = df.pipe((f, 'y'), 0)
self.assert_frame_equal(result, df)

result = df.A.pipe((f, 'y'), 0)
self.assert_series_equal(result, df.A)

def test_pipe_tuple_error(self):
df = DataFrame({"A": [1, 2, 3]})
f = lambda x, y: y
with tm.assertRaises(ValueError):
result = df.pipe((f, 'y'), x=1, y=0)

with tm.assertRaises(ValueError):
result = df.A.pipe((f, 'y'), x=1, y=0)

def test_pipe_panel(self):
wp = Panel({'r1': DataFrame({"A": [1, 2, 3]})})
f = lambda x, y: x + y
result = wp.pipe(f, 2)
expected = wp + 2
assert_panel_equal(result, expected)

result = wp.pipe((f, 'y'), x=1)
expected = wp + 1
assert_panel_equal(result, expected)

with tm.assertRaises(ValueError):
result = wp.pipe((f, 'y'), x=1, y=1)

if __name__ == '__main__':
nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'],
exit=False)