more working on v0.6

adamklein · wesm · commit e4e9e9433acb · 2011-12-29T22:25:30.000-05:00
diff --git a/doc/source/basics.rst b/doc/source/basics.rst
@@ -380,6 +380,19 @@ maximum value for each column occurred:
                     index=DateRange('1/1/2000', periods=1000))
    tsdf.apply(lambda x: x.index[x.dropna().argmax()])
 
+You may also pass additional arguments and keyword arguments to the ``apply``
+method. For instance, consider the following function you would like to apply:
+
+.. code-block:: python
+
+   def subtract_and_divide(x, sub, divide=1):
+       return (x - sub) / divide
+
+You may then apply this function as follows:
+
+.. code-block:: python
+
+   df.apply(subtract_and_divide, args=(5,), divide=3)
 
 Another useful feature is the ability to pass Series methods to carry out some
 Series operation on each column or row:
@@ -396,6 +409,12 @@ Series operation on each column or row:
    tsdf
    tsdf.apply(Series.interpolate)
 
+Finally, ``apply`` takes an argument ``raw`` which is False by default, which
+converts each row or column into a Series before applying the function. When
+set to True, the passed function will instead receive an ndarray object, which
+has positive performance implications if you do not need the indexing
+functionality.
+
 .. seealso::
 
    The section on :ref:`GroupBy <groupby>` demonstrates related, flexible
@@ -673,11 +692,10 @@ produces the "keys" of the objects, namely:
 
 Thus, for example:
 
-.. ipython::
+.. ipython:: python
 
-   In [0]: for col in df:
-      ...:     print col
-      ...:
+   for col in df:
+       print col
 
 iteritems
 ~~~~~~~~~
@@ -691,12 +709,11 @@ key-value pairs:
 
 For example:
 
-.. ipython::
+.. ipython:: python
 
-   In [0]: for item, frame in wp.iteritems():
-      ...:     print item
-      ...:     print frame
-      ...:
+   for item, frame in wp.iteritems():
+       print item
+       print frame
 
 .. _basics.sorting:
 
diff --git a/doc/source/groupby.rst b/doc/source/groupby.rst
@@ -178,7 +178,8 @@ number:
    s.groupby(level='second').sum()
 
 As of v0.6, the aggregation functions such as ``sum`` will take the level
-parameter directly:
+parameter directly. Additionally, the resulting index will be named according
+to the chosen level:
 
 .. ipython:: python
 
@@ -424,8 +425,8 @@ Flexible ``apply``
 
 Some operations on the grouped data might not fit into either the aggregate or
 transform categories. Or, you may simply want GroupBy to infer how to combine
-the results. For these, use the ``apply`` function, which can be substitute for
-both ``aggregate`` and ``transform`` in many standard use cases. However,
+the results. For these, use the ``apply`` function, which can be substituted
+for both ``aggregate`` and ``transform`` in many standard use cases. However,
 ``apply`` can handle some exceptional use cases, for example:
 
 .. ipython:: python
diff --git a/doc/source/indexing.rst b/doc/source/indexing.rst
@@ -756,8 +756,11 @@ integer index. This is the inverse operation to ``set_index``
    df.reset_index()
 
 The output is more similar to a SQL table or a record array. The names for the
-columns derived from the index are the ones stored in the ``names``
-attribute.
+columns derived from the index are the ones stored in the ``names`` attribute.
+
+.. note::
+
+   The ``reset_index`` method used to be called ``delevel`` which is now deprecated.
 
 Adding an ad hoc index
 ~~~~~~~~~~~~~~~~~~~~~~
diff --git a/doc/source/merging.rst b/doc/source/merging.rst
@@ -75,9 +75,9 @@ new DataFrame as above:
 
 .. ipython:: python
 
-   df = DataFrame(np.random.randn(8, 4))
+   df = DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
    df
-   s = df.xs(5)
+   s = df.xs(3)
    df.append(s, ignore_index=True)
 
 
@@ -115,7 +115,7 @@ passed DataFrame's index. This is best illustrated by example:
 
 .. ipython:: python
 
-   df['key'] = ['foo', 'bar'] * 3
+   df['key'] = ['foo', 'bar'] * 4
    to_join = DataFrame(randn(2, 2), index=['bar', 'foo'],
                        columns=['j1', 'j2'])
    df
diff --git a/doc/source/reshaping.rst b/doc/source/reshaping.rst
@@ -7,6 +7,7 @@
    import numpy as np
    np.random.seed(123456)
    from pandas import *
+   from pandas.core.reshape import *
    import pandas.util.testing as tm
    randn = np.random.randn
    np.set_printoptions(precision=4, suppress=True)
@@ -61,19 +62,20 @@ To select out everything for variable ``A`` we could do:
 
    df[df['variable'] == 'A']
 
-But if we wished to do time series operations between variables, this will
-hardly do at all. This is really just a representation of a DataFrame whose
-``columns`` are formed from the unique ``variable`` values and ``index`` from
-the ``date`` values. To reshape the data into this form, use the ``pivot``
-function:
+But suppose we wish to do time series operations with the variables. A better
+representation would be where the ``columns`` are the unique variables and an
+``index`` of dates identifies individual observations. To reshape the data into
+this form, use the ``pivot`` function:
 
 .. ipython:: python
 
    df.pivot(index='date', columns='variable', values='value')
 
-If the ``values`` argument is omitted, the resulting "pivoted" DataFrame will
-have :ref:`hierarchical columns <indexing.hierarchical>` with the top level
-being the set of value columns:
+If the ``values`` argument is omitted, and the input DataFrame has more than
+one column of values which are not used as column or index inputs to ``pivot``,
+then the resulting "pivoted" DataFrame will have :ref:`hierarchical columns
+<indexing.hierarchical>` whose topmost level indicates the respective value
+column:
 
 .. ipython:: python
 
@@ -90,22 +92,26 @@ You of course can then select subsets from the pivoted DataFrame:
 Note that this returns a view on the underlying data in the case where the data
 are homogeneously-typed.
 
+.. _reshaping.stacking:
+
 Reshaping by stacking and unstacking
 ------------------------------------
 
 Closely related to the ``pivot`` function are the related ``stack`` and
 ``unstack`` functions currently available on Series and DataFrame. These
-functions are designed to tie together with ``MultiIndex`` objects (see the
+functions are designed to work together with ``MultiIndex`` objects (see the
 section on :ref:`hierarchical indexing <indexing.hierarchical>`). Here are
 essentially what these functions do:
 
-  - ``stack``: collapse level in ``axis=1`` to produce new object whose index
-    has the collapsed columns as its lowest level
-  - ``unstack``: inverse operation from ``stack``; "pivot" index level to
-    produce reshaped DataFrame
+  - ``stack``: "pivot" a level of the (possibly hierarchical) column labels,
+    returning a DataFrame with an index with a new inner-most level of row
+    labels.
+  - ``unstack``: inverse operation from ``stack``: "pivot" a level of the
+    (possibly hierarchical) row index to the column axis, producing a reshaped
+    DataFrame with a new inner-most level of column labels.
 
-Actually very hard to explain in words; the clearest way is by example. Let's
-take a prior example data set from the hierarchical indexing section:
+The clearest way to explain is by example. Let's take a prior example data set
+from the hierarchical indexing section:
 
 .. ipython:: python
 
@@ -151,10 +157,14 @@ the level numbers:
 
    stacked.unstack('second')
 
-These functions are very intelligent about handling missing data and do not
-expect each subgroup within the hierarchical index to have the same set of
-labels. They also can handle the index being unsorted (but you can make it
-sorted by calling ``sortlevel``, of course). Here is a more complex example:
+You may also stack or unstack more than one level at a time by passing a list
+of levels, in which case the end result is as if each level in the list were
+processed individually.
+
+These functions are intelligent about handling missing data and do not expect
+each subgroup within the hierarchical index to have the same set of labels.
+They also can handle the index being unsorted (but you can make it sorted by
+calling ``sortlevel``, of course). Here is a more complex example:
 
 .. ipython:: python
 
@@ -181,6 +191,29 @@ the right thing:
    df[:3].unstack(0)
    df2.unstack(1)
 
+.. _reshaping.melt:
+
+Reshaping by Melt
+-----------------
+
+The ``melt`` function found in ``pandas.core.reshape`` is useful to massage a
+DataFrame into a format where one or more columns are identifier variables,
+while all other columns, considered measured variables, are "pivoted" to the
+row axis, leaving just two non-identifier columns, "variable" and "value".
+
+For instance,
+
+.. ipython:: python
+
+   df = DataFrame({'first' : ['John', 'Mary'],
+                   'last' : ['Doe', 'Bo'],
+                   'height' : [5.5, 6.0],
+                   'weight' : [130, 150]})
+
+   df
+
+   melt(df, id_vars=['first', 'last'])
+
 Combining with stats and GroupBy
 --------------------------------
 
@@ -210,7 +243,7 @@ The function ``pandas.pivot_table`` can be used to create spreadsheet-style pivo
 tables. It takes a number of arguments
 
 - ``data``: A DataFrame object
-- ``values``: column to aggregate
+- ``values``: a column or a list of columns to aggregate
 - ``rows``: list of columns to group by on the table rows
 - ``cols``: list of columns to group by on the table columns
 - ``aggfunc``: function to use for aggregation, defaulting to ``numpy.mean``
@@ -232,6 +265,7 @@ We can produce pivot tables from this data very easily:
 
    pivot_table(df, values='D', rows=['A', 'B'], cols=['C'])
    pivot_table(df, values='D', rows=['B'], cols=['A', 'C'], aggfunc=np.sum)
+   pivot_table(df, values=['D','E'], rows=['B'], cols=['A', 'C'], aggfunc=np.sum)
 
 The result object is a DataFrame having potentially hierarchical indexes on the
 rows and columns. If the ``values`` column name is not given, the pivot table
diff --git a/doc/source/visualization.rst b/doc/source/visualization.rst
@@ -28,6 +28,8 @@ We use the standard convention for referencing the matplotlib API:
 
    import matplotlib.pyplot as plt
 
+.. _visualization.basic:
+
 Basic plotting: ``plot``
 ------------------------
 
@@ -43,7 +45,7 @@ The ``plot`` method on Series and DataFrame is just a simple wrapper around
    ts.plot()
 
 If the index consists of dates, it calls ``gca().autofmt_xdate()`` to try to
-format the x-axis nicely as per above. THe method takes a number of arguments
+format the x-axis nicely as per above. The method takes a number of arguments
 for controlling the look of the plot:
 
 .. ipython:: python
@@ -62,6 +64,14 @@ On DataFrame, ``plot`` is a convenience to plot all of the columns with labels:
    @savefig frame_plot_basic.png width=4.5in
    plt.figure(); df.plot(); plt.legend(loc='best')
 
+You may set the ``legend`` argument to ``False`` to hide the legend, which is
+shown by default.
+
+.. ipython:: python
+
+   @savefig frame_plot_basic_noleg.png width=4.5in
+   df.plot(legend=False)
+
 Some other options are available, like plotting each Series on a different axis:
 
 .. ipython:: python
diff --git a/doc/source/whatsnew/v0.6.0.txt b/doc/source/whatsnew/v0.6.0.txt
@@ -5,13 +5,12 @@ v.0.6.0 (November 25, 2011)
 
 New Features
 ~~~~~~~~~~~~
-- Add ``melt`` function to ``pandas.core.reshape``
+- :ref:`Added <reshaping.melt>` ``melt`` function to ``pandas.core.reshape``
 - :ref:`Added <groupby.multiindex>` ``level`` parameter to group by level in Series and DataFrame descriptive statistics (PR313_)
 - :ref:`Added <basics.head_tail>` ``head`` and ``tail`` methods to Series, analogous to to DataFrame (PR296_)
 - :ref:`Added <indexing.boolean>` ``Series.isin`` function which checks if each value is contained in a passed sequence (GH289_)
 - :ref:`Added <io.formatting>` ``float_format`` option to ``Series.to_string``
 - :ref:`Added <io.parse_dates>` ``skip_footer`` (GH291_) and ``converters`` (GH343_) options to ``read_csv`` and ``read_table``
-- Added proper, tested weighted least squares to standard and panel OLS (GH303_)
 - :ref:`Added <indexing.duplicate>` ``drop_duplicates`` and ``duplicated`` functions for removing duplicate DataFrame rows and checking for duplicate rows, respectively (GH319_)
 - :ref:`Implemented <dsintro.boolean>` operators '&', '|', '^', '-' on DataFrame (GH347_)
 - :ref:`Added <basics.stats>` ``Series.mad``, mean absolute deviation
@@ -33,33 +32,26 @@ New Features
 - :ref:`Added <io.html>` ``DataFrame.to_html`` for writing DataFrame to HTML (PR387_)
 - :ref:`Added <basics.dataframe>` support for MaskedArray data in DataFrame, masked values converted to NaN (PR396_)
 - :ref:`Added <visualization.box>` ``DataFrame.boxplot`` function (GH368_)
-- Can pass extra args, kwds to DataFrame.apply (GH376_)
-- Arithmetic methods like ``sum`` will attempt to sum dtype=object values by default instead of excluding them (GH382_)
-- Print level names in hierarchical index in Series repr (GH305_)
-- Return DataFrame when performing GroupBy on selected column and as_index=False (GH308_)
-- Can pass vector to ``on`` argument in ``DataFrame.join`` (GH312_)
-- Show legend by default in ``DataFrame.plot``, add ``legend`` boolean flag
-  (GH324_) np.unique called on a Series faster (GH327_) "empty" combinations
-  ``Series.map`` significantly when passed elementwise Python function,
-  motivated by PR355_ enhancements throughout the codebase (GH361_) with 3-5x
-  better performance than ``np.apply_along_axis`` (GH309_) the passed function
-  only requires an ndarray (GH309_)
-- Can pass multiple levels to ``stack`` and ``unstack`` (GH370_)
-- Can pass multiple values columns to ``pivot_table`` (GH381_)
-- Can call ``DataFrame.delevel`` with standard Index with name set (GH393_)
-- Use Series name in GroupBy for result index (GH363_)
-- MAYBE? Refactor Series/DataFrame stat methods to use common set of NaN-friendly function
+- :ref:`Can <basics.apply>` pass extra args, kwds to DataFrame.apply (GH376_)
+- :ref:`Implement <merging.multikey_join>` ``DataFrame.join`` with vector ``on`` argument (GH312_)
+- :ref:`Added <visualization.basic>` ``legend`` boolean flag to ``DataFrame.plot`` (GH324_)
+- :ref:`Can <reshaping.stacking>` pass multiple levels to ``stack`` and ``unstack`` (GH370_)
+- :ref:`Can <reshaping.pivot>` pass multiple values columns to ``pivot_table`` (GH381_)
+- :ref:`Use <groupby.multiindex>` Series name in GroupBy for result index (GH363_)
+- :ref:`Added <basics.apply>` ``raw`` option to ``DataFrame.apply`` for performance if only need ndarray (GH309_)
+- Added proper, tested weighted least squares to standard and panel OLS (GH303_)
 
 Performance Enhancements
 ~~~~~~~~~~~~~~~~~~~~~~~~
-- VBENCH Cythonized ``cache_readonly``, resulting in substantial micro-performance
-- VBENCH Improve performance of ``MultiIndex.from_tuples``
+- VBENCH Cythonized ``cache_readonly``, resulting in substantial micro-performance enhancements throughout the codebase (GH361_)
+- VBENCH Special Cython matrix iterator for applying arbitrary reduction operations with 3-5x better performance than `np.apply_along_axis` (GH309_)
+- VBENCH Improved performance of ``MultiIndex.from_tuples``
 - VBENCH Special Cython matrix iterator for applying arbitrary reduction operations
 - VBENCH + DOCUMENT Add ``raw`` option to ``DataFrame.apply`` for getting better performance when
 - VBENCH Faster cythonized count by level in Series and DataFrame (GH341_)
-- VBENCH? Significant GroupBy performance enhancement with multiple keys with many
-- VBENCH New Cython vectorized function ``map_infer`` speeds up ``Series.apply`` and
-- VBENCH Significantly improved performance of ``Series.order``, which also makes
+- VBENCH? Significant GroupBy performance enhancement with multiple keys with many "empty" combinations
+- VBENCH New Cython vectorized function ``map_infer`` speeds up ``Series.apply`` and ``Series.map`` significantly when passed elementwise Python function, motivated by (PR355_)
+- VBENCH Significantly improved performance of ``Series.order``, which also makes np.unique called on a Series faster (GH327_)
 - VBENCH Vastly improved performance of GroupBy on axes with a MultiIndex (GH299_)
 
 .. _GH65: https://github.com/wesm/pandas/issues/65