DOC: update docs about file parsing functions

wesm · wesm · commit 196acb8e63af · 2011-10-21T18:18:23.000-04:00
diff --git a/RELEASE.rst b/RELEASE.rst
@@ -144,7 +144,8 @@ feedback on the library.
   - Add support for different delimiters in `DataFrame.to_csv` (PR #244)
   - Add more helpful error message when importing pandas post-installation from
     the source directory (GH #250)
-
+  - Significantly speed up DataFrame `__repr__` and `count` on large mixed-type
+    DataFrame objects
 
 **Bug fixes**
 
@@ -305,8 +306,6 @@ infrastructure are the main new additions
     retrieve groups
   - Added informative Exception when passing dict to DataFrame groupby
     aggregation with axis != 0
-  - Significantly speed up DataFrame `__repr__` and `count` on large mixed-type
-    DataFrame objects
 
 **API Changes**
 
diff --git a/TODO.rst b/TODO.rst
@@ -5,10 +5,50 @@ DONE
 
 TODO
 ----
-- .name pickling / unpicking / HDFStore handling
-- Is there a way to write hierarchical columns to csv?
-- Possible to blow away existing name when creating MultiIndex?
-- prettytable output with index names
-- Add load/save functions to top level pandas namespace
 - _consolidate, does it always copy?
 - Series.align with fill method. Will have to generate more Cython code
+
+TODO docs
+---------
+
+- read_csv / read_table
+  - auto-sniff delimiter
+  - MultiIndex
+  - generally more documentation
+
+- pivot_table
+
+- Set mixed-type values with .ix
+- get_dtype_counts / dtypes
+- save / load functions
+- combine_first
+- describe for Series
+- DataFrame.to_string
+- Index / MultiIndex names
+- Unstack / stack by level name
+- ignore_index in DataFrame.append
+- Inner join on key
+- Multi-key joining
+- as_index=False in groupby
+- is_monotonic
+- isnull/notnull as instance methods
+- name attribute on Series
+- DataFrame.to_csv: different delimiters?
+- groupby with level name
+- MultiIndex
+  - get_level_values
+
+- Update to reflect Python 3 support in intro
+- align functions
+- df[col_list]
+- Panel.rename_axis
+- & and | for intersection / union
+- IPython tab complete hook
+
+Performance blog
+----------------
+- Series / Time series data alignment
+- DataFrame alignment
+- Groupby
+- joining
+- Take
diff --git a/doc/source/io.rst b/doc/source/io.rst
@@ -6,6 +6,7 @@
    :suppress:
 
    import numpy as np
+   import os
    np.random.seed(123456)
    from pandas import *
    from StringIO import StringIO
@@ -29,9 +30,8 @@ data into a DataFrame object. They can take a number of arguments:
 
   - ``path_or_buffer``: Either a string path to a file, or any object with a
     ``read`` method (such as an open file or ``StringIO``).
-  - ``delimiter``: For ``read_table`` only, a regular expression to split
-    fields on. ``read_csv`` uses the ``csv`` module to do this and hence only
-    supports comma-separated values.
+  - ``sep``: A delimiter / separator to split fields on. `read_csv` is capable
+    of inferring automatically "sniffing" the delimiter in some cases
   - ``header``: row number to use as the column names, and the start of the data.
     Defaults to 0 (first row); specify None if there is no header row.
   - ``names``: List of column names to use if header is None.
@@ -47,45 +47,89 @@ data into a DataFrame object. They can take a number of arguments:
     ``dateutil.parser``. Specifying this implicitly sets ``parse_dates`` as True.
   - ``na_values``: optional list of strings to recognize as NaN (missing values),
     in addition to a default set.
-  
-
-.. code-block:: ipython
-
-    In [1]: print open('foo.csv').read()
-    date,A,B,C
-    20090101,a,1,2
-    20090102,b,3,4
-    20090103,c,4,5
-    
-    # A basic index is created by default:
-    In [3]: read_csv('foo.csv')
-    Out[3]:
-       date      A  B  C
-    0  20090101  a  1  2
-    1  20090102  b  3  4
-    2  20090103  c  4  5
-
-    # Use a column as an index, and parse it as dates.
-    In [3]: df = read_csv('foo.csv', index_col=0, parse_dates=True)
-    
-    In [4]: df
-    Out[4]:
-                A  B  C
-    2009-01-01  a  1  2
-    2009-01-02  b  3  4
-    2009-01-03  c  4  5
-
-    # These are python datetime objects
-    In [16]: df.index
-    Out[16]: Index([2009-01-01 00:00:00, 2009-01-02 00:00:00,
-                    2009-01-03 00:00:00], dtype=object)
+  - ``nrows``: Number of rows to read out of the file. Useful to only read a
+    small portion of a large file
+  - ``chunksize``: An number of rows to be used to "chunk" a file into
+    pieces. Will cause an ``TextParser`` object to be returned. More on this
+    below in the section on :ref:`iterating and chunking <io.chunking>`
+  - ``iterator``: If True, return a ``TextParser`` to enable reading a file
+    into memory piece by piece
+
+.. ipython:: python
+   :suppress:
+
+   f = open('foo.csv', 'w')
+   f.write('date,A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5')
+   f.close()
+
+Consider a typical CSV file containing, in this case, some time series data:
+
+.. ipython:: python
+
+   print open('foo.csv').read()
 
+The default for `read_csv` is to create a DataFrame with simple numbered rows:
+
+.. ipython:: python
+
+   read_csv('foo.csv')
+
+In the case of indexed data, you can pass the column number (or a list of
+column numbers, for a hierarchical index) you wish to use as the index. If the
+index values are dates and you want them to be converted to ``datetime``
+objects, pass ``parse_dates=True``:
+
+.. ipython:: python
+
+   # Use a column as an index, and parse it as dates.
+   df = read_csv('foo.csv', index_col=0, parse_dates=True)
+   df
+   # These are python datetime objects
+   df.index
+
+.. ipython:: python
+   :suppress:
+
+   os.remove('foo.csv')
 
 The parsers make every attempt to "do the right thing" and not be very
 fragile. Type inference is a pretty big deal. So if a column can be coerced to
 integer dtype without altering the contents, it will do so. Any non-numeric
 columns will come through as object dtype as with the rest of pandas objects.
 
+Files with an "implicit" index column
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. ipython:: python
+   :suppress:
+
+   f = open('foo.csv', 'w')
+   f.write('A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5')
+   f.close()
+
+Consider a file with one less entry in the header than the number of data
+column:
+
+.. ipython:: python
+
+   print open('foo.csv').read()
+
+In this special case, ``read_csv`` assumes that the first column is to be used
+as the index of the DataFrame:
+
+.. ipython:: python
+
+   read_csv('foo.csv')
+
+Note that the dates weren't automatically parsed. In that case you would need
+to do as before:
+
+.. ipython:: python
+
+   df = read_csv('foo.csv', parse_dates=True)
+   df.index
+
+
 Reading DataFrame objects with ``MultiIndex``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -104,6 +148,65 @@ column numbers to turn multiple columns into a ``MultiIndex``:
    df
    df.ix[1978]
 
+.. .. _io.sniff:
+
+.. Automatically "sniffing" the delimiter
+.. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. ``read_csv`` is capable of inferring delimited, but not necessarily
+.. comma-separated, files in some cases:
+
+.. .. ipython:: python
+
+..    print open('tmp.csv').read()
+..    read_csv('tmp.csv')
+
+
+
+.. _io.chunking:
+
+Iterating through files chunk by chunk
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Suppose you wish to iterate through a (potentially very large) file lazily
+rather than reading the entire file into memory, such as the following:
+
+.. ipython:: python
+   :suppress:
+
+   df[:7].to_csv('tmp.sv', delimiter='|')
+
+.. ipython:: python
+
+   print open('tmp.sv').read()
+   table = read_table('tmp.sv', sep='|')
+   table
+
+.. ipython:: python
+   :suppress:
+
+   os.remove('tmp.csv')
+
+By specifiying a ``chunksize`` to ``read_csv`` or ``read_table``, the return
+value will be an iterable object of type ``TextParser``:
+
+.. ipython::
+
+   In [1]: reader = read_table('tmp.sv', sep='|', chunksize=4)
+
+   In [1]: reader
+
+   In [2]: for chunk in reader:
+      ...:     print chunk
+      ...:
+
+Specifying ``iterator=True`` will also return the ``TextParser`` object:
+
+.. ipython:: python
+
+   reader = read_table('tmp.sv', sep='|', iterator=True)
+   reader.get_chunk(5)
+
 Excel 2003 files
 ----------------
 
@@ -132,7 +235,6 @@ performance HDF5 format using the excellent `PyTables
 .. ipython:: python
    :suppress:
 
-   import os
    os.remove('store.h5')
 
 .. ipython:: python