Skip to content

Commit 196acb8

Browse files
committed
DOC: update docs about file parsing functions
1 parent b133013 commit 196acb8

File tree

3 files changed

+185
-44
lines changed

3 files changed

+185
-44
lines changed

RELEASE.rst

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,8 @@ feedback on the library.
144144
- Add support for different delimiters in `DataFrame.to_csv` (PR #244)
145145
- Add more helpful error message when importing pandas post-installation from
146146
the source directory (GH #250)
147-
147+
- Significantly speed up DataFrame `__repr__` and `count` on large mixed-type
148+
DataFrame objects
148149

149150
**Bug fixes**
150151

@@ -305,8 +306,6 @@ infrastructure are the main new additions
305306
retrieve groups
306307
- Added informative Exception when passing dict to DataFrame groupby
307308
aggregation with axis != 0
308-
- Significantly speed up DataFrame `__repr__` and `count` on large mixed-type
309-
DataFrame objects
310309

311310
**API Changes**
312311

TODO.rst

Lines changed: 45 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,50 @@ DONE
55

66
TODO
77
----
8-
- .name pickling / unpicking / HDFStore handling
9-
- Is there a way to write hierarchical columns to csv?
10-
- Possible to blow away existing name when creating MultiIndex?
11-
- prettytable output with index names
12-
- Add load/save functions to top level pandas namespace
138
- _consolidate, does it always copy?
149
- Series.align with fill method. Will have to generate more Cython code
10+
11+
TODO docs
12+
---------
13+
14+
- read_csv / read_table
15+
- auto-sniff delimiter
16+
- MultiIndex
17+
- generally more documentation
18+
19+
- pivot_table
20+
21+
- Set mixed-type values with .ix
22+
- get_dtype_counts / dtypes
23+
- save / load functions
24+
- combine_first
25+
- describe for Series
26+
- DataFrame.to_string
27+
- Index / MultiIndex names
28+
- Unstack / stack by level name
29+
- ignore_index in DataFrame.append
30+
- Inner join on key
31+
- Multi-key joining
32+
- as_index=False in groupby
33+
- is_monotonic
34+
- isnull/notnull as instance methods
35+
- name attribute on Series
36+
- DataFrame.to_csv: different delimiters?
37+
- groupby with level name
38+
- MultiIndex
39+
- get_level_values
40+
41+
- Update to reflect Python 3 support in intro
42+
- align functions
43+
- df[col_list]
44+
- Panel.rename_axis
45+
- & and | for intersection / union
46+
- IPython tab complete hook
47+
48+
Performance blog
49+
----------------
50+
- Series / Time series data alignment
51+
- DataFrame alignment
52+
- Groupby
53+
- joining
54+
- Take

doc/source/io.rst

Lines changed: 138 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
:suppress:
77
88
import numpy as np
9+
import os
910
np.random.seed(123456)
1011
from pandas import *
1112
from StringIO import StringIO
@@ -29,9 +30,8 @@ data into a DataFrame object. They can take a number of arguments:
2930

3031
- ``path_or_buffer``: Either a string path to a file, or any object with a
3132
``read`` method (such as an open file or ``StringIO``).
32-
- ``delimiter``: For ``read_table`` only, a regular expression to split
33-
fields on. ``read_csv`` uses the ``csv`` module to do this and hence only
34-
supports comma-separated values.
33+
- ``sep``: A delimiter / separator to split fields on. `read_csv` is capable
34+
of inferring automatically "sniffing" the delimiter in some cases
3535
- ``header``: row number to use as the column names, and the start of the data.
3636
Defaults to 0 (first row); specify None if there is no header row.
3737
- ``names``: List of column names to use if header is None.
@@ -47,45 +47,89 @@ data into a DataFrame object. They can take a number of arguments:
4747
``dateutil.parser``. Specifying this implicitly sets ``parse_dates`` as True.
4848
- ``na_values``: optional list of strings to recognize as NaN (missing values),
4949
in addition to a default set.
50-
51-
52-
.. code-block:: ipython
53-
54-
In [1]: print open('foo.csv').read()
55-
date,A,B,C
56-
20090101,a,1,2
57-
20090102,b,3,4
58-
20090103,c,4,5
59-
60-
# A basic index is created by default:
61-
In [3]: read_csv('foo.csv')
62-
Out[3]:
63-
date A B C
64-
0 20090101 a 1 2
65-
1 20090102 b 3 4
66-
2 20090103 c 4 5
67-
68-
# Use a column as an index, and parse it as dates.
69-
In [3]: df = read_csv('foo.csv', index_col=0, parse_dates=True)
70-
71-
In [4]: df
72-
Out[4]:
73-
A B C
74-
2009-01-01 a 1 2
75-
2009-01-02 b 3 4
76-
2009-01-03 c 4 5
77-
78-
# These are python datetime objects
79-
In [16]: df.index
80-
Out[16]: Index([2009-01-01 00:00:00, 2009-01-02 00:00:00,
81-
2009-01-03 00:00:00], dtype=object)
50+
- ``nrows``: Number of rows to read out of the file. Useful to only read a
51+
small portion of a large file
52+
- ``chunksize``: An number of rows to be used to "chunk" a file into
53+
pieces. Will cause an ``TextParser`` object to be returned. More on this
54+
below in the section on :ref:`iterating and chunking <io.chunking>`
55+
- ``iterator``: If True, return a ``TextParser`` to enable reading a file
56+
into memory piece by piece
57+
58+
.. ipython:: python
59+
:suppress:
60+
61+
f = open('foo.csv', 'w')
62+
f.write('date,A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5')
63+
f.close()
64+
65+
Consider a typical CSV file containing, in this case, some time series data:
66+
67+
.. ipython:: python
68+
69+
print open('foo.csv').read()
8270
71+
The default for `read_csv` is to create a DataFrame with simple numbered rows:
72+
73+
.. ipython:: python
74+
75+
read_csv('foo.csv')
76+
77+
In the case of indexed data, you can pass the column number (or a list of
78+
column numbers, for a hierarchical index) you wish to use as the index. If the
79+
index values are dates and you want them to be converted to ``datetime``
80+
objects, pass ``parse_dates=True``:
81+
82+
.. ipython:: python
83+
84+
# Use a column as an index, and parse it as dates.
85+
df = read_csv('foo.csv', index_col=0, parse_dates=True)
86+
df
87+
# These are python datetime objects
88+
df.index
89+
90+
.. ipython:: python
91+
:suppress:
92+
93+
os.remove('foo.csv')
8394
8495
The parsers make every attempt to "do the right thing" and not be very
8596
fragile. Type inference is a pretty big deal. So if a column can be coerced to
8697
integer dtype without altering the contents, it will do so. Any non-numeric
8798
columns will come through as object dtype as with the rest of pandas objects.
8899

100+
Files with an "implicit" index column
101+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
102+
103+
.. ipython:: python
104+
:suppress:
105+
106+
f = open('foo.csv', 'w')
107+
f.write('A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5')
108+
f.close()
109+
110+
Consider a file with one less entry in the header than the number of data
111+
column:
112+
113+
.. ipython:: python
114+
115+
print open('foo.csv').read()
116+
117+
In this special case, ``read_csv`` assumes that the first column is to be used
118+
as the index of the DataFrame:
119+
120+
.. ipython:: python
121+
122+
read_csv('foo.csv')
123+
124+
Note that the dates weren't automatically parsed. In that case you would need
125+
to do as before:
126+
127+
.. ipython:: python
128+
129+
df = read_csv('foo.csv', parse_dates=True)
130+
df.index
131+
132+
89133
Reading DataFrame objects with ``MultiIndex``
90134
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
91135

@@ -104,6 +148,65 @@ column numbers to turn multiple columns into a ``MultiIndex``:
104148
df
105149
df.ix[1978]
106150
151+
.. .. _io.sniff:
152+
153+
.. Automatically "sniffing" the delimiter
154+
.. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
155+
156+
.. ``read_csv`` is capable of inferring delimited, but not necessarily
157+
.. comma-separated, files in some cases:
158+
159+
.. .. ipython:: python
160+
161+
.. print open('tmp.csv').read()
162+
.. read_csv('tmp.csv')
163+
164+
165+
166+
.. _io.chunking:
167+
168+
Iterating through files chunk by chunk
169+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
170+
171+
Suppose you wish to iterate through a (potentially very large) file lazily
172+
rather than reading the entire file into memory, such as the following:
173+
174+
.. ipython:: python
175+
:suppress:
176+
177+
df[:7].to_csv('tmp.sv', delimiter='|')
178+
179+
.. ipython:: python
180+
181+
print open('tmp.sv').read()
182+
table = read_table('tmp.sv', sep='|')
183+
table
184+
185+
.. ipython:: python
186+
:suppress:
187+
188+
os.remove('tmp.csv')
189+
190+
By specifiying a ``chunksize`` to ``read_csv`` or ``read_table``, the return
191+
value will be an iterable object of type ``TextParser``:
192+
193+
.. ipython::
194+
195+
In [1]: reader = read_table('tmp.sv', sep='|', chunksize=4)
196+
197+
In [1]: reader
198+
199+
In [2]: for chunk in reader:
200+
...: print chunk
201+
...:
202+
203+
Specifying ``iterator=True`` will also return the ``TextParser`` object:
204+
205+
.. ipython:: python
206+
207+
reader = read_table('tmp.sv', sep='|', iterator=True)
208+
reader.get_chunk(5)
209+
107210
Excel 2003 files
108211
----------------
109212

@@ -132,7 +235,6 @@ performance HDF5 format using the excellent `PyTables
132235
.. ipython:: python
133236
:suppress:
134237
135-
import os
136238
os.remove('store.h5')
137239
138240
.. ipython:: python

0 commit comments

Comments
 (0)