6
6
:suppress:
7
7
8
8
import numpy as np
9
+ import os
9
10
np.random.seed(123456 )
10
11
from pandas import *
11
12
from StringIO import StringIO
@@ -29,9 +30,8 @@ data into a DataFrame object. They can take a number of arguments:
29
30
30
31
- ``path_or_buffer ``: Either a string path to a file, or any object with a
31
32
``read `` method (such as an open file or ``StringIO ``).
32
- - ``delimiter ``: For ``read_table `` only, a regular expression to split
33
- fields on. ``read_csv `` uses the ``csv `` module to do this and hence only
34
- supports comma-separated values.
33
+ - ``sep ``: A delimiter / separator to split fields on. `read_csv ` is capable
34
+ of inferring automatically "sniffing" the delimiter in some cases
35
35
- ``header ``: row number to use as the column names, and the start of the data.
36
36
Defaults to 0 (first row); specify None if there is no header row.
37
37
- ``names ``: List of column names to use if header is None.
@@ -47,45 +47,89 @@ data into a DataFrame object. They can take a number of arguments:
47
47
``dateutil.parser ``. Specifying this implicitly sets ``parse_dates `` as True.
48
48
- ``na_values ``: optional list of strings to recognize as NaN (missing values),
49
49
in addition to a default set.
50
-
51
-
52
- .. code-block :: ipython
53
-
54
- In [1]: print open('foo.csv').read()
55
- date,A,B,C
56
- 20090101,a,1,2
57
- 20090102,b,3,4
58
- 20090103,c,4,5
59
-
60
- # A basic index is created by default:
61
- In [3]: read_csv('foo.csv')
62
- Out[3]:
63
- date A B C
64
- 0 20090101 a 1 2
65
- 1 20090102 b 3 4
66
- 2 20090103 c 4 5
67
-
68
- # Use a column as an index, and parse it as dates.
69
- In [3]: df = read_csv('foo.csv', index_col=0, parse_dates=True)
70
-
71
- In [4]: df
72
- Out[4]:
73
- A B C
74
- 2009-01-01 a 1 2
75
- 2009-01-02 b 3 4
76
- 2009-01-03 c 4 5
77
-
78
- # These are python datetime objects
79
- In [16]: df.index
80
- Out[16]: Index([2009-01-01 00:00:00, 2009-01-02 00:00:00,
81
- 2009-01-03 00:00:00], dtype=object)
50
+ - ``nrows ``: Number of rows to read out of the file. Useful to only read a
51
+ small portion of a large file
52
+ - ``chunksize ``: An number of rows to be used to "chunk" a file into
53
+ pieces. Will cause an ``TextParser `` object to be returned. More on this
54
+ below in the section on :ref: `iterating and chunking <io.chunking >`
55
+ - ``iterator ``: If True, return a ``TextParser `` to enable reading a file
56
+ into memory piece by piece
57
+
58
+ .. ipython :: python
59
+ :suppress:
60
+
61
+ f = open (' foo.csv' , ' w' )
62
+ f.write(' date,A,B,C\n 20090101,a,1,2\n 20090102,b,3,4\n 20090103,c,4,5' )
63
+ f.close()
64
+
65
+ Consider a typical CSV file containing, in this case, some time series data:
66
+
67
+ .. ipython :: python
68
+
69
+ print open (' foo.csv' ).read()
82
70
71
+ The default for `read_csv ` is to create a DataFrame with simple numbered rows:
72
+
73
+ .. ipython :: python
74
+
75
+ read_csv(' foo.csv' )
76
+
77
+ In the case of indexed data, you can pass the column number (or a list of
78
+ column numbers, for a hierarchical index) you wish to use as the index. If the
79
+ index values are dates and you want them to be converted to ``datetime ``
80
+ objects, pass ``parse_dates=True ``:
81
+
82
+ .. ipython :: python
83
+
84
+ # Use a column as an index, and parse it as dates.
85
+ df = read_csv(' foo.csv' , index_col = 0 , parse_dates = True )
86
+ df
87
+ # These are python datetime objects
88
+ df.index
89
+
90
+ .. ipython :: python
91
+ :suppress:
92
+
93
+ os.remove(' foo.csv' )
83
94
84
95
The parsers make every attempt to "do the right thing" and not be very
85
96
fragile. Type inference is a pretty big deal. So if a column can be coerced to
86
97
integer dtype without altering the contents, it will do so. Any non-numeric
87
98
columns will come through as object dtype as with the rest of pandas objects.
88
99
100
+ Files with an "implicit" index column
101
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
102
+
103
+ .. ipython :: python
104
+ :suppress:
105
+
106
+ f = open (' foo.csv' , ' w' )
107
+ f.write(' A,B,C\n 20090101,a,1,2\n 20090102,b,3,4\n 20090103,c,4,5' )
108
+ f.close()
109
+
110
+ Consider a file with one less entry in the header than the number of data
111
+ column:
112
+
113
+ .. ipython :: python
114
+
115
+ print open (' foo.csv' ).read()
116
+
117
+ In this special case, ``read_csv `` assumes that the first column is to be used
118
+ as the index of the DataFrame:
119
+
120
+ .. ipython :: python
121
+
122
+ read_csv(' foo.csv' )
123
+
124
+ Note that the dates weren't automatically parsed. In that case you would need
125
+ to do as before:
126
+
127
+ .. ipython :: python
128
+
129
+ df = read_csv(' foo.csv' , parse_dates = True )
130
+ df.index
131
+
132
+
89
133
Reading DataFrame objects with ``MultiIndex ``
90
134
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
91
135
@@ -104,6 +148,65 @@ column numbers to turn multiple columns into a ``MultiIndex``:
104
148
df
105
149
df.ix[1978 ]
106
150
151
+ .. .. _io.sniff:
152
+
153
+ .. Automatically "sniffing" the delimiter
154
+ .. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
155
+
156
+ .. ``read_csv`` is capable of inferring delimited, but not necessarily
157
+ .. comma-separated, files in some cases:
158
+
159
+ .. .. ipython:: python
160
+
161
+ .. print open('tmp.csv').read()
162
+ .. read_csv('tmp.csv')
163
+
164
+
165
+
166
+ .. _io.chunking :
167
+
168
+ Iterating through files chunk by chunk
169
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
170
+
171
+ Suppose you wish to iterate through a (potentially very large) file lazily
172
+ rather than reading the entire file into memory, such as the following:
173
+
174
+ .. ipython :: python
175
+ :suppress:
176
+
177
+ df[:7 ].to_csv(' tmp.sv' , delimiter = ' |' )
178
+
179
+ .. ipython :: python
180
+
181
+ print open (' tmp.sv' ).read()
182
+ table = read_table(' tmp.sv' , sep = ' |' )
183
+ table
184
+
185
+ .. ipython :: python
186
+ :suppress:
187
+
188
+ os.remove(' tmp.csv' )
189
+
190
+ By specifiying a ``chunksize `` to ``read_csv `` or ``read_table ``, the return
191
+ value will be an iterable object of type ``TextParser ``:
192
+
193
+ .. ipython ::
194
+
195
+ In [1]: reader = read_table('tmp.sv', sep='|', chunksize=4)
196
+
197
+ In [1]: reader
198
+
199
+ In [2]: for chunk in reader:
200
+ ...: print chunk
201
+ ...:
202
+
203
+ Specifying ``iterator=True `` will also return the ``TextParser `` object:
204
+
205
+ .. ipython :: python
206
+
207
+ reader = read_table(' tmp.sv' , sep = ' |' , iterator = True )
208
+ reader.get_chunk(5 )
209
+
107
210
Excel 2003 files
108
211
----------------
109
212
@@ -132,7 +235,6 @@ performance HDF5 format using the excellent `PyTables
132
235
.. ipython :: python
133
236
:suppress:
134
237
135
- import os
136
238
os.remove(' store.h5' )
137
239
138
240
.. ipython :: python
0 commit comments