Skip to content

Commit 1381c39

Browse files
author
y-p
committed
Merge pull request #3504 from pjob/s3-support
ENH: Support reading from S3
2 parents bf667e3 + f06b43c commit 1381c39

File tree

4 files changed

+38
-14
lines changed

4 files changed

+38
-14
lines changed

README.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ Optional dependencies
9090
* openpyxl version 1.6.1 or higher, for writing .xlsx files
9191
* xlrd >= 0.9.0
9292
* Needed for Excel I/O
93+
* `boto <https://pypi.python.org/pypi/boto>`__: necessary for Amazon S3 access.
9394

9495

9596
Installation from sources

RELEASE.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ pandas 0.11.1
3232

3333
- pd.read_html() can now parse HTML string, files or urls and return dataframes
3434
courtesy of @cpcloud. (GH3477_)
35+
- Support for reading Amazon S3 files. (GH3504_)
3536

3637
**Improvements to existing features**
3738

doc/source/io.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,9 @@ for some advanced strategies
4040

4141
They can take a number of arguments:
4242

43-
- ``filepath_or_buffer``: Either a string path to a file, or any object with a
44-
``read`` method (such as an open file or ``StringIO``).
43+
- ``filepath_or_buffer``: Either a string path to a file, url
44+
(including http, ftp, and s3 locations), or any object with a ``read``
45+
method (such as an open file or ``StringIO``).
4546
- ``sep`` or ``delimiter``: A delimiter / separator to split fields
4647
on. `read_csv` is capable of inferring the delimiter automatically in some
4748
cases by "sniffing." The separator may be specified as a regular

pandas/io/parsers.py

Lines changed: 33 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ class DateConversionError(Exception):
3434
Parameters
3535
----------
3636
filepath_or_buffer : string or file handle / StringIO. The string could be
37-
a URL. Valid URL schemes include http, ftp, and file. For file URLs, a host
37+
a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host
3838
is expected. For instance, a local file could be
3939
file ://localhost/path/to/table.csv
4040
%s
@@ -188,6 +188,12 @@ def _is_url(url):
188188
except:
189189
return False
190190

191+
def _is_s3_url(url):
192+
""" Check for an s3 url """
193+
try:
194+
return urlparse.urlparse(url).scheme == 's3'
195+
except:
196+
return False
191197

192198
def _read(filepath_or_buffer, kwds):
193199
"Generic reader of line files."
@@ -196,17 +202,32 @@ def _read(filepath_or_buffer, kwds):
196202
if skipfooter is not None:
197203
kwds['skip_footer'] = skipfooter
198204

199-
if isinstance(filepath_or_buffer, basestring) and _is_url(filepath_or_buffer):
200-
from urllib2 import urlopen
201-
filepath_or_buffer = urlopen(filepath_or_buffer)
202-
if py3compat.PY3: # pragma: no cover
203-
if encoding:
204-
errors = 'strict'
205-
else:
206-
errors = 'replace'
207-
encoding = 'utf-8'
208-
bytes = filepath_or_buffer.read()
209-
filepath_or_buffer = StringIO(bytes.decode(encoding, errors))
205+
if isinstance(filepath_or_buffer, basestring):
206+
if _is_url(filepath_or_buffer):
207+
from urllib2 import urlopen
208+
filepath_or_buffer = urlopen(filepath_or_buffer)
209+
if py3compat.PY3: # pragma: no cover
210+
if encoding:
211+
errors = 'strict'
212+
else:
213+
errors = 'replace'
214+
encoding = 'utf-8'
215+
bytes = filepath_or_buffer.read()
216+
filepath_or_buffer = StringIO(bytes.decode(encoding, errors))
217+
218+
if _is_s3_url(filepath_or_buffer):
219+
try:
220+
import boto
221+
except:
222+
raise ImportError("boto is required to handle s3 files")
223+
# Assuming AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
224+
# are environment variables
225+
parsed_url = urlparse.urlparse(filepath_or_buffer)
226+
conn = boto.connect_s3()
227+
b = conn.get_bucket(parsed_url.netloc)
228+
k = boto.s3.key.Key(b)
229+
k.key = parsed_url.path
230+
filepath_or_buffer = StringIO(k.get_contents_as_string())
210231

211232
if kwds.get('date_parser', None) is not None:
212233
if isinstance(kwds['parse_dates'], bool):

0 commit comments

Comments
 (0)