read_csv(compression='gzip') fails while reading compressed file from s3 #14222

mandarup · 2016-09-14T15:12:09Z

reading gzipped csv from s3 bucket (private)

import pandas as pd
import boto3

session = boto3.Session()
s3client = session.client('s3')
obj = s3client.get_object(Bucket=bucket_name, Key=objkey)
df = pd.read_csv(obj['Body'], compression='gzip', nrows=5, engine='python')

Error

 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
   return _read(filepath_or_buffer, kwds)
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 315, in _read
   parser = TextFileReader(filepath_or_buffer, **kwds)
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 645, in __init__
   self._make_engine(self.engine)
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 805, in _make_engine
   self._engine = klass(self.f, **self.options)
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1608, in __init__
   self.columns, self.num_original_columns = self._infer_columns()
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1823, in _infer_columns
   line = self._buffered_line()
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1975, in _buffered_line
   return self._next_line()
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 2006, in _next_line
   orig_line = next(self.data)
 File "/Users/username/anaconda/lib/python2.7/gzip.py", line 464, in readline
   c = self.read(readsize)
 File "/Users/username/anaconda/lib/python2.7/gzip.py", line 268, in read
   self._read(readsize)
 File "/Users/username/anaconda/lib/python2.7/gzip.py", line 295, in _read
   pos = self.fileobj.tell()   # Save current position
AttributeError: 'StreamingBody' object has no attribute 'tell'

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 15.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 20.3
Cython: 0.23.4
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.4.5
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-09-15T10:36:03Z

see #13137 I don't think this is fully compat with pandas (boto3). @TomAugspurger

TomAugspurger · 2016-09-15T12:15:06Z

@mandarup, our current implementation isn't file-like enough for the gzip decompression. As a workaround for now you can use https://github.com/dask/s3fs, which implements more file operations (like .tell), and is what pandas might use in the future. For now, it'd be

import pandas as pd
import s3fs

fs = s3fs.S3FileSystem()

with fs.open('s3://bucket_name/objkey') as f:
    df = pd.read_csv(f, compression='gzip', nrows=5)

Going to close this for now, as it'll be taken care of with #13137.

jreback added IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label labels Sep 15, 2016

TomAugspurger closed this as completed Sep 15, 2016

jorisvandenbossche added this to the No action milestone Sep 15, 2016

wmitsuda mentioned this issue May 4, 2017

read_csv(compression='gzip') fails while reading compressed file with tf.gfile.GFile in Python 2 #16241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv(compression='gzip') fails while reading compressed file from s3 #14222

read_csv(compression='gzip') fails while reading compressed file from s3 #14222

mandarup commented Sep 14, 2016 •

edited

Loading

jreback commented Sep 15, 2016

TomAugspurger commented Sep 15, 2016

read_csv(compression='gzip') fails while reading compressed file from s3 #14222

read_csv(compression='gzip') fails while reading compressed file from s3 #14222

Comments

mandarup commented Sep 14, 2016 • edited Loading

reading gzipped csv from s3 bucket (private)

Error

output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Sep 15, 2016

TomAugspurger commented Sep 15, 2016

mandarup commented Sep 14, 2016 •

edited

Loading

output of `pd.show_versions()`