Skip to content

read_csv(compression='gzip') fails while reading compressed file from s3 #14222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mandarup opened this issue Sep 14, 2016 · 2 comments
Closed
Labels
IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label

Comments

@mandarup
Copy link

mandarup commented Sep 14, 2016

reading gzipped csv from s3 bucket (private)

import pandas as pd
import boto3

session = boto3.Session()
s3client = session.client('s3')
obj = s3client.get_object(Bucket=bucket_name, Key=objkey)
df = pd.read_csv(obj['Body'], compression='gzip', nrows=5, engine='python')

Error

 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
   return _read(filepath_or_buffer, kwds)
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 315, in _read
   parser = TextFileReader(filepath_or_buffer, **kwds)
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 645, in __init__
   self._make_engine(self.engine)
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 805, in _make_engine
   self._engine = klass(self.f, **self.options)
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1608, in __init__
   self.columns, self.num_original_columns = self._infer_columns()
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1823, in _infer_columns
   line = self._buffered_line()
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1975, in _buffered_line
   return self._next_line()
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 2006, in _next_line
   orig_line = next(self.data)
 File "/Users/username/anaconda/lib/python2.7/gzip.py", line 464, in readline
   c = self.read(readsize)
 File "/Users/username/anaconda/lib/python2.7/gzip.py", line 268, in read
   self._read(readsize)
 File "/Users/username/anaconda/lib/python2.7/gzip.py", line 295, in _read
   pos = self.fileobj.tell()   # Save current position
AttributeError: 'StreamingBody' object has no attribute 'tell'

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 15.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 20.3
Cython: 0.23.4
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.4.5
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Sep 15, 2016

see #13137 I don't think this is fully compat with pandas (boto3). @TomAugspurger

@jreback jreback added IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label labels Sep 15, 2016
@TomAugspurger
Copy link
Contributor

@mandarup, our current implementation isn't file-like enough for the gzip decompression. As a workaround for now you can use https://github.com/dask/s3fs, which implements more file operations (like .tell), and is what pandas might use in the future. For now, it'd be

import pandas as pd
import s3fs

fs = s3fs.S3FileSystem()

with fs.open('s3://bucket_name/objkey') as f:
    df = pd.read_csv(f, compression='gzip', nrows=5)

Going to close this for now, as it'll be taken care of with #13137.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

4 participants