Skip to content

read_csv c engine accepts binary mode data and python engine rejects it #23779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mgeens opened this issue Nov 19, 2018 · 4 comments · Fixed by #27925
Closed

read_csv c engine accepts binary mode data and python engine rejects it #23779

mgeens opened this issue Nov 19, 2018 · 4 comments · Fixed by #27925
Labels
Compat pandas objects compatability with Numpy or Python functions good first issue IO CSV read_csv, to_csv Testing pandas testing functions or related to the test suite

Comments

@mgeens
Copy link

mgeens commented Nov 19, 2018

Code Sample

import pandas as pd

if __name__ == "__main__":
    with open('test.csv', 'w') as f:
        f.write('1,2,3\n4,5,6')
    with open('test.csv', 'rt') as f:
        pd.read_csv(f, header=None)
    with open('test.csv', 'rb') as f:
        pd.read_csv(f, header=None)
    with open('test.csv', 'rt') as f:
        pd.read_csv(f, header=None, engine='python')
    with open('test.csv', 'rb') as f:
        pd.read_csv(f, header=None, engine='python')

Problem description

The second read_csv call (using the C engine and a file opened in binary mode) will correctly read the csv. The fourth read_csv call (using the Python engine and a file opened in binary mode) will throw an exception stating it needs to be in text mode:

pandas.errors.ParserError: iterator should return strings, not bytes (did you open the file in text mode?)

Perhaps this is intended behavior, but I found this difference in behavior between the engines surprising, as well as that binary mode was accepted at all.

Expected Output

Either the C engine rejecting binary mode files or the Python engine accepting them.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-39-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 39.1.0
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@mgeens mgeens changed the title read_csv c engine accepts binary data and python engine rejects it read_csv c engine accepts binary mode data and python engine rejects it Nov 19, 2018
@mroeschke
Copy link
Member

cc @gfyoung

@mroeschke mroeschke added the IO CSV read_csv, to_csv label Jan 13, 2019
@gfyoung gfyoung added the Compat pandas objects compatability with Numpy or Python functions label Jan 13, 2019
@gfyoung
Copy link
Member

gfyoung commented Jan 13, 2019

@rgeens : Thanks for opening this! Sorry that this got lost in the pile of issues that we have 😞

I strongly believe that this discrepancy is symptomatic of limitations in Python's native csv library. The Python engine is largely a wrapper around it, and the error message you're showing comes from the csv module in fact.

To illustrate my point:

import csv

with open('test.csv', 'w') as f:
    f.write('1,2,3\n4,5,6')

r = csv.reader(open('test.csv', 'rb'))
print(next(r))

This outputs:

...
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

This is beyond our control, so the best we can do is to test the behavior for the C engine (if it doesn't exist already). You're more than welcome to do that!

@gfyoung gfyoung added Testing pandas testing functions or related to the test suite good first issue labels Jan 13, 2019
gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 16, 2019
Python's native CSV library doesn't accept such
files, but we do for the C parser.

Closes pandas-devgh-23779.
gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 16, 2019
Python's native CSV library doesn't accept such
files, but we do for the C parser.

Closes pandas-devgh-23779.
gfyoung added a commit that referenced this issue Jan 16, 2019
Python's native CSV library doesn't accept such
files, but we do for the C parser.

Closes gh-23779.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
Python's native CSV library doesn't accept such
files, but we do for the C parser.

Closes pandas-devgh-23779.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
Python's native CSV library doesn't accept such
files, but we do for the C parser.

Closes pandas-devgh-23779.
@fiendish
Copy link
Contributor

fiendish commented Aug 14, 2019

I would like to have this reopened, because it's not actually beyond pandas control to fix this cleanly and because right now the read_csv function doc specifies "file-like object" not "file-like object opened in ascii mode".

Just conditionally wrap the object in a io.TextIOWrapper, since you know that csv is by definition an ascii format:

import csv
import io

def __ascii_wrap(potentially_binary_buffer):
    try:
        return io.TextIOWrapper(potentially_binary_buffer)
    except Exception:
        return potentially_binary_buffer


with open('test.csv', 'w') as f:
    f.write('1,2,3\n4,5,6')

r = csv.reader(__ascii_wrap(open('test.csv', 'rb')))
print(next(r))

@gfyoung
Copy link
Member

gfyoung commented Aug 14, 2019

@fiendish : That's a good point. You're more than welcome to open a PR to implement this.

@gfyoung gfyoung reopened this Aug 14, 2019
TomAugspurger pushed a commit that referenced this issue Aug 19, 2019
* BUG: Help python csv engine read binary buffers

The file buffer given to read_csv could have been opened in
binary mode, but the python csv reader errors on binary buffers.

closes #23779
EunSeop pushed a commit to EunSeop/pandas that referenced this issue Aug 20, 2019
* BUG: Help python csv engine read binary buffers

The file buffer given to read_csv could have been opened in
binary mode, but the python csv reader errors on binary buffers.

closes pandas-dev#23779
galuhsahid pushed a commit to galuhsahid/pandas that referenced this issue Aug 25, 2019
* BUG: Help python csv engine read binary buffers

The file buffer given to read_csv could have been opened in
binary mode, but the python csv reader errors on binary buffers.

closes pandas-dev#23779
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions good first issue IO CSV read_csv, to_csv Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants