Skip to content

read_csv: Infers different column types in different runs #13604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aptiko opened this issue Jul 10, 2016 · 9 comments
Closed

read_csv: Infers different column types in different runs #13604

aptiko opened this issue Jul 10, 2016 · 9 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv Testing pandas testing functions or related to the test suite
Milestone

Comments

@aptiko
Copy link

aptiko commented Jul 10, 2016

#!/usr/bin/env python3

from io import StringIO

import pandas as pd

test_timeseries = """\
2008-02-07 09:40,1032.43
2008-02-07 09:50,1042.54
2008-02-07 10:00,1051.65
"""

df = pd.read_csv(StringIO(test_timeseries), parse_dates=[0],
                 usecols=['date', 'value'], index_col=0, header=None,
                 names=('date', 'value'))
print (df.value.dtype)

I run this program 10 times and the result is sometimes float64 and sometimes object.

This happens with pandas 0.18.1 on Debian Jessie amd64 with Python 3.4.2 and numpy 1.11.1. I don't see it happening with Debian's packaged pandas 0.14.1.

I can work around this by specifying the dtype argument; but shouldn't pandas behave deterministically when it's omitted?

aptiko added a commit to openmeteo/pd2hts that referenced this issue Jul 10, 2016
If the dtype argument is not specified in read_csv, the result is not
always the same in all runs. This is probably a pandas bug
(pandas-dev/pandas#13604).
@sinhrks sinhrks added Can't Repro IO CSV read_csv, to_csv Dtype Conversions Unexpected or buggy dtype conversions labels Jul 10, 2016
@sinhrks
Copy link
Member

sinhrks commented Jul 10, 2016

Thanks for the report. Unfortunately I couldn't reproduce it on my mac. It looks to be always object(I suppose it should be float64).

If no options are specified, dtypes are object (date) and float64 (value).

@jreback
Copy link
Contributor

jreback commented Jul 10, 2016

pls pd.show_versions() and exact code that u r running; and print the pandas version in the running code

@aptiko
Copy link
Author

aptiko commented Jul 11, 2016

Here's the program I'm running, which I call test13604.py (the difference from the initial I initially presented is only in the last two lines which print things):

#!/usr/bin/env python3

from io import StringIO

import pandas as pd

test_timeseries = """\
2008-02-07 09:40,1032.43
2008-02-07 09:50,1042.54
2008-02-07 10:00,1051.65
"""

df = pd.read_csv(StringIO(test_timeseries), parse_dates=[0],
                 usecols=['date', 'value'], index_col=0, header=None,
                 names=('date', 'value'))
print ('Result: {}'.format(df.value.dtype))
pd.show_versions()

Here is some output:

anthony@seska:pd$ mkvirtualenv --python=/usr/bin/python3 pandas
Already using interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in pandas/bin/python3
Also creating executable in pandas/bin/python
Installing setuptools, pip...done.

(pandas)anthony@seska:pd$ pip install pandas
[snip]
Successfully installed pandas python-dateutil pytz numpy six
Cleaning up...

(pandas)anthony@seska:pd$ for i in 1 2 3 4 5 6 7 8 9 10; do python test13604.py|grep Result; done
Result: object
Result: object
Result: object
Result: float64
Result: float64
Result: object
Result: float64
Result: float64
Result: float64
Result: object

(pandas)anthony@seska:pd$ python test13604.py 
Result: object

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-4-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: None
pip: 1.5.6
setuptools: 5.5.1
Cython: None
numpy: 1.11.1
scipy: None
statsmodels: None
xarray: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jul 11, 2016

I can reproduce this, but only when called from a script. If I repeat it multiple times in an interactive console, it gives always the same.
EDIT: I also see it in the interactive console, but there it only changes after restarting the kernel. So within one interactive session, it returns always float64 or always object.

In only seem to see this with python 3 and not with python 2, but there are also many other differences between the two environments, so not sure this is the cause of the difference.

@vrajmohan
Copy link

I have been able to isolate it to https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1457. We are building a list from a set and expecting a consistent order. I would love to help, but I don't know Cython to be able to take it any further.

@gfyoung
Copy link
Member

gfyoung commented Jan 3, 2017

@jreback : This was patched in #14984. Can be closed.

@jreback
Copy link
Contributor

jreback commented Jan 3, 2017

@gfyoung is there a replicating test for this?

@jreback jreback added this to the 0.20.0 milestone Jan 3, 2017
@jreback jreback added the Testing pandas testing functions or related to the test suite label Jan 3, 2017
@gfyoung
Copy link
Member

gfyoung commented Jan 4, 2017

I added a test here that replicates the same situation. Should I explicitly add this example to that set of tests as well?

@jreback
Copy link
Contributor

jreback commented Jan 4, 2017

yes i think a replica of this issue would be good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv Testing pandas testing functions or related to the test suite
Projects
None yet
Development

No branches or pull requests

6 participants