read_csv: Infers different column types in different runs #13604

aptiko · 2016-07-10T06:04:42Z

#!/usr/bin/env python3

from io import StringIO

import pandas as pd

test_timeseries = """\
2008-02-07 09:40,1032.43
2008-02-07 09:50,1042.54
2008-02-07 10:00,1051.65
"""

df = pd.read_csv(StringIO(test_timeseries), parse_dates=[0],
                 usecols=['date', 'value'], index_col=0, header=None,
                 names=('date', 'value'))
print (df.value.dtype)

I run this program 10 times and the result is sometimes float64 and sometimes object.

This happens with pandas 0.18.1 on Debian Jessie amd64 with Python 3.4.2 and numpy 1.11.1. I don't see it happening with Debian's packaged pandas 0.14.1.

I can work around this by specifying the dtype argument; but shouldn't pandas behave deterministically when it's omitted?

The text was updated successfully, but these errors were encountered:

If the dtype argument is not specified in read_csv, the result is not always the same in all runs. This is probably a pandas bug (pandas-dev/pandas#13604).

sinhrks · 2016-07-10T15:22:01Z

Thanks for the report. Unfortunately I couldn't reproduce it on my mac. It looks to be always object(I suppose it should be float64).

If no options are specified, dtypes are object (date) and float64 (value).

jreback · 2016-07-10T17:05:27Z

pls pd.show_versions() and exact code that u r running; and print the pandas version in the running code

aptiko · 2016-07-11T08:30:30Z

Here's the program I'm running, which I call test13604.py (the difference from the initial I initially presented is only in the last two lines which print things):

#!/usr/bin/env python3

from io import StringIO

import pandas as pd

test_timeseries = """\
2008-02-07 09:40,1032.43
2008-02-07 09:50,1042.54
2008-02-07 10:00,1051.65
"""

df = pd.read_csv(StringIO(test_timeseries), parse_dates=[0],
                 usecols=['date', 'value'], index_col=0, header=None,
                 names=('date', 'value'))
print ('Result: {}'.format(df.value.dtype))
pd.show_versions()

Here is some output:

anthony@seska:pd$ mkvirtualenv --python=/usr/bin/python3 pandas
Already using interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in pandas/bin/python3
Also creating executable in pandas/bin/python
Installing setuptools, pip...done.

(pandas)anthony@seska:pd$ pip install pandas
[snip]
Successfully installed pandas python-dateutil pytz numpy six
Cleaning up...

(pandas)anthony@seska:pd$ for i in 1 2 3 4 5 6 7 8 9 10; do python test13604.py|grep Result; done
Result: object
Result: object
Result: object
Result: float64
Result: float64
Result: object
Result: float64
Result: float64
Result: float64
Result: object

(pandas)anthony@seska:pd$ python test13604.py 
Result: object

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-4-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: None
pip: 1.5.6
setuptools: 5.5.1
Cython: None
numpy: 1.11.1
scipy: None
statsmodels: None
xarray: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

jorisvandenbossche · 2016-07-11T09:00:04Z

I can reproduce this, but only when called from a script. If I repeat it multiple times in an interactive console, it gives always the same.
EDIT: I also see it in the interactive console, but there it only changes after restarting the kernel. So within one interactive session, it returns always float64 or always object.

In only seem to see this with python 3 and not with python 2, but there are also many other differences between the two environments, so not sure this is the cause of the difference.

vrajmohan · 2016-08-12T14:15:50Z

I have been able to isolate it to https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1457. We are building a list from a set and expecting a consistent order. I would love to help, but I don't know Cython to be able to take it any further.

gfyoung · 2017-01-03T22:57:52Z

@jreback : This was patched in #14984. Can be closed.

jreback · 2017-01-03T23:47:03Z

@gfyoung is there a replicating test for this?

gfyoung · 2017-01-04T02:19:01Z

I added a test here that replicates the same situation. Should I explicitly add this example to that set of tests as well?

jreback · 2017-01-04T02:20:21Z

yes i think a replica of this issue would be good

Closes pandas-devgh-13604.

Closes gh-13604.

aptiko added a commit to openmeteo/pd2hts that referenced this issue Jul 10, 2016

Make read_csv more robust in unit tests

8aab7f9

If the dtype argument is not specified in read_csv, the result is not always the same in all runs. This is probably a pandas bug (pandas-dev/pandas#13604).

sinhrks added Can't Repro IO CSV read_csv, to_csv Dtype Conversions Unexpected or buggy dtype conversions labels Jul 10, 2016

jorisvandenbossche added Bug and removed Can't Repro labels Jul 11, 2016

jreback added this to the 0.20.0 milestone Jan 3, 2017

jreback added the Testing pandas testing functions or related to the test suite label Jan 3, 2017

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 4, 2017

TST: Add new test for flaky usecols

f59565c

Closes pandas-devgh-13604.

gfyoung mentioned this issue Jan 4, 2017

TST: Add new test for flaky usecols #15051

Merged

jorisvandenbossche closed this as completed in #15051 Jan 4, 2017

jorisvandenbossche pushed a commit that referenced this issue Jan 4, 2017

TST: Add new test for flaky usecols (#15051)

098831d

Closes gh-13604.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv: Infers different column types in different runs #13604

read_csv: Infers different column types in different runs #13604

aptiko commented Jul 10, 2016

sinhrks commented Jul 10, 2016

jreback commented Jul 10, 2016

aptiko commented Jul 11, 2016

jorisvandenbossche commented Jul 11, 2016 •

edited

Loading

vrajmohan commented Aug 12, 2016

gfyoung commented Jan 3, 2017

jreback commented Jan 3, 2017

gfyoung commented Jan 4, 2017 •

edited

Loading

jreback commented Jan 4, 2017

read_csv: Infers different column types in different runs #13604

read_csv: Infers different column types in different runs #13604

Comments

aptiko commented Jul 10, 2016

sinhrks commented Jul 10, 2016

jreback commented Jul 10, 2016

aptiko commented Jul 11, 2016

jorisvandenbossche commented Jul 11, 2016 • edited Loading

vrajmohan commented Aug 12, 2016

gfyoung commented Jan 3, 2017

jreback commented Jan 3, 2017

gfyoung commented Jan 4, 2017 • edited Loading

jreback commented Jan 4, 2017

jorisvandenbossche commented Jul 11, 2016 •

edited

Loading

gfyoung commented Jan 4, 2017 •

edited

Loading