BUG: pandas.read_csv corrupts integers on reading #48170

Ark-kun · 2022-08-20T00:09:26Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

df = pandas.read_csv(io.StringIO("""col1,col2
,1
1234567890123456789,
"""))
print(df)
print(df.dtypes)
print(df.convert_dtypes())

           col1  col2
0           NaN   1.0
1  1.234568e+18   NaN

col1    float64
col2    float64
dtype: object

                  col1  col2
0                 <NA>     1
1  1234567890123456768  <NA>

The dataframe value is 1234567890123456768 while the original value is 1234567890123456789



### Issue Description

`pandas.read_csv` reads CSV files in a way that corrupts data.
For some reason Pandas reads integers as floats, losing the numerical precision.

### Expected Behavior

I expect Pandas to read integers as integers.

I expect Pandas to not corrupt data.

### Installed Versions

<details>

INSTALLED VERSIONS
------------------
commit           : f2ca0a2665b2d169c97de87b8e778dbed86aea07
python           : 3.7.8.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.19.0-10-cloud-amd64
Version          : #1 SMP Debian 4.19.132-1 (2020-07-24)
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.1
numpy            : 1.21.6
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 22.1.2
setuptools       : 49.6.0.post20200814
Cython           : 0.29.21
pytest           : 6.0.1
hypothesis       : None
sphinx           : 3.2.1
blosc            : None
feather          : None
xlsxwriter       : 1.3.3
lxml.etree       : 4.5.2
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : 7.17.0
pandas_datareader: None
bs4              : 4.9.1
bottleneck       : 1.3.2
fsspec           : 0.8.0
fastparquet      : None
gcsfs            : 0.7.0
matplotlib       : 3.3.1
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.5
pandas_gbq       : None
pyarrow          : 9.0.0
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.3.19
tables           : 3.6.1
tabulate         : 0.8.7
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : 0.49.1

</details>

The text was updated successfully, but these errors were encountered:

phofl · 2022-08-20T12:40:51Z

We are converting to float under the hood, which is considered a bug. This causes the loss in precision. There is an open issue about this somewhere on the tracker

…ook at column data Bugs: pandas-dev/pandas#48170 pandas-dev/pandas#48173 pandas-dev/pandas#48175

… data Bugs: pandas-dev/pandas#48170 pandas-dev/pandas#48173 pandas-dev/pandas#48175 See also: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html

Ark-kun added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 20, 2022

phofl closed this as completed Aug 20, 2022

Ark-kun added a commit to Ark-kun/pipeline_components that referenced this issue Aug 31, 2022

fix: Pandas - Fixed Pandas data mangling for components that do not l…

8c78aae

…ook at column data Bugs: pandas-dev/pandas#48170 pandas-dev/pandas#48173 pandas-dev/pandas#48175

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pandas.read_csv corrupts integers on reading #48170

BUG: pandas.read_csv corrupts integers on reading #48170

Ark-kun commented Aug 20, 2022

phofl commented Aug 20, 2022

BUG: pandas.read_csv corrupts integers on reading #48170

BUG: pandas.read_csv corrupts integers on reading #48170

Comments

Ark-kun commented Aug 20, 2022

Pandas version checks

Reproducible Example

phofl commented Aug 20, 2022