to_csv with UTF16 Incorrectly Treats BOM as column #26446

WillAyd · 2019-05-18T15:59:38Z

Code Sample, a copy-pastable example if possible

In [2]: df = pd.DataFrame([['foo']])
In [4]: df.to_csv('afile.csv', encoding='utf16')
In [5]: with open('afile.csv', 'rb') as afile:
   ...:     print(afile.read())
   ...:
b'\xff\xfe,\x000\x00\n\x000\x00,\x00f\x00o\x00o\x00\n\x00'

Note the comma after the BOM '\xff\xfe - this has the unintended side effect of creating a malformed structure especially when trying to read back in

INSTALLED VERSIONS

commit: 9d5f110
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 18.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.25.0.dev0+587.g9d5f1105c
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.7
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.0
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.1
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.2.18
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.2.0
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

shantanu-gontia · 2019-05-20T12:34:36Z

It appears there are problems reading with UTF-8 encoding as well, however not the same as that of UTF-16, specifically here the indexes aren't being constructed correctly. But I guess, this is how it is supposed to read in UTF-8 without any options specified. Will investigate this further.

In [6]: df = pd.DataFrame([['foo']])                                                                                                

In [7]: df                                                                                                                          
Out[7]: 
     0
0  foo

In [8]: df.to_csv('utf8file.csv', encoding='utf8')                                                                                  

In [9]: dfr_8 = pd.read_csv('utf8file.csv', encoding='utf8')                                                                        

In [10]: dfr_8                                                                                                                      
Out[10]: 
   Unnamed: 0    0
0           0  foo

In [11]: df.to_csv('utf16file.csv', encoding='utf16')                                                                               

In [12]: dfr_16 = pd.read_csv('utf16file.csv', encoding='utf16')                                                                    

In [14]: dfr_16                                                                                                                     
Out[14]: 
   Unnamed: 0  Unnamed: 1
0         NaN         NaN
1         NaN         NaN

In [15]: dfr_8._data                                                                                                                
Out[15]: 
BlockManager
Items: Index(['Unnamed: 0', '0'], dtype='object')
Axis 1: RangeIndex(start=0, stop=1, step=1)
IntBlock: slice(0, 1, 1), 1 x 1, dtype: int64
ObjectBlock: slice(1, 2, 1), 1 x 1, dtype: object

In [16]: df._data                                                                                                                   
Out[16]: 
BlockManager
Items: RangeIndex(start=0, stop=1, step=1)
Axis 1: RangeIndex(start=0, stop=1, step=1)
ObjectBlock: slice(0, 1, 1), 1 x 1, dtype: object

In [17]: dfr_16._data                                                                                                               
Out[17]: 
BlockManager
Items: Index(['Unnamed: 0', 'Unnamed: 1'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
FloatBlock: slice(0, 2, 1), 2 x 2, dtype: float64

shantanu-gontia · 2019-05-20T15:22:17Z

Also, I think the correct interface requires writing utf-16 instead of utf16 which actually produces the following DataFrame, which is wrong as well though because of the incorrect comma.

In [24]: pd.read_csv('afile.csv', encoding='utf-16')                                                                   
Out[24]: 
   Unnamed: 0    0
0           0  foo

WillAyd · 2019-05-20T16:07:51Z

It appears there are problems reading with UTF-8 encoding as well, however not the same as that of UTF-16, specifically here the indexes aren't being constructed correctly.

This is already a known nuance to roundtripping (see #24468)

shantanu-gontia · 2019-05-24T13:43:36Z

Looking at this again, the encoding looks correct to me,

b'\xff\xfe,\x000\x00\n\x000\x00,\x00f\x00o\x00o\x00\n\x00' has the bytes (UTF-16-LE)

\xff \xfe --> 0xfeff (BOM)
, \x00    --> 0x002c (comma)
0 \x00   --> 0x0030 (zero)
\n \x00  --> 0x000A (newline)
0 \x00   --> 0x0030 (zero)
, \x00    --> 0x002c (comma)
f \x00    --> 0x0066 (f)
o \x00   --> 0x006f (o)
o \x00   --> 0x006f (o)
\n \x00  --> 0x000A (newline)

jbrockmendel · 2019-12-12T23:11:15Z

@WillAyd im not clear on what the issue is here:

path = 'afile.csv'
df = pd.DataFrame([['foo']])
df.to_csv(path, encoding="UTF-16")

df2 = pd.read_csv(path, index_col=0, encoding="UTF-16")

df.to_csv(path)
df3 = pd.read_csv(path, index_col=0)
assert df2.equals(df3)

Is the issue that you shouldn't have to pass the encoding on the read_csv?

WillAyd · 2019-12-12T23:19:19Z

The issue is that a comma is placed after the BOM

jbrockmendel · 2019-12-12T23:35:17Z

The issue is that a comma is placed after the BOM

I'm not understanding why that's a problem. If you do to_csv without an encoding you end up writing b',0\n0,foo\n', which has a leading comma too

WillAyd · 2019-12-13T22:36:49Z

Hmm OK maybe not an issue then. Closing for now can always come back

WillAyd added IO Data IO issues that don't fit into a more specific label IO CSV read_csv, to_csv Bug and removed IO CSV read_csv, to_csv labels May 18, 2019

WillAyd added this to the Contributions Welcome milestone May 18, 2019

WillAyd closed this as completed Dec 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_csv with UTF16 Incorrectly Treats BOM as column #26446

to_csv with UTF16 Incorrectly Treats BOM as column #26446

WillAyd commented May 18, 2019

INSTALLED VERSIONS

shantanu-gontia commented May 20, 2019 •

edited

Loading

shantanu-gontia commented May 20, 2019

WillAyd commented May 20, 2019

shantanu-gontia commented May 24, 2019 •

edited

Loading

jbrockmendel commented Dec 12, 2019

WillAyd commented Dec 12, 2019

jbrockmendel commented Dec 12, 2019

WillAyd commented Dec 13, 2019

to_csv with UTF16 Incorrectly Treats BOM as column #26446

to_csv with UTF16 Incorrectly Treats BOM as column #26446

Comments

WillAyd commented May 18, 2019

Code Sample, a copy-pastable example if possible

INSTALLED VERSIONS

shantanu-gontia commented May 20, 2019 • edited Loading

shantanu-gontia commented May 20, 2019

WillAyd commented May 20, 2019

shantanu-gontia commented May 24, 2019 • edited Loading

jbrockmendel commented Dec 12, 2019

WillAyd commented Dec 12, 2019

jbrockmendel commented Dec 12, 2019

WillAyd commented Dec 13, 2019

shantanu-gontia commented May 20, 2019 •

edited

Loading

shantanu-gontia commented May 24, 2019 •

edited

Loading