Skip to content

to_csv with UTF16 Incorrectly Treats BOM as column #26446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
WillAyd opened this issue May 18, 2019 · 8 comments
Closed

to_csv with UTF16 Incorrectly Treats BOM as column #26446

WillAyd opened this issue May 18, 2019 · 8 comments
Labels
Bug IO Data IO issues that don't fit into a more specific label

Comments

@WillAyd
Copy link
Member

WillAyd commented May 18, 2019

Code Sample, a copy-pastable example if possible

In [2]: df = pd.DataFrame([['foo']])
In [4]: df.to_csv('afile.csv', encoding='utf16')
In [5]: with open('afile.csv', 'rb') as afile:
   ...:     print(afile.read())
   ...:
b'\xff\xfe,\x000\x00\n\x000\x00,\x00f\x00o\x00o\x00\n\x00'

Note the comma after the BOM '\xff\xfe - this has the unintended side effect of creating a malformed structure especially when trying to read back in

INSTALLED VERSIONS

commit: 9d5f110
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 18.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.25.0.dev0+587.g9d5f1105c
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.7
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.0
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.1
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.2.18
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.2.0
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd WillAyd added IO Data IO issues that don't fit into a more specific label IO CSV read_csv, to_csv Bug and removed IO CSV read_csv, to_csv labels May 18, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone May 18, 2019
@shantanu-gontia
Copy link
Contributor

shantanu-gontia commented May 20, 2019

It appears there are problems reading with UTF-8 encoding as well, however not the same as that of UTF-16, specifically here the indexes aren't being constructed correctly. But I guess, this is how it is supposed to read in UTF-8 without any options specified. Will investigate this further.

In [6]: df = pd.DataFrame([['foo']])                                                                                                

In [7]: df                                                                                                                          
Out[7]: 
     0
0  foo

In [8]: df.to_csv('utf8file.csv', encoding='utf8')                                                                                  

In [9]: dfr_8 = pd.read_csv('utf8file.csv', encoding='utf8')                                                                        

In [10]: dfr_8                                                                                                                      
Out[10]: 
   Unnamed: 0    0
0           0  foo

In [11]: df.to_csv('utf16file.csv', encoding='utf16')                                                                               

In [12]: dfr_16 = pd.read_csv('utf16file.csv', encoding='utf16')                                                                    

In [14]: dfr_16                                                                                                                     
Out[14]: 
   Unnamed: 0  Unnamed: 1
0         NaN         NaN
1         NaN         NaN

In [15]: dfr_8._data                                                                                                                
Out[15]: 
BlockManager
Items: Index(['Unnamed: 0', '0'], dtype='object')
Axis 1: RangeIndex(start=0, stop=1, step=1)
IntBlock: slice(0, 1, 1), 1 x 1, dtype: int64
ObjectBlock: slice(1, 2, 1), 1 x 1, dtype: object

In [16]: df._data                                                                                                                   
Out[16]: 
BlockManager
Items: RangeIndex(start=0, stop=1, step=1)
Axis 1: RangeIndex(start=0, stop=1, step=1)
ObjectBlock: slice(0, 1, 1), 1 x 1, dtype: object

In [17]: dfr_16._data                                                                                                               
Out[17]: 
BlockManager
Items: Index(['Unnamed: 0', 'Unnamed: 1'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
FloatBlock: slice(0, 2, 1), 2 x 2, dtype: float64

@shantanu-gontia
Copy link
Contributor

Also, I think the correct interface requires writing utf-16 instead of utf16 which actually produces the following DataFrame, which is wrong as well though because of the incorrect comma.

In [24]: pd.read_csv('afile.csv', encoding='utf-16')                                                                   
Out[24]: 
   Unnamed: 0    0
0           0  foo

@WillAyd
Copy link
Member Author

WillAyd commented May 20, 2019

It appears there are problems reading with UTF-8 encoding as well, however not the same as that of UTF-16, specifically here the indexes aren't being constructed correctly.

This is already a known nuance to roundtripping (see #24468)

@shantanu-gontia
Copy link
Contributor

shantanu-gontia commented May 24, 2019

Looking at this again, the encoding looks correct to me,

b'\xff\xfe,\x000\x00\n\x000\x00,\x00f\x00o\x00o\x00\n\x00' has the bytes (UTF-16-LE)

\xff \xfe --> 0xfeff (BOM)
, \x00    --> 0x002c (comma)
0 \x00   --> 0x0030 (zero)
\n \x00  --> 0x000A (newline)
0 \x00   --> 0x0030 (zero)
, \x00    --> 0x002c (comma)
f \x00    --> 0x0066 (f)
o \x00   --> 0x006f (o)
o \x00   --> 0x006f (o)
\n \x00  --> 0x000A (newline)

@jbrockmendel
Copy link
Member

@WillAyd im not clear on what the issue is here:

path = 'afile.csv'
df = pd.DataFrame([['foo']])
df.to_csv(path, encoding="UTF-16")

df2 = pd.read_csv(path, index_col=0, encoding="UTF-16")

df.to_csv(path)
df3 = pd.read_csv(path, index_col=0)
assert df2.equals(df3)

Is the issue that you shouldn't have to pass the encoding on the read_csv?

@WillAyd
Copy link
Member Author

WillAyd commented Dec 12, 2019

The issue is that a comma is placed after the BOM

@jbrockmendel
Copy link
Member

The issue is that a comma is placed after the BOM

I'm not understanding why that's a problem. If you do to_csv without an encoding you end up writing b',0\n0,foo\n', which has a leading comma too

@WillAyd
Copy link
Member Author

WillAyd commented Dec 13, 2019

Hmm OK maybe not an issue then. Closing for now can always come back

@WillAyd WillAyd closed this as completed Dec 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

3 participants