-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
to_csv with UTF16 Incorrectly Treats BOM as column #26446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It appears there are problems reading with UTF-8 encoding as well, however not the same as that of UTF-16, specifically here the indexes aren't being constructed correctly. But I guess, this is how it is supposed to read in UTF-8 without any options specified. Will investigate this further. In [6]: df = pd.DataFrame([['foo']])
In [7]: df
Out[7]:
0
0 foo
In [8]: df.to_csv('utf8file.csv', encoding='utf8')
In [9]: dfr_8 = pd.read_csv('utf8file.csv', encoding='utf8')
In [10]: dfr_8
Out[10]:
Unnamed: 0 0
0 0 foo
In [11]: df.to_csv('utf16file.csv', encoding='utf16')
In [12]: dfr_16 = pd.read_csv('utf16file.csv', encoding='utf16')
In [14]: dfr_16
Out[14]:
Unnamed: 0 Unnamed: 1
0 NaN NaN
1 NaN NaN
In [15]: dfr_8._data
Out[15]:
BlockManager
Items: Index(['Unnamed: 0', '0'], dtype='object')
Axis 1: RangeIndex(start=0, stop=1, step=1)
IntBlock: slice(0, 1, 1), 1 x 1, dtype: int64
ObjectBlock: slice(1, 2, 1), 1 x 1, dtype: object
In [16]: df._data
Out[16]:
BlockManager
Items: RangeIndex(start=0, stop=1, step=1)
Axis 1: RangeIndex(start=0, stop=1, step=1)
ObjectBlock: slice(0, 1, 1), 1 x 1, dtype: object
In [17]: dfr_16._data
Out[17]:
BlockManager
Items: Index(['Unnamed: 0', 'Unnamed: 1'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
FloatBlock: slice(0, 2, 1), 2 x 2, dtype: float64 |
Also, I think the correct interface requires writing In [24]: pd.read_csv('afile.csv', encoding='utf-16')
Out[24]:
Unnamed: 0 0
0 0 foo |
This is already a known nuance to roundtripping (see #24468) |
Looking at this again, the encoding looks correct to me,
|
@WillAyd im not clear on what the issue is here:
Is the issue that you shouldn't have to pass the encoding on the read_csv? |
The issue is that a comma is placed after the BOM |
I'm not understanding why that's a problem. If you do |
Hmm OK maybe not an issue then. Closing for now can always come back |
Code Sample, a copy-pastable example if possible
Note the comma after the BOM
'\xff\xfe
- this has the unintended side effect of creating a malformed structure especially when trying to read back inINSTALLED VERSIONS
commit: 9d5f110
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 18.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.25.0.dev0+587.g9d5f1105c
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.7
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.0
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.1
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.2.18
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.2.0
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: