Skip to content

read_excel with dtype=str converts empty cells to the string 'nan' #20377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
arnau126 opened this issue Mar 16, 2018 · 17 comments · Fixed by #23162
Closed

read_excel with dtype=str converts empty cells to the string 'nan' #20377

arnau126 opened this issue Mar 16, 2018 · 17 comments · Fixed by #23162
Labels
IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@arnau126
Copy link

Code Sample, a copy-pastable example if possible

In [10]: df = pd.DataFrame({'a': ['x', 'y', '', 'z']})

In [11]: df
Out[11]: 
   a
0  x
1  y
2   
3  z

In [12]: df.to_excel('temp.xlsx')

In [13]: df = pd.read_excel('temp.xlsx', dtype=str)

In [14]: df
Out[14]: 
     a
0    x
1    y
2  nan
3    z

In [15]: df.loc[2, 'a']
Out[15]: 'nan'

In [16]: type(df.loc[2, 'a'])
Out[16]: str

Problem description

The empty string of the original dataframe becomes the string 'nan', instead of numpy.nan.

Expected Output

In [14]: df
Out[14]: 
     a
0    x
1    y
2  NaN
3    z

In [15]: df.loc[2, 'a']
Out[15]: nan

In [16]: type(df.loc[2, 'a'])
Out[16]: float

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-36-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: ca_ES.UTF-8
LOCALE: ca_ES.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.2
Cython: 0.27.3
numpy: 1.14.1
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: None
xlsxwriter: 0.7.3
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger TomAugspurger added IO Data IO issues that don't fit into a more specific label Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate IO Excel read_excel, to_excel Difficulty Intermediate labels Mar 16, 2018
@TomAugspurger
Copy link
Contributor

Thanks for the report. Mind taking a look to see what's going on?

@arnau126
Copy link
Author

arnau126 commented Mar 16, 2018

I've found the function that converts the np.nan to 'nan': pandas.core.dtypes.astype_nansafe.

read_excel uses PythonParser. This parser has a function called self._cast_types, which uses astype_nansafe.

https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L1570

An example of astype_nansafe:

In [4]: from pandas.core.dtypes.cast import astype_nansafe

In [5]: arr = np.array([np.nan, 'a', 'b'], object)

In [6]: arr
Out[6]: array([nan, 'a', 'b'], dtype=object)

In [7]: astype_nansafe(arr, str)
Out[7]: array(['nan', 'a', 'b'], dtype=object)

I don't know if it's the expected behavior or not.
Should we modify astype_nansafe? or _cast_types?.

I'm not sure how to fix the issue.

@cbertinato
Copy link
Contributor

cbertinato commented Mar 17, 2018

Is that what one should expect when using dtype=str? Not specifying dtype produces the expected result. The default value is None and dtype conversion is done by row.

In [16]: df = pd.read_excel('temp.xlsx')

In [17]: df
Out[17]:
     a
0    x
1    y
2  NaN
3    z

In [18]: df.loc[2, 'a']
Out[18]: nan

In [19]: type(df.loc[2, 'a'])
Out[19]: float

In [20]: type(df.loc[1, 'a'])
Out[20]: str

I think that the output for dtype=str is expected behavior.

@nikoskaragiannakis
Copy link
Contributor

read_csv returns what you would expect

from io import StringIO
import pandas as pd

arr = pd.read_csv(StringIO('col1, col2, col3\n,a,b'), dtype=str)

it will give you

  col1  col2  col3
0  NaN     a     b

so, I guess, what you expect from read_excel seems to be consistent with this.

I guess, if this needs to be changed, the best would be to change astype_unicode in pandas/_libs/lib.pyx.
What do you think?

@nikoskaragiannakis
Copy link
Contributor

@cbertinato just saw your comment! Good point.

@arnau126
Copy link
Author

arnau126 commented Mar 19, 2018

@cbertinato, if I have a column with numbers, but I want to read them as strings I need to specify dtype=str, otherwise they will be read as float.

In [1]: df = pd.DataFrame({'a': ['1', '2', '', '3']})

In [2]: df.to_excel('temp.xlsx')

In [3]: pd.read_excel('temp.xlsx', dtype=str)
Out[3]: 
     a
0   1
1   2
2  nan
3   3

In [5]: pd.read_excel('temp.xlsx')
Out[5]: 
     a
0  1.0
1  2.0
2  NaN
3  3.0

Moreover, as @nikoskaragiannakis says, it should be consistent with read_csv.

@cbertinato
Copy link
Contributor

I missed the use case for numbers, and @nikoskaragiannakis's comment. It should probably be consistent with read_csv.

@nikoskaragiannakis
Copy link
Contributor

nikoskaragiannakis commented Mar 19, 2018

@cbertinato if the case is as @arnau126 says, then I can make a fix for this.

@nikoskaragiannakis
Copy link
Contributor

nikoskaragiannakis commented Mar 20, 2018

Question: When we use dtype=str do we want np.nan to turn to a string "nan" or to stay as it is?
I made a fix for this issue, but the above functionality is affected.

@arnau126
Copy link
Author

The purpose of this issue is just change this functionality: currently np.nan turns to a string "nan", and I think that it should stay as it.

@cbertinato
Copy link
Contributor

As @arnau126 points out, the result from pd.read_excel with dtype=str is inconsistent with that from pd.read_csv. The value of placing a np.nan instead of the string representation is that you can use pd.isna, which does not work for 'nan'. So the thought is to make read_excel consistent with read_csv.

@nikoskaragiannakis
Copy link
Contributor

@cbertinato @arnau126 maybe my question wasn't very clear.
I suggested changing astype_unicode and astype_str (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L460-L483), which would mean that 'nan' would always become np.nan.

If we only want to change the read_excel function, then we could just substitute all the 'nan' to np.nan before we return the result to the user. Not very elegant, but it would only affect read_excel.

Which one do we want here?

Sorry if I misunderstood what you guys are saying.

@cbertinato
Copy link
Contributor

cbertinato commented Mar 20, 2018

I think that the latter would be safer and less likely to break something else. This is clearly only an issue for read_excel and not read_csv.

@nikoskaragiannakis
Copy link
Contributor

It is clear now, thanks. I'll do the changes.

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Mar 25, 2018
nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Mar 25, 2018
nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Mar 25, 2018
nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Mar 25, 2018
nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018
nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018
nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018
nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018
nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018
nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018
nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018
@jreback jreback added this to the 0.23.0 milestone Apr 5, 2018
nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 8, 2018
nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 8, 2018
nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 8, 2018
nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 8, 2018
@jreback jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 15, 2018
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 15, 2018
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 15, 2018
@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Oct 15, 2018
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 16, 2018
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 16, 2018
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 17, 2018
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 18, 2018
tm9k1 pushed a commit to tm9k1/pandas that referenced this issue Nov 19, 2018
@yoonghm
Copy link

yoonghm commented Dec 11, 2018

Treating an empty value in Excel as nan has another side effect: integer will be converted to float. Subsequent operation on that columns will have other effects again.

Also read_excel() does not honor empty value handling offered by function in converters:

I have an Excel file temp.xlsx with the following data:

Key1,Key2,Key3,Key4
0,11.0,  Apple  ,1.12
1,12.0,,1.02
2,13.0,  Orange,
3,  Banana  ,0.01

This is the code:

import numpy as np
import pandas as pd

def handle_string(value):
    return value.replace(' ', '')

def handle_integer(value):
    if value == '':
        return 0
    else:
        int(value)

def handle_float(value):
    if value == '':
        return 0.0
    else:
        float(value)

df = pd.read_excel(
        'temp.xlsx',
)
print(df)
print(f"type(df.loc[3,'Key2']) = {type(df.loc[3,'Key2'])}")
print(f"type(df.loc[1,'Key3']) = {type(df.loc[1,'Key3'])}")
print(f"type(df.loc[2,'Key4']) = {type(df.loc[2,'Key4'])}")

print('')

df = pd.read_excel(
        'temp.xlsx',
        converters={\
            'Key1' : handle_integer,
            'Key2' : handle_integer,
            'Key3' : handle_string,
            'Key4' : handle_float,
        }
)
print(df)
print(f"type(df.loc[3,'Key2']) = {type(df.loc[3,'Key2'])}")
print(f"type(df.loc[1,'Key3']) = {type(df.loc[1,'Key3'])}")
print(f"type(df.loc[2,'Key4']) = {type(df.loc[2,'Key4'])}")

The output:

Key1 Key2 Key3 Key4
0 0 11.0 Apple 1.12
1 1 12.0 NaN 1.02
2 2 13.0 Orange NaN
3 3 NaN Banana 0.01
type(df.loc[3,'Key2']) = <class 'numpy.float64'>
type(df.loc[1,'Key3']) = <class 'float'>
type(df.loc[2,'Key4']) = <class 'numpy.float64'>

Key1 Key2 Key3 Key4
0 None NaN Apple NaN
1 None NaN NaN NaN
2 None NaN Orange 0.0
3 None 0.0 Banana NaN
type(df.loc[3,'Key2']) = <class 'numpy.float64'>
type(df.loc[1,'Key3']) = <class 'float'>
type(df.loc[2,'Key4']) = <class 'numpy.float64'>

@daveknave
Copy link

Hello everyone, I think I have encountered that same issue while using read_fwf. NaNs are being converted to 'nan' rather than np.nan() when dtype=str is set.

It would be great, if the behavior were consistent with read_csv for this one as well. Since it seems to be exactly this issue, I didn't want to open a new one.
I am using Python 3.6 panda 25.0.0

Best David

@ptth222
Copy link

ptth222 commented Apr 20, 2022

This change ended up breaking some code for me. It was a simple enough fix, and the implementation should have been better in the first place, but it still took me a whole day to track it down.

I am commenting to say that I feel like read_csv should have been changed to be like read_excel instead of the other way around. I find it rather confusing that when you tell the function to give you str data type it actually gives you a mix of str and float (for NaN). My expectation was that everything would be of str type since that is what I asked for. I understand some of the history and reasoning for why it does what it does, but I still think that if you tell a function to return a certain type that's what it should return.

There seem to be some recent developments on this described here: https://pandas.pydata.org/docs/user_guide/text.html. Once this is more stable I think it would make sense for dtype=str and dtype="string" to have the same functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
8 participants