Skip to content

Python dependencies: Upgrade pandas and matplotlib, check xlrd #211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mknaranja opened this issue Feb 16, 2022 · 5 comments · Fixed by #221
Closed

Python dependencies: Upgrade pandas and matplotlib, check xlrd #211

mknaranja opened this issue Feb 16, 2022 · 5 comments · Fixed by #221
Assignees
Labels
loc::data handling This issue concerns any kind of data handling. prio::moderate The priority of this task is moderate.

Comments

@mknaranja
Copy link
Member

mknaranja commented Feb 16, 2022

memilio-epidata currently uses quite old pandas and matplotlib versions as of mid-2020:

https://github.com/DLR-SC/memilio/blob/main/pycode/memilio-epidata/setup.py#L71-L72

These should be upgraded to more recent versions.

@mknaranja
Copy link
Member Author

@annette-lutz Could you take this one over and check whether our programs still work well with some new(er) versions ?

@mknaranja
Copy link
Member Author

In this context, we should have a look a xlrd.

I run into the issue:

python getPopulationData.py
Information: Using new population data file 12411-02-03-4
Traceback (most recent call last):
  File "getPopulationData.py", line 555, in <module>
    main()
  File "getPopulationData.py", line 551, in main
    get_population_data(**arg_dict)
  File "getPopulationData.py", line 287, in get_population_data
    df_pop_raw = gd.loadExcel(
  File "/home/kueh_mj/memilio-main/pycode/memilio-epidata/memilio/epidata/getDataIntoPandasDataFrame.py", line 137, in loadExcel
    df = pd.read_excel(url, **param_dict)
  File "/home/kueh_mj/virtualenv_corona/lib/python3.8/site-packages/pandas-1.1.5-py3.8-linux-x86_64.egg/pandas/util/_decorators.py", line 296, in wrapper
    return func(*args, **kwargs)
  File "/home/kueh_mj/virtualenv_corona/lib/python3.8/site-packages/pandas-1.1.5-py3.8-linux-x86_64.egg/pandas/io/excel/_base.py", line 304, in read_excel
    io = ExcelFile(io, engine=engine)
  File "/home/kueh_mj/virtualenv_corona/lib/python3.8/site-packages/pandas-1.1.5-py3.8-linux-x86_64.egg/pandas/io/excel/_base.py", line 867, in __init__
    self._reader = self._engines[engine](self._io)
  File "/home/kueh_mj/virtualenv_corona/lib/python3.8/site-packages/pandas-1.1.5-py3.8-linux-x86_64.egg/pandas/io/excel/_xlrd.py", line 22, in __init__
    super().__init__(filepath_or_buffer)
  File "/home/kueh_mj/virtualenv_corona/lib/python3.8/site-packages/pandas-1.1.5-py3.8-linux-x86_64.egg/pandas/io/excel/_base.py", line 353, in __init__
    self.book = self.load_workbook(filepath_or_buffer)
  File "/home/kueh_mj/virtualenv_corona/lib/python3.8/site-packages/pandas-1.1.5-py3.8-linux-x86_64.egg/pandas/io/excel/_xlrd.py", line 37, in load_workbook
    return open_workbook(filepath_or_buffer)
  File "/home/kueh_mj/virtualenv_corona/lib/python3.8/site-packages/xlrd/__init__.py", line 170, in open_workbook
    raise XLRDError(FILE_FORMAT_DESCRIPTIONS[file_format]+'; not supported')
xlrd.biffh.XLRDError: Excel xlsx file; not supported

with xlrd 2.0.1. However, an old version 1.2.0 worked well.

@mknaranja mknaranja changed the title Upgrade numpy and matplotlib Upgrade numpy and matplotlib, check xlrd Feb 18, 2022
@mknaranja mknaranja changed the title Upgrade numpy and matplotlib, check xlrd Python dependencies: Upgrade numpy and matplotlib, check xlrd Feb 18, 2022
@dabele dabele added the loc::data handling This issue concerns any kind of data handling. label Feb 21, 2022
@mknaranja mknaranja added the prio::moderate The priority of this task is moderate. label Feb 21, 2022
@annette-lutz
Copy link
Contributor

annette-lutz commented Feb 22, 2022

With the more recent pandas version 1.3.5 I get a warning and it uses openpyxl to read the file.

python getPopulationData.py
Information: Using new population data file 12411-02-03-4
/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/openpyxl-3.0.9-py3.8.egg/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default
warn("Workbook contains no default style, apply openpyxl's default")
Information: Data has been written to /localdata1/lutz_an/memilio/data/pydata/Germany/county_current_population_dim401.json
Information: Data has been written to
/localdata1/lutz_an/memilio/data/pydata/Germany/county_current_population.json

Reading the file with another engine is not possible since xlrd supports only .xls files (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html).
If I define the engine to be xlrd I get

python getPopulationData.py
Information: Using new population data file 12411-02-03-4
Traceback (most recent call last):
File "getPopulationData.py", line 555, in
main()
File "getPopulationData.py", line 551, in main
get_population_data(**arg_dict)
File "getPopulationData.py", line 287, in get_population_data
df_pop_raw = gd.loadExcel(
File "/localdata1/lutz_an/memilio/pycode/memilio-epidata/memilio/epidata/getDataIntoPandasDataFrame.py", line 137, in loadExcel
df = pd.read_excel(url, **param_dict)
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 364, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 1216, in init
raise ValueError(
ValueError: Your version of xlrd is 2.0.1. In xlrd >= 2.0, only the xls format is supported. Install openpyxl instead.

So we can use only openpyxl but it issues a warning and only works for pandas version>=1.3.0. I think the warning is about the wrong defined dimensions of the exel sheet that causes older pandas versions to define the number of lines in the file to be 2 and to read an empty dataframe (see explaination in openpyxls documentation https://openpyxl.readthedocs.io/en/default/optimized.html?highlight=dimension#worksheet-dimensions).

If I use openpyxl with an older pandas version I get

python getPopulationData.py
Information: Using new population data file 12411-02-03-4
/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/openpyxl-3.0.9-py3.8.egg/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default
warn("Workbook contains no default style, apply openpyxl's default")
Traceback (most recent call last):
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/parsers.py", line 2877, in _next_line
line = self._check_comments([self.data[self.pos]])[0]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/parsers.py", line 2611, in _infer_columns
line = self._next_line()
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/parsers.py", line 2890, in _next_line
raise StopIteration
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "getPopulationData.py", line 555, in
main()
File "getPopulationData.py", line 551, in main
get_population_data(**arg_dict)
File "getPopulationData.py", line 287, in get_population_data
df_pop_raw = gd.loadExcel(
File "/localdata1/lutz_an/memilio/pycode/memilio-epidata/memilio/epidata/getDataIntoPandasDataFrame.py", line 137, in loadExcel
df = pd.read_excel(url, **param_dict)
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/util/_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 344, in read_excel
data = io.parse(
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 1153, in parse
return self._reader.parse(
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 532, in parse
parser = TextParser(
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/parsers.py", line 2224, in TextParser
return TextFileReader(*args, **kwds)
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/parsers.py", line 814, in init
self._engine = self._make_engine(self.engine)
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/parsers.py", line 1045, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/parsers.py", line 2301, in init
) = self._infer_columns()
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/parsers.py", line 2615, in _infer_columns
raise ValueError(
ValueError: Passed header=4 but only 2 lines in file

A similar issue was reported to pandas (see pandas-dev/pandas#38956).

For pandas version 1.2.0 I tried all engines but could not read the file. For pandas>=1.3.0 the file can only be read with openpyxls and a warning is issued.

@mknaranja mknaranja changed the title Python dependencies: Upgrade numpy and matplotlib, check xlrd Python dependencies: Upgrade pandas and matplotlib, check xlrd Feb 22, 2022
@mknaranja
Copy link
Member Author

This pandas issue seems to be an issue with empty rows and incorrectly or not set dimensions in the excel sheet. Is it possible to set the dimensions in any way?

From the openpyxl docs:

Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions. Some applications set this incorrectly. You can check the apparent dimensions of a worksheet using ws.calculate_dimension(). If this returns a range that you know is incorrect, say A1:A1 then simply resetting the max_row and max_column attributes should allow you to work with the file

Furthermore, since header=N can lead to different results for visually the same file, we should check that the correct row is parsed as header. So for every file, we should verify the header line (first row) after loading the xls(x)

@annette-lutz
Copy link
Contributor

The bug is fixed in pandas version 1.2.2, so requirering a pandas version>=1.2.2 solves the problem (see https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.2.2.html).

@annette-lutz annette-lutz linked a pull request Feb 23, 2022 that will close this issue
14 tasks
@annette-lutz annette-lutz self-assigned this Feb 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
loc::data handling This issue concerns any kind of data handling. prio::moderate The priority of this task is moderate.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants