Python dependencies: Upgrade pandas and matplotlib, check xlrd #211

mknaranja · 2022-02-16T10:31:12Z

memilio-epidata currently uses quite old pandas and matplotlib versions as of mid-2020:

https://github.com/DLR-SC/memilio/blob/main/pycode/memilio-epidata/setup.py#L71-L72

These should be upgraded to more recent versions.

mknaranja · 2022-02-16T10:31:16Z

@annette-lutz Could you take this one over and check whether our programs still work well with some new(er) versions ?

mknaranja · 2022-02-18T13:23:03Z

In this context, we should have a look a xlrd.

I run into the issue:

python getPopulationData.py
Information: Using new population data file 12411-02-03-4
Traceback (most recent call last):
  File "getPopulationData.py", line 555, in <module>
    main()
  File "getPopulationData.py", line 551, in main
    get_population_data(**arg_dict)
  File "getPopulationData.py", line 287, in get_population_data
    df_pop_raw = gd.loadExcel(
  File "/home/kueh_mj/memilio-main/pycode/memilio-epidata/memilio/epidata/getDataIntoPandasDataFrame.py", line 137, in loadExcel
    df = pd.read_excel(url, **param_dict)
  File "/home/kueh_mj/virtualenv_corona/lib/python3.8/site-packages/pandas-1.1.5-py3.8-linux-x86_64.egg/pandas/util/_decorators.py", line 296, in wrapper
    return func(*args, **kwargs)
  File "/home/kueh_mj/virtualenv_corona/lib/python3.8/site-packages/pandas-1.1.5-py3.8-linux-x86_64.egg/pandas/io/excel/_base.py", line 304, in read_excel
    io = ExcelFile(io, engine=engine)
  File "/home/kueh_mj/virtualenv_corona/lib/python3.8/site-packages/pandas-1.1.5-py3.8-linux-x86_64.egg/pandas/io/excel/_base.py", line 867, in __init__
    self._reader = self._engines[engine](self._io)
  File "/home/kueh_mj/virtualenv_corona/lib/python3.8/site-packages/pandas-1.1.5-py3.8-linux-x86_64.egg/pandas/io/excel/_xlrd.py", line 22, in __init__
    super().__init__(filepath_or_buffer)
  File "/home/kueh_mj/virtualenv_corona/lib/python3.8/site-packages/pandas-1.1.5-py3.8-linux-x86_64.egg/pandas/io/excel/_base.py", line 353, in __init__
    self.book = self.load_workbook(filepath_or_buffer)
  File "/home/kueh_mj/virtualenv_corona/lib/python3.8/site-packages/pandas-1.1.5-py3.8-linux-x86_64.egg/pandas/io/excel/_xlrd.py", line 37, in load_workbook
    return open_workbook(filepath_or_buffer)
  File "/home/kueh_mj/virtualenv_corona/lib/python3.8/site-packages/xlrd/__init__.py", line 170, in open_workbook
    raise XLRDError(FILE_FORMAT_DESCRIPTIONS[file_format]+'; not supported')
xlrd.biffh.XLRDError: Excel xlsx file; not supported

with xlrd 2.0.1. However, an old version 1.2.0 worked well.

annette-lutz · 2022-02-22T10:09:42Z

With the more recent pandas version 1.3.5 I get a warning and it uses openpyxl to read the file.

python getPopulationData.py
Information: Using new population data file 12411-02-03-4
/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/openpyxl-3.0.9-py3.8.egg/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default
warn("Workbook contains no default style, apply openpyxl's default")
Information: Data has been written to /localdata1/lutz_an/memilio/data/pydata/Germany/county_current_population_dim401.json
Information: Data has been written to
/localdata1/lutz_an/memilio/data/pydata/Germany/county_current_population.json

Reading the file with another engine is not possible since xlrd supports only .xls files (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html).
If I define the engine to be xlrd I get

python getPopulationData.py
Information: Using new population data file 12411-02-03-4
Traceback (most recent call last):
File "getPopulationData.py", line 555, in
main()
File "getPopulationData.py", line 551, in main
get_population_data(**arg_dict)
File "getPopulationData.py", line 287, in get_population_data
df_pop_raw = gd.loadExcel(
File "/localdata1/lutz_an/memilio/pycode/memilio-epidata/memilio/epidata/getDataIntoPandasDataFrame.py", line 137, in loadExcel
df = pd.read_excel(url, **param_dict)
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 364, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 1216, in init
raise ValueError(
ValueError: Your version of xlrd is 2.0.1. In xlrd >= 2.0, only the xls format is supported. Install openpyxl instead.

So we can use only openpyxl but it issues a warning and only works for pandas version>=1.3.0. I think the warning is about the wrong defined dimensions of the exel sheet that causes older pandas versions to define the number of lines in the file to be 2 and to read an empty dataframe (see explaination in openpyxls documentation https://openpyxl.readthedocs.io/en/default/optimized.html?highlight=dimension#worksheet-dimensions).

If I use openpyxl with an older pandas version I get

python getPopulationData.py
Information: Using new population data file 12411-02-03-4
/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/openpyxl-3.0.9-py3.8.egg/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default
warn("Workbook contains no default style, apply openpyxl's default")
Traceback (most recent call last):
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/parsers.py", line 2877, in _next_line
line = self._check_comments([self.data[self.pos]])[0]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/parsers.py", line 2611, in _infer_columns
line = self._next_line()
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/parsers.py", line 2890, in _next_line
raise StopIteration
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "getPopulationData.py", line 555, in
main()
File "getPopulationData.py", line 551, in main
get_population_data(**arg_dict)
File "getPopulationData.py", line 287, in get_population_data
df_pop_raw = gd.loadExcel(
File "/localdata1/lutz_an/memilio/pycode/memilio-epidata/memilio/epidata/getDataIntoPandasDataFrame.py", line 137, in loadExcel
df = pd.read_excel(url, **param_dict)
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/util/_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 344, in read_excel
data = io.parse(
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 1153, in parse
return self._reader.parse(
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 532, in parse
parser = TextParser(
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/parsers.py", line 2224, in TextParser
return TextFileReader(*args, **kwds)
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/parsers.py", line 814, in init
self._engine = self._make_engine(self.engine)
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/parsers.py", line 1045, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/parsers.py", line 2301, in init
) = self._infer_columns()
File "/localdata1/lutz_an/virtual_env_corona/lib/python3.8/site-packages/pandas/io/parsers.py", line 2615, in _infer_columns
raise ValueError(
ValueError: Passed header=4 but only 2 lines in file

A similar issue was reported to pandas (see pandas-dev/pandas#38956).

For pandas version 1.2.0 I tried all engines but could not read the file. For pandas>=1.3.0 the file can only be read with openpyxls and a warning is issued.

mknaranja · 2022-02-22T11:01:07Z

This pandas issue seems to be an issue with empty rows and incorrectly or not set dimensions in the excel sheet. Is it possible to set the dimensions in any way?

From the openpyxl docs:

Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions. Some applications set this incorrectly. You can check the apparent dimensions of a worksheet using ws.calculate_dimension(). If this returns a range that you know is incorrect, say A1:A1 then simply resetting the max_row and max_column attributes should allow you to work with the file

Furthermore, since header=N can lead to different results for visually the same file, we should check that the correct row is parsed as header. So for every file, we should verify the header line (first row) after loading the xls(x)

annette-lutz · 2022-02-22T13:10:03Z

The bug is fixed in pandas version 1.2.2, so requirering a pandas version>=1.2.2 solves the problem (see https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.2.2.html).

mknaranja changed the title ~~Upgrade numpy and matplotlib~~ Upgrade numpy and matplotlib, check xlrd Feb 18, 2022

mknaranja changed the title ~~Upgrade numpy and matplotlib, check xlrd~~ Python dependencies: Upgrade numpy and matplotlib, check xlrd Feb 18, 2022

dabele added the loc::data handling This issue concerns any kind of data handling. label Feb 21, 2022

mknaranja added the prio::moderate The priority of this task is moderate. label Feb 21, 2022

mknaranja changed the title ~~Python dependencies: Upgrade numpy and matplotlib, check xlrd~~ Python dependencies: Upgrade pandas and matplotlib, check xlrd Feb 22, 2022

annette-lutz linked a pull request Feb 23, 2022 that will close this issue

211 Python dependencies #221

Merged

14 tasks

annette-lutz self-assigned this Feb 23, 2022

annette-lutz mentioned this issue Mar 24, 2022

211 Python dependencies #221

Merged

14 tasks

xsaschako closed this as completed in #221 Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python dependencies: Upgrade pandas and matplotlib, check xlrd #211

Python dependencies: Upgrade pandas and matplotlib, check xlrd #211

mknaranja commented Feb 16, 2022 •

edited

Loading

mknaranja commented Feb 16, 2022

mknaranja commented Feb 18, 2022

annette-lutz commented Feb 22, 2022 •

edited

Loading

mknaranja commented Feb 22, 2022

annette-lutz commented Feb 22, 2022

Python dependencies: Upgrade pandas and matplotlib, check xlrd #211

Python dependencies: Upgrade pandas and matplotlib, check xlrd #211

Comments

mknaranja commented Feb 16, 2022 • edited Loading

mknaranja commented Feb 16, 2022

mknaranja commented Feb 18, 2022

annette-lutz commented Feb 22, 2022 • edited Loading

mknaranja commented Feb 22, 2022

annette-lutz commented Feb 22, 2022

mknaranja commented Feb 16, 2022 •

edited

Loading

annette-lutz commented Feb 22, 2022 •

edited

Loading