Skip to content

Openpyxl22 #11144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed

Conversation

Themanwithoutaplan
Copy link
Contributor

closes #10125

I've added preliminary tests for openpyxl >= 2.2 styles. Unfortunately, I don't know how to get a whole test class to be skipped so tests will only run with either the Openpyxl20 or Openpyxl22 test class disabled, depending upon which version is installed. I hope this can be fixed fairly easily. Some code somewhere is still calling openpyxl.styles.Style() which needs changing. The aggregate Style object is an irrelevancy in client code.

@sinhrks
Copy link
Member

sinhrks commented Sep 19, 2015

I don't know how to get a whole test class to be skipped

You can refer this which can be used as a decorator for test class.

@jreback jreback added the IO Excel read_excel, to_excel label Sep 19, 2015
if isinstance(cell.val, datetime.datetime):
xcell.number_format = self.datetime_format

elif isinstance(cell.val, datetime.date):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would almost never be true, as we only have internally Timestamp (which is a sub-class of datetime.datetime). prob what you want is something like core.format._is_dates_only (which you should only call on an entire array/columns (a-priori to iterating over the cells)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code has been kept around from the previous implementation. openpyxl itself will automatically assign date and time formatting to relevant objects. The preferred method for adding a Pandas Dataframe to an openpyxl worksheet is described here: https://bitbucket.org/snippets/openpyxl/jgbak .Hope to add support for NumPy types soon so that only the conversion from a Dataframe to lists of lists will be required.

I haven't used that here because it doesn't look like a Dataframe is being passed in but rather some kind of cell collection abstraction: offsets and styling.

@jreback jreback added the Compat pandas objects compatability with Numpy or Python functions label Sep 21, 2015
@jreback
Copy link
Contributor

jreback commented Sep 21, 2015

Include this: Themanwithoutaplan@e4e02a8

for testing on travis

@jreback jreback added this to the 0.17.0 milestone Sep 21, 2015
@jreback
Copy link
Contributor

jreback commented Sep 21, 2015

@sinhrks @TomAugspurger @chris-b1
can u guys give this a try
thxs

@Themanwithoutaplan
Copy link
Contributor Author

It's worth noting that this implementation is significantly slower than simply streaming the data to a worksheet. This is probably related to the cell abstraction.

@chris-b1
Copy link
Contributor

I tried a handful of examples, all of which seemed to work fine.

Maybe it's a separate PR, and I don't think it's a new issue, but could performance be improved like you suggested? It gets pretty slow for even medium-size frames. The cell abstraction itself is actually pretty light-weight, but maybe needs to be refactored a bit for whatever openpyxl works best with?

In [1]: df = pd.DataFrame({'a': np.linspace(0, 100, 20000),
                           'b': range(20000),
                           'c': pd.date_range('1900-1-1', periods=20000)})

In [3]: %timeit df.to_excel('xlwswriter.xlsx', engine='xlsxwriter')
1 loops, best of 3: 2.42 s per loop

In [4]: %timeit df.to_excel('openpy.xlsx', engine='openpyxl')
1 loops, best of 3: 10.4 s per loop

Looks like a lot of time spent on the styles?

In [9]: %prun df.to_excel('openpy.xlsx', engine='openpyxl')

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1380478 1.068   0.000   1.579   0.000 base.py:31(__set__)
2880368 1.041   0.000   1.327   0.000 style.py:98(__iter__)
940344  0.783   0.000   1.856   0.000 base.py:45(__set__)
80003   0.604   0.000   3.694   0.000 lxml_worksheet.py:64(write_cell)
120012  0.570   0.000   1.422   0.000 style.py:108(__eq__)
480085/360067   0.569   0.000   1.049   0.000 hashable.py:55(key)
4600889/3520831 0.559   0.000   1.858   0.000 {getattr}
2441079 0.412   0.000   0.412   0.000 {isinstance}
1580529 0.394   0.000   0.394   0.000 base.py:20(__set__)
1   0.351   0.351   8.493   8.493 excel.py:1186(write_cells)
420198  0.324   0.000   1.205   0.000 base.py:143(__set__)
60000/40000 0.301   0.000   1.653   0.000 functools.py:105(wrapper)
1   0.267   0.267   6.991   6.991 lxml_worksheet.py:42(write_rows)
20000   0.267   0.000   0.267   0.000 datetime.py:72(time_to_days)
120019  0.266   0.000   0.770   0.000 style.py:114(__hash__)
60009   0.262   0.000   1.478   0.000 style.py:46(__init__)
80001   0.219   0.000   0.357   0.000 format.py:1770(_format_regular_rows)
80012   0.217   0.000   0.640   0.000 excel.py:1032(_convert_to_side)
60008   0.216   0.000   4.077   0.000 styleable.py:81(style_id)
1080203 0.212   0.000   0.364   0.000 hashable.py:59(<genexpr>)
160024  0.202   0.000   3.745   0.000 indexed_list.py:45(add)
80003   0.178   0.000   0.178   0.000 cell.py:111(__init__)
20000   0.174   0.000   0.245   0.000 jdcal.py:203(jd2gcal)
120000  0.171   0.000   0.236   0.000 threading.py:147(acquire)
20003   0.164   0.000   1.061   0.000 fonts.py:77(__init__)
160024  0.158   0.000   2.425   0.000 indexed_list.py:40(append)
180010  0.154   0.000   0.154   0.000 {method 'element' of 'lxml.etree._IncrementalFileWriter' objects}

@Themanwithoutaplan
Copy link
Contributor Author

Yeah, styles are a pain if they're being applied individually to cells but that shouldn't be the case for that dataset. They get decomposed to their constituents which get hashed to remove duplicates but they shouldn't be called when just setting datetimes.

Excel worksheets are row-oriented which is why this is append() is the standard call for bulk inserts. We've harmonised the API across the implementations but this means you'd have to pass in WriteOnlyCells among values for anything that needs styling to use the streaming mode.

These are times on my machine using append() with openpyxl 2.3-b2

xlsxwriter 6.116043s
openpyxl 18.04805s
openpyxl direct 5.844062s

It would be slightly slower if it was keeping cells in memory.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2015

what is openpyxl direct ?

can you add a whatsnew note (e.g. saying which versions people should use). Futher let's update the install.rst and io.rst with the recomendations (e.g. a warning box in the Excel section of io.rst, you can just list the versions in install.rst)

@Themanwithoutaplan
Copy link
Contributor Author

openpyxl direct is how the snippet does it, based on the code you showed me at PyCon: just pass rows of data (so a Dataframe just needs expanding and transposing).

I'd personally advise against any version of openpyxl < 2.2. I hope to add support for NumPy types in the soon to be released 2.3.

2.0 & 2.1 make sense if you're editing a worksheet in place because they avoid the side-effects when working with styles. Otherwise the guards, used to avoid side-effects when working with styles, exact a huge penalty on performance.

@Themanwithoutaplan
Copy link
Contributor Author

I'd also really appreciate a standalone version of the "cells" generator to work with and test in openpyxl so that the responsibility of the API is with openpyxl.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2015

@Themanwithoutaplan you can put whatever you think is best for users. I think we have something pointing to use your 1.x Series, now you can just say use > 2.2 for styles.

not sure what a 'standalone version of the cells' generator is?

@Themanwithoutaplan
Copy link
Contributor Author

Just for testing purposes: whatever gets passed into the write_cells method.

@chris-b1
Copy link
Contributor

This is the formatter, which is relatively standalone. IIRC xlsxwriter could also benefit from cells being yielded in a row-oriented matter too so it probably makes sense to change.

https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L1615

@Themanwithoutaplan
Copy link
Contributor Author

Thanks. Can I assume the workbook will be saved as soon as data has been added? ie. Switch to using to write-only (formerly optimized-write) mode?

In the current in-memory implementation we're actually still using a dictionary to hold the cells but I plan to move this using some kind of matrix structure. This could have several memory benefits and would also facilitate aggregate functions (deleting / adding rows or columns) and might also speed things up, but is essentially an implementation detail write-only could benefit from a co-routine.

I've also tried looking at the xlrd reader code, because openpyxl's read support is more extensive but I gave up. Would be happy to work on something using read-only mode for fast, sheet-by-sheet access but I think that going from rows (here the implementation really matters) to typed columns is definitely "tricky".

@jreback
Copy link
Contributor

jreback commented Sep 21, 2015

@Themanwithoutaplan you can certainly optimize / change internals in a future version. Further I think both reading is by definition read-only (e.g. you can assume that).

Trying to get this version out-the-door. Let me know when you can make those doc changes.

Also pls squash a bit.

@Themanwithoutaplan
Copy link
Contributor Author

I'm working on the docs now. What should I be squashing?

@jreback
Copy link
Contributor

jreback commented Sep 21, 2015

the commits. Ideally just 1-2 or so.

@Themanwithoutaplan
Copy link
Contributor Author

I think I'll need to look that up. I'm not very familiar with git, if it's a feature there, and I tend to write a lot of commits. Does this affect when the CI work? I thought that would run once per push.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2015

http://pandas.pydata.org/pandas-docs/stable/contributing.html#contributing-your-changes-to-pandas

you just

git rebase -i master

then change pick to s

I will do it if you don't

@Themanwithoutaplan
Copy link
Contributor Author

@jreback I tried squash but wasn't sure what it was doing.

Improvements on style handling can be done later. They're slow because the same style (cell with border) is being created over and over again. This can be avoided by having the style locally and just binding it when required: xcell.border = CellWithBorders. This will allow for future improvements like named styles which will just assign the name of a style to a cell and thus avoiding even the need to hash the style. But avoiding style object creation should be the biggest win.

@@ -2230,6 +2230,8 @@ Writing Excel Files to Memory
Pandas supports writing Excel files to buffer-like objects such as ``StringIO`` or
``BytesIO`` using :class:`~pandas.io.excel.ExcelWriter`.

Added support for Openpyxl >= 2.2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a versionadded directive here

@Themanwithoutaplan
Copy link
Contributor Author

I hope the changes are okay.

I looked at using a dict to cache styles (this is something openpyxl does internally as well) but cell.style isn't hashable. Something like that will have to be added in the future.

@Themanwithoutaplan
Copy link
Contributor Author

I've added a naive style caching strategy. This speeds things up quite a bit (25 %) in openpyxl 2.2 and even more (33 %) with openpyxl 2.3.

BTW. I had trouble running tests against openpyxl 2.3-b2 because LooseVersion doesn't like it: AttributeError: 'unicode' object has no attribute 'version'.

I thought I was following convention? I'd like to be able to run the tests against our betas.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2015

@Themanwithoutaplan ok, going to merge when this passes: https://travis-ci.org/jreback/pandas/builds/81571705

just squashed yours + minor doc edits.

@Themanwithoutaplan
Copy link
Contributor Author

@jreback thanks very much and sorry for any trouble.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2015

merged via c6bcc99

thanks for the fixes @Themanwithoutaplan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

COMPAT: openpyxl >= 2.2 failing
4 participants