Openpyxl22 #11144

Themanwithoutaplan · 2015-09-18T17:29:52Z

I've added preliminary tests for openpyxl >= 2.2 styles. Unfortunately, I don't know how to get a whole test class to be skipped so tests will only run with either the Openpyxl20 or Openpyxl22 test class disabled, depending upon which version is installed. I hope this can be fixed fairly easily. Some code somewhere is still calling openpyxl.styles.Style() which needs changing. The aggregate Style object is an irrelevancy in client code.

sinhrks · 2015-09-19T12:48:53Z

I don't know how to get a whole test class to be skipped

You can refer this which can be used as a decorator for test class.

https://github.com/pydata/pandas/blob/master/pandas/util/testing.py#L183

jreback · 2015-09-20T23:35:29Z

pandas/io/excel.py

+            if isinstance(cell.val, datetime.datetime):
+                xcell.number_format = self.datetime_format
+
+            elif isinstance(cell.val, datetime.date):


this would almost never be true, as we only have internally Timestamp (which is a sub-class of datetime.datetime). prob what you want is something like core.format._is_dates_only (which you should only call on an entire array/columns (a-priori to iterating over the cells)

The code has been kept around from the previous implementation. openpyxl itself will automatically assign date and time formatting to relevant objects. The preferred method for adding a Pandas Dataframe to an openpyxl worksheet is described here: https://bitbucket.org/snippets/openpyxl/jgbak .Hope to add support for NumPy types soon so that only the conversion from a Dataframe to lists of lists will be required.

I haven't used that here because it doesn't look like a Dataframe is being passed in but rather some kind of cell collection abstraction: offsets and styling.

jreback · 2015-09-21T01:06:33Z

Include this: Themanwithoutaplan@e4e02a8

for testing on travis

jreback · 2015-09-21T06:30:45Z

@sinhrks @TomAugspurger @chris-b1
can u guys give this a try
thxs

Themanwithoutaplan · 2015-09-21T07:35:46Z

It's worth noting that this implementation is significantly slower than simply streaming the data to a worksheet. This is probably related to the cell abstraction.

chris-b1 · 2015-09-21T11:35:19Z

I tried a handful of examples, all of which seemed to work fine.

Maybe it's a separate PR, and I don't think it's a new issue, but could performance be improved like you suggested? It gets pretty slow for even medium-size frames. The cell abstraction itself is actually pretty light-weight, but maybe needs to be refactored a bit for whatever openpyxl works best with?

In [1]: df = pd.DataFrame({'a': np.linspace(0, 100, 20000),
                           'b': range(20000),
                           'c': pd.date_range('1900-1-1', periods=20000)})

In [3]: %timeit df.to_excel('xlwswriter.xlsx', engine='xlsxwriter')
1 loops, best of 3: 2.42 s per loop

In [4]: %timeit df.to_excel('openpy.xlsx', engine='openpyxl')
1 loops, best of 3: 10.4 s per loop

Looks like a lot of time spent on the styles?

In [9]: %prun df.to_excel('openpy.xlsx', engine='openpyxl')

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1380478 1.068   0.000   1.579   0.000 base.py:31(__set__)
2880368 1.041   0.000   1.327   0.000 style.py:98(__iter__)
940344  0.783   0.000   1.856   0.000 base.py:45(__set__)
80003   0.604   0.000   3.694   0.000 lxml_worksheet.py:64(write_cell)
120012  0.570   0.000   1.422   0.000 style.py:108(__eq__)
480085/360067   0.569   0.000   1.049   0.000 hashable.py:55(key)
4600889/3520831 0.559   0.000   1.858   0.000 {getattr}
2441079 0.412   0.000   0.412   0.000 {isinstance}
1580529 0.394   0.000   0.394   0.000 base.py:20(__set__)
1   0.351   0.351   8.493   8.493 excel.py:1186(write_cells)
420198  0.324   0.000   1.205   0.000 base.py:143(__set__)
60000/40000 0.301   0.000   1.653   0.000 functools.py:105(wrapper)
1   0.267   0.267   6.991   6.991 lxml_worksheet.py:42(write_rows)
20000   0.267   0.000   0.267   0.000 datetime.py:72(time_to_days)
120019  0.266   0.000   0.770   0.000 style.py:114(__hash__)
60009   0.262   0.000   1.478   0.000 style.py:46(__init__)
80001   0.219   0.000   0.357   0.000 format.py:1770(_format_regular_rows)
80012   0.217   0.000   0.640   0.000 excel.py:1032(_convert_to_side)
60008   0.216   0.000   4.077   0.000 styleable.py:81(style_id)
1080203 0.212   0.000   0.364   0.000 hashable.py:59(<genexpr>)
160024  0.202   0.000   3.745   0.000 indexed_list.py:45(add)
80003   0.178   0.000   0.178   0.000 cell.py:111(__init__)
20000   0.174   0.000   0.245   0.000 jdcal.py:203(jd2gcal)
120000  0.171   0.000   0.236   0.000 threading.py:147(acquire)
20003   0.164   0.000   1.061   0.000 fonts.py:77(__init__)
160024  0.158   0.000   2.425   0.000 indexed_list.py:40(append)
180010  0.154   0.000   0.154   0.000 {method 'element' of 'lxml.etree._IncrementalFileWriter' objects}

Themanwithoutaplan · 2015-09-21T12:56:42Z

Yeah, styles are a pain if they're being applied individually to cells but that shouldn't be the case for that dataset. They get decomposed to their constituents which get hashed to remove duplicates but they shouldn't be called when just setting datetimes.

Excel worksheets are row-oriented which is why this is append() is the standard call for bulk inserts. We've harmonised the API across the implementations but this means you'd have to pass in WriteOnlyCells among values for anything that needs styling to use the streaming mode.

These are times on my machine using append() with openpyxl 2.3-b2

xlsxwriter 6.116043s
openpyxl 18.04805s
openpyxl direct 5.844062s

It would be slightly slower if it was keeping cells in memory.

jreback · 2015-09-21T15:34:59Z

what is openpyxl direct ?

can you add a whatsnew note (e.g. saying which versions people should use). Futher let's update the install.rst and io.rst with the recomendations (e.g. a warning box in the Excel section of io.rst, you can just list the versions in install.rst)

Themanwithoutaplan · 2015-09-21T15:42:56Z

openpyxl direct is how the snippet does it, based on the code you showed me at PyCon: just pass rows of data (so a Dataframe just needs expanding and transposing).

I'd personally advise against any version of openpyxl < 2.2. I hope to add support for NumPy types in the soon to be released 2.3.

2.0 & 2.1 make sense if you're editing a worksheet in place because they avoid the side-effects when working with styles. Otherwise the guards, used to avoid side-effects when working with styles, exact a huge penalty on performance.

Themanwithoutaplan · 2015-09-21T15:46:48Z

I'd also really appreciate a standalone version of the "cells" generator to work with and test in openpyxl so that the responsibility of the API is with openpyxl.

jreback · 2015-09-21T15:50:31Z

@Themanwithoutaplan you can put whatever you think is best for users. I think we have something pointing to use your 1.x Series, now you can just say use > 2.2 for styles.

not sure what a 'standalone version of the cells' generator is?

Themanwithoutaplan · 2015-09-21T15:56:47Z

Just for testing purposes: whatever gets passed into the write_cells method.

chris-b1 · 2015-09-21T16:07:36Z

This is the formatter, which is relatively standalone. IIRC xlsxwriter could also benefit from cells being yielded in a row-oriented matter too so it probably makes sense to change.

https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L1615

Themanwithoutaplan · 2015-09-21T16:15:17Z

Thanks. Can I assume the workbook will be saved as soon as data has been added? ie. Switch to using to write-only (formerly optimized-write) mode?

In the current in-memory implementation we're actually still using a dictionary to hold the cells but I plan to move this using some kind of matrix structure. This could have several memory benefits and would also facilitate aggregate functions (deleting / adding rows or columns) and might also speed things up, but is essentially an implementation detail write-only could benefit from a co-routine.

I've also tried looking at the xlrd reader code, because openpyxl's read support is more extensive but I gave up. Would be happy to work on something using read-only mode for fast, sheet-by-sheet access but I think that going from rows (here the implementation really matters) to typed columns is definitely "tricky".

jreback · 2015-09-21T16:18:56Z

@Themanwithoutaplan you can certainly optimize / change internals in a future version. Further I think both reading is by definition read-only (e.g. you can assume that).

Trying to get this version out-the-door. Let me know when you can make those doc changes.

Also pls squash a bit.

Themanwithoutaplan · 2015-09-21T16:51:40Z

I'm working on the docs now. What should I be squashing?

jreback · 2015-09-21T16:54:58Z

the commits. Ideally just 1-2 or so.

Themanwithoutaplan · 2015-09-21T17:08:27Z

I think I'll need to look that up. I'm not very familiar with git, if it's a feature there, and I tend to write a lot of commits. Does this affect when the CI work? I thought that would run once per push.

jreback · 2015-09-21T17:15:11Z

http://pandas.pydata.org/pandas-docs/stable/contributing.html#contributing-your-changes-to-pandas

you just

git rebase -i master

then change pick to s

I will do it if you don't

Themanwithoutaplan · 2015-09-21T17:59:15Z

@jreback I tried squash but wasn't sure what it was doing.

Improvements on style handling can be done later. They're slow because the same style (cell with border) is being created over and over again. This can be avoided by having the style locally and just binding it when required: xcell.border = CellWithBorders. This will allow for future improvements like named styles which will just assign the name of a style to a cell and thus avoiding even the need to hash the style. But avoiding style object creation should be the biggest win.

jreback · 2015-09-21T18:16:21Z

doc/source/io.rst

@@ -2230,6 +2230,8 @@ Writing Excel Files to Memory
 Pandas supports writing Excel files to buffer-like objects such as ``StringIO`` or
 ``BytesIO`` using :class:`~pandas.io.excel.ExcelWriter`.

+Added support for Openpyxl >= 2.2
+


use a versionadded directive here

Themanwithoutaplan · 2015-09-21T19:26:00Z

I hope the changes are okay.

I looked at using a dict to cache styles (this is something openpyxl does internally as well) but cell.style isn't hashable. Something like that will have to be added in the future.

Themanwithoutaplan · 2015-09-22T10:29:26Z

I've added a naive style caching strategy. This speeds things up quite a bit (25 %) in openpyxl 2.2 and even more (33 %) with openpyxl 2.3.

BTW. I had trouble running tests against openpyxl 2.3-b2 because LooseVersion doesn't like it: AttributeError: 'unicode' object has no attribute 'version'.

I thought I was following convention? I'd like to be able to run the tests against our betas.

jreback · 2015-09-22T12:06:13Z

@Themanwithoutaplan ok, going to merge when this passes: https://travis-ci.org/jreback/pandas/builds/81571705

just squashed yours + minor doc edits.

Themanwithoutaplan · 2015-09-22T12:22:03Z

@jreback thanks very much and sorry for any trouble.

jreback · 2015-09-22T14:12:37Z

merged via c6bcc99

thanks for the fixes @Themanwithoutaplan

Themanwithoutaplan added 3 commits September 18, 2015 17:38

Create separate environments for testing openpyxl.

6391f1a

Subclass Openpyxl2Writer for >= 2.2

6115d89

Add openpyxl >= 2.2 specific tests.

f266990

Themanwithoutaplan added 2 commits September 19, 2015 15:22

Use class decorator for skipping TestClass

63fb961

Invert order for reading number format.

5900483

jreback added the IO Excel read_excel, to_excel label Sep 19, 2015

jreback reviewed Sep 20, 2015
View reviewed changes

jreback added the Compat pandas objects compatability with Numpy or Python functions label Sep 21, 2015

jreback added this to the 0.17.0 milestone Sep 21, 2015

Themanwithoutaplan added 3 commits September 21, 2015 19:42

Update docs.

2597a45

Allow openpyxl to handle the formatting for dates and times.

692fccd

Make function call clearer.

79d1cf1

jreback reviewed Sep 21, 2015
View reviewed changes

Themanwithoutaplan added 2 commits September 21, 2015 21:22

Add version flag.

9255e99

Remove comments.

fcab59c

Add a naive cache for styles.

921da27

jreback closed this Sep 22, 2015

chris-b1 mentioned this pull request Oct 17, 2015

PERF: refactor ExcelFormatter #11355

Closed

2 tasks

sinhrks mentioned this pull request Aug 29, 2016

DOC: small update to install.rst page #14115

Merged

Themanwithoutaplan deleted the openpyxl22 branch November 28, 2017 11:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Openpyxl22 #11144

Openpyxl22 #11144

Themanwithoutaplan commented Sep 18, 2015

sinhrks commented Sep 19, 2015

jreback Sep 20, 2015

Themanwithoutaplan Sep 21, 2015

jreback commented Sep 21, 2015

jreback commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

chris-b1 commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

jreback commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

jreback commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

chris-b1 commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

jreback commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

jreback commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

jreback commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

jreback Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

Themanwithoutaplan commented Sep 22, 2015

jreback commented Sep 22, 2015

Themanwithoutaplan commented Sep 22, 2015

jreback commented Sep 22, 2015

Openpyxl22 #11144

Openpyxl22 #11144

Conversation

Themanwithoutaplan commented Sep 18, 2015

sinhrks commented Sep 19, 2015

jreback Sep 20, 2015

Choose a reason for hiding this comment

Themanwithoutaplan Sep 21, 2015

Choose a reason for hiding this comment

jreback commented Sep 21, 2015

jreback commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

chris-b1 commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

jreback commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

jreback commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

chris-b1 commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

jreback commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

jreback commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

jreback commented Sep 21, 2015

Themanwithoutaplan commented Sep 21, 2015

jreback Sep 21, 2015

Choose a reason for hiding this comment

Themanwithoutaplan commented Sep 21, 2015

Themanwithoutaplan commented Sep 22, 2015

jreback commented Sep 22, 2015

Themanwithoutaplan commented Sep 22, 2015

jreback commented Sep 22, 2015