-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Openpyxl22 #11144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Openpyxl22 #11144
Conversation
You can refer this which can be used as a decorator for test class. |
if isinstance(cell.val, datetime.datetime): | ||
xcell.number_format = self.datetime_format | ||
|
||
elif isinstance(cell.val, datetime.date): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would almost never be true, as we only have internally Timestamp
(which is a sub-class of datetime.datetime
). prob what you want is something like core.format._is_dates_only
(which you should only call on an entire array/columns (a-priori to iterating over the cells)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code has been kept around from the previous implementation. openpyxl itself will automatically assign date and time formatting to relevant objects. The preferred method for adding a Pandas Dataframe to an openpyxl worksheet is described here: https://bitbucket.org/snippets/openpyxl/jgbak .Hope to add support for NumPy types soon so that only the conversion from a Dataframe to lists of lists will be required.
I haven't used that here because it doesn't look like a Dataframe is being passed in but rather some kind of cell collection abstraction: offsets and styling.
Include this: Themanwithoutaplan@e4e02a8 for testing on travis |
@sinhrks @TomAugspurger @chris-b1 |
It's worth noting that this implementation is significantly slower than simply streaming the data to a worksheet. This is probably related to the cell abstraction. |
I tried a handful of examples, all of which seemed to work fine. Maybe it's a separate PR, and I don't think it's a new issue, but could performance be improved like you suggested? It gets pretty slow for even medium-size frames. The cell abstraction itself is actually pretty light-weight, but maybe needs to be refactored a bit for whatever openpyxl works best with?
Looks like a lot of time spent on the styles?
|
Yeah, styles are a pain if they're being applied individually to cells but that shouldn't be the case for that dataset. They get decomposed to their constituents which get hashed to remove duplicates but they shouldn't be called when just setting datetimes. Excel worksheets are row-oriented which is why this is These are times on my machine using
It would be slightly slower if it was keeping cells in memory. |
what is can you add a whatsnew note (e.g. saying which versions people should use). Futher let's update the |
I'd personally advise against any version of openpyxl < 2.2. I hope to add support for NumPy types in the soon to be released 2.3. 2.0 & 2.1 make sense if you're editing a worksheet in place because they avoid the side-effects when working with styles. Otherwise the guards, used to avoid side-effects when working with styles, exact a huge penalty on performance. |
I'd also really appreciate a standalone version of the "cells" generator to work with and test in openpyxl so that the responsibility of the API is with openpyxl. |
@Themanwithoutaplan you can put whatever you think is best for users. I think we have something pointing to use your 1.x Series, now you can just say use > 2.2 for styles. not sure what a 'standalone version of the cells' generator is? |
Just for testing purposes: whatever gets passed into the |
This is the formatter, which is relatively standalone. IIRC https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L1615 |
Thanks. Can I assume the workbook will be saved as soon as data has been added? ie. Switch to using to In the current in-memory implementation we're actually still using a dictionary to hold the cells but I plan to move this using some kind of matrix structure. This could have several memory benefits and would also facilitate aggregate functions (deleting / adding rows or columns) and might also speed things up, but is essentially an implementation detail I've also tried looking at the xlrd reader code, because openpyxl's read support is more extensive but I gave up. Would be happy to work on something using |
@Themanwithoutaplan you can certainly optimize / change internals in a future version. Further I think both reading is by definition read-only (e.g. you can assume that). Trying to get this version out-the-door. Let me know when you can make those doc changes. Also pls squash a bit. |
I'm working on the docs now. What should I be squashing? |
the commits. Ideally just 1-2 or so. |
I think I'll need to look that up. I'm not very familiar with git, if it's a feature there, and I tend to write a lot of commits. Does this affect when the CI work? I thought that would run once per push. |
http://pandas.pydata.org/pandas-docs/stable/contributing.html#contributing-your-changes-to-pandas you just
then change pick to s I will do it if you don't |
@jreback I tried squash but wasn't sure what it was doing. Improvements on style handling can be done later. They're slow because the same style (cell with border) is being created over and over again. This can be avoided by having the style locally and just binding it when required: |
@@ -2230,6 +2230,8 @@ Writing Excel Files to Memory | |||
Pandas supports writing Excel files to buffer-like objects such as ``StringIO`` or | |||
``BytesIO`` using :class:`~pandas.io.excel.ExcelWriter`. | |||
|
|||
Added support for Openpyxl >= 2.2 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a versionadded directive here
I hope the changes are okay. I looked at using a dict to cache styles (this is something openpyxl does internally as well) but |
I've added a naive style caching strategy. This speeds things up quite a bit (25 %) in openpyxl 2.2 and even more (33 %) with openpyxl 2.3. BTW. I had trouble running tests against openpyxl 2.3-b2 because I thought I was following convention? I'd like to be able to run the tests against our betas. |
@Themanwithoutaplan ok, going to merge when this passes: https://travis-ci.org/jreback/pandas/builds/81571705 just squashed yours + minor doc edits. |
@jreback thanks very much and sorry for any trouble. |
merged via c6bcc99 thanks for the fixes @Themanwithoutaplan |
closes #10125
I've added preliminary tests for openpyxl >= 2.2 styles. Unfortunately, I don't know how to get a whole test class to be skipped so tests will only run with either the Openpyxl20 or Openpyxl22 test class disabled, depending upon which version is installed. I hope this can be fixed fairly easily. Some code somewhere is still calling
openpyxl.styles.Style()
which needs changing. The aggregate Style object is an irrelevancy in client code.