Skip to content

Performance regression in combine_first() #6557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
qwhelan opened this issue Mar 5, 2014 · 4 comments
Closed

Performance regression in combine_first() #6557

qwhelan opened this issue Mar 5, 2014 · 4 comments

Comments

@qwhelan
Copy link
Contributor

qwhelan commented Mar 5, 2014

I took another look at the testing that lead to #6479 and am still seeing ~73% slowdown. Benchmarking my application with v0.12.0 against 549a390 found a 48x slowdown in Series.combine_first when using a DatetimeIndex with MonthEnd offsets. This is also present in v0.13.1, so it wasn't introduced by the fix to #6479 . The test case:

from pandas import *
rows = 2000
a = Series(np.random.random(rows), index=DatetimeIndex(start='1/1/1900', periods=rows, freq='M'))
b = Series(np.random.random(rows), index=DatetimeIndex(start='1/1/1900', periods=rows, freq='M'))

a.ix[-500:] = float('nan')

%timeit a.combine_first(b)

And the comparison:
perf_head

This regression does not exist for daily indices. Probably also exists for DataFrames and other offset frequencies. I'll create some vbenches later.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

I think this is just related to the indexing that was fixed in regards to the non-daily caching (don't remember the issue).

In master this looks fine.

In [1]: from pandas import *

In [2]: rows = 2000

In [3]: a = Series(np.random.random(rows), index=DatetimeIndex(start='1/1/1900', periods=rows, freq='M'))

In [4]: b = Series(np.random.random(rows), index=DatetimeIndex(start='1/1/1900', periods=rows, freq='M'))

In [5]: 

In [5]: a.ix[-500:] = float('nan')

In [6]: 

In [6]: %timeit a.combine_first(b)
1000 loops, best of 3: 382 ᄉs per loop

In [7]: a = Series(np.random.random(rows), index=DatetimeIndex(start='1/1/1900', periods=rows, freq='D'))

In [8]: b = Series(np.random.random(rows), index=DatetimeIndex(start='1/1/1900', periods=rows, freq='D'))

In [9]: a.ix[-500:] = float('nan')

In [10]: %timeit a.combine_first(b)
1000 loops, best of 3: 339 ᄉs per loop

@jreback jreback closed this as completed Mar 5, 2014
@qwhelan
Copy link
Contributor Author

qwhelan commented Mar 5, 2014

Sorry about that, looks like my kernel didn't pick up the build from HEAD.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

np keep reporting!

@qwhelan
Copy link
Contributor Author

qwhelan commented Mar 6, 2014

Ok, so the performance regression does exist but I don't think it's a bug.

This test case reproduces the regression (yes, it's ugly and un-idiomatic):

from pandas.tseries.offsets import MonthEnd
from pandas import *
rows = 2000

s = Series(np.random.random(rows), index=DatetimeIndex(start='1/1/1900', periods=rows, freq='M'))

s = s.reindex(s.index.append(DatetimeIndex([s.index[-1] + MonthEnd()])))

a = s.copy()

s.ix[-500:] = float('nan')

print s.index._can_fast_union(s.index)
print s.index.offset

%timeit s.combine_first(a)

Unlike .union, .append here doesn't preserve index.offset in HEAD or v0.12.0, which causes s.index._can_fast_union(s.index) to be False in both cases.

However, if you do %lprun -f DatetimeIndex.union s.combine_first(a) with v0.12.0 after running the above, you get this:

   844         1          273    273.0      0.2          if this._can_fast_union(other):
   845         1           83     83.0      0.1              return this._fast_union(other)
   846                                                   else:
   847                                                       result = Index.union(this, other)
   848                                                       if isinstance(result, DatetimeIndex):
   849                                                           result.tz = this.tz
   850                                                           if result.freq is None:
   851                                                               result.offset = to_offset(result.inferred_freq)
   852                                                       return result

which is just incorrect given _can_fast_union was False above. So v0.12.0 was incorrectly taking the fast path here... sometimes...

If you put print this.offset, other.offset in DatetimeIndex.union, you'd expect to get None None as the result. Here's what you get with v0.12.0:

%timeit s.combine_first(a)
None None
<1 MonthEnd> <1 MonthEnd>
<1 MonthEnd> <1 MonthEnd>
<1 MonthEnd> <1 MonthEnd>
<1 MonthEnd> <1 MonthEnd>
<1 MonthEnd> <1 MonthEnd>
<1 MonthEnd> <1 MonthEnd>
<1 MonthEnd> <1 MonthEnd>
<1 MonthEnd> <1 MonthEnd>

My guess is that a caching mechanism went and inferred the frequency and set .offset on the cached copy.

So performance regression isn't a bug, though .append may want to preserve/infer .offset where possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants