Performance regression in combine_first() #6557

qwhelan · 2014-03-05T21:55:05Z

I took another look at the testing that lead to #6479 and am still seeing ~73% slowdown. Benchmarking my application with v0.12.0 against 549a390 found a 48x slowdown in Series.combine_first when using a DatetimeIndex with MonthEnd offsets. This is also present in v0.13.1, so it wasn't introduced by the fix to #6479 . The test case:

from pandas import *
rows = 2000
a = Series(np.random.random(rows), index=DatetimeIndex(start='1/1/1900', periods=rows, freq='M'))
b = Series(np.random.random(rows), index=DatetimeIndex(start='1/1/1900', periods=rows, freq='M'))

a.ix[-500:] = float('nan')

%timeit a.combine_first(b)

And the comparison:

This regression does not exist for daily indices. Probably also exists for DataFrames and other offset frequencies. I'll create some vbenches later.

The text was updated successfully, but these errors were encountered:

jreback · 2014-03-05T22:03:15Z

I think this is just related to the indexing that was fixed in regards to the non-daily caching (don't remember the issue).

In master this looks fine.

In [1]: from pandas import *

In [2]: rows = 2000

In [3]: a = Series(np.random.random(rows), index=DatetimeIndex(start='1/1/1900', periods=rows, freq='M'))

In [4]: b = Series(np.random.random(rows), index=DatetimeIndex(start='1/1/1900', periods=rows, freq='M'))

In [5]: 

In [5]: a.ix[-500:] = float('nan')

In [6]: 

In [6]: %timeit a.combine_first(b)
1000 loops, best of 3: 382 ﾵs per loop

In [7]: a = Series(np.random.random(rows), index=DatetimeIndex(start='1/1/1900', periods=rows, freq='D'))

In [8]: b = Series(np.random.random(rows), index=DatetimeIndex(start='1/1/1900', periods=rows, freq='D'))

In [9]: a.ix[-500:] = float('nan')

In [10]: %timeit a.combine_first(b)
1000 loops, best of 3: 339 ﾵs per loop

qwhelan · 2014-03-05T22:07:42Z

Sorry about that, looks like my kernel didn't pick up the build from HEAD.

jreback · 2014-03-05T23:04:34Z

np keep reporting!

qwhelan · 2014-03-06T00:14:59Z

Ok, so the performance regression does exist but I don't think it's a bug.

This test case reproduces the regression (yes, it's ugly and un-idiomatic):

from pandas.tseries.offsets import MonthEnd
from pandas import *
rows = 2000

s = Series(np.random.random(rows), index=DatetimeIndex(start='1/1/1900', periods=rows, freq='M'))

s = s.reindex(s.index.append(DatetimeIndex([s.index[-1] + MonthEnd()])))

a = s.copy()

s.ix[-500:] = float('nan')

print s.index._can_fast_union(s.index)
print s.index.offset

%timeit s.combine_first(a)

Unlike .union, .append here doesn't preserve index.offset in HEAD or v0.12.0, which causes s.index._can_fast_union(s.index) to be False in both cases.

However, if you do %lprun -f DatetimeIndex.union s.combine_first(a) with v0.12.0 after running the above, you get this:

   844         1          273    273.0      0.2          if this._can_fast_union(other):
   845         1           83     83.0      0.1              return this._fast_union(other)
   846                                                   else:
   847                                                       result = Index.union(this, other)
   848                                                       if isinstance(result, DatetimeIndex):
   849                                                           result.tz = this.tz
   850                                                           if result.freq is None:
   851                                                               result.offset = to_offset(result.inferred_freq)
   852                                                       return result

which is just incorrect given _can_fast_union was False above. So v0.12.0 was incorrectly taking the fast path here... sometimes...

If you put print this.offset, other.offset in DatetimeIndex.union, you'd expect to get None None as the result. Here's what you get with v0.12.0:

%timeit s.combine_first(a)
None None
<1 MonthEnd> <1 MonthEnd>
<1 MonthEnd> <1 MonthEnd>
<1 MonthEnd> <1 MonthEnd>
<1 MonthEnd> <1 MonthEnd>
<1 MonthEnd> <1 MonthEnd>
<1 MonthEnd> <1 MonthEnd>
<1 MonthEnd> <1 MonthEnd>
<1 MonthEnd> <1 MonthEnd>

My guess is that a caching mechanism went and inferred the frequency and set .offset on the cached copy.

So performance regression isn't a bug, though .append may want to preserve/infer .offset where possible.

jreback closed this as completed Mar 5, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in combine_first() #6557

Performance regression in combine_first() #6557

qwhelan commented Mar 5, 2014

jreback commented Mar 5, 2014

qwhelan commented Mar 5, 2014

jreback commented Mar 5, 2014

qwhelan commented Mar 6, 2014

Performance regression in combine_first() #6557

Performance regression in combine_first() #6557

Comments

qwhelan commented Mar 5, 2014

jreback commented Mar 5, 2014

qwhelan commented Mar 5, 2014

jreback commented Mar 5, 2014

qwhelan commented Mar 6, 2014