PERF: use .values in index difference #11279

max-sixty · 2015-10-10T01:16:40Z

The existing .difference method 'unboxed' all the objects, which has a severe performance impact on PeriodIndex in particular.

In [3]: long_index = pd.period_range(start='2000', freq='s', periods=1000)

In [4]: empty_index = pd.PeriodIndex([],freq='s')


In [24]: %timeit long_index.difference(empty_index)

# existing:
1 loops, best of 1: 1.02 s per loop
# updated: 
1000 loops, best of 3: 538 µs per loop

...so around 2000x

I haven't worked with asv or the like - is this a case where a test like that is required?

jorisvandenbossche · 2015-10-10T10:25:07Z

pandas/core/index.py

@@ -1605,12 +1605,12 @@ def difference(self, other):
        self._assert_can_do_setop(other)

        if self.equals(other):
-            return Index([], name=self.name)
+            return self._shallow_copy([])


maybe nice to add a test for this one? (that it keeps the correct class)

jreback · 2015-10-10T18:53:58Z

there are quite a number of tests in tseries/tests/test_base for this type of behavior FYI

max-sixty · 2015-10-10T19:28:21Z

OK cheers @jreback. At the moment I'm getting a number of failures similar to the one below - I think it's where this operates on MultiIndexes.
I don't know how well multi_index._shallow_copy(multi_index.values) == multi_index works?
I can branch the logic depending on whether it's a MultiIndex or not - unless you have an alternative?

======================================================================
ERROR: test_stack_partial_multiIndex (pandas.tests.test_frame.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/tests/test_frame.py", line 13998, in test_stack_partial_multiIndex
    _test_stack_with_multiindex(full_multiindex[multiindex_columns])
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/tests/test_frame.py", line 13969, in _test_stack_with_multiindex
    result = df.stack(level=level, dropna=False)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 3745, in stack
    return stack(self, level, dropna=dropna)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/reshape.py", line 481, in stack
    return _stack_multi_columns(frame, level_num=level_num, dropna=dropna)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/reshape.py", line 648, in _stack_multi_columns
    result = DataFrame(new_data, index=new_index, columns=new_columns)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 227, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 322, in _init_dict
    data = dict((k, v) for k, v in compat.iteritems(data)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 323, in <genexpr>
    if k in columns)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/index.py", line 1116, in __contains__
    return key in self._engine
  File "pandas/index.pyx", line 99, in pandas.index.IndexEngine.__contains__ (pandas/index.c:2749)
  File "pandas/index.pyx", line 261, in pandas.index.IndexEngine._ensure_mapping_populated (pandas/index.c:5304)
  File "pandas/index.pyx", line 267, in pandas.index.IndexEngine.initialize (pandas/index.c:5408)
  File "pandas/hashtable.pyx", line 703, in pandas.hashtable.PyObjectHashTable.map_locations (pandas/hashtable.c:12850)
ValueError: Does not understand character buffer dtype format string ('w')

jreback · 2015-10-10T19:38:29Z

looks like something else is going on
shallow_cooy should work it overridden for MultIndex

jreback · 2015-10-15T22:26:21Z

any progress?

max-sixty · 2015-10-15T22:37:58Z

@jreback not yet - will look at it this weekend. Thanks for the ping

jreback · 2015-11-18T20:16:12Z

@MaximilianR if you'd like to update would be gr8

max-sixty · 2015-11-19T01:16:06Z

I had a go at debugging this. But I'm struggling, since the errors happen on the Cython side - I need to get up to speed on how to debug those.
If anyone has any guidance, I'm very open to ideas. Otherwise it'll be a few weeks at least, I think.

jreback · 2015-12-06T19:17:55Z

@MaximilianR can you rebase / update

max-sixty · 2015-12-08T04:07:51Z

I still get this error below. I'm really not sure how to debug the pyx files - although keen to learn. Any guidance?

======================================================================
ERROR: test_stack_partial_multiIndex (pandas.tests.test_frame.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/tests/test_frame.py", line 14305, in test_stack_partial_multiIndex
    _test_stack_with_multiindex(full_multiindex[multiindex_columns])
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/tests/test_frame.py", line 14276, in _test_stack_with_multiindex
    result = df.stack(level=level, dropna=False)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 3803, in stack
    return stack(self, level, dropna=dropna)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/reshape.py", line 481, in stack
    return _stack_multi_columns(frame, level_num=level_num, dropna=dropna)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/reshape.py", line 648, in _stack_multi_columns
    result = DataFrame(new_data, index=new_index, columns=new_columns)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 226, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 323, in _init_dict
    data = dict((k, v) for k, v in compat.iteritems(data)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 324, in <genexpr>
    if k in columns)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/index.py", line 1161, in __contains__
    return key in self._engine
  File "pandas/index.pyx", line 99, in pandas.index.IndexEngine.__contains__ (pandas/index.c:2749)
  File "pandas/index.pyx", line 261, in pandas.index.IndexEngine._ensure_mapping_populated (pandas/index.c:5304)
  File "pandas/index.pyx", line 267, in pandas.index.IndexEngine.initialize (pandas/index.c:5408)
  File "pandas/hashtable.pyx", line 703, in pandas.hashtable.PyObjectHashTable.map_locations (pandas/hashtable.c:12518)
ValueError: Does not understand character buffer dtype format string ('w')

----------------------------------------------------------------------

jreback · 2015-12-09T15:12:17Z

go up the stack when debugging. somehow the new_columns is created with a dtype of S1 which is invalid this violates some guarantees there. So you have to trace where this is happening (prob the _shallow_copy may need a hint)

> /Users/jreback/pandas/pandas/core/reshape.py(648)_stack_multi_columns()
-> result = DataFrame(new_data, index=new_index, columns=new_columns)
(Pdb) p new_data
{'A': array([ nan,   2.,  nan,  nan,   5.,  nan,  nan,   8.,  nan]), 'B': array([  0.,  nan,   1.,   3.,  nan,   4.,   6.,  nan,   7.])}
(Pdb) p new_index
MultiIndex(levels=[[0, 1, 2], [u'u', u'x', u'y', u'z']],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [1, 2, 3, 1, 2, 3, 1, 2, 3]],
           names=[None, u'Lower'])
(Pdb) p new_columns
Index([u'A', u'B'], dtype='|S1', name=u'Upper')
(Pdb) !new_columns = Index(new_columns.values,name=new_columns.name)
*** NameError: name 'Index' is not defined
(Pdb) from pandas import Index
(Pdb) !new_columns = Index(new_columns.values,name=new_columns.name)
(Pdb) p new_columns
Index([u'A', u'B'], dtype='object', name=u'Upper')
(Pdb) p DataFrame(new_data, index=new_index, columns=new_columns)
Upper     A   B
  Lower        
0 x     NaN   0
  y       2 NaN
  z     NaN   1
1 x     NaN   3
  y       5 NaN
  z     NaN   4
2 x     NaN   6
  y       8 NaN
  z     NaN   7

max-sixty · 2015-12-09T15:34:38Z

OK thanks, I'll try that angle

jreback · 2016-01-06T17:18:33Z

@MaximilianR pls reopen if you would like to update

max-sixty · 2016-01-06T17:33:28Z

OK, I will aim to come back to this one at some point

jreback · 2016-01-06T17:35:24Z

np. just trying to keep out outstanding PR's to minimum.

max-sixty changed the title ~~PER: use .values in index difference~~ PERF: use .values in index difference Oct 10, 2015

max-sixty force-pushed the index-setops-speed branch from c610191 to d483846 Compare October 10, 2015 03:21

jorisvandenbossche reviewed Oct 10, 2015
View reviewed changes

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Period Period data type labels Oct 10, 2015

max-sixty force-pushed the index-setops-speed branch from d483846 to b3fbdd5 Compare October 10, 2015 19:05

max-sixty force-pushed the index-setops-speed branch from b3fbdd5 to 224791a Compare October 17, 2015 18:51

use .values in index difference

19cc65d

max-sixty force-pushed the index-setops-speed branch from 224791a to 19cc65d Compare December 8, 2015 04:02

jreback closed this Jan 6, 2016

jreback mentioned this pull request Jan 15, 2016

Index.difference performance #12044

Closed

max-sixty deleted the index-setops-speed branch December 22, 2016 05:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: use .values in index difference #11279

PERF: use .values in index difference #11279

max-sixty commented Oct 10, 2015

jorisvandenbossche Oct 10, 2015

jreback commented Oct 10, 2015

max-sixty commented Oct 10, 2015

jreback commented Oct 10, 2015

jreback commented Oct 15, 2015

max-sixty commented Oct 15, 2015

jreback commented Nov 18, 2015

max-sixty commented Nov 19, 2015

jreback commented Dec 6, 2015

max-sixty commented Dec 8, 2015

jreback commented Dec 9, 2015

max-sixty commented Dec 9, 2015

jreback commented Jan 6, 2016

max-sixty commented Jan 6, 2016

jreback commented Jan 6, 2016

PERF: use .values in index difference #11279

PERF: use .values in index difference #11279

Conversation

max-sixty commented Oct 10, 2015

jorisvandenbossche Oct 10, 2015

Choose a reason for hiding this comment

jreback commented Oct 10, 2015

max-sixty commented Oct 10, 2015

jreback commented Oct 10, 2015

jreback commented Oct 15, 2015

max-sixty commented Oct 15, 2015

jreback commented Nov 18, 2015

max-sixty commented Nov 19, 2015

jreback commented Dec 6, 2015

max-sixty commented Dec 8, 2015

jreback commented Dec 9, 2015

max-sixty commented Dec 9, 2015

jreback commented Jan 6, 2016

max-sixty commented Jan 6, 2016

jreback commented Jan 6, 2016