Skip to content

PERF: use .values in index difference #11279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

max-sixty
Copy link
Contributor

The existing .difference method 'unboxed' all the objects, which has a severe performance impact on PeriodIndex in particular.

In [3]: long_index = pd.period_range(start='2000', freq='s', periods=1000)

In [4]: empty_index = pd.PeriodIndex([],freq='s')


In [24]: %timeit long_index.difference(empty_index)

# existing:
1 loops, best of 1: 1.02 s per loop
# updated: 
1000 loops, best of 3: 538 µs per loop

...so around 2000x

I haven't worked with asv or the like - is this a case where a test like that is required?

@max-sixty max-sixty changed the title PER: use .values in index difference PERF: use .values in index difference Oct 10, 2015
@@ -1605,12 +1605,12 @@ def difference(self, other):
self._assert_can_do_setop(other)

if self.equals(other):
return Index([], name=self.name)
return self._shallow_copy([])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe nice to add a test for this one? (that it keeps the correct class)

@jreback
Copy link
Contributor

jreback commented Oct 10, 2015

there are quite a number of tests in tseries/tests/test_base for this type of behavior FYI

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Period Period data type labels Oct 10, 2015
@max-sixty
Copy link
Contributor Author

OK cheers @jreback. At the moment I'm getting a number of failures similar to the one below - I think it's where this operates on MultiIndexes.
I don't know how well multi_index._shallow_copy(multi_index.values) == multi_index works?
I can branch the logic depending on whether it's a MultiIndex or not - unless you have an alternative?

======================================================================
ERROR: test_stack_partial_multiIndex (pandas.tests.test_frame.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/tests/test_frame.py", line 13998, in test_stack_partial_multiIndex
    _test_stack_with_multiindex(full_multiindex[multiindex_columns])
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/tests/test_frame.py", line 13969, in _test_stack_with_multiindex
    result = df.stack(level=level, dropna=False)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 3745, in stack
    return stack(self, level, dropna=dropna)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/reshape.py", line 481, in stack
    return _stack_multi_columns(frame, level_num=level_num, dropna=dropna)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/reshape.py", line 648, in _stack_multi_columns
    result = DataFrame(new_data, index=new_index, columns=new_columns)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 227, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 322, in _init_dict
    data = dict((k, v) for k, v in compat.iteritems(data)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 323, in <genexpr>
    if k in columns)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/index.py", line 1116, in __contains__
    return key in self._engine
  File "pandas/index.pyx", line 99, in pandas.index.IndexEngine.__contains__ (pandas/index.c:2749)
  File "pandas/index.pyx", line 261, in pandas.index.IndexEngine._ensure_mapping_populated (pandas/index.c:5304)
  File "pandas/index.pyx", line 267, in pandas.index.IndexEngine.initialize (pandas/index.c:5408)
  File "pandas/hashtable.pyx", line 703, in pandas.hashtable.PyObjectHashTable.map_locations (pandas/hashtable.c:12850)
ValueError: Does not understand character buffer dtype format string ('w')

@jreback
Copy link
Contributor

jreback commented Oct 10, 2015

looks like something else is going on
shallow_cooy should work it overridden for MultIndex

@jreback
Copy link
Contributor

jreback commented Oct 15, 2015

any progress?

@max-sixty
Copy link
Contributor Author

@jreback not yet - will look at it this weekend. Thanks for the ping

@jreback
Copy link
Contributor

jreback commented Nov 18, 2015

@MaximilianR if you'd like to update would be gr8

@max-sixty
Copy link
Contributor Author

I had a go at debugging this. But I'm struggling, since the errors happen on the Cython side - I need to get up to speed on how to debug those.
If anyone has any guidance, I'm very open to ideas. Otherwise it'll be a few weeks at least, I think.

@jreback
Copy link
Contributor

jreback commented Dec 6, 2015

@MaximilianR can you rebase / update

@max-sixty
Copy link
Contributor Author

I still get this error below. I'm really not sure how to debug the pyx files - although keen to learn. Any guidance?

======================================================================
ERROR: test_stack_partial_multiIndex (pandas.tests.test_frame.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/tests/test_frame.py", line 14305, in test_stack_partial_multiIndex
    _test_stack_with_multiindex(full_multiindex[multiindex_columns])
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/tests/test_frame.py", line 14276, in _test_stack_with_multiindex
    result = df.stack(level=level, dropna=False)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 3803, in stack
    return stack(self, level, dropna=dropna)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/reshape.py", line 481, in stack
    return _stack_multi_columns(frame, level_num=level_num, dropna=dropna)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/reshape.py", line 648, in _stack_multi_columns
    result = DataFrame(new_data, index=new_index, columns=new_columns)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 226, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 323, in _init_dict
    data = dict((k, v) for k, v in compat.iteritems(data)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 324, in <genexpr>
    if k in columns)
  File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/index.py", line 1161, in __contains__
    return key in self._engine
  File "pandas/index.pyx", line 99, in pandas.index.IndexEngine.__contains__ (pandas/index.c:2749)
  File "pandas/index.pyx", line 261, in pandas.index.IndexEngine._ensure_mapping_populated (pandas/index.c:5304)
  File "pandas/index.pyx", line 267, in pandas.index.IndexEngine.initialize (pandas/index.c:5408)
  File "pandas/hashtable.pyx", line 703, in pandas.hashtable.PyObjectHashTable.map_locations (pandas/hashtable.c:12518)
ValueError: Does not understand character buffer dtype format string ('w')

----------------------------------------------------------------------

@jreback
Copy link
Contributor

jreback commented Dec 9, 2015

go up the stack when debugging. somehow the new_columns is created with a dtype of S1 which is invalid this violates some guarantees there. So you have to trace where this is happening (prob the _shallow_copy may need a hint)

> /Users/jreback/pandas/pandas/core/reshape.py(648)_stack_multi_columns()
-> result = DataFrame(new_data, index=new_index, columns=new_columns)
(Pdb) p new_data
{'A': array([ nan,   2.,  nan,  nan,   5.,  nan,  nan,   8.,  nan]), 'B': array([  0.,  nan,   1.,   3.,  nan,   4.,   6.,  nan,   7.])}
(Pdb) p new_index
MultiIndex(levels=[[0, 1, 2], [u'u', u'x', u'y', u'z']],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [1, 2, 3, 1, 2, 3, 1, 2, 3]],
           names=[None, u'Lower'])
(Pdb) p new_columns
Index([u'A', u'B'], dtype='|S1', name=u'Upper')
(Pdb) !new_columns = Index(new_columns.values,name=new_columns.name)
*** NameError: name 'Index' is not defined
(Pdb) from pandas import Index
(Pdb) !new_columns = Index(new_columns.values,name=new_columns.name)
(Pdb) p new_columns
Index([u'A', u'B'], dtype='object', name=u'Upper')
(Pdb) p DataFrame(new_data, index=new_index, columns=new_columns)
Upper     A   B
  Lower        
0 x     NaN   0
  y       2 NaN
  z     NaN   1
1 x     NaN   3
  y       5 NaN
  z     NaN   4
2 x     NaN   6
  y       8 NaN
  z     NaN   7

@max-sixty
Copy link
Contributor Author

OK thanks, I'll try that angle

@jreback
Copy link
Contributor

jreback commented Jan 6, 2016

@MaximilianR pls reopen if you would like to update

@jreback jreback closed this Jan 6, 2016
@max-sixty
Copy link
Contributor Author

OK, I will aim to come back to this one at some point

@jreback
Copy link
Contributor

jreback commented Jan 6, 2016

np. just trying to keep out outstanding PR's to minimum.

@max-sixty max-sixty deleted the index-setops-speed branch December 22, 2016 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants