Skip to content

PERF: pandas 0.15.2 multi-indexed DataFrame sum #9049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xdliao opened this issue Dec 9, 2014 · 4 comments · Fixed by #9177
Closed

PERF: pandas 0.15.2 multi-indexed DataFrame sum #9049

xdliao opened this issue Dec 9, 2014 · 4 comments · Fixed by #9177
Assignees
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Milestone

Comments

@xdliao
Copy link

xdliao commented Dec 9, 2014

Problem:
data.sum(level=...) for multi-index table produce different result (lots of NAs) than groupby
in certain cases. It's also much slower than groupby. Seems that the new version
produced a cross join of the keys and produce NAs for pair of keys with no data, which makes
the result bigger and significantly slower.
data.groupby(level=...).sum(). This happens in the following example:

Code:

import pandas as pd
print "-------------- pandas version: ", pd.__version__
max_num_of_syms = 4000
list_of_df = []
for i,a in enumerate(pd.Series(range(100)).astype(str)):
   #Each 'A' has difference number of 'B' entries in order to produce the problem
    num_of_syms = int(i*max_num_of_syms/100.0)# if i<3 else max_num_of_syms
    #print num_of_syms
    d = pd.DataFrame({'A': [a]*num_of_syms , 'B': pd.Series(range(num_of_syms)).astype(str), 'C':1})
    list_of_df.append(d)
data = pd.concat(list_of_df).set_index(['A','B'])


%time a= data.sum(level=['A','B'])
print a.shape
#This is a lot faster
%time a= data.reset_index().groupby(['A','B']).sum()
print a.shape

-------------- pandas version: 0.15.1.dev
CPU times: user 876 ms, sys: 17 ms, total: 893 ms
Wall time: 894 ms
(392040, 1)
CPU times: user 109 ms, sys: 0 ns, total: 109 ms
Wall time: 108 ms
(198000, 1)

-------------- pandas version: 0.14.1
CPU times: user 94 ms, sys: 0 ns, total: 94 ms
Wall time: 94.2 ms
(198000, 1)
CPU times: user 120 ms, sys: 0 ns, total: 120 ms
Wall time: 120 ms (198000, 1)

@jreback jreback changed the title BUG: pandas 0.15.2 multi-indexed DataFrame sum PERF: pandas 0.15.2 multi-indexed DataFrame sum Dec 9, 2014
@jreback jreback added Performance Memory or execution speed performance Numeric Operations Arithmetic, Comparison, and Logical operations labels Dec 9, 2014
@jreback jreback added this to the 0.16.0 milestone Dec 9, 2014
@jreback
Copy link
Contributor

jreback commented Dec 9, 2014

so, these are 'exactly' the same, except for a final step that happens in the .sum(level=....) case.

Their is a reindex that happens to expand the multi-output groupby space (e.g. A and B) in this case. This was necessary for some categorical groupings. I don't think this should be needed in this case (as the reindex ends up doing nothing here, but constructing the MultiIndex actually takes some time, as its independently constructed.)

So this is a bug/perf issue.

Welcome for you to have a look. look at core/groupby/DataFrameGroupBy/_reindex_output.

@jreback jreback self-assigned this Dec 9, 2014
@xdliao
Copy link
Author

xdliao commented Dec 10, 2014

I did confirm that returning the result immediately in the _reindex_output solved the problem for this case (as a hack). I printed out the value of "[ping._was_factor for ping in groupings]".
It seems that in the sim(level=) call, "ping._was_factor " were True for all groupings,
in the regular d.groupby().sum() call, _reindex_output() is also called but "ping._was_factor " were False.
Why the same grouping are treated differently?

@jreback
Copy link
Contributor

jreback commented Dec 10, 2014

yeh, I think something is wrong. This should only be ncessary for a Categorical type of grouping (where you want group expansion). Not 100% sure why it hits this path.

I would put a halt at the end of _reindex_outpout (e.g. after all the ifs) and see where it hits in the test suite. Then for the case above see whats same/different.

@ledmonster
Copy link

I also got same issue, and made a pull request. Please check it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants