-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
fix series.isin slow issue with Dtype IntegerArray #38379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
109c0e7
e9f96ea
a6be9c8
415b590
f3e5afb
14579fc
562c918
1449d3c
3ccc917
98a0683
6e2917e
a4b6503
f95fde9
2348a60
c963183
c151102
ef99e86
f23ba94
13dc64f
cc38088
94846cc
f4cb5ce
2ee8b05
7763b18
87dfff9
d2d32d1
90ef57a
a48e00b
2279519
c726d4a
1134ad6
bce0e3e
bf788e5
40950be
570d640
0f89578
199c11c
9f35b5b
2238dc5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,17 +25,32 @@ def time_constructor(self, data): | |
|
||
class IsIn: | ||
|
||
params = ["int64", "uint64", "object"] | ||
params = ["int64", "uint64", "object", "Int64"] | ||
param_names = ["dtype"] | ||
|
||
def setup(self, dtype): | ||
self.s = Series(np.random.randint(1, 10, 100000)).astype(dtype) | ||
N = 10000 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You changed the exact value here (compared to the original), maybe by accident? |
||
self.s = Series(np.random.randint(1, 10, N)).astype(dtype) | ||
self.values = [1, 2] | ||
|
||
def time_isin(self, dtypes): | ||
self.s.isin(self.values) | ||
|
||
|
||
class IsInBoolean: | ||
|
||
params = ["boolean", "bool"] | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
param_names = ["dtype"] | ||
|
||
def setup(self, dtype): | ||
N = 10000 | ||
self.s = Series(np.random.randint(0, 2, N)).astype(dtype) | ||
self.values = [True, False] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same as above factor out N There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
|
||
def time_isin(self, dtypes): | ||
self.s.isin(self.values) | ||
|
||
|
||
class IsInDatetime64: | ||
def setup(self): | ||
dti = date_range( | ||
|
@@ -59,21 +74,27 @@ def time_isin_empty(self): | |
|
||
|
||
class IsInFloat64: | ||
def setup(self): | ||
self.small = Series([1, 2], dtype=np.float64) | ||
self.many_different_values = np.arange(10 ** 6, dtype=np.float64) | ||
self.few_different_values = np.zeros(10 ** 7, dtype=np.float64) | ||
self.only_nans_values = np.full(10 ** 7, np.nan, dtype=np.float64) | ||
|
||
def time_isin_many_different(self): | ||
params = [np.float64, "Float64"] | ||
param_names = ["dtype"] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use 5 and 6 for the powers rather than 6 and 7 you can factor these to the class as N_many, N_few There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback is there a reason to reduce the number? (on my laptop, the current takes around 80ms, which is fine I think?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the originals take multiple seconds which is very wasteful There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The 80ms is what I get for the above code on current master. Where does the "multiple seconds" comes from? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. look at the benchmarks above There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you mean the comment at #38379 (comment), then the specific benchmark I am commenting on (
and for the others it is around 60ms after this PR, which seems like a fine number. It are other benchmarks (eg There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pls read my comments some of those benchmarks take 2 or more s these all need to be normalized There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then can you point me to the comment I might have missed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. look at the posted benchmarks |
||
|
||
def setup(self, dtype): | ||
N_many = 10 ** 5 | ||
N_few = 10 ** 6 | ||
self.small = Series([1, 2], dtype=dtype) | ||
self.many_different_values = np.arange(N_many, dtype=np.float64) | ||
self.few_different_values = np.zeros(N_few, dtype=np.float64) | ||
self.only_nans_values = np.full(N_few, np.nan, dtype=np.float64) | ||
|
||
def time_isin_many_different(self, dtypes): | ||
# runtime is dominated by creation of the lookup-table | ||
self.small.isin(self.many_different_values) | ||
|
||
def time_isin_few_different(self): | ||
def time_isin_few_different(self, dtypes): | ||
# runtime is dominated by creation of the lookup-table | ||
self.small.isin(self.few_different_values) | ||
|
||
def time_isin_nan_values(self): | ||
def time_isin_nan_values(self, dtypes): | ||
# runtime is dominated by creation of the lookup-table | ||
self.small.isin(self.few_different_values) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not related to your PR, but noticing this now: I suppose the above should use |
||
|
||
|
@@ -114,7 +135,7 @@ def time_isin_long_series_long_values_floats(self): | |
|
||
class IsInLongSeriesLookUpDominates: | ||
params = [ | ||
["int64", "int32", "float64", "float32", "object"], | ||
["int64", "int32", "float64", "float32", "object", "Int64", "Float64"], | ||
[5, 1000], | ||
["random_hits", "random_misses", "monotone_hits", "monotone_misses"], | ||
] | ||
|
@@ -141,7 +162,7 @@ def time_isin(self, dtypes, MaxNumber, series_type): | |
|
||
class IsInLongSeriesValuesDominate: | ||
params = [ | ||
["int64", "int32", "float64", "float32", "object"], | ||
["int64", "int32", "float64", "float32", "object", "Int64", "Float64"], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is causing failures with numpy 1.20 Int64 and Float64 now raise TypeError https://github.com/pandas-dev/pandas/runs/1802727214 |
||
["random", "monotone"], | ||
] | ||
param_names = ["dtype", "series_type"] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make the size a parameter in the class, e.g. N and make it 10000 is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I make it similar to the
N
in theIsInLongSeriesLookUpDominates
class.