-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF/ENH: add fast astyping for Categorical #37355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
5d82b02
c18ae4e
d7c0575
856995f
57817a4
1050d9e
c8c05cc
b8141c4
3d3bcf1
3714d09
f8f501f
f4b5952
2ec7ded
cd110bc
113a569
341ceb6
c5f3fd4
9943bb9
c720536
f96a20d
37e3264
190c015
f9a3040
568aa7f
6860e48
f2aa2ef
d226d84
9a9e24a
a323544
07b2a65
229bfc7
da12be0
e5ede6d
93f3e1a
9cb5fe3
f55964e
b342135
73e0442
19e22e2
d195d91
3351cb1
38696d9
071deec
13fa086
dda6804
1016894
a9544b3
527b15a
9c29946
7e9fc32
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -67,6 +67,24 @@ def time_existing_series(self): | |
pd.Categorical(self.series) | ||
|
||
|
||
class AsType: | ||
def setup(self): | ||
N = 10 ** 6 | ||
|
||
self.df = pd.DataFrame( | ||
np.random.default_rng() | ||
.choice(np.array(list("abcde")), 4 * N) | ||
.reshape(N, 4), | ||
columns=list("ABCD"), | ||
) | ||
|
||
for col in self.df.columns: | ||
self.df[col] = self.df[col].astype("category") | ||
|
||
def astype_unicode(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add benchmarks for other types of categories (int, dti) for example. show the results of the benchmarks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok! Posted int benchmark in main thread + will add/post more There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. right pls update these for int,float,string,datetime |
||
[self.df[col].astype("unicode") for col in self.df.columns] | ||
|
||
|
||
class Concat: | ||
def setup(self): | ||
N = 10 ** 5 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -347,6 +347,7 @@ Performance improvements | |
- Small performance decrease to :meth:`Rolling.min` and :meth:`Rolling.max` for fixed windows (:issue:`36567`) | ||
- Reduced peak memory usage in :meth:`DataFrame.to_pickle` when using ``protocol=5`` in python 3.8+ (:issue:`34244`) | ||
- Performance improvement in :class:`ExpandingGroupby` (:issue:`37064`) | ||
- Performance improvement in :meth:`DataFrame.astype` for :class:`Categorical` (:issue:`8628`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. its actually for Series.astype, but you can mention for both. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
|
||
.. --------------------------------------------------------------------------- | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -46,6 +46,7 @@ | |
is_re, | ||
is_re_compilable, | ||
is_sparse, | ||
is_string_like_dtype, | ||
is_timedelta64_dtype, | ||
pandas_dtype, | ||
) | ||
|
@@ -596,6 +597,17 @@ def astype(self, dtype, copy: bool = False, errors: str = "raise"): | |
|
||
return self.make_block(Categorical(self.values, dtype=dtype)) | ||
|
||
elif ( # GH8628 | ||
is_categorical_dtype(self.values.dtype) | ||
and not (is_object_dtype(dtype) or is_string_like_dtype(dtype)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could define a new method for this in |
||
and copy is True | ||
): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this seems really convoluted, in a method that is already too complicated as it is (xref #22369) Do you have a good idea where the perf improvement comes from? e.g. could we push this down into There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed that the amount of special casing here is not good (and even with that a bunch of tests are still failing) The perf improvement is from astyping just the category labels instead of astyping each array entry separately.
|
||
return self.make_block( | ||
Categorical.from_codes( | ||
self.values.codes, categories=self.values.categories.astype(dtype) | ||
) | ||
) | ||
|
||
dtype = pandas_dtype(dtype) | ||
|
||
# astype processing | ||
|
Uh oh!
There was an error while loading. Please reload this page.