-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Value counts normalize #33652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Value counts normalize #33652
Changes from 7 commits
bd9011a
d9d5ec1
86fe7f9
c34a863
9c1c269
5f8eb1d
1276166
a1b7197
9c3ede3
0cff92b
27aa460
27c9856
f5e9aeb
99b7112
75374b2
25b6c14
277ce52
73ef54b
6b97e0b
797f668
fce6998
637a609
c9a4383
7ae1280
d2399ea
3299a36
ec92f15
d6179b0
5abfb16
83ccfd2
5f33834
8562f1b
f685cb2
c21bdbb
e4c2552
74b13d8
f0e630a
9763e83
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,6 +22,8 @@ Fixed regressions | |
|
||
Bug fixes | ||
~~~~~~~~~ | ||
Fixed Series.value_counts so that normalize excludes NA values when dropna=False. (:issue:`25970`) | ||
Fixed Dataframe Groupby value_counts with bins (:issue:`32471`) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you revert unrelated changes? Looks like blank space and file permissions were changed here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @DataInformer can you address this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you do this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I thought I did this before, but apparently it reverted. Should be undone again now. |
||
Contributors | ||
~~~~~~~~~~~~ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -663,12 +663,16 @@ def value_counts( | |
ascending : bool, default False | ||
Sort in ascending order | ||
normalize: bool, default False | ||
If True then compute a relative histogram | ||
bins : integer, optional | ||
Rather than count values, group them into half-open bins, | ||
convenience for pd.cut, only works with numeric data | ||
If True, then compute a relative histogram that outputs the | ||
proportion of each value. | ||
bins : integer or iterable of numeric, optional | ||
Rather than count values, group them into half-open bins. | ||
Only works with numeric data. | ||
If int, interpreted as number of bins and will use pd.cut. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm this will use pd.cut either way, so pls amend the doc to say that. needs a versionchanged tag 1.2 |
||
If interable of numeric, will use provided numbers as bin endpoints. | ||
dropna : bool, default True | ||
Don't include counts of NaN | ||
Don't include counts of NaN. | ||
If False and NaNs are present, NaN will be a key in the output. | ||
|
||
Returns | ||
------- | ||
|
@@ -689,16 +693,15 @@ def value_counts( | |
|
||
# count, remove nulls (from the index), and but the bins | ||
result = ii.value_counts(dropna=dropna) | ||
result = result[result.index.notna()] | ||
result.index = result.index.astype("interval") | ||
result = result.sort_index() | ||
|
||
# if we are dropna and we have NO values | ||
if dropna and (result._values == 0).all(): | ||
result = result.iloc[0:0] | ||
|
||
# normalizing is by len of all (regardless of dropna) | ||
counts = np.array([len(ii)]) | ||
# normalizing is by len of what gets included in the bins | ||
counts = result._values | ||
|
||
else: | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1176,17 +1176,20 @@ def value_counts( | |
Parameters | ||
---------- | ||
normalize : bool, default False | ||
If True then the object returned will contain the relative | ||
frequencies of the unique values. | ||
If True, outputs the relative frequencies of the unique values. | ||
sort : bool, default True | ||
Sort by frequencies. | ||
ascending : bool, default False | ||
Sort in ascending order. | ||
bins : int, optional | ||
Rather than count values, group them into half-open bins, | ||
a convenience for ``pd.cut``, only works with numeric data. | ||
bins : integer or iterable of numeric, optional | ||
Rather than count individual values, group them into half-open bins. | ||
Only works with numeric data. | ||
If int, interpreted as number of bins and will use ``pd.cut``. | ||
If interable of numeric, will use provided numbers as bin endpoints. | ||
|
||
dropna : bool, default True | ||
Don't include counts of NaN. | ||
If False and NaNs are present, NaN will be a key in the output. | ||
|
||
Returns | ||
------- | ||
|
@@ -1223,15 +1226,26 @@ def value_counts( | |
|
||
Bins can be useful for going from a continuous variable to a | ||
categorical variable; instead of counting unique | ||
apparitions of values, divide the index in the specified | ||
number of half-open bins. | ||
instances of values, count the number of values that fall | ||
into half-open intervals. | ||
|
||
Bins can be an int. | ||
|
||
>>> s.value_counts(bins=3) | ||
(2.0, 3.0] 2 | ||
(0.996, 2.0] 2 | ||
(3.0, 4.0] 1 | ||
dtype: int64 | ||
|
||
Bins can also be an iterable of numbers. These numbers are treated | ||
as endpoints for the intervals. | ||
|
||
>>> s.value_counts(bins=[0,2,4,9]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use spaces between the bins |
||
(2.0, 4.0] 3 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there a space missing here? |
||
(-0.001, 2.0] 2 | ||
(4.0, 9.0] 0 | ||
dtype: int64 | ||
|
||
**dropna** | ||
|
||
With `dropna` set to `False` we can also see NaN index values. | ||
|
@@ -1244,6 +1258,7 @@ def value_counts( | |
1.0 1 | ||
dtype: int64 | ||
""" | ||
|
||
result = value_counts( | ||
self, | ||
sort=sort, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,7 +7,6 @@ | |
""" | ||
from collections import abc, namedtuple | ||
import copy | ||
from functools import partial | ||
from textwrap import dedent | ||
import typing | ||
from typing import ( | ||
|
@@ -41,11 +40,8 @@ | |
maybe_downcast_to_dtype, | ||
) | ||
from pandas.core.dtypes.common import ( | ||
ensure_int64, | ||
ensure_platform_int, | ||
is_bool, | ||
is_integer_dtype, | ||
is_interval_dtype, | ||
is_numeric_dtype, | ||
is_object_dtype, | ||
is_scalar, | ||
|
@@ -671,128 +667,14 @@ def describe(self, **kwargs): | |
def value_counts( | ||
self, normalize=False, sort=True, ascending=False, bins=None, dropna=True | ||
): | ||
|
||
from pandas.core.reshape.tile import cut | ||
from pandas.core.reshape.merge import _get_join_indexers | ||
|
||
if bins is not None and not np.iterable(bins): | ||
# scalar bins cannot be done at top level | ||
# in a backward compatible way | ||
return self.apply( | ||
Series.value_counts, | ||
normalize=normalize, | ||
sort=sort, | ||
ascending=ascending, | ||
bins=bins, | ||
) | ||
|
||
ids, _, _ = self.grouper.group_info | ||
val = self.obj._values | ||
|
||
# groupby removes null keys from groupings | ||
mask = ids != -1 | ||
ids, val = ids[mask], val[mask] | ||
|
||
if bins is None: | ||
lab, lev = algorithms.factorize(val, sort=True) | ||
llab = lambda lab, inc: lab[inc] | ||
else: | ||
|
||
# lab is a Categorical with categories an IntervalIndex | ||
lab = cut(Series(val), bins, include_lowest=True) | ||
lev = lab.cat.categories | ||
lab = lev.take(lab.cat.codes) | ||
llab = lambda lab, inc: lab[inc]._multiindex.codes[-1] | ||
|
||
if is_interval_dtype(lab): | ||
# TODO: should we do this inside II? | ||
sorter = np.lexsort((lab.left, lab.right, ids)) | ||
else: | ||
sorter = np.lexsort((lab, ids)) | ||
|
||
ids, lab = ids[sorter], lab[sorter] | ||
|
||
# group boundaries are where group ids change | ||
idx = np.r_[0, 1 + np.nonzero(ids[1:] != ids[:-1])[0]] | ||
|
||
# new values are where sorted labels change | ||
lchanges = llab(lab, slice(1, None)) != llab(lab, slice(None, -1)) | ||
inc = np.r_[True, lchanges] | ||
inc[idx] = True # group boundaries are also new values | ||
out = np.diff(np.nonzero(np.r_[inc, True])[0]) # value counts | ||
|
||
# num. of times each group should be repeated | ||
rep = partial(np.repeat, repeats=np.add.reduceat(inc, idx)) | ||
|
||
# multi-index components | ||
codes = self.grouper.reconstructed_codes | ||
codes = [rep(level_codes) for level_codes in codes] + [llab(lab, inc)] | ||
levels = [ping.group_index for ping in self.grouper.groupings] + [lev] | ||
names = self.grouper.names + [self._selection_name] | ||
|
||
if dropna: | ||
mask = codes[-1] != -1 | ||
if mask.all(): | ||
dropna = False | ||
else: | ||
out, codes = out[mask], [level_codes[mask] for level_codes in codes] | ||
|
||
if normalize: | ||
out = out.astype("float") | ||
d = np.diff(np.r_[idx, len(ids)]) | ||
if dropna: | ||
m = ids[lab == -1] | ||
np.add.at(d, m, -1) | ||
acc = rep(d)[mask] | ||
else: | ||
acc = rep(d) | ||
out /= acc | ||
|
||
if sort and bins is None: | ||
cat = ids[inc][mask] if dropna else ids[inc] | ||
sorter = np.lexsort((out if ascending else -out, cat)) | ||
out, codes[-1] = out[sorter], codes[-1][sorter] | ||
|
||
if bins is None: | ||
mi = MultiIndex( | ||
levels=levels, codes=codes, names=names, verify_integrity=False | ||
) | ||
|
||
if is_integer_dtype(out): | ||
out = ensure_int64(out) | ||
return Series(out, index=mi, name=self._selection_name) | ||
|
||
# for compat. with libgroupby.value_counts need to ensure every | ||
# bin is present at every index level, null filled with zeros | ||
diff = np.zeros(len(out), dtype="bool") | ||
for level_codes in codes[:-1]: | ||
diff |= np.r_[True, level_codes[1:] != level_codes[:-1]] | ||
|
||
ncat, nbin = diff.sum(), len(levels[-1]) | ||
|
||
left = [np.repeat(np.arange(ncat), nbin), np.tile(np.arange(nbin), ncat)] | ||
|
||
right = [diff.cumsum() - 1, codes[-1]] | ||
|
||
_, idx = _get_join_indexers(left, right, sort=False, how="left") | ||
out = np.where(idx != -1, out[idx], 0) | ||
|
||
if sort: | ||
sorter = np.lexsort((out if ascending else -out, left[0])) | ||
out, left[-1] = out[sorter], left[-1][sorter] | ||
|
||
# build the multi-index w/ full levels | ||
def build_codes(lev_codes: np.ndarray) -> np.ndarray: | ||
return np.repeat(lev_codes[diff], nbin) | ||
|
||
codes = [build_codes(lev_codes) for lev_codes in codes[:-1]] | ||
codes.append(left[-1]) | ||
|
||
mi = MultiIndex(levels=levels, codes=codes, names=names, verify_integrity=False) | ||
|
||
if is_integer_dtype(out): | ||
out = ensure_int64(out) | ||
return Series(out, index=mi, name=self._selection_name) | ||
return self.apply( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is most likely significantly slower than the existing implementation - can you run the appropriate groupby benchmarks to check? |
||
Series.value_counts, | ||
normalize=normalize, | ||
sort=sort, | ||
ascending=ascending, | ||
bins=bins, | ||
dropna=dropna, | ||
) | ||
|
||
def count(self) -> Series: | ||
""" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -190,6 +190,20 @@ def test_value_counts_bins(index_or_series): | |
|
||
assert s.nunique() == 0 | ||
|
||
# handle normalizing bins with NA's properly | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. make a new test There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to make sure I understand: are you saying this is a badly written test, so I should make a different one? Or are you saying add an additional test beyond this one? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. test_value_counts_bins is already doing too much. make this a separate test. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
# see GH25970 | ||
s2 = Series([1, 2, 2, 3, 3, 3, np.nan, np.nan, 4, 5]) | ||
intervals = IntervalIndex.from_breaks([0.995, 2.333, 3.667, 5.0]) | ||
expected_dropna = Series([0.375, 0.375, 0.25], intervals.take([1, 0, 2])) | ||
expected_keepna_vals = np.array([0.3, 0.3, 0.2, 0.2]) | ||
tm.assert_series_equal( | ||
s2.value_counts(dropna=True, normalize=True, bins=3), expected_dropna | ||
) | ||
tm.assert_numpy_array_equal( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you don't need to do this already done in assert_series_equal |
||
s2.value_counts(dropna=False, normalize=True, bins=3).values, | ||
expected_keepna_vals, | ||
) | ||
|
||
|
||
def test_value_counts_datetime64(index_or_series): | ||
klass = index_or_series | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this to 1.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done