Skip to content

Value counts normalize #33652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
bd9011a
made nan count when dropna=False
DataInformer Apr 18, 2020
d9d5ec1
updated changelog
DataInformer Apr 18, 2020
86fe7f9
trivial
DataInformer Apr 18, 2020
c34a863
added specific test for groupby valuecount interval fix
DataInformer Apr 18, 2020
9c1c269
merge master
DataInformer Apr 18, 2020
5f8eb1d
updated value_count docstrings
DataInformer Apr 19, 2020
1276166
fixed pep8 style
DataInformer Apr 19, 2020
a1b7197
fixed more minor style
DataInformer Apr 19, 2020
9c3ede3
added test for na in bins
DataInformer Apr 20, 2020
0cff92b
added release notes to 1.1
DataInformer Apr 20, 2020
27aa460
trying to avoid docstring warning
DataInformer Apr 25, 2020
27c9856
trying to avoid docstring warning
DataInformer Apr 25, 2020
f5e9aeb
include nan count when dropna=False
DataInformer Jun 27, 2020
99b7112
listed bugfix
DataInformer Jun 27, 2020
75374b2
avoided tests that highlight groupby.value_count bug
DataInformer Jun 27, 2020
25b6c14
Revert "avoided tests that highlight groupby.value_count bug"
DataInformer Jul 4, 2020
277ce52
use series value_counts for groupby
DataInformer Jul 4, 2020
73ef54b
Merge branch 'master' of https://github.com/pandas-dev/pandas
DataInformer Jul 4, 2020
6b97e0b
Merge branch 'master' into value_counts_part1
DataInformer Jul 4, 2020
797f668
added groupby bin test
DataInformer Jul 4, 2020
fce6998
passing groupy valcount tests
DataInformer Jul 16, 2020
637a609
nan doesnt work for times
DataInformer Jul 26, 2020
c9a4383
passing all value count tests
DataInformer Jul 27, 2020
7ae1280
Merge remote-tracking branch 'upstream/master' into value_counts_part1
DataInformer Jul 27, 2020
d2399ea
speedups 1
DataInformer Jul 27, 2020
3299a36
Merge branch 'value_counts_part1' into value_counts_normalize
DataInformer Jul 27, 2020
ec92f15
speedup?
DataInformer Aug 2, 2020
d6179b0
Revert "speedup?"
DataInformer Aug 2, 2020
5abfb16
fixed comments
DataInformer Aug 10, 2020
83ccfd2
Merge remote-tracking branch 'upstream/master' into value_counts_norm…
DataInformer Aug 30, 2020
5f33834
removed unneeded import
DataInformer Sep 12, 2020
8562f1b
Merge remote-tracking branch 'upstream/master' into value_counts_norm…
DataInformer Sep 12, 2020
f685cb2
updated to use na_sentinal param
DataInformer Sep 12, 2020
c21bdbb
fixed bad test
DataInformer Sep 12, 2020
e4c2552
moved doc, reverted permissions
DataInformer Sep 13, 2020
74b13d8
more doc and permission fix
DataInformer Sep 13, 2020
f0e630a
fixed docstrings
DataInformer Sep 14, 2020
9763e83
file perm
DataInformer Sep 14, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v1.0.3.rst
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ Fixed regressions

Bug fixes
~~~~~~~~~
Fixed Series.value_counts so that normalize excludes NA values when dropna=False. (:issue:`25970`)
Fixed Dataframe Groupby value_counts with bins (:issue:`32471`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to 1.1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert unrelated changes? Looks like blank space and file permissions were changed here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DataInformer can you address this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I thought I did this before, but apparently it reverted. Should be undone again now.

Contributors
~~~~~~~~~~~~
Expand Down
19 changes: 11 additions & 8 deletions pandas/core/algorithms.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -663,12 +663,16 @@ def value_counts(
ascending : bool, default False
Sort in ascending order
normalize: bool, default False
If True then compute a relative histogram
bins : integer, optional
Rather than count values, group them into half-open bins,
convenience for pd.cut, only works with numeric data
If True, then compute a relative histogram that outputs the
proportion of each value.
bins : integer or iterable of numeric, optional
Rather than count values, group them into half-open bins.
Only works with numeric data.
If int, interpreted as number of bins and will use pd.cut.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this will use pd.cut either way, so pls amend the doc to say that.

needs a versionchanged tag 1.2

If interable of numeric, will use provided numbers as bin endpoints.
dropna : bool, default True
Don't include counts of NaN
Don't include counts of NaN.
If False and NaNs are present, NaN will be a key in the output.

Returns
-------
Expand All @@ -689,16 +693,15 @@ def value_counts(

# count, remove nulls (from the index), and but the bins
result = ii.value_counts(dropna=dropna)
result = result[result.index.notna()]
result.index = result.index.astype("interval")
result = result.sort_index()

# if we are dropna and we have NO values
if dropna and (result._values == 0).all():
result = result.iloc[0:0]

# normalizing is by len of all (regardless of dropna)
counts = np.array([len(ii)])
# normalizing is by len of what gets included in the bins
counts = result._values

else:

Expand Down
29 changes: 22 additions & 7 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1176,17 +1176,20 @@ def value_counts(
Parameters
----------
normalize : bool, default False
If True then the object returned will contain the relative
frequencies of the unique values.
If True, outputs the relative frequencies of the unique values.
sort : bool, default True
Sort by frequencies.
ascending : bool, default False
Sort in ascending order.
bins : int, optional
Rather than count values, group them into half-open bins,
a convenience for ``pd.cut``, only works with numeric data.
bins : integer or iterable of numeric, optional
Rather than count individual values, group them into half-open bins.
Only works with numeric data.
If int, interpreted as number of bins and will use ``pd.cut``.
If interable of numeric, will use provided numbers as bin endpoints.

dropna : bool, default True
Don't include counts of NaN.
If False and NaNs are present, NaN will be a key in the output.

Returns
-------
Expand Down Expand Up @@ -1223,15 +1226,26 @@ def value_counts(

Bins can be useful for going from a continuous variable to a
categorical variable; instead of counting unique
apparitions of values, divide the index in the specified
number of half-open bins.
instances of values, count the number of values that fall
into half-open intervals.

Bins can be an int.

>>> s.value_counts(bins=3)
(2.0, 3.0] 2
(0.996, 2.0] 2
(3.0, 4.0] 1
dtype: int64

Bins can also be an iterable of numbers. These numbers are treated
as endpoints for the intervals.

>>> s.value_counts(bins=[0,2,4,9])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use spaces between the bins

(2.0, 4.0] 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a space missing here?

(-0.001, 2.0] 2
(4.0, 9.0] 0
dtype: int64

**dropna**

With `dropna` set to `False` we can also see NaN index values.
Expand All @@ -1244,6 +1258,7 @@ def value_counts(
1.0 1
dtype: int64
"""

result = value_counts(
self,
sort=sort,
Expand Down
134 changes: 8 additions & 126 deletions pandas/core/groupby/generic.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
"""
from collections import abc, namedtuple
import copy
from functools import partial
from textwrap import dedent
import typing
from typing import (
Expand Down Expand Up @@ -41,11 +40,8 @@
maybe_downcast_to_dtype,
)
from pandas.core.dtypes.common import (
ensure_int64,
ensure_platform_int,
is_bool,
is_integer_dtype,
is_interval_dtype,
is_numeric_dtype,
is_object_dtype,
is_scalar,
Expand Down Expand Up @@ -671,128 +667,14 @@ def describe(self, **kwargs):
def value_counts(
self, normalize=False, sort=True, ascending=False, bins=None, dropna=True
):

from pandas.core.reshape.tile import cut
from pandas.core.reshape.merge import _get_join_indexers

if bins is not None and not np.iterable(bins):
# scalar bins cannot be done at top level
# in a backward compatible way
return self.apply(
Series.value_counts,
normalize=normalize,
sort=sort,
ascending=ascending,
bins=bins,
)

ids, _, _ = self.grouper.group_info
val = self.obj._values

# groupby removes null keys from groupings
mask = ids != -1
ids, val = ids[mask], val[mask]

if bins is None:
lab, lev = algorithms.factorize(val, sort=True)
llab = lambda lab, inc: lab[inc]
else:

# lab is a Categorical with categories an IntervalIndex
lab = cut(Series(val), bins, include_lowest=True)
lev = lab.cat.categories
lab = lev.take(lab.cat.codes)
llab = lambda lab, inc: lab[inc]._multiindex.codes[-1]

if is_interval_dtype(lab):
# TODO: should we do this inside II?
sorter = np.lexsort((lab.left, lab.right, ids))
else:
sorter = np.lexsort((lab, ids))

ids, lab = ids[sorter], lab[sorter]

# group boundaries are where group ids change
idx = np.r_[0, 1 + np.nonzero(ids[1:] != ids[:-1])[0]]

# new values are where sorted labels change
lchanges = llab(lab, slice(1, None)) != llab(lab, slice(None, -1))
inc = np.r_[True, lchanges]
inc[idx] = True # group boundaries are also new values
out = np.diff(np.nonzero(np.r_[inc, True])[0]) # value counts

# num. of times each group should be repeated
rep = partial(np.repeat, repeats=np.add.reduceat(inc, idx))

# multi-index components
codes = self.grouper.reconstructed_codes
codes = [rep(level_codes) for level_codes in codes] + [llab(lab, inc)]
levels = [ping.group_index for ping in self.grouper.groupings] + [lev]
names = self.grouper.names + [self._selection_name]

if dropna:
mask = codes[-1] != -1
if mask.all():
dropna = False
else:
out, codes = out[mask], [level_codes[mask] for level_codes in codes]

if normalize:
out = out.astype("float")
d = np.diff(np.r_[idx, len(ids)])
if dropna:
m = ids[lab == -1]
np.add.at(d, m, -1)
acc = rep(d)[mask]
else:
acc = rep(d)
out /= acc

if sort and bins is None:
cat = ids[inc][mask] if dropna else ids[inc]
sorter = np.lexsort((out if ascending else -out, cat))
out, codes[-1] = out[sorter], codes[-1][sorter]

if bins is None:
mi = MultiIndex(
levels=levels, codes=codes, names=names, verify_integrity=False
)

if is_integer_dtype(out):
out = ensure_int64(out)
return Series(out, index=mi, name=self._selection_name)

# for compat. with libgroupby.value_counts need to ensure every
# bin is present at every index level, null filled with zeros
diff = np.zeros(len(out), dtype="bool")
for level_codes in codes[:-1]:
diff |= np.r_[True, level_codes[1:] != level_codes[:-1]]

ncat, nbin = diff.sum(), len(levels[-1])

left = [np.repeat(np.arange(ncat), nbin), np.tile(np.arange(nbin), ncat)]

right = [diff.cumsum() - 1, codes[-1]]

_, idx = _get_join_indexers(left, right, sort=False, how="left")
out = np.where(idx != -1, out[idx], 0)

if sort:
sorter = np.lexsort((out if ascending else -out, left[0]))
out, left[-1] = out[sorter], left[-1][sorter]

# build the multi-index w/ full levels
def build_codes(lev_codes: np.ndarray) -> np.ndarray:
return np.repeat(lev_codes[diff], nbin)

codes = [build_codes(lev_codes) for lev_codes in codes[:-1]]
codes.append(left[-1])

mi = MultiIndex(levels=levels, codes=codes, names=names, verify_integrity=False)

if is_integer_dtype(out):
out = ensure_int64(out)
return Series(out, index=mi, name=self._selection_name)
return self.apply(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is most likely significantly slower than the existing implementation - can you run the appropriate groupby benchmarks to check?

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#running-the-performance-test-suite

Series.value_counts,
normalize=normalize,
sort=sort,
ascending=ascending,
bins=bins,
dropna=dropna,
)

def count(self) -> Series:
"""
Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/base/test_value_counts.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,20 @@ def test_value_counts_bins(index_or_series):

assert s.nunique() == 0

# handle normalizing bins with NA's properly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a new test

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand: are you saying this is a badly written test, so I should make a different one? Or are you saying add an additional test beyond this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_value_counts_bins is already doing too much. make this a separate test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# see GH25970
s2 = Series([1, 2, 2, 3, 3, 3, np.nan, np.nan, 4, 5])
intervals = IntervalIndex.from_breaks([0.995, 2.333, 3.667, 5.0])
expected_dropna = Series([0.375, 0.375, 0.25], intervals.take([1, 0, 2]))
expected_keepna_vals = np.array([0.3, 0.3, 0.2, 0.2])
tm.assert_series_equal(
s2.value_counts(dropna=True, normalize=True, bins=3), expected_dropna
)
tm.assert_numpy_array_equal(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to do this already done in assert_series_equal

s2.value_counts(dropna=False, normalize=True, bins=3).values,
expected_keepna_vals,
)


def test_value_counts_datetime64(index_or_series):
klass = index_or_series
Expand Down
22 changes: 19 additions & 3 deletions pandas/tests/groupby/test_value_counts.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import numpy as np
import pytest

from pandas import DataFrame, Grouper, MultiIndex, Series, date_range, to_datetime
from pandas import DataFrame, Grouper, MultiIndex, Series, cut, date_range, to_datetime
import pandas._testing as tm


Expand Down Expand Up @@ -41,13 +41,12 @@ def seed_df(seed_nans, n, m):
ids = []
for seed_nans in [True, False]:
for n, m in product((100, 1000), (5, 20)):

df = seed_df(seed_nans, n, m)
bins = None, np.arange(0, max(5, df["3rd"].max()) + 1, 2)
keys = "1st", "2nd", ["1st", "2nd"]
for k, b in product(keys, bins):
binned.append((df, k, b, n, m))
ids.append(f"{k}-{n}-{m}")
ids.append(f"{k}-{n}-{m}-{seed_nans} ")


@pytest.mark.slow
Expand All @@ -71,6 +70,7 @@ def rebuild_index(df):

gr = df.groupby(keys, sort=isort)
left = gr["3rd"].value_counts(**kwargs)
left.index.names = left.index.names[:-1] + ["3rd"]

gr = df.groupby(keys, sort=isort)
right = gr["3rd"].apply(Series.value_counts, **kwargs)
Expand All @@ -81,6 +81,22 @@ def rebuild_index(df):
tm.assert_series_equal(left.sort_index(), right.sort_index())


def test_groubpy_value_counts_bins():
# GH32471
BINS = [0, 20, 80, 100]
df = DataFrame(
[[0, 0], [1, 100], [0, 100], [2, 0], [3, 100]], columns=["key", "score"]
)
result = df.groupby("key")["score"].value_counts(bins=BINS)
result.sort_index(inplace=True)
intervals = cut(Series([0]), bins=BINS, include_lowest=True).cat.categories
index = MultiIndex.from_product(
[[0, 1, 2, 3], sorted(intervals)], names=("key", None)
)
expected = Series([1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1], index, name="score")
tm.assert_series_equal(result, expected)


def test_series_groupby_value_counts_with_grouper():
# GH28479
df = DataFrame(
Expand Down