Dataframe Groupby value_counts with bins parameter #32471

scottboston · 2020-03-05T21:08:02Z

Found on Stack Overflow post here

df=pd.DataFrame([[0,0],[1,100],[0,100],[2,0],[3,100],[4,100],[4,100],[4,100],[1,100],[3,100]],columns=['key','score'])
df.groupby('key')['score'].value_counts(bins=[0,20,40,60,80,100])

Problem description

Outputs counts at the with wrong index labels. Note: data only occurs in 0-20 and 80-100.

key  score         
0    (-0.001, 20.0]    1
     (20.0, 40.0]      1
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
1    (20.0, 40.0]      2
     (-0.001, 20.0]    0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
2    (-0.001, 20.0]    1
     (20.0, 40.0]      0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
3    (20.0, 40.0]      2
     (-0.001, 20.0]    0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
4    (20.0, 40.0]      3
     (-0.001, 20.0]    0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
Name: score, dtype: int64

Expected Output

df.groupby('key')['score'].apply(pd.Series.value_counts, bins=[0,20,40,60,80,100])

key                
0    (80.0, 100.0]     1
     (-0.001, 20.0]    1
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
1    (80.0, 100.0]     2
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
     (-0.001, 20.0]    0
2    (-0.001, 20.0]    1
     (80.0, 100.0]     0
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
3    (80.0, 100.0]     2
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
     (-0.001, 20.0]    0
4    (80.0, 100.0]     3
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
     (-0.001, 20.0]    0
Name: score, dtype: int64

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.1
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : 0.29.12
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.4
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.3.4
matplotlib : 3.1.0
numexpr : 2.6.9
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.0.1
pyxlsb : None
s3fs : None
scipy : 1.3.0
sqlalchemy : 1.3.5
tables : 3.5.2
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8
numba : 0.45.0

The text was updated successfully, but these errors were encountered:

DataInformer · 2020-04-18T17:22:32Z

I think is related to #25970. The SeriesGroupby.value_counts has a weird section that mentions 'scalar bins cannot be done at top level in a backward compatible way' and then does more awkward manual operations that seem to have an error. I'm looking at whether tests will pass and behavior will be correct without this special section.

DataInformer · 2020-04-18T17:22:38Z

take

swierh · 2022-12-22T17:01:49Z

This bug seems to occur if there's bins that are always empty. I've tested this for versions 1.3.4 and 1.5.1, and the bug is present in both.

Given the following dataframes, the groupby operation works as expected for df1, but not for df2.

df1 = pd.DataFrame(
    [["a", 1.5], ["b", 0.5], ["b", 1.5]],
    columns=["group", "value"],
)
df2 = pd.DataFrame(
    [["a", 1.5], ["b", 1.5], ["b", 1.5]],
    columns=["group", "value"],
)

And the following two groupby methods:

# groupby A:
df.groupby("group")["value"].value_counts(bins=[0, 1, 2]).unstack(level=0)

# groupby B:
pd.concat(
    {
        group: df.loc[df["group"] == group, "value"].value_counts(bins=[0, 1, 2])
        for group in df["group"].unique()
    }
).unstack(level=0)

I would expect both groupby operations to give identical results.

For df1 they give identical results, but for df2, which has an empty first bin, they do not. Method B gives the expected outcome, but A gives:

        group   a   b
value		
(-0.001, 1.0]   1   2
(1.0, 2.0]      0   0

Haven't looked into it properly, but currently the following line is my prime suspect: https://github.com/pandas-dev/pandas/blob/main/pandas/core/groupby/generic.py#L673

github-actions bot assigned DataInformer Apr 18, 2020

DataInformer mentioned this issue Apr 19, 2020

Value counts normalize #33652

Closed

6 tasks

simonjayhawkins added Bug Groupby labels Apr 25, 2020

jreback added this to the Contributions Welcome milestone Sep 13, 2020

jreback mentioned this issue Dec 31, 2020

BUG: GH38672 SeriesGroupBy.value_counts for categorical #38796

Merged

5 tasks

mroeschke added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jul 29, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe Groupby value_counts with bins parameter #32471

Dataframe Groupby value_counts with bins parameter #32471

scottboston commented Mar 5, 2020 •

edited

Loading

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

DataInformer commented Apr 18, 2020

DataInformer commented Apr 18, 2020

swierh commented Dec 22, 2022 •

edited

Loading

Dataframe Groupby value_counts with bins parameter #32471

Dataframe Groupby value_counts with bins parameter #32471

Comments

scottboston commented Mar 5, 2020 • edited Loading

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

DataInformer commented Apr 18, 2020

DataInformer commented Apr 18, 2020

swierh commented Dec 22, 2022 • edited Loading

scottboston commented Mar 5, 2020 •

edited

Loading

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

swierh commented Dec 22, 2022 •

edited

Loading