Skip to content

Dataframe Groupby value_counts with bins parameter #32471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
scottboston opened this issue Mar 5, 2020 · 3 comments
Open

Dataframe Groupby value_counts with bins parameter #32471

scottboston opened this issue Mar 5, 2020 · 3 comments
Assignees
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Groupby

Comments

@scottboston
Copy link

scottboston commented Mar 5, 2020

Found on Stack Overflow post here

df=pd.DataFrame([[0,0],[1,100],[0,100],[2,0],[3,100],[4,100],[4,100],[4,100],[1,100],[3,100]],columns=['key','score'])
df.groupby('key')['score'].value_counts(bins=[0,20,40,60,80,100])

Problem description

Outputs counts at the with wrong index labels. Note: data only occurs in 0-20 and 80-100.

key  score         
0    (-0.001, 20.0]    1
     (20.0, 40.0]      1
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
1    (20.0, 40.0]      2
     (-0.001, 20.0]    0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
2    (-0.001, 20.0]    1
     (20.0, 40.0]      0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
3    (20.0, 40.0]      2
     (-0.001, 20.0]    0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
4    (20.0, 40.0]      3
     (-0.001, 20.0]    0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
Name: score, dtype: int64

Expected Output

df.groupby('key')['score'].apply(pd.Series.value_counts, bins=[0,20,40,60,80,100])

key                
0    (80.0, 100.0]     1
     (-0.001, 20.0]    1
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
1    (80.0, 100.0]     2
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
     (-0.001, 20.0]    0
2    (-0.001, 20.0]    1
     (80.0, 100.0]     0
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
3    (80.0, 100.0]     2
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
     (-0.001, 20.0]    0
4    (80.0, 100.0]     3
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
     (-0.001, 20.0]    0
Name: score, dtype: int64

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.1
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : 0.29.12
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.4
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.3.4
matplotlib : 3.1.0
numexpr : 2.6.9
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.0.1
pyxlsb : None
s3fs : None
scipy : 1.3.0
sqlalchemy : 1.3.5
tables : 3.5.2
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8
numba : 0.45.0

@DataInformer
Copy link

I think is related to #25970. The SeriesGroupby.value_counts has a weird section that mentions 'scalar bins cannot be done at top level in a backward compatible way' and then does more awkward manual operations that seem to have an error. I'm looking at whether tests will pass and behavior will be correct without this special section.

@DataInformer
Copy link

take

@jreback jreback added this to the Contributions Welcome milestone Sep 13, 2020
@mroeschke mroeschke added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jul 29, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@swierh
Copy link

swierh commented Dec 22, 2022

This bug seems to occur if there's bins that are always empty. I've tested this for versions 1.3.4 and 1.5.1, and the bug is present in both.

Given the following dataframes, the groupby operation works as expected for df1, but not for df2.

df1 = pd.DataFrame(
    [["a", 1.5], ["b", 0.5], ["b", 1.5]],
    columns=["group", "value"],
)
df2 = pd.DataFrame(
    [["a", 1.5], ["b", 1.5], ["b", 1.5]],
    columns=["group", "value"],
)

And the following two groupby methods:

# groupby A:
df.groupby("group")["value"].value_counts(bins=[0, 1, 2]).unstack(level=0)

# groupby B:
pd.concat(
    {
        group: df.loc[df["group"] == group, "value"].value_counts(bins=[0, 1, 2])
        for group in df["group"].unique()
    }
).unstack(level=0)

I would expect both groupby operations to give identical results.

For df1 they give identical results, but for df2, which has an empty first bin, they do not. Method B gives the expected outcome, but A gives:

        group   a   b
value		
(-0.001, 1.0]   1   2
(1.0, 2.0]      0   0

Haven't looked into it properly, but currently the following line is my prime suspect: https://github.com/pandas-dev/pandas/blob/main/pandas/core/groupby/generic.py#L673

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants