Skip to content

BUG: Bins are unexpected for qcut when the edges are duplicated #16328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
artturib opened this issue May 11, 2017 · 9 comments
Open

BUG: Bins are unexpected for qcut when the edges are duplicated #16328

artturib opened this issue May 11, 2017 · 9 comments
Labels
Bug cut cut, qcut

Comments

@artturib
Copy link

Code Sample, a copy-pastable example if possible

#
import pandas as pd
import numpy as np
values = np.empty(shape=10)
values[:3] = 0
values[3:5] = 1
values[5:7] = 2
values[7:9] = 3
values[9:] = 4
pd.qcut(values,5,duplicates='drop')

Problem description

The first bin contains both 0 and 1. Since I'm looking to put 20% in each bin I would expect to have the first bin to contain only 0's (for 30% of the data) rather than 0's and 1's (for 50% of the data).

Expected Output

Output of pd.show_versions()

# INSTALLED VERSIONS ------------------ commit: None python: 2.7.11.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: en_US LOCALE: None.None

pandas: 0.20.1
pytest: 2.8.5
pip: 8.1.1
setuptools: 21.2.1
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
xarray: None
IPython: 4.0.3
sphinx: 1.3.5
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.5.2
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: 0.2.1

@jreback
Copy link
Contributor

jreback commented May 11, 2017

You are effectively doing this.

In [11]: pd.qcut([0, 1, 2, 3, 4, 5], 5)
 
Out[11]: 
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0], (4.0, 5.0]]
Categories (5, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 4.0] < (4.0, 5.0]]

you have to have 1 bin that has 2 values. drop does exactly what it sounds like, it uses the unique values.

So this looks like the right answer. If you don't want this, I would specify the bins yourself.

@jreback jreback added Usage Question Numeric Operations Arithmetic, Comparison, and Logical operations labels May 11, 2017
@jreback
Copy link
Contributor

jreback commented May 11, 2017

@TomAugspurger

@artturib
Copy link
Author

artturib commented May 12, 2017

@jreback
I agree with you that the function does what is advertised i.e. drops duplicated bins. I think this has a surprising result in the my example.

I think there is a slight difference between our examples. In my example I'd like the bins to be 20% and qcut makes the bin that would be 30% with only 0's 50% with {0,1} since all 0's don't completely fit in 20%.
In your example you would like the bins to be 20% and every single value would be less than 20% hence two values have to go to the same bin since 0's don't completely fill the first bin.

Your suggestion of specifying the bins myself is not feasible since I don't know the distribution of the values before hand.

@TomAugspurger
Copy link
Contributor

Your suggestion of specifying the bins myself is not feasible since I don't know the distribution of the values before hand.

pandas has the same problem :) Doing qcut(x, 5) is just qcut(x, [0, .2, .4, .6, .8, 1.]), which can't give you your desired outcome since the 20th and 40th percentiles are the same.

I did a brief skim of other packages, and it seems like they get around this by iteratively adjusting the quantiles until things work. @artturib would you mind writing up a function that does what you want, and we can see if we can integrate it into pandas?

@wyegelwel
Copy link

I ran into this today. Consider the case:

In [3]: pd.qcut([1,1,1,1,2,3,4], 3, duplicates='drop')
Out[3]: 
[(0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (2.0, 4.0], (2.0, 4.0]]
Categories (2, interval[float64]): [(0.999, 2.0] < (2.0, 4.0]]

In [9]: pd.Series([1,1,1,1,2,3,4]).quantile([ 0.        ,  0.33333333,  0.66666667,  1.        ])
Out[9]: 
0.000000    1.0
0.333333    1.0
0.666667    2.0
1.000000    4.0

Given this data with these quantile values, I would expect the bins to be [(0.9999,1] < [2,4)], however they are [(0.999, 2.0] < (2.0, 4.0]]

I think this is a bug in the qcut logic with duplicates.

Specifically, qcut decides on the quantiles using linspace if it isn't specified. The linspace is np.linspace(0,1, num_quantiles+1). The bucket ranges are then constructed by taking consecutive pairs of the quantiles values.

The problem is if the min and first quantile values are duplicate, than we drop one and the first quantile is then treated as the min for the first bucket constructed.

I think the fix is if the 0th and 1st bin values are equal, to update the 0th bin value by subtracting a small epsilon instead of filtering it

@jreback
Copy link
Contributor

jreback commented Dec 1, 2017

@wyegelwel thanks for the report! a PR to fix would be welcome!

@jreback jreback added this to the Next Major Release milestone Dec 1, 2017
@jreback jreback changed the title Bins are unexpected for qcut when the edges are duplicated BUG: Bins are unexpected for qcut when the edges are duplicated Dec 1, 2017
wyegelwel added a commit to wyegelwel/pandas that referenced this issue Dec 6, 2017
@jbrockmendel jbrockmendel added the quantile quantile method label Nov 1, 2019
@mroeschke mroeschke added cut cut, qcut and removed Numeric Operations Arithmetic, Comparison, and Logical operations quantile quantile method labels Apr 5, 2020
@burk
Copy link

burk commented May 19, 2022

I believe I've hit the same, or a very related issue. When there are not enough distinct values to create bins, the output is dependent on how large the input array is. I would expect both these to generate two bins:

>>> pd.qcut([-1]*7 + [1] * 2, 5, labels=False, duplicates="drop")
array([0, 0, 0, 0, 0, 0, 0, 1, 1])
>>> pd.qcut([-1]*70 + [1] * 20, 5, labels=False, duplicates="drop")
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0])

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@louisguichard
Copy link

Hello, this issue still looks relevant. Would a PR be welcome?

@jjlevinemusic
Copy link

I still see this bug, and it has caused significant problems for me:
Here is an example:
pd.qcut(pd.Series(np.concatenate([np.ones(100)*i for i in range(9)])),q=10,duplicates='drop').value_counts()
9 values, request for 10 bins, I expect 9 bins, but instead I get 8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug cut cut, qcut
Projects
None yet
Development

No branches or pull requests

9 participants