-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Bins are unexpected for qcut when the edges are duplicated #16328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You are effectively doing this.
you have to have 1 bin that has 2 values. So this looks like the right answer. If you don't want this, I would specify the bins yourself. |
@jreback I think there is a slight difference between our examples. In my example I'd like the bins to be 20% and qcut makes the bin that would be 30% with only 0's 50% with {0,1} since all 0's don't completely fit in 20%. Your suggestion of specifying the bins myself is not feasible since I don't know the distribution of the values before hand. |
pandas has the same problem :) Doing I did a brief skim of other packages, and it seems like they get around this by iteratively adjusting the quantiles until things work. @artturib would you mind writing up a function that does what you want, and we can see if we can integrate it into pandas? |
I ran into this today. Consider the case: In [3]: pd.qcut([1,1,1,1,2,3,4], 3, duplicates='drop')
Out[3]:
[(0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (2.0, 4.0], (2.0, 4.0]]
Categories (2, interval[float64]): [(0.999, 2.0] < (2.0, 4.0]]
In [9]: pd.Series([1,1,1,1,2,3,4]).quantile([ 0. , 0.33333333, 0.66666667, 1. ])
Out[9]:
0.000000 1.0
0.333333 1.0
0.666667 2.0
1.000000 4.0 Given this data with these quantile values, I would expect the bins to be I think this is a bug in the qcut logic with duplicates. Specifically, qcut decides on the quantiles using linspace if it isn't specified. The linspace is The problem is if the min and first quantile values are duplicate, than we drop one and the first quantile is then treated as the min for the first bucket constructed. I think the fix is if the 0th and 1st bin values are equal, to update the 0th bin value by subtracting a small epsilon instead of filtering it |
@wyegelwel thanks for the report! a PR to fix would be welcome! |
I believe I've hit the same, or a very related issue. When there are not enough distinct values to create bins, the output is dependent on how large the input array is. I would expect both these to generate two bins:
|
Hello, this issue still looks relevant. Would a PR be welcome? |
I still see this bug, and it has caused significant problems for me: |
Code Sample, a copy-pastable example if possible
Problem description
The first bin contains both 0 and 1. Since I'm looking to put 20% in each bin I would expect to have the first bin to contain only 0's (for 30% of the data) rather than 0's and 1's (for 50% of the data).
Expected Output
Output of
pd.show_versions()
pandas: 0.20.1
pytest: 2.8.5
pip: 8.1.1
setuptools: 21.2.1
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
xarray: None
IPython: 4.0.3
sphinx: 1.3.5
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.5.2
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: 0.2.1
The text was updated successfully, but these errors were encountered: