BUG: Bins are unexpected for qcut when the edges are duplicated #16328

artturib · 2017-05-11T12:48:22Z

Code Sample, a copy-pastable example if possible

#
import pandas as pd
import numpy as np
values = np.empty(shape=10)
values[:3] = 0
values[3:5] = 1
values[5:7] = 2
values[7:9] = 3
values[9:] = 4
pd.qcut(values,5,duplicates='drop')

Problem description

The first bin contains both 0 and 1. Since I'm looking to put 20% in each bin I would expect to have the first bin to contain only 0's (for 30% of the data) rather than 0's and 1's (for 50% of the data).

Expected Output

Output of `pd.show_versions()`

# INSTALLED VERSIONS ------------------ commit: None python: 2.7.11.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: en_US LOCALE: None.None

pandas: 0.20.1
pytest: 2.8.5
pip: 8.1.1
setuptools: 21.2.1
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
xarray: None
IPython: 4.0.3
sphinx: 1.3.5
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.5.2
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: 0.2.1

The text was updated successfully, but these errors were encountered:

jreback · 2017-05-11T12:56:05Z

You are effectively doing this.

In [11]: pd.qcut([0, 1, 2, 3, 4, 5], 5)
 
Out[11]: 
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0], (4.0, 5.0]]
Categories (5, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 4.0] < (4.0, 5.0]]

you have to have 1 bin that has 2 values. drop does exactly what it sounds like, it uses the unique values.

So this looks like the right answer. If you don't want this, I would specify the bins yourself.

jreback · 2017-05-11T12:56:30Z

@TomAugspurger

artturib · 2017-05-12T13:08:26Z

@jreback
I agree with you that the function does what is advertised i.e. drops duplicated bins. I think this has a surprising result in the my example.

I think there is a slight difference between our examples. In my example I'd like the bins to be 20% and qcut makes the bin that would be 30% with only 0's 50% with {0,1} since all 0's don't completely fit in 20%.
In your example you would like the bins to be 20% and every single value would be less than 20% hence two values have to go to the same bin since 0's don't completely fill the first bin.

Your suggestion of specifying the bins myself is not feasible since I don't know the distribution of the values before hand.

TomAugspurger · 2017-05-15T19:52:07Z

Your suggestion of specifying the bins myself is not feasible since I don't know the distribution of the values before hand.

pandas has the same problem :) Doing qcut(x, 5) is just qcut(x, [0, .2, .4, .6, .8, 1.]), which can't give you your desired outcome since the 20th and 40th percentiles are the same.

I did a brief skim of other packages, and it seems like they get around this by iteratively adjusting the quantiles until things work. @artturib would you mind writing up a function that does what you want, and we can see if we can integrate it into pandas?

wyegelwel · 2017-12-01T00:06:25Z

I ran into this today. Consider the case:

In [3]: pd.qcut([1,1,1,1,2,3,4], 3, duplicates='drop')
Out[3]: 
[(0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (2.0, 4.0], (2.0, 4.0]]
Categories (2, interval[float64]): [(0.999, 2.0] < (2.0, 4.0]]

In [9]: pd.Series([1,1,1,1,2,3,4]).quantile([ 0.        ,  0.33333333,  0.66666667,  1.        ])
Out[9]: 
0.000000    1.0
0.333333    1.0
0.666667    2.0
1.000000    4.0

Given this data with these quantile values, I would expect the bins to be [(0.9999,1] < [2,4)], however they are [(0.999, 2.0] < (2.0, 4.0]]

I think this is a bug in the qcut logic with duplicates.

Specifically, qcut decides on the quantiles using linspace if it isn't specified. The linspace is np.linspace(0,1, num_quantiles+1). The bucket ranges are then constructed by taking consecutive pairs of the quantiles values.

The problem is if the min and first quantile values are duplicate, than we drop one and the first quantile is then treated as the min for the first bucket constructed.

I think the fix is if the 0th and 1st bin values are equal, to update the 0th bin value by subtracting a small epsilon instead of filtering it

jreback · 2017-12-01T01:18:03Z

@wyegelwel thanks for the report! a PR to fix would be welcome!

burk · 2022-05-19T07:48:37Z

I believe I've hit the same, or a very related issue. When there are not enough distinct values to create bins, the output is dependent on how large the input array is. I would expect both these to generate two bins:

>>> pd.qcut([-1]*7 + [1] * 2, 5, labels=False, duplicates="drop")
array([0, 0, 0, 0, 0, 0, 0, 1, 1])
>>> pd.qcut([-1]*70 + [1] * 20, 5, labels=False, duplicates="drop")
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0])

louisguichard · 2023-07-11T12:43:00Z

Hello, this issue still looks relevant. Would a PR be welcome?

jjlevinemusic · 2023-10-22T13:53:13Z

I still see this bug, and it has caused significant problems for me:
Here is an example:
pd.qcut(pd.Series(np.concatenate([np.ones(100)*i for i in range(9)])),q=10,duplicates='drop').value_counts()
9 values, request for 10 bins, I expect 9 bins, but instead I get 8.

jreback added Usage Question Numeric Operations Arithmetic, Comparison, and Logical operations labels May 11, 2017

jreback added Bug Difficulty Intermediate and removed Usage Question labels Dec 1, 2017

jreback added this to the Next Major Release milestone Dec 1, 2017

jreback changed the title ~~Bins are unexpected for qcut when the edges are duplicated~~ BUG: Bins are unexpected for qcut when the edges are duplicated Dec 1, 2017

wyegelwel added a commit to wyegelwel/pandas that referenced this issue Dec 6, 2017

Fix qcut bug when left edge is duplicate, issue pandas-dev#16328

e5a5e56

jbrockmendel removed Effort Medium labels Oct 21, 2019

jbrockmendel added the quantile quantile method label Nov 1, 2019

mroeschke added cut cut, qcut and removed Numeric Operations Arithmetic, Comparison, and Logical operations quantile quantile method labels Apr 5, 2020

simonjayhawkins mentioned this issue May 28, 2022

BUG: equal-valued series should work with qcut #47156

Closed

3 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Bins are unexpected for qcut when the edges are duplicated #16328

BUG: Bins are unexpected for qcut when the edges are duplicated #16328

artturib commented May 11, 2017

jreback commented May 11, 2017

jreback commented May 11, 2017

artturib commented May 12, 2017 •

edited

Loading

TomAugspurger commented May 15, 2017

wyegelwel commented Dec 1, 2017

jreback commented Dec 1, 2017

burk commented May 19, 2022

louisguichard commented Jul 11, 2023

jjlevinemusic commented Oct 22, 2023

BUG: Bins are unexpected for qcut when the edges are duplicated #16328

BUG: Bins are unexpected for qcut when the edges are duplicated #16328

Comments

artturib commented May 11, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented May 11, 2017

jreback commented May 11, 2017

artturib commented May 12, 2017 • edited Loading

TomAugspurger commented May 15, 2017

wyegelwel commented Dec 1, 2017

jreback commented Dec 1, 2017

burk commented May 19, 2022

louisguichard commented Jul 11, 2023

jjlevinemusic commented Oct 22, 2023

Output of `pd.show_versions()`

artturib commented May 12, 2017 •

edited

Loading