Skip to content

[Improvement] Deterministic value_counts #15833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tomspur opened this issue Mar 29, 2017 · 5 comments
Closed

[Improvement] Deterministic value_counts #15833

tomspur opened this issue Mar 29, 2017 · 5 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Duplicate Report Duplicate issue or pull request

Comments

@tomspur
Copy link

tomspur commented Mar 29, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.DataFrame(["a", "b", "c"], columns=["test"])

print(df["test"].value_counts())

Problem description

Using value_counts in a testsuite can be a problem, when the resulting values have the same count as they permutade on each call, e.g.:

$ python pandas_value_counts.py
a    1
b    1
c    1
Name: test, dtype: int64

$ python pandas_value_counts.py
c    1
a    1
b    1
Name: test, dtype: int64

Expected Output

Some stable/deterministic output or optionally additionally sorting of the keys, if they have the same counts

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 32 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Mar 29, 2017

see discussion in these related issues.

xref #12679
xref #11227
xref #14860

This is not guaranteed in any way, nor is performant to do so. Further why should this be anything but an arbitrary ordering? This is a mapping of value -> count.

@jreback jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Mar 29, 2017
@jreback
Copy link
Contributor

jreback commented Mar 29, 2017

If you need this for testing, the easiest / best is simply to .sort_index() and compare.

@jreback
Copy link
Contributor

jreback commented Mar 29, 2017

actually this is a duplicate of #12679

the guarantee on sort=False (not the default) is not there. This could be done as a post-processing step

The .unique() has a guarantee that shows the orderings as seen.

In [23]: s.value_counts(sort=False).reindex(s.unique())
Out[23]: 
a    1
b    1
c    1
dtype: int64

In [24]: s = Series(list('bac'))

In [25]: s.value_counts(sort=False).reindex(s.unique())
Out[25]: 
b    1
a    1
c    1
dtype: int64

If you'd like to do a PR for #12679 would be great.

@jreback jreback closed this as completed Mar 29, 2017
@jreback jreback added the Duplicate Report Duplicate issue or pull request label Mar 29, 2017
@jreback jreback added this to the No action milestone Mar 29, 2017
@tomspur
Copy link
Author

tomspur commented Apr 3, 2017

Yes, I would need this for testing and would like to have the highest count of the data. With just getting the first item of the value_counts, this is some random value, that has the maximum count and it would be great to have always the same value.

@jreback
Copy link
Contributor

jreback commented Apr 3, 2017

love to have a PR as above!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

2 participants