Value counts normalize #33652

DataInformer · 2020-04-19T12:48:00Z

closes value_counts unexpected behaviour - bins and dropna #25970
closes Dataframe Groupby value_counts with bins parameter #32471
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This pull request resolves issues with binning and NA values in both Series.value_counts and SeriesGroupBy.value_counts, adding new tests to check the problematic cases.

pep8speaks · 2020-04-19T12:48:07Z

Hello @DataInformer! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-14 02:11:04 UTC

jreback

will look soon

jreback · 2020-04-19T12:55:46Z

pandas/tests/base/test_value_counts.py

@@ -190,6 +190,14 @@ def test_value_counts_bins(index_or_series):

    assert s.nunique() == 0

+    # handle normalizing bins with NA's properly


make a new test

Just to make sure I understand: are you saying this is a badly written test, so I should make a different one? Or are you saying add an additional test beyond this one?

test_value_counts_bins is already doing too much. make this a separate test.

simonjayhawkins · 2020-04-20T16:14:10Z

doc/source/whatsnew/v1.0.3.rst

@@ -22,6 +22,8 @@ Fixed regressions

 Bug fixes
 ~~~~~~~~~
+Fixed Series.value_counts so that normalize excludes NA values when dropna=False. (:issue:`25970`)
+Fixed Dataframe Groupby value_counts with bins (:issue:`32471`)


move this to 1.1

DataInformer · 2020-04-20T22:19:10Z

I'm not sure why Web and docs is failing. Looking through the output, I only see warnings (for my part, only about block quote issue that I think is detecting on pd.cut

kevin-meyers · 2020-04-24T17:28:34Z

I'm not positive but it could have something to do with https://pandas.pydata.org/docs/development/contributing_docstring.html

/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/IPython/sphinxext/ipython_directive.py:1023: UserWarning: Code input with no code at /home/runner/work/pandas/pandas/doc/source/user_guide/computation.rst, line 622
  warnings.warn(message)
/home/runner/work/pandas/pandas/pandas/core/base.py:docstring of pandas.Index.value_counts:18: WARNING: Block quote ends without a blank line; unexpected unindent.
/home/runner/work/pandas/pandas/pandas/core/base.py:docstring of pandas.Series.value_counts:18: WARNING: Block quote ends without a blank line; unexpected unindent.

build finished with problems, 2 warnings.
##[error]Process completed with exit code 1.

DataInformer · 2020-04-25T17:35:30Z

I'm not positive but it could have something to do with https://pandas.pydata.org/docs/development/contributing_docstring.html

/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/IPython/sphinxext/ipython_directive.py:1023: UserWarning: Code input with no code at /home/runner/work/pandas/pandas/doc/source/user_guide/computation.rst, line 622
  warnings.warn(message)
/home/runner/work/pandas/pandas/pandas/core/base.py:docstring of pandas.Index.value_counts:18: WARNING: Block quote ends without a blank line; unexpected unindent.
/home/runner/work/pandas/pandas/pandas/core/base.py:docstring of pandas.Series.value_counts:18: WARNING: Block quote ends without a blank line; unexpected unindent.

build finished with problems, 2 warnings.
##[error]Process completed with exit code 1.

I'm not positive but it could have something to do with https://pandas.pydata.org/docs/development/contributing_docstring.html

/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/IPython/sphinxext/ipython_directive.py:1023: UserWarning: Code input with no code at /home/runner/work/pandas/pandas/doc/source/user_guide/computation.rst, line 622
  warnings.warn(message)
/home/runner/work/pandas/pandas/pandas/core/base.py:docstring of pandas.Index.value_counts:18: WARNING: Block quote ends without a blank line; unexpected unindent.
/home/runner/work/pandas/pandas/pandas/core/base.py:docstring of pandas.Series.value_counts:18: WARNING: Block quote ends without a blank line; unexpected unindent.

build finished with problems, 2 warnings.
##[error]Process completed with exit code 1.

Right, that's what I thought, but I don't see any block quotes without blank lines. I was hoping maybe someone could help me identify more specifically what the problem is.

WillAyd · 2020-05-29T18:08:26Z

doc/source/whatsnew/v1.0.3.rst

@@ -23,6 +23,7 @@ Fixed regressions
 Bug fixes
 ~~~~~~~~~

+


Can you revert unrelated changes? Looks like blank space and file permissions were changed here

@DataInformer can you address this

can you do this.

Sorry, I thought I did this before, but apparently it reverted. Should be undone again now.

WillAyd · 2020-05-29T18:09:13Z

pandas/core/groupby/generic.py

-        if is_integer_dtype(out):
-            out = ensure_int64(out)
-        return Series(out, index=mi, name=self._selection_name)
+        return self.apply(


This is most likely significantly slower than the existing implementation - can you run the appropriate groupby benchmarks to check?

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#running-the-performance-test-suite

simonjayhawkins

Is the fix for #32471 dependent on the fix for #25970? If so is it possible to address #25970 independently as a pre-cursor PR?

simonjayhawkins · 2020-06-02T14:07:30Z

doc/source/whatsnew/v1.1.0.rst

@@ -434,7 +434,8 @@ Performance improvements

 Bug fixes
 ~~~~~~~~~
-
+Fixed Series.value_counts so that normalize excludes NA values when dropna=False. (:issue:`25970`)


can you move this down into the Numeric section. starts at L482

simonjayhawkins · 2020-06-02T14:08:26Z

doc/source/whatsnew/v1.1.0.rst

@@ -434,7 +434,8 @@ Performance improvements

 Bug fixes
 ~~~~~~~~~
-
+Fixed Series.value_counts so that normalize excludes NA values when dropna=False. (:issue:`25970`)
+Fixed Dataframe Groupby value_counts with bins (:issue:`32471`)


can you move this down into the Groupby/resample/rolling section. starts on L596.

DataInformer · 2020-06-27T21:10:53Z

Is the fix for #32471 dependent on the fix for #25970? If so is it possible to address #25970 independently as a pre-cursor PR?

I cleaned up the pull request so it has minimal changes only addressing #25970. I kept the branch name so this request history would remain.

DataInformer · 2020-06-28T02:10:50Z

Is the fix for #32471 dependent on the fix for #25970? If so is it possible to address #25970 independently as a pre-cursor PR?

I cleaned up the pull request so it has minimal changes only addressing #25970. I kept the branch name so this request history would remain.

However, I had only run the base tests, and the groupby tests now cause conflicting results, since the Series.value_count is doing the right thing but the DataFrame.groupby.value_count is not. It seems wrong to change the test, but I have temporarily reduced the normalize and dropna parameters to skip the problematic cases. Would you rather I combine the fixes again as a single pull request?

DataInformer · 2020-08-10T16:01:13Z

I'm still a bit confused by the results from the benchmarking (pasted below). In particular, I don't see how my changes would have made some of the methods faster, when I didn't change those, which casts doubt on the results showing that value_counts is significantly slower. My code handles a couple special cases separately to avoid the identified bugs, but I would not expect it to be more than 20% slower.
This run is with the computer idle (I get very different results while still using it, which makes sense). I reached out to the Pandas-dev alias to see if someone could advise on the performance issues, but didn't get any response. Do you have other suggestions for tracking down what might be causing these methods to run slower?

'asv continuous -f 1.1 upstream/master HEAD -b groupby.GroupByMethods'
before after ratio
[04e9e0a] [5abfb16]
<value_counts_part1~1^2> <value_counts_normalize>

1.03±0.03ms      1.85±0.05ms     1.80  groupby.GroupByMethods.time_dtype_as_group('float', 'value_counts', 'transformation')

1.03±0.06ms       1.81±0.1ms     1.75  groupby.GroupByMethods.time_dtype_as_group('int', 'value_counts', 'transformation')

1.02±0.07ms      1.70±0.06ms     1.67  groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'transformation')

 1.01±0.1ms      1.65±0.07ms     1.64  groupby.GroupByMethods.time_dtype_as_group('float', 'value_counts', 'direct')

1.05±0.04ms       1.64±0.2ms     1.56  groupby.GroupByMethods.time_dtype_as_group('datetime', 'value_counts', 'transformation')

1.08±0.08ms      1.67±0.09ms     1.55  groupby.GroupByMethods.time_dtype_as_group('datetime', 'value_counts', 'direct')

1.11±0.04ms      1.69±0.06ms     1.52  groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'direct')

   962±20μs      1.41±0.03ms     1.47  groupby.GroupByMethods.time_dtype_as_field('object', 'value_counts', 'direct')

1.09±0.04ms      1.55±0.08ms     1.42  groupby.GroupByMethods.time_dtype_as_field('object', 'value_counts', 'transformation')

1.21±0.02ms      1.69±0.05ms     1.40  groupby.GroupByMethods.time_dtype_as_field('datetime', 'value_counts', 'transformation')

1.17±0.06ms      1.64±0.06ms     1.40  groupby.GroupByMethods.time_dtype_as_group('int', 'value_counts', 'direct')

1.28±0.09ms       1.72±0.1ms     1.35  groupby.GroupByMethods.time_dtype_as_field('float', 'value_counts', 'transformation')

1.23±0.03ms      1.66±0.06ms     1.34  groupby.GroupByMethods.time_dtype_as_field('datetime', 'value_counts', 'direct')

1.31±0.08ms       1.71±0.1ms     1.30  groupby.GroupByMethods.time_dtype_as_field('float', 'value_counts', 'direct')

    191±9μs          241±9μs     1.26  groupby.GroupByMethods.time_dtype_as_group('object', 'head', 'transformation')

   865±50μs      1.08±0.05ms     1.25  groupby.GroupByMethods.time_dtype_as_group('object', 'value_counts', 'transformation')

   947±20μs      1.17±0.06ms     1.24  groupby.GroupByMethods.time_dtype_as_group('object', 'value_counts', 'direct')

   577±30μs         684±20μs     1.18  groupby.GroupByMethods.time_dtype_as_field('int', 'min', 'direct')

   595±30μs         698±30μs     1.17  groupby.GroupByMethods.time_dtype_as_field('float', 'quantile', 'transformation')

   703±70μs         635±30μs     0.90  groupby.GroupByMethods.time_dtype_as_group('datetime', 'min', 'direct')

    287±7μs         255±10μs     0.89  groupby.GroupByMethods.time_dtype_as_group('float', 'cumcount', 'direct')

1.26±0.04ms      1.10±0.07ms     0.87  groupby.GroupByMethods.time_dtype_as_group('int', 'cumprod', 'direct')

   237±10μs          206±2μs     0.87  groupby.GroupByMethods.time_dtype_as_group('float', 'head', 'direct')

    183±6μs         158±10μs     0.86  groupby.GroupByMethods.time_dtype_as_field('float', 'count', 'direct')

   516±20μs         440±20μs     0.85  groupby.GroupByMethods.time_dtype_as_field('object', 'nunique', 'direct')

   993±20μs         835±10μs     0.84  groupby.GroupByMethods.time_dtype_as_group('float', 'sem', 'direct')

…alize

WillAyd · 2020-09-10T19:12:18Z

@DataInformer is this still active? If so can you merge master and try to get CI green?

…alize

DataInformer · 2020-09-12T14:29:12Z

@DataInformer is this still active? If so can you merge master and try to get CI green?

Yes, this is active. I merged master again and checks pass now. Hopefully we're good to close this out.

simonjayhawkins

Thanks @DataInformer can you move release note to 1.2

it also look like file modes are changes and codecov is reporting several sections of code not covered by tests.

simonjayhawkins · 2020-09-13T14:27:23Z

doc/source/whatsnew/v1.0.3.rst

@@ -23,6 +23,7 @@ Fixed regressions
 Bug fixes
 ~~~~~~~~~

+


can you do this.

simonjayhawkins · 2020-09-13T14:28:18Z

pandas/core/groupby/generic.py

+        if dropna:
+            mask = ~isna(val)
+            if not mask.all():
+                ids, val = ids[mask], val[mask]


codecov reports no testing here

Most features are tested in test_series_groupby_value_counts, which is parameterized and includes dropna as a parameter. That should address the below case as well (since it includes seed_nans in the dataframe generation. I could add specific tests for these cases, but it seems like they are covered in the existing tests.

@simonjayhawkins IIUC codecov gets results from the travis-37-cov build, which runs with not slow, so we wouldn't expect GroupBy.value_counts to show up there

simonjayhawkins · 2020-09-13T14:29:20Z

pandas/core/groupby/generic.py

+        if (not dropna) and (-1 in val_lab):
+            # in this case we need to explicitly add NaN as a level
+            val_lev = np.r_[Index([np.nan]), val_lev]
+            val_lab += 1


codecov reports no testing for this

jreback

wow this is a huge amount of change. It would be really nice to do this in smaller pieces, otherwise review is going to take an extended amount of time.

jreback · 2020-09-13T20:47:23Z

doc/source/whatsnew/v1.2.0.rst

@@ -247,6 +247,7 @@ Numeric
 ^^^^^^^
 - Bug in :func:`to_numeric` where float precision was incorrect (:issue:`31364`)
 - Bug in :meth:`DataFrame.any` with ``axis=1`` and ``bool_only=True`` ignoring the ``bool_only`` keyword (:issue:`32432`)
+- Fixed Series.value_counts so that normalize excludes NA values when dropna=False. (:issue:`25970`)


can you change so that
:meth:`Series.value_counts` otherwise won't render, also put dropna=False in double-backtics.

jreback · 2020-09-13T20:48:28Z

doc/source/whatsnew/v1.2.0.rst

@@ -315,7 +316,7 @@ Groupby/resample/rolling
 - Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`)
 - Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`)
 - Bug in :meth:`DataFrameGroupBy.apply` raising error with ``np.nan`` group(s) when ``dropna=False`` (:issue:`35889`)
-
+- Fixed Dataframe Groupby value_counts with bins (:issue:`32471`)


not sure what this means, so expand on this, and use a fully-qualified reference for value_counts
:meth:`DataFrameGroupby.value_counts`

jreback · 2020-09-13T20:49:27Z

pandas/core/algorithms.py

+    bins : integer or iterable of numeric, optional
+        Rather than count values, group them into half-open bins.
+        Only works with numeric data.
+        If int, interpreted as number of bins and will use pd.cut.


hmm this will use pd.cut either way, so pls amend the doc to say that.

needs a versionchanged tag 1.2

jreback · 2020-09-13T20:50:21Z

pandas/core/base.py

+        bins : integer or iterable of numeric, optional
+            Rather than count individual values, group them into half-open bins.
+            Only works with numeric data.
+            If int, interpreted as number of bins and will use `pd.cut`.


update this doc-string the same way (in theory these could be shared, but that's another day)

jreback · 2020-09-13T20:50:35Z

pandas/core/base.py

+        Bins can also be an iterable of numbers.  These numbers are treated
+        as endpoints for the intervals.
+
+        >>> s.value_counts(bins=[0,2,4,9])


use spaces between the bins

DataInformer · 2020-09-14T02:15:09Z

wow this is a huge amount of change. It would be really nice to do this in smaller pieces, otherwise review is going to take an extended amount of time.

I tried to do this request in pieces (see June 27 comments), but because the existing tests compare the results of Series.value_counts with SeriesGroupby.value_counts, it was impossible to fix only one of these and still have tests pass.

DataInformer · 2020-10-07T23:07:38Z

I know it's a lot to go through, but is there a time you think you'll get to the full review? I can merge master again when you're ready.

jbrockmendel · 2020-10-28T02:48:54Z

I'm still a bit confused by the results from the benchmarking (pasted below). In particular, I don't see how my changes would have made some of the methods faster, when I didn't change those, which casts doubt on the results showing that value_counts is significantly slower.

It is, unfortunately, normal to get a lot of noise in asv results. Two main ways to handle it: 1) re-run and manually discard results that are not consistent(ish) across runs 2) use %timeit on targeted snippets

jbrockmendel · 2020-10-28T02:50:13Z

pandas/core/base.py

+        as endpoints for the intervals.
+
+        >>> s.value_counts(bins=[0, 2, 4, 9])
+        (2.0, 4.0]      3


is there a space missing here?

jbrockmendel · 2020-10-28T02:58:55Z

pandas/core/algorithms.py

    if normalize:
-        result = result / float(counts.sum())
+        counts = result._values
+        result = result / float(max(counts.sum(), 1))


it looks like part of this PR is about normalize and another part is about dropna. Could these be split into independent pieces?

DataInformer · 2020-10-28T12:30:29Z

@jreback @jbrockmendel @WillAyd @simonjayhawkins is there someone who could have a 5min phone conversation with me about this PR and how to make it easier to evaluate? 858-205-8203
Thanks,
Evan

jreback · 2020-10-28T14:54:39Z

@DataInformer i won't have time to look at this for at least a few weeks

we just have quite a backlog

if @jbrockmendel can look in depth would be great

mroeschke · 2021-03-10T05:37:41Z

Mind merging in master and resolving conflicts? It would help the review process

DataInformer · 2021-03-10T11:40:33Z

Mind merging in master and resolving conflicts? It would help the review process

I'm happy to do so when you all are ready for a full review. Is that time now?

mroeschke · 2021-03-11T15:43:38Z

It will be easier for the core devs to perform a full review with an updated pull request: https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#tips-for-a-successful-pull-request

DataInformer · 2021-03-12T02:28:19Z

It will be easier for the core devs to perform a full review with an updated pull request: https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#tips-for-a-successful-pull-request

I definitely want to make things easy for the devs, but I've gotten very sporadic feedback about this PR and I prefer not to repeatedly merge dev or respond to comments made weeks apart. What can I do to make this easier to review while also being reasonable for me?

mroeschke · 2021-03-12T02:46:20Z

Sorry for the sporadic feedback; since pandas is a community project, the core devs' time commits can be all over the place.

The best approach would be to keep the PR up to date with master and ping the devs that have responded already to this PR when you're ready for another review.

simonjayhawkins · 2021-06-16T13:58:58Z

@DataInformer Thanks for the PR. closing as stale. ping if you want to continue and will reopen

DataInformer added 6 commits April 18, 2020 18:16

made nan count when dropna=False

bd9011a

updated changelog

d9d5ec1

trivial

86fe7f9

added specific test for groupby valuecount interval fix

c34a863

merge master

9c1c269

updated value_count docstrings

5f8eb1d

jreback requested changes Apr 19, 2020

View reviewed changes

DataInformer added 2 commits April 19, 2020 13:47

fixed pep8 style

1276166

fixed more minor style

a1b7197

simonjayhawkins reviewed Apr 20, 2020

View reviewed changes

DataInformer added 2 commits April 20, 2020 21:00

added test for na in bins

9c3ede3

added release notes to 1.1

0cff92b

DataInformer requested a review from simonjayhawkins April 23, 2020 12:05

DataInformer added 2 commits April 25, 2020 15:25

trying to avoid docstring warning

27aa460

trying to avoid docstring warning

27c9856

WillAyd requested changes May 29, 2020

View reviewed changes

simonjayhawkins reviewed Jun 2, 2020

View reviewed changes

simonjayhawkins added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Groupby labels Jun 2, 2020

DataInformer added 2 commits June 27, 2020 15:57

include nan count when dropna=False

f5e9aeb

listed bugfix

99b7112

DataInformer force-pushed the value_counts_normalize branch from 27c9856 to 99b7112 Compare June 27, 2020 21:05

avoided tests that highlight groupby.value_count bug

75374b2

Merge remote-tracking branch 'upstream/master' into value_counts_norm…

83ccfd2

…alize

DataInformer added 4 commits September 12, 2020 08:09

removed unneeded import

5f33834

Merge remote-tracking branch 'upstream/master' into value_counts_norm…

8562f1b

…alize

updated to use na_sentinal param

f685cb2

fixed bad test

c21bdbb

simonjayhawkins requested changes Sep 13, 2020

View reviewed changes

DataInformer added 2 commits September 13, 2020 12:02

moved doc, reverted permissions

e4c2552

more doc and permission fix

74b13d8

jreback requested changes Sep 13, 2020

View reviewed changes

DataInformer added 2 commits September 13, 2020 22:04

fixed docstrings

f0e630a

file perm

9763e83

jbrockmendel reviewed Oct 28, 2020

View reviewed changes

arw2019 added the Needs Review label Nov 30, 2020

mroeschke added the Stale label Mar 10, 2021

simonjayhawkins closed this Jun 16, 2021

		@@ -190,6 +190,14 @@ def test_value_counts_bins(index_or_series):

		assert s.nunique() == 0

		# handle normalizing bins with NA's properly

Value counts normalize #33652

Value counts normalize #33652

Conversation

DataInformer commented Apr 19, 2020 • edited Loading

pep8speaks commented Apr 19, 2020 • edited Loading

Comment last updated at 2020-09-14 02:11:04 UTC

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DataInformer commented Apr 20, 2020

kevin-meyers commented Apr 24, 2020

DataInformer commented Apr 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonjayhawkins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DataInformer commented Jun 27, 2020

DataInformer commented Jun 28, 2020

DataInformer commented Aug 10, 2020

WillAyd commented Sep 10, 2020

DataInformer commented Sep 12, 2020

simonjayhawkins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DataInformer commented Sep 14, 2020

DataInformer commented Oct 7, 2020

jbrockmendel commented Oct 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DataInformer commented Oct 28, 2020

jreback commented Oct 28, 2020

mroeschke commented Mar 10, 2021

DataInformer commented Mar 10, 2021

mroeschke commented Mar 11, 2021

DataInformer commented Mar 12, 2021

mroeschke commented Mar 12, 2021

simonjayhawkins commented Jun 16, 2021

DataInformer commented Apr 19, 2020 •

edited

Loading

pep8speaks commented Apr 19, 2020 •

edited

Loading