Skip to content

Value counts normalize #33652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

DataInformer
Copy link

@DataInformer DataInformer commented Apr 19, 2020

This pull request resolves issues with binning and NA values in both Series.value_counts and SeriesGroupBy.value_counts, adding new tests to check the problematic cases.

@pep8speaks
Copy link

pep8speaks commented Apr 19, 2020

Hello @DataInformer! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-14 02:11:04 UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will look soon

@@ -190,6 +190,14 @@ def test_value_counts_bins(index_or_series):

assert s.nunique() == 0

# handle normalizing bins with NA's properly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a new test

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand: are you saying this is a badly written test, so I should make a different one? Or are you saying add an additional test beyond this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_value_counts_bins is already doing too much. make this a separate test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -22,6 +22,8 @@ Fixed regressions

Bug fixes
~~~~~~~~~
Fixed Series.value_counts so that normalize excludes NA values when dropna=False. (:issue:`25970`)
Fixed Dataframe Groupby value_counts with bins (:issue:`32471`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to 1.1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@DataInformer
Copy link
Author

I'm not sure why Web and docs is failing. Looking through the output, I only see warnings (for my part, only about block quote issue that I think is detecting on pd.cut

@kevin-meyers
Copy link

I'm not positive but it could have something to do with https://pandas.pydata.org/docs/development/contributing_docstring.html

/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/IPython/sphinxext/ipython_directive.py:1023: UserWarning: Code input with no code at /home/runner/work/pandas/pandas/doc/source/user_guide/computation.rst, line 622
  warnings.warn(message)
/home/runner/work/pandas/pandas/pandas/core/base.py:docstring of pandas.Index.value_counts:18: WARNING: Block quote ends without a blank line; unexpected unindent.
/home/runner/work/pandas/pandas/pandas/core/base.py:docstring of pandas.Series.value_counts:18: WARNING: Block quote ends without a blank line; unexpected unindent.
build finished with problems, 2 warnings.
##[error]Process completed with exit code 1.

@DataInformer
Copy link
Author

I'm not positive but it could have something to do with https://pandas.pydata.org/docs/development/contributing_docstring.html

/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/IPython/sphinxext/ipython_directive.py:1023: UserWarning: Code input with no code at /home/runner/work/pandas/pandas/doc/source/user_guide/computation.rst, line 622
  warnings.warn(message)
/home/runner/work/pandas/pandas/pandas/core/base.py:docstring of pandas.Index.value_counts:18: WARNING: Block quote ends without a blank line; unexpected unindent.
/home/runner/work/pandas/pandas/pandas/core/base.py:docstring of pandas.Series.value_counts:18: WARNING: Block quote ends without a blank line; unexpected unindent.
build finished with problems, 2 warnings.
##[error]Process completed with exit code 1.

I'm not positive but it could have something to do with https://pandas.pydata.org/docs/development/contributing_docstring.html

/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/IPython/sphinxext/ipython_directive.py:1023: UserWarning: Code input with no code at /home/runner/work/pandas/pandas/doc/source/user_guide/computation.rst, line 622
  warnings.warn(message)
/home/runner/work/pandas/pandas/pandas/core/base.py:docstring of pandas.Index.value_counts:18: WARNING: Block quote ends without a blank line; unexpected unindent.
/home/runner/work/pandas/pandas/pandas/core/base.py:docstring of pandas.Series.value_counts:18: WARNING: Block quote ends without a blank line; unexpected unindent.
build finished with problems, 2 warnings.
##[error]Process completed with exit code 1.

Right, that's what I thought, but I don't see any block quotes without blank lines. I was hoping maybe someone could help me identify more specifically what the problem is.

@@ -23,6 +23,7 @@ Fixed regressions
Bug fixes
~~~~~~~~~


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert unrelated changes? Looks like blank space and file permissions were changed here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DataInformer can you address this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I thought I did this before, but apparently it reverted. Should be undone again now.

if is_integer_dtype(out):
out = ensure_int64(out)
return Series(out, index=mi, name=self._selection_name)
return self.apply(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is most likely significantly slower than the existing implementation - can you run the appropriate groupby benchmarks to check?

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#running-the-performance-test-suite

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the fix for #32471 dependent on the fix for #25970? If so is it possible to address #25970 independently as a pre-cursor PR?

@@ -434,7 +434,8 @@ Performance improvements

Bug fixes
~~~~~~~~~

Fixed Series.value_counts so that normalize excludes NA values when dropna=False. (:issue:`25970`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this down into the Numeric section. starts at L482

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -434,7 +434,8 @@ Performance improvements

Bug fixes
~~~~~~~~~

Fixed Series.value_counts so that normalize excludes NA values when dropna=False. (:issue:`25970`)
Fixed Dataframe Groupby value_counts with bins (:issue:`32471`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this down into the Groupby/resample/rolling section. starts on L596.

@simonjayhawkins simonjayhawkins added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Groupby labels Jun 2, 2020
@DataInformer DataInformer force-pushed the value_counts_normalize branch from 27c9856 to 99b7112 Compare June 27, 2020 21:05
@DataInformer
Copy link
Author

Is the fix for #32471 dependent on the fix for #25970? If so is it possible to address #25970 independently as a pre-cursor PR?

I cleaned up the pull request so it has minimal changes only addressing #25970. I kept the branch name so this request history would remain.

@DataInformer
Copy link
Author

Is the fix for #32471 dependent on the fix for #25970? If so is it possible to address #25970 independently as a pre-cursor PR?

I cleaned up the pull request so it has minimal changes only addressing #25970. I kept the branch name so this request history would remain.

However, I had only run the base tests, and the groupby tests now cause conflicting results, since the Series.value_count is doing the right thing but the DataFrame.groupby.value_count is not. It seems wrong to change the test, but I have temporarily reduced the normalize and dropna parameters to skip the problematic cases. Would you rather I combine the fixes again as a single pull request?

@DataInformer
Copy link
Author

I'm still a bit confused by the results from the benchmarking (pasted below). In particular, I don't see how my changes would have made some of the methods faster, when I didn't change those, which casts doubt on the results showing that value_counts is significantly slower. My code handles a couple special cases separately to avoid the identified bugs, but I would not expect it to be more than 20% slower.
This run is with the computer idle (I get very different results while still using it, which makes sense). I reached out to the Pandas-dev alias to see if someone could advise on the performance issues, but didn't get any response. Do you have other suggestions for tracking down what might be causing these methods to run slower?

'asv continuous -f 1.1 upstream/master HEAD -b groupby.GroupByMethods'
before after ratio
[04e9e0a] [5abfb16]
<value_counts_part1~1^2> <value_counts_normalize>

  • 1.03±0.03ms      1.85±0.05ms     1.80  groupby.GroupByMethods.time_dtype_as_group('float', 'value_counts', 'transformation')
    
  • 1.03±0.06ms       1.81±0.1ms     1.75  groupby.GroupByMethods.time_dtype_as_group('int', 'value_counts', 'transformation')
    
  • 1.02±0.07ms      1.70±0.06ms     1.67  groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'transformation')
    
  •  1.01±0.1ms      1.65±0.07ms     1.64  groupby.GroupByMethods.time_dtype_as_group('float', 'value_counts', 'direct')
    
  • 1.05±0.04ms       1.64±0.2ms     1.56  groupby.GroupByMethods.time_dtype_as_group('datetime', 'value_counts', 'transformation')
    
  • 1.08±0.08ms      1.67±0.09ms     1.55  groupby.GroupByMethods.time_dtype_as_group('datetime', 'value_counts', 'direct')
    
  • 1.11±0.04ms      1.69±0.06ms     1.52  groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'direct')
    
  •    962±20μs      1.41±0.03ms     1.47  groupby.GroupByMethods.time_dtype_as_field('object', 'value_counts', 'direct')
    
  • 1.09±0.04ms      1.55±0.08ms     1.42  groupby.GroupByMethods.time_dtype_as_field('object', 'value_counts', 'transformation')
    
  • 1.21±0.02ms      1.69±0.05ms     1.40  groupby.GroupByMethods.time_dtype_as_field('datetime', 'value_counts', 'transformation')
    
  • 1.17±0.06ms      1.64±0.06ms     1.40  groupby.GroupByMethods.time_dtype_as_group('int', 'value_counts', 'direct')
    
  • 1.28±0.09ms       1.72±0.1ms     1.35  groupby.GroupByMethods.time_dtype_as_field('float', 'value_counts', 'transformation')
    
  • 1.23±0.03ms      1.66±0.06ms     1.34  groupby.GroupByMethods.time_dtype_as_field('datetime', 'value_counts', 'direct')
    
  • 1.31±0.08ms       1.71±0.1ms     1.30  groupby.GroupByMethods.time_dtype_as_field('float', 'value_counts', 'direct')
    
  •     191±9μs          241±9μs     1.26  groupby.GroupByMethods.time_dtype_as_group('object', 'head', 'transformation')
    
  •    865±50μs      1.08±0.05ms     1.25  groupby.GroupByMethods.time_dtype_as_group('object', 'value_counts', 'transformation')
    
  •    947±20μs      1.17±0.06ms     1.24  groupby.GroupByMethods.time_dtype_as_group('object', 'value_counts', 'direct')
    
  •    577±30μs         684±20μs     1.18  groupby.GroupByMethods.time_dtype_as_field('int', 'min', 'direct')
    
  •    595±30μs         698±30μs     1.17  groupby.GroupByMethods.time_dtype_as_field('float', 'quantile', 'transformation')
    
  •    703±70μs         635±30μs     0.90  groupby.GroupByMethods.time_dtype_as_group('datetime', 'min', 'direct')
    
  •     287±7μs         255±10μs     0.89  groupby.GroupByMethods.time_dtype_as_group('float', 'cumcount', 'direct')
    
  • 1.26±0.04ms      1.10±0.07ms     0.87  groupby.GroupByMethods.time_dtype_as_group('int', 'cumprod', 'direct')
    
  •    237±10μs          206±2μs     0.87  groupby.GroupByMethods.time_dtype_as_group('float', 'head', 'direct')
    
  •     183±6μs         158±10μs     0.86  groupby.GroupByMethods.time_dtype_as_field('float', 'count', 'direct')
    
  •    516±20μs         440±20μs     0.85  groupby.GroupByMethods.time_dtype_as_field('object', 'nunique', 'direct')
    
  •    993±20μs         835±10μs     0.84  groupby.GroupByMethods.time_dtype_as_group('float', 'sem', 'direct')
    

@WillAyd
Copy link
Member

WillAyd commented Sep 10, 2020

@DataInformer is this still active? If so can you merge master and try to get CI green?

@DataInformer
Copy link
Author

@DataInformer is this still active? If so can you merge master and try to get CI green?

Yes, this is active. I merged master again and checks pass now. Hopefully we're good to close this out.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DataInformer can you move release note to 1.2

it also look like file modes are changes and codecov is reporting several sections of code not covered by tests.

@@ -23,6 +23,7 @@ Fixed regressions
Bug fixes
~~~~~~~~~


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this.

if dropna:
mask = ~isna(val)
if not mask.all():
ids, val = ids[mask], val[mask]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codecov reports no testing here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most features are tested in test_series_groupby_value_counts, which is parameterized and includes dropna as a parameter. That should address the below case as well (since it includes seed_nans in the dataframe generation. I could add specific tests for these cases, but it seems like they are covered in the existing tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonjayhawkins IIUC codecov gets results from the travis-37-cov build, which runs with not slow, so we wouldn't expect GroupBy.value_counts to show up there

if (not dropna) and (-1 in val_lab):
# in this case we need to explicitly add NaN as a level
val_lev = np.r_[Index([np.nan]), val_lev]
val_lab += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codecov reports no testing for this

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow this is a huge amount of change. It would be really nice to do this in smaller pieces, otherwise review is going to take an extended amount of time.

@@ -247,6 +247,7 @@ Numeric
^^^^^^^
- Bug in :func:`to_numeric` where float precision was incorrect (:issue:`31364`)
- Bug in :meth:`DataFrame.any` with ``axis=1`` and ``bool_only=True`` ignoring the ``bool_only`` keyword (:issue:`32432`)
- Fixed Series.value_counts so that normalize excludes NA values when dropna=False. (:issue:`25970`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change so that
:meth:`Series.value_counts` otherwise won't render, also put dropna=False in double-backtics.

@@ -315,7 +316,7 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`)
- Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`)
- Bug in :meth:`DataFrameGroupBy.apply` raising error with ``np.nan`` group(s) when ``dropna=False`` (:issue:`35889`)
-
- Fixed Dataframe Groupby value_counts with bins (:issue:`32471`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what this means, so expand on this, and use a fully-qualified reference for value_counts
:meth:`DataFrameGroupby.value_counts`

bins : integer or iterable of numeric, optional
Rather than count values, group them into half-open bins.
Only works with numeric data.
If int, interpreted as number of bins and will use pd.cut.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this will use pd.cut either way, so pls amend the doc to say that.

needs a versionchanged tag 1.2

bins : integer or iterable of numeric, optional
Rather than count individual values, group them into half-open bins.
Only works with numeric data.
If int, interpreted as number of bins and will use `pd.cut`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update this doc-string the same way (in theory these could be shared, but that's another day)

Bins can also be an iterable of numbers. These numbers are treated
as endpoints for the intervals.

>>> s.value_counts(bins=[0,2,4,9])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use spaces between the bins

@DataInformer
Copy link
Author

wow this is a huge amount of change. It would be really nice to do this in smaller pieces, otherwise review is going to take an extended amount of time.

I tried to do this request in pieces (see June 27 comments), but because the existing tests compare the results of Series.value_counts with SeriesGroupby.value_counts, it was impossible to fix only one of these and still have tests pass.

@DataInformer
Copy link
Author

I know it's a lot to go through, but is there a time you think you'll get to the full review? I can merge master again when you're ready.

@jbrockmendel
Copy link
Member

I'm still a bit confused by the results from the benchmarking (pasted below). In particular, I don't see how my changes would have made some of the methods faster, when I didn't change those, which casts doubt on the results showing that value_counts is significantly slower.

It is, unfortunately, normal to get a lot of noise in asv results. Two main ways to handle it: 1) re-run and manually discard results that are not consistent(ish) across runs 2) use %timeit on targeted snippets

as endpoints for the intervals.

>>> s.value_counts(bins=[0, 2, 4, 9])
(2.0, 4.0] 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a space missing here?

if normalize:
result = result / float(counts.sum())
counts = result._values
result = result / float(max(counts.sum(), 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like part of this PR is about normalize and another part is about dropna. Could these be split into independent pieces?

@DataInformer
Copy link
Author

@jreback @jbrockmendel @WillAyd @simonjayhawkins is there someone who could have a 5min phone conversation with me about this PR and how to make it easier to evaluate? 858-205-8203
Thanks,
Evan

@jreback
Copy link
Contributor

jreback commented Oct 28, 2020

@DataInformer i won't have time to look at this for at least a few weeks

we just have quite a backlog

if @jbrockmendel can look in depth would be great

@mroeschke
Copy link
Member

Mind merging in master and resolving conflicts? It would help the review process

@DataInformer
Copy link
Author

Mind merging in master and resolving conflicts? It would help the review process

I'm happy to do so when you all are ready for a full review. Is that time now?

@mroeschke
Copy link
Member

It will be easier for the core devs to perform a full review with an updated pull request: https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#tips-for-a-successful-pull-request

@DataInformer
Copy link
Author

It will be easier for the core devs to perform a full review with an updated pull request: https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#tips-for-a-successful-pull-request

I definitely want to make things easy for the devs, but I've gotten very sporadic feedback about this PR and I prefer not to repeatedly merge dev or respond to comments made weeks apart. What can I do to make this easier to review while also being reasonable for me?

@mroeschke
Copy link
Member

Sorry for the sporadic feedback; since pandas is a community project, the core devs' time commits can be all over the place.

The best approach would be to keep the PR up to date with master and ping the devs that have responded already to this PR when you're ready for another review.

@simonjayhawkins
Copy link
Member

@DataInformer Thanks for the PR. closing as stale. ping if you want to continue and will reopen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Groupby Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataframe Groupby value_counts with bins parameter value_counts unexpected behaviour - bins and dropna
9 participants