-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Value counts normalize #33652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Value counts normalize #33652
Conversation
Hello @DataInformer! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-09-14 02:11:04 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will look soon
@@ -190,6 +190,14 @@ def test_value_counts_bins(index_or_series): | |||
|
|||
assert s.nunique() == 0 | |||
|
|||
# handle normalizing bins with NA's properly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a new test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I understand: are you saying this is a badly written test, so I should make a different one? Or are you saying add an additional test beyond this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_value_counts_bins is already doing too much. make this a separate test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
doc/source/whatsnew/v1.0.3.rst
Outdated
@@ -22,6 +22,8 @@ Fixed regressions | |||
|
|||
Bug fixes | |||
~~~~~~~~~ | |||
Fixed Series.value_counts so that normalize excludes NA values when dropna=False. (:issue:`25970`) | |||
Fixed Dataframe Groupby value_counts with bins (:issue:`32471`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this to 1.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
I'm not sure why Web and docs is failing. Looking through the output, I only see warnings (for my part, only about block quote issue that I think is detecting on |
I'm not positive but it could have something to do with https://pandas.pydata.org/docs/development/contributing_docstring.html
|
Right, that's what I thought, but I don't see any block quotes without blank lines. I was hoping maybe someone could help me identify more specifically what the problem is. |
doc/source/whatsnew/v1.0.3.rst
Outdated
@@ -23,6 +23,7 @@ Fixed regressions | |||
Bug fixes | |||
~~~~~~~~~ | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you revert unrelated changes? Looks like blank space and file permissions were changed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DataInformer can you address this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I thought I did this before, but apparently it reverted. Should be undone again now.
pandas/core/groupby/generic.py
Outdated
if is_integer_dtype(out): | ||
out = ensure_int64(out) | ||
return Series(out, index=mi, name=self._selection_name) | ||
return self.apply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is most likely significantly slower than the existing implementation - can you run the appropriate groupby benchmarks to check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -434,7 +434,8 @@ Performance improvements | |||
|
|||
Bug fixes | |||
~~~~~~~~~ | |||
|
|||
Fixed Series.value_counts so that normalize excludes NA values when dropna=False. (:issue:`25970`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move this down into the Numeric section. starts at L482
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -434,7 +434,8 @@ Performance improvements | |||
|
|||
Bug fixes | |||
~~~~~~~~~ | |||
|
|||
Fixed Series.value_counts so that normalize excludes NA values when dropna=False. (:issue:`25970`) | |||
Fixed Dataframe Groupby value_counts with bins (:issue:`32471`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move this down into the Groupby/resample/rolling section. starts on L596.
27c9856
to
99b7112
Compare
However, I had only run the base tests, and the groupby tests now cause conflicting results, since the Series.value_count is doing the right thing but the DataFrame.groupby.value_count is not. It seems wrong to change the test, but I have temporarily reduced the normalize and dropna parameters to skip the problematic cases. Would you rather I combine the fixes again as a single pull request? |
I'm still a bit confused by the results from the benchmarking (pasted below). In particular, I don't see how my changes would have made some of the methods faster, when I didn't change those, which casts doubt on the results showing that value_counts is significantly slower. My code handles a couple special cases separately to avoid the identified bugs, but I would not expect it to be more than 20% slower. 'asv continuous -f 1.1 upstream/master HEAD -b groupby.GroupByMethods'
|
@DataInformer is this still active? If so can you merge master and try to get CI green? |
Yes, this is active. I merged master again and checks pass now. Hopefully we're good to close this out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @DataInformer can you move release note to 1.2
it also look like file modes are changes and codecov is reporting several sections of code not covered by tests.
doc/source/whatsnew/v1.0.3.rst
Outdated
@@ -23,6 +23,7 @@ Fixed regressions | |||
Bug fixes | |||
~~~~~~~~~ | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do this.
if dropna: | ||
mask = ~isna(val) | ||
if not mask.all(): | ||
ids, val = ids[mask], val[mask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codecov reports no testing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most features are tested in test_series_groupby_value_counts, which is parameterized and includes dropna as a parameter. That should address the below case as well (since it includes seed_nans in the dataframe generation. I could add specific tests for these cases, but it seems like they are covered in the existing tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonjayhawkins IIUC codecov gets results from the travis-37-cov build, which runs with not slow
, so we wouldn't expect GroupBy.value_counts to show up there
if (not dropna) and (-1 in val_lab): | ||
# in this case we need to explicitly add NaN as a level | ||
val_lev = np.r_[Index([np.nan]), val_lev] | ||
val_lab += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codecov reports no testing for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow this is a huge amount of change. It would be really nice to do this in smaller pieces, otherwise review is going to take an extended amount of time.
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -247,6 +247,7 @@ Numeric | |||
^^^^^^^ | |||
- Bug in :func:`to_numeric` where float precision was incorrect (:issue:`31364`) | |||
- Bug in :meth:`DataFrame.any` with ``axis=1`` and ``bool_only=True`` ignoring the ``bool_only`` keyword (:issue:`32432`) | |||
- Fixed Series.value_counts so that normalize excludes NA values when dropna=False. (:issue:`25970`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change so that
:meth:`Series.value_counts`
otherwise won't render, also put dropna=False in double-backtics.
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -315,7 +316,7 @@ Groupby/resample/rolling | |||
- Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`) | |||
- Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`) | |||
- Bug in :meth:`DataFrameGroupBy.apply` raising error with ``np.nan`` group(s) when ``dropna=False`` (:issue:`35889`) | |||
- | |||
- Fixed Dataframe Groupby value_counts with bins (:issue:`32471`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what this means, so expand on this, and use a fully-qualified reference for value_counts
:meth:`DataFrameGroupby.value_counts`
pandas/core/algorithms.py
Outdated
bins : integer or iterable of numeric, optional | ||
Rather than count values, group them into half-open bins. | ||
Only works with numeric data. | ||
If int, interpreted as number of bins and will use pd.cut. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this will use pd.cut either way, so pls amend the doc to say that.
needs a versionchanged tag 1.2
pandas/core/base.py
Outdated
bins : integer or iterable of numeric, optional | ||
Rather than count individual values, group them into half-open bins. | ||
Only works with numeric data. | ||
If int, interpreted as number of bins and will use `pd.cut`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update this doc-string the same way (in theory these could be shared, but that's another day)
pandas/core/base.py
Outdated
Bins can also be an iterable of numbers. These numbers are treated | ||
as endpoints for the intervals. | ||
|
||
>>> s.value_counts(bins=[0,2,4,9]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use spaces between the bins
I tried to do this request in pieces (see June 27 comments), but because the existing tests compare the results of Series.value_counts with SeriesGroupby.value_counts, it was impossible to fix only one of these and still have tests pass. |
I know it's a lot to go through, but is there a time you think you'll get to the full review? I can merge master again when you're ready. |
It is, unfortunately, normal to get a lot of noise in asv results. Two main ways to handle it: 1) re-run and manually discard results that are not consistent(ish) across runs 2) use %timeit on targeted snippets |
as endpoints for the intervals. | ||
|
||
>>> s.value_counts(bins=[0, 2, 4, 9]) | ||
(2.0, 4.0] 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a space missing here?
if normalize: | ||
result = result / float(counts.sum()) | ||
counts = result._values | ||
result = result / float(max(counts.sum(), 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like part of this PR is about normalize
and another part is about dropna
. Could these be split into independent pieces?
@jreback @jbrockmendel @WillAyd @simonjayhawkins is there someone who could have a 5min phone conversation with me about this PR and how to make it easier to evaluate? 858-205-8203 |
@DataInformer i won't have time to look at this for at least a few weeks we just have quite a backlog if @jbrockmendel can look in depth would be great |
Mind merging in master and resolving conflicts? It would help the review process |
I'm happy to do so when you all are ready for a full review. Is that time now? |
It will be easier for the core devs to perform a full review with an updated pull request: https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#tips-for-a-successful-pull-request |
I definitely want to make things easy for the devs, but I've gotten very sporadic feedback about this PR and I prefer not to repeatedly merge dev or respond to comments made weeks apart. What can I do to make this easier to review while also being reasonable for me? |
Sorry for the sporadic feedback; since pandas is a community project, the core devs' time commits can be all over the place. The best approach would be to keep the PR up to date with master and ping the devs that have responded already to this PR when you're ready for another review. |
@DataInformer Thanks for the PR. closing as stale. ping if you want to continue and will reopen |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This pull request resolves issues with binning and NA values in both
Series.value_counts
andSeriesGroupBy.value_counts,
adding new tests to check the problematic cases.