Some refinements #5

rhshadrach · 2021-11-30T00:42:54Z

I've done some work and wanted to share, this seemed like the easiest way. No need to merge, just steal whatever you like. It is certainly a work in progress. The main changes are moving away from name-based logic to positional logic. This allows the method to work when the columns have duplicate or unexpected names, e.g.

df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6], 'd': [7, 8], 'level_1': [9, 10]})
df.columns = list('abbd') + ['level_1']
# I think this fails on your branch for two reasons - duplicate names for "b" and "level_1" causing issues
df.groupby(['a', [0, 1], 'd']).value_counts()

It also avoids alignment when normalize is True, giving somewhat of a speedup in, e.g.

size, bins = 100000, 10
df = pd.DataFrame({k: np.random.randint(bins, size=size) for k in 'abcdef'})
%timeit df.groupby(['a', 'b']).value_counts(normalize=True)

gives

46.2 ms ± 568 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
69.3 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Finally, it simplifies some of the logic involved with dropna and as_index.

rhshadrach · 2021-12-01T03:10:17Z

@johnzangwill - friendly ping.

johnzangwill · 2021-12-02T14:19:03Z

Thanks, Richard.
Your code does simplify a few things and solve a few problems, so I just merged it in. Hopefully that makes cooperation simpler...
I added test_column_name_clashes with your example data, but only with as_index=True.
Both these examples fail with as_index=False, because Series.reset_index() fails.
Do you know how to make reset_index() cope with duplicated column names?
Perhaps this is a bug, since Pandas is supposed to allow duplicates by default? Surely frame.py line 5850 should be passing the allows_duplicate_labels flag?

This also raises the question: just how pathological data do I need to support?...

johnzangwill · 2021-12-02T14:27:14Z

There will be another issue with this PR: just how to make Jeff happy!
I am concentrating on your issues and advice for the moment. That is to improve the code and toughen up the tests.
But at some point Jeff will need convincing that any of this is necessary...

rhshadrach · 2021-12-02T22:02:21Z

@johnzangwill Certainly. I'm not concerned - it appears to me what you're doing is the correct way. Explaining why this is novel to pandas (no other op groups by all the remaining columns!) will go a long way, and I'll be happy to help out there. Wanted to get things in a good state first though.

rhshadrach · 2021-12-02T22:02:42Z

I'll take a look at the reset index in the next few days.

johnzangwill · 2021-12-03T12:09:40Z

I changed frame.py to cope with duplicate column labels. This makes Series.reset_index() work in the case that duplicate labels result in the frame. Your cases then work with as_index=False.
Clearly, this potentially changes the rest of Pandas... Overall, it causes some warnings to change and some warnings to not appear.
I temporarily changed 3 tests to cope and switched my PR to Draft.
Green tick, so all the other tests pass!
But this came up recently in pandas-dev#44410 and it looks like it was decided to just document it.
My change fixes the "Reproducible Example" in pandas-dev#44410 and gives the "Expected Behaviour". So it would close that Issue.
So I am not sure quite what to do next.
In any case, I imagine that my change to frame.py should probably be in a separate PR. No?

johnzangwill · 2021-12-23T20:27:21Z

@rhshadrach I am not sure that I am winning this one pandas-dev#44755 (comment) which was started by you and your duplicate label examples. Please, either explain to Jeff why he is wrong, or tell me to just do it his way and forget about trapping duplicates in existing code. Thanks!

Some refinements

085e8c9

johnzangwill merged commit 92c718b into johnzangwill:DataFrameGroupBy.value_counts Dec 2, 2021

johnzangwill mentioned this pull request Dec 4, 2021

ENH: Add DataFrameGroupBy.value_counts pandas-dev/pandas#44267

Merged

4 tasks

rhshadrach deleted the DataFrameGroupBy.value_counts branch February 14, 2022 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some refinements #5

Some refinements #5

Uh oh!

rhshadrach commented Nov 30, 2021 •

edited

Loading

Uh oh!

rhshadrach commented Dec 1, 2021

Uh oh!

johnzangwill commented Dec 2, 2021

Uh oh!

johnzangwill commented Dec 2, 2021

Uh oh!

rhshadrach commented Dec 2, 2021

Uh oh!

rhshadrach commented Dec 2, 2021

Uh oh!

johnzangwill commented Dec 3, 2021 •

edited

Loading

Uh oh!

johnzangwill commented Dec 23, 2021

Uh oh!

Uh oh!

Some refinements #5

Some refinements #5

Uh oh!

Conversation

rhshadrach commented Nov 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach commented Dec 1, 2021

Uh oh!

johnzangwill commented Dec 2, 2021

Uh oh!

johnzangwill commented Dec 2, 2021

Uh oh!

rhshadrach commented Dec 2, 2021

Uh oh!

rhshadrach commented Dec 2, 2021

Uh oh!

johnzangwill commented Dec 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnzangwill commented Dec 23, 2021

Uh oh!

Uh oh!

rhshadrach commented Nov 30, 2021 •

edited

Loading

johnzangwill commented Dec 3, 2021 •

edited

Loading