Skip to content

Some refinements #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

rhshadrach
Copy link

@rhshadrach rhshadrach commented Nov 30, 2021

I've done some work and wanted to share, this seemed like the easiest way. No need to merge, just steal whatever you like. It is certainly a work in progress. The main changes are moving away from name-based logic to positional logic. This allows the method to work when the columns have duplicate or unexpected names, e.g.

df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6], 'd': [7, 8], 'level_1': [9, 10]})
df.columns = list('abbd') + ['level_1']
# I think this fails on your branch for two reasons - duplicate names for "b" and "level_1" causing issues
df.groupby(['a', [0, 1], 'd']).value_counts()

It also avoids alignment when normalize is True, giving somewhat of a speedup in, e.g.

size, bins = 100000, 10
df = pd.DataFrame({k: np.random.randint(bins, size=size) for k in 'abcdef'})
%timeit df.groupby(['a', 'b']).value_counts(normalize=True)

gives

46.2 ms ± 568 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
69.3 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Finally, it simplifies some of the logic involved with dropna and as_index.

@rhshadrach
Copy link
Author

@johnzangwill - friendly ping.

@johnzangwill johnzangwill merged commit 92c718b into johnzangwill:DataFrameGroupBy.value_counts Dec 2, 2021
@johnzangwill
Copy link
Owner

Thanks, Richard.
Your code does simplify a few things and solve a few problems, so I just merged it in. Hopefully that makes cooperation simpler...
I added test_column_name_clashes with your example data, but only with as_index=True.
Both these examples fail with as_index=False, because Series.reset_index() fails.
Do you know how to make reset_index() cope with duplicated column names?
Perhaps this is a bug, since Pandas is supposed to allow duplicates by default? Surely frame.py line 5850 should be passing the allows_duplicate_labels flag?

This also raises the question: just how pathological data do I need to support?...

@johnzangwill
Copy link
Owner

There will be another issue with this PR: just how to make Jeff happy!
I am concentrating on your issues and advice for the moment. That is to improve the code and toughen up the tests.
But at some point Jeff will need convincing that any of this is necessary...

@rhshadrach
Copy link
Author

@johnzangwill Certainly. I'm not concerned - it appears to me what you're doing is the correct way. Explaining why this is novel to pandas (no other op groups by all the remaining columns!) will go a long way, and I'll be happy to help out there. Wanted to get things in a good state first though.

@rhshadrach
Copy link
Author

I'll take a look at the reset index in the next few days.

@johnzangwill
Copy link
Owner

johnzangwill commented Dec 3, 2021

I changed frame.py to cope with duplicate column labels. This makes Series.reset_index() work in the case that duplicate labels result in the frame. Your cases then work with as_index=False.
Clearly, this potentially changes the rest of Pandas... Overall, it causes some warnings to change and some warnings to not appear.
I temporarily changed 3 tests to cope and switched my PR to Draft.
Green tick, so all the other tests pass!
But this came up recently in pandas-dev#44410 and it looks like it was decided to just document it.
My change fixes the "Reproducible Example" in pandas-dev#44410 and gives the "Expected Behaviour". So it would close that Issue.
So I am not sure quite what to do next.
In any case, I imagine that my change to frame.py should probably be in a separate PR. No?

@johnzangwill
Copy link
Owner

@rhshadrach I am not sure that I am winning this one pandas-dev#44755 (comment) which was started by you and your duplicate label examples. Please, either explain to Jeff why he is wrong, or tell me to just do it his way and forget about trapping duplicates in existing code. Thanks!

@rhshadrach rhshadrach deleted the DataFrameGroupBy.value_counts branch February 14, 2022 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants