Skip to content

Fixed Inconsistent GroupBy Output Shape with Duplicate Column Labels #29124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 61 commits into from
Nov 20, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
fd53827
Added any_all test
WillAyd Oct 17, 2019
6f60cd0
Added complete test for output shape of elements
WillAyd Oct 17, 2019
0aa1813
Fixed shape issues with bool aggs
WillAyd Oct 17, 2019
4af22f6
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Oct 21, 2019
9756e74
Added test for transformation shape
WillAyd Oct 21, 2019
b675963
Changed transform output
WillAyd Oct 21, 2019
444d542
Updated test with required args
WillAyd Oct 21, 2019
98a9901
Fixed cummin issue
WillAyd Oct 21, 2019
c8648b1
Fixed ohlc one-off handling
WillAyd Oct 21, 2019
1626de1
Fixed output shape of nunique
WillAyd Oct 21, 2019
2a6b8d7
Fixed tshift
WillAyd Oct 21, 2019
12d1ca0
lint and black
WillAyd Oct 21, 2019
5a3fcd7
Added whatsnew
WillAyd Oct 21, 2019
dee597a
Quantile special case hack
WillAyd Oct 21, 2019
a2f1b64
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Oct 22, 2019
fdb36f6
Test fix for np dev
WillAyd Oct 22, 2019
8975009
Used position as labels
WillAyd Oct 22, 2019
9adde1b
Code cleanup
WillAyd Oct 22, 2019
63b35f9
docstrings
WillAyd Oct 22, 2019
9eb7c73
style and more typing
WillAyd Oct 22, 2019
0e49bdb
Fixed parallel test collection
WillAyd Oct 22, 2019
2ad7632
numpy dev warning fix
WillAyd Oct 22, 2019
11fda39
More generic ohlc handling
WillAyd Oct 22, 2019
caf8f11
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Oct 22, 2019
7c4bad9
Converted Dict -> Mapping
WillAyd Oct 22, 2019
b9dca96
doc fixups
WillAyd Oct 22, 2019
a878e67
numpy dev compat
WillAyd Oct 22, 2019
037f9af
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Oct 22, 2019
dd3b1dc
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Oct 23, 2019
a66d37f
Renamed index -> idx
WillAyd Oct 23, 2019
6d50448
Removed rogue TODO
WillAyd Oct 23, 2019
4dd8f5b
Added failing test for multiindex
WillAyd Oct 23, 2019
9d39862
MultiIndex support
WillAyd Oct 23, 2019
d6b197b
More exact typing
WillAyd Oct 23, 2019
3cfd1a2
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Oct 29, 2019
16e9512
jbrockmendel feedback
WillAyd Oct 29, 2019
d297684
Removed failing type
WillAyd Oct 29, 2019
a234bed
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Oct 31, 2019
e3959b0
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 2, 2019
391d106
Aligned annotations and comments
WillAyd Nov 2, 2019
c30ca82
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 6, 2019
3a78051
mypy fix
WillAyd Nov 6, 2019
f4f9e61
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 7, 2019
936591e
asserts and mypy fixes
WillAyd Nov 7, 2019
5dd131e
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 13, 2019
6cc1607
Fix issue in merge resolution
WillAyd Nov 13, 2019
d5ce753
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 17, 2019
ce97eff
Correct merge with 29629
WillAyd Nov 17, 2019
d1a92b4
Fixed issue with dict assignment
WillAyd Nov 17, 2019
23eb803
modernized annotations
WillAyd Nov 17, 2019
4aa9f4c
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 18, 2019
faa08c9
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 18, 2019
b07335b
Changed to assert, removed collections
WillAyd Nov 18, 2019
fb71185
Removed breakpoint and whatsnew space
WillAyd Nov 18, 2019
d7b84a2
Fixed issue with DTA
WillAyd Nov 18, 2019
7934422
typing fixup
WillAyd Nov 18, 2019
c8f0b19
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 19, 2019
acf22d3
packed key into namedtuple
WillAyd Nov 19, 2019
a0aae64
mypy fixes
WillAyd Nov 19, 2019
a9b411a
docstring cleanup
WillAyd Nov 19, 2019
51b8050
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 19, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -453,6 +453,7 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrameGroupby.agg` not able to use lambda function with named aggregation (:issue:`27519`)
- Bug in :meth:`DataFrame.groupby` losing column name information when grouping by a categorical column (:issue:`28787`)
- Bug in :meth:`DataFrameGroupBy.rolling().quantile()` ignoring ``interpolation`` keyword argument (:issue:`28779`)
- Bug in :meth:`DataFrame.groupby` where ``any``, ``all``, ``nunique`` and transform functions would incorrectly handle duplicate column labels (:issue:`21668`)

Reshaping
^^^^^^^^^
Expand Down
4 changes: 4 additions & 0 deletions pandas/core/groupby/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,12 @@
hold the whitelist of methods that are exposed on the
SeriesGroupBy and the DataFrameGroupBy objects.
"""
import collections

from pandas.core.dtypes.common import is_list_like, is_scalar

OutputKey = collections.namedtuple("OutputKey", ["label", "position"])


class GroupByMixin:
"""
Expand Down
175 changes: 137 additions & 38 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,17 @@
from functools import partial
from textwrap import dedent
import typing
from typing import Any, Callable, FrozenSet, Iterable, Sequence, Type, Union, cast
from typing import (
Any,
Callable,
FrozenSet,
Iterable,
Mapping,
Sequence,
Type,
Union,
cast,
)

import numpy as np

Expand Down Expand Up @@ -309,28 +319,91 @@ def _aggregate_multiple_funcs(self, arg):

return DataFrame(results, columns=columns)

def _wrap_series_output(self, output, index, names=None):
""" common agg/transform wrapping logic """
output = output[self._selection_name]
def _wrap_series_output(
self, output: Mapping[base.OutputKey, Union[Series, np.ndarray]], index: Index,
) -> Union[Series, DataFrame]:
"""
Wraps the output of a SeriesGroupBy operation into the expected result.

Parameters
----------
output : Mapping[base.OutputKey, Union[Series, np.ndarray]]
Data to wrap.
index : pd.Index
Index to apply to the output.

if names is not None:
return DataFrame(output, index=index, columns=names)
Returns
-------
Series or DataFrame

Notes
-----
In the vast majority of cases output and columns will only contain one
element. The exception is operations that expand dimensions, like ohlc.
"""
indexed_output = {key.position: val for key, val in output.items()}
columns = Index(key.label for key in output)

result: Union[Series, DataFrame]
if len(output) > 1:
result = DataFrame(indexed_output, index=index)
result.columns = columns
else:
name = self._selection_name
if name is None:
name = self._selected_obj.name
return Series(output, index=index, name=name)
result = Series(indexed_output[0], index=index, name=columns[0])

return result

def _wrap_aggregated_output(
self, output: Mapping[base.OutputKey, Union[Series, np.ndarray]]
) -> Union[Series, DataFrame]:
"""
Wraps the output of a SeriesGroupBy aggregation into the expected result.

def _wrap_aggregated_output(self, output, names=None):
Parameters
----------
output : Mapping[base.OutputKey, Union[Series, np.ndarray]]
Data to wrap.

Returns
-------
Series or DataFrame

Notes
-----
In the vast majority of cases output will only contain one element.
The exception is operations that expand dimensions, like ohlc.
"""
result = self._wrap_series_output(
output=output, index=self.grouper.result_index, names=names
output=output, index=self.grouper.result_index
)
return self._reindex_output(result)._convert(datetime=True)

def _wrap_transformed_output(self, output, names=None):
return self._wrap_series_output(
output=output, index=self.obj.index, names=names
)
def _wrap_transformed_output(
self, output: Mapping[base.OutputKey, Union[Series, np.ndarray]]
) -> Series:
"""
Wraps the output of a SeriesGroupBy aggregation into the expected result.

Parameters
----------
output : dict[base.OutputKey, Union[Series, np.ndarray]]
Dict with a sole key of 0 and a value of the result values.

Returns
-------
Series

Notes
-----
output should always contain one element. It is specified as a dict
for consistency with DataFrame methods and _wrap_aggregated_output.
"""
assert len(output) == 1
result = self._wrap_series_output(output=output, index=self.obj.index)

# No transformations increase the ndim of the result
assert isinstance(result, Series)
return result

def _wrap_applied_output(self, keys, values, not_indexed_same=False):
if len(keys) == 0:
Expand Down Expand Up @@ -1082,17 +1155,6 @@ def _aggregate_item_by_item(self, func, *args, **kwargs) -> DataFrame:

return DataFrame(result, columns=result_columns)

def _decide_output_index(self, output, labels):
if len(output) == len(labels):
output_keys = labels
else:
output_keys = sorted(output)

if isinstance(labels, MultiIndex):
output_keys = MultiIndex.from_tuples(output_keys, names=labels.names)

return output_keys

def _wrap_applied_output(self, keys, values, not_indexed_same=False):
if len(keys) == 0:
return DataFrame(index=keys)
Expand Down Expand Up @@ -1559,27 +1621,62 @@ def _insert_inaxis_grouper_inplace(self, result):
if in_axis:
result.insert(0, name, lev)

def _wrap_aggregated_output(self, output, names=None):
agg_axis = 0 if self.axis == 1 else 1
agg_labels = self._obj_with_exclusions._get_axis(agg_axis)
def _wrap_aggregated_output(
self, output: Mapping[base.OutputKey, Union[Series, np.ndarray]]
) -> DataFrame:
"""
Wraps the output of DataFrameGroupBy aggregations into the expected result.

output_keys = self._decide_output_index(output, agg_labels)
Parameters
----------
output : Mapping[base.OutputKey, Union[Series, np.ndarray]]
Data to wrap.

Returns
-------
DataFrame
"""
indexed_output = {key.position: val for key, val in output.items()}
columns = Index(key.label for key in output)

result = DataFrame(indexed_output)
result.columns = columns

if not self.as_index:
result = DataFrame(output, columns=output_keys)
self._insert_inaxis_grouper_inplace(result)
result = result._consolidate()
else:
index = self.grouper.result_index
result = DataFrame(output, index=index, columns=output_keys)
result.index = index

if self.axis == 1:
result = result.T

return self._reindex_output(result)._convert(datetime=True)

def _wrap_transformed_output(self, output, names=None) -> DataFrame:
return DataFrame(output, index=self.obj.index)
def _wrap_transformed_output(
self, output: Mapping[base.OutputKey, Union[Series, np.ndarray]]
) -> DataFrame:
"""
Wraps the output of DataFrameGroupBy transformations into the expected result.

Parameters
----------
output : Mapping[base.OutputKey, Union[Series, np.ndarray]]
Data to wrap.

Returns
-------
DataFrame
"""
indexed_output = {key.position: val for key, val in output.items()}
columns = Index(key.label for key in output)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above, I think you need to sort the columns (index is ok )

result = DataFrame(indexed_output)
result.columns = columns
result.index = self.obj.index

return result

def _wrap_agged_blocks(self, items, blocks):
if not self.as_index:
Expand Down Expand Up @@ -1699,9 +1796,11 @@ def groupby_series(obj, col=None):
if isinstance(obj, Series):
results = groupby_series(obj)
else:
# TODO: this is duplicative of how GroupBy naturally works
# Try to consolidate with normal wrapping functions
from pandas.core.reshape.concat import concat

results = [groupby_series(obj[col], col) for col in obj.columns]
results = [groupby_series(content, label) for label, content in obj.items()]
results = concat(results, axis=1)
results.columns.names = obj.columns.names

Expand Down Expand Up @@ -1743,7 +1842,7 @@ def _normalize_keyword_aggregation(kwargs):
"""
Normalize user-provided "named aggregation" kwargs.

Transforms from the new ``Dict[str, NamedAgg]`` style kwargs
Transforms from the new ``Mapping[str, NamedAgg]`` style kwargs
to the old OrderedDict[str, List[scalar]]].

Parameters
Expand All @@ -1764,7 +1863,7 @@ def _normalize_keyword_aggregation(kwargs):
>>> _normalize_keyword_aggregation({'output': ('input', 'sum')})
(OrderedDict([('input', ['sum'])]), ('output',), [('input', 'sum')])
"""
# Normalize the aggregation functions as Dict[column, List[func]],
# Normalize the aggregation functions as Mapping[column, List[func]],
# process normally, then fixup the names.
# TODO(Py35): When we drop python 3.5, change this to
# defaultdict(list)
Expand Down
Loading