Skip to content

BUG: incorrect EA casting in groubpy.agg #38254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
99073b8
REF: consolidate casting
jbrockmendel Dec 1, 2020
1ed9d8a
lighter-weight casting
jbrockmendel Dec 1, 2020
21b4f0f
simplify casting
jbrockmendel Dec 1, 2020
56b42bb
REF: minimize python_agg_general groupby casting
jbrockmendel Dec 2, 2020
e01487f
Handle Float64
jbrockmendel Dec 3, 2020
cf5bd53
TST: PeriodDtype
jbrockmendel Dec 3, 2020
02bffdb
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Dec 3, 2020
f950c75
dont cast
jbrockmendel Dec 4, 2020
0544c8b
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Dec 4, 2020
5a30b93
whatsnew
jbrockmendel Dec 4, 2020
3410102
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Dec 8, 2020
b481049
move whatsnew to 1.3.0
jbrockmendel Dec 8, 2020
2047f3c
CLN: remove unused import
jbrockmendel Dec 8, 2020
b96835f
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Dec 8, 2020
7bc78eb
retain Float64
jbrockmendel Dec 8, 2020
643fab6
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Dec 11, 2020
eddb089
fix test
jbrockmendel Dec 11, 2020
2572dda
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Jan 4, 2021
5590ad7
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Jan 5, 2021
5b73850
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Jan 12, 2021
896a97a
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Jan 16, 2021
5e611b2
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Jan 19, 2021
961304f
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Jan 21, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ class providing the base-class of operations.
from pandas.errors import AbstractMethodError
from pandas.util._decorators import Appender, Substitution, cache_readonly, doc

from pandas.core.dtypes.cast import maybe_cast_result, maybe_downcast_to_dtype
from pandas.core.dtypes.cast import maybe_downcast_to_dtype
from pandas.core.dtypes.common import (
ensure_float,
is_bool_dtype,
Expand Down Expand Up @@ -1188,8 +1188,7 @@ def _python_agg_general(self, func, *args, **kwargs):

if self.grouper._filter_empty_groups:
mask = counts.ravel() > 0

# since we are masking, make sure that we have a float object
# since we are masking, make sure that we have a float dtype
values = result
if is_numeric_dtype(values.dtype):
values = ensure_float(values)
Expand Down
15 changes: 14 additions & 1 deletion pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
is_extension_array_dtype,
is_integer_dtype,
is_numeric_dtype,
is_object_dtype,
is_period_dtype,
is_sparse,
is_timedelta64_dtype,
Expand Down Expand Up @@ -725,7 +726,19 @@ def _aggregate_series_pure_python(self, obj: Series, func: F):
result[label] = res

result = lib.maybe_convert_objects(result, try_float=0)
result = maybe_cast_result(result, obj, numeric_only=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you doing this here, rather than inside maybe_cast_results itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I'm trying to get rid of maybe_cast_result altogether. It uses maybe_cast_to_extension_array whereas we should be casting yoda-style (do or do not)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as mentioned in the OP, id actually rather remove this chunk of code entirely)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure but I'd rather NOT do it in groupby at all (any casting like this), and instead push it to dtypes/cast.py this seems like going backwards.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you're good with just ripping this out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and instead push it to dtypes/cast.py this seems like going backwards.

depends on how we think of dtypes.cast. I think of it as "low-level helper functions related to casting" NOT "anything related to casting". I don't want to put groupby-specific casting code in there. (i also dont like having DataFrame.reset_index code in there)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and instead push it to dtypes/cast.py this seems like going backwards.

depends on how we think of dtypes.cast. I think of it as "low-level helper functions related to casting" NOT "anything related to casting". I don't want to put groupby-specific casting code in there. (i also dont like having DataFrame.reset_index code in there)

i don't think virtually any casting code should be in groupby / frame. but we have to put it somewhere (and of course ideally there isn't any special casing on the class of the parent container).

so i think its the lesser of evils to keep it all together in dtypes/cast.py potentially allowing for re-use / refactoring.

if is_numeric_dtype(obj.dtype):
# Needed to cast float64 back to Float64
result = maybe_cast_result(result, obj, numeric_only=True)

elif is_object_dtype(result.dtype) and is_extension_array_dtype(obj.dtype):
# FIXME: kludge; we have tests for DecimalArray that get here
# but they only work because DecimalArray._from_sequence is
# strict in what inputs it accepts, which we cannot rely on.
inferred = lib.infer_dtype(result, skipna=False)
if inferred == obj.dtype.name == "decimal":
cls = obj.dtype.construct_array_type()
result = cls._from_sequence(result)

return result, counts

Expand Down
7 changes: 6 additions & 1 deletion pandas/tests/groupby/aggregate/test_other.py
Original file line number Diff line number Diff line change
Expand Up @@ -432,10 +432,13 @@ def test_agg_over_numpy_arrays():
tm.assert_frame_equal(result, expected)


def test_agg_tzaware_non_datetime_result():
@pytest.mark.parametrize("as_period", [True, False])
def test_agg_tzaware_non_datetime_result(as_period):
# discussed in GH#29589, fixed in GH#29641, operating on tzaware values
# with function that is not dtype-preserving
dti = pd.date_range("2012-01-01", periods=4, tz="UTC")
if as_period:
dti = dti.tz_localize(None).to_period("D")
df = DataFrame({"a": [0, 0, 1, 1], "b": dti})
gb = df.groupby("a")

Expand All @@ -454,6 +457,8 @@ def test_agg_tzaware_non_datetime_result():
result = gb["b"].agg(lambda x: x.iloc[-1] - x.iloc[0])
expected = Series([pd.Timedelta(days=1), pd.Timedelta(days=1)], name="b")
expected.index.name = "a"
if as_period:
expected = expected.astype(object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the correct expected result? When subtracting Periods, you get offset objects, not Timedelta objects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. probably not great that tm.assert_series_equal can't tell the difference either

tm.assert_series_equal(result, expected)


Expand Down