-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG/API: prohibit dtype-changing IntervalArray.__setitem__ #32782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Is this linking back to the correct issue? The linkage there isn't immediately obvious to me (may be lack of understanding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. cc @jschendel
#27147 is about the data backing an IntervalArray (a pair of Index objects) being swapped out under certain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably have a whatsnew entry since it's a breaking change.
pandas/core/arrays/interval.py
Outdated
@@ -543,7 +543,11 @@ def __setitem__(self, key, value): | |||
msg = f"'value' should be an interval type, got {type(value)} instead." | |||
raise TypeError(msg) from err | |||
|
|||
if needs_float_conversion: | |||
raise ValueError("Cannot set float values for integer-backed IntervalArray") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The needs_float_conversion
name might be a bit misleading here because it's not triggered when setting arbitrary float values but rather just when np.nan
is included, so we're not generically disallowing setting float values. Maybe a more accurate error message would be along the lines "Cannot set float NaN to integer-backed IntervalArray"
.
FWIW the behavior when setting float values without np.nan
included is to truncate the float and just use the integer component, which is consistent with numpy's behavior:
In [1]: import pandas as pd; import numpy as np
In [2]: ia = pd.arrays.IntervalArray.from_breaks(range(5))
In [3]: ia[0] = pd.Interval(0.9, 1.1)
In [4]: ia
Out[4]:
<IntervalArray>
[(0, 1], (1, 2], (2, 3], (3, 4]]
Length: 4, closed: right, dtype: interval[int64]
In [5]: a = np.arange(4)
In [6]: a[0] = 0.9
In [7]: a[1] = 1.1
In [8]: a
Out[8]: array([0, 1, 2, 3])
I'm not crazy about that behavior but I'm fine with it since it's consistent with numpy.
@@ -543,7 +543,11 @@ def __setitem__(self, key, value): | |||
msg = f"'value' should be an interval type, got {type(value)} instead." | |||
raise TypeError(msg) from err | |||
|
|||
if needs_float_conversion: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're raising here we can remove the future references that are no longer being used:
pandas/pandas/core/arrays/interval.py
Lines 554 to 555 in 23b6b93
if needs_float_conversion: | |
left = left.astype("float") |
pandas/pandas/core/arrays/interval.py
Lines 560 to 561 in 23b6b93
if needs_float_conversion: | |
right = right.astype("float") |
I'm fine with prohibiting dtype changing here. Seems more consistent with existing behavior in pandas/numpy and makes the logic easier. It looks like the existing In [2]: ia = pd.arrays.IntervalArray.from_breaks(range(5))
In [3]: ia[:2] = [np.nan, pd.Interval(0, 5)]
In [4]: ia
Out[4]:
<IntervalArray>
[(-9223372036854775808, -9223372036854775808], (0, 5], (2, 3], (3, 4]]
Length: 4, closed: right, dtype: interval[int64] I think we can address this by shifting the logic around a little bit. The following diff addresses the issue locally for me and at first glance don't appear to break anything: diff --git a/pandas/core/arrays/interval.py b/pandas/core/arrays/interval.py
index 22ce5a6f8..2e6e4bd0c 100644
--- a/pandas/core/arrays/interval.py
+++ b/pandas/core/arrays/interval.py
@@ -513,18 +513,15 @@ class IntervalArray(IntervalMixin, ExtensionArray):
return self._shallow_copy(left, right)
def __setitem__(self, key, value):
- # na value: need special casing to set directly on numpy arrays
- needs_float_conversion = False
if is_scalar(value) and isna(value):
- if is_integer_dtype(self.dtype.subtype):
- # can't set NaN on a numpy integer array
- needs_float_conversion = True
- elif is_datetime64_any_dtype(self.dtype.subtype):
+ if is_datetime64_any_dtype(self.dtype.subtype):
# need proper NaT to set directly on the numpy array
value = np.datetime64("NaT")
elif is_timedelta64_dtype(self.dtype.subtype):
# need proper NaT to set directly on the numpy array
value = np.timedelta64("NaT")
+ else:
+ value = np.nan
value_left, value_right = value, value
# scalar interval
@@ -542,18 +539,18 @@ class IntervalArray(IntervalMixin, ExtensionArray):
msg = f"'value' should be an interval type, got {type(value)} instead."
raise TypeError(msg) from err
+ if is_integer_dtype(self.dtype.subtype) and np.any(isna(value_left)):
+ raise ValueError("Cannot set float NaN to integer-backed IntervalArray") |
@jbrockmendel can you rebase and does @jschendel comments make sense? |
rebased+green; addressed some comments by @jschendel, commented above on reasons for punting on others |
thanks @jbrockmendel and @jschendel |
closes BUG: IntervalArray.__setitem__ creates copies incorrectly #27147Not quiteblack pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Still needs a dedicated test, but this is also a non-trivial API change, so I want to get the ball rolling on discussion. cc @jschendel