-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
TST: base test for ExtensionArray.astype to its own type + copy keyword #35116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST: base test for ExtensionArray.astype to its own type + copy keyword #35116
Conversation
- Issue is that .astype of self's type always returned a copy - Change BooleanArray fix to have consistent test order
-Just check if dtype is same, then return self or self.copy() - Make test order consistent w/ other classes
-Continue to make style consistent
- I turned off the test that was failing temporarily because of __eq__ so that I could continue with the other tests
Notes: The test_string tests still fail, I have some ideas about how to fix it but I'm not sure about them. I'll try to give it a go. I'm also planning to squash&merge and clean up the commit names. If I should do that beforehand so other folks can see it, that's fine too. I'll just have to open another PR? |
This branch is not building because of an issue I'm not sure what folks want to do with- the "test_string" tests. pandas/tests/extension/test_string.py::TestCasting::test_astype_own_type[True] FAILED [ 87%] I believe it fails because StringArray does not have eq implemented, which ExtensionArray requires. |
I don't know what to do next. I can take a look at implementing eq, but I wanted an okay to move forward on that. |
Sorry for the delay, prepping the release. No need to squash. We do that on merge. There are some merge conflicts now. Can you fix those?
Let's keep each PR as small as possible. |
Sounds good, I should get around to the merge conflicts tomorrow. |
Fixed up the merge conflicts. There are still two tests that aren't passing due to the eq issue.
|
I temporarily disabled the failing test, in case that's why nobody was looking at this PR yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
pandas/core/arrays/sparse/array.py
Outdated
if dtype == self.dtype and copy: | ||
return self.copy() | ||
elif dtype == self.dtype and not copy: | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do this one using the same structure as the others (first only checking if dtype == self.dtype:
and then if not copy: ... else: ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Of course.
pass | ||
# | ||
# class TestCasting(base.BaseCastingTests): | ||
# pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you re-enable this again? Then we can look into the failure (and it will need to work anyway, before we can merge this PR)
(__eq__
is certainly implemented for StringArray, but something else might be going wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. (Sorry about that, from what I could see a lack of eq looked like the issue. I'll try taking another look at it.)
I think I found the problem, and I'm not sure of the correct layer to fix. I'd appreciate your input.
The result is it goes through to the base numpy_. extensionarray astype method, which blows up when you try to compare dtype('O') (not sure what the name should be. Pandas object dtype?) and self.dtype, which is StringDtype. You should be able to replicate easily enough by pulling the pr and running |
pandas/core/arrays/string_.py
Outdated
if not copy: | ||
return self | ||
elif copy: | ||
return self.copy() | ||
dtype = pandas_dtype(dtype) | ||
if isinstance(dtype, StringDtype): | ||
if copy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you can remove the new code, as you can see on this line and the two lines below, the logic is already present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point; must've not been paying attention.
pandas/core/arrays/numpy_.py
Outdated
elif copy: | ||
return self.copy() | ||
else: | ||
return super().astype(dtype, copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to move this to the base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, tried it and it worked, seems logically consistent. Going to also look to see if I can drop the checks elsewhere by leaving it to the base class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem so, without doing more restructuring than I'm comfortable with ATM.
pandas/core/arrays/numpy_.py
Outdated
@@ -276,6 +276,15 @@ def __setitem__(self, key, value) -> None: | |||
|
|||
self._ndarray[key] = value | |||
|
|||
def astype(self, dtype, copy: bool = True) -> ArrayLike: | |||
if dtype == self.dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem is that np.dtype(object) == ExtensionDtype
raises an error, which is an error coming from numpy and not directly solvable in pandas.
So to workaround it, can you do is_dtype_equal(dtype, self.dtype):
, with is_dtype_equal
imported from pandas.core.dtypes.common
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed and moved checks to base class. This just calls super().astype now.
- Remove corresponding check from numpy_.py - Note that all tests now pass
elif copy: | ||
return self.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche Is this implemented correctly? I'm concerned because of the way returning a copy is implemented on line 141, which seems to take into account dtype.context
. Maybe the fix would be to remove lines 137-138 and let that case fall to 141-142?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is correct, yes. The first is_dtype_equal(dtype, self._dtype)
will only catch decimal dtypes with an equal context
, and then the if isinstance(dtype, type(self.dtype)):
will work correctly in case the context differs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but we might need to pass through the copy
(anyway, this array implementation is only for testing, it's not actually used, so it's not super important)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easy enough to check, so I changed it to pass through the copy
. It passes the test suite the same way, and seems right, so I'll go with that version.
Welp, ran the full test suite and saw things failing. I'll go in and start trying to fix them. Should've been doing that locally given that the builds were failing. edit: Cool, looks like it was the same problem you mentioned with is_dtype_equal in the other checks as well. |
Something I don't understand is going wrong with pytests' xfail marking- tests are still failing despite being marked. You can run (I was unfamiliar with this pytest feature, so I might be misinterpreting.) |
A copy/paste of the (failing) output of the entire test suite, for reference:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
The failures you mention, you can ignore I think, as they seem unrelated (eg the network failure). But the tests that were xfailed and are now passing need to be checked:
=================================== FAILURES ===================================
_____________ TestGetitem.test_loc_iloc_frame_single_dtype[float] ______________
[gw0] linux -- Python 3.7.9 /home/vsts/miniconda3/envs/pandas-dev/bin/python
[XPASS(strict)] GH#33125 astype doesn't recognize data.dtype
____________ TestGroupby.test_groupby_extension_apply[object-float] ____________
[gw0] linux -- Python 3.7.9 /home/vsts/miniconda3/envs/pandas-dev/bin/python
[XPASS(strict)] GH#33125 astype doesn't recognize data.dtype
_____________________ TestPrinting.test_series_repr[float] _____________________
[gw0] linux -- Python 3.7.9 /home/vsts/miniconda3/envs/pandas-dev/bin/python
[XPASS(strict)] GH#33125 PandasArray.astype does not recognize PandasDtype
____________________ TestPrinting.test_series_repr[object] _____________________
[gw0] linux -- Python 3.7.9 /home/vsts/miniconda3/envs/pandas-dev/bin/python
[XPASS(strict)] GH#33125 PandasArray.astype does not recognize PandasDtype
=============================== warnings summary ===============================
They are all in test_numpy.py
, I think, and I checked the first one: it seems that now for float dtype this test is working correctly, so we no longer need to xfail it:
--- a/pandas/tests/extension/test_numpy.py
+++ b/pandas/tests/extension/test_numpy.py
@@ -177,7 +177,7 @@ class TestGetitem(BaseNumPyTests, base.BaseGetitemTests):
def test_loc_iloc_frame_single_dtype(self, data, request):
npdtype = data.dtype.numpy_dtype
- if npdtype == object or npdtype == np.float64:
+ if npdtype == object:
# GH#33125
mark = pytest.mark.xfail(
reason="GH#33125 astype doesn't recognize data.dtype"
pandas/core/arrays/numpy_.py
Outdated
def astype(self, dtype, copy: bool = True) -> ArrayLike: | ||
return super().astype(dtype, copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it is only a call to super()
, you can leave out it entirely (so it will just use the inherited method)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I should've pulled that yesterday ¯_(ツ)_/¯
elif copy: | ||
return self.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is correct, yes. The first is_dtype_equal(dtype, self._dtype)
will only catch decimal dtypes with an equal context
, and then the if isinstance(dtype, type(self.dtype)):
will work correctly in case the context differs.
elif copy: | ||
return self.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but we might need to pass through the copy
(anyway, this array implementation is only for testing, it's not actually used, so it's not super important)
if not copy: | ||
return self | ||
elif copy: | ||
return self.copy() | ||
dtype = pandas_dtype(dtype) | ||
if isinstance(dtype, type(self.dtype)): | ||
return type(self)(self._data, context=dtype.context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return type(self)(self._data, context=dtype.context) | |
return type(self)(self._data, copy=copy, context=dtype.context) |
You're welcome! Gotta do something to stay sane while applying for jobs, and I got tired of working on a data science thing.
Cool, makes sense.
ok good, makes sense. re: one of the tests- Should the ValueError comment in this test be left alone? Not sure why it's there.
Just running the full test suite now. Once that's done I'll push. |
Figured out that the linting was what was causing the failure; should've looked more closely at the details. Fixed the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last comments!
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
@tomaszps Thanks a lot for the contribution! |
…rd (pandas-dev#35116) Co-authored-by: Joris Van den Bossche <[email protected]>
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
n.b. I'm not certain. I get no output on the command line when I run this command.