Skip to content

BUG: inconsistency in dtype of replace() #44897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Dec 22, 2021
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions pandas/core/internals/blocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -661,6 +661,7 @@ def replace(
return [blk]

regex = should_use_regex(regex, to_replace)
mask = missing.mask_missing(values, to_replace)

if regex:
return self._replace_regex(to_replace, value, inplace=inplace)
Expand All @@ -672,7 +673,6 @@ def replace(
# replace_list instead of replace.
return [self] if inplace else [self.copy()]

mask = missing.mask_missing(values, to_replace)
if not mask.any():
# Note: we get here with test_replace_extension_other incorrectly
# bc _can_hold_element is incorrect.
Expand Down Expand Up @@ -739,7 +739,10 @@ def _replace_regex(
replace_regex(new_values, rx, value, mask)

block = self.make_block(new_values)
return [block]
if self.ndim == 1 or self.shape[0] == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we use this pattern the 'else' path usually involves split_and_operate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous case where the dimensions are as follows split and operate don't operate on these dimensions, for split and operate you need

assert self.ndim == 2 and self.shape[0] != 1, which we do not use here.

Since we have a single element here, we should not split and upcast, right?

Copy link
Member

@jbrockmendel jbrockmendel Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i understand. im saying its weird to see this check on L734 and not see a split_and_operate on L737.

e.g. in the test you implemented, what happens if you use pd.DataFrame({"A": ["0"], "B": ["0"]})?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In replace_regex, we have aldready applied the replacement with the only issue of the type cast. So, if I apply split_and_operate again, it won't give any specific result except for converting dtype before breaking a lot of tests. The solution is correct with the dtype being wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.DataFrame({"A": ["0"], "B: ["0"]}) is giving a dtype assertion error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is giving a dtype assertion error

can you paste it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rectified it, the testcase has been added.

return block.convert(numeric=False, copy=False)
else:
return [block]

@final
def _replace_list(
Expand Down
9 changes: 9 additions & 0 deletions pandas/tests/series/methods/test_replace.py
Original file line number Diff line number Diff line change
Expand Up @@ -512,3 +512,12 @@ def test_pandas_replace_na(self):
result = ser.replace(regex_mapping, regex=True)
exp = pd.Series(["CC", "CC", "CC-REPL", "DD", "CC", "", pd.NA], dtype="string")
tm.assert_series_equal(result, exp)

def test_replace_regex_dtype(self):
# GH-48644
s = pd.Series(["0"])
expected = pd.Series([1])
result_series_false = s.replace(to_replace="0", value=1, regex=False)
tm.assert_series_equal(result_series_false, expected)
result_series_true = s.replace(to_replace="0", value=1, regex=True)
tm.assert_series_equal(result_series_true, expected)