-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Bug in DataFrame.drop_duplicates for empty DataFrame throws error #22394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ror (:issue:`20516`)
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -711,7 +711,7 @@ Reshaping | |||
- Bug in :func:`get_dummies` with Unicode attributes in Python 2 (:issue:`22084`) | |||
- Bug in :meth:`DataFrame.replace` raises ``RecursionError`` when replacing empty lists (:issue:`22083`) | |||
- Bug in :meth:`Series.replace` and meth:`DataFrame.replace` when dict is used as the `to_replace` value and one key in the dict is is another key's value, the results were inconsistent between using integer key and using string key (:issue:`20656`) | |||
- | |||
- Bug in :meth:`DataFrame.drop_duplicates`for empty DataFrame throws error (:issue:`20516`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this link render? Curious if a space is required after the end backtick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for noticing! Fixed it
Codecov Report
@@ Coverage Diff @@
## master #22394 +/- ##
==========================================
- Coverage 92.05% 92.04% -0.01%
==========================================
Files 169 169
Lines 50709 50744 +35
==========================================
+ Hits 46679 46708 +29
- Misses 4030 4036 +6
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -711,7 +711,7 @@ Reshaping | |||
- Bug in :func:`get_dummies` with Unicode attributes in Python 2 (:issue:`22084`) | |||
- Bug in :meth:`DataFrame.replace` raises ``RecursionError`` when replacing empty lists (:issue:`22083`) | |||
- Bug in :meth:`Series.replace` and meth:`DataFrame.replace` when dict is used as the `to_replace` value and one key in the dict is is another key's value, the results were inconsistent between using integer key and using string key (:issue:`20656`) | |||
- | |||
- Bug in :meth:`DataFrame.drop_duplicates` for empty DataFrame throws error (:issue:`20516`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use double backticks on DataFrame
throws error -> which incorrectly raises
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, will apply the changes right away
@@ -263,6 +263,13 @@ def test_drop_duplicates_tuple(): | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
|
|||
def test_drop_duplicates_empty(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize this on columns with [], ['A', 'B','C']
as the values
IOW test the DataFrame(columns=[])
and DataFrame(columns=['A', 'B', 'C'])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually also need a test with no columns but an index e.g. DataFrame(index=[1, 2])
so don't need to parameterize but test all of these cases (where the result is the input)
DataFrame.drop_duplicates
for empty DataFrame throws error
Test assert failed due to the following issue: #22409. Posted the details regarding this issue. Will try to update on it if I have time. |
@jreback @WillAyd @datapythonista Should I just not include the test for columns until the #22409 is closed (as it will constantly tell that the shape is different due to column info is not there) or put this PR on hold until the above issue is closed? |
See note for the referenced issue but I don't think that is related (?). As @jreback mentioned you should parametrize the test case you have created. If so it will help more easily identify which if any case is failing |
@WillAyd Yes, I did parameterize the issue as @jreback mentioned, and while others are fine, the problem arose from this test:
current method for
Since we are dealing with an empty current assert_frame_equal checks for the shapes and column comparison within the code, as such:
and
Considering that when selecting in
If this is the expected behavior, I think we can consider the fact and change the above snippet to following:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks OK but the test needs to be parametrized still. Take a look at some of the other functions in the same module if unsure how to do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments, ping on green.
|
||
df = DataFrame(columns=['A', 'B', 'C']) | ||
result = df.drop_duplicates() | ||
expected = DataFrame(columns=[]) # The column infos are not carrying over |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find this comment useful, rather put a comment about testing with empty columns, and below about an empty index
Hello @HyunTruth! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on August 22, 2018 at 13:38 Hours UTC |
result = df.drop_duplicates() | ||
if df.columns.empty is False: | ||
df = DataFrame(columns=[]) | ||
tm.assert_frame_equal(result, df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tests feels a bit tricky to me. So, when there are columns, we set them to an empty list. So, that would be the same as simply using as expected value DataFrame(columns=[])
, or am I missing something?
But not sure if I'm missing something, but I'd expect DataFrame(columns=['A', 'B', 'C']).drop_duplicates()
to keep the columns. Any reason to drop them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the closed issue of #22409. DataFrame(columns=['A', 'B', 'C']).drop_duplicates()
passes through empty selection, and whilst index remains, column infos do not as the selection itself is done on the column basis if I understood the comment correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the problem may be your approach implementing this story. I agree on what @WillAyd said in #22409, but not in the side effect in this PR.
I guess df[pandas.Series()]
is the same as df[[]]
, so you're telling the DataFrame to select an empty list of columns, so none of them. So, if we have a DataFrame with 4 columns and 10 rows, I expect this to return the 10 rows, but 0 columns.
But the case here with drop_duplicates
, if I understand correctly, is that you have a DataFrame with 4 columns and 0 rows, you remove the duplicates, that there are none, as the DataFrame has 0 rows, so I should still have 4 columns and 0 rows.
And in your test you're asserting that the return will be 0 columns and 0 rows.
Am I right? Does it makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. However, if we want to leave the column when there exists no value, then we'd need to change the DataFrame.drop_duplicates
itself to not use the logic of self[~duplicated]
, as it currently does, as this is what is causing this problem at least at this moment with the empty return of DataFrame.duplicated
... I thought that DataFrame.duplicated
returns a Series
, and then if DataFrame
has no value, returning an empty Series
is logical. However, since DataFrame.drop_duplicates
use the logic of self[~duplicated]
(which works for every other case except on empty DataFrame
with only columns) returning empty Series
leads to df[[]]
, resulting in a DataFrame
with no value without the columns (as it is used in selection which has no value in it). Maybe returning an empty wasn't a good idea. Completely stuck then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I understand the problem now. There is some magic on df[something]
in pandas. In some cases something
would be understood as the list of columns, like in df[[]]
or df[['A', 'B', 'C']]
, but if something
is a list of boolean values, then it filters on rows instead of columns, like in df[df['A'].isnull()]
(df['A'].isnull()
returns a list (Series) of booleans).
That's why your approach is not working as expected. You may try to check with df.iloc[something]
, which should always filter on rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments. ping on green.
pandas/core/frame.py
Outdated
@@ -4335,6 +4335,9 @@ def drop_duplicates(self, subset=None, keep='first', inplace=False): | |||
------- | |||
deduplicated : DataFrame | |||
""" | |||
if self.empty: | |||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return self.copy() here
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -711,7 +711,7 @@ Reshaping | |||
- Bug in :func:`get_dummies` with Unicode attributes in Python 2 (:issue:`22084`) | |||
- Bug in :meth:`DataFrame.replace` raises ``RecursionError`` when replacing empty lists (:issue:`22083`) | |||
- Bug in :meth:`Series.replace` and meth:`DataFrame.replace` when dict is used as the `to_replace` value and one key in the dict is is another key's value, the results were inconsistent between using integer key and using string key (:issue:`20656`) | |||
- | |||
- Bug in :meth:`DataFrame.drop_duplicates` for empty ``DataFrame`` which incorrectly raises error (:issue:`20516`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raises an error
@@ -4335,6 +4335,9 @@ def drop_duplicates(self, subset=None, keep='first', inplace=False): | |||
------- | |||
deduplicated : DataFrame | |||
""" | |||
if self.empty: | |||
return self.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of this, can you try to replace the last line return self[-duplicated]
by return self.iloc[-duplicated]
, for the reasons I mentioned earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then wouldn't index be lost in the case of DataFrame(index=['A', 'B', 'C'])
? As it will be self.iloc[[]]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use self.iloc[~duplicated]
, as you said, we will be able to maintain the columns if we have 4 columns and 0 rows. However, if we have 0 columns and 4 rows, we have the same problem - the outcome is self.iloc[[]]
, which means that the rows won't be selected, for the same reason with using self[[]]
for columns, thus ending up with 0 columns and 0 rows, once again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether it'd make sense for a dataframe with rows but no columns to drop the empty rows as duplicate.
But as @jreback if happy with this implementation, just leave it like this. I didn't see his comment earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a test, and here's the result:
import pandas as pd
a = pd.DataFrame(index=['A', 'B'])
b = a.iloc[[]]
a.shape # (2, 0)
b.shape # (0, 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense to me. If we consider that empty rows are equal among them, then pd.DataFrame(index=['A', 'B']).duplicated()
would return False, True
, and .iloc[-duplicated]
would return the first row.
But as I said, I'm happy to keep the original DataFrame as is for this case, as @jreback is happy with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, if you see the empty rows as equals, then it does make sense. I haven't thought of it that way. Thanks.
def test_drop_duplicates_empty(df): | ||
# GH 20516 | ||
result = df.drop_duplicates() | ||
tm.assert_frame_equal(result, df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test case with inplace=True
please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outside of changes requested by other reviewers this lgtm
lgtm. @datapythonista merge when satisfied. |
Thanks @HyunTruth |
@jreback @datapythonista @WillAyd Thank you all |
git diff upstream/master -u -- "*.py" | flake8 --diff