Skip to content

Bug in DataFrame.drop_duplicates for empty DataFrame throws error #22394

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Aug 23, 2018

Conversation

HyunTruth
Copy link
Contributor

@@ -711,7 +711,7 @@ Reshaping
- Bug in :func:`get_dummies` with Unicode attributes in Python 2 (:issue:`22084`)
- Bug in :meth:`DataFrame.replace` raises ``RecursionError`` when replacing empty lists (:issue:`22083`)
- Bug in :meth:`Series.replace` and meth:`DataFrame.replace` when dict is used as the `to_replace` value and one key in the dict is is another key's value, the results were inconsistent between using integer key and using string key (:issue:`20656`)
-
- Bug in :meth:`DataFrame.drop_duplicates`for empty DataFrame throws error (:issue:`20516`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this link render? Curious if a space is required after the end backtick

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for noticing! Fixed it

@codecov
Copy link

codecov bot commented Aug 17, 2018

Codecov Report

Merging #22394 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22394      +/-   ##
==========================================
- Coverage   92.05%   92.04%   -0.01%     
==========================================
  Files         169      169              
  Lines       50709    50744      +35     
==========================================
+ Hits        46679    46708      +29     
- Misses       4030     4036       +6
Flag Coverage Δ
#multiple 90.45% <100%> (-0.01%) ⬇️
#single 42.24% <0%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.2% <100%> (-0.05%) ⬇️
pandas/util/_depr_module.py 65.11% <0%> (-2.33%) ⬇️
pandas/core/reshape/pivot.py 96.55% <0%> (-0.63%) ⬇️
pandas/core/arrays/integer.py 94.55% <0%> (-0.12%) ⬇️
pandas/util/testing.py 85.75% <0%> (-0.11%) ⬇️
pandas/core/reshape/merge.py 94.15% <0%> (-0.01%) ⬇️
pandas/core/groupby/grouper.py 98.16% <0%> (-0.01%) ⬇️
pandas/core/generic.py 96.44% <0%> (-0.01%) ⬇️
pandas/io/parsers.py 95.48% <0%> (ø) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b5d81cf...fc61899. Read the comment docs.

@datapythonista datapythonista added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 17, 2018
@@ -711,7 +711,7 @@ Reshaping
- Bug in :func:`get_dummies` with Unicode attributes in Python 2 (:issue:`22084`)
- Bug in :meth:`DataFrame.replace` raises ``RecursionError`` when replacing empty lists (:issue:`22083`)
- Bug in :meth:`Series.replace` and meth:`DataFrame.replace` when dict is used as the `to_replace` value and one key in the dict is is another key's value, the results were inconsistent between using integer key and using string key (:issue:`20656`)
-
- Bug in :meth:`DataFrame.drop_duplicates` for empty DataFrame throws error (:issue:`20516`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double backticks on DataFrame

throws error -> which incorrectly raises

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, will apply the changes right away

@@ -263,6 +263,13 @@ def test_drop_duplicates_tuple():
tm.assert_frame_equal(result, expected)


def test_drop_duplicates_empty():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize this on columns with [], ['A', 'B','C'] as the values
IOW test the DataFrame(columns=[]) and DataFrame(columns=['A', 'B', 'C'])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually also need a test with no columns but an index e.g. DataFrame(index=[1, 2])

so don't need to parameterize but test all of these cases (where the result is the input)

@jreback jreback added this to the 0.24.0 milestone Aug 17, 2018
@jreback jreback changed the title Bug in :meth:DataFrame.drop_duplicatesfor empty DataFrame throws error Bug in DataFrame.drop_duplicates for empty DataFrame throws error Aug 17, 2018
@HyunTruth
Copy link
Contributor Author

Test assert failed due to the following issue: #22409. Posted the details regarding this issue. Will try to update on it if I have time.

@HyunTruth
Copy link
Contributor Author

@jreback @WillAyd @datapythonista Should I just not include the test for columns until the #22409 is closed (as it will constantly tell that the shape is different due to column info is not there) or put this PR on hold until the above issue is closed?

@WillAyd
Copy link
Member

WillAyd commented Aug 18, 2018

See note for the referenced issue but I don't think that is related (?). As @jreback mentioned you should parametrize the test case you have created. If so it will help more easily identify which if any case is failing

@HyunTruth
Copy link
Contributor Author

HyunTruth commented Aug 19, 2018

@WillAyd Yes, I did parameterize the issue as @jreback mentioned, and while others are fine, the problem arose from this test:

    expected = DataFrame(columns=['A', 'B', 'C'])
    result = expected.drop_duplicates()
    tm.assert_frame_equal(result, expected)

current method for DataFrame.drop_duplicates follows this logic:

    duplicated = self.duplicated(subset, keep=keep)

    if inplace:
        inds, = (-duplicated).nonzero()
        new_data = self._data.take(inds)
        self._update_inplace(new_data)
    else:
        return self[-duplicated]

Since we are dealing with an empty DataFrame, the duplicated should be returning an empty Series as there exists no value to iterate over.

current assert_frame_equal checks for the shapes and column comparison within the code, as such:

    if left.shape != right.shape:
        raise_assert_detail(obj,
                            'DataFrame shape mismatch',
                            '{shape!r}'.format(shape=left.shape),
                            '{shape!r}'.format(shape=right.shape))

and

    assert_index_equal(left.columns, right.columns, exact=check_column_type,
                       check_names=check_names,
                       check_less_precise=check_less_precise,
                       check_exact=check_exact,
                       check_categorical=check_categorical,
                       obj='{obj}.columns'.format(obj=obj))

Considering that when selecting in df[pd.Series()] manner, column infos are by design not carrying over and thus fail the test, which spots different shape and index from the one the expected and result of the below test:

    expected = DataFrame(columns=['A', 'B', 'C'])
    result = expected.drop_duplicates()
    tm.assert_frame_equal(result, expected)

If this is the expected behavior, I think we can consider the fact and change the above snippet to following:

    df = DataFrame(columns=['A', 'B', 'C'])
    result = df.drop_duplicates()
    expected = DataFrame(columns=[]) # Since the column infos are not carrying over
    tm.assert_frame_equal(result, expected)

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks OK but the test needs to be parametrized still. Take a look at some of the other functions in the same module if unsure how to do that

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments, ping on green.


df = DataFrame(columns=['A', 'B', 'C'])
result = df.drop_duplicates()
expected = DataFrame(columns=[]) # The column infos are not carrying over
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find this comment useful, rather put a comment about testing with empty columns, and below about an empty index

@pep8speaks
Copy link

pep8speaks commented Aug 22, 2018

Hello @HyunTruth! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 22, 2018 at 13:38 Hours UTC

result = df.drop_duplicates()
if df.columns.empty is False:
df = DataFrame(columns=[])
tm.assert_frame_equal(result, df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests feels a bit tricky to me. So, when there are columns, we set them to an empty list. So, that would be the same as simply using as expected value DataFrame(columns=[]), or am I missing something?

But not sure if I'm missing something, but I'd expect DataFrame(columns=['A', 'B', 'C']).drop_duplicates() to keep the columns. Any reason to drop them?

Copy link
Contributor Author

@HyunTruth HyunTruth Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the closed issue of #22409. DataFrame(columns=['A', 'B', 'C']).drop_duplicates() passes through empty selection, and whilst index remains, column infos do not as the selection itself is done on the column basis if I understood the comment correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the problem may be your approach implementing this story. I agree on what @WillAyd said in #22409, but not in the side effect in this PR.

I guess df[pandas.Series()] is the same as df[[]], so you're telling the DataFrame to select an empty list of columns, so none of them. So, if we have a DataFrame with 4 columns and 10 rows, I expect this to return the 10 rows, but 0 columns.

But the case here with drop_duplicates, if I understand correctly, is that you have a DataFrame with 4 columns and 0 rows, you remove the duplicates, that there are none, as the DataFrame has 0 rows, so I should still have 4 columns and 0 rows.

And in your test you're asserting that the return will be 0 columns and 0 rows.

Am I right? Does it makes sense?

Copy link
Contributor Author

@HyunTruth HyunTruth Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. However, if we want to leave the column when there exists no value, then we'd need to change the DataFrame.drop_duplicates itself to not use the logic of self[~duplicated], as it currently does, as this is what is causing this problem at least at this moment with the empty return of DataFrame.duplicated... I thought that DataFrame.duplicated returns a Series, and then if DataFrame has no value, returning an empty Series is logical. However, since DataFrame.drop_duplicates use the logic of self[~duplicated] (which works for every other case except on empty DataFrame with only columns) returning empty Series leads to df[[]], resulting in a DataFrame with no value without the columns (as it is used in selection which has no value in it). Maybe returning an empty wasn't a good idea. Completely stuck then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I understand the problem now. There is some magic on df[something] in pandas. In some cases something would be understood as the list of columns, like in df[[]] or df[['A', 'B', 'C']], but if something is a list of boolean values, then it filters on rows instead of columns, like in df[df['A'].isnull()] (df['A'].isnull() returns a list (Series) of booleans).

That's why your approach is not working as expected. You may try to check with df.iloc[something], which should always filter on rows.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments. ping on green.

@@ -4335,6 +4335,9 @@ def drop_duplicates(self, subset=None, keep='first', inplace=False):
-------
deduplicated : DataFrame
"""
if self.empty:
return self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return self.copy() here

@@ -711,7 +711,7 @@ Reshaping
- Bug in :func:`get_dummies` with Unicode attributes in Python 2 (:issue:`22084`)
- Bug in :meth:`DataFrame.replace` raises ``RecursionError`` when replacing empty lists (:issue:`22083`)
- Bug in :meth:`Series.replace` and meth:`DataFrame.replace` when dict is used as the `to_replace` value and one key in the dict is is another key's value, the results were inconsistent between using integer key and using string key (:issue:`20656`)
-
- Bug in :meth:`DataFrame.drop_duplicates` for empty ``DataFrame`` which incorrectly raises error (:issue:`20516`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raises an error

@@ -4335,6 +4335,9 @@ def drop_duplicates(self, subset=None, keep='first', inplace=False):
-------
deduplicated : DataFrame
"""
if self.empty:
return self.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this, can you try to replace the last line return self[-duplicated] by return self.iloc[-duplicated], for the reasons I mentioned earlier?

Copy link
Contributor Author

@HyunTruth HyunTruth Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then wouldn't index be lost in the case of DataFrame(index=['A', 'B', 'C'])? As it will be self.iloc[[]]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use self.iloc[~duplicated], as you said, we will be able to maintain the columns if we have 4 columns and 0 rows. However, if we have 0 columns and 4 rows, we have the same problem - the outcome is self.iloc[[]], which means that the rows won't be selected, for the same reason with using self[[]] for columns, thus ending up with 0 columns and 0 rows, once again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether it'd make sense for a dataframe with rows but no columns to drop the empty rows as duplicate.

But as @jreback if happy with this implementation, just leave it like this. I didn't see his comment earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a test, and here's the result:

import pandas as pd

a = pd.DataFrame(index=['A', 'B'])
b = a.iloc[[]]
a.shape # (2, 0)
b.shape # (0, 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense to me. If we consider that empty rows are equal among them, then pd.DataFrame(index=['A', 'B']).duplicated() would return False, True, and .iloc[-duplicated] would return the first row.

But as I said, I'm happy to keep the original DataFrame as is for this case, as @jreback is happy with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, if you see the empty rows as equals, then it does make sense. I haven't thought of it that way. Thanks.

def test_drop_duplicates_empty(df):
# GH 20516
result = df.drop_duplicates()
tm.assert_frame_equal(result, df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test case with inplace=True please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of changes requested by other reviewers this lgtm

@jreback
Copy link
Contributor

jreback commented Aug 23, 2018

lgtm. @datapythonista merge when satisfied.

@datapythonista datapythonista merged commit 9122952 into pandas-dev:master Aug 23, 2018
@datapythonista
Copy link
Member

Thanks @HyunTruth

@HyunTruth
Copy link
Contributor Author

@jreback @datapythonista @WillAyd Thank you all

@HyunTruth HyunTruth deleted the dropdup branch August 23, 2018 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calling drop_duplicates method for empty pandas dataframe throws error
5 participants