Bug in DataFrame.drop_duplicates for empty DataFrame throws error #22394

HyunTruth · 2018-08-17T02:56:14Z

closes Calling drop_duplicates method for empty pandas dataframe throws error #20516
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

…ror (:issue:`20516`)

WillAyd · 2018-08-17T03:11:31Z

doc/source/whatsnew/v0.24.0.txt

@@ -711,7 +711,7 @@ Reshaping
 - Bug in :func:`get_dummies` with Unicode attributes in Python 2 (:issue:`22084`)
 - Bug in :meth:`DataFrame.replace` raises ``RecursionError`` when replacing empty lists (:issue:`22083`)
 - Bug in :meth:`Series.replace` and meth:`DataFrame.replace` when dict is used as the `to_replace` value and one key in the dict is is another key's value, the results were inconsistent between using integer key and using string key (:issue:`20656`)
-
+- Bug in :meth:`DataFrame.drop_duplicates`for empty DataFrame throws error (:issue:`20516`)


Does this link render? Curious if a space is required after the end backtick

Thank you for noticing! Fixed it

codecov · 2018-08-17T04:03:35Z

Codecov Report

Merging #22394 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #22394      +/-   ##
==========================================
- Coverage   92.05%   92.04%   -0.01%     
==========================================
  Files         169      169              
  Lines       50709    50744      +35     
==========================================
+ Hits        46679    46708      +29     
- Misses       4030     4036       +6

Flag	Coverage Δ
#multiple	`90.45% <100%> (-0.01%)`	⬇️
#single	`42.24% <0%> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.2% <100%> (-0.05%)`	⬇️
pandas/util/_depr_module.py	`65.11% <0%> (-2.33%)`	⬇️
pandas/core/reshape/pivot.py	`96.55% <0%> (-0.63%)`	⬇️
pandas/core/arrays/integer.py	`94.55% <0%> (-0.12%)`	⬇️
pandas/util/testing.py	`85.75% <0%> (-0.11%)`	⬇️
pandas/core/reshape/merge.py	`94.15% <0%> (-0.01%)`	⬇️
pandas/core/groupby/grouper.py	`98.16% <0%> (-0.01%)`	⬇️
pandas/core/generic.py	`96.44% <0%> (-0.01%)`	⬇️
pandas/io/parsers.py	`95.48% <0%> (ø)`	⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b5d81cf...fc61899. Read the comment docs.

jreback · 2018-08-17T10:24:37Z

doc/source/whatsnew/v0.24.0.txt

@@ -711,7 +711,7 @@ Reshaping
 - Bug in :func:`get_dummies` with Unicode attributes in Python 2 (:issue:`22084`)
 - Bug in :meth:`DataFrame.replace` raises ``RecursionError`` when replacing empty lists (:issue:`22083`)
 - Bug in :meth:`Series.replace` and meth:`DataFrame.replace` when dict is used as the `to_replace` value and one key in the dict is is another key's value, the results were inconsistent between using integer key and using string key (:issue:`20656`)
-
+- Bug in :meth:`DataFrame.drop_duplicates` for empty DataFrame throws error (:issue:`20516`)


use double backticks on DataFrame

throws error -> which incorrectly raises

Okay, will apply the changes right away

jreback · 2018-08-17T10:28:05Z

pandas/tests/frame/test_duplicates.py

@@ -263,6 +263,13 @@ def test_drop_duplicates_tuple():
    tm.assert_frame_equal(result, expected)


+def test_drop_duplicates_empty():


can you parameterize this on columns with [], ['A', 'B','C'] as the values
IOW test the DataFrame(columns=[]) and DataFrame(columns=['A', 'B', 'C'])

actually also need a test with no columns but an index e.g. DataFrame(index=[1, 2])

so don't need to parameterize but test all of these cases (where the result is the input)

HyunTruth · 2018-08-18T15:39:04Z

Test assert failed due to the following issue: #22409. Posted the details regarding this issue. Will try to update on it if I have time.

HyunTruth · 2018-08-18T15:46:18Z

@jreback @WillAyd @datapythonista Should I just not include the test for columns until the #22409 is closed (as it will constantly tell that the shape is different due to column info is not there) or put this PR on hold until the above issue is closed?

WillAyd · 2018-08-18T21:50:09Z

See note for the referenced issue but I don't think that is related (?). As @jreback mentioned you should parametrize the test case you have created. If so it will help more easily identify which if any case is failing

HyunTruth · 2018-08-19T07:22:06Z

@WillAyd Yes, I did parameterize the issue as @jreback mentioned, and while others are fine, the problem arose from this test:

    expected = DataFrame(columns=['A', 'B', 'C'])
    result = expected.drop_duplicates()
    tm.assert_frame_equal(result, expected)

current method for DataFrame.drop_duplicates follows this logic:

    duplicated = self.duplicated(subset, keep=keep)

    if inplace:
        inds, = (-duplicated).nonzero()
        new_data = self._data.take(inds)
        self._update_inplace(new_data)
    else:
        return self[-duplicated]

Since we are dealing with an empty DataFrame, the duplicated should be returning an empty Series as there exists no value to iterate over.

current assert_frame_equal checks for the shapes and column comparison within the code, as such:

    if left.shape != right.shape:
        raise_assert_detail(obj,
                            'DataFrame shape mismatch',
                            '{shape!r}'.format(shape=left.shape),
                            '{shape!r}'.format(shape=right.shape))

and

    assert_index_equal(left.columns, right.columns, exact=check_column_type,
                       check_names=check_names,
                       check_less_precise=check_less_precise,
                       check_exact=check_exact,
                       check_categorical=check_categorical,
                       obj='{obj}.columns'.format(obj=obj))

Considering that when selecting in df[pd.Series()] manner, column infos are by design not carrying over and thus fail the test, which spots different shape and index from the one the expected and result of the below test:

    expected = DataFrame(columns=['A', 'B', 'C'])
    result = expected.drop_duplicates()
    tm.assert_frame_equal(result, expected)

If this is the expected behavior, I think we can consider the fact and change the above snippet to following:

    df = DataFrame(columns=['A', 'B', 'C'])
    result = df.drop_duplicates()
    expected = DataFrame(columns=[]) # Since the column infos are not carrying over
    tm.assert_frame_equal(result, expected)

WillAyd

Change looks OK but the test needs to be parametrized still. Take a look at some of the other functions in the same module if unsure how to do that

jreback

small comments, ping on green.

jreback · 2018-08-20T10:35:10Z

pandas/tests/frame/test_duplicates.py

+
+    df = DataFrame(columns=['A', 'B', 'C'])
+    result = df.drop_duplicates()
+    expected = DataFrame(columns=[])  # The column infos are not carrying over


I don't find this comment useful, rather put a comment about testing with empty columns, and below about an empty index

pep8speaks · 2018-08-22T08:43:18Z

Hello @HyunTruth! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 22, 2018 at 13:38 Hours UTC

…opdup

datapythonista · 2018-08-22T09:55:51Z

pandas/tests/frame/test_duplicates.py

+    result = df.drop_duplicates()
+    if df.columns.empty is False:
+        df = DataFrame(columns=[])
+    tm.assert_frame_equal(result, df)


This tests feels a bit tricky to me. So, when there are columns, we set them to an empty list. So, that would be the same as simply using as expected value DataFrame(columns=[]), or am I missing something?

But not sure if I'm missing something, but I'd expect DataFrame(columns=['A', 'B', 'C']).drop_duplicates() to keep the columns. Any reason to drop them?

Please check the closed issue of #22409. DataFrame(columns=['A', 'B', 'C']).drop_duplicates() passes through empty selection, and whilst index remains, column infos do not as the selection itself is done on the column basis if I understood the comment correctly.

I don't think the problem may be your approach implementing this story. I agree on what @WillAyd said in #22409, but not in the side effect in this PR.

I guess df[pandas.Series()] is the same as df[[]], so you're telling the DataFrame to select an empty list of columns, so none of them. So, if we have a DataFrame with 4 columns and 10 rows, I expect this to return the 10 rows, but 0 columns.

But the case here with drop_duplicates, if I understand correctly, is that you have a DataFrame with 4 columns and 0 rows, you remove the duplicates, that there are none, as the DataFrame has 0 rows, so I should still have 4 columns and 0 rows.

And in your test you're asserting that the return will be 0 columns and 0 rows.

Am I right? Does it makes sense?

Yeah. However, if we want to leave the column when there exists no value, then we'd need to change the DataFrame.drop_duplicates itself to not use the logic of self[~duplicated], as it currently does, as this is what is causing this problem at least at this moment with the empty return of DataFrame.duplicated... I thought that DataFrame.duplicated returns a Series, and then if DataFrame has no value, returning an empty Series is logical. However, since DataFrame.drop_duplicates use the logic of self[~duplicated] (which works for every other case except on empty DataFrame with only columns) returning empty Series leads to df[[]], resulting in a DataFrame with no value without the columns (as it is used in selection which has no value in it). Maybe returning an empty wasn't a good idea. Completely stuck then.

Oh, I understand the problem now. There is some magic on df[something] in pandas. In some cases something would be understood as the list of columns, like in df[[]] or df[['A', 'B', 'C']], but if something is a list of boolean values, then it filters on rows instead of columns, like in df[df['A'].isnull()] (df['A'].isnull() returns a list (Series) of booleans).

That's why your approach is not working as expected. You may try to check with df.iloc[something], which should always filter on rows.

jreback

minor comments. ping on green.

jreback · 2018-08-22T12:25:47Z

pandas/core/frame.py

@@ -4335,6 +4335,9 @@ def drop_duplicates(self, subset=None, keep='first', inplace=False):
        -------
        deduplicated : DataFrame
        """
+        if self.empty:
+            return self


return self.copy() here

jreback · 2018-08-22T12:26:07Z

doc/source/whatsnew/v0.24.0.txt

@@ -711,7 +711,7 @@ Reshaping
 - Bug in :func:`get_dummies` with Unicode attributes in Python 2 (:issue:`22084`)
 - Bug in :meth:`DataFrame.replace` raises ``RecursionError`` when replacing empty lists (:issue:`22083`)
 - Bug in :meth:`Series.replace` and meth:`DataFrame.replace` when dict is used as the `to_replace` value and one key in the dict is is another key's value, the results were inconsistent between using integer key and using string key (:issue:`20656`)
-
+- Bug in :meth:`DataFrame.drop_duplicates` for empty ``DataFrame`` which incorrectly raises error (:issue:`20516`)


raises an error

datapythonista · 2018-08-22T12:32:27Z

pandas/core/frame.py

@@ -4335,6 +4335,9 @@ def drop_duplicates(self, subset=None, keep='first', inplace=False):
        -------
        deduplicated : DataFrame
        """
+        if self.empty:
+            return self.copy()


instead of this, can you try to replace the last line return self[-duplicated] by return self.iloc[-duplicated], for the reasons I mentioned earlier?

Then wouldn't index be lost in the case of DataFrame(index=['A', 'B', 'C'])? As it will be self.iloc[[]]?

If we use self.iloc[~duplicated], as you said, we will be able to maintain the columns if we have 4 columns and 0 rows. However, if we have 0 columns and 4 rows, we have the same problem - the outcome is self.iloc[[]], which means that the rows won't be selected, for the same reason with using self[[]] for columns, thus ending up with 0 columns and 0 rows, once again.

I'm not sure whether it'd make sense for a dataframe with rows but no columns to drop the empty rows as duplicate.

But as @jreback if happy with this implementation, just leave it like this. I didn't see his comment earlier.

I tried a test, and here's the result:

import pandas as pd a = pd.DataFrame(index=['A', 'B']) b = a.iloc[[]] a.shape # (2, 0) b.shape # (0, 0)

Yes, that makes sense to me. If we consider that empty rows are equal among them, then pd.DataFrame(index=['A', 'B']).duplicated() would return False, True, and .iloc[-duplicated] would return the first row.

But as I said, I'm happy to keep the original DataFrame as is for this case, as @jreback is happy with it.

Oh, if you see the empty rows as equals, then it does make sense. I haven't thought of it that way. Thanks.

datapythonista · 2018-08-22T12:32:57Z

pandas/tests/frame/test_duplicates.py

+def test_drop_duplicates_empty(df):
+    # GH 20516
+    result = df.drop_duplicates()
+    tm.assert_frame_equal(result, df)


can you add a test case with inplace=True please?

WillAyd

Outside of changes requested by other reviewers this lgtm

jreback · 2018-08-23T10:34:42Z

lgtm. @datapythonista merge when satisfied.

datapythonista · 2018-08-23T13:13:41Z

Thanks @HyunTruth

HyunTruth · 2018-08-23T13:25:35Z

@jreback @datapythonista @WillAyd Thank you all

…andas-dev#22394)

Bug in :meth:DataFrame.drop_duplicatesfor empty DataFrame throws er…

03af0c1

…ror (:issue:`20516`)

WillAyd reviewed Aug 17, 2018

View reviewed changes

fixed what's new to render well

79ee155

datapythonista added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 17, 2018

datapythonista approved these changes Aug 17, 2018

View reviewed changes

jreback requested changes Aug 17, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Aug 17, 2018

jreback mentioned this pull request Aug 17, 2018

BUG: Fix drop_duplicates failure when DataFrame has no column #20974

Closed

4 tasks

jreback changed the title ~~Bug in :meth:DataFrame.drop_duplicatesfor empty DataFrame throws error~~ Bug in DataFrame.drop_duplicates for empty DataFrame throws error Aug 17, 2018

HyunTruth added 2 commits August 18, 2018 09:33

Applied changes according to reviews by @jreback

31f6099

removed an additional line

6eb53f6

HyunTruth mentioned this pull request Aug 18, 2018

selection within the DataFrame by empty series loses column schema #22409

Closed

changed test to accomodate the column behavior in selection

de745bb

WillAyd requested changes Aug 20, 2018

View reviewed changes

jreback requested changes Aug 20, 2018

View reviewed changes

Parameterized the tests

3a5d97d

hyuntruth added 4 commits August 22, 2018 17:45

Parameterized the tests

1f58a85

Merge branch 'dropdup' of https://github.com/HyunTruth/pandas into dr…

4f299c5

…opdup

Adhere to flake8

cab0958

switched df

68d69db

datapythonista reviewed Aug 22, 2018

View reviewed changes

Try catching for empty dataframes and return self

fb8845d

jreback requested changes Aug 22, 2018

View reviewed changes

change requested applied

1f12545

datapythonista requested changes Aug 22, 2018

View reviewed changes

added inplace=True tests

20c03ef

WillAyd approved these changes Aug 22, 2018

View reviewed changes

rectified inplace test to reflect actual usage

fc61899

jreback approved these changes Aug 23, 2018

View reviewed changes

datapythonista merged commit 9122952 into pandas-dev:master Aug 23, 2018

HyunTruth deleted the dropdup branch August 23, 2018 13:25

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

BUG: DataFrame.drop_duplicates for empty DataFrame raises exception (p…

3114e93

…andas-dev#22394)

		@@ -263,6 +263,13 @@ def test_drop_duplicates_tuple():
		tm.assert_frame_equal(result, expected)


		def test_drop_duplicates_empty():

Bug in DataFrame.drop_duplicates for empty DataFrame throws error #22394

Bug in DataFrame.drop_duplicates for empty DataFrame throws error #22394

Conversation

HyunTruth commented Aug 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 17, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyunTruth commented Aug 18, 2018

HyunTruth commented Aug 18, 2018

WillAyd commented Aug 18, 2018

HyunTruth commented Aug 19, 2018 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Aug 22, 2018 • edited Loading

Comment last updated on August 22, 2018 at 13:38 Hours UTC

Choose a reason for hiding this comment

HyunTruth Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyunTruth Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyunTruth Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

jreback commented Aug 23, 2018

datapythonista commented Aug 23, 2018

HyunTruth commented Aug 23, 2018

codecov bot commented Aug 17, 2018 •

edited

Loading

HyunTruth commented Aug 19, 2018 •

edited

Loading

pep8speaks commented Aug 22, 2018 •

edited

Loading

HyunTruth Aug 22, 2018 •

edited

Loading

HyunTruth Aug 22, 2018 •

edited

Loading

HyunTruth Aug 22, 2018 •

edited

Loading