BUG: Enable Series.equals to compare numpy arrays to scalars #36161

avinashpancham · 2020-09-06T12:50:08Z

closes BUG: Series.equals fails when comparing numpy arrays to scalars #35267
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

avinashpancham · 2020-09-06T12:54:54Z

How do we test functions in the _libs directory, or do we only test them indirectly? Since they don't have a directory in the test directory

jreback

@avinashpancham use the original example as a test in pandas/tests/series/methods/test_equals.py

also a release note for 1.2

avinashpancham · 2020-09-06T21:46:58Z

Bit stuck here. Two tests in the CI keep failing due to a JSONArray being used in pandas\tests\extension\base\interface.py and pandas\tests\extension\base\casting.py, but locally pytest does not find these tests.

dsaxton · 2020-09-08T00:24:25Z

Bit stuck here. Two tests in the CI keep failing due to a JSONArray being used in pandas\tests\extension\base\interface.py and pandas\tests\extension\base\casting.py, but locally pytest does not find these tests.

@avinashpancham I believe the tests you're looking for are in pandas/tests/extension/json/test_json.py (the files you mention are defining base testing classes from which others inherit)

avinashpancham · 2020-09-08T06:24:00Z

This fix solves the problem described in the PR, but I found that ValueErrors are still thrown when different data types, such as dictionaries, are provided. Will work on that.

import pandas as pd
import numpy as np

arr = np.array([1, 2])
s1 = pd.Series([arr, arr])
s2 = s1.copy()

s1[1] = 9
s1.equals(s2) # Works

s1[1] = {"a": 1}
s1.equals(s2) # This throws a ValueError

avinashpancham · 2020-09-08T06:25:06Z

@dsaxton Thanks for the explanation, that helped me with testing it locally

jreback

finally this is a heavily used path, need to run the entire asv suite and see what comes up

jreback · 2020-09-08T22:47:44Z

pandas/_libs/lib.pyx

+            elif cnp.PyArray_IsAnyScalar(x) != cnp.PyArray_IsAnyScalar(y):
+                return False
+            # Check if arrays have the same type
+            elif not (cnp.PyArray_IsPythonScalar(x) or cnp.PyArray_IsPythonScalar(y)) \


can you format this with parens rather than \ much easier to read

jreback · 2020-09-08T22:49:12Z

pandas/tests/series/methods/test_equals.py

-    # s1[1] = 9
-    # assert not s1.equals(s2)
+    s1[1] = 9
+    assert not s1.equals(s2)


can you add all of the missing cases
e.q. not s1.equals('foo')

also we have a bunch of tests for array_equivalent, these need a battery of tests

parameterized values in test_equals and added tests for array_equivalent with a Series

pandas/_libs/lib.pyx

doc/source/whatsnew/v1.2.0.rst

avinashpancham · 2020-09-10T05:30:30Z

@jreback ran entire asv suite, see the top 10 below. It was ofc expected to be slower, but what would be acceptable? If necessary I can also share the complete list.

   before           after         ratio
 [8c7efd1c]       [bfb36ed4]
 <master>         <gh35267>

   25.9±2ms         48.8±3ms     1.88  frame_methods.Equals.time_frame_nonunique_equal

   26.6±2ms         48.6±3ms     1.83  frame_methods.Equals.time_frame_nonunique_unequal

  754±200ns       1.25±0.7μs     1.66  index_cached_properties.IndexCache.time_inferred_type('Float64Index')

   36.5±3ms         57.7±3ms     1.58  frame_methods.Equals.time_frame_object_unequal

   47.4±2ms         68.2±4ms     1.44  frame_methods.Equals.time_frame_object_equal

   19.8±3ms         27.9±3ms     1.41  index_object.SetOperations.time_operation('strings', 'symmetric_difference')

2.48±0.06ms       3.35±0.2ms     1.35  stat_ops.SeriesMultiIndexOps.time_op(0, 'sem')

   83.8±1ms          112±8ms     1.34  rolling.ExpandingMethods.time_expanding('DataFrame', 'int', 'median')

1.84±0.03ms       2.45±0.2ms     1.33  stat_ops.SeriesMultiIndexOps.time_op(0, 'var')

 2.83±0.6μs         3.69±2μs     1.31  index_cached_properties.IndexCache.time_engine('UInt64Index')

avinashpancham · 2020-09-10T05:34:43Z

The running of the entire asv suite took close to 8 hours on my laptop. The docs state: "Running the full test suite can take up to one hour and use up to 3GB of RAM".

Did I do something wrong, cause this is a significant difference, else we should give a more realistic estimate in the docs.

jreback · 2020-09-12T21:42:01Z

The running of the entire asv suite took close to 8 hours on my laptop. The docs state: "Running the full test suite can take up to one hour and use up to 3GB of RAM".

Did I do something wrong, cause this is a significant difference, else we should give a more realistic estimate in the docs.

docs are likely out of data and the suite is now much bigger.

jreback · 2020-09-12T21:42:38Z

hmm these perf numbers are a big change. can you run the biggest ones again; sometimes the suite is flaky.

avinashpancham · 2020-09-12T22:10:35Z

The running of the entire asv suite took close to 8 hours on my laptop. The docs state: "Running the full test suite can take up to one hour and use up to 3GB of RAM".
Did I do something wrong, cause this is a significant difference, else we should give a more realistic estimate in the docs.

docs are likely out of data and the suite is now much bigger.

Is it ok if I submit an issue to get this corrected in the docs? I can imagine that people wouldn't mind running tests on their laptop for 1h, but not for approx 8h. Then they would rather do it on another machine/server

jreback · 2020-09-12T22:14:39Z

The running of the entire asv suite took close to 8 hours on my laptop. The docs state: "Running the full test suite can take up to one hour and use up to 3GB of RAM".

Did I do something wrong, cause this is a significant difference, else we should give a more realistic estimate in the docs.

docs are likely out of data and the suite is now much bigger.

Is it ok if I submit an issue to get this corrected in the docs? I can imagine that people wouldn't mind running tests on their laptop for 1h, but not for approx 8h. Then they would rather do it on another machine/server

sure

avinashpancham · 2020-09-12T23:12:40Z

frame_methods.Equals and index_cached_properties.IndexCache still show a perf decrease in the 2nd run, see below. Can look if there is another way to solve the bug while maintaining similar perf.

The others (index_object.SetOperations, stat_ops.SeriesMultiIndexOps and rolling.ExpandingMethods) did not have a perf decrease in the 2nd run.

   before           after         ratio
 [822dc6f9]       [e94db6db]
                  <gh35267>

   23.9±1ms       46.8±0.3ms     1.96  frame_methods.Equals.time_frame_nonunique_unequal

   32.6±2ms         60.7±6ms     1.86  frame_methods.Equals.time_frame_object_unequal

   25.0±1ms       45.5±0.7ms     1.82  frame_methods.Equals.time_frame_nonunique_equal

 37.9±0.7ms         62.8±2ms     1.66  frame_methods.Equals.time_frame_object_equal

 1.04±0.2μs       1.73±0.9μs     1.66  index_cached_properties.IndexCache.time_is_all_dates('UInt64Index')

 1.10±0.1μs       1.62±0.7μs     1.46  index_cached_properties.IndexCache.time_inferred_type('UInt64Index')

 1.37±0.2μs       2.00±0.7μs     1.46  index_cached_properties.IndexCache.time_is_all_dates('TimedeltaIndex')

 1.23±0.2μs       1.75±0.5μs     1.42  index_cached_properties.IndexCache.time_inferred_type('TimedeltaIndex')

 4.06±0.6μs         5.74±2μs     1.41  index_cached_properties.IndexCache.time_engine('TimedeltaIndex')

   814±90ns       1.09±0.4μs     1.34  index_cached_properties.IndexCache.time_inferred_type('Float64Index')

 1.35±0.1μs       1.72±0.2μs     1.27  index_cached_properties.IndexCache.time_is_all_dates('MultiIndex')

  828±200ns        986±300ns     1.19  index_cached_properties.IndexCache.time_is_all_dates('Float64Index')

 2.91±0.3μs         3.44±1μs     1.18  index_cached_properties.IndexCache.time_engine('UInt64Index')

 1.94±0.1μs       2.23±0.1μs     1.15  index_cached_properties.IndexCache.time_inferred_type('IntervalIndex')

jreback · 2020-09-12T23:28:26Z

@jbrockmendel WDYT

jbrockmendel · 2020-09-12T23:34:06Z

I think apocalyptic smoke makes weekends really productive. will take a look

jbrockmendel · 2020-09-12T23:35:12Z

pandas/_libs/lib.pyx

@@ -581,6 +581,13 @@ def array_equivalent_object(left: object[:], right: object[:]) -> bool:
                    return False
            elif (x is C_NA) ^ (y is C_NA):
                return False
+            # Only compare scalars to scalars and arrays to arrays
+            elif cnp.PyArray_IsAnyScalar(x) != cnp.PyArray_IsAnyScalar(y):


can you put these comments indented inside the conditions

i think id rather use our is_scalar than cnp.PyArray_IsAnyScalar

If I change it to is_scalar one unit test fails and the perf decreases even more. It is almost twice as slow as using cnp.PyArray_IsAnyScalar

before after ratio [822dc6f9] [eeabb266] <gh35267>

25.6±1ms 99.8±2ms 3.90 frame_methods.Equals.time_frame_nonunique_unequal

28.2±3ms 97.3±1ms 3.44 frame_methods.Equals.time_frame_nonunique_equal

31.2±2ms 106±0.9ms 3.40 frame_methods.Equals.time_frame_object_unequal

43.3±3ms 112±0.7ms 2.59 frame_methods.Equals.time_frame_object_equal

Put comments indented inside the conditons.

im pre-caffeine so bear with me: what does PyArray_IsAnyScalar include? or more specifically, what does it not include that is_scalar does?

is_scalar one unit test fails

what breaks?

PyArray_IsAnyScalar does not include None, datetime.datetime, datetime.timedelta, Period, decimal.Decimal, Fraction, numbers.Number, Interval DateOffset, while is_scalar does.

If I see this I think it would be good to use is_scalar for consistency. But the drawback is the worse perf, because under the hood is_scalar makes use of PyArray_IsAnyScalar and some other functions.

Nvm the failed unit test, I see that it is a broken test on master that appeared after I merged master (pandas/tests/io/test_parquet.y::test_s3_roundtrip_for_dir)

@jbrockmendel how do we want to continue with this? Shall I replace PyArray_IsAnyScalar with is_scalar, or leave it as it is?

IIRC not is_listlike(obj) is more general and more performant than is_scalar(obj). would that work in this context?

If I replace PyArray_IsAnyScalar with is_listlike and update the code like below, the perf becomes (much) worse. Will try the try/except option jreback proposed.

elif is_list_like(x) != is_list_like(y): # Only compare scalars to scalars and non-scalars to non-scalars return False elif (is_list_like(x) and is_list_like(y) and not (isinstance(x, type(y)) or isinstance(y, type(x)))): # Check if non-scalars have the same type return False

before after ratio [822dc6f9] [98cea4b3] <master~31> <gh35267_clean>

25.1±0.6ms 695±2ms 27.65 frame_methods.Equals.time_frame_nonunique_equal

25.4±0.6ms 702±3ms 27.65 frame_methods.Equals.time_frame_nonunique_unequal

34.0±0.3ms 711±9ms 20.92 frame_methods.Equals.time_frame_object_unequal

44.5±0.5ms 718±3ms 16.14 frame_methods.Equals.time_frame_object_equal

jbrockmendel · 2020-09-12T23:37:52Z

pandas/_libs/lib.pyx

+            elif cnp.PyArray_IsAnyScalar(x) != cnp.PyArray_IsAnyScalar(y):
+                return False
+            # Check if arrays have the same type
+            elif (not (cnp.PyArray_IsPythonScalar(x) or cnp.PyArray_IsPythonScalar(y))


are you using not cnp.PyArray_IsPythonScalar(x) as a standin for isinstance(x, np.ndarray)?

I don't believe I am. My reasoning for the code is as follows:

The first condition checks if both values are of different types, i.e. scalar vs non-scalar comparison. If they are different then fail.

If both sides are scalars then the current logic in Pandas can correctly handle the comparison. The code still fails though if both sides have a different non-scalar type (e.g. dict vs list comparison). Hence the check to see if we are dealing with non scalars (not cnp.PyArray_IsPythonScalar(x)) and then check if they are of the same type.

can we not just wrap a try/except around the RichCompareBool? and handle there if there are odd types

this will avoid the perf hit.

Put check in a try/except clause. frame_methods.Equals never has a significant perf decrease now (<1.1). Some test from index_cached_properties.IndexCache are sometimes a little bit higher than 1.1 (e.g. 1.11-1.16)

ok let's go with this (add a comment as well)

Great. I've added the comment to the try/except clause.

jreback · 2020-09-18T23:10:03Z

pandas/_libs/lib.pyx

+            elif cnp.PyArray_IsAnyScalar(x) != cnp.PyArray_IsAnyScalar(y):
+                return False
+            # Check if arrays have the same type
+            elif (not (cnp.PyArray_IsPythonScalar(x) or cnp.PyArray_IsPythonScalar(y))


can we not just wrap a try/except around the RichCompareBool? and handle there if there are odd types

this will avoid the perf hit.

jreback · 2020-09-19T11:53:56Z

try with the cython is_list_like need to cimport it

avinashpancham · 2020-09-19T13:22:20Z

is_list_like is defined in lib.pyx, so then I dont need to import it right?

…6 and lower

pandas/tests/dtypes/test_missing.py

jreback · 2020-09-19T20:29:49Z

thanks @avinashpancham very nice!

…dev#36161)

jreback requested changes Sep 6, 2020

View reviewed changes

jreback added Bug Numeric Operations Arithmetic, Comparison, and Logical operations labels Sep 6, 2020

avinashpancham force-pushed the gh35267 branch from 42a8c38 to e31dfe1 Compare September 8, 2020 21:49

jreback requested changes Sep 8, 2020

View reviewed changes

dsaxton reviewed Sep 9, 2020

View reviewed changes

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved

doc/source/whatsnew/v1.2.0.rst Outdated Show resolved Hide resolved

doc/source/whatsnew/v1.2.0.rst Outdated Show resolved Hide resolved

avinashpancham changed the title ~~BUG: Enable Series.equals to compare iterables to non-iterables~~ BUG: Enable Series.equals to compare numpy arrays to scalars Sep 10, 2020

avinashpancham force-pushed the gh35267 branch from dd21fe2 to 232f581 Compare September 12, 2020 13:02

avinashpancham added 5 commits September 12, 2020 15:05

BUG: Enable Series.equals to compare numpy arrays to scalars

9bd42bf

Refactor condition with parentheses instead of slash

c84a39d

Update whatsnew message

8b26aec

Extend tests in test_equals.py with different types

6367f9a

Add tests for array_equivalent with Series

e94db6d

avinashpancham force-pushed the gh35267 branch from 232f581 to e94db6d Compare September 12, 2020 13:06

jreback added this to the 1.2 milestone Sep 12, 2020

jbrockmendel reviewed Sep 12, 2020

View reviewed changes

Reword comments and put comments indented inside conditions

dc67138

avinashpancham mentioned this pull request Sep 13, 2020

DOC: Projected time for running entire performance test suite is outdated #36344

Closed

jreback requested changes Sep 18, 2020

View reviewed changes

Put checks to prevent ValueError in try/except clause

61a58a9

avinashpancham added 3 commits September 19, 2020 15:54

Assert FutureWarning that is thrown when comparing strings

993cd93

Add comment for try/except clause

98c580f

Move import statement to right place and dont support import python3.…

3b1d788

…6 and lower

avinashpancham commented Sep 19, 2020

View reviewed changes

pandas/tests/dtypes/test_missing.py Show resolved Hide resolved

jreback approved these changes Sep 19, 2020

View reviewed changes

jreback merged commit a22cf43 into pandas-dev:master Sep 19, 2020

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

BUG: Enable Series.equals to compare numpy arrays to scalars (pandas-…

4f845bf

…dev#36161)

BUG: Enable Series.equals to compare numpy arrays to scalars #36161

BUG: Enable Series.equals to compare numpy arrays to scalars #36161

Conversation

avinashpancham commented Sep 6, 2020 • edited Loading

avinashpancham commented Sep 6, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

avinashpancham commented Sep 6, 2020

dsaxton commented Sep 8, 2020

avinashpancham commented Sep 8, 2020

avinashpancham commented Sep 8, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avinashpancham commented Sep 10, 2020 • edited Loading

avinashpancham commented Sep 10, 2020

jreback commented Sep 12, 2020

jreback commented Sep 12, 2020

avinashpancham commented Sep 12, 2020 • edited Loading

jreback commented Sep 12, 2020

avinashpancham commented Sep 12, 2020

jreback commented Sep 12, 2020

jbrockmendel commented Sep 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Sep 19, 2020

avinashpancham commented Sep 19, 2020

jreback commented Sep 19, 2020

avinashpancham commented Sep 6, 2020 •

edited

Loading

avinashpancham commented Sep 6, 2020 •

edited

Loading

avinashpancham commented Sep 10, 2020 •

edited

Loading

avinashpancham commented Sep 12, 2020 •

edited

Loading