Skip to content

BUG: Enable Series.equals to compare numpy arrays to scalars #36161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Sep 19, 2020

Conversation

avinashpancham
Copy link
Contributor

@avinashpancham avinashpancham commented Sep 6, 2020

@avinashpancham
Copy link
Contributor Author

avinashpancham commented Sep 6, 2020

How do we test functions in the _libs directory, or do we only test them indirectly? Since they don't have a directory in the test directory

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avinashpancham use the original example as a test in pandas/tests/series/methods/test_equals.py

also a release note for 1.2

@jreback jreback added Bug Numeric Operations Arithmetic, Comparison, and Logical operations labels Sep 6, 2020
@avinashpancham
Copy link
Contributor Author

Bit stuck here. Two tests in the CI keep failing due to a JSONArray being used in pandas\tests\extension\base\interface.py and pandas\tests\extension\base\casting.py, but locally pytest does not find these tests.

@dsaxton
Copy link
Member

dsaxton commented Sep 8, 2020

Bit stuck here. Two tests in the CI keep failing due to a JSONArray being used in pandas\tests\extension\base\interface.py and pandas\tests\extension\base\casting.py, but locally pytest does not find these tests.

@avinashpancham I believe the tests you're looking for are in pandas/tests/extension/json/test_json.py (the files you mention are defining base testing classes from which others inherit)

@avinashpancham
Copy link
Contributor Author

This fix solves the problem described in the PR, but I found that ValueErrors are still thrown when different data types, such as dictionaries, are provided. Will work on that.

import pandas as pd
import numpy as np

arr = np.array([1, 2])
s1 = pd.Series([arr, arr])
s2 = s1.copy()

s1[1] = 9
s1.equals(s2) # Works

s1[1] = {"a": 1}
s1.equals(s2) # This throws a ValueError

@avinashpancham
Copy link
Contributor Author

@dsaxton Thanks for the explanation, that helped me with testing it locally

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finally this is a heavily used path, need to run the entire asv suite and see what comes up

elif cnp.PyArray_IsAnyScalar(x) != cnp.PyArray_IsAnyScalar(y):
return False
# Check if arrays have the same type
elif not (cnp.PyArray_IsPythonScalar(x) or cnp.PyArray_IsPythonScalar(y)) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you format this with parens rather than \ much easier to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# s1[1] = 9
# assert not s1.equals(s2)
s1[1] = 9
assert not s1.equals(s2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add all of the missing cases
e.q. not s1.equals('foo')

also we have a bunch of tests for array_equivalent, these need a battery of tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameterized values in test_equals and added tests for array_equivalent with a Series

@avinashpancham
Copy link
Contributor Author

avinashpancham commented Sep 10, 2020

@jreback ran entire asv suite, see the top 10 below. It was ofc expected to be slower, but what would be acceptable? If necessary I can also share the complete list.

   before           after         ratio
 [8c7efd1c]       [bfb36ed4]
 <master>         <gh35267>
  •    25.9±2ms         48.8±3ms     1.88  frame_methods.Equals.time_frame_nonunique_equal
    
  •    26.6±2ms         48.6±3ms     1.83  frame_methods.Equals.time_frame_nonunique_unequal
    
  •   754±200ns       1.25±0.7μs     1.66  index_cached_properties.IndexCache.time_inferred_type('Float64Index')
    
  •    36.5±3ms         57.7±3ms     1.58  frame_methods.Equals.time_frame_object_unequal
    
  •    47.4±2ms         68.2±4ms     1.44  frame_methods.Equals.time_frame_object_equal
    
  •    19.8±3ms         27.9±3ms     1.41  index_object.SetOperations.time_operation('strings', 'symmetric_difference')
    
  • 2.48±0.06ms       3.35±0.2ms     1.35  stat_ops.SeriesMultiIndexOps.time_op(0, 'sem')
    
  •    83.8±1ms          112±8ms     1.34  rolling.ExpandingMethods.time_expanding('DataFrame', 'int', 'median')
    
  • 1.84±0.03ms       2.45±0.2ms     1.33  stat_ops.SeriesMultiIndexOps.time_op(0, 'var')
    
  •  2.83±0.6μs         3.69±2μs     1.31  index_cached_properties.IndexCache.time_engine('UInt64Index')
    

@avinashpancham
Copy link
Contributor Author

The running of the entire asv suite took close to 8 hours on my laptop. The docs state: "Running the full test suite can take up to one hour and use up to 3GB of RAM".

Did I do something wrong, cause this is a significant difference, else we should give a more realistic estimate in the docs.

@avinashpancham avinashpancham changed the title BUG: Enable Series.equals to compare iterables to non-iterables BUG: Enable Series.equals to compare numpy arrays to scalars Sep 10, 2020
@jreback
Copy link
Contributor

jreback commented Sep 12, 2020

The running of the entire asv suite took close to 8 hours on my laptop. The docs state: "Running the full test suite can take up to one hour and use up to 3GB of RAM".

Did I do something wrong, cause this is a significant difference, else we should give a more realistic estimate in the docs.

docs are likely out of data and the suite is now much bigger.

@jreback
Copy link
Contributor

jreback commented Sep 12, 2020

hmm these perf numbers are a big change. can you run the biggest ones again; sometimes the suite is flaky.

@avinashpancham
Copy link
Contributor Author

avinashpancham commented Sep 12, 2020

The running of the entire asv suite took close to 8 hours on my laptop. The docs state: "Running the full test suite can take up to one hour and use up to 3GB of RAM".
Did I do something wrong, cause this is a significant difference, else we should give a more realistic estimate in the docs.

docs are likely out of data and the suite is now much bigger.

Is it ok if I submit an issue to get this corrected in the docs? I can imagine that people wouldn't mind running tests on their laptop for 1h, but not for approx 8h. Then they would rather do it on another machine/server

@jreback
Copy link
Contributor

jreback commented Sep 12, 2020

The running of the entire asv suite took close to 8 hours on my laptop. The docs state: "Running the full test suite can take up to one hour and use up to 3GB of RAM".

Did I do something wrong, cause this is a significant difference, else we should give a more realistic estimate in the docs.

docs are likely out of data and the suite is now much bigger.

Is it ok if I submit an issue to get this corrected in the docs? I can imagine that people wouldn't mind running tests on their laptop for 1h, but not for approx 8h. Then they would rather do it on another machine/server

sure

@avinashpancham
Copy link
Contributor Author

frame_methods.Equals and index_cached_properties.IndexCache still show a perf decrease in the 2nd run, see below. Can look if there is another way to solve the bug while maintaining similar perf.

The others (index_object.SetOperations, stat_ops.SeriesMultiIndexOps and rolling.ExpandingMethods) did not have a perf decrease in the 2nd run.

   before           after         ratio
 [822dc6f9]       [e94db6db]
                  <gh35267>
  •    23.9±1ms       46.8±0.3ms     1.96  frame_methods.Equals.time_frame_nonunique_unequal
    
  •    32.6±2ms         60.7±6ms     1.86  frame_methods.Equals.time_frame_object_unequal
    
  •    25.0±1ms       45.5±0.7ms     1.82  frame_methods.Equals.time_frame_nonunique_equal
    
  •  37.9±0.7ms         62.8±2ms     1.66  frame_methods.Equals.time_frame_object_equal
    
  •  1.04±0.2μs       1.73±0.9μs     1.66  index_cached_properties.IndexCache.time_is_all_dates('UInt64Index')
    
  •  1.10±0.1μs       1.62±0.7μs     1.46  index_cached_properties.IndexCache.time_inferred_type('UInt64Index')
    
  •  1.37±0.2μs       2.00±0.7μs     1.46  index_cached_properties.IndexCache.time_is_all_dates('TimedeltaIndex')
    
  •  1.23±0.2μs       1.75±0.5μs     1.42  index_cached_properties.IndexCache.time_inferred_type('TimedeltaIndex')
    
  •  4.06±0.6μs         5.74±2μs     1.41  index_cached_properties.IndexCache.time_engine('TimedeltaIndex')
    
  •    814±90ns       1.09±0.4μs     1.34  index_cached_properties.IndexCache.time_inferred_type('Float64Index')
    
  •  1.35±0.1μs       1.72±0.2μs     1.27  index_cached_properties.IndexCache.time_is_all_dates('MultiIndex')
    
  •   828±200ns        986±300ns     1.19  index_cached_properties.IndexCache.time_is_all_dates('Float64Index')
    
  •  2.91±0.3μs         3.44±1μs     1.18  index_cached_properties.IndexCache.time_engine('UInt64Index')
    
  •  1.94±0.1μs       2.23±0.1μs     1.15  index_cached_properties.IndexCache.time_inferred_type('IntervalIndex')
    

@jreback
Copy link
Contributor

jreback commented Sep 12, 2020

@jbrockmendel WDYT

@jbrockmendel
Copy link
Member

I think apocalyptic smoke makes weekends really productive. will take a look

@@ -581,6 +581,13 @@ def array_equivalent_object(left: object[:], right: object[:]) -> bool:
return False
elif (x is C_NA) ^ (y is C_NA):
return False
# Only compare scalars to scalars and arrays to arrays
elif cnp.PyArray_IsAnyScalar(x) != cnp.PyArray_IsAnyScalar(y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put these comments indented inside the conditions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think id rather use our is_scalar than cnp.PyArray_IsAnyScalar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I change it to is_scalar one unit test fails and the perf decreases even more. It is almost twice as slow as using cnp.PyArray_IsAnyScalar

   before           after         ratio
 [822dc6f9]       [eeabb266]
                  <gh35267>
  •    25.6±1ms         99.8±2ms     3.90  frame_methods.Equals.time_frame_nonunique_unequal
    
  •    28.2±3ms         97.3±1ms     3.44  frame_methods.Equals.time_frame_nonunique_equal
    
  •    31.2±2ms        106±0.9ms     3.40  frame_methods.Equals.time_frame_object_unequal
    
  •    43.3±3ms        112±0.7ms     2.59  frame_methods.Equals.time_frame_object_equal
    

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put comments indented inside the conditons.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im pre-caffeine so bear with me: what does PyArray_IsAnyScalar include? or more specifically, what does it not include that is_scalar does?

is_scalar one unit test fails

what breaks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyArray_IsAnyScalar does not include None, datetime.datetime, datetime.timedelta, Period, decimal.Decimal, Fraction, numbers.Number, Interval DateOffset, while is_scalar does.

If I see this I think it would be good to use is_scalar for consistency. But the drawback is the worse perf, because under the hood is_scalar makes use of PyArray_IsAnyScalar and some other functions.

Nvm the failed unit test, I see that it is a broken test on master that appeared after I merged master (pandas/tests/io/test_parquet.y::test_s3_roundtrip_for_dir)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel how do we want to continue with this? Shall I replace PyArray_IsAnyScalar with is_scalar, or leave it as it is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC not is_listlike(obj) is more general and more performant than is_scalar(obj). would that work in this context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I replace PyArray_IsAnyScalar with is_listlike and update the code like below, the perf becomes (much) worse. Will try the try/except option jreback proposed.

elif is_list_like(x) != is_list_like(y):
    # Only compare scalars to scalars and non-scalars to non-scalars
    return False
elif (is_list_like(x) and is_list_like(y)
       and not (isinstance(x, type(y)) or isinstance(y, type(x)))):
    # Check if non-scalars have the same type
   return False
   before           after         ratio
 [822dc6f9]       [98cea4b3]
 <master~31>       <gh35267_clean>
  •  25.1±0.6ms          695±2ms    27.65  frame_methods.Equals.time_frame_nonunique_equal
    
  •  25.4±0.6ms          702±3ms    27.65  frame_methods.Equals.time_frame_nonunique_unequal
    
  •  34.0±0.3ms          711±9ms    20.92  frame_methods.Equals.time_frame_object_unequal
    
  •  44.5±0.5ms          718±3ms    16.14  frame_methods.Equals.time_frame_object_equal
    

elif cnp.PyArray_IsAnyScalar(x) != cnp.PyArray_IsAnyScalar(y):
return False
# Check if arrays have the same type
elif (not (cnp.PyArray_IsPythonScalar(x) or cnp.PyArray_IsPythonScalar(y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you using not cnp.PyArray_IsPythonScalar(x) as a standin for isinstance(x, np.ndarray)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe I am. My reasoning for the code is as follows:

The first condition checks if both values are of different types, i.e. scalar vs non-scalar comparison. If they are different then fail.

If both sides are scalars then the current logic in Pandas can correctly handle the comparison. The code still fails though if both sides have a different non-scalar type (e.g. dict vs list comparison). Hence the check to see if we are dealing with non scalars (not cnp.PyArray_IsPythonScalar(x)) and then check if they are of the same type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not just wrap a try/except around the RichCompareBool? and handle there if there are odd types

this will avoid the perf hit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put check in a try/except clause. frame_methods.Equals never has a significant perf decrease now (<1.1). Some test from index_cached_properties.IndexCache are sometimes a little bit higher than 1.1 (e.g. 1.11-1.16)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let's go with this (add a comment as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I've added the comment to the try/except clause.

elif cnp.PyArray_IsAnyScalar(x) != cnp.PyArray_IsAnyScalar(y):
return False
# Check if arrays have the same type
elif (not (cnp.PyArray_IsPythonScalar(x) or cnp.PyArray_IsPythonScalar(y))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not just wrap a try/except around the RichCompareBool? and handle there if there are odd types

this will avoid the perf hit.

@jreback
Copy link
Contributor

jreback commented Sep 19, 2020

try with the cython is_list_like need to cimport it

@avinashpancham
Copy link
Contributor Author

is_list_like is defined in lib.pyx, so then I dont need to import it right?

@jreback jreback merged commit a22cf43 into pandas-dev:master Sep 19, 2020
@jreback
Copy link
Contributor

jreback commented Sep 19, 2020

thanks @avinashpancham very nice!

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Series.equals fails when comparing numpy arrays to scalars
4 participants