-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Enable Series.equals to compare numpy arrays to scalars #36161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
How do we test functions in the _libs directory, or do we only test them indirectly? Since they don't have a directory in the test directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avinashpancham use the original example as a test in pandas/tests/series/methods/test_equals.py
also a release note for 1.2
Bit stuck here. Two tests in the CI keep failing due to a JSONArray being used in |
@avinashpancham I believe the tests you're looking for are in pandas/tests/extension/json/test_json.py (the files you mention are defining base testing classes from which others inherit) |
This fix solves the problem described in the PR, but I found that ValueErrors are still thrown when different data types, such as dictionaries, are provided. Will work on that.
|
@dsaxton Thanks for the explanation, that helped me with testing it locally |
42a8c38
to
e31dfe1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finally this is a heavily used path, need to run the entire asv suite and see what comes up
pandas/_libs/lib.pyx
Outdated
elif cnp.PyArray_IsAnyScalar(x) != cnp.PyArray_IsAnyScalar(y): | ||
return False | ||
# Check if arrays have the same type | ||
elif not (cnp.PyArray_IsPythonScalar(x) or cnp.PyArray_IsPythonScalar(y)) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you format this with parens rather than \
much easier to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
# s1[1] = 9 | ||
# assert not s1.equals(s2) | ||
s1[1] = 9 | ||
assert not s1.equals(s2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add all of the missing cases
e.q. not s1.equals('foo')
also we have a bunch of tests for array_equivalent, these need a battery of tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameterized values in test_equals and added tests for array_equivalent with a Series
@jreback ran entire asv suite, see the top 10 below. It was ofc expected to be slower, but what would be acceptable? If necessary I can also share the complete list.
|
The running of the entire asv suite took close to 8 hours on my laptop. The docs state: "Running the full test suite can take up to one hour and use up to 3GB of RAM". Did I do something wrong, cause this is a significant difference, else we should give a more realistic estimate in the docs. |
dd21fe2
to
232f581
Compare
232f581
to
e94db6d
Compare
docs are likely out of data and the suite is now much bigger. |
hmm these perf numbers are a big change. can you run the biggest ones again; sometimes the suite is flaky. |
Is it ok if I submit an issue to get this corrected in the docs? I can imagine that people wouldn't mind running tests on their laptop for 1h, but not for approx 8h. Then they would rather do it on another machine/server |
sure |
The others (
|
@jbrockmendel WDYT |
I think apocalyptic smoke makes weekends really productive. will take a look |
pandas/_libs/lib.pyx
Outdated
@@ -581,6 +581,13 @@ def array_equivalent_object(left: object[:], right: object[:]) -> bool: | |||
return False | |||
elif (x is C_NA) ^ (y is C_NA): | |||
return False | |||
# Only compare scalars to scalars and arrays to arrays | |||
elif cnp.PyArray_IsAnyScalar(x) != cnp.PyArray_IsAnyScalar(y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put these comments indented inside the conditions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think id rather use our is_scalar
than cnp.PyArray_IsAnyScalar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I change it to is_scalar
one unit test fails and the perf decreases even more. It is almost twice as slow as using cnp.PyArray_IsAnyScalar
before after ratio
[822dc6f9] [eeabb266]
<gh35267>
-
25.6±1ms 99.8±2ms 3.90 frame_methods.Equals.time_frame_nonunique_unequal
-
28.2±3ms 97.3±1ms 3.44 frame_methods.Equals.time_frame_nonunique_equal
-
31.2±2ms 106±0.9ms 3.40 frame_methods.Equals.time_frame_object_unequal
-
43.3±3ms 112±0.7ms 2.59 frame_methods.Equals.time_frame_object_equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put comments indented inside the conditons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im pre-caffeine so bear with me: what does PyArray_IsAnyScalar include? or more specifically, what does it not include that is_scalar does?
is_scalar one unit test fails
what breaks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyArray_IsAnyScalar
does not include None, datetime.datetime, datetime.timedelta, Period, decimal.Decimal, Fraction, numbers.Number, Interval DateOffset, while is_scalar
does.
If I see this I think it would be good to use is_scalar
for consistency. But the drawback is the worse perf, because under the hood is_scalar
makes use of PyArray_IsAnyScalar
and some other functions.
Nvm the failed unit test, I see that it is a broken test on master that appeared after I merged master (pandas/tests/io/test_parquet.y::test_s3_roundtrip_for_dir)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel how do we want to continue with this? Shall I replace PyArray_IsAnyScalar
with is_scalar
, or leave it as it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC not is_listlike(obj)
is more general and more performant than is_scalar(obj)
. would that work in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I replace PyArray_IsAnyScalar
with is_listlike
and update the code like below, the perf becomes (much) worse. Will try the try/except option jreback proposed.
elif is_list_like(x) != is_list_like(y):
# Only compare scalars to scalars and non-scalars to non-scalars
return False
elif (is_list_like(x) and is_list_like(y)
and not (isinstance(x, type(y)) or isinstance(y, type(x)))):
# Check if non-scalars have the same type
return False
before after ratio
[822dc6f9] [98cea4b3]
<master~31> <gh35267_clean>
-
25.1±0.6ms 695±2ms 27.65 frame_methods.Equals.time_frame_nonunique_equal
-
25.4±0.6ms 702±3ms 27.65 frame_methods.Equals.time_frame_nonunique_unequal
-
34.0±0.3ms 711±9ms 20.92 frame_methods.Equals.time_frame_object_unequal
-
44.5±0.5ms 718±3ms 16.14 frame_methods.Equals.time_frame_object_equal
pandas/_libs/lib.pyx
Outdated
elif cnp.PyArray_IsAnyScalar(x) != cnp.PyArray_IsAnyScalar(y): | ||
return False | ||
# Check if arrays have the same type | ||
elif (not (cnp.PyArray_IsPythonScalar(x) or cnp.PyArray_IsPythonScalar(y)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you using not cnp.PyArray_IsPythonScalar(x)
as a standin for isinstance(x, np.ndarray)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe I am. My reasoning for the code is as follows:
The first condition checks if both values are of different types, i.e. scalar vs non-scalar comparison. If they are different then fail.
If both sides are scalars then the current logic in Pandas can correctly handle the comparison. The code still fails though if both sides have a different non-scalar type (e.g. dict vs list comparison). Hence the check to see if we are dealing with non scalars (not cnp.PyArray_IsPythonScalar(x)
) and then check if they are of the same type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we not just wrap a try/except around the RichCompareBool? and handle there if there are odd types
this will avoid the perf hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put check in a try/except clause. frame_methods.Equals never has a significant perf decrease now (<1.1). Some test from index_cached_properties.IndexCache are sometimes a little bit higher than 1.1 (e.g. 1.11-1.16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok let's go with this (add a comment as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. I've added the comment to the try/except clause.
pandas/_libs/lib.pyx
Outdated
elif cnp.PyArray_IsAnyScalar(x) != cnp.PyArray_IsAnyScalar(y): | ||
return False | ||
# Check if arrays have the same type | ||
elif (not (cnp.PyArray_IsPythonScalar(x) or cnp.PyArray_IsPythonScalar(y)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we not just wrap a try/except around the RichCompareBool? and handle there if there are odd types
this will avoid the perf hit.
try with the cython is_list_like need to cimport it |
|
thanks @avinashpancham very nice! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff