Skip to content

BUG: Hash and compare tuple subclasses as builtin tuples #59286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jul 22, 2024

Conversation

matiaslindgren
Copy link
Contributor

@matiaslindgren matiaslindgren commented Jul 20, 2024

Relaxing the PyTuple_CheckExact in khash_python.h to PyTuple_Check would allow the usage of types created with collections.namedtuple as index keys everywhere tuple is expected.

It is worth noting this would change the behaviour of all client code expecting custom __eq__ and __hash__ functions in tuple subclasses to be respected.

This also removes the non-deterministic behaviour in cases where collections.namedtuple types and tuples are used interchangeably.
I traced the intermittent KeyErrors to this part in PyObjectHashTable.
I did not verify this, but I suspect the custom tuple hasher leads to a scenario where builtin tuples are hashed using the custom hasher, while all tuple subclasses are hashed with their own __hash__, resulting in different hash values.
For small hash tables, the resulting index might point to the correct bucket just by coincidence.

A similar issue, but for different types, was reported in #39585.

@matiaslindgren matiaslindgren marked this pull request as draft July 20, 2024 09:40
@matiaslindgren matiaslindgren changed the title [BUG] Cast all tuple subclass index keys to tuple [BUG] Jul 21, 2024
@matiaslindgren matiaslindgren changed the title [BUG] [BUG] Hash and compare tuple subclasses as builtin tuples Jul 21, 2024
@matiaslindgren matiaslindgren marked this pull request as ready for review July 21, 2024 18:11
@matiaslindgren matiaslindgren requested a review from WillAyd as a code owner July 21, 2024 18:11
@matiaslindgren matiaslindgren changed the title [BUG] Hash and compare tuple subclasses as builtin tuples BUG: Hash and compare tuple subclasses as builtin tuples Jul 22, 2024
assert table.get_item(nan2) == 42
with pytest.raises(KeyError, match=None) as error:
table.get_item(other)
assert str(error.value) == str(other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead could you use the match argument instead of setting it to None?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be removed now that you added the match argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes more sense. I'll fix the other tests too while I'm at it.

@mroeschke mroeschke added Indexing Related to indexing on series/frames, not to indexes themselves Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). labels Jul 22, 2024
@mroeschke mroeschke added this to the 3.0 milestone Jul 22, 2024
@mroeschke mroeschke merged commit b8bc510 into pandas-dev:main Jul 22, 2024
45 checks passed
@mroeschke
Copy link
Member

Thanks @matiaslindgren

@matiaslindgren matiaslindgren deleted the fix-57922 branch July 25, 2024 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: NamedTuples do no match tuples in pandas.Index
2 participants