Skip to content

Multiindex recurse error fix #29260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jan 1, 2020

Conversation

endremborza
Copy link
Contributor

@endremborza endremborza commented Oct 28, 2019

test and 3 line fix for a small bug mentioned in 2 issues (1 closed). test might be expanded, I only checked for the error disappearing

@WillAyd
Copy link
Member

WillAyd commented Oct 29, 2019

Hmm it doesn't appear a consensus was ever made on this, but the suggestion was to raise. Why do you think this is better than raising?

@endremborza
Copy link
Contributor Author

there are join operations that modify the order of MultiIndex names, like

x=pd.DataFrame(np.arange(4),pd.MultiIndex.from_product([[1,2],[3,4]],names=['a','b']))
x2=pd.DataFrame(np.arange(8),pd.MultiIndex.from_product([[1,2],[5,6],[3,4]],names=['a','c','b']))
x2.join(x, lsuffix='x1')

so I thought it is not such a definitive feature and I couldn't think of a use case where it might be a problem

also, this change doesn't break any of the tests, so there seems to be no agreement on the other side either

anyway, raising a recursion error is definitely not an intended effect, changing it to a more informative error can also be a way to go. I just feel that it is more in line with the general way of python (and pandas) for the api to be more practical, convenient and flexible than restrictive

@@ -3549,8 +3549,12 @@ def _join_multi(self, other, how, return_indexers=True):
ldrop_names = list(self_names - overlap)
rdrop_names = list(other_names - overlap)

self_jnlevels = self.droplevel(ldrop_names)
other_jnlevels = other.droplevel(rdrop_names)
if len(ldrop_names + rdrop_names) == 0: # if only the order differs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than add a special case, is there an issue droplevel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, "issue droplevel"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you are trying to fix here. is there somthing that is actually wrong with droplevel instead of adding a special case here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing is wrong with droplevel, it just gets called twice with empty lists if len(ldrop_names + rdrop_names) == 0 is true, which seems pointless. and then because this doesn't change anything, it ends up in an infinite recursion loop. this can be stopped with checking for this condition, and the reordering of the names makes the join work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, this is not idiomatic, use

if not len(....)
 ....

put he comment on the line above

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a bug fix whatsnew in reshaping in 1.0

midx1 = pd.MultiIndex.from_product([[1, 2], [3, 4]], names=["a", "b"])
midx2 = pd.MultiIndex.from_product([[1, 2], [3, 4]], names=["b", "a"])

midx1.join(midx2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert the results of the join here

@@ -3549,8 +3549,12 @@ def _join_multi(self, other, how, return_indexers=True):
ldrop_names = list(self_names - overlap)
rdrop_names = list(other_names - overlap)

self_jnlevels = self.droplevel(ldrop_names)
other_jnlevels = other.droplevel(rdrop_names)
if len(ldrop_names + rdrop_names) == 0: # if only the order differs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, this is not idiomatic, use

if not len(....)
 ....

put he comment on the line above

@@ -421,6 +421,7 @@ Reshaping
- :func:`qcut` and :func:`cut` now handle boolean input (:issue:`20303`)
- Fix to ensure all int dtypes can be used in :func:`merge_asof` when using a tolerance value. Previously every non-int64 type would raise an erroneous ``MergeError`` (:issue:`28870`).
- Better error message in :func:`get_dummies` when `columns` isn't a list-like value (:issue:`28383`)
- Bug in :meth:`Index.join` that caused infinite recursion error for mismatched multiindex name orders (:issue:`25760`, :issue:`28956`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this more user-facing? I somehow doubt the test case in this PR is something users would do in actual code, so this seems to be hinting at addressing some other use case(s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putting a use case into the docs or expanding the tests?

assert midx1.equals(join_idx)
assert midx2.equals(join_idx)
assert lidx is None
tm.assert_numpy_array_equal(ridx, exp_ridx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use tm.assert_index_equal here instead of breaking up into individual arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can only do this for midx1 as Index.equals is not sensitive for the order of names, but tm.assert_index_equal is.

Anyway, I expanded both the test and the docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure I totally follow; can you not set the proper expectation inclusive of the order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mdx does not equal join_idx, so midx2 just remains the same, basically, which is fine for Index.equals, but there is nothing to compare it with in this example using tm.assert_index_equal. That would require a more complicated example. I didn't add any, as the order of the indices is not agreed upon, I just corrected a bug one way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway, since this I expanded the test a little

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - the last batch of updates look good. Do you know if this works when there are unused levels in the MultiIndex as well? A test case to cover that might be good

assert midx1.equals(join_idx)
assert midx2.equals(join_idx)
assert lidx is None
tm.assert_numpy_array_equal(ridx, exp_ridx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure I totally follow; can you not set the proper expectation inclusive of the order?

@jreback
Copy link
Contributor

jreback commented Nov 20, 2019

can you merge master

@jreback jreback added this to the 1.0 milestone Jan 1, 2020
@jreback jreback added the Bug label Jan 1, 2020
@jreback jreback merged commit 8778760 into pandas-dev:master Jan 1, 2020
@jreback
Copy link
Contributor

jreback commented Jan 1, 2020

thanks @endremborza

hweecat pushed a commit to hweecat/pandas that referenced this pull request Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RecursionError when aligning DataFrames based on MultiIndex with different order of names
3 participants