-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
regression in 0.10.1 with boolean indexing? #2745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
this worked by 'accident' before 0.10.1, see #2686 |
The provided reference is difficult for me to follow. Can you provide a simple example where this would be ambiguous? |
your example is ambiguous, e.g. the rhs side is a 1 element list, you are assigning to 2 elements should numpy by default will take whatever elements that you supply (so if you have 2 on the left but 3 on the right it will take the first 2, with 1 element on the rhs it will cycle them). since you have a series with defined labels on the lhs, the rhs needs to be aligned so that the labels match, in this case you don't have labels, so its impossible to match unambigously (a constant is a special case where all labels from the lhs get the value, an equal length series or ndarray is unambigous, there is a 1-1 match between lhs and rhs) |
sorry....number got mixed up...its PR #2686 |
Thanks for the corrected link, that makes more sense. I would not expect df['test'] = [0] to work after first assignment because of the length mismatch, but in the case where the result of the boolean vector on LHS matches the shape on the RHS it's unambiguous though. |
the alignment happens before the indexing, so it IS ambiguous, as I said, you can simply make the rhs a series and it will work (you example was dtype int, so I changed to floats and it works, (with ints I think this is a bug, cause the reindexing should cast the ints to floats so you can put Nans on the )
|
see issue #2746 |
I suppose that if you provide a list on the rhs that matches the indexed vector then it SHOULD work, but a priori you almost never know (otherwise why would you need to do the boolean indexing?) - e.g. in your example you are explicity using True/False...using this is an expression though |
?
?
In our code we are interested doing multiple, separate operations on a slice that we refer to by using the boolean vector as a variable - ndarray of dtype bool which makes sense in our code. |
yes, this could be tested AFTER the vector is applied what I meant (my language is unclear!) - is that if you have a boolean vector that is already indicative of true/false (e.g. its not a computed vector), then use reindex by that and assign directly to your ndarray, the point of an alignment is so you don't make errors by assigning an unlabeled vector to something, everything always has (or can be converted to something) like a series you can certianly do what you are doing, but seems a lot clearer to make your rhs a series (which semantically is very close to a ndarray), and has the BIG advantage of having labels for the values |
I'm not clear on the internal mechanics, so why isn't it done this way?
can you clarify how? To elaborate on our usage:
in the above, why would having a Series/labels on RHS be an advantage? Thanks for trying to help and explain, perhaps we should move this to the ML ? My biggest concern is the change in behaviour that (to me at least) was not ambiguous and hard to identify in a large code-base. |
what is the ['some extra work will happen here']? here's a psedo example
of course the rhs could be any series (from this df or other), that aligns by labels, that's the key |
when you suggest a Series in RHS to "align by labels", I presume you mean, to have matching/valid index values? ['some extra work will happen here'] in my use case is a list returned from a server operation on the interesting_subset whose only relationship to the DataFrame is positional alignment of the rows, so I can work around the issue, but it still feels like a regression in a case like this where there is no ambiguity from lengths. |
yes, the issue you have is that you are providing a guarantee that the ['some extra work will happen here'] are in exactly the same order and exactly the same length as the indexing array, this is an extremely strong statement; it might be true in your case, but in general this is not. what if you happen to off by 1 or 1 extra value is returned? doesn't it make more sense to have the operation 'figure' it out by aligning by labels? |
of course, but that's extra work than previously required. Thanks for clarifying. |
@changhiskhan or @wesm any comments on this? |
This looks like a bug to me. Marking as such and will try to fix for 0.10.2/0.11 |
closed by #3139 |
this used to work in 0.10 but now fails in 0.10.1:
Now gives:
ValueError: Length of replacements must equal series length
possibly related to closed issue #2703
The text was updated successfully, but these errors were encountered: