Skip to content

regression in 0.10.1 with boolean indexing? #2745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ruidc opened this issue Jan 24, 2013 · 19 comments · Fixed by #3139
Closed

regression in 0.10.1 with boolean indexing? #2745

ruidc opened this issue Jan 24, 2013 · 19 comments · Fixed by #3139
Assignees
Labels
Milestone

Comments

@ruidc
Copy link
Contributor

ruidc commented Jan 24, 2013

this used to work in 0.10 but now fails in 0.10.1:

import pandas
df = pandas.DataFrame(index=[1,2])
df['test'] = [1,2]
df['test'][[True, False]] = [0]

Now gives:
ValueError: Length of replacements must equal series length

possibly related to closed issue #2703

@jreback
Copy link
Contributor

jreback commented Jan 24, 2013

this worked by 'accident' before 0.10.1, see #2686
this will work if the rhs is a same length list/ndarray, constant expression, or alignable series
the problem with a list that is not the correct length is that it is ambiguous what should be assigned (e.g. do you cycle the values or not)

@ruidc
Copy link
Contributor Author

ruidc commented Jan 24, 2013

The provided reference is difficult for me to follow. Can you provide a simple example where this would be ambiguous?

@jreback
Copy link
Contributor

jreback commented Jan 24, 2013

your example is ambiguous, e.g. the rhs side is a 1 element list, you are assigning to 2 elements

should df['test'] = [0] work? (after your first assignment where 'test' is created)
what if there are 3 elements on the rhs?

numpy by default will take whatever elements that you supply (so if you have 2 on the left but 3 on the right it will take the first 2, with 1 element on the rhs it will cycle them).

since you have a series with defined labels on the lhs, the rhs needs to be aligned so that the labels match, in this case you don't have labels, so its impossible to match unambigously (a constant is a special case where all labels from the lhs get the value, an equal length series or ndarray is unambigous, there is a 1-1 match between lhs and rhs)

@jreback
Copy link
Contributor

jreback commented Jan 24, 2013

sorry....number got mixed up...its PR #2686

@ruidc
Copy link
Contributor Author

ruidc commented Jan 24, 2013

Thanks for the corrected link, that makes more sense.

I would not expect df['test'] = [0] to work after first assignment because of the length mismatch, but in the case where the result of the boolean vector on LHS matches the shape on the RHS it's unambiguous though.
I can understand if the lengths were different

@jreback
Copy link
Contributor

jreback commented Jan 24, 2013

the alignment happens before the indexing, so it IS ambiguous, as I said, you can simply make the rhs a series and it will work (you example was dtype int, so I changed to floats and it works, (with ints I think this is a bug, cause the reindexing should cast the ints to floats so you can put Nans on the )

df['test'] = [1.,2.]
df['test'][[True,False]] = pd.Series([0.],index=[1])

@jreback
Copy link
Contributor

jreback commented Jan 24, 2013

see issue #2746

@jreback
Copy link
Contributor

jreback commented Jan 24, 2013

I suppose that if you provide a list on the rhs that matches the indexed vector then it SHOULD work, but a priori you almost never know (otherwise why would you need to do the boolean indexing?) - e.g. in your example you are explicity using True/False...using this is an expression though

@ruidc
Copy link
Contributor Author

ruidc commented Jan 24, 2013

but a priori you almost never know

?
isn't it just a matter of testing the length AFTER the vector is applied?

using this is an expression though

?
how so?

otherwise why would you need to do the boolean indexing

In our code we are interested doing multiple, separate operations on a slice that we refer to by using the boolean vector as a variable - ndarray of dtype bool which makes sense in our code.

@jreback
Copy link
Contributor

jreback commented Jan 24, 2013

yes, this could be tested AFTER the vector is applied

what I meant (my language is unclear!) - is that if you have a boolean vector that is already indicative of true/false (e.g. its not a computed vector), then use reindex by that and assign directly to your ndarray, the point of an alignment is so you don't make errors by assigning an unlabeled vector to something, everything always has (or can be converted to something) like a series

you can certianly do what you are doing, but seems a lot clearer to make your rhs a series (which semantically is very close to a ndarray), and has the BIG advantage of having labels for the values

@ruidc
Copy link
Contributor Author

ruidc commented Jan 24, 2013

yes, this could be tested AFTER the vector is applied

I'm not clear on the internal mechanics, so why isn't it done this way?

use reindex by that and assign directly to your ndarray

can you clarify how? To elaborate on our usage:

import numpy
import pandas
df = pandas.DataFrame([1, 2, 3], index=[0, 1, 2], columns=['test'], dtype=object)
interesting_subset = numpy.greater(df['test'], 1)
df['test'][interesting_subset] = ['some extra work will happen here']

and has the BIG advantage of having labels for the values

in the above, why would having a Series/labels on RHS be an advantage? Thanks for trying to help and explain, perhaps we should move this to the ML ? My biggest concern is the change in behaviour that (to me at least) was not ambiguous and hard to identify in a large code-base.

@jreback
Copy link
Contributor

jreback commented Jan 24, 2013

what is the ['some extra work will happen here']?

here's a psedo example

mask = df['test'] > 1
df['test'][mask] = df['test'] + 5

of course the rhs could be any series (from this df or other), that aligns by labels, that's the key
it makes it so you don't have to worry about sub-setting the rhs at all

@ruidc
Copy link
Contributor Author

ruidc commented Jan 25, 2013

when you suggest a Series in RHS to "align by labels", I presume you mean, to have matching/valid index values?

['some extra work will happen here'] in my use case is a list returned from a server operation on the interesting_subset whose only relationship to the DataFrame is positional alignment of the rows, so I can work around the issue, but it still feels like a regression in a case like this where there is no ambiguity from lengths.

@jreback
Copy link
Contributor

jreback commented Jan 25, 2013

yes, the issue you have is that you are providing a guarantee that the ['some extra work will happen here'] are in exactly the same order and exactly the same length as the indexing array, this is an extremely strong statement; it might be true in your case, but in general this is not. what if you happen to off by 1 or 1 extra value is returned? doesn't it make more sense to have the operation 'figure' it out by aligning by labels?

@ruidc
Copy link
Contributor Author

ruidc commented Jan 25, 2013

of course, but that's extra work than previously required. Thanks for clarifying.

@jreback
Copy link
Contributor

jreback commented Jan 25, 2013

@changhiskhan or @wesm any comments on this?

@wesm
Copy link
Member

wesm commented Feb 7, 2013

This looks like a bug to me. Marking as such and will try to fix for 0.10.2/0.11

@jreback
Copy link
Contributor

jreback commented Mar 22, 2013

closed by #3139

@jreback
Copy link
Contributor

jreback commented Apr 2, 2013

@ruidc FYI #3236 fixes the more general issue of what you were doing here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants