-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Series/DataFrame sample method with/without replacement #2419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Something like |
Or even just |
This doesn't need to get done for 0.10 |
I would like to propose that we should copy the API from dplyr for this method: namely, we should have two methods, CC @hayd |
Steal all the dplyr! To keep the number of new methods low, would you favor a single method
And we can have a |
@TomAugspurger Hmm. I've used |
Good enough for me. On Wed, Jan 21, 2015 at 1:20 PM, Stephan Hoyer [email protected]
|
+1 |
I'd be happy to take a look at this in a about a week (after a presentation). How would people feel about an implementation built around a numpy sampling of the index, followed by a .loc[] call, similar (though with the suggested
|
That sounds fine. You'll also want to accept a The only wrinkle is how to handle duplicates in the index. If you use |
Sounds great -- I'll get to it next week! |
@nickeubank glad you're excited about this! It would be great if you could get this finished :). Here are the rough versions (mostly untested) that I wrote a few weeks ago: def sample_n(df, n, replace=False, weight=None, seed=None):
"""Sample n rows from a DataFrame at random
"""
rs = np.random.RandomState(seed)
locs = rs.choice(df.shape[0], size=n, replace=replace, p=weight)
return df.take(locs, axis=0)
def sample_frac(df, frac, replace=False, weight=None, seed=None):
"""Sample some fraction of a DataFrame at random
"""
n = int(round(frac * df.shape[0]))
return sample_n(df, n, replace=replace, weight=weight, seed=seed) I think these get a couple of things right:
What this needs:
Also, it would be really nice for these methods to work with grouped operations, so you could write something like |
@shoyer Great! looks like this is in great shape. I'll start by building some tests and look into a weight implementation and get back to you, then we can pivot to the groupby once that's done. Do you have an existing fork I should work on? |
@nickeubank Nope, feel free to start from scratch. I needed |
Quick poll: I'm inclined to call the function "rand()" and accept both "size" and "size_type = {number, frac}" to accommodate both request for an exact number of rows and a fraction of rows. My personal interest in this is mostly for being able to quickly query a random set of rows to examine my data frame, so having "df.rand()" return 5 random rows in a manner analogous to "df.head()" feels more appealing than longer function names like sample_n() or sample_frac(). But I'm open to input -- would people prefer sample_n() and sample_frac()? or is rand() seem ok? |
I am not a fan of For me, adding a few characters to the length of the function is not such a big concern, because I'm almost always using auto-complete in IPython, anyways. I'm afraid I'm also not a fan of returning 5 random rows as the default. That feels like a very arbitrary number to me -- and again, something that would be hard to guess. |
I'm also in favor of |
@nickeubank be sure to also check #7274, a closed PR trying to implement this for some inspiration (comments, tests) I also like |
OK, sounds like a concensus in favor of like @jorisvandenbossche, I'm inclined to one method with a @shoyer Regarding the default return of five rows, it's a little arbitrary, but is analogous to what |
Like I said before, my main issue with plain |
Ah, I see -- you were thinking that if a size value is between 0 and 1, the function infers the user wants a share of rows; if size is an integer greater than 1, the function assumes they want N rows? I was just going to make it a function option. That gets rid of the corner case. Basically:
|
If we would make it one But also ok to make two functions of it |
Also, I would use |
|
and actually |
Ha! Do you think this is the exact conversation that the dplyr developers had? Sounds like there's a pretty good consensus around 2 functions -- i'll code that up! |
Actually, I think @jorisvandenbossche and I are now voting for one function, two arguments :). |
Oh! Misread post on length. :) OK, so something like the following, with an error thrown if both n and frac values are provided:
|
Yes, that looks very close. One thing to note is that you'll need to make Also, |
On first point: Great. On weights: I was coding this into "code/generic.py" so it would also work with Series, and in a series the string wouldn't mean anything. With that in mind, I thought I'd just ask for a Series in the Or do you think we need an |
Nevermind -- ill just add "if dataframe" clause. :) |
Little late to the party here, but I am -1 on passing in a string to weights to mean a column. Why not just accept a single thing--a Series--and it works with both series and frame without having to know what the type of self is. It's also more clear what the meaning is IMO. |
I agree this functionality is not essential, but we already use this sort of syntax as a shortcut (e.g., with |
Submitted as pull request #9666. Input welcome! |
closed by #9666 |
Should use a more intelligent algorithm than using
np.random.permutation
The text was updated successfully, but these errors were encountered: