Series/DataFrame sample method with/without replacement #2419

wesm · 2012-12-03T21:45:11Z

Should use a more intelligent algorithm than using np.random.permutation

The text was updated successfully, but these errors were encountered:

changhiskhan · 2012-12-06T01:36:31Z

Something like Series/DataFrame.sample(ntrials, shape=None, axis=0, replace=True, iterator=False)?

wesm · 2012-12-07T17:14:47Z

Or even just .sample(size, replace=True/False) would be fine. @rkern had a reservoir sampling impl floating around (for efficient sampling w/o replacement), maybe only on the mailing list

wesm · 2012-12-07T17:27:59Z

This doesn't need to get done for 0.10

shoyer · 2015-01-21T02:15:01Z

I would like to propose that we should copy the API from dplyr for this method: namely, we should have two methods, sample_n and sample_frac. These methods are especially nice when coupled with groupby.

CC @hayd

TomAugspurger · 2015-01-21T13:15:21Z

Steal all the dplyr!

To keep the number of new methods low, would you favor a single method df.sample(sample_size) where the behavior is like sample_frac if sample_size is between (0, 1), and like sample_n if it's a positive integer? There's precendce for this in scikit-learn's train_test_split:

test_size: If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples...

And we can have a with_replacement keyword argument as well. np.random.choice has a default of replace=True.

shoyer · 2015-01-21T19:20:49Z

@TomAugspurger Hmm. I've used train_test_split, but don't like the degeneracy of size = 1. I think Hadley Wickham has the right idea in dplyr with "each function does only one thing, but does it well." So I would prefer for two methods with the same prefix. In my opinion, similarly named methods do not cause much more cognitive load than a single method.

TomAugspurger · 2015-01-21T19:26:57Z

Good enough for me.

On Wed, Jan 21, 2015 at 1:20 PM, Stephan Hoyer [email protected]
wrote:

@TomAugspurger https://github.com/TomAugspurger Hmm. I've used
train_test_split, but don't like the degeneracy of size = 1. I think
Hadley Wickham has the right idea in dplyr with "each function does only
one thing, but does it well." So I would prefer for two methods with the
same prefix. In my opinion, similarly named methods do not cause much more
cognitive load than a single method.

—
Reply to this email directly or view it on GitHub
#2419 (comment).

stared · 2015-01-26T23:23:31Z

+1

nickeubank · 2015-03-01T20:55:58Z

I'd be happy to take a look at this in a about a week (after a presentation).

How would people feel about an implementation built around a numpy sampling of the index, followed by a .loc[] call, similar (though with the suggested df.sample_size() and .sample_frac() formatting suggested above)?

def rand_rows(df, num_rows = 5):
    from numpy import random as rm
    subset = rm.choice(df.index.values, size = num_rows)    
    return df.loc[subset]

a_data_frame = pd.DataFrame({'col1':range(10,20), 'col2':range(20,30)})
rand_rows(a_data_frame)
rand_rows(a_data_frame, 6)

TomAugspurger · 2015-03-01T21:01:58Z

That sounds fine. You'll also want to accept a seed parameter.

The only wrinkle is how to handle duplicates in the index. If you use .loc you could potentially more that num_rows if a duplicated index_label is selected. I think you should use .iloc and make everything position based.

nickeubank · 2015-03-01T21:12:50Z

Sounds great -- I'll get to it next week!

shoyer · 2015-03-01T21:14:56Z

@nickeubank glad you're excited about this! It would be great if you could get this finished :).

Here are the rough versions (mostly untested) that I wrote a few weeks ago:

def sample_n(df, n, replace=False, weight=None, seed=None):
    """Sample n rows from a DataFrame at random
    """
    rs = np.random.RandomState(seed)
    locs = rs.choice(df.shape[0], size=n, replace=replace, p=weight)
    return df.take(locs, axis=0)

def sample_frac(df, frac, replace=False, weight=None, seed=None):
    """Sample some fraction of a DataFrame at random
    """
    n = int(round(frac * df.shape[0]))
    return sample_n(df, n, replace=replace, weight=weight, seed=seed)

I think these get a couple of things right:

Accepts a random numbers seed, which is essential for reproducibility.
Samples integers and does position based indexing. This lets us side-step the complexity of .loc and location based indexing.
Uses .take, which is actually usually considerably faster than indexing with .iloc.
API borrowed from dplyr

What this needs:

Tests!
Documentation!
Probably should accept a string for the weight argument, which would map to a DataFrame column.

Also, it would be really nice for these methods to work with grouped operations, so you could write something like df.groupby('category').sample_n(100) -> get 100 samples from each category.

nickeubank · 2015-03-01T21:31:47Z

@shoyer Great! looks like this is in great shape. I'll start by building some tests and look into a weight implementation and get back to you, then we can pivot to the groupby once that's done.

Do you have an existing fork I should work on?

shoyer · 2015-03-01T21:39:22Z

@nickeubank Nope, feel free to start from scratch. I needed sample_n for a notebook, but didn't have time to clean it up for a PR.

nickeubank · 2015-03-12T19:25:34Z

Quick poll: I'm inclined to call the function "rand()" and accept both "size" and "size_type = {number, frac}" to accommodate both request for an exact number of rows and a fraction of rows.

My personal interest in this is mostly for being able to quickly query a random set of rows to examine my data frame, so having "df.rand()" return 5 random rows in a manner analogous to "df.head()" feels more appealing than longer function names like sample_n() or sample_frac().

But I'm open to input -- would people prefer sample_n() and sample_frac()? or is rand() seem ok?

shoyer · 2015-03-13T06:22:36Z

I am not a fan of df.rand() because it's not clear what rand means in the context of a DataFrame. Sure, it means something random is happening, but rand makes me think of generating random numbers (e.g., with np.random.rand()), not sampling at random.

For me, adding a few characters to the length of the function is not such a big concern, because I'm almost always using auto-complete in IPython, anyways.

I'm afraid I'm also not a fan of returning 5 random rows as the default. That feels like a very arbitrary number to me -- and again, something that would be hard to guess.

TomAugspurger · 2015-03-13T12:30:57Z

I'm also in favor of sample_n and sample_frac. Long method names don't bother me (up to point). The only trouble is that tab completion doesn't work through method chains.

jorisvandenbossche · 2015-03-13T13:03:37Z

@nickeubank be sure to also check #7274, a closed PR trying to implement this for some inspiration (comments, tests)

I also like sample more than rand, if it should be two functions or one function with two kwargs, I have no real opinion about. Slightly leaning to one method, but if the others prefer two, that is OK with me.

nickeubank · 2015-03-13T16:52:21Z

OK, sounds like a concensus in favor of .sample() over .rand().

like @jorisvandenbossche, I'm inclined to one method with a n_or_frac option, but am open to following @TomAugspurger's suggestion if that's what people prefer.

@shoyer Regarding the default return of five rows, it's a little arbitrary, but is analogous to what head() and tail() provide. And while I realize not everyone will use this for quick data interrogations, I don't see a lot of harm in a default for those who are -- I have trouble imagining a situation in which having a default N would cause problems in analysis.

shoyer · 2015-03-13T17:16:04Z

Like I said before, my main issue with plain sample is that size=1 is degenerate. And unfortunately, getting one sample at random and getting a number of samples equal to the length of the frame (e.g., for bootstrapping) are both common use cases. What's your proposal for this edge case?

nickeubank · 2015-03-13T19:19:58Z

Ah, I see -- you were thinking that if a size value is between 0 and 1, the function infers the user wants a share of rows; if size is an integer greater than 1, the function assumes they want N rows?

I was just going to make it a function option. That gets rid of the corner case. Basically:

def sample(self, size = 5, n_or_frac = 'number', replacement = False, weights = None, seed = None):
    """
    Returns a sample of rows from object. 

    Parameters
    ----------
        size: Number of rows (if n_or_frac = 'n') or 
              share of rows (if n_or_frac = 'frac'). Default 5.
        n_or_frac {'n', 'frac'}: 
              If 'n': return a sample with 'size' number of rows. 
              If 'frac', return 'size' fraction of rows. 
              Default is 'n'. 
        replacement {True, False}: Sample with or without replacement.
        weights: Series or ndarray of weights. Must be same length as index.  
                 Default 'None' results in equal probability weighting.
        seed: seed to be fed to numpy random.RandomState() Function. Default None. 
    """

jorisvandenbossche · 2015-03-13T19:29:00Z

If we would make it one sample function, I think it should have two separate keywords like sample(n=None, frac=None) instead of one keyword controlling what the other does.

But also ok to make two functions of it

jorisvandenbossche · 2015-03-13T19:29:43Z

Also, I would use replace instead of replacement to be consistent with numpy

shoyer · 2015-03-13T19:34:13Z

sample(n=None, frac=None) looks pretty nice to me, actually. I suppose if it's called like df.sample() then we could even default to sampling five rows (not entirely sure that's a good idea, though).

jorisvandenbossche · 2015-03-13T19:36:35Z

and actually df.sample_frac(0.5) is not shorter as df.sample(frac=0.5), and the latter looks a bit nicer to me.

nickeubank · 2015-03-13T20:04:53Z

Ha! Do you think this is the exact conversation that the dplyr developers had?

Sounds like there's a pretty good consensus around 2 functions -- i'll code that up!

shoyer · 2015-03-13T20:07:07Z

Actually, I think @jorisvandenbossche and I are now voting for one function, two arguments :).

nickeubank · 2015-03-13T20:21:24Z

Oh! Misread post on length. :)

OK, so something like the following, with an error thrown if both n and frac values are provided:

   def sample(self, n = 5, frac = None , replace = False, weights = None, seed = None):
        """
        Returns a sample of rows from object. 

        Parameters
        ----------
            n: Number of rows to return. Cannot be used with frac.
               Default = 5 if frac = None. 
            frac: share of rows to return. Cannot be used with n. 
            replace {True, False}: Sample with or without replacement.
            weights: Series or ndarray of weights. Must be same length as index.  
                     Default 'None' results in equal probability weighting.
            seed: seed to be fed to numpy random.RandomState() Function. Default None. 
        """

shoyer · 2015-03-13T20:36:07Z

Yes, that looks very close. One thing to note is that you'll need to make n=None in the function signature -- otherwise we can't tell cleanly if n=5 was intentional or merely the default value. This matters because of the alternative frac option.

Also, weights (on DataFrame) should accept a string, which tries to look up the weights from that column of the data frame.

nickeubank · 2015-03-13T20:38:52Z

On first point: Great.

On weights: I was coding this into "code/generic.py" so it would also work with Series, and in a series the string wouldn't mean anything. With that in mind, I thought I'd just ask for a Series in the weight field, and the user could pass df.weightColumn if they had one.

Or do you think we need an if type(self) = pd.core.frame.DataFrame: clause to allow strings if DataFrame?

nickeubank · 2015-03-13T20:51:30Z

Nevermind -- ill just add "if dataframe" clause. :)

cpcloud · 2015-03-16T11:23:41Z

Little late to the party here, but I am -1 on passing in a string to weights to mean a column. Why not just accept a single thing--a Series--and it works with both series and frame without having to know what the type of self is. It's also more clear what the meaning is IMO.

shoyer · 2015-03-16T15:36:46Z

Little late to the party here, but I am -1 on passing in a string to weights to mean a column.

I agree this functionality is not essential, but we already use this sort of syntax as a shortcut (e.g., with groupby), so I doubt it will be confusing. The main advantage, from my perspective, is enhanced chain-ability (similar to assign), because you don't need to write the variable for the containing frame again.

nickeubank · 2015-03-16T21:09:55Z

Submitted as pull request #9666. Input welcome!

jreback · 2015-05-01T12:05:09Z

closed by #9666

jreback mentioned this issue Sep 21, 2013

Ideas about random sampling #2282

Closed

hayd mentioned this issue May 29, 2014

ENH add sample #2419 #7274

Closed

jreback modified the milestones: 0.14.1, Someday May 29, 2014

hayd added a commit to hayd/pandas that referenced this issue May 30, 2014

ENH add sample pandas-dev#2419

1bf0d3c

jreback modified the milestones: 0.15.0, 0.14.1 Jun 26, 2014

jreback mentioned this issue Dec 27, 2014

Add shuffling behavior / utility method #9159

Closed

TomAugspurger mentioned this issue Mar 1, 2015

Feature Request: .rand() to call random rows to compliment .head() and .tail() #9569

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

nickeubank mentioned this issue Mar 17, 2015

New function to sample from frames (Issue #2419 ) #9666

Closed

jreback modified the milestones: 0.16.1, Next Major Release Mar 17, 2015

jreback closed this as completed May 1, 2015

Series/DataFrame sample method with/without replacement #2419

Series/DataFrame sample method with/without replacement #2419

Comments

wesm commented Dec 3, 2012

changhiskhan commented Dec 6, 2012

wesm commented Dec 7, 2012

wesm commented Dec 7, 2012

shoyer commented Jan 21, 2015

TomAugspurger commented Jan 21, 2015

shoyer commented Jan 21, 2015

TomAugspurger commented Jan 21, 2015

stared commented Jan 26, 2015

nickeubank commented Mar 1, 2015

TomAugspurger commented Mar 1, 2015

nickeubank commented Mar 1, 2015

shoyer commented Mar 1, 2015

nickeubank commented Mar 1, 2015

shoyer commented Mar 1, 2015

nickeubank commented Mar 12, 2015

shoyer commented Mar 13, 2015

TomAugspurger commented Mar 13, 2015

jorisvandenbossche commented Mar 13, 2015

nickeubank commented Mar 13, 2015

shoyer commented Mar 13, 2015

nickeubank commented Mar 13, 2015

jorisvandenbossche commented Mar 13, 2015

jorisvandenbossche commented Mar 13, 2015

shoyer commented Mar 13, 2015

jorisvandenbossche commented Mar 13, 2015

nickeubank commented Mar 13, 2015

shoyer commented Mar 13, 2015

nickeubank commented Mar 13, 2015

shoyer commented Mar 13, 2015

nickeubank commented Mar 13, 2015

nickeubank commented Mar 13, 2015

cpcloud commented Mar 16, 2015

shoyer commented Mar 16, 2015

nickeubank commented Mar 16, 2015

jreback commented May 1, 2015