-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
New function to sample from frames (Issue #2419 ) #9666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,11 +12,12 @@ Highlights include: | |
- Support for a ``CategoricalIndex``, a category based index, see :ref:`here <whatsnew_0161.enhancements.categoricalindex>` | ||
- New section on how-to-contribute to *pandas*, see :ref`here <contributing>` | ||
|
||
- New method ``sample`` for drawing random samples from Series, DataFrames and Panels. See :ref:`here <whatsnew_0161.enchancements.sample>` | ||
|
||
.. contents:: What's new in v0.16.1 | ||
:local: | ||
:backlinks: none | ||
|
||
|
||
.. _whatsnew_0161.enhancements: | ||
|
||
Enhancements | ||
|
@@ -137,6 +138,47 @@ values NOT in the categories, similarly to how you can reindex ANY pandas index. | |
|
||
See the :ref:`documentation <advanced.categoricalindex>` for more. (:issue:`7629`) | ||
|
||
.. _whatsnew_0161.enhancements.sample: | ||
|
||
Sample | ||
^^^^^^ | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. needs to be exactly length of the text |
||
Series, DataFrames, and Panels now have a new method: :meth:`~pandas.DataFrame.sample`. | ||
The method accepts a specific number of rows or columns to return, or a fraction of the | ||
total number or rows or columns. It also has options for sampling with or without replacement, | ||
for passing in a column for weights for non-uniform sampling, and for setting seed values to facilitate replication. | ||
|
||
.. ipython :: python | ||
|
||
example_series = Series([0,1,2,3,4,5]) | ||
|
||
# When no arguments are passed, returns 1 | ||
example_series.sample() | ||
|
||
# One may specify either a number of rows: | ||
example_series.sample(n=3) | ||
|
||
# Or a fraction of the rows: | ||
example_series.sample(frac=0.5) | ||
|
||
# weights are accepted. | ||
example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4] | ||
example_series.sample(n=3, weights=example_weights) | ||
|
||
# weights will also be normalized if they do not sum to one, | ||
# and missing values will be treated as zeros. | ||
example_weights2 = [0.5, 0, 0, 0, None, np.nan] | ||
example_series.sample(n=1, weights=example_weights2) | ||
|
||
|
||
When applied to a DataFrame, one may pass the name of a column to specify sampling weights | ||
when sampling from rows. | ||
|
||
.. ipython :: python | ||
|
||
df = DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]}) | ||
df.sample(n=3, weights='weight_column') | ||
|
||
.. _whatsnew_0161.api: | ||
|
||
API changes | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1948,6 +1948,120 @@ def tail(self, n=5): | |
return self | ||
return self.iloc[-n:] | ||
|
||
|
||
def sample(self, n=None, frac=None, replace=False, weights=None, random_state=None, axis=None): | ||
""" | ||
Returns a random sample of items from an axis of object. | ||
|
||
Parameters | ||
---------- | ||
n : int, optional | ||
Number of items from axis to return. Cannot be used with `frac`. | ||
Default = 1 if `frac` = None. | ||
frac : float, optional | ||
Fraction of axis items to return. Cannot be used with `n`. | ||
replace : boolean, optional | ||
Sample with or without replacement. Default = False. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I usually let function signatures document default values like |
||
weights : str or ndarray-like, optional | ||
Default 'None' results in equal probability weighting. | ||
If called on a DataFrame, will accept the name of a column | ||
when axis = 0. | ||
Weights must be same length as axis being sampled. | ||
If weights do not sum to 1, they will be normalized to sum to 1. | ||
Missing values in the weights column will be treated as zero. | ||
inf and -inf values not allowed. | ||
random_state : int or numpy.random.RandomState, optional | ||
Seed for the random number generator (if int), or numpy RandomState | ||
object. | ||
axis : int or string, optional | ||
Axis to sample. Accepts axis number or name. Default is stat axis | ||
for given data type (0 for Series and DataFrames, 1 for Panels). | ||
|
||
Returns | ||
------- | ||
Same type as caller. | ||
""" | ||
|
||
### | ||
# Process axis argument | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still think this function would be clearer if you removed the all comments. Yes, really! All your comments are a pretty literal restatements of what the code itself does. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @nickeubank maybe you can at least remove the |
||
### | ||
|
||
if axis is None: | ||
axis = self._stat_axis_number | ||
|
||
axis = self._get_axis_number(axis) | ||
|
||
axis_length = self.shape[axis] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can eliminate some of these comments. Too much verbiage here. |
||
|
||
### | ||
# Process random_state argument | ||
### | ||
|
||
rs = com._random_state(random_state) | ||
|
||
### | ||
# Process weights | ||
### | ||
|
||
# Check weights for compliance | ||
if weights is not None: | ||
|
||
# Strings acceptable if a dataframe and axis = 0 | ||
if isinstance(weights, string_types): | ||
if isinstance(self, pd.DataFrame): | ||
if axis == 0: | ||
try: | ||
weights = self[weights] | ||
except KeyError: | ||
raise KeyError("String passed to weights not a valid column") | ||
else: | ||
raise ValueError("Strings can only be passed to weights when sampling from rows on a DataFrame") | ||
else: | ||
raise ValueError("Strings cannot be passed as weights when sampling from a Series or Panel.") | ||
|
||
#normalize format of weights to Series. | ||
weights = pd.Series(weights, dtype='float64') | ||
|
||
if len(weights) != axis_length: | ||
raise ValueError("Weights and axis to be sampled must be of same length") | ||
|
||
if (weights == np.inf).any() or (weights == -np.inf).any(): | ||
raise ValueError("weight vector may not include `inf` values") | ||
|
||
if (weights < 0).any(): | ||
raise ValueError("weight vector many not include negative values") | ||
|
||
# If has nan, set to zero. | ||
weights = weights.fillna(0) | ||
|
||
# Renormalize if don't sum to 1 | ||
if weights.sum() != 1: | ||
weights = weights / weights.sum() | ||
|
||
weights = weights.values | ||
|
||
### | ||
# Process n and frac arguments | ||
### | ||
|
||
# If no frac or n, default to n=1. | ||
if n is None and frac is None: | ||
n = 1 | ||
elif n is not None and frac is None and n % 1 != 0: | ||
raise ValueError("Only integers accepted as `n` values") | ||
elif n is None and frac is not None: | ||
n = int(round(frac * axis_length)) | ||
elif n is not None and frac is not None: | ||
raise ValueError('Please enter a value for `frac` OR `n`, not both') | ||
|
||
# Check for negative sizes | ||
if n < 0: | ||
raise ValueError("A negative number of rows requested. Please provide positive value.") | ||
|
||
locs = rs.choice(axis_length, size=n, replace=replace, p=weights) | ||
return self.take(locs, axis=axis) | ||
|
||
|
||
#---------------------------------------------------------------------- | ||
# Attribute access | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add highlights description, linking to this section, like: