Skip to content

Commit e22ede3

Browse files
author
Chang She
committed
DOC: groupby.transform examples
1 parent 40fb1bf commit e22ede3

File tree

2 files changed

+89
-31
lines changed

2 files changed

+89
-31
lines changed

doc/source/groupby.rst

Lines changed: 79 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@
99
from pandas import *
1010
randn = np.random.randn
1111
np.set_printoptions(precision=4, suppress=True)
12+
import matplotlib.pyplot as plt
13+
plt.close('all')
1214
1315
*****************************
1416
Group By: split-apply-combine
@@ -283,14 +285,6 @@ the ``aggregate`` or equivalently ``agg`` method:
283285
grouped = df.groupby(['A', 'B'])
284286
grouped.aggregate(np.sum)
285287
286-
Another simple example is to compute the size of each group. This is included
287-
in GroupBy as the ``size`` method. It returns a Series whose index are the
288-
group names and whose values are the sizes of each group.
289-
290-
.. ipython:: python
291-
292-
grouped.size()
293-
294288
As you can see, the result of the aggregation will have the group names as the
295289
new index along the grouped axis. In the case of multiple keys, the result is a
296290
:ref:`MultiIndex <indexing.hierarchical>` by default, though this can be
@@ -310,6 +304,14 @@ same result as the column names are stored in the resulting ``MultiIndex``:
310304
311305
df.groupby(['A', 'B']).sum().reset_index()
312306
307+
Another simple aggregation example is to compute the size of each group.
308+
This is included in GroupBy as the ``size`` method. It returns a Series whose
309+
index are the group names and whose values are the sizes of each group.
310+
311+
.. ipython:: python
312+
313+
grouped.size()
314+
313315
314316
.. _groupby.aggregate.multifunc:
315317

@@ -385,29 +387,86 @@ Transformation
385387
The ``transform`` method returns an object that is indexed the same (same size)
386388
as the one being grouped. Thus, the passed transform function should return a
387389
result that is the same size as the group chunk. For example, suppose we wished
388-
to standardize a data set within a group:
390+
to standardize the data within each group:
389391

390392
.. ipython:: python
391393
392-
tsdf = DataFrame(randn(1000, 3),
393-
index=DateRange('1/1/2000', periods=1000),
394-
columns=['A', 'B', 'C'])
395-
tsdf
394+
index = date_range('10/1/1999', periods=1100)
395+
ts = Series(np.random.normal(0.5, 2, 1100), index)
396+
ts = rolling_mean(ts, 100, 100).dropna()
396397
398+
ts.head()
399+
ts.tail()
400+
key = lambda x: x.year
397401
zscore = lambda x: (x - x.mean()) / x.std()
398-
transformed = tsdf.groupby(lambda x: x.year).transform(zscore)
402+
transformed = ts.groupby(key).transform(zscore)
399403
400404
We would expect the result to now have mean 0 and standard deviation 1 within
401405
each group, which we can easily check:
402406

403407
.. ipython:: python
404408
405-
grouped = transformed.groupby(lambda x: x.year)
406-
407-
# OK, close enough to zero
409+
# Original Data
410+
grouped = ts.groupby(key)
408411
grouped.mean()
409412
grouped.std()
410413
414+
# Transformed Data
415+
grouped_trans = transformed.groupby(key)
416+
grouped_trans.mean()
417+
grouped_trans.std()
418+
419+
We can also visually compare the original and transformed data sets.
420+
421+
.. ipython:: python
422+
423+
compare = DataFrame({'Original': ts, 'Transformed': transformed})
424+
425+
@savefig groupby_transform_plot.png width=4in
426+
compare.plot()
427+
428+
Another common data transform is to replace missing data with the group mean.
429+
430+
.. ipython:: python
431+
:suppress:
432+
433+
cols = ['A', 'B', 'C']
434+
values = randn(1000, 3)
435+
values[np.random.randint(0, 1000, 100), 0] = np.nan
436+
values[np.random.randint(0, 1000, 50), 1] = np.nan
437+
values[np.random.randint(0, 1000, 200), 2] = np.nan
438+
data_df = DataFrame(values, columns=cols)
439+
440+
.. ipython:: python
441+
442+
data_df
443+
444+
countries = np.array(['US', 'UK', 'GR', 'JP'])
445+
key = countries[np.random.randint(0, 4, 1000)]
446+
447+
grouped = data_df.groupby(key)
448+
449+
# Non-NA count in each group
450+
grouped.count()
451+
452+
f = lambda x: x.fillna(x.mean())
453+
454+
transformed = grouped.transform(f)
455+
456+
We can verify that the group means have not changed in the transformed data
457+
and that the transformed data contains no NAs.
458+
459+
.. ipython:: python
460+
461+
grouped_trans = transformed.groupby(key)
462+
463+
grouped.mean() # original group means
464+
grouped_trans.mean() # transformation did not change group means
465+
466+
grouped.count() # original has some missing data points
467+
grouped_trans.count() # counts after transformation
468+
grouped_trans.size() # Verify non-NA count equals group size
469+
411470
.. _groupby.dispatch:
412471

413472
Dispatching to instance methods
@@ -439,6 +498,9 @@ next). This enables some operations to be carried out rather succinctly:
439498

440499
.. ipython:: python
441500
501+
tsdf = DataFrame(randn(1000, 3),
502+
index=DateRange('1/1/2000', periods=1000),
503+
columns=['A', 'B', 'C'])
442504
tsdf.ix[::2] = np.nan
443505
grouped = tsdf.groupby(lambda x: x.year)
444506
grouped.fillna(method='pad')

doc/source/indexing.rst

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -246,11 +246,10 @@ index positions.
246246
247247
positions = [0, 9, 3]
248248
249-
index.ix[positions]
249+
index[positions]
250250
index.take(positions)
251251
252252
ser = Series(randn(10))
253-
ser
254253
255254
ser.ix[positions]
256255
ser.take(positions)
@@ -260,31 +259,28 @@ row or column positions.
260259

261260
.. ipython:: python
262261
263-
df = DataFrame(randn(5, 3))
264-
df
262+
frm = DataFrame(randn(5, 3))
265263
266-
df.take([0, 2])
264+
frm.take([1, 4, 3])
267265
268-
df.take([1, 4, 6], axis=1)
266+
frm.take([0, 2], axis=1)
269267
270-
Like ndarray, the ``take`` method on pandas objects are not intended
271-
to work on boolean indices and may return unexpected results.
268+
It is important to note that the ``take`` method on pandas objects are not
269+
intended to work on boolean indices and may return unexpected results.
272270

273271
.. ipython:: python
274272
275273
arr = randn(10)
276-
arr
277-
arr.take([False, True])
274+
arr.take([False, False, True, True])
278275
arr[[0, 1]]
279276
280277
ser = Series(randn(10))
281-
ser
282-
ser.take([False, True])
278+
ser.take([False, False, True, True])
283279
ser.ix[[0, 1]]
284280
285281
Finally, as a small note on performance, because the ``take`` method handles
286-
more a narrower range of inputs, it is more optimized internally in numpy
287-
and thus offers performance that is a good deal faster than indexing.
282+
a narrower range of inputs, it can offer performance that is a good deal
283+
faster than fancy indexing.
288284

289285
.. ipython::
290286

0 commit comments

Comments
 (0)