9
9
from pandas import *
10
10
randn = np.random.randn
11
11
np.set_printoptions(precision = 4 , suppress = True )
12
+ import matplotlib.pyplot as plt
13
+ plt.close(' all' )
12
14
13
15
*****************************
14
16
Group By: split-apply-combine
@@ -283,14 +285,6 @@ the ``aggregate`` or equivalently ``agg`` method:
283
285
grouped = df.groupby([' A' , ' B' ])
284
286
grouped.aggregate(np.sum)
285
287
286
- Another simple example is to compute the size of each group. This is included
287
- in GroupBy as the ``size `` method. It returns a Series whose index are the
288
- group names and whose values are the sizes of each group.
289
-
290
- .. ipython :: python
291
-
292
- grouped.size()
293
-
294
288
As you can see, the result of the aggregation will have the group names as the
295
289
new index along the grouped axis. In the case of multiple keys, the result is a
296
290
:ref: `MultiIndex <indexing.hierarchical >` by default, though this can be
@@ -310,6 +304,14 @@ same result as the column names are stored in the resulting ``MultiIndex``:
310
304
311
305
df.groupby([' A' , ' B' ]).sum().reset_index()
312
306
307
+ Another simple aggregation example is to compute the size of each group.
308
+ This is included in GroupBy as the ``size `` method. It returns a Series whose
309
+ index are the group names and whose values are the sizes of each group.
310
+
311
+ .. ipython :: python
312
+
313
+ grouped.size()
314
+
313
315
314
316
.. _groupby.aggregate.multifunc :
315
317
@@ -385,29 +387,86 @@ Transformation
385
387
The ``transform `` method returns an object that is indexed the same (same size)
386
388
as the one being grouped. Thus, the passed transform function should return a
387
389
result that is the same size as the group chunk. For example, suppose we wished
388
- to standardize a data set within a group:
390
+ to standardize the data within each group:
389
391
390
392
.. ipython :: python
391
393
392
- tsdf = DataFrame(randn(1000 , 3 ),
393
- index = DateRange(' 1/1/2000' , periods = 1000 ),
394
- columns = [' A' , ' B' , ' C' ])
395
- tsdf
394
+ index = date_range(' 10/1/1999' , periods = 1100 )
395
+ ts = Series(np.random.normal(0.5 , 2 , 1100 ), index)
396
+ ts = rolling_mean(ts, 100 , 100 ).dropna()
396
397
398
+ ts.head()
399
+ ts.tail()
400
+ key = lambda x : x.year
397
401
zscore = lambda x : (x - x.mean()) / x.std()
398
- transformed = tsdf .groupby(lambda x : x.year ).transform(zscore)
402
+ transformed = ts .groupby(key ).transform(zscore)
399
403
400
404
We would expect the result to now have mean 0 and standard deviation 1 within
401
405
each group, which we can easily check:
402
406
403
407
.. ipython :: python
404
408
405
- grouped = transformed.groupby(lambda x : x.year)
406
-
407
- # OK, close enough to zero
409
+ # Original Data
410
+ grouped = ts.groupby(key)
408
411
grouped.mean()
409
412
grouped.std()
410
413
414
+ # Transformed Data
415
+ grouped_trans = transformed.groupby(key)
416
+ grouped_trans.mean()
417
+ grouped_trans.std()
418
+
419
+ We can also visually compare the original and transformed data sets.
420
+
421
+ .. ipython :: python
422
+
423
+ compare = DataFrame({' Original' : ts, ' Transformed' : transformed})
424
+
425
+ @savefig groupby_transform_plot.png width =4in
426
+ compare.plot()
427
+
428
+ Another common data transform is to replace missing data with the group mean.
429
+
430
+ .. ipython :: python
431
+ :suppress:
432
+
433
+ cols = [' A' , ' B' , ' C' ]
434
+ values = randn(1000 , 3 )
435
+ values[np.random.randint(0 , 1000 , 100 ), 0 ] = np.nan
436
+ values[np.random.randint(0 , 1000 , 50 ), 1 ] = np.nan
437
+ values[np.random.randint(0 , 1000 , 200 ), 2 ] = np.nan
438
+ data_df = DataFrame(values, columns = cols)
439
+
440
+ .. ipython :: python
441
+
442
+ data_df
443
+
444
+ countries = np.array([' US' , ' UK' , ' GR' , ' JP' ])
445
+ key = countries[np.random.randint(0 , 4 , 1000 )]
446
+
447
+ grouped = data_df.groupby(key)
448
+
449
+ # Non-NA count in each group
450
+ grouped.count()
451
+
452
+ f = lambda x : x.fillna(x.mean())
453
+
454
+ transformed = grouped.transform(f)
455
+
456
+ We can verify that the group means have not changed in the transformed data
457
+ and that the transformed data contains no NAs.
458
+
459
+ .. ipython :: python
460
+
461
+ grouped_trans = transformed.groupby(key)
462
+
463
+ grouped.mean() # original group means
464
+ grouped_trans.mean() # transformation did not change group means
465
+
466
+ grouped.count() # original has some missing data points
467
+ grouped_trans.count() # counts after transformation
468
+ grouped_trans.size() # Verify non-NA count equals group size
469
+
411
470
.. _groupby.dispatch :
412
471
413
472
Dispatching to instance methods
@@ -439,6 +498,9 @@ next). This enables some operations to be carried out rather succinctly:
439
498
440
499
.. ipython :: python
441
500
501
+ tsdf = DataFrame(randn(1000 , 3 ),
502
+ index = DateRange(' 1/1/2000' , periods = 1000 ),
503
+ columns = [' A' , ' B' , ' C' ])
442
504
tsdf.ix[::2 ] = np.nan
443
505
grouped = tsdf.groupby(lambda x : x.year)
444
506
grouped.fillna(method = ' pad' )
0 commit comments