tapply in Python

tapply is a super convenient function in R for computing statistics on a “ragged array”. It lets you separate a dataset into groups based on a categorial variable, then compute any function you want on each group. Suppose we have the following made-up data

name state age
alice maine 26
bob washington 23
claire california 28
david florida 26
ellen florida 23
frank maine 25
gina texas 20
jerome maine 26
kira texas 26
larry washington 19

stored in an R data frame called roster. If we want to figure out the maximum age in each state we can use tapply:

max_age = tapply(roster$age, INDEX=roster$state, FUN='max')
california 28
florida 26
maine 26
texas 26
washington 23

One of the annoying parts of switching from R to the Python/Numpy/Scipy universe is the lack of a tapply analog. The Python Pandas library solves the problem, quite smoothly. Suppose the data is saved in the file ‘ages.csv’.

import pandas as pd
roster = pd.read_csv('ages.csv')
grouped = roster.groupby('state')
age_max = grouped.max()
print age_max
name age
california claire 28
florida ellen 26
maine jerome 26
texas kira 26
washington larry 23

This is the “split-apply-combine” paradigm in Pandas. The grouped object in my example above comes with several built-in functions for standard counting and statistics:

size – gives the number of rows in each group
groups – row indices belonging to each group
min, max
mean, median
sum, prod – sum and product
var, std

For more complicated functions, or when other arguments need to be passed, there are a few options. The least elegant is to simply loop through the groups, doing the desired function within the loop.

for state, record in grouped:
    print state
    print record


     name       state  age
2  claire  california   28

    name    state  age
3  david  florida   26
4  ellen  florida   26

     name  state  age
0   alice  maine   26
5   frank  maine   25
7  jerome  maine   26

   name  state  age
6  gina  texas   20
8  kira  texas   26

    name       state  age
1    bob  washington   23
9  larry  washington   19

A more elegant solution is to use a lambda function within the aggregrate method.

quintile = grouped.aggregate(lambda x: np.percentile(x, 20))
print quintile
california 28.0
florida 26.0
maine 25.4
texas 21.2
washington 19.8

For more on the split-apply-combine framework, see the Pandas documentation and various blog posts [1] [2].

This entry was posted in python. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s