Statistics a.e.

Installing Rpy2 in Linux Mint

Posted on October 23, 2014 by bpkent

While installing Rpy2 on Linux Mint 17, I ran into the same problem described here. This was apparently a StackOverflow question at some point, but it was removed for some reason. Here’s my solution.

R version: 3.0.2
Python version: 2.7.6

Initially, I tried

$ pip install readline
$ pip install rpy2

But got the following (abridged) error.
/rpy/rinterface/_rinterface.c:86:31: fatal error: readline/readline.h: No such file or directory #include <readline/readline.h> compilation terminated. error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 ---------------------------------------- Cleaning up... UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 79: ordinal not in range(128)

Although the Rpy2 documentation just lists “readline” as a dependency, the python readline package is not sufficient. Instead try

$ sudo apt-get install libreadline-dev
$ pip install rpy2

Posted in Uncategorized | Leave a comment

Implementing the Chaudhuri-Dasgupta density cluster tree

Posted on July 11, 2013 by bpkent

A few months ago Siva Balakrishnan wrote a nice post on Larry Wasserman’s blog about the Chaudhuri-Dasgupta algorithm for estimating a density cluster tree. The method, which is described in a NIPS 2010 paper “Rates of convergence for the cluster tree” (pdf link), is conceptually simple and supported by nice theoretical guarantees but somewhat difficult to use in practice.

As Siva described, clustering is a challenging problem because the term “cluster” is often poorly defined. One statistically-principled way to cluster (due to Hartigan (1975)) is to find the connected components of upper level sets of the data-generating probability density function. Assume the data are drawn from density $f$ and fix some threshold value $\lambda$ , then the upper level set is $\{x: f(x) \geq \lambda \}$ and the high-density clusters are the connected components of this set. Choosing a value for $\lambda$ is difficult, so we instead find the clusters for all values of $\lambda \geq 0$ . The compilation of high-density clusters over all $\lambda$ values yields the level set tree (aka the density cluster tree).

Of course, $f$ is unknown in real problems, but can be estimated by $\widehat{f}$ . Unfortunately, computing upper level sets and connected components of $\widehat{f}$ is impractical. Chaudhuri and Dasgupta propose an estimator for the level set tree that is based on geometric clustering.

Fix parameter values $k$ and $\alpha$ .
For each observation $x_i$ , $i = 1,\ldots,n$ , set $r_k(x_i)$ to be the distance from $x_i$ to $x_i$ ‘s $(k-1)$ ‘th closest neighbor.
Let grow from to . For each value of :
1. Construct a similarity graph $G_r$ with node set $\{x_i: r_k(x_i) \leq r\}$ and edge $(x_i, x_j)$ if $\|x_i - x_j\| \leq \alpha r$ .
2. Let $\widehat{C}(r)$ be the connected components of $G_r$ .

Chaudhuri and Dasgupta observe that for single linkage $\alpha=1$ and $k=2$ . The connected components $\widehat{C}(r)$ also form a tree when compiled over all values of $r$ , which is the estimated level set tree.

Using this estimator in practice is not straightforward, due to the fact that $r$ varies continuously from $0$ to $\infty$ . Fortunately, there is a finite set of $r$ values where the similarity graph $G_r$ can change. Let $W$ be the set of edge lengths in the complete graph (aka $G_\infty$ ), and let $W_\alpha$ be the set $W$ with each member divided by $\alpha$ . Then $G_r$ can only change at values in the set

$R = W \bigcup W_\alpha$ .

The vertex set changes when when $r$ is equal to some edge length, because as $r$ grows, node $i$ is first included in $G_r$ precisely when $r = r_k(x_i)$ which is the $k$ ‘th smallest edge length incident to vertex $i$ . Similarly, the edge set changes only when $r$ is equal to an edge length divided by $\alpha$ , because edge $(i,j)$ is first included in $G_r$ when $\alpha r = \|x_i - x_j\|$ , which only happens if $r = w_{ij}/\alpha$ .

An exact implementation of the Chaudhuri-Dasgupta algorithm starts with $r$ equal to the longest pairwise distance and iterates over decreasing values in set $R$ . At each iteration, construct $G_r$ and find the connected components. To illustrate, I sampled 200 points in 2-dimensions from a mixture of 3 Gaussians, colored here according to the true group membership.

The second figure is a visualization of the Chaudhuri-Dasgupta level set tree for these data. Vertical line segments represent connected components, with the bottom indicating where (in terms of $r$ ) the component first appears (by splitting from a larger cluster) and the top indicating where the component dies (by vanishing or splitting further). The longer a line segment, the more persistent the cluster. The plot clearly captures the three main clusters.

There are many ways to improve on this method, some of which are addressed in our upcoming publications (stay tuned). A Python implementation of the Chaudhuri-Dasgupta algorithm will be available soon. In fact, the code is already available at https://github.com/CoAxLab/DeBaCl/ in the “develop” branch. Accompanying demos, tutorials, and documentation will be posted shortly.

References

Chaudhuri, K. and DasGupta, S. (2010). Rates of convergence for the cluster tree. Advances in Neural Information Processing Systems, 23, 343–351.
Hartigan, J. (1975). Clustering Algorithms. John Wiley & Sons.

Posted in python, research | 2 Comments

Getting started with statistical computing in Python

Posted on June 7, 2013 by bpkent

The world hardly needs yet another tutorial on statistical computing with Python, but I made one for a live demo and I might as well post it. Also, the IPython Notebook makes me happy.

http://nbviewer.ipython.org/urls/raw.github.com/papayawarrior/public_talks/master/statBytes-python.ipynb

In a very loose sense, the topics are:

what packages are in the Python statistical computing ecosystem
how to install Python packages
generating and plotting very simple data
ordinary least squares
reading and summarizing data with pandas
3D plotting with matplotlib
applying algorithms from scikit-learn
resorting to R functions (if necessary) with Rpy2

Posted in python | Leave a comment

tapply in Python

Posted on May 24, 2013 by bpkent

tapply is a super convenient function in R for computing statistics on a “ragged array”. It lets you separate a dataset into groups based on a categorial variable, then compute any function you want on each group. Suppose we have the following made-up data

name	state	age
alice	maine	26
bob	washington	23
claire	california	28
david	florida	26
ellen	florida	23
frank	maine	25
gina	texas	20
jerome	maine	26
kira	texas	26
larry	washington	19

stored in an R data frame called roster. If we want to figure out the maximum age in each state we can use tapply:

max_age = tapply(roster$age, INDEX=roster$state, FUN='max')
print(max_age)

california	28
florida	26
maine	26
texas	26
washington	23

One of the annoying parts of switching from R to the Python/Numpy/Scipy universe is the lack of a tapply analog. The Python Pandas library solves the problem, quite smoothly. Suppose the data is saved in the file ‘ages.csv’.

import pandas as pd
roster = pd.read_csv('ages.csv')
grouped = roster.groupby('state')
age_max = grouped.max()
print age_max

	name	age
state
california	claire	28
florida	ellen	26
maine	jerome	26
texas	kira	26
washington	larry	23

This is the “split-apply-combine” paradigm in Pandas. The grouped object in my example above comes with several built-in functions for standard counting and statistics:

size – gives the number of rows in each group
groups – row indices belonging to each group
min, max
mean, median
sum, prod – sum and product
var, std

For more complicated functions, or when other arguments need to be passed, there are a few options. The least elegant is to simply loop through the groups, doing the desired function within the loop.

for state, record in grouped:
    print state
    print record
    print

returns

california
     name       state  age
2  claire  california   28

florida
    name    state  age
3  david  florida   26
4  ellen  florida   26

maine
     name  state  age
0   alice  maine   26
5   frank  maine   25
7  jerome  maine   26

texas
   name  state  age
6  gina  texas   20
8  kira  texas   26

washington
    name       state  age
1    bob  washington   23
9  larry  washington   19

A more elegant solution is to use a lambda function within the aggregrate method.

quintile = grouped.aggregate(lambda x: np.percentile(x, 20))
print quintile

	age
state
california	28.0
florida	26.0
maine	25.4
texas	21.2
washington	19.8

For more on the split-apply-combine framework, see the Pandas documentation and various blog posts [1] [2].

Posted in python | Leave a comment

Stat programming workshop – tutorial for a very simple R choropleth map

Posted on July 12, 2012 by bpkent

Eye candy is the best way to get your poster noticed at a conference and choropleth maps are some of the tastiest morsels. Here’s a quick tutorial on how to make one in R. It’s not a beginner topic, so bear with it. The approach below probably requires the least overheard in terms of new libraries or plotting paradigms, but it is also probably the least powerful method.
-Brian

Install the maps library. The command install.packages("maps") usually does the trick, although I’m not too familiar with non-linux environments.
Load the necessary libraries and data objects. The last line loads the state names and some data to play with.
library(maps) data(state)
Make a basic map, just to see how it looks.
map("state", fill=FALSE, boundary=TRUE, col="red")
Now come the hard parts. Presumably you want to color each state according to some numerical variable. I’m going to use land area, which is called state.area (loaded in step 2). The problem is that the map command actually draws 63 regions that comprise the continental US, while our data vector has one entry for each of the 50 states.
To get the 63 regions, make an invisible map.
mapnames <- map("state", plot=FALSE)$names
Note that each region’s name is either a state or a state followed by a colon and then an island. We’ll use the colon to isolate the state names.
region_list <- strsplit(mapnames, ":") mapnames2 <- sapply(region_list, "[", 1)
Now we need to match our data to the state names as they appear in the map.
m <- match(mapnames2, tolower(state.name)) map.area <- state.area[m]
Now each of the 63 regions is associated with the land area corresponding to its state (except poor Washington, DC, which is missing from the original state.area variable).
The next step is to define the colors for our map. Of the basic R palettes, I think the heat.colors() look best. I use 8 colors; use a higher number if you want more subtle distinctions. I also reverse the color vector so that higher land area states show up redder.
clr <- rev(heat.colors(8))
Now we need to collapse our map areas into bins, one per color.
area.buckets <- cut(map.area, breaks=8)
And finally, for all the glory…
map("state", fill=TRUE, col=clr[area.buckets])

Posted in teaching | 3 Comments

Stat programming workshop – week 6 tasks

Posted on July 3, 2012 by bpkent

April 2013 update: this post is part 6 of 6 that were designed to help beginning R programmers get up and running with some simple data analyses. They were originally private for a specific course in Summer 2012, but they’re now public in case the tips might be useful for a broader audience. -Brian

In week 6 we look at loops in R and how to avoid them at all costs. If you’ve had some programming experience in other languages like C, C++, or python you know that looping is an important part of many programs, but looping in R is really slow. When dealing with large data sets it becomes really important to avoid loops so your programs finish in a reasonable amount of time. Fortunately R has several built-in functions that are very good at replacing loops, although it is tricky to learn to think in R’s “vectorized” fashion.
-Brian

The first task is to get the average of each row in a matrix.Create a 10 x 5 matrix:
```
m = matrix(1:50, nrow=10, byrow=T)
```
Use a for loopto find the row averages. Here’s how I would do it:
```
avg = rep(NA, 10)
for(i in 1:10) {
    avg[i] = mean(m[i,])
}
```
The apply function allows us to avoid the for loop in this case. Check the help file for apply then use it to find the row means. Make sure the answer is correct.
```
avg = apply(m, 1, mean)
```
Note the second argument is a 1 (instead of a 2) because we want to apply the function mean to each row of our matrix. Making this a 2 would find the column means.
For common operations, R often has built-in functions that accomplish the task most efficiently. In this case we can use the function rowMeans to do this job.
```
avg = rowMeans(m)
```
The second task is to compute statistics about a data set arranged as a “ragged array”. The cars data set that we’ve used in previous weeks is a good example of this. Suppose we want to find the mean gas mileage for each year. This is a ragged array because there are a different number of rows for each of the years.
Use the table command to confirm to get a list of the years and the number of observations for each year.
Think about how you would use a for-loop to find the mean gas mileage for each year. It’s kind of a pain.
The tapply command is more elegant and much faster than the for-loop solution, so let’s use that instead. Look up the command’s help file, and use it to find the mean mileage for each year.
```
avg = tapply(cars$mpg, INDEX=cars$year, FUN=mean)
```