Implementing the Chaudhuri-Dasgupta density cluster tree

A few months ago Siva Balakrishnan wrote a nice post on Larry Wasserman’s blog about the Chaudhuri-Dasgupta algorithm for estimating a density cluster tree. The method, which is described in a NIPS 2010 paper “Rates of convergence for the cluster tree” (pdf link), is conceptually simple and supported by nice theoretical guarantees but somewhat difficult to use in practice.

As Siva described, clustering is a challenging problem because the term “cluster” is often poorly defined. One statistically-principled way to cluster (due to Hartigan (1975)) is to find the connected components of upper level sets of the data-generating probability density function. Assume the data are drawn from density f and fix some threshold value \lambda, then the upper level set is \{x: f(x) \geq \lambda \} and the high-density clusters are the connected components of this set. Choosing a value for \lambda is difficult, so we instead find the clusters for all values of \lambda \geq 0. The compilation of high-density clusters over all \lambda values yields the level set tree (aka the density cluster tree).

Of course, f is unknown in real problems, but can be estimated by \widehat{f}. Unfortunately, computing upper level sets and connected components of \widehat{f} is impractical. Chaudhuri and Dasgupta propose an estimator for the level set tree that is based on geometric clustering.

  1. Fix parameter values k and \alpha.
  2. For each observation x_i, i = 1,\ldots,n, set r_k(x_i) to be the distance from x_i to x_i‘s (k-1)‘th closest neighbor.
  3. Let r grow from 0 to \infty. For each value of r:
    1. Construct a similarity graph G_r with node set \{x_i: r_k(x_i) \leq r\} and edge (x_i, x_j) if \|x_i - x_j\| \leq \alpha r.
    2. Let \widehat{C}(r) be the connected components of G_r.

Chaudhuri and Dasgupta observe that for single linkage \alpha=1 and k=2. The connected components \widehat{C}(r) also form a tree when compiled over all values of r, which is the estimated level set tree.

Using this estimator in practice is not straightforward, due to the fact that r varies continuously from 0 to \infty. Fortunately, there is a finite set of r values where the similarity graph G_r can change. Let W be the set of edge lengths in the complete graph (aka G_\infty), and let W_\alpha be the set W with each member divided by \alpha. Then G_r can only change at values in the set

R = W \bigcup W_\alpha.

The vertex set changes when when r is equal to some edge length, because as r grows, node i is first included in G_r precisely when r = r_k(x_i) which is the k‘th smallest edge length incident to vertex i. Similarly, the edge set changes only when r is equal to an edge length divided by \alpha, because edge (i,j) is first included in G_r when \alpha r = \|x_i - x_j\|, which only happens if r = w_{ij}/\alpha.

An exact implementation of the Chaudhuri-Dasgupta algorithm starts with r equal to the longest pairwise distance and iterates over decreasing values in set R. At each iteration, construct G_r and find the connected components. To illustrate, I sampled 200 points in 2-dimensions from a mixture of 3 Gaussians, colored here according to the true group membership.

cd_tree_data

cd_treeThe second figure is a visualization of the Chaudhuri-Dasgupta level set tree for these data. Vertical line segments represent connected components, with the bottom indicating where (in terms of r) the component first appears (by splitting from a larger cluster) and the top indicating where the component dies (by vanishing or splitting further). The longer a line segment, the more persistent the cluster. The plot clearly captures the three main clusters.

There are many ways to improve on this method, some of which are addressed in our upcoming publications (stay tuned). A Python implementation of the Chaudhuri-Dasgupta algorithm will be available soon. In fact, the code is already available at https://github.com/CoAxLab/DeBaCl/ in the “develop” branch. Accompanying demos, tutorials, and documentation will be posted shortly.

References

  • Chaudhuri, K. and DasGupta, S. (2010). Rates of convergence for the cluster tree. Advances in Neural Information Processing Systems, 23, 343–351.
  • Hartigan, J. (1975). Clustering Algorithms. John Wiley & Sons.
Advertisements
This entry was posted in python, research. Bookmark the permalink.

2 Responses to Implementing the Chaudhuri-Dasgupta density cluster tree

  1. Peter says:

    Hi Brian,
    Very interesting post. I have some very interesting microscope tracking data I thought I’d try your package out on.

    Unfortunately I can’t seem to get it working. When running the gauss_demo.py it throw “critical packages not installed” originating from geom_tree.py. Its a fresh 2.7.5 python environment with numpy 1.7.1 maplotlib 1.3.0 ipgraph 0.6.5 and scipy 0.12.0. The command in geom_tree.py I can’t execute from command line is ‘import utils’.

    Any thoughts?

    • Brian says:

      Hi Peter, thanks for checking it out, and sorry it broke on you. Three possibilities come to mind, although I’m not too sanguine about any of them.

      1) The lines
      14 import sys
      15 sys.path.append(‘..’)

      aren’t needed if DeBaCl is installed on elsewhere on the Path, so you can try deleting them.

      2) Is the utils.py script located in the same directory as geom_tree.py?

      3) I’m still using matplotlib 1.2.1, so I wonder if there’s something not compatible with 1.3. You might try commenting out the matplotlib import statements and the “plotForeground” and “setPlotParams”. I’ll check it out against matplotlib 1.3 on my end as well.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s