Some memos to self on how to do hierarchical cluster analysis in R

The standard function to use is hclust(). hclust() works on a distance matrix. If the variable to cluster on is numerical, then use dist() to create the distance matrix. If you have a factor, then the function daisy() in package cluster is needed to calculate a distance matrix.

Nominal (or ordinal) data

str(sweden)
'data.frame':	1929 obs. of  52 variables:
 $ Country      : Factor w/ 23 levels "AT","BE","BG",..: 20 20 20 20 20 20 20 20 20 20 ...
 $ pid          : num  97260001 98050001 98080001 98100001 98340001 ...
 $ Sex          : Factor w/ 2 levels "1","2": 2 2 2 2 1 1 1 2 1 2 ...
 $ Year.of.Birth: Factor w/ 84 levels "1925","1926",..: 1 1 1 4 8 2 3 21 6 23 ...
 $ MA.2005.A    : Factor w/ 9 levels "1","2","3","4",..: NA NA NA NA 1 NA NA 5 4 1 ...
 $ MA.2005.B    : Factor w/ 9 levels "1","2","3","4",..: NA NA NA NA 1 NA NA 5 4 1 ...
...
 $ MA.2008.K    : Factor w/ 9 levels "1","2","3","4",..: 6 NA 6 6 6 6 6 6 6 1 ...
 $ MA.2008.L    : Factor w/ 9 levels "1","2","3","4",..: 6 NA 6 6 6 6 6 6 6 1 ...

The variables MA.200X.Y represent "main activity" in the labour market a certain month, and the are nominal variables.

Before we can do cluster analysis on these data, Missing data, NA, must be taken care of. [...] Since that is a general problem, see this page.

To apply daisy() on these variables:

the.distance.matrix <- daisy(sweden[,c(5:52)])

Now, apply a clustering function on the distance matrix. There are a few to choose among:

fanny ("fuzzy")
hierarcichal
- agnes ("agglomerate", similar to hclust above)
- diana ("divisive", like agnes but starts with one cluster which is then divided).
partition based
- pam ("k-means", or actually even better, k-medoids)
model based (see http://cran.r-project.org/web/views/Cluster.html)

my.clusters <- pam(the.distance.matrix)

Deciding on the number of clusters

A k-medoids bases analysis requires the researcher to decide the number of clusters. There are several possible ways to reach a conclusion on how many cluster the data should be partitioned into. Wikipedia mentions the following:

The elbow test, if F(k) is the percentage of variance explained by a k-medoids (or k-means) solution, then choose the lowest k for which F'(k) = F'(k+1).

comments powered by Disqus

Back to the index

Blog roll

R-bloggers, Debian Weekly

Last modified: oktober 17, 2019