The standard function to use is hclust()
. hclust()
works on a distance matrix. If the variable to cluster on is numerical, then use dist()
to create the distance matrix. If you have a factor, then the function daisy()
in package cluster
is needed to calculate a distance matrix.
str(sweden) 'data.frame': 1929 obs. of 52 variables: $ Country : Factor w/ 23 levels "AT","BE","BG",..: 20 20 20 20 20 20 20 20 20 20 ... $ pid : num 97260001 98050001 98080001 98100001 98340001 ... $ Sex : Factor w/ 2 levels "1","2": 2 2 2 2 1 1 1 2 1 2 ... $ Year.of.Birth: Factor w/ 84 levels "1925","1926",..: 1 1 1 4 8 2 3 21 6 23 ... $ MA.2005.A : Factor w/ 9 levels "1","2","3","4",..: NA NA NA NA 1 NA NA 5 4 1 ... $ MA.2005.B : Factor w/ 9 levels "1","2","3","4",..: NA NA NA NA 1 NA NA 5 4 1 ... ... $ MA.2008.K : Factor w/ 9 levels "1","2","3","4",..: 6 NA 6 6 6 6 6 6 6 1 ... $ MA.2008.L : Factor w/ 9 levels "1","2","3","4",..: 6 NA 6 6 6 6 6 6 6 1 ...
The variables MA.200X.Y represent "main activity" in the labour market a certain month, and the are nominal variables.
Before we can do cluster analysis on these data, Missing data, NA
, must be taken care of. [...] Since that is a general problem, see this page.
To apply daisy()
on these variables:
the.distance.matrix <- daisy(sweden[,c(5:52)])
Now, apply a clustering function on the distance matrix. There are a few to choose among:
hclust
above)my.clusters <- pam(the.distance.matrix)
A k-medoids bases analysis requires the researcher to decide the number of clusters. There are several possible ways to reach a conclusion on how many cluster the data should be partitioned into. Wikipedia mentions the following: