DSCI 563 Lab1 Tutorial: k-means

This tutorial discusses the implementation of \(k\)-means (and variants) in R.

Let’s use a simulated dataset. Throughout, we’ll take \(k=3\) groups.

suppressMessages(library(ggplot2))
## Warning: package 'ggplot2' was built under R version 3.5.2
set.seed(345)
x <- rnorm(250)
y <- rnorm(250)
## Shift group 2:
x[51:150] <- x[51:150] + 2.5
y[51:150] <- y[51:150] - 1
## Shift group 3:
x[151:250] <- x[151:250] - 1.5
y[151:250] <- y[151:250] - 1.5
## Plot:
dat <- data.frame(x=x, y=y)
qplot(x, y, alpha=I(0.5))

\(k\)-means

Basics

Let’s fit a \(k\)-means algorithm to the data with the kmeans function, to \(k=3\) groups. Indicate the data first, then \(k\) in the second (centers) argument.

set.seed(22)
(fit1 <- kmeans(dat, 3))
## K-means clustering with 3 clusters of sizes 60, 87, 103
## 
## Cluster means:
##            x          y
## 1  0.3725131  0.2750934
## 2  2.5109499 -1.2199721
## 3 -1.5793637 -1.5174017
## 
## Clustering vector:
##   [1] 1 3 3 3 1 3 3 1 2 1 1 1 1 1 1 1 1 2 3 1 1 1 3 3 1 1 3 1 1 1 1 1 1 3 1
##  [36] 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 2 2 2 2 2 2 1 3 2 2 2 2 2 1 2 2 2 2 2 2
##  [71] 2 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2
## [106] 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 1 1 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 1 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3
## [176] 3 3 1 3 3 3 3 3 3 3 3 1 3 3 3 3 3 1 3 3 1 3 3 3 1 3 3 3 3 3 3 3 3 3 3
## [211] 3 3 3 1 3 3 3 3 3 3 1 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [246] 3 3 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 100.3076 154.1080 170.9162
##  (between_SS / total_SS =  68.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

The output tells you what you can extract, under “Available components”. You can extract them using the $ symbol. Let’s look at some of them.

Here are the assigned clusters:

fit1$cluster
##   [1] 1 3 3 3 1 3 3 1 2 1 1 1 1 1 1 1 1 2 3 1 1 1 3 3 1 1 3 1 1 1 1 1 1 3 1
##  [36] 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 2 2 2 2 2 2 1 3 2 2 2 2 2 1 2 2 2 2 2 2
##  [71] 2 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2
## [106] 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 1 1 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 1 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3
## [176] 3 3 1 3 3 3 3 3 3 3 3 1 3 3 3 3 3 1 3 3 1 3 3 3 1 3 3 3 3 3 3 3 3 3 3
## [211] 3 3 3 1 3 3 3 3 3 3 1 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [246] 3 3 3 3 3
qplot(x, y, alpha=I(0.5), colour=factor(fit1$cluster))

Here are the three means/centers of each group:

fit1$centers
##            x          y
## 1  0.3725131  0.2750934
## 2  2.5109499 -1.2199721
## 3 -1.5793637 -1.5174017

We can also extract the squared distances.

fit1$totss        # Total sum of squares.
## [1] 1343.956
fit1$withinss     # Within group sum of squares.
## [1] 100.3076 154.1080 170.9162
fit1$tot.withinss # Total within group. **Objective function**
## [1] 425.3318
fit1$betweenss    # Between group sum of squares.
## [1] 918.6239

Tweaks

In addition to running basic \(k\)-means, we can also run multiple \(k\)-means with the nstart argument. For each run, different initial values are used (different centroids). The run with the best fit is the one that is reported. What is the “best fit” anyway? It’s the one with the smallest total within group sum of squares – that is, $tot.withinss.

Let’s take the best of 20 runs:

(fit2 <- kmeans(dat, 3, nstart=20))
## K-means clustering with 3 clusters of sizes 60, 103, 87
## 
## Cluster means:
##            x          y
## 1  0.3725131  0.2750934
## 2 -1.5793637 -1.5174017
## 3  2.5109499 -1.2199721
## 
## Clustering vector:
##   [1] 1 2 2 2 1 2 2 1 3 1 1 1 1 1 1 1 1 3 2 1 1 1 2 2 1 1 2 1 1 1 1 1 1 2 1
##  [36] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 3 3 3 3 3 3 1 2 3 3 3 3 3 1 3 3 3 3 3 3
##  [71] 3 3 3 3 1 3 3 3 3 3 3 3 1 1 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 1 3 3 3 3
## [106] 3 3 3 3 3 1 3 3 3 3 3 3 1 3 3 3 3 3 3 3 1 1 3 3 1 1 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 1 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [176] 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2
## [211] 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [246] 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 100.3076 170.9162 154.1080
##  (between_SS / total_SS =  68.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

The structure of the output is no different from before.

You may also choose the centroids to start the algorithm with. To do this, instead of indicating \(k\) for the centers argument, use a matrix of the centroids you wish to begin with.

Let’s start with the true means: (0,0), (2.5,-1), and (-1.5, -1.5):

(xstart <- matrix(c(0,0, 2.5,-1, -1.5,-1.5), ncol=2, byrow=TRUE))
##      [,1] [,2]
## [1,]  0.0  0.0
## [2,]  2.5 -1.0
## [3,] -1.5 -1.5
(fit3 <- kmeans(dat, xstart))
## K-means clustering with 3 clusters of sizes 60, 87, 103
## 
## Cluster means:
##            x          y
## 1  0.3725131  0.2750934
## 2  2.5109499 -1.2199721
## 3 -1.5793637 -1.5174017
## 
## Clustering vector:
##   [1] 1 3 3 3 1 3 3 1 2 1 1 1 1 1 1 1 1 2 3 1 1 1 3 3 1 1 3 1 1 1 1 1 1 3 1
##  [36] 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 2 2 2 2 2 2 1 3 2 2 2 2 2 1 2 2 2 2 2 2
##  [71] 2 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2
## [106] 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 1 1 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 1 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3
## [176] 3 3 1 3 3 3 3 3 3 3 3 1 3 3 3 3 3 1 3 3 1 3 3 3 1 3 3 3 3 3 3 3 3 3 3
## [211] 3 3 3 1 3 3 3 3 3 3 1 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [246] 3 3 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 100.3076 154.1080 170.9162
##  (between_SS / total_SS =  68.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

\(k\)-means++

Basics

The kmeans function is quite specific – it only looks at squared distances. There’s a package called flexclust that allows for more flexible cluster analysis. The kcca function is particularly useful.

At its basic, kcca works similarly to kmeans. In fact, this will run a standard \(k\)-means algorithm:

suppressMessages(library(flexclust))
(fit4 <- kcca(dat, 3))
## kcca object of family 'kmeans' 
## 
## call:
## kcca(x = dat, k = 3)
## 
## cluster sizes:
## 
##  1  2  3 
## 92 59 99

But, the output is less friendly than kmeans (because kcca uses a formal S4 object-oriented format). The multitude of output can be extracted by @ instead of $. For example, here are the cluster assignments, and the final three centers:

fit4@cluster
##   [1] 2 3 3 3 2 3 3 2 1 2 1 2 2 2 2 2 2 1 3 2 2 2 3 3 2 2 2 2 2 2 2 2 2 3 2
##  [36] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 1 1 1 1 1 1 2 3 1 1 1 1 1 2 1 1 1 1 1 1
##  [71] 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 2 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3
## [176] 3 3 2 3 3 3 3 3 3 3 3 1 2 3 3 3 3 2 3 3 2 3 3 3 2 3 3 3 3 2 3 3 3 3 3
## [211] 3 3 3 2 2 3 3 3 3 3 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [246] 3 3 3 3 3
fit4@centers
##               x          y
## [1,]  2.4487745 -1.1781922
## [2,]  0.1910489  0.3025144
## [3,] -1.6003047 -1.5694850

Notice that there is no sum of squares output, like there is in kmeans – this is because kcca doesn’t necessarily use the euclidean distance. However, there is a way to convert a kcca output to a kmeans output, though this output is more limited than usual (the only ss available is $withinss):

fit4b <- as(fit4, "kmeans")
fit4b$withinss
## [1] 167.92353  96.14623 161.87687

Tweaks

The most useful thing about kcca regarding Lab 1 is kcca’s ability to do \(k\)-means++. This can be done through the control argument.

The control argument should be a named list, with each name corresponding to some property you’d like to indicate. See ?cclustControl for the various options. But the one we’re interested in is “initcent”, which controls how the initial centers are chosen. From the documentation, this should be:

Character string, name of function for initial centroids, currently “randomcent” (the default) and “kmeanspp” are available.

So to do \(k\)-means++, this amounts to:

(fit5 <- kcca(dat, 3, control=list(initcent="kmeanspp")))
## kcca object of family 'kmeans' 
## 
## call:
## kcca(x = dat, k = 3, control = list(initcent = "kmeanspp"))
## 
## cluster sizes:
## 
##  1  2  3 
## 95 68 87

Another feature of kcca is that it allows for different distance metrics besides euclidean. This can be indicated through the family argument. Take a look at the kcca documentation under “Predefined Families” to see what distance metrics you can use.

Avatar
Vincenzo Coia
he/him/his 🌈 👨

I’m a data scientist at the University of British Columbia, Vancouver.

comments powered by Disqus