R
This post shows how Base R
can be used to visualize a cluster analysis in such a way that all clusters can be well distinguished. I use the classical iris
data set.
# Load data
data(iris)
cluster_data <- iris[, 1:4]
# Do k-means clustering, store the clustering vector
clusters <- kmeans(cluster_data, 3)$cluster
# define 3 colors, 3 cex values, and 3 pch values that differ between
# clusters - for more clusters, define more values.
colors <- c("#a9a9a9", "#df536b", "#61d04f")
cex <- c(0.7, 1.2, 1.5)
pch <- c(19, 15, 17)
# Plot the data while visualizing the different clusters
plot(
cluster_data,
col = colors[clusters],
cex = cex[clusters],
pch = pch[clusters]
)
This solution exploits the beautiful indexing capabilities of R
, seen by the call to the col
, cex
and pch
arguments in plot()
. The solution requires that the clustering vector is integer and only has values \(1, ..., K\) where \(K\) is the number of clusters. If a clustering vector x
has any other form (e.g., is of type character
), you can convert it to such an integer vector by calling as.numeric(factor(x))
.
As an addon, I also visualize the results of an anticlustering analysis, where instead of maximizing homogeneity within clusters, we maximize heterogeneity within clusters (and equivalently: maximize homogeneity between clusters, i.e., make the different clusters as similar as possible). To this end, I make use of the package anticlust
.
library(anticlust)
# generate some random data, N = 120
data <- data.frame(
x = rnorm(120),
y = rnorm(120)
)
# Generate 6 anticlusters that are as similar as possible
groups <- anticlustering(
data,
K = 6
)
# define 6 colors, 6 cex values, and 6 pch values:
colors <- c("#a9a9a9", "#df536b", "#61d04f", "#2297e6", "#28e2e5", "#eec12f")
cex <- c(0.7, 0.9, 1.2, 1.5, 1.7, 2)
pch <- 15:20
plot(
data,
col = colors[groups],
cex = cex[groups],
pch = pch[groups]
)
As we can see, anticlustering is a lot more messy than clustering. All 6 groups overlap to a large degree. This is what anticlustering does: the groups should be similar to each other, as opposed to cluster analysis that seeks groups that are dissimilar from each other.
Last updated: 2020-10-22