fast_anticlustering()
The latest version of anticlust
(version 0.8.0) was just
published on CRAN. It is
mostly concerned with optimizing anticlust
for (very) large
data sets, so most users are probably not affected by the changes.
Several internal changes were implemented, which were all triggered by
an issue on
Github. This post highlights the – in my opinion – most important
changes. You can also read the entire change logs here.
The most pronounced changes affect the function
fast_anticlustering()
, which was pretty much a second class
citizen in anticlust
ever since
anticlustering()
was re-implemented in C (starting with
anticlust
version 0.5.4 in October 2020). For
anticlustering, using C really makes a huge
difference in run time as compared to using plain R code. After
publishing anticlust 0.5.4, fast_anticlustering()
was
ironically slower than anticlustering()
and only had
limited use cases.1 With version 0.8.0,
fast_anticlustering()
is now again the best choice for
processing large data sets. This is due to two changes. First,
fast_anticlustering()
now also uses a C implementation.
Second, it now uses a different way of re-computing the k-means
objective during the optimization process, thereby gaining an order of
magnitude in run time. In the following, I will talk about the changed
objective for k-means anticlustering.
The k-means objective is the “variance”, which is the sum of the squared Euclidean distances between data points and their cluster centers. It can be visualized as follows using the well-known iris data set:
library(anticlust)
kmeans_clusters <- kmeans(iris[, 1:2], centers = 3)$cluster
plot_clusters(iris[, 1:2], kmeans_clusters, illustrate_variance = TRUE, show_axes = TRUE)
Here, the anticlust
function
plot_clusters()
is used to illustrate the k-means objective
by highlighting the cluster affiliation (which were obtained by applying
the k-means clustering algorithm) and the cluster centers (as
triangles). While the k-means clustering method minimizes the
sum of squared distances between cluster centers and data points,
k-means anticlustering maximizes it. Thereby, k-means
anticlustering leads to cluster centers that are very close to each
other:
kmeans_anticlusters <- anticlustering(
iris[, 1:2],
K = 3,
objective = "variance" # k-means criterion
)
plot_clusters(iris[, 1:2], kmeans_anticlusters, illustrate_variance = TRUE, show_axes = TRUE)
Anticlustering is an iterative method that exchanges data points
between clusters repeatedly and re-computes the objective after each
exchange. Therefore, computing the distances between data points and
cluster centers is the major factor driving the run time of k-means
anticlustering. While the cluster centers can be updated rather quickly
after an exchange, computing the distances always requires iterating
through all data points2. Thus, re-computing the objective depends
on the size of the data set N, and it has to be done very often
during the optimization process. There is a faster way to update the
objective, though, which was introduced in anticlust
0.8.0:
Instead of maximizing the sum of squared distances between cluster
centers and data points, it is also possible (and equivalent) to
minimize the (weighted) sum of the squared Euclidean distances between
cluster centers and the overall centroid. In this case, much fewer
distances need to be re-computed during each iteration of the
optimization algorithm (in the case of three groups: only three
distances). In particular, re-computing the objective the no longer
depends on the number of data points \(N\). Therefore, we gain an order of
magnitude in run time as compared to when computing the variance.
However, this is only practically relevant for rather large data sets
because for up to several hundred elements, the variance implementation
is already very fast.
We can illustrate the difference between computing the two k-means
objectives by comparing anticlustering()
(which computes
the variance) and fast_anticlustering()
(which computes the
distances between cluster centers and overall center):
# 1. Compute k-means anticlustering for, N = 500, 2 variables, 5 groups
N <- 500
M <- 2
K <- 5
data <- matrix(rnorm(N * M), ncol = M)
# 1.1 Using the variance computation
system.time(anticlustering(data, K = K, objective = "variance"))
## User System verstrichen
## 0.171 0.000 0.171
# 1.2 Using the cluster-center-to-overall-center computation
system.time(fast_anticlustering(data, K = K))
## User System verstrichen
## 0.027 0.004 0.031
For N = 500 (which is not a small data set for most anticlustering
applications), the new implementation is already much faster than the
old one. Still, both versions are very fast and there is no practical
gain of using fast_anticlustering()
. However, let’s see
what happens with a larger data set:
# 2. Compute k-means anticlustering for, N = 2000, 2 variables, 5 groups
N <- 2000
M <- 2
K <- 5
data <- matrix(rnorm(N * M), ncol = M)
# 2.1 Using the variance computation
system.time(anticlustering(data, K = K, objective = "variance"))
## User System verstrichen
## 14.182 0.000 14.180
# 2.2 Using the cluster-center-to-overall-center computation
system.time(fast_anticlustering(data, K = K))
## User System verstrichen
## 0.519 0.052 0.571
When N gets larger, gaining an order of magnitude of computation time
makes a huge difference. Usually, for large data sets, we would also
specify the argument k_neighbours
when using
fast_anticlustering()
, which then gains another order of
magnitude in run time. Then, we can even process 1 million data points
in a few minutes. The user viv-analytics
, who opened the
Github issue reporting that anticlustering fails for large N,
reported that the old R implementation of
fast_anticlustering()
took about 6000 seconds (almost two
hours) for processing about 295k data points with 2 numeric variables.
Let’s see how that works using the new anticlust
version:
N <- 295000
M <- 2
K <- 2
data <- matrix(rnorm(N * M), ncol = M)
system.time(fast_anticlustering(data, K = K, k_neighbours = 2))
## User System verstrichen
## 12.907 0.113 13.018
About 13 seconds vs. about 6000 seconds – what an improvement!
To conclude, some questions remain as to why the improved computation
is not also implemented in anticlustering()
for the
objectives "variance"
and "kplus"
, and
kplus_anticlustering()
. While this may be done in the
future, there are some reasons that prevented me from already doing it
with the current release:
anticlustering(..., objective = "variance")
– and hence,
anticlustering(..., objective = "kplus")
– does run quickly
for most practical sizes of data sets and its implementation has proven
stable over several years. I cannot say for sure that the new
implementation is as stable and as free of bugs.fast_anticlustering()
and
anticlustering()
have a different handling of categorical
variables (passed via the argument categories
). In
anticlustering()
, I would like to keep the handling that
has been used previously. So, some work would remain to port the code
used in fast_anticlustering()
to
anticlustering()
.objective = "variance"
because the variance is no
longer computed. It might be possible to introduce a new option
objective = "kmeans"
that uses the new computation. I am
leaning towards this option, because I have wanted the option
objective = "kmeans"
in anticlustering()
for
some time anyway. Frankly, I am no longer happy with the name
objective = "variance"
and I would prefer
objective = "kmeans"
. For backwards compatibility, I would
still keep objective = "variance"
in future versions. I
think that including objective = "kmeans"
as part of a
different computation would be quite elegant.Last updated: 2023-09-14
The function fast_anticlustering()
has an
improved theoretical run time (by an order of magnitude) because the
optimization algorithm can employ fewer iterations. For very large data
sets, this improved theoretical run time outweighs the worse
implementation (R vs. C). However, this only relevant for very large
data sets. Still: at least one user profited from
fast_anticlustering()
using the old R implementation, as
evident from the issue on
Github.↩︎
Or at least the data points that are currently assigned to the two groups that are affected by the exchange.↩︎