Rewriting it in C: Anticlustering (part 1)

Surely, everyone using R1 has heard the complaint that “R is slow”. Often enough, this is not true. For example, many built-in methods such as rowSums() are lightning fast for processing quite large data sets. This should not be surprising because rowSums() basically just runs a highly optimized C program. However, there are cases where “R is slow”, in particular when using (potentially nested) for loops2 with many iterations, which usually occurs when (repeatedly) iterating over all elements in a data set. In this post and the following parts, I document a case of slow behaviour in R, which could be fixed by – aside from implementing algorithmic improvements, which were at least equally important – rewriting the code using the C programming language. In particular, in this series I will show how I was able to speed up the anticlustering algorithm used in the R package anticlust by a factor of many hundreds.

Anticlustering is used to partition a data set into parts in such a way that all parts are similar according to some mathematical criterion. In this post I talk about the k-means criterion, which is well-known in cluster analysis. When assigning data points to clusters in such a way that the k-means criterion has a small value, we obtain well separated clusters. A classical “pedagogical” example to illustrate k-means clustering is to apply the k-means algorithm to the popular iris data set, which is also available in R. For illustrative purposes, I only use the sepal width and sepal length features and not the other two numeric features in that data set (petal length and petal width).

kmeans_clusters <- kmeans(iris[, 1:2], centers = 3)$cluster

I use the anticlust function plot_clusters() to illustrate the results of k-means. It illustrates the cluster affiliation and draws the cluster centers (as triangle):

library(anticlust)
plot_clusters(iris[, 1:2], kmeans_clusters, illustrate_variance = TRUE, show_axes = TRUE)

K-means clustering leads to cluster centers that are as far away from each other as possible. K-means anticlustering, on the other hand, leads to cluster centers that are as close to each other as possible:

kmeans_anticlusters <- anticlustering(
  iris[, 1:2],
  K = 3,
  objective = "variance" # k-means criterion
)

plot_clusters(iris[, 1:2], kmeans_anticlusters, illustrate_variance = TRUE, show_axes = TRUE)

Thus, k-means anticlustering leads to groups that are similar in the sense that the mean values of the numeric attributes that are used in the analysis are close to each other. The package anticlust has a function variance_objective() to compute the k-means objective3 for a given clustering and data set. We can use it to verify that the anticlustering and clustering results differ markedly with regard to to this criterion:

variance_objective(iris[, 1:2], kmeans_clusters) # low value
## [1] 37.0507
variance_objective(iris[, 1:2], kmeans_anticlusters) # high value
## [1] 130.475

Out of interest, let’s also compare these values with the objective value obtained when randomly assigning the elements to groups:

random_clusters <- sample(kmeans_anticlusters)
variance_objective(iris[, 1:2], random_clusters) # high value
## [1] 128.103

The random assignment is in between the clustering and anticlustering solutions, and is typically closer to the anticlustering results. This should not come as a surprise because randomly assigning elements to groups is oftentimes done to obtain similar groups, but it is less adequate than anticlustering for this purpose:

plot_clusters(iris[, 1:2], random_clusters, illustrate_variance = TRUE, show_axes = TRUE)

The anticlustering algorithm: The exchange method

Anticlustering is an optimization procedure. The elements are assigned to clusters in such a way that the objective (here: the k-means objective) is maximized. There is not really a “clever way” of doing this, so the algorithm is an iterative procedure that repeatedly walks through the data set and tries to improve the group assignment little by little. This procedure is called exchange method. Details on the exchange method that is used in anticlust can be found in Papenberg and Klau (2021; https://doi.org/10.1037/met0000301), Papenberg (2023; https://doi.org/10.1111/bmsp.12315), or the anticlust documentation (?anticlustering).

In the beginning of the exchange algorithm, all elements are randomly assigned to a group. Usually, this is done under the restriction of obtaining the same number of elements in each group, but any grouping is possible. For example, the following code might do that for the iris data set:

sample(rep_len(1:3, nrow(iris)))
##   [1] 3 1 2 3 2 3 3 1 1 3 2 1 1 3 1 3 3 3 1 3 1 2 3 2 2 3 3 1 2 3 3 1 1 3 2 3 2
##  [38] 1 1 2 2 2 3 2 1 1 1 2 1 1 2 2 1 2 2 3 3 1 1 3 3 3 1 3 2 3 2 2 2 2 1 3 2 1
##  [75] 3 1 1 3 3 2 2 2 2 3 2 1 3 1 2 2 1 3 2 1 3 2 1 2 1 1 2 3 2 3 3 3 1 3 1 2 3
## [112] 2 2 1 2 1 2 1 3 1 1 3 2 3 2 1 2 3 2 3 1 3 3 1 1 1 2 3 2 2 3 1 1 3 2 1 1 1
## [149] 2 3

Based on this initial grouping, the default exchange algorithm (anticlustering(..., method = "exchange")) iterates through all input elements and attempts to improve the results by swapping each input element with a element that is currently assigned to a different group. For each element, the exchange is performed that leads to the largest possible improvement in the criterion; therefore, each possible exchange – i.e., with all elements that are currently assigned to a different group – has to be simulated. No exchange is performed if an element cannot be swapped in such a way that the anticlustering objective is improved. The process stops after all possible exchanges have been evaluated for each element.4 To give a concrete example, if we have \(N = 150\) data points and \(K = 3\) equal-sized groups (as in the iris data example above), 100 swaps are evaluated for each element, leading to 150 * 100 = 15000 exchanges that have to be conducted during the entire exchange algorithm. For each exchange, the objective function variance_objective() has to be re-evaluated. Thus, the total run time approximately corresponds to computing the k-means objective 15000 times.

In fact, we can use anticlustering() just this way: The argument objective is used to specify which anticlustering objective is maximized, and it can be a function that computes the objective – based on a clustering and given the data set, which is passed as the first argument. So, the following code works:

kmeans_anticlusters <- anticlustering(
  iris[, 1:2],
  K = 3,
  objective = variance_objective, # k-means criterion, passed as a function
)

Usually, however, we would use the following code where we specify the objective as a string5:

kmeans_anticlusters <- anticlustering(
  iris[, 1:2],
  K = 3,
  objective = "variance", # k-means criterion, passed as a string
)

Let us compare these equivalent calls with regard to their running time.

start1 <- Sys.time()
kmeans_anticlusters1 <- anticlustering(
  iris[, 1:2],
  K = 3,
  objective = variance_objective, # k-means criterion, passed as a function
)
time1 <- difftime(Sys.time(), start1, units = "s")

start2 <- Sys.time()
kmeans_anticlusters2 <- anticlustering(
  iris[, 1:2],
  K = 3,
  objective = "variance", # k-means criterion, passed as a string
)
time2 <- difftime(Sys.time(), start2, units = "s")
c(time1, time2)
## Time differences in secs
## [1] 4.669100046 0.007648706

The running times differ markedly by an astonishing factor of 610, even though the very same algorithm was performed in both cases. And believe it or not, when I first committed the exchange method to anticlust6 the slower code (or at least a very similar code) was used. In the change logs on July 01, 2019, I noted with regard to the exchange method: “This procedure is repeated for each element; because each possible swap is investigated for each element, the total number of exchanges grows quadratically with input size, rendering the exchange method unsuitable for large N.” By “large” I meant more than a few hundred back then. Now, anticlust is much more performant.

The following posts will highlight the reasons as to why do the stark differences occur. As an outlook, there are several difference about the underlying code when using objective = variance_objective versus objective = "variance". In particular, using objective = variance_objective will call the R function variance_objective() repeatedly during each iteration of the exchange algorithm. When using objective = "variance", a specialized method is called that does not recompute the objective entirely from scratch during each iteration (algorithmic improvement), and this method has been implemented in C (implementation improvement). In the following posts, I will go into more detail regarding the differences of the two anticlustering implementations. In the next part, I will talk about the exchange method and the R code used to implement it, and talk about potential improvements for running time that result from investigating the R code.


Last updated: 2023-08-29

References

Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301.

Papenberg, M. (2023). K-plus Anticlustering: An Improved k-means Criterion for Maximizing Between-Group Similarity. British Journal of Mathematical and Statistical Psychology. Advance online publication. https://doi.org/10.1111/bmsp.12315

Weitz, R., & Lakshminarayanan, S. (1998). An empirical comparison of heuristic methods for creating maximally diverse groups. Journal of the Operational Research Society, 49(6), 635–646. https://doi.org/10.1057/palgrave.jors.2600510


  1. Programmers from other languages are probably even more prone to hearing / sharing this sentiment.↩︎

  2. The same applies to the lapply() familiy, which does not increase speed as compared to for loops, but has other advantages in some settings.↩︎

  3. The k-means criterion is the squared deviation of data points to the cluster centers, which is also called the variance.↩︎

  4. The results of the exchange method can be improved by not stopping after a single iteration through the data set; instead we may repeat the process until no single exchange is able to further improve the anticlustering objective, i.e., until a local maximum is found. This happens if we use anticlustering(..., method = "local-maximum"). This method corresponds to the algorithm “LCW” in Weitz and Lakshminarayanan (1998). Using the local maximum method leads to more exchanges and thus to longer running time, but also better results than the default exchange method.↩︎

  5. Four anticlustering objectives are natively supported: “diversity” (which is the default), “variance”, “kplus”, and “dispersion”. For each of these objectives, a fast C implementation is available.↩︎

  6. This was in version 0.2.7, the commit where this version was published seems to be this one. Unfortunately I did not include tags for all early versions. There is a gap in between version 0.1.0, which was the very first version on Github, and version 0.3.0, which was the current version when I submitted the first anticlustering paper.↩︎