anticlust 0.8.6: A quick overview

A new anticlust version was released two days ago. This one will probably not go to CRAN, so for now we can install it via R Universe:

install.packages('anticlust', repos = c('https://m-py.r-universe.dev', 'https://cloud.r-project.org'))

or via Github:

library("remotes") # if not available: install.packages("remotes")
install_github("m-Py/anticlust")

I think I will add some tweaks to the documentation and do some additional testing on other platforms before submitting the following version to CRAN.

However, it is quite a big update, and this post gives an overview of some important changes. Not everything is explained in detail, because some of the changes will be explained in more detail in an upcoming preprint (especially the changes to the function bicriterion_anticlustering()).

A new objective: average diversity

The diversity is the default objective in anticlust, and it is very popular. It is defined as the overall sum of all pairwise intra-cluster distances (by default using the Euclidean distance). It has the disadvantage that it does not represent between-group similarity when not all clusters are of the same size:

cl <- anticlustering(
  iris[, 1:4],
  K = c(100, 25, 25),
  objective = "diversity",
  method = "local-maximum"
)

mean_sd_tab(iris[, 1:4], cl)
##   Sepal.Length  Sepal.Width   Petal.Length  Petal.Width  
## 1 "5.84 (0.96)" "3.10 (0.48)" "3.66 (2.02)" "1.17 (0.87)"
## 2 "5.86 (0.49)" "2.97 (0.29)" "3.96 (1.13)" "1.26 (0.48)"
## 3 "5.86 (0.50)" "2.96 (0.32)" "3.95 (1.10)" "1.25 (0.46)"

The standard deviation (i.e., the spread of the data) is much larger in the largest group than in the other groups. The k-plus objective, for example, does not have this problem:

cl <- anticlustering(
  iris[, 1:4],
  K = c(100, 25, 25),
  objective = "kplus",
  method = "local-maximum"
)

mean_sd_tab(iris[, 1:4], cl)
##   Sepal.Length  Sepal.Width   Petal.Length  Petal.Width  
## 1 "5.84 (0.83)" "3.06 (0.44)" "3.76 (1.77)" "1.20 (0.76)"
## 2 "5.84 (0.84)" "3.06 (0.44)" "3.74 (1.80)" "1.20 (0.78)"
## 3 "5.85 (0.85)" "3.06 (0.44)" "3.76 (1.79)" "1.20 (0.78)"

We can now also use the objective “average-diversity”, which normalizes the sum of within-cluster distances by cluster size. It better represents between-group similarity for unequal-sized groups:

cl <- anticlustering(
  iris[, 1:4],
  K = c(100, 25, 25),
  objective = "average-diversity",
  method = "local-maximum"
)

mean_sd_tab(iris[, 1:4], cl)
##   Sepal.Length  Sepal.Width   Petal.Length  Petal.Width  
## 1 "5.84 (0.84)" "3.06 (0.45)" "3.76 (1.77)" "1.20 (0.76)"
## 2 "5.83 (0.81)" "3.05 (0.43)" "3.77 (1.78)" "1.20 (0.78)"
## 3 "5.85 (0.84)" "3.06 (0.40)" "3.76 (1.81)" "1.20 (0.77)"

For equal-sized groups (which is the default), the objectives diversity and average diversity are equivalent.

Improved speed for objective = “diversity”

I implemented some changes in the C code underlying the algorithm that optimizes the diversity (and average diversity) objective, when using method = “local-maximum” and repetitions. In particular, the restart algorithm is now entirely implemented in C and does not call method = "exchange" repeatedly from R. I used these specifications to compare the implementations in version 0.8.5 and 0.8.6:

# This code generated the data:
# N <- 100
# M <- 5
# data <- matrix(rnorm(N * M), ncol = M)
# write.csv(data, "test_data.csv", row.names = FALSE)


# This code was called with the two versions:
library(anticlust)
data <- read.csv("test_data.csv")
K <- 5
start <- Sys.time()
cl <- anticlustering(
  data, 
  K = K,
  objective = "diversity", 
  method = "local-maximum", 
  repetitions = 100
)
Sys.time() - start

The new implementation used 0.5 seconds on my personal computer, the old implementation used 0.90 seconds, which really is a substantial speedup (indicating that a similar change in implementation for the k-means objective would also be desirable).

lpSolve as new (and default) solver for ILP methods

anticlust now depends on the R package lpSolve, which it uses by default to solve optimal anticlustering methods. These are the methods than can be called via optimal_anticlustering(), optimal_dispersion(), and anticlustering(..., method = "ilp"). Previously, anticlust only supported the solvers GLPK (via R package Rplpk) and Symphony (via R package Rsymphony), which however were “suggested” dependencies. This means that users explicitly had to install these packages themselves and they were not installed together with anticlust. A potential problem with these packages is that they require an additional installation of a “system dependency”: Users were required to install the solver library from external websites (or package repositories) and not through the R environment. I am not sure how many “average users” would actually do this.

The advantage of the lpSolve package as compared to the other two is that it has no system dependencies, the solver library “lpsolve” is included in the source code of the R package. So, for now, everyone at least has a possibility to apply the optimal methods (even though the Symphony solver really is better for maximum “dispersion” than the lpSolve solver).

Note that when we use the argument cannot_link in anticlustering(), we basically solve the same problem (although maybe a less difficult version) as optimal_dispersion(), so this argument also requires a solver package. This is also the main reason why anticlust now depends on lpSolve: Everyone now can use the argument cannot_link. However, when using cannot_link, anticlustering() will by default select the Symphony solver when the package Rsymphony is found because it is much better at this problem. It is not possible for the user to explicitly specify the solver here. If Rsymphony is not available, it will opt for lpSolve, which is always present when anticlust is installed.


Last updated: 2024-07-18