the anticlust blog

anticlust 0.8.6: A quick overview

A new anticlust version was released two days ago. This one will probably not go to CRAN, so for now we can install it via R Universe:

install.packages('anticlust', repos = c('https://m-py.r-universe.dev', 'https://cloud.r-project.org'))

or via Github:

library("remotes") # if not available: install.packages("remotes")
install_github("m-Py/anticlust")

I think I will add some tweaks to the documentation and do some additional testing on other platforms before submitting the following version to CRAN.

However, it is quite a big update, and this post gives an overview of some important changes. Not everything is explained in detail, because some of the changes will be explained in more detail in an upcoming preprint (especially the changes to the function bicriterion_anticlustering()).

Using cannot-link constraints

This version allows that pairs of items must not be part of the same (anti)cluster. Most straight forward, we can use the argument cannot_link to request this feature. It is a 2-column matrix, where each row contains the indices of two elements that must not be part of the same group. For example, this is how I prevent the first flower in the iris data set to be part of the same cluster as flower 2 to 20:

library(anticlust)

# Use cannot_link constraints: Element 1 must not be linked with elements 2 to 10:
cl_matrix <- matrix(c(rep(1, 19), 2:20), ncol = 2)
cl_matrix

##       [,1] [,2]
##  [1,]    1    2
##  [2,]    1    3
##  [3,]    1    4
##  [4,]    1    5
##  [5,]    1    6
##  [6,]    1    7
##  [7,]    1    8
##  [8,]    1    9
##  [9,]    1   10
## [10,]    1   11
## [11,]    1   12
## [12,]    1   13
## [13,]    1   14
## [14,]    1   15
## [15,]    1   16
## [16,]    1   17
## [17,]    1   18
## [18,]    1   19
## [19,]    1   20

cl <- anticlustering(
  iris[, 1:4],
  K = 3,
  cannot_link = cl_matrix,
  objective = "kplus",
  method = "local-maximum"
)
cl[1:20]

##  [1] 3 2 2 2 1 1 2 2 1 1 1 1 2 1 1 2 2 2 2 1

all(cl[1] != cl[2:20])

## [1] TRUE

mean_sd_tab(iris[, 1:4], cl)

##   Sepal.Length  Sepal.Width   Petal.Length  Petal.Width  
## 1 "5.84 (0.83)" "3.06 (0.44)" "3.75 (1.77)" "1.20 (0.77)"
## 2 "5.84 (0.83)" "3.06 (0.44)" "3.76 (1.78)" "1.20 (0.77)"
## 3 "5.84 (0.84)" "3.06 (0.44)" "3.76 (1.78)" "1.19 (0.77)"

Using such constraints may decrease the quality of the solution, but here the three groups are still very similar.

One possibility is to use the maximum dispersion as a cannot-link constraint. That means: We prevent the most similar items from being part of the same cluster. This for example works as follows on the iris data set:

opt <- optimal_dispersion(
  iris[, 1:4], 
  K = 3
)

The output of optimal_dispersion() is a list that includes several elements, including “edges”:

opt$edges

##       [,1] [,2]
##  [1,]  102  143
##  [2,]    8   40
##  [3,]    1   18
##  [4,]   10   35
##  [5,]  129  133
##  [6,]   11   49
##  [7,]   18   41
##  [8,]    5   38
##  [9,]   20   22
## [10,]    1    5
## [11,]   30   31
## [12,]   58   94
## [13,]   81   82
## [14,]  117  138
## [15,]    9   39
## [16,]   20   47
## [17,]    1   40
## [18,]    2   35
## [19,]    4   48
## [20,]    8   50
## [21,]   28   29
## [22,]   83   93
## [23,]   96   97
## [24,]  128  139
## [25,]    3   48
## [26,]    2   46
## [27,]    2   13

The rows of this matrix indicate the indices of the plants which must not be part of the same (anti)cluster to ensure that the most similar plants are not in the same group. So we can actually use this output as input for the argument cannot_link:

cl <- anticlustering(
  iris[, 1:4],
  K = 3,
  objective = "kplus",
  method = "local-maximum",
  cannot_link = opt$edges
)

mean_sd_tab(iris[, 1:4], cl)

##   Sepal.Length  Sepal.Width   Petal.Length  Petal.Width  
## 1 "5.84 (0.83)" "3.06 (0.44)" "3.76 (1.78)" "1.20 (0.77)"
## 2 "5.84 (0.83)" "3.06 (0.44)" "3.75 (1.78)" "1.19 (0.77)"
## 3 "5.85 (0.83)" "3.06 (0.44)" "3.76 (1.78)" "1.20 (0.77)"

dispersion_objective(iris[, 1:4], cl)

## [1] 0.1414214

This way, we obtain the maximum possible dispersion (pairwise dissimilarity of plants within each cluster) while maximizing between-group similarity (here, using the k-plus objective).

A new objective: average diversity

The diversity is the default objective in anticlust, and it is very popular. It is defined as the overall sum of all pairwise intra-cluster distances (by default using the Euclidean distance). It has the disadvantage that it does not represent between-group similarity when not all clusters are of the same size:

cl <- anticlustering(
  iris[, 1:4],
  K = c(100, 25, 25),
  objective = "diversity",
  method = "local-maximum"
)

mean_sd_tab(iris[, 1:4], cl)

##   Sepal.Length  Sepal.Width   Petal.Length  Petal.Width  
## 1 "5.84 (0.96)" "3.10 (0.48)" "3.66 (2.02)" "1.17 (0.87)"
## 2 "5.86 (0.49)" "2.97 (0.29)" "3.96 (1.13)" "1.26 (0.48)"
## 3 "5.86 (0.50)" "2.96 (0.32)" "3.95 (1.10)" "1.25 (0.46)"

The standard deviation (i.e., the spread of the data) is much larger in the largest group than in the other groups. The k-plus objective, for example, does not have this problem:

cl <- anticlustering(
  iris[, 1:4],
  K = c(100, 25, 25),
  objective = "kplus",
  method = "local-maximum"
)

mean_sd_tab(iris[, 1:4], cl)

##   Sepal.Length  Sepal.Width   Petal.Length  Petal.Width  
## 1 "5.84 (0.83)" "3.06 (0.44)" "3.76 (1.77)" "1.20 (0.76)"
## 2 "5.84 (0.84)" "3.06 (0.44)" "3.74 (1.80)" "1.20 (0.78)"
## 3 "5.85 (0.85)" "3.06 (0.44)" "3.76 (1.79)" "1.20 (0.78)"

We can now also use the objective “average-diversity”, which normalizes the sum of within-cluster distances by cluster size. It better represents between-group similarity for unequal-sized groups:

cl <- anticlustering(
  iris[, 1:4],
  K = c(100, 25, 25),
  objective = "average-diversity",
  method = "local-maximum"
)

mean_sd_tab(iris[, 1:4], cl)

##   Sepal.Length  Sepal.Width   Petal.Length  Petal.Width  
## 1 "5.84 (0.84)" "3.06 (0.45)" "3.76 (1.77)" "1.20 (0.76)"
## 2 "5.83 (0.81)" "3.05 (0.43)" "3.77 (1.78)" "1.20 (0.78)"
## 3 "5.85 (0.84)" "3.06 (0.40)" "3.76 (1.81)" "1.20 (0.77)"

For equal-sized groups (which is the default), the objectives diversity and average diversity are equivalent.

Improved speed for objective = “diversity”

I implemented some changes in the C code underlying the algorithm that optimizes the diversity (and average diversity) objective, when using method = “local-maximum” and repetitions. In particular, the restart algorithm is now entirely implemented in C and does not call method = "exchange" repeatedly from R. I used these specifications to compare the implementations in version 0.8.5 and 0.8.6:

# This code generated the data:
# N <- 100
# M <- 5
# data <- matrix(rnorm(N * M), ncol = M)
# write.csv(data, "test_data.csv", row.names = FALSE)


# This code was called with the two versions:
library(anticlust)
data <- read.csv("test_data.csv")
K <- 5
start <- Sys.time()
cl <- anticlustering(
  data, 
  K = K,
  objective = "diversity", 
  method = "local-maximum", 
  repetitions = 100
)
Sys.time() - start

The new implementation used 0.5 seconds on my personal computer, the old implementation used 0.90 seconds, which really is a substantial speedup (indicating that a similar change in implementation for the k-means objective would also be desirable).

`lpSolve` as new (and default) solver for ILP methods

anticlust now depends on the R package lpSolve, which it uses by default to solve optimal anticlustering methods. These are the methods than can be called via optimal_anticlustering(), optimal_dispersion(), and anticlustering(..., method = "ilp"). Previously, anticlust only supported the solvers GLPK (via R package Rplpk) and Symphony (via R package Rsymphony), which however were “suggested” dependencies. This means that users explicitly had to install these packages themselves and they were not installed together with anticlust. A potential problem with these packages is that they require an additional installation of a “system dependency”: Users were required to install the solver library from external websites (or package repositories) and not through the R environment. I am not sure how many “average users” would actually do this.

The advantage of the lpSolve package as compared to the other two is that it has no system dependencies, the solver library “lpsolve” is included in the source code of the R package. So, for now, everyone at least has a possibility to apply the optimal methods (even though the Symphony solver really is better for maximum “dispersion” than the lpSolve solver).

Note that when we use the argument cannot_link in anticlustering(), we basically solve the same problem (although maybe a less difficult version) as optimal_dispersion(), so this argument also requires a solver package. This is also the main reason why anticlust now depends on lpSolve: Everyone now can use the argument cannot_link. However, when using cannot_link, anticlustering() will by default select the Symphony solver when the package Rsymphony is found because it is much better at this problem. It is not possible for the user to explicitly specify the solver here. If Rsymphony is not available, it will opt for lpSolve, which is always present when anticlust is installed.

Last updated: 2024-07-18

the anticlust blog

Martin Papenberg

July 17, 2024

anticlust 0.8.6: A quick overview

Using cannot-link constraints

A new objective: average diversity

Improved speed for objective = “diversity”

`lpSolve` as new (and default) solver for ILP methods

Back to the front page

the anticlust blog

Martin Papenberg

July 17, 2024

anticlust 0.8.6: A quick overview

Using cannot-link constraints

A new objective: average diversity

Improved speed for objective = “diversity”

lpSolve as new (and default) solver for ILP methods

Back to the front page

`lpSolve` as new (and default) solver for ILP methods