Maximize dispersion for K groups
optimal_dispersion(
x,
K,
solver = NULL,
max_dispersion_considered = NULL,
min_dispersion_considered = NULL,
npartitions = 1,
time_limit = NULL
)
The data input. Can be one of two structures: (1) A
feature matrix where rows correspond to elements and columns
correspond to variables (a single numeric variable can be
passed as a vector). (2) An N x N matrix dissimilarity matrix;
can be an object of class dist
(e.g., returned by
dist
or as.dist
) or a matrix
where the entries of the upper and lower triangular matrix
represent pairwise dissimilarities.
The number of groups or a vector describing the size of each group.
Optional argument; currently supports "lpSolve",
"glpk", and "symphony". See optimal_anticlustering
.
Optional argument used for early stopping. If the dispersion found is equal to or exceeds this value, a solution having the previous best dispersion is returned.
Optional argument used for speeding up the algorithm computation. If passed, the dispersion is optimized starting from this value instead the global minimum distance.
The number of groupings that are returned, each having an optimal dispersion value (defaults to 1).
Time limit in seconds, given to the solver. Default is there is no time limit.
A list with four elements:
The optimal dispersion
An assignment of elements to groups (vector)
A matrix of 2 columns. Each row contains the indices of elements that had to be investigated to find the dispersion (i.e., each pair of elements cannot be part of the same group in order to achieve maximum dispersion).
All distances that were tested until the dispersion was found.
The dispersion is the minimum distance between two elements
within the same group. This function implements an optimal method
to maximize the dispersion. If the data input x
is a
feature matrix and not a dissimilarity matrix, the pairwise
Euclidean distance is used. It uses the algorithm presented in
Max Diekhoff's Bachelor thesis at the Computer Science Department
at Heinrich Heine University Düsseldorf.
To find out which items are not allowed to be grouped in the same
cluster for maximum dispersion, the algorithm sequentially builds
instances of a graph coloring problem, using an integer linear
programming (ILP) representation (also see Fernandez et al.,
2013). It is possible to specify the ILP solver via the argument
solver
(See optimal_anticlustering
for more
information on this argument). Optimally solving the maximum
dispersion problem is NP-hard for K > 2 and therefore
computationally infeasible for larger data sets. For K = 3 and K
= 4, it seems that this approach scales up to several 100
elements, or even > 1000 for K = 3 (at least when using the
Symphony solver). For larger data sets, use the heuristic
approaches in anticlustering
or
bicriterion_anticlustering
. However, note that for
K = 2, the optimal approach is usually much faster than the
heuristics.
In the output, the element edges
defines which elements
must be in separate clusters in order to achieve maximum
dispersion. All elements not listed here can be changed
arbitrarily between clusters without reducing the dispersion. If
the maximum possible dispersion corresponds to the minimum
dispersion in the data set, the output elements edges
and
groups
are set to NULL
because all possible
groupings have the same value of dispersion. In this case the
output element dispersions_considered
has length 1.
If a time_limit
is set and the function cannot find in the optimal
dispersion in the given time, it will throw an error.
If the SYMPHONY solver is used, an unfortunate "message" is printed to the console when this function terminates:
sym_get_col_solution(): No solution has been stored!
This message is no reason to worry and instead is a direct result
of the algorithm finding the optimal value for the dispersion.
Unfortunately, this message is generated in the C code underlying
the SYMPHONY library (via the printing function printf
),
which cannot be prevented in R.
Diekhoff (2023). Maximizing dispersion for anticlustering. Retrieved from https://www.cs.hhu.de/fileadmin/redaktion/Fakultaeten/Mathematisch-Naturwissenschaftliche_Fakultaet/Informatik/Algorithmische_Bioinformatik/Bachelor-_Masterarbeiten/2831963_ba_ifo_AbschlArbeit_klau_mapap102_madie120_20230203_1815.pdf
Fernández, E., Kalcsics, J., & Nickel, S. (2013). The maximum dispersion problem. Omega, 41(4), 721–730. https://doi.org/10.1016/j.omega.2012.09.005
N <- 30
M <- 5
K <- 3
data <- matrix(rnorm(N*M), ncol = M)
distances <- dist(data)
opt <- optimal_dispersion(distances, K = K)
opt
#> $dispersion
#> [1] 1.646841
#>
#> $groups
#> [1] 3 3 3 2 1 1 2 1 3 3 3 3 2 1 2 2 3 2 2 1 2 2 3 2 1 1 3 1 1 1
#>
#> $groups_fixated
#> [1] NA 3 3 NA 1 1 2 1 NA NA 3 3 2 1 2 2 NA 2 2 1 2 NA 3 NA 1
#> [26] 1 3 1 1 1
#>
#> $edges
#> [,1] [,2]
#> [1,] 20 23
#> [2,] 11 19
#> [3,] 8 11
#> [4,] 5 13
#> [5,] 2 6
#> [6,] 12 29
#> [7,] 6 18
#> [8,] 5 18
#> [9,] 3 26
#> [10,] 2 8
#> [11,] 3 15
#> [12,] 5 19
#> [13,] 18 27
#> [14,] 5 27
#> [15,] 5 16
#> [16,] 7 25
#> [17,] 2 18
#> [18,] 14 16
#> [19,] 21 28
#> [20,] 15 26
#> [21,] 16 27
#> [22,] 5 11
#> [23,] 6 15
#> [24,] 8 19
#> [25,] 23 26
#> [26,] 21 23
#> [27,] 11 29
#> [28,] 19 29
#> [29,] 13 27
#> [30,] 19 23
#> [31,] 23 30
#>
#> $dispersions_considered
#> [1] 1.028233 1.075159 1.144307 1.159567 1.182799 1.226687 1.260399 1.338457
#> [9] 1.349471 1.353310 1.367298 1.394381 1.399823 1.404871 1.410823 1.436337
#> [17] 1.466566 1.504632 1.519661 1.522956 1.532941 1.568069 1.572953 1.577198
#> [25] 1.578936 1.580271 1.601445 1.628697 1.628733 1.634875 1.636778 1.646841
#>
# Compare to bicriterion heuristic:
groups_heuristic <- anticlustering(
distances,
K = K,
method = "brusco",
objective = "dispersion",
repetitions = 100
)
c(
OPT = dispersion_objective(distances, opt$groups),
HEURISTIC = dispersion_objective(distances, groups_heuristic)
)
#> OPT HEURISTIC
#> 1.646841 1.646841
# Different group sizes are possible:
table(optimal_dispersion(distances, K = c(15, 10, 5))$groups)
#>
#> 1 2 3
#> 15 10 5
# Induce cannot-link constraints by maximizing the dispersion:
solvable <- matrix(1, ncol = 6, nrow = 6)
solvable[2, 1] <- -1
solvable[3, 1] <- -1
solvable[4, 1] <- -1
solvable <- as.dist(solvable)
solvable
#> 1 2 3 4 5
#> 2 -1
#> 3 -1 1
#> 4 -1 1 1
#> 5 1 1 1 1
#> 6 1 1 1 1 1
# An optimal solution has to put item 1 in a different group than
# items 2, 3 and 4 -> this is possible for K = 2
optimal_dispersion(solvable, K = 2)$groups
#> [1] 2 1 1 1 2 2
# It no longer works when item 1 can also not be linked with item 5:
# (check out output!)
unsolvable <- as.matrix(solvable)
unsolvable[5, 1] <- -1
unsolvable <- as.dist(unsolvable)
unsolvable
#> 1 2 3 4 5
#> 2 -1
#> 3 -1 1
#> 4 -1 1 1
#> 5 -1 1 1 1
#> 6 1 1 1 1 1
optimal_dispersion(unsolvable, K = 2)
#> $dispersion
#> [1] -1
#>
#> $groups
#> NULL
#>
#> $edges
#> NULL
#>
#> $dispersions_considered
#> [1] -1
#>
# But:
optimal_dispersion(unsolvable, K = c(2, 4)) # group sizes, not number of groups
#> $dispersion
#> [1] 1
#>
#> $groups
#> [1] 2 1 1 1 1 2
#>
#> $groups_fixated
#> [1] 2 1 1 1 1 NA
#>
#> $edges
#> [,1] [,2]
#> [1,] 1 2
#> [2,] 1 3
#> [3,] 1 4
#> [4,] 1 5
#>
#> $dispersions_considered
#> [1] -1 1
#>