R/optimal_anticlustering.R
optimal_anticlustering.Rd
Wrapper function that gives access to all optimal algorithms for anticlustering that are available in anticlust.
optimal_anticlustering(x, K, objective, solver = NULL, time_limit = NULL)
The data input. Can be one of two structures: (1) A feature
matrix where rows correspond to elements and columns correspond
to variables (a single numeric variable can be passed as a
vector). (2) An N x N matrix dissimilarity matrix; can be an
object of class dist
(e.g., returned by
dist
or as.dist
) or a matrix
where the entries of the upper and lower triangular matrix
represent pairwise dissimilarities.
How many anticlusters should be created or alternatively:
(a) A vector describing the size of each group (the latter
currently only works for objective = "dispersion")
.
The anticlustering objective, can be "diversity", "variance", "kplus" or "dispersion".
Optional. The solver used to obtain the optimal method. Currently supports "glpk", "symphony", and "lpSolve". See details.
Time limit in seconds, given to the solver. Default is there is no time limit.
A vector of length N that assigns a group (i.e, a number
between 1 and K
) to each input element.
This is a wrapper for all optimal methods supported in anticlust
(currently and in the future). As compared to
anticlustering
, it allows to specify the solver to
obtain an optimal solution and it can be used to obtain optimal
solutions for all supported anticlustering objectives (variance,
diversity, k-plus, dispersion). For the objectives "variance",
"diversity" and "kplus", the optimal ILP method in Papenberg and
Klau (2021) is used, which maximizes the sum of all pairwise
intra-cluster distances (given user specified number of clusters,
for equal-sized clusters). To employ k-means anticlustering
(i.e. set objective = "variance"
), the squared Euclidean
distance is used. For k-plus anticlustering, the squared Euclidean
distance based on the extended k-plus data matrix is used (see
kplus_moment_variables
). For the diversity (and the
dispersion), the Euclidean distance is used by default, but any
user-defined dissimilarity matrix is possible.
The dispersion is solved optimal using the approach described in
optimal_dispersion
.
The optimal methods make use of "solvers" that actually implement the algorithm for finding optimal solutions. The package anticlust supports three solvers:
The default solver lpSolve (<https://sourceforge.net/projects/lpsolve/>).
GNU linear programming kit (<http://www.gnu.org/software/glpk/>),
available from the package Rglpk and requested using solver = "glpk"
.
The R package Rglpk has to be installed manually if this solver should be used.
The Symphony solver (<https://github.com/coin-or/SYMPHONY>),
available from the package Rsymphony and requested using solver = "symphony"
.
(The package Rsymphony has to be installed manually if this solver should be used).
For the maximum dispersion problem, it seems that the Symphony solver is fastest, while the lpSolve solver seems to be good for maximum diversity. However, note that in general the dispersion can be solved optimally for much larger data sets than the diversity.
If a time_limit
is set and the function cannot find in the optimal
objective in the given time, it will throw an error.
data <- matrix(rnorm(24), ncol = 2)
# These calls are equivalent for k-means anticlustering:
optimal_anticlustering(data, K = 2, objective = "variance")
#> [1] 1 2 1 2 2 2 1 2 1 2 1 1
optimal_anticlustering(dist(data)^2, K = 2, objective = "diversity")
#> [1] 1 2 1 2 2 2 1 2 1 2 1 1
# These calls are equivalent for k-plus anticlustering:
optimal_anticlustering(data, K = 2, objective = "kplus")
#> [1] 1 2 2 1 2 1 2 2 1 1 1 2
optimal_anticlustering(dist(kplus_moment_variables(data, 2))^2, K = 2, objective = "diversity")
#> [1] 1 2 2 1 2 1 2 2 1 1 1 2