Wrapper function that gives access to all optimal algorithms for anticlustering that are available in anticlust.

optimal_anticlustering(x, K, objective, solver = NULL, time_limit = NULL)

Arguments

x

The data input. Can be one of two structures: (1) A feature matrix where rows correspond to elements and columns correspond to variables (a single numeric variable can be passed as a vector). (2) An N x N matrix dissimilarity matrix; can be an object of class dist (e.g., returned by dist or as.dist) or a matrix where the entries of the upper and lower triangular matrix represent pairwise dissimilarities.

K

How many anticlusters should be created or alternatively: (a) A vector describing the size of each group (the latter currently only works for objective = "dispersion").

objective

The anticlustering objective, can be "diversity", "variance", "kplus" or "dispersion".

solver

Optional. The solver used to obtain the optimal method. Currently supports "glpk", "symphony", and "lpSolve". See details.

time_limit

Time limit in seconds, given to the solver. Default is there is no time limit.

Value

A vector of length N that assigns a group (i.e, a number between 1 and K) to each input element.

Details

This is a wrapper for all optimal methods supported in anticlust (currently and in the future). As compared to anticlustering, it allows to specify the solver to obtain an optimal solution and it can be used to obtain optimal solutions for all supported anticlustering objectives (variance, diversity, k-plus, dispersion). For the objectives "variance", "diversity" and "kplus", the optimal ILP method in Papenberg and Klau (2021) is used, which maximizes the sum of all pairwise intra-cluster distances (given user specified number of clusters, for equal-sized clusters). To employ k-means anticlustering (i.e. set objective = "variance"), the squared Euclidean distance is used. For k-plus anticlustering, the squared Euclidean distance based on the extended k-plus data matrix is used (see kplus_moment_variables). For the diversity (and the dispersion), the Euclidean distance is used by default, but any user-defined dissimilarity matrix is possible.

The dispersion is solved optimal using the approach described in optimal_dispersion.

The optimal methods make use of "solvers" that actually implement the algorithm for finding optimal solutions. The package anticlust supports three solvers:

  • The default solver lpSolve (<https://sourceforge.net/projects/lpsolve/>).

  • GNU linear programming kit (<http://www.gnu.org/software/glpk/>), available from the package Rglpk and requested using solver = "glpk". The R package Rglpk has to be installed manually if this solver should be used.

  • The Symphony solver (<https://github.com/coin-or/SYMPHONY>), available from the package Rsymphony and requested using solver = "symphony". (The package Rsymphony has to be installed manually if this solver should be used).

For the maximum dispersion problem, it seems that the Symphony solver is fastest, while the lpSolve solver seems to be good for maximum diversity. However, note that in general the dispersion can be solved optimally for much larger data sets than the diversity.

If a time_limit is set and the function cannot find in the optimal objective in the given time, it will throw an error.

Author

Martin Papenberg martin.papenberg@hhu.de

Examples


data <- matrix(rnorm(24), ncol = 2)

# These calls are equivalent for k-means anticlustering:
optimal_anticlustering(data, K = 2, objective = "variance")
#>  [1] 1 1 2 2 1 1 2 1 2 1 2 2
optimal_anticlustering(dist(data)^2, K = 2, objective = "diversity")
#>  [1] 1 1 2 2 1 1 2 1 2 1 2 2

# These calls are equivalent for k-plus anticlustering:
optimal_anticlustering(data, K = 2, objective = "kplus")
#>  [1] 1 2 2 2 2 1 2 1 2 1 1 1
optimal_anticlustering(dist(kplus_moment_variables(data, 2))^2, K = 2, objective = "diversity")
#>  [1] 1 2 2 2 2 1 2 1 2 1 1 1