Anticlustering — anticlustering • anticlust

Partition a pool of elements into groups (i.e., anticlusters) with the aim of creating high within-group heterogeneity and high between-group similarity. Anticlustering is accomplished by maximizing instead of minimizing a clustering objective function. Implements anticlustering methods as described in Papenberg and Klau (2021; <doi:10.1037/met0000301>), Brusco et al. (2020; <doi:10.1111/bmsp.12186>), Papenberg (2024; <doi:10.1111/bmsp.12315>), and Papenberg et al. (2025; <doi:10.1101/2025.03.03.641320>).

anticlustering(
  x,
  K,
  objective = "diversity",
  method = "exchange",
  preclustering = FALSE,
  categories = NULL,
  repetitions = NULL,
  standardize = FALSE,
  cannot_link = NULL,
  must_link = NULL
)

Arguments

x: The data input. Can be one of two structures: (1) A feature matrix where rows correspond to elements and columns correspond to variables (a single numeric variable can be passed as a vector). (2) An N x N matrix dissimilarity matrix; can be an object of class dist (e.g., returned by dist or as.dist) or a matrix where the entries of the upper and lower triangular matrix represent pairwise dissimilarities.
K: How many anticlusters should be created. Alternatively: (a) A vector describing the size of each group, or (b) a vector of length nrow(x) describing how elements are assigned to anticlusters before the optimization starts.
objective: The objective to be maximized. The options "diversity" (default; previously called "distance", which is still supported), "average-diversity", "variance", "kplus" and "dispersion" are natively supported. May also be a user-defined function. See Details.
method: One of "exchange" (default) , "local-maximum", "brusco", "ilp", or "2PML". See Details.
preclustering: Boolean. Should a preclustering be conducted before anticlusters are created? Defaults to FALSE. See Details.
categories: A vector, data.frame or matrix representing one or several categorical variables whose distribution should be similar between groups. See Details.
repetitions: The number of times a search heuristic is initiated when using method = "exchange", method = "local-maximum", method = "brusco", or method = "2PML". In the end, the best objective found across the repetitions is returned.
standardize: Boolean. If TRUE and x is a feature matrix, the data is standardized through a call to scale before the optimization starts. This argument is silently ignored if x is a distance matrix.
cannot_link: A 2 column matrix where each row has the indices of two elements that must not be assigned to the same anticluster.
must_link: A numeric vector of length nrow(x). Elements having the same value in this vector are assigned to the same anticluster.

Value

A vector of length N that assigns a group (i.e, a number between 1 and K) to each input element.

Details

This function is used to solve anticlustering. That is, the data input is divided into K groups in such a way that elements within groups are heterogeneous and the different groups are similar. Anticlustering is accomplished by maximizing instead of minimizing a clustering objective function. The maximization of five objectives is natively supported (other functions can also defined by the user as described below):

the diversity, setting objective = "diversity" (this is the default objective)
the average diversity, which normalizes the diversity by cluster size, setting objective = "average-diversity"
the k-means (or "variance") objective, setting objective = "variance"
the k-plus objective, an extension of the k-means objective, setting objective = "kplus"
the dispersion, which is the minimum distance between any two elements within the same cluster (setting objective = "dispersion")

The k-means objective is the within-group variance—that is, the sum of the squared distances between each element and its cluster center (see variance_objective). K-means anticlustering focuses on minimizing differences with regard to the means of the input variables (that is, the columns in x), but it ignores any other distribution characterstics such as the variance / standard deviation. K-plus anticlustering (using objective = "kplus") is an extension of the k-means criterion that also minimizes differences with regard to the standard deviations between groups (for details see kplus_anticlustering). K-plus anticlustering can also be extended towards higher order moments such as skew and kurtosis; to consider these additional distribution characteristics, use the function kplus_anticlustering. Setting objective = "kplus" in anticlustering function will only consider means and standard deviations (in my experience, this is what users usually want). It is strongly recommended to set the argument standardize = TRUE when using the k-plus objective.

The "diversity" objective is the sum of pairwise distances of elements within the same groups (see diversity_objective). Hence, anticlustering using the diversity criterion maximizes between-group similarity by maximizing within-group heterogeneity (represented as the sum of all pairwise distances). If it is computed on the basis of the Euclidean distance (which is the default behaviour if x is a feature matrix), the diversity is an all rounder objective that tends to equate all distribution characteristics between groups (such as means, variances, ...). Note that the equivalence of within-group heterogeneity and between-group similarity only holds for equal-sized groups. For unequal-sized groups, it is recommended to use a different objective when striving for overall between-group similarity, e.g., the k-plus objective or the "average-diversity". The average diversity was introduced in version 0.8.6, and it is more useful if groups are not equal-sized. The average diversity normalizes the sum of intra-cluster distances by group size. If all groups are equal-sized, it is equivalent to the regular diversity. In the publication that introduces the anticlust package (Papenberg & Klau, 2021), we used the term "anticluster editing" to refer to the maximization of the diversity, because the reversed procedure - minimizing the diversity - is also known as "cluster editing".

The "dispersion" is the minimum distance between any two elements that are part of the same cluster; maximization of this objective ensures that any two elements within the same group are as dissimilar from each other as possible. Applications that require high within-group heterogeneity often require to maximize the dispersion. Oftentimes, it is useful to also consider the diversity and not only the dispersion; to optimize both objectives at the same time, see the function bicriterion_anticlustering.

If the data input x is a feature matrix (that is: each row is a "case" and each column is a "variable") and the option objective = "diversity" or objective = "dispersion" is used, the Euclidean distance is computed as the basic unit of the objectives. If a different measure of dissimilarity is preferred, you may pass a self-generated dissimilarity matrix via the argument x.

In the standard case, groups of equal size are generated. Adjust the argument K to create groups of different size (see Examples).

Algorithms for anticlustering

By default, a heuristic method is employed for anticlustering: the exchange method (method = "exchange"). First, elements are randomly assigned to anticlusters (It is also possible to explicitly specify the initial assignment using the argument K; in this case, K has length nrow(x).) Based on the initial assignment, elements are systematically swapped between anticlusters in such a way that each swap improves the objective value. For an element, each possible swap with elements in other clusters is simulated; then, the one swap is performed that improves the objective the most, but a swap is only conducted if there is an improvement at all. This swapping procedure is repeated for each element. When using method = "local-maximum", the exchange method does not terminate after the first iteration over all elements; instead, the swapping continues until a local maximum is reached. This method corresponds to the algorithm "LCW" by Weitz and Lakshminarayanan (1998). This means that after the exchange process has been conducted once for each data point, the algorithm restarts with the first element and proceeds to conduct exchanges until the objective cannot be improved.

When setting preclustering = TRUE, only the K - 1 most similar elements serve as exchange partners for each element, which can speed up the optimization (more information on the preclustering heuristic follows below). If the categories argument is used, only elements having the same value in categories serve as exchange partners.

Using method = "brusco" implements the local bicriterion iterated local search (BILS) heuristic by Brusco et al. (2020) and returns the partition that best optimized either the diversity or the dispersion during the optimization process. The function bicriterion_anticlustering can also be used to run the algorithm by Brusco et al., but it returns multiple partitions that approximate the optimal pareto efficient set according to both objectives (diversity and dispersion). Thus, to fully utilize the BILS algorithm, use the function bicriterion_anticlustering.

Optimal anticlustering

Usually, heuristics are employed to tackle anticlustering problems, and their performance is generally very satisfying. However, heuristics do not investigate all possible group assignments and therefore do not (necessarily) find the "globally optimal solution", i.e., a partitioning that has the best possible value with regard to the objective that is optimized. Enumerating all possible partitions in order to find the best solution, however, quickly becomes impossible with increasing N, and therefore it is not possible to find a global optimum this way. Because all anticlustering problems considered here are also NP-hard, there is also no (known) clever algorithm that might identify the best solution without considering all possibilities - at least in the worst case. Integer linear programming (ILP) is an approach for tackling NP hard problems that nevertheless tries to be clever when finding optimal solutions: It does not necessarily enumerate all possibilities but is still guaranteed to return the optimal solution. Still, for NP hard problems such as anticlustering, ILP methods will also fail at some point (i.e., when N increases).

anticlust implements optimal solution algorithms via integer linear programming. In order to use the ILP methods, set method = "ilp". The integer linear program optimizing the diversity was described in Papenberg & Klau, (2021; (8) - (12)). It can also be used to optimize the k-means and k-plus objectives, but you actually have to use the function optimal_anticlustering for these objectives. The documentation of the function optimal_dispersion and optimal_anticlustering contain more information on the optimal anticlustering algorithms.

Categorical variables

There are two ways to balance categorical variables among anticlusters (also see the package vignette "Using categorical variables with anticlustering"). The first way is to treat them as "hard constraints" via the argument categories (see Papenberg & Klau, 2021). If done so, balancing the categorical variable is accomplished via categorical_sampling through a stratified split before the anticlustering optimization. After that, the balance is never changed when the algorithm runs (hence, it is a "hard constraint"). When categories has multiple columns (i.e., there are multiple categorical variables), each combination of categories is treated as a distinct category by the exchange method (i.e., the multiple columns are "merged" into a single column). This behaviour may lead to less than optimal results on the level of each single categorical variable. In this case, it may be useful to treat the categorical variables as part of the numeric data, i.e., the first argument x via binary coding (e.g. using categories_to_binary). The examples show how to do this when using the bicriterion algorithm by Bruso et al. Using the argument categories is only available for the classical exchange procedures, that is, for method = "exchange" and method = "local-maximum".

Anticlustering with constraints

Versions 0.8.6 and 0.8.7 of anticlust introduced the possibility to induce cannot-link and must-link constraints with anticlustering with the arguments cannot_link and must_link, respectively. Cannot-link constraints ensure that pairs of items are assigned to different clusters. They are given as a 2-column matrix, where each row has the indices of two elements, which must not be assigned to the same cluster. It is possible that a set of cannot-link constraints cannot be fulfilled. To verify whether the constraints cannot be fulfilled (and to actually assign elements while respecting the constraints), a graph coloring algorithm algorithm is used. This algorithm is is actually the same method as used in optimal_dispersion. The graph coloring algorithm uses an ILP solver and it greatly profits (that is, it may be much faster) from the Rsymphony package, which is not installed as a necessary dependency with anticlust. It is therefore recommended to manually install the Rsymphony package, which is then automatically selected as solver when using the must_link argument. If you have access to the gurobi solver and have the gurobi R package installed, it will be selected as solver (which is even faster than Symphony).

Must-link constraints are passed as a single vector of length nrow(x). Positions that have the same numeric index are assigned to the same anticluster (if the constraints can be fulfilled). When including must-link constraints, method = "2PML" performs a specialized search heuristic that potentially yields better results than method = "local-maximum". The must-link functionality and the 2PML algorithm was introduced in Papenberg et al. (2025).

The examples illustrate the usage of the must_link and cannot_link arguments. Currently, the different kinds of constraints (arguments must_link, cannot_link, and categories) cannot be used together, but this may change in future versions.

Preclustering

A useful heuristic for anticlustering is to form small groups of very similar elements and assign these to different groups. This logic is used as a preprocessing when setting preclustering = TRUE. That is, before the anticlustering objective is optimized, a cluster analysis identifies small groups of similar elements (pairs if K = 2, triplets if K = 3, and so forth). The optimization of the anticlustering objective is then conducted under the constraint that these matched elements cannot be assigned to the same group. When using the exchange algorithm, preclustering is conducted using a call to matching. When using method = "ilp", the preclustering optimally finds groups of minimum pairwise distance by solving the integer linear program described in Papenberg and Klau (2021; (8) - (10), (12) - (13)). Note that when combining preclustering restrictions with method = "ilp", the anticlustering result is no longer guaranteed to be globally optimal, but only optimal given the preclustering restrictions.

Optimize a custom objective function

It is possible to pass a function to the argument objective. See dispersion_objective for an example. If objective is a function, the exchange method assigns elements to anticlusters in such a way that the return value of the custom function is maximized (hence, the function should return larger values when the between-group similarity is higher). The custom function has to take two arguments: the first is the data argument, the second is the clustering assignment. That is, the argument x will be passed down to the user-defined function as first argument. However, only after as.matrix has been called on x. This implies that in the function body, columns of the data set cannot be accessed using data.frame operations such as $. Objects of class dist will be converted to matrix as well.

References

Brusco, M. J., Cradit, J. D., & Steinley, D. (2020). Combining diversity and dispersion criteria for anticlustering: A bicriterion approach. British Journal of Mathematical and Statistical Psychology, 73, 275-396. https://doi.org/10.1111/bmsp.12186

Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301.

Papenberg, M. (2024). K-plus Anticlustering: An Improved k-means Criterion for Maximizing Between-Group Similarity. British Journal of Mathematical and Statistical Psychology, 77(1), 80-102. https://doi.org/10.1111/bmsp.12315

Papenberg, M., Wang, C., Diop, M., Bukhari, S. H., Oskotsky, B., Davidson, B. R., Vo, K. C., Liu, B., Irwin, J. C., Combes, A., Gaudilliere, B., Li, J., Stevenson, D. K., Klau, G. W., Giudice, L. C., Sirota, M., & Oskotsky, T. T. (2025). Anticlustering for sample allocation to minimize batch effects. bioRxiv. https://doi.org/10.1101/2025.03.03.641320

Späth, H. (1986). Anticlustering: Maximizing the variance criterion. Control and Cybernetics, 15, 213-218.

Weitz, R. R., & Lakshminarayanan, S. (1998). An empirical comparison of heuristic methods for creating maximally diverse groups. Journal of the Operational Research Society, 49(6), 635-646. https://doi.org/10.1057/palgrave.jors.2600510

Author

Martin Papenberg martin.papenberg@hhu.de

Examples


# Use default method ("exchange") and the default diversity criterion, also include
# a categorical variable via argument `categories`:
anticlusters <- anticlustering(
  schaper2019[, 3:6],
  K = 3,
  categories = schaper2019$room
)
# Compare feature means and standard deviations by anticluster
mean_sd_tab(schaper2019[, 3:6], anticlusters)
#>   rating_consistent rating_inconsistent syllables     frequency     
#> 1 "4.49 (0.25)"     "1.11 (0.06)"       "3.44 (0.91)" "18.31 (2.47)"
#> 2 "4.49 (0.24)"     "1.10 (0.07)"       "3.38 (0.94)" "18.28 (2.37)"
#> 3 "4.50 (0.26)"     "1.10 (0.07)"       "3.44 (0.95)" "18.34 (2.39)"
# Verify that the "room" is balanced across anticlusters:
table(anticlusters, schaper2019$room)
#>             
#> anticlusters bathroom kitchen
#>            1       16      16
#>            2       16      16
#>            3       16      16

# Use multiple starts of the algorithm to improve the objective and
# optimize the k-means criterion ("variance")
anticlusters <- anticlustering(
  schaper2019[, 3:6],
  objective = "variance",
  K = 3,
  categories = schaper2019$room,
  method = "local-maximum", # better search algorithm
  repetitions = 20 # multiple restarts of the algorithm 
)
# Compare means and standard deviations by anticluster
mean_sd_tab(schaper2019[, 3:6], anticlusters)
#>   rating_consistent rating_inconsistent syllables     frequency     
#> 1 "4.49 (0.25)"     "1.10 (0.08)"       "3.41 (0.87)" "18.31 (2.21)"
#> 2 "4.49 (0.23)"     "1.10 (0.06)"       "3.41 (0.84)" "18.31 (2.26)"
#> 3 "4.49 (0.27)"     "1.10 (0.06)"       "3.44 (1.08)" "18.31 (2.73)"

# Use different group sizes and optimize the extended k-means
# criterion ("kplus")
anticlusters <- anticlustering(
  schaper2019[, 3:6],
  objective = "kplus",
  K = c(24, 24, 48),
  categories = schaper2019$room,
  repetitions = 20,
  method = "local-maximum",
  standardize = TRUE # ususally recommended
)

# Use cannot_link constraints: Element 1 must not be linked with elements 2 to 10:
cl_matrix <- matrix(c(rep(1, 9), 2:10), ncol = 2)
cl <- anticlustering(
  schaper2019[, 3:6],
  K = 10,
  cannot_link = cl_matrix
)
all(cl[1] != cl[2:10])
#> [1] TRUE

# Use cannot_link constraints: Element 1 must be linked with elements 2 to 10.
# Element 11 must be linked with elements 12-20.
must_link <- rep(NA, nrow(schaper2019))
must_link[1:10] <- 1
must_link[11:20] <- 2
cl <- anticlustering(
  schaper2019[, 3:6],
  K = 3,
  must_link = must_link
)
cl[1:10]
#>  [1] 1 1 1 1 1 1 1 1 1 1
cl[11:20]
#>  [1] 1 1 1 1 1 1 1 1 1 1

# Use the heuristic by Brusco et al. (2020) for k-plus anticlustering
# Include categorical variable as part of the optimization criterion rather 
# than the argument categories!
anticlusters <- anticlustering(
  cbind(
    kplus_moment_variables(schaper2019[, 3:6], 2), 
    categories_to_binary(schaper2019$room)
  ),
  objective = "variance", # k-plus anticlustering because of the input above!
  K = 3,
  repetitions = 20,
  method = "brusco"
)

mean_sd_tab(schaper2019[, 3:6], anticlusters)
#>   rating_consistent rating_inconsistent syllables     frequency     
#> 1 "4.49 (0.25)"     "1.10 (0.07)"       "3.44 (0.91)" "18.31 (2.42)"
#> 2 "4.50 (0.25)"     "1.10 (0.07)"       "3.41 (0.95)" "18.31 (2.42)"
#> 3 "4.49 (0.25)"     "1.10 (0.07)"       "3.41 (0.95)" "18.31 (2.40)"
table(anticlusters, schaper2019$room)
#>             
#> anticlusters bathroom kitchen
#>            1       16      16
#>            2       16      16
#>            3       16      16