Fast anticlustering

Increasing the speed of (k-means / k-plus) anticlustering by (1) conducting fewer exchanges during the optimization and (2) using an alternative formulation of the k-means objective. Makes anticlustering applicable to quite large data sets.

fast_anticlustering(
  x,
  K,
  k_neighbours = Inf,
  categories = NULL,
  exchange_partners = NULL
)

Arguments

x: A numeric vector, matrix or data.frame of data points. Rows correspond to elements and columns correspond to features. A vector represents a single numeric feature.
K: How many anticlusters should be created. Alternatively: (a) A vector describing the size of each group, or (b) a vector of length nrow(x) describing how elements are assigned to anticlusters before the optimization starts.
k_neighbours: The number of nearest neighbours that serve as exchange partner for each element. See details.
categories: A vector, data.frame or matrix representing one or several categorical constraints.
exchange_partners: Optional argument. A list of length NROW(x) specifying for each element the indices of the elements that serve as exchange partners. If used, this argument overrides the k_neighbours argument. See examples.

Details

This function was created to make anticlustering applicable to large data sets (e.g., several 100,000 elements). It optimizes the k-means objective because computing all pairwise distances as is done when optimizing the "diversity" (i.e., the default in anticlustering) is not feasible for very large data sets (for about N > 20000 on my personal computer). Using fast_anticlustering for k-plus anticlustering is also possible by applying kplus_moment_variables on the input (and possibly by using the argument exchange_partners, see Examples).

The function fast_anticlustering employs a speed-optimized exchange method, which is basically equivalent to method = "exchange" in anticlustering, but may reduce the number of exchanges that are investigated for each input element. The number of exchange partners per element has to be set using the argument k_neighbours. By default, it is set to Inf, meaning that all possible swaps are tested. If k_neighbours is set differently (which is usually recommended when running this function), the default behaviour is to generate exchange partners using a nearest neighbour search (using the function nn2 from the RANN package). Using more exchange partners can improve the quality of the results, but also increase run time. Note that for very large data sets, anticlustering generally becomes "easier" (even a random split may yield satisfactory results), so using few exchange partners is usually not a problem.

It is possible to specify custom exchange partners using the argument exchange_partners instead of relying on the default nearest neighbour search. When using exchange_partners, it is not necessary that each element has the same number of exchange partners; this is why the argument exchange_partners has to be a list instead of a matrix or data.frame. Exchange partners can for example be generated by generate_exchange_partners (see Examples), but a custom list may also be used. Note that categorical constraints induced via categories may not be respected during the optimization if the exchange_partners argument allows exchanges between members of different categories, so care must be taken when combining the arguments exchange_partners and categories.

In anticlustering(..., objective = "variance"), the run time of computing the k-means objective is in O(M N), where N is the total number of elements and M is the number of variables. This is because the variance is computed as the sum of squared distances between all data points and their cluster centers. The function fast_anticlustering uses a different - but equivalent - formulation of the k-means objective, where the re-computation of the objective only depends and M but no longer on N. In particular, this variant of k-means anticlustering minimizes the weighted sum of squared distances between cluster centroids and the overall data centroid; the distances between all individual data points and their cluster center are not computed (Späth, 1986). Using the different objective formulation reduces the run time by an order of magnitude and makes k-means anticlustering applicable to very large data sets (even in the millions). For a fixed number of exchange partners (specified using the argument k_neighbours), the approximate run time of fast_anticlustering is in O(M N). The algorithm method = "exchange" in anticlustering with objective = "variance" has a run time of O(M N^3). Thus, fast_anticlustering can improve the run time by two orders of magnitude as compared to the standard exchange algorithm. The nearest neighbour search, which is done in the beginning usually does not strongly contribute to the overall run time. It is nevertheless possible to suppress the nearest neighbour search by using the exchange_partners argument.

When setting the categories argument, exchange partners (i.e., nearest neighbours) will be generated from the same category. Note that when categories has multiple columns, each combination of these categories is treated as a distinct category by the exchange method. You can also use categories_to_binary to potentially improve results for several categorical variables, instead of using the argument categories.

References

Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301.

Papenberg, M. (2024). K-plus Anticlustering: An Improved k-means Criterion for Maximizing Between-Group Similarity. British Journal of Mathematical and Statistical Psychology, 77(1), 80–102. https://doi.org/10.1111/bmsp.12315

Späth, H. (1986). Anticlustering: Maximizing the variance criterion. Control and Cybernetics, 15, 213-218.

Author

Martin Papenberg martin.papenberg@hhu.de

Examples


## Use fewer or more exchange partners to adjust speed (vs. quality tradeoff)
features <- iris[, - 5]
N <- nrow(features)
init <- sample(rep_len(1:3, N)) # same starting point for all calls:
groups1 <- fast_anticlustering(features, K = init) # default: all exchanges
groups2 <- fast_anticlustering(features, K = init, k_neighbours = 20) 
groups3 <- fast_anticlustering(features, K = init, k_neighbours = 2)

variance_objective(features, groups1)
#> [1] 681.369
variance_objective(features, groups2)
#> [1] 681.3682
variance_objective(features, groups3)
#> [1] 680.5326

# K-plus anticlustering is straight forward when sticking with the default
# for k_neighbours
kplus_anticlusters <- fast_anticlustering(
  kplus_moment_variables(features, T = 2), 
  K = 3
)
mean_sd_tab(features, kplus_anticlusters)
#>   Sepal.Length  Sepal.Width   Petal.Length  Petal.Width  
#> 1 "5.84 (0.84)" "3.06 (0.44)" "3.76 (1.78)" "1.20 (0.77)"
#> 2 "5.84 (0.83)" "3.06 (0.44)" "3.75 (1.78)" "1.21 (0.77)"
#> 3 "5.85 (0.83)" "3.06 (0.44)" "3.76 (1.77)" "1.19 (0.77)"

# Some care is needed when applying k-plus using with this function 
# while using a reduced number of exchange partners generated in the 
# nearest neighbour search. Then we:
# 1) Use kplus_moment_variables() on the numeric input
# 2) Generate custom exchange_partners because otherwise nearest 
#    neighbours are internally selected based on the extended k-plus 
#    variables returned by kplus_moment_variables() 
#    (which does not really make sense)
kplus_anticlusters <- fast_anticlustering(
  kplus_moment_variables(features, T = 2), 
  K = 3,
  exchange_partners = generate_exchange_partners(120, features = features, method = "RANN")
 )
mean_sd_tab(features, kplus_anticlusters)
#>   Sepal.Length  Sepal.Width   Petal.Length  Petal.Width  
#> 1 "5.84 (0.84)" "3.06 (0.44)" "3.72 (1.79)" "1.21 (0.77)"
#> 2 "5.84 (0.83)" "3.06 (0.44)" "3.76 (1.78)" "1.20 (0.77)"
#> 3 "5.84 (0.83)" "3.06 (0.44)" "3.79 (1.77)" "1.19 (0.77)"
# Or we use random exchange partners: 
kplus_anticlusters <- fast_anticlustering(
  kplus_moment_variables(features, T = 2), 
  K = 3,
  exchange_partners = generate_exchange_partners(120, N = nrow(features), method = "random")
)
mean_sd_tab(features, kplus_anticlusters)
#>   Sepal.Length  Sepal.Width   Petal.Length  Petal.Width  
#> 1 "5.84 (0.83)" "3.06 (0.44)" "3.77 (1.78)" "1.20 (0.77)"
#> 2 "5.84 (0.83)" "3.06 (0.44)" "3.75 (1.78)" "1.20 (0.77)"
#> 3 "5.85 (0.84)" "3.05 (0.44)" "3.75 (1.78)" "1.20 (0.77)"


# Working on several 1000 elements is very fast (Here n = 10000, m = 2)
data <- matrix(rnorm(10000 * 2), ncol = 2)
start <- Sys.time()
groups <- fast_anticlustering(data, K = 5, k_neighbours = 5)
Sys.time() - start 
#> Time difference of 0.4193084 secs

Arguments

Details

References

See also

Author

Examples