Increasing the speed of (k-means / k-plus) anticlustering by (1) conducting fewer exchanges during the optimization and (2) using an alternative formulation of the k-means objective. Makes anticlustering applicable to quite large data sets.
fast_anticlustering(
x,
K,
k_neighbours = Inf,
categories = NULL,
exchange_partners = NULL
)
A numeric vector, matrix or data.frame of data points. Rows correspond to elements and columns correspond to features. A vector represents a single numeric feature.
How many anticlusters should be created. Alternatively:
(a) A vector describing the size of each group, or (b) a vector
of length nrow(x)
describing how elements are assigned
to anticlusters before the optimization starts.
The number of nearest neighbours that serve as exchange partner for each element. See details.
A vector, data.frame or matrix representing one or several categorical constraints.
Optional argument. A list of length
NROW(x)
specifying for each element the indices of the
elements that serve as exchange partners. If used, this
argument overrides the k_neighbours
argument. See
examples.
This function was created to make anticlustering applicable to
large data sets (e.g., several 100,000 elements). It optimizes the
k-means objective because computing all pairwise distances as is
done when optimizing the "diversity" (i.e., the default in
anticlustering
) is not feasible for very large data
sets (for about N > 20000 on my personal computer). Using
fast_anticlustering
for k-plus anticlustering is also
possible by applying kplus_moment_variables
on the
input (and possibly by using the argument exchange_partners
,
see Examples).
The function fast_anticlustering
employs a speed-optimized
exchange method, which is basically equivalent to method =
"exchange"
in anticlustering
, but may reduce the number
of exchanges that are investigated for each input element. The number of
exchange partners per element has to be set using the argument k_neighbours
. By
default, it is set to Inf
, meaning that all possible swaps are
tested. If k_neighbours
is set differently (which is usually recommended when running
this function), the default behaviour is to generate exchange partners using a
nearest neighbour search (using the function nn2
from the RANN
package). Using more exchange partners can improve the quality of
the results, but also increase run time. Note that for very large
data sets, anticlustering generally becomes "easier" (even a random
split may yield satisfactory results), so using few exchange
partners is usually not a problem.
It is possible to specify custom exchange partners using the
argument exchange_partners
instead of relying on the default
nearest neighbour search. When using exchange_partners
, it
is not necessary that each element has the same number of exchange
partners; this is why the argument exchange_partners
has to
be a list
instead of a matrix
or
data.frame
. Exchange partners can for example be generated
by generate_exchange_partners
(see Examples), but a
custom list may also be used. Note that categorical constraints
induced via categories
may not be respected during the
optimization if the exchange_partners
argument allows
exchanges between members of different categories, so care must be
taken when combining the arguments exchange_partners
and
categories
.
In anticlustering(..., objective = "variance")
, the run time
of computing the k-means objective is in O(M N), where N is the
total number of elements and M is the number of variables. This is
because the variance is computed as the sum of squared distances
between all data points and their cluster centers. The function
fast_anticlustering
uses a different - but equivalent -
formulation of the k-means objective, where the re-computation of
the objective only depends and M but no longer on N. In
particular, this variant of k-means anticlustering minimizes
the weighted sum of squared distances between
cluster centroids and the overall data centroid; the distances
between all individual data points and their cluster center are not
computed (Späth, 1986). Using the different objective formulation
reduces the run time by an
order of magnitude and makes k-means anticlustering applicable to
very large data sets (even in the millions). For a fixed number of
exchange partners (specified using the argument
k_neighbours
), the approximate run time of
fast_anticlustering
is in O(M N). The algorithm
method = "exchange"
in anticlustering
with
objective = "variance"
has a run time of O(M N^3).
Thus, fast_anticlustering
can improve the run time
by two orders of magnitude as compared to the standard exchange
algorithm. The nearest neighbour search, which is done in the
beginning usually does not strongly contribute to the overall
run time. It is nevertheless possible to suppress the nearest
neighbour search by using the exchange_partners
argument.
When setting the categories
argument, exchange partners
(i.e., nearest neighbours) will be generated from the same
category. Note that when categories
has multiple columns,
each combination of these categories is treated as a distinct
category by the exchange method. You can also use
categories_to_binary
to potentially improve results
for several categorical variables, instead of using the argument
categories
.
Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301.
Papenberg, M. (2024). K-plus Anticlustering: An Improved k-means Criterion for Maximizing Between-Group Similarity. British Journal of Mathematical and Statistical Psychology, 77(1), 80–102. https://doi.org/10.1111/bmsp.12315
Späth, H. (1986). Anticlustering: Maximizing the variance criterion. Control and Cybernetics, 15, 213-218.
## Use fewer or more exchange partners to adjust speed (vs. quality tradeoff)
features <- iris[, - 5]
N <- nrow(features)
init <- sample(rep_len(1:3, N)) # same starting point for all calls:
groups1 <- fast_anticlustering(features, K = init) # default: all exchanges
groups2 <- fast_anticlustering(features, K = init, k_neighbours = 20)
groups3 <- fast_anticlustering(features, K = init, k_neighbours = 2)
variance_objective(features, groups1)
#> [1] 681.369
variance_objective(features, groups2)
#> [1] 681.057
variance_objective(features, groups3)
#> [1] 675.105
# K-plus anticlustering is straight forward when sticking with the default
# for k_neighbours
kplus_anticlusters <- fast_anticlustering(
kplus_moment_variables(features, T = 2),
K = 3
)
mean_sd_tab(features, kplus_anticlusters)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 "5.84 (0.83)" "3.06 (0.44)" "3.78 (1.77)" "1.21 (0.77)"
#> 2 "5.84 (0.84)" "3.06 (0.44)" "3.74 (1.78)" "1.19 (0.77)"
#> 3 "5.84 (0.83)" "3.06 (0.44)" "3.76 (1.78)" "1.20 (0.77)"
# Some care is needed when applying k-plus using with this function
# while using a reduced number of exchange partners generated in the
# nearest neighbour search. Then we:
# 1) Use kplus_moment_variables() on the numeric input
# 2) Generate custom exchange_partners because otherwise nearest
# neighbours are internally selected based on the extended k-plus
# variables returned by kplus_moment_variables()
# (which does not really make sense)
kplus_anticlusters <- fast_anticlustering(
kplus_moment_variables(features, T = 2),
K = 3,
exchange_partners = generate_exchange_partners(120, features = features, method = "RANN")
)
mean_sd_tab(features, kplus_anticlusters)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 "5.84 (0.83)" "3.06 (0.44)" "3.77 (1.78)" "1.20 (0.77)"
#> 2 "5.84 (0.83)" "3.06 (0.44)" "3.75 (1.78)" "1.20 (0.77)"
#> 3 "5.85 (0.84)" "3.06 (0.44)" "3.76 (1.78)" "1.20 (0.77)"
# Or we use random exchange partners:
kplus_anticlusters <- fast_anticlustering(
kplus_moment_variables(features, T = 2),
K = 3,
exchange_partners = generate_exchange_partners(120, N = nrow(features), method = "random")
)
mean_sd_tab(features, kplus_anticlusters)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 "5.84 (0.83)" "3.06 (0.44)" "3.76 (1.78)" "1.20 (0.77)"
#> 2 "5.85 (0.83)" "3.06 (0.44)" "3.77 (1.77)" "1.20 (0.77)"
#> 3 "5.84 (0.83)" "3.06 (0.44)" "3.75 (1.78)" "1.20 (0.77)"
# Working on several 1000 elements is very fast (Here n = 10000, m = 2)
data <- matrix(rnorm(10000 * 2), ncol = 2)
start <- Sys.time()
groups <- fast_anticlustering(data, K = 5, k_neighbours = 5)
Sys.time() - start
#> Time difference of 0.374614 secs