vignettes/Best_practices.Rmd
Best_practices.Rmd
This vignette documents some “best practices” for anticlustering
using the R package anticlust
. In many cases, the
suggestions pertain to overriding the default values of arguments of
anticlustering()
, which seems to be a difficult decision
for users. However, I advise you: Do not stick with the defaults; check
out the results of different anticlustering specifications; repeat the
process; play around; read the documentation (especially
?anticlustering
); change arguments arbitrarily; compare the
output. Nothing can break.1
This document uses somewhat imperative language; nuance and explanations are given in the package documentation, the other vignettes, and the papers by Papenberg and Klau (2021; https://doi.org/10.1037/met0000301) and Papenberg (2024; https://doi.org/10.1111/bmsp.12315). Note that deciding which anticlustering objective to use usually requires substantial content considerations and cannot be reduced to “which one is better”. However, some hints are given below.
method = "local-maximum"
instead of the default
method = "exchange"
. It is unambiguously
better.
repetitions
.standardize = TRUE
instead of the default
standardize = FALSE
.2
objective = "diversity"
when the
group sizes are not equal (preferably, use
objective = "kplus"
or
objective = "average-diversity"
).objective = "variance"
.objective = "kplus"
over
objective = "variance"
(or check out the function
kplus_anticlustering()
).objective = "kplus"
instead of the default
objective = "diversity"
.standardize = TRUE
.Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301.
Papenberg, M. (2024). K-plus Anticlustering: An Improved k-means Criterion for Maximizing Between-Group Similarity. British Journal of Mathematical and Statistical Psychology, 77 (1), 80–102. https://doi.org/10.1111/bmsp.12315
Well, actually your R session can break if you use an
optimal method (method = "ilp"
) with a data set that is too
large.↩︎
You might ask why standardize = TRUE
is not
the default. Actually, there are two reasons. First, the argument was
not always available in anticlust
and changing the default
behaviour of a function when releasing a new version is oftentimes
undesirable. Second, it seems like a big decision to me to just change
users’ data by default (which is done when standardizing the data). In
doubt, just compare the results of using standardize = TRUE
and standardize = FALSE
and decide for yourself which you
like best. Standardization may not be the best choice in all settings.↩︎