Perform anticlustering using the k-plus objective to maximize between-group similarity. This function implements the k-plus anticlustering method described in Papenberg (2024; <doi:10.1111/bmsp.12315>).

kplus_anticlustering(
  x,
  K,
  variance = TRUE,
  skew = FALSE,
  kurtosis = FALSE,
  covariances = FALSE,
  T = NULL,
  standardize = TRUE,
  ...
)

Arguments

x

A feature matrix where rows correspond to elements and columns correspond to variables (a single numeric variable can be passed as a vector).

K

How many anticlusters should be created. Alternatively: (a) A vector describing the size of each group, or (b) a vector of length nrow(x) describing how elements are assigned to anticlusters before the optimization starts.

variance

Boolean: Should the k-plus objective include a term to maximize between-group similarity with regard to the variance? (Default = TRUE)

skew

Boolean: Should the k-plus objective include a term to maximize between-group similarity with regard to skewness? (Default = FALSE)

kurtosis

Boolean: Should the k-plus objective include a term to maximize between-group similarity with regard to kurtosis? (Default = FALSE)

covariances

Boolean: Should the k-plus objective include a term to maximize between-group similarity with regard to covariance structure? (Default = FALSE)

T

Optional argument: An integer specifying how many distribution moments should be equalized between groups.

standardize

Boolean. If TRUE, the data is standardized through a call to scale before the optimization starts. Defaults to TRUE. See details.

...

Arguments passed down to anticlustering. All of the arguments are supported except for objective.

Details

This function implements the unweighted sum approach for k-plus anticlustering. Details are given in Papenberg (2024).

The optional argument T denotes the number of distribution moments that are considered in the anticlustering process. For example, T = 4 will lead to similar means, variances, skew and kurtosis. For the first four moments, it is also possible to use the boolean convenience arguments variance, skew and kurtosis; the mean (the first moment) is always included and cannot be "turned off". If the argument T is used, it overrides the arguments variance, skew and kurtosis (corresponding to the second, third and fourth moment), ignoring their values.

The standardization is applied to all original features and the additional k-plus features that are appended to the data set in order to optimize the k-plus criterion. When using standardization, all criteria such as means, variances and skewness receive a comparable weight during the optimization. It is usually recommended not to change the default setting standardization = TRUE.

This function can use any arguments that are also possible in anticlustering (except for `objective` because the objective optimized here is the k-plus objective; to use a different objective, call anticlustering directly). Any arguments that are not explicitly changed here (i.e., standardize = TRUE) receive the default given in anticlustering (e.g., method = "exchange".)

References

Papenberg, M. (2024). K-plus Anticlustering: An Improved k-means Criterion for Maximizing Between-Group Similarity. British Journal of Mathematical and Statistical Psychology, 77(1), 80–102. https://doi.org/10.1111/bmsp.12315

Examples


# Generate some data
N <- 180
M <- 4
features <- matrix(rnorm(N * M), ncol = M)
# standard k-plus anticlustering: optimize similarity with regard to mean and variance:
cl <- kplus_anticlustering(features, K = 3, method = "local-maximum")
mean_sd_tab(features, cl)
#>   [,1]           [,2]          [,3]           [,4]          
#> 1 "-0.07 (1.06)" "0.02 (0.92)" "-0.12 (1.00)" "-0.02 (1.13)"
#> 2 "-0.07 (1.06)" "0.02 (0.92)" "-0.11 (1.00)" "-0.02 (1.13)"
#> 3 "-0.07 (1.06)" "0.02 (0.92)" "-0.12 (1.00)" "-0.01 (1.13)"
# Visualize an anticlustering solution:
plot(features, col = palette()[2:4][cl], pch = c(16:18)[cl])


# Also optimize with regard to skewness and kurtosis
cl2 <- kplus_anticlustering(
  features, 
  K = 3, 
  method = "local-maximum", 
  skew = TRUE, 
  kurtosis = TRUE
)

# The following two calls are equivalent: 
init_clusters <- sample(rep_len(1:3, nrow(features)))
# 1.
x1 <- kplus_anticlustering(
  features, 
  K = init_clusters, 
  variance = TRUE,
  skew = TRUE
)
# 2. 
x2 <- kplus_anticlustering(
  features, 
  K = init_clusters, 
  T = 3
)
# Verify: 
all(x1 == x2)
#> [1] TRUE