Get binary representation of categorical variables

categories_to_binary(categories, use_combinations = FALSE)

Arguments

categories

A vector, data.frame or matrix representing one or several categorical variables

use_combinations

Logical, should the output also include columns representing the combination / interaction of the categories (defaults to FALSE).

Value

A matrix representing the categorical variables in binary form ("dummy coding")

Details

The conversion of categorical variable to binary variables is done via model.matrix. This function can be used to include categorical variables as part of the optimization criterion in k-means / k-plus anticlustering, rather than including them as hard constraints as done in anticlustering. This can be useful when there are several categorical variables or when the group sizes are unequal (or both). See examples.

References

Papenberg, M. (2024). K-plus Anticlustering: An Improved k-means Criterion for Maximizing Between-Group Similarity. British Journal of Mathematical and Statistical Psychology, 77(1), 80--102. https://doi.org/10.1111/bmsp.12315

Author

Martin Papenberg martin.papenberg@hhu.de

Examples


# Use Schaper data set for example
data(schaper2019)
features <- schaper2019[, 3:6]
K <- 3
N <- nrow(features) 

# - Generate data input for k-means anticlustering -
# We conduct k-plus anticlustering by first generating k-plus variables, 
# and also include the categorical variable as "numeric" input for the 
# k-means optimization (rather than as input for the argument `categories`)

input_data <- cbind(
  kplus_moment_variables(features, T = 2), 
  categories_to_binary(schaper2019$room) 
)

kplus_groups <- anticlustering(
  input_data, 
  K = K,
  objective = "variance",
  method = "local-maximum", 
  repetitions = 10
)
mean_sd_tab(features, kplus_groups)
#>   rating_consistent rating_inconsistent syllables     frequency     
#> 1 "4.49 (0.25)"     "1.10 (0.07)"       "3.44 (0.91)" "18.31 (2.40)"
#> 2 "4.49 (0.25)"     "1.10 (0.07)"       "3.41 (0.95)" "18.31 (2.42)"
#> 3 "4.49 (0.25)"     "1.10 (0.07)"       "3.41 (0.95)" "18.31 (2.42)"
table(kplus_groups, schaper2019$room) # argument categories was not used!
#>             
#> kplus_groups bathroom kitchen
#>            1       16      16
#>            2       16      16
#>            3       16      16