Get binary representation of categorical variables

categories_to_binary(categories, use_combinations = FALSE)

Arguments

categories: A vector, data.frame or matrix representing one or several categorical variables
use_combinations: Logical, should the output also include columns representing the combination / interaction of the categories (defaults to FALSE).

Value

A matrix encoding the categorical variable(s) in binary form.

Details

The conversion of categorical variables to binary variables is done via model.matrix. Since version 0.8.9, each category of a categorical variable is coded by a separate variable. So this is not 'dummy' coding, which is often used to encode predictors in statistical analysis. Dummy coding uses a reference category that has only zeros for each variable, while all other categories consist of a 1 and otherwise zeros. This implies that there is a different distance to the reference category than among the other categories, which is unwarranted in anticlustering.

This function can be used to include categorical variables as part of the optimization criterion in anticlustering, rather than including them as hard constraints as done when using the argument categories in anticlustering (or fast_anticlustering). This way, categorical variables are treated as numeric variables, which can be useful when there are several categorical variables or when the group sizes are unequal (or both). See examples. Please see the vignette 'Using categorical variables with anticlustering' for more information on this approach.

References

Papenberg, M. (2024). K-plus Anticlustering: An Improved k-means Criterion for Maximizing Between-Group Similarity. British Journal of Mathematical and Statistical Psychology, 77(1), 80–102. https://doi.org/10.1111/bmsp.12315

Author

Martin Papenberg martin.papenberg@hhu.de

Examples


# How to encode a categorical variable with three levels:
unique(iris$Species)
#> [1] setosa     versicolor virginica 
#> Levels: setosa versicolor virginica
categories_to_binary(iris$Species)[c(1, 51, 101), ]
#>     categoriessetosa categoriesversicolor categoriesvirginica
#> 1                  1                    0                   0
#> 51                 0                    1                   0
#> 101                0                    0                   1

# Use Schaper data set for anticlustering example
data(schaper2019)
features <- schaper2019[, 3:6]
K <- 3
N <- nrow(features) 

# - Generate data input for k-means anticlustering -
# We conduct k-plus anticlustering by first generating k-plus variables, 
# and also include the categorical variable as "numeric" input for the 
# k-means optimization (rather than as input for the argument \code{categories})

input_data <- cbind(
  kplus_moment_variables(features, T = 2), 
  categories_to_binary(schaper2019$room) 
)

kplus_groups <- anticlustering(
  input_data, 
  K = K,
  objective = "variance",
  method = "local-maximum", 
  repetitions = 10
)
mean_sd_tab(features, kplus_groups)
#>   rating_consistent rating_inconsistent syllables     frequency     
#> 1 "4.49 (0.25)"     "1.10 (0.07)"       "3.41 (0.95)" "18.28 (2.43)"
#> 2 "4.49 (0.25)"     "1.10 (0.07)"       "3.44 (0.91)" "18.31 (2.40)"
#> 3 "4.49 (0.25)"     "1.10 (0.07)"       "3.41 (0.95)" "18.34 (2.40)"
table(kplus_groups, schaper2019$room) # argument categories was not used!
#>             
#> kplus_groups bathroom kitchen
#>            1       16      16
#>            2       16      16
#>            3       16      16