vignettes/Categorical_vars.Rmd
Categorical_vars.Rmd
In this vignette I explore two ways to incorporate categorical
variables with anticlustering. The main function of
anticlust
is anticlustering()
, and it has an
argument categories
. It can be used easily enough: We just
pass the numeric variables as first argument (x
) and our
categorical variable(s) to categories
. I will use the
penguin data set from the palmerpenguins
package to
illustrate the usage:
library(palmerpenguins)
# First exclude cases with missing values
df <- na.omit(penguins)
head(df)
#> # A tibble: 6 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750
#> 2 Adelie Torgersen 39.5 17.4 186 3800
#> 3 Adelie Torgersen 40.3 18 195 3250
#> 4 Adelie Torgersen 36.7 19.3 193 3450
#> 5 Adelie Torgersen 39.3 20.6 190 3650
#> 6 Adelie Torgersen 38.9 17.8 181 3625
#> # ℹ 2 more variables: sex <fct>, year <int>
nrow(df)
#> [1] 333
In the data set, each row represents a penguin, and the data set has four numeric variables (bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) and several categorical variables (species, island, sex) as descriptions of the penguins.
Let’s call anticlustering()
to divide the 333 penguins
into 3 groups. We use the four the numeric variables as first argument
(i.e., the anticlustering objective is computed on the basis of the
numeric variables), and the penguins’ sex as categorical variable:
numeric_vars <- df[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")]
groups <- anticlustering(
numeric_vars,
K = 3,
categories = df$sex
)
Let’s check out how well our categorical variables are balanced:
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
A perfect split! Similarly, we could use the species as categorical variable:
groups <- anticlustering(
numeric_vars,
K = 3,
categories = df$species
)
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 22 40
#> 2 49 23 39
#> 3 48 23 40
As good as it could be! Now, let’s use both categorical variables at the same time:
groups <- anticlustering(
numeric_vars,
K = 3,
categories = df[, c("species", "sex")]
)
table(groups, df$sex)
#>
#> groups female male
#> 1 54 57
#> 2 56 55
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 22 40
#> 2 49 23 39
#> 3 48 23 40
The results for the sex variable are worse than previously when we
only considered one variable at a time. This is because when using
multiple variables with the categories
argument, all
columns are “merged” into a single column, and each combination of sex /
species is treated as a separate category. Some information on the
original variables is lost, and the results may become less
optimal—while being still pretty okay here. Alas, using only the
categories
argument, we cannot improve this balancing even
if a better split with regard to both categorical variables would be
possible.
A second possibility to incorporate categorical variables is to treat
them as numeric variables and use them as part of the first argument
x
, which is used to compute the anticlustering objective
(e.g., the diversity or variance). This approach can lead to better
results when multiple categorical variables are available, and / or if
the group sizes are unequal. I discuss the approach by the example of
k-means anticlustering, but using the diversity objective is also
possible (in principle, any reasonable way to transform categorical
variables to pairwise dissimilarities would work).
To use categorical variables as part of the anticlustering objective,
we first generate a matrix of the categorical variables in binary
representation using the anticlust
convenience function
categories_to_binary()
.1 Because k-means anticlustering optimizes
similarity with regard to means, k-means anticlustering applied to this
binary matrix will even out the proportion of each category in each
group (this is because the mean of a binary variable is the proportion
of 1
s in that variable).
binary_categories <- categories_to_binary(df[, c("species", "sex")])
# see ?categories_to_binary
head(binary_categories)
#> speciesAdelie speciesChinstrap speciesGentoo sexfemale sexmale
#> 1 1 0 0 0 1
#> 2 1 0 0 1 0
#> 3 1 0 0 1 0
#> 4 1 0 0 1 0
#> 5 1 0 0 0 1
#> 6 1 0 0 1 0
groups <- anticlustering(
binary_categories,
K = 3,
method = "local-maximum",
objective = "variance",
repetitions = 10,
standardize = TRUE
)
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 23 39
#> 2 49 22 40
#> 3 48 23 40
The results are quite convincing. In particular, the penguins’ sex is
better balanced than previously when we used the argument
categories
. If we have multiple categorical variables and /
or unequal-sized groups, it may be useful to try out the k-means
optimization version of including categorical variables, instead of
(only) using the categories
argument. If we also wish to
ensure that the categorical variables in their combination are
balanced between groups (i.e., the proportion of the penguins’ sex is
roughly the same for each species in each group), we could set the
optional argument use_combinations
of
categories_to_binary()
to TRUE
:
binary_categories <- categories_to_binary(df[, c("species", "sex")], use_combinations = TRUE)
groups <- anticlustering(
binary_categories,
K = 3,
method = "local-maximum",
objective = "variance",
repetitions = 10,
standardize = TRUE
)
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 23 39
#> 2 49 22 40
#> 3 48 23 40
table(groups, df$sex, df$species)
#> , , = Adelie
#>
#>
#> groups female male
#> 1 24 25
#> 2 25 24
#> 3 24 24
#>
#> , , = Chinstrap
#>
#>
#> groups female male
#> 1 12 11
#> 2 11 11
#> 3 11 12
#>
#> , , = Gentoo
#>
#>
#> groups female male
#> 1 19 20
#> 2 19 21
#> 3 20 20
Note that we only evenly distributed the categorical variable between groups and did not consider any numeric variables. Fortunately, also considering the numeric variables is possible, and can we accomplish that in two different ways:
anticlustering()
We discuss both approaches in the following.
We use the output vector groups
of the previous call to
anticlustering()
—which convincingly balanced our
categorical variables—as input to the K
argument in an
additional call to anticlustering()
. The
groups
vector is used as the initial group assignment
before the anticlustering optimization starts. In this group assignment,
the categories are already well balanced. We additionally pass the two
categorical variables to categories
, thus ensuring that the
balancing of the categorical variable is never changed throughout the
optimization process:2
final_groups <- anticlustering(
numeric_vars,
K = groups,
standardize = TRUE,
method = "local-maximum",
categories = df[, c("species", "sex")]
)
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 23 39
#> 2 49 22 40
#> 3 48 23 40
mean_sd_tab(numeric_vars, final_groups)
#> bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> 1 "43.99 (5.58)" "17.17 (1.96)" "201.00 (14.09)" "4206.31 (811.03)"
#> 2 "43.99 (5.38)" "17.16 (1.98)" "201.00 (14.04)" "4207.66 (803.24)"
#> 3 "44.00 (5.49)" "17.17 (1.98)" "200.90 (14.05)" "4207.21 (808.66)"
The results are convincing, both with regard to the numeric variables and the categorical variables.
We can simultaneously consider the numeric and categorical variables
in the optimization process. Note that this approach only works with the
k-means and k-plus objectives, because only k-means adequately deals
with the categorical variables (at least when using the approach
described here). Using the simultaneous approach, we just pass all
variables (representing binary categories and numeric variables) as a
single matrix to the first argument of anticlustering()
. Do
not use the categories
argument here!
final_groups <- anticlustering(
cbind(numeric_vars, binary_categories),
K = 3,
standardize = TRUE,
method = "local-maximum",
objective = "variance",
repetitions = 10
)
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 23 39
#> 2 49 22 40
#> 3 48 23 40
mean_sd_tab(numeric_vars, final_groups)
#> bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> 1 "44.00 (5.48)" "17.17 (1.96)" "200.96 (13.70)" "4206.98 (828.22)"
#> 2 "43.99 (5.54)" "17.16 (1.94)" "200.96 (14.58)" "4206.98 (803.28)"
#> 3 "43.99 (5.43)" "17.16 (2.02)" "200.97 (13.88)" "4207.21 (791.01)"
The following code extends the simultaneous optimization approach towards k-plus anticlustering, which ensures that standard deviations as well as means are similar between groups (and not only the means, which is achieved via standard k-means anticlustering):
final_groups <- anticlustering(
cbind(kplus_moment_variables(numeric_vars, T = 2), binary_categories),
K = 3,
method = "local-maximum",
objective = "variance",
repetitions = 10
)
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 23 39
#> 2 49 22 40
#> 3 48 23 40
mean_sd_tab(numeric_vars, final_groups)
#> bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> 1 "44.00 (5.48)" "17.16 (1.98)" "200.97 (14.05)" "4207.88 (807.64)"
#> 2 "43.99 (5.48)" "17.17 (1.98)" "200.97 (14.07)" "4207.43 (807.94)"
#> 3 "43.99 (5.49)" "17.16 (1.97)" "200.95 (14.06)" "4205.86 (807.37)"
While we use objective = "variance"
—indicating that the
k-means objective is used—this code actually performs k-plus
anticlustering because the first argument takes as input the augmented
k-plus variable matrix3. We see that the standard deviations are
now also quite evenly matched between groups (which is unlike when using
standard k-means anticlustering).
In the end: You should try out the different approaches for dealing with categorical variables and see which one works best for you!
Internally, categories_to_binary()
is
wrapper around the base R
function
model.matrix()
.↩︎
Only elements that have the same value in
categories
are exchanged between clusters throughout the
optimization algorithm, so the initial balancing of the categories is
never changed when the algorithm runs.↩︎
This is how k-plus anticlustering actually works: It
reuses the k-means criterion but uses additional “k-plus” variables as
input. More information on the k-plus approach is given in the
documentation: ?kplus_moment_variables
and
?kplus_anticlustering
.↩︎