vignettes/Categorical_vars.Rmd
Categorical_vars.RmdIn this vignette I explore some ways to incorporate categorical
variables with anticlustering. The main function of
anticlust is anticlustering(), and it has an
argument categories. It can be used easily enough: We just
pass the numeric variables as first argument (x) and our
categorical variable(s) to categories. I will use the
penguin data set to illustrate the usage:
data(penguins)
# First exclude cases with missing values
df <- na.omit(penguins)
head(df)
#> species island bill_len bill_dep flipper_len body_mass sex year
#> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
#> 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
#> 3 Adelie Torgersen 40.3 18.0 195 3250 female 2007
#> 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
#> 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
#> 7 Adelie Torgersen 38.9 17.8 181 3625 female 2007
nrow(df)
#> [1] 333In the data set, each row represents a penguin, and the data set has four numeric variables (bill_len, bill_dep, flipper_len, body_mass) and several categorical variables (species, island, sex) as descriptions of the penguins.
Let’s call anticlustering() to divide the 333 penguins
into 3 groups. We use the four the numeric variables as first argument
(i.e., the anticlustering objective is computed on the basis of the
numeric variables), and the penguins’ sex as categorical variable:
numeric_vars <- df[, c("bill_len", "bill_dep", "flipper_len", "body_mass")]
groups <- anticlustering(
numeric_vars,
K = 3,
categories = df$sex
)Let’s check out how well our categorical variables are balanced:
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56A perfect split! Similarly, we could use the species as categorical variable:
groups <- anticlustering(
numeric_vars,
K = 3,
categories = df$species
)
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 22 40
#> 2 49 23 39
#> 3 48 23 40As good as it could be! Now, let’s use both categorical variables at the same time:
groups <- anticlustering(
numeric_vars,
K = 3,
categories = df[, c("species", "sex")]
)
table(groups, df$sex)
#>
#> groups female male
#> 1 54 57
#> 2 56 55
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 22 40
#> 2 49 23 39
#> 3 48 23 40The results for the sex variable are worse than previously when we
only considered one variable at a time. This is because when using
multiple variables with the categories argument, all
columns are “merged” into a single column, and each combination of sex /
species is treated as a separate category. Some information on the
original variables is lost, and the results may become less
optimal—while being still pretty okay here. Alas, using only the
categories argument, we cannot improve this balancing even
if a better split with regard to both categorical variables would be
possible.
A second possibility to incorporate categorical variables is to treat
them as numeric variables and use them as part of the first argument
x, which is used to compute the anticlustering objective
(e.g., the diversity or variance). This approach can lead to better
results when multiple categorical variables are available, and / or if
the group sizes are unequal. Since version 0.8.12, we can use
categorical variables as part of the first argument when they are
defined as factors. Before that, we manually had to convert categorical
variables into a binary representation via
categories_to_binary(). Manual conversion can still be
useful, as shown further below.
In the penguin data sets, all variables are already correctly coded, i.e., categorical variables are defined as factors. So I generate a data frame that includes all features – numeric and categorical features – and use it as input for anticlustering.
all_features <- data.frame(numeric_vars, df[, c("species", "sex")])
groups <- anticlustering(
all_features,
K = 3,
method = "local-maximum",
standardize = TRUE
)
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 22 40
#> 2 48 23 40
#> 3 49 23 39The results are quite convincing. In particular, the penguins’ sex is
better balanced than previously when we used the argument
categories. If we have multiple categorical variables and /
or unequal-sized groups, it may be useful to try out using categorical
variables as factors, instead of using the categories
argument.
If we also wish to ensure that the categorical variables in their
combination are balanced between groups, we must do some manual
data preparation. For anticlustering, categorical variables are
converted into a binary representation via “one hot” encoding. The
anticlust package has the convenience function
categories_to_binary(). for this purpose.1 This is done
internally via anticlustering() when using categorical
variables as part of the data input (as factors). In that case, however,
combinations of categorical variables are not considered. To consider
combinations, we can manually create our data set with binary
categorical variables, setting the optional argument
use_combinations of categories_to_binary() to
TRUE. First, let’s see how we would manually encode
categorical variables without considering their combinations. We will
use collection year (2007, 2008, 2009) and species as categorical
variables:
binary_categories <- categories_to_binary(df[, c("species", "year")], use_combinations = FALSE)
data_input <- data.frame(binary_categories, numeric_vars)
groups <- anticlustering(
data_input,
K = 3,
method = "local-maximum",
standardize = TRUE
)
table(groups, df$year, df$species)
#> , , = Adelie
#>
#>
#> groups 2007 2008 2009
#> 1 15 17 17
#> 2 14 16 18
#> 3 15 17 17
#>
#> , , = Chinstrap
#>
#>
#> groups 2007 2008 2009
#> 1 8 7 8
#> 2 9 6 8
#> 3 9 5 8
#>
#> , , = Gentoo
#>
#>
#> groups 2007 2008 2009
#> 1 11 14 14
#> 2 11 16 13
#> 3 11 15 14When setting use_combinations = TRUE, we will also
balance the proportions of species collected in each year across groups,
which was not explicitly done before:
binary_categories <- categories_to_binary(df[, c("species", "year")], use_combinations = TRUE)
data_input <- data.frame(binary_categories, numeric_vars)
groups <- anticlustering(
data_input,
K = 3,
method = "local-maximum",
standardize = TRUE
)
table(groups, df$year, df$species)
#> , , = Adelie
#>
#>
#> groups 2007 2008 2009
#> 1 14 17 18
#> 2 15 16 17
#> 3 15 17 17
#>
#> , , = Chinstrap
#>
#>
#> groups 2007 2008 2009
#> 1 9 6 8
#> 2 9 6 8
#> 3 8 6 8
#>
#> , , = Gentoo
#>
#>
#> groups 2007 2008 2009
#> 1 11 15 13
#> 2 11 15 14
#> 3 11 15 14Now, the year of data collection is perfectly balance across groups
for each of the three species, which is not accomplished when setting
use_combinations = FALSE or when using the categories as
factors, which internally sets
use_combinations = FALSE.
Internally, categories_to_binary() is
wrapper around the base R function
model.matrix().↩︎