Random sampling employing a categorical constraint — categorical

This function can be used to obtain a stratified split of a data set.

categorical_sampling(categories, K)

Arguments

categories: A matrix or vector of one or more categorical variables.
K: The number of groups that are returned.

Value

A vector representing the sample each element was assigned to.

Details

This function can be used to obtain a stratified split of a data set. Using this function is like calling anticlustering with argument `categories`, but without optimizing a clustering objective. The categories are just evenly split between samples. Apart from the restriction that categories are balanced between samples, the split is random.

Examples


data(schaper2019)
categories <- schaper2019$room
groups <- categorical_sampling(categories, K = 6)
table(groups, categories)
#>       categories
#> groups bathroom kitchen
#>      1        8       8
#>      2        8       8
#>      3        8       8
#>      4        8       8
#>      5        8       8
#>      6        8       8

# Unequal sized groups
groups <- categorical_sampling(categories, K = c(24, 24, 48))
table(groups, categories)
#>       categories
#> groups bathroom kitchen
#>      1       12      12
#>      2       12      12
#>      3       24      24

# Heavily unequal sized groups, is harder to balance the groups
groups <- categorical_sampling(categories, K = c(51, 19, 26))
table(groups, categories)
#>       categories
#> groups bathroom kitchen
#>      1       26      25
#>      2       10       9
#>      3       12      14