R/sample-by-category.R
categorical_sampling.Rd
This function can be used to obtain a stratified split of a data set.
categorical_sampling(categories, K)
A vector representing the sample each element was assigned to.
This function can be used to obtain a stratified split of a data set.
Using this function is like calling anticlustering
with
argument `categories`, but without optimizing a clustering objective. The
categories are just evenly split between samples. Apart from the restriction
that categories are balanced between samples, the split is random.
data(schaper2019)
categories <- schaper2019$room
groups <- categorical_sampling(categories, K = 6)
table(groups, categories)
#> categories
#> groups bathroom kitchen
#> 1 8 8
#> 2 8 8
#> 3 8 8
#> 4 8 8
#> 5 8 8
#> 6 8 8
# Unequal sized groups
groups <- categorical_sampling(categories, K = c(24, 24, 48))
table(groups, categories)
#> categories
#> groups bathroom kitchen
#> 1 12 12
#> 2 12 12
#> 3 24 24
# Heavily unequal sized groups, is harder to balance the groups
groups <- categorical_sampling(categories, K = c(51, 19, 26))
table(groups, categories)
#> categories
#> groups bathroom kitchen
#> 1 25 26
#> 2 9 10
#> 3 14 12