Maximize dispersion for K groups

optimal_dispersion(
  x,
  K,
  solver = NULL,
  max_dispersion_considered = NULL,
  min_dispersion_considered = NULL,
  npartitions = 1,
  time_limit = NULL
)

Arguments

x

The data input. Can be one of two structures: (1) A feature matrix where rows correspond to elements and columns correspond to variables (a single numeric variable can be passed as a vector). (2) An N x N matrix dissimilarity matrix; can be an object of class dist (e.g., returned by dist or as.dist) or a matrix where the entries of the upper and lower triangular matrix represent pairwise dissimilarities.

K

The number of groups or a vector describing the size of each group.

solver

Optional argument; currently supports "lpSolve", "glpk", and "symphony". See optimal_anticlustering.

max_dispersion_considered

Optional argument used for early stopping. If the dispersion found is equal to or exceeds this value, a solution having the previous best dispersion is returned.

min_dispersion_considered

Optional argument used for speeding up the algorithm computation. If passed, the dispersion is optimized starting from this value instead the global minimum distance.

npartitions

The number of groupings that are returned, each having an optimal dispersion value (defaults to 1).

time_limit

Time limit in seconds, given to the solver. Default is there is no time limit.

Value

A list with four elements:

dispersion

The optimal dispersion

groups

An assignment of elements to groups (vector)

edges

A matrix of 2 columns. Each row contains the indices of elements that had to be investigated to find the dispersion (i.e., each pair of elements cannot be part of the same group in order to achieve maximum dispersion).

dispersions_considered

All distances that were tested until the dispersion was found.

Details

The dispersion is the minimum distance between two elements within the same group. This function implements an optimal method to maximize the dispersion. If the data input x is a feature matrix and not a dissimilarity matrix, the pairwise Euclidean distance is used. It uses the algorithm presented in Max Diekhoff's Bachelor thesis at the Computer Science Department at Heinrich Heine University Düsseldorf.

To find out which items are not allowed to be grouped in the same cluster for maximum dispersion, the algorithm sequentially builds instances of a graph coloring problem, using an integer linear programming (ILP) representation (also see Fernandez et al., 2013). It is possible to specify the ILP solver via the argument solver (See optimal_anticlustering for more information on this argument). Optimally solving the maximum dispersion problem is NP-hard for K > 2 and therefore computationally infeasible for larger data sets. For K = 3 and K = 4, it seems that this approach scales up to several 100 elements, or even > 1000 for K = 3 (at least when using the Symphony solver). For larger data sets, use the heuristic approaches in anticlustering or bicriterion_anticlustering. However, note that for K = 2, the optimal approach is usually much faster than the heuristics.

In the output, the element edges defines which elements must be in separate clusters in order to achieve maximum dispersion. All elements not listed here can be changed arbitrarily between clusters without reducing the dispersion. If the maximum possible dispersion corresponds to the minimum dispersion in the data set, the output elements edges and groups are set to NULL because all possible groupings have the same value of dispersion. In this case the output element dispersions_considered has length 1.

If a time_limit is set and the function cannot find in the optimal dispersion in the given time, it will throw an error.

Note

If the SYMPHONY solver is used, an unfortunate "message" is printed to the console when this function terminates:

sym_get_col_solution(): No solution has been stored!

This message is no reason to worry and instead is a direct result of the algorithm finding the optimal value for the dispersion. Unfortunately, this message is generated in the C code underlying the SYMPHONY library (via the printing function printf), which cannot be prevented in R.

References

Diekhoff (2023). Maximizing dispersion for anticlustering. Retrieved from https://www.cs.hhu.de/fileadmin/redaktion/Fakultaeten/Mathematisch-Naturwissenschaftliche_Fakultaet/Informatik/Algorithmische_Bioinformatik/Bachelor-_Masterarbeiten/2831963_ba_ifo_AbschlArbeit_klau_mapap102_madie120_20230203_1815.pdf

Fernández, E., Kalcsics, J., & Nickel, S. (2013). The maximum dispersion problem. Omega, 41(4), 721–730. https://doi.org/10.1016/j.omega.2012.09.005

Author

Max Diekhoff

Martin Papenberg martin.papenberg@hhu.de

Examples


N <- 30
M <- 5
K <- 3
data <- matrix(rnorm(N*M), ncol = M)
distances <- dist(data)

opt <- optimal_dispersion(distances, K = K)
opt
#> $dispersion
#> [1] 1.646841
#> 
#> $groups
#>  [1] 3 3 3 2 1 1 2 1 3 3 3 3 2 1 2 2 3 2 2 1 2 2 3 2 1 1 3 1 1 1
#> 
#> $groups_fixated
#>  [1] NA  3  3 NA  1  1  2  1 NA NA  3  3  2  1  2  2 NA  2  2  1  2 NA  3 NA  1
#> [26]  1  3  1  1  1
#> 
#> $edges
#>       [,1] [,2]
#>  [1,]   20   23
#>  [2,]   11   19
#>  [3,]    8   11
#>  [4,]    5   13
#>  [5,]    2    6
#>  [6,]   12   29
#>  [7,]    6   18
#>  [8,]    5   18
#>  [9,]    3   26
#> [10,]    2    8
#> [11,]    3   15
#> [12,]    5   19
#> [13,]   18   27
#> [14,]    5   27
#> [15,]    5   16
#> [16,]    7   25
#> [17,]    2   18
#> [18,]   14   16
#> [19,]   21   28
#> [20,]   15   26
#> [21,]   16   27
#> [22,]    5   11
#> [23,]    6   15
#> [24,]    8   19
#> [25,]   23   26
#> [26,]   21   23
#> [27,]   11   29
#> [28,]   19   29
#> [29,]   13   27
#> [30,]   19   23
#> [31,]   23   30
#> 
#> $dispersions_considered
#>  [1] 1.028233 1.075159 1.144307 1.159567 1.182799 1.226687 1.260399 1.338457
#>  [9] 1.349471 1.353310 1.367298 1.394381 1.399823 1.404871 1.410823 1.436337
#> [17] 1.466566 1.504632 1.519661 1.522956 1.532941 1.568069 1.572953 1.577198
#> [25] 1.578936 1.580271 1.601445 1.628697 1.628733 1.634875 1.636778 1.646841
#> 

# Compare to bicriterion heuristic:
groups_heuristic <- anticlustering(
  distances, 
  K = K,
  method = "brusco", 
  objective = "dispersion", 
  repetitions = 100
)
c(
  OPT = dispersion_objective(distances, opt$groups),
  HEURISTIC = dispersion_objective(distances, groups_heuristic)
)
#>       OPT HEURISTIC 
#>  1.646841  1.646841 

# Different group sizes are possible:
table(optimal_dispersion(distances, K = c(15, 10, 5))$groups)
#> 
#>  1  2  3 
#> 15 10  5 

# Induce cannot-link constraints by maximizing the dispersion:
solvable <- matrix(1, ncol = 6, nrow = 6)
solvable[2, 1] <- -1
solvable[3, 1] <- -1
solvable[4, 1] <- -1
solvable <- as.dist(solvable)
solvable
#>    1  2  3  4  5
#> 2 -1            
#> 3 -1  1         
#> 4 -1  1  1      
#> 5  1  1  1  1   
#> 6  1  1  1  1  1

# An optimal solution has to put item 1 in a different group than 
# items 2, 3 and 4 -> this is possible for K = 2
optimal_dispersion(solvable, K = 2)$groups
#> [1] 2 1 1 1 2 2

# It no longer works when item 1 can also not be linked with item 5:
# (check out output!)
unsolvable <- as.matrix(solvable)
unsolvable[5, 1] <- -1
unsolvable <- as.dist(unsolvable)
unsolvable
#>    1  2  3  4  5
#> 2 -1            
#> 3 -1  1         
#> 4 -1  1  1      
#> 5 -1  1  1  1   
#> 6  1  1  1  1  1
optimal_dispersion(unsolvable, K = 2)
#> $dispersion
#> [1] -1
#> 
#> $groups
#> NULL
#> 
#> $edges
#> NULL
#> 
#> $dispersions_considered
#> [1] -1
#> 
# But:
optimal_dispersion(unsolvable, K = c(2, 4)) # group sizes, not number of groups
#> $dispersion
#> [1] 1
#> 
#> $groups
#> [1] 2 1 1 1 1 2
#> 
#> $groups_fixated
#> [1]  2  1  1  1  1 NA
#> 
#> $edges
#>      [,1] [,2]
#> [1,]    1    2
#> [2,]    1    3
#> [3,]    1    4
#> [4,]    1    5
#> 
#> $dispersions_considered
#> [1] -1  1
#>