New features

Internal changes

Documentation

  • Expanded documentation of fast_anticlustering().
  • The vignette “Speeding up anticlustering” has been rewritten to reflect that fast_anticlustering() is now again the best choice for processing (very) large data sets.
  • An exact ILP method is now available for maximizing the dispersion, contributed by Max Diekhoff.
  • kplus_moment_variables() is a new exported function that generates k-plus variables from a data set
  • Offers some additional flexibility as compared to calling kplus_anticlustering(), which generates these variables internally (e.g., use k-plus augmentation on some variables but not all – such as binary variables)
  • categories_to_binary() is a new exported function that converts one or several categorical variables into a binary representation
    • Can be used to include categorical variables as part of the optimization criterion in k-means / k-plus anticlustering, see new vignette “Using categorical variables with anticlustering”
  • 3 new vignettes have been added to the anticlust documentation
  • Fixed a bug in kplus_anticlustering() that did not correctly implement preclustering = TRUE
  • It is now possible to use the SYMPHONY solver as backend for the optimal ILP methods.
  • Implements some fixes in the internal function gdc_set() that finds the greatest common denominator in a set of numbers. The fixes prevent categorical_sampling() (which is also called by anticlustering() when using the categories argument) from potentially running into an infinite loop when combining uneven group sizes via K with a categories argument.
  • kplus_anticlustering() now has an argument T instead of moments, where T denotes the number of distribution moments considered during k-plus anticlustering (moments was an integer vector specifying each individual moment that should be considered)
    • Explanation: Lower order moments should be skipped in favour of higher order moments, so the new interface makes more sense.

Major changes

  • This release adds a new exported function and removes two others (I very much doubt anyone used those, though – see below – if your code is affected, please email me).
    • kplus_anticlustering() is a new exported function: A new interface function to k-plus anticlustering, implementing the k-plus method as described in “K-plus Anticlustering: An Improved K-means Criterion for Maximizing Between-Group Similarity” (Papenberg, 2023; https://doi.org/10.1111/bmsp.12315). Using anticlustering(x, K, objective = "kplus") is still supported and remains unchanged. The new function kplus_anticlustering(), however, offers more functionality and nuance with regard to optimizing the k-plus objective family.
    • The function kplus_objective() was removed.
    • The function mean_sd_obj() was removed.

Explanations for the rather drastic changes, i.e., removing instead of deprecating functions (that very likely do not affect anyone):

  • Given the advanced theoretical background for k-plus anticlustering, the function kplus_objective() no longer makes any sense. Given that the k-plus objective is a family of objectives, keeping the function that computes one special case is more harmful to keep it than to just remove it now. As the k-plus objective basically re-uses the k-means criterion, maintaining a function such as kplus_objective() was questionable to begin with.

  • Since there is the k-plus anticlustering method now, I did not want to keep the “hacky” way to optimize similarity with regard to means and standard deviations, i.e., using the mean_sd_obj() function as objective in anticlustering. Please use the k-plus method to optimize similarity with regard to means and standard deviations (you can even extend to skewness, kurtosis, and other higher order moments; see the new kplus_anticlustering() function).

Minor changes

  • Finally added Marie Luisa Schaper as contributor for contributing her data set
  • Some work on documentation
  • Some work on docs and examples
  • Minor bug fix in C code base via c1a5604f
  • anticlust now includes the bicriterion algorithm for simultaneously maximizing diversity and dispersion, proposed by Brusco et al. (doi:10.1111/bmsp.12186) and implemented by Martin Breuer (for details see his bachelor thesis)
    • It can be called from the main function anticlustering() by setting method = "brusco"; in this case only either dispersion or diversity is maximized
    • bicriterion_anticlustering() – newly exported in this version – can be used for a more fine grained usage of the Brusco et al. algorithm, fully using its main functionality to optimize both dispersion as well as diversity
  • Just an update to the documentation: Updating all references to the Papenberg & Klau paper after its “actual” publication in Psychological Methods:

Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301

User-visible changes

  • plot_clusters() now uses the default color palette to highlight the different clusters
  • plot_clusters() now uses different pch symbols when the number of clusters is low (K < 8)

Internal changes

  • anticlustering() and categorical_sampling() now better balance categorical variables when the output groups require different sizes (i.e., if the group sizes do not have any common denominator)

  • Some additional input validations for more useful error messages when arguments in anticlustering() are not correctly specified

New feature

  • anticlustering() has a new argument standardize to standardize the data input before the optimization starts. This is useful to give all variables the same weight in the anticlustering process, irregardless of the scaling of the variables. Especially useful for objective = "kplus" to ensure that both minimizing differences with regard to means and variance is equally important.

Bug fix

  • Fixes a memory leak in the C code base, via 2c4fe6d
  • Internal change: anticlustering() with objective = "dispersion" now implements the local updating procedure proposed by Martin Breuer. This leads to a considerable speedup when maximizing the dispersion, enabling the fast processing of large data sets.

User-visible changes

  • anticlustering() now has native support for the maximizing the dispersion objective, setting objective = "dispersion". The dispersion is the minimum distance between any two elements within the same cluster, see ?dispersion_objective.

Internal changes

  • The exchange optimization algorithm for anticlustering has been reimplemented in C, leading to a substantial boost in performance when using one of the supported objectives “diversity”, “variance”, “dispersion”, or “kplus”. (Optimizing user-defined objective functions still has to be done in plain R and therefore has not been sped up.)
  • kplus_objective() is a new function to compute the value of the k-plus criterion given a clustering. See ?kplus_objective for details.

  • In anticlustering() and categorical_sampling(), the argument K can now be used to specify the size of the groups, not just the number of groups. This way, it is easy to request groups of different size. See the help pages ?anticlustering and ?categorical_sampling for examples.

  • Fixed two minor bugs that prevented the correct transformation of class dist to class matrix when using the repeated exchange (or “local-maximum”) method, see c42e136 and e6fdae5.

User-visible changes

  • In anticlustering(), there is a new option for the argument method: “local-maximum”. When using method = "local-maximum", the exchange method is repeated until an local maximum is reached. That means after the exchange process has been conducted for each data point, the algorithm restarts with the first element and proceeds to conduct exchanges until the objective cannot be improved. This procedure is more in line with classical neighbourhood search that only terminates when a local optimum is reached.

  • In anticlustering(), there is now a new argument repetitions. It can be used to specify the number of times the exchange procedure (either method = "exchange" or method = "local-maximum") is called. anticlustering() returns the best partitioning found across all repetitions.

  • anticlustering() now implements a new objective function, extending the classical k-means criterion, given by objective = "kplus". Using objective = "kplus" will minimize differences with regard to both means and standard deviations of the input variables, whereas k-means only focuses on the means. Details on this objective will follow.

  • Fixes a bug in anticlustering(), that led to an incorrect computation of cluster centers with option objective = "variance" for unequal cluster sizes, see 2ef6547

User-visible changes

Major

  • A new exported function: categorical_sampling(). Categorical sampling can be used to obtain a stratified split of a data set. Using this function is like calling anticlustering() with argument categories, but no clustering objective is maximized. The categories are just evenly split between samples, which is very fast (in contrast to the exchange optimization that may take some time for large data sets). Apart from the categorical restriction that balances the frequency of categories between samples, the split is random.

  • The function distance_objective() was renamed into diversity_objective() because there are several clustering objectives based on pairwise distances, e.g. see the new function dispersion_objective().

  • dispersion_objective() is a new function to compute the dispersion of a given clustering, i.e., the minimum distance between two elements within the same group. Maximizing the dispersion is an anticlustering task, see the help page of dispersion_objective() for an example.

Minor

  • Several changes to the documentation, in particular now highlighting the publication of the paper “Using Anticlustering to Partition Data Sets Into Equivalent Parts” (https://doi.org/10.1037/met0000301) describing the algorithms and criteria used in the package anticlust

  • In anticlustering(), anticluster editing is now by default requested using objective = "diversity" (but objective = "distance" is still supported and leads to the same behaviour). This change was done because there are several anticlustering objectives based on pairwise distances.

  • anticlustering() can no longer use an argument K of length > 1 with preclustering = TRUE because this resulted in undocumented behaviour (this is a good change because it does not make sense to specify an initial assignment of elements to groups via K and at the same time request that preclustering handles the initial assignment)

  • When using a custom objective function, the order of the required arguments is now reversed: The data comes first, the clustering second.

  • Because the order of arguments in custom objective functions was reversed, the function mean_sd_obj() now has reversed arguments as well.

  • The package vignettes are no longer distributed with the package itself because rendering R Markdown resulted in an error with the development version of R. This may change again in the future when R Markdown no longer throws an error with R devel. The vignette is currently available via the package website (https://m-py.github.io/anticlust/).

Internal changes

  • Improved running speed of generating constraints in integer linear programming variant of (anti)clustering, via 0a870240f8

User-visible changes

  • In anticlustering(), preclustering and categorical constraints can now be used at the same time. In this case, exchange partners are clustered within the same category, using a call to matching() passing categories to argument match_within.

  • In anticlustering(), it is now possible to use preclustering = TRUE for unbalanced data size (e.g., if N = 9 and K = 2).

  • In matching(), it is now possible to prevent sorting the output by similarity using a new argument sort_output. Its default is TRUE, setting it to FALSE prevents sorting. This prevents some extra computation that is necessary to determine similarity for each cluster.

Minor

Internal

  • Improvements to implementation of k-means anticlustering (i.e., in anticlustering() with objective == "variance" or in fast_anticlustering())
    • on each exchange iteration, only recomputes distances from clusters whose elements have been swapped (improves run time relevant for larger K).
    • Previously, there were only as many exchange partners per element as members in the least frequent category if argument categories was passed). This was not documented behavior and is undesirable. Now, all members from a category may serve as exchange partners, even if the categories have different size.

Major changes

  • matching() is a new function for unrestricted or K-partite matching to finds groups of similar elements.

  • plot_similarity()is a new function to plot similarity by cluster (according to the cluster editing criterion)

  • All clustering and anticlustering functions now only take one data argument (called x) instead of either features or distances.

  • The argument iv was removed from anticlustering() because it does not fit the anticlustering semantic (anticlustering should make sets «similar» and not dissimilar).

  • The random sampling method for anticlustering was removed. This implies that the anticlustering() function no longer has an argument nrep.

  • The functions initialize_K() and generate_exchange_partners() were removed.

  • Dropped support for the commercial integer linear programming solvers CPLEX and gurobi for exact (anti)cluster editing. If this functionality is needed, install version 0.3.0 from Github:

remotes::install_github("m-Py/anticlust", ref = "v0.3.0")
  • mean_sd_obj() no longer computes the discrepancy of medians, only in means and standard deviations (as the name also suggests).

  • In plot_clusters(), the arguments col and pch were removed.

  • In plot_clusters(), the argument clustering was renamed to clusters.

  • In generate_partitions(), the order of the arguments N and K was switched (the order is now consistent with n_partitions()).

  • In balanced_clustering(), the default method was renamed to "centroid" from "heuristic".

  • Release of the package version used in the manuscript »Using anticlustering to partition a stimulus pool into equivalent parts« (Papenberg & Klau, 2019; https://doi.org/10.31234/osf.io/3razc)