I just submitted anticlust
version 0.8.3 to CRAN and
hope for the best that it remains there without any CRAN issues. I use
this opportunity to give an update on the status of the
anticlust
development.
fast_anticlustering()
to CRAN, which I removed temporarily
due to CRAN issues described here (which
were probably not related to fast_anticlustering()
,
however)anticlustering()
and
fast_anticlustering()
produced different results for
unequal cluster sizes. Present Martin was not as smart and initially
thought the test result was bogus and reacted
by removing the test, but later
restored it and fixed the
implementation.anticlustering()
and
otherwise is the same as 0.8.2.I previously described some important
development going on on this development
branch. Since then, I implemented an interface for employing
cannot-link constraints with anticlustering, which is given through the
argument cannot_link
in anticlustering()
. I
strongly assume that this is how it will remain. The implementation is
already quite potent. It can be used in conjunction with all implemented
algorithms ("exchange"
, "local-maximum"
,
"brusco"
, "ilp"
), even the optimal integer
linear programming algorithm, and for the objectives
"diversity"
, "variance"
and
"kplus"
. I still need to implement unit tests for this
argument and currently the selection of the partition is not correctly
implemented for method = "brusco"
(this is an easy fix, it
just should not be forgotten prior to a release).
anticlustering()
This section gives some outlook for the implementation of k-means
anticlustering in anticlustering()
.
It is possible to unite the implementations for the diversity objective and the k-means objective because the k-means objective can also be formulated as a “normalized” diversity, i.e., the sum of the squared pairwise distances within each cluster, divided by the number of objects in a cluster. The standard diversity is just defined as the plain sum of within-cluster dissimilarities, but the addition of a normalizing factor is straight forward and has already been done (as already described here). Thus, I can just re-use the diversity implementation for k-means and k-plus anticlustering, which has some advantages pertaining to maintenance. I just need to correctly compute the correct input distances depending on the user input, and I am already doing this if cannot-link constraints are inserted.
This means that I could retire the current implementation of k-means
anticlustering in anticlustering()
, which maximizes the sum
of squared Euclidean distances between cluster centers and their
centroid (the “variance”). I would keep the k-means implementation in
fast_anticlustering()
, which minimizes the normalized sum
of squared distances between the cluster centroids and the overall
centroid. This minimization is equivalent to maximizing the variance—and
it is by far the most efficient way to optimize the k-means criterion.
It is particularly well suited for large data sets. Given that computing
the quadratic distance matrix is usually the bottleneck for large data
sets when using maximizing the diversity, it is not a problem replacing
the variance maximizing by diversity maximization—if the data set is too
large, just use fast_anticlustering()
, which is still
possible and even recommended. Maximizing the variance can be fast in
some constellations than maximizing the diversity (as discussed in this
package vignette), but the discrepancy is not very pronounced. And
as stated, we can just use fast_anticlustering()
for
speed.
I will consider this option as I am not decided yet.
Last updated: 2024-04-24