Since the anticlust
Github repository is
currently quite the mess for different reasons, it seems reasonable to
write down the status of the different branches that are currently
developed, mostly for myself to stay on top of things.
The main branch has version 0.8.2 of anticlust. This version is not on CRAN, and ironically, version 0.8.1, which is currently on CRAN, is not released on Gihub. The reason for this is my very personal CRAN horror story. Two days after my paternal leave started in September 2023, an error occurred (“additional issue”) on the CRAN results page, indicating that on a specific version of a mac computer (“M1mac”), there was a segfault. This error was not removed by submitting an intermediate patch 0.8.0-1, and even remained shortly after I submitted version 0.8.1, which is a 100% copy of the previous version 0.7.0, which did not have any issues on CRAN.1 However, the issue disappeared shortly after. This does mean that right now, the new features from version 0.8.0 are not available from CRAN, but only from Github or r-universe.
Version 0.8.2, which is the latest published version of Github
introduces one regression as compared to the version 0.8.0: the fix, which
addressed anticlustering()
crashing for very large data
sets, is not included, and it probably never will.2 Otherwise, it is the
same code of anticlust
. I will soon resubmit a version of
anticlust
to CRAN, which will probably feature one
additional change to version 0.8.2, which is that I could make
fast_anticlustering()
even
faster, especially for many clusters, by removing some unneeded
computations and copying. So (hopefully) expect anticlust version 0.8.3
on CRAN in the near future.
When a new version is published on CRAN, I will also update the package homepage
to the new version; here, the latest version is still version 0.8.0. I
will not include version 0.8.0-1 and 0.8.1 in the change logs
of anticlust
, because they were only introduced to fix some
CRAN issues.
Here, most development is happening with regard to my latest research project, which addresses the bicriterion anticlustering algorithm BILS by Brusco et al. (2020). The NEWS page in that branch already lists some of the changes.
Some other notable changes include:
anticlustering()
from R and checks when no
improvement occurs), as well as the argument repetitions
.
This speeds up computation a bit and facilitates some features that I
currently implement with regard to combining cannot-link constraints
with anticlustering.repetitions
in
anticlustering(..., objective = "diversity")
. This may be
useful to include cannot-link constraints.cannot_link_anticlustering()
,
which however will not remain in the package as is; rather, it will be
made available as part of other functions, but I am not yet sure how
(e.g., add an argument cannot_link
to
anticlustering()
)."average-diversity"
. It is
computed just as the diversity, but the sum of within-group distances is
divided by the group size before the global sum of within-group
distances is computed. Actually, this does not change the computation
for equal-sized anticlusters, but using the average diversity is much
more useful if the groups are different-sized (if the diversity is used
as a measure of between-group similarity, which it usually is). The
average diversity objective is now also available from the function
bicriterion_anticlustering()
.This branch includes a heuristic local maximum exchange method for
the weighted cluster editing / clique partitioning problem. It is
actually quite fast, and I think that with this function I kind of
perfected the local updating procedure for clustering problems.
Unfortunately it is based on the old master branch of
anticlust
, which has many unused commits that are not part
of the current main branch. So I guess I need to conduct some cherry picking if
this feature is included in the released version of anticlust at some
point.
This branch probably has the most unfortunate name and unfortunately
is based on the “wce_heuristic” branch. It implements yet another
version of k-means anticlustering and uses the fact that the sum of
within-cluster squared Euclidean distances (which is the diversity,
using the squared Euclidean distance as input) is equivalent to the
k-means objective; at least when the group sizes are equal - otherwise,
we would need the “average” diversity to maintain the equivalence. The
diversity implementation of anticlustering profits from having many
groups (because increasing the number of groups leads to a quadratic
decrease in the number of within-group distances) and I assumed that
using this implementation may be best suited for very many groups (e.g.,
K = N/2). Unfortunately, we cannot use the quadratic matrix of pairwise
distances for very large N. For this reason, the new implementation does
not store this matrix and instead computes the distances when they are
needed. It seems that this implementation can actually outperform the
very fast implementation in fast_anticlustering()
, but only
for highly specific conditions (such as K = N/2 for very large data
sets).
Based on my experienced implementing this new k-means anticlustering
version, I was however able to speed up
fast_anticlustering()
even more than before by preventing
some redundant computations and copying. It uses the logic that I
first used for the heuristic cluster editing function, as discussed
above. This functionality is now also included in the “diversity2”
branch, but I would like to have it in the next anticlust release, so
some more cherry picking will be required.
In the anticlust
repository, there are even some more
branches, many are old and stale or there is not really anything going
on there. The branch “gurobisolver” may be useful at some point for
research, because it includes the commercial gurobi mathematical
programming solver for integer linear programming. The “must-link”
branch will probably be unearthed at some point when I proceed to
working on these kind of constraints with anticlustering. I am not sure
if the other branches will see the light of the day again.
Last updated: 2024-02-07
For some strange reason, the issue persisted even with the old anticlust code (i.e., version 0.8.1) at first, but after I wrote another email to the CRAN maintainers, it disappeared from the results page.↩︎
This is not so bad, because you should use
fast_anticlustering()
for such data sets anyway.↩︎