the anticlust blog

Anticlust Development Diary, Issue 1, 2024

Since the anticlust Github repository is currently quite the mess for different reasons, it seems reasonable to write down the status of the different branches that are currently developed, mostly for myself to stay on top of things.

The “main” branch (previously master)

The main branch has version 0.8.2 of anticlust. This version is not on CRAN, and ironically, version 0.8.1, which is currently on CRAN, is not released on Gihub. The reason for this is my very personal CRAN horror story. Two days after my paternal leave started in September 2023, an error occurred (“additional issue”) on the CRAN results page, indicating that on a specific version of a mac computer (“M1mac”), there was a segfault. This error was not removed by submitting an intermediate patch 0.8.0-1, and even remained shortly after I submitted version 0.8.1, which is a 100% copy of the previous version 0.7.0, which did not have any issues on CRAN.¹ However, the issue disappeared shortly after. This does mean that right now, the new features from version 0.8.0 are not available from CRAN, but only from Github or r-universe.

Version 0.8.2, which is the latest published version of Github introduces one regression as compared to the version 0.8.0: the fix, which addressed anticlustering() crashing for very large data sets, is not included, and it probably never will.² Otherwise, it is the same code of anticlust. I will soon resubmit a version of anticlust to CRAN, which will probably feature one additional change to version 0.8.2, which is that I could make fast_anticlustering() even faster, especially for many clusters, by removing some unneeded computations and copying. So (hopefully) expect anticlust version 0.8.3 on CRAN in the near future.

When a new version is published on CRAN, I will also update the package homepage to the new version; here, the latest version is still version 0.8.0. I will not include version 0.8.0-1 and 0.8.1 in the change logs of anticlust, because they were only introduced to fix some CRAN issues.

The “bils-e” branch

Here, most development is happening with regard to my latest research project, which addresses the bicriterion anticlustering algorithm BILS by Brusco et al. (2020). The NEWS page in that branch already lists some of the changes.

Some other notable changes include:

(Internal change only) For diversity anticlustering, the local maximum method is now directly implemented in C (it no longer repeatedly calls anticlustering() from R and checks when no improvement occurs), as well as the argument repetitions. This speeds up computation a bit and facilitates some features that I currently implement with regard to combining cannot-link constraints with anticlustering.
(This is currently only available internally and I do not know how this functionality will eventually be made available to users) It is now possible to pass many initial partitions for all repetitions in anticlustering(..., objective = "diversity"). This may be useful to include cannot-link constraints.
Right now, there is an exported function cannot_link_anticlustering(), which however will not remain in the package as is; rather, it will be made available as part of other functions, but I am not yet sure how (e.g., add an argument cannot_link to anticlustering()).
There is now an objective "average-diversity". It is computed just as the diversity, but the sum of within-group distances is divided by the group size before the global sum of within-group distances is computed. Actually, this does not change the computation for equal-sized anticlusters, but using the average diversity is much more useful if the groups are different-sized (if the diversity is used as a measure of between-group similarity, which it usually is). The average diversity objective is now also available from the function bicriterion_anticlustering().

The “wce_heuristic” branch

This branch includes a heuristic local maximum exchange method for the weighted cluster editing / clique partitioning problem. It is actually quite fast, and I think that with this function I kind of perfected the local updating procedure for clustering problems. Unfortunately it is based on the old master branch of anticlust, which has many unused commits that are not part of the current main branch. So I guess I need to conduct some cherry picking if this feature is included in the released version of anticlust at some point.

The “diversity2” branch

This branch probably has the most unfortunate name and unfortunately is based on the “wce_heuristic” branch. It implements yet another version of k-means anticlustering and uses the fact that the sum of within-cluster squared Euclidean distances (which is the diversity, using the squared Euclidean distance as input) is equivalent to the k-means objective; at least when the group sizes are equal - otherwise, we would need the “average” diversity to maintain the equivalence. The diversity implementation of anticlustering profits from having many groups (because increasing the number of groups leads to a quadratic decrease in the number of within-group distances) and I assumed that using this implementation may be best suited for very many groups (e.g., K = N/2). Unfortunately, we cannot use the quadratic matrix of pairwise distances for very large N. For this reason, the new implementation does not store this matrix and instead computes the distances when they are needed. It seems that this implementation can actually outperform the very fast implementation in fast_anticlustering(), but only for highly specific conditions (such as K = N/2 for very large data sets).

Based on my experienced implementing this new k-means anticlustering version, I was however able to speed up fast_anticlustering() even more than before by preventing some redundant computations and copying. It uses the logic that I first used for the heuristic cluster editing function, as discussed above. This functionality is now also included in the “diversity2” branch, but I would like to have it in the next anticlust release, so some more cherry picking will be required.

Other branches

In the anticlust repository, there are even some more branches, many are old and stale or there is not really anything going on there. The branch “gurobisolver” may be useful at some point for research, because it includes the commercial gurobi mathematical programming solver for integer linear programming. The “must-link” branch will probably be unearthed at some point when I proceed to working on these kind of constraints with anticlustering. I am not sure if the other branches will see the light of the day again.

Last updated: 2024-02-07

Back to the front page

For some strange reason, the issue persisted even with the old anticlust code (i.e., version 0.8.1) at first, but after I wrote another email to the CRAN maintainers, it disappeared from the results page.↩︎
This is not so bad, because you should use fast_anticlustering() for such data sets anyway.↩︎