the anticlust blog

Do we need anticlustering in psychological experiments?

This post is concerned with the very reason why the anticlust package came to exist: Originally, I sought a method to assign stimuli to different sets in experiments while minimizing differences between sets.¹ As it turns out, I was not alone with this problem as it frequently occurs in experimental psychology, but previously lacked an accessible software solution. This post asks the awkward question if my effort was even needed: Is it necessary to strive for similarity between stimulus sets or is some other assignment method (e.g., random assignment) sufficient?

The basic anticlustering problem in psychological research occurs in within-subjects designs: Each participant processes two experimental conditions and in each condition, a set of stimuli is shown. For example, in Papenberg and Klau (2021) we discuss the following application:

Lahl [et al.] (2008) investigated the effect of napping on recall memory. In their study, each participant completed a napping session and a wake session, separated by a 1-week wash out period. Before each session, participants had to memorize a list of 30 words; after each session, word recall was tested. Due to possible carry-over effects, presenting the same word list in both conditions was not feasible. Instead, two word lists had to be created and counterbalanced across the experimental conditions (wake vs. sleep). (p. 162)

In practice, researchers often opt for using fixed sets,² i.e., the same stimuli are grouped together and are always shown in the same condition. In the study by Lahl et al. (2008), two sets of stimuli—let’s denote them as S1 and S2—was either shown in the wake condition or in the sleep condition. So, for some participants, the wake condition was paired with set S1 and the sleep condition was paired with set S2; for the other participants, the wake condition was paired with set S2 and the sleep condition was paired with set S1. Researchers intuitively wish that the sets S1 and S2 have to similar to each other on any factors that may influence the responses of the participants—this is one of the reasons why the anticlust package is being used with increasing frequency. Lintz et al. (2021) gave two reasons for this intuition and explained why balancing stimuli between sets is really needed:

First, failing to balance lists well within subjects will drastically inflate the variance between those subjects (with some having extreme bias in one direction, some having low bias, and still others having extreme bias in the other direction), and it will correspondingly lower statistical power. (p. 18)

Their first reason is of statistical nature: If the sets are dissimilar, this will increase the effect of an experimental manipulation for some participants, while it decreases the effect for others. The increased variance in within-person effects is expected to reduce statistical power. Below, I present a small simulation to investigate this claim by Lintz et al. (2021). It turns out that the concern is correct, but only if there is no statistical control of the stimulus sets. A simple ANOVA can remediate the problem on a statistical level (i.e., adding the counterbalancing variable as an additional independent variable). However, proper statistical control is not expected to be done in all cases. For example, Lahl et al. (2008) only reported the simple t-test and did not control for the counterbalancing of the stimulus sets, which I assume is the norm rather than the exception.³

Lintz et al. (2021) go on to provide an additional problem with failing to creating similar stimulus sets:

Second, sufficiently unbalanced lists could have secondary effects well beyond the low-level influences of the lexical properties on measures like response time. For instance, if a participant becomes consciously aware that one condition’s words are consistently longer than another’s, they might change their strategy, suspect deception, lose focus on the task, or behave in other unpredictable ways that could distort the results to a degree that violates the typical expectations underlying law-of-large-numbers logic. (p. 18)

Arguably, this second concern is more severe because neither can it be identified in the data, nor can it be controlled for by statistical analysis. If participants adjust their response strategy depending on the perceived similarity between stimulus sets, the comparison between experimental conditions can be strongly biased, rendering comparisons between conditions useless in the worst case.

While I actually find the second concern of Lintz et al. to be a more convincing as to why balancing stimuli is needed, this post investigates the claims in their first concern. This is because it can actually be investigated in terms of statistical analysis.

The simulation

I implemented a function that generates a data set that includes a numeric response variable that is influenced by (a) an experimental effect and (b) the effect of the item set and (c) some random (normally distributed) error. It also has arguments to adjust the imbalance of the counterbalancing variable (i.e., how many people see condition A with set S1 and how many people see condition A with set S2), and to request that stimuli are assigned to conditions via k-means anticlustering (via argument anticlust).

library(anticlust)
# N = Sample Size; Total Number of Participants
# D = Effect Size Experimental Condition (the same true effect is assumed for each person)
# D_M = Effect Size of Materials (i.e., average difference between two sets)
# balance = ratio of persons in each balancing condition. Default is "random" assignment
# anticlust = are the 20 stimuli balanced between experimental sets via k-means anticlustering?
get_data_set <- function(N, D, D_M = 0, balance = "random", anticlust = FALSE) { 
  # Generate group sizes according to balancing variable:
  if (balance == "random") {
    tab <- table(sample(1:2, size = N, replace = TRUE)) 
    N1 <- tab[1]
    N2 <- tab[2]
  } else {
    N1 <- round((N/2) * balance)
    N2 <- N - N1
  }
  
  # Generate two groups of 10 stimuli each that differ in their mean by M
  stimuli1 <- rnorm(10, D_M)
  stimuli2 <- rnorm(10, 0)
  M <- mean(stimuli1) - mean(stimuli2) # actual effect of the stimuli in the sample
  
  if (anticlust) {
    all_stimuli <- c(stimuli1, stimuli2)
    stimulus_groups <- anticlustering(all_stimuli, K = 2, objective = "variance", method = "local-maximum") 
    M <- diff(tapply(all_stimuli, stimulus_groups, mean))
  }
  
  # Actually generate the data:
  C1M1 <- rnorm(N1, D + M)
  C2M2 <- rnorm(N1)
  C1M2 <- rnorm(N2, D)
  C2M1 <- rnorm(N2, M)
  data.frame(
    value = c(C1M1, C2M2, C1M2, C2M1),
    condition = rep(c(1, 2, 1, 2), c(N1, N1, N2, N2)),
    balancing = rep(c(1, 1, 2, 2), c(N1, N1, N2, N2)),
    set = rep(c(1, 2, 2, 1), c(N1, N1, N2, N2)),
    casenum = c(rep(1:N1, 2), rep((N1+1):N, 2))
  )
}

The function returns a data frame in long format that has \(2N\) rows where \(N\) is the number of participants. The function arguments can adjust the sample size, the effect size between conditions, the effect size of the stimulus sets, the relative sizes of the counterbalancing groups (via argument balance) and if anticlustering is used for stimulus assignment.

How I actually implemented the effect of the stimulus sets may be up to debate, but this implementation lets me induce the anticlustering of the stimuli conveniently: In the function body, I generate two sets of stimuli, which are simply given by the function rnorm() as normal variate values. They are drawn from potentially different normal distributions if \(D_M\) is set to a different value than 0. For each set, I (arbitrarily) generate 10 stimuli.⁴ Then, I compute the difference in means between the two sets, which is used as the actual effect of the item sets on responses. Therefore, even if \(D_M\) remains as the default value of 0, there will be a random influence of the item sets because the mean values of the item sets will vary according to random fluctuation. So the default setting actually resembles the common use case that stimuli are assigned randomly to sets.

For purposes of illustration, let’s use it the function get_data_set() to simulate data for four respondents, assuming an effect size of .5 for the experimental condition and an effect size of .3 for the materials (i.e., the expected difference between the stimulus sets):

get_data_set(4, D = .5, D_M = .3)

##        value condition balancing set casenum
## 1  2.2603253         1         1   1       1
## 2  0.6485547         1         1   1       2
## 3 -0.3446353         2         1   2       1
## 4  0.1429062         2         1   2       2
## 5 -1.7332583         1         2   2       3
## 6  0.3293714         1         2   2       4
## 7  1.2203604         2         2   1       3
## 8  0.9996620         2         2   1       4

We obtain 8 rows because the data is in long format and there are two responses for each participant. We see that there is already some bookkeeping required for this simple design.⁵ The variable balancing encodes the pairing of condition with item set; so it could in principle be deduced from the columns condition and set. Next, I define a function to generate a data set and then compute a t-test on the response variable by condition. It returns the p-value of the t-test, which can be used to estimate statistical power.

sim_ttest <- function(X, N, D, D_M = 0, balance = "random", anticlust = FALSE) {
  data <- get_data_set(N = N, D = D, D_M = D_M, balance, anticlust)
  t.test(data$value[data$condition == 1], data$value[data$condition == 2], paired = TRUE)$p.value
}

Using sapply(), I can repeatedly call it to conduct a small scale simulation (i.e., for the same parameter combination). So, in the following I define some parameters that I use throughout my examples; a real simulation would be obtained by varying the input parameters systematically. To estimate statistical power, 10000 repeated calls to sim_ttest() are conducted.⁶

nsim <- 10000 # number of simulation runs
N <- 50
D <- .5

First, I simulate the power of the t-test for an effect of \(D = 0.5\) in a sample of 50 participants when there is no systematic effect of the item set. Still, in a given sample there is a random difference caused by the item set, because while the implementation of the get_data_set() assumes that there is no systematic difference between item sets, a difference may occur due to random sampling. This actually simulates how it would be in a real study when using random assignment of stimuli to sets!

Power

# Base line power:  no effect of item set
pvalues1 <- sapply(1:nsim, sim_ttest, N = N, D = D)
pvalues2 <- sapply(1:nsim, sim_ttest, N = N, D = D, anticlust = TRUE) 

mean(pvalues1 <= .05)

## [1] 0.6519

mean(pvalues2 <= .05)

## [1] 0.6909

We see that power is slightly improved (about 4 percantage points) when using anticlustering assignment instead of random assignment of stimuli.

In the next example, I increase the bias induced by difference in item sets. It simulates the case when we got unlucky with a random assignment or if some other (suboptimal) method of stimulus assignment was used. I even assume that \(D_M\), which is the expected effect of the item set (i.e., the mean difference in the dependent variable between the item sets) is quite large and even larger than the effect of the experimental manipulation. This may not be a realistic assumption when dividing stimuli randomly into fixed sets. However, this setting makes it possible to show the potential detrimental effects of the stimulus sets on our analysis, specifically on the statistical power of our study. As a ground comparison, I also simulate p-values for standard random assignment (pvalues0):

D_M <- 1
pvalues3 <- sapply(1:nsim, sim_ttest, N = N, D = D, D_M = D_M) 
pvalues4 <- sapply(1:nsim, sim_ttest, N = N, D = D, D_M = D_M, anticlust = TRUE) 

mean(pvalues3 <= .05)

## [1] 0.5083

mean(pvalues4 <= .05)

## [1] 0.6923

The power of the standard t-test is strongly reduced when there is a systematic bias between stimulus sets. Using anticlustering on these stimuli (that are effectively drawn from two populations when specifying \(D_M\)) leads to much higher statistical power.

Statistical control via ANOVA

Instead of using anticlustering—or even better: in addition to!—we can use statistical control to remediate the problem of variation caused by differences in item sets. A simple tool⁷ is to use a mixed ANOVA to includes the counterbalancing variable as a between subjects factor. I define the function sim_aov() analogously to sim_ttest() to compute this ANOVA instead of the simple t-test.

library(afex)
sim_aov <- function(X, N, D, D_M = 0, balance = "random", anticlust = FALSE) {
  data <- get_data_set(N = N, D = D, D_M = D_M, balance = balance, anticlust = anticlust)
  data$balancing <- factor(data$balancing)
  aov_data <- suppressMessages(aov_ez(data, id = "casenum", dv = "value",
         between = "balancing", within = "condition"))
  p_value_condition <- aov_data$anova["condition", ][["Pr(>F)"]] 
  p_value_condition
}

The function uses the powerful afex package to compute the mixed ANOVA (which I generally recommend for usage). The function sim_aov() returns the p-value of the within-subjects condition effect and can be called repeatedly to conduct a simulation. Again, I stick with the same parameters (i.e., \(D = 0.5\), \(N = 50\)) and do not vary the parameters. I redo the simulation for the t-test (once using anticlustering and once not) for comparison.

pvalues5 <- sapply(1:nsim, sim_ttest, N = N, D = D) 
pvalues6 <- sapply(1:nsim, sim_aov, N = N, D = D)
pvalues7 <- sapply(1:nsim, sim_ttest, N = N, D = D, anticlust = TRUE)

# Power
mean(pvalues5 <= .05)

## [1] 0.6459

mean(pvalues6 <= .05)

## [1] 0.6848

mean(pvalues7 <= .05)

## [1] 0.693

We repeat the pattern that the power of the t-test is lowest. ANOVA improves power as compared to the t-test, and anticlustering + t-test has comparable power.

I repeat this simulation for the case of highly biased sets:

D_M <- 1
pvalues8 <- sapply(1:nsim, sim_ttest, N = N, D = D, D_M = D_M) 
pvalues9 <- sapply(1:nsim, sim_aov, N = N, D = D, D_M = D_M)
pvalues10 <- sapply(1:nsim, sim_ttest, N = N, D = D, D_M = D_M, anticlust = TRUE)

# Power
mean(pvalues8 <= .05)

## [1] 0.5147

mean(pvalues9 <= .05)

## [1] 0.6767

mean(pvalues10 <= .05)

## [1] 0.69

Again, we see that the t-test suffers a lot when stimulus sets are highly different. Interestingly, the ANOVA approach can deal with the bias pretty well and has the comparable performance as using anticlustering (which removes the bias).

We can conclude that anticlustering can redemiate the problem of reduced power caused by (random or systematic) differences between item sets, when fixed sets are used and when there is no statistical control of the stimuli. Using ANOVA to employ a statistical control should be done when using stimulus sets instead of the simple t-test, preferably combined with anticlustering. Using anticlustering as well as statistical control is expected to (a) maximize statistical power via and to (b) reduce the risk of bias due to differences in response strategy between conditions.

Conclusion

So we need anticlustering in psychological experiments? I would say it definitely helps. From a pure statistical standpoint, controlling for stimuli through the analysis should sufficiently remediate imbalances in stimulus sets. However, actively balancing stimuli still has advantages: Even though there are loud proponents of sophistication in analysis (e.g., controlling for stimuli in mixed models), statistical control is not always done, for different reasons. For example, it still seems that statistically controlling for imbalance instead of removing it is regarded with more scepticism (e.g., Treasure & MacRae, 1998; Senn, 2005). More complex analyses also tend to produce problems for practical researchers, such as non-convergence of the model fitting algorithm, or uncertainty about which interactions to include. Moreover, as Lintz et al. (2021) pointed our, we cannot rule out that participants react to differences in stimulus sets in unanticipated ways. In the end, I personally would advocate using anticlustering (when using fixed stimulus sets), but still employ statistical control of the stimuli. If statistical control is not done, I would definitely advocate anticlustering of stimulus sets, given that power is indeed enhanced with anticlustering, and when we get unlucky with our one assignment, differences in stimulus sets may completely interfere with the experimental effect.

References

Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of memory and language, 59(4), 390-412.

Lahl, O., & Pietrowsky, R. (2006). EQUIWORD: A software application for the automatic creation of truly equivalent word lists. Behavior Research Methods, 38, 146–152. http://dx.doi.org/10.3758/BF03192760

Lahl, O., Wispel, C., Willigens, B., & Pietrowsky, R. (2008). An ultra short episode of sleep is sufficient to promote declarative memory performance. Journal of Sleep Research, 17, 3–10. http://dx.doi.org/10.1111/j.1365-2869.2008.00622.x

Lintz, E. N., Lim, P. C., & Johnson, M. R. (2021). A new tool for equating lexical stimuli across experimental conditions. MethodsX, 8, 101545.

Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301

Senn, S. (2005). An unreasonable prejudice against modelling? Pharmaceutical Statistics: The Journal of Applied Statistics in the Pharmaceutical Industry, 4(2), 87–89.

Treasure, T., & MacRae, K. D. (1998). Minimisation: The platinum standard for trials?: Randomisation doesn’t guarantee similarity of groups; minimisation does. In Bmj (7155; Vol. 317, pp. 362–363). British Medical Journal Publishing Group.

Last updated: 2025-03-18

Back to the front page

I even wrote a less sophisticated R package to do this: minDiff.↩︎
It is also feasible to generate a random subset of stimuli for each participant. In this case, we do not need anticlustering. For practical reasons, it seems that fixed sets are often used. It makes sense, given that we usually try to keep as many things fixed as possible in experiments to maximize the precision of our experimental manipulation. Using fixed sets may facilitate implementing the experiment (e.g. via computer or pen and paper) and the data analysis. In some cases, the experimental logic might even demand the use of fixed sets. Still, experiments do exist where random sets of stimuli are generated for each participant, which potentially circumvents the need for anticlustering.↩︎
To be fair, Lahl et al. (2008) used a matching method (see Lahl & Pietrowsky, 2006) that like anticlustering strives for similarity between stimulus sets. So while not using statistical control, they attempted to ensure comparability between sets.↩︎
In a larger simulation, the number of stimuli might be varied and not fix.↩︎
And in reality, there could even be an additional balancing variable pertaining to the sequence in which the experimental conditions are processed. So a simple design with 2 conditions quickly becomes a three-factorial data analysis.↩︎
I guess 10000 is a nice number for simulation runs. Maybe this choice should be justified.↩︎
If we do not apply a linear mixed model, which would be the prefered choice of many for this type of design.↩︎