This post is concerned with the very reason why the
anticlust
package came to exist: Originally, I sought a
method to assign stimuli to different sets in experiments while
minimizing differences between sets.1 As it turns out, I was not alone with this
problem as it frequently occurs in experimental psychology, but
previously lacked an accessible software solution. This post
asks the awkward question if my effort was even needed: Is it necessary
to strive for similarity between stimulus sets or is some other
assignment method (e.g., random assignment) sufficient?
The basic anticlustering problem in psychological research occurs in within-subjects designs: Each participant processes two experimental conditions and in each condition, a set of stimuli is shown. For example, in Papenberg and Klau (2021) we discuss the following application:
Lahl [et al.] (2008) investigated the effect of napping on recall memory. In their study, each participant completed a napping session and a wake session, separated by a 1-week wash out period. Before each session, participants had to memorize a list of 30 words; after each session, word recall was tested. Due to possible carry-over effects, presenting the same word list in both conditions was not feasible. Instead, two word lists had to be created and counterbalanced across the experimental conditions (wake vs. sleep). (p. 162)
In practice, researchers often opt for using fixed sets,2 i.e., the
same stimuli are grouped together and are always shown in the same
condition. In the study by Lahl et al. (2008), two sets of stimuli—let’s
denote them as S1 and S2—was either shown in the wake condition or in
the sleep condition. So, for some participants, the wake condition was
paired with set S1 and the sleep condition was paired with set S2; for
the other participants, the wake condition was paired with set S2 and
the sleep condition was paired with set S1. Researchers intuitively wish
that the sets S1 and S2 have to similar to each other on any factors
that may influence the responses of the participants—this is one of the
reasons why the anticlust
package is being used with
increasing frequency. Lintz et al. (2021) gave two reasons for this
intuition and explained why balancing stimuli between sets is really
needed:
First, failing to balance lists well within subjects will drastically inflate the variance between those subjects (with some having extreme bias in one direction, some having low bias, and still others having extreme bias in the other direction), and it will correspondingly lower statistical power. (p. 18)
Their first reason is of statistical nature: If the sets are dissimilar, this will increase the effect of an experimental manipulation for some participants, while it decreases the effect for others. The increased variance in within-person effects is expected to reduce statistical power. Below, I present a small simulation to investigate this claim by Lintz et al. (2021). It turns out that the concern is correct, but only if there is no statistical control of the stimulus sets. A simple ANOVA can remediate the problem on a statistical level (i.e., adding the counterbalancing variable as an additional independent variable). However, proper statistical control is not expected to be done in all cases. For example, Lahl et al. (2008) only reported the simple t-test and did not control for the counterbalancing of the stimulus sets, which I assume is the norm rather than the exception.3
Lintz et al. (2021) go on to provide an additional problem with failing to creating similar stimulus sets:
Second, sufficiently unbalanced lists could have secondary effects well beyond the low-level influences of the lexical properties on measures like response time. For instance, if a participant becomes consciously aware that one condition’s words are consistently longer than another’s, they might change their strategy, suspect deception, lose focus on the task, or behave in other unpredictable ways that could distort the results to a degree that violates the typical expectations underlying law-of-large-numbers logic. (p. 18)
Arguably, this second concern is more severe because neither can it be identified in the data, nor can it be controlled for by statistical analysis. If participants adjust their response strategy depending on the perceived similarity between stimulus sets, the comparison between experimental conditions can be strongly biased, rendering comparisons between conditions useless in the worst case.
While I actually find the second concern of Lintz et al. to be a more convincing as to why balancing stimuli is needed, this post investigates the claims in their first concern. This is because it can actually be investigated in terms of statistical analysis.
I implemented a function that generates a data set that includes a
numeric response variable that is influenced by (a) an experimental
effect and (b) the effect of the item set and (c) some random (normally
distributed) error. It also has arguments to adjust the imbalance of the
counterbalancing variable (i.e., how many people see condition A with
set S1 and how many people see condition A with set S2), and to request
that stimuli are assigned to conditions via k-means anticlustering (via
argument anticlust
).
library(anticlust)
# N = Sample Size; Total Number of Participants
# D = Effect Size Experimental Condition (the same true effect is assumed for each person)
# D_M = Effect Size of Materials (i.e., average difference between two sets)
# balance = ratio of persons in each balancing condition. Default is "random" assignment
# anticlust = are the 20 stimuli balanced between experimental sets via k-means anticlustering?
get_data_set <- function(N, D, D_M = 0, balance = "random", anticlust = FALSE) {
# Generate group sizes according to balancing variable:
if (balance == "random") {
tab <- table(sample(1:2, size = N, replace = TRUE))
N1 <- tab[1]
N2 <- tab[2]
} else {
N1 <- round((N/2) * balance)
N2 <- N - N1
}
# Generate two groups of 10 stimuli each that differ in their mean by M
stimuli1 <- rnorm(10, D_M)
stimuli2 <- rnorm(10, 0)
M <- mean(stimuli1) - mean(stimuli2) # actual effect of the stimuli in the sample
if (anticlust) {
all_stimuli <- c(stimuli1, stimuli2)
stimulus_groups <- anticlustering(all_stimuli, K = 2, objective = "variance", method = "local-maximum")
M <- diff(tapply(all_stimuli, stimulus_groups, mean))
}
# Actually generate the data:
C1M1 <- rnorm(N1, D + M)
C2M2 <- rnorm(N1)
C1M2 <- rnorm(N2, D)
C2M1 <- rnorm(N2, M)
data.frame(
value = c(C1M1, C2M2, C1M2, C2M1),
condition = rep(c(1, 2, 1, 2), c(N1, N1, N2, N2)),
balancing = rep(c(1, 1, 2, 2), c(N1, N1, N2, N2)),
set = rep(c(1, 2, 2, 1), c(N1, N1, N2, N2)),
casenum = c(rep(1:N1, 2), rep((N1+1):N, 2))
)
}
The function returns a data frame in long format that has \(2N\) rows where \(N\) is the number of participants. The
function arguments can adjust the sample size, the effect size between
conditions, the effect size of the stimulus sets, the relative sizes of
the counterbalancing groups (via argument balance
) and if
anticlustering is used for stimulus assignment.
How I actually implemented the effect of the stimulus sets may be up
to debate, but this implementation lets me induce the anticlustering of
the stimuli conveniently: In the function body, I generate two sets of
stimuli, which are simply given by the function rnorm()
as
normal variate values. They are drawn from potentially different normal
distributions if \(D_M\) is set to a
different value than 0. For each set, I (arbitrarily) generate 10
stimuli.4
Then, I compute the difference in means between the two sets, which is
used as the actual effect of the item sets on responses. Therefore, even
if \(D_M\) remains as the default value
of 0, there will be a random influence of the item sets because the mean
values of the item sets will vary according to random fluctuation. So
the default setting actually resembles the common use case that stimuli
are assigned randomly to sets.
For purposes of illustration, let’s use it the function
get_data_set()
to simulate data for four respondents,
assuming an effect size of .5 for the experimental condition and an
effect size of .3 for the materials (i.e., the expected difference
between the stimulus sets):
get_data_set(4, D = .5, D_M = .3)
## value condition balancing set casenum
## 1 2.2603253 1 1 1 1
## 2 0.6485547 1 1 1 2
## 3 -0.3446353 2 1 2 1
## 4 0.1429062 2 1 2 2
## 5 -1.7332583 1 2 2 3
## 6 0.3293714 1 2 2 4
## 7 1.2203604 2 2 1 3
## 8 0.9996620 2 2 1 4
We obtain 8 rows because the data is in long format and there are two
responses for each participant. We see that there is already some
bookkeeping required for this simple design.5 The variable
balancing
encodes the pairing of condition with item set;
so it could in principle be deduced from the columns
condition
and set
. Next, I define a function
to generate a data set and then compute a t-test on the
response variable by condition. It returns the p-value of the
t-test, which can be used to estimate statistical power.
sim_ttest <- function(X, N, D, D_M = 0, balance = "random", anticlust = FALSE) {
data <- get_data_set(N = N, D = D, D_M = D_M, balance, anticlust)
t.test(data$value[data$condition == 1], data$value[data$condition == 2], paired = TRUE)$p.value
}
Using sapply()
, I can repeatedly call it to conduct a
small scale simulation (i.e., for the same parameter combination). So,
in the following I define some parameters that I use throughout my
examples; a real simulation would be obtained by varying the input
parameters systematically. To estimate statistical power, 10000 repeated
calls to sim_ttest()
are conducted.6
nsim <- 10000 # number of simulation runs
N <- 50
D <- .5
First, I simulate the power of the t-test for an effect of
\(D = 0.5\) in a sample of 50
participants when there is no systematic effect of the item set. Still,
in a given sample there is a random difference caused by the item set,
because while the implementation of the get_data_set()
assumes that there is no systematic difference between item sets, a
difference may occur due to random sampling. This actually simulates how
it would be in a real study when using random assignment of stimuli to
sets!
Power
# Base line power: no effect of item set
pvalues1 <- sapply(1:nsim, sim_ttest, N = N, D = D)
pvalues2 <- sapply(1:nsim, sim_ttest, N = N, D = D, anticlust = TRUE)
mean(pvalues1 <= .05)
## [1] 0.6519
mean(pvalues2 <= .05)
## [1] 0.6909
We see that power is slightly improved (about 4 percantage points) when using anticlustering assignment instead of random assignment of stimuli.
In the next example, I increase the bias induced by difference in
item sets. It simulates the case when we got unlucky with a random
assignment or if some other (suboptimal) method of stimulus assignment
was used. I even assume that \(D_M\),
which is the expected effect of the item set (i.e., the mean difference
in the dependent variable between the item sets) is quite large and even
larger than the effect of the experimental manipulation. This may not be
a realistic assumption when dividing stimuli randomly into fixed sets.
However, this setting makes it possible to show the potential
detrimental effects of the stimulus sets on our analysis, specifically
on the statistical power of our study. As a ground comparison, I also
simulate p-values for standard random assignment
(pvalues0
):
D_M <- 1
pvalues3 <- sapply(1:nsim, sim_ttest, N = N, D = D, D_M = D_M)
pvalues4 <- sapply(1:nsim, sim_ttest, N = N, D = D, D_M = D_M, anticlust = TRUE)
mean(pvalues3 <= .05)
## [1] 0.5083
mean(pvalues4 <= .05)
## [1] 0.6923
The power of the standard t-test is strongly reduced when there is a systematic bias between stimulus sets. Using anticlustering on these stimuli (that are effectively drawn from two populations when specifying \(D_M\)) leads to much higher statistical power.
Instead of using anticlustering—or even better: in addition to!—we
can use statistical control to remediate the problem of variation caused
by differences in item sets. A simple tool7 is to use a mixed
ANOVA to includes the counterbalancing variable as a between subjects
factor. I define the function sim_aov()
analogously to
sim_ttest()
to compute this ANOVA instead of the simple
t-test.
library(afex)
sim_aov <- function(X, N, D, D_M = 0, balance = "random", anticlust = FALSE) {
data <- get_data_set(N = N, D = D, D_M = D_M, balance = balance, anticlust = anticlust)
data$balancing <- factor(data$balancing)
aov_data <- suppressMessages(aov_ez(data, id = "casenum", dv = "value",
between = "balancing", within = "condition"))
p_value_condition <- aov_data$anova["condition", ][["Pr(>F)"]]
p_value_condition
}
The function uses the powerful afex package to compute the
mixed ANOVA (which I generally recommend for usage). The function
sim_aov()
returns the p-value of the
within-subjects condition effect and can be called repeatedly to conduct
a simulation. Again, I stick with the same parameters (i.e., \(D = 0.5\), \(N =
50\)) and do not vary the parameters. I redo the simulation for
the t-test (once using anticlustering and once not) for
comparison.
pvalues5 <- sapply(1:nsim, sim_ttest, N = N, D = D)
pvalues6 <- sapply(1:nsim, sim_aov, N = N, D = D)
pvalues7 <- sapply(1:nsim, sim_ttest, N = N, D = D, anticlust = TRUE)
# Power
mean(pvalues5 <= .05)
## [1] 0.6459
mean(pvalues6 <= .05)
## [1] 0.6848
mean(pvalues7 <= .05)
## [1] 0.693
We repeat the pattern that the power of the t-test is lowest. ANOVA improves power as compared to the t-test, and anticlustering + t-test has comparable power.
I repeat this simulation for the case of highly biased sets:
D_M <- 1
pvalues8 <- sapply(1:nsim, sim_ttest, N = N, D = D, D_M = D_M)
pvalues9 <- sapply(1:nsim, sim_aov, N = N, D = D, D_M = D_M)
pvalues10 <- sapply(1:nsim, sim_ttest, N = N, D = D, D_M = D_M, anticlust = TRUE)
# Power
mean(pvalues8 <= .05)
## [1] 0.5147
mean(pvalues9 <= .05)
## [1] 0.6767
mean(pvalues10 <= .05)
## [1] 0.69
Again, we see that the t-test suffers a lot when stimulus sets are highly different. Interestingly, the ANOVA approach can deal with the bias pretty well and has the comparable performance as using anticlustering (which removes the bias).
We can conclude that anticlustering can redemiate the problem of reduced power caused by (random or systematic) differences between item sets, when fixed sets are used and when there is no statistical control of the stimuli. Using ANOVA to employ a statistical control should be done when using stimulus sets instead of the simple t-test, preferably combined with anticlustering. Using anticlustering as well as statistical control is expected to (a) maximize statistical power via and to (b) reduce the risk of bias due to differences in response strategy between conditions.
So we need anticlustering in psychological experiments? I would say it definitely helps. From a pure statistical standpoint, controlling for stimuli through the analysis should sufficiently remediate imbalances in stimulus sets. However, actively balancing stimuli still has advantages: Even though there are loud proponents of sophistication in analysis (e.g., controlling for stimuli in mixed models), statistical control is not always done, for different reasons. For example, it still seems that statistically controlling for imbalance instead of removing it is regarded with more scepticism (e.g., Treasure & MacRae, 1998; Senn, 2005). More complex analyses also tend to produce problems for practical researchers, such as non-convergence of the model fitting algorithm, or uncertainty about which interactions to include. Moreover, as Lintz et al. (2021) pointed our, we cannot rule out that participants react to differences in stimulus sets in unanticipated ways. In the end, I personally would advocate using anticlustering (when using fixed stimulus sets), but still employ statistical control of the stimuli. If statistical control is not done, I would definitely advocate anticlustering of stimulus sets, given that power is indeed enhanced with anticlustering, and when we get unlucky with our one assignment, differences in stimulus sets may completely interfere with the experimental effect.
Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of memory and language, 59(4), 390-412.
Lahl, O., & Pietrowsky, R. (2006). EQUIWORD: A software application for the automatic creation of truly equivalent word lists. Behavior Research Methods, 38, 146–152. http://dx.doi.org/10.3758/BF03192760
Lahl, O., Wispel, C., Willigens, B., & Pietrowsky, R. (2008). An ultra short episode of sleep is sufficient to promote declarative memory performance. Journal of Sleep Research, 17, 3–10. http://dx.doi.org/10.1111/j.1365-2869.2008.00622.x
Lintz, E. N., Lim, P. C., & Johnson, M. R. (2021). A new tool for equating lexical stimuli across experimental conditions. MethodsX, 8, 101545.
Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301
Senn, S. (2005). An unreasonable prejudice against modelling? Pharmaceutical Statistics: The Journal of Applied Statistics in the Pharmaceutical Industry, 4(2), 87–89.
Treasure, T., & MacRae, K. D. (1998). Minimisation: The platinum standard for trials?: Randomisation doesn’t guarantee similarity of groups; minimisation does. In Bmj (7155; Vol. 317, pp. 362–363). British Medical Journal Publishing Group.
Last updated: 2025-03-18
I even wrote a less sophisticated R package to do this:
minDiff
.↩︎
It is also feasible to generate a random subset of stimuli for each participant. In this case, we do not need anticlustering. For practical reasons, it seems that fixed sets are often used. It makes sense, given that we usually try to keep as many things fixed as possible in experiments to maximize the precision of our experimental manipulation. Using fixed sets may facilitate implementing the experiment (e.g. via computer or pen and paper) and the data analysis. In some cases, the experimental logic might even demand the use of fixed sets. Still, experiments do exist where random sets of stimuli are generated for each participant, which potentially circumvents the need for anticlustering.↩︎
To be fair, Lahl et al. (2008) used a matching method (see Lahl & Pietrowsky, 2006) that like anticlustering strives for similarity between stimulus sets. So while not using statistical control, they attempted to ensure comparability between sets.↩︎
In a larger simulation, the number of stimuli might be varied and not fix.↩︎
And in reality, there could even be an additional balancing variable pertaining to the sequence in which the experimental conditions are processed. So a simple design with 2 conditions quickly becomes a three-factorial data analysis.↩︎
I guess 10000 is a nice number for simulation runs. Maybe this choice should be justified.↩︎
If we do not apply a linear mixed model, which would be the prefered choice of many for this type of design.↩︎