In this post, I discuss permutation tests by the example of the two group t-test. I illustrate that the number of possible permutations often assumed in the statistical literature is incorrect when the two groups have the same size.

I only recently – I think it was in the beginning of 2019 – learned about permutation tests. In my view, permutation tests are pretty neat because they seem to avoid some of the problems of traditional significance tests, e.g., they make no assumptions with regard to the data distribution.

As an example, assume we are comparing the responses of 12 persons between two conditions. We would have data set that looks like this:

n <- 12
data <- data.frame(
  response = c(rnorm(n / 2, 115, 15), rnorm(n / 2, 100, 15)),
  condition = c(rep(1, n / 2), rep(2, n / 2))
)

print(data)
##    response condition
## 1       107         1
## 2       112         1
## 3       138         1
## 4       116         1
## 5       117         1
## 6       141         1
## 7       107         2
## 8        81         2
## 9        90         2
## 10       93         2
## 11      118         2
## 12      105         2

Note that in this case the null hypothesis is false, because I generated responses from two normal distributions that had different population means (\(115\) vs. \(100\)). The responses sampled from the two populations were assigned different labels (\(1\) and \(2\)).

In the case of a two-sample t-test, permutation tests are based on “reshuffling” the condition labels. For different arrangements of the labels, we compute a t value comparing the responses between reshuffled conditions. For example, we might randomly shuffle the condition labels many times (e.g., 10,000 times) and compute a t statistic each time, comparing the conditions. The following code would accomplish just that:

reshuffle_and_compare <- function(responses, conditions) {
  t.test(responses ~ sample(conditions))$statistic
}
t_values <- replicate(10000, reshuffle_and_compare(data$response, data$condition))

We obtain a distribution of t values:

hist(t_values)
# add actually observed t-value:
abline(
  v = t.test(response ~ condition, data = data)$statistic, 
  col = "red",
  lwd = 2,
  lty = 2
)

This distribution results from a true null hypothesis because we assigned conditions to responses at random; the condition was not systematically related to the responses. The red line indicates the observed t-value based on the real condition labels. We compare it to the sampling distribution to assess its extremeness, i.e., to decide if it is statistically significant.

Why are permutation tests called permutation tests? In the example, we reshuffled the condition labels using the function sample(). The function sample() returns a random sequence of the condition labels; any possible sequence of the labels is called a permutation. That is, each permutation consists of the same labels, but the order of the labels differs between different permutations.

Generating all permutations

Instead of randomly reshuffling the condition labels, there is a more systematic (i.e., exact) way of doing a permutation test: to try out all possible assignments of responses to groups. To do an exact permutation test, we can systematically generate all possible permutations without generating duplicates. This is not an easy task,1 but fortunately we can use my R package anticlust for this purpose. The function next_permutation()2 generates a new permutation based on an existing one.

library(anticlust)

data$permutation1 <- anticlust:::next_permutation(data$condition)
data$permutation2 <- anticlust:::next_permutation(data$permutation1)

print(data)
##    response condition permutation1 permutation2
## 1       107         1            1            1
## 2       112         1            1            1
## 3       138         1            1            1
## 4       116         1            1            1
## 5       117         1            1            1
## 6       141         1            2            2
## 7       107         2            1            2
## 8        81         2            2            1
## 9        90         2            2            2
## 10       93         2            2            2
## 11      118         2            2            2
## 12      105         2            2            2

The function next_permutation() chooses a successor for a given permutation based on the lexicographic order of the condition labels.3 The column permutation1 differs by two elements from the original condition labels: these elements have been swapped; permutation2 differs by two elements from its own predecessor permutation1. Repeatedly calling next_permutation(), we would be able to generate all possible group assignments. The function next_permutation() is smart and ensures that no duplicate permutations are generated until we get back to the initial labels contained in the column condition.

This is a neat approach: We generate all possible assignments of persons to groups using the function next_permutation(). For each assignment, we compute a t statistic. In the end, we know how extreme the observed t value is with reference to all possible t statistics.

There is only one problem with generating all permutations: There are too many of them. For a sample of 12 persons, a computer can generate all partitions in acceptable running time; for \(N = 100\), it is impossible to do that in several life times.

Okay – after some introduction, we are finally ready to come to the actual point of my post: How many permutations do we have to enumerate to conduct an exact permutation test? Put differently: How many possible ways are there to divide people into groups? I will deal with this question assuming that all groups have the same size because this constitutes a special case when computing the number of possible divisions.

A permutation test is not a permutation test

When I found out about permutation tests, I was at first a bit confused about the naming. I would prefer have preferred the name partitioning test, or something similar. Partitioning is the process of dividing a group into several subsets, which is what we actually do when computing a permutation test: We partition the data set into multiple subsets—many times.

My question is: When generating all possible subsets for an exact permutation test, is it necessary to generate all permutations of the condition labels—as the name permutation test would suggest? The answer is: not if the groups have the same size. Consider the following permutations of four group labels:

These are different permutations of the group labels \(\{1, 1, 2, 2\}\), but they clearly correspond to the same partitioning of four elements into two sets: in both cases, the first element is grouped with the forth element and the second element is grouped with the third element. When conducting an exact permutation test, it would be sufficient to consider one of these permutations.

The anticlust function generate_partitions() can be used to illustrate the issue more generally:

generate_partitions(
  N = 4, 
  K = 2,
  generate_permutations = FALSE
)
## [[1]]
## [1] 1 1 2 2
## 
## [[2]]
## [1] 1 2 1 2
## 
## [[3]]
## [1] 1 2 2 1

The function generated all equal-sized partitionings of 4 elements into two groups. We can set the argument generate_permutations to TRUE, in which case we also generate all corresponding permutations.

generate_partitions(
  N = 4, 
  K = 2,
  generate_permutations = TRUE
)
## [[1]]
## [1] 1 1 2 2
## 
## [[2]]
## [1] 1 2 1 2
## 
## [[3]]
## [1] 1 2 2 1
## 
## [[4]]
## [1] 2 1 1 2
## 
## [[5]]
## [1] 2 1 2 1
## 
## [[6]]
## [1] 2 2 1 1

For two groups, we generate the double amount of permutations as compared to partitions. For more than two groups, the discrepancy between the number of permutations and the number of partitions increases even more.4

This matter seems to have been overlooked at least in some of the the statistical literature5 on permutation tests. For example, Ernst (2004) writes in a publication in Statistical Science:

Computation of the permutation distribution of a test statistic involves careful enumeration of all \(\binom{N}{n}\) divisions of the observations. This poses two computational challenges. First, the sheer number of calculations required becomes very large as the samples become only moderate in size. There are over 155 million ways to divide 30 observations into two groups of size 15, and over 5.5 trillion ways to divide them into three groups of size 10.

Ernst is correct with remarking the computational complexity of the enumeration process, but is wrong with regard to the numbers he presents. For equal-sized groups (which he refers to), the number of divisions of the observations is not simply given by the binomial coefficient that would treat the partitionings (\(1, 2, 2, 1\)) and (\(2, 1, 1, 2\)) as distinct groupings—even though they represent the same “division of the observations”. We have to divide \(\binom{30}{15}\) by \(2\) to obtain the actual number of partitions:

choose(30, 15) / 2
## [1] 77558760

We can use the anticlust function n_partitions() to compute the number of equal-sized partitions in the general setting of \(N\) elements and \(K\) groups:

n_partitions(
  N = 30, 
  K = 2
)
## [1] 77558760
n_partitions(
  N = 30, 
  K = 3
)
## [1] 925166131890

The number of ways to divide 30 elements into three groups of 10 is 925,166,131,890 (~ 900 billion) and not over over 5.5 trillion (approx. 5.5 trillion divided by \(6\)). The equation to compute the number of ways to partition \(N\) elements into \(K\) equal-sized groups can be formulated as follows (see Papenberg & Klau, 2020):

\[ P(N, K) = \frac{1}{K!} \cdot \prod\limits_{i=0}^{K - 1} \binom{N - \frac{iN}{K}}{\frac{N}{K}} \]

I am not sure if this formula has been discussed in the literature on permutation tests. Anyway, the discrepancy between the number of permutations and partitions probably does not matter a lot. The practical relevance is negligible because we usually can neither generate all permutations nor all partitions—there is an exponential amount of either. Hence, in most cases, we cannot conduct an exact permutation test and instead rely on a Monte-Carlo test. Moreover, the number of partitions and permutations only differs if some groups are of equal size, e.g., there is no duplicate partition for the permutation \((1, 1, 2, 2, 2)\).

Conclusions

In this post, I discussed the general logic of permutation tests and illustrated that exact permutation tests are usually not feasible. The main point of my post was to show how to compute the number of required iterations for an exact permutation test when experimental groups are of equal size. This number has not been presented correctly in some of the statistical literature on permutation tests.

Update on December 17, 2019: I found out that the special case of equal sample sizes has indeed been recognized in the statistical literature (Phipson & Smyth, 2010), but since it seems that the issue has not been appreciated everywhere since then (e.g., Craig & Fisher, 2019).

References

Craig, A. R., & Fisher, W. W. (2019). Randomization tests as alternative analysis methods for behavior-analytic data. Journal of the Experimental Analysis of Behavior, 111, 309–328.

Ernst, M. D. (2004). Permutation methods: a basis for exact inference. Statistical Science, 19, 676-685.

Papenberg, M., & Klau, G. W. (2020). Using anticlustering to partition data sets into equivalent parts. Psychological Methods. Advance Online Publication. https://doi.org/10.1037/met0000301.

Phipson, B., and Smyth, G. K. (2010). Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn. Statistical Applications in Genetics and Molecular Biology 9, Article 39.


Last updated: 2020-10-23


  1. When repeatedly calling sample(), we can be fairly certain that some duplicates will occur in the long run.

  2. Currently, this function is not an exported function, for which reason I have to use the ::: operator to access it.

  3. If you want to learn more about the next permutation algorithm, check out this post.

  4. For \(k = 3\) groups: \(6\) times more permutations than partitions; for \(k = 4\) groups: \(24\) times as many permutations; in general we have \(k!\) times more permutations than partitions if all groups are of equal size.

  5. The wikipedia article on permutation tests also ignores this issue.