There are quite many alternative ways to select a column from a data frame in R. In this post I compare some of them with regard to their running time. Note that this experiment is of purely scientific interest; it does not have any practical consequences because the speed of column selection will rarely be the bottleneck of your program.

The competitors

The $ selects a single column as a vector; it is the first option I was introduced to and I still often use it:

sleep$extra
##  [1]  0.7 -1.6 -0.2 -1.2 -0.1  3.4  3.7  0.8  0.0  2.0  1.9  0.8  1.1  0.1
## [15] -0.1  4.4  5.5  1.6  4.6  3.4

The [[·]] notation selects a single column from a data frame like, just like the $ notation. However, the column name has to be quoted (i.e., we pass a one element vector of type character; this has the advantage that a function call or a variable can be inserted into the double brackets):

sleep[["extra"]]
##  [1]  0.7 -1.6 -0.2 -1.2 -0.1  3.4  3.7  0.8  0.0  2.0  1.9  0.8  1.1  0.1
## [15] -0.1  4.4  5.5  1.6  4.6  3.4

The [·,·] notation can be used to select multiple columns, but can also read just one column as a vector:

sleep[, "extra"]
##  [1]  0.7 -1.6 -0.2 -1.2 -0.1  3.4  3.7  0.8  0.0  2.0  1.9  0.8  1.1  0.1
## [15] -0.1  4.4  5.5  1.6  4.6  3.4

We obtain a data frame if we include the argument drop = FALSE:

sleep[, "extra", drop = FALSE]
##    extra
## 1    0.7
## 2   -1.6
## 3   -0.2
## 4   -1.2
## 5   -0.1
## 6    3.4
## 7    3.7
## 8    0.8
## 9    0.0
## 10   2.0
## 11   1.9
## 12   0.8
## 13   1.1
## 14   0.1
## 15  -0.1
## 16   4.4
## 17   5.5
## 18   1.6
## 19   4.6
## 20   3.4

The [·] notation can be used to select multiple columns, but can also read just one column as a data frame:

sleep["extra"]
##    extra
## 1    0.7
## 2   -1.6
## 3   -0.2
## 4   -1.2
## 5   -0.1
## 6    3.4
## 7    3.7
## 8    0.8
## 9    0.0
## 10   2.0
## 11   1.9
## 12   0.8
## 13   1.1
## 14   0.1
## 15  -0.1
## 16   4.4
## 17   5.5
## 18   1.6
## 19   4.6
## 20   3.4

The function pull() from the package dplyr can be used to select a columns as a vector:

dplyr::pull(sleep, extra)
##  [1]  0.7 -1.6 -0.2 -1.2 -0.1  3.4  3.7  0.8  0.0  2.0  1.9  0.8  1.1  0.1
## [15] -0.1  4.4  5.5  1.6  4.6  3.4

The function select() from the package dplyr can be used to select one or several columns as data frame:

dplyr::select(sleep, extra)
##    extra
## 1    0.7
## 2   -1.6
## 3   -0.2
## 4   -1.2
## 5   -0.1
## 6    3.4
## 7    3.7
## 8    0.8
## 9    0.0
## 10   2.0
## 11   1.9
## 12   0.8
## 13   1.1
## 14   0.1
## 15  -0.1
## 16   4.4
## 17   5.5
## 18   1.6
## 19   4.6
## 20   3.4

Running time

Using the package microbenchmark, I compare the speed of selecting single columns. First, I use the sleep data frame consisting of 20 rows; then, I use the iris data frame consisting of 150 rows.

nrow(sleep)
## [1] 20
library(microbenchmark)
microbenchmark(
  sleep$extra, 
  sleep[["extra"]],
  sleep[, "extra"],
  sleep[, "extra", drop = FALSE],
  sleep["extra"],
  dplyr::pull(sleep, extra),
  dplyr::select(sleep, extra)
)
## Unit: nanoseconds
##                            expr    min     lq   mean median     uq     max
##                     sleep$extra    509    929   1427   1576   1778    4158
##                sleep[["extra"]]   2782   3763   4693   4172   5056   21572
##                sleep[, "extra"]   4534   5663   7760   7036   9036   24360
##  sleep[, "extra", drop = FALSE]   8067   9878  13500  13467  16298   31459
##                  sleep["extra"]   8180   9860  13350  12768  15588   32086
##       dplyr::pull(sleep, extra) 238909 253218 266381 260607 279340  350748
##     dplyr::select(sleep, extra) 486947 500610 538289 522484 531972 2211427
##  neval
##    100
##    100
##    100
##    100
##    100
##    100
##    100

I think it is quite interesting that the difference are rather substantial. The $ notation is fastest, about 4 times faster than the [[·]] notation and more 10 times faster than the [·] notation. The function select() was by far slowest. However, this operation still only took about 500 microseconds. As I said, this is definitely not the bottleneck of your program.

Let’s repeat the test with a larger data frame:

nrow(iris)
## [1] 150
microbenchmark(
  iris$Sepal.Length, 
  iris[["Sepal.Length"]],
  iris[, "Sepal.Length"],
  iris[, "Sepal.Length", drop = FALSE],
  iris["Sepal.Length"],
  dplyr::pull(iris, Sepal.Length),
  dplyr::select(iris, Sepal.Length)
)
## Unit: nanoseconds
##                                  expr    min     lq   mean median     uq
##                     iris$Sepal.Length    533   1195   1736   1606   1902
##                iris[["Sepal.Length"]]   2832   3713   4989   4222   5190
##                iris[, "Sepal.Length"]   4461   5711   7378   7038   8774
##  iris[, "Sepal.Length", drop = FALSE]   7733   9676  12931  12006  15452
##                  iris["Sepal.Length"]   8186  10485  13456  12835  15248
##       dplyr::pull(iris, Sepal.Length) 240854 260052 294382 272116 288716
##     dplyr::select(iris, Sepal.Length) 484614 520556 539652 538021 560159
##      max neval
##    19612   100
##    22344   100
##    12889   100
##    31784   100
##    33054   100
##  2021288   100
##   625882   100

Interestingly, the operations take the same time, even for the larger data frame. The rank order in speed remains the same.

Let’s check out a “very large” data frame having 100,000 rows:

data <- data.frame(
  col1 = rnorm(100000), 
  col2 = rnorm(100000)
)

nrow(data)
## [1] 100000
microbenchmark(
  data$col1, 
  data[["col1"]],
  data[, "col1"],
  data[, "col1", drop = FALSE],
  data["col1"],
  dplyr::pull(data, col1),
  dplyr::select(data, col1)
)
## Unit: nanoseconds
##                          expr    min     lq   mean median     uq     max
##                     data$col1    518   1092   1485   1555   1763    3020
##                data[["col1"]]   2795   3556   4562   4185   5007    8374
##                data[, "col1"]   4428   5747   8392   7698   9490   25354
##  data[, "col1", drop = FALSE]   7527   9735  13354  12041  15162   33227
##                  data["col1"]   8153  10589  14893  13222  16443   67393
##       dplyr::pull(data, col1) 236861 253765 285870 262367 283163 1965278
##     dplyr::select(data, col1) 481516 497876 519348 515912 532538  692486
##  neval
##    100
##    100
##    100
##    100
##    100
##    100
##    100

We can state that the speed of reading a column from a data frame is not affected by the number of rows!


Last updated: 2019-11-29