R
There are quite many alternative ways to select a column from a data frame in R
. In this post I compare some of them with regard to their running time. Note that this experiment is of purely scientific interest; it does not have any practical consequences because the speed of column selection will rarely be the bottleneck of your program.
The $
selects a single column as a vector; it is the first option I was introduced to and I still often use it:
sleep$extra
## [1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0 1.9 0.8 1.1 0.1
## [15] -0.1 4.4 5.5 1.6 4.6 3.4
The [[·]]
notation selects a single column from a data frame like, just like the $
notation. However, the column name has to be quoted (i.e., we pass a one element vector of type character
; this has the advantage that a function call or a variable can be inserted into the double brackets):
sleep[["extra"]]
## [1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0 1.9 0.8 1.1 0.1
## [15] -0.1 4.4 5.5 1.6 4.6 3.4
The [·,·]
notation can be used to select multiple columns, but can also read just one column as a vector:
sleep[, "extra"]
## [1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0 1.9 0.8 1.1 0.1
## [15] -0.1 4.4 5.5 1.6 4.6 3.4
We obtain a data frame if we include the argument drop = FALSE
:
sleep[, "extra", drop = FALSE]
## extra
## 1 0.7
## 2 -1.6
## 3 -0.2
## 4 -1.2
## 5 -0.1
## 6 3.4
## 7 3.7
## 8 0.8
## 9 0.0
## 10 2.0
## 11 1.9
## 12 0.8
## 13 1.1
## 14 0.1
## 15 -0.1
## 16 4.4
## 17 5.5
## 18 1.6
## 19 4.6
## 20 3.4
The [·]
notation can be used to select multiple columns, but can also read just one column as a data frame:
sleep["extra"]
## extra
## 1 0.7
## 2 -1.6
## 3 -0.2
## 4 -1.2
## 5 -0.1
## 6 3.4
## 7 3.7
## 8 0.8
## 9 0.0
## 10 2.0
## 11 1.9
## 12 0.8
## 13 1.1
## 14 0.1
## 15 -0.1
## 16 4.4
## 17 5.5
## 18 1.6
## 19 4.6
## 20 3.4
The function pull()
from the package dplyr
can be used to select a columns as a vector:
dplyr::pull(sleep, extra)
## [1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0 1.9 0.8 1.1 0.1
## [15] -0.1 4.4 5.5 1.6 4.6 3.4
The function select()
from the package dplyr
can be used to select one or several columns as data frame:
dplyr::select(sleep, extra)
## extra
## 1 0.7
## 2 -1.6
## 3 -0.2
## 4 -1.2
## 5 -0.1
## 6 3.4
## 7 3.7
## 8 0.8
## 9 0.0
## 10 2.0
## 11 1.9
## 12 0.8
## 13 1.1
## 14 0.1
## 15 -0.1
## 16 4.4
## 17 5.5
## 18 1.6
## 19 4.6
## 20 3.4
Using the package microbenchmark
, I compare the speed of selecting single columns. First, I use the sleep
data frame consisting of 20 rows; then, I use the iris
data frame consisting of 150 rows.
nrow(sleep)
## [1] 20
library(microbenchmark)
microbenchmark(
sleep$extra,
sleep[["extra"]],
sleep[, "extra"],
sleep[, "extra", drop = FALSE],
sleep["extra"],
dplyr::pull(sleep, extra),
dplyr::select(sleep, extra)
)
## Unit: nanoseconds
## expr min lq mean median uq max
## sleep$extra 509 929 1427 1576 1778 4158
## sleep[["extra"]] 2782 3763 4693 4172 5056 21572
## sleep[, "extra"] 4534 5663 7760 7036 9036 24360
## sleep[, "extra", drop = FALSE] 8067 9878 13500 13467 16298 31459
## sleep["extra"] 8180 9860 13350 12768 15588 32086
## dplyr::pull(sleep, extra) 238909 253218 266381 260607 279340 350748
## dplyr::select(sleep, extra) 486947 500610 538289 522484 531972 2211427
## neval
## 100
## 100
## 100
## 100
## 100
## 100
## 100
I think it is quite interesting that the difference are rather substantial. The $
notation is fastest, about 4 times faster than the [[·]]
notation and more 10 times faster than the [·]
notation. The function select()
was by far slowest. However, this operation still only took about 500 microseconds. As I said, this is definitely not the bottleneck of your program.
Let’s repeat the test with a larger data frame:
nrow(iris)
## [1] 150
microbenchmark(
iris$Sepal.Length,
iris[["Sepal.Length"]],
iris[, "Sepal.Length"],
iris[, "Sepal.Length", drop = FALSE],
iris["Sepal.Length"],
dplyr::pull(iris, Sepal.Length),
dplyr::select(iris, Sepal.Length)
)
## Unit: nanoseconds
## expr min lq mean median uq
## iris$Sepal.Length 533 1195 1736 1606 1902
## iris[["Sepal.Length"]] 2832 3713 4989 4222 5190
## iris[, "Sepal.Length"] 4461 5711 7378 7038 8774
## iris[, "Sepal.Length", drop = FALSE] 7733 9676 12931 12006 15452
## iris["Sepal.Length"] 8186 10485 13456 12835 15248
## dplyr::pull(iris, Sepal.Length) 240854 260052 294382 272116 288716
## dplyr::select(iris, Sepal.Length) 484614 520556 539652 538021 560159
## max neval
## 19612 100
## 22344 100
## 12889 100
## 31784 100
## 33054 100
## 2021288 100
## 625882 100
Interestingly, the operations take the same time, even for the larger data frame. The rank order in speed remains the same.
Let’s check out a “very large” data frame having 100,000 rows:
data <- data.frame(
col1 = rnorm(100000),
col2 = rnorm(100000)
)
nrow(data)
## [1] 100000
microbenchmark(
data$col1,
data[["col1"]],
data[, "col1"],
data[, "col1", drop = FALSE],
data["col1"],
dplyr::pull(data, col1),
dplyr::select(data, col1)
)
## Unit: nanoseconds
## expr min lq mean median uq max
## data$col1 518 1092 1485 1555 1763 3020
## data[["col1"]] 2795 3556 4562 4185 5007 8374
## data[, "col1"] 4428 5747 8392 7698 9490 25354
## data[, "col1", drop = FALSE] 7527 9735 13354 12041 15162 33227
## data["col1"] 8153 10589 14893 13222 16443 67393
## dplyr::pull(data, col1) 236861 253765 285870 262367 283163 1965278
## dplyr::select(data, col1) 481516 497876 519348 515912 532538 692486
## neval
## 100
## 100
## 100
## 100
## 100
## 100
## 100
We can state that the speed of reading a column from a data frame is not affected by the number of rows!
Last updated: 2019-11-29