Three things to know beyond base R

I think it’s fair to say that most academics who learn about R do so in the process of training or applying quantitative research methods. As a consequence, knowledge of R among academics tends to be limited to core (base) R packages (R Core Team, 2018) and a small handful of speciality statistical packages, e.g., {lavaan}, {lme4}, {MASS}, {car}, etc. With this in mind, the goal of this post is to provide an overview of three things to know beyond base R.

A more appropriate title for this post could be, “A quick introduction to the tidyverse,” as all of the following things to know beyond base R come from the {tidyverse}–a collection of high-powered, consistent, and easy-to-use packages developed by a number of thoughtful and talented R developers, so this should really be considered an exceptionally brief introduction to parts of the tidyverse. But users don’t need to know know everything about the tidyverse to reap the benefits of it. However, if you’re interested in a more formal/thorough introduction to the tidyverse, I would strongly encourage you to checkout R for Data Science by Garrett Groleman and Hadley Wickham.

For those of you who still might be hestitant about moving beyond base R, consider the plot below, which shows the download counts of {tidyverse} packages compared to several well-known and highly-regarded statistical packages.

This plot hopefully demonstrates two things. First, that a lot of people use (and therefore test, troubleshoot, and write documentation for) tidyverse packages. Second, that use of tidyverse packages is not merely a fad or momentary trend. Indeed, compared to widely used statistical packages, there is a considerably higher download rate among tidyverse packages (even above and beyond the general uptick in overall R usage).

1. The pipe

The first thing to know beyond base R is the pipe. The pipe refers to the %>% operator from the {magrittr} package.

library(magrittr)

It may seem complicated at first, but what the pipe does is actually quite simple. That is, it allows users to write linear code. To illustrate use of the pipe, consider the following code that takes the mean of the log of three numbers:

mean(log(c(1, 3, 9)))
#> [1] 1.098612

Notice how the numbers c(1, 3, 9) are nested inside log(), which is then nested inside mean()? If you’re reading the code from left-to-right, it means the functions are performed in reverse order from how they are written. If we broke the code down into its three functions, we would actually expect the order of operations to proceed as follows:

Concacenate numbers into vector c(...)
Log the numeric vector log(...)
Estimate the mean of the logged numeric vector mean(...)

With this order in mind, now consider the following piped code, which takes a numeric vector c(1, 3, 9), calculates the log(), and then estimates the mean(). Hopefully you notice that, in contrast to the nested code above, the code below is linear; in other words, the code appears in the same order (moving from left to right) as the operations are performed.

c(1, 3, 9) %>% log() %>% mean()
#> [1] 1.098612

As a convention designed to make piped code even easier to read, users are encouraged to place each piped statement on its own line. So the code above should be rewritten as follows:

c(1, 3, 9) %>%
  log() %>%
  mean()
#> [1] 1.098612

2. The tibble

R is my favorite programming language because nearly everyone who uses it either works with data frames or is extremely familiar with them. With all of the use and attention, it is hardly surprising that there would be some improvements to the traditional data.frame, which is why the tibble is the second thing to know beyond base R. The tibble refers to a data frame-like class produced by the {tibble} package. Tibbles (class tbl_df) are essentially a special variant of data frames that have desirable properties for printing and joining. And because they also inherit the data.frame class, they also behave like data frames 99.9% (and, seriously, I wouldn’t worry about the 0.1%).

As the code illustrates below, it’s easy to convert nearly any data frame into a tibble via tibble::as_tibble():

(mtcars <- tibble::as_tibble(mtcars))
#> # A tibble: 32 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
#> # … with 27 more rows

It’s also possible to create tibbles directly via tibble::tibble(...), e.g.,

tibble::tibble(
  x = rnorm(100),
  y = rnorm(100),
  z = sample(letters, 100, replace = TRUE)
)
#> # A tibble: 100 x 3
#>       x      y z
#>   <dbl>  <dbl> <chr>
#> 1 0.138  0.851 u
#> 2 0.894 -0.295 m
#> 3 0.377  1.61  u
#> 4 0.345 -0.332 r
#> 5 0.508  1.24  m
#> # … with 95 more rows

You hopefully noticed in the printing of the two previous code chunks that tibbles print out a lot prettier than normal data frames. Each observations is limited to a single line (no horizontal scrolling or wrapping). Not all rows are printed by default. And the printout also includes meta information about the classes of variables and the number of rows and columns and in the data set.

If you were really paying attention, you may have also noticed the z variable in the tibble built from scratch was stored (by default) as a character vector and not a factor. This is another important difference in tibbles compared to data frames. Tibbles are lazy, which is this case is useful for avoiding join or mutate errors later on related to a limited set of observed factor levels.

3. Select, filter, arrange, and mutate

If I could only use one package beyond base R, it’d probably be {dplyr}, which is why key dplyr functions are the third thing to learn beyond base R. Compared to base R, the beauty of these {dplyr} functions is that they feature consistent design principles, easily work with non-standard evaluation (i.e., you don’t have to put quotes around variable names), and even leverage c++ behind the scenes for improved performance.

{dplyr} has tons of useful features, but its fundamental building blocks allow users to select columns,

dplyr::select(mtcars, cyl, wt, mpg, gear)
#> # A tibble: 32 x 4
#>     cyl    wt   mpg  gear
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     6  2.62  21       4
#> 2     6  2.88  21       4
#> 3     4  2.32  22.8     4
#> 4     6  3.22  21.4     3
#> 5     8  3.44  18.7     3
#> # … with 27 more rows

mutate columns (i.e., transform, add),

mtcars %>%
  dplyr::mutate(mpg_per_cyl = mpg / cyl,
    car = row.names(datasets::mtcars)) %>%
  dplyr::select(car, mpg, cyl, mpg_per_cyl)
#> # A tibble: 32 x 4
#>   car                 mpg   cyl mpg_per_cyl
#>   <chr>             <dbl> <dbl>       <dbl>
#> 1 Mazda RX4          21       6        3.5
#> 2 Mazda RX4 Wag      21       6        3.5
#> 3 Datsun 710         22.8     4        5.7
#> 4 Hornet 4 Drive     21.4     6        3.57
#> 5 Hornet Sportabout  18.7     8        2.34
#> # … with 27 more rows

filter rows,

dplyr::filter(mtcars, cyl == 4, mpg >= 10)
#> # A tibble: 11 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
#> 2  24.4     4 147.     62  3.69  3.19  20       1     0     4     2
#> 3  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
#> 4  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#> 5  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#> # … with 6 more rows

and arrange rows.

dplyr::arrange(mtcars, dplyr::desc(mpg))
#> # A tibble: 32 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
#> 2  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#> 3  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#> 4  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
#> 5  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
#> # … with 27 more rows

Chain these {dplyr} functions together with the pipe operator %>% to achieve linear/tidyverse-style coding excellence.

mtcars %>%
  dplyr::filter(gear == 4) %>%
  dplyr::mutate(wt_mpg = wt / mpg) %>%
  dplyr::select(cyl, wt, mpg, wt_mpg) %>%
  dplyr::arrange(wt_mpg)
#> # A tibble: 12 x 4
#>     cyl    wt   mpg wt_mpg
#>   <dbl> <dbl> <dbl>  <dbl>
#> 1     4  1.62  30.4 0.0531
#> 2     4  1.84  33.9 0.0541
#> 3     4  2.2   32.4 0.0679
#> 4     4  1.94  27.3 0.0709
#> 5     4  2.32  22.8 0.102
#> # … with 7 more rows