Dplyr, Pipes, and More

Statistical Computing, 36-350

Tuesday September 27, 2022

Last week: Purrr and a bit of dplyr

Part I

Motivation: tidyverse, revisited

What is the tidyverse?

The tidyverse is a coherent collection of packages in R for data science (and tidyverse is itself a actually package that loads all its constituent packages). Packages include:

Last week we covered purrr and a bit of dplyr. This week we’ll do more dplyr, and some tidyr. (Many of you will learn ggplot2 in Statistical Graphics 36-315)

Loading the tidyverse so that we can get all this functionality (plus more):

library(tidyverse)

Why the tidyverse?

Data wrangling the tidy way

Part II

Mastering the pipe

All behold the glorius pipe

ls -l | grep tidy | wc -l

How to read pipes: single arguments

Passing a single argument through pipes, we interpret something like:

x %>% f %>% g %>% h

as h(g(f(x)))

Key takeaway: in your mind, when you see %>%, read this as “and then”

Simple example

We can write exp(1) with pipes as 1 %>% exp, and log(exp(1)) as 1 %>% exp %>% log

exp(1)
## [1] 2.718282
1 %>% exp
## [1] 2.718282
1 %>% exp %>% log
## [1] 1

How to read pipes: multiple arguments

Now for multi-arguments functions, we interpret something like:

x %>% f(y) 

as f(x,y)

Simple example

mtcars %>% head(4) 

And what’s the “old school” (base R) way?

head(mtcars, 4)

Notice that, with pipes:

The dot

The command x %>% f(y) can be equivalently written in dot notation as:

x %>% f(., y)

What’s the advantage of using dots? Sometimes you want to pass in a variable as the second or third (say, not first) argument to a function, with a pipe. As in:

x %>% f(y, .)

which is equivalent to f(y,x)

Simple example

Again, see if you can interpret the code below without running it, then run it in your R console as a way to check your understanding:

state_df = data.frame(state.x77)
state.region %>% 
  tolower %>%
  tapply(state_df$Income, ., summary)

A more complicated example:

x = "Prof Rinaldo really loves piping"
x %>% 
  strsplit(split = " ") %>% 
  .[[1]] %>% # indexing, could also use `[[`(1)
  nchar %>% 
  max 
## [1] 7

Part III

dyplr verbs

dplyr verbs

Some of the most important dplyr verbs (functions):

We’ve learned filter(), group_by(), summarize() in the last lecture. (Go back and rewrite the examples using pipes!)

arrange(): order rows by values of a column

mtcars %>% 
  arrange(mpg) %>% 
  head(4)
##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
# Base R
mpg_inds = order(mtcars$mpg)
head(mtcars[mpg_inds, ], 4)
##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4

We can ask for descending order:

mtcars %>% 
  arrange(desc(mpg)) %>% 
  head(4)
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
## Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2
# Base R
mpg_inds_decr = order(mtcars$mpg, decreasing = TRUE)
head(mtcars[mpg_inds_decr, ], 4)
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
## Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

We can order by multiple columns too:

mtcars %>% 
  arrange(desc(gear), desc(hp)) %>%
  head(8)
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4

select(): pick out given columns

mtcars %>% 
  select(cyl, disp, hp) %>% 
  head(2)
##               cyl disp  hp
## Mazda RX4       6  160 110
## Mazda RX4 Wag   6  160 110
# Base R
head(mtcars[, c("cyl", "disp", "hp")], 2)
##               cyl disp  hp
## Mazda RX4       6  160 110
## Mazda RX4 Wag   6  160 110

Some handy select() helpers

mtcars %>% 
  select(starts_with("d")) %>% 
  head(2)
##               disp drat
## Mazda RX4      160  3.9
## Mazda RX4 Wag  160  3.9
# Base R (yikes!)
d_colnames = grep(x = colnames(mtcars), pattern = "^d")
head(mtcars[, d_colnames], 2)
##               disp drat
## Mazda RX4      160  3.9
## Mazda RX4 Wag  160  3.9

We can do many other things as well:

mtcars %>% select(ends_with('t')) %>% head(2)
##               drat    wt
## Mazda RX4      3.9 2.620
## Mazda RX4 Wag  3.9 2.875
mtcars %>% select(ends_with('yl')) %>% head(2)
##               cyl
## Mazda RX4       6
## Mazda RX4 Wag   6
mtcars %>% select(contains('ar')) %>% head(2)
##               gear carb
## Mazda RX4        4    4
## Mazda RX4 Wag    4    4

(If you’re interested go and read more here)

mutate(): create one or several columns

mtcars = mtcars %>% 
  mutate(hp_wt = hp/wt, 
         mpg_wt = mpg/wt) 

# Base R
mtcars$hp_wt = mtcars$hp/mtcars$wt
mtcars$mpg_wt = mtcars$mpg/mtcars$wt

Newly created variables are useable immediately:

mtcars = mtcars %>% 
  mutate(hp_wt_again = hp/wt,
         hp_wt_cyl = hp_wt_again/cyl) 

# Base R
mtcars$hp_wt_again = mtcars$hp/mtcars$wt
mtcars$hp_wt_cyl = mtcars$hp_wt_again/mtcars$cyl

mutate_at(): apply a function to one or several columns

mtcars = mtcars %>% 
  mutate_at(c("hp_wt", "mpg_wt"), log) 

# Base R
mtcars$hp_wt = log(mtcars$hp_wt)
mtcars$mpg_wt = log(mtcars$mpg_wt)

Important note

Calling dplyr verbs always outputs a new data frame, it does not alter the existing data frame

So to keep the changes, we have to reassign the data frame to be the output of the pipe! (Look back at the examples for mutate() and mutate_at())

dplyr and SQL

Part IV

tidyr verbs

tidyr verbs

Two of the most important tidyr verbs (functions):

There are many others like spread(), gather(), nest(), unnest(), etc. (If you’re interested go and read about them here)

pivot_longer(): make “wide” data longer

#devtools::install_github("rstudio/EDAWR") or remotes::install_github("rstudio/EDAWR")
library(EDAWR) # Load some nice data sets
EDAWR::cases 
##   country  2011  2012  2013
## 1      FR  7000  6900  7000
## 2      DE  5800  6000  6200
## 3      US 15000 14000 13000
EDAWR::cases %>% 
  pivot_longer(names_to = "year", 
               values_to = "n", 
               cols = 2:4) 
## # A tibble: 9 × 3
##   country year      n
##   <chr>   <chr> <dbl>
## 1 FR      2011   7000
## 2 FR      2012   6900
## 3 FR      2013   7000
## 4 DE      2011   5800
## 5 DE      2012   6000
## 6 DE      2013   6200
## 7 US      2011  15000
## 8 US      2012  14000
## 9 US      2013  13000
# Different approach to do the same thing
EDAWR::cases %>% 
  pivot_longer(names_to = "year", 
               values_to = "n", 
               -country) 
## # A tibble: 9 × 3
##   country year      n
##   <chr>   <chr> <dbl>
## 1 FR      2011   7000
## 2 FR      2012   6900
## 3 FR      2013   7000
## 4 DE      2011   5800
## 5 DE      2012   6000
## 6 DE      2013   6200
## 7 US      2011  15000
## 8 US      2012  14000
## 9 US      2013  13000
# Could also do: 
# EDAWR::cases %>% 
#   pivot_longer(names_to = "year", 
#                values_to = "n", 
#                c(`2011`, `2012`, `2013`)) 

pivot_wider(): make “long” data wider

EDAWR::pollution
##       city  size amount
## 1 New York large     23
## 2 New York small     14
## 3   London large     22
## 4   London small     16
## 5  Beijing large    121
## 6  Beijing small     56
EDAWR::pollution %>% 
  pivot_wider(names_from = "size", 
              values_from = "amount")
## # A tibble: 3 × 3
##   city     large small
##   <chr>    <dbl> <dbl>
## 1 New York    23    14
## 2 London      22    16
## 3 Beijing    121    56

Summary