Last week: Data frames and apply

Data frames are a representation of the “classic” data table in R: rows are observations/cases, columns are variables/features
Each column can be a different data type (but must be the same length)
subset(): function for extracting rows of a data frame meeting a condition
split(): function for splitting up rows of a data frame, according to a factor variable
apply(): function for applying a given routine to rows or columns of a matrix or data frame
lapply(): similar, but used for applying a routine to elements of a vector or list
sapply(): similar, but will try to simplify the return type, in comparison to lapply()
tapply(): function for applying a given routine to groups of elements in a vector or list, according to a factor variable

Part I

Motivation: tidyverse

Common iteration tasks

Here’s a basic breakdown for common iteration tasks that we encounter in R: we iterate over …

elements of a list
dimensions of an array (e.g., rows/columns of a matrix)
sub data frames induced by one or more factors

All of this is possible in base R, using the apply family of functions: lapply(), sapply(), apply(), tapply(), etc. So why look anywhere else?

Answer: because some alternatives offer better consistency

With the apply family of functions, there are some inconsistencies in both the interfaces to the functions, as well as their outputs
This can both slow down learning and also lead to inefficiencies in practice (frequent checking and post-processing of results)

However, the world isn’t black-and-white: base R still has its advantages, and the best thing you can do is to be informed and well-versed in using all the major options!

Why not `plyr`?

The plyr package used to be one of the most popular (most downloaded) R packages of all-time. It was more popular in the late 2000s and early 2010s

All plyr functions are of the form **ply()
Replace ** with characters denoting types:
- First character: input type, one of a, d, l
- Second character: output type, one of a, d, l, or _ (drop)

It is no longer under active development and that development is now happening elsewhere (mainly in the tidyverse). However, some people still like it. If you want to learn about it, you can check out the notes from a previous offering of this course.

What is `purrr`?

purrr is a package that is part of the tidyverse. It offers a family of functions for iterating (mainly over lists) that can be seen as alternatives to base R’s family of apply functions

Compared to base R, they are more consistent
Compared to plyr, they can often be faster

Credit: Jenny Bryan’s tutorial on purrr provided the inspiration for this lecture. Another good reference is Chapter 21: Iterations of the book R for Data Science.

What is the tidyverse?

The tidyverse is a coherent collection of packages in R for data science (and tidyverse is itself a actually package that loads all its constituent packages). Packages include:

Data wrangling: dplyr, tidyr, readr
Iteration: purrr
Visualization: ggplot2

This week we’ll cover purrr and a bit of dplyr. Next week we’ll do more dplyr, and some tidyr. (Many of you will learn ggplot2 in Statistical Graphics 36-315)

Loading the tidyverse so that we can get all this functionality (plus more):

library(tidyverse)

Part II

map() and friends

The map family

purrr offers a family of map functions, which allow you to apply a function across different chunks of data (primarily used with lists). Offers an alternative base R’s apply functions. Summary of functions:

map(): apply a function across elements of a list or vector
map_dbl(), map_lgl(), map_chr(): same, but return a vector of a particular data type
map_dfr(), map_dfc(): same, but return a data frame

`map()`: list in, list out

The map() function is an alternative to lapply(). It has the following simple form: map(x, f), where x is a list or vector, and f is a function. It always returns a list

my.list = list(nums=seq(0.1,0.6,by=0.1), chars=letters[1:12], 
               bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
map(my.list, length)

## $nums
## [1] 6
## 
## $chars
## [1] 12
## 
## $bools
## [1] 6

# Base R is just as easy
lapply(my.list, length)

## $nums
## [1] 6
## 
## $chars
## [1] 12
## 
## $bools
## [1] 6

`map_dbl()`: list in, numeric out

The map_dbl() function is an alternative to sapply(). It has the form: map_dbl(x, f), where x is a list or vector, and f is a function that returns a numeric value (when applied to each element of x)

Similarly:

map_int() returns an integer vector
map_lgl() returns a logical vector
map_chr() returns a character vector

map_dbl(my.list, length)

##  nums chars bools 
##     6    12     6

map_chr(my.list, length)

##  nums chars bools 
##   "6"  "12"   "6"

# Base R is a bit more complicated
as.numeric(sapply(my.list, length))

## [1]  6 12  6

as.numeric(unlist(lapply(my.list, length)))

## [1]  6 12  6

vapply(my.list, FUN=length, FUN.VALUE=numeric(1))

##  nums chars bools 
##     6    12     6

Applying a custom function

As before (with the apply family), we can of course apply a custom function, and define it “on-the-fly”

library(repurrrsive) # Load Game of Thrones data set
class(got_chars)

## [1] "list"

class(got_chars[[1]])

## [1] "list"

names(got_chars[[1]])

##  [1] "url"         "id"          "name"        "gender"      "culture"    
##  [6] "born"        "died"        "alive"       "titles"      "aliases"    
## [11] "father"      "mother"      "spouse"      "allegiances" "books"      
## [16] "povBooks"    "tvSeries"    "playedBy"

map_chr(got_chars, function(x) { return(x$name) })

##  [1] "Theon Greyjoy"      "Tyrion Lannister"   "Victarion Greyjoy" 
##  [4] "Will"               "Areo Hotah"         "Chett"             
##  [7] "Cressen"            "Arianne Martell"    "Daenerys Targaryen"
## [10] "Davos Seaworth"     "Arya Stark"         "Arys Oakheart"     
## [13] "Asha Greyjoy"       "Barristan Selmy"    "Varamyr"           
## [16] "Brandon Stark"      "Brienne of Tarth"   "Catelyn Stark"     
## [19] "Cersei Lannister"   "Eddard Stark"       "Jaime Lannister"   
## [22] "Jon Connington"     "Jon Snow"           "Aeron Greyjoy"     
## [25] "Kevan Lannister"    "Melisandre"         "Merrett Frey"      
## [28] "Quentyn Martell"    "Samwell Tarly"      "Sansa Stark"

Extracting elements

Handily, the map functions all allow the second argument to be an integer or string, and treat this internally as an appropriate extractor function

map_chr(got_chars, "name")

##  [1] "Theon Greyjoy"      "Tyrion Lannister"   "Victarion Greyjoy" 
##  [4] "Will"               "Areo Hotah"         "Chett"             
##  [7] "Cressen"            "Arianne Martell"    "Daenerys Targaryen"
## [10] "Davos Seaworth"     "Arya Stark"         "Arys Oakheart"     
## [13] "Asha Greyjoy"       "Barristan Selmy"    "Varamyr"           
## [16] "Brandon Stark"      "Brienne of Tarth"   "Catelyn Stark"     
## [19] "Cersei Lannister"   "Eddard Stark"       "Jaime Lannister"   
## [22] "Jon Connington"     "Jon Snow"           "Aeron Greyjoy"     
## [25] "Kevan Lannister"    "Melisandre"         "Merrett Frey"      
## [28] "Quentyn Martell"    "Samwell Tarly"      "Sansa Stark"

map_lgl(got_chars, "alive")

##  [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [13]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [25] FALSE  TRUE FALSE FALSE  TRUE  TRUE

Interestingly, we can actually do the following in base R: `[`() and `[[`() are functions that act in the following way for an integer x and index i

`[`(x, i) is equivalent to x[i]
`[[`(x, i) is equivalent to x[[i]]

(This works whether i is an integer or a string)

sapply(got_chars, `[[`, "name")

##  [1] "Theon Greyjoy"      "Tyrion Lannister"   "Victarion Greyjoy" 
##  [4] "Will"               "Areo Hotah"         "Chett"             
##  [7] "Cressen"            "Arianne Martell"    "Daenerys Targaryen"
## [10] "Davos Seaworth"     "Arya Stark"         "Arys Oakheart"     
## [13] "Asha Greyjoy"       "Barristan Selmy"    "Varamyr"           
## [16] "Brandon Stark"      "Brienne of Tarth"   "Catelyn Stark"     
## [19] "Cersei Lannister"   "Eddard Stark"       "Jaime Lannister"   
## [22] "Jon Connington"     "Jon Snow"           "Aeron Greyjoy"     
## [25] "Kevan Lannister"    "Melisandre"         "Merrett Frey"      
## [28] "Quentyn Martell"    "Samwell Tarly"      "Sansa Stark"

sapply(got_chars, `[[`, "alive")

##  [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [13]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [25] FALSE  TRUE FALSE FALSE  TRUE  TRUE

Part III

map_dfr(), map_dfc(), dplyr

`map_dfr()` and `map_dfc()`: list in, data frame out

The map_dfr() and map_dfc() functions iterate a function call over a list or vector, but automatically combine the results into a data frame. They differ in whether that data frame is formed by row-binding or column-binding

map_dfr(got_chars, `[`, c("name", "alive"))

## # A tibble: 30 × 2
##    name               alive
##    <chr>              <lgl>
##  1 Theon Greyjoy      TRUE 
##  2 Tyrion Lannister   TRUE 
##  3 Victarion Greyjoy  TRUE 
##  4 Will               FALSE
##  5 Areo Hotah         TRUE 
##  6 Chett              FALSE
##  7 Cressen            FALSE
##  8 Arianne Martell    TRUE 
##  9 Daenerys Targaryen TRUE 
## 10 Davos Seaworth     TRUE 
## # … with 20 more rows

# Base R is much less convenient
data.frame(name = sapply(got_chars, `[[`, "name"),
           alive = sapply(got_chars, `[[`, "alive"))

##                  name alive
## 1       Theon Greyjoy  TRUE
## 2    Tyrion Lannister  TRUE
## 3   Victarion Greyjoy  TRUE
## 4                Will FALSE
## 5          Areo Hotah  TRUE
## 6               Chett FALSE
## 7             Cressen FALSE
## 8     Arianne Martell  TRUE
## 9  Daenerys Targaryen  TRUE
## 10     Davos Seaworth  TRUE
## 11         Arya Stark  TRUE
## 12      Arys Oakheart FALSE
## 13       Asha Greyjoy  TRUE
## 14    Barristan Selmy  TRUE
## 15            Varamyr FALSE
## 16      Brandon Stark  TRUE
## 17   Brienne of Tarth  TRUE
## 18      Catelyn Stark FALSE
## 19   Cersei Lannister  TRUE
## 20       Eddard Stark FALSE
## 21    Jaime Lannister  TRUE
## 22     Jon Connington  TRUE
## 23           Jon Snow  TRUE
## 24      Aeron Greyjoy  TRUE
## 25    Kevan Lannister FALSE
## 26         Melisandre  TRUE
## 27       Merrett Frey FALSE
## 28    Quentyn Martell FALSE
## 29      Samwell Tarly  TRUE
## 30        Sansa Stark  TRUE

Note: the first example uses extra arguments; the map functions work just like the apply functions in this regard

`dplyr`

The map_dfr() and map_dfc() functions actually depend on another package called the dplyr, hence require the latter to be installed

What is dplyr? It is another tidyverse package that is very useful for data frame computations. You’ll learn more soon, but for now, you can think of it as providing the tidyverse alternative to the base R functions subset(), split(), tapply()

`filter()`: subset rows based on a condition

head(mtcars) # Built in data frame of cars data, 32 cars x 11 variables

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

filter(mtcars, (mpg >= 20 & disp >= 200) | (drat <= 3))

##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive      21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Valiant             18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Dodge Challenger    15.5   8  318 150 2.76 3.520 16.87  0  0    3    2

# Base R is just as easy with subset(), more complicated with direct indexing
subset(mtcars, (mpg >= 20 & disp >= 200) | (drat <= 3))

##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive      21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Valiant             18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Dodge Challenger    15.5   8  318 150 2.76 3.520 16.87  0  0    3    2

mtcars[(mtcars$mpg >= 20 & mtcars$disp >= 200) | (mtcars$drat <= 3), ]

##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive      21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Valiant             18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Dodge Challenger    15.5   8  318 150 2.76 3.520 16.87  0  0    3    2

`group_by()`: define groups of rows based on columns or conditions

head(group_by(mtcars, cyl), 2)

## # A tibble: 2 × 11
## # Groups:   cyl [1]
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1    21     6   160   110   3.9  2.62  16.5     0     1     4     4
## 2    21     6   160   110   3.9  2.88  17.0     0     1     4     4

This doesn’t actually change anything about the way the data frame looks
Only difference is that when it prints, we’re told about the groups
But it will play a big role in how dplyr functions act on the data frame

`summarize()`: apply computations to (groups of) rows of a data frame

# Ungrouped
summarize(mtcars, mpg = mean(mpg), hp = mean(hp))

##        mpg       hp
## 1 20.09062 146.6875

# Grouped by number of cylinders
summarize(group_by(mtcars, cyl), mpg = mean(mpg), hp = mean(hp))

## # A tibble: 3 × 3
##     cyl   mpg    hp
##   <dbl> <dbl> <dbl>
## 1     4  26.7  82.6
## 2     6  19.7 122. 
## 3     8  15.1 209.

Note: the use of group_by() makes the difference here

# Base R, ungrouped calculation is not so bad
c("mpg" = mean(mtcars$mpg), "hp" = mean(mtcars$hp))

##       mpg        hp 
##  20.09062 146.68750

# Base R, grouped calculation is getting a bit ugly
cbind(tapply(mtcars$mpg, INDEX=mtcars$cyl, FUN=mean),
      tapply(mtcars$hp, INDEX=mtcars$cyl, FUN=mean))

##       [,1]      [,2]
## 4 26.66364  82.63636
## 6 19.74286 122.28571
## 8 15.10000 209.21429

sapply(split(mtcars, mtcars$cyl), FUN=function(df) {
  return(c("mpg" = mean(df$mpg), "hp" = mean(df$hp)))
})

##            4         6        8
## mpg 26.66364  19.74286  15.1000
## hp  82.63636 122.28571 209.2143

aggregate(mtcars[, c("mpg", "hp")], by=list(mtcars$cyl), mean)

##   Group.1      mpg        hp
## 1       4 26.66364  82.63636
## 2       6 19.74286 122.28571
## 3       8 15.10000 209.21429

Summary

The tidyverse is a collection of packages for common data science tasks
purrr is one such package that provides a consistent family of iteration functions
map(): list in, list out
map_dbl(), map_lgl(), map_chr(): list in, vector out (of a particular data type)
map_dfr(), map_dfc(): list in, data frame out (row-binded or column-binded)
dplyr is another such package that provides functions for data frame computations
filter(): subset rows based on a condition
group_by(): define groups of rows according to a condition
summarize(): apply computations across groups of rows

Purrr and a Bit of Dplyr

Statistical Computing, 36-350

Tuesday September 20, 2022

Last week: Data frames and apply

Part I

Common iteration tasks

Why not `plyr`?

What is `purrr`?

What is the tidyverse?

Part II

The map family

`map()`: list in, list out

`map_dbl()`: list in, numeric out

Applying a custom function

Extracting elements

Part III

`map_dfr()` and `map_dfc()`: list in, data frame out

`dplyr`

`filter()`: subset rows based on a condition

`group_by()`: define groups of rows based on columns or conditions

`summarize()`: apply computations to (groups of) rows of a data frame

Summary

Purrr and a Bit of Dplyr

Statistical Computing, 36-350

Tuesday September 20, 2022

Last week: Data frames and apply

Part I

Common iteration tasks

Why not plyr?

What is purrr?

What is the tidyverse?

Part II

The map family

map(): list in, list out

map_dbl(): list in, numeric out

Applying a custom function

Extracting elements

Part III

map_dfr() and map_dfc(): list in, data frame out

dplyr

filter(): subset rows based on a condition

group_by(): define groups of rows based on columns or conditions

summarize(): apply computations to (groups of) rows of a data frame

Summary

Why not `plyr`?

What is `purrr`?

`map()`: list in, list out

`map_dbl()`: list in, numeric out

`map_dfr()` and `map_dfc()`: list in, data frame out

`dplyr`

`filter()`: subset rows based on a condition

`group_by()`: define groups of rows based on columns or conditions

`summarize()`: apply computations to (groups of) rows of a data frame