subset()
: function for extracting rows of a data frame
meeting a conditionsplit()
: function for splitting up rows of a data
frame, according to a factor variableapply()
: function for applying a given routine to rows
or columns of a matrix or data framelapply()
: similar, but used for applying a routine to
elements of a vector or listsapply()
: similar, but will try to simplify the return
type, in comparison to lapply()
tapply()
: function for applying a given routine to
groups of elements in a vector or list, according to a factor
variableMotivation: tidyverse
Here’s a basic breakdown for common iteration tasks that we encounter in R: we iterate over …
All of this is possible in base R, using the apply family of
functions: lapply()
, sapply()
,
apply()
, tapply()
, etc. So why look anywhere
else?
Answer: because some alternatives offer better consistency
However, the world isn’t black-and-white: base R still has its advantages, and the best thing you can do is to be informed and well-versed in using all the major options!
plyr
?The plyr
package used to be one of the most popular
(most downloaded) R packages of all-time. It was more popular in the
late 2000s and early 2010s
plyr
functions are of the form
**ply()
**
with characters denoting types:
a
, d
,
l
a
,
d
, l
, or _
(drop)It is no longer under active development and that development is now happening elsewhere (mainly in the tidyverse). However, some people still like it. If you want to learn about it, you can check out the notes from a previous offering of this course.
purrr
?purrr
is a package that is part of the tidyverse. It
offers a family of functions for iterating (mainly over lists) that can
be seen as alternatives to base R’s family of apply functions
plyr
, they can often be fasterCredit: Jenny Bryan’s
tutorial on purrr
provided the inspiration for this
lecture. Another good reference is Chapter 21: Iterations
of the book R for Data
Science.
The tidyverse is a coherent collection of packages in R for data
science (and tidyverse
is itself a actually package that
loads all its constituent packages). Packages include:
dplyr
,
tidyr
, readr
purrr
ggplot2
This week we’ll cover purrr
and a bit of
dplyr
. Next week we’ll do more dplyr
, and some
tidyr
. (Many of you will learn ggplot2
in
Statistical Graphics 36-315)
Loading the tidyverse so that we can get all this functionality (plus more):
library(tidyverse)
map()
and
friends
purrr
offers a family of map functions,
which allow you to apply a function across different chunks of data
(primarily used with lists). Offers an alternative base R’s apply
functions. Summary of functions:
map()
: apply a function across elements of a list or
vectormap_dbl()
, map_lgl()
,
map_chr()
: same, but return a vector of a particular data
typemap_dfr()
, map_dfc()
: same, but return a
data framemap()
: list in, list outThe map()
function is an alternative to
lapply()
. It has the following simple form:
map(x, f)
, where x
is a list or vector, and
f
is a function. It always returns a list
my.list = list(nums=seq(0.1,0.6,by=0.1), chars=letters[1:12],
bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
map(my.list, length)
## $nums
## [1] 6
##
## $chars
## [1] 12
##
## $bools
## [1] 6
# Base R is just as easy
lapply(my.list, length)
## $nums
## [1] 6
##
## $chars
## [1] 12
##
## $bools
## [1] 6
map_dbl()
: list in, numeric outThe map_dbl()
function is an alternative to
sapply()
. It has the form: map_dbl(x, f)
,
where x
is a list or vector, and f
is a
function that returns a numeric value (when applied to each element of
x
)
Similarly:
map_int()
returns an integer vectormap_lgl()
returns a logical vectormap_chr()
returns a character vectormap_dbl(my.list, length)
## nums chars bools
## 6 12 6
map_chr(my.list, length)
## nums chars bools
## "6" "12" "6"
# Base R is a bit more complicated
as.numeric(sapply(my.list, length))
## [1] 6 12 6
as.numeric(unlist(lapply(my.list, length)))
## [1] 6 12 6
vapply(my.list, FUN=length, FUN.VALUE=numeric(1))
## nums chars bools
## 6 12 6
As before (with the apply family), we can of course apply a custom function, and define it “on-the-fly”
library(repurrrsive) # Load Game of Thrones data set
class(got_chars)
## [1] "list"
class(got_chars[[1]])
## [1] "list"
names(got_chars[[1]])
## [1] "url" "id" "name" "gender" "culture"
## [6] "born" "died" "alive" "titles" "aliases"
## [11] "father" "mother" "spouse" "allegiances" "books"
## [16] "povBooks" "tvSeries" "playedBy"
map_chr(got_chars, function(x) { return(x$name) })
## [1] "Theon Greyjoy" "Tyrion Lannister" "Victarion Greyjoy"
## [4] "Will" "Areo Hotah" "Chett"
## [7] "Cressen" "Arianne Martell" "Daenerys Targaryen"
## [10] "Davos Seaworth" "Arya Stark" "Arys Oakheart"
## [13] "Asha Greyjoy" "Barristan Selmy" "Varamyr"
## [16] "Brandon Stark" "Brienne of Tarth" "Catelyn Stark"
## [19] "Cersei Lannister" "Eddard Stark" "Jaime Lannister"
## [22] "Jon Connington" "Jon Snow" "Aeron Greyjoy"
## [25] "Kevan Lannister" "Melisandre" "Merrett Frey"
## [28] "Quentyn Martell" "Samwell Tarly" "Sansa Stark"
Handily, the map functions all allow the second argument to be an integer or string, and treat this internally as an appropriate extractor function
map_chr(got_chars, "name")
## [1] "Theon Greyjoy" "Tyrion Lannister" "Victarion Greyjoy"
## [4] "Will" "Areo Hotah" "Chett"
## [7] "Cressen" "Arianne Martell" "Daenerys Targaryen"
## [10] "Davos Seaworth" "Arya Stark" "Arys Oakheart"
## [13] "Asha Greyjoy" "Barristan Selmy" "Varamyr"
## [16] "Brandon Stark" "Brienne of Tarth" "Catelyn Stark"
## [19] "Cersei Lannister" "Eddard Stark" "Jaime Lannister"
## [22] "Jon Connington" "Jon Snow" "Aeron Greyjoy"
## [25] "Kevan Lannister" "Melisandre" "Merrett Frey"
## [28] "Quentyn Martell" "Samwell Tarly" "Sansa Stark"
map_lgl(got_chars, "alive")
## [1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## [13] TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
## [25] FALSE TRUE FALSE FALSE TRUE TRUE
Interestingly, we can actually do the following in base R:
`[`()
and `[[`()
are functions that act in the
following way for an integer x
and index i
`[`(x, i)
is equivalent to x[i]
`[[`(x, i)
is equivalent to x[[i]]
(This works whether i
is an integer or a string)
sapply(got_chars, `[[`, "name")
## [1] "Theon Greyjoy" "Tyrion Lannister" "Victarion Greyjoy"
## [4] "Will" "Areo Hotah" "Chett"
## [7] "Cressen" "Arianne Martell" "Daenerys Targaryen"
## [10] "Davos Seaworth" "Arya Stark" "Arys Oakheart"
## [13] "Asha Greyjoy" "Barristan Selmy" "Varamyr"
## [16] "Brandon Stark" "Brienne of Tarth" "Catelyn Stark"
## [19] "Cersei Lannister" "Eddard Stark" "Jaime Lannister"
## [22] "Jon Connington" "Jon Snow" "Aeron Greyjoy"
## [25] "Kevan Lannister" "Melisandre" "Merrett Frey"
## [28] "Quentyn Martell" "Samwell Tarly" "Sansa Stark"
sapply(got_chars, `[[`, "alive")
## [1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## [13] TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
## [25] FALSE TRUE FALSE FALSE TRUE TRUE
map_dfr()
,
map_dfc()
, dplyr
map_dfr()
and map_dfc()
: list in, data
frame outThe map_dfr()
and map_dfc()
functions
iterate a function call over a list or vector, but automatically combine
the results into a data frame. They differ in whether that data frame is
formed by row-binding or
column-binding
map_dfr(got_chars, `[`, c("name", "alive"))
## # A tibble: 30 × 2
## name alive
## <chr> <lgl>
## 1 Theon Greyjoy TRUE
## 2 Tyrion Lannister TRUE
## 3 Victarion Greyjoy TRUE
## 4 Will FALSE
## 5 Areo Hotah TRUE
## 6 Chett FALSE
## 7 Cressen FALSE
## 8 Arianne Martell TRUE
## 9 Daenerys Targaryen TRUE
## 10 Davos Seaworth TRUE
## # … with 20 more rows
# Base R is much less convenient
data.frame(name = sapply(got_chars, `[[`, "name"),
alive = sapply(got_chars, `[[`, "alive"))
## name alive
## 1 Theon Greyjoy TRUE
## 2 Tyrion Lannister TRUE
## 3 Victarion Greyjoy TRUE
## 4 Will FALSE
## 5 Areo Hotah TRUE
## 6 Chett FALSE
## 7 Cressen FALSE
## 8 Arianne Martell TRUE
## 9 Daenerys Targaryen TRUE
## 10 Davos Seaworth TRUE
## 11 Arya Stark TRUE
## 12 Arys Oakheart FALSE
## 13 Asha Greyjoy TRUE
## 14 Barristan Selmy TRUE
## 15 Varamyr FALSE
## 16 Brandon Stark TRUE
## 17 Brienne of Tarth TRUE
## 18 Catelyn Stark FALSE
## 19 Cersei Lannister TRUE
## 20 Eddard Stark FALSE
## 21 Jaime Lannister TRUE
## 22 Jon Connington TRUE
## 23 Jon Snow TRUE
## 24 Aeron Greyjoy TRUE
## 25 Kevan Lannister FALSE
## 26 Melisandre TRUE
## 27 Merrett Frey FALSE
## 28 Quentyn Martell FALSE
## 29 Samwell Tarly TRUE
## 30 Sansa Stark TRUE
Note: the first example uses extra arguments; the map functions work just like the apply functions in this regard
dplyr
The map_dfr()
and map_dfc()
functions
actually depend on another package called the dplyr
, hence
require the latter to be installed
What is dplyr
? It is another tidyverse package that is
very useful for data frame computations. You’ll learn more soon, but for
now, you can think of it as providing the tidyverse alternative to the
base R functions subset()
, split()
,
tapply()
filter()
: subset rows based on a conditionhead(mtcars) # Built in data frame of cars data, 32 cars x 11 variables
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
filter(mtcars, (mpg >= 20 & disp >= 200) | (drat <= 3))
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Dodge Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
# Base R is just as easy with subset(), more complicated with direct indexing
subset(mtcars, (mpg >= 20 & disp >= 200) | (drat <= 3))
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Dodge Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
mtcars[(mtcars$mpg >= 20 & mtcars$disp >= 200) | (mtcars$drat <= 3), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Dodge Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
group_by()
: define groups of rows based on columns or
conditionshead(group_by(mtcars, cyl), 2)
## # A tibble: 2 × 11
## # Groups: cyl [1]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
dplyr
functions act
on the data framesummarize()
: apply computations to (groups of) rows of
a data frame# Ungrouped
summarize(mtcars, mpg = mean(mpg), hp = mean(hp))
## mpg hp
## 1 20.09062 146.6875
# Grouped by number of cylinders
summarize(group_by(mtcars, cyl), mpg = mean(mpg), hp = mean(hp))
## # A tibble: 3 × 3
## cyl mpg hp
## <dbl> <dbl> <dbl>
## 1 4 26.7 82.6
## 2 6 19.7 122.
## 3 8 15.1 209.
Note: the use of group_by()
makes the difference
here
# Base R, ungrouped calculation is not so bad
c("mpg" = mean(mtcars$mpg), "hp" = mean(mtcars$hp))
## mpg hp
## 20.09062 146.68750
# Base R, grouped calculation is getting a bit ugly
cbind(tapply(mtcars$mpg, INDEX=mtcars$cyl, FUN=mean),
tapply(mtcars$hp, INDEX=mtcars$cyl, FUN=mean))
## [,1] [,2]
## 4 26.66364 82.63636
## 6 19.74286 122.28571
## 8 15.10000 209.21429
sapply(split(mtcars, mtcars$cyl), FUN=function(df) {
return(c("mpg" = mean(df$mpg), "hp" = mean(df$hp)))
})
## 4 6 8
## mpg 26.66364 19.74286 15.1000
## hp 82.63636 122.28571 209.2143
aggregate(mtcars[, c("mpg", "hp")], by=list(mtcars$cyl), mean)
## Group.1 mpg hp
## 1 4 26.66364 82.63636
## 2 6 19.74286 122.28571
## 3 8 15.10000 209.21429
purrr
is one such package that provides a consistent
family of iteration functionsmap()
: list in, list outmap_dbl()
, map_lgl()
,
map_chr()
: list in, vector out (of a particular data
type)map_dfr()
, map_dfc()
: list in, data frame
out (row-binded or column-binded)dplyr
is another such package that provides functions
for data frame computationsfilter()
: subset rows based on a conditiongroup_by()
: define groups of rows according to a
conditionsummarize()
: apply computations across groups of
rows