Statistical Computing, 36-350
Tuesday September 15, 2022
[ ]
and [[ ]]
)if()
, else if()
, else
:
standard conditionalsifelse()
: shortcut for using if()
and
else
in combinationswitch()
: shortcut for using if()
,
elseif()
, and else
in combinationfor()
, while()
: standard loop
constructsfor()
loops, vectorization is
your friend!Data frames
The format for the “classic” data table in statistics: data frame. Lots of the “really-statistical” parts of the R programming language presume data frames
Difference between data frames and lists? Each column in a data frame must have the same length (each element in the list can be of different lengths)
Use data.frame()
, similar to how we create lists
my.df = data.frame(nums=seq(0.1,0.6,by=0.1), chars=letters[1:6],
bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
my.df
## nums chars bools
## 1 0.1 a TRUE
## 2 0.2 b FALSE
## 3 0.3 c TRUE
## 4 0.4 d TRUE
## 5 0.5 e FALSE
## 6 0.6 f FALSE
# Recall, a list can have different lengths for different elements!
my.list = list(nums=seq(0.1,0.6,by=0.1), chars=letters[1:12],
bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
my.list
## $nums
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
##
## $chars
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
##
## $bools
## [1] TRUE FALSE TRUE FALSE FALSE FALSE
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
## [1] "a" "b" "c" "d" "e" "f"
Often times it’s helpful to start with a matrix, and add columns (of different data types) to make it a data frame
## [1] "matrix" "array"
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
## [1] "factor"
## [1] South West West South West West
## Levels: Northeast South North Central West
## [1] "factor"
## [1] East South Central Pacific Mountain West South Central Pacific Mountain
## 9 Levels: New England Middle Atlantic South Atlantic East South Central West South Central East North Central West North Central ... Pacific
# Combine these into a data frame with 50 rows and 10 columns
state.df = data.frame(state.x77, Region=state.region, Division=state.division)
class(state.df)
## [1] "data.frame"
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area Region Division
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 South East South Central
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 West Pacific
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 West Mountain
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 South West South Central
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361 West Pacific
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766 West Mountain
To add columns: we can either use data.frame()
, or
directly define a new named column
# First way: use data.frame() to concatenate on a new column
state.df = data.frame(state.df, Cool=sample(c(T,F), nrow(state.df), rep=TRUE))
head(state.df, 4)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area Region Division Cool
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 South East South Central TRUE
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 West Pacific FALSE
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 West Mountain FALSE
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 South West South Central FALSE
# Second way: just directly define a new named column
state.df$Score = sample(1:100, nrow(state.df), replace=TRUE)
head(state.df, 4)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area Region Division Cool Score
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 South East South Central TRUE 65
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 West Pacific FALSE 17
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 West Mountain FALSE 59
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 South West South Central FALSE 85
To delete columns: we can either use negative integer indexing, or
set a column to NULL
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area Region Division Cool
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 South East South Central TRUE
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 West Pacific FALSE
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 West Mountain FALSE
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 South West South Central FALSE
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area Region Division
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 South East South Central
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 West Pacific
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 West Mountain
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 South West South Central
With matrices or data frames, we’ll often want to access a subset of the rows corresponding to some condition. You already know how to do this, with Boolean indexing
# Compare the averages of the Frost column between states in New England and
# Pacific divisions
mean(state.df[state.df$Division == "New England", "Frost"])
## [1] 145.3333
## [1] 49.6
subset()
: extract rows based on a conditionThe subset()
function provides a convenient alternative
way of accessing rows for data frames
# Using subset(), we can just use the column names directly (i.e., no need for
# using $)
state.df.ne.1 = subset(state.df, Division == "New England")
# Get same thing by extracting the appropriate rows manually
state.df.ne.2 = state.df[state.df$Division == "New England", ]
all(state.df.ne.1 == state.df.ne.2)
## [1] TRUE
# Same calculation as in the last slide, using subset()
mean(subset(state.df, Division == "New England")$Frost)
## [1] 145.3333
## [1] 49.6
apply()
R offers a family of apply functions, which allow
you to apply a function across different chunks of data. Offers an
alternative to explicit iteration using for()
loop; can be
simpler and faster, though not always. Summary of functions:
apply()
: apply a function to rows or columns of a
matrix or data framelapply()
: apply a function to elements of a list or
vectorsapply()
: same as the above, but simplify the output
(if possible)tapply()
: apply a function to levels of a factor
vectorapply()
: rows or columns of a matrix or data frameThe apply()
function takes inputs of the following
form:
apply(x, MARGIN=1, FUN=my.fun)
, to apply
my.fun()
across rows of a matrix or data frame
x
apply(x, MARGIN=2, FUN=my.fun)
, to apply
my.fun()
across columns of a matrix or data frame
x
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## 365.00 3098.00 0.50 67.96 1.40 37.80 0.00 1049.00
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## 21198.0 6315.0 2.8 73.6 15.1 67.3 188.0 566432.0
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## 5 2 18 11 1 44 28 2
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Min. 365.00 3098.00 0.500 67.9600 1.400 37.800 0.00 1049.00
## 1st Qu. 1079.50 3992.75 0.625 70.1175 4.350 48.050 66.25 36985.25
## Median 2838.50 4519.00 0.950 70.6750 6.850 53.250 114.50 54277.00
## Mean 4246.42 4435.80 1.170 70.8786 7.378 53.108 104.46 70735.88
## 3rd Qu. 4968.50 4813.50 1.575 71.8925 10.675 59.150 139.75 81162.50
## Max. 21198.00 6315.00 2.800 73.6000 15.100 67.300 188.00 566432.00
For a custom function, we can just define it before hand, and the use
apply()
as usual
# Our custom function: trimmed mean
trimmed.mean = function(v) {
q1 = quantile(v, prob=0.1)
q2 = quantile(v, prob=0.9)
return(mean(v[q1 <= v & v <= q2]))
}
apply(state.x77, MARGIN=2, FUN=trimmed.mean)
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## 3384.27500 4430.07500 1.07381 70.91775 7.29750 53.33750 104.68293 56575.72500
We’ll learn more about functions later (don’t worry too much at this point about the details of the function definition)
Instead of defining a custom function before hand, we can just define it “on-the-fly”. Sometimes this is more convenient
# Compute trimmed means, defining this on-the-fly
apply(state.x77, MARGIN=2, FUN=function(v) {
q1 = quantile(v, prob=0.1)
q2 = quantile(v, prob=0.9)
return(mean(v[q1 <= v & v <= q2]))
})
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## 3384.27500 4430.07500 1.07381 70.91775 7.29750 53.33750 104.68293 56575.72500
Can tell apply()
to pass extra
arguments to the function in question. E.g., can use:
apply(x, MARGIN=1, FUN=my.fun, extra.arg.1, extra.arg.2)
,
for two extra arguments extra.arg.1
,
extra.arg.2
to be passed to my.fun()
# Our custom function: trimmed mean, with user-specified percentiles
trimmed.mean = function(v, p1, p2) {
q1 = quantile(v, prob=p1)
q2 = quantile(v, prob=p2)
return(mean(v[q1 <= v & v <= q2]))
}
apply(state.x77, MARGIN=2, FUN=trimmed.mean, p1=0.01, p2=0.99)
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## 3974.125000 4424.520833 1.136735 70.882708 7.341667 53.131250 104.895833 61860.687500
What kind of data type will apply()
give us? Depends on
what function we pass. Summary, say, with FUN=my.fun()
:
my.fun()
returns a single value, then
apply()
will return a vectormy.fun()
returns k values, then apply()
will return a matrix with k rows (note: this is true regardless
of whether MARGIN=1
or MARGIN=2
)my.fun()
returns different length outputs for
different inputs, then apply()
will return a listmy.fun()
returns a list, then apply()
will return a listWe’ll grapple with this on the lab. This is one main advantage of
purrr
package: there is a much more transparent return
object type
Don’t overuse the apply paradigm! There’s lots of
special functions that optimized are will be both
simpler and faster than using apply()
. E.g.,
rowSums()
, colSums()
: for computing row,
column sums of a matrixrowMeans()
, colMeans()
: for computing row,
column means of a matrixmax.col()
: for finding the maximum position in each row
of a matrixCombining these functions with logical indexing and vectorized operations will enable you to do quite a lot. E.g., how to count the number of positives in each row of a matrix?
x = matrix(rnorm(9), 3, 3)
# Don't do this (much slower for big matrices)
apply(x, MARGIN=1, function(v) { return(sum(v > 0)) })
## [1] 2 3 2
## [1] 2 3 2
lapply()
,
sapply()
, tapply()
lapply()
: elements of a list or vectorThe lapply()
function takes inputs as in:
lapply(x, FUN=my.fun)
, to apply my.fun()
across elements of a list or vector x
. The output is always
a list
## $nums
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
##
## $chars
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
##
## $bools
## [1] TRUE FALSE TRUE FALSE FALSE FALSE
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical: returning NA
## $nums
## [1] 0.35
##
## $chars
## [1] NA
##
## $bools
## [1] 0.3333333
## $nums
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.225 0.350 0.350 0.475 0.600
##
## $chars
## Length Class Mode
## 12 character character
##
## $bools
## Mode FALSE TRUE
## logical 4 2
sapply()
: elements of a list or vectorThe sapply()
function works just like
lapply()
, but tries to simplify the return
value whenever possible. E.g., most common is the conversion from a list
to a vector
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical: returning NA
## nums chars bools
## 0.3500000 NA 0.3333333
## $nums
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.225 0.350 0.350 0.475 0.600
##
## $chars
## Length Class Mode
## 12 character character
##
## $bools
## Mode FALSE TRUE
## logical 4 2
tapply()
: levels of a factor vectorThe function tapply()
takes inputs as in:
tapply(x, INDEX=my.index, FUN=my.fun)
, to apply
my.fun()
to subsets of entries in x
that share
a common level in my.index
# Compute the mean and sd of the Frost variable, within each region
tapply(state.x77[,"Frost"], INDEX=state.region, FUN=mean)
## Northeast South North Central West
## 132.7778 64.6250 138.8333 102.1538
## Northeast South North Central West
## 30.89408 31.30682 23.89307 68.87652
split()
: split by levels of a factorThe function split()
split up the rows of a data frame
by levels of a factor, as in: split(x, f=my.index)
to split
a data frame x
according to levels of
my.index
# Split up the state.x77 matrix according to region
state.by.reg = split(data.frame(state.x77), f=state.region)
class(state.by.reg) # The result is a list
## [1] "list"
## [1] "Northeast" "South" "North Central" "West"
## [1] "data.frame"
## $Northeast
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Connecticut 3100 5348 1.1 72.48 3.1 56.0 139 4862
## Maine 1058 3694 0.7 70.39 2.7 54.7 161 30920
## Massachusetts 5814 4755 1.1 71.83 3.3 58.5 103 7826
##
## $South
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Delaware 579 4809 0.9 70.06 6.2 54.6 103 1982
##
## $`North Central`
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Illinois 11197 5107 0.9 70.14 10.3 52.6 127 55748
## Indiana 5313 4458 0.7 70.88 7.1 52.9 122 36097
## Iowa 2861 4628 0.5 72.56 2.3 59.0 140 55941
##
## $West
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
# For each region, average each of the 8 numeric variables
lapply(state.by.reg, FUN=function(df) {
return(apply(df, MARGIN=2, mean))
})
## $Northeast
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## 5495.111111 4570.222222 1.000000 71.264444 4.722222 53.966667 132.777778 18141.000000
##
## $South
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## 4208.12500 4011.93750 1.73750 69.70625 10.58125 44.34375 64.62500 54605.12500
##
## $`North Central`
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## 4803.00000 4611.08333 0.70000 71.76667 5.27500 54.51667 138.83333 62652.00000
##
## $West
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## 2.915308e+03 4.702615e+03 1.023077e+00 7.123462e+01 7.215385e+00 6.200000e+01 1.021538e+02 1.344630e+05
subset()
: function for extracting rows of a data frame
meeting a conditionsplit()
: function for splitting up rows of a data
frame, according to a factor variableapply()
: function for applying a given routine to rows
or columns of a matrix or data framelapply()
: similar, but used for applying a routine to
elements of a vector or listsapply()
: similar, but will try to simplify the return
type, in comparison to lapply()
tapply()
: function for applying a given routine to
groups of elements in a vector or list, according to a factor
variable