Statistical Computing, 36-350
Tuesday November 1, 2022
Simulation basics
R gives us unique access to great simulation tools (unique compared to other languages). Why simulate? Welcome to the 21st century! Two reasons:
To sample from a given vector, use sample()
## [1] "n" "j" "x" "u" "w" "f" "r" "p" "b" "h"
## [1] 1 1 1 1 1 0 0 1 1 0
## [1] 4 10 1 2 9 8 5 3 6 7
To sample from a normal distribution, we have the utility functions:
rnorm()
: generate normal random variablespnorm()
: normal distribution function, \(\Phi(x)=P(Z \leq x)\)dnorm()
: normal density function, \(\phi(x)=\Phi'(x)\)qnorm()
: normal quantile function, \(q(y)=\Phi^{-1}(y)\), i.e., \(\Phi(q(y))=y\)Replace “norm” with the name of another distribution, all the same functions apply. E.g., “t”, “exp”, “gamma”, “chisq”, “binom”, “pois”, etc.
Standard normal random variables (mean 0 and variance 1)
n = 100
z = rnorm(n, mean=0, sd=1) # These are the defaults for mean, sd
mean(z) # Check: sample mean is approximately 0
## [1] 0.07550971
## [1] 1.0815
To compute empirical cumulative distribution function (ECDF)—the
standard estimator of the cumulative distribution function (CDF)—use
ecdf()
## [1] "ecdf" "stepfun" "function"
## [1] 0.54
One of the most celebrated tests in statistics is due to Kolmogorov in 1933. The Kolmogorov-Smirnoff (KS) statistic is: \[ \sqrt{\frac{n}{2}} \sup_{x} |F_n(x)-G_n(x)| \] Here \(F_n\) is the ECDF of \(X_1,\ldots,X_n \sim F\), and \(G_n\) is the ECDF of \(Y_1,\ldots,Y_n \sim G\). Under the null hypothesis \(F=G\) (two distributions are the same), as \(n \to \infty\), the KS statistic approaches the supremum of a Brownian bridge: \[ \sup_{t \in [0,1]} |B(t)| \]
Here \(B\) is a Gaussian process with \(B(0)=B(1)=0\), mean \(\mathbb{E}(B(t))=0\) for all \(t\), and covariance function \(\mathrm{Cov}(B(s), B(t)) = s(1-t)\)
Two remarkable facts about the KS test:
It is distribution-free, meaning that the null distribution doesn’t depend on \(F,G\)!
We can actually compute the null distribution and use this test,
e.g., via ks.test()
:
##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: rnorm(n) and rt(n, df = 1)
## D = 0.142, p-value = 8.365e-05
## alternative hypothesis: two-sided
##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: rnorm(n) and rt(n, df = 10)
## D = 0.06, p-value = 0.3291
## alternative hypothesis: two-sided
To compute histogram—a basic estimator of the density based on
binning—use hist()
## [1] "histogram"
## [1] -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6
## [20] 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8
## [1] 0.05 0.00 0.00 0.00 0.00 0.00 0.05 0.25 0.05 0.35 0.30 0.20 0.40 0.50 0.55 0.35 0.10 0.45 0.15
## [20] 0.35 0.10 0.05 0.30 0.20 0.05 0.05 0.10 0.00 0.05
Pseudorandomness and seeds
Not surprisingly, we get different draws each time we call
rnorm()
## [1] 0.01911518
## [1] -0.07090431
## [1] -0.01171319
## [1] -0.03441602
Random numbers generated in R (in any language) are not “truly” random; they are what we call pseudorandom
?Random
in your R console to read more about this
(and to read how to change the algorithm used for pseudorandom number
generation, which you should never really have to do, by the way)All pseudorandom number generators depend on what is called a seed value
set.seed()
## [1] 1.2629543 -0.3262334 1.3297993 1.2724293 0.4146414
## [1] 1.2629543 -0.3262334 1.3297993 1.2724293 0.4146414
## [1] 1.2629543 -0.3262334 1.3297993 1.2724293 0.4146414
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
## [1] -0.89691455 0.18484918 1.58784533 -1.13037567 -0.08025176
## [1] -0.9619334 -0.2925257 0.2587882 -1.1521319 0.1957828
# Each time the seed is set, the same sequence follows (indefinitely)
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
## [1] 1.2629543 -0.3262334 1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995
## [1] 1.2629543 -0.3262334 1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995
## [1] 1.2629543 -0.3262334 1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995
Iteration and simulation
What would you do if you had such a model, and your scientist collaborators asked you: how many patients would we need to have in each group (drug, no drug), in order to reliably see that the average reduction in tumor size is large?
# Find the range of all the measurements together, and define breaks
x.range = range(c(x.nodrug,x.drug))
breaks = seq(min(x.range),max(x.range),length=20)
# Produce hist of the non drug measurements, then drug measurements on top
hist(x.nodrug, breaks=breaks, probability=TRUE, xlim=x.range,
col="lightgray", xlab="Percentage reduction in tumor size",
main="Comparison of tumor reduction")
# Plot a histogram of the drug measurements, on top
hist(x.drug, breaks=breaks, probability=TRUE, col=rgb(1,0,0,0.2), add=TRUE)
# Draw estimated densities on top, for each dist
lines(density(x.nodrug), lwd=3, col=1)
lines(density(x.drug), lwd=3, col=2)
legend("topright", legend=c("No drug","Drug"), lty=1, lwd=3, col=1:2)
set.seed()
Consider the code below for a generic simulation. Think about how you would frame this for the drug effect example, which you’ll revisit in lab
# Function to do one simulation run
one.sim = function(param1, param2=value2, param3=value3) {
# Possibly intricate simulation code goes here
}
# Function to do repeated simulation runs
rep.sim = function(nreps, param1, param2=value2, param3=value3, seed=NULL) {
# Set the seed, if we need to
if(!is.null(seed)) set.seed(seed)
# Run the simulation over and over
sim.objs = vector(length=nreps, mode="list")
for (r in 1:nreps) {
sim.objs[r] = one.sim(param1, param2, param3)
}
# Aggregate the results somehow, and then return something
}
Sometimes simulations take a long time to run, and we want to save intermediate or final output, for quick reference later
There two different ways of saving things from R (there are more than two, but here are two useful ones):
saveRDS()
: allows us to save single R objects (like
a vector, matrix, list, etc.), in (say) .rds format. E.g.,
save()
: allows us to save any number of R objects in
(say) .rdata format. E.g.,
Note: there is a big difference between how these two treat variable names
Corresponding to the two different ways of saving, we have two ways of loading things into R:
readRDS()
: allows us to load an object that has been
saved by savedRDS()
, and assign a new variable
name. E.g.,
load()
: allows us to load all objects that have been
saved through save()
, according to their original
variables names. E.g.,
set.seed()