Simulation

Statistical Computing, 36-350

Tuesday November 1, 2022

Last week: Functions

Part I

Simulation basics

Why simulate?

R gives us unique access to great simulation tools (unique compared to other languages). Why simulate? Welcome to the 21st century! Two reasons:

Sampling from a given vector

To sample from a given vector, use sample()

sample(x=letters, size=10) # Without replacement, the default
##  [1] "n" "j" "x" "u" "w" "f" "r" "p" "b" "h"
sample(x=c(0,1), size=10, replace=TRUE) # With replacement
##  [1] 1 1 1 1 1 0 0 1 1 0
sample(x=10) # Arguments set as x=1:10, size=10, replace=FALSE
##  [1]  4 10  1  2  9  8  5  3  6  7

Random number generation

To sample from a normal distribution, we have the utility functions:

Replace “norm” with the name of another distribution, all the same functions apply. E.g., “t”, “exp”, “gamma”, “chisq”, “binom”, “pois”, etc.

Random number examples

Standard normal random variables (mean 0 and variance 1)

n = 100
z = rnorm(n, mean=0, sd=1) # These are the defaults for mean, sd
mean(z)  # Check: sample mean is approximately 0
## [1] 0.07550971
var(z)   # Check: sample variance is approximately 1
## [1] 1.0815

Estimated distribution function

To compute empirical cumulative distribution function (ECDF)—the standard estimator of the cumulative distribution function (CDF)—use ecdf()

x = seq(-3, 3, length=100)
ecdf.fun = ecdf(z) # Create the ECDF
class(ecdf.fun) # It's a function!
## [1] "ecdf"     "stepfun"  "function"
ecdf.fun(0)
## [1] 0.54
# We can plot it 
plot(x, ecdf.fun(x), lwd=2, col="red", type="l", ylab="CDF", main="ECDF")
lines(x, pnorm(x), lwd=2)
legend("topleft", legend=c("Empirical", "Actual"), lwd=2, 
       col=c("red","black"))

Interlude: Kolmogorov-Smirnov test

One of the most celebrated tests in statistics is due to Kolmogorov in 1933. The Kolmogorov-Smirnoff (KS) statistic is: \[ \sqrt{\frac{n}{2}} \sup_{x} |F_n(x)-G_n(x)| \] Here \(F_n\) is the ECDF of \(X_1,\ldots,X_n \sim F\), and \(G_n\) is the ECDF of \(Y_1,\ldots,Y_n \sim G\). Under the null hypothesis \(F=G\) (two distributions are the same), as \(n \to \infty\), the KS statistic approaches the supremum of a Brownian bridge: \[ \sup_{t \in [0,1]} |B(t)| \]

Here \(B\) is a Gaussian process with \(B(0)=B(1)=0\), mean \(\mathbb{E}(B(t))=0\) for all \(t\), and covariance function \(\mathrm{Cov}(B(s), B(t)) = s(1-t)\)

n = 500
t = 1:n/n
Sig = t %o% (1-t)
Sig = pmin(Sig, t(Sig))
eig = eigen(Sig)
Sig.half = eig$vec %*% diag(sqrt(eig$val)) %*% t(eig$vec)
B = Sig.half %*% rnorm(n)
plot(t, B, type="l")

Two remarkable facts about the KS test:

  1. It is distribution-free, meaning that the null distribution doesn’t depend on \(F,G\)!

  2. We can actually compute the null distribution and use this test, e.g., via ks.test():

ks.test(rnorm(n), rt(n, df=1)) # Normal versus t1
## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  rnorm(n) and rt(n, df = 1)
## D = 0.142, p-value = 8.365e-05
## alternative hypothesis: two-sided
ks.test(rnorm(n), rt(n, df=10)) # Normal versus t10
## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  rnorm(n) and rt(n, df = 10)
## D = 0.06, p-value = 0.3291
## alternative hypothesis: two-sided

Estimated density function

To compute histogram—a basic estimator of the density based on binning—use hist()

hist.obj = hist(z, breaks=30, plot=FALSE) 
class(hist.obj) # It's a list
## [1] "histogram"
hist.obj$breaks # These are the break points that were used
##  [1] -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2  0.0  0.2  0.4  0.6
## [20]  0.8  1.0  1.2  1.4  1.6  1.8  2.0  2.2  2.4  2.6  2.8
hist.obj$density # These are the estimated probabilities
##  [1] 0.05 0.00 0.00 0.00 0.00 0.00 0.05 0.25 0.05 0.35 0.30 0.20 0.40 0.50 0.55 0.35 0.10 0.45 0.15
## [20] 0.35 0.10 0.05 0.30 0.20 0.05 0.05 0.10 0.00 0.05
# We can plot it
plot(hist.obj, col="pink", freq=FALSE, main="Histogram")
lines(x, dnorm(x), lwd=2)
legend("topleft", legend=c("Histogram", "Actual"), lwd=2, 
       col=c("pink","black"))

Part II

Pseudorandomness and seeds

Same function call, different results

Not surprisingly, we get different draws each time we call rnorm()

mean(rnorm(n))
## [1] 0.01911518
mean(rnorm(n))
## [1] -0.07090431
mean(rnorm(n))
## [1] -0.01171319
mean(rnorm(n))
## [1] -0.03441602

Is it really random?

Random numbers generated in R (in any language) are not “truly” random; they are what we call pseudorandom

Setting the random seed

All pseudorandom number generators depend on what is called a seed value

Seed examples

# Getting the same 5 random normals over and over
set.seed(0); rnorm(5)
## [1]  1.2629543 -0.3262334  1.3297993  1.2724293  0.4146414
set.seed(0); rnorm(5)
## [1]  1.2629543 -0.3262334  1.3297993  1.2724293  0.4146414
set.seed(0); rnorm(5)
## [1]  1.2629543 -0.3262334  1.3297993  1.2724293  0.4146414
# Different seeds, different numbers
set.seed(1); rnorm(5)
## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078
set.seed(2); rnorm(5)
## [1] -0.89691455  0.18484918  1.58784533 -1.13037567 -0.08025176
set.seed(3); rnorm(5)
## [1] -0.9619334 -0.2925257  0.2587882 -1.1521319  0.1957828
# Each time the seed is set, the same sequence follows (indefinitely)
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
## [1]  1.2629543 -0.3262334  1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
## [1]  1.2629543 -0.3262334  1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
## [1]  1.2629543 -0.3262334  1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995

Part III

Iteration and simulation

Drug effect model

What would you do?

What would you do if you had such a model, and your scientist collaborators asked you: how many patients would we need to have in each group (drug, no drug), in order to reliably see that the average reduction in tumor size is large?

So, let’s simulate!

# Simulate, supposing 60 subjects in each group 
set.seed(0)
n = 60 
mu.drug = 2
mu.nodrug = runif(n, min=0, max=1)
x.drug = 100*rexp(n, rate=1/mu.drug) 
x.nodrug = 100*rexp(n, rate=1/mu.nodrug)
# Find the range of all the measurements together, and define breaks
x.range = range(c(x.nodrug,x.drug))
breaks = seq(min(x.range),max(x.range),length=20)

# Produce hist of the non drug measurements, then drug measurements on top
hist(x.nodrug, breaks=breaks, probability=TRUE, xlim=x.range, 
     col="lightgray", xlab="Percentage reduction in tumor size", 
     main="Comparison of tumor reduction")

# Plot a histogram of the drug measurements, on top
hist(x.drug, breaks=breaks, probability=TRUE, col=rgb(1,0,0,0.2), add=TRUE) 

# Draw estimated densities on top, for each dist
lines(density(x.nodrug), lwd=3, col=1)
lines(density(x.drug), lwd=3, col=2)
legend("topright", legend=c("No drug","Drug"), lty=1, lwd=3, col=1:2)

Repetition and reproducibility

Iteration and simulation (and functions): good friends

Code sketch

Consider the code below for a generic simulation. Think about how you would frame this for the drug effect example, which you’ll revisit in lab

# Function to do one simulation run
one.sim = function(param1, param2=value2, param3=value3) {
  # Possibly intricate simulation code goes here
}

# Function to do repeated simulation runs
rep.sim = function(nreps, param1, param2=value2, param3=value3, seed=NULL) {
  # Set the seed, if we need to
  if(!is.null(seed)) set.seed(seed)
  
  # Run the simulation over and over
  sim.objs = vector(length=nreps, mode="list")
  for (r in 1:nreps) {
    sim.objs[r] = one.sim(param1, param2, param3)
  }
  
  # Aggregate the results somehow, and then return something
}

Saving results

Sometimes simulations take a long time to run, and we want to save intermediate or final output, for quick reference later

There two different ways of saving things from R (there are more than two, but here are two useful ones):

Note: there is a big difference between how these two treat variable names

Loading results

Corresponding to the two different ways of saving, we have two ways of loading things into R:

Summary