Name:
Andrew ID:
Collaborated with:
This lab is to be done in class (completed outside of class time if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted PDF file on Gradescope, by Friday 9pm, this week.
This week’s agenda: practice writing functions and running simulations.
head()
on
those variables.
x
that contains 100 draws
from \(\mathrm{Unif}(0, 1)\), and a
column y
that contains 100 draws of the form \(y_i \sim \mathrm{Unif}(0, x_i)\). Do this
without using explicit iteration.# YOUR CODE GOES HERE
1b. We’ve written a function
plot.cum.means()
below which plots cumulative sample mean
as the sample size increases. The first argument rfun
stands for a function which takes one argument n
and
generates this many random numbers when called as rfun(n)
.
The second argument n.max
is an integer which tells the
number samples to draw. As a side effect, the function plots the
cumulative mean against the number of samples.
# plot.cum.means: plot cumulative sample mean as a function of sample size
# Inputs:
# - rfun: function which generates random draws
# - n.max: number of samples to draw
# Ouptut: none
plot.cum.means = function(rfun, n.max) {
samples = rfun(n.max)
plot(1:n.max, cumsum(samples) / 1:n.max, type = "l")
}
Use this function to make plots for the following distributions, with
n.max=1000
. Then answer: do the sample means start
concentrating around the appropriate value as the sample size
increases?
Hint: for each, you should construct a new single-argument random
number generator to pass as the rfun
argument to
plot.cum.means()
, as in
function(n) rnorm(n, mean=-3, sd=sqrt(10))
for the first
case.
# YOUR CODE GOES HERE
plot.cum.means()
with the appropriate random number
generator and n.max=1000
.# YOUR CODE GOES HERE
1c. For the same distributions as Q1b we will do the following.
In order to do this, we’ll write a function
plot.ecdf(rfun, pfun, sizes)
which takes as its arguments
the single-argument random number generating function rfun
,
the corresponding single-argument conditional density function
pfun
, and a vector of sample sizes sizes
for
which to plot the ecdf.
We’ve already started to define plot.ecdf()
below, but
we’ve left it incomplete. Fill in the definition by editing the lines
with “##” and “??”, and then run it on the same distributions as in Q1b.
Examine the plots and discuss how the ECDFs converge as the sample size
increases. Note: make sure to remove eval=FALSE
, after
you’ve edited the function, to see the results.
# plot.ecdf: plots ECDFs along with the true CDF, for varying sample sizes
# Inputs:
# - rfun: function which generates n random draws, when called as rfun(n)
# - pfun: function which calculates the true CDF at x, when called as pfun(x)
# - sizes: a vector of sample sizes
# Output: none
plot.ecdf = function(rfun, pfun, sizes) {
# Draw the random numbers
## samples = lapply(sizes, ??)
# Calculate the grid for the CDF
grid.min = min(sapply(samples, min))
grid.max = max(sapply(samples, max))
grid = seq(grid.min, grid.max, length=1000)
# Calculate the ECDFs
## ecdfs = lapply(samples, ??)
evals = lapply(ecdfs, function(f) f(grid))
# Plot the true CDF
## plot(grid, ??, type="l", col="black", xlab="x", ylab = "P(X <= x)")
# Plot the ECDFs on top
n.sizes = length(sizes)
cols = rainbow(n.sizes)
for (i in 1:n.sizes) {
lines(grid, evals[[i]], col=cols[i])
}
legend("bottomright", legend=sizes, col=cols, lwd=1)
}
We’re going to continue studying the drug effect model that was discussed in the “Simulation” lecture. Recall, we suppose that there is a new drug that can be optionally given before chemotherapy. We believe those who aren’t given the drug experience a reduction in tumor size of percentage: \[ X_{\mathrm{no\,drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=R), \;\;\; R \sim \mathrm{Unif}(0,1), \] whereas those who were given the drug experience a reduction in tumor size of percentage: \[ X_{\mathrm{drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=2). \]
simulate.data()
, that takes two
arguments: n
, the sample size (number of subjects in each
group), with a default value of 60; and mu.drug
, the mean
for the exponential distribution that defines the drug tumor reduction
measurements, with a default value of 2. Your function should return a
list with two vectors called no.drug
and drug
.
Each of these two vectors should have length n
, containing
the percentage reduction in tumor size under the appropriate condition
(not taking the drug or taking the drug).# YOUR CODE GOES HERE
simulate.data()
without any arguments (hence, relying on the default values of
n
and mu.drug
), and store the output in
results1
. Print out the first 6 values in both the
results1$no.drug
and results1$drug
vectors.
Now, run simulate.data()
again, and store its output in
results2
. Again, print out the first 6 values in both the
results2$no.drug
and results2$drug
vectors. We
have effectively simulated two hypothetical datasets. Note that we
shouldn’t expect the values from results1
and
results2
to be the same.# YOUR CODE GOES HERE
no.drug
between
results1
and results2
, the absolute difference
in the mean values of drug
between results1
and results2
, and the absolute difference in mean values of
no.drug
and drug
in `results1``. Of these
three numbers, which one is the largest, and does this make sense?# YOUR CODE GOES HERE
plot.data()
, that takes just one argument
data
, which is a list with components drug
and
no.drug
. To be clear, this function should create a single
plot, with two overlaid histograms, one fordata$no.drug
(in
gray) and one for data$drug
(in red), with the same 20
bins. It should also overlay a density curve for each histogram in the
appropriate colors, and produce a legend. One written, call
plot.data()
on each of results1
, and on
results2
.# YOUR CODE GOES HERE
simulate.data()
where
n=1000
and mu.drug=1.1
, and plot the results
using plot.data()
. In one or two sentences, explain the
differences that you see between this plot and the two you produced in
the last problem.# YOUR CODE GOES HERE
simulate.difference()
,
which takes in the same two arguments as simulate.data()
,
namely n
and mu.drug
, with the same default
parameters as before. Your function should generate a new data set using
simulate.data()
using the appropriate inputs, and then just
return the difference in means of drug
and
no.drug
(no absolute value). Run this function twice with
no arguments (hence, using the default parameters) to see that your
function is returning different numbers, and run the function once with
n=1000
and mu.drug=10
. Print out all three
return values. This last value should be substantially larger than the
first two.# YOUR CODE GOES HERE
For the next few questions, we will work with this hypothetical: suppose we work for a drug company that wants to put this new drug out on the market. In order to get FDA approval, your company must demonstrate that the patients who had the drug had on average a reduction in tumor size at least 100 percent greater than those who didn’t receive the drug, or in math: \[ \overline{X}_{\mathrm{drug}} - \overline{X}_{\mathrm{no\,drug}} \geq 100. \] Your drug company wants to spend as little money as possible. They want the smallest number \(n\) such that, if they were to run a clinical trial with \(n\) patients in each of the drug / no drug groups, they would likely succeed in demonstrating that the effect size (as above) is at least 100. Of course, the result of a clinical trial is random; your drug company is willing to take “likely” to mean successful with probability 0.95, i.e., successful in 190 of 200 hypothetical clinical trials (though only 1 will be run in reality).
rep.sim()
. This function takes four arguments:
nreps
(the number of repetitions, with default value of
200), n
and mu.drug
(the values needed for
simulate.difference()
, with the same defaults as before),
and seed
(with default value NULL
). Your
function should run simulate.differences()
nreps
number of times, and then return the number of
success, i.e., the number of times that the output of
simulate.difference()
exceeds 100. Demonstrate your
function works by using it with mu.drug=1.5
. Hint: to
implement rep.sim()
, you could use a for()
loop, as shown in the slides, or if you’re interested in trying an
alternative route, you could use the replicate()
function.
Check the documentation to understand how the latter function.# YOUR CODE GOES HERE
n
, fixing mu.drug
to be 2. For each value
of n
in between 5 and 100 (inclusive), run your function
rep.sim()
. You can do this using a for()
loop
or an apply function. Store the number of success in a vector. Just to
be clear: for each sample size in between 5 and 100, you should have a
corresponding number of successes. Plot the number of successes versus
the sample size, and label the axes appropriately. Based on your
simulation, what is the smallest sample size for which the number of
successes is 190 or more?# YOUR CODE GOES HERE
mu.drug
have to be, the mean
proportion of tumor reduction in the drug group, in order to have
probability 0.95 of a successful drug trial? Run a simulation, much like
your simulation in the last problem, to answer this question.
Specifically, similar to before, for each value of the input
mu.drug
in between 0 and 5, in increments of 0.25, run your
function rep.sim()
, with n=20
and
nreps=200
. Plot the number of successes versus the value of
mu.drug
, and label the axes appropriately. What is the
smallest value of mu.drug
for which the number of successes
exceeds 190?# YOUR CODE GOES HERE
n=5
subjects (as always, this means 5
subjects with the drug, 5 subjects without the drug).simulate.data()
,
simulate.difference()
and rep.sim()
—whatever
necessary—to accommodate this new scheme. Then run this simulation with
200 repetitions with mu.drug=1.5
, and print out how many
success there were. How does this number compare with the result you saw
earlier in Q3a? Should it be much different?# YOUR CODE GOES HERE
A common task in modern data science is to analyze the results of an AB test. AB tests are essentially controlled experiments: we obtain data from two different conditions, such as the different versions of a website we want to show to users, to try to determine which condition gives better results.
4a. Write a function to simulate collecting data
from an AB test where the responses from the A condition follow a normal
distribution with mean a.mean
and standard deviation
a.sd
, whereas responses from the B condition follow a
normal distribution with mean b.mean
and standard deviation
b.sd
.
Your function’s signature should be
ab.collect(n, a.mean, a.sd, b.mean, b.sd)
where
n
is the number of samples to collect from each condition
and the other arguments are as described above. Your function should
return a list with two named components a.responses
and
b.responses
which contain the responses for each condition
respectively. Try your function out for several values of
a.mean
, a.sd
, b.mean
, and
b.sd
and check that the sample means and standard
deviations approximately match the appropriate theoretical
values.
# YOUR CODE GOES HERE
test.at.end(n, a.mean, a.sd, b.mean, b.sd)
which uses your
function from Q4a to draw samples of size n
and then runs a
t-test to determine whether there is a significant difference. We’ll
define this as having a p-value at most 0.05. If there is a significant
difference, we return either “A” or “B” for whichever condition has the
higher mean. If there isn’t no significant difference, we return
“Inconclusive”. Hint: recall t.test()
, and examine its
output on a trial run to figuure out how to extract the p-value. Run
your function with n=2000
, a.mean=100
,
a.sd=20
, b.mean=104
, b.sd=10
and
display the result.# YOUR CODE GOES HERE
4c. Waiting until you collect all of the samples can take a while. So you instead decide to take the following approach.
Note that this kind of sequential sampling is very common in AB testing. Note also the similarity to what we had you do in Q3d.
Write a function
test.as.you.go(n.per.day, n.days, a.mean, a.sd, b.mean, b.sd)
to implement this procedure. Your function should return a list with the
winner (or “Inconclusive”), as well and the amount of data you needed to
collect.
Run this function on the same example as before with
n.per.day=100
and n.days=20
(to match final
sample sizes). Do you get the same result? Do you save time collecting
data?
# YOUR CODE GOES HERE
4d. In practice, most AB tests won’t have a clear winner; instead both conditions A and B will be roughly equivalent. In this case we want to avoid false positives: saying there’s a difference when there isn’t really a difference (with respect to the true distributions). Let’s run a simulation that checks the false positive rate of the two testing regimes.
Setting a.mean = b.mean = 100
,
a.sd = b.sd = 20
, and retaining the number of samples as in
the previous examples conduct 1000 AB experiments using each of previous
two setups, in test.at.end()
and
test.as.you.go()
.
For each, calculate the number of “A” results, “B” results, and “Inconclusive” results. Is this what you would expect to see—recalling that we are declaring significance if the p-value from the t-test is at most 0.05? Does either method of sampling (all-at-once, or as-you-go) perform better than the other, with respect to controlling false positives? Challenge: can you explain the behavior you’re seeing, with the sequential sampling?
# YOUR CODE GOES HERE