Name:
Andrew ID:
Collaborated with:
This lab is to be done in class (completed outside of class time if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted PDF file on Gradescope, by Friday 9pm, this week.
This week’s agenda: getting familiar with data frames; practicing how to use the apply family of functions.
Below we construct a data frame, of 50 states x 10 variables. The
first 8 variables are numeric and the last 2 are factors. The numeric
variables here come from the built-in state.x77
matrix,
which records various demographic factors on 50 US states, measured in
the 1970s. You can learn more about this state data set by typing
?state.x77
into your R console.
state.df = data.frame(state.x77, Region=state.region, Division=state.division)
state.df
,
containing the state abbreviations that are stored in the built-in
vector state.abb
. Name this column Abbr
. You
can do this in (at least) two ways: by using a call to
data.frame()
, or by directly defining
state.df$Abbr
. Display the first 3 rows and all 11 columns
of the new state.df
.# YOUR CODE GOES HERE
Region
column from
state.df
. You can do this in (at least) two ways: by using
negative indexing, or by directly setting state.df$Region
to be NULL
. Display the first 3 rows and all 10 columns of
state.df
.# YOUR CODE GOES HERE
state.df
,
containing the x and y coordinates (longitude and latitude,
respectively) of the center of the states, that are stored in the
(existing) list state.center
. Hint: take a look at this
list in the console, to see what its elements are named. Name these two
columns Center.x
and Center.y
. Display the
first 3 rows and all 12 columns of state.df
.# YOUR CODE GOES HERE
subset
(). Check that they are
equal to each other, using an appropriate function call.# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
Below we read in the prostate cancer data set that we looked in the last lab. You can remind yourself about what’s been measured by looking back at the lab.
pros.dat =
read.table("http://www.stat.cmu.edu/~arinaldo/Teaching/36350/F22/data/pros.dat")
sapply()
, calculate the mean
of each variable. Also, calculate the standard deviation of each
variable. Each should require just one line of code. Display your
results.# YOUR CODE GOES HERE
lapply()
, plot each column, excluding SVI, on the y-axis
with SVI on the x-axis. This should require just one line of code.
Challenge: label the y-axes in your plots
appropriately. Your solution should still consist of just one line of
code and use an apply function. Hint: for this part, consider using
mapply()
.# YOUR CODE GOES HERE
lapply()
to perform
t-tests for each variable in the data set, between SVI and non-SVI
groups. To be precise, you will perform a t-test for each variable
excluding the SVI variable itself. For convenience, we’ve defined a
function t.test.by.ind()
below, which takes a numeric
variable x
, and then an indicator variable ind
(of 0s and 1s) that defines the groups. Run this function on the columns
of pros.dat
, excluding the SVI column itself, and save the
result as tests
. What kind of data structure is
tests
? Print it to the console.t.test.by.ind = function(x, ind) {
stopifnot(all(ind %in% c(0, 1)))
return(t.test(x[ind == 0], x[ind == 1]))
}
# YOUR CODE GOES HERE
lapply()
again, extract the
p-values from the tests
object you created in the last
question, with just a single line of code. Hint: first, take a look at
the first element of tests
, what kind of object is it, and
how is the p-value stored? Second, run the command
`[[`(pros.dat, "lcavol")
in your console—what does this do?
Now use what you’ve learned to extract p-values from the
tests
object.# YOUR CODE GOES HERE
Now we’re going to examine data from the 2016 Summer Olympics in Rio
de Janeiro, taken from https://github.com/flother/rio2016 (complete data on the
2020 Summer Olympics in Tokyo doesn’t appear to be available yet). Below
we read in the data and store it as rio
.
rio = read.csv("http://www.stat.cmu.edu/~arinaldo/Teaching/36350/F22/data/rio.csv")
rio
? What does each row
represent? Is there any missing data?# YOUR CODE GOES HERE
rio
to answer the following
questions. How many athletes competed in the 2016 Summer Olympics? How
many countries were represented? What were these countries, and how many
athletes competed for each one? Which country brought the most athletes,
and how many was this? Hint: for a factor variable f
, you
can use table(f)
see how many elements in f
are in each level of the factor.# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
total
which
adds the number of gold, silver, and bronze medals for each athlete, and
add this column to rio
. Which athlete had the most number
of medals and how many was this? Gold medals? Silver medals? In the case
of ties, here, display all the relevant athletes.# YOUR CODE GOES HERE
tapply()
, calculate the
total medal count for each country. Save the result as
total.by.nat
, and print it to the console. Which country
had the most number of medals, and how many was this? How many countries
had zero medals?# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
4a. The variable date_of_birth
contains strings of the date of birth of each athlete. Use the
substr()
function to extract the year of birth for each
athlete, and then create a new numeric variable called age
,
equal to 2016 - (the year of birth). (Here we’re ignoring days and
months for simplicity.) Hint: to extract the first 4 characters of a
string str
, you can use substr(str, 1, 4)
. As
always, you can also look at the help file for substr()
for
more details.
Add the age
variable to the rio
data frame.
variable Who is the oldest athlete, and how old is he/she? Youngest
athlete, and how old is he/she? In the case of ties, here, display all
the relevant athletes.
# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
tapply()
,
answer: how old are the youngest and oldest athletes, for each
sport?# YOUR CODE GOES HERE
tapply()
in the last part is a list, which is not
particularly convenient. Convert this list into a matrix that has one
row for each sport, and two columns that display the ages of the
youngest and oldest athletes in that sport. The first 3 rows should look
like this: Youngest Oldest
athletics 14 41
archery 17 44
athletics 16 47
You’ll notice that we set the row names according to the sports, and
we also set appropriate column names. Hint: unlist()
will
unravel all the values in a list; and matrix()
, as you’ve
seen before, can be used to create a matrix from a vector of values.
After you’ve converted the results to a matrix, print it to the console
(and make sure its first 3 rows match those displayed above).
# YOUR CODE GOES HERE
plyr
or dplyr
if you want to.)# YOUR CODE GOES HERE
sports
, which we’ll populate with information about each
sporting event at the Summer Olympics. Initially, define
sports
to contain a single variable called
sport
which contains the names of the sporting events in
alphabetical order. Then, add a column called
n_participants
which contains the number of participants in
each sport. Use one of the apply functions to determine the number of
gold medals given out for each sport, and add this as a column called
n_gold
. Using your newly created sports
data
frame, calculate the ratio of the number of gold medals to participants
for each sport. Which sport has the highest ratio? Which has the
lowest?# YOUR CODE GOES HERE
sports
called ave_weight
. Important:
there are missing weights in the data set coded as NA
, but
your column ave_weight
should ignore these, i.e., it should
be itself free of NA
values. You will have to pass an
additional argument to your apply call in order to achieve this. Hint:
look at the help file for the mean()
function; what
argument can you set to ignore NA
values? Once computed,
display the average weights along with corresponding sport names, in
decreasing order of average weight.# YOUR CODE GOES HERE
ave_weight_men
and ave_weight_women
, to sports
. Once
computed, display the average weights along with corresponding sports,
for men and women, each list sorted in decreasing order of average
weight. Are the orderings roughly similar?# YOUR CODE GOES HERE
ave_weight_men
and
average_weight_women
columns, and define a new column
ave_weight2
accordingly. Does ave_weight2
differ from ave_weight
? It should. Explain why. Then show
how to recompute the average weight from ave_weight_men
and
average_weight_women
in a way that exactly recreates
average_weight
.# YOUR CODE GOES HERE