Name:
Andrew ID:
Collaborated with:
This lab is to be done in class (completed outside of class time if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted PDF file on Gradescope, by Friday 9pm, this week.
This week’s agenda: exploratory data analysis, cleaning data, fitting linear/logistic models, and using associated utility functions.
Below we read in the prostate cancer data set that we looked in previous labs.
pros.df =
read.table("https://www.stat.cmu.edu/~arinaldo/Teaching/36350/F22/data/pros.dat")
dim(pros.df)
## [1] 97 9
head(pros.df, 3)
## lcavol lweight age lbph svi lcp gleason
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7
## pgg45 lpsa
## 1 0 -0.4307829
## 2 0 -0.1625189
## 3 20 -0.1625189
pros.df.subset
to be the
subset of observations (rows) of the prostate data set such the
lcp
measurement is greater than the minimum value (the
minimum value happens to be log(0.25)
, but you should not
hardcode this value and should work it out from the data). As in
lecture, plot histograms of all of the variables in
pros.df.subset
. Comment on any differences you see between
these distributions and the ones in lecture.# YOUR CODE GOES HERE
pros.df.subset
. Report the two highest correlations between
pairs of (distinct) variables, and also report the names of the
associated variables. Are these different from answers that were
computed on the full data set?# YOUR CODE GOES HERE
pros.df.subset
. For this heatmap, use the full matrix (not
just its upper triangular part). Makes sure your heatmap is displayed in
a sensible way and that it’s clear what the variables are in the plot.
For full points, create your heatmap using base R graphics (hint: the
clockwise90()
function from the “Plotting tools” lecture
will be handy); for partial points, use an R package.# YOUR CODE GOES HERE
lm()
, a linear
regression model of lpsa
(log PSA score) on
lcavol
(log cancer volume). Do this twice: once with the
full data set, pros.df
, and once with the subsetted data,
pros.df.subset
. Save the results as pros.lm.
and pros.subset.lm
, respectively. Using
coef()
, display the coefficients (intercept and slope) from
each linear regression. Are they different?# YOUR CODE GOES HERE
lpsa
versus lcavol
, using the full set of
observations, in pros.df
. Label the axes appropriately.
Then, mark the observations in pros.df.subset
by small
filled red circles. Add a thick black line to your plot, displaying the
fitted regression line from pros.lm
. Add a thick red line,
displaying the fitted regression line from pros.subset.lm
.
Add a legend that explains the color coding.# YOUR CODE GOES HERE
lpsa
on lcavol
, but now on two different
subsets of the data: the first consisting of patients with SVI, and the
second consistent of patients without SVI. Display the resulting
coefficients (intercept and slope) from each model, and produce a plot
just like the one in the last question, to visualize the different
regression lines on top of the data. Do these two regression lines
differ, and in what way?# YOUR CODE GOES HERE
read.csv()
, setting stringsAsFactors = TRUE
in
this function call, and save the resulting data frame as
wage.df
. Check that wage.df
has the right
dimensions, and display its first 3 rows. Hint: the first several lines
of the linked file just explain the nature of the data; open up the file
(either directly in your web browser or after you download it to your
computer), and count how many lines must be skipped before getting to
the data; then use an appropriate setting for the skip
argument to read.csv()
.# YOUR CODE GOES HERE
wage.df
, set up a plotting grid of appropriate dimensions,
and then plot each of these factor variables, with appropriate titles.
What do you notice about the distributions?# YOUR CODE GOES HERE
wage.df
, set up a plotting grid of appropriate dimensions,
and then plot histograms of each these numeric variables, with
appropriate titles and x-axis labels. What do you notice about the
distributions? In particular, what do you notice about the distribution
of the wage
column? Does it appear to be unimodal (having a
single mode)? Does what you see make sense?# YOUR CODE GOES HERE
lm()
, with response variable wage
and
predictor variables year
and age
, using the
wage.df
data frame. Call the result wage.lm
.
Display the coefficient estimates, using coef()
, for
year
and age
. Do they have the signs you would
expect, i.e., can you explain their signs? Display a summary, using
summary()
, of this linear model. Report the standard errors
and p-values associated with the coefficient estimates for
year
and age
. Do both of these predictors
appear to be significant, based on their p-values?# YOUR CODE GOES HERE
year
and age
into a vector called wage.se
, and
print it out to the console. Don’t just type the values in you see from
summary()
; you need to determine these values
programmatically. Hint: define wage.sum
to be the result of
calling summary()
on wage.lm
; then figure out
what kind of R object wage.sum
is, and how you can extract
the standard errors.# YOUR CODE GOES HERE
plot()
on wage.lm
.
Look at the “Residuals vs Fitted”, “Scale-Location”, and “Residuals vs
Leverage” plots—are there any groups of points away from the main bulk
of points along the x-axis? Look at the “Normal Q-Q” plot—do the
standardized residuals lie along the line \(y=x\)? Note: don’t worry too if you’re
generally unsure how to interpret these diagnostic plots; you’ll learn a
lot more in your Modern Regression 36-401 course; for now, you can just
answer the questions we asked. Challenge: what is
causing the discrepancies you are (should be) seeing in these plots?
Hint: look back at the histogram of the wage
column you
plotted above.# YOUR CODE GOES HERE
wage
and predictor variables year
and
age
, but this time only using observations in the
wage.df
data frame for which the wage
variable
is less than or equal to 250 (note, this is measured in thousands of
dollars!). Call the result wage.lm.lt250
. Display a
summary, reporting the coefficient estimates of year
and
age
, their standard errors, and associated p-values. Are
these coefficients different than before? Are the predictors
year
and age
still significant? Finally, plot
diagnostics. Do the “Residuals vs Fitted”, “Normal Q-Q”,
“Scale-location”, and “Residuals vs Leverage” plots still have the same
problems as before?# YOUR CODE GOES HERE
wage.lm.lt250
to predict: (a) what a 30 year old person
should be making this year; (b) what President Trump should be making
this year; (c) what you should be making 5 years from now. Comment on
the results—which do you think is the most accurate prediction?# YOUR CODE GOES HERE
glm()
with family="binomial"
, with the
response variable being the indicator that wage
is larger
than 250, and the predictor variables being year
and
age
. Call the result wage.glm
. Note: you can
set this up in two different ways: (i) you can manually define a new
column (say) wage.high
in the wage.df
data
frame to be the indicator that the wage
column is larger
than 250; or (ii) you can define an indicator variable “on-the-fly” in
the call to glm()
with an appropriate usage of
I()
. Display a summary, reporting the coefficient estimates
for year
and age
, their standard errors, and
associated p-values. Are the predictors year
and
age
both significant?# YOUR CODE GOES HERE
year
, age
, and education
. Note
that the third predictor is stored as a factor variable, which we call a
categorical variable (rather than a continuous
variable, like the first two predictors) in the context of regression
modeling. Display a summary. What do you notice about the predictor
education
: how many coefficients are associated with it in
the end? Can you explain why the number of coefficients associated with
education
makes sense?# YOUR CODE GOES HERE
education
variable, we should have people at this education
level that have a wage less than or equal to 250, and also people at
this education level that have a wage above 250. Which levels of
education
fail to meet this criterion? Let’s call these
levels “incomplete”, and the other levels “complete”.# YOUR CODE GOES HERE
wage.df
that corresponds to the incomplete education levels
(equivalently, using only the data from the complete education levels).
Display a summary, and comment on the differences seen to the summary
for the logistic regression model fitted in Q4b. Did any predictors
become more significant, according to their p-values?# YOUR CODE GOES HERE
gam
package, if you
haven’t already, and load it into your R session with
library(gam)
. Fit a generalized additive model, using
gam()
with family="binomial"
, with the
response variable being the indicator that wage
is larger
than 250, and the predictor variables being year
,
age
, and education
; as in the last question,
only use observations in wage.df
corresponding to the
complete education levels. Also, in the call to gam()
,
allow for age
to have a nonlinear effect by using
s()
(leave year
and education
alone, and they will have the default—linear effects). Call the result
wage.gam
. Display a summary with summary()
. Is
the age
variable more or less significant, in terms of its
p-value, to what you saw in the logistic regression model fitted in the
last question? Also, plot the fitted effect for each predictor, using
plot()
. Comment on the plots—does the fitted effect make
sense to you? In particular, is there a strong nonlinearity associated
with the effect of age
, and does this make sense?# YOUR CODE GOES HERE
wage.gam
, predict the
probability that a 30 year old person, who earned a Ph.D., will make
over $250,000 in 2018.# YOUR CODE GOES HERE
# YOUR CODE GOES HERE