Name:
Andrew ID:
Collaborated with:
This lab is to be done in class (completed outside of class time if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted PDF file on Gradescope, by Friday 9pm, this week.
This week’s agenda: practicing SQLite queries, performing simple computations and joins, and testing our understanding by writing equivalent R code for these database manipulations.
Thanks to Sean Lahman, extensive baseball data is freely available all the way back to the 1871 season. We’re going ot use a SQLite version of the baseball database put together by Jeff Knecht, at https://github.com/jknecht/baseball-archive-sqlite. The most recent SQLite database was recently updated to include the 2016 season. It has been posted to the class website at https://www.stat.cmu.edu/~arinaldo/Teaching/36350/F22/data/lahman2016.sqlite. Download this file (it’s about 50 MB) and save it in the working directory for your lab.
DBI
,
RSQLite
if you haven’t done so already, and load them into
your R session. Using dbDriver()
, dbConnect()
,
set up a connection called con
the SQLite database stored
in lahman2016.sqlite. Then, use dbListTables()
to list the
tables in the database.# YOUR CODE GOES HERE
dbReadTable()
, grab the
table named “Batting” and save it as a data frame in your R session,
called batting
. Check that batting
is indeed a
data frame, and that it has dimension 102816 x 24.# YOUR CODE GOES HERE
eval=FALSE
from the
preamble in the R code chunks below. Then, after each SQL query (each
call to dbGetQuery()
), explain in words what is being
extracted, and write one line of base R code (sometimes you might need
two lines) to get the same result using the batting
data
frame.dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
"FROM Batting",
"ORDER BY yearID",
"LIMIT 10"))
# YOUR CODE GOES HERE
dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
"FROM Batting",
"ORDER BY HR DESC",
"LIMIT 10"))
# YOUR CODE GOES HERE
dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
"FROM Batting",
"WHERE HR > 55",
"ORDER BY HR DESC"))
# YOUR CODE GOES HERE
dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
"FROM Batting",
"WHERE yearID >= 1990 AND yearID <= 2000",
"ORDER BY HR DESC",
"LIMIT 10"))
# YOUR CODE GOES HERE
dplyr
verbs and pipes.# YOUR CODE GOES HERE
eval=FALSE
from
the preamble in the following R code chunks. Then, after each SQL query,
explain in words what is being extracted, and write one line of base R
code to get the same result using the batting
data frame.
Hint: often you’ll have to use na.rm=TRUE
to deal with NA
values, for example mean(x, na.rm=TRUE)
computes the mean
of a vector x
after removing any NA values.dbGetQuery(con, paste("SELECT AVG(HR)",
"FROM Batting"))
# YOUR CODE GOES HERE
dbGetQuery(con, paste("SELECT SUM(HR)",
"FROM Batting"))
# YOUR CODE GOES HERE
dbGetQuery(con, paste("SELECT playerID, yearID, teamID, MAX(HR)",
"FROM Batting"))
# YOUR CODE GOES HERE
dbGetQuery(con, paste("SELECT AVG(HR)",
"FROM Batting",
"WHERE yearID >= 1990"))
# YOUR CODE GOES HERE
batting
data frame. You may
use base R, dplyr
, pipes, or whatever means you want.dbGetQuery(con, paste("SELECT teamID, AVG(HR)",
"FROM Batting",
"WHERE yearID >= 1990",
"GROUP BY teamID",
"LIMIT 5"))
# YOUR CODE GOES HERE
dbGetQuery(con, paste("SELECT teamID, AVG(HR)",
"FROM Batting",
"WHERE yearID < 1960",
"GROUP BY teamID",
"ORDER BY AVG(HR) DESC",
"LIMIT 5"))
# YOUR CODE GOES HERE
dbGetQuery(con, paste("SELECT teamID, yearID, AVG(HR)",
"FROM Batting",
"WHERE yearID == 1991 OR yearID == 1992",
"GROUP BY teamID, yearID",
"ORDER BY AVG(HR) DESC",
"LIMIT 15"))
# YOUR CODE GOES HERE
H
) over the
seasons they played, and display the players with the 10 highest hit
averages, along with their hit averages. Hint: AVG()
,
GROUP BY
, ORDER BY
.# YOUR CODE GOES HERE
HAVING
.# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
JOIN
, merge the “Batting”
and “Salaries” tables based on matching the yearID
,
playerID
pairs. Display the year, player, salary, and
number of hits for the first 10 records.# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
E
) committed by a player in a
season, only considering the year 1990 and later. In addition to the
number of errors, list the year and player ID for each record.# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
dbGetQuery()
actually returns a data frame. You should have
a data frame of dimension 46 x 2, and the 2 columns should display the
team ID and the payroll. Check that your data frame has the right
dimensions and display its first 10 rows. Then, answer: what team has
the highest total payroll? The lowest payroll? Where do the Pirates
rank?# YOUR CODE GOES HERE
GROUP BY
can take two arguments, separated by a comma. You
should have a data frame of dimension 918 x 3, and the 3 columns should
be display the team ID, year, and payroll. Check that your data frame
has the proper dimensions, and display its last 10 rows.# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
H
) divided by number
of at bats (AB
) achieved by a player in a given season, but
(let’s say) it is only defined for players that have at least 400 at
bats in that season. Your resulting data frame from the SQL query should
be 10 x 3, with the 3 columns displaying the playerID, yearID, and
batting average.# YOUR CODE GOES HERE
breaks
argument to get a good
sense of the shape of the histogram. Does this look like a normal
distribution to you? What is the estimated mean and the standard
deviation? Overlay the normal density curve on top of your histogram,
with the appropriate mean and variance, and comment on how it fits.
Perform a rigorous hypothesis test for normality of batting averages
here; you might consider using ks.test()
.# YOUR CODE GOES HERE
ks.test()
.# YOUR CODE GOES HERE
# YOUR CODE GOES HERE