Statistical Computing, 36-350
Tuesday October 4, 2022
%>%
operator)dplyr
is a package for data wrangling, with several key
verbs (functions)filter()
: subset rows based on a conditiongroup_by()
: define groups of rows according to a
conditionsummarize()
: apply computations across groups of
rowsarrange()
: order rows by value of a columnselect()
: pick out given columnsmutate()
: create new columnsmutate_at()
: apply a function to given columnstidyr
is a package for manipulating the structure of
data framespivot_longer()
: make “wide” data longerpivot_wider()
: make “long” data widerString basics
The simplest distinction:
Character: a symbol in a written language, like letters, numerals, punctuation, space, etc.
String: a sequence of characters bound together
## [1] "character"
## [1] "character"
Why do we care about strings?
Whitespaces count as characters and can be included in strings:
" "
for space"\n"
for newline"\t"
for tab## [1] "Dear Mr. Carnegie,\n\nThanks for the great school!\n\nSincerely, Ryan"
Use cat()
to print strings to the console, displaying
whitespaces properly
## Dear Mr. Carnegie,
##
## Thanks for the great school!
##
## Sincerely, Ryan
The character is a basic data type in R (like numeric, or logical), so we can make vectors or matrices of out them. Just like we would with numbers
str.vec = c("Statistical", "Computing", "isn't that bad") # Collect 3 strings
str.vec # All elements of the vector
## [1] "Statistical" "Computing" "isn't that bad"
## [1] "isn't that bad"
## [1] "isn't that bad"
str.mat = matrix("", 2, 3) # Build an empty 2 x 3 matrix
str.mat[1,] = str.vec # Fill the 1st row with str.vec
str.mat[2,1:2] = str.vec[1:2] # Fill the 2nd row, only entries 1 and 2, with
# those of str.vec
str.mat[2,3] = "isn't a fad" # Fill the 2nd row, 3rd entry, with a new string
str.mat # All elements of the matrix
## [,1] [,2] [,3]
## [1,] "Statistical" "Computing" "isn't that bad"
## [2,] "Statistical" "Computing" "isn't a fad"
## [,1] [,2]
## [1,] "Statistical" "Statistical"
## [2,] "Computing" "Computing"
## [3,] "isn't that bad" "isn't a fad"
Easy! Make things into strings with as.character()
## [1] "0.8"
## [1] "8e+09"
## [1] "1" "2" "3" "4" "5"
## [1] "TRUE"
Not as easy! Depends on the given string, of course
## [1] 0.5
## [1] 0.5
## [1] 5e-11
## Warning: NAs introduced by coercion
## [1] NA
## [1] TRUE
## [1] NA
Use nchar()
to count the number of characters in a
string
## [1] 6
## [1] 11
## [1] 1
## [1] 2
## [1] 6 11
Substrings, splitting and combining strings
Use substr()
to grab a subsequence of characters from a
string, called a substring
## [1] "Give"
## [1] "break"
## [1] ""
substr()
vectorizesJust like nchar()
, and many other string functions
presidents = c("Clinton", "Bush", "Reagan", "Carter", "Ford")
substr(presidents, 1, 2) # Grab the first 2 letters from each
## [1] "Cl" "Bu" "Re" "Ca" "Fo"
## [1] "C" "u" "a" "t" ""
## [1] "C" "Bu" "Rea" "Cart" "Ford"
## [1] "on" "sh" "an" "er" "rd"
Can also use substr()
to replace a character, or a
substring
## [1] "Give me a break"
## [1] "Live me a break"
## [1] "Live me a break"
## [1] "Show me a break"
Use the strsplit()
function to split based on a
keyword
ingredients = "chickpeas, tahini, olive oil, garlic, salt"
split.obj = strsplit(ingredients, split=",")
split.obj
## [[1]]
## [1] "chickpeas" " tahini" " olive oil" " garlic" " salt"
## [1] "list"
## [1] 1
Note that the output is actually a list! (With just one element, which is a vector of strings)
strsplit()
vectorizesJust like nchar()
, substr()
, and the many
others
great.profs = "Nugent, Genovese, Greenhouse, Seltman, Shalizi, Ventura"
favorite.cats = "tiger, leopard, jaguar, lion"
split.list = strsplit(c(ingredients, great.profs, favorite.cats), split=",")
split.list
## [[1]]
## [1] "chickpeas" " tahini" " olive oil" " garlic" " salt"
##
## [[2]]
## [1] "Nugent" " Genovese" " Greenhouse" " Seltman" " Shalizi" " Ventura"
##
## [[3]]
## [1] "tiger" " leopard" " jaguar" " lion"
strsplit()
needs to return a list
now?Finest splitting you can do is character-by-character: use
strsplit()
with split=""
## [1] "c" "h" "i" "c" "k" "p" "e" "a" "s" "," " " "t" "a" "h" "i" "n" "i" "," " " "o"
## [21] "l" "i" "v" "e" " " "o" "i" "l" "," " " "g" "a" "r" "l" "i" "c" "," " " "s" "a"
## [41] "l" "t"
## [1] 42
## [1] 42
Use the paste()
function to join two (or more) strings
into one, separated by a keyword
## [1] "Spider Man"
## [1] "Spider-Man"
## [1] "Spider, Man, does whatever"
paste()
vectorizesJust like nchar()
, substr()
,
strsplit()
, etc. Seeing a theme yet?
## [1] "Clinton" "Bush" "Reagan" "Carter" "Ford"
## [1] "Clinton D" "Bush R" "Reagan R" "Carter D" "Ford R"
## [1] "Clinton D" "Bush R" "Reagan D" "Carter R" "Ford D"
## [1] "Clinton (42)" "Bush (41)" "Reagan (40)" "Carter (39)" "Ford (38)"
Can condense a vector of strings into one big string by using
paste()
with the collapse
argument
## [1] "Clinton" "Bush" "Reagan" "Carter" "Ford"
## [1] "Clinton; Bush; Reagan; Carter; Ford"
## [1] "Clinton (42); Bush (41); Reagan (40); Carter (39); Ford (38)"
## [1] "Clinton (D42); Bush (R41); Reagan (R40); Carter (D39); Ford (R38)"
## [1] "Clinton" "Bush" "Reagan" "Carter" "Ford"
Reading in text, summarizing text
How to get text, from an external source, into R? Use the
readLines()
function
king.lines = readLines("https://www.stat.cmu.edu/~arinaldo/Teaching/36350/F22/data/king.txt")
class(king.lines) # We have a character vector
## [1] "character"
## [1] 59
## [1] "Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation. This momentous decree came as a great beacon light of hope to millions of Negro slaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to end the long night of their captivity."
## [2] ""
## [3] "But 100 years later, the Negro still is not free. One hundred years later, the life of the Negro is still sadly crippled by the manacles of segregation and the chains of discrimination. One hundred years later, the Negro lives on a lonely island of poverty in the midst of a vast ocean of material prosperity. One hundred years later..."
(This was Martin Luther King Jr.’s famous “I Have a Dream” speech at the March on Washington for Jobs and Freedom on August 28, 1963)
We don’t need to use the web; readLines()
can be used on
a local file. The following code would read in a text file from
Professor Rinaldo’s computer:
This will cause an error for you, unless your folder is set up exactly like Professor Rinaldo’s laptop! So using web links is more robust
Fancy word, but all it means: make one long string, then split the words
king.text = paste(king.lines, collapse=" ")
king.words = strsplit(king.text, split=" ")[[1]]
# Sanity check
substr(king.text, 1, 150)
## [1] "Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation. This momentous decree came as a"
## [1] "Five" "score" "years" "ago," "a"
## [6] "great" "American," "in" "whose" "symbolic"
## [11] "shadow" "we" "stand" "today," "signed"
## [16] "the" "Emancipation" "Proclamation." "This" "momentous"
Our most basic tool for summarizing text: word
counts, retrieved using table()
## [1] "table"
## [1] 622
## king.words
## - ...the ...to 'tis 100 1963 a able Again
## 29 2 1 1 1 1 1 37 8 1
What did we get? Alphabetically sorted unique words, and their counts = number of appearances
Note: this is actually a vector of numbers, and the words are the names of the vector
## king.words
## - ...the ...to 'tis
## 29 2 1 1 1
## -
## TRUE
## [1] TRUE
So with named indexing, we can now use this to look up whatever words we want
## dream
## 9
## Negro
## 13
## freedom
## 18
## <NA>
## NA
Let’s sort in decreasing order, to get the most frequent words
## [1] 622
## king.words
## of the to and a be will is that
## 98 97 57 40 37 32 29 25 23 23
## as freedom in we from have our I Negro not
## 19 18 18 18 17 17 16 14 13 13
## king.words
## walk, wallow warm waters, well were When
## 1 1 1 1 1 1 1
## whirlwinds whites whose winds with. withering wrongful
## 1 1 1 1 1 1 1
## wrote yes, York York. You your
## 1 1 1 1 1 1
Notice that punctuation matters, e.g., “York” and “York.” are treated as separate words, not ideal—we’ll learn just a little bit about how to fix this on lab, using regular expressions
Let’s use a plot to visualize frequencies
nw = length(king.wordtab.sorted)
plot(1:nw, as.numeric(king.wordtab.sorted), type="l",
xlab="Rank", ylab="Frequency")
A pretty drastic looking trend! It looks as if \(\mathrm{Frequency} \propto (1/\mathrm{Rank})^a\) for some \(a>0\)
This phenomenon, that frequency tends to be inversely proportional to a power of rank, is called Zipf’s law
For our data, Zipf’s law approximately holds, with \(\mathrm{Frequency} \approx C(1/\mathrm{Rank})^a\) for \(C=100\) and \(a=0.65\)
C = 100; a = 0.65
king.wordtab.zipf = C*(1/1:nw)^a
cbind(king.wordtab.sorted[1:8], king.wordtab.zipf[1:8])
## [,1] [,2]
## of 98 100.00000
## the 97 63.72803
## to 57 48.96336
## and 40 40.61262
## a 37 35.12930
## be 32 31.20338
## 29 28.22840
## will 25 25.88162
Not perfect, but not bad. We can also plot the original sorted word counts, and those estimated by our formula law on top
plot(1:nw, as.numeric(king.wordtab.sorted), type="l",
xlab="Rank", ylab="Frequency")
curve(C*(1/x)^a, from=1, to=nw, col="red", add=TRUE)
We’ll learn about plotting tools in detail a bit later
nchar()
, substr()
: functions for substring
extractions and replacementsstrsplit()
, paste()
: functions for
splitting and combining stringstable()
: function to get word counts, useful way of
summarizing text data