Strings

class: center, middle, inverse, title-slide

.title[
# Strings
]
.author[
### Becky Tang
]
.date[
### 10/7/22
]

---

## Housekeeping

- Homework 4 due Tuesday, Oct. 11 at 11:59pm

- This is an application exercise today (the last one before the midterm!)

- You are welcome to come to any of my office hours next week:

- Monday: 10:30am-12:00pm
  - Tuesday: 11:00am-12:00pm
  - Wednesday: 10:30am-12:00pm

- Please fill out this anonymous survey: https://forms.gle/jcdJhoFRUHcx4vsR9

---

class: middle, center

## `stringr`

---

## `stringr`

In addition to the `tidyverse`, we will use the package `stringr`.

```r
library(tidyverse)
library(stringr)
```

`stringr` provides tools to work with character strings.

- Functions in `stringr` have consistent and memorable names

- All begin with `str_` (`str_count`, `str_detect`, `str_trim`, etc)

- All take a vector of strings as their first argument

---

## Preliminaries

Character strings in R are defined by double quotation marks.

They can include letters, numbers, punctuation, whitespace, etc.

```r
string1 <- "MATH 118 is my favorite class."
string1
```

```
## [1] "MATH 118 is my favorite class."
```

You can combine character strings in a **vector**.

```r
string2 <- c("MATH 118", "Data Science", "Middlebury")
string2
```

```
## [1] "MATH 118"     "Data Science" "Middlebury"
```

---

## To quote or not to quote?

```r
datascience %>%
  filter(Major == "Computer Science")
```

Why does `Major` not have quotes, but `"Computer Science"` does?

- If you do not use quotes, R assumes you are referring to the name of an object

- If you use quotes, R assumes you are simply entering a character string

---

## To quote or not to quote?

```r
#This defines an object 
object_name <- 2

# This is the name of an object
object_name
```

```
## [1] 2
```

```r
# This is a character string
"object_name"
```

```
## [1] "object_name"
```

---

## Include a quotation in a string?

Why doesn't the code below work?

.midi[

```r
string3 <- "I said "Hello" to my class"
```
]

To include a double quote in a string, *escape it* using a backslash `\`.

.midi[

```r
string4 <- "I said \"Hello\" to my class."
```
]

What if you want to include an actual backslash?

.midi[

```r
string5 <- "\\"
```
]

This may seem tedious but it will come up later!

---

## U.S. States

To demonstrate functions from `stringr` we will use a vector of all 50 states.

.midi[

```r
states
```

```
##  [1] "alabama"        "alaska"         "arizona"        "arkansas"      
##  [5] "california"     "colorado"       "connecticut"    "delaware"      
##  [9] "florida"        "georgia"        "hawaii"         "idaho"         
## [13] "illinois"       "indiana"        "iowa"           "kansas"        
## [17] "kentucky"       "louisiana"      "maine"          "maryland"      
## [21] "massachusetts"  "michigan"       "minnesota"      "mississippi"   
## [25] "missouri"       "montana"        "nebraska"       "nevada"        
## [29] "new hampshire"  "new jersey"     "new mexico"     "new york"      
## [33] "north carolina" "north dakota"   "ohio"           "oklahoma"      
## [37] "oregon"         "pennsylvania"   "rhode island"   "south carolina"
## [41] "south dakota"   "tennessee"      "texas"          "utah"          
## [45] "vermont"        "virginia"       "washington"     "west virginia" 
## [49] "wisconsin"      "wyoming"
```
]

---

## `str_length`

Given a string, return the number of characters.

.midi[

```r
string1 <- "Math 118 is my favorite class."
str_length(string1)
```

```
## [1] 30
```
]

Given a vector of strings, return the number of characters in each string.

.midi[

```r
str_length(states)
```

```
##  [1]  7  6  7  8 10  8 11  8  7  7  6  5  8  7  4  6  8  9  5  8 13  8  9 11  8
## [26]  7  8  6 13 10 10  8 14 12  4  8  6 12 12 14 12  9  5  4  7  8 10 13  9  7
```
]

.pull-left[
- Alabama: 7
- Alaska: 6
- Arizona: 7
- Arkansas: 8
]
.pull-right[
- California: 10
- Colorado: 8
- Connecticut: 11
- ...
]

---

## `str_c`

Combine two or more strings.

```r
str_c("Math 118", "is", "my", "favorite", "class")
```

```
## [1] "Math 118ismyfavoriteclass"
```

Use `sep` to specify how the strings are separated.

```r
str_c("Math 119", "is", "my", "favorite", "class", sep = "-")
```

```
## [1] "Math 119-is-my-favorite-class"
```

```r
str_c("Math 119", "is", "my", "favorite", "class", sep = " ")
```

```
## [1] "Math 119 is my favorite class"
```

---

## `str_c`

Combine together the strings within a character vector using `collapse`:

```r
string_vec <- c("Math", "118", "is", "my", "favorite", "class")
str_c(string_vec, collapse = "")
```

```
## [1] "Math118ismyfavoriteclass"
```

```r
str_c(string_vec, collapse = " ")
```

```
## [1] "Math 118 is my favorite class"
```

What happens if you don't use `collapse`?

```r
str_c(string_vec)
```

```
## [1] "Math"     "118"      "is"       "my"       "favorite" "class"
```

---

## `str_to_lower` and `str_to_upper`

Convert the case of a string from lower to upper or upper to lower.

.midi[

```r
str_to_upper(states)
```

```
##  [1] "ALABAMA"        "ALASKA"         "ARIZONA"        "ARKANSAS"      
##  [5] "CALIFORNIA"     "COLORADO"       "CONNECTICUT"    "DELAWARE"      
##  [9] "FLORIDA"        "GEORGIA"        "HAWAII"         "IDAHO"         
## [13] "ILLINOIS"       "INDIANA"        "IOWA"           "KANSAS"        
## [17] "KENTUCKY"       "LOUISIANA"      "MAINE"          "MARYLAND"      
## [21] "MASSACHUSETTS"  "MICHIGAN"       "MINNESOTA"      "MISSISSIPPI"   
## [25] "MISSOURI"       "MONTANA"        "NEBRASKA"       "NEVADA"        
## [29] "NEW HAMPSHIRE"  "NEW JERSEY"     "NEW MEXICO"     "NEW YORK"      
## [33] "NORTH CAROLINA" "NORTH DAKOTA"   "OHIO"           "OKLAHOMA"      
## [37] "OREGON"         "PENNSYLVANIA"   "RHODE ISLAND"   "SOUTH CAROLINA"
## [41] "SOUTH DAKOTA"   "TENNESSEE"      "TEXAS"          "UTAH"          
## [45] "VERMONT"        "VIRGINIA"       "WASHINGTON"     "WEST VIRGINIA" 
## [49] "WISCONSIN"      "WYOMING"
```
]

---

## `str_to_title`

Converts a string into a title-cased version: converts the first character of each words to uppercase and the remaining characters in lowercase

```r
str_to_title("math is 188 my favorite class")
```

```
## [1] "Math Is 188 My Favorite Class"
```

---

## `str_sub`

Extract subsets (parts) of a string from `start` to `end`, inclusive.

.midi[

```r
str_sub(states, 1, 4)
```

```
##  [1] "alab" "alas" "ariz" "arka" "cali" "colo" "conn" "dela" "flor" "geor"
## [11] "hawa" "idah" "illi" "indi" "iowa" "kans" "kent" "loui" "main" "mary"
## [21] "mass" "mich" "minn" "miss" "miss" "mont" "nebr" "neva" "new " "new "
## [31] "new " "new " "nort" "nort" "ohio" "okla" "oreg" "penn" "rhod" "sout"
## [41] "sout" "tenn" "texa" "utah" "verm" "virg" "wash" "west" "wisc" "wyom"
```
]

.midi[

```r
str_sub(states, -4, -1)
```

```
##  [1] "bama" "aska" "zona" "nsas" "rnia" "rado" "icut" "ware" "rida" "rgia"
## [11] "waii" "daho" "nois" "iana" "iowa" "nsas" "ucky" "iana" "aine" "land"
## [21] "etts" "igan" "sota" "ippi" "ouri" "tana" "aska" "vada" "hire" "rsey"
## [31] "xico" "york" "lina" "kota" "ohio" "homa" "egon" "ania" "land" "lina"
## [41] "kota" "ssee" "exas" "utah" "mont" "inia" "gton" "inia" "nsin" "ming"
```
]

---

## `str_sub` and `str_to_upper`

Can combine `str_sub` and `str_to_upper` to capitalize each state.

.midi[

```r
str_sub(states, 1, 1) <- str_to_upper(str_sub(states, 1, 1))
```
]

.question[What is this code doing?]

```r
states
```

```
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New hampshire"  "New jersey"     "New mexico"     "New york"      
## [33] "North carolina" "North dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode island"   "South carolina"
## [41] "South dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West virginia" 
## [49] "Wisconsin"      "Wyoming"
```

---

## `str_sort`

.question[What do we think this code is doing?]

.midi[

```r
str_sort(states, decreasing = TRUE)
```
]

Sort a vector of strings, in decreasing alphabetical order.

```
##  [1] "Wyoming"        "Wisconsin"      "West virginia"  "Washington"    
##  [5] "Virginia"       "Vermont"        "Utah"           "Texas"         
##  [9] "Tennessee"      "South dakota"   "South carolina" "Rhode island"  
## [13] "Pennsylvania"   "Oregon"         "Oklahoma"       "Ohio"          
## [17] "North dakota"   "North carolina" "New york"       "New mexico"    
## [21] "New jersey"     "New hampshire"  "Nevada"         "Nebraska"      
## [25] "Montana"        "Missouri"       "Mississippi"    "Minnesota"     
## [29] "Michigan"       "Massachusetts"  "Maryland"       "Maine"         
## [33] "Louisiana"      "Kentucky"       "Kansas"         "Iowa"          
## [37] "Indiana"        "Illinois"       "Idaho"          "Hawaii"        
## [41] "Georgia"        "Florida"        "Delaware"       "Connecticut"   
## [45] "Colorado"       "California"     "Arkansas"       "Arizona"       
## [49] "Alaska"         "Alabama"
```

---

## Regular Expressions

A .vocab[regular expression] is a sequence of characters that allows you to 
describe string patterns. We use them to search for patterns.

- extract a phone number from text data
- determine if an email address is valid
- determine if a password has the required number of letters, characters, and symbols
- count the number of times "statistics" occurs in a corpus of text
- ...

---

## Regular Expressions

To demonstrate, we will use a vector of all of the states bordering North 
Carolina.

```r
vt_neighbors <- c("Vermont", "New York", "Massachusetts", "New Hampshire")
vt_neighbors
```

```
## [1] "Vermont"       "New York"      "Massachusetts" "New Hampshire"
```

---

## Basic Match

We can match exactly.

```r
str_view_all(vt_neighbors, "ew")
```

<div id="htmlwidget-cc2b63276e055e0f44a5" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-cc2b63276e055e0f44a5">{"x":{"html":"<ul>\n  <li>Vermont<\/li>\n  <li>N<span class='match'>ew<\/span> York<\/li>\n  <li>Massachusetts<\/li>\n  <li>N<span class='match'>ew<\/span> Hampshire<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---

## Basic Match

Match any character using `.`

```r
str_view_all(vt_neighbors, "e.")
```

<div id="htmlwidget-32435dfd8d48f697774c" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-32435dfd8d48f697774c">{"x":{"html":"<ul>\n  <li>V<span class='match'>er<\/span>mont<\/li>\n  <li>N<span class='match'>ew<\/span> York<\/li>\n  <li>Massachus<span class='match'>et<\/span>ts<\/li>\n  <li>N<span class='match'>ew<\/span> Hampshire<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---

## Extract matches

Pulls the match (if it exist) from each element in the character

```r
str_extract(vt_neighbors, "New")
```

```
## [1] NA    "New" NA    "New"
```

---

## Anchors

Match the start of a string using `^`

```r
str_view_all(vt_neighbors, "^V")
```

<div id="htmlwidget-e34b8fedb40ab3c50129" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-e34b8fedb40ab3c50129">{"x":{"html":"<ul>\n  <li><span class='match'>V<\/span>ermont<\/li>\n  <li>New York<\/li>\n  <li>Massachusetts<\/li>\n  <li>New Hampshire<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---

## Anchors

Match the end of a string using `$`

```r
str_view_all(vt_neighbors, "s$")
```

<div id="htmlwidget-200043019004c18a4749" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-200043019004c18a4749">{"x":{"html":"<ul>\n  <li>Vermont<\/li>\n  <li>New York<\/li>\n  <li>Massachusett<span class='match'>s<\/span><\/li>\n  <li>New Hampshire<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---

## `str_detect`

Determine if a character vector matches a pattern.

```r
vt_neighbors
```

```
## [1] "Vermont"       "New York"      "Massachusetts" "New Hampshire"
```

```r
str_detect(vt_neighbors, "a")
```

```
## [1] FALSE FALSE  TRUE  TRUE
```

---

## `str_subset`

Select elements from the character vector that match a pattern.

```r
str_subset(vt_neighbors, "e$")
```

```
## [1] "New Hampshire"
```

---

## `str_count`

How many matches are there in a string?

```r
vt_neighbors
```

```
## [1] "Vermont"       "New York"      "Massachusetts" "New Hampshire"
```

```r
str_count(vt_neighbors, "a")
```

```
## [1] 0 0 2 1
```

---

## `str_replace`

Replace first match with new strings.

```r
str_replace(vt_neighbors, "s", "-")
```

```
## [1] "Vermont"       "New York"      "Ma-sachusetts" "New Hamp-hire"
```

---

## `str_replace_all`

Replace all matches with new strings.

```r
str_replace_all(vt_neighbors, "s", "-")
```

```
## [1] "Vermont"       "New York"      "Ma--achu-ett-" "New Hamp-hire"
```

Don't forget to save results:

```r
vt_neighbors
```

```
## [1] "Vermont"       "New York"      "Massachusetts" "New Hampshire"
```

```r
vt_neighbors_replaced <- str_replace_all(vt_neighbors, "s", "-")
vt_neighbors_replaced
```

```
## [1] "Vermont"       "New York"      "Ma--achu-ett-" "New Hamp-hire"
```

---

## Many Matches

The regular expressions below match more than one character.

- Match any single digit using `\d` or `[[:digit:]]`
- Match all digits using `\d+` or `[[:digit:]]+`
- Match any whitespace using `\s` or `[[:space:]]`
- Match f, g, or h using `[fgh]` 
- Match anything but f, g, or h using `[^fgh]`
- Match lower-case letters using `[a-z]` or `[[:lower:]]`
- Match upper-case letters using `[A-Z]` or `[[:upper:]]`
- Match alphabetic characters using `[A-z]` or `[[:alpha:]]`

Remember these are regular expressions! To match digits you'll need to *escape*
the string, so use `"\\d"`, not `"\d"`

---

## Working within pipeline

```r
vt_df %>%
  mutate(state_code = str_to_lower(state_code))
```

```
##           state state_code
## 1       Vermont         vt
## 2      New York         ny
## 3 Massachusetts         ma
## 4 New Hampshire         nh
```

---

## Working within pipeline

```r
vt_df %>%
  mutate(state = str_replace_all(state, "[ea]", "-"))
```

```
##           state state_code
## 1       V-rmont         VT
## 2      N-w York         NY
## 3 M-ss-chus-tts         MA
## 4 N-w H-mpshir-         NH
```

---

## Additional resources

- `stringr` website: https://stringr.tidyverse.org/
- `stringr` [cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf)
- Regular Expressions [Cheat Sheet](https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf)
- [Chapter 14: Strings](https://r4ds.had.co.nz/strings.html#matching-patterns-with-regular-expressions) in R for Data Science