class: center, middle, inverse, title-slide .title[ # Strings ] .author[ ### Becky Tang ] .date[ ### 10/7/22 ] --- ## Housekeeping - Homework 4 due Tuesday, Oct. 11 at 11:59pm - This is an application exercise today (the last one before the midterm!) - You are welcome to come to any of my office hours next week: - Monday: 10:30am-12:00pm - Tuesday: 11:00am-12:00pm - Wednesday: 10:30am-12:00pm - Please fill out this anonymous survey: https://forms.gle/jcdJhoFRUHcx4vsR9 --- class: middle, center ## `stringr` --- ## `stringr` In addition to the `tidyverse`, we will use the package `stringr`. ```r library(tidyverse) library(stringr) ``` `stringr` provides tools to work with character strings. - Functions in `stringr` have consistent and memorable names - All begin with `str_` (`str_count`, `str_detect`, `str_trim`, etc) - All take a vector of strings as their first argument --- ## Preliminaries Character strings in R are defined by double quotation marks. They can include letters, numbers, punctuation, whitespace, etc. ```r string1 <- "MATH 118 is my favorite class." string1 ``` ``` ## [1] "MATH 118 is my favorite class." ``` You can combine character strings in a **vector**. ```r string2 <- c("MATH 118", "Data Science", "Middlebury") string2 ``` ``` ## [1] "MATH 118" "Data Science" "Middlebury" ``` --- ## To quote or not to quote? ```r datascience %>% filter(Major == "Computer Science") ``` Why does `Major` not have quotes, but `"Computer Science"` does? -- - If you do not use quotes, R assumes you are referring to the name of an object - If you use quotes, R assumes you are simply entering a character string --- ## To quote or not to quote? ```r #This defines an object object_name <- 2 # This is the name of an object object_name ``` ``` ## [1] 2 ``` ```r # This is a character string "object_name" ``` ``` ## [1] "object_name" ``` --- ## Include a quotation in a string? Why doesn't the code below work? .midi[ ```r string3 <- "I said "Hello" to my class" ``` ] -- To include a double quote in a string, *escape it* using a backslash `\`. -- .midi[ ```r string4 <- "I said \"Hello\" to my class." ``` ] -- What if you want to include an actual backslash? -- .midi[ ```r string5 <- "\\" ``` ] This may seem tedious but it will come up later! --- ## U.S. States To demonstrate functions from `stringr` we will use a vector of all 50 states. .midi[ ```r states ``` ``` ## [1] "alabama" "alaska" "arizona" "arkansas" ## [5] "california" "colorado" "connecticut" "delaware" ## [9] "florida" "georgia" "hawaii" "idaho" ## [13] "illinois" "indiana" "iowa" "kansas" ## [17] "kentucky" "louisiana" "maine" "maryland" ## [21] "massachusetts" "michigan" "minnesota" "mississippi" ## [25] "missouri" "montana" "nebraska" "nevada" ## [29] "new hampshire" "new jersey" "new mexico" "new york" ## [33] "north carolina" "north dakota" "ohio" "oklahoma" ## [37] "oregon" "pennsylvania" "rhode island" "south carolina" ## [41] "south dakota" "tennessee" "texas" "utah" ## [45] "vermont" "virginia" "washington" "west virginia" ## [49] "wisconsin" "wyoming" ``` ] --- ## `str_length` Given a string, return the number of characters. .midi[ ```r string1 <- "Math 118 is my favorite class." str_length(string1) ``` ``` ## [1] 30 ``` ] Given a vector of strings, return the number of characters in each string. .midi[ ```r str_length(states) ``` ``` ## [1] 7 6 7 8 10 8 11 8 7 7 6 5 8 7 4 6 8 9 5 8 13 8 9 11 8 ## [26] 7 8 6 13 10 10 8 14 12 4 8 6 12 12 14 12 9 5 4 7 8 10 13 9 7 ``` ] -- .pull-left[ - Alabama: 7 - Alaska: 6 - Arizona: 7 - Arkansas: 8 ] .pull-right[ - California: 10 - Colorado: 8 - Connecticut: 11 - ... ] --- ## `str_c` Combine two or more strings. ```r str_c("Math 118", "is", "my", "favorite", "class") ``` ``` ## [1] "Math 118ismyfavoriteclass" ``` -- Use `sep` to specify how the strings are separated. ```r str_c("Math 119", "is", "my", "favorite", "class", sep = "-") ``` ``` ## [1] "Math 119-is-my-favorite-class" ``` ```r str_c("Math 119", "is", "my", "favorite", "class", sep = " ") ``` ``` ## [1] "Math 119 is my favorite class" ``` --- ## `str_c` Combine together the strings within a character vector using `collapse`: ```r string_vec <- c("Math", "118", "is", "my", "favorite", "class") str_c(string_vec, collapse = "") ``` ``` ## [1] "Math118ismyfavoriteclass" ``` ```r str_c(string_vec, collapse = " ") ``` ``` ## [1] "Math 118 is my favorite class" ``` -- What happens if you don't use `collapse`? ```r str_c(string_vec) ``` ``` ## [1] "Math" "118" "is" "my" "favorite" "class" ``` --- ## `str_to_lower` and `str_to_upper` Convert the case of a string from lower to upper or upper to lower. .midi[ ```r str_to_upper(states) ``` ``` ## [1] "ALABAMA" "ALASKA" "ARIZONA" "ARKANSAS" ## [5] "CALIFORNIA" "COLORADO" "CONNECTICUT" "DELAWARE" ## [9] "FLORIDA" "GEORGIA" "HAWAII" "IDAHO" ## [13] "ILLINOIS" "INDIANA" "IOWA" "KANSAS" ## [17] "KENTUCKY" "LOUISIANA" "MAINE" "MARYLAND" ## [21] "MASSACHUSETTS" "MICHIGAN" "MINNESOTA" "MISSISSIPPI" ## [25] "MISSOURI" "MONTANA" "NEBRASKA" "NEVADA" ## [29] "NEW HAMPSHIRE" "NEW JERSEY" "NEW MEXICO" "NEW YORK" ## [33] "NORTH CAROLINA" "NORTH DAKOTA" "OHIO" "OKLAHOMA" ## [37] "OREGON" "PENNSYLVANIA" "RHODE ISLAND" "SOUTH CAROLINA" ## [41] "SOUTH DAKOTA" "TENNESSEE" "TEXAS" "UTAH" ## [45] "VERMONT" "VIRGINIA" "WASHINGTON" "WEST VIRGINIA" ## [49] "WISCONSIN" "WYOMING" ``` ] --- ## `str_to_title` Converts a string into a title-cased version: converts the first character of each words to uppercase and the remaining characters in lowercase ```r str_to_title("math is 188 my favorite class") ``` ``` ## [1] "Math Is 188 My Favorite Class" ``` --- ## `str_sub` Extract subsets (parts) of a string from `start` to `end`, inclusive. .midi[ ```r str_sub(states, 1, 4) ``` ``` ## [1] "alab" "alas" "ariz" "arka" "cali" "colo" "conn" "dela" "flor" "geor" ## [11] "hawa" "idah" "illi" "indi" "iowa" "kans" "kent" "loui" "main" "mary" ## [21] "mass" "mich" "minn" "miss" "miss" "mont" "nebr" "neva" "new " "new " ## [31] "new " "new " "nort" "nort" "ohio" "okla" "oreg" "penn" "rhod" "sout" ## [41] "sout" "tenn" "texa" "utah" "verm" "virg" "wash" "west" "wisc" "wyom" ``` ] -- .midi[ ```r str_sub(states, -4, -1) ``` ``` ## [1] "bama" "aska" "zona" "nsas" "rnia" "rado" "icut" "ware" "rida" "rgia" ## [11] "waii" "daho" "nois" "iana" "iowa" "nsas" "ucky" "iana" "aine" "land" ## [21] "etts" "igan" "sota" "ippi" "ouri" "tana" "aska" "vada" "hire" "rsey" ## [31] "xico" "york" "lina" "kota" "ohio" "homa" "egon" "ania" "land" "lina" ## [41] "kota" "ssee" "exas" "utah" "mont" "inia" "gton" "inia" "nsin" "ming" ``` ] --- ## `str_sub` and `str_to_upper` Can combine `str_sub` and `str_to_upper` to capitalize each state. .midi[ ```r str_sub(states, 1, 1) <- str_to_upper(str_sub(states, 1, 1)) ``` ] .question[What is this code doing?] -- ```r states ``` ``` ## [1] "Alabama" "Alaska" "Arizona" "Arkansas" ## [5] "California" "Colorado" "Connecticut" "Delaware" ## [9] "Florida" "Georgia" "Hawaii" "Idaho" ## [13] "Illinois" "Indiana" "Iowa" "Kansas" ## [17] "Kentucky" "Louisiana" "Maine" "Maryland" ## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi" ## [25] "Missouri" "Montana" "Nebraska" "Nevada" ## [29] "New hampshire" "New jersey" "New mexico" "New york" ## [33] "North carolina" "North dakota" "Ohio" "Oklahoma" ## [37] "Oregon" "Pennsylvania" "Rhode island" "South carolina" ## [41] "South dakota" "Tennessee" "Texas" "Utah" ## [45] "Vermont" "Virginia" "Washington" "West virginia" ## [49] "Wisconsin" "Wyoming" ``` --- ## `str_sort` .question[What do we think this code is doing?] .midi[ ```r str_sort(states, decreasing = TRUE) ``` ] -- Sort a vector of strings, in decreasing alphabetical order. ``` ## [1] "Wyoming" "Wisconsin" "West virginia" "Washington" ## [5] "Virginia" "Vermont" "Utah" "Texas" ## [9] "Tennessee" "South dakota" "South carolina" "Rhode island" ## [13] "Pennsylvania" "Oregon" "Oklahoma" "Ohio" ## [17] "North dakota" "North carolina" "New york" "New mexico" ## [21] "New jersey" "New hampshire" "Nevada" "Nebraska" ## [25] "Montana" "Missouri" "Mississippi" "Minnesota" ## [29] "Michigan" "Massachusetts" "Maryland" "Maine" ## [33] "Louisiana" "Kentucky" "Kansas" "Iowa" ## [37] "Indiana" "Illinois" "Idaho" "Hawaii" ## [41] "Georgia" "Florida" "Delaware" "Connecticut" ## [45] "Colorado" "California" "Arkansas" "Arizona" ## [49] "Alaska" "Alabama" ``` --- ## Regular Expressions A .vocab[regular expression] is a sequence of characters that allows you to describe string patterns. We use them to search for patterns. - extract a phone number from text data - determine if an email address is valid - determine if a password has the required number of letters, characters, and symbols - count the number of times "statistics" occurs in a corpus of text - ... --- ## Regular Expressions To demonstrate, we will use a vector of all of the states bordering North Carolina. ```r vt_neighbors <- c("Vermont", "New York", "Massachusetts", "New Hampshire") vt_neighbors ``` ``` ## [1] "Vermont" "New York" "Massachusetts" "New Hampshire" ``` --- ## Basic Match We can match exactly. ```r str_view_all(vt_neighbors, "ew") ```
--- ## Basic Match Match any character using `.` ```r str_view_all(vt_neighbors, "e.") ```
--- ## Extract matches Pulls the match (if it exist) from each element in the character ```r str_extract(vt_neighbors, "New") ``` ``` ## [1] NA "New" NA "New" ``` --- ## Anchors Match the start of a string using `^` ```r str_view_all(vt_neighbors, "^V") ```
--- ## Anchors Match the end of a string using `$` ```r str_view_all(vt_neighbors, "s$") ```
--- ## `str_detect` Determine if a character vector matches a pattern. ```r vt_neighbors ``` ``` ## [1] "Vermont" "New York" "Massachusetts" "New Hampshire" ``` ```r str_detect(vt_neighbors, "a") ``` ``` ## [1] FALSE FALSE TRUE TRUE ``` --- ## `str_subset` Select elements from the character vector that match a pattern. ```r str_subset(vt_neighbors, "e$") ``` ``` ## [1] "New Hampshire" ``` --- ## `str_count` How many matches are there in a string? ```r vt_neighbors ``` ``` ## [1] "Vermont" "New York" "Massachusetts" "New Hampshire" ``` ```r str_count(vt_neighbors, "a") ``` ``` ## [1] 0 0 2 1 ``` --- ## `str_replace` Replace first match with new strings. ```r str_replace(vt_neighbors, "s", "-") ``` ``` ## [1] "Vermont" "New York" "Ma-sachusetts" "New Hamp-hire" ``` --- ## `str_replace_all` Replace all matches with new strings. ```r str_replace_all(vt_neighbors, "s", "-") ``` ``` ## [1] "Vermont" "New York" "Ma--achu-ett-" "New Hamp-hire" ``` -- Don't forget to save results: ```r vt_neighbors ``` ``` ## [1] "Vermont" "New York" "Massachusetts" "New Hampshire" ``` ```r vt_neighbors_replaced <- str_replace_all(vt_neighbors, "s", "-") vt_neighbors_replaced ``` ``` ## [1] "Vermont" "New York" "Ma--achu-ett-" "New Hamp-hire" ``` --- ## Many Matches The regular expressions below match more than one character. - Match any single digit using `\d` or `[[:digit:]]` - Match all digits using `\d+` or `[[:digit:]]+` - Match any whitespace using `\s` or `[[:space:]]` - Match f, g, or h using `[fgh]` - Match anything but f, g, or h using `[^fgh]` - Match lower-case letters using `[a-z]` or `[[:lower:]]` - Match upper-case letters using `[A-Z]` or `[[:upper:]]` - Match alphabetic characters using `[A-z]` or `[[:alpha:]]` Remember these are regular expressions! To match digits you'll need to *escape* the string, so use `"\\d"`, not `"\d"` --- ## Working within pipeline ```r vt_df %>% mutate(state_code = str_to_lower(state_code)) ``` ``` ## state state_code ## 1 Vermont vt ## 2 New York ny ## 3 Massachusetts ma ## 4 New Hampshire nh ``` --- ## Working within pipeline ```r vt_df %>% mutate(state = str_replace_all(state, "[ea]", "-")) ``` ``` ## state state_code ## 1 V-rmont VT ## 2 N-w York NY ## 3 M-ss-chus-tts MA ## 4 N-w H-mpshir- NH ``` --- ## Additional resources - `stringr` website: https://stringr.tidyverse.org/ - `stringr` [cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf) - Regular Expressions [Cheat Sheet](https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf) - [Chapter 14: Strings](https://r4ds.had.co.nz/strings.html#matching-patterns-with-regular-expressions) in R for Data Science