class: center, middle, inverse, title-slide .title[ # Reshaping ] .author[ ### Becky Tang ] .date[ ### 9/26/2022 ] --- class: center, middle ## Housekeeping --- class: center, middle # Reshaping --- # Why reshape? - Sometimes, our data is not always in the format that we would like - We want to preserve all the information, but "massage" it prior to analysis - Move information between rows and columns --- # Types of data frames - Data frames are often described as **wide** or **long** - **Wide**: when a row has more than one observation, and the units of observation (e.g., individuals, countries, households) are on one row each - **Long**: when a row has only one observation, but the units of observation are repeated down a column --- # Long to wide data - This is our original data, where `country` is our unit of observation. - Why is this considered long data? ``` ## country month avgtemp ## 1 Sweden jan 5 ## 2 Denmark jan 6 ## 3 Norway jan 8 ## 4 Sweden feb 11 ## 5 Denmark feb 5 ## 6 Norway feb 11 ## 7 Sweden march 12 ## 8 Denmark march 9 ## 9 Norway march 9 ``` --- # Long to wide data - Want to reshape from to wide data, where each unit of observation (`country`) has exactly one row .pull-left[ Long data: ``` ## country month avgtemp ## 1 Sweden jan 5 ## 2 Denmark jan 6 ## 3 Norway jan 8 ## 4 Sweden feb 11 ## 5 Denmark feb 5 ## 6 Norway feb 11 ## 7 Sweden march 12 ## 8 Denmark march 9 ## 9 Norway march 9 ``` ] .pull-right[ Wide data: ``` ## # A tibble: 3 × 4 ## country jan feb march ## <fct> <dbl> <dbl> <dbl> ## 1 Sweden 5 11 12 ## 2 Denmark 6 5 9 ## 3 Norway 8 11 9 ``` ] --- # Long to wide data - `pivot_wider()` function turns data from long to wide - Takes two arguments: - `names_from`: the name of the variable(s) in the data frame to get the name of the output column - `values_from`: the name of the variable(s) in the data frame to get the cell values from --- # Long to Wide data .pull-left[ Original long data: ``` ## country month avgtemp ## 1 Sweden jan 5 ## 2 Denmark jan 6 ## 3 Norway jan 8 ## 4 Sweden feb 11 ## 5 Denmark feb 5 ## 6 Norway feb 11 ## 7 Sweden march 12 ## 8 Denmark march 9 ## 9 Norway march 9 ``` ] .pull-right[ Desired wide data: ``` ## # A tibble: 3 × 4 ## country jan feb march ## <fct> <dbl> <dbl> <dbl> ## 1 Sweden 5 11 12 ## 2 Denmark 6 5 9 ## 3 Norway 8 11 9 ``` ```r country_long %>% pivot_wider(names_from = ____, values_from = _____) ``` .question[What should I pass in for `names_from` and `values_from`?] ] --- # Long to Wide data ```r country_long %>% * pivot_wider(names_from = month, * values_from = avgtemp) ``` ``` ## # A tibble: 3 × 4 ## country jan feb march ## <fct> <dbl> <dbl> <dbl> ## 1 Sweden 5 11 12 ## 2 Denmark 6 5 9 ## 3 Norway 8 11 9 ``` --- # Wide to long data - Long data structure is often preferred - Often required for advanced statistical analysis and graphing. .pull-left[ Now this is our original data: ``` ## country jan feb march ## 1 Sweden 5 11 12 ## 2 Denmark 6 5 9 ## 3 Norway 8 11 9 ``` ] .pull-right[ And this is what we want: ``` ## # A tibble: 9 × 3 ## country month avg_temp ## <chr> <chr> <dbl> ## 1 Sweden jan 5 ## 2 Sweden feb 11 ## 3 Sweden march 12 ## 4 Denmark jan 6 ## 5 Denmark feb 5 ## 6 Denmark march 9 ## 7 Norway jan 8 ## 8 Norway feb 11 ## 9 Norway march 9 ``` ] --- # Wide to Long data - `pivot_longer()` function can be used to go from wide to long data - Takes three arguments: - `cols`: specify which columns in the dataframe to pivot into longer format. That is, the ones that should "move" - `names_to`: a string (your choice) specifying the name of the new column created from the column names species by `cols` - `values_to`: a string (your choice) specifying the name of the column to create from the data stored in cell values --- # Wide to Long data This is my original data. ```r country_wide ``` ``` ## country jan feb march ## 1 Sweden 5 11 12 ## 2 Denmark 6 5 9 ## 3 Norway 8 11 9 ``` For each row, I want a country, the month, and the temperature within that month. What should I choose for the arguments `cols`? -- - `cols`: the columns `jan`, `feb`, and `march` -- - `names_to`: anything you'd like, so long as it makes sense. I will choose "month" - `values_to`: anything you'd like, so long as it makes sense. I will choose "avg_temp" --- # Wide to Long data ```r country_wide %>% * pivot_longer(cols = c(jan, feb, march), * names_to = "month", * values_to = "avg_temp") ``` ``` ## # A tibble: 9 × 3 ## country month avg_temp ## <chr> <chr> <dbl> ## 1 Sweden jan 5 ## 2 Sweden feb 11 ## 3 Sweden march 12 ## 4 Denmark jan 6 ## 5 Denmark feb 5 ## 6 Denmark march 9 ## 7 Norway jan 8 ## 8 Norway feb 11 ## 9 Norway march 9 ``` --- # Wide to Long data - Can specify column indices in `cols` argument - What you choose for `values_to` or `names_to` doesn't affect resulting structure .pull-left[ ```r country_wide %>% * pivot_longer(cols = 2:4, names_to = "month", * values_to = "temp") ``` ``` ## # A tibble: 9 × 3 ## country month temp ## <chr> <chr> <dbl> ## 1 Sweden jan 5 ## 2 Sweden feb 11 ## 3 Sweden march 12 ## 4 Denmark jan 6 ## 5 Denmark feb 5 ## 6 Denmark march 9 ## 7 Norway jan 8 ## 8 Norway feb 11 ## 9 Norway march 9 ``` ] .pull-right[ ```r country_wide %>% * pivot_longer(cols = -country, names_to = "month", values_to = "temp") ``` ``` ## # A tibble: 9 × 3 ## country month temp ## <chr> <chr> <dbl> ## 1 Sweden jan 5 ## 2 Sweden feb 11 ## 3 Sweden march 12 ## 4 Denmark jan 6 ## 5 Denmark feb 5 ## 6 Denmark march 9 ## 7 Norway jan 8 ## 8 Norway feb 11 ## 9 Norway march 9 ``` ]