class: center, middle, inverse, title-slide .title[ # Spatial data & visualization ] .author[ ### Becky Tang ] .date[ ### 12/5/2022 ] --- ## Housekeeping - You can sign up to meet with me about the project via Calendly, or stop by during office hours! - If you want to try the code you see during this lecture, as well as live code with me, please go ahead and clone the `appex-18` github repository - The code for lecture is in the `spatial-live-code.Rmd` file --- ## Final project reminders - Due Thursday at 11:59pm: - PDF of final report to Canvas - PDF of slides to Canvas (5 min. presentation) - Data dictionary in README - Everything pushed to Github (including slides) - For your final report, please set the argument `echo = F` in your code chunks such that when you knit, you only see the output of the code by not the code itself. ````markdown ```{r echo = F} # your code here ``` ```` --- class: center, middle # Intro to spatial visualization --- ## Spatial data is important - exploratory data analysis - detecting spatial patterns and trends - understanding spatial data relationships - analysis of spatial data should reflect spatial structure --- ## Context - `R` is great for interactive data visualizations, but it is not the gold standard for spatial data viz - GIS is more commonly used for intensive spatial mapping (think layers) - Packages such as `sf` and `rgdal` make it easier to create beautiful maps in `R`, but are not the most friendly for beginners --- ## 1854 London Cholera Outbreak <img src="img/24/cholera.png" width="50%" style="display: block; margin: auto;" /> --- ## Spatial data - While it's often easier for us to understand longitude-latitude, this system is not used in real world mapping - Instead, professionals use *vector* data that are encoded using "simple features" - The `sf` package developed by Edzer Pebesma provides an excellent toolset for working with such data --- ## Spatial data is different Our typical tidy data frame: .midi[ ```r nycflights13::flights ``` ``` ## # A tibble: 336,776 × 19 ## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA ## 2 2013 1 1 533 529 4 850 830 20 UA ## 3 2013 1 1 542 540 2 923 850 33 AA ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 ## 5 2013 1 1 554 600 -6 812 837 -25 DL ## 6 2013 1 1 554 558 -4 740 728 12 UA ## 7 2013 1 1 555 600 -5 913 854 19 B6 ## 8 2013 1 1 557 600 -3 709 723 -14 EV ## 9 2013 1 1 557 600 -3 838 846 -8 B6 ## 10 2013 1 1 558 600 -2 753 745 8 AA ## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm>, and abbreviated variable names ## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay ``` ] --- ## Spatial data is different Our (new) .vocab[simple feature] object: .midi[ ``` ## Simple feature collection with 100 features and 3 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 ## Geodetic CRS: NAD27 ## First 10 features: ## name regstrd voted geometry ## 1 ASHE 19414 8428 MULTIPOLYGON (((-81.47276 3... ## 2 ALLEGHANY 7556 4101 MULTIPOLYGON (((-81.23989 3... ## 3 SURRY 46666 23660 MULTIPOLYGON (((-80.45634 3... ## 4 CURRITUCK 21803 7536 MULTIPOLYGON (((-76.00897 3... ## 5 NORTHAMPTON 13891 6196 MULTIPOLYGON (((-77.21767 3... ## 6 HERTFORD 14945 6955 MULTIPOLYGON (((-76.74506 3... ## 7 CAMDEN 8128 3472 MULTIPOLYGON (((-76.00897 3... ## 8 GATES 8294 3105 MULTIPOLYGON (((-76.56251 3... ## 9 WARREN 13441 6878 MULTIPOLYGON (((-78.30876 3... ## 10 STOKES 31649 14444 MULTIPOLYGON (((-80.02567 3... ``` ] -- <br/> **What differences do you observe?** --- ## Raster versus vector spatial data .vocab[Vector] spatial data describes the world using shapes (points, lines, polygons, etc). .vocab[Raster] spatial data describes the world using cells of constant size. <img src="img/24/vector_raster_comparison.png" width="35%" style="display: block; margin: auto;" /> The choice to use vector or raster data depends on the problem context. We will focus on **vector** data. .tiny[*Source:* https://commons.wikimedia.org/wiki/File:Raster_vector_tikz.png] --- ## Simple features ```r #install.packages("sf") library(sf) ``` A .vocab[simple feature] is a standard way to describe how real-world spatial objects (country, building, tree, road, etc) can be represented by a computer. -- The package `sf` implements simple features and other spatial functionality using **tidy** principles. --- ## Simple features Simple features have a geometry type. Common choices are below. <img src="19-spatial-1_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## A simple feature object - Simple features are stored in a data frame, with the geographic information in a column called `geometry` - Simple features can contain both spatial and non-spatial data - Functions for spatial data in `sf` begin `st_` --- ## A simple feature object .midi[ ``` ## Simple feature collection with 100 features and 6 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 ## Geodetic CRS: NAD27 ## First 10 features: ## name regstrd voted mailed rejectd ml_rqst ## 1 ASHE 19414 8428 NA NA 2666 ## 2 ALLEGHANY 7556 4101 NA NA 971 ## 3 SURRY 46666 23660 4366 7 7088 ## 4 CURRITUCK 21803 7536 NA NA 2472 ## 5 NORTHAMPTON 13891 6196 828 2 1441 ## 6 HERTFORD 14945 6955 NA NA 1524 ## 7 CAMDEN 8128 3472 416 1 739 ## 8 GATES 8294 3105 NA NA 847 ## 9 WARREN 13441 6878 NA NA 1913 ## 10 STOKES 31649 14444 2162 2 3648 ## geometry ## 1 MULTIPOLYGON (((-81.47276 3... ## 2 MULTIPOLYGON (((-81.23989 3... ## 3 MULTIPOLYGON (((-80.45634 3... ## 4 MULTIPOLYGON (((-76.00897 3... ## 5 MULTIPOLYGON (((-77.21767 3... ## 6 MULTIPOLYGON (((-76.74506 3... ## 7 MULTIPOLYGON (((-76.00897 3... ## 8 MULTIPOLYGON (((-76.56251 3... ## 9 MULTIPOLYGON (((-78.30876 3... ## 10 MULTIPOLYGON (((-80.02567 3... ``` ] --- class: center, middle # Visualizing spatial data --- ## `nc_votes` This data was pulled from the [North Carolina Early Voting Statistics](https://electproject.github.io/Early-Vote-2020G/NC.html) website and is through 10-28-2020. The dataset contains the following variables: - `name`: county name - `regstrd`: number of registered voters - `voted`: number of individuals who have voted - `mailed`: number of mail ballots returned - `rejectd`: number of mail ballots rejected - `ml_rqst`: number of mail ballots requested --- ## Getting `sf` objects To read simple features from a file or database use the function `st_read()`. .small[ ```r library(sf) *nc <- st_read("data/nc_votes.shp", quiet = TRUE) nc ``` ``` ## Simple feature collection with 100 features and 6 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 ## Geodetic CRS: NAD27 ## First 10 features: ## name regstrd voted mailed rejectd ml_rqst ## 1 ASHE 19414 8428 NA NA 2666 ## 2 ALLEGHANY 7556 4101 NA NA 971 ## 3 SURRY 46666 23660 4366 7 7088 ## 4 CURRITUCK 21803 7536 NA NA 2472 ## 5 NORTHAMPTON 13891 6196 828 2 1441 ## 6 HERTFORD 14945 6955 NA NA 1524 ## 7 CAMDEN 8128 3472 416 1 739 ## 8 GATES 8294 3105 NA NA 847 ## 9 WARREN 13441 6878 NA NA 1913 ## 10 STOKES 31649 14444 2162 2 3648 ## geometry ## 1 MULTIPOLYGON (((-81.47276 3... ## 2 MULTIPOLYGON (((-81.23989 3... ## 3 MULTIPOLYGON (((-80.45634 3... ## 4 MULTIPOLYGON (((-76.00897 3... ## 5 MULTIPOLYGON (((-77.21767 3... ## 6 MULTIPOLYGON (((-76.74506 3... ## 7 MULTIPOLYGON (((-76.00897 3... ## 8 MULTIPOLYGON (((-76.56251 3... ## 9 MULTIPOLYGON (((-78.30876 3... ## 10 MULTIPOLYGON (((-80.02567 3... ``` ] --- ## Plotting with `ggplot()` ```r *ggplot(nc) + * geom_sf() + labs(title = "North Carolina counties") ``` <img src="19-spatial-1_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## A look at some aesthetics ```r ggplot(nc) + * geom_sf(color = "lightblue") + labs(title = "North Carolina counties with theme and aesthetics") ``` <img src="19-spatial-1_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ## A look at some aesthetics ```r ggplot(nc) + * geom_sf(color = "lightblue", size = 1.5) + labs(title = "North Carolina counties with theme and aesthetics") ``` <img src="19-spatial-1_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ## A look at some aesthetics ```r ggplot(nc) + * geom_sf(color = "lightblue", size = 1.5, fill = "purple") + labs(title = "North Carolina counties with theme and aesthetics") ``` <img src="19-spatial-1_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## A look back at some of our data .small[ ``` ## Simple feature collection with 100 features and 6 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 ## Geodetic CRS: NAD27 ## First 10 features: ## name regstrd voted mailed rejectd ml_rqst ## 1 ASHE 19414 8428 NA NA 2666 ## 2 ALLEGHANY 7556 4101 NA NA 971 ## 3 SURRY 46666 23660 4366 7 7088 ## 4 CURRITUCK 21803 7536 NA NA 2472 ## 5 NORTHAMPTON 13891 6196 828 2 1441 ## 6 HERTFORD 14945 6955 NA NA 1524 ## 7 CAMDEN 8128 3472 416 1 739 ## 8 GATES 8294 3105 NA NA 847 ## 9 WARREN 13441 6878 NA NA 1913 ## 10 STOKES 31649 14444 2162 2 3648 ## geometry ## 1 MULTIPOLYGON (((-81.47276 3... ## 2 MULTIPOLYGON (((-81.23989 3... ## 3 MULTIPOLYGON (((-80.45634 3... ## 4 MULTIPOLYGON (((-76.00897 3... ## 5 MULTIPOLYGON (((-77.21767 3... ## 6 MULTIPOLYGON (((-76.74506 3... ## 7 MULTIPOLYGON (((-76.00897 3... ## 8 MULTIPOLYGON (((-76.56251 3... ## 9 MULTIPOLYGON (((-78.30876 3... ## 10 MULTIPOLYGON (((-80.02567 3... ``` ] Let's incorporate these variables into our plot using `ggplot`. --- ## Choropleth map ```r ggplot(nc) + * geom_sf(aes(fill = voted)) + labs(title = "Higher population counties have more votes cast", fill = "Total votes cast") ``` <img src="19-spatial-1_files/figure-html/choropleth-1-code-1.png" style="display: block; margin: auto;" /> It is sometimes helpful to pick diverging colors; [colorbrewer2](http://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3) can help. --- ## Choropleth map: colors One way to set fill colors is with `scale_fill_gradient()`. ```r ggplot(nc) + geom_sf(aes(fill = voted)) + * scale_fill_gradient(low = "#fee8c8", high = "#7f0000") + labs(title = "Higher population counties have more votes cast", fill = "Total votes cast") ``` <img src="19-spatial-1_files/figure-html/choropleth-2-1.png" style="display: block; margin: auto;" /> --- ## Choropleth map: colors - The `RColorBrewer` package provides fun color palettes for you ```r library(RColorBrewer) ggplot(nc) + geom_sf(aes(fill = voted)) + * scale_fill_gradientn(colors = brewer.pal(8, "Spectral")) + labs(title = "Higher population counties have more votes cast", fill = "Total votes cast") ``` <img src="19-spatial-1_files/figure-html/RColorBrewer_code-1.png" style="display: block; margin: auto;" /> --- ## Adding labels ```r ggplot(nc) + geom_sf(aes(fill = voted)) + scale_fill_gradientn(colors = brewer.pal(8, "Spectral")) + * geom_sf_text(data = nc %>% filter(voted > 300000), aes(label = name), size = 2, col = "white") ``` ``` ## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not ## give correct results for longitude/latitude data ``` <img src="19-spatial-1_files/figure-html/map_label_error-1.png" style="display: block; margin: auto;" /> --- ## Let's make it more informative ```r ggplot(nc) + * geom_sf(aes(fill = voted/regstrd)) + scale_fill_gradient(low = "#fff7f3", high = "#49006a") + labs(fill = "Votes cast per registered voter", title = "Early vote turnout varies by county") + * theme(legend.position = "bottom", legend.direction="horizontal") ``` --- ## Let's make it more informative <img src="19-spatial-1_files/figure-html/choropleth-3-1.png" style="display: block; margin: auto;" /> --- class: center, middle ## Live code! --- ## `us_states` ```r library(sf) library(spData) #this packages contains the dataset (with sf objects) that we will be using today data("us_states") us_states ``` ``` ## Simple feature collection with 49 features and 6 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: -124.7042 ymin: 24.55868 xmax: -66.9824 ymax: 49.38436 ## Geodetic CRS: NAD83 ## First 10 features: ## GEOID NAME REGION AREA total_pop_10 total_pop_15 ## 1 01 Alabama South 133709.27 [km^2] 4712651 4830620 ## 2 04 Arizona West 295281.25 [km^2] 6246816 6641928 ## 3 08 Colorado West 269573.06 [km^2] 4887061 5278906 ## 4 09 Connecticut Norteast 12976.59 [km^2] 3545837 3593222 ## 5 12 Florida South 151052.01 [km^2] 18511620 19645772 ## 6 13 Georgia South 152725.21 [km^2] 9468815 10006693 ## 7 16 Idaho West 216512.66 [km^2] 1526797 1616547 ## 8 18 Indiana Midwest 93648.40 [km^2] 6417398 6568645 ## 9 20 Kansas Midwest 213037.08 [km^2] 2809329 2892987 ## 10 22 Louisiana South 122345.76 [km^2] 4429940 4625253 ## geometry ## 1 MULTIPOLYGON (((-88.20006 3... ## 2 MULTIPOLYGON (((-114.7196 3... ## 3 MULTIPOLYGON (((-109.0501 4... ## 4 MULTIPOLYGON (((-73.48731 4... ## 5 MULTIPOLYGON (((-81.81169 2... ## 6 MULTIPOLYGON (((-85.60516 3... ## 7 MULTIPOLYGON (((-116.916 45... ## 8 MULTIPOLYGON (((-87.52404 4... ## 9 MULTIPOLYGON (((-102.0517 4... ## 10 MULTIPOLYGON (((-92.01783 2... ``` - The CRS is `NAD83`, which projects longitude-latitude coordinates. It is most commonly used by U.S. federal agencies. --- ## Mapping states ```r ggplot(us_states) + geom_sf() ``` <img src="19-spatial-1_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ## Mapping with points! - Recall the `nycflights13` dataset from the practice midterm which had all kinds information about all the flights out of NYC airports in 2013 ```r library(nycflights13) data(airports) ``` - Let's try to plot these airports onto our map! --- ## Mapping airports ```r ggplot(us_states) + geom_sf() + * geom_point(data = airports, aes(x = lon, y = lat), size = 0.1) ``` <img src="19-spatial-1_files/figure-html/us-airports-1-1.png" style="display: block; margin: auto;" /> .question[What's happening here?] --- ## Mapping airports ```r ggplot(us_states) + geom_sf() + geom_point(data = airports, aes(x = lon, y = lat), pch = 19, size = 0.1) + * coord_sf(xlim = c(-130, -65), ylim = c(20, 50)) ``` <img src="19-spatial-1_files/figure-html/us-airports-2-1.png" style="display: block; margin: auto;" /> --- ## Mapping airports .alert[Recall that `airports` is a regular data frame. Why am I able to plot `airports` if it is not an `sf` object?] --- ## Making more informative - What if we wanted each point to represent the number of flights that flew to those airports from NYC? - Need the data about flight info: ```r data(flights) dest_count <- flights %>% count(dest, name = "count") dest_count %>% slice(1:3) ``` ``` ## # A tibble: 3 × 2 ## dest count ## <chr> <int> ## 1 ABQ 254 ## 2 ACK 265 ## 3 ALB 439 ``` --- ## Making more informative (cont.) - We will merge our new `dest_count` object with the original `airports` data frame: ```r airports_count <- airports %>% inner_join(dest_count, by=c("faa" = "dest")) ``` - Then we can add `airports_count` data to our plot! ```r ggplot(us_states) + geom_sf() + coord_sf(xlim = c(-130, -65), ylim = c(20, 50)) + * geom_point(data=airports_count, aes(x=lon, y=lat, size=count)) + * scale_size(range = c(0.2,5), breaks=c(0,1,10,100,1000,10000,100000)) ``` --- ## Making more informative (cont.) <img src="19-spatial-1_files/figure-html/airports_count-1.png" style="display: block; margin: auto;" /> --- ## Challenge 1 - Manipulating spatial data objects is similar, but not identical to manipulating data frames. - Note the core data-wrangling functions from `dplyr` do work (we used `filter()`earlier!) --- ## Geometries are "sticky" The geometry is kept until it is deliberated dropped using `st_drop_geometry`. -- ```r nc %>% select(name, regstrd) %>% filter(regstrd > 100000) %>% * st_drop_geometry() ``` ``` ## name regstrd ## 1 FORSYTH 270818 ## 2 GUILFORD 381797 ## 3 ALAMANCE 110127 ## 4 ORANGE 111765 ## 5 DURHAM 243045 ## 6 WAKE 791821 ## 7 IREDELL 130013 ## 8 DAVIDSON 112872 ## 9 PITT 122925 ## 10 CATAWBA 108098 ## 11 BUNCOMBE 206178 ## 12 JOHNSTON 142255 ## 13 MECKLENBURG 789547 ## 14 CABARRUS 149584 ## 15 GASTON 150779 ## 16 CUMBERLAND 225029 ## 17 UNION 167068 ## 18 ONSLOW 116300 ## 19 NEW HANOVER 177056 ## 20 BRUNSWICK 114115 ``` --- ## Challenge 2 The coordinate reference system (CRS) matters. The following is output from the `sf` object `nc`: ```r Simple feature collection with 100 features and 6 fields Geometry type: MULTIPOLYGON Dimension: XY Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 * Geodetic CRS: NAD27 name regstrd ... geometry 1 Ashe 19414 ... <MULTIPOLYGON> (((-81.47276 36.23436, -81.54084 36.27251, -... ``` --- ## References - [North Carolina Early Voting Statistics](https://electproject.github.io/Early-Vote-2020G/NC.html) - [Simple Features for R vignette](https://r-spatial.github.io/sf/) - [mapview vignette](https://r-spatial.github.io/mapview/index.html) - [Coordinate Reference Systems in R](https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/OverviewCoordinateReferenceSystems.pdf) - [Geocomputation with R](https://geocompr.robinlovelace.net/spatial-class.html)