class: center, middle, inverse, title-slide .title[ # Data and visualization ] .author[ ### Becky Tang ] .date[ ### 9/16/22 ] --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- # Announcements - Homework 1 released today, due Tuesday 9/20 11:59pm - Abbreviated assignment - This is the only homework where you won't have a week to complete it --- class: center, middle # Exploratory data analysis --- ## What is EDA? - .vocab[Exploratory data analysis (EDA)] is an approach to analyzing data sets to summarize the main characteristics. <br> - Often, EDA is visual. That's what we're focusing on today. <br> - We can also calculate summary statistics and perform data wrangling/manipulation/transformation at (or before) this stage of the analysis. --- class: center, middle # Data visualization --- ## Data visualization > *"The simple graph has brought more information to the data analyst’s mind than any other device." — John Tukey* <br> - .vocab[Data visualization] is the creation and study of the visual representation of data. <br> - There are many tools for visualizing data (R is one of them), and many approaches/systems within R for making data visualizations - We'll use **`ggplot2`**. --- ## What function is doing the plotting? ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total)) + * geom_point() + labs(title = "GDP vs. Total Ecological Footprint of countries (2016)", x = "GDP ($10k)", y = "Total footprint (hectare)") ``` ``` ## Warning: Removed 15 rows containing missing values (geom_point). ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## What is the dataset being plotted? ```r *ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total)) + geom_point() + labs(title = "GDP vs. Total Ecological Footprint of countries (2016)", x = "GDP ($10k)", y = "Total footprint (hectare)") ``` ``` ## Warning: Removed 15 rows containing missing values (geom_point). ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ## Which variable is on the x-axis? On the y-axis? ```r ggplot(data = countries_footprint, * mapping = aes(x = GDP, y = Total)) + geom_point() + labs(title = "GDP vs. Total Ecological Footprint of countries (2016)", x = "GDP ($10k)", y = "Total footprint (hectare)") ``` ``` ## Warning: Removed 15 rows containing missing values (geom_point). ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## What does `geom_smooth()` do? ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y =Total)) + geom_point() + * geom_smooth() + labs(title = "GDP vs. Total Ecological Footprint of countries (2016)", x = "GDP ($10k)", y = "Total footprint (hectare)") ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Hello ggplot2! - `ggplot()` is the main function in ggplot2 and plots are constructed in layers - The structure of the code for plots can often be summarized as ```r ggplot + geom_xxx ``` <br> -- or, more precisely .small[ ```r ggplot(data = [dataset], mapping = aes(x = [x-variable], y = [y-variable])) + geom_xxx() + other options ``` ] --- ## Hello ggplot2! To use ggplot2 functions, first load tidyverse ```r library(tidyverse) ``` For help with the ggplot2, see [ggplot2.tidyverse.org](http://ggplot2.tidyverse.org/) --- class: center, middle # Visualizing Ecological Footprint --- ## Dataset terminology .small[ ```r countries_footprint ``` ``` ## # A tibble: 185 × 14 ## Country Region Popul…¹ HDI GDP Cropl…² Grazing Forest Carbon Fish ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan Middl… 29.8 0.46 0.0615 0.3 0.2 0.08 0.18 0 ## 2 Albania East … 3.16 0.73 0.453 0.78 0.22 0.25 0.87 0.02 ## 3 Algeria Africa 38.5 0.73 0.543 0.6 0.16 0.17 1.14 0.01 ## 4 Angola Africa 20.8 0.52 0.467 0.33 0.15 0.12 0.2 0.09 ## 5 Antigua and… Latin… 0.09 0.78 1.32 NA NA NA NA NA ## 6 Argentina Latin… 41.1 0.83 1.35 0.78 0.79 0.29 1.08 0.1 ## 7 Armenia Middl… 2.97 0.73 0.343 0.74 0.18 0.34 0.89 0.01 ## 8 Aruba Latin… 0.1 NA NA NA NA NA NA NA ## 9 Australia Asia-… 23.0 0.93 6.66 2.68 0.63 0.89 4.85 0.11 ## 10 Austria EU 8.46 0.88 5.13 0.82 0.27 0.63 4.14 0.06 ## # … with 175 more rows, 4 more variables: Total <dbl>, EarthsRequired <dbl>, ## # CountriesRequired <dbl>, DataQuality <chr>, and abbreviated variable names ## # ¹Population, ²Cropland ``` ] Each row is an .vocab[observation]. Each column is a .vocab[variable] Data obtained from [https://www.kaggle.com/footprintnetwork/ecological-footprint](https://www.kaggle.com/footprintnetwork/ecological-footprint) --- ## What's in the Ecological Footprint data? Take a `glimpse` of the data: ```r glimpse(countries_footprint) ``` ``` ## Rows: 185 ## Columns: 14 ## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Anti… ## $ Region <chr> "Middle East", "East Europe", "Africa", "Africa", "L… ## $ Population <dbl> 29.82, 3.16, 38.48, 20.82, 0.09, 41.09, 2.97, 0.10, … ## $ HDI <dbl> 0.46, 0.73, 0.73, 0.52, 0.78, 0.83, 0.73, NA, 0.93, … ## $ GDP <dbl> 0.061466, 0.453437, 0.543057, 0.466591, 1.320510, 1.… ## $ Cropland <dbl> 0.30, 0.78, 0.60, 0.33, NA, 0.78, 0.74, NA, 2.68, 0.… ## $ Grazing <dbl> 0.20, 0.22, 0.16, 0.15, NA, 0.79, 0.18, NA, 0.63, 0.… ## $ Forest <dbl> 0.08, 0.25, 0.17, 0.12, NA, 0.29, 0.34, NA, 0.89, 0.… ## $ Carbon <dbl> 0.18, 0.87, 1.14, 0.20, NA, 1.08, 0.89, NA, 4.85, 4.… ## $ Fish <dbl> 0.00, 0.02, 0.01, 0.09, NA, 0.10, 0.01, NA, 0.11, 0.… ## $ Total <dbl> 0.79, 2.21, 2.12, 0.93, 5.38, 3.14, 2.23, 11.88, 9.3… ## $ EarthsRequired <dbl> 0.46, 1.27, 1.22, 0.54, 3.11, 1.82, 1.29, 6.86, 5.37… ## $ CountriesRequired <dbl> 1.60, 1.87, 3.61, 0.37, 5.70, 0.45, 2.52, 20.69, 0.5… ## $ DataQuality <chr> "High", "High", "Medium", "High", "Low", "High", "Lo… ``` --- ## Example: What's in the Star Wars data? If data have been loaded into R for anyone to use, it comes with a help file. Run the following **<u>in the Console</u>** to view the help file for the starwars dataset ```r ?starwars ``` <img src="img/02/starwars-help.png" width="60%" style="display: block; margin: auto auto auto 0;" /> --- ## GDP vs. Total Footprint ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total)) + geom_point() ``` ``` ## Warning: Removed 15 rows containing missing values (geom_point). ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## What's that warning? - Not all countries have GDP and Total Footprint information (hence 15 of them not plotted) ``` ## Warning: Removed 15 rows containing missing values (geom_point). ``` - We can suppress warnings to save space on the output documents, but it's important to note them - To suppress warning: .center[ `{r code-chunk-label, warning=FALSE}` ] --- ## GDP vs. Total Footprint .question[ How would you describe this **relationship**? ] <img src="02-data-and-viz_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## Additional variables We can map additional variables to various features of the plot: - **aesthetics** - shape - color - size - alpha (transparency) - **faceting**: small multiples displaying different subsets --- class: center, middle # Aesthetics --- ## Aesthetics options Visual characteristics of plotting characters that can be **mapped to a specific variable** in the data are - `color` - `size` - `shape` - `alpha` (transparency) --- ## GDP + Total Footprint + Data Quality ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total, * color = DataQuality)) + geom_point() ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- ## GDP + Total Footprint + Data Quality Let's map `shape` and `color` to `DataQuality` ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total, * color = DataQuality, * shape = DataQuality)) + geom_point() ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ### GDP + Total Footprint + Data Quality + HDI ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total, color = DataQuality, shape = DataQuality, * size = Fish)) + geom_point() ``` <img src="02-data-and-viz_files/figure-html/plot-birth-year-1.png" style="display: block; margin: auto;" /> --- ## GDP + Total Footprint + Data Quality Let's increase the size of all points across the board: ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total, color = DataQuality, shape = DataQuality)) + * geom_point(size = 3) ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- ## Aesthetics summary - Continuous variable are measured on a continuous scale - Discrete variables are measured (or often counted) on a discrete scale .small[ aesthetics | discrete | continuous ------------- | ------------------------ | ------------ color | rainbow of colors | gradient size | discrete steps | linear mapping between radius and value shape | different shape for each | shouldn't (and doesn't) work ] <br> .alert[Use aesthetics (`aes`) for mapping features of a plot to a variable, define the features in the `geom_xxx` for customization **<u>not</u>** mapped to a variable ] --- class: center, middle # Faceting --- ## Faceting options - Smaller plots that display different subsets of the data - Useful for exploring conditional relationships and large data ```r ggplot(data = countries_footprint,mapping = aes(x = GDP, y = Total)) + geom_point()+ labs(title = "GDP vs. Total Footprint of countries (2016)", * subtitle = "Faceted by region", x = "GDP ($10k)", y = "Total footprint (hectare)")+ * facet_grid(. ~ Region) ``` --- ```r ggplot(data = countries_footprint,mapping = aes(x = GDP, y = Total)) + geom_point()+ labs(title = "GDP vs. Total Footprint of countries (2016)", * subtitle = "Faceted by region", x = "GDP ($10k)", y = "Total footprint (hectare)")+ * facet_grid(. ~ Region) ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- .question[ In the next few slides describe what each plot displays. Think about how the code relates to the output. ] -- <br><br><br> .alert[ The plots in the next few slides do not have proper titles, axis labels, etc, so you can more easily focus on what's happening in the plots. But you should always label your plots! ] --- ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total)) + geom_point()+ * facet_grid(DataQuality ~ Region) ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total)) + geom_point()+ * facet_grid(Region ~ .) ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total)) + geom_point()+ * facet_wrap(Region ~ .) ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> --- ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total)) + geom_point()+ * facet_wrap(Region ~ . , nrow = 1) ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> --- ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total)) + geom_point()+ * facet_wrap(Region ~ . , scales = "free_x") ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- ## Facet summary - `facet_grid()`: - 2d grid - `row variable ~ column variable` - use `.` for no split -- - `facet_wrap()`: 1d ribbon wrapped into 2d - `variable ~ . ` - specify number of rows and columns using `ncol` or `nrow` argument -- - set scales using `scales = ` ("free_x", "free_y", "free") --- ## Modifications You can omit the names of first two arguments when building plots with `ggplot()`. ```r *ggplot(countries_footprint, aes(x = GDP, y = Total)) + geom_point()+ facet_grid(. ~ Region) ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> --- ## Removing legend .pull-left[ ```r ggplot(countries_footprint, aes(x = GDP, y = Total, col = DataQuality)) + geom_point() ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> ] .pull-right[ ```r ggplot(countries_footprint, aes(x = GDP, y = Total, col = DataQuality)) + geom_point()+ * guides(col = "none") ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> ] --- ## `ggplot2` supplementary resources 1. [ggplot2.tidyverse.org](https://ggplot2.tidyverse.org/) 2. `ggplot2` [cheat sheet](files/data-visualization.pdf) 3. [Top 50 `ggplot2` visualizations](http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html) 4. [How the BBC uses `ggplot2`](https://medium.com/bbc-visual-and-data-journalism/how-the-bbc-visual-and-data-journalism-team-works-with-graphics-in-r-ed0b35693535) 5. [ggplot2: Elegant Graphics for Data Analysis](https://ggplot2-book.org/)