class: center, middle, inverse, title-slide .title[ # Introducing linear models ] .author[ ### Becky Tang ] .date[ ### 11/11/2022 ] --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- ## Housekeeping - HW 07 due tomorrow, 11/15 at 11:59pm - Project proposals will be due to Canvas this Friday, 11/18 at 11:59pm - I will send out an e-mail where you and your partner should sign up for a brief Zoom meeting with me on 11/28 --- class: middle, center ## The language of models --- ## Modeling - We use models to... - understand relationships - assess differences - make predictions - We will focus on .vocab[linear] models but there are many other types. --- class: center, middle # Data: Teacher salaries --- ## Teacher salaries This data set contains teacher salaries from 2009-2010 for 71 teachers employed by the St. Louis Public School in Michigan. ```r glimpse(teacher) ``` ``` ## Rows: 71 ## Columns: 9 ## $ id <fct> 01, 02, 03, 04, 05, 06, 07, 08, 09, 11, 12, 13, 14, 15, 16,… ## $ degree <fct> BA, MA, MA, BA, BA, BA, BA, MA, BA, BA, BA, BA, BA, BA, MA,… ## $ fte <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… ## $ years <dbl> 5.0, 15.0, 16.0, 10.0, 26.0, 28.5, 12.0, 32.0, 25.0, 12.0, … ## $ base <int> 45388, 60649, 60649, 54466, 65360, 65360, 58097, 68230, 653… ## $ fica <dbl> 3472.18, 4639.65, 4639.65, 4166.65, 5000.04, 5000.04, 4444.… ## $ retirement <dbl> 7688.73, 10273.94, 10273.94, 9226.54, 11071.98, 11071.98, 9… ## $ total <dbl> 56548.91, 75562.59, 75562.59, 67859.19, 81432.02, 81432.02,… ## $ grade <chr> "high", "middle", "elementary", "middle", "middle", "elemen… ``` --- ## Data - `degree`: highest educational degree attained (`"BA"` or `"MA"`) - `fte`: full-time enrollment status (fulled time 1 or part-time 0.5) - `years`: number of years employed by the school district - `base`: base annual salary, in dollars - `fica`: amount paid into Social Security and Medicare per year, in dollars - `retirement`: amount paid into the retirement fund of the teacher per year, in dollars - `total`: total annual salary of the teacher, resulting from the sum of `base` + `fica` + `retirement`, in dollars. --- class: center, middle ## Modeling the relationship between variables --- ## Describe the distribution of base .midi[ ```r ggplot(data = teacher, aes(x = base)) + geom_histogram(bins = 20) + labs(title="Distribution of base salary (in $)") ``` <img src="16-intro-linear-models_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- ## Years and Degree .pull-left[ ```r ggplot(data = teacher, aes(x = years)) + geom_histogram(bins = 20) + theme(text = element_text(size = 20)) ``` <img src="16-intro-linear-models_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] .pull-right[ ```r ggplot(data = teacher, aes(x = degree)) + geom_bar() + theme(text = element_text(size = 20)) ``` <img src="16-intro-linear-models_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] --- ## Models as functions - We can represent relationships between variables using **functions** - A .vocab[function] in the *mathematical* sense is the relationship between one or more inputs and an output created from those inputs. - Plug in the inputs and receive back the output -- - The formula `\(y = 3x + 7\)` is a function with input `\(x\)` and output `\(y\)`. - When `\(x\)` is `\(5\)`, the output `\(y\)` is `\(22\)` ``` y = 3 * 5 + 7 = 22 ``` -- - When `\(x\)` is `\(12\)`, the output of `\(y\)` is `\(43\)` ``` y = 3 * 12 + 7 = 43 ``` --- ## Visualizing the linear model .midi[ ```r ggplot(data = teacher, aes(x = years, y = base)) + geom_point() + ggtitle("Base salary vs years") + * geom_smooth(method = "lm") ``` <img src="16-intro-linear-models_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] --- ## Visualizing the linear model .midi[ ```r ggplot(data = teacher, aes(x = years, y = base)) + geom_point() + ggtitle("Base salary vs years") + * geom_smooth(method = "lm", se = FALSE, * col = "red", lty = 2, lwd = 1) ``` <img src="16-intro-linear-models_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] --- ## Vocabulary - .vocab[Response variable]: Variable whose behavior or variation you are trying to understand, on the y-axis. Also called the **dependent variable**, and referred to as `\(y\)`. For linear - .vocab[Explanatory variables]: Other variables that you want to use to explain the variation in the response, on the x-axis. Also called **independent variables**, **predictors**, or **features**. Each explanatory variable is referred to using an `\(x\)`. -- .question[What might be the dependent and explanatory variables in our `teacher` data?] -- - .vocab[Predicted value]:</font> Output of the **model function** - The model function gives the typical value of the response variable *conditioning* on the explanatory variables (what does this mean?) --- ## Data .pull-left[ ```r teacher %>% dplyr::select(base, years) %>% slice(1:5) ``` ``` ## # A tibble: 5 × 2 ## base years ## <int> <dbl> ## 1 45388 5 ## 2 60649 15 ## 3 60649 16 ## 4 54466 10 ## 5 65360 26 ``` ] .pull-right[ ```r teacher %>% dplyr::select(base, years) %>% rename("y" = 1, "x" = 2) %>% slice(1:5) ``` ``` ## # A tibble: 5 × 2 ## y x ## <int> <dbl> ## 1 45388 5 ## 2 60649 15 ## 3 60649 16 ## 4 54466 10 ## 5 65360 26 ``` ] --- ## The linear model with a single predictor - .vocab[Linear regression] is the statistical method for fitting a line to data where the relationship between predictor `\(x\)` and quantitative response `\(y\)` can be modeled by a straight line with some error `\(\epsilon\)`: `$$y = \beta_{0} + \beta_{1}x + \epsilon$$` - For this class, we will not worry about the `\(\epsilon\)` too much. But we should remember there is always a little bit of error! - When we ignore `\(\epsilon\)`, we replace `\(y\)` with `\(\hat{y}\)` --- ## The linear model with a single predictor (cont.) - Thus, we have the following model: `$$\hat{y} = \beta_0 + \beta_1~x$$` where `\(\beta_1\)` is often called the .vocab[coefficient] or .vocab[slope] for the explanatory variable `\(x\)` -- - Unfortunately, we can't get these values -- - So we use sample statistics to estimate them: `$$\hat{y} = b_0 + b_1~x$$` --- ## Vocabulary - .vocab[Residuals]: Shows how far each case `\(i\)` is from its predicted value - **Residual = Observed value - Predicted value = `\(y_{i} - \hat{y}_{i}\)`** - Tells how far above/below the model function each case is --- .question[ What does a positive residual mean? Which salaries on the plot have positive residuals, those below or above the line? ] <img src="16-intro-linear-models_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## Residuals <img src="16-intro-linear-models_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ## Residuals - True salary when years = 13: 61216 - Predicted salary when years = 13: 54108.45 <img src="16-intro-linear-models_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ## Residuals <img src="16-intro-linear-models_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## Residuals - True salary when years = 34: 65360 - Predicted salary when years = 34: 70973.64 <img src="16-intro-linear-models_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## Models - upsides and downsides - Models can sometimes reveal patterns that are not evident in a graph of the data. This is a great advantage of modeling over simple visual inspection of data. - There is a real risk, however, that a model is imposing structure that is not really there on the scatter of data, just as people imagine animal shapes in the stars. A skeptical approach is always warranted. --- ## Variation around the model... is just as important as the model, if not more! .question[ *Statistics is the explanation of variation in the context of what remains unexplained.* ] - The scatter suggests that there might be other factors that account for large parts of salary-to-salary variability, or perhaps just that randomness plays a big role. - Adding more explanatory variables to a model can sometimes usefully reduce the size of the scatter around the model. (We'll talk more about this later.) --- ## How do we use models? 1. .vocab[Explanation:] Characterize the relationship between `\(y\)` and `\(x\)` via *slopes* for numerical explanatory variables or *differences* for categorical explanatory variables 2. .vocab[Prediction:] Plug in `\(x\)`, get the predicted `\(y\)` --- class: middle, center ## Interpreting Models --- ## Packages .pull-left[   ] .pull-right[ - You're familiar with the tidyverse: ```r library(tidyverse) ``` - The broom package takes the messy output of built-in functions in R, such as `lm`, and turns them into tidy data frames. ```r library(broom) ``` ] --- ## broom .pull-left[ .middle[  ] ] .pull-right[ - **broom** follows tidyverse principles and tidies up regression output - **`tidy`**: Constructs a tidy data frame summarizing model's statistical findings - **`glance`**: Constructs a concise one-row summary of the model - **`augment`**: Adds columns (e.g. predictions, residuals) to the original data that was modeled ] [https://broom.tidyverse.org/](https://broom.tidyverse.org/) --- ## Interpeting the slope - Recall the linear regression model: `$$\hat{\color{purple}{y}} = \beta_{0} + \beta_{1}x$$` -- - Let's increase `\(x\)` by 1, i.e. plug in `\((x+1)\)` into the equation: $$ `\begin{align*} \beta_{0} + \beta_{1}(x+1) &= \beta_{0} + \beta_{1}x + \beta_{1}\\ &= (\beta_{0} + \beta_{1}x) + \beta_{1}\\ &= \hat{\color{purple}{y}} + \beta_{1} \end{align*}` $$ -- - .question[What is the interpretation of the slope?] --- ## Interpeting the intercept - Recall the linear regression model: `$$\hat{\color{purple}{y}} = \beta_{0} + \beta_{1}x$$` -- - Let's set `\(x=0\)`: `$$\hat{\color{purple}{y}} = \beta_{0} + \beta_{1}\times 0 = \beta_{0}$$` -- - .question[What is the interpretation of the intercept?] --- ## Years and base salary .midi[ ```r mod_years <- lm(base ~ years, data = teacher) tidy(mod_years) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 43668. 1054. 41.4 1.79e-50 ## 2 years 803. 55.0 14.6 8.88e-23 ``` ] -- `$$\widehat{baseSalary} = \color{blue}{43668} + \color{orange}{803}\times years$$` -- - .vocab[Slope]: For each additional year a teacher is employed at the school district, the base salary is expected to be *higher*, on average, by `\(\color{orange}{803}\)` dollars. -- - .vocab[Intercept]: Teachers that have worked 0 years in the district are expected to have a base salary of `\(\color{blue}{43668}\)` dollars, on average. - Does this make sense? -- - Why is there a "hat" on the response variable? --- ## Least squares regression Why these coefficients? i.e. Why this particular line? The regression line minimizes the sum of squared residuals. -- - .vocab[Residuals]: </font> `\(e_i = y_i - \hat{y}_i\)`, - The regression line minimizes `\(\sum_{i = 1}^n e_i^2\)`. - Equivalently, minimizing `\(\sum_{i = 1}^n [y_i - (b_0 + b_1~x_i)]^2\)` .question[ Why do we minimize the *squares* of the residuals? ] --- ## Visualizing residuals <img src="16-intro-linear-models_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- ## Visualizing residuals (cont.) <img src="16-intro-linear-models_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ## Visualizing residuals (cont.) <img src="16-intro-linear-models_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> Check out the applet [here](http://www.rossmanchance.com/applets/RegShuffle.htm) to see this process in action. --- class: middle, center ## Live code: Exercises 1-3 --- class: middle, center ## Correlation does not imply causation!! Remember this when interpreting model coefficients --- class: center, middle # Prediction with models --- ## Predict base salary from years .question[ On average, how much does a teacher in this district make if they have worked for 15 years?] -- - Recall our fitted model: `$$\widehat{baseSalary} = 43668.1 + 803.1\times years$$` -- - Let's just plug in 15 for `\(years\)`: ```r 43668.1 + 803.1 * 15 ``` ``` ## [1] 55714.6 ``` - "On average, we expect that a teacher who has been employed for 15 years to have a base salary of $55714.60." - **Warning:** We "expect" this to happen, but there will be some variability. --- ## Prediction vs. extrapolation .question[ On average, what is the base salary for someone who has worked 80 years? ] <img src="16-intro-linear-models_files/figure-html/extrapolate-1.png" style="display: block; margin: auto;" /> --- ## Watch out for extrapolation! > "When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February 6th it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on."<sup>1</sup> <br> Stephen Colbert, April 6th, 2010 .footnote[[1] OpenIntro Statistics. "Extrapolation is treacherous." OpenIntro Statistics.] --- class: middle, center ## Live code: Exercises 4-5 --- class: middle, center ## Categorical Predictors --- ## What about non-continuous predictors? ``` ## # A tibble: 8 × 2 ## degree grade ## <fct> <chr> ## 1 BA high ## 2 MA middle ## 3 MA elementary ## 4 BA middle ## 5 BA middle ## 6 BA elementary ## 7 BA high ## 8 MA high ``` --- ### Categorical predictor with 2 levels `base` salary regressed on `degree` ```r mod_degree <- lm(base ~ degree, data = teacher) tidy(mod_degree) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 56610. 1777. 31.8 5.44e-43 ## 2 degreeBA -352. 2398. -0.147 8.84e- 1 ``` -- <br> `$$\widehat{baseSalary} = 56610 - 352\times degreeBA$$` .question[What is happening here?] --- ### Indicator variables - Under the hood, `R` has actually converted the categorical variable `degree` into an .vocab[indicator] variable called `\(degreeBA\)` `$$degreeBA = \begin{cases} 1 & \text{ if teacher has a BA} \\ 0 & \text{ if teacher does not have a BA (i.e. has MA)} \end{cases}$$` -- - More generally, indicator variables are .vocab[binary] variables that take the value 1 when a condition is true, and 0 otherwise --- ### Categorical predictor with 2 levels (cont.) - The .vocab[baseline] level or category is the level of the variable when all indicators are 0 -- `$$\widehat{baseSalary} = 56610 - 352\times degreeBA$$` - .question[What is the baseline level in our model?] -- - Whenever there is a categorical variable in the model, we *always* interpret the coefficients with respect to the baseline! --- ### Categorical predictor with 2 levels (cont.) `$$\widehat{baseSalary} = \color{blue}{56610} \color{orange}{- 352}\times degreeBA$$` - Intercept `\(\beta_{0}\)`: Teachers with an MA are expected, on average, to have a base salary of `\(\color{blue}{56610}\)` dollars - Plug in 0 for `degreeBA` and interpret accordingly -- - `\(\beta_{1}\)`: Teachers with a BA are expected, on average, to have a base salary `\(\color{orange}{352}\)` dollars *less* than teachers with an MA - Notice that we explicitly compare to baseline! --- ## Regression lines by `degree` `$$\widehat{baseSalary} = \color{blue}{56610} \color{orange}{- 352}\times degreeBA$$` <img src="16-intro-linear-models_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- ### Categorical predictors with more than 2 levels .midi[ ```r mod_grade <- lm(base ~ grade, data = teacher) tidy(mod_grade) ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 58985. 2195. 26.9 6.16e-38 ## 2 gradehigh -1448. 2920. -0.496 6.22e- 1 ## 3 grademiddle -5790. 2945. -1.97 5.34e- 2 ``` ] .question[ What do these rows correspond to? Why are there only two `grade`s listed, but we know there are three total?) ] --- ### Categorical predictors with more than 2 levels .midi[ ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 58985. 2195. 26.9 6.16e-38 ## 2 gradehigh -1448. 2920. -0.496 6.22e- 1 ## 3 grademiddle -5790. 2945. -1.97 5.34e- 2 ``` ] - When the categorical explanatory variable has many levels, the levels are encoded to .vocab[dummy variables] - Each coefficient describes the expected difference between `base` salaries in that particular `grade` level compared to the baseline level. --- ## How dummy variables are made ``` ## # A tibble: 3 × 4 ## # Groups: grade [3] ## grade elementary middle high ## <chr> <int> <int> <int> ## 1 elementary 1 0 0 ## 2 high 0 0 1 ## 3 middle 0 1 0 ``` --- class: middle, center ## Live code: Exercises 6-9