Introducing linear models

---

<div class="my-footer">
<span>
<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

## Housekeeping

- HW 07 due tomorrow, 11/15 at 11:59pm

- Project proposals will be due to Canvas this Friday, 11/18 at 11:59pm

- I will send out an e-mail where you and your partner should sign up for a brief Zoom meeting with me on 11/28

---

## The language of models

---

## Modeling

- We use models to...
  - understand relationships
  - assess differences
  - make predictions

- We will focus on .vocab[linear] models but there are many other types.

---

# Data: Teacher salaries

---

## Teacher salaries

This data set contains teacher salaries from 2009-2010 for 71 teachers employed by the St. Louis Public School in Michigan.

```r
glimpse(teacher)
```

```
## Rows: 71
## Columns: 9
## $ id         <fct> 01, 02, 03, 04, 05, 06, 07, 08, 09, 11, 12, 13, 14, 15, 16,…
## $ degree     <fct> BA, MA, MA, BA, BA, BA, BA, MA, BA, BA, BA, BA, BA, BA, MA,…
## $ fte        <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ years      <dbl> 5.0, 15.0, 16.0, 10.0, 26.0, 28.5, 12.0, 32.0, 25.0, 12.0, …
## $ base       <int> 45388, 60649, 60649, 54466, 65360, 65360, 58097, 68230, 653…
## $ fica       <dbl> 3472.18, 4639.65, 4639.65, 4166.65, 5000.04, 5000.04, 4444.…
## $ retirement <dbl> 7688.73, 10273.94, 10273.94, 9226.54, 11071.98, 11071.98, 9…
## $ total      <dbl> 56548.91, 75562.59, 75562.59, 67859.19, 81432.02, 81432.02,…
## $ grade      <chr> "high", "middle", "elementary", "middle", "middle", "elemen…
```

---

## Data

- `degree`: highest educational degree attained (`"BA"` or `"MA"`)

- `fte`: full-time enrollment status (fulled time 1 or part-time 0.5)

- `years`: number of years employed by the school district

- `base`: base annual salary, in dollars

- `fica`: amount paid into Social Security and Medicare per year, in dollars

- `retirement`: amount paid into the retirement fund of the teacher per year, in dollars

- `total`: total annual salary of the teacher, resulting from the sum of `base` + `fica` + `retirement`, in dollars.

---

## Modeling the relationship between variables

---

## Describe the distribution of base

```r
ggplot(data = teacher, aes(x = base)) +
  geom_histogram(bins = 20) +
  labs(title="Distribution of base salary (in $)")
```

<img src="16-intro-linear-models_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />
]

---

## Years and Degree

```r
ggplot(data = teacher, aes(x = years)) +
  geom_histogram(bins = 20) +
  theme(text = element_text(size = 20))
```

<img src="16-intro-linear-models_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />
]

```r
ggplot(data = teacher, aes(x = degree)) +
  geom_bar() +
  theme(text = element_text(size = 20))
```

<img src="16-intro-linear-models_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" />
]

---

## Models as functions

- We can represent relationships between variables using **functions**

- A .vocab[function] in the *mathematical* sense is the relationship between one or more inputs and an output created from those inputs. 
    - Plug in the inputs and receive back the output
    
--

- The formula `$y = 3x + 7$` is a function with input `$x$` and output `$y$`.
  - When `$x$` is `$5$`, the output `$y$` is `$22$`
    
    ```
    y = 3 * 5 + 7 = 22
    ```
    
--

- When `$x$` is `$12$`, the output of `$y$` is `$43$`
  
    ```
    y = 3 * 12 + 7 = 43
    ```

---

## Visualizing the linear model

```r
ggplot(data = teacher, aes(x = years, y = base)) +
  geom_point() +
  ggtitle("Base salary vs years") +
* geom_smooth(method = "lm")
```

<img src="16-intro-linear-models_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" />
]

---

## Visualizing the linear model

```r
ggplot(data = teacher, aes(x = years, y = base)) +
  geom_point() +
  ggtitle("Base salary vs years") +
* geom_smooth(method = "lm", se = FALSE,
*             col = "red", lty = 2, lwd = 1)
```

<img src="16-intro-linear-models_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" />
]

---

## Vocabulary

- .vocab[Response variable]: Variable whose behavior or variation you are trying to understand, on the y-axis. Also called the **dependent variable**, and referred to as `$y$`. For linear

- .vocab[Explanatory variables]: Other variables that you want to use to explain the variation in the response, on the x-axis. Also called **independent variables**, **predictors**, or **features**. Each explanatory variable is referred to using an `$x$`.

- .vocab[Predicted value]:</font> Output of the **model function**
    - The model function gives the typical value of the response variable
    *conditioning* on the explanatory variables (what does this mean?)

---

## Data

```r
teacher %>%
  dplyr::select(base, years) %>%
  slice(1:5)
```

```
## # A tibble: 5 × 2
##    base years
##   <int> <dbl>
## 1 45388     5
## 2 60649    15
## 3 60649    16
## 4 54466    10
## 5 65360    26
```
]

```r
teacher %>%
  dplyr::select(base, years) %>%
  rename("y" = 1, "x" = 2) %>%
  slice(1:5)
```

```
## # A tibble: 5 × 2
##       y     x
##   <int> <dbl>
## 1 45388     5
## 2 60649    15
## 3 60649    16
## 4 54466    10
## 5 65360    26
```
]

---

## The linear model with a single predictor

- .vocab[Linear regression] is the statistical method for fitting a line to data where the relationship between predictor `$x$` and quantitative response `$y$` can be modeled by a straight line with some error `$\epsilon$`:

`$$y  = \beta_{0} + \beta_{1}x + \epsilon$$`
  - For this class, we will not worry about the `$\epsilon$` too much. But we should remember there is always a little bit of error!
  
  - When we ignore `$\epsilon$`, we replace `$y$` with `$\hat{y}$`
  
---

## The linear model with a single predictor (cont.)

- Thus, we have the following model:

`$$\hat{y} = \beta_0 + \beta_1~x$$`

where `$\beta_1$` is often called the .vocab[coefficient] or  .vocab[slope] for the explanatory variable `$x$`

- Unfortunately, we can't get these values

- So we use sample statistics to estimate them:

`$$\hat{y} = b_0 + b_1~x$$`

---

## Vocabulary

- .vocab[Residuals]: Shows how far each case `$i$`  is from its predicted value
   - **Residual = Observed value - Predicted value = `$y_{i} - \hat{y}_{i}$`**
   - Tells how far above/below the model function each case is

---

.question[
What does a positive residual mean? Which salaries on the plot have positive residuals, those below or above the line?
]

---

## Residuals

---

## Residuals

- True salary when years = 13: 61216

- Predicted salary when years = 13: 54108.45

---

## Residuals
   
<img src="16-intro-linear-models_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" />

---

## Residuals

- True salary when years = 34: 65360

- Predicted salary when years = 34: 70973.64

---

## Models - upsides and downsides

- Models can sometimes reveal patterns that are not evident in a graph of the
data. This is a great advantage of modeling over simple visual inspection of
data.

- There is a real risk, however, that a model is imposing structure that is
not really there on the scatter of data, just as people imagine animal shapes in
the stars. A skeptical approach is always warranted.

---

## Variation around the model...

is just as important as the model, if not more!

.question[
*Statistics is the explanation of variation in the context of what remains
unexplained.*
]

- The scatter suggests that there might be other factors that account for large parts 
of salary-to-salary variability, or perhaps just that randomness plays a big role.

- Adding more explanatory variables to a model can sometimes usefully reduce
the size of the scatter around the model. (We'll talk more about this later.)

---

## How do we use models?

1. .vocab[Explanation:] Characterize the relationship between `$y$` and `$x$` via *slopes* for
numerical explanatory variables or *differences* for categorical explanatory
variables

2. .vocab[Prediction:] Plug in `$x$`, get the predicted `$y$`

---

## Interpreting Models

---

## Packages

![](img/17/broom.png)
]
.pull-right[
- You're familiar with the tidyverse:

```r
library(tidyverse)
```

- The broom package takes the messy output of built-in functions in R, such as `lm`, and turns them into tidy data frames.

```r
library(broom)
```
]

---

## broom

.pull-right[
- **broom** follows tidyverse principles and tidies up regression output
- **`tidy`**: Constructs a tidy data frame summarizing model's statistical findings
- **`glance`**: Constructs a concise one-row summary of the model
- **`augment`**: Adds columns (e.g. predictions, residuals) to the original data that was modeled
]

[https://broom.tidyverse.org/](https://broom.tidyverse.org/)

---

## Interpeting the slope

- Recall the linear regression model:

`$$\hat{\color{purple}{y}} = \beta_{0} + \beta_{1}x$$`
--

- Let's increase `$x$` by 1, i.e. plug in `$(x+1)$` into the equation:

$$
`\begin{align*}
\beta_{0} + \beta_{1}(x+1) &= \beta_{0} + \beta_{1}x + \beta_{1}\\
&= (\beta_{0} + \beta_{1}x) + \beta_{1}\\
&= \hat{\color{purple}{y}}  + \beta_{1}
\end{align*}`
$$
--

- .question[What is the interpretation of the slope?]

---

## Interpeting the intercept

- Recall the linear regression model:

`$$\hat{\color{purple}{y}}  = \beta_{0} + \beta_{1}x$$`

- Let's set `$x=0$`:

`$$\hat{\color{purple}{y}}  = \beta_{0} + \beta_{1}\times 0 = \beta_{0}$$`

- .question[What is the interpretation of the intercept?]

---

## Years and base salary

```r
mod_years <- lm(base ~ years, data = teacher)
tidy(mod_years)
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   43668.    1054.       41.4 1.79e-50
## 2 years           803.      55.0      14.6 8.88e-23
```
]

`$$\widehat{baseSalary} = \color{blue}{43668} + \color{orange}{803}\times years$$`

- .vocab[Slope]: For each additional year a teacher is employed at the school district, the base salary is expected
to be *higher*, on average, by `$\color{orange}{803}$` dollars.

- .vocab[Intercept]: Teachers that have worked 0 years in the district are expected to have a base salary of `$\color{blue}{43668}$` dollars,
on average.
    - Does this make sense?
   
--

- Why is there a "hat" on the response variable?

---

## Least squares regression

Why these coefficients? i.e. Why this particular line? The regression line minimizes the sum of squared residuals.

- .vocab[Residuals]: </font> `$e_i = y_i - \hat{y}_i$`,

- The regression line minimizes `$\sum_{i = 1}^n e_i^2$`.

- Equivalently, minimizing `$\sum_{i = 1}^n [y_i - (b_0 + b_1~x_i)]^2$`

---

## Visualizing residuals

---

## Visualizing residuals (cont.)

---

## Visualizing residuals (cont.)

Check out the applet [here](http://www.rossmanchance.com/applets/RegShuffle.htm) to see this process in action.

---

## Live code: Exercises 1-3

---

## Correlation does not imply causation!!

Remember this when interpreting model coefficients

---

# Prediction with models

---

## Predict base salary from years

.question[
On average, how much does a teacher in this district make if they have worked for 15 years?]

- Recall our fitted model:

`$$\widehat{baseSalary} = 43668.1 + 803.1\times years$$`

- Let's just plug in 15 for `$years$`:

```r
43668.1  + 803.1 * 15
```

```
## [1] 55714.6
```

- "On average, we expect that a teacher who has been employed for 15 years to have a base salary of $55714.60."

- **Warning:** We "expect" this to happen, but there will be some variability.

---

## Prediction vs. extrapolation

---

## Watch out for extrapolation!

> "When those blizzards hit the East Coast this winter, it proved to my satisfaction 
that global warming was a fraud. That snow was freezing cold. But in an alarming 
trend, temperatures this spring have risen. Consider this: On February 6th it was 10 
degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So 
clearly folks the climate debate rages on."<sup>1</sup>  <br>
Stephen Colbert, April 6th, 2010
.footnote[[1] OpenIntro Statistics. "Extrapolation is treacherous." OpenIntro Statistics.]

---

## Live code: Exercises 4-5

---

## Categorical Predictors

---

## What about non-continuous predictors?

```
## # A tibble: 8 × 2
##   degree grade     
##   <fct>  <chr>     
## 1 BA     high      
## 2 MA     middle    
## 3 MA     elementary
## 4 BA     middle    
## 5 BA     middle    
## 6 BA     elementary
## 7 BA     high      
## 8 MA     high
```

---

### Categorical predictor with 2 levels

`base` salary regressed on `degree`

```r
mod_degree <- lm(base ~ degree, data = teacher)
tidy(mod_degree)
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   56610.     1777.    31.8   5.44e-43
## 2 degreeBA       -352.     2398.    -0.147 8.84e- 1
```

<br>

`$$\widehat{baseSalary} = 56610 - 352\times degreeBA$$`
.question[What is happening here?]

---

### Indicator variables

- Under the hood, `R` has actually converted the categorical variable `degree` into an .vocab[indicator] variable called `$degreeBA$`

`$$degreeBA = \begin{cases}
1 & \text{ if teacher has a BA} \\
0 &  \text{ if teacher does not have a BA (i.e. has MA)}
\end{cases}$$`

- More generally, indicator variables are .vocab[binary] variables that take the value 1 when a condition is true, and 0 otherwise

---

### Categorical predictor with 2 levels (cont.)

- The .vocab[baseline] level or category is the level of the variable when all indicators are 0

`$$\widehat{baseSalary} = 56610 - 352\times degreeBA$$`

- .question[What is the baseline level in our model?]

- Whenever there is a categorical variable in the model, we *always* interpret the coefficients with respect to the baseline!

---

### Categorical predictor with 2 levels (cont.)

`$$\widehat{baseSalary} = \color{blue}{56610} \color{orange}{- 352}\times degreeBA$$`

- Intercept `$\beta_{0}$`: Teachers with an MA are expected, on 
average, to have a base salary of `$\color{blue}{56610}$` dollars

- Plug in 0 for `degreeBA` and interpret accordingly
  
--

- `$\beta_{1}$`: Teachers with a BA are expected, on average,
to have a base salary `$\color{orange}{352}$` dollars *less* than teachers with an MA

- Notice that we explicitly compare to baseline!

---

## Regression lines by `degree`

`$$\widehat{baseSalary} =  \color{blue}{56610} \color{orange}{- 352}\times degreeBA$$`

---

### Categorical predictors with more than 2 levels

```r
mod_grade <- lm(base ~ grade, data = teacher)
tidy(mod_grade)
```

```
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   58985.     2195.    26.9   6.16e-38
## 2 gradehigh     -1448.     2920.    -0.496 6.22e- 1
## 3 grademiddle   -5790.     2945.    -1.97  5.34e- 2
```
]

.question[
What do these rows correspond to? Why are there only two `grade`s listed, but we know there are three total?)
]

---

### Categorical predictors with more than 2 levels

- When the categorical explanatory variable has many levels, the levels are encoded 
to .vocab[dummy variables]

- Each coefficient describes the expected difference between `base` salaries in that particular `grade` level compared to the baseline level.

---

## How dummy variables are made

```
## # A tibble: 3 × 4
## # Groups:   grade [3]
##   grade      elementary middle  high
##   <chr>           <int>  <int> <int>
## 1 elementary          1      0     0
## 2 high                0      0     1
## 3 middle              0      1     0
```

---

## Live code: Exercises 6-9