Linear regression

.title[
# Linear regression
]
.subtitle[
## Conditions + Fit
]
.author[
### Becky Tang
]
.date[
### 11/16/2022
]

---

---

## Data: Old Faithful

- We have data about the Old Faithful geyser in Yellowstone National Park

- `eruptions`: Eruption time (in mins)
  - `waiting`: Waiting time to next eruption (in mins)

```r
faithful %>%
  slice(1:3)
```

```
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
```

---

## EDA

.question[How would you describe the relationship between eruption time and waiting time for Old Faithful?]

---

## Linear regression model

```r
mod_faithful <- lm(waiting ~ eruptions, data = faithful)
tidy(mod_faithful)
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)     33.5     1.15       29.0 7.14e- 85
## 2 eruptions       10.7     0.315      34.1 8.13e-100
```

---

## Tidy regression output

Achieved with functions from the `broom` package:

- .vocab[`tidy`]: Constructs a data frame that summarizes the model's statistical findings. We've talked about coefficient estimates and standard errors, but it also displays *test statistics and p-values*

- .vocab[`augment`]: Adds columns to the original data that was modeled. This includes predictions and residuals.

- .vocab[`glance`]: Constructs a concise one-row summary of the model, computed once for the entire model.

---

## Tidy your model's statistical findings

```r
tidy(mod_faithful) %>%
  select(term, estimate) %>%
  mutate(estimate = round(estimate, 3))
```

```
## # A tibble: 2 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)     33.5
## 2 eruptions       10.7
```
 
---

## Model diagnostics

---

## Conditions

- **Linearity**: The relationship between response and predictor(s) is linear

- Independence: The residuals are independent

- **Normality**: The residuals are nearly normally distributed

- Equal Variance: The residuals have constant variance

---

## Conditions

- .vocab[L]**inearity**: The relationship between response and predictor(s) is linear

- .vocab[I]ndependence: The residuals are independent

- .vocab[N]**ormality**: The residuals are nearly normally distributed

- .vocab[E]qual Variance: The residuals have constant variance

---

## Examples

- Top row: scatterplot of `\(x\)` vs `\(y\)`

- Bottom row: residual plot

---

## `augment` data with model results

- `.fitted`: Predicted value of the response variable `\((\hat{y})\)`
- `.resid`: Residuals `\((y - \hat{y})\)`

```r
*mod_faithful_aug <- augment(mod_faithful)
mod_faithful_aug %>%
  slice(1:3)
```

```
## # A tibble: 3 × 8
##   waiting eruptions .fitted .resid    .hat .sigma  .cooksd .std.resid
##     <dbl>     <dbl>   <dbl>  <dbl>   <dbl>  <dbl>    <dbl>      <dbl>
## 1      79      3.6     72.1   6.90 0.00371   5.91 0.00254       1.17 
## 2      54      1.8     52.8   1.21 0.0117    5.92 0.000253      0.206
## 3      74      3.33    69.2   4.76 0.00374   5.92 0.00122       0.807
```
]

We will use the fitted values (`.fitted`) and residuals (`.resid`) to check the conditions by 
constructing .vocab[diagnostic plots].

---

### Residuals vs fitted plot

Use to check .vocab[**L**inearity] and .vocab[**E**qual variance.]

```r
ggplot(mod_faithful_aug, mapping = aes(x = .fitted, y = .resid)) +
  geom_point() + geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) +
  labs(x = "Predicted waiting time (min.)", y = "Residuals") 
```

<img src="17-linreg-conditions_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" />
]

---

### Residuals vs fitted plot

.midi[
<img src="17-linreg-conditions_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" />
]

---

### Histogram of residuals

Use to check .vocab[**N**ormality]

```r
ggplot(mod_faithful_aug, mapping = aes(x = .resid)) +
  geom_histogram(bins = 15) + labs(x = "Residuals")
```

<img src="17-linreg-conditions_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" />
]

---

### Histogram of residuals

---

## Assessing quality of model fit

---

## Assessing the quality of the fit

- The strength of the fit of a linear model is commonly evaluated using `\(R^2\)`, sometimes called the .vocab[coefficient of determination]

- It tells us what percentage of the variability in the response variable is explained by the model. The remainder of the variability is unexplained.

- The `\(R^2\)` is always between 0 and 100 (or 0 and 1)
  
--

---

## Obtaining `\(R^2\)` in R

We can use the `glance()` function to obtain the `\(R^2\)` of our model:

```r
glance(mod_faithful)
```

```
## # A tibble: 1 × 12
##   r.squared adj.r.squ…¹ sigma stati…²   p.value    df logLik   AIC   BIC devia…³
##       <dbl>       <dbl> <dbl>   <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
## 1     0.811       0.811  5.91   1162. 8.13e-100     1  -868. 1743. 1754.   9443.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹adj.r.squared, ²statistic, ³deviance
```

```r
glance(mod_faithful) %>%
  pull(r.squared)
```

```
## [1] 0.8114608
```

- Interpretation: about 81.1% of the variability in Old Faithful's `waiting` times  can be explained by the previous `eruption` time.

---

## Your turn!

---

### Ames housing prices

- This data set contains information about homes sold in Ames, Iowa from 2006 to 2010

- We will focus today on two variables:

- `price`: sale price in USD
  
  - `area`: Above grade (ground) living area square feet