class: center, middle, inverse, title-slide .title[ # Linear regression ] .subtitle[ ## Conditions + Fit ] .author[ ### Becky Tang ] .date[ ### 11/16/2022 ] --- layout: true --- ## Data: Old Faithful - We have data about the Old Faithful geyser in Yellowstone National Park - `eruptions`: Eruption time (in mins) - `waiting`: Waiting time to next eruption (in mins) ```r faithful %>% slice(1:3) ``` ``` ## eruptions waiting ## 1 3.600 79 ## 2 1.800 54 ## 3 3.333 74 ``` --- ## EDA .question[How would you describe the relationship between eruption time and waiting time for Old Faithful?] <img src="17-linreg-conditions_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## Linear regression model ```r mod_faithful <- lm(waiting ~ eruptions, data = faithful) tidy(mod_faithful) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 33.5 1.15 29.0 7.14e- 85 ## 2 eruptions 10.7 0.315 34.1 8.13e-100 ``` .question[How would you interrupt the intercept and slope?] --- ## Tidy regression output Achieved with functions from the `broom` package: - .vocab[`tidy`]: Constructs a data frame that summarizes the model's statistical findings. We've talked about coefficient estimates and standard errors, but it also displays *test statistics and p-values* - .vocab[`augment`]: Adds columns to the original data that was modeled. This includes predictions and residuals. - .vocab[`glance`]: Constructs a concise one-row summary of the model, computed once for the entire model. --- ## Tidy your model's statistical findings ```r tidy(mod_faithful) %>% select(term, estimate) %>% mutate(estimate = round(estimate, 3)) ``` ``` ## # A tibble: 2 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 33.5 ## 2 eruptions 10.7 ``` --- class: center, middle ## Model diagnostics --- ## Conditions - **Linearity**: The relationship between response and predictor(s) is linear - Independence: The residuals are independent - **Normality**: The residuals are nearly normally distributed - Equal Variance: The residuals have constant variance .alert[We will just focus on linearity and normality in this class.] --- ## Conditions - .vocab[L]**inearity**: The relationship between response and predictor(s) is linear - .vocab[I]ndependence: The residuals are independent - .vocab[N]**ormality**: The residuals are nearly normally distributed - .vocab[E]qual Variance: The residuals have constant variance --- ## Examples <img src="img/19/violated_conds.png" width="70%" style="display: block; margin: auto;" /> - Top row: scatterplot of `\(x\)` vs `\(y\)` - Bottom row: residual plot --- ## `augment` data with model results - `.fitted`: Predicted value of the response variable `\((\hat{y})\)` - `.resid`: Residuals `\((y - \hat{y})\)` .midi[ ```r *mod_faithful_aug <- augment(mod_faithful) mod_faithful_aug %>% slice(1:3) ``` ``` ## # A tibble: 3 × 8 ## waiting eruptions .fitted .resid .hat .sigma .cooksd .std.resid ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 79 3.6 72.1 6.90 0.00371 5.91 0.00254 1.17 ## 2 54 1.8 52.8 1.21 0.0117 5.92 0.000253 0.206 ## 3 74 3.33 69.2 4.76 0.00374 5.92 0.00122 0.807 ``` ] -- We will use the fitted values (`.fitted`) and residuals (`.resid`) to check the conditions by constructing .vocab[diagnostic plots]. --- ### Residuals vs fitted plot Use to check .vocab[**L**inearity] and .vocab[**E**qual variance.] .midi[ ```r ggplot(mod_faithful_aug, mapping = aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) + labs(x = "Predicted waiting time (min.)", y = "Residuals") ``` <img src="17-linreg-conditions_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] --- ### Residuals vs fitted plot .alert[ This is a good example! We don't see any patterns in the residuals. ] .midi[ <img src="17-linreg-conditions_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> ] --- ### Histogram of residuals Use to check .vocab[**N**ormality] .midi[ ```r ggplot(mod_faithful_aug, mapping = aes(x = .resid)) + geom_histogram(bins = 15) + labs(x = "Residuals") ``` <img src="17-linreg-conditions_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ] --- ### Histogram of residuals .alert[ This is a good example! We want to see symmetric, bell-shaped curve here. ] <img src="17-linreg-conditions_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- class: center, middle ## Assessing quality of model fit --- ## Assessing the quality of the fit - The strength of the fit of a linear model is commonly evaluated using `\(R^2\)`, sometimes called the .vocab[coefficient of determination] - It tells us what percentage of the variability in the response variable is explained by the model. The remainder of the variability is unexplained. - The `\(R^2\)` is always between 0 and 100 (or 0 and 1) -- .question[ What does "explained variability in the response variable" mean? ] -- .question[What does an `\(R^2 = 1\)` tell us? What about an `\(R^2 = 0\)`?] --- ## Obtaining `\(R^2\)` in R We can use the `glance()` function to obtain the `\(R^2\)` of our model: ```r glance(mod_faithful) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squ…¹ sigma stati…² p.value df logLik AIC BIC devia…³ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.811 0.811 5.91 1162. 8.13e-100 1 -868. 1743. 1754. 9443. ## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated ## # variable names ¹adj.r.squared, ²statistic, ³deviance ``` ```r glance(mod_faithful) %>% pull(r.squared) ``` ``` ## [1] 0.8114608 ``` -- - Interpretation: about 81.1% of the variability in Old Faithful's `waiting` times can be explained by the previous `eruption` time. --- class: center, middle ## Your turn! --- ### Ames housing prices - This data set contains information about homes sold in Ames, Iowa from 2006 to 2010 - We will focus today on two variables: - `price`: sale price in USD - `area`: Above grade (ground) living area square feet