class: center, middle, inverse, title-slide .title[ # Multiple linear regression ] .author[ ### Becky Tang ] .date[ ### 11/30/2022 ] --- <!-- layout: true --> <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- ## Houskeeping - HW 08 due tonight at 11:59pm! - Sign up for meetings regarding your project via the Calendly link found on the project description! --- class: center, middle ## Review --- ## Vocabulary - .vocab[Response variable]: Variable whose behavior or variation you are trying to understand. - .vocab[Explanatory variables]: Other variables that you want to use to explain the variation in the response. - .vocab[Residuals]: Shows how far each case is from its predicted value - **Residual = Observed value - Predicted value** --- ## The linear model with a single predictor - We're interested in the `\(\beta_0\)` (population parameter for the intercept) and the `\(\beta_1\)` (population parameter for the slope) in the following model: $$ \hat{y} = \beta_0 + \beta_1~x $$ -- - Unfortunately, we can't get these values - So we use sample statistics to estimate them: $$ \hat{y} = b_0 + b_1~x $$ --- ## Data and Packages ```r library(tidyverse) library(broom) ``` ``` ## Rows: 60 ## Columns: 4 ## $ mileage <dbl> 21500, 43000, 19900, 36000, 44000, 49800, 1300, 670, 13400, 97… ## $ price <dbl> 69400, 56900, 49900, 47400, 42900, 36900, 83000, 72900, 69900,… ## $ age <dbl> 3, 3, 2, 4, 4, 6, 0, 0, 2, 0, 2, 2, 4, 3, 10, 11, 4, 4, 10, 3,… ## $ type <chr> "Porsche", "Porsche", "Porsche", "Porsche", "Porsche", "Porsch… ``` The data set contains prices for Porsche and Jaguar cars for sale on cars.com. .vocab[`type`]: car make (Jaguar or Porsche) .vocab[`price`]: price in USD .vocab[`age`]: age of the car in years .vocab[`mileage`]: previous miles driven --- ## Single numerical predictor ```r mod_pr_age <- lm(price ~ age, data = sports_car_prices) tidy(mod_pr_age) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 53246. 3322. 16.0 5.70e-23 ## 2 age -2149. 466. -4.62 2.22e- 5 ``` `$$\widehat{price} = 53246- 2149 ~age_{years}$$` --- ## Single categorical predictor (2 levels) ```r mod_pr_type <- lm(price ~ type, data = sports_car_prices) tidy(mod_pr_type) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 31957. 2954. 10.8 1.56e-15 ## 2 typePorsche 18580 4178. 4.45 4.00e- 5 ``` `$$\widehat{price} = 31956.67 +18580 ~typePorsche$$` --- ## Single categorical predictor (2 levels) `$$\widehat{price} = 31956.67 + 18580 ~typePorsche$$` What is the the average price of a Porsche? -- For Porsches, `\(typePorsche = 1\)`. So the average price of Porsches is `$$\widehat{price} = 31956.67 + 18580 \times 1 = 50536.67$$` --- class: center, middle ## The linear model with multiple predictors --- ## The linear model with multiple predictors - Population model: `$$\hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_p~x_p$$` where `\(p\)` is the number of explanatory variables. -- - Sample model that we use to estimate the population model: `$$\hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_p~x_p$$` --- ## Price and age <img src="18-mlr-intro_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Price vs. age and make .question[ Can we model simultaneously model the relationships between the age and make of a used car and its price? ] <img src="18-mlr-intro_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ### Modeling with multiple predictors .question[What is the linear regression model for `price` that uses both `age` and `type` of the car as predictors?] `$$\widehat{price} = \beta_{0} + \beta_{1}~ age + \beta_{2}~type$$` -- - Our estimated linear regression model: ```r m_main <- lm(price ~ age + type, data = sports_car_prices) m_main %>% tidy() %>% select(term, estimate) ``` ``` ## # A tibble: 3 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 44310. ## 2 age -2487. ## 3 typePorsche 21648. ``` -- .midi[ $$ \widehat{price} = 44310 - 2487~age + 21648~typePorsche $$ ] --- ### Different lines for each level .alert[ $$ \widehat{price} = 44310 - 2487~age + 21648~typePorsche $$ ] - What is the linear model for Porsches? Plug in 1 for `typePorsche`: `$$\begin{align}\widehat{price}_{porsche} &= 44310 - 2487~age + 21648 \times 1 \\ &= 65958 - 2487~age\\\end{align}$$` -- - What is the linear model for Jaguars? Plug in 0 for `typePorsche`: `$$\begin{align}\widehat{price}_{jaguar} &= 44310 - 2487~age + 21648 \times 0\\ &= 44310 - 2487~age\\\end{align}$$` --- ### Different lines for each level (cont.) .alert[ **Jaguar** `$$\begin{align}\widehat{price}_{jaguar} = 44310 - 2487~age\\\end{align}$$` **Porsche** `$$\begin{align}\widehat{price}_{porsche} = 65958 - 2487~age\\\end{align}$$` ] - Rate of change in price as the age of the car increases does not depend on make of car (.vocab[same coefficient for age!]) - Porsches are consistently more expensive than Jaguars (.vocab[different intercepts]) --- ### Different lines for each level (cont.) <img src="18-mlr-intro_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ## Interpretation ``` ## # A tibble: 3 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 44310. ## 2 age -2487. ## 3 typePorsche 21648. ``` .alert[ $$ \widehat{price} = 44310 - 2487~age + 21648~typePorsche $$ ] -- - **All else held constant**, for each additional year of a car's age, the price of the car is predicted to *decrease*, on average, by $2,487. -- - **All else held constant**, Porsches are predicted, on average, to have a price that is $21,648 greater than Jaguars. -- - Jaguars that are new (`age` = 0) are predicted, on average, to have a price of $44,309.