HW 04: Examining smoking and health outcomes

due Tuesday, 10/11 at 11:59pm

Introduction

The goal of today’s homework is to practice visualizing and calculating probabilities using the tidyverse.

Set-up

Clone the hw-04-smoking assignment repo into GitHub desktop. Then open the .Rmd file in RStudio sto get started!

Packages

In this lab we will work with the tidyverse and mosaicData packages.

You may need to install the mosaicData package!

library(tidyverse) 
library(mosaicData) 

Note that these packages are also loaded in your R Markdown document.

The data

Today’s data comes from a study of conducted in Whickham, England. In this study, the researchers recorded each participant’s age, smoking status at the start of the study, and their health outcome 20 years later.

The data is in the mosaicData package. You can load it with

data(Whickham)

Take a peek at the codebook with

?Whickham

Good coding practice

Please follow these coding practices in this homework and all coding moving forward. Your code style will be assessed according to the following guidelines:

Exercises

Don’t forget to commit often!

  1. How many observations are in this dataset? What does each observation represent?

  2. How many variables are in this dataset? What type of variable is each? Display each variable using an appropriate visualization.

  3. What would you expect the relationship between smoking status and health outcome to be?

  4. Create a visualization depicting the relationship between smoking status and health outcome.

  5. Calculate the conditional probabilities of death for each smoking status, only reporting probabilities for the outcome of Dead. Briefly describe the relationship, and evaluate whether or not it is what you expected. Use the visualization from the previous exercise and the conditional probabilities to support your narrative.

  6. Create a new variable called age_cat using the following scheme:

    • age <= 44 ~ "18-44"
    • age > 44 & age <= 64 ~ "45-64"
    • age > 64 ~ "65+"
  7. Re-create the visualization from Exercise 4, this time faceting by age_cat.

  8. Extend the table from Exercise 5 by breaking it down by age category.

  9. Compare the visualization from Exercise 7 and the table from Exercise 8 to what you previously observed in Exercises 4 and 5. What changed, and what might explain the change? Use the table you calculated in Exercise 8 to support your narrative.

Submission

Knit to PDF to create a PDF document. Knit and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Please upload your PDF document to Canvas.



This lab was adapted from Data Science in a Box.