Introduction

Life expectancy is a key metric for assessing population health, as it takes into account both infant and elderly mortality. It can be used to compare countries, with the generally accepted belief that a higher life expectancy is associated with a higher quality of life. However, the inequality of life expectancy is still very large across and within countries. We will be examining data about life expectancy across countries. In particular, you will analyze data from the World Health Organization using linear regression models in order to learn what factors are important for determining a country’s average life expectancy in a given year.

The data were modified from this Kaggle dataset. According to the website, “the Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries. The datasets are made available to public for the purpose of health data analysis. The dataset related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative.” However, due to missing data for certain countries, they ultimately decided to remove ten countries from the final dataset. Thus the resultant dataset contains information about 183 countries.

Data and Packages

⊕The broom package is installed with the tidyverse, but we need to load it separately in order to make use of it.

In this lab we will work with the tidyverse and broom packages. You will need to load them in yourself!

In this homework you will once again upload the data on your own! The life_expectancy.csv data can be found on Canvas $\Rightarrow$ Files and click on the file called life_expectancy.csv. Follow the instructions from HW 07 to upload the life_expectancy.csv data into RStudio.

Then, you can load the data as usual using the following. Be sure to remove the eval = FALSE before knitting!

life_expect <- read.csv("data/life_expectancy.csv")

Codebook

Variable name	Description
`country`	Country
`year`	Year (2000-2015)
`status`	Developed or Developing country status
`life_expectancy`	life expentancy in age
`adult_mortality`	Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
`infant_deaths`	Number of Infant Deaths per 1000 population
`alcohol`	Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
`percentage_expenditure`	Expenditure on health as a percentage of Gross Domestic Product per capita (%)
`hepB`	Hepatitis B immunization coverage among 1-year-olds (%)
`measles`	Number of reported cases of Measles per 1000 population
`BMI`	Average Body Mass Index (BMI) of entire population
`under_five_deaths`	Number of under-five deaths per 1000 population
`polio`	Polio immunization coverage among 1-year-olds (%)
`total expenditure`	General government expenditure on health as a percentage of total government expenditure (%)
`diphtheria`	Diphtheria tetanus toxoid and pertussis immunization coverage among 1-year-olds (%)
`HIV_AIDS`	Deaths per 1000 live births HIV/AIDS (0-4 years)
`GDP`	Gross Domestic Product per capita (in USD)
`population`	Population of the country
`thinness_10_19`	Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
`thinness_5_9`	Prevalence of thinness among children for Age 5 to 9(%)
`income_composition`	Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
`schooling`	Number of years of schooling

To following resource provides code needed to make useful symbols. You may use the code to typeset the characters of interest in the narrative of your document:

$\hat{y}$: $\hat{y}$
$\widehat{response}$: $\widehat{response}$
$y_{unit}$: $y_{unit}$

Exercises

Make sure to follow the tidy coding practices discussed throughout the semester.

Part 1: Data Manipulation

While the data provided spans from 2000-2015, we will be focusing on a single year of data. Filter the data to only retain observations from 2015 that have non-NA values for schooling.

Part 2: Exploratory Data Analysis

Visualize and describe the distribution of life_expectancy.What does that tell you about the average life expectancy across countries? Is this what you expected to see? Why, or why not? Include any summary statistics and visualizations you use in your response.

⊕

Visualize and describe the relationship between life_expectancy and schooling.

⊕

Part 3: Simple linear regression with a numerical predictor

⊕Linear model is in the form $\hat{y} = b_0 + b_1 x$.

Let’s see if the apparent trend in the plot is something more than natural variation. Fit a linear model called m_school to predict average life expectancy by average number of years of schooling (schooling), and display the output. Based on the regression output, write the linear model.

⊕

Interpret the slope of the linear model in context of the data.

⊕

Interpret the intercept of the linear model in context of the data. Comment on whether or not the intercept makes sense in this context.

⊕

Part 4: Linear regression with a categorical predictor

Fit a new linear model called m_status to predict the life expectancy of a country based on its status, and display the output. Based on the regression output, write the linear model.

⊕

Interpret the slope and intercept from m_status in the context of the data.

⊕

Part 5: Model diagnostics and evaluation

Store an augmented version of your m_school model. Using this new augmented data frame, create diagnostic plots to determine if the conditions of linearity and normality are met. For each condition, briefly explain why or why not based off the appropriate diagnostic plot.

⊕

Using appropriate functions, obtain and interpret the $R^2$ of your m_school model.

⊕

Submission

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. we will be checking these to make sure you have been practicing how to commit and push changes.

Once your work is finalized in your GitHub repo, submit the final PDF to Canvas.

HW 08 - Life Expectancy

due Wednesday, 11/30 at 11:59pm