Introduction and Data

The goal of this week’s homework is to practice creating bootstrap confidence intervals, and visualizing bootstrap distributions.

Tips

Use a small number of reps (about 100) as you write and test out your code. Once you have finalized all of your code, increase the number of reps to at least 1000 (10,000 if possible) to produce your final results.
Write your code and narrative in a reproducible way, so you can easily change the number of reps. Tips on making your code more reproducible:

Tip 1: Set your seed

When dealing with randomness (as often the case in simulation in statistics), it is important to specify which pseudo-random draw you used in your analysis, so that you or someone else can reproduce the exact numbers you initially report. The set.seed() function in R allows you to ensure that all of your analysis relies on a specific pseudo-random draw:

set.seed(42)

Tip 2: Define global parameters

Often, we rely on specific parameters values throughout our analysis, and at a later point, we may want to replace them. In order to minimize the need to change your code later, we can assign the parameter values to a name, and use the name (rather than the hard-coded value) downstream. Then, to update your code at a later point, you can just change the value. Here, we are assigning the number of reps to a variable called num_reps:

num_reps <- 100
boot_dist <- diamonds %>%
  specify(response = price) %>%
  generate(reps=num_reps, type = "bootstrap") %>%
  calculate(stat="mean")

The data

The data may be found by cloning your hw-05-boston- repository available in the GitHub course organization.

Today’s data comes from the city of Boston, courtesy of the U.S. Census Bureau. In particular, the Boston dataset contains data about median value of owner-occupied housing units in 506 suburbs of Boston. “Owner-occupied housing units” is defined as: one-family houses on less than 10 acres without a business or medical office on the property. The variables and their definitions are as follows:

crim: per capita crime rate by town
rm: average number of rooms per dwelling
age: proportion of owner-occupied units built prior to 1940
dis: weighted mean of distances to five Boston employment centers
tax: full-value property-tax rate per $10,000
ptratio: pupil-teacher ratio by town
lstat: lower status of the population (percent)
medv: median value of owner-occupied homes (in $1000s)

You may load in the data with the following code, where ____ should be replaced by a meaningful name of your choosing. Don’t forget to set eval = TRUE before knitting:

___ <- read.csv("data/Boston.csv")

Exercises

Write all R code according to the style guidelines discussed in class. Make sure that your plots have appropriate labels and titles.

⊕Hint: Don’t forget to set a seed in order to ensure reproducibility!

Based on this data set, provide a point estimate for the population mean of the median value of owner-occupied homes in Boston, medv.

⊕

Construct a 95% bootstrap interval for the mean of the median value of owner-occupied homes in Boston. Use at least 1,000 bootstrap samples. Make sure your interval is reproducible.

⊕

Visualize the bootstrap distribution and your confidence interval from Exercise 2. Interpret the confidence interval you constructed.

⊕

Considering towns with a pupil-teacher ratio of at most 15, provide a point estimate of the proportion of owner-occupied houses in Boston with median values over $40,000.

⊕

For towns with a pupil-teacher ratio of at most 15, construct a 99% bootstrap interval for the proportion of owner-occupied houses in Boston with median values over $40,000. Make sure your interval is reproducible.

⊕

Visualize the bootstrap distribution and your confidence interval from Exercise 5. Interpret the confidence interval you constructed.

⊕

Provide a point estimate of the correlation between the median housing value of owner-occupied houses in Boston medv and the pupil-teacher ratio in the town.

⊕Hint: To simulate the correlation between two variables, use specify(var1 ~ var2). Remember that correlation is still a numerical quantity, so that should help you choose the type of simulation you want to perform.

Construct a 95% bootstrap interval for the correlation between the median housing value of owner-occupied houses in Boston medv and the pupil-teacher ratio in the town. Make sure your interval is reproducible.

Note: you do not need to visualize your confidence interval.

⊕Hint: To simulate the correlation between two variables var1 and var2, use specify(var1 ~ var2). Remember that correlation is still a numerical quantity, so that should help you choose the type of simulation you want to perform.

Construct and report your 90% and 99% bootstrap interval using the bootstrap distribution from Exercise 8. Then answer the following questions:

How does the 95% bootstrap interval in the previous exercise compare to the intervals calculated here?
In general, how does the bootstrap interval change when the confidence level increases?

⊕

Submission

Knit to PDF to create a PDF document. Knit and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo. Please only upload your PDF document to Canvas.

HW 05 - Bootstrap estimation

due Tuesday, November 1 at 11:59pm