The goal of this week’s homework is to practice creating bootstrap confidence intervals, and visualizing bootstrap distributions.
When dealing with randomness (as often the case in simulation in
statistics), it is important to specify which pseudo-random draw you
used in your analysis, so that you or someone else can reproduce the
exact numbers you initially report. The set.seed() function
in R allows you to ensure that all of your analysis relies on a specific
pseudo-random draw:
set.seed(42)Often, we rely on specific parameters values throughout our analysis,
and at a later point, we may want to replace them. In order to minimize
the need to change your code later, we can assign the parameter values
to a name, and use the name (rather than the hard-coded value)
downstream. Then, to update your code at a later point, you can just
change the value. Here, we are assigning the number of reps to a
variable called num_reps:
num_reps <- 100
boot_dist <- diamonds %>%
specify(response = price) %>%
generate(reps=num_reps, type = "bootstrap") %>%
calculate(stat="mean")The data may be found by cloning your hw-05-boston- repository available in the GitHub course organization.
Today’s data comes from the city of Boston, courtesy of the U.S.
Census Bureau. In particular, the Boston dataset contains
data about median value of owner-occupied housing units in 506 suburbs
of Boston. “Owner-occupied housing units” is defined as: one-family
houses on less than 10 acres without a business or medical office on the
property. The variables and their definitions are as follows:
crim: per capita crime rate by townrm: average number of rooms per dwellingage: proportion of owner-occupied units built prior to
1940dis: weighted mean of distances to five Boston
employment centerstax: full-value property-tax rate per $10,000ptratio: pupil-teacher ratio by townlstat: lower status of the population (percent)medv: median value of owner-occupied homes (in
$1000s)You may load in the data with the following code, where
____ should be replaced by a meaningful name of your
choosing. Don’t forget to set eval = TRUE before
knitting:
___ <- read.csv("data/Boston.csv")Write all R code according to the style guidelines discussed in class. Make sure that your plots have appropriate labels and titles.
Hint: Don’t forget to set a seed in order to ensure reproducibility!
medv.
medv and the pupil-teacher ratio in the town.Hint:
To simulate the correlation between two variables, use
specify(var1 ~ var2). Remember that correlation is still a
numerical quantity, so that should help you choose the type of
simulation you want to perform.
medv and the pupil-teacher ratio in the town. Make sure
your interval is reproducible.Note: you do not need to visualize your confidence interval.
Hint:
To simulate the correlation between two variables var1 and
var2, use specify(var1 ~ var2). Remember that
correlation is still a numerical quantity, so that should help you
choose the type of simulation you want to perform.
How does the 95% bootstrap interval in the previous exercise compare to the intervals calculated here?
In general, how does the bootstrap interval change when the confidence level increases?
Knit to PDF to create a PDF document. Knit and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo. Please only upload your PDF document to Canvas.