source("beginner_sampling_distributions_tests.r") # load some background tests
library(tidyverse)
library(dplyr)
library(infer)
#install.packages("gridExtra") #uncomment these to run the section
library(gridExtra)
1.8 - Beginner - Sampling Distributions
Outline
Prerequisites
- Introduction to Jupyter
- Introduction to R
- Introduction to Visualization
- Introduction to Central Tendency
- Introduction to Distribution
- Dispersion and Dependence
Outcomes
- Define a simple random sample and why it is useful in econometrics
- Use a simple random sample to infer population parameters
- Produce a sampling distribution through repeated random sampling
- Understand how sampling distributions relate to the Law of Large Numbers and the Central Limit Theorem
Part 1: Introduction to Statistical Inference
Let’s say we wanted to determine a particular characteristic of a population, the population parameter. For example, if we want to determine the average height of a Canadian resident, our population would be every Canadian resident and our population parameter would be average height.
Since collecting data from the entire population is often unrealistic, we can instead survey a sample and gather a sample estimate. From our example, it would be much more practical to survey 1000 people in Canada and calculate their average height. While this sample estimate won’t be a true match, it’s probably a good estimate of the population parameter. But how do we know we picked a representative sample of the Canadian population?
Simple random samples
A simple random sample is a subset of a population, in which each member of the population has an equal chance of being selected for the sample. In our case if we only sampled professional athletes to determine the average height of the Canadian population, we wouldn’t get a very good estimate of our population parameter. Simple random samples allow us to create sample parameters that are unbiased estimators of the population parameter.
There are two statistical approaches:
We could conduct a census. In our case we could measure the height of all 38 million people living in Canada. While this may produce little error, it would be costly and we may not actually be able to collect all of this data - what if we miss someone?
We could take a simple random sample of the Canadian population, in which all 38 million Canadians have an equal probability of being chosen. This would allow us to make statistical inferences on the population mean, without the cost of conducting a census.
Note: while there are other types of sampling in economics, simple random sampling is a good starting point. In any case, we must try our best to get an unbiased estimator through a random sample.
Simulating data
We will work with simulated data drawing from normal and uniform distributions. First we need to set a random number seed which would allow us to reproduce results drawn from a distribution of a random variable.
Note: the
set.seed()
function in R uses a random number generator that produces a sequence of numbers that are completely determined by a seed value - they are effectively random, but also reproducible. This is useful if you wanted to check your classmates’ work with the same set of numbers. See this reference for more information on pseudo-random numbers.
<- 1 # my seed number
seed_number
set.seed(seed_number) # set the random number seed to reproduce results
In the following, you can try changing the seed number and seeing how some of the results are different. Remember that the default seed we used was seed_number <- 1
.
Example: simulating a distribution
Let’s say we wanted to create a hypothetical population with a variable called x
of size 10,000, distributed normally with mean 0 and standard deviation 1. In this case x
represents “household indebtedness.” Given that it is normally distributed, this means that most households have close to no debt, but few households have an extreme amount of debt and few others have large investments with consistent cash flow (negative debt). Below, we will create a data frame population
, using the rnorm
function.
<- rnorm(10000, 0, 1) # draw 10,000 observations from the random normal distribution with mean 0, and sd 1
x
<- data.frame(x) # store it in a data frame population
We should have a population mean of 0, and standard deviation of 1. We can also visualize our distribution using a histogram for the population distribution. In this example, our population distribution should be normally distributed, centered around 0, with the mean represeted as a blue line on the histogram.
<- population %>%
table summarize(mean = mean(x), sd = sd(x)) %>%
round(digits = 2) # two decimal places
table
<- population %>%
pop_dist ggplot(aes(x = x)) +
geom_histogram(bins = 50, color = "black", fill = "gray") +
geom_vline(xintercept = mean(population$x), colour="blue", linetype = "solid", lwd = 1)
pop_dist
If we wanted to estimate the population mean above but we do not have the time to conduct a census of the population, we can randomly sample everyone in the population with equal probability.
To take a simple random sample in R, we can use the slice_sample
function. Let’s start with a very small sample of size 5, and see how it compares to our population mean. We will select the observations from the data set by their number.
<- population %>%
sample_data slice_sample(n = 5) # 5 observations in our sample
<- sample_data %>%
s_table summarize(mean = mean(x), sd = sd(x)) %>%
round(digits = 2) # two decimal places
s_table
Notice the sample average is very different from our true population average due to small sample bias. Therefore, we must take a larger sample to produce a more unbiased estimate of the population mean,
Let’s take another simple random sample with a larger sample of size 30.
<- population %>%
sample_data slice_sample(n = 30, # 30 observations in our sample
replace = TRUE) # with replacement
<- sample_data %>%
s_table summarize(mean = mean(x), sd = sd(x)) %>%
round(digits = 2) # two decimal places
s_table
Just by increasing our sample from n = 5, to n = 30, our estimate has gotten much closer to the population mean. Hence, having a larger sample size gets us much closer to the true mean of the population.
In this exercise, we have accomplished the following:
- Simulated a standard normally distributed population of size 10,000.
- Estimated the population mean, using the sample mean as an unbiased estimator.
Next we will create a distribution of sample means through repeated random samples from our population.
Test your knowledge
In this exercise:
- Create an object
a
which is a draw from the unifom distribution, with 5,000 observations between 10 and 20, using the random seed 20. Store your answer inanswer_1
. - Take a simple random sample of size 30 from our population, and calculate the sample mean and store it in an object
answer_2
set.seed(...)
<- runif(... , ... , ...) # variable you are creating
a <- data.frame(a)
answer_1
answer_1
test_1()
set.seed(...)
<- answer_1 %>%
answer_2 slice_sample(n = ...) %>%
summarize(mean = mean(...)) %>%
pull() %>%
round(3)
answer_2
test_2()
Repeated random samples and sampling distributions
Remember that when we survey a sample, we are only observing a subset of the population and determining a sample estimate based off that sample. Hence, If we surveyed a different 1000 Canadian residents every time, we would certainly get a different estimate for the average height of a Canadian resident.
Let’s say we can draw many samples of size \(n\) from our normal distribution population. Every time we draw a new sample, we get a different estimate of the mean. Now, if we were to plot each of those estimates onto a histogram, we would get a sampling distribution.
A sampling distribution of a statistic is a probability distribution based on repeated independent samples of size \(n\) from a population of size \(N\). We can now produce a distribution of sample means based on repeated and independent simple random samples from our population.
If we take enough samples, we will find that the mean of the sampling distribution will be nearly equal the mean of our population distribution (our population parameter). With an infinite number of random samples, the mean of the sampling distribution will be exactly equal to the population parameter.
As samples sizes increase, the standard error of the sampling distribution decreases. In other words, a larger sized sample \(n\) would have a lower probability of having a sample estimate far away from the true population parameter. This is a very important concept in statistical inference and econometrics.
Scenario 1: Population is normally distributed
We will need to take \(R\) independent simple random samples of size \(n\) from our population
of size \(N\), to produce our sampling distribution (\(R\) observations).
To do this in R, we will need to take a simple random sample of size \(n\), compute the sample mean, and store it in a vector. We will then repeat this \(R\) times, appending each sample mean into our vector. Our vector will represent the observations of our sampling distribution!
Standard error is the standard deviation of the sampling distribution which indicates how much our sample mean will vary from sample to sample. For this exercise, we will keep the number of samples constant and progressively increase our sample size to see how it affects the standard error of the sampling distribution.
We will use the
rep_sample_n
function from theinfer
package in R. This function allows us to repeatedly take samples from a population.The
size
parameter indicates the sample size and thereps
parameter indicates the number of samples we wish to draw from our population.The solid blue line represents the true mean of the normal distribution population (0)
The dashed red line shows us the mean of the sampling distribution. Since we are taking many samples, these two values should be very similar.
- Suppose we take 1000 simple random samples (\(R = 1000\)), with a sample size of 5 (\(n = 5\)):
# CREATES A DATA FRAME WITH ALL 1000 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE
<- population %>%
reps_5_1000 rep_sample_n(size = 5, reps = 1000) %>% # creates 5 samples of size 5
group_by(replicate) %>% # groups each of the samples
summarize(mean = mean(x), sd = sd(x)) # calculates the mean of each sample
#CALCULATES THE MEAN AND STANDARD ERROR OF THE SAMPLING DISTRIBUTION
<- reps_5_1000 %>%
sample_5_mean_se summarize(sd_mean = mean(mean), se = sd(mean))
sample_5_mean_se
#VISUAL DEPICTION OF SAMPLING DISTRIBUTION
<- reps_5_1000 %>%
sampling_dist_5_1000 ggplot(aes(x = mean)) +
geom_histogram(bins = 50, color = "black", fill = "gray") +
geom_vline(xintercept = 0, colour="blue", linetype = "solid", lwd = 1) +
geom_vline(xintercept = mean(reps_5_1000$mean), colour="red", linetype = "dashed", lwd = 1)
sampling_dist_5_1000
- Suppose we take 1,000 simple random samples (\(R = 1000\)), with a sample size of 50 (\(n = 50\)):
# CREATES A DATA FRAME WITH ALL 1000 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE
<- population %>%
reps_50_1000 rep_sample_n(size = 50, reps = 1000) %>% # creates 1000 samples of size 5
group_by(replicate) %>% # groups each of the samples
summarize(mean = mean(x), sd = sd(x)) # calculates the mean of each sample
# CALCULATES THE MEAN AND STANDARD ERROR OF THE SAMPLING DISTRIBUTION
<- reps_50_1000 %>%
sample_50_mean_se summarize(sd_mean = mean(mean), se = sd(mean))
sample_50_mean_se
# VISUAL DEPICTION OF SAMPLING DISTRIBUTION
<- reps_50_1000 %>%
sampling_dist_50_1000 ggplot(aes(x = mean)) +
geom_histogram(bins = 50, color = "black", fill = "gray") +
geom_vline(xintercept = 0, colour="blue", linetype = "solid", lwd = 1) +
geom_vline(xintercept = mean(reps_50_1000$mean), colour="red", linetype = "dashed", lwd = 1)
sampling_dist_50_1000
Note: see how the scale of the x axis for \(n = 50\) has adjusted from \(n = 5\) in response to the reduction in error?
Now that we have increased the number of repeated and independent simple random samples from 5 to 1000, the mean of our sampling distribution is much closer to our population mean. Let’s see how we can further reduce the standard error of our estimate by increasing our sample size.
- Suppose we take 1,000 simple random samples (\(R = 1000\)), with a sample size of 500 (\(n = 500\)):
# CREATES A DATA FRAME WITH ALL 1000 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE
<- population %>%
reps_500_1000 rep_sample_n(size = 500, reps = 1000, replace = FALSE) %>% #creates 1000 samples of size 5
group_by(replicate) %>% #groups each of the samples
summarize(mean = mean(x)) #calculates the mean of each sample
# CALCULATES THE MEAN AND STANDARD ERROR OF THE SAMPLING DISTRIBUTION
<- reps_500_1000 %>%
sample_500_mean_se summarize(sd_mean = mean(mean), se = sd(mean))
sample_500_mean_se
# VISUAL DEPICTION OF SAMPLING DISTRIBUTION
<- reps_500_1000 %>%
sampling_dist_500_1000 ggplot(aes(x = mean)) +
geom_histogram(bins = 50, color = "black", fill = "gray") +
geom_vline(xintercept = 0, colour="blue", linetype = "solid", lwd = 1) +
geom_vline(xintercept = mean(reps_500_1000$mean), colour="red", linetype = "dashed", lwd = 1)
sampling_dist_500_1000
Test your knowledge
Let’s take a sample of size 50 with 1000 reps from the same population
data frame and produce a data frame with only the means. Assign your answer to the object answer_3
.
set.seed(30)
<- population %>%
answer_3 rep_sample_n(... = ..., ... = ...) %>%
group_by(...) %>%
summarize(mean = ...(x))
answer_3
test_3()
Because the number of samples we drew from our population is constant, we saw that the mean of our sampling distribution was roughly equivalent to the true population mean for all of our examples. As we increased our samples size, the standard error decreased.
Having a low standard error means that the probability of having a sample produce a sample estimate that is far away from the true population parameter is very low.
With a sample size of 500, the sample means are mostly between -0.1 and +0.1.
With a sample size of 5, the sample means are mostly between -1 and 1.
Because it’s often only possible to collect one sample in real life, we tend to collect as large a sample as possible to minimize the standard error. This gives us high confidence that our sample estimate will be close to the true population parameter.
In the section about Bootstrapping later in this lesson, we will explore how economists calculate the standard error of a sample estimate using only one sample.
Scenario 2: Population is not normally distributed
A few things to note about increasing the number of samples:
- The mean of the sampling distribution comes closer to the true value of the population parameter.
- The sampling distribution will resemble a normal distribution curve regardless of the shape of the original population distribution (note that this is just the sampling distribution that starts to look more like a normal curve, and not the population distribution).
This exercise will be nearly identical to above except now our population will be random uniformly distributed. For this exercise, we will see how varying the samples taken affects the sampling distribution. We can achieve this in R using the runif()
function to create a uniform distribution.
<- runif(10000, 0, 1) # Draw 10,000 observations from the random uniform distribution between 0 and 1
y
<- data.frame(y)
population_unif
<- population_unif %>%
unif_pop_dist ggplot(aes(x = y)) +
geom_histogram(bins = 20, color = "black", fill = "gray")
unif_pop_dist
<- population_unif %>%
unif_pop_mean_se summarize(true_mean = mean(y))
unif_pop_mean_se
- Suppose we take 1000 simple random samples (\(R = 1000\)), with a sample size of 5 (\(n = 5\)):
# CREATES A DATA FRAME WITH ALL 1000 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE
<- population_unif %>%
unif_reps_5_1000 rep_sample_n(size = 5, reps = 1000) %>%
group_by(replicate) %>%
summarize(mean = mean(y))
# MEAN AND STANDARD ERROR OF SAMPLING DIST
<- unif_reps_5_1000 %>%
sample_5_mean_se summarize(sd_mean = mean(mean), se = sd(mean))
sample_5_mean_se
# VISUAL DEPICTION OF SAMPLING DISTRIBUTION
<- unif_reps_5_1000 %>%
unif_sampling_dist_5_1000 ggplot(aes(x = mean)) +
geom_histogram(bins = 50, color = "black", fill = "gray") +
geom_vline(xintercept = mean(population_unif$y), colour="blue", linetype = "solid", lwd = 1) +
geom_vline(xintercept = mean(unif_reps_5_1000$mean), colour="red", linetype = "dashed", lwd = 1)
unif_sampling_dist_5_1000
- Suppose we take 1,000 simple random samples (\(R = 1000\)), with a sample size of 50 (\(n = 50\)):
# CREATES A DATA FRAME WITH ALL 1000 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE
<- population_unif %>%
unif_reps_50_1000 rep_sample_n(size = 50, reps = 1000) %>%
group_by(replicate) %>%
summarize(mean = mean(y))
# MEAN AND STANDARD ERROR OF SAMPLING DIST
<- unif_reps_50_1000 %>%
sample_50_mean_se summarize(sd_mean = mean(mean), se = sd(mean))
sample_50_mean_se
# VISUAL DEPICTION OF SAMPLING DISTRIBUTION
<- unif_reps_50_1000 %>%
unif_sampling_dist_50_1000 ggplot(aes(x = mean)) +
geom_histogram(bins = 50, color = "black", fill = "gray") +
geom_vline(xintercept = mean(population_unif$y), colour="blue", linetype = "solid", lwd = 1) +
geom_vline(xintercept = mean(unif_reps_50_1000$mean), colour="red", linetype = "dashed", lwd = 1)
unif_sampling_dist_50_1000
- Suppose we take 1,000 simple random samples (\(R = 1000\)), with a sample size of 500 (\(n = 500\)):
# CREATES A DATA FRAME WITH ALL 1000 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE
<- population_unif %>%
unif_reps_500_1000 rep_sample_n(size = 500, reps = 1000) %>%
group_by(replicate) %>%
summarize(mean = mean(y))
# MEAN AND STANDARD ERROR OF SAMPLING DIST
<- unif_reps_500_1000 %>%
sample_500_mean_se summarize(sd_mean = mean(mean), se = sd(mean))
sample_500_mean_se
# VISUAL DEPICTION OF SAMPLING DISTRIBUTION
<- unif_reps_500_1000 %>%
unif_sampling_dist_500_1000 ggplot(aes(x = mean)) +
geom_histogram(bins = 50, color = "black", fill = "gray") +
geom_vline(xintercept = mean(population_unif$y), colour="blue", linetype = "solid", lwd = 1) +
geom_vline(xintercept = mean(unif_reps_500_1000$mean), colour="red", linetype = "dashed", lwd = 1)
unif_sampling_dist_500_1000
Notice how the difference between the mean of the sampling distribution and the true value of the population parameter become closer and closer as we increase the number of samples drawn. As mentioned above, we can also see that the sampling distribution curve more strongly resembles a normal distribution curve as we increase the number of samples drawn. This is true regardless of the fact that we began with a population that was uniformly distributed.
<- 500
SIZE <- 2000
REPS
<- population %>%
pop_dist_eg rep_sample_n(size = SIZE, reps = REPS) %>% # CHANGE PARAMETERS HERE
group_by(replicate) %>%
summarize(mean = mean(x)) %>%
ggplot(aes(x = mean)) +
geom_histogram(bins = 50, color = "black", fill = "gray") +
geom_vline(xintercept = 0, colour="blue", linetype = "solid", lwd = 1)
<- population_unif %>%
unif_dist_eg rep_sample_n(size = SIZE, reps = REPS) %>% # CHANGE PARAMETERS HERE
group_by(replicate) %>%
summarize(mean = mean(y)) %>%
ggplot(aes(x = mean)) +
geom_histogram(bins = 50, color = "black", fill = "gray") +
geom_vline(xintercept = mean(population_unif$y), colour="blue", linetype = "solid", lwd = 1)
grid.arrange(pop_dist_eg, unif_dist_eg, ncol = 2)
In the above graph, we fixed the sample size and number of samples for both the normal distribution and uniform distribution examples, and we attained two sampling distributions that are strikingly similar to a normal distribution curve.
Hence, regardless of the initial population distribution, as we increase our sample size and the number of simple and independent simple random samples, our sampling distribution converges to our population mean!
Try it out yourself by increasing SIZE
and REPS
to see how higher values make the distribution become tighter around the true population mean.
This converging of the sample mean to a normal distribution is referred to as the Central Limit Theorem in statistics.
Test your knowledge
- Use the
population_unif
data frame and draw 500 samples of sample size 50. Then calculate the mean for each sample and store your answer inanswer_4
. - Create a visualization of the sampling distribution and store it in
answer_5
. - Calculate the standard error of this sampling distribution and assign it to the object
answer_6
. Remember thesd
function calculates the standard deviation of a column.
Template code has been provided below
set.seed(40) # don't change this
<- ... %>%
answer_4 ...(... = ..., ... = ...) %>%
...(...) %>%
summarize(mean = mean(y))
answer_4
test_4()
set.seed(40) # don't change this
<- answer_4 %>%
answer_5 ggplot(...(x = mean)) +
geom_histogram(bins = 50, color = "black", fill = "gray") +
geom_vline(xintercept = mean(population_unif$y), colour="blue", linetype = "solid", lwd = 1)
answer_5
test_5()
set.seed(40)
<- answer_4 %>%
answer_6 summarize(se = ...(mean)) %>%
pull() %>%
round(5)
answer_6
test_6()
What does the standard error tell us?
- the average variation in the population distribution
- the average variation in our sample estimates of the population parameter
- the probability that our sample estimate is correct
- exactly what the sample estimate tells us
# enter your answer as either "A", "B", "C" or "D"
<- "..."
answer_7
test_7()
Review of Most Important Concepts
The population is the collection of all individuals or entities we are interested in studying. From the above example, the population is entire population of Canada.
A population parameter is a particular quantity from the population that we’re interested in. In our example, this is the average height of the population.
A simple sample is a small subset of the population which we use to get the sample estimate of the parameter we are interested in. Since we are often unable to conduct a comprehensive census in the real world, we have to use the sample estimates as an estimate for the population parameter.
A sampling distribution is the distribution of the possible values of the sample estimate. In a hypothetical world, we assume that we are able to take many samples of size
n
from the same population and attain a sample estimate from all of those samples. When we plot those sample estimates on a histogram, we get a sampling distribution. When we take many samples, we find that the mean of the sampling distribution (the most frequently observed estimate of the population parameter) is very close to the actual value of the population parameter.Increasing the sample size decreases our standard error. When we increase the number of samples drawn from the population, the mean of the sampling distribution progresses closer and closer to the population parameter and our sampling distribution has a closer resemblance to a normal distribution curve.
Part 2: Central Limit Theorem (CLT) and the Law of Large Numbers (LLN)
The Central Limit Theorem tells us that no matter the distribution of the population, with sufficient sample size (usually above 30 and no more than 5% of the population), the distribution of sample means will tend towards a normal distributions.
Note: When sampling more than 5% of a population, we should use the Finite Population Correct Factor (FPC). This is because the Central Limit Thereom does not hold, and the standard error will be too large: FPC = \(\left(\frac{N-n}{N-1}\right)^{\frac{1}{2}}\).
The Law of Large Numbers tells us that with a sufficiently large number of repeated and independent random samples, the mean of our distribution of sample means will tend toward the population mean! Notice the visualization of our sample mean as we increase \(R\) in the example below!
set.seed(55)
# RUN THIS CODE.
= 1000 # Number of iterations (number of sample means produced for our distribution)
r = 10 # Sample size
num = 1:r # Range of iterations
R <- vector(,r) # Create empty vector to store sample means
means <- vector(,r)
sample_mean
for (val in R) { # loop over all values in R
= mean(slice_sample(population_unif, n = num)$y) # Take a sample of size n from the population, caluclate the mean, and store in vector means, with index of the current iteration
means[val] <- mean(means[1:val]) # Running value of sample mean at index equal to current iteration
sample_mean[val]
}
options(scipen = 10)
plot(sample_mean, cex=0.5, col="darkblue")
abline(h = mean(population_unif$y))
See how many many iterations (in this case, 1000) eventually produce a relatively stable line around 0.5 for this example?
Showing the CLT in practice
In this exercise, we will create a new simulation for you to play around with \(R\) and \(n\) to get a better understanding of how sampling distribution changes with sample size and the number of repeated independent random samples. You can alter the values by changing the SIZE
and REPS
constants at the top of the code block. Try different values and see for yourself how the shape of the sampling distribution changes.
Note: You can also try different distributions of our population, try the following: rchisq(n, df)
, rexp(n, rate)
, rf(n, df1, df2)
<- rexp(10000, 0.2) # Draw 10,000 observations from the random uniform distribution between 0 and 1
population_ex <- data.frame(x)
population_ex <- population_ex %>%
pop_dist_int ggplot(aes(x = x)) +
geom_histogram(bins = 50) +
geom_vline(xintercept = mean(population_ex$x), colour="blue", linetype = "solid", lwd = 1)
pop_dist_int
= 50 # change me here
SIZE = 5000 # change me here
REPS
<- population_ex %>%
population_sample_estimates rep_sample_n(size = SIZE, reps = REPS) %>%
group_by(replicate) %>%
summarize(mean = mean(x))
<- population_sample_estimates %>%
pop_mean_se summarize(sd_mean = mean(mean), se = sd(mean))
pop_mean_se
<- population_sample_estimates %>%
samp_dist_int ggplot(aes(x = mean)) +
geom_histogram(bins = 50, color = "black", fill = "gray") +
geom_vline(xintercept = mean(population_ex$x), colour="blue", linetype = "solid", lwd = 1) +
geom_vline(xintercept = mean(population_sample_estimates$mean), colour="red", linetype = "dashed", lwd = 1)
samp_dist_int
Bootstrapping
From our example, we see that it can be quite difficult to calculate the standard error by first having access to the population (often impossible) and then drawing hundreds of samples from that population to make a sampling distribution (drawing two samples of 10 individuals is also the same as drawing one sample of 20).
One common technique that economists use to calculate the standard error of their sample estimates is bootstrapping samples.
We do this by taking one sample of size \(n\) from our population and then taking a sample of size \(n\) from our sample, with replacement.
Let’s imagine we have a paper bag with 5 red balls and 5 green balls, which we will treat as our sample.
In order to draw a bootstrap sample, we pick 1 ball out of the bag ten times while allowing replacement.
We take out the first ball, note whether it’s a red or green ball, then put it back in the bag and repeat the process 9 more times to arrive at our bootstrap sample.
Note that with this method, it is entirely possible to draw any combination of red and green balls. The bootstrap sampling method introduces sampling variability to our sample.
If we repeat this process by drawing many bootstrap samples, we can arrive at a bootstrap distribution which serves as an approximation of our sampling distribution.
Example: bootstrapping
Let’s demonstrate this using our normal distribution data frame population
.
We first start off by taking a sample from our population using the slice_sample
function.
To connect it back to the average height example, imagine that we had resources to sample 100 people about their height out of the entire Canadian population.
set.seed(300)
<- population %>%
boot_main_sample slice_sample(n = 100)
head(boot_main_sample)
paste("This is the mean of our sample: ", mean(boot_main_sample$x))
Now, we want to know the uncertainty behind this estimate. Thinking back to our sampling distribution, how would we know if our sample estimate is close to the true population parameter? It’s entirely possible we got a sample estimate in the tails of the sampling distribution. To understand the uncertainty around our sample estimate, we must compute the standard error.
Let’s draw bootstrap samples from our sample. The code is largely similar from before, except the key difference that we now add replace = TRUE
to our rep_sample_n
function. This tells R to allow replacement when taking samples.
<- boot_main_sample %>%
boot_samples rep_sample_n(size = 500, reps = 1000, replace = TRUE)
head(boot_samples)
Now that we have our bootstrap samples, let’s calculate the bootstrap sample means and make a bootstrap sampling distribution. We use the same methods as we used above.
<- boot_samples %>%
boot_means group_by(replicate) %>%
summarize(mean = mean(x))
<- boot_means %>%
boot_mean_se summarize(sd_mean = mean(mean), se = sd(mean))
boot_mean_se
<- boot_means %>%
boot_sampling_dist ggplot(aes(x = mean)) +
geom_histogram(bins = 50, color = "black", fill = "gray") +
geom_vline(xintercept = 0, colour="blue", linetype = "solid", lwd = 1) +
geom_vline(xintercept = mean(boot_means$mean), colour="red", linetype = "dashed", lwd = 1) +
ggtitle("Boostrap Sampling Distribution")
grid.arrange(boot_sampling_dist, sampling_dist_500_1000 + ggtitle("Sampling Distribution"))
The key takeaway here is that the standard error of the bootstrap sampling distribution serves as a great approximation of the standard error of the sampling distribution.
Few things to note:
- The graphs that our bootstrap sampling distribution visually serves as a good approximation of our sampling distribution. The standard error for the bootstrap sampling distribution (0.0451) is close to our sampling distribution standard error (0.0433).
- The key difference in the graphs is the mean of the distributions.
- The mean of the sampling distribution is nearly equivalent to the population parameter.
- However, the mean of the bootstrap distribution will be the value of the sample estimate from the first sample we took from the population.
Think Deeper: why is the bootstrapping distribution centered around the mean of our sample?
Test your knowledge
For this exercise, we have taken a sample from the population
data frame and assigned it to the object boot_eg_sample
of sample size 50.
set.seed(60)
<- population %>%
boot_eg_sample slice_sample(n = 50)
boot_eg_sample
- Make a data frame of bootstrap sample estimates with 1000 reps and assign your answer to
answer_8
. - Calculate the standard error of the bootstrap sampling distribution and store your answer in
answer_9
.
We have provided a code template below.
set.seed(60) # don't change this
<- boot_eg_sample %>%
answer_8 ...(... = ..., ... = ..., replace = ...) %>%
...(...) %>%
summarize(mean = ...(x))
answer_8
test_8()
set.seed(60) # don't change this
<- answer_8 %>%
answer_9 summarize(se = ...(mean)) %>%
pull() %>%
round(3)
answer_9
test_9()