1.3.1 - Beginner - Introduction to Confidence Intervals

introduction

beginner

econ 226

econ 227

confidence intervals

inferential statistics

population mean

sample mean

sampling distribution

bootstrapping

This notebook is an hands-on introduction to Confidence Intervals and inferential statistics at the beginner level using R. Meant for lower-level undergraduates with no or very little prior exposure to university-level statistics.

Author

COMET Team
Mridul Manas

Published

20 July 2023

Outline

Prerequisites

Introduction to Jupyter
Introduction to R

Outcomes

In this notebook, you will learn about:

What Confidence Intervals are and why they are useful for drawing inferential claims about the populations of interest.
How to compute a Confidence Interval for chosen level of confidence for estimating true population parameters such as the mean.
What the different methods commonly used for the construction of Confidence Intervals are, namely the analytical method, the sampling method, and the bootstrapped sampling method.

To begin, run the code cell below.

# Run this cell

source("beginner_intro_to_confidence_intervals_tests.R")

#Import the required packages for this tutorial:
library(dplyr)
library(ggplot2)

1. Introduction to Confidence Intervals for the Population Mean

A confidence interval shows how likely it is that a range based on a sample of a population contains the “true” mean for the entire population of interest. The formula for computing a 95% Confidence Interval for the population mean is as follows:

\[ \text{95% Confidence Interval} = \bar{x} \pm \text{Critical Value} \times \text{Standard Error} \] Here, $\bar{x}$ is the sample mean, or the point estimate, obtained from a single randomly drawn sample from the population.

The $\text{Standard Error}$ is computed as$\frac{\sigma}{\sqrt{n}$, serving as a measure of variability for sample means. In other words, if we had to obtain all possible samples from the population and calculate their means, the Standard Error will capture the variability of such sample means around the mean of the sample means. Later, we’ll try to explain the concept of distribution of sample means or the sampling distribution of sample means since it is a crucial underlying concept here!

The Critical Value is a “quantile” value obtained typically from the Standard Normal Distribution (Mean 0, SD 1) such that approximately 95% of the values lie below it. This is often called the Z-score, as denoted below:

\[ z_{\alpha/2} = qnorm(1 - \alpha/2, mean = 0, sd = 1) \]

The subscript under $z$ represents the tail probabilities in a standard normal distribution ($z$). The value of $\alpha$ represents the significance level, often denoted as $1 - \text{Confidence Interval}$.

#RUN THIS CELL BEFORE CONTINUING
library(ggplot2)

# Significance level (alpha)
alpha <- 0.05

# Calculate the critical value for a two-tailed 95% confidence interval
z_critical <- qnorm(1 - alpha / 2, mean = 0, sd = 1)

# Create a sequence of x values for the standard normal curve
x <- seq(-3, 3, length.out = 1000)

# Calculate the standard normal probability density function
pdf <- dnorm(x, mean = 0, sd = 1)

# Create the plot
ggplot(data.frame(x), aes(x = x)) +
  geom_line(aes(y = pdf), color = "blue", size = 1.5) +
  geom_vline(xintercept = z_critical, color = "red", linetype = "dashed") +
  geom_vline(xintercept = -z_critical, color = "red", linetype = "dashed") +
  annotate("text", x = z_critical + 0.1, y = 0.25, label = "α/2", color = "red") +
  annotate("text", x = -z_critical - 0.5, y = 0.25, label = "α/2", color = "red") +
  labs(x = "Z", y = "Probability Density", title = "Two-Tailed 95% Confidence Interval") +
  theme_minimal()

In a 95% Confidence Interval, $\alpha$ is $1 - 0.95 = 0.05$ and the critical value $z_{\frac{\alpha}{2}}$ corresponds to the point beyond which the area in the tail is $\alpha/2$, leaving the central 95% area under the standard normal curve.

We use the qnorm() function in R to generate Z-scores for the chosen level of significance:

#defining some params for our use
conf.level = 0.95
alpha = 1 - conf.level 

qnorm(1 - alpha/2, mean = 0, sd = 1)

or simply,

qnorm(0.975, mean = 0, sd = 1)

In general, This means $P(Y \leq 1.96) = 0.975$ and $P(Y \leq -1.96) = 0.025$ where $Y \sim N(0,1)$. Thus, $P(-1.96 < Y < 1.96) = 0.95$ or 95%.

Notice the symmetry of the Standard Normal Distribution. This is the reason why we only have to compute the Z-score once, using the right-hand side critical value (below which 97.5% value lie). We then multiply this with the Standard Error and ADD to the point estimate to get the upper bound for the C.I. We can then subtract the same Z-score (after multiplying with the Standard Error) from the sample mean to get the lower bound of the Confidence Interval.

Just like we rarely ever know the true population means, knowing the true population standard deviation is quite rare. It is common practice to use the sample standard deviation, or $s$, to calculate the Standard Error as $\frac{s}{\sqrt{n}}$.

In the case population standard deviations are unknown, the critical values must be drawn from the $t$-distribution and not the Normal Distribution. We use the qt() function in R to obtain the quantile values under the $t$-distribution with specified degrees of freedom.

\[ t_{n-1, \alpha/2} = qt(1 - \alpha/2,df=n-1) \]

sample_size <- 15
degrees_of_freedom <- sample_size - 1 #degrees of freedom of a t-distribution is equal to the sample size minus 1

qt(0.975, df = degrees_of_freedom)

This means $P(Y \leq 1.76) = 0.975$ and $P(Y \leq -1.76) = 0.025$ where $Y \sim \text{t}_{n-1}$. Thus, $P(-1.76 < Y < 1.76) = 0.95$ or 95%.

For higher degrees of freedom (hence, higher sample sizes) the $t$-distribution does a better job of approximating the normal distribution. In general, the t-distribution will mimic the bell-shaped nature of the normal distribution but will have fatter or thicker tails.

Let’s summarize the assumptions, requirements and appropriate methods for calculating Confidence Intervals:

If population standard deviation $\sigma$ is known and sample size is above 30, obtain the critical value from the standard normal distribution (z-score) and calculate the 95% Confidence Interval as:

\[ \bar{x} \pm qnorm(0.975, mean = 0, sd = 1) \times \frac{\sigma}{\sqrt{n}} \]
If population standard deviation is known but the sample size is small (ie. below than 30), approximate the critical value using the t-distribution (t-score) and calculate the 95% Confidence Interval as:

\[ \bar{x} \pm qt(0.975, df = (n-1)) \times \frac{\sigma}{\sqrt{n}} \]
If the standard deviation of the population is unknown and the sample size is large (ie. above 30), obtain the critical value using the standard normal distribution (ie. get a Z-score) and calculate the 95% Confidence Interval as:

\[ \bar{x} \pm qnorm(0.975, mean = 0, sd = 1) \times \frac{s}{\sqrt{n}} \]
If the population standard deviation, $\sigma$, is unknown and the sample size is small (below 30), but guaranteed the population is normally distributed, approximate the critical value using the $t$-distribution (t-score) and calculate the 95% Confidence Interval as:

\[ \bar{x} \pm qt(0.975, df = (n-1)) \times \frac{s}{\sqrt{n}} \]

1.1 95% Confidence Intervals for Reading Comprehension Scores

A teacher is interested in knowing if Grade 8 students are meeting the expectations for reading ability set by the governing body of the country.

She nominates 15 randomly chosen students who then take a standardized reading and comprehension exam. The average score for this sample of 15 students is 17 out of 32 and the sample standard deviation is 4.2.

Suppose the reading comprehension scores for all students in the country are known to be normally distributed. Which distribution should be used to calculate the Critical Value for 95% Confidence Intervals:

A: The Standard Normal Distribution (ie. obtain the Z-score)

B: The $t$-distribution (ie. obtain the t-score)

# which distribution should we use?
answer_1.1 <- "" #Type either A or B here
test_1.1()

# which distribution should we use?
answer_1.1 <- "" #Type either A or B here
test_1.1()

Suppose the teacher now nominates a bigger sample of 45 students randomly chosen to sit for the exam. The governing body has also announced that the national scores are normally distributed with a population standard deviation of 3.

The teacher’s sample of students on average score 18 (out of 32) with a standard deviation of 5. Now use the appropriate function in R (qnorm or qt) to calculate a 95% Confidence Interval for Mean Reading & Comprehension Score for the entire class:

# use the correct function and formula to calculate the lower and upper bounds of the interval. Think about what 1 - alpha/2 will be.

lower_ci <-  # REPLACE 5 WITH THE CORRECT CODE FOR THE LOWER BOUND
upper_ci <-  # REPLACE 10 WITH THE CORRECT CODE FOR THE UPPER BOUND

answer_1.2 <- tibble(lower_ci = lower_ci, upper_ci = upper_ci)
test_1.2()

# use the correct function and formula to calculate the lower and upper bounds of the interval. Think about what 1 - alpha/2 will be.

lower_ci <- 5 # REPLACE 5 WITH THE CORRECT CODE FOR THE LOWER BOUND
upper_ci <- 5 # REPLACE 10 WITH THE CORRECT CODE FOR THE UPPER BOUND

answer_1.2 <- tibble(lower_ci = round(lower_ci, 2), upper_ci = round(upper_ci, 2))
test_1.2()

confidence_interval_analytical <- tibble(lower_ci = 17.26, upper_ci = 18.74)
confidence_interval_analytical

The teacher using the bigger sample obtains the 95% Confidence Interval: (17.2644, 18.7356). She interprets this in the following two ways:

She is 95% confident that the true average reading and comprehension score for the entire class of Grade 8 students is between 17.2644 and 18.7356.
If she drew samples repeatedly from the population of Grade 8 students at her school and calculated 95% confidence intervals using the same methodology, then 95% of such confidence intervals would capture the true mean score for all Grade 8 students at the school.

Note how the first interpretation allows the teacher to infer if her students are performing at par with the nation. For example, if the national average score is 18, which is contained within the 95% Confidence Interval, she cannot be certain if all of the students on average are performing better since the interval includes values below 18.

She should try increasing the sample size, which in turn will cause the Standard Error of the point estimate to be lower, thus decreasing the width of the Confidence Interval helping her obtain more precise estimates for the true mean score for the entire class.

Obviously, the best approach would be to ask all Grade 8 students at the school to sit for the exam and get the true mean score for all students. However, this can be both costly and time consuming arguing in favor of using small samples to infer claims about the population of Grade 8 students at the school and their reading and comprehension abilities!

2. Simulating the Sampling Distribution of Sample Means

You must have noticed that Confidence Intervals are always centered around the sample mean or the point estimate obtained from the population. When we construct a confidence interval, we’re determining a range around the sample mean within which we’re confident the population parameter lies. This range is influenced by the variability of the sample statistic as indicated by the standard error.

The Standard Error describes the variability of sample means around its mean. This implies two things: there must a distribution of sample means and that this distribution must have a mean/center, called the mean of sample means!

Let’s use R to simulate the distribution of sample means.

# Set the random seed for reproducibility
set.seed(123) #DON'T CHANGE

# Generate 500 reading scores from a normal distribution
reading_scores <- data.frame(scores = rnorm(n = 500, mean = 17.5, sd = 3)) %>%
  mutate(scores = round(scores, 2))

head(reading_scores$scores)

reading_scores$scores is a vector containing the true reading and comprehension scores of all 500 Grade 8 students at a school XYZ.

Next, we’ll draw 1000 samples repeatedly from the population of size 50 each.

num_samples <- 1000
samples <- data.frame(sample_id = c(), scores = c())

for (i in 1:num_samples) {
  sample_id <- i
  sample_scores <- sample(reading_scores$scores, size = 50, replace = TRUE)
  
  to_add <- data.frame(sample_id = sample_id, scores = sample_scores)
  
  samples <- rbind(samples, to_add)
}


head(samples)

We chose to draw 1000 samples repeatedly but think about the fact that there are indeed infinite number of samples that we could have really drawn from the population with replacement!

To obtain the distribution of sample means, we will first calculate the sample mean for each of the 1000 samples.

sampling_dist_mean_scores <- samples %>% group_by(sample_id) %>%
  summarise(sample_mean = mean(scores))

head(sampling_dist_mean_scores, 10)

Let’s now plot these sample means as a distribution:

sampling_dist_plot <- sampling_dist_mean_scores %>%
  ggplot(aes(x = sample_mean)) +
  geom_density(fill = "lightblue") +
  geom_vline(xintercept = mean(sampling_dist_mean_scores$sample_mean)) +
  ggtitle("Sampling Distribution of Sample Average Scores (n = 50)") +
  xlab("Average Score from Sample") +
  ylab("Density")

sampling_dist_plot

The black line marks the mean of the sampling distribution (mean of mean scores) which is equal to 17.61.

Let’s now compare the means of the population with the mean of the sampling distribution:

pop_mean <- mean(reading_scores$scores)
sampling_dist_mean <- mean(sampling_dist_mean_scores$sample_mean)

tibble(pop_mean = pop_mean, sampling_dist_mean = sampling_dist_mean)

Note how similar the two means are! It is an important concept in Statistics that the actual sampling distribution of the sample means will be centered approximately at the mean of the population from which it was drawn from.

The standard deviation of the sample average scores is:

sd_sampling_dist <- sd(sampling_dist_mean_scores$sample_mean)
pop_sd <- sd(reading_scores$scores)

tibble(sd_sampling_dist = sd_sampling_dist, pop_sd = pop_sd)

2.918379/(sqrt(50))

These two are different but observe that:

\[ \frac{\text{pop_sd}}{\sqrt{n}} = \frac{2.918379}{\sqrt{50}} \approx 0.4127211 \approx \text{sd_sampling_dist} \]

As you might recall, we popularly use $\frac{\sigma}{\sqrt{n}}$ to compute the Standard Error for the Confidence Interval, which is actually the standard deviation of the sampling distribution of sample means.

Now consider again the distribution of sample estimates we had generated from the population:

head(sampling_dist_mean_scores)

Learn that if we had to compute the 97.5th and 2.5th percentile values of the vector sample_mean, such a range would also be considered a 95% Confidence Interval.

lower_ci <- quantile(sampling_dist_mean_scores$sample_mean, 0.025)
upper_ci <- quantile(sampling_dist_mean_scores$sample_mean, 0.975)

conf_interval <- tibble(lower_ci = lower_ci, upper_ci = upper_ci)
conf_interval

Hence, this is a valid 95% Confidence Interval generated by taking repeated samples (total 1000 of size 50) out of the original population (500 student scores) and then calculating the 2.5th and 97.5th percentile values of the collection of sample means.

That is to say, 5% of the 1000 sample averages generated from the population fall outside of (16.80755, 18.3585). Equivalently, 95% of the sample average scores fall within the range.

Note that the 95% Confidence Interval we have computed using the repeated sampling technique has different values than the single sample Confidence Interval computed by the teacher. But both are valid!

In fact, the teacher’s confidence interval is less resource-intensive and more practical. It is almost always impossible to obtain the actual sampling distribution of the sample means as (1) population is usually unknown and (2) taking all possible repeated samples is extremely unfeasible.

However, the Confidence Interval method involving the use of a single sample draws its theoretical credibility from concepts such as the sampling distribution of sample means and the standard error for the sample mean!

Thinking Critically: When using a single point estimate to draw claims about the population parameter, it must be important to also use measures of variability to capture or illustrate the variability in both the sample observations and between different possible samples (and estimates) that can be obtained from the population. As you will learn more about the Central Limit Theorem in future classes/tutorials, the theorem can be used to hypothesize what the distribution of sample estimates or the sampling distribution will look like. This then governs our choice for the Critical Value – whether we will use the Z-score (normal distribution) or the t-score (t-distribution).

Hopefully this section helped you understand the ideas behind why we use the Standard Error (or its approximations) for computing Confidence Intervals.

3. Using the Bootstrapping Method for Constructing Confidence Intervals for Population Means

As we discussed earlier, obtaining the actual sampling distribution of sample means is almost impossible in the real world.

But other techniques exist that help us compute Confidence Intervals by going a step further than simply using one single sample mean and one single sample standard deviation.

One such technique, quite popular in Data Science, is the bootstrapping method.

Take a random and representative sample of an appropriate size (above 30)
Replicate the sampling process, ie. draw repeated samples of the same size drawing from the single sample we have – with replacement. This equilavent to saying, treat the sample as a population, and draw samples with replacement ideally of the same size as the original sample.
As usual, calculate the sample means and find the 2.5th and 97.5th percentile values to get a range within which 95% of the sample means obtained lie. This is a valid 95% Confidence Interval for the true mean.

Note: The Bootstrapping Distribution of Sample Means is different from the actual sampling distribution of sample means, if there was a realistic way of obtaining the latter.

The center of the bootstrap distribution is equal to the sample mean of the original sample, not the population mean as was the case for sampling distributions.

Consider the population of 500 Grade 8 students for whom we are interested in obtaining the 95% Confidence Interval of Mean Reading and Comprehension Scores.

Suppose it is unfeasible to ask all the students to take the standardized exam, and thus, 65 students are randomly chosen to sit for it:

set.seed(1234) #DO NOT CHANGE
random_sample <- sample(reading_scores$scores, size = 65)
head(random_sample)

Next, we’ll use the bootstrapping method to generate 1000 samples of size 65 out of this single sample. This is only possible via resampling with replacement.

num_samples <- 1000
bootstrap_samples <- data.frame(sample_id = c(), scores = c())

for (i in 1:num_samples) {
  sample_id <- i
  sample_scores <- sample(random_sample, size = 65, replace = TRUE)
  
  to_add <- data.frame(sample_id = sample_id, scores = sample_scores)
  
  bootstrap_samples <- rbind(bootstrap_samples, to_add)
}

nrow(bootstrap_samples) #Total number of observations: 65 * 1000

Next, let’s calculate the sample means for each of the 1000 samples:

bootstrap_sampling_dist <- bootstrap_samples %>%
  group_by(sample_id) %>% summarise(sample_mean = mean(scores))
  
head(bootstrap_sampling_dist, 10)

To obtain the 95% Confidence Interval for the Population Mean Score (of 500 students), we’ll obtain the 2.5th and 97.5th percentile values from the column sample_mean of the bootstrap_sampling_dist.

conf_interval_bootstrap <- tibble(lower_ci = quantile  (bootstrap_sampling_dist$sample_mean, 0.025),
  upper_ci = quantile(bootstrap_sampling_dist$sample_mean, 0.975))

conf_interval_bootstrap

This is considered a valid 95% Confidence Interval for the true mean score of the 500 Grade 8 students. Notably, we had obtained only a single sample of a manageable size and then used the bootstrapped distribution of sample means to obtain the 2.5th and 97.5th percentile values for the sample means.

Let’s now compare the the three Confidence Intervals obtained via:

Single Sample using the analytical approach
Repeated Sampling from Original Population using R
Bootstrapped Sampling from Single Sample using R

library(tibble)

comparison_table <- tibble(
  analytical_method = paste("[", confidence_interval_analytical$lower_ci, ", ", confidence_interval_analytical$upper_ci, "]", sep = ""),
  sampling_dist_method = paste("[", conf_interval$lower_ci, ", ", conf_interval$upper_ci, "]", sep = ""),
  bootstrap_method = paste("[", conf_interval_bootstrap$lower_ci, ", ", conf_interval_bootstrap$upper_ci, "]", sep = "")
)

print(comparison_table)

Contrary to what happens in many real-world scenarios, we have access to the population data and the population mean (since we had simulated the 500 scores).

# Calculate the population mean
pop_mean <- mean(reading_scores$scores)

comparison_table <- comparison_table %>%
  mutate(pop_mean = pop_mean)
  
comparison_table

All of the 95% Confidence Intervals were able to capture the true population mean score. However, note how they differ in terms of the width and the values for the upper and lower bounds.

4. Conclusion

In Part 1, we introduced confidence intervals using a very simple example – something you’ve probably crossed paths with in your basic statistics classes. We used one sample of test scores, and used the analytical method and assumptions about the underlying population to compute a single 95% confidence interval.

Recall how we used both the standard error (using the sample standard deviation and sample size) and critical value (obtained through $t$-distribution) to finally get the margin of error? Thus the idea of variability around the mean is essential to the process of constructing a 95% confidence intervals.

We introduced an important concept in Part 2 - the theoretical framework of the Sampling Distribution of Sample Means. Easily imagined as the distribution of all possible sample means that can be generated from the original population, this distribution of sample means is useful for both Confidence Intervals construction and Hypothesis Testing, as you will learn in further tutorials.

Part 3 is where the real fun began. We took note of the fact that obtaining the actual sampling distribution is unfeasible in the real world. So, we instead used a popular method used in the Data Sciences – the bootstrapping method. We essentially “replicated” samples out of a single sample, generating a sampling distribution that is centered at the original sample’s mean. We then used this “bootstrapped” sampling distribution to get the 2.5th and 97.5th percentile values for the average sample scores to finally obtain the 95% Confidence Interval.

Lastly, we compared the 3 intervals and found that they all captured the true mean. While this might not be true always - due to variability (randomness, luck, chance, etc.) inherent in the sampling techniques involved – all the three methods discussed lead to valid Confidence Intervals given the underlying assumptions and requirements as discussed previously hold!

And, that wraps up this tutorial.