# importing packages
library(tidyverse)
library(haven)
library(ggplot2)
# load self-tests
source("beginner_confidence_intervals_tests.r")
# loading the dataset
<- read_dta("../datasets_beginner/01_census2016.dta")
census_data
# cleaning the dataset
<- filter(census_data, !is.na(census_data$wages))
census_data <- filter(census_data, !is.na(census_data$mrkinc))
census_data <- filter(census_data, census_data$pkids != 9) census_data
1.3.2 - Beginner - Confidence Intervals
Outline
Prerequisites
- Introduction to Jupyter
- Introduction to R
- Introduction to Visualization
- Central Tendency
- Distribution
- Dispersion and Dependence
Outcomes
After completing this notebook, you will be able to:
- Interpret and report confidence intervals
- Calculate confidence intervals under a variety of conditions
- Understand how the scope of sampling impacts confidence intervals
References
Introduction
So far, we have developed a strong grasp of core concepts in statistics. We’ve learned about measures of central tendency and variation, as well as how these measures relate to distributions. We have also learned about random sampling and how sampling distributions can shed light on the parameters of a population distribution.
So, how can we apply this knowledge to real empirical work? In this notebook, we will learn about a key concept which relates to how we report our results empirically when sampling from a population. This is the idea of a confidence interval.
Part 1: Introducing Confidence Intervals
A confidence interval is an estimate that gives us a range of values within which we expect a population parameter to fall. Put another way, it provides a range within which we can have a certain degree of confidence that a desired parameter, such as a population mean, lies.
This is in contrast to a point estimate, which is a specific estimated value of another object, like a population parameter. The point estimate of the population mean is the sample mean and the point estimate of the population standard deviation is the sample standard deviation.
Let’s make this concrete with an example.
Let’s say we’re interested in finding the mean GPA of undergraduate students at universities across Canada.Instead of collecting the GPA of every single undergraduate student in the country without error, we collect a sample of students and find the mean of their GPAs (the sample mean). However, due to sampling variability, our sample mean will probably not be exactly equal to the population mean. It would be better to provide a range of values in which we think the population mean lies. This is where confidence intervals become useful - instead of just providing the sample mean, confidence intervals allow us to combine information about central tendency and dispersion into a single object.
Confidence levels
The confidence interval describes a range in which we think the population parameter lies. To calculate this confidence interval, we must choose a confidence level. The confidence level indicates the probability that the parameter lies within our confidence interval.
A higher confidence level means greater certainty that our confidence interval serves as a good range of values for the population parameter of interest.
The most commonly chosen confidence level is 95%, but other percentages (90%, 99%) are also used.
If we choose the 95% confidence level for our mean GPA example, we could say that we are 95% confident that the true mean GPA of all Canadian undergraduates lies in the range of the confidence interval (which we’ll learn to calculate below). Being 95% confident means that if we drew random samples of undergraduate students (e.g., 1000 samples), got different sample means for each sample (1000 sample means), and calculated confidence intervals from those sample means (1000 confidence intervals), we would expect 95% of the confidence intervals (950 confidence intervals) to contain the actual average GPA of all Canadian undergraduates.
Calculating confidence intervals
The official representation of a confidence interval is the following:
\[ (\text{point estimate} - \text{margin of error}, \text{point estimate} + \text{margin of error}) \]
or
\[ \text{point estimate} \pm \text{margin of error} \]
We add and subtract a margin of error from our point estimate to find the lower bound and upper bound of our confidence interval estimate. The calculation for the margin of error varies depending on the sample statistic. Let’s look at a few important special cases: the confidence intervals for the sample mean, sample proportion, and sample variance.
Part 2: Confidence Intervals for the Sample Mean
To construct a confidence interval for a sample mean (e.g., the mean GPA of a sample of undergraduates), we must meet the following three conditions:
Random sampling
The sampling distribution of the sample means must be approximately normal, either because
- The original population is normally distributed
- The sample size is larger than 30 (invokes the Central Limit Theorem)
Our sample observations must be independent either because
- We sample with replacement (when we record an observation, we put it back in the population with the possibility of drawing it again)
- Our sample size is less than 10% of the population size
If each of conditions 1-3 are met, we are able to construct a valid confidence interval around our sample mean. There are two different cases for this construction.
Case 1: when we know the population standard deviation
In rare instances when we may know the standard deviation of the population of interest, we use the following formula to calculate the confidence interval:
\[ \bar x \pm z_{\alpha / 2} \cdot \frac{\sigma}{\sqrt n} \]
where
- \(\bar x\) is the sample mean
- \(z\) is the critical value (from the standard normal distribution)
- \(1-\alpha\) is the confidence level
- \(\sigma\) is the population standard deviation
- \(n\) is the sample size
Note: this case is extremely rare as it requires us to know the standard deviation but not the mean of a population! Typically we either know both the mean and standard deviation of the population or we know neither.
Case 2: when we don’t know the population standard deviation
In this case, we estimate the population standard deviation with the sample standard deviation and we use the \(t\)-distribution to calculate the margin of error. Otherwise, the calculation procedure follows exactly as in Case 1.
\[ \bar x \pm t_{\alpha / 2} \cdot \frac{s}{\sqrt n} \]
where
- \(\bar x\) is the sample mean
- \(t\) is the critical value (from the \(t\)-distribution)
- \(1-\alpha\) is the confidence level
- \(s\) is the sample standard deviation
- \(n\) is the sample size
To illustrate, let’s construct a 95% confidence interval for the sample mean of the variable wages
. We can use the function mean()
to calculate its mean, which will serve as our point estimate.
# calculating the sample mean of wages
<- mean(census_data$wages)
sample_mean sample_mean
Now that we have this point estimate, we can calculate our margin of error around it. To do so, we must first find
- The \(t\) value corresponding to a 95% confidence level
- The sample standard deviation of
wages
- The sample size (the number of observations recorded for
wages
)
# finding the sample size and associated degrees of freedom
<- nrow(census_data)
n <- n - 1
df
# finding the t value for a confidence level of 95%
<- qt(p = 0.05, df = df)
t
# finding the sample standard deviation of wages
<- sd(census_data$wages)
s
# calculating the lower and upper bounds of the desired confidence interval
<- sample_mean - (t*s/sqrt(n))
lower_bound <- sample_mean + (t*s/sqrt(n))
upper_bound
lower_bound upper_bound
We are 95% confident that the mean wage of all Canadians ranges between \(55708\) and \(55293\). We also know this is a valid confidence interval estimate because our wages
variable and the procedure for sampling meets all of the three criteria outlined:
- Random sampling: Statistics Canada (the source for this data) utilizes random sampling
- Our sample size is \(n > 30\) and thus we don’t even need to check the distribution of
wages
- Our observations are independent because our sample size \(n\) is less than 10% of the total population (the total population of Canada is about 38 million)
Relative to the large value of the sample mean, this confidence interval is very narrow. This means our confidence interval estimate is very precise.
Test your knowledge
Matilda takes a random sample of 20 books from a library in order to estimate the average number of pages among all books in the library.
Does it make more sense for Matilda to use a standard normal distribution or a t-distribution to calculate the margin of error for her confidence interval?
# replace "..." by your answer for "z" or "t" here
<- "..."
answer_1
test_1()
Matilda finds a sample mean of 280 pages and sample variance of 400 pages. She wants to construct a 90% confidence interval for her sample mean. What will be the upper and lower bounds of this interval (assuming it’s a valid confidence interval)?
# here are the values you need to calculate
<-
sample_mean <-
s <-
n <- n-1 # this is given to help you out
df <- t
<- # your answer for the lower bound here, rounded to the nearest integer
lower_bound <- # your answer for the upper bound here, rounded to the nearest integer
upper_bound
<- lower_bound
answer_2 <- upper_bound
answer_3
test_2()
test_3()
Part 3: Confidence Intervals for the Sample Proportion
Similar to the sample mean, we can also calculate confidence intervals for sample proportions.
Suppose a population must vote for one of the two political parties (A or B), and we want to find out the proportion of the population that voted for party A.
To do that we must
- Collect a random sample and calculate the corresponding sample proportion
- Estimate a confidence interval for the sample proportion
Just like in the case of the sample mean, confidence intervals for sample proportions must satisfy three conditions:
Random sampling
The sampling distribution of the sample means must be approximately normal, because
- There are at least 10 “successes” and 10 “failures” in our sample (e.g., at least 10 people in our sample voted for party A and at least 10 people voted for party)
Our sample observations must be independent either because
- We sample with replacement (when we record an observation, we put it back in the population with the possibility of drawing it again)
- Our sample size is less than 10% of the population size
If conditions 1-3 are all met, we are able to construct a valid confidence interval around our sample proportion. There is a single case for calculating the margin of error and confidence interval for sample proportions.
Case: when we don’t know the population standard deviation
When we don’t know the standard deviation of the population, we use the following formula to construct confidence intervals for sample proportions:
\[ \hat P \pm z_{\alpha / 2} \cdot \sqrt \frac {\hat P \cdot(1 - \hat P)}{n} \]
where - \(\hat P\) is the sample proportion - \(z\) is the critical value (from the standard normal distribution) - \(1-\alpha\) is the confidence level - \(n\) is the sample size
Think deeper: why is there only one case for confidence intervals of sample proportions?
Let’s calculate a 95% confidence interval for the sample proportion of the variable pkids
from our census_data
. pkids == 1
for when the respondent lives in a household with kids, and pkids == 0
otherwise. We can calculate our sample proportion, which serves as our point estimate for the confidence interval.
# calculating our sample proportion of observations with pkids == 1
<- sum(census_data$pkids == 1) / nrow(census_data)
p p
Now that we have our sample proportion, we can find our \(z\) critical value for a 95% confidence level and calculate our confidence interval.
# finding the z value for a confidence level of 95%
<- qnorm(p = 0.05, lower.tail = FALSE) # lower.tail = FALSE gives us the right tail of the distribution
z
# calculating the lower and upper bounds of the desired confidence interval
<- nrow(census_data)
n <- p - z*sqrt(p*(1-p)/n)
lower_bound <- p + z*sqrt(p*(1-p)/n)
upper_bound
lower_bound upper_bound
From our above calculations, we can say that we are 95% confident that the true proportion of Canadians with a child in their household ranges between \(70.75\%\) - \(71.04\%\).
Note: In rare cases when our sample proportion estimate is either too high or too low, we may find that the confidence interval may surpass the domain [0,1]. When that happens we can cap the confidence interval at 0 or 1 and make a note about it when reporting the results.
Test your knowledge
Matilda now wants to know the proportion of students in her school who are left-handed. Let’s assume her sampling procedure meets all of the criteria for constructing a valid confidence interval. She takes a sample of 200 students and finds that 22 of them are left-handed. What is the upper and lower bound of a 98% confidence interval for the proportion of the school’s overall student body that are left-handed?
# your code here
<- # your answer for the lower bound here, rounded to 3 decimal places (in proportion form, i.e. 10% = 0.1)
lower_bound <- # your answer for the upper bound here, rounded to 3 decimal places (in proportion form, i.e. 10% = 0.1)
upper_bound
<- lower_bound
answer_4 <- upper_bound
answer_5
test_4()
test_5()
Let’s imagine that our sample size and confidence level are fixed and cannot be changed. What sample proportion of students who are left-handed would result in the smallest valid confidence interval possible?
# enter your answers in the vector below in ascending order (smaller number first)
<- c(...,...) # your answer for the sample proportions here (i.e. 10% = 0.1)
answer_6 test_6()
Part 4: Confidence Intervals for the Sample Variance
If for some reason we want to estimate the standard deviation of a population, we could also create confidence intervals for the sample variance. The logic is the same as confidence intervals explained above, but instead of the sample mean or proportion, our point estimate is the sample variance. The conditions needed to construct a confidence interval for the sample variance are the same as the sample mean. We restate them below.
Random sampling
The sampling distribution of the sample means must be approximately normal, either because
- The original population is normally distributed
- The sample size is larger than 30 (invokes the Central Limit Theorem)
Our sample observations must be independent either because
- We sample with replacement (when we record an observation, we put it back in the population with the possibility of drawing it again)
- Our sample size is less than 10% of the population size
If conditions 1-3 are all met, we are able to construct a valid confidence interval for our sample variance.
Case: when we don’t know the population standard deviation
We only need worry about this case when calculating confidence intervals for the sample variance. If we knew the population standard deviation, we would also know the population variance and therefore not need to construct a confidence interval to estimate this number.
The formula for confidence intervals works a bit differently for the sample variance: instead of adding and subtracting a margin of error to our point estimate, we will use our point estimate to calculate the lower and upper bounds of our confidence interval directly.
\[ (\frac{(n - 1) \cdot s^2}{\chi^2_{\alpha/{2}}}, \frac{(n - 1) \cdot s^2}{\chi^2_{1 - \alpha/{2}}}) \]
where
- \(n\) is the sample size
- \(s^2\) is the sample variance
- \(\chi^2\) is the critical value from the chi-squared distribution with \(n - 1\) degrees of freedom
- \(1 - \alpha\) is the confidence level
Note: Constructing this type of confidence interval is different from previous instances with the sample mean and sample proportion. This is because, unlike the sample mean and sample proportion, the sample variance follows a non-normal distribution: the \(\chi^2\) distribution.
Let’s construct a 95% confidence interval for the sample variance of mrkinc
. Our procedure will follow exactly the steps above. First, let’s calculate the sample variance:
# calculating the variance of mrkinc
<- var(census_data$mrkinc)
var var
Now that we have our sample variance let’s find the other statistics necessary to calculate our confidence interval estimate.
# finding sample size
<-nrow(census_data)
n
# finding the chi-squared values for a 95% confidence level and n - 1 degrees of freedom
<- qchisq(p = 0.05, df = (n - 1), lower.tail = TRUE)
upper_chi <- qchisq(p = 0.05, df = (n - 1), lower.tail = FALSE)
lower_chi
# calculating the upper and lower bounds of the desired confidence interval
<- ((n - 1)*var)/lower_chi
lower_bound <- ((n - 1)*var)/upper_chi
upper_bound
lower_bound upper_bound
Therefore, we are 95% confident that the variance of market income among all Canadians is within (\(7677615858\), \(7745209769\)). This is quite a large interval, but given the size of the variance for this variable, it seems reasonable.
Test your knowledge
Matilda wants to know the variance of weights of all cars sold at her father’s car dealership. She takes the steps below.
- She takes a random sample of 40 cars and records their weights
- She finds that they have a sample mean weight of 5,000 pounds and a sample variance of 250,000
Matilda wants to construct a 95% confidence interval estimate for the population variance. What are the upper and lower bounds of this confidence interval?
# write your code here
# here are the variables you need to calculate
<-
n <-
var
<-
lower_chi <- upper_chi
<- # your answer for the lower bound here, rounded to the nearest whole number
lower_bound <- # your answer for the upper bound here, rounded to the nearest whole number
upper_bound
<- lower_bound
answer_7 <- upper_bound
answer_8
test_7()
test_8()
Let’s now say that Matilda draws a new random sample of 40 cars and reports with 95% confidence that the population variance of car weights falls within the confidence interval (490000, 640000). Under this sampling procedure, what is the 95% confidence interval estimate for the standard deviation of weights of all cars ever sold at the dealership?
<- # your answer for the lower bound here
answer_9 <- # your answer for the upper bound here
answer_10
test_9()
test_10()
Part 5: What Affects the Width of Confidence Intervals?
We can see that no matter the parameter we are estimating, we always need to specify:
- The confidence level
- The sample size
Let’s explore what happens to our confidence intervals when we change each of these numbers.
Changing the sample size
Let’s say we want to change our sample size \(n\).
If we increase our sample size \(n\), both our margin of error and confidence interval will decrease. That happens because a larger sample makes our estimates more precise.
If we decrease our sample size \(n\), both our margin of error and confidence interval will increase. That happens because a smaller sample makes our estimates less precise.
To see this point interactively, run the code below in which we change the input \(n\) for our function to create 95% confidence intervals. We can see that the size of the confidence intervals increase or decrease depending on whether we decrease or increase the simulated sample size.
# simulating data
<- rnorm(10000, 0, 1)
population
# defining a function which outputs a confidence interval for the sample mean given an input `n` for the sample size
<- function(n) {
create_confidence_intervals = mean(sample(population, n))
x = qnorm(p = 0.05, lower.tail=FALSE)
z = x - (z*1/sqrt(n))
lower = x + (z*1/sqrt(n))
upper = data.frame(lower, upper)
df return(c(lower, upper))
}
# confidence intervals with sample size 10
<- create_confidence_intervals(10)
cf_10 <- cf_10[2] - cf_10[1]
difference_between_bounds
cf_10 difference_between_bounds
# confidence intervals with sample size 20
<- create_confidence_intervals(20)
cf_20 <- cf_20[2] - cf_20[1]
difference_between_bounds
cf_20 difference_between_bounds
# confidence intervals with sample size 100
<- create_confidence_intervals(100)
cf_100 <- cf_100[2] - cf_100[1]
difference_between_bounds
cf_100 difference_between_bounds
# confidence intervals with sample size 10000
<- create_confidence_intervals(10000)
cf_10000 <- cf_10000[2] - cf_10000[1]
difference_between_bounds
cf_10000 difference_between_bounds
# Try it yourself!
# create_confidence_intervals(...)
Changing the confidence level
If we increase the confidence level to a higher percentage, then the new confidence interval will be wider.
If we decrease the confidence level to a lower percentage, then the new confidence interval will be narrower.
The logic is simple: to be more confident that our confidence interval actually does contain the true value of the population parameter, we have to increase its range of values.
This all occurs mathematically through an increase/decrease in our margin of error (or bounds) due to the increase/decrease in our \(z\) or \(t\) critical values.
To see this point interactively, modify the code below by changing the input for \(\alpha\). We can see that the vertical length (width) of the confidence intervals increase or decrease depending on whether we increase or decrease the simulated confidence level.
<- rnorm(10000, 0, 1)
population
# defining a function which outputs a confidence interval for a given confidence level
<- function(alpha) {
create_confidence_intervals = mean(sample(population, 100))
x = qnorm(p = alpha, lower.tail=FALSE)
z = x - (z*1/sqrt(100))
lower = x + (z*1/sqrt(100))
upper = data.frame(lower, upper)
df return(c(lower, upper))
}
# confidence intervals with significance level 0.01
<- create_confidence_intervals(0.01)
cf_1_pct <- cf_1_pct[2] - cf_1_pct[1]
difference_between_bounds
cf_1_pct difference_between_bounds
# confidence intervals with significance level 0.05
<- create_confidence_intervals(0.05)
cf_5_pct <- cf_5_pct[2] - cf_5_pct[1]
difference_between_bounds
cf_5_pct difference_between_bounds
# confidence intervals with significance level 0.1
<- create_confidence_intervals(0.1)
cf_10_pct <- cf_10_pct[2] - cf_10_pct[1]
difference_between_bounds
cf_10_pct difference_between_bounds
# confidence intervals with significance level 0.2
<- create_confidence_intervals(0.2)
cf_20_pct <- cf_20_pct[2] - cf_20_pct[1]
difference_between_bounds
cf_20_pct difference_between_bounds
# Try it yourself!
# create_confidence_intervals(...)
Test your knowledge
Matilda thinks that one of her confidence intervals above is too wide and wishes to narrow it. What could she do in order to achieve this goal?
- increase the sample size and increase the confidence level
- decrease the sample size and decrease the confidence level
- increase the sample size and decrease the confidence level
- decrease the sample size and increase the confidence level
<- "..." # your answer here
answer_11
test_11()
Part 6: Common Misconceptions
Up to this point, we’ve covered what confidence intervals are, how we calculate them, and how they’re sensitive to two key parameters. To wrap up, let’s clarify a couple of misconceptions about the interpretation of confidence intervals.
Misconception 1:
If we have a 95% confidence interval, this is a concrete range under which our estimated population parameter must fall.
If we repeated our sampling procedure many times and constructed a confidence interval each time, we would expect about 95% of these confidence intervals to contain our true parameter. Note that this does not imply that the confidence interval we calculated contains the true parameter. Since about 5% of our confidence intervals will not contain the true parameter, nothing prevents the confidence interval we calculate from being one of those 5% that don’t contain the parameter.
Main takeaway is that the confidence interval is an estimator and not an official range of possible values for the population parameter.
Misconception 2:
If we have a confidence level of 95%, 95% of our population data must lie within the calculated confidence interval.
This is not true since our confidence level indicates the long run percentage of constructed confidence intervals which contain our true parameter but says nothing about the spread of our actual data. To find the range within which 95% of our data lie, we must consult a histogram for the population and calculate percentiles.
For instance, if our data is quite bimodal distributed (around half of our data is clustered far to the left of our mean, and the other half is clustered far to the right of our mean), our calculated 95% confidence interval will likely contain very little (much less than 95%) of the data.
The confidence level does not determine the spread of the actual data
Misconception 3:
If we have a confidence level of 95%, a confidence interval calculated from a sample of 500 observations will more likely contain the true parameter than a confidence interval calculated from a sample of 100 observations.
We know from the previous section that a confidence interval generated from the sample \(n = 500\) will be narrower than one generated from \(n = 100\). However, a confidence level by definition is the percentage of calculated intervals we expect to contain the true parameter of interest if we calculated these intervals over and over. This means any one interval from a sample of \(n = 100\) has a 95% chance of containing the true parameter, just as any one interval from a sample of \(n = 500\) has a 95% chance of containing the true parameter. Each interval (the wider one from \(n = 100\) and narrower one from \(n= 500\)) has a chance of containing the true parameter relative to all other calculated intervals for that same sample size.
Hence, whether we have an interval from a sample of \(n = 100\) or \(n = 500\), we are still 95% confident in both cases that the true parameter lies within that interval. The probability of a given interval containing the true parameter is not affected by the sample size. This probability only changes when we change our confidence level.
- 🟠 Every research context will drastically shape how confidence intervals are approached. As we have seen, the volume and quality of data affect how accurate data analyses can be, and many rules of thumb in data science are simply that - rules of thumb, as opposed to hard facts about how to report statistics.
- 🟠 What are some situations where you want to know that something is true with nearly 100% confidence?
- 🟠 What are some situations where the uncertainty of statistic is maybe not so bad?
- 🟠 What does it really mean to have something within or outside of a confidence interval?