source("testing_confidence_intervals.r")
# importing typical packages
library(tidyverse)
library(haven)
library(ggplot2)
# loading the dataset
<- read_dta("../datasets/01_census2016.dta")
census_data
# cleaning the dataset
<- filter(census_data, !is.na(census_data$wages))
census_data <- filter(census_data, !is.na(census_data$mrkinc))
census_data <- filter(census_data, census_data$pkids != 9) census_data
03 - ECON 325: Confidence Intervals
Outline
Prerequisites
- Introduction to Jupyter
- Introduction to R
- Introduction to Visualization
- Central Tendency
- Distribution
- Dispersion and Dependence
Outcomes
After completing this notebook, you will be able to: * Interpret and report confidence intervals * Calculate confidence intervals under a variety of conditions * Understand how the scope of sampling impacts confidence intervals
References
Introduction
So far, we have developed a strong grasp of core concepts in statistics. We’ve learned about measures of central tendency and variation, as well as how these measures relate to distributions. We have also learned about random sampling and how sampling distributions can shed light on the parameters of a population distribution.
So, how can we apply this knowledge to real empirical work? While another notebook that covers hypothesis testing will provide a deeper answer to this question, this current notebook provides a starting point. In this notebook, we will learn about a key concept which relates to how we report our results empirically when sampling from a population. This is the idea of a confidence interval.
Confidence Intervals and Point Estimates
A confidence interval is an estimate that gives us a range of values within which we expect a population parameter to fall. Put another way, it provides a range within which we can have a certain degree of confidence that a desired parameter, such as a population mean, lies.
This is in contrast to a point estimate. A point estimate is a specific estimated value of another object, like a population parameter. For instance, the sample mean and sample standard deviation are point estimates of the population mean and the population standard deviation (respectively).
Let’s make this concrete with an example.
Example
Suppose that we want to know the mean GPA of undergraduate students at universities across Canada. Finding this exact mean (the population mean) would require us to collect the GPA of every single undergraduate student in the country without error: a nearly impossible task. As a result, and as we have already seen, we collect a sample of students and find the mean of their GPAs (the sample mean). This allows us to make inferences about the desired, yet unobtainable, population mean. The sample mean we find is called our point estimate.
However, as we already learned, this sample mean will be different every time we draw a different random sample of undergraduate students. Suppose one random sample we draw just so happens to include many high-achieving students, while another does not. The first sample will give us a very high point estimate for the population mean, while the second sample will give us a point estimate much lower.
So our question becomes: how can we report an estimate for the population mean GPA if we draw a different mean GPA for every possible sample? This is where confidence intervals become useful. They allow us to combine information about central tendency and dispersion into a single object.
Confidence Levels
As we will see, every time we draw a sample and get a new point estimate, we can compute a confidence interval to describe the precision of this estimate. To calculate this confidence interval, we always must choose a confidence level. This is a percentage which represents the long-run percentage of confidence intervals within which we would expect to actually find our desired population parameter.
We choose a higher confidence level when we want to have more certainty in our confidence interval serving as a good estimate for the population parameter of interest. The most commonly chosen confidence level is 95%, but other values are also common.
In our GPA example, this means that if we drew random samples of undergraduate students 1000 different times and got 1000 sample mean point estimates and corresponding confidence intervals, we would expect 950 of these confidence intervals to contain the actual average GPA of all Canadian undergraduates.
Of course, we cannot conclusively find the desired population mean to prove this; however, choosing a high confidence level gives us more certainty that any one of our hypothetical confidence intervals (including the one we actually calculate from our specific sampling) includes our unknowable parameter of interest. When we choose a confidence level of 95% and calculate a confidence interval around our sample mean point estimate, we say that we are 95% confident that the true mean GPA of all Canadian undergraduates lies in this range.
Let’s now see how we actually calculate a confidence interval for a given point estimate and confidence level.
Calculating Confidence Intervals
The official representation of a confidence interval is the following:
\[ (\text{point estimate} - \text{margin of error}, \text{point estimate} + \text{margin of error}) \]
or
\[ \text{point estimate} \pm \text{margin of error} \]
where our point estimate is just the sample statistic we find from our random sample. The margin of error is the more cumbersome piece to calculate. It is subtracted and added from our point estimate to find the lower bound and upper bound of our confidence interval estimate. While this general formula and format for the confidence interval always holds, calculating the margin of error varies depending on what sample statistic we are looking at and what we know about our population. Let’s look at few important special cases.
Confidence Intervals for the Sample Mean
When we want to construct a confidence interval for a sample mean we’ve found (e.g. the mean GPA of a sample of Canadian undergraduates), we must first reflect on how we gathered our data and what its sampling distribution looks like. This is because we must meet the following three conditions in order to construct a valid confidence interval for a sample mean in the first place: > 1. We must have a random sample (typically found through simple random sampling) 2. The sampling distribution of the sample means is approximately normal, which can be met through one of the following ways:
a. our original population is normally distributed
b. our sample size is > 120 (invokes the Central Limit Theorem) 3. Our sample observations must be independent, for instance:
a. we sample with replacement (when we record an observation, we put it back in the population with the possibility of drawing it again)
b. our sample size is < 10% of the population size
If each of conditions 1-3 are met, we are able to construct a valid confidence interval around our sample mean point estimate. There are two different cases for this construction, but they’re pretty similar.
Case 1: We Know the Population Standard Deviation
In rare instances, we may know the variance (and thus standard deviation) of our original population of interest. In this case, we are able to consult our trustworthy \(z\)-statistic when calculating the margin of error in our confidence interval. We use the following formula to calculate the confidence interval for a sample mean when our population standard deviation is known:
\[ \bar x \pm z_{\alpha / 2} \cdot \frac{\sigma}{\sqrt n} \]
where \(\bar x\) is the sample mean, \(z\) is the critical value (from the standard normal distribution) for a chosen confidence level \(1-\alpha\), \(\sigma\) is the population standard deviation, and \(n\) is the sample size.
However, this case is extremely rare. After all, it requires us to know the standard deviation but not the mean of a population! Typically we either have very good information on a population (and thus know both its mean and standard deviation) or we don’t (and thus don’t know either its mean or standard deviation). Nonetheless, it is good to keep this case in mind since it occasionally comes up.
Case 2: We Don’t Know the Population Standard Deviation
Much more frequently, we won’t know the population standard deviation. In this case, we must instead invoke the \(t\)-distribution when calculating the margin of error for our confidence intervals. We also use the sample standard deviation in place of the population standard deviation, since we do know this statistic. The calculation procedure otherwise follows exactly as before in Case 1.
\[
\bar x \pm t_{\alpha / 2} \cdot \frac{s}{\sqrt n}
\]
where \(\bar x\) is the sample mean, \(t\) is the critical value (from the \(t\)-distribution) for a chosen confidence level \(1-\alpha\), \(s\) is the sample standard deviation, and \(n\) is the sample size.
We can use an example from our dataset to emphasize this point. Let’s construct a 95% confidence interval for the sample mean of the variable wages
. We can immediately calculate its mean, which serves as our sample mean point estimate.
# calculating the sample mean of wages
<- mean(census_data$wages) x
Now that we have this point estimate, we can calculate our margin of error around it. To do so, we must first find 3 other statistics: the \(t\) value corresponding to a 95% confidence level, the standard deviation of wages
, and the sample size (the number of observations recorded for wages
).
# finding the sample size and associated degrees of freedom
<- nrow(census_data)
n <- n - 1
df
# finding the t value for a confidence level of 95% (noticing this value converges on the z value as so we could have used this too)
<- qt(p = 0.05, df = df)
t
# finding the sample standard deviation of wages
<- sd(census_data$wages)
s
# calculating the lower and upper bounds of the desired confidence interval
<- x - (t*s/sqrt(n))
lower_bound <- x + (t*s/sqrt(n))
upper_bound
lower_bound upper_bound
In a formal setting, we would thus report the following: We are 95% confident that the mean wage of all Canadians ranges between \(54274\) and \(54690\) . We also know this is a valid confidence interval estimate because our wages
variable and the procedure for sampling it meets all of the three criteria outlined: Statistics Canada (the source for this data) utilizes random sampling, our sample size is \(n > 30\) and thus we don’t even need to check the distribution of wages
, and our sample size \(n\) is < 10% of the total population (since the total population of Canada is about 38 million). This is a very small confidence interval. We can understand this by looking at our formula for calculating it and realizing that our sample size, \(n\), is very large. This adds precision to our confidence interval estimate, highlighted in the narrowness of the interval we found above.
Exercise
Matilda (she/her) takes a random sample of 10 books from a library in order to estimate the average number of pages among all books in the library. Let’s assume the library is very large and the library does not keep record of the specifics of its overall population of books in terms of their pages. Does it make more sense for Matilda to use a standard z distribution or student’s t distribution when calculating the margin of error for her confidence interval?
<- "X" # your answer for "z" or "t" in place of "X"
answer_1
test_1()
From her sample, Matilda finds a sample mean of 280 and sample variance of 400. She wants to construct a 90% confidence interval to estimate the population mean number of pages. What will be the upper and lower bounds of this interval (assuming its a valid confidence interval)?
# your code here
<- # your answer for the lower bound here, rounded to 2 decimal places
answer_2 <- # your answer for the upper bound here, rounded to 2 decimal places
answer_3
test_2()
test_3()
Confidence Intervals for the Sample Proportion
While we’ve looked at the example of mean GPA throughout this notebook, we can also calculate confidence intervals for sample proportions as well. Imagine we have just two political parties (A and B), compulsory voting, and we want to know the proportion of the population that voted for party A. Of course, this would be quite costly and time-consuming to calculate, so we instead collect a sample and corresponding sample proportion. This sample proportion becomes our point estimate around which we construct a confidence interval to estimate the population proportion with a certain degree of confidence. Before we begin constructing this interval, we must make sure that our sampling process again satisfies three conditions; this time, however, the second of these three conditions will be different:
- We must have a random sample (typically found through simple random sampling)
- The sampling distribution of the sample proportions is normally distributed, which typically requires there to be at least 10 “successes” and 10 “failures” in our sample (in the example above, this would mean at least 10 people in our sample voted for party A and at least 10 people voted for party B). We can see here that very small sample sizes (i.e. \(n = 5\), \(n = 10\), etc. will fail this condition)
- Our sample observations must be independent, which can be met through one of the following two ways:
- we sample with replacement (when we record an observation, we put it back in the population with the possibility of drawing it again)
- our sample size is < 10% of the population size
If conditions 1-3 are all met, we are able to construct a valid confidence interval around our sample proportion point estimate. We now turn to the one case we must consider when calculating the margin of error and confidence interval for sample proportions.
The Only Case: We Don’t Know the Population Standard Deviation
This is the only case we have to worry about when calculating the sample proportion. This is because the population standard deviation is necessarily a function of the population proportion and sample size \(n\). Thus, if we knew the population standard deviation, we would necessarily know the population proportion and there would be no point in sampling and constructing confidence intervals to estimate it! For this reason, we worry about only the one case where we don’t know the standard deviation of the population. The formula for the confidence interval of a sample proportion is below:
\[ \hat P \pm z_{\alpha / 2} \cdot \sqrt \frac {\hat P \cdot(1 - \hat P)}{n} \]
where \(\hat P\) is the sample proportion, \(z\) is the critical value (from the standard normal table) for a chosen confidence level \(1-\alpha\), and \(n\) is the sample size.
Let’s do an example. Let’s calculate a 95% confidence interval for the sample proportion of the census dataset who has one or more kids in their household (pkids == 1
). We can immediately calculate our sample proportion, which serves us our point estimate.
# calculating our sample proportion of observations with pkids == 1
<- sum(census_data$pkids == 1) / n
p p
Now that we have our sample proportion, we can find our \(z\) critical value for a 95% confidence level, as well as use our sample proportion \(\hat{p}\) and sample size \(n\), to calculate our confidence interval.
# finding the z value for a confidence level of 95%
<- qnorm(p = 0.05, lower.tail=FALSE)
z
# calculating the lower and upper bounds of the desired confidence interval
<- p - z*sqrt(p*(1-p)/n)
lower_bound <- p + z*sqrt(p*(1-p)/n)
upper_bound
lower_bound upper_bound
From our above calculations, we can say that we are 95% confident that the true proportion of Canadians with a child in their household ranges between 0.7075% - 0.7104%. Importantly, it is possible to run into cases where the upper or lower bound of the confidence interval for a sample proportion is outside of the accepted domain of [0, 1]. This is particularly likely when our sample proportion point estimate is already very high or low, and then our sample size is not very large. In these cases, we may choose to either report the true interval or cap or interval at 0 or 1 accordingly, with a note that this does not reflect the full confidence interval found. However, these cases are rare. We can check for ourselves that the above confidence interval is valid since it obeys all three of the criteria we have layed out for confidence interval estimating sample proportions.
Exercise
Matilda now wants to know the proportion of students in her school who are left-handed. Let’s assume her sampling procedure meets all of the criteria for constructing a valid confidence interval. She takes a sample of 200 students and finds that 22 of them are left-handed. What is the upper and lower bound of a 98% confidence interval for the proportion of the school’s overall student body that are left-handed?
# your code here
<- # your answer for the lower bound here, rounded to 3 decimal places (in proportion form, i.e. 10% = 0.1)
answer_4 <- # your answer for the upper bound here, rounded to 3 decimal places (in proportion form, i.e. 10% = 0.1)
answer_5
test_4()
test_5()
Let’s imagine that our sample size and confidence level are fixed and cannot be changed. What sample proportion of students who are left-handed would result in the smallest confidence interval possible?
<- # your answer for the sample proportion here (i.e. 10% = 0.1)
answer_6
test_6()
Confidence Intervals for the Sample Variance
Finally, we may want to construct confidence intervals for a sample variation itself in order to estimate the population standard deviation that we do not know. The following conditions must be met for this confidence interval to be valid:
- We must have a random sample (typically found through simple random sampling)
- Our original population is normally distributed or at least symmetrically distributed without many outliers. If this does not hold, our sample size must be > 120 (invokes the Central Limit Theorem)
- Our sample observations must be independent, which can be met through one of the following two ways:
- we sample with replacement (when we record an observation, we put it back in the population with the possibility of drawing it again)
- our sample size is < 10% of the population size
If conditions 1-3 are all met, we are able to construct a valid confidence interval for our sample variance point estimate.
The Only Case: We Don’t Know the Population Standard Deviation
We only need worry about this case when calculating confidence intervals for the sample variance. This is because, if we knew the population standard deviation, we would necessarily know the population variance and thus constructing a confidence interval to estimate this number would be useless! Instead, we assume we have only a sample variance to rely on. It should be noted that the formula we will use works a bit differently in this case. Rather than add and subtract a margin of error to our point estimate, we will instead use our point estimate to calculate the lower and upper bounds of our confidence interval directly.
\[ (\frac{(n - 1) \cdot s^2}{\chi^2_{\alpha/{2}}}, \frac{(n - 1) \cdot s^2}{\chi^2_{1 - \alpha/{2}}}) \]
where \(n\) is the sample size, \(s^2\) is the sample variance, and \(\chi^2\) is the chi-squared value for a chosen confidence level \(1 - \alpha\) and degrees of freedom \(n - 1\).
Constructing this type of confidence interval may feel a bit less intuitive and familiar than the previous two. This is because the sample variance does not follow a normal distribution. Unlike the sample mean and sample proportion, it follows a different, decisively non-normal distribution: the \(\chi^2\) distribution. For this reason, we construct our confidence intervals for this sample statistic differently, as depicted above.
Let’s do one final example to reinforce the calculation of confidence intervals for this type of sample statistic. We will construct a 95% confidence interval for the sample mean of mrkinc
. Our procedure will follow exactly the steps above, although this time we need to use the chi-squared distribution in place of the t or z distributions. We can calculate our sample variance first.
# calculating the variance of mrkinc
<- var(census_data$mrkinc)
var var
Now that we have our sample variance (which is quite large), we can find the other statistics necessary to calculate our confidence interval estimate.
# finding the chi-squared values for a 95% confidence level and n - 1 degrees of freedom
<- qchisq(p = 0.05, df =df, lower.tail = TRUE)
upper_chi <- qchisq(p = 0.05, df = df, lower.tail = FALSE)
lower_chi
# calculating the upper and lower bounds of the desired confidence interval
<- (df*var)/lower_chi
lower_bound <- (df*var)/upper_chi
upper_bound
lower_bound upper_bound
From the above, we can say that we are 95% confident that the variance of market income among all Canadians is within (767761585, 7745209769). This is quite a large interval, but given the size of the variance for this variable, this is reasonable.
Exercise
Finally, Matilda wants to know the variance of weights of all cars ever sold at her father’s car dealership. Naturally, since she can’t find the variance of the thousands of cars sold, she consults the dealership archives and takes a random sample of 40 cars and records their weights. She finds that they have a sample mean weight of 5,000 pounds and a sample variance of 250,000. Matilda wants to construct a 95% confidence interval estimate for the population variance. Given the information above, what are the upper and lower bounds of this confidence interval?
# your code here
<- # your answer for the lower bound here, rounded to the nearest whole number
answer_7 <- # your answer for the upper bound here, rounded to the nearest whole number
answer_7
test_7()
test_8()
Let’s now say that Matilda draws a new random sample of 40 cars and reports 95% confidence that the population variance of car weights falls within the confidence interval (490000, 640000). Under this sampling procedure, what is the 95% confidence interval estimate for the standard deviation of weights of all cars ever sold at the dealership?
<- # your answer for the lower bound here
answer_10 <- # your answer for the upper bound here
answer_11
test_10()
test_11()
Factors Which Impact the Width of Confidence Intervals
Looking at the above formulas, we now have a better understanding of exactly what factors go into calculating a confidence interval. More specifically, we can see that no matter the parameter we are estimating, we always need the following numbers: a confidence level and sample size. These numbers are chosen early on during the sampling procedure. Thus, they can easily be changed. Let’s explore what happens to our confidence intervals when we change each of these numbers.
Changing the Sample Size
Let’s say we want to change our sample size \(n\). If we increase our sample size \(n\), we can see mathematically that in all cases our margin of error goes down (or our bounds explicitly come closer together in the case of sampling variance). As a result, our confidence interval shrinks. This makes sense intuitively. If we draw a larger sample, then for any confidence level we can expect to estimate our desired population parameter more precisely. This is indicated by a narrower confidence interval. The same logic applies when we decrease our sample size \(n\). Both our margin of error and confidence interval will increase, indicative of the fact that our sample is smaller and we therefore have less precision in estimating our population parameter of interest. To see this point interactively, modify the code below by changing the input for \(n\) (currently set at 30). We can see that the size of the confidence intervals increases or decreases depending on whether we decrease or increase the simulated sample size.
<- rnorm(10000, 0, 1)
population set.seed(2)
# defining a function which outputs a confidence interval for a given sample size
<- function(n) {
create_confidence_intervals = mean(sample(population, n))
x = qnorm(p = 0.05, lower.tail=FALSE)
z = x - (z*1/sqrt(n))
lower = x + (z*1/sqrt(n))
upper = data.frame(lower, upper)
df return(c(lower, upper))
}
# calling the function, tweak default sample size 30 here!
create_confidence_intervals(30)
Changing the Confidence Level
Now suppose we instead want to change our confidence level. If we increase our confidence level, we are saying that if we hypothetically calculated many confidence intervals, the percentage of these intervals containing our desired population parameter should increase. We are thus asking for confidence interval estimates which capture the true parameter of interest more often. To capture this parameter more frequently for a given sample size, the width of our confidence interval (the range of possibilities for capturing the true parameter) must naturally increase. Similar logic applies to decreasing our confidence level: our confidence interval increases. This all occurs mathematically through an increase or decrease in our margin of error (or bounds) respectively, engendered by an increase or decrease in our \(z\) or \(t\) critical value. To see this point interactively, modify the code below by changing the input for \(\alpha\) (currently set at 0.05, indicating a 95% confidence level). We can see that the vertical length (width) of the confidence intervals increases or decreases depending on whether we increase or decrease the simulated confidence level.
<- rnorm(10000, 0, 1)
population set.seed(2)
# defining a function which outputs a confidence interval for a given confidence level
<- function(alpha) {
create_confidence_intervals = mean(sample(population, 100))
x = qnorm(p = alpha, lower.tail=FALSE)
z = x - (z*1/sqrt(100))
lower = x + (z*1/sqrt(100))
upper = data.frame(lower, upper)
df return(c(lower, upper))
}
# calling the function, tweak default 0.05 alpha (95% confidence level) here!
create_confidence_intervals(0.05)
Exercise
Matilda thinks that one of her confidence intervals above is too wide and wishes to narrow it. What could she do in order to achieve this goal?
- A. increase the sample size and higher the confidence level
- B. decrease the sample size and lower the confidence level
- C. increase the sample size and lower the confidence level
- D. decrease the sample size and higher the confidence level
<- "..." # enter your choice here
answer_1
test_11()
Common Misconceptions
Up to this point, we’ve covered what confidence intervals are, how we calculate them, and how they’re sensitive to two key parameters. Let’s lastly clarify a couple of misconceptions about the interpretation of confidence intervals.
Misconception 1:
If we have a 95% confidence interval, this is a concrete range under which our estimated population paramater must fall.
Hopefully the error in this way of thinking is quite clear now. If we repeated our sampling procedure many times and constructed a confidence interval each time, we would expect about 95% of these confidence intervals to contain our true parameter. However, this is not 100%. Theoretically, about 5% of our confidence intervals will not contain the true paramater. There is no stopping the actual confidence interval we calculate from being one of those 5%. Due to this, we cannot say with absolute certainty that our true paramater lies within the interval that we calculate. It is a common mistake to assume that a confidence interval is an official range within which the true paramater need fall. It can fall anywhere, it is just quite likely (very likely if our confidence level is high enough) that it falls within the interval calculated, hence we can have some trust in it as an estimator. This is why the confidence interval is an estimator and not a concrete range of possible values for our population paramater. If we had 100% certainty the true paramater fell within our calculated interval, it would not be much of an estimator but instead just a complete spectrum of the possible values the paramater could take on.
Misconception 2:
If we have a confidence level of 95%, 95% of our population data must lie within the calculated confidence interval.
This is not true. Our confidence level indicates the long run percentage of constructed confidence intervals which contain our true parameter. It says nothing about the spread of our actual data. To find the range within which 95% of our data lie, we must consult a histogram for the population. It could easily be the case that our data is quite bimodally distributed (around half of our data is clustered far to the left of our mean, and the other half is clustered far to the right of our mean). In this case, our calculated 95% confidence interval will likely contain very little (much less than 95%) of the data.
Misconception 3:
If we have a confidence level of 95%, a confidence interval calculated from a sample of 500 observations will more likely contain the true paramater than a confidence interval calculated from a sample of 100 observations.
This one might feel quite counterintuitive. After all, we already know from the previous section that a confidence interval generated from the sample \(n = 500\) will be smaller than one generated from \(n = 100\). However, think about the nuance of this statement. A confidence level by definition is the percentage of calculated intervals we expect to contain the true paramater of interest if we calculated these intervals over and over. In this situation, any one interval from a sample of \(n = 100\) has a 95% of containing the true paramater, just as any one interval from a sample of \(n = 500\) has a 95% of containing the true paramater. Each interval (the wider one from \(n = 100\) and narrower one from \(n= 500\)) has a chance of containing the true parameter in relation to all other calculated intervals for that same sample size. The probability of a given interval containing the true paramater is in no way influenced by or varies with the sample size. This probability only changes when we change our confidence level.
🟠 Every research context will drastically shape how confidence intervals are approached. As we have seen, the volume and quality of data affect how accurate data analyses can be, and many rules of thumb in data science are simply that - rules of thumb, as opposed to hard facts about how to report statistics.
🟠 What are some situations where you want to know that something is true with nearly 100% confidence?
🟠 What are some situations where the uncertainty of statistic is maybe not so bad?
🟠 What does it really mean to have something within or outside of a confidence interval?