{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 1.3.1 - Beginner - Introduction to Confidence Intervals\n", "\n", "COMET Team
*Mridul Manas* \n", "2023-07-20\n", "\n", "## Outline\n", "\n", "### Prerequisites\n", "\n", "- Introduction to Jupyter\n", "- Introduction to R\n", "\n", "### Outcomes\n", "\n", "In this notebook, you will learn about:\n", "\n", "- What Confidence Intervals are and why they are useful for drawing\n", " inferential claims about the populations of interest.\n", "\n", "- How to compute a Confidence Interval for chosen level of confidence\n", " for estimating true population parameters such as the mean.\n", "\n", "- What the different methods commonly used for the construction of\n", " Confidence Intervals are, namely the analytical method, the sampling\n", " method, and the bootstrapped sampling method.\n", "\n", "To begin, run the code cell below." ], "id": "a76a8fa1-4805-4c49-98eb-677912d7e5b1" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run this cell\n", "\n", "source(\"beginner_intro_to_confidence_intervals_tests.R\")" ], "id": "facab9f6-3a3b-4643-ad6f-d985b6549185" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Import the required packages for this tutorial:\n", "library(dplyr)\n", "library(ggplot2)" ], "id": "524468c3-d970-43e9-9dec-5065a9993570" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Introduction to Confidence Intervals for the Population Mean\n", "\n", "A confidence interval shows how likely it is that a range based on a\n", "sample of a population contains the “true” mean for the entire\n", "population of interest. The formula for computing a 95% Confidence\n", "Interval for the population mean is as follows:\n", "\n", "$$\n", "\\text{95% Confidence Interval} = \\bar{x} \\pm \\text{Critical Value} \\times \\text{Standard Error}\n", "$$ Here, $\\bar{x}$ is the sample mean, or the *point estimate*, obtained\n", "from a single randomly drawn sample from the population.\n", "\n", "The $\\text{Standard Error}$ is computed as$\\frac{\\sigma}{\\sqrt{n}$,\n", "serving as a measure of variability for sample means. In other words, if\n", "we had to obtain all possible samples from the population and calculate\n", "their means, the Standard Error will capture the variability of such\n", "sample means around the *mean of the sample means*. Later, we’ll try to\n", "explain the concept of distribution of sample means or the sampling\n", "distribution of sample means since it is a crucial underlying concept\n", "here!\n", "\n", "The Critical Value is a “quantile” value obtained typically from the\n", "Standard Normal Distribution (Mean 0, SD 1) such that approximately 95%\n", "of the values lie below it. This is often called the Z-score, as denoted\n", "below:\n", "\n", "$$\n", "z_{\\alpha/2} = qnorm(1 - \\alpha/2, mean = 0, sd = 1)\n", "$$\n", "\n", "The subscript under $z$ represents the tail probabilities in a standard\n", "normal distribution ($z$). The value of $\\alpha$ represents the\n", "significance level, often denoted as $1 - \\text{Confidence\n", "Interval}$." ], "id": "adc4cc60-95cc-4cb0-aac2-e79c5b0407ac" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#RUN THIS CELL BEFORE CONTINUING\n", "library(ggplot2)\n", "\n", "# Significance level (alpha)\n", "alpha <- 0.05\n", "\n", "# Calculate the critical value for a two-tailed 95% confidence interval\n", "z_critical <- qnorm(1 - alpha / 2, mean = 0, sd = 1)\n", "\n", "# Create a sequence of x values for the standard normal curve\n", "x <- seq(-3, 3, length.out = 1000)\n", "\n", "# Calculate the standard normal probability density function\n", "pdf <- dnorm(x, mean = 0, sd = 1)\n", "\n", "# Create the plot\n", "ggplot(data.frame(x), aes(x = x)) +\n", " geom_line(aes(y = pdf), color = \"blue\", size = 1.5) +\n", " geom_vline(xintercept = z_critical, color = \"red\", linetype = \"dashed\") +\n", " geom_vline(xintercept = -z_critical, color = \"red\", linetype = \"dashed\") +\n", " annotate(\"text\", x = z_critical + 0.1, y = 0.25, label = \"α/2\", color = \"red\") +\n", " annotate(\"text\", x = -z_critical - 0.5, y = 0.25, label = \"α/2\", color = \"red\") +\n", " labs(x = \"Z\", y = \"Probability Density\", title = \"Two-Tailed 95% Confidence Interval\") +\n", " theme_minimal()" ], "id": "b019b29c-38b9-44d9-9005-b7e02f7c1b90" }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a 95% Confidence Interval, $\\alpha$ is $1 - 0.95 = 0.05$ and the\n", "critical value $z_{\\frac{\\alpha}{2}}$ corresponds to the point beyond\n", "which the area in the tail is $\\alpha/2$, leaving the central 95% area\n", "under the standard normal curve.\n", "\n", "We use the `qnorm()` function in R to generate Z-scores for the chosen\n", "level of significance:" ], "id": "36f5dc25-058d-42d3-9017-6952525d09d1" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#defining some params for our use\n", "conf.level = 0.95\n", "alpha = 1 - conf.level \n", "\n", "qnorm(1 - alpha/2, mean = 0, sd = 1)" ], "id": "827c1ae7-e8ee-4c01-a8cf-e28429b687a0" }, { "cell_type": "markdown", "metadata": {}, "source": [ "or simply," ], "id": "c569c99d-a9c0-480e-841a-b2c095b787f6" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "qnorm(0.975, mean = 0, sd = 1)" ], "id": "2b62f32f-067d-4870-9a38-a564db75169a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, This means $P(Y \\leq 1.96) = 0.975$ and\n", "$P(Y \\leq -1.96) = 0.025$ where $Y \\sim N(0,1)$. Thus,\n", "$P(-1.96 < Y < 1.96) = 0.95$ or 95%.\n", "\n", "Notice the symmetry of the Standard Normal Distribution. This is the\n", "reason why we only have to compute the Z-score once, using the\n", "right-hand side critical value (below which 97.5% value lie). We then\n", "multiply this with the Standard Error and ADD to the point estimate to\n", "get the upper bound for the C.I. We can then subtract the same Z-score\n", "(after multiplying with the Standard Error) from the sample mean to get\n", "the lower bound of the Confidence Interval.\n", "\n", "Just like we rarely ever know the true population means, knowing the\n", "true population standard deviation is quite rare. It is common practice\n", "to use the sample standard deviation, or $s$, to calculate the Standard\n", "Error as $\\frac{s}{\\sqrt{n}}$.\n", "\n", "In the case population standard deviations are unknown, the critical\n", "values must be drawn from the $t$-distribution and not the Normal\n", "Distribution. We use the `qt()` function in R to obtain the quantile\n", "values under the $t$-distribution with specified degrees of freedom.\n", "\n", "$$\n", "t_{n-1, \\alpha/2} = qt(1 - \\alpha/2,df=n-1)\n", "$$" ], "id": "4bfe929f-aedd-40cc-9e2f-f782dbe2083e" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_size <- 15\n", "degrees_of_freedom <- sample_size - 1 #degrees of freedom of a t-distribution is equal to the sample size minus 1\n", "\n", "qt(0.975, df = degrees_of_freedom)" ], "id": "8cae0c93-bc44-4b25-b618-c2257d1c0e03" }, { "cell_type": "markdown", "metadata": {}, "source": [ "This means $P(Y \\leq 1.76) = 0.975$ and $P(Y \\leq -1.76) = 0.025$ where\n", "$Y \\sim \\text{t}_{n-1}$. Thus, $P(-1.76 < Y < 1.76) = 0.95$ or 95%.\n", "\n", "> For higher degrees of freedom (hence, higher sample sizes) the\n", "> \\$t\\$-distribution does a better job of approximating the normal\n", "> distribution. In general, the t-distribution will mimic the\n", "> bell-shaped nature of the normal distribution but will have fatter or\n", "> thicker tails.\n", "\n", "Let’s summarize the assumptions, requirements and appropriate methods\n", "for calculating Confidence Intervals:\n", "\n", "- If population standard deviation \\$\\sigma\\$ is known and sample size\n", " is above 30, obtain the critical value from the standard normal\n", " distribution (z-score) and calculate the 95% Confidence Interval as:\n", "\n", " $$\n", " \\bar{x} \\pm qnorm(0.975, mean = 0, sd = 1) \\times \\frac{\\sigma}{\\sqrt{n}}\n", " $$\n", "\n", "- If population standard deviation is known but the sample size is\n", " small (ie. below than 30), *approximate* the critical value using\n", " the t-distribution (t-score) and calculate the 95% Confidence\n", " Interval as:\n", "\n", " $$\n", " \\bar{x} \\pm qt(0.975, df = (n-1)) \\times \\frac{\\sigma}{\\sqrt{n}}\n", " $$\n", "\n", "- If the standard deviation of the population is unknown and the\n", " sample size is **large** (ie. above 30), obtain the critical value\n", " using the standard normal distribution (ie. get a Z-score) and\n", " calculate the 95% Confidence Interval as:\n", "\n", " $$\n", " \\bar{x} \\pm qnorm(0.975, mean = 0, sd = 1) \\times \\frac{s}{\\sqrt{n}}\n", " $$\n", "\n", "- If the population standard deviation, \\$\\sigma\\$, is unknown and the\n", " sample size is small (below 30), **but guaranteed the population is\n", " normally distributed**, *approximate* the critical value using the\n", " $t$-distribution (t-score) and calculate the 95% Confidence Interval\n", " as:\n", "\n", " $$\n", " \\bar{x} \\pm qt(0.975, df = (n-1)) \\times \\frac{s}{\\sqrt{n}}\n", " $$\n", "\n", "### 1.1 95% Confidence Intervals for Reading Comprehension Scores\n", "\n", "A teacher is interested in knowing if Grade 8 students are meeting the\n", "expectations for reading ability set by the governing body of the\n", "country.\n", "\n", "She nominates 15 randomly chosen students who then take a standardized\n", "reading and comprehension exam. The average score for this sample of 15\n", "students is 17 out of 32 and the sample standard deviation is 4.2.\n", "\n", "Suppose the reading comprehension scores for *all* students in the\n", "country are known to be normally distributed. Which distribution should\n", "be used to calculate the Critical Value for 95% Confidence Intervals:\n", "\n", "A: The Standard Normal Distribution (ie. obtain the Z-score)\n", "\n", "B: The \\$t\\$-distribution (ie. obtain the t-score)" ], "id": "4048f90a-f7c3-4822-9857-8d7812b5f9b7" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# which distribution should we use?\n", "answer_1.1 <- \"\" #Type either A or B here\n", "test_1.1()" ], "id": "a528a48d-2679-4517-b4fb-8f765309d15e" }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": "question" }, "outputs": [], "source": [ "# which distribution should we use?\n", "answer_1.1 <- \"\" #Type either A or B here\n", "test_1.1()" ], "id": "22334a68-5e07-40ca-ab78-08664227a3f5" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose the teacher now nominates a bigger sample of 45 students\n", "randomly chosen to sit for the exam. The governing body has also\n", "announced that the national scores are normally distributed with a\n", "population standard deviation of 3.\n", "\n", "The teacher’s sample of students on average score 18 (out of 32) with a\n", "standard deviation of 5. Now use the appropriate function in R (`qnorm`\n", "or `qt`) to calculate a 95% Confidence Interval for Mean Reading &\n", "Comprehension Score for the entire class:" ], "id": "ff28ddd6-0729-4c7c-8d40-5f7717a5585c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use the correct function and formula to calculate the lower and upper bounds of the interval. Think about what 1 - alpha/2 will be.\n", "\n", "lower_ci <- # REPLACE 5 WITH THE CORRECT CODE FOR THE LOWER BOUND\n", "upper_ci <- # REPLACE 10 WITH THE CORRECT CODE FOR THE UPPER BOUND\n", "\n", "answer_1.2 <- tibble(lower_ci = lower_ci, upper_ci = upper_ci)\n", "test_1.2()" ], "id": "b6aed43d-9bd5-4ff5-87a6-5b96362a3e80" }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": "question" }, "outputs": [], "source": [ "# use the correct function and formula to calculate the lower and upper bounds of the interval. Think about what 1 - alpha/2 will be.\n", "\n", "lower_ci <- 5 # REPLACE 5 WITH THE CORRECT CODE FOR THE LOWER BOUND\n", "upper_ci <- 5 # REPLACE 10 WITH THE CORRECT CODE FOR THE UPPER BOUND\n", "\n", "answer_1.2 <- tibble(lower_ci = round(lower_ci, 2), upper_ci = round(upper_ci, 2))\n", "test_1.2()" ], "id": "818a5bbd-7c4a-496e-a2ce-85ccdbcbe47a" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "confidence_interval_analytical <- tibble(lower_ci = 17.26, upper_ci = 18.74)\n", "confidence_interval_analytical" ], "id": "605f4229-09b8-4c2b-b5fd-2e791a90a5a8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The teacher using the bigger sample obtains the 95% Confidence Interval:\n", "(17.2644, 18.7356). She interprets this in the following two ways:\n", "\n", "- She is 95% confident that the true average reading and comprehension\n", " score for the entire class of Grade 8 students is between 17.2644\n", " and 18.7356.\n", "\n", "- If she drew samples repeatedly from the population of Grade 8\n", " students at her school and calculated 95% confidence intervals using\n", " the same methodology, then 95% of such confidence intervals would\n", " capture the true mean score for all Grade 8 students at the school.\n", "\n", "Note how the first interpretation allows the teacher to infer if her\n", "students are performing at par with the nation. For example, if the\n", "national average score is 18, which is contained within the 95%\n", "Confidence Interval, she cannot be certain if all of the students on\n", "average are performing better since the interval includes values below\n", "18.\n", "\n", "She should try increasing the sample size, which in turn will cause the\n", "Standard Error of the point estimate to be lower, thus decreasing the\n", "*width* of the Confidence Interval helping her obtain more precise\n", "estimates for the true mean score for the entire class.\n", "\n", "Obviously, the best approach would be to ask all Grade 8 students at the\n", "school to sit for the exam and get the true mean score for all students.\n", "However, this can be both costly and time consuming arguing in favor of\n", "using small samples to *infer* claims about the population of Grade 8\n", "students at the school and their reading and comprehension abilities!\n", "\n", "### 2. Simulating the Sampling Distribution of Sample Means\n", "\n", "You must have noticed that Confidence Intervals are always *centered*\n", "around the sample mean or the point estimate obtained from the\n", "population. When we construct a confidence interval, we’re determining a\n", "range around the sample mean within which we’re confident the population\n", "parameter lies. This range is influenced by the variability of the\n", "sample statistic as indicated by the standard error.\n", "\n", "The Standard Error describes the variability of sample means around its\n", "mean. This implies two things: there must a distribution of sample means\n", "and that this distribution must have a mean/center, called the *mean of\n", "sample means*!\n", "\n", "Let’s use R to *simulate* the distribution of sample means." ], "id": "1e517eba-4ee8-49d1-8151-0ab2be79045a" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set the random seed for reproducibility\n", "set.seed(123) #DON'T CHANGE\n", "\n", "# Generate 500 reading scores from a normal distribution\n", "reading_scores <- data.frame(scores = rnorm(n = 500, mean = 17.5, sd = 3)) %>%\n", " mutate(scores = round(scores, 2))\n", "\n", "head(reading_scores$scores)" ], "id": "a45320d1-c2a4-49e0-8dd0-389b9b6a828f" }, { "cell_type": "markdown", "metadata": {}, "source": [ "`reading_scores$scores` is a vector containing the true reading and\n", "comprehension scores of *all 500 Grade 8 students at a school XYZ*.\n", "\n", "Next, we’ll draw 1000 samples repeatedly from the population of size 50\n", "each." ], "id": "dcbf4efd-c746-448f-9009-62e2272a8ace" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "num_samples <- 1000\n", "samples <- data.frame(sample_id = c(), scores = c())\n", "\n", "for (i in 1:num_samples) {\n", " sample_id <- i\n", " sample_scores <- sample(reading_scores$scores, size = 50, replace = TRUE)\n", " \n", " to_add <- data.frame(sample_id = sample_id, scores = sample_scores)\n", " \n", " samples <- rbind(samples, to_add)\n", "}\n", "\n", "\n", "head(samples)" ], "id": "dca66b37-089d-4f33-af51-fb3e7b49773c" }, { "cell_type": "markdown", "metadata": {}, "source": [ "We chose to draw 1000 samples repeatedly but think about the fact that\n", "there are indeed *infinite* number of samples that we could have really\n", "drawn from the population with replacement!\n", "\n", "To obtain the distribution of sample means, we will first calculate the\n", "sample mean for each of the 1000 samples." ], "id": "ec6406cf-074c-4eaa-adb5-1df65878e2a1" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sampling_dist_mean_scores <- samples %>% group_by(sample_id) %>%\n", " summarise(sample_mean = mean(scores))\n", "\n", "head(sampling_dist_mean_scores, 10)" ], "id": "c0ada9dc-a3b3-49e9-9f20-6097109b2fac" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s now plot these sample means as a distribution:" ], "id": "1e96e879-9450-47f8-8a20-fa5ccd6dd7bc" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sampling_dist_plot <- sampling_dist_mean_scores %>%\n", " ggplot(aes(x = sample_mean)) +\n", " geom_density(fill = \"lightblue\") +\n", " geom_vline(xintercept = mean(sampling_dist_mean_scores$sample_mean)) +\n", " ggtitle(\"Sampling Distribution of Sample Average Scores (n = 50)\") +\n", " xlab(\"Average Score from Sample\") +\n", " ylab(\"Density\")\n", "\n", "sampling_dist_plot" ], "id": "64ab6df3-037f-466e-b3af-576cb9ca93ab" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The black line marks the mean of the sampling distribution (mean of mean\n", "scores) which is equal to 17.61.\n", "\n", "Let’s now compare the means of the population with the mean of the\n", "sampling distribution:" ], "id": "a906748f-42b4-4e69-8943-836d1ea91d04" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pop_mean <- mean(reading_scores$scores)\n", "sampling_dist_mean <- mean(sampling_dist_mean_scores$sample_mean)\n", "\n", "tibble(pop_mean = pop_mean, sampling_dist_mean = sampling_dist_mean)" ], "id": "3193bc21-ed71-40f3-ae36-58f966ad2796" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how similar the two means are! It is an important concept in\n", "Statistics that the actual sampling distribution of the sample means\n", "will be centered approximately at the mean of the population from which\n", "it was drawn from.\n", "\n", "The standard deviation of the sample average scores is:" ], "id": "61167a41-06be-4b09-905f-7fadbb3e19de" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sd_sampling_dist <- sd(sampling_dist_mean_scores$sample_mean)\n", "pop_sd <- sd(reading_scores$scores)\n", "\n", "tibble(sd_sampling_dist = sd_sampling_dist, pop_sd = pop_sd)" ], "id": "219dc0ac-909e-441a-9b15-1aec9f03c7f3" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "2.918379/(sqrt(50))" ], "id": "11540549-2071-43e4-80f1-ee2e1f8822e4" }, { "cell_type": "markdown", "metadata": {}, "source": [ "These two are different but observe that:\n", "\n", "$$\n", "\\frac{\\text{pop_sd}}{\\sqrt{n}} = \\frac{2.918379}{\\sqrt{50}} \\approx 0.4127211 \\approx \\text{sd_sampling_dist}\n", "$$\n", "\n", "As you might recall, we popularly use $\\frac{\\sigma}{\\sqrt{n}}$ to\n", "compute the Standard Error for the Confidence Interval, which is\n", "actually the standard deviation of the sampling distribution of sample\n", "means.\n", "\n", "Now consider again the distribution of sample estimates we had generated\n", "from the population:" ], "id": "45704f20-45c1-4f63-b7c5-00c81482a4d7" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "head(sampling_dist_mean_scores)" ], "id": "a43d141b-1101-4843-8b44-8482b45109ea" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Learn that if we had to compute the 97.5th and 2.5th percentile values\n", "of the vector `sample_mean`, such a range would also be considered a 95%\n", "Confidence Interval." ], "id": "b0747680-d111-4ba4-87ce-7cc351a78f52" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lower_ci <- quantile(sampling_dist_mean_scores$sample_mean, 0.025)\n", "upper_ci <- quantile(sampling_dist_mean_scores$sample_mean, 0.975)\n", "\n", "conf_interval <- tibble(lower_ci = lower_ci, upper_ci = upper_ci)\n", "conf_interval" ], "id": "7937c5be-24b6-45c5-bee9-10a168a9a7a3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hence, this is a valid 95% Confidence Interval generated by taking\n", "repeated samples (total 1000 of size 50) out of the original population\n", "(500 student scores) and then calculating the 2.5th and 97.5th\n", "percentile values of the collection of sample means.\n", "\n", "That is to say, 5% of the 1000 sample averages generated from the\n", "population fall outside of (16.80755, 18.3585). Equivalently, 95% of the\n", "sample average scores fall within the range.\n", "\n", "Note that the 95% Confidence Interval we have computed using the\n", "repeated sampling technique has different values than the *single sample\n", "Confidence Interval* computed by the teacher. But both are valid!\n", "\n", "In fact, the teacher’s confidence interval is less resource-intensive\n", "and more practical. It is almost always impossible to obtain the actual\n", "sampling distribution of the sample means as (1) population is usually\n", "unknown and (2) taking *all possible* repeated samples is extremely\n", "unfeasible.\n", "\n", "However, the Confidence Interval method involving the use of a single\n", "sample draws its theoretical credibility from concepts such as the\n", "sampling distribution of sample means and the standard error for the\n", "sample mean!\n", "\n", "> ***Thinking Critically:*** When using a single point estimate to draw\n", "> claims about the population parameter, it must be important to also\n", "> use *measures of variability* to capture or illustrate the variability\n", "> in both the sample observations and between different possible samples\n", "> (and estimates) that can be obtained from the population. As you will\n", "> learn more about the Central Limit Theorem in future\n", "> classes/tutorials, the theorem can be used to hypothesize what the\n", "> distribution of sample estimates or the sampling distribution will\n", "> look like. This then governs our choice for the Critical Value –\n", "> whether we will use the Z-score (normal distribution) or the t-score\n", "> (t-distribution).\n", "\n", "Hopefully this section helped you understand the ideas behind why we use\n", "the Standard Error (or its approximations) for computing Confidence\n", "Intervals.\n", "\n", "### 3. Using the Bootstrapping Method for Constructing Confidence Intervals for Population Means\n", "\n", "As we discussed earlier, obtaining the actual sampling distribution of\n", "sample means is almost impossible in the real world.\n", "\n", "But other techniques exist that help us compute Confidence Intervals by\n", "going a step further than simply using one single sample mean and one\n", "single sample standard deviation.\n", "\n", "One such technique, quite popular in Data Science, is the\n", "**bootstrapping method**.\n", "\n", "1. Take a random and representative sample of an appropriate size\n", " (above 30)\n", "2. *Replicate* the sampling process, ie. draw repeated samples of the\n", " same size drawing from the single sample we have – with replacement.\n", " This equilavent to saying, treat the sample as a population, and\n", " draw samples with replacement ideally of the same size as the\n", " original sample.\n", "3. As usual, calculate the sample means and find the 2.5th and 97.5th\n", " percentile values to get a range within which 95% of the sample\n", " means obtained lie. This is a valid 95% Confidence Interval for the\n", " true mean.\n", "\n", "> ***Note:*** The Bootstrapping Distribution of Sample Means is\n", "> different from the *actual* sampling distribution of sample means, if\n", "> there was a realistic way of obtaining the latter.\n", ">\n", "> The center of the bootstrap distribution is equal to the sample mean\n", "> of the original sample, not the population mean as was the case for\n", "> sampling distributions.\n", "\n", "Consider the population of 500 Grade 8 students for whom we are\n", "interested in obtaining the 95% Confidence Interval of Mean Reading and\n", "Comprehension Scores.\n", "\n", "Suppose it is unfeasible to ask all the students to take the\n", "standardized exam, and thus, 65 students are randomly chosen to sit for\n", "it:" ], "id": "502faa6f-e4fc-4aaa-8811-2b633ae74306" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(1234) #DO NOT CHANGE\n", "random_sample <- sample(reading_scores$scores, size = 65)\n", "head(random_sample)" ], "id": "fc5304b6-caf4-4f76-9843-f302180c2b71" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we’ll use the **bootstrapping method** to generate 1000 samples of\n", "size 65 out of this single sample. This is only possible via resampling\n", "with replacement." ], "id": "ebb89a7c-5ca3-46d1-b610-97ad97233936" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "num_samples <- 1000\n", "bootstrap_samples <- data.frame(sample_id = c(), scores = c())\n", "\n", "for (i in 1:num_samples) {\n", " sample_id <- i\n", " sample_scores <- sample(random_sample, size = 65, replace = TRUE)\n", " \n", " to_add <- data.frame(sample_id = sample_id, scores = sample_scores)\n", " \n", " bootstrap_samples <- rbind(bootstrap_samples, to_add)\n", "}\n", "\n", "nrow(bootstrap_samples) #Total number of observations: 65 * 1000" ], "id": "a7e0f506-46d8-4eb9-b7c1-2543b39210cd" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let’s calculate the sample means for each of the 1000 samples:" ], "id": "cc205d02-c1c1-4bb1-9613-8b54455df54d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bootstrap_sampling_dist <- bootstrap_samples %>%\n", " group_by(sample_id) %>% summarise(sample_mean = mean(scores))\n", " \n", "head(bootstrap_sampling_dist, 10)" ], "id": "881cbf8c-b11b-47f2-88aa-0d582df9c0d5" }, { "cell_type": "markdown", "metadata": {}, "source": [ "To obtain the 95% Confidence Interval for the Population Mean Score (of\n", "500 students), we’ll obtain the 2.5th and 97.5th percentile values from\n", "the column `sample_mean` of the `bootstrap_sampling_dist`." ], "id": "b08037ff-4e92-4389-bce7-9cbbf1840dfe" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conf_interval_bootstrap <- tibble(lower_ci = quantile (bootstrap_sampling_dist$sample_mean, 0.025),\n", " upper_ci = quantile(bootstrap_sampling_dist$sample_mean, 0.975))\n", "\n", "conf_interval_bootstrap" ], "id": "e23eda27-e7d4-4d4f-9224-7c17e9121543" }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is considered a valid 95% Confidence Interval for the true mean\n", "score of the 500 Grade 8 students. Notably, we had obtained only a\n", "single sample of a manageable size and then used the **bootstrapped\n", "distribution of sample means** to obtain the 2.5th and 97.5th percentile\n", "values for the sample means.\n", "\n", "Let’s now compare the the three Confidence Intervals obtained via:\n", "\n", "1. Single Sample using the analytical approach\n", "2. Repeated Sampling from Original Population using R\n", "3. Bootstrapped Sampling from Single Sample using R" ], "id": "f2e6681e-4df9-4f69-81d9-fea950ee2e3e" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(tibble)\n", "\n", "comparison_table <- tibble(\n", " analytical_method = paste(\"[\", confidence_interval_analytical$lower_ci, \", \", confidence_interval_analytical$upper_ci, \"]\", sep = \"\"),\n", " sampling_dist_method = paste(\"[\", conf_interval$lower_ci, \", \", conf_interval$upper_ci, \"]\", sep = \"\"),\n", " bootstrap_method = paste(\"[\", conf_interval_bootstrap$lower_ci, \", \", conf_interval_bootstrap$upper_ci, \"]\", sep = \"\")\n", ")\n", "\n", "print(comparison_table)" ], "id": "5f0962ef-ed55-4be2-88ea-fbd60c4d5f90" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Contrary to what happens in many real-world scenarios, we have access to\n", "the population data and the population mean (since we had simulated the\n", "500 scores)." ], "id": "6fbb22f6-9299-460e-a279-108b4e759f81" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calculate the population mean\n", "pop_mean <- mean(reading_scores$scores)\n", "\n", "comparison_table <- comparison_table %>%\n", " mutate(pop_mean = pop_mean)\n", " \n", "comparison_table" ], "id": "73f06cf6-eb53-4761-881e-b5f4b1619a08" }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of the 95% Confidence Intervals were able to capture the true\n", "population mean score. However, note how they differ in terms of the\n", "width and the values for the upper and lower bounds.\n", "\n", "### 4. Conclusion\n", "\n", "In Part 1, we introduced confidence intervals using a very simple\n", "example – something you’ve probably crossed paths with in your basic\n", "statistics classes. We used one sample of test scores, and used the\n", "*analytical method* and assumptions about the underlying population to\n", "compute a single 95% confidence interval.\n", "\n", "Recall how we used both the standard error (using the sample standard\n", "deviation and sample size) and critical value (obtained through\n", "$t$-distribution) to finally get the margin of error? Thus the idea of\n", "*variability around the mean* is essential to the process of\n", "constructing a 95% confidence intervals.\n", "\n", "We introduced an important concept in Part 2 - the theoretical framework\n", "of the Sampling Distribution of Sample Means. Easily imagined as the\n", "distribution of *all possible sample means* that can be generated from\n", "the original population, this distribution of sample means is useful for\n", "both Confidence Intervals construction and Hypothesis Testing, as you\n", "will learn in further tutorials.\n", "\n", "Part 3 is where the real fun began. We took note of the fact that\n", "obtaining the actual sampling distribution is unfeasible in the real\n", "world. So, we instead used a popular method used in the Data Sciences –\n", "the bootstrapping method. We essentially “replicated” samples out of a\n", "single sample, generating a sampling distribution that is centered at\n", "the original sample’s mean. We then used this “bootstrapped” sampling\n", "distribution to get the 2.5th and 97.5th percentile values for the\n", "average sample scores to finally obtain the 95% Confidence Interval.\n", "\n", "Lastly, we compared the 3 intervals and found that they all captured the\n", "true mean. While this might not be true always - due to variability\n", "(randomness, luck, chance, etc.) inherent in the sampling techniques\n", "involved – all the three methods discussed lead to *valid* Confidence\n", "Intervals given the underlying assumptions and requirements as discussed\n", "previously hold!\n", "\n", "And, that wraps up this tutorial." ], "id": "a6152587-9ed0-49d8-81c5-9887e711474b" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "ir", "display_name": "R", "language": "r" } } }