{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 1.3.2 - Beginner - Confidence Intervals\n", "\n", "COMET Team
*Anneke Dresselhuis, Colby Chambers, Jonathan Graves* \n", "2023-01-12\n", "\n", "## Outline\n", "\n", "### Prerequisites\n", "\n", "- Introduction to Jupyter
\n", "- Introduction to R
\n", "- Introduction to Visualization
\n", "- Central Tendency
\n", "- Distribution
\n", "- Dispersion and Dependence
\n", "\n", "### Outcomes\n", "\n", "After completing this notebook, you will be able to: \\* Interpret and\n", "report confidence intervals \\* Calculate confidence intervals under a\n", "variety of conditions \\* Understand how the scope of sampling impacts\n", "confidence intervals\n", "\n", "### References\n", "\n", "- [Simulating the Construction of Confidence Intervals for Sample\n", " Means](https://rpubs.com/pgrosse/545955)" ], "id": "477d5356-3821-4c05-b8b4-a357387b7b69" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "source(\"beginner_confidence_intervals_tests.r\")\n", "\n", "# importing typical packages\n", "library(tidyverse)\n", "library(haven)\n", "library(ggplot2)\n", "\n", "# loading the dataset\n", "census_data <- read_dta(\"../datasets_beginner/01_census2016.dta\")\n", "\n", "# cleaning the dataset\n", "census_data <- filter(census_data, !is.na(census_data$wages))\n", "census_data <- filter(census_data, !is.na(census_data$mrkinc))\n", "census_data <- filter(census_data, census_data$pkids != 9)" ], "id": "cb76a9de-5772-4c1c-8f35-8f7c0c4adf84" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "So far, we have developed a strong grasp of core concepts in statistics.\n", "We’ve learned about measures of central tendency and variation, as well\n", "as how these measures relate to distributions. We have also learned\n", "about random sampling and how sampling distributions can shed light on\n", "the parameters of a population distribution.\n", "\n", "So, how can we apply this knowledge to real empirical work? In this\n", "notebook, we will learn about a key concept which relates to how we\n", "report our results empirically when sampling from a population. This is\n", "the idea of a **confidence interval**.\n", "\n", "# Confidence Intervals and Point Estimates\n", "\n", "A **confidence interval** is an estimate that gives us a range of values\n", "within which we expect a population parameter to fall. Put another way,\n", "it provides a range within which we can have a certain degree of\n", "*confidence* that a desired parameter, such as a population mean, lies.\n", "\n", "> This is in contrast to a **point estimate**, which is a specific\n", "> estimated value of another object, like a population parameter.\n", ">\n", "> ie: The point estimate of the population mean is the sample mean and\n", "> the point estimate of the population standard deviation is the sample\n", "> standard deviation.\n", "\n", "Let’s make this concrete with an example.\n", "\n", "## Example\n", "\n", "**Aim:** Find the mean GPA of undergraduate students at universities\n", "across Canada\n", "\n", "**Method:** Instead of collecting the GPA of every single undergraduate\n", "student in the country without error, we can collect a sample of\n", "students and find the mean of their GPAs (the sample mean).\n", "\n", "**Evaluation:** Make inferences about the desired, yet unobtainable,\n", "population mean using the sample mean (point estimate).\n", "\n", "But how can we report an estimate for the population mean GPA if we draw\n", "a different mean GPA for every possible sample? This is where\n", "**confidence intervals** become useful. They allow us to combine\n", "information about central tendency and dispersion into a single object.\n", "\n", "# Confidence Levels\n", "\n", "The confidence interval describes the precision of point estimate from\n", "the sample.\n", "\n", "To calculate this confidence interval, we must choose a **confidence\n", "level**. The confidence level indicates the probability with which the\n", "estimation of a statistical parameter (ie: the mean) in a sample survey\n", "is also true for the population.\n", "\n", "Higher confidence level means greater certainty that our confidence\n", "interval serves as good estimate for the population parameter of\n", "interest.\n", "\n", "~The most commonly chosen confidence level is 95%, but other percentages\n", "(90%, 99%) are also used sometimes.~\n", "\n", "If the confidence level is established at 95% for our sample scenario,\n", "this would mean that if we drew random samples of undergraduate students\n", "1000 different times and got 1000 sample mean point estimates and\n", "corresponding confidence intervals, we would expect 950 of these\n", "confidence intervals to contain the actual average GPA of all Canadian\n", "undergraduates.\n", "\n", "We say that we are **95% confident** that the true mean GPA of all\n", "Canadian undergraduates lies in this range.\n", "\n", "# Calculating Confidence Intervals\n", "\n", "The official representation of a confidence interval is the following:\n", "\n", "$$\n", "(\\text{point estimate} - \\text{margin of error}, \\text{point estimate} + \\text{margin of error})\n", "$$\n", "\n", "or\n", "\n", "$$ \n", "\\text{point estimate} \\pm \\text{margin of error}\n", "$$\n", "\n", "Point estimate:\n", "\n", "- The sample statistic we find from our random sample.\n", "\n", "Margin of error:\n", "\n", "- This is subtracted and added from our point estimate to find the\n", " **lower bound** and **upper bound** of our confidence interval\n", " estimate. Calculating the margin of error varies depending on what\n", " sample statistic we are looking at and what we know about our\n", " population. Let’s look at few important special cases.\n", "\n", "# Confidence Intervals for the Sample Mean\n", "\n", "To construct a confidence interval for a sample mean we’ve found (e.g.\n", "the mean GPA of a sample of undergraduates), we must meet the following\n", "three conditions\n", "\n", "1\\. Sample must be **obtained randomly** (typically found through simple\n", "random sampling)\n", "\n", "2\\. The sampling distribution of the sample means is **approximately\n", "normal**, either because\n", "\n", "- a\\) The original population is normally distributed\n", "\n", "- b\\) The Sample size is \\> 120 (invokes the Central Limit Theorem)\n", "\n", "3\\. Our sample observations must be **independent** either because\n", "\n", "- a\\) we sample with replacement (when we record an observation, we\n", " put it back in the population with the possibility of drawing it\n", " again)\n", "- b\\) our sample size is \\< 10% of the population size\n", "\n", "If each of conditions 1-3 are met, we are able to construct a valid\n", "confidence interval around our sample mean point estimate. There are two\n", "different cases for this construction.\n", "\n", "## Case 1: We Know the Population Standard Deviation\n", "\n", "In rare instances when we may know the variance (and thus standard\n", "deviation) of our original population of interest, we use the following\n", "formula to calculate the confidence interval:\n", "\n", "$$\n", "\\bar x \\pm z_{\\alpha / 2} \\cdot \\frac{\\sigma}{\\sqrt n}\n", "$$\n", "\n", "where $\\bar x$ is the sample mean, $z$ is the critical value (from the\n", "standard normal distribution) for a chosen confidence level $1-\\alpha$,\n", "$\\sigma$ is the population standard deviation, and $n$ is the sample\n", "size.\n", "\n", "Note, this case is extremely rare as it requires us to know the standard\n", "deviation but not the mean of a population! Typically we either know\n", "both the mean and standard deviation of the population or we know\n", "neither.\n", "\n", "## Case 2: We Don’t Know the Population Standard Deviation\n", "\n", "In this case we invoke the $t$-distribution when calculating the margin\n", "of error for our confidence intervals. When we don’t know population\n", "standard deviation, we will use the sample standard deviation instead.\n", "The calculation procedure otherwise follows exactly as before in Case 1.\n", "\n", "$$\n", "\\bar x \\pm t_{\\alpha / 2} \\cdot \\frac{s}{\\sqrt n}\n", "$$
\n", "\n", "Where $\\bar x$ is the sample mean, $t$ is the critical value (from the\n", "$t$-distribution) for a chosen confidence level $1-\\alpha$, $s$ is the\n", "sample standard deviation, and $n$ is the sample size.\n", "\n", "For example, let’s construct a 95% confidence interval for the sample\n", "mean of the variable `wages`. We can immediately calculate its mean,\n", "which serves as our sample mean point estimate." ], "id": "51e15295-fb01-4c27-bae5-8277d0d40be1" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculating the sample mean of wages\n", "x <- mean(census_data$wages)" ], "id": "4a1ff963-783d-4fd4-9ff1-b246a23ae431" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have this point estimate, we can calculate our margin of\n", "error around it. To do so, we must first find\n", "\n", "1. The $t$ value corresponding to a 95% confidence level\n", "2. The standard deviation of `wages`\n", "3. The sample size (the number of observations recorded for `wages`." ], "id": "a6ef8bf5-018b-48c3-8896-50d6080b8df7" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# finding the sample size and associated degrees of freedom\n", "n <- nrow(census_data)\n", "df <- n - 1\n", "\n", "# finding the t value for a confidence level of 95% (noticing this value converges on the z value as so we could have used this too)\n", "t <- qt(p = 0.05, df = df)\n", "\n", "# finding the sample standard deviation of wages\n", "s <- sd(census_data$wages)\n", "\n", "# calculating the lower and upper bounds of the desired confidence interval\n", "\n", "lower_bound <- x - (t*s/sqrt(n))\n", "upper_bound <- x + (t*s/sqrt(n))\n", "\n", "lower_bound\n", "upper_bound" ], "id": "c300bebf-9c07-4fbe-91dd-5fa0f176aa9a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are 95% confident that the mean wage of all Canadians ranges between\n", "$54274$ and $54690$. We also know this is a valid confidence interval\n", "estimate because our `wages` variable and the procedure for sampling\n", "meets all of the three criteria outlined:\n", "\n", "- 1. Random sampling: Statistics Canada (the source for this data)\n", " utilizes random sampling\n", " 2. Our sample size is $n > 30$ and thus we don’t even need to check\n", " the distribution of `wages`.\n", " 3. Our observations are independent because our sample size $n$ is\n", " \\< 10% of the total population (since the total population of\n", " Canada is about 38 million).\n", "\n", " This is a very small confidence interval given our large sample\n", " size, $n$. This means our confidence interval estimate is very\n", " precise as indicated by the narrowness of the interval we found\n", " above.\n", "\n", "## Exercise\n", "\n", "Matilda takes a random sample of 10 books from a library in order to\n", "estimate the average number of pages among all books in the library.\n", "Let’s assume the library is very large and the library does not keep\n", "record of the specifics of its overall population of books in terms of\n", "their pages.\n", "\n", "Does it make more sense for Matilda to use a standard z distribution or\n", "student’s t distribution when calculating the margin of error for her\n", "confidence interval?" ], "id": "470ae61a-5507-45e4-a8a3-7142188ddc4b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_1 <- \"X\" # your answer for \"z\" or \"t\" in place of \"X\"\n", "\n", "test_1()" ], "id": "68234a46-8a4d-492a-b19e-9ff62069e4ae" }, { "cell_type": "markdown", "metadata": {}, "source": [ "From her sample, Matilda finds a sample mean of 280 and sample variance\n", "of 400. She wants to construct a 90% confidence interval to estimate the\n", "population mean number of pages. What will be the upper and lower bounds\n", "of this interval (assuming its a valid confidence interval)?" ], "id": "c8cf6d95-b562-4141-ab14-b88f179d616a" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here\n", "\n", "answer_2 <- # your answer for the lower bound here, rounded to 2 decimal places\n", "answer_3 <- # your answer for the upper bound here, rounded to 2 decimal places\n", "\n", "test_2()\n", "test_3()" ], "id": "19e9ee2f-cbd6-4a04-bdd2-a55a6b276c47" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Confidence Intervals for the Sample Proportion\n", "\n", "While we’ve looked at the example of mean GPA throughout this notebook,\n", "we can also calculate confidence intervals for sample proportions as\n", "well. Let’s try another example:\n", "\n", "Condition: A population must vote for either of the two political\n", "parties (A or B)\n", "\n", "Aim: Find out the proportion of the population that voted for party A\n", "\n", "Method:\n", "\n", "1. Collect a sample and corresponding sample proportion\n", "2. Using the point estimate, we establish a confidence interval\n", "3. Estimate the porpotion who voted for party A within a certain degree\n", " of confidence\n", "\n", "Before we establish the confidence interval, we must make sure that our\n", "sampling process again satisfies three conditions; this time, however,\n", "the second of these three conditions will be different:\n", "\n", "1. **Random sample**\n", "\n", "- Typically found through simple random sampling\n", "\n", "1. **The sampling distribution of the sample proportions is normally\n", " distributed**\n", "\n", "- We must have at least 10 “successes” and 10 “failures” in our\n", " sample, this means at least 10 people in our sample voted for party\n", " A and at least 10 people voted for party B.\n", "\n", "- Therefore, very small sample sizes (i.e. $n = 5$, $n = 10$, etc.\n", " will fail this condition)\n", "\n", "1. **Sample observations must be independent either because**\n", "\n", "- A\\) We sample with replacement (when we record an observation, we\n", " put it back in the population with the possibility of drawing it\n", " again)\n", "\n", "- B\\) Our sample size is \\< 10% of the population size\n", "\n", "If conditions 1-3 are all met, we are able to construct a valid\n", "confidence interval around our sample proportion point estimate. We now\n", "turn to the one case we must consider when calculating the margin of\n", "error and confidence interval for sample proportions.\n", "\n", "## The Only Case: We Don’t Know the Population Standard Deviation\n", "\n", "When we don’t know the standard deviation for the population, we use the\n", "following formula to constrcut the the confidence interval of a sample\n", "proportion:\n", "\n", "$$\n", "\\hat P \\pm z_{\\alpha / 2} \\cdot \\sqrt \\frac {\\hat P \\cdot(1 - \\hat P)}{n}\n", "$$\n", "\n", "where $\\hat P$ is the sample proportion, $z$ is the critical value (from\n", "the standard normal table) for a chosen confidence level $1-\\alpha$, and\n", "$n$ is the sample size.\n", "\n", "Note; if we knew the population standard deviation, we would also know the population proportion and there would be no point in sampling and constructing confidence intervals to estimate it!\n", "\n", "For example:\n", "\n", "Let’s calculate a 95% confidence interval for the sample proportion of\n", "the census dataset who has one or more kids in their household\n", "(`pkids == 1`). We can immediately calculate our sample proportion,\n", "which serves us our point estimate." ], "id": "bbe38e7b-18b5-4865-9fff-7d2e90a123e4" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculating our sample proportion of observations with pkids == 1\n", "p <- sum(census_data$pkids == 1) / n\n", "p" ], "id": "1b4d9bca-8365-42ff-8a3a-e7e12d2a971d" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our sample proportion, we can find our $z$ critical\n", "value for a 95% confidence level, as well as use our sample proportion\n", "$\\hat{p}$ and sample size $n$, to calculate our confidence interval." ], "id": "842cbea5-6c68-47ad-a0e8-a7f590ccedeb" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# finding the z value for a confidence level of 95%\n", "z <- qnorm(p = 0.05, lower.tail=FALSE)\n", "\n", "# calculating the lower and upper bounds of the desired confidence interval\n", "lower_bound <- p - z*sqrt(p*(1-p)/n)\n", "upper_bound <- p + z*sqrt(p*(1-p)/n)\n", "\n", "lower_bound\n", "upper_bound" ], "id": "fa551166-a75c-400a-90b3-5e5b79ebf0ba" }, { "cell_type": "markdown", "metadata": {}, "source": [ "From our above calculations, we can say that we are 95% confident that\n", "the true proportion of Canadians with a child in their household ranges\n", "between 0.7075% - 0.7104%.\n", "\n", "- Note: In rare cases when our sample proportion point estimate is either very high or low and our sample size is small, we may find that the the upper or lower bound of the confidence interval for a sample proportion is outside of the accepted domain of \\[0, 1\\]. We may choose to either report the true interval or cap or interval at 0 or 1, while noting that this does not reflect the full confidence interval found.\n", "\n", "## Exercise\n", "\n", "Matilda now wants to know the proportion of students in her school who\n", "are left-handed. Let’s assume her sampling procedure meets all of the\n", "criteria for constructing a valid confidence interval. She takes a\n", "sample of 200 students and finds that 22 of them are left-handed. What\n", "is the upper and lower bound of a 98% confidence interval for the\n", "proportion of the school’s overall student body that are left-handed?" ], "id": "e593f46a-2c3b-4081-b1ef-e4bfde5eed15" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here\n", "\n", "answer_4 <- # your answer for the lower bound here, rounded to 3 decimal places (in proportion form, i.e. 10% = 0.1)\n", "answer_5 <- # your answer for the upper bound here, rounded to 3 decimal places (in proportion form, i.e. 10% = 0.1)\n", "\n", "test_4()\n", "test_5()" ], "id": "0ee44022-a0ae-4392-a79e-333711e2c7b3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s imagine that our sample size and confidence level are fixed and\n", "cannot be changed. What sample proportion of students who are\n", "left-handed would result in the smallest confidence interval possible?" ], "id": "c67be30b-c224-4e58-b531-22643364a994" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_6 <- # your answer for the sample proportion here (i.e. 10% = 0.1)\n", "\n", "test_6()" ], "id": "9b136d23-a154-4c92-8f44-e7066665ecde" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Confidence Intervals for the Sample Variance\n", "\n", "Finally, we may want to construct confidence intervals for a sample\n", "variation itself in order to estimate the population standard deviation\n", "that we do not know. The following conditions must be met for this\n", "confidence interval to be valid:\n", "\n", "> 1. Sample collected randomly\n", "> 2. Original population is normally distributed or at least\n", "> symmetrically distributed without many outliers.\n", "> - If this does not hold, our sample size must be \\> 120 (invokes\n", "> the Central Limit Theorem)\n", "> 3. Our sample observations must be independent either because\n", "> - A\\) We sample with replacement (when we record an observation,\n", "> we put it back in the population with the possibility of\n", "> drawing it again)
\n", ">\n", "> - B\\) Our sample size is \\< 10% of the population size\n", "\n", "If conditions 1-3 are all met, we are able to construct a valid\n", "confidence interval for our sample variance point estimate.\n", "\n", "## The Only Case: We Don’t Know the Population Standard Deviation\n", "\n", "We only need worry about this case when calculating confidence intervals\n", "for the sample variance since if we knew the population standard\n", "deviation, we would also know the population variance and therefore not\n", "need to construct a confidence interval to estimate this number.\n", "\n", "Instead, we assume we have only a sample variance to rely on. The\n", "formula works a bit differently in this case: instead of adding and\n", "subtracting a margin of error to our point estimate, we will use our\n", "point estimate to calculate the lower and upper bounds of our confidence\n", "interval directly.\n", "\n", "$$\n", "(\\frac{(n - 1) \\cdot s^2}{\\chi^2_{\\alpha/{2}}}, \\frac{(n - 1) \\cdot s^2}{\\chi^2_{1 - \\alpha/{2}}})\n", "$$\n", "\n", "where $n$ is the sample size, $s^2$ is the sample variance, and $\\chi^2$\n", "is the chi-squared value for a chosen confidence level $1 - \\alpha$ and\n", "degrees of freedom $n - 1$.\n", "\n", "> Note: Constructing this type of confidence interval is different than\n", "> previous instances with the sample mean and sample proportion. This is\n", "> because this sample variance follows a **non-normal distribution**:\n", "> the $\\chi^2$ distribution instead of a normal distribution like that\n", "> of the sample mean and sample proportion.\n", "\n", "Let’s do one final example to reinforce the calculation of confidence\n", "intervals for this type of sample statistic. We will construct a 95%\n", "confidence interval for the sample mean of `mrkinc`. Our procedure will\n", "follow exactly the steps above, although this time we need to **use the\n", "chi-squared distribution in place of the t or z distributions**. We can\n", "calculate our sample variance first." ], "id": "48fd07c1-5ed8-4ae5-9383-7e53ef260f28" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculating the variance of mrkinc\n", "var <- var(census_data$mrkinc)\n", "var" ], "id": "eb62418c-9e84-4296-ba3f-2ab6136e177c" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our sample variance (which is quite large), we can find\n", "the other statistics necessary to calculate our confidence interval\n", "estimate." ], "id": "4f046e44-f22f-4638-ab84-0c386e74808a" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# finding the chi-squared values for a 95% confidence level and n - 1 degrees of freedom\n", "upper_chi <- qchisq(p = 0.05, df =df, lower.tail = TRUE)\n", "lower_chi <- qchisq(p = 0.05, df = df, lower.tail = FALSE)\n", "\n", "# calculating the upper and lower bounds of the desired confidence interval\n", "lower_bound <- (df*var)/lower_chi\n", "upper_bound <- (df*var)/upper_chi\n", "\n", "lower_bound\n", "upper_bound" ], "id": "9ea6d867-fb7b-4984-b4f8-3970bb84f460" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Therefore, we are 95% confident that the variance of market income among\n", "all Canadians is within (767761585, 7745209769). This is quite a large\n", "interval, but given the size of the variance for this variable, this is\n", "reasonable.\n", "\n", "## Exercise\n", "\n", "Finally, Matilda wants to know the variance of weights of all cars ever\n", "sold at her father’s car dealership.\n", "\n", "- Since she can’t find the variance of the thousands of cars sold, she\n", " takes a random sample of 40 cars and records their weights.\n", "\n", "- She finds that they have a sample mean weight of 5,000 pounds and a\n", " sample variance of 250,000.\n", "\n", "- Matilda wants to construct a 95% confidence interval estimate for\n", " the population variance.\n", "\n", "Given the information above, what are the upper and lower bounds of this\n", "confidence interval?" ], "id": "fb74fb1e-dd61-4179-a816-570103afd6ff" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here\n", "\n", "answer_7 <- # your answer for the lower bound here, rounded to the nearest whole number\n", "answer_7 <- # your answer for the upper bound here, rounded to the nearest whole number\n", "\n", "test_7()\n", "test_8()" ], "id": "43931bd9-d412-44bb-9516-ddb6907c3da4" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s now say that Matilda draws a new random sample of 40 cars and\n", "reports 95% confidence that the population variance of car weights falls\n", "within the confidence interval (490000, 640000). Under this sampling\n", "procedure, what is the 95% confidence interval estimate for the standard\n", "deviation of weights of all cars ever sold at the dealership?" ], "id": "10f0c483-95fa-4d25-92a9-a7c699d0f2b1" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_10 <- # your answer for the lower bound here\n", "answer_11 <- # your answer for the upper bound here\n", "\n", "test_10()\n", "test_11()" ], "id": "e851c912-e4c3-4208-b1d7-e7d8f5335dbe" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Factors Which Impact the Width of Confidence Intervals\n", "\n", "We can see that no matter the parameter we are estimating, we always\n", "need to establish\n", "\n", "- The confidence level\n", "\n", "- The sample size.\n", "\n", "Because these numbers are chosen early on during the sampling procedure,\n", "they can easily be changed. Let’s explore what happens to our confidence\n", "intervals when we change each of these numbers.\n", "\n", "# Changing the Sample Size\n", "\n", "Let’s say we want to change our sample size $n$.\n", "\n", "- **If we increase our sample size** $n$,\n", "\n", "\n", "\n", " - Both our margin of error and confidence interval will\n", " [decrease]{.underline} since our sample is [larger]{.underline}\n", " and our estimates are therefore [more precise.]{.underline}\n", "\n", "- **If we decrease our sample size** $n$**.**\n", "\n", " - Both our margin of error and confidence interval will\n", " increase since our sample is smaller and our\n", " estimates are therefore less precise.\n", "\n", "To see this point interactively, modify the code below by changing the\n", "input for $n$ (currently set at 30). We can see that the size of the\n", "confidence intervals increases or decreases depending on whether we\n", "decrease or increase the simulated sample size." ], "id": "6173ab23-1013-4c50-87b2-c91dff325ff6" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "population <- rnorm(10000, 0, 1)\n", "set.seed(2)\n", "\n", "# defining a function which outputs a confidence interval for a given sample size\n", "create_confidence_intervals <- function(n) {\n", " x = mean(sample(population, n))\n", " z = qnorm(p = 0.05, lower.tail=FALSE)\n", " lower = x - (z*1/sqrt(n))\n", " upper = x + (z*1/sqrt(n))\n", " df = data.frame(lower, upper)\n", " return(c(lower, upper))\n", " }\n", "\n", "# calling the function, tweak default sample size 30 here!\n", "create_confidence_intervals(30)" ], "id": "2ee7e646-9959-41e9-8db5-8c03a11d6f36" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Changing the Confidence Level\n", "\n", "- If we increase the confidence level to a higher percentage, then the\n", " new confidence interval will be wider.\n", "\n", "- If we decrease the confidence level to a lower percentage, then the\n", " new confidence interval will be more narrow.\n", "\n", "The logic is simple: to be more confident that our confidence interval\n", "actually does contain the true value of the population parameter, means\n", "our confidence interval must be wider and likewise for decreased\n", "confidence level.\n", "\n", "Increased confidence level → Higher error bound → Wider confidence interval\n", "\n", "Decreased confidence level→ Lower error bound→ Narrower confidence interval\n", "\n", "This all occurs mathematically through an increase or decrease in our\n", "margin of error (or bounds) respectively due to the increase or decrease\n", "in our $z$ or $t$ critical value.\n", "\n", "To see this point interactively, modify the code below by changing the\n", "input for $\\alpha$ (currently set at 0.05, indicating a 95% confidence\n", "level). We can see that the vertical length (width) of the confidence\n", "intervals increases or decreases depending on whether we increase or\n", "decrease the simulated confidence level." ], "id": "d361eb6d-68bf-470c-9a74-590067bd114e" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "population <- rnorm(10000, 0, 1)\n", "set.seed(2)\n", "\n", "# defining a function which outputs a confidence interval for a given confidence level\n", "create_confidence_intervals <- function(alpha) {\n", " x = mean(sample(population, 100))\n", " z = qnorm(p = alpha, lower.tail=FALSE)\n", " lower = x - (z*1/sqrt(100))\n", " upper = x + (z*1/sqrt(100))\n", " df = data.frame(lower, upper)\n", " return(c(lower, upper))\n", " }\n", "\n", "# calling the function, tweak default 0.05 alpha (95% confidence level) here!\n", "create_confidence_intervals(0.05)" ], "id": "36b42875-8fec-4b0a-bcdb-ac90559ea650" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise\n", "\n", "Matilda thinks that one of her confidence intervals above is too wide\n", "and wishes to narrow it. What could she do in order to achieve this\n", "goal?\n", "\n", "- A. increase the sample size and higher the confidence level\n", "- B. decrease the sample size and lower the confidence level\n", "- C. increase the sample size and lower the confidence level\n", "- D. decrease the sample size and higher the confidence level" ], "id": "9ce92041-5e3c-484a-a065-30f5da7f54b2" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_1 <- \"...\" # enter your choice here\n", "\n", "test_11()" ], "id": "d24a5943-a78c-47ee-9f26-e168949336d0" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Common Misconceptions\n", "\n", "Up to this point, we’ve covered what confidence intervals are, how we\n", "calculate them, and how they’re sensitive to two key parameters. Let’s\n", "lastly clarify a couple of misconceptions about the interpretation of\n", "confidence intervals.\n", "\n", "## Misconception 1:\n", "\n", "*If we have a 95% confidence interval, this is a **concrete range**\n", "under which our estimated population parameter **must** fall*.\n", "\n", "- If we repeated our sampling procedure many times and constructed a\n", " confidence interval each time, we would expect about 95% of these\n", " confidence intervals to contain our true parameter.\n", "\n", "- **However, this is not 100%.** since about 5% of our confidence\n", " intervals will not contain the true parameter. There is no stopping\n", " the actual confidence interval we calculate from being one of those\n", " 5%.\n", "\n", "- Therefore, cannot say with absolute certainty that our true\n", " parameter lies within the interval that we calculate.\n", "\n", "The confidence interval is an *estimator* and not an official range of\n", "possible values for the population parameter.\n", "\n", "## Misconception 2:\n", "\n", "*If we have a confidence level of 95%, 95% of our population data must\n", "lie within the calculated confidence interval*.\n", "\n", "- This is not true since our confidence level indicates the long run\n", " percentage of constructed confidence intervals which contain our\n", " true parameter but says nothing about the spread of our actual data.\n", "\n", "- To find the range within which 95% of our data lie, we must consult\n", " a histogram for the population.\n", "\n", "- For instance if our data is quite bimodaly distributed (around half\n", " of our data is clustered far to the left of our mean, and the other\n", " half is clustered far to the right of our mean), our calculated 95%\n", " confidence interval will likely contain very little (much less than\n", " 95%) of the data.\n", "\n", "The confidence level does **not** determine the spread of the actual\n", "data\n", "\n", "# Misconception 3:\n", "\n", "*If we have a confidence level of 95%, a confidence interval calculated\n", "from a sample of 500 observations will more likely contain the true\n", "parameter than a confidence interval calculated from a sample of 100\n", "observations.*\n", "\n", "- We know from the previous section that a confidence interval\n", " generated from the sample $n = 500$ will be smaller than one\n", " generated from $n = 100$\n", "\n", "- However a confidence level by definition is the percentage of\n", " calculated intervals we expect to contain the true parameter of\n", " interest if we calculated these intervals over and over.\n", "\n", "- This means any one interval from a sample of $n = 100$ has a 95% of\n", " containing the true parameter, just as any one interval from a\n", " sample of $n = 500$ has a 95% of containing the true paramater. Each\n", " interval (the wider one from $n = 100$ and narrower one from\n", " $n= 500$) has a chance of containing the true parameter in relation\n", " to all other calculated intervals for that same sample size.\n", "\n", "- Hence, whether we have an interval from a sample of n=100 or n=500,\n", " we are still 95% confident in both cases that the true parameter\n", " lies within that interval.\n", "\n", "The probability of a given interval containing the true parameter is not\n", "affected by the sample size. This probability only changes when we\n", "change our confidence level.\n", "\n", "> **🔎 **Let’s think critically****\n", ">\n", "> > 🟠 Every research context will drastically shape how confidence\n", "> > intervals are approached. As we have seen, the volume and quality of\n", "> > data affect how accurate data analyses can be, and many rules of\n", "> > thumb in data science are simply that - rules of thumb, as opposed\n", "> > to hard facts about how to report statistics. \n", "> > 🟠 What are some situations where you want to know that something is\n", "> > true with nearly 100% confidence? \n", "> > 🟠 What are some situations where the uncertainty of statistic is\n", "> > maybe not so bad? \n", "> > 🟠 What does it *really* mean to have something within or outside of\n", "> > a confidence interval?" ], "id": "dd1bd647-7958-4aa5-b582-ce08910986c0" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "ir", "display_name": "R", "language": "r" } } }