*Anneke Dresselhuis, Colby Chambers, Jonathan Graves* \n", "2023-01-12\n", "\n", "## Outline\n", "\n", "### Prerequisites\n", "\n", "- Introduction to Jupyter

\n", "- Introduction to R

\n", "- Introduction to Visualization

\n", "- Central Tendency

\n", "- Distribution

\n", "- Dispersion and Dependence

\n", "\n", "### Outcomes\n", "\n", "After completing this notebook, you will be able to: \\* Interpret and\n", "report confidence intervals \\* Calculate confidence intervals under a\n", "variety of conditions \\* Understand how the scope of sampling impacts\n", "confidence intervals\n", "\n", "### References\n", "\n", "- [Simulating the Construction of Confidence Intervals for Sample\n", " Means](https://rpubs.com/pgrosse/545955)" ], "id": "729fb229-49b8-43f2-a99f-f8c46f1dbe69" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "source(\"testing_confidence_intervals.r\")\n", "\n", "# importing typical packages\n", "library(tidyverse)\n", "library(haven)\n", "library(ggplot2)\n", "\n", "# loading the dataset\n", "census_data <- read_dta(\"../datasets/01_census2016.dta\")\n", "\n", "# cleaning the dataset\n", "census_data <- filter(census_data, !is.na(census_data$wages))\n", "census_data <- filter(census_data, !is.na(census_data$mrkinc))\n", "census_data <- filter(census_data, census_data$pkids != 9)" ], "id": "59c606d1-970e-4613-b80b-526f29f2fc23" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "So far, we have developed a strong grasp of core concepts in statistics.\n", "We’ve learned about measures of central tendency and variation, as well\n", "as how these measures relate to distributions. We have also learned\n", "about random sampling and how sampling distributions can shed light on\n", "the parameters of a population distribution.\n", "\n", "So, how can we apply this knowledge to real empirical work? While\n", "another notebook that covers hypothesis testing will provide a deeper\n", "answer to this question, this current notebook provides a starting\n", "point. In this notebook, we will learn about a key concept which relates\n", "to how we report our results empirically when sampling from a\n", "population. This is the idea of a **confidence interval**.\n", "\n", "# Confidence Intervals and Point Estimates\n", "\n", "A **confidence interval** is an estimate that gives us a range of values\n", "within which we expect a population parameter to fall. Put another way,\n", "it provides a range within which we can have a certain degree of\n", "*confidence* that a desired parameter, such as a population mean, lies.\n", "\n", "This is in contrast to a **point estimate**. A point estimate is a\n", "specific estimated value of another object, like a population parameter.\n", "For instance, the sample mean and sample standard deviation are point\n", "estimates of the population mean and the population standard deviation\n", "(respectively).\n", "\n", "Let’s make this concrete with an example.\n", "\n", "## Example\n", "\n", "Suppose that we want to know the mean GPA of undergraduate students at\n", "universities across Canada. Finding this exact mean (the population\n", "mean) would require us to collect the GPA of every single undergraduate\n", "student in the country without error: a nearly impossible task. As a\n", "result, and as we have already seen, we collect a sample of students and\n", "find the mean of their GPAs (the sample mean). This allows us to make\n", "inferences about the desired, yet unobtainable, population mean. The\n", "sample mean we find is called our point estimate.\n", "\n", "However, as we already learned, this sample mean will be different every\n", "time we draw a different random sample of undergraduate students.\n", "Suppose one random sample we draw just so happens to include many\n", "high-achieving students, while another does not. The first sample will\n", "give us a very high point estimate for the population mean, while the\n", "second sample will give us a point estimate much lower.\n", "\n", "So our question becomes: how can we report an estimate for the\n", "population mean GPA if we draw a different mean GPA for every possible\n", "sample? This is where **confidence intervals** become useful. They allow\n", "us to combine information about central tendency and dispersion into a\n", "single object.\n", "\n", "# Confidence Levels\n", "\n", "As we will see, every time we draw a sample and get a new point\n", "estimate, we can compute a confidence interval to describe the precision\n", "of this estimate. To calculate this confidence interval, we always must\n", "choose a **confidence level**. This is a percentage which represents the\n", "long-run percentage of confidence intervals within which we would expect\n", "to actually find our desired population parameter.\n", "\n", "We choose a higher confidence level when we want to have more certainty\n", "in our confidence interval serving as a good estimate for the population\n", "parameter of interest. The most commonly chosen confidence level is 95%,\n", "but other values are also common.\n", "\n", "> In our GPA example, this means that if we drew random samples of\n", "> undergraduate students 1000 different times and got 1000 sample mean\n", "> point estimates and corresponding confidence intervals, we would\n", "> expect 950 of these confidence intervals to contain the actual average\n", "> GPA of all Canadian undergraduates.\n", "\n", "Of course, we cannot conclusively *find* the desired population mean to\n", "prove this; however, choosing a high confidence level gives us more\n", "certainty that any one of our hypothetical confidence intervals\n", "(including the one we actually calculate from our specific sampling)\n", "includes our unknowable parameter of interest. When we choose a\n", "confidence level of 95% and calculate a confidence interval around our\n", "sample mean point estimate, we say that we are **95% confident** that\n", "the true mean GPA of all Canadian undergraduates lies in this range.\n", "\n", "Let’s now see how we actually calculate a confidence interval for a\n", "given point estimate and confidence level.\n", "\n", "# Calculating Confidence Intervals\n", "\n", "The official representation of a confidence interval is the following:\n", "\n", "$$\n", "(\\text{point estimate} - \\text{margin of error}, \\text{point estimate} + \\text{margin of error})\n", "$$\n", "\n", "or\n", "\n", "$$ \n", "\\text{point estimate} \\pm \\text{margin of error}\n", "$$\n", "\n", "where our point estimate is just the sample statistic we find from our\n", "random sample. The margin of error is the more cumbersome piece to\n", "calculate. It is subtracted and added from our point estimate to find\n", "the **lower bound** and **upper bound** of our confidence interval\n", "estimate. While this general formula and format for the confidence\n", "interval always holds, calculating the margin of error varies depending\n", "on what sample statistic we are looking at and what we know about our\n", "population. Let’s look at few important special cases.\n", "\n", "# Confidence Intervals for the Sample Mean\n", "\n", "When we want to construct a confidence interval for a sample mean we’ve\n", "found (e.g. the mean GPA of a sample of Canadian undergraduates), we\n", "must first reflect on how we gathered our data and what its sampling\n", "distribution looks like. This is because we must meet the following\n", "three conditions in order to construct a valid confidence interval for a\n", "sample mean in the first place: \\> 1. We must have a random sample\n", "(typically found through simple random sampling) 2. The sampling\n", "distribution of the sample means is approximately normal, which can be\n", "met through one of the following ways:

a. our original population\n", "is normally distributed

b. our sample size is \\> 120 (invokes the\n", "Central Limit Theorem) 3. Our sample observations must be independent,\n", "for instance:

a. we sample with replacement (when we record an\n", "observation, we put it back in the population with the possibility of\n", "drawing it again)

b. our sample size is \\< 10% of the population\n", "size\n", "\n", "If each of conditions 1-3 are met, we are able to construct a valid\n", "confidence interval around our sample mean point estimate. There are two\n", "different cases for this construction, but they’re pretty similar.\n", "\n", "## Case 1: We Know the Population Standard Deviation\n", "\n", "In rare instances, we may know the variance (and thus standard\n", "deviation) of our original population of interest. In this case, we are\n", "able to consult our trustworthy $z$-statistic when calculating the\n", "margin of error in our confidence interval. We use the following formula\n", "to calculate the confidence interval for a sample mean when our\n", "population standard deviation is known:\n", "\n", "$$\n", "\\bar x \\pm z_{\\alpha / 2} \\cdot \\frac{\\sigma}{\\sqrt n}\n", "$$\n", "\n", "where $\\bar x$ is the sample mean, $z$ is the critical value (from the\n", "standard normal distribution) for a chosen confidence level $1-\\alpha$,\n", "$\\sigma$ is the population standard deviation, and $n$ is the sample\n", "size.\n", "\n", "However, this case is extremely rare. After all, it requires us to know\n", "the standard deviation but not the mean of a population! Typically we\n", "either have very good information on a population (and thus know both\n", "its mean and standard deviation) or we don’t (and thus don’t know either\n", "its mean or standard deviation). Nonetheless, it is good to keep this\n", "case in mind since it occasionally comes up.\n", "\n", "## Case 2: We Don’t Know the Population Standard Deviation\n", "\n", "Much more frequently, we won’t know the population standard deviation.\n", "In this case, we must instead invoke the $t$-distribution when\n", "calculating the margin of error for our confidence intervals. We also\n", "use the sample standard deviation in place of the population standard\n", "deviation, since we do know this statistic. The calculation procedure\n", "otherwise follows exactly as before in Case 1.\n", "\n", "$$\n", "\\bar x \\pm t_{\\alpha / 2} \\cdot \\frac{s}{\\sqrt n}\n", "$$

where $\\bar x$ is the sample mean, $t$ is the critical value\n", "(from the $t$-distribution) for a chosen confidence level $1-\\alpha$,\n", "$s$ is the sample standard deviation, and $n$ is the sample size.\n", "\n", "We can use an example from our dataset to emphasize this point. Let’s\n", "construct a 95% confidence interval for the sample mean of the variable\n", "`wages`. We can immediately calculate its mean, which serves as our\n", "sample mean point estimate." ], "id": "48d22a3d-788b-473a-9994-6992eebc642e" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculating the sample mean of wages\n", "x <- mean(census_data$wages)" ], "id": "889c305e-1d5b-482f-b32b-933d9c18f44a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have this point estimate, we can calculate our margin of\n", "error around it. To do so, we must first find 3 other statistics: the\n", "$t$ value corresponding to a 95% confidence level, the standard\n", "deviation of `wages`, and the sample size (the number of observations\n", "recorded for `wages`)." ], "id": "d38bda9f-89ab-435a-a718-60f01ab25fc6" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# finding the sample size and associated degrees of freedom\n", "n <- nrow(census_data)\n", "df <- n - 1\n", "\n", "# finding the t value for a confidence level of 95% (noticing this value converges on the z value as so we could have used this too)\n", "t <- qt(p = 0.05, df = df)\n", "\n", "# finding the sample standard deviation of wages\n", "s <- sd(census_data$wages)\n", "\n", "# calculating the lower and upper bounds of the desired confidence interval\n", "\n", "lower_bound <- x - (t*s/sqrt(n))\n", "upper_bound <- x + (t*s/sqrt(n))\n", "\n", "lower_bound\n", "upper_bound" ], "id": "94f8c49d-10c5-4503-b219-283de2244adc" }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a formal setting, we would thus report the following: We are 95%\n", "confident that the mean wage of all Canadians ranges between $54274$ and\n", "$54690$ . We also know this is a valid confidence interval estimate\n", "because our `wages` variable and the procedure for sampling it meets all\n", "of the three criteria outlined: Statistics Canada (the source for this\n", "data) utilizes random sampling, our sample size is $n > 30$ and thus we\n", "don’t even need to check the distribution of `wages`, and our sample\n", "size $n$ is \\< 10% of the total population (since the total population\n", "of Canada is about 38 million). This is a very small confidence\n", "interval. We can understand this by looking at our formula for\n", "calculating it and realizing that our sample size, $n$, is very large.\n", "This adds precision to our confidence interval estimate, highlighted in\n", "the narrowness of the interval we found above.\n", "\n", "## Exercise\n", "\n", "Matilda (she/her) takes a random sample of 10 books from a library in\n", "order to estimate the average number of pages among all books in the\n", "library. Let’s assume the library is very large and the library does not\n", "keep record of the specifics of its overall population of books in terms\n", "of their pages. Does it make more sense for Matilda to use a standard z\n", "distribution or student’s t distribution when calculating the margin of\n", "error for her confidence interval?" ], "id": "058bf263-fd1c-4419-b284-e32ba0a6b1ae" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_1 <- \"X\" # your answer for \"z\" or \"t\" in place of \"X\"\n", "\n", "test_1()" ], "id": "a687c885-824a-4681-a368-6ea4ad78b7e1" }, { "cell_type": "markdown", "metadata": {}, "source": [ "From her sample, Matilda finds a sample mean of 280 and sample variance\n", "of 400. She wants to construct a 90% confidence interval to estimate the\n", "population mean number of pages. What will be the upper and lower bounds\n", "of this interval (assuming its a valid confidence interval)?" ], "id": "331d70c1-4663-470c-a8aa-819280b91477" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here\n", "\n", "answer_2 <- # your answer for the lower bound here, rounded to 2 decimal places\n", "answer_3 <- # your answer for the upper bound here, rounded to 2 decimal places\n", "\n", "test_2()\n", "test_3()" ], "id": "6626b25e-073f-42a4-848e-1566ce23091b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Confidence Intervals for the Sample Proportion\n", "\n", "While we’ve looked at the example of mean GPA throughout this notebook,\n", "we can also calculate confidence intervals for sample proportions as\n", "well. Imagine we have just two political parties (A and B), compulsory\n", "voting, and we want to know the proportion of the population that voted\n", "for party A. Of course, this would be quite costly and time-consuming to\n", "calculate, so we instead collect a sample and corresponding sample\n", "proportion. This sample proportion becomes our point estimate around\n", "which we construct a confidence interval to estimate the population\n", "proportion with a certain degree of confidence. Before we begin\n", "constructing this interval, we must make sure that our sampling process\n", "again satisfies three conditions; this time, however, the second of\n", "these three conditions will be different:\n", "\n", "> 1. We must have a random sample (typically found through simple\n", "> random sampling)\n", "> 2. The sampling distribution of the sample proportions is normally\n", "> distributed, which typically requires there to be at least 10\n", "> “successes” and 10 “failures” in our sample (in the example above,\n", "> this would mean at least 10 people in our sample voted for party A\n", "> and at least 10 people voted for party B). We can see here that\n", "> very small sample sizes (i.e. $n = 5$, $n = 10$, etc. will fail\n", "> this condition)\n", "> 3. Our sample observations must be independent, which can be met\n", "> through one of the following two ways:

\n", ">\n", "> \n", ">\n", "> 1. we sample with replacement (when we record an observation, we put\n", "> it back in the population with the possibility of drawing it\n", "> again)

\n", "> 2. our sample size is \\< 10% of the population size\n", "\n", "If conditions 1-3 are all met, we are able to construct a valid\n", "confidence interval around our sample proportion point estimate. We now\n", "turn to the one case we must consider when calculating the margin of\n", "error and confidence interval for sample proportions.\n", "\n", "## The Only Case: We Don’t Know the Population Standard Deviation\n", "\n", "This is the only case we have to worry about when calculating the sample\n", "proportion. This is because the population standard deviation is\n", "necessarily a function of the population proportion and sample size $n$.\n", "Thus, if we knew the population standard deviation, we would necessarily\n", "know the population proportion and there would be no point in sampling\n", "and constructing confidence intervals to estimate it! For this reason,\n", "we worry about only the one case where we don’t know the standard\n", "deviation of the population. The formula for the confidence interval of\n", "a sample proportion is below:\n", "\n", "$$\n", "\\hat P \\pm z_{\\alpha / 2} \\cdot \\sqrt \\frac {\\hat P \\cdot(1 - \\hat P)}{n}\n", "$$\n", "\n", "where $\\hat P$ is the sample proportion, $z$ is the critical value (from\n", "the standard normal table) for a chosen confidence level $1-\\alpha$, and\n", "$n$ is the sample size.\n", "\n", "Let’s do an example. Let’s calculate a 95% confidence interval for the\n", "sample proportion of the census dataset who has one or more kids in\n", "their household (`pkids == 1`). We can immediately calculate our sample\n", "proportion, which serves us our point estimate." ], "id": "3abc1d9b-b271-4bd3-b083-4841a5bf952c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculating our sample proportion of observations with pkids == 1\n", "p <- sum(census_data$pkids == 1) / n\n", "p" ], "id": "bec78e2a-9f71-4975-81d3-0b73578ed8f6" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our sample proportion, we can find our $z$ critical\n", "value for a 95% confidence level, as well as use our sample proportion\n", "$\\hat{p}$ and sample size $n$, to calculate our confidence interval." ], "id": "11ee1b71-7e76-4c33-b3ad-749ba972438c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# finding the z value for a confidence level of 95%\n", "z <- qnorm(p = 0.05, lower.tail=FALSE)\n", "\n", "# calculating the lower and upper bounds of the desired confidence interval\n", "lower_bound <- p - z*sqrt(p*(1-p)/n)\n", "upper_bound <- p + z*sqrt(p*(1-p)/n)\n", "\n", "lower_bound\n", "upper_bound" ], "id": "ae79b643-4b6c-4204-a722-c74007739319" }, { "cell_type": "markdown", "metadata": {}, "source": [ "From our above calculations, we can say that we are 95% confident that\n", "the true proportion of Canadians with a child in their household ranges\n", "between 0.7075% - 0.7104%. Importantly, it is possible to run into cases\n", "where the upper or lower bound of the confidence interval for a sample\n", "proportion is outside of the accepted domain of \\[0, 1\\]. This is\n", "particularly likely when our sample proportion point estimate is already\n", "very high or low, and then our sample size is not very large. In these\n", "cases, we may choose to either report the true interval or cap or\n", "interval at 0 or 1 accordingly, with a note that this does not reflect\n", "the full confidence interval found. However, these cases are rare. We\n", "can check for ourselves that the above confidence interval is valid\n", "since it obeys all three of the criteria we have layed out for\n", "confidence interval estimating sample proportions.\n", "\n", "## Exercise\n", "\n", "Matilda now wants to know the proportion of students in her school who\n", "are left-handed. Let’s assume her sampling procedure meets all of the\n", "criteria for constructing a valid confidence interval. She takes a\n", "sample of 200 students and finds that 22 of them are left-handed. What\n", "is the upper and lower bound of a 98% confidence interval for the\n", "proportion of the school’s overall student body that are left-handed?" ], "id": "5454537f-019b-4cd5-b779-5870716afeb2" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here\n", "\n", "answer_4 <- # your answer for the lower bound here, rounded to 3 decimal places (in proportion form, i.e. 10% = 0.1)\n", "answer_5 <- # your answer for the upper bound here, rounded to 3 decimal places (in proportion form, i.e. 10% = 0.1)\n", "\n", "test_4()\n", "test_5()" ], "id": "ce98de66-2718-48c1-af93-bb58f26ac840" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s imagine that our sample size and confidence level are fixed and\n", "cannot be changed. What sample proportion of students who are\n", "left-handed would result in the smallest confidence interval possible?" ], "id": "5580463a-0d83-4b39-a9d0-3830fc8ad515" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_6 <- # your answer for the sample proportion here (i.e. 10% = 0.1)\n", "\n", "test_6()" ], "id": "b9bc5746-987c-4766-953d-36f4892b142e" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Confidence Intervals for the Sample Variance\n", "\n", "Finally, we may want to construct confidence intervals for a sample\n", "variation itself in order to estimate the population standard deviation\n", "that we do not know. The following conditions must be met for this\n", "confidence interval to be valid:\n", "\n", "> 1. We must have a random sample (typically found through simple\n", "> random sampling)\n", "> 2. Our original population is normally distributed or at least\n", "> symmetrically distributed without many outliers. If this does not\n", "> hold, our sample size must be \\> 120 (invokes the Central Limit\n", "> Theorem)\n", "> 3. Our sample observations must be independent, which can be met\n", "> through one of the following two ways:

\n", ">\n", "> \n", ">\n", "> 1. we sample with replacement (when we record an observation, we put\n", "> it back in the population with the possibility of drawing it\n", "> again)

\n", "> 2. our sample size is \\< 10% of the population size\n", "\n", "If conditions 1-3 are all met, we are able to construct a valid\n", "confidence interval for our sample variance point estimate.\n", "\n", "## The Only Case: We Don’t Know the Population Standard Deviation\n", "\n", "We only need worry about this case when calculating confidence intervals\n", "for the sample variance. This is because, if we knew the population\n", "standard deviation, we would necessarily know the population variance\n", "and thus constructing a confidence interval to estimate this number\n", "would be useless! Instead, we assume we have only a sample variance to\n", "rely on. It should be noted that the formula we will use works a bit\n", "differently in this case. Rather than add and subtract a margin of error\n", "to our point estimate, we will instead use our point estimate to\n", "calculate the lower and upper bounds of our confidence interval\n", "directly.\n", "\n", "$$\n", "(\\frac{(n - 1) \\cdot s^2}{\\chi^2_{\\alpha/{2}}}, \\frac{(n - 1) \\cdot s^2}{\\chi^2_{1 - \\alpha/{2}}})\n", "$$\n", "\n", "where $n$ is the sample size, $s^2$ is the sample variance, and $\\chi^2$\n", "is the chi-squared value for a chosen confidence level $1 - \\alpha$ and\n", "degrees of freedom $n - 1$.\n", "\n", "Constructing this type of confidence interval may feel a bit less\n", "intuitive and familiar than the previous two. This is because the sample\n", "variance does not follow a normal distribution. Unlike the sample mean\n", "and sample proportion, it follows a different, decisively non-normal\n", "distribution: the $\\chi^2$ distribution. For this reason, we construct\n", "our confidence intervals for this sample statistic differently, as\n", "depicted above.\n", "\n", "Let’s do one final example to reinforce the calculation of confidence\n", "intervals for this type of sample statistic. We will construct a 95%\n", "confidence interval for the sample mean of `mrkinc`. Our procedure will\n", "follow exactly the steps above, although this time we need to use the\n", "chi-squared distribution in place of the t or z distributions. We can\n", "calculate our sample variance first." ], "id": "9ddede3d-90bd-4ab2-b705-80c0329b10f6" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculating the variance of mrkinc\n", "var <- var(census_data$mrkinc)\n", "var" ], "id": "2879d463-9f7d-4e94-b19a-da8e2d63fca6" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our sample variance (which is quite large), we can find\n", "the other statistics necessary to calculate our confidence interval\n", "estimate." ], "id": "41c1c330-8062-4fff-a4f2-cccf71e9dcca" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# finding the chi-squared values for a 95% confidence level and n - 1 degrees of freedom\n", "upper_chi <- qchisq(p = 0.05, df =df, lower.tail = TRUE)\n", "lower_chi <- qchisq(p = 0.05, df = df, lower.tail = FALSE)\n", "\n", "# calculating the upper and lower bounds of the desired confidence interval\n", "lower_bound <- (df*var)/lower_chi\n", "upper_bound <- (df*var)/upper_chi\n", "\n", "lower_bound\n", "upper_bound" ], "id": "6f389f1d-56d7-4893-85b3-569c14a90e9f" }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the above, we can say that we are 95% confident that the variance\n", "of market income among all Canadians is within (767761585, 7745209769).\n", "This is quite a large interval, but given the size of the variance for\n", "this variable, this is reasonable.\n", "\n", "## Exercise\n", "\n", "Finally, Matilda wants to know the variance of weights of all cars ever\n", "sold at her father’s car dealership. Naturally, since she can’t find the\n", "variance of the thousands of cars sold, she consults the dealership\n", "archives and takes a random sample of 40 cars and records their weights.\n", "She finds that they have a sample mean weight of 5,000 pounds and a\n", "sample variance of 250,000. Matilda wants to construct a 95% confidence\n", "interval estimate for the population variance. Given the information\n", "above, what are the upper and lower bounds of this confidence interval?" ], "id": "4c8ae1c2-ba3a-4ebf-ab74-5c749d80233e" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here\n", "\n", "answer_7 <- # your answer for the lower bound here, rounded to the nearest whole number\n", "answer_7 <- # your answer for the upper bound here, rounded to the nearest whole number\n", "\n", "test_7()\n", "test_8()" ], "id": "b2a32618-ee08-453b-924a-97f811f0ffa6" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s now say that Matilda draws a new random sample of 40 cars and\n", "reports 95% confidence that the population variance of car weights falls\n", "within the confidence interval (490000, 640000). Under this sampling\n", "procedure, what is the 95% confidence interval estimate for the standard\n", "deviation of weights of all cars ever sold at the dealership?" ], "id": "861e02bb-e992-488c-a38a-18a9abca566b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_10 <- # your answer for the lower bound here\n", "answer_11 <- # your answer for the upper bound here\n", "\n", "test_10()\n", "test_11()" ], "id": "975b096d-a524-47cc-9bf9-d8e84c67eaeb" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Factors Which Impact the Width of Confidence Intervals\n", "\n", "Looking at the above formulas, we now have a better understanding of\n", "exactly what factors go into calculating a confidence interval. More\n", "specifically, we can see that no matter the parameter we are estimating,\n", "we always need the following numbers: a confidence level and sample\n", "size. These numbers are chosen early on during the sampling procedure.\n", "Thus, they can easily be changed. Let’s explore what happens to our\n", "confidence intervals when we change each of these numbers.\n", "\n", "# Changing the Sample Size\n", "\n", "Let’s say we want to change our sample size $n$. If we increase our\n", "sample size $n$, we can see mathematically that in all cases our margin\n", "of error goes down (or our bounds explicitly come closer together in the\n", "case of sampling variance). As a result, our confidence interval\n", "shrinks. This makes sense intuitively. If we draw a larger sample, then\n", "for any confidence level we can expect to estimate our desired\n", "population parameter more precisely. This is indicated by a narrower\n", "confidence interval. The same logic applies when we decrease our sample\n", "size $n$. Both our margin of error and confidence interval will\n", "increase, indicative of the fact that our sample is smaller and we\n", "therefore have less precision in estimating our population parameter of\n", "interest. To see this point interactively, modify the code below by\n", "changing the input for $n$ (currently set at 30). We can see that the\n", "size of the confidence intervals increases or decreases depending on\n", "whether we decrease or increase the simulated sample size." ], "id": "55a4f869-d885-49ab-9996-e31ec836287a" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "population <- rnorm(10000, 0, 1)\n", "set.seed(2)\n", "\n", "# defining a function which outputs a confidence interval for a given sample size\n", "create_confidence_intervals <- function(n) {\n", " x = mean(sample(population, n))\n", " z = qnorm(p = 0.05, lower.tail=FALSE)\n", " lower = x - (z*1/sqrt(n))\n", " upper = x + (z*1/sqrt(n))\n", " df = data.frame(lower, upper)\n", " return(c(lower, upper))\n", " }\n", "\n", "# calling the function, tweak default sample size 30 here!\n", "create_confidence_intervals(30)" ], "id": "71e071b8-2e59-4722-aaff-ab59d75f8e7b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Changing the Confidence Level\n", "\n", "Now suppose we instead want to change our confidence level. If we\n", "increase our confidence level, we are saying that if we hypothetically\n", "calculated many confidence intervals, the percentage of these intervals\n", "containing our desired population parameter should increase. We are thus\n", "asking for confidence interval estimates which capture the true\n", "parameter of interest more often. To capture this parameter more\n", "frequently for a given sample size, the width of our confidence interval\n", "(the range of possibilities for capturing the true parameter) must\n", "naturally increase. Similar logic applies to decreasing our confidence\n", "level: our confidence interval increases. This all occurs mathematically\n", "through an increase or decrease in our margin of error (or bounds)\n", "respectively, engendered by an increase or decrease in our $z$ or $t$\n", "critical value. To see this point interactively, modify the code below\n", "by changing the input for $\\alpha$ (currently set at 0.05, indicating a\n", "95% confidence level). We can see that the vertical length (width) of\n", "the confidence intervals increases or decreases depending on whether we\n", "increase or decrease the simulated confidence level." ], "id": "5a063963-200b-4a02-8b1a-bebfc2de5620" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "population <- rnorm(10000, 0, 1)\n", "set.seed(2)\n", "\n", "# defining a function which outputs a confidence interval for a given confidence level\n", "create_confidence_intervals <- function(alpha) {\n", " x = mean(sample(population, 100))\n", " z = qnorm(p = alpha, lower.tail=FALSE)\n", " lower = x - (z*1/sqrt(100))\n", " upper = x + (z*1/sqrt(100))\n", " df = data.frame(lower, upper)\n", " return(c(lower, upper))\n", " }\n", "\n", "# calling the function, tweak default 0.05 alpha (95% confidence level) here!\n", "create_confidence_intervals(0.05)" ], "id": "d9e27b83-2b91-4f2b-ac52-811ebf15fc68" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise\n", "\n", "Matilda thinks that one of her confidence intervals above is too wide\n", "and wishes to narrow it. What could she do in order to achieve this\n", "goal?\n", "\n", "- A. increase the sample size and higher the confidence level\n", "- B. decrease the sample size and lower the confidence level\n", "- C. increase the sample size and lower the confidence level\n", "- D. decrease the sample size and higher the confidence level" ], "id": "4a9f11f7-6916-46e2-8c13-7af94861e3c9" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_1 <- \"...\" # enter your choice here\n", "\n", "test_11()" ], "id": "b4a1dec8-c37d-428d-9190-9ae297b95855" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Common Misconceptions\n", "\n", "Up to this point, we’ve covered what confidence intervals are, how we\n", "calculate them, and how they’re sensitive to two key parameters. Let’s\n", "lastly clarify a couple of misconceptions about the interpretation of\n", "confidence intervals.\n", "\n", "## Misconception 1:\n", "\n", "*If we have a 95% confidence interval, this is a concrete range under\n", "which our estimated population paramater must fall*.\n", "\n", "Hopefully the error in this way of thinking is quite clear now. If we\n", "repeated our sampling procedure many times and constructed a confidence\n", "interval each time, we would expect about 95% of these confidence\n", "intervals to contain our true parameter. However, this is not 100%.\n", "Theoretically, about 5% of our confidence intervals will not contain the\n", "true paramater. There is no stopping the actual confidence interval we\n", "calculate from being one of those 5%. Due to this, we cannot say with\n", "absolute certainty that our true paramater lies within the interval that\n", "we calculate. It is a common mistake to assume that a confidence\n", "interval is an official range within which the true paramater need fall.\n", "It can fall anywhere, it is just quite likely (very likely if our\n", "confidence level is high enough) that it falls within the interval\n", "calculated, hence we can have some trust in it as an estimator. This is\n", "why the confidence interval is an *estimator* and not a concrete range\n", "of possible values for our population paramater. If we had 100%\n", "certainty the true paramater fell within our calculated interval, it\n", "would not be much of an estimator but instead just a complete spectrum\n", "of the possible values the paramater could take on.\n", "\n", "## Misconception 2:\n", "\n", "*If we have a confidence level of 95%, 95% of our population data must\n", "lie within the calculated confidence interval*.\n", "\n", "This is not true. Our confidence level indicates the long run percentage\n", "of constructed confidence intervals which contain our true parameter. It\n", "says nothing about the spread of our actual data. To find the range\n", "within which 95% of our data lie, we must consult a histogram for the\n", "population. It could easily be the case that our data is quite bimodally\n", "distributed (around half of our data is clustered far to the left of our\n", "mean, and the other half is clustered far to the right of our mean). In\n", "this case, our calculated 95% confidence interval will likely contain\n", "very little (much less than 95%) of the data.\n", "\n", "# Misconception 3:\n", "\n", "*If we have a confidence level of 95%, a confidence interval calculated\n", "from a sample of 500 observations will more likely contain the true\n", "paramater than a confidence interval calculated from a sample of 100\n", "observations.*\n", "\n", "This one might feel quite counterintuitive. After all, we already know\n", "from the previous section that a confidence interval generated from the\n", "sample $n = 500$ will be smaller than one generated from $n = 100$.\n", "However, think about the nuance of this statement. A confidence level by\n", "definition is the percentage of calculated intervals we expect to\n", "contain the true paramater of interest if we calculated these intervals\n", "over and over. In this situation, any one interval from a sample of\n", "$n = 100$ has a 95% of containing the true paramater, just as any one\n", "interval from a sample of $n = 500$ has a 95% of containing the true\n", "paramater. Each interval (the wider one from $n = 100$ and narrower one\n", "from $n= 500$) has a chance of containing the true parameter in relation\n", "to all other calculated intervals for that same sample size. The\n", "probability of a given interval containing the true paramater is in no\n", "way influenced by or varies with the sample size. This probability only\n", "changes when we change our confidence level.\n", "\n", "> **🔎 **Let’s think critically****\n", ">\n", "> > 🟠 Every research context will drastically shape how confidence\n", "> > intervals are approached. As we have seen, the volume and quality of\n", "> > data affect how accurate data analyses can be, and many rules of\n", "> > thumb in data science are simply that - rules of thumb, as opposed\n", "> > to hard facts about how to report statistics. \n", "> > 🟠 What are some situations where you want to know that something is\n", "> > true with nearly 100% confidence? \n", "> > 🟠 What are some situations where the uncertainty of statistic is\n", "> > maybe not so bad? \n", "> > 🟠 What does it *really* mean to have something within or outside of\n", "> > a confidence interval?" ], "id": "14873aee-1019-411d-a5b4-10ffbaa1c428" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "ir", "display_name": "R", "language": "r" } } }