{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 1.8 - Beginner - Sampling Distributions\n", "\n", "COMET Team
*Colin Grimes, Valeria Zolla, Rathin Dharani, Jonathan\n", "Graves* \n", "2023-01-12\n", "\n", "## Outline\n", "\n", "### Prerequisites\n", "\n", "- Introduction to Jupyter\n", "- Introduction to R\n", "- Introduction to Visualization\n", "- Introduction to Central Tendency\n", "- Introduction to Distribution\n", "- Dispersion and Dependence\n", "\n", "### Learning Objectives\n", "\n", "- Define a simple random sample and why it is useful in econometrics\n", "- Use a simple random sample to infer population parameters\n", "- Produce a sampling distribution through repeated random sampling\n", "- Understand how sampling distributions relate to the Law of Large\n", " Numbers and the Central Limit Theorem\n", "\n", "# Introduction to Statistical Inference\n", "\n", "Let’s say we wanted to determine a particular characteristic of a\n", "population, the **population parameter**. For example, if we want to\n", "determine the average height of a Canadian resident, our population\n", "would be every Canadian resident and our population parameter would be\n", "average height.\n", "\n", "Since collecting data from the entire population is often unrealistic,\n", "we can instead survey a **sample** and gather a **sample estimate.**\n", "From our example, it would be much more practical to survey a 1000\n", "people in Canada and calculate their average height. While this sample\n", "estimate won’t be a true match, it’s probably a good estimate. But how\n", "do we know we picked a representative sample of the Canadian population?\n", "\n", "## Simple Random Samples\n", "\n", "A **simple random sample** is a subset of a population, in which each\n", "member of the population has an equal chance of being selected for the\n", "sample. In our case if we only sampled professional athletes to\n", "determine the average height of the Canadian population, we wouldn’t get\n", "a very good estimate of our population parameter. Simple random samples\n", "allow us to create sample parameters that are **unbiased estimators** of\n", "the population parameter.\n", "\n", "There are two statistical approaches:\n", "\n", "1. We could conduct a census. In our case we could measure the height\n", " of all 38 million people living in Canada. While this may produce\n", " little error, it would be costly and we may not actually be able to\n", " collect all of this data - what if we miss someone?\n", "\n", "2. We could take a **simple random sample** of the Canadian population,\n", " in which all 38 million Canadians have an equal probability of being\n", " chosen. This would allow us to make **statistical inferences** on\n", " the population mean, without the cost of conducting a census.\n", "\n", "> **Note**: While there are other types of sampling in economics, simple\n", "> random sampling is a good starting point. In any case, we must try our\n", "> best to get an unbiased estimator through a random sample.\n", "\n", "## Simulating Data\n", "\n", "We will work with **simulated data** drawing from normal and uniform\n", "distributions. First we need to set a **random number seed** which would\n", "allow us to reproduce results drawn from a distribution of a random\n", "variable.\n", "\n", "Note: The `set.seed()` function in R uses a random number generator that\n", "produces a sequence of numbers that are completely determined by a seed\n", "value - they are effectively random, but also reproducible. This is\n", "useful if you wanted to check your classmates’ work with the same set of\n", "numbers. See [this\n", "reference](https://datasciencebook.ca/classification2.html#randomseeds)\n", "for more information on pseudo-random numbers." ], "id": "be7957e6-f141-4825-b9e4-2c0ce9aa0714" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "source(\"beginner_sampling_distributions_tests.r\") # load some background tests\n", "library(tidyverse)\n", "library(dplyr)\n", "library(infer)\n", "#install.packages(\"gridExtra\") #uncomment these to run the section\n", "#library(gridExtra)" ], "id": "ec3c4a28-3f94-40e9-b986-c44400a44ec0" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "seed_number <- 1 # my seed number\n", "\n", "set.seed(seed_number) # set the random number seed to reproduce results" ], "id": "cefe16e7-0691-4902-ac47-4c7dedcdd41a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following, you can try changing the seed number and seeing how\n", "some of the results are different. Remember that the default seed we\n", "used was `seed_number <- 1`.\n", "\n", "## Example 1: Simulating a Distribution\n", "\n", "Let’s say we wanted to create a hypothetical population with a variable\n", "called `x` of size 10,000, distributed normally with mean 0 and standard\n", "deviation 1. In this case `x` represents “household indebtedness.” Given\n", "that it is normally distributed, this means that most households have\n", "close to no debt, but few households have an extreme amount of debt and\n", "few others have large investments with consistent cash flow (negative\n", "debt). Below, we will create a data frame `population`, using the\n", "`rnorm` function." ], "id": "bd1b1676-fd58-4e98-acd8-4395c51fcbf2" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x <- rnorm(10000, 0, 1) # Draw 10,000 observations from the random normal distribution with mean 0, and variance 1.\n", "\n", "population <- data.frame(x) #store it in a data frame" ], "id": "2c1fb780-c981-45b2-84b3-466a589b6915" }, { "cell_type": "markdown", "metadata": {}, "source": [ "We should have a population mean of 0, and standard deviation of 1.We\n", "can also visualize our distribution using a histogram/ **population\n", "distribution**. In this example, our population distribution should be\n", "normally distributed, centered around 0, with the mean represeted as a\n", "blue line on the histogram." ], "id": "ffdfadae-766e-4b7d-8cf5-c382c2dda1a5" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "table <- population %>% \n", " summarize(mean = mean(x), sd = sd(x)) %>%\n", " round(digits = 2) #two decimal places\n", "\n", "table " ], "id": "4539ae9b-be18-42b0-b5fd-f9345d33e356" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pop_dist <- population %>%\n", " ggplot(aes(x = x)) + \n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = mean(population$x), colour=\"blue\", linetype = \"solid\", lwd = 1)\n", "pop_dist" ], "id": "cef1d148-e40f-41be-9559-952a8386660b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we wanted to estimate the population mean above but we do not have\n", "the time to conduct a census of the population, we can randomly sample\n", "everyone in the population with equal probability.\n", "\n", "To take a simple random sample in R, we can use the `slice_sample`\n", "function. Let’s start with a very small sample of size 5, and see how it\n", "compares to our population mean. We will select the observations from\n", "the data set by their number." ], "id": "1507d5a9-906a-46f2-b1c9-21260fa45a39" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_data <- population %>% \n", " slice_sample(n = 5) # 5 observations in our sample\n", " \n", "s_table <- sample_data %>% \n", " summarize(mean = mean(x), sd = sd(x)) %>%\n", " round(digits = 2) # two decimal places \n", "\n", "s_table " ], "id": "467a0a08-8009-4b15-a2d8-e0b3646c4d91" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the sample average is very different from our true population\n", "average due to **small sample bias**. Therefore, we must take a larger\n", "sample to produce a more unbiased estimate of the population mean,\n", "\n", "Let’s take another simple random sample with a larger sample of size 30." ], "id": "081d1326-9c94-4ba6-ae50-4cffaeb372af" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_data <- population %>% \n", " slice_sample(n = 30, # 30 observations in our sample\n", " replace = TRUE) # with replacement\n", "\n", "s_table <- sample_data %>% \n", " summarize(mean = mean(x), sd = sd(x)) %>%\n", " round(digits = 2) # two decimal places\n", "\n", "s_table" ], "id": "c6348739-d817-402b-87ae-c74a4343c721" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just by increasing our sample from n = 5, to n = 30, our estimate has\n", "gotten much closer to the population mean. Hence, having a larger sample\n", "size gets us much closer to the true mean of the population.\n", "\n", "In this exercise, we have accomplished the following:\n", "\n", "1. Simulated a standard normally distributed population of size 10,000.\n", "2. Estimated the population mean, using the sample mean as an unbiased\n", " estimator.\n", "\n", "Next we will create a distribution of sample means through repeated\n", "random samples from our population.\n", "\n", "### Exercise 1: Simple random samples\n", "\n", "In this exercise:\n", "\n", "1. Create an object `a` which is a draw from the unifom distribution,\n", " with 5,000 observations between 10 and 20, using the random seed 20.\n", "2. Take a simple random sample of size 30 from our population, and\n", " calculate the sample mean and store it in an object `answer_1`" ], "id": "1dc32b0d-2566-424d-992b-9df54de38942" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(...)\n", "\n", "a <- runif(... , ... , ...) # variable you are creating\n", "eg_data_frame <- data.frame(a)\n", "\n", "answer_1 <- eg_data_frame %>%\n", " slice_sample(n = ...) %>%\n", " summarize(mean = mean(...)) %>%\n", " pull() %>%\n", " round(3)\n", "\n", "answer_1 \n", "\n", "test_1()" ], "id": "b3c3417a-6e89-475b-a487-f5b90e79125a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Repeated Random Samples and Sampling Distributions\n", "\n", "Remember that when we survey a sample, we are only observing a subset of\n", "the population and determining a sample estimate based off that sample.\n", "Hence, If we surveyed a different 1000 Canadian residents every time, we\n", "would certainly get a different estimate for the average height of a\n", "Canadian resident.\n", "\n", "Let’s say we can draw many samples of size $n$ from our normal\n", "distribution population. Every time we draw a new sample, we get a\n", "different estimate of the mean. Now, if we were to plot each of those\n", "estimates onto a histogram, we would get a **sampling distribution**.\n", "\n", "A sampling distribution of a statistic is a probability distribution\n", "based on repeated independent samples of size $n$ from a population of\n", "size $N$. We can now produce a distribution of sample means based on\n", "repeated and independent simple random samples from our population.\n", "\n", "If we take enough samples, we will find that the mean of the sampling\n", "distribution will be nearly equal the mean of our population\n", "distribution (our population parameter). With an infinite number of\n", "random samples, the mean of the sampling distribution will be exactly\n", "equal to the population parameter.\n", "\n", "As samples sizes increase, the standard error of the sampling\n", "distribution decreases. In other words, a larger sized sample $n$ would\n", "have a lower probability of having a sample estimate far away from the\n", "true population parameter. This is a very important concept in\n", "statistical inference and econometrics.\n", "\n", "### Scenario 1: Population is normally distributed\n", "\n", "We will need to take $R$ independent simple random samples of size $n$\n", "from our `population` of size $N$, to produce our sampling distribution\n", "($R$ observations).\n", "\n", "To do this in R, we will need to take a simple random sample of size\n", "$n$, compute the sample mean, and store it in a vector. We will then\n", "repeat this $R$ times, appending each sample mean into our vector. Our\n", "vector will represent the observations of our sampling distribution!\n", "\n", "**Standard error** is the standard deviation of the sampling\n", "distribution which indicates how much our sample mean will vary from\n", "sample to sample. For this exercise, we will keep the number of samples\n", "constant and progressively increase our sample size to see how it\n", "affects the standard error of the sampling distribution.\n", "\n", "- We will use the `rep_sample_n` function from the `infer` package\n", " in R. This function allows us to repeatedly take samples from a\n", " population.\n", "\n", "- The `size` parameter indicates the sample size and the `reps`\n", " parameter indicates the number of samples we wish to draw from our\n", " population." ], "id": "604e88cd-ef6e-48b0-9fdf-0878ec3d99d4" }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/html" }, "source": [ "" ], "id": "1794b8df-ea2d-42f8-8b67-7af350e076f3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The solid blue line represents the true mean of the normal\n", " distribution population (0)\n", "\n", "- The dashed red line shows us the mean of the sampling distribution.\n", " Since we are taking many samples, these two values should be very\n", " similar.\n", "\n", "#### 1. Suppose we take 1000 simple random samples ($R = 1000$), with a sample size of 5 ($n = 5$):" ], "id": "06243326-aaa2-4fe9-89a7-d606cc54060c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# CREATES A DATA FRAME WITH ALL 1000 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE \n", "reps_5_1000 <- population %>%\n", " rep_sample_n(size = 5, reps = 1000) %>% # creates 5 samples of size 5\n", " group_by(replicate) %>% # groups each of the samples \n", " summarize(mean = mean(x), sd = sd(x)) # calculates the mean of each sample\n", "\n", "#CALCULATES THE MEAN AND STANDARD ERROR OF THE SAMPLING DISTRIBUTION\n", "sample_5_mean_se <- reps_5_1000 %>%\n", " summarize(sd_mean = mean(mean), se = sd(mean))\n", "\n", "sample_5_mean_se\n", "\n", "#VISUAL DEPICTION OF SAMPLING DISTRIBUTION\n", "sampling_dist_5_1000 <- reps_5_1000 %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = 0, colour=\"blue\", linetype = \"solid\", lwd = 1) + \n", " geom_vline(xintercept = mean(reps_5_1000$mean), colour=\"red\", linetype = \"dashed\", lwd = 1) \n", "\n", "sampling_dist_5_1000" ], "id": "740236c6-2da3-4d6a-a667-945ce9961ba2" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Suppose we take 1,000 simple random samples ($R = 1000$), with a sample size of 50 ($n = 50$):" ], "id": "88fc0585-6bca-4bad-97af-1c0ac5e476e5" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# CREATES A DATA FRAME WITH ALL 1000 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE \n", "reps_50_1000 <- population %>%\n", " rep_sample_n(size = 50, reps = 1000) %>% # creates 1000 samples of size 5\n", " group_by(replicate) %>% # groups each of the samples \n", " summarize(mean = mean(x), sd = sd(x)) # calculates the mean of each sample\n", "\n", "# CALCULATES THE MEAN AND STANDARD ERROR OF THE SAMPLING DISTRIBUTION\n", "sample_50_mean_se <- reps_50_1000 %>%\n", " summarize(sd_mean = mean(mean), se = sd(mean))\n", "\n", "sample_50_mean_se\n", "\n", "# VISUAL DEPICTION OF SAMPLING DISTRIBUTION\n", "sampling_dist_50_1000 <- reps_50_1000 %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = 0, colour=\"blue\", linetype = \"solid\", lwd = 1) + \n", " geom_vline(xintercept = mean(reps_50_1000$mean), colour=\"red\", linetype = \"dashed\", lwd = 1) \n", "\n", "sampling_dist_50_1000" ], "id": "32741833-6a71-4b39-aa85-b9c8d7dc4139" }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Pro tip:** See how the scale of the x axis for $n = 50$ has adjusted\n", "> from $n = 5$ in response to the reduction in error?\n", "\n", "Now that we have increased the number of repeated and independent simple\n", "random samples from 5 to 1000, the mean of our sampling distribution is\n", "much closer to our population mean. Let’s see how we can further reduce\n", "the standard error of our estimate by increasing our sample size.\n", "\n", "#### 3. Suppose we take 1,000 simple random samples ($R = 1000$), with a sample size of 500 ($n = 500$):" ], "id": "c6593956-0639-40a7-8d21-5c268a1c8e15" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# CREATES A DATA FRAME WITH ALL 1000 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE \n", "reps_500_1000 <- population %>%\n", " rep_sample_n(size = 500, reps = 1000, replace = FALSE) %>% #creates 1000 samples of size 5\n", " group_by(replicate) %>% #groups each of the samples \n", " summarize(mean = mean(x)) #calculates the mean of each sample\n", "\n", "# CALCULATES THE MEAN AND STANDARD ERROR OF THE SAMPLING DISTRIBUTION\n", "sample_500_mean_se <- reps_500_1000 %>%\n", " summarize(sd_mean = mean(mean), se = sd(mean))\n", "\n", "sample_500_mean_se\n", "\n", "# VISUAL DEPICTION OF SAMPLING DISTRIBUTION\n", "sampling_dist_500_1000 <- reps_500_1000 %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = 0, colour=\"blue\", linetype = \"solid\", lwd = 1) +\n", " geom_vline(xintercept = mean(reps_500_1000$mean), colour=\"red\", linetype = \"dashed\", lwd = 1) \n", "\n", "sampling_dist_500_1000" ], "id": "69f2e1f3-baa3-47e9-ac2e-0ecb08433c9f" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2: Taking Random Samples\n", "\n", "Let’s take a sample of size 50 with 1000 reps from the same `population`\n", "data frame and produce a data frame with only the means. Assign your\n", "answer to the object `answer_2`." ], "id": "c12a1fe0-39e2-41f3-8737-5ddd6d0a44b9" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# FILL IN THE BLANKS\n", "set.seed(30)\n", "\n", "answer_2 <- population %>%\n", " rep_sample_n(... = ..., ... = ...) %>%\n", " group_by(...) %>%\n", " summarize(mean = ...(x))\n", "\n", "test_2()" ], "id": "3c497f1c-86cc-4777-8d99-60c161fd83b7" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the number of samples we drew from our population is constant,\n", "we saw that the mean of our sampling distribution was roughly equivalent\n", "to the true population mean for all of our examples. As we increased our\n", "samples size, the standard error decreased.\n", "\n", "- Having a low standard error means that the probability of having a\n", " sample produce a sample estimate that is far away from the true\n", " population parameter is very low.\n", "\n", " - With a sample size of 500, the sample means are mostly between\n", " -0.1 and +0.1.\n", "\n", " - With a sample size of 5, the sample means are mostly between -1\n", " and 1.\n", "\n", "Because it’s often only possible to collect *one* sample in real life,\n", "we tend to collect as large a sample as possible to minimize the\n", "standard error. This gives us high confidence that our sample estimate\n", "will be close to the true population parameter.\n", "\n", "In the section about Bootstrapping later in this lesson, we will explore\n", "how economists calculate the standard error of a sample estimate using\n", "only one sample.\n", "\n", "### Scenario 2: Population is not normally distributed\n", "\n", "A few things to note about increasing the number of samples:\n", "\n", "1. The mean of the sampling distribution comes closer to the true value\n", " of the population parameter.\n", "2. The sampling distribution will resemble a normal distribution curve\n", " regardless of the shape of the original population distribution\n", " (note that this is just the *sampling distribution* that starts to\n", " look more like a normal curve, and not the *population\n", " distribution*).\n", "\n", "This exercise will be nearly identical to above except now our\n", "population will be random uniformly distributed. For this exercise, we\n", "will see how varying the samples taken affects the sampling\n", "distribution. We can achieve this in R using the `runif()` function to\n", "create a uniform distribution." ], "id": "dbe3c154-aeaf-426f-870b-6a68f694d871" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y <- runif(10000, 0, 1) # Draw 10,000 observations from the random uniform distribution between 0 and 1\n", "\n", "population_unif <- data.frame(y)\n", "\n", "unif_pop_dist <- population_unif %>%\n", " ggplot(aes(x = y)) + \n", " geom_histogram(bins = 20, color = \"black\", fill = \"gray\")\n", "\n", "unif_pop_dist\n", "\n", "unif_pop_mean_se <- population_unif %>%\n", " summarize(true_mean = mean(y))\n", "\n", "unif_pop_mean_se" ], "id": "06ded72e-0e30-4e52-aa65-a1ccefd8eba3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1. Suppose we take 100 simple random samples (`reps` = 100), with a sample size of 100 (`size` = 100):" ], "id": "886d9c6b-6b9a-4db0-949c-abaf1b93063b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# CREATES A DATA FRAME WITH ALL 100 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE \n", "unif_reps_100_100 <- population_unif %>%\n", " rep_sample_n(size = 100, reps = 100) %>% #creates 5 samples of size 5\n", " group_by(replicate) %>% #groups each of the samples \n", " summarize(mean = mean(y)) #calculates the mean of each sample\n", "\n", "# MEAN AND STANDARD ERROR OF SAMPLING DIST\n", "sample_100_mean_se <- unif_reps_100_100 %>%\n", " summarize(sd_mean = mean(mean), se = sd(mean))\n", "\n", "sample_100_mean_se\n", "\n", "# VISUAL DEPICTION OF SAMPLING DISTRIBUTION\n", "unif_sampling_dist_100_100 <- unif_reps_100_100 %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = mean(population_unif$y), colour=\"blue\", linetype = \"solid\", lwd = 1) + \n", " geom_vline(xintercept = mean(unif_reps_100_100$mean), colour=\"red\", linetype = \"dashed\", lwd = 1) \n", "\n", "unif_sampling_dist_100_100" ], "id": "51696b29-358b-4442-b9f9-2d3c1e343800" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Suppose we take 1,000 simple random samples ($R = 1000$), with a sample size of 5 ($n = 5$):" ], "id": "d5cb3112-1d69-4598-ba34-e80bc31fb237" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# CREATES A DATA FRAME WITH ALL 100 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE \n", "unif_reps_100_1000 <- population_unif %>%\n", " rep_sample_n(size = 100, reps = 1000) %>% # creates 5 samples of size 5\n", " group_by(replicate) %>% # groups each of the samples \n", " summarize(mean = mean(y)) # calculates the mean of each sample\n", "\n", "# MEAN AND STANDARD ERROR OF SAMPLING DIST\n", "sample_1000_mean_se <- unif_reps_100_1000 %>%\n", " summarize(sd_mean = mean(mean), se = sd(mean))\n", "\n", "sample_1000_mean_se\n", "\n", "# VISUAL DEPICTION OF SAMPLING DISTRIBUTION\n", "unif_sampling_dist_100_1000 <- unif_reps_100_1000 %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = mean(population_unif$y), colour=\"blue\", linetype = \"solid\", lwd = 1) + \n", " geom_vline(xintercept = mean(unif_reps_100_1000$mean), colour=\"red\", linetype = \"dashed\", lwd = 1) \n", "\n", "unif_sampling_dist_100_1000" ], "id": "df148af9-06dc-4312-9e74-9e08cbc09088" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3. Suppose we take 20,000 simple random samples (`reps` = 20,000), with a sample size of 500 (`size` = 500\\$):" ], "id": "38251e06-33f8-49a9-8c0a-34d7b9a6455e" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# CREATES A DATA FRAME WITH ALL 100 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE \n", "unif_reps_100_20000 <- population_unif %>%\n", " rep_sample_n(size = 100, reps = 20000) %>% #creates 5 samples of size 5\n", " group_by(replicate) %>% #groups each of the samples \n", " summarize(mean = mean(y)) #calculates the mean of each sample\n", "\n", "# MEAN AND STANDARD ERROR OF SAMPLING DIST\n", "sample_20000_mean_se <- unif_reps_100_20000 %>%\n", " summarize(sd_mean = mean(mean), se = sd(mean))\n", "\n", "sample_20000_mean_se\n", "\n", "#VISUAL DEPICTION OF SAMPLING DISTRIBUTION\n", "unif_sampling_dist_100_20000 <- unif_reps_100_20000 %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = mean(population_unif$y), colour=\"blue\", linetype = \"solid\", lwd = 1) + \n", " geom_vline(xintercept = mean(unif_reps_100_20000$mean), colour=\"red\", linetype = \"dashed\", lwd = 1) \n", "\n", "unif_sampling_dist_100_20000" ], "id": "347c4c5d-0e32-4538-8ebb-593365e3f03c" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the difference between the mean of the sampling distribution\n", "and the true value of the population parameter become closer and closer\n", "as we increase the number of samples drawn. As mentioned above, we can\n", "also see that the sampling distribution curve more strongly resembles a\n", "normal distribution curve as we increase the number of samples drawn.\n", "This is true regardless of the fact that we began with a population that\n", "was uniformly distributed." ], "id": "4bcbcafb-bf6b-4ba4-b40b-2faeb222031c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SIZE <- 500\n", "REPS <- 2000\n", "\n", "pop_dist_eg <- population %>%\n", " rep_sample_n(size = SIZE, reps = REPS) %>% # CHANGE PARAMETERS HERE\n", " group_by(replicate) %>% \n", " summarize(mean = mean(x)) %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = 0, colour=\"blue\", linetype = \"solid\", lwd = 1) \n", "\n", "unif_dist_eg <- population_unif %>%\n", " rep_sample_n(size = SIZE, reps = REPS) %>% # CHANGE PARAMETERS HERE\n", " group_by(replicate) %>% \n", " summarize(mean = mean(y)) %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = mean(population_unif$y), colour=\"blue\", linetype = \"solid\", lwd = 1) \n", "\n", "grid.arrange(pop_dist_eg, unif_dist_eg, ncol = 2)" ], "id": "4cedb5b4-f7e9-40f4-813b-6e32bc021fdb" }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above graph, we fixed the sample size and number of samples for\n", "both the normal distribution and uniform distribution examples, and we\n", "attained two sampling distributions that are strikingly similar to a\n", "normal distribution curve.\n", "\n", "Hence, regardless of the initial population distribution, as we increase\n", "our sample size and the number of simple and independent simple random\n", "samples, our sampling distribution converges to our population mean!\n", "\n", "Try it out yourself by increasing `SIZE` and `REPS` to see how higher\n", "values make the distribution become tighter around the true population\n", "mean.\n", "\n", "This converging of the sample mean to a normal distribution is referred\n", "to as the **Central Limit Theorem** in statistics.\n", "\n", "### Exercise 3: Taking Samples and Visualizing the Sampling Distribution\n", "\n", "Use the `population_unif` data frame and draw 500 samples of sample size\n", "50. Then create a visualization of the sampling distribution. Finally,\n", "calculate the standard error of this sampling distribution and assign it\n", "to the object `answer_3`. Remember the `sd` function calculates the\n", "standard deviation of a column. Template code has been provided." ], "id": "e2ffaf37-7a4a-4c6f-9cd0-120f3178396e" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(40) # don't change this\n", "\n", "eg_3 <- ... %>%\n", " ...(... = ..., ... = ...) %>%\n", " ...(...) %>%\n", " summarize(mean = mean(y)) \n", "\n", "eg_3_graph <- eg_3 %>%\n", " ggplot(...(x = mean)) + \n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = mean(population_unif$y), colour=\"blue\", linetype = \"solid\", lwd = 1) \n", "\n", "answer_3 <- eg_3 %>%\n", " summarize(se = ...(mean)) %>%\n", " pull() %>%\n", " round(5)\n", "\n", "test_3()" ], "id": "769a2c77-3859-4a93-9441-37bef72681f9" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 4: Understanding Standard Error\n", "\n", "What does the standard error tell us?\n", "\n", "**A**: the average variation in the population distribution \n", "**B**: the average variation in our sample estimates of the population\n", "parameter \n", "**C**: the probability that our sample estimate is correct \n", "**D**: exactly what the sample estimate tells us\n", "\n", "Answer in capital letters surrounded by quotation marks. eg… `\"E\"`" ], "id": "9586e950-bce0-4761-b01e-aba470721535" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_4 <- \"...\"\n", "\n", "test_4(answer_4)" ], "id": "d9ba43dc-c45b-4f3b-be69-73ec079ce343" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Quick Review\n", "\n", "- The **population** is the collection of all individuals or entities\n", " we are interested in studying. From the above example, the\n", " population is entire population of Canada.\n", "\n", "- **Population parameter:** A particular quantity from the population\n", " that we’re interested in. In our example, this is the height.\n", "\n", "- A **simple sample** is a small subset of the population which we use\n", " to get the **sample estimate** of the parameter we are interested\n", " in. Since we are often unable to conduct a comprehensive census in\n", " the real world, we have to use the sample estimate as an estimate\n", " for the population parameter.\n", "\n", "- **Sampling distribution**: In a hypothetical world, we assume that\n", " we are able to take many samples of size `n` from the same\n", " population and attain a sample estimate from all of those samples.\n", " When we plot those sample estimates on a histogram, we get a\n", " **sampling distribution**. When we take many samples, we find that\n", " the mean of the sampling distribution (the most frequently observed\n", " estimate of the population parameter) is very close to the actual\n", " value of the population parameter.\n", "\n", "- Increasing the sample size decreases our standard error. When we\n", " increase the number of samples drawn from the population, the mean\n", " of the sampling distribution progresses closer and closer to the\n", " population parameter and our sampling distribution has a closer\n", " resemblance to a normal distribution curve.\n", "\n", "## Central Limit Theorem (CLT) and the Law of Large Numbers (LLN)\n", "\n", "The **Central Limit Theorem** tells us that no matter the distribution\n", "of the population, with sufficient sample size (usually above 30 and no\n", "more than 5% of the population), the distribution of sample means will\n", "tend toward normal.\n", "\n", "> Note: When sampling more than 5% of a population, we should use the\n", "> Finite Population Correct Factor (FPC). This is because the Central\n", "> Limit Thereom does not hold, and the standard error will be too large:\n", "> FPC = $((N-n)/(N-1))^{1/2}$\n", "\n", "The **Law of Large Numbers** tells us that with a sufficiently large\n", "number of repeated and independent random samples, the mean of our\n", "distribution of sample means will tend toward the population mean!\n", "Notice the visualization of our sample mean as we increase $R$ in the\n", "example below!" ], "id": "f7ee3d01-b2e2-43cb-ac1e-19df542def43" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(55)\n", "\n", "# RUN THIS CODE. \n", "\n", "r = 1000 # Number of iterations (number of sample means produced for our distribution)\n", "num = 10 # Sample size\n", "R = 1:r # Range of iterations\n", "means <- vector(,r) # Create empty vector to store sample means\n", "sample_mean <- vector(,r)\n", "\n", "for (val in R) { # loop over all values in R \n", " means[val] = mean(slice_sample(population_unif, n = num)$y) # Take a sample of size n from the population, caluclate the mean, and store in vector means, with index of the current iteration\n", " sample_mean[val] <- mean(means[1:val]) # Running value of sample mean at index equal to current iteration\n", "}\n", "\n", "options(scipen = 10)\n", "plot(sample_mean, cex=0.5, col=\"darkblue\")\n", "abline(h = mean(population_unif$y))" ], "id": "709f7530-f99c-4660-8997-449673868426" }, { "cell_type": "markdown", "metadata": {}, "source": [ "See how many many iterations (in this case, 1000) eventually produce a\n", "relatively stable line around 0.5 for this example?\n", "\n", "### INTERACTIVE EXERCISE DEMONSTRATING CLT\n", "\n", "In this exercise, we will create a new simulation for you to play around\n", "with $R$ and $n$ to get a better understanding of how sampling\n", "distribution changes with sample size and the number of repeated\n", "independent random samples. You can alter the values by changing the\n", "`SIZE` and `REPS` constants at the top of the code block. Try different\n", "values and see for yourself how the shape of the sampling distribution\n", "changes.\n", "\n", "Note: You can also try different distributions of our population, try\n", "the following: `rchisq(n, df)`, `rexp(n, rate)`, `rf(n, df1, df2)`" ], "id": "1f718952-271b-4600-bc1f-17a2c5ab8ba6" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "population_ex <- rexp(10000, 0.2) # Draw 10,000 observations from the random uniform distribution between 0 and 1\n", "population_ex <- data.frame(x)\n", "pop_dist_int <- population_ex %>% \n", " ggplot(aes(x = x)) +\n", " geom_histogram(bins = 50) +\n", " geom_vline(xintercept = mean(population_ex$x), colour=\"blue\", linetype = \"solid\", lwd = 1) \n", "pop_dist_int" ], "id": "d18dab16-98a4-4817-8080-4798b7906b6d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SIZE = 50 # change me here\n", "REPS = 5000 # change me here\n", "\n", "population_sample_estimates <- population_ex %>%\n", " rep_sample_n(size = SIZE, reps = REPS) %>%\n", " group_by(replicate) %>%\n", " summarize(mean = mean(x))\n", "\n", "pop_mean_se <- population_sample_estimates %>%\n", " summarize(sd_mean = mean(mean), se = sd(mean))\n", "\n", "pop_mean_se \n", "\n", "samp_dist_int <- population_sample_estimates %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = mean(population_ex$x), colour=\"blue\", linetype = \"solid\", lwd = 1) +\n", " geom_vline(xintercept = mean(population_sample_estimates$mean), colour=\"red\", linetype = \"dashed\", lwd = 1)\n", "\n", "samp_dist_int" ], "id": "c87db8e1-aae6-431e-8ff5-82614f30af4a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bootstrapping\n", "\n", "From our example, we see that it can be quite difficult to calculate the\n", "standard error by first having access to the population (often\n", "impossible) and then drawing hundreds of samples from that population to\n", "make a sampling distribution (drawing two samples of 10 individuals is\n", "also the same as drawing one sample of 20).\n", "\n", "One common technique that economists use to calculate the standard error\n", "of their sample estimates is **bootstrapping** samples.\n", "\n", "We do this by taking one sample of size $n$ from our population and then\n", "take a sample of size $n$ from our sample, with replacement.\n", "\n", "Let’s imagine we have a paper bag with 5 red balls and 5 green balls,\n", "which we will treat as our sample.\n", "\n", "- In order to draw a bootstrap sample, we pick 1 ball out of the bag\n", " ten times while allowing replacement.\n", "\n", "- We take out the first ball, note whether it’s a red or green ball,\n", " then put it back in the bag and repeat the process 9 more times to\n", " arrive at our bootstrap sample.\n", "\n", "- Note that with this method, it is entirely possible to draw any\n", " combination of red and green balls. The bootstrap sampling method\n", " introduces sampling variability to our sample.\n", "\n", "If we repeat this process by drawing many bootstrap samples, we can\n", "arrive at a bootstrap distribution which serves as an approximation of\n", "our sampling distribution.\n", "\n", "### Example\n", "\n", "Let’s demonstrate this using our normal distribution data frame\n", "`population`.\n", "\n", "We first start off by taking a sample from our population using the\n", "`slice_sample` function.\n", "\n", "To connect it back to the average height example, imagine that we had\n", "resources to sample 100 people about their height out of the entire\n", "Canadian population." ], "id": "bcea725d-ce85-425f-9d66-7e461d5daab0" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boot_main_sample <- population %>%\n", " slice_sample(n = 100)\n", "\n", "head(boot_main_sample)\n", "paste(\"This is the mean of our sample: \", mean(boot_main_sample$x))" ], "id": "fcdb6971-ae23-4878-a3ce-a7302e34650a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we want to know the uncertainty behind this estimate. Thinking back\n", "to our sampling distribution, how would we know if our sample estimate\n", "of 0.008 is close to the true population parameter? It’s entirely\n", "possible we got a sample estimate in the tails of the sampling\n", "distribution. To understand the uncertainty around our sample estimate,\n", "we must compute the standard error.\n", "\n", "Let’s draw bootstrap samples from our sample. The code is largely\n", "similar from before, except the key difference that we now add\n", "`replace = TRUE` to our `rep_sample_n` function. This tells R to allow\n", "replacement when taking samples." ], "id": "5d4a8264-0d76-470f-a5a0-ec27de95706c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boot_samples <- boot_main_sample %>%\n", " rep_sample_n(size = 500, reps = 1000, replace = TRUE)\n", "\n", "head(boot_samples)" ], "id": "a4dbbac8-ccf1-4f3c-92db-41e1ee0eea35" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our bootstrap samples, let’s calculate the bootstrap\n", "sample means and make a bootstrap sampling distribution. We use the same\n", "methods as we used above." ], "id": "77e31a63-e0e3-4158-8397-e652ccccf737" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boot_means <- boot_samples %>%\n", " group_by(replicate) %>%\n", " summarize(mean = mean(x))\n", "\n", "boot_mean_se <- boot_means %>%\n", " summarize(sd_mean = mean(mean), se = sd(mean)) \n", "\n", "boot_mean_se\n", "\n", "boot_sampling_dist <- boot_means %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = 0, colour=\"blue\", linetype = \"solid\", lwd = 1) + \n", " geom_vline(xintercept = mean(boot_means$mean), colour=\"red\", linetype = \"dashed\", lwd = 1) +\n", " ggtitle(\"Boostrap Sampling Distribution\")\n", "\n", "grid.arrange(boot_sampling_dist, sampling_dist_500_1000 + ggtitle(\"Sampling Distribution\"))" ], "id": "425b195e-d34f-4a04-9425-53de842654f8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The key takeaway here is that the standard error of the bootstrap\n", "sampling distribution serves as a great approximation of the standard\n", "error of the sampling distribution.\n", "\n", "Few things to note:\n", "\n", "1. The graphs that our bootstrap sampling distribution visually serves\n", " as a good approximation of our sampling distribution. The standard\n", " error for the bootstrap sampling distribution (0.00402) is close to\n", " our sampling distribution standard error (0.0433).\n", "2. The key difference in the graphs is the mean of the distributions.\n", " - The mean of the sampling distribution is nearly equivalent to\n", " the population parameter.\n", " - However, the mean of the bootstrap distribution will be the\n", " value of the sample estimate from the first sample we took from\n", " the population. This is the reason why the bootstrap\n", " distribution sits to the right of the sampling distribution.\n", "\n", "### Exercise 5: Taking Bootstrap Samples\n", "\n", "For this exercise, we have taken a sample from the `population` data\n", "frame and assigned it to the object `boot_eg_sample` of sample size 50.\n", "You are going to make a data frame of bootstrap sample estimates with\n", "1000 reps. Afterwards, you will assign the standard error of the\n", "bootstrap sampling distribution to `answer_5`. We have provided a code\n", "template below." ], "id": "84ba8beb-5818-4fcc-914d-5ecd6fe36629" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(60)\n", "\n", "boot_eg_sample <- ... %>%\n", " slice_sample(n = ...)\n", "\n", "boot_eg_sample_est <- boot_eg_sample %>%\n", " ...(... = ..., ... = ..., replace = ...) %>%\n", " ...(...) %>%\n", " summarize(mean = ...(x))\n", "\n", "answer_5 <- boot_eg_sample_est %>%\n", " summarize(se = ...(mean)) %>%\n", " pull() %>%\n", " round(3)\n", "\n", "test_5()" ], "id": "20bf5b39-73d1-48a6-997a-521c28b24365" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "ir", "display_name": "R", "language": "r" } } }