*Colin Grimes, Valeria Zolla, Rathin Dharani, Jonathan\n", "Graves* \n", "2023-01-12\n", "\n", "## Outline\n", "\n", "### Prerequisites\n", "\n", "- Introduction to Jupyter\n", "- Introduction to R\n", "- Introduction to Visualization\n", "- Introduction to Central Tendency\n", "- Introduction to Distribution\n", "- Dispersion and Dependence\n", "\n", "### Learning Objectives\n", "\n", "- Define a simple random sample and why it is useful in econometrics\n", "- Use a simple random sample to infer population parameters\n", "- Produce a sampling distribution through repeated random sampling\n", "- Understand how sampling distributions relate to the Law of Large\n", " Numbers and the Central Limit Theorem\n", "\n", "# Introduction to Statistical Inference\n", "\n", "It’s important to begin with the fundamental problem of statistical\n", "inference. Let’s start with a **population**, which is commonly defined\n", "as the complete collection of all individuals or entities we are\n", "studying. From here, we want to determine a particular characteristic of\n", "the population, the **population parameter**. For example, if we want to\n", "determine the average height of a Canadian resident, our population\n", "would be every Canadian resident and our population parameter would be\n", "average height. Imagine how easy life would be for economists if we had\n", "access to any population parameter we desired. If we had perfect data\n", "from our population, we would be able to determine the average returns\n", "to education, the impact of stimulus checks for every resident, and so\n", "much more with absolute certainty all by surveying every member of the\n", "population.\n", "\n", "Unfortunately collecting data from the entire population is often too\n", "expensive and impractical. A more realistic way to find what we’re\n", "looking is to survey a **sample**, a subset of individuals or entities\n", "from the population. Through this sample, we can gather a **sample\n", "estimate**, an estimate of the population parameter from our sample.\n", "From our example, it would be much more practical to survey a 1000\n", "people in Canada and calculate their average height. Evidently, this\n", "won’t exactly the match the true average height of a Canadian resident,\n", "because we haven’t based our calculation off every Canadian resident,\n", "but it’s probably a good estimate. But how do we know we picked a\n", "representative sample of the Canadian population?\n", "\n", "## Simple Random Samples\n", "\n", "A **simple random sample** is a subset of a population, in which each\n", "member of the population has an equal chance of being selected for the\n", "sample. It’s important that this condition is met when selecting a\n", "sample. Imagine if we try to determine the average height of the\n", "Canadian population by only sampling professional athletes. We wouldn’t\n", "get a very good estimate of our population parameter. Simple random\n", "samples allow us to create sample parameters, which are **unbiased\n", "estimators** of the population parameter.\n", "\n", "There are two statistical approaches:\n", "\n", "1. We could conduct a census. To do this, we would need to measure the\n", " height of all 38 million people living in Canada. This technique\n", " produces little error, but comes at a very high cost. Moreover, we\n", " may not actually be able to collect all of this data - what if we\n", " miss someone?\n", "\n", "2. We could take a **simple random sample** of the Canadian population,\n", " in which all 38 million Canadians have an equal probability of being\n", " chosen. This would allow us to make **statistical inferences** on\n", " the population mean, without the cost of conducting a census.\n", "\n", "> **Note**: It’s important to keep in mind that there are other types of\n", "> sampling in economics. Simple random sampling is just a good starting\n", "> point for our discussion! In econometrics, it is crucial that we try\n", "> to get as close as possible to an unbiased estimator through a random\n", "> sample. Imagine if we are trying to determine the gender pay gap in\n", "> British Columbia and we only take a sample from those living in\n", "> downtown Vancouver or we only took a sample from those working in blue\n", "> collar jobs. That wouldn’t give us very accurate results. You will\n", "> learn more about this in ECON 326.\n", "\n", "## Simulating Data\n", "\n", "It will be helpful in this notebook to work with **simulated data**. We\n", "can simulate data by drawing from some well known distributions such as\n", "normal and uniform. Before we begin simulating data to work with, we\n", "need to be familar the concept of a **random number seed**. Setting a\n", "random number seed allows us to reproduce results drawn from a\n", "distribution of a random variable.\n", "\n", "In R and all programming languages, random numbers are not truly random,\n", "they are known as **pseudo-random numbers**. R uses a random number\n", "generator that produces a sequence of numbers that are completely\n", "determined by a seed value - they are effectively random, but also\n", "reproducible. Once you set the seed value using the `set.seed` function,\n", "everything after that point may look random, but is actually totally\n", "reproducible. For example, if you were to pick a random sample out of a\n", "100,000 data points, every time we ran the code we would get a different\n", "random sample. You would never be able to check the work of your\n", "classmates or ever attain the same results again. By using the same seed\n", "before conducting our analyses, we are able to reproduce the results we\n", "attained earlier.\n", "\n", "See [this\n", "reference](https://datasciencebook.ca/classification2.html#randomseeds)\n", "for more information on how the `set.seed()` function achieves\n", "reproducible randomness in R." ], "id": "bad47b21-6b20-4a19-87b2-5a893e67b3fb" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "source(\"sampling_tests.r\") # load some background tests\n", "library(tidyverse)\n", "library(dplyr)\n", "library(infer)\n", "#install.packages(\"gridExtra\") #uncomment these to run the section\n", "#library(gridExtra)" ], "id": "9fad2860-d395-4af6-be40-67aa32cebf81" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "seed_number <- 1 # my seed number\n", "\n", "set.seed(seed_number) # set the random number seed to reproduce results" ], "id": "ccc3b63d-819a-4b5d-aa28-3053514f9bf4" }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following, you can try changing the seed number and seeing how\n", "some of the results are different. Remember that the default seed we\n", "used was `seed_number <- 1`.\n", "\n", "## Example 1: Simulating a Distribution\n", "\n", "Let’s now imagine that we want to create a hypothetical population with\n", "a variable called `x` of size 10,000, distributed normally with mean 0\n", "and standard deviation 1. Just to give an example, we can imagine `x` to\n", "represent “household indebtedness.” In our example, this variable is\n", "normally distributed, meaning that most households have close to no\n", "debt, but few households have an extreme amount of debt and few others\n", "have large investments with consistent cash flow (negative debt). Below,\n", "we will create an data frame `population`, using the `rnorm` function." ], "id": "fdad45a6-03da-43d4-872d-6d40b9c03795" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x <- rnorm(10000, 0, 1) # Draw 10,000 observations from the random normal distribution with mean 0, and variance 1.\n", "\n", "population <- data.frame(x) #store it in a data frame" ], "id": "e1e49cfa-0659-481a-bd7d-77ec933ab828" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s ensure that our function has worked as expected. We should have a\n", "population mean of 0, and standard deviation of 1. Additionally, we will\n", "visualize our distribution using a histogram, otherwise known as a\n", "**population distribution**. In this example, our population\n", "distribution should be normally distributed, centered around 0. We will\n", "visually depict the mean as a blue line on the histogram." ], "id": "3847b516-5963-48c5-b4e8-84d6f0e5ab6d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "table <- population %>% \n", " summarize(mean = mean(x), sd = sd(x)) %>%\n", " round(digits = 2) #two decimal places\n", "\n", "table " ], "id": "6043d65d-1723-4a72-94c4-2fc36d232d25" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pop_dist <- population %>%\n", " ggplot(aes(x = x)) + \n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = mean(population$x), colour=\"blue\", linetype = \"solid\", lwd = 1)\n", "pop_dist" ], "id": "b8132077-0f15-43d4-af6d-4a2a8f06808b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now suppose we would like to estimate the population mean above,\n", "however, we do not have the time nor money to conduct a census of the\n", "population. Additionally, we **are** able to randomly sample everyone in\n", "the population with equal probability.\n", "\n", "To take a simple random sample in R, we can use the `slice_sample`\n", "function. Let’s start with a very small sample of size 5, and see how it\n", "compares to our population mean. We will select the observations from\n", "the data set by their number." ], "id": "d56ce318-ebe5-44f5-a717-aa997640d303" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_data <- population %>% \n", " slice_sample(n = 5) # 5 observations in our sample\n", " \n", "s_table <- sample_data %>% \n", " summarize(mean = mean(x), sd = sd(x)) %>%\n", " round(digits = 2) # two decimal places \n", "\n", "s_table " ], "id": "61e542bb-d773-4410-8f4d-49bb391a5d75" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the sample average is very different from our true population\n", "average. Because we have taken such a small sample, we have introduced\n", "bias into the estimation of our population parameter - this error is\n", "commonly referred to as **small sample bias**. To produce a more\n", "unbiased estimate of the population mean, we must take a larger sample.\n", "\n", "We will now take another simple random sample with a larger sample of\n", "size 30." ], "id": "1c6170e6-2668-45da-8bfe-935e8272f301" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_data <- population %>% \n", " slice_sample(n = 30, # 30 observations in our sample\n", " replace = TRUE) # with replacement\n", "\n", "s_table <- sample_data %>% \n", " summarize(mean = mean(x), sd = sd(x)) %>%\n", " round(digits = 2) # two decimal places\n", "\n", "s_table" ], "id": "7d62a629-ec17-4fcf-9098-32dfc05849a8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just by increasing our sample from n = 5, to n = 30, our estimate has\n", "gotten much closer to the population mean. We will soon see with even\n", "greater confidence that having a larger sample size gets us closer and\n", "closer to the true mean of the population.\n", "\n", "In this exercise, we have accomplished the following:\n", "\n", "1. Simulated a standard normally distributed population of size 10,000.\n", "2. Estimated the population mean, using the sample mean as an unbiased\n", " estimator.\n", "\n", "Next we will create a distribution of sample means through repeated\n", "random samples from our population.\n", "\n", "### Exercise 1: Simple random samples\n", "\n", "In this exercise:\n", "\n", "1. Create an object `a` which is a draw from the unifom distribution,\n", " with 5,000 observations between 10 and 20, using the random seed 20.\n", "2. Take a simple random sample of size 30 from our population, and\n", " calculate the sample mean and store it in an object `answer_1`" ], "id": "807a85be-dd22-482b-88d8-7139d98ffb81" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(...)\n", "\n", "a <- runif(... , ... , ...) # variable you are creating\n", "eg_data_frame <- data.frame(a)\n", "\n", "answer_1 <- eg_data_frame %>%\n", " slice_sample(n = ...) %>%\n", " summarize(mean = mean(...)) %>%\n", " pull() %>%\n", " round(3)\n", "\n", "answer_1 \n", "\n", "test_1()" ], "id": "49bce499-a98e-467b-9347-247df2b698c8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Repeated Random Samples and Sampling Distributions\n", "\n", "Remember that when we survey a sample, we are only observing a subset of\n", "the population and determining a sample estimate based off that sample.\n", "If we surveyed a different 1000 Canadian residents every time, we would\n", "certainly get a different estimate for the average height of a Canadian\n", "resident. Let’s assume for now that we can draw many samples of size $n$\n", "from our normal distribution population. Every time we draw a new\n", "sample, we will get a different estimate of the mean. Now, if we were to\n", "plot each of those estimates onto a histogram, we would get a **sampling\n", "distribution**. A sampling distribution of a statistic is a probability\n", "distribution based on repeated independent samples of size $n$ from a\n", "population of size $N$. Using the example introduced above, we can\n", "produce a distribution of sample means, based on repeated and\n", "independent simple random samples from our population. If we take enough\n", "samples, we will find that the mean of the sampling distribution will be\n", "nearly equal the mean of our population distribution (our population\n", "parameter). Amazing, isn’t it? With an infinite number of random\n", "samples, the mean of the sampling distribution will be exactly equal to\n", "the population parameter.\n", "\n", "We are more interested in the effect of increasing sample size. We will\n", "find that with larger and larger samples sizes, the standard error of\n", "the sampling distribution decreases. In other words, in a given sample\n", "of a larger sized $n$, the probability of having a sample estimate far\n", "away from the true population parameter decreases. This is a very\n", "important concept in statistical inference and econometrics.\n", "\n", "### Scenario 1: Population is normally distributed\n", "\n", "We will need to take $R$ independent simple random samples of size $n$\n", "from our `population` of size $N$, to produce our sampling distribution\n", "($R$ observations). To do this in R, we will need to take a simple\n", "random sample of size $n$, compute the sample mean, and store it in a\n", "vector. We will then repeat this $R$ times, appending each sample mean\n", "into our vector. Our vector, in this case, will represent the\n", "observations of our sampling distribution!\n", "\n", "We can define the **standard error** as the standard deviation of the\n", "sampling distribution. The standard error will inform us on how much our\n", "sample mean will vary from sample to sample. For this exercise, we will\n", "keep the number of samples constant and progressively increase our\n", "sample size to see how it affects the standard error of the sampling\n", "distribution.\n", "\n", "We will use the `rep_sample_n` function from the `infer` package in R.\n", "This function allows us to repeatedly take samples from a population.\n", "The `size` parameter indicates the sample size and the `reps` parameter\n", "indicates the number of samples we wish to draw from our population. The\n", "solid blue line represents the true mean of the normal distribution\n", "population (0) and the dashed red line shows us the mean of the sampling\n", "distribution. Given that we are taking many samples, these two values\n", "should be very similar.\n", "\n", "#### 1. Suppose we take 1000 simple random samples ($R = 1000$), with a sample size of 5 ($n = 5$):" ], "id": "44ac8ac2-21aa-4c7e-83c3-e578b51556d8" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# CREATES A DATA FRAME WITH ALL 1000 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE \n", "reps_5_1000 <- population %>%\n", " rep_sample_n(size = 5, reps = 1000) %>% # creates 5 samples of size 5\n", " group_by(replicate) %>% # groups each of the samples \n", " summarize(mean = mean(x), sd = sd(x)) # calculates the mean of each sample\n", "\n", "#CALCULATES THE MEAN AND STANDARD ERROR OF THE SAMPLING DISTRIBUTION\n", "sample_5_mean_se <- reps_5_1000 %>%\n", " summarize(sd_mean = mean(mean), se = sd(mean))\n", "\n", "sample_5_mean_se\n", "\n", "#VISUAL DEPICTION OF SAMPLING DISTRIBUTION\n", "sampling_dist_5_1000 <- reps_5_1000 %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = 0, colour=\"blue\", linetype = \"solid\", lwd = 1) + \n", " geom_vline(xintercept = mean(reps_5_1000$mean), colour=\"red\", linetype = \"dashed\", lwd = 1) \n", "\n", "sampling_dist_5_1000" ], "id": "9cb8185f-4d6e-40fd-96d6-d242c2992832" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Suppose we take 1,000 simple random samples ($R = 1000$), with a sample size of 50 ($n = 50$):" ], "id": "1f966a7e-d555-4c34-86af-124f4e285bc2" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# CREATES A DATA FRAME WITH ALL 1000 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE \n", "reps_50_1000 <- population %>%\n", " rep_sample_n(size = 50, reps = 1000) %>% # creates 1000 samples of size 5\n", " group_by(replicate) %>% # groups each of the samples \n", " summarize(mean = mean(x), sd = sd(x)) # calculates the mean of each sample\n", "\n", "# CALCULATES THE MEAN AND STANDARD ERROR OF THE SAMPLING DISTRIBUTION\n", "sample_50_mean_se <- reps_50_1000 %>%\n", " summarize(sd_mean = mean(mean), se = sd(mean))\n", "\n", "sample_50_mean_se\n", "\n", "# VISUAL DEPICTION OF SAMPLING DISTRIBUTION\n", "sampling_dist_50_1000 <- reps_50_1000 %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = 0, colour=\"blue\", linetype = \"solid\", lwd = 1) + \n", " geom_vline(xintercept = mean(reps_50_1000$mean), colour=\"red\", linetype = \"dashed\", lwd = 1) \n", "\n", "sampling_dist_50_1000" ], "id": "d0bc36cc-2849-4c59-a68c-41be08401a84" }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Pro tip:** See how the scale of the x axis for $n = 50$ has adjusted\n", "> from $n = 5$ in response to the reduction in error?\n", "\n", "Now that we have increased the number of repeated and independent simple\n", "random samples from 5 to 1000, notice that the mean of our sampling\n", "distribution is much closer to our population mean. Let’s further\n", "increase our sample size to see how we can reduce the standard error of\n", "our estimate.\n", "\n", "#### 3. Suppose we take 1,000 simple random samples ($R = 1000$), with a sample size of 500 ($n = 500$):" ], "id": "6d1f3cbf-0061-4f06-ad32-bdcb71213138" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# CREATES A DATA FRAME WITH ALL 1000 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE \n", "reps_500_1000 <- population %>%\n", " rep_sample_n(size = 500, reps = 1000, replace = FALSE) %>% #creates 1000 samples of size 5\n", " group_by(replicate) %>% #groups each of the samples \n", " summarize(mean = mean(x)) #calculates the mean of each sample\n", "\n", "# CALCULATES THE MEAN AND STANDARD ERROR OF THE SAMPLING DISTRIBUTION\n", "sample_500_mean_se <- reps_500_1000 %>%\n", " summarize(sd_mean = mean(mean), se = sd(mean))\n", "\n", "sample_500_mean_se\n", "\n", "# VISUAL DEPICTION OF SAMPLING DISTRIBUTION\n", "sampling_dist_500_1000 <- reps_500_1000 %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = 0, colour=\"blue\", linetype = \"solid\", lwd = 1) +\n", " geom_vline(xintercept = mean(reps_500_1000$mean), colour=\"red\", linetype = \"dashed\", lwd = 1) \n", "\n", "sampling_dist_500_1000" ], "id": "03ea5bc2-4215-4a64-8b40-5e85ee2d221c" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2: Taking Random Samples\n", "\n", "Let’s try it for ourselves. Take a sample of size 50 with 1000 reps from\n", "the same `population` data frame and produce a data frame with only the\n", "means. Assign your answer to the object `answer_2`." ], "id": "64421dd0-6093-4423-bfe7-86ccbb0cab71" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# FILL IN THE BLANKS\n", "set.seed(30)\n", "\n", "answer_2 <- population %>%\n", " rep_sample_n(... = ..., ... = ...) %>%\n", " group_by(...) %>%\n", " summarize(mean = ...(x))\n", "\n", "test_2()" ], "id": "273712f8-6809-4b7d-ab04-4ca9c72ba813" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the number of samples we drew from our population is constant,\n", "we saw that the mean of our sampling distribution was roughly equivalent\n", "to the true population mean for all of our examples. However, we varied\n", "our sample sizes. It’s important to notice that as we increased our\n", "samples size, the standard error decreased. As a reminder, having a low\n", "standard error means that the probability of having a sample produce a\n", "sample estimate that is far away from the true population parameter is\n", "very low. In the previous examples with a sample size of 500, we can see\n", "that the sample means are mostly between -0.1 and +0.1. With a sample\n", "size of 5, the sample means are mostly between -1 and 1.\n", "\n", "Having a low standard error is something we economists seek to have in\n", "our experiments. Because it’s often only possible to collect *one*\n", "sample in real life, we tend to collect as large a sample as possible to\n", "minimize the standard error. This gives us high confidence that our\n", "sample estimate will be close to the true population parameter.\n", "\n", "You may have wondered how economists calculate the standard error of a\n", "sample estimate using only one sample. We will get to that in the\n", "section about Bootstrapping later in this lesson.\n", "\n", "### Scenario 2: Population is not normally distributed\n", "\n", "Another important concept in statistics is that as the number of samples\n", "drawn from the population increases, the mean of the sampling\n", "distribution converges to the true value of the population parameter.\n", "Furthermore, by increasing the number of samples, the sampling\n", "distribution become a closer resemblance of a normal distribution curve.\n", "This is true regardless of the shape of the original population\n", "distribution - again here, it is just the *sampling distribution* that\n", "starts to look more like a normal curve, and not the *population\n", "distribution*.\n", "\n", "This exercise will be nearly identical to above, with the difference\n", "being our population will now be random uniformly distributed. Similar\n", "to the previous example, increasing the the sample size will reduce the\n", "standard error of the sampling distribution. For this exercise, we will\n", "see how varying the samples taken affects the sampling distribution. We\n", "can achieve this in R using the `runif()` function to create a uniform\n", "distribution." ], "id": "3c0ace5f-f36b-403a-bb41-94ad39350037" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y <- runif(10000, 0, 1) # Draw 10,000 observations from the random uniform distribution between 0 and 1\n", "\n", "population_unif <- data.frame(y)\n", "\n", "unif_pop_dist <- population_unif %>%\n", " ggplot(aes(x = y)) + \n", " geom_histogram(bins = 20, color = \"black\", fill = \"gray\")\n", "\n", "unif_pop_dist\n", "\n", "unif_pop_mean_se <- population_unif %>%\n", " summarize(true_mean = mean(y))\n", "\n", "unif_pop_mean_se" ], "id": "beae4790-344a-4616-b5d4-fa9477803203" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1. Suppose we take 100 simple random samples (`reps` = 100), with a sample size of 100 (`size` = 100):" ], "id": "dc110ee8-cdb9-46a8-a014-ca4fc8a64993" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# CREATES A DATA FRAME WITH ALL 100 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE \n", "unif_reps_100_100 <- population_unif %>%\n", " rep_sample_n(size = 100, reps = 100) %>% #creates 5 samples of size 5\n", " group_by(replicate) %>% #groups each of the samples \n", " summarize(mean = mean(y)) #calculates the mean of each sample\n", "\n", "# MEAN AND STANDARD ERROR OF SAMPLING DIST\n", "sample_100_mean_se <- unif_reps_100_100 %>%\n", " summarize(sd_mean = mean(mean), se = sd(mean))\n", "\n", "sample_100_mean_se\n", "\n", "# VISUAL DEPICTION OF SAMPLING DISTRIBUTION\n", "unif_sampling_dist_100_100 <- unif_reps_100_100 %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = mean(population_unif$y), colour=\"blue\", linetype = \"solid\", lwd = 1) + \n", " geom_vline(xintercept = mean(unif_reps_100_100$mean), colour=\"red\", linetype = \"dashed\", lwd = 1) \n", "\n", "unif_sampling_dist_100_100" ], "id": "2447d964-fc9e-4150-b51d-a0ac3cb109d2" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Suppose we take 1,000 simple random samples ($R = 1000$), with a sample size of 5 ($n = 5$):" ], "id": "1e666f55-fad3-4640-a7ae-71938f633017" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# CREATES A DATA FRAME WITH ALL 100 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE \n", "unif_reps_100_1000 <- population_unif %>%\n", " rep_sample_n(size = 100, reps = 1000) %>% # creates 5 samples of size 5\n", " group_by(replicate) %>% # groups each of the samples \n", " summarize(mean = mean(y)) # calculates the mean of each sample\n", "\n", "# MEAN AND STANDARD ERROR OF SAMPLING DIST\n", "sample_1000_mean_se <- unif_reps_100_1000 %>%\n", " summarize(sd_mean = mean(mean), se = sd(mean))\n", "\n", "sample_1000_mean_se\n", "\n", "# VISUAL DEPICTION OF SAMPLING DISTRIBUTION\n", "unif_sampling_dist_100_1000 <- unif_reps_100_1000 %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = mean(population_unif$y), colour=\"blue\", linetype = \"solid\", lwd = 1) + \n", " geom_vline(xintercept = mean(unif_reps_100_1000$mean), colour=\"red\", linetype = \"dashed\", lwd = 1) \n", "\n", "unif_sampling_dist_100_1000" ], "id": "b00e81ac-f5f4-40bc-80f2-61a0b03692a9" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3. Suppose we take 20,000 simple random samples (`reps` = 20,000), with a sample size of 500 (`size` = 500\\$):" ], "id": "0935eb37-7035-41c6-a3f7-fde37a8ec699" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# CREATES A DATA FRAME WITH ALL 100 RANDOM SAMPLES, AND CALCULATES MEAN AND STANDARD DEVIATION OF EACH SAMPLE \n", "unif_reps_100_20000 <- population_unif %>%\n", " rep_sample_n(size = 100, reps = 20000) %>% #creates 5 samples of size 5\n", " group_by(replicate) %>% #groups each of the samples \n", " summarize(mean = mean(y)) #calculates the mean of each sample\n", "\n", "# MEAN AND STANDARD ERROR OF SAMPLING DIST\n", "sample_20000_mean_se <- unif_reps_100_20000 %>%\n", " summarize(sd_mean = mean(mean), se = sd(mean))\n", "\n", "sample_20000_mean_se\n", "\n", "#VISUAL DEPICTION OF SAMPLING DISTRIBUTION\n", "unif_sampling_dist_100_20000 <- unif_reps_100_20000 %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = mean(population_unif$y), colour=\"blue\", linetype = \"solid\", lwd = 1) + \n", " geom_vline(xintercept = mean(unif_reps_100_20000$mean), colour=\"red\", linetype = \"dashed\", lwd = 1) \n", "\n", "unif_sampling_dist_100_20000" ], "id": "e6c47dcd-3571-490c-9876-c98ab260d851" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the difference between the mean of the sampling distribution\n", "and the true value of the population parameter become closer and closer\n", "as we increase the number of samples drawn. As mentioned above, we can\n", "also see that the sampling distribution curve more strongly resembles a\n", "normal distribution curve as we increase the number of samples drawn -\n", "pretty neat! This is true regardless of the fact that we began with a\n", "population that was uniformly distributed." ], "id": "0fb4d3a9-4be9-469e-b0d1-1e298e06f19c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SIZE <- 500\n", "REPS <- 2000\n", "\n", "pop_dist_eg <- population %>%\n", " rep_sample_n(size = SIZE, reps = REPS) %>% # CHANGE PARAMETERS HERE\n", " group_by(replicate) %>% \n", " summarize(mean = mean(x)) %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = 0, colour=\"blue\", linetype = \"solid\", lwd = 1) \n", "\n", "unif_dist_eg <- population_unif %>%\n", " rep_sample_n(size = SIZE, reps = REPS) %>% # CHANGE PARAMETERS HERE\n", " group_by(replicate) %>% \n", " summarize(mean = mean(y)) %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = mean(population_unif$y), colour=\"blue\", linetype = \"solid\", lwd = 1) \n", "\n", "grid.arrange(pop_dist_eg, unif_dist_eg, ncol = 2)" ], "id": "bec0889e-23aa-4217-b8e1-aa3aa6c780ec" }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above graph, we fixed the sample size and number of samples for\n", "both the normal distribution and uniform distribution examples, and we\n", "attained two sampling distributions that are strikingly similar to a\n", "normal distribution curve. You have just witnessed something magical in\n", "statistics ✨! Regardless of the initial population distribution, as we\n", "increase our sample size and the number of simple and independent simple\n", "random samples, our sampling distribution converges to our population\n", "mean! Try it out yourself by increasing `SIZE` and `REPS` to see how\n", "higher values make the distribution become tighter around the true\n", "population mean.\n", "\n", "This converging of the sample mean to a normal distribution is referred\n", "to as the **Central Limit Theorem** in statistics.\n", "\n", "### Exercise 3: Taking Samples and Visualizing the Sampling Distribution\n", "\n", "Use the `population_unif` data frame and draw 500 samples of sample size\n", "50. Then create a visualization of the sampling distribution. Finally,\n", "calculate the standard error of this sampling distribution and assign it\n", "to the object `answer_3`. Remember the `sd` function calculates the\n", "standard deviation of a column. Template code has been provided." ], "id": "26ecec9c-7f1e-40d8-a823-92657b0e9af8" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(40) # don't change this\n", "\n", "eg_3 <- ... %>%\n", " ...(... = ..., ... = ...) %>%\n", " ...(...) %>%\n", " summarize(mean = mean(y)) \n", "\n", "eg_3_graph <- eg_3 %>%\n", " ggplot(...(x = mean)) + \n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = mean(population_unif$y), colour=\"blue\", linetype = \"solid\", lwd = 1) \n", "\n", "answer_3 <- eg_3 %>%\n", " summarize(se = ...(mean)) %>%\n", " pull() %>%\n", " round(5)\n", "\n", "test_3()" ], "id": "de64c38f-e5de-4744-88c2-b2902ad5870b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 4: Understanding Standard Error\n", "\n", "What does the standard error tell us?\n", "\n", "**A**: the average variation in the population distribution \n", "**B**: the average variation in our sample estimates of the population\n", "parameter \n", "**C**: the probability that our sample estimate is correct \n", "**D**: exactly what the sample estimate tells us\n", "\n", "Answer in capital letters surrounded by quotation marks. eg… `\"E\"`" ], "id": "feee32c2-eb73-44cb-986e-11071758be51" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_4 <- \"...\"\n", "\n", "test_4(answer_4)" ], "id": "de2a8206-2c41-4fd4-994c-04a3ab560d00" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Quick Review\n", "\n", "We just learned a lot of new terms which all sound very similar, so\n", "let’s take a moment to review. The **population** is the collection of\n", "all individuals or entities we are interested in studying. From the\n", "above example, the population is entire population of Canada. From the\n", "population, we are interested in knowing a particular quantity, the\n", "**population parameter**. From our example, this is the height.\n", "\n", "A **simple sample** is a small subset of the population which we use to\n", "get an estimate of the parameter we are interested. This is the **sample\n", "estimate** for the population parameter. Because we are often unable to\n", "conduct a comprehensive census in the real world, we have to use the\n", "sample estimate as an estimate for the population parameter.\n", "\n", "In a hypothetical world, we assume that we are able to take many samples\n", "of size `n` from the same population and attain a sample estimate from\n", "all of those samples. When we plot those sample estimates on a\n", "histogram, we get a **sampling distribution**. When we take many\n", "samples, we find that the mean of the sampling distribution (the most\n", "frequently observed estimate of the population parameter) is very close\n", "to the actual value of the population parameter.\n", "\n", "We have also noticed that as we increase sample size, our standard error\n", "decreases. When we increase the number of samples drawn from the\n", "population, the mean of the sampling distribution progresses closer and\n", "closer to the population parameter and our sampling distribution has a\n", "closer resemblance to a normal distribution curve.\n", "\n", "## Central Limit Theorem (CLT) and the Law of Large Numbers (LLN)\n", "\n", "The **Central Limit Theorem** tells us that no matter the distribution\n", "of the population, with sufficient sample size (usually above 30 and no\n", "more than 5% of the population), the distribution of sample means will\n", "tend toward normal.\n", "\n", "> Note: When sampling more than 5% of a population, we should use the\n", "> Finite Population Correct Factor (FPC). This is because the Central\n", "> Limit Thereom does not hold, and the standard error will be too large:\n", "> FPC = $((N-n)/(N-1))^{1/2}$\n", "\n", "The **Law of Large Numbers** tells us that with a sufficiently large\n", "number of repeated and independent random samples, the mean of our\n", "distribution of sample means will tend toward the population mean!\n", "Notice the visualization of our sample mean as we increase $R$ in the\n", "example below!" ], "id": "a697289c-aa40-4c1f-bd14-d6793ca61653" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(55)\n", "\n", "# RUN THIS CODE. \n", "\n", "r = 1000 # Number of iterations (number of sample means produced for our distribution)\n", "num = 10 # Sample size\n", "R = 1:r # Range of iterations\n", "means <- vector(,r) # Create empty vector to store sample means\n", "sample_mean <- vector(,r)\n", "\n", "for (val in R) { # loop over all values in R \n", " means[val] = mean(slice_sample(population_unif, n = num)$y) # Take a sample of size n from the population, caluclate the mean, and store in vector means, with index of the current iteration\n", " sample_mean[val] <- mean(means[1:val]) # Running value of sample mean at index equal to current iteration\n", "}\n", "\n", "options(scipen = 10)\n", "plot(sample_mean, cex=0.5, col=\"darkblue\")\n", "abline(h = mean(population_unif$y))" ], "id": "58bd6d03-a492-4cd8-98f7-0e66a1215d7a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "See how many many iterations (in this case, 1000) eventually produce a\n", "relatively stable line around 0.5 for this example?\n", "\n", "### INTERACTIVE EXERCISE DEMONSTRATING CLT\n", "\n", "In this exercise, we will create a new simulation for you to play around\n", "with $R$ and $n$ to get a better understanding of how sampling\n", "distribution changes with sample size and the number of repeated\n", "independent random samples. You can alter the values by changing the\n", "`SIZE` and `REPS` constants at the top of the code block. Try different\n", "values and see for yourself how the shape of the sampling distribution\n", "changes.\n", "\n", "Note: You can also try different distributions of our population, try\n", "the following: `rchisq(n, df)`, `rexp(n, rate)`, `rf(n, df1, df2)`" ], "id": "d5285d5c-a86c-4a64-8e06-87ae13f5ee0d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "population_ex <- rexp(10000, 0.2) # Draw 10,000 observations from the random uniform distribution between 0 and 1\n", "population_ex <- data.frame(x)\n", "pop_dist_int <- population_ex %>% \n", " ggplot(aes(x = x)) +\n", " geom_histogram(bins = 50) +\n", " geom_vline(xintercept = mean(population_ex$x), colour=\"blue\", linetype = \"solid\", lwd = 1) \n", "pop_dist_int" ], "id": "3ccb89ae-0fb2-463e-975d-d083f722c839" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SIZE = 50 # change me here\n", "REPS = 5000 # change me here\n", "\n", "population_sample_estimates <- population_ex %>%\n", " rep_sample_n(size = SIZE, reps = REPS) %>%\n", " group_by(replicate) %>%\n", " summarize(mean = mean(x))\n", "\n", "pop_mean_se <- population_sample_estimates %>%\n", " summarize(sd_mean = mean(mean), se = sd(mean))\n", "\n", "pop_mean_se \n", "\n", "samp_dist_int <- population_sample_estimates %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = mean(population_ex$x), colour=\"blue\", linetype = \"solid\", lwd = 1) +\n", " geom_vline(xintercept = mean(population_sample_estimates$mean), colour=\"red\", linetype = \"dashed\", lwd = 1)\n", "\n", "samp_dist_int" ], "id": "8db895fd-0768-4038-a7b7-68d5e34e5fab" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bootstrapping\n", "\n", "From the previous examples, it was quite a process to calculate the\n", "standard error. We had to first have access to the population and draw\n", "hundreds of samples from that population to make a sampling\n", "distribution. This is impossible for multiple reasons (1) we almost\n", "never access to the entire population (2) drawing two samples of 10\n", "individuals is the same as drawing one sample of 20. Naturally, you may\n", "have been wondering how economists calculate the standard error of their\n", "sample estimates. One common technique is **bootstrapping** samples.\n", "\n", "In the bootstrapping process, we start off by taking one sample of size\n", "$n$ from our population. We then take a sample of size $n$ from our\n", "sample, with replacement. Let’s imagine we have a paper bag with 5 red\n", "balls and 5 green balls, which we will treat as our sample. In order to\n", "draw a bootstrap sample, we pick 1 ball out of the bag ten times while\n", "allowing replacement. We take out the first ball, note whether it’s a\n", "red or green ball, then put it back in the bag and repeat the process 9\n", "more times to arrive at our bootstrap sample. Note that with this\n", "method, it is entirely possible to draw any combination of red and green\n", "balls. The bootstrap sampling method introduces sampling variability to\n", "our sample.\n", "\n", "If we repeat this process by drawing many bootstrap samples, we can\n", "arrive at a bootstrap distribution which serves as an approximation of\n", "our sampling distribution.\n", "\n", "### Example\n", "\n", "Let’s demonstrate this using our normal distribution data frame\n", "`population`. We first start off by taking a sample from our population.\n", "Once again we use the `slice_sample` function to achieve this. To\n", "connect it back to the average height example, imagine that we had\n", "resources to sample 100 people about their height out of the entire\n", "Canadian population." ], "id": "14a57287-7d26-438a-bf67-42ebc8b5bfc1" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boot_main_sample <- population %>%\n", " slice_sample(n = 100)\n", "\n", "head(boot_main_sample)\n", "paste(\"This is the mean of our sample: \", mean(boot_main_sample$x))" ], "id": "87361f1a-1632-4248-9b28-6ab2da8a83e3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we want to know the uncertainty behind this estimate. Thinking back\n", "to our sampling distribution, how would we know if our sample estimate\n", "of 0.008 is close to the true population parameter. It’s entirely\n", "possible we got a sample estimate far off from our true population\n", "parameter - for example, we could have gotten a value in the tails of\n", "the sampling distribution. To begin to understand the uncertainty around\n", "our sample estimate, we must compute the standard error.\n", "\n", "Let’s draw bootstrap samples from our sample. The code is largely\n", "similar from before, except the key difference that we now add\n", "`replace = TRUE` to our `rep_sample_n` function. This tells R to allow\n", "replacement when taking samples." ], "id": "ac2210e6-ccd3-45df-96e5-0d6b3e2e8ab8" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boot_samples <- boot_main_sample %>%\n", " rep_sample_n(size = 500, reps = 1000, replace = TRUE)\n", "\n", "head(boot_samples)" ], "id": "6f1f9667-9e13-416a-950f-9eca15e2a351" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our bootstrap samples, let’s calculate the bootstrap\n", "sample means and make a bootstrap sampling distribution. We use the same\n", "methods as we used above." ], "id": "51d6fb0a-c4dc-474e-ae76-3989cb0384c1" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boot_means <- boot_samples %>%\n", " group_by(replicate) %>%\n", " summarize(mean = mean(x))\n", "\n", "boot_mean_se <- boot_means %>%\n", " summarize(sd_mean = mean(mean), se = sd(mean)) \n", "\n", "boot_mean_se\n", "\n", "boot_sampling_dist <- boot_means %>%\n", " ggplot(aes(x = mean)) +\n", " geom_histogram(bins = 50, color = \"black\", fill = \"gray\") + \n", " geom_vline(xintercept = 0, colour=\"blue\", linetype = \"solid\", lwd = 1) + \n", " geom_vline(xintercept = mean(boot_means$mean), colour=\"red\", linetype = \"dashed\", lwd = 1) +\n", " ggtitle(\"Boostrap Sampling Distribution\")\n", "\n", "grid.arrange(boot_sampling_dist, sampling_dist_500_1000 + ggtitle(\"Sampling Distribution\"))" ], "id": "62dca170-4f5b-4fa0-8307-ce5e0cab9bc0" }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many things to unpack here. Notice from the graphs that our\n", "bootstrap sampling distribution visually serves as a good approximation\n", "of our sampling distribution. We can also see that the standard error\n", "for the bootstrap sampling distribution (0.00402) is very close to our\n", "sampling distribution standard error, which was (0.0433).\n", "\n", "The key difference in the graphs is the mean of the distributions. The\n", "mean of the sampling distribution is nearly equivalent to the population\n", "parameter. However, the mean of the bootstrap distribution will be the\n", "value of the sample estimate from the first sample we took from the\n", "population. This is the reason why the bootstrap distribution sits to\n", "the right of the sampling distribution.\n", "\n", "The key takeaway here is that the standard error of the bootstrap\n", "sampling distribution serves as a great approximation of the standard\n", "error of the sampling distribution.\n", "\n", "This gives us a solid foundation to learn about confidence intervals and\n", "hypothesis testing in the coming weeks.\n", "\n", "### Exercise 5: Taking Bootstrap Samples\n", "\n", "For this exercise, we have taken a sample from the `population` data\n", "frame and assigned it to the object `boot_eg_sample` of sample size 50.\n", "You are going to make a data frame of bootstrap sample estimates with\n", "1000 reps. Afterwards, you will assign the standard error of the\n", "bootstrap sampling distribution to `answer_5`. We have provided a code\n", "template below." ], "id": "e7937d3f-d924-4c8c-b1d6-42ffcc5fec6a" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(60)\n", "\n", "boot_eg_sample <- ... %>%\n", " slice_sample(n = ...)\n", "\n", "boot_eg_sample_est <- boot_eg_sample %>%\n", " ...(... = ..., ... = ..., replace = ...) %>%\n", " ...(...) %>%\n", " summarize(mean = ...(x))\n", "\n", "answer_5 <- boot_eg_sample_est %>%\n", " summarize(se = ...(mean)) %>%\n", " pull() %>%\n", " round(3)\n", "\n", "test_5()" ], "id": "48decf7a-4478-425e-a3d3-0743719ba565" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "ir", "display_name": "R", "language": "r" } } }