{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 1.2 - Beginner - Dispersion and Dependence\n", "\n", "COMET Team
*Oliver (Junye) Xu, Anneke Dresselhuis, Jonathan\n", "Graves* \n", "2023-01-12\n", "\n", "## Outline\n", "\n", "### Prerequisites\n", "\n", "- Introduction to Jupyter\n", "- Introduction to R\n", "- Introduction to Visualization\n", "- Central Tendency\n", "- Distribution\n", "\n", "### Outcomes\n", "\n", "This notebook explains the concepts of dispersion and dependence. After\n", "completing this notebook, you will be able to:\n", "\n", "- Understand and interpret measures of dispersion, including variance\n", " and standard deviation\n", "- Understand and interpret measures of dependence, including\n", " covariance and correlation\n", "- Investigate, compute, and interpret common descriptive statistics\n", "- Create summary tables for variables, including qualitative variables\n", "- Parse summary statistics for meaning\n", "\n", "### References\n", "\n", "- [Introductory\n", " Statistics](https://openstax.org/books/introductory-statistics/pages/2-7-measures-of-the-spread-of-the-data)\n", "\n", "# Introduction\n", "\n", "In this notebook, we will continue learning about how to use descriptive\n", "statistics to represent sets of data. We’ve already seen how to compute\n", "measures of central tendency and determine which measures are\n", "appropriate for given situations. We’ll now focus on computing measures\n", "of dispersion and dependence in order to better understand both the\n", "variation of variables, as well as relationships between variables in a\n", "data set. We’ll dedicate time to both measures, but we’ll look at\n", "dispersion first. Let’s first import our familiar 2016 Census data set\n", "from Statistics Canada." ], "id": "fa8542d2-6d02-46d6-a399-ad4e33222e09" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "source(\"beginner_dispersion_and_dependence_tests.r\")\n", "\n", "# load packages\n", "library(tidyverse)\n", "library(haven)\n", "library(ggplot2)\n", "\n", "# Reading in the data\n", "census_data <- read_dta(\"../datasets_beginner/01_census2016.dta\")" ], "id": "0b5eba44-bf21-4931-943e-af53e87fad87" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Understanding Measures of Dispersion\n", "\n", "Measures of dispersion describe the spread of data, that is, the\n", "possible values that a variable in a data set can take on. Common\n", "measures of dispersion which we’ll look at include the range,\n", "interquartile range, standard deviation and variance.\n", "\n", "## Range and Interquartile Range\n", "\n", "- **Range**: Difference between the maximum and minimum value that a\n", " variable takes on\n", "\n", "- **Interquartile range**: Difference between the 75th and 25th\n", " percentile values. We can use functions like `quantile()` and\n", " `fivenum()` to calculate these statistics quite quickly.\n", "\n", "> Both functions return the same output: a list with different\n", "> percentiles of the data. By default, these are the minimum, 25th\n", "> percentile, 50th percentile (median), 75th percentile, and maximum\n", "> values. In this way, these commands allow us to see both the spread of\n", "> the middle 50% of data around the median **(interquartile range),**\n", "> and the spread of the data in its entirely **(range).**\n", "\n", "## Variance\n", "\n", "The **variance** is the average of the squared differences from the\n", "mean.\n", "\n", "- Small variance: observations tend to fall close to the mean\n", "\n", "- High variance: observations are very spread out from the mean.\n", "\n", "The formula for the sample variance is:\n", "\n", "$$\n", "s_{x}^2 = \\frac{\\sum_{i=0}^{n} (x_i - \\overline{x})^2}{n - 1}\n", "$$\n", "\n", "The formula for the variance in a population is:\n", "\n", "$$\n", "\\sigma_{x}^2 = \\int(x - \\mu)^2 f(x) dx\n", "$$\n", "\n", "# Standard Deviation\n", "\n", "The **standard deviation** is the square root of the variance. It also\n", "measures dispersion around the mean, similar to the variance. For a\n", "sample this is:\n", "\n", "$$\n", "s_{x} = \\sqrt{s_{x}^2} = \\sqrt{\\frac{\\sum_{i=0}^{n} (x_i - \\overline{x})^2}{n - 1}}\n", "$$\n", "\n", "For the population:\n", "\n", "$$\n", "\\sigma_{x} = \\sqrt{\\sigma_{x}^2}\n", "$$\n", "\n", "> For example, a normal distribution with `mean = 30` and `sd = 5` is\n", "> exactly the same thing as a normal distribution with `mean = 30` and\n", "> `variance = 25`.\n", "\n", "We usually use standard deviation rather than variance. This is because\n", "variance does not have the *same units* as the original variable, while\n", "the standard deviation does.\n", "\n", "> **Advanced Note**: In econometrics, we use samples to estimate\n", "> population parameters. Some samples have more information than others\n", "> about the population.\n", ">\n", "> For example, an estimate of the population variance based on a sample\n", "> size of 100 certainly has more information than a sample size of 10.\n", ">\n", "> We measure using the **degrees of freedom** of an **estimate**.\n", ">\n", "> - The degrees of freedom for our estimate of variance (sample\n", "> variance) is equal to $n - 1$.\n", "\n", "In R, we use the `var()` function to calculate the variance of a\n", "variable, and the `sd()` function for the standard deviation." ], "id": "f28257c3-a2b8-4764-8149-351081de923a" }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# calculate the variance of wage\n", "variance <- var(census_data$wages, na.rm = TRUE)" ], "id": "26976ddd-d9be-4d89-85ef-f7a932f775e0" }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can illustrate the relationship between standard deviation and\n", "variance by taking the `sqrt()` of the variance:" ], "id": "f44295f2-c3f2-4509-a0eb-cf31b764febd" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# fill in the ... with your code below to find the sd of wages\n", "answer_6 <- ...(var(census_data$wages, na.rm = TRUE))\n", "test_6()" ], "id": "06d6eb54-d4cf-4769-a7f3-601daa143b2d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# fill in the ... with your code below to find the sd of wages\n", "answer_6 <- sqrt(var(census_data$wages, na.rm = TRUE))\n", "test_6()" ], "id": "3f1d53f6-73ee-4e2d-8714-7f34a3375156" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# recall the mean of wages\n", "mean(census_data$wages, na.rm = TRUE) # remember that we need to remove all NA values otherwise R won't let us compute our summary statistics!\n", "\n", "# calculate the standard deviation of wages\n", "sd(census_data$wages, na.rm = TRUE) " ], "id": "cf9aff72-49eb-49f5-bb44-eb1171879825" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Interpreting Variation\n", "\n", "Let’s say we’re interested in understanding how the `wages` variable was\n", "dispersed around the mean.\n", "\n", "In the example above, we have a pretty large standard deviation, even\n", "larger than our mean! This tells us that most of the Canadians in the\n", "data set have a wage which lies approximately \\$64275.27 away from the\n", "mean of \\$54482.52.\n", "\n", "This large standard deviation tells us that there is high variation in\n", "wages and that some of them are very far from the mean. This can be for\n", "many reasons, but one possibility is that we **outliers** in the data\n", "set: extreme variables of wages. This is common for wage distributions\n", "in the presence of income inequality.\n", "\n", "> *General rule*: the standard deviation is small when the data are all\n", "> concentrated close to the mean, while the standard deviation is large\n", "> when the data are spread out away from the mean.\n", "\n", "# Empirical Rule\n", "\n", "Recall, from the *Central Tendency* notebook, that some data often\n", "approximately follows a **normal distribution**. For a variable with\n", "values distributed in this way, there is a rule used when discussing\n", "their standard deviation. This is called **68-95-99.7 rule** or\n", "**Empirical Rule**:\n", "\n", "- 68% of the values are within 1 standard deviation of the mean\n", "- 95% are within 2 standard deviations of the mean.\n", "- 99.7% are within 3 standard deviations from the mean\n", "- Remaining values are outliers and incredibly rare.\n", "\n", "\n", "\n", "This gives us a helpful frame of reference when discussing the standard\n", "deviation of a variable. Although we already saw that the `wages`\n", "variable follows a relatively skewed distribution, imagine a variable\n", "that doesn’t.\n", "\n", "For example: test scores. If the mean score on a test is 70 and the\n", "standard deviation is 10, this tells us that approximately 68% of\n", "students who wrote that test earned a score between 60 and 80 (1\n", "standard deviation), approximately 95% earned a score between 50 and 90\n", "(2 standard deviations) and virtually everyone earned a score between 40\n", "and 100 (3 standard deviations).\n", "\n", "# Understanding Measures of Dependence\n", "\n", "Measures of **dependence** calculate relationships between variables.\n", "The two most common are *covariance* and *correlation*.\n", "\n", "# Covariance\n", "\n", "**Covariance** is a measure of the direction of a relationship between\n", "two variables.\n", "\n", "- Positive covariance: two variables are positively related\n", " - When one variable goes up, the other goes up, and vice versa.\n", "- Negative covariance: two variables are negatively related.\n", " - When one variable goes up, the other goes down and vice versa.\n", "\n", "This is similar to the idea of variance, but where variance measures how\n", "a *single* variable varies, covariance measures how *two* vary together.\n", "They also have similar formulas.\n", "\n", "Sample Covariance:\n", "\n", "$$\n", "s_{x,y}=\\frac{\\sum_{i=1}^{n}(x_{i}-\\bar{x})(y_{i}-\\bar{y})}{n-1}\n", "$$\n", "\n", "Population Covariance:\n", "\n", "$$\n", "\\sigma_{x,y}=\\int\\int(x_{i}-\\mu_x)(y_{i}-\\mu_y)f(x,y)dxdy\n", "$$\n", "\n", "This is tedious to calculate, especially for large samples. In R, we can\n", "use the `cov()` function to calculate the covariance between two\n", "variables. Let’s say we’re interested in exploring the covariance\n", "between the `wages` variable and `mrkinc` variable in the dataset." ], "id": "d7c688c3-d092-4621-8c29-c435f179f539" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cov() function requires use=\"complete.obs\" to remove NA entries\n", "cov(census_data$wages, census_data$mrkinc, use=\"complete.obs\") " ], "id": "ab07b2e7-527c-49c9-85be-42902c07e3ac" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The calculated covariance between the `wages` variable and `mrkinc`\n", "variable in the dataset is positive, indicating the two variables are\n", "positively related. As one variable changes, the other variable will\n", "change in the same direction.\n", "\n", "Let’s try computing the covariance “by hand” to understand how the\n", "formula really works. To simplify the process, we will construct a\n", "hypothetical data set with variables $x$ and $y$." ], "id": "8216677e-2794-4898-b268-6c1e91bd9f2d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x <- c(6, 8, 10)\n", "y <- c(25, 100, 125)" ], "id": "f6a3ee55-5722-44ff-bf28-5a592f81427a" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Difference of each value and the mean for the variables\n", "# Product of the above differences\n", "# Sum the products\n", "# Denominator is one less than the sample size\n", "sum((x - mean(x))*(y - mean(y)))/(3-1)" ], "id": "c9dd3cb9-a2ff-4f35-8f0f-f9531cb1b67c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Confirming the previous calculation\n", "cov(x,y)" ], "id": "aa00842b-d953-4f09-bc4f-0e119598af1f" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interpreting covariances directly is difficult because the size of the\n", "covariance depends on the scale of $x$ and $y$. Repeat the preceding\n", "calculation, but with variables that are 10x as large. What do you see?" ], "id": "18cf3581-23fa-4e45-8e47-702bfae5f810" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x <- c(60, 80, 100)\n", "y <- c(250, 1000, 1250)\n", "\n", "cov(x,y)" ], "id": "03fc7c24-8392-40ee-8178-e18f5c49ac3c" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The solution to this problem is the next topic: correlation.\n", "\n", "# Correlation\n", "\n", "A **correlation coefficient** measures the relationship between two\n", "variables. It allows us to know both if two variables move in the same\n", "direction (positive correlation), or in the opposite directions\n", "(negative correlation), or if they have no relationship (no\n", "correlation).\n", "\n", "> **Note**: even though a covariance or correlation may be zero, this\n", "> does not mean that there is no relationship between the variables:\n", "> this only means that there is no *linear* relationship.\n", "\n", "Correlation fixes the scale problem with covariance by standardizing\n", "covariance to a scale of -1 to 1. It does this by dividing the\n", "covariance by the standard deviations. The most popular version is\n", "**Pearson’s correlation coefficient** which is calculated as follows in\n", "a sample.\n", "\n", "$$\n", "r_{x,y} = \\frac{\\sum_{i=1}^{n} (x_i - \\overline{x})(y_i - \\overline{y})}{\\sqrt{\\sum_{i=1}^{n} (x_i - \\overline{x})^2 \\sum_{i=1}^{n}(y_i - \\overline{y})^2}}=\\frac{s_{x,y}}{s_{x} s_{y}}\n", "$$\n", "\n", "Once again, let’s try to compute the correlation “by hand” using the\n", "formula." ], "id": "bd6bd5ce-11ef-4dcc-9c9a-ebf8c5632abd" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "numerator <- sum((x - mean(x))*(y - mean(y)))\n", "denominator <- sqrt(sum((x - mean(x))^2) * sum((y - mean(y))^2))\n", "numerator/denominator" ], "id": "77229a48-ce86-44dc-b5f1-e658f31adfac" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "numerator <- cov(x,y)\n", "denominator <- sd(x) * sd(y)\n", "numerator/denominator" ], "id": "526d1afa-0fd5-4b7c-ad24-f525a1702417" }, { "cell_type": "markdown", "metadata": {}, "source": [ "In R, we can use the `cor()` function to calculate the correlation\n", "between two variables" ], "id": "1d9c22e6-fe26-4e96-815e-fb21a2c77d56" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Confirming the previous calculation\n", "cor(x,y)" ], "id": "b44e510b-9054-40b3-8b19-9ee37a35537b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "To calculate the correlation between the `wages` variable and `mrkinc`\n", "variable in the dataset:" ], "id": "77f57a5f-1620-46ac-9162-d8a5e2b250eb" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cor() function requires use=\"complete.obs\" to remove NA entries\n", "cor(census_data$wages, census_data$mrkinc, use=\"complete.obs\") " ], "id": "2ed1e5fd-676f-4c09-ab07-2936f1fd8b92" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have the number 0.8898687 $\\approx$ 0.89 as our correlation\n", "coefficient. What does it really mean?\n", "\n", "A correlation coefficient ranges from -1 to 1, which tells us two\n", "things:\n", "\n", "1. The direction of the relationship between the 2 variables.\n", "\n", "- A negative correlation coefficient means that two variables evolve\n", " in opposite directions. If a variable increases the other decreases\n", " and vice versa.\n", "\n", "- A positive correlation implies that the two variables evolve in the\n", " same direction, that is, if one variable increases the other also\n", " increases and vice versa.\n", "\n", "1. The strength of the relationship between the 2 variables.\n", "\n", "- The more extreme the correlation coefficient (the closer to -1 or\n", " 1), the stronger the relationship. The less extreme the correlation\n", " coefficient (the closer to 0), the weaker the relationship.\n", "- Two variables are uncorrelated if the correlation coefficient is\n", " close to 0. As one variable increases, there is no tendency in the\n", " other variable to either decrease or increase.\n", "\n", "> **Test your knowledge**:\n", ">\n", "> True or False? The correlation can measure linear relationships but\n", "> the covariance can’t." ], "id": "c43dfc8c-4ec9-4778-bb4e-011d272c9ede" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_5 <- \"...\" # enter True or False\n", "\n", "test_5()" ], "id": "a459fb69-9b13-4920-a018-1dc061303934" }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also easily visualize correlation by plotting scatter plot with a\n", "trend line via `ggplot()` function." ], "id": "23c8eee8-50a2-4965-a24c-48298f972483" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ggplot(census_data, aes(x = mrkinc, y = wages)) +\n", " geom_point(shape = 1)" ], "id": "c94e29a3-8662-4dd3-b822-c0b4eee7b235" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Adding a trend line to the scatter plot helps us interpret the\n", "directionality of two variables. We can do it via the `geom_smooth()`\n", "function by including the `method=lm` argument, which displays\n", "scatterplot patterns in the presence of overplotting. You will learn\n", "more about how trend lines are mathematically formulated in advanced\n", "econometrics classes." ], "id": "84a66709-c10d-44f5-947d-29611a15805b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ggplot(census_data, aes(x = mrkinc, y = wages)) +\n", " geom_point(shape = 1) +\n", " geom_smooth(method = lm)" ], "id": "4ef0b9f1-430a-49cd-add1-9c82cd01d09a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can see the apparent positive correlation!\n", "\n", "\"three\n", "\n", "> **Try it yourself!** \n", "> Brainstorm some real-world examples that best demonstrate the\n", "> correlation relationships below. The first one is already done for\n", "> you!\n", ">\n", "> - zero or near zero: the number of forks in your house vs the\n", "> average rainfall where you live\n", "> - weak negative: *\\[your text here\\]*\n", "> - strong positive: *\\[your text here\\]*\n", "> - weak positive: *\\[your text here\\]*\n", "> - strong negative: *\\[your text here\\]*\n", "\n", "# Making Tables: Visualizing Results\n", "\n", "Tables can be a useful way to generate large lists of different\n", "statistics that are relevant to your analysis." ], "id": "7c19c473-f195-439b-ba5d-ab1a2f4d26cf" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "census_data <- census_data %>% drop_na(wages)" ], "id": "3dae5f24-a634-4d76-a1c3-d5f69c5f7130" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "table2 <- census_data %>%\n", " group_by(immstat) %>%\n", " # we're intereseted in calculating different statistics based on immigration status \n", " # (1 or 2: 1 = immigrant, 2 = non-immigrant)\n", " summarize(avg_wage = mean(wages),\n", " # this will calulate all statistics twice, once for each group\n", " sd_wage = sd(wages),\n", " median_wage = quantile(wages,0.5),\n", " r_wm = cor(wages, mrkinc))\n", "\n", "table2" ], "id": "31a8bdc0-9bbd-4afc-802e-c10cc22543d8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **A Note on Reshaping Tables** \n", "> These tables can be tough to read. Fortunately, R has a nice set of\n", "> reshaping commands which allow you to reorganize these tables: \n", "> `pivot_wider`: turn selected row-values into columns (usually what you\n", "> want to do) \n", "> `pivot_longer`: turn selected columns into rows\n", "\n", "You do this by specifying what the new names and rows should looks like.\n", "Here’s an example, using the above table:" ], "id": "78e948c9-2d4d-40d7-b19f-94cc55d386d3" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pivot_wider(table2, \n", " names_from = c(immstat),\n", " values_from = c(avg_wage, sd_wage, median_wage, r_wm),\n", " names_sep = \".\") #the divider for new variable names" ], "id": "eb9b8080-668e-4278-b17e-93db31c1d008" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Practice Exercises\n", "\n", "## Exercise 1\n", "\n", "Suppose the weights of packages(in lbs) at a particular post office are\n", "recorded as below. Assuming the weights follow a normal distribution." ], "id": "acb673d9-5059-4b52-88a8-c2640e8f5fef" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "package_data <- c(95, 130, 148, 183, 100, 98, 137, 110, 188, 166)" ], "id": "1c7c7117-5516-4f81-abf6-2387f90094d5" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculate the mean, standard deviation \n", "# and variance of the weights of packages\n", "# round all answers to 2 decimal places\n", "\n", "answer_1 <- # enter your answer here for mean\n", "answer_2 <- # enter your answer here for standard deviation \n", "answer_3 <- # enter your answer here for variance\n", "\n", "test_1()\n", "test_2()\n", "test_3()" ], "id": "048440ed-e727-4a44-bfce-c7664b1cc28d" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2\n", "\n", "Use the example above to answer: 68% of packages at the post office\n", "weigh how much?\n", "\n", "- A - 68% of packages weigh between 65.30 and 150.70 lbs\n", "- B - 68% of packages weigh between 100.40 and 170.60 lbs\n", "- C - 68% of packages weigh between 120.40 and 150.60 lbs 4. D - 68%\n", " of packages weigh between 80.56 and 120.60 lbs" ], "id": "68dd5ad0-ff93-4886-bb23-22cdac304ed8" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_4 <- \"...\" # enter your choice here (ex, \"F\")\n", "\n", "test_4()" ], "id": "c5aa8f19-e83d-41cf-8840-bc1825a6103e" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "ir", "display_name": "R", "language": "r" } } }