{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 1.6 - Beginner - Distributions\n", "\n", "COMET Team
*Valeria Zolla, Colby Chambers, Jonathan Graves* \n", "2023-01-12\n", "\n", "## Outline\n", "\n", "### Prerequisites\n", "\n", "- Introduction to Jupyter\n", "- Introduction to R\n", "- Introduction to Visualization\n", "- Central Tendency\n", "\n", "### Outcomes\n", "\n", "After completing this notebook, you will be able:\n", "\n", "- Understand and work with Probability Density Functions (PDFs) and\n", " Cumulative Density Functions (CDFs)\n", "- Use tables to find joint, marginal, and conditional probabilities\n", "- Interpret uniform, normal, and $t$ distributions\n", "\n", "### References\n", "\n", "- [Introduction to Probability and Statistics Using\n", " R](https://mran.microsoft.com/snapshot/2018-09-28/web/packages/IPSUR/vignettes/IPSUR.pdf)\n", "\n", "## Introduction\n", "\n", "This notebook will explore the concept of distributions, both in terms\n", "of their functional forms for probability and how they represent\n", "different sets of data.\n", "\n", "Let’s first load the 2016 Census from Statistics Canada, which we will\n", "consult throughout this lesson." ], "id": "21f48c5d-cd0a-4188-a439-52f6a91a68e0" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# loading in our packages\n", "library(tidyverse)\n", "library(haven)\n", "library(digest)\n", "\n", "source(\"beginner_distributions_tests.r\")" ], "id": "7a49cdf7-9d37-4cb8-a475-cce56cc5ad4b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# reading in the data\n", "census_data <- read_dta(\"../datasets_beginner/01_census2016.dta\")\n", "\n", "# cleaning up factors\n", "census_data <- as_factor(census_data)\n", "\n", "# cleaning up missing data\n", "census_data <- filter(census_data, !is.na(census_data$wages))\n", "census_data <- filter(census_data, !is.na(census_data$mrkinc))\n", "\n", "# inspecting the data\n", "glimpse(census_data)" ], "id": "cc7c9dd5-2c89-4786-9090-78c9bdeae078" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our data set ready on stand-by for analysis, let’s\n", "start looking at distributions as a concept more generally.\n", "\n", "## Part 1: Introduction to Concepts in Probability\n", "\n", "### What is a Probability?\n", "\n", "The probability of an event is a number that indicates the likelihood of\n", "that event happening.\n", "\n", "When the possible values of a certain event are discrete (e.g., `1,2,3`\n", "or `adult, child`), we refer to this as the **frequency**.\n", "\n", "When the possible values are continuous (e.g., any number between `0.5`\n", "and `3.75`), we refer to this as the **density**.\n", "\n", "There is a difference between *population* probabilities and *empirical*\n", "or *sample* probabilities. Generally, when we talk about distributions\n", "we will be referring to *population* objects: but there are also sample\n", "versions as well, which are often easier to think about.\n", "\n", "For instance, let’s say we have a dataset with 5,000 observations and a\n", "variable called `birthmonth` which records the month of birth of every\n", "participant captured in the dataset. If 500 people in the data were born\n", "in October, then `birthmonth == \"October\"` would have an *empirical*\n", "probability of occurring in an observation 10% of the time. We can’t be\n", "sure what the population probability would be, unless we knew more about\n", "the population.\n", "\n", "### What is a Random Variable?\n", "\n", "A **random variable** is a variable whose possible values are numerical\n", "outcomes of a random phenomenon, such as rolling a dice. A random\n", "variable can be either discrete or continuous.\n", "\n", "- A **discrete random variable** is one which may take on only a\n", " finite number of distinct values (e.g., the number of children in a\n", " family).\n", "\n", " - In this notebook we’ll see that `agegrp` is an example of a\n", " discrete variable.\n", "\n", "- A **continuous random variable** is one which takes an infinite\n", " number of possible values and can be *measured* rather than merely\n", " *categorized* (e.g., height, weight, or how much people earn).\n", "\n", " - In the data, we can see that `wages` and `mrkinc` are great\n", " examples of continuous random variables.\n", "\n", "### What is a Probability Distribution?\n", "\n", "A **probability distribution** refers to the pattern or arrangement of\n", "probabilities in a population. These are usually described as\n", "*functions* used to indicate the probability of that event occurring. As\n", "we explained above, there is a difference between *population* and\n", "*sample* distributions:\n", "\n", "- A *population* distribution (which is the typical way we describe\n", " these) describes population probabilities\n", "\n", "- An *empirical* or *sample* distribution reports empirical\n", " probabilities from within a particular sample\n", "\n", "> **Note**: we typically use *empirical* distributions as a way to learn\n", "> about the *population* distributions, which is what we’re primarily\n", "> interested in.\n", "\n", "Distribution functions come in several standard forms; let’s learn about\n", "them.\n", "\n", "## Part 2: Distribution Functions of Single Random Variables\n", "\n", "### Probability Density Functions (PDFs)\n", "\n", "**Probability Density Functions** are also sometimes referred to as PDFs\n", "or probability mass functions. We usually use lower case letters like\n", "$f$ or $p$ to describe these functions.\n", "\n", "#### Discrete PDFs\n", "\n", "> “The probability distribution of a discrete random variable is the\n", "> list of all possible values of the variable and their probabilities\n", "> which sum to 1.” - Econometrics with R\n", "\n", "A **PDF**, also referred to as **density** or **frequency**, is the\n", "probability of occurrence of all the different values of a variable.\n", "\n", "Suppose a random variable $X$ may take $k$ different values, with the\n", "probability that $X = x_{i}$ defined to be\n", "$\\mathrm{P}(X = x_{i}) = p_{i}$. The probabilities $p_{i}$ must satisfy\n", "the following:\n", "\n", "- For each $i$, $0 < p_{i} < 1$\n", "- $p_{1} + p_{2} + ... + p_{k} = 1$\n", "\n", "We can view the empirical PDF of a discrete variable by creating either\n", "a frequency table or a graph.\n", "\n", "Let’s start by creating a frequency table of age groups using the\n", "variable `agegrp` in our `census_data`." ], "id": "4089b0d6-c458-4cca-a7ac-20370520255e" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "census_data_pdf <- filter(census_data, agegrp != \"not available\") # filter out NAs\n", "\n", "table_1 <- census_data_pdf %>% \n", " group_by(agegrp) %>% # for every age group\n", " summarize(count = n()) %>%\n", " mutate(frequency = count/sum(count)) # calculate the frequency with which they occur\n", "\n", "table_1" ], "id": "d853279e-0667-4038-97fc-ca5cc86951be" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let’s try creating a graph to show the data. PDFs are best\n", "visualized with histograms. To show a histogram with the probabilites in\n", "the y-axis we use the function `geom_bar`.\n", "\n", "> Refer to *Introduction to Visualization* for a refresher." ], "id": "575a1eac-bfa9-48e7-aa8e-bd6eb94bcb35" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_pdf <- ggplot(data = table_1, aes(x = agegrp, y = frequency)) +\n", " geom_bar(stat = 'identity') + # specify identity to plot the values in frequency\n", " theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))\n", "\n", "plot_pdf" ], "id": "3a9dee9f-bb61-4f4f-9fa8-4e2e899b5c9d" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Continuous PDF:\n", "\n", "> “Since a continuous random variable takes on a continuum of possible\n", "> values, we cannot use the concept of a probability distribution as\n", "> used for discrete random variables.” - Econometrics with R\n", "\n", "Unlike a discrete variable, a continuous random variable is not defined\n", "by specific values. Instead, it is defined over intervals of values, and\n", "is represented by the area under a curve (in Calculus, that’s an\n", "integral).\n", "\n", "The curve, which represents the probability function is also called a\n", "**density** curve and it must satisfy the following:\n", "\n", "- The curve must have no negative values $p(x) > 0$ for all $x$ (the\n", " probability of observing a value can’t be negative)\n", "- The total area under the curve must be equal to 1\n", "\n", "Let’s imagine a random variable that can take any value over an interval\n", "of real numbers. The probability of observing values between $a$ and $b$\n", "is the area below the density curve for the region defined by $a$ and\n", "$b$:\n", "\n", "$$\n", "\\mathrm{P}(a \\le X \\le b) = \\left(\\int_{a}^{b} f(x) \\; dx\\right)\n", "$$\n", "\n", "Since the number of values which may be assumed by the random variable\n", "is infinite, the probability of observing any single value is equal to\n", "0.\n", "\n", "> **Example**: If we take height as a continuous random variable, the\n", "> probability of observing an exact height (e.g., exactly 173.4827 or\n", "> exactly 187.19283 centimeters) is zero because the number of values\n", "> which may be assumed by the random variable is infinite.\n", "\n", "We will use graphs to visualize continuous PDFs rather than tables, as\n", "we need to visualize the entire continuum of possible values to be\n", "represented in the graph. Since the probability of observing values\n", "between $a$ and $b$ is the area underneath the curve, a continuous PDF\n", "should be visualized as a line graph instead of bar graphs or\n", "scatterplots.\n", "\n", "Suppose we would like to visualize a continuous empirical PDF for all\n", "wages between 25000 and 75000:" ], "id": "d716c7a2-9847-483d-b586-3adadf9374d2" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "density <- density(census_data$wages)\n", "plot(density)\n", "\n", "# telling R how to read our upper and lower bounds\n", "l <- min(which(density$x >= 25000))\n", "h <- max(which(density$x < 75000))\n", "\n", "# visualizing our specified range in red \n", "polygon(c(density$x[c(l, l:h, h)]),\n", " c(0, density$y[l:h], 0),\n", " col = \"red\")" ], "id": "84f7ab3c-221b-4688-b654-ba621f220153" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cumulative Density Functions (CDFs)\n", "\n", "When we have a variable which is rankable, we can define a related\n", "object: the **Cumulative Density Function (CDF)**.\n", "\n", "- The CDF for both discrete and continuous random variables is the\n", " probability that the random variable is *less than or equal to* a\n", " particular value.\n", "\n", "- Hence, the CDF must necessarily be an increasing function. Think of\n", " the example of rolling a dice:\n", "\n", " - $F(1)$ would indicate the the probability of rolling 1\n", " - $F(2)$ would indicate the probability of rolling 2 *or lower*\n", " - Evidently, $F(2)$ would be greater than $F(1)$\n", "\n", "- A CDF can only take values between 0 and 1.\n", "\n", " - 0 or (0%) is the probability that the random variable is less or\n", " equal to the smallest value of the variable\n", " - 1 or (100%) is the total probability that the random variable is\n", " less or equal to the biggest value of the variable\n", "\n", "- Therefore, if we have a variable $X$ that can take the value of $x$,\n", " the CDF is is the probability that $X$ will take a value less than\n", " or equal to $x$.\n", "\n", "Since we use the lowercase $f(y)$ to represent the PDF of $y$, we use\n", "the uppercase $F(y)$ to represent the CDF of $y$. Mathematically, since\n", "$f_{X}(x)$ denotes the probability density function of $X$, then the\n", "probability that $X$ falls between $a$ and $b$ where $a \\leq b$ is:\n", "\n", "$$\n", "\\mathrm{P}(a \\leq X \\leq b) = \\left(\\int_{a}^{b} f_{X}(x) \\; dx\\right)\n", "$$\n", "\n", "We know that the entire $X$ variable falls between 2 values if the\n", "probability of $x$ falling in between them is 1. Therefore a CDF curve\n", "for $X$ is:\n", "\n", "$$\n", "\\mathrm{P}(−∞ \\le X \\le ∞) = \\left(\\int_{−∞}^{∞} f_{X}(x) \\; dx\\right) = 1\n", "$$\n", "\n", "Below we’ve used a scatter plot to visualize empirical CDF of the\n", "continuous variable `wages`. That graph indicates that most people earn\n", "between 0 and 200000 as the probability of wages being less than or\n", "equal to 200000 is over 90%." ], "id": "73d42674-5aa9-477b-8866-8e1ed26617aa" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculate CDF\n", "p <- ecdf(census_data$wages)\n", "\n", "# plot CDF\n", "plot(p)" ], "id": "c7e335d4-d7f7-439f-9f08-c0721c0c053e" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3: Distribution Functions of Multiple Random Variables\n", "\n", "So far, we’ve looked at distributions for single random variables.\n", "However, we can also use **joint distributions** to analyze the\n", "probability of multiple random variables taking on certain values.\n", "\n", "### Joint Probability Distribution\n", "\n", "In this case, the **joint distribution** is the probability distribution\n", "on all possible values that $X$ and $Y$, can take on.\n", "\n", "Lets suppose both $X$ and $Y$ are discrete random variables which can\n", "take on values from 1-3. We show the joint probability table ($X$ on\n", "vertical axis, and $Y$ on horizontal) below:\n", "\n", "| | $X = 1$ | $X = 2$ | $X = 3$ |\n", "|---------|---------|---------|---------|\n", "| $Y = 1$ | 1/6 | 1/6 | 1/12 |\n", "| $Y = 2$ | 1/12 | 0 | 1/12 |\n", "| $Y = 3$ | 1/4 | 1/6 | 0 |\n", "\n", "The chart shows the probability that $X$ and $Y$ take on certain values.\n", "For example, the third row of the first column states that\n", "$\\mathrm{P}(X = 3, Y = 1) = 1/4$.\n", "\n", "Notice that the probabilities sum to $1$.\n", "\n", "Every joint distribution can be represented by a PDF and CDF, just like\n", "single random variables. The formal notation of a PDF for two jointly\n", "distributed random variables is\n", "\n", "$$f(x, y) = \\mathrm{P} (X = x, Y = y)$$\n", "\n", "where $f(x, y)$ is the joint probability density that the random\n", "variable $X$ takes on a value of $x$, and the random variable $Y$ takes\n", "on a value of $y$.\n", "\n", "The CDF for jointly distributed random variables follows the same logic\n", "as with single variables though this time it represents the probability\n", "of multiple variables taking on values less than those specified all at\n", "once.\n", "\n", "This might not make sense for two discrete random variables but it is\n", "useful if both variables are continuous. The formal notation of a CDF\n", "for two jointly distributed random variables is\n", "\n", "$$\n", "F(x, y) = \\mathrm{P}({X \\leq x}, {Y \\leq y})\n", "$$\n", "\n", "where $F(x, y)$ is the joint cumulative probability that the random\n", "variable $X$ takes on a value less than or equal to $x$ and the random\n", "variable $Y$ takes on a value less than or equal to $y$ simultaneously.\n", "\n", "### Marginal Probability Distribution\n", "\n", "The **marginal distribution** is the probability density function for\n", "each individual random variable. If we add up all of the joint\n", "probabilities from the same row or the same column, we get the\n", "probability of one random variable taking on a series of different\n", "values. We can represent the marginal probability density function as\n", "follows:\n", "\n", "$$\n", "f_{x}(x) = \\sum_{y} \\mathrm{P}(X = x, Y = y)\n", "$$\n", "\n", "where we sum across all possible joint probabilities of $X$ and $Y$ for\n", "a given $x$ or $y$.\n", "\n", "If we wanted the marginal empirical probability distribution function of\n", "$X$, we would need to find the marginal probability for all possible\n", "values of $X$.\n", "\n", "The marginal probability $X = 1$ from our example above is simply the\n", "probability that $X$ takes on $1$ for every possible value of $Y$: $$\n", "\\mathrm{P}(X = 1, Y = 1) + \\mathrm{P}(X = 1, Y = 2) + \\mathrm{P}(X = 1, Y = 3) = 1/6 + 1/12 + 1/4 = 1/2\n", "$$\n", "\n", "One important point to consider is that of **statistical independence of\n", "random variables**.\n", "\n", "- Two random variables are independent if and only if their joint\n", " probability of occurrence equals the product of their marginal\n", " probabilities for all possible combinations of values of the random\n", " variables.\n", "- In mathematical notation, this means that two random variables are\n", " statistically independent if and only if:\n", "\n", "$$\n", "f(x, y) = f_{x}(x) f_{y}(y)\n", "$$\n", "\n", "> **Think Deeper**: Can you tell whether the two random variables in our\n", "> example are statistically independent?\n", "\n", "### Conditional Probability Distribution\n", "\n", "The **conditional distribution** function indicates the probability of\n", "seeing a host of values for one random variable conditional on a\n", "specified value of another random variable, provided that the two random\n", "variables are jointly distributed.\n", "\n", "Below is the formula of a probability density function of random\n", "variables $X$ and $Y$:\n", "\n", "$$\n", "f(x | y) = \\frac {\\mathrm{P} (X = x \\cap Y = y)} {\\mathrm{P}(Y = y)}\n", "$$\n", "\n", "where\n", "\n", "- $f(x | y)$ represents the conditional probability that the random\n", " variable $X$ will take on a value of $x$ when the random variable of\n", " $Y$ takes on a value of $y$\n", "- $\\cap$ represents the case that both $X$ = $x$ and $Y$ = $y$\n", " simultaneously (a joint probability)\n", "\n", "> **Note**: the marginal probability that $Y = y$ must not be 0 as that\n", "> would make the conditional probability undefined.\n", "\n", "Let’s say we want to find the conditional probability of $X = 1$ given\n", "$Y = 2$, using the joint probability table in our example above. To find\n", "that we need to first find $\\mathrm{P}(Y = 2)$ and then divide\n", "$\\mathrm{P}(X = 1, Y = 2)$ by that number. We get:\n", "$(1/12) \\div (1/12 + 1/12) = 1/2$.\n", "\n", "Until now, we have referred to the joint, marginal and conditional\n", "distribution of two discrete random variables; however, **the logic\n", "extends to continuous variables**.\n", "\n", "We focused on discrete random variables since they are much easier to\n", "represent in table format. While the same logic for discrete variables\n", "applies to continuous random variables, we often refer to mathematical\n", "formulas when finding the marginal and conditional probability functions\n", "for continuous random variables, since their PDFs and CDFs can be\n", "represented by mathematical functions.\n", "\n", "> **Note**: we can also have more than two jointly distributed random\n", "> variables. While it is possible to represent the probability of 3 or\n", "> more variables taking on certain values at once, it is hard to\n", "> represent that graphically or in table format. That is why we have\n", "> stuck to investigating two jointly distributed random variables in\n", "> this notebook.\n", "\n", "### Test your knowledge\n", "\n", "Let the random variable $X$ denote the time (in hours) Omar can wait for\n", "his flight. Omar could have to wait up to 2 hours for this flight. Use\n", "this information to answer questions 1, 2, and 3 below.\n", "\n", "Is $X$ a discrete or continuous random variable?" ], "id": "606767ba-69eb-4460-8b64-f67942c4fb3e" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your answer of \"discrete\" or \"continuous\" in place of ...\n", "answer_1 <- \"...\" \n", "\n", "test_1()" ], "id": "701a441d-7fa9-4d6f-aee3-312f5463b608" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Say a potential probability density function representing this random\n", "variable is the following:\n", "\n", "$$ \n", "f(x) = \\begin{cases}\n", "x & \\text{if } 0 \\leq x \\leq 1,\\\\\n", "2 - x & \\text{if } 1 \\leq x \\leq 2,\\\\\n", "0 & \\text{otherwise}\n", "\\end{cases}\n", "$$\n", "\n", "Is this a valid PDF?" ], "id": "79dc5fec-663a-47a5-98e7-2065d86b6626" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your answer of \"yes\" or \"no\" in place of ...\n", "answer_2 <- \"...\"\n", "\n", "test_2()" ], "id": "4d41ae60-fe47-4917-aa75-09c7a630f465" }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the probability of a person waiting up to 1.5 hours for their\n", "flight? Answer to 3 decimal places.\n", "\n", "> **Hint**: this is not the same as the probability of waiting precisely\n", "> 1.5 hours." ], "id": "af392f6d-1971-4147-bba8-1b993d03736f" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here" ], "id": "bede065a-54ff-4ee2-86c8-a267dceab8be" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your answer for the cumulative probability (in decimal format, i.e. 95% = 0.95) here\n", "answer_3 <- ... \n", "\n", "test_3()" ], "id": "4ab1ff91-1dbf-4010-a6d5-1a460a961071" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let’s change gears and look at the joint distribution of discrete\n", "random variables `immstat` (rows) and `kol` (columns).\n", "\n", "| | $1$ | $2$ | $3$ |\n", "|-----|-----|-----|------|\n", "| $1$ | 1/4 | 1/6 | 1/6 |\n", "| $2$ | 1/5 | 1/5 | 1/60 |\n", "| $3$ | 0 | 0 | 0 |\n", "\n", "Use the following legend to answer the questions 4 and 5 below:\n", "\n", "- `immstat` takes values 1 == non-immigrant; 2 == immigrant; 3 == NA\n", "- `kol` takes values 1 == english only; 2 == french only; 3 == both\n", " french and english\n", "\n", "What is the probability that someone is both an immigrant and knows both\n", "English and French? Answer in fractional from." ], "id": "248a8364-eb9d-406b-a0f0-87d116bcf781" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your answer for the probability (in fractional format, i.e. 10% = 1/10) here\n", "answer_4 <- ... \n", "\n", "test_4()" ], "id": "6e719144-39d5-4888-9aea-b494587b59c2" }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the probability that someone is an immigrant given that they\n", "know only English? Answer in fractional from." ], "id": "2e026767-d439-400d-aab2-4c4a4e966f9f" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your answer for the probability (in fractional format, i.e. 10% = 1/10) here\n", "answer_5 <- ... \n", "\n", "test_5()" ], "id": "59ee7d35-7af1-4b84-aa6e-dbdecd61fff9" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let $Y$ be a continuous random variable uniformly distributed on the\n", "range of values \\[20, 80\\]. Use this information to answer questions 6,\n", "7, and 8.\n", "\n", "What is the probability of $Y$ taking on the value of 30? You may use a\n", "graph to help you." ], "id": "550c8779-8b9e-49b4-9b03-25936ebdd720" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here" ], "id": "bcabca18-8adb-4a62-a0f6-f6df156520a8" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your answer for the probability (in fractional format, i.e. 25% = 1/4) here\n", "answer_6 <- ... \n", "\n", "test_6()" ], "id": "01d75868-6889-41dc-852e-50f44585ba78" }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the probability of $Y$ taking on a value of 60 or more?" ], "id": "971874e5-ae15-4a08-823b-81ba43794422" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your answer for the probability (in fractional format, i.e. 25% = 1/4) here\n", "answer_7 <- ... \n", "\n", "test_7()" ], "id": "6552a4ca-4063-4c39-9085-612df172cfd0" }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the range of $Y$ expanded to \\[20, 100\\], would the probability that\n", "$Y$ takes on a value of 60 or more increase or decrease?" ], "id": "1c94b742-3551-49a8-aecf-18400beecd27" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your answer of \"increase\" or \"decrease\" in place of \"...\"\n", "answer_8 <- \"...\"\n", "\n", "test_8()" ], "id": "c25b7661-e4e3-4046-96f5-37259ff21e39" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let $Z$ be a normally distributed random variable representing the\n", "length of a piece of classical music (in minutes), with a mean of 5 and\n", "standard deviation of 1.5. Use this information to answer questions 9,\n", "10 and 11.\n", "\n", "What is the probability that a given piece will last between 3 and 7\n", "minutes? Answer to 4 decimal places. You may use code to help you." ], "id": "7fb6017a-3c6c-4bc6-8ff7-88eced4a3e96" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here" ], "id": "7ab8e426-feec-41a1-9490-c5cff64b36ec" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your answer for the probability (in decimal format, i.e. 95.23% = 0.9523) here\n", "answer_9 <- ...\n", "\n", "test_9()" ], "id": "b1cc0f11-fcde-46f2-8c7e-829fbea406a8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "If $Z$ were to remain normally distributed and have the same standard\n", "deviation, but the mean piece length was changed to 3 minutes, how would\n", "this probability change?" ], "id": "48382c4a-07af-499e-857d-9bba18b04201" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your answer of \"increase\" or \"decrease\" in place of \"...\"\n", "answer_10 <- \"...\"\n", "\n", "test_10()" ], "id": "64f592cd-b492-4a39-a0eb-429f757987c8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Returning to our original $Z$ variable (with mean 5), if the standard\n", "deviation were to decrease to 1, how would the original probability\n", "change?" ], "id": "f621f9bd-9e59-4fb8-989a-e43ac3411e30" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your answer of \"increase\" or \"decrease\" in place of \"...\"\n", "answer_11 <- \"...\"\n", "\n", "test_11()" ], "id": "aedaba49-11a1-42b5-9986-f5a53b19f5a8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 4: Parametric Distributions\n", "\n", "All of the examples we used so far were for *empirical* distributions\n", "since we didn’t know what the *population* distributions were. However,\n", "many statistics *do* have known distributions which are very important\n", "to understand.\n", "\n", "Let’s look at the three most famous examples of distributions:\n", "\n", "- uniform distribution\n", "- normal (or Gaussian) distribution\n", "- student $t$-distribution\n", "\n", "These are called **parametric** distributions because they can be\n", "described by a set of numbers called *parameters*. For instance, the\n", "normal distribution’s two *parameters* are the mean and standard\n", "deviation.\n", "\n", "All the parametric distributions explained in this module are analyzed\n", "using four R commands. The four commands will start with the prefixes:\n", "\n", "- `d` for “density”: it produces the probability density function\n", " (PDF)\n", "- `p` for “probability”: it produces the cumulative distribution\n", " function (CDF)\n", "- `q` for “quantile”: it produces the inverse cumulative distribution\n", " function, also called the quantile function\n", "- `r` for “random”: generates random numbers from a particular\n", " parametric distribution\n", "\n", "### Uniform Distribution\n", "\n", "A continuous variable has a **uniform distribution** if all values have\n", "the same likelihood of occurring.\n", "\n", "- An example of a random event with a uniform distribution is rolling\n", " a dice: it is equally likely to roll any of the six numbers.\n", "\n", "- The variable’s density curve is therefore a rectangle, with constant\n", " height across the interval and 0 height elsewhere.\n", "\n", "- Since the area under the curve must be equal to 1, the length of the\n", " interval determines the height of the curve.\n", "\n", "Let’s see how this kind of distribution might look like.\n", "\n", "- First, we will generate random values from this distribution using\n", " the function `runif()`.\n", "- This command is written as `runif(n, min = , max = )`, where `n` is\n", " the number of observations, and `max` and `min` provide the interval\n", " between which the random variables are picked from." ], "id": "8c800e30-9624-470a-8d82-7a990db3d99b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "example_unif <- runif(10000, min = 10, max = 100)\n", "hist(example_unif, freq = FALSE, xlab = 'x', xlim = c(0,100), main = \"Empirical PDF for uniform random values on [0,100]\")" ], "id": "d3be417e-2179-4b47-b7d8-938d1dc793ac" }, { "cell_type": "markdown", "metadata": {}, "source": [ "While each number within the specified range is equally likely to be\n", "drawn, by random chance, some ranges of numbers are drawn more\n", "frequently others, hence the bars are not all the exact same height. The\n", "shape of the distribution will change each time you re-run the previous\n", "code cell.\n", "\n", "If we know the underlying distribution, we can infer many\n", "characteristics of the data. For instance, suppose we have a uniform\n", "random variable $X$ defined on the interval $(10,50)$.\n", "\n", "Since the interval has a width of 40, the curve must have a height of\n", "$\\frac{1}{40} = 0.025$ over the interval and 0 elsewhere. The\n", "probability that $X \\leq 25$ is the area between 10 and 25, or\n", "$(25-10)\\cdot 0.025 = 0.375$.\n", "\n", "#### PDF\n", "\n", "The `dunif()` function calculates the uniform probability density\n", "function for a variable and can also calculate a specific value’s\n", "density." ], "id": "ff493975-e2e3-436f-95e3-2992e4930bed" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "range <- seq(0, 100, by = 1) # creating a variable with a uniform distribution\n", "ex.dunif <- dunif(range, min = 10, max = 50) # calculating the PDF of the variable \"range\"\n", "plot(ex.dunif, type = \"o\") # plotting the PDF" ], "id": "5ba8dbaa-191f-4064-8dbf-d32ac892a6e9" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### CDF\n", "\n", "The `punif()` function calculates the uniform cumulative distribution\n", "function for the set of values." ], "id": "8f9d3834-a0b0-4bb6-8651-50d8d8680e15" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x_cdf <- punif(range, # vector of quantiles\n", " min = 10, # lower limit of the distribution (a)\n", " max = 50, # upper limit of the distribution (b)\n", " lower.tail = TRUE, # if TRUE, probabilities are P(X <= x); if FALSE P(X > x)\n", " log.p = FALSE) # if TRUE, probabilities are given as log\n", "plot(x_cdf, type = \"l\")" ], "id": "e8c0a027-e401-4e27-8b60-dbcf5b369c17" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `qunif()` function calculates, based on the cumulative probability,\n", "where a specific value is located in the distribution of density and\n", "helps us access the quantile distribution probability values from the\n", "data." ], "id": "bc84be7c-f493-4148-acc3-123c2ea2a222" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "quantiles <- seq(0, 1, by = 0.01)\n", "y_qunif <- qunif(quantiles, min = 10, max = 50) \n", "plot(y_qunif, type = \"l\")" ], "id": "e92b0780-c68b-4679-99b6-08c00fd0e08d" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normal (Gaussian) Distribution\n", "\n", "We first saw the normal distribution in the *Central Tendency* notebook.\n", "The normal distribution is fundamental to many statistic processes as\n", "many random variables in natural and social sciences are normally\n", "distributed (e.g, height and SAT scores both follow a normal\n", "distribution). We refer to this type of distribution as “normal” because\n", "it is symmetrical and bell-shaped.\n", "\n", "A normal distribution is **parameterized** by its mean $\\mu$ and its\n", "standard deviation $\\sigma$, and it is expressed as $N(\\mu,\\sigma)$. We\n", "cannot calculate the normal distribution without knowing the mean and\n", "the standard deviation.\n", "\n", "The PDF has a complex equation, which can be written as:\n", "\n", "$$\n", "f(x; \\mu, \\sigma) = \\displaystyle \\frac{x^{-(x-\\mu)^{2}/(2\\sigma^{2})}}{\\sigma\\sqrt{2\\pi}}\n", "$$\n", "\n", "A **standard normal distribution** is a special normal distribution: it\n", "has a mean equal to zero and a standard deviation equal to 1 ($\\mu=0$\n", "and $\\sigma=1$), hence we can denote it $N(0,1)$. A couple of notation\n", "points to keep in mind include:\n", "\n", "- Standard normal variables are often denoted by $Z$\n", "- Standard normal PDF is denoted by $\\phi$\n", "- Standard normal CDF is denoted by $\\Phi$\n", "\n", "To generate simulated normal random variables, we can use the\n", "`rnorm()`function, which is similar to the `runif()` function." ], "id": "d7ec4c0f-5c71-49d2-bf29-18263de85fcd" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " x <- rnorm(10000, # number of observations\n", " mean = 0, # mean\n", " sd = 1) # sd\n", " hist(x, probability=TRUE) # the command hist() creates a histogram using variable x\n", " xx <- seq(min(x), max(x), length=100)\n", " lines(xx, dnorm(xx, mean=0, sd=1))" ], "id": "1924774c-8724-48fd-8c14-4405ec85421e" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### PDF\n", "\n", "As with the uniform distribution, we can use `dnorm` to plot the\n", "standard normal pdf." ], "id": "5b186618-3a85-47b6-860d-64ef61c52aa3" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " # create a sequence of 100 equally spaced numbers between -4 and 4\n", " x <- seq(-4, 4, length=100)\n", "\n", " # create a vector of values that shows the height of the probability distribution\n", " # for each value in x\n", " y <- dnorm(x)\n", "\n", " # plot x and y as a scatterplot with connected lines (type = \"l\") and add\n", " # an x-axis with custom labels\n", " plot(x,y, type = \"l\", lwd = 2, axes = FALSE, xlab = \"\", ylab = \"\")\n", " axis(1, at = -3:3, labels = c(\"-3s\", \"-2s\", \"-1s\", \"mean\", \"1s\", \"2s\", \"3s\"))" ], "id": "a0850d44-84db-493b-b195-f24df281ffb5" }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have used the random values generated to observe its bell shaped\n", "distribution. This is a standard normal PDF because the mean is zero and\n", "the standard deviation is one.\n", "\n", "We can also change the numbers of mean and sd in the `rnorm()` command\n", "to make the distribution not standard.\n", "\n", "`dnorm()` gives the height of the probability distribution at each point\n", "for a given mean and standard deviation. Since the height of the pdf\n", "curve is the density, `dnorm()` can also be used to calculate the entire\n", "density curve, as observed in the command *lines(xx, dnorm(xx, mean=0,\n", "sd=1))* in the previous section." ], "id": "4f9a9cce-d0a9-4066-a11a-dc4bf452eeb3" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " dnorm(100, mean=100, sd=15)" ], "id": "cc54166f-ab39-4ed2-bbfb-b3d01a29c0b3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### CDF\n", "\n", "The `pnorm()` function can (1) give the entire CDF curve of a normally\n", "distributed random *variable* (2) give the probability of a *specific\n", "number* from that variable to be less than the value of a given number." ], "id": "6957f9a5-8789-4d5e-9027-34b6b4ba8422" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " curve(pnorm(x), \n", " xlim = c(-3.5, 3.5), \n", " ylab = \"Probability\", \n", " main = \"Standard Normal Cumulative Distribution Function\")" ], "id": "38cb521a-8853-40cb-9ad5-adceb4ccbc79" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " pnorm(27.4, mean=50, sd=20) # gives you the CDF at that specific location" ], "id": "581f5df7-ee15-4ce2-b0bb-c753ce482aac" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `qnorm()` function can create a percent point function (ppf), which\n", "is the inverse curve of the cumulative distribution function. The\n", "`qnorm()` function gives the inverse of the CDF by taking the density\n", "value and giving a number with a matching cumulative value.\n", "\n", "- The CDF of a specific value is the probability of a normally\n", " distributed value of a random variable to be less than the value of\n", " a *given number*\n", "- To create the ppf, we start with that probability and use the\n", " `qnorm()` function to compute the corresponding *given number* for\n", " the cumulative distribution" ], "id": "95995b5e-3a1c-4e89-95d8-7792b7d0f436" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " curve(qnorm(x), \n", " xlim = c(0, 1), \n", " xlab = \"Probability\",\n", " ylab = \"x\", \n", " main = \"Quantile (inverse CDF) Function\")" ], "id": "ff18aa60-8f76-4588-98f9-d0cb5bd94990" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " qnorm(0.84, mean=100, sd=25)" ], "id": "f4f90d1e-af24-44f6-a35a-48f3fb16464c" }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Think Deeper**: The output of the function above shows that the 84th\n", "> quantile is approximately 1 standard deviation to the right of the\n", "> mean. Do you recognize this property of normally distributed random\n", "> variables?\n", "\n", "### Student’s $t$-Distribution\n", "\n", "The **Student’s t-distribution** is a continuous distribution that\n", "occurs when we estimate the sampling distribution of a normally\n", "distributed population with a small sample size and an unknown standard\n", "deviation. This is an important concept that we will explore in a future\n", "notebooks.\n", "\n", "The $t$-distribution is based on the number of observations and the\n", "degrees of freedom.\n", "\n", "A degree of freedom ($\\nu$) is the maximum number of logically\n", "independent values. You can think of it as the number of values that\n", "need to be known in order for the remaining values to be determined. For\n", "example, let’s say you have 3 data points and you know that their\n", "average value is 5. If you randomly select two of the values (let’s say,\n", "4 and 5) even without sampling the last data point, you know that its\n", "value needs to be 6. Hence, there is “no freedom” in the last data\n", "point.\n", "\n", "In the case of the $t$-distribution, the degree(s) of freedom are\n", "calculated as $\\nu = n-1$, with $n$ being the sample size.\n", "\n", "When $\\nu$ is large, the $t$-distribution begins to look like a standard\n", "normal distribution. This approximation between standard normal and\n", "$t$-distribution can start to be noticed around $\\nu \\geq 30$.\n", "\n", "As with the uniform and normal distribution, to generate random values\n", "that together have a $t$-distribution we add the prefix `r` to the name\n", "of the distribution, `rt()`." ], "id": "17a3395a-47a5-4eab-81ed-cb72f52f033c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " n <- 100\n", " df <- n - 1\n", "\n", " samples <- rt(n, df)\n", " hist(samples, breaks = 20, freq = FALSE)\n", " xx <- seq(min(samples), max(samples), length=100)\n", " lines(xx, dt(xx, df))" ], "id": "df12570b-2623-497b-9b03-5ccdd3a5f40b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although the t-distribution is bell-shaped and symmetrical like the\n", "normal distribution, it is not as thin as a normal distribution. Hence,\n", "the data is more spread out than a normal distribution.\n", "\n", "> **Note**: this is explained by the central limit theorem (CLT) and the\n", "> law of large numbers (LLN), which we will explore in future notebooks.\n", "\n", "#### PDF\n", "\n", "The function `dt()` calculates the PDF or the density of a particular\n", "variable, depending on the sample size and degrees of freedom.\n", "\n", "In the examples shown below we use the variable `ex.tvalues` which is a\n", "sequence of numbers ranging from -4 to 4 with increments of 0.01.\n", "Therefore there are 800 numbers generated with the degrees of freedom of\n", "799." ], "id": "f00ac0e7-bb96-4441-8adb-2c9530af4e66" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " ex.tvalues <- seq(- 4, 4, by = 0.01) # generating a sequence of number \n", " ex_dt <- dt(ex.tvalues, df = 799) # calculating the PDF\n", " plot(ex_dt, type=\"l\") " ], "id": "adf4213f-50ca-4c7a-a3cc-b8365974784a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### CDF\n", "\n", "The `pt()` function calculates the entire CDF curve of a t-distributed\n", "random *variable* and gives the probability of a t-distributed random\n", "*number* that is less than value of a given number." ], "id": "803e5d70-54de-46ff-b52f-6a7253fdd8af" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " ex_pt <- pt(ex.tvalues, df = 799) # calculating CDF\n", " plot(ex_pt, type = \"l\") " ], "id": "8c5e759a-2153-409f-90db-d87c23bd30b2" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `qnorm()` function takes the probability value and gives a number\n", "whose cumulative value matches the probability value. This function can\n", "also create a percent point function (ppf)." ], "id": "94586bd9-88fb-4c54-a36a-b8ec4c80ec75" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " ex.qtvalues <- seq(0, 1, by = 0.01) # generating a sequence of number \n", " ex_qt <- qt(ex.qtvalues, df = 99) # calculating the ppf\n", " plot(ex_qt, type = \"l\") # plotting the ppf " ], "id": "9f241a7a-585a-4b1b-b5e0-5aaf77b4c933" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beyond these three common distributions, there are many other types of\n", "distributions, such as the chi-square and f distributions. In some cases\n", "we may also have variables that do not fit any common distribution. In\n", "those cases, we describe those distributions as non-parametrical.\n", "\n", "### Test your knowledge\n", "\n", "Which of the following random variables are most likely to be\n", "uniformally distributed?\n", "\n", "1. The height of a UBC student\n", "2. The wages of a UBC student\n", "3. The birthday of a UBC student" ], "id": "c4a65826-b5a2-44d2-bd1d-ae1a4ef805d4" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# enter your answer here as \"A\", \"B\", or \"C\"\n", "answer_12 <- \"...\"\n", "\n", "test_12()" ], "id": "ff4b57b3-029f-4c8e-8b95-a69fb1c2829b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which of the following random variables are most likely to be normally\n", "distributed?\n", "\n", "1. The height of a UBC student\n", "2. The wages of a UBC student\n", "3. The birthday of a UBC student" ], "id": "c28a15d1-4b32-403c-8157-837a9e8bbf0e" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# enter your answer here as \"A\", \"B\", or \"C\"\n", "answer_13 <- \"...\"\n", "\n", "test_13()" ], "id": "a3f05248-5174-4c0d-9320-caf5613ca220" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given our uniform distribution `example_unif`, find $F(72)$.\n", "\n", "> **Hint**: you don’t need to calculate the exact probability given the\n", "> distribution. You only need to know that this random variable is\n", "> uniformly distributed for values between 10 and 100." ], "id": "7b8a6d20-b7bd-42c8-98ed-2c3edaa4af43" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here" ], "id": "1e868b04-abce-4bc9-a575-48f5375b1ea7" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# enter your answer as a fraction below\n", "answer_14 <- ...\n", "\n", "test_14()" ], "id": "dc0589b7-f47f-4fbb-a6c7-7e849191fd4d" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assume we have a standard normal distribution $N(0,1)$. Find $F(0)$." ], "id": "6247e5fd-182d-44aa-888c-057a9d4eca63" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# enter your answer as a fraction below\n", "answer_15 <- ...\n", "\n", "test_15()" ], "id": "bd84bea6-f5ad-4571-8880-e4bf90996515" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_15 <- 1/2\n", "\n", "test_15()" ], "id": "8cf68b39-02e3-4263-92ac-1ea016e9a457" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s assume we have a student’s $t-$distribution that approximates a\n", "normal distribution really well. What must be true?\n", "\n", "1. The degrees of freedom parameter must be very large\n", "2. The degrees of freedom parameter must be very small\n", "3. The degrees of freedom parameter must be equal to the mean of the\n", " normal distribution" ], "id": "f8a5c044-f61b-4e63-8e66-2baa76799dc2" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Enter your answer here as \"A\", \"B\", or \"C\"\n", "answer_16 <- \"...\"\n", "\n", "test_16()" ], "id": "967e62ac-fab6-4f1e-9b32-5b35fee24be9" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "ir", "display_name": "R", "language": "r" } } }