{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 1.4.1 - Beginner - Hypothesis Testing (226)\n",
        "\n",
        "COMET Team <br> *Mridul Manas*  \n",
        "2023-08-03\n",
        "\n",
        "### Prerequisites:\n",
        "\n",
        "-   Introduction to Jupyter\n",
        "\n",
        "-   Introduction to R\n",
        "\n",
        "-   Confidence Intervals\n",
        "\n",
        "### Learning Outcomes:\n",
        "\n",
        "1.  Modelling real-world problems as a null hypothesis tests carefully\n",
        "    defining the hypotheses using correct notation.\n",
        "\n",
        "2.  Simulating the distribution of population observations as\n",
        "    hypothesized under the null assumption.\n",
        "\n",
        "3.  Simulating the process of repeated sampling in R.\n",
        "\n",
        "4.  Verifying if **Central Limit Theorem** holds by simulating the\n",
        "    sampling distribution under the null hypothesis.\n",
        "\n",
        "5.  Obtaining the null distribution of test-statistics and understand\n",
        "    its significance to hypothesis testing.\n",
        "\n",
        "6.  Computing **p-values** and using them to reject or not reject the\n",
        "    null hypothesis.\n",
        "\n",
        "7.  Interpreting **significance levels (**$\\alpha$) and how they are\n",
        "    used in hypothesis testing.\n",
        "\n",
        "### 1. An Introduction to Hypothesis Testing using R\n",
        "\n",
        "Hypothesis Testing is a formal approach to choosing between two possible\n",
        "interpretations of an observed relationship in a sample. You are\n",
        "comparing two populations A and B and draw one independent sample\n",
        "randomly from each population to find that the point estimate obtained\n",
        "from sample A (eg. sample mean) is higher than the point estimate\n",
        "obtained from sample B. Thinking about the true population parameters\n",
        "for the two populations, we can choose between two interpretations:\n",
        "\n",
        "-   $H_0$ or null hypothesis: there is no relationship between the two\n",
        "    population parameters, and the observed relationship between the two\n",
        "    point estimates is a result of sampling variability\n",
        "-   $H_1$ or alternative hypothesis: there *is* a relationship between\n",
        "    the two population parameters, as sampling variability alone cannot\n",
        "    explain the observed relationship between the two point estimates\n",
        "\n",
        "Please note that we neither reject or accept the alternative hypothesis.\n",
        "Hypothesis testing always concludes with us either accepting or\n",
        "rejecting the null hypothesis or $H_0$.\n",
        "\n",
        "You might have heard of the *right tailed* hypothesis tests. When\n",
        "interested in a *single* population, this indicates that we are asking\n",
        "if the true parameter (could be the mean, standard deviation or\n",
        "variance) is strictly “greater than” a fixed constant value(eg.\n",
        "$\\mu_A > 23$ or when checking if the coefficient from a regression is\n",
        "significant, ie. $\\beta_1 > 0$). In the two-sample case, we are always\n",
        "interested in knowing whether one population’s true parameter is greater\n",
        "than another population’s (eg. $\\mu_A > \\mu_B$).\n",
        "\n",
        "> This notebook explains how to conduct a *right-tailed* using R. The\n",
        "> left-tailed test is similar to right-tailed except we are interested\n",
        "> in “lesser than” relationships observed within populations. Two-tailed\n",
        "> tests are used to check for strict equality, such as asking if\n",
        "> $\\mu_A = mu_B$ or whether one parameter (eg. $\\mu_A$, $\\sigma_A$ or\n",
        "> $\\beta_A$ (regression) is strictly equal to a certain fixed value or\n",
        "> the true parameter of some other population.\n",
        "\n",
        "### 1.1: Do Mandatory Tutorials Increase GPAs?\n",
        "\n",
        "The UWC School at the start of 2022-23 launched a policy requiring all\n",
        "of the boarding students in the school to attend supervised “prep\n",
        "sessions” or tutorials. You are given the challenge to infer whether the\n",
        "average GPA for 2022-23 year was higher than the 2021-22 GPA average\n",
        "(72.5%) while only having access to 2022-23 grades for 50 (randomly\n",
        "picked) students from the whole class of 500.\n",
        "\n",
        "### 1.2: Formulating The Null And Alternate Hypotheses\n",
        "\n",
        "Let’s first define our population parameters:\n",
        "\n",
        "-   $\\mu_0 = 72.5$: Population mean GPA (%) in 2021-22 (base)\n",
        "\n",
        "-   $\\mu_1 (unknown)$: Population mean GPA (%) in 2022-23 (treatment)\n",
        "\n",
        "We are essentially interested in determining whether $\\mu_1 > \\mu_0$.\n",
        "Since we know the true value for $\\mu_0$, we can choose the\n",
        "*relationship to be tested* as: $\\mu_1 > 72.5$ where $\\mu_1$ is the true\n",
        "average GPA of all 500 students in 2022-23.\n",
        "\n",
        "> Side note: This is also equivalent to asking if $\\mu_1 - mu_0 > 0$,\n",
        "> and that is a right-tailed test for population difference in means.\n",
        "\n",
        "Our hypotheses for the right-tailed test are:\n",
        "\n",
        "-   $H_0$: After-school tutorials did not increase the average GPA\n",
        "    $\\mu_1 = 72.5$\n",
        "\n",
        "-   $H_1$: After-school tutorials did increase the average GPA\n",
        "    ($\\mu_1 > 72.5)$\n",
        "\n",
        "Here, $\\H_0$ is the **null hypothesis** and we always begin with\n",
        "assuming that this is true. Our job essentially is to use statistical\n",
        "inference and reasoning to conclude whether null hypothesis is true or\n",
        "not, based on a single sample from the population!\n",
        "\n",
        "### 1.2: The Distribution Individual GPAs In 2022-23 Under Null Hypothesis"
      ],
      "id": "21a5ee55-2532-43aa-ac75-520ff647495a"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#RUN THIS CELL BEFORE CONTINUING\n",
        "# Load the necessary library\n",
        "library(dplyr)\n",
        "library(ggplot2)"
      ],
      "id": "eec39f8a-57b8-4ef5-ad1e-04d4ccedea27"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Since the values for GPAs are both random and of continuous type, we can\n",
        "visualize the **distribution** to see the number of times we observe\n",
        "specific values for the GPAs across a continuous x-axis. For this\n",
        "tutorial, we assume that we know the GPAs in both years (2021-22 and\n",
        "2022-23) are **normally distributed**. This means that the distribution\n",
        "is symmetric and bell-like in shape with its center situated at its\n",
        "mean.\n",
        "\n",
        "As always, we begin the test with assuming the null hypothesis is true.\n",
        "Hence, in the following code cell, we have set the mean of the\n",
        "**hypothesized under null** distribution of GPAs as 72.5%."
      ],
      "id": "c8033252-85de-4b54-a1f8-ed49f860be50"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Set the seed for reproducibility (optional)\n",
        "set.seed(42)\n",
        "n = 500 # number of observations\n",
        "mean_h0 = 72.5 #hypothesized population mean under the null\n",
        "sd_pop = 2.8 #population standard deviation of GPAs in 2022-23 \n",
        "\n",
        "# Generate a bounded normal distribution of GPAs with mean 72.5% and standard deviation of 2.8%. \n",
        "gpa_null_dist <- data.frame(GPA = rnorm(n, mean_h0, sd_pop))\n",
        "\n",
        "# Create the density plot and add the vertical line and annotation\n",
        "gpa_null_dist_plot <- ggplot(gpa_null_dist, aes(x = GPA)) +\n",
        "  geom_density(fill = \"skyblue\", color = \"black\") +\n",
        "  geom_vline(xintercept = mean_h0, color = \"red\", linetype = \"dashed\") +\n",
        "  geom_text(aes(label = sprintf(\"Mean: %.2f\", mean_h0), x = mean_h0, y = 0.01), vjust = 1, color = \"red\") +\n",
        "  labs(title = \"Hypothesized Distribution of GPAs (2022-23) Under Null Hypothesis\",\n",
        "       x = \"GPA (%)\", y = \"Density\")\n",
        "gpa_null_dist_plot"
      ],
      "id": "c1422a43-32b3-418d-9948-8fb4520120f8"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Our choice of the population mean for this distribution has been\n",
        "borrowed directly from the **null hypothesis**. Hence, we can call it\n",
        "the distribution of 2022-23 GPAs *under null*.\n",
        "\n",
        "> Data simulated in R comes with inherent variability that should\n",
        "> explain the *imperfections* in the shape of the distribution or why\n",
        "> the center (mean) is not exactly equal to the set mean,\n",
        "> $\\mu_0 = 72.5$.\n",
        "\n",
        "****Thinking Critically*:*** Consider one student the population has\n",
        "obtained a GPA of 82.3%, placing themselves in the top 0.15% of the\n",
        "2022-23 class (assuming the the population mean $\\mu_1$ is still equal\n",
        "to $\\mu_0$ under null).\n",
        "\n",
        "Would you reject the null hypothesis based on this observation alone?\n",
        "\n",
        "Our answer is a clear NO. Outliers might challenge our null hypothesis\n",
        "but they can occur in all fairness the null hypothesis. They tell\n",
        "nothing about the validity of the null hypothesis!\n",
        "\n",
        "You might be wondering how we calculated the $P(GPA > 82.3) = 0.15$. We\n",
        "used a rule called the 68-95-99.7 rule that only works for normal\n",
        "distributions.\n",
        "\n",
        "> **The 68-95-99.7 rule** – also known as the empirical rule or\n",
        "> three-sigma rule – is a statistical guideline that describes the\n",
        "> percentage of data that falls within a certain number of standard\n",
        "> deviations from the mean in a normal distribution.\n",
        ">\n",
        "> For a normal distribution, approximately 68% of the data falls within\n",
        "> one standard deviation (σ) of the mean (μ). Approximately 95% of the\n",
        "> data falls within two standard deviations (2σ) of the mean (μ).\n",
        "> Approximately 99.7% of the data falls within three standard deviations\n",
        "> (3σ) of the mean (μ).\n",
        "\n",
        "Instead of considering individual values from the population, let’s\n",
        "explore the idea of taking a representative set of 50 randomly chosen\n",
        "students from the 2022-23 class.\n",
        "\n",
        "### 1.3: The Distribution of Sample Means Under Null Hypothesis\n",
        "\n",
        "*Essentially, the sampling distribution of sample means can be generated\n",
        "through the following steps:*\n",
        "\n",
        "1.  Draw all possible samples of a fixed size $n$ from the population\n",
        "    (drawing observations randomly with replacement)\n",
        "\n",
        "2.  Record the point estimate or the sample statistic for each sample.\n",
        "    This is the $\\bar{x_i}$ or the sample mean GPA for sample $i$.\n",
        "\n",
        "3.  Plot *each and every* point estimate obtainable from the population,\n",
        "    ie. the ($\\bar{x_i}$s), as a distribution (just like we did for\n",
        "    2022-23 individual GPAs). This distribution will be called the\n",
        "    **sampling distribution of sample means**.\n",
        "\n",
        "*What does the sampling distribution look like under the null\n",
        "hypothesis? Where is it centered under null, and what is its standard\n",
        "deviation?*\n",
        "\n",
        "The **Central Limit Theorem** states that for large enough sample sizes,\n",
        "the sampling distribution of sample means will approach a normal\n",
        "distribution, regardless of the shape of the population distribution. As\n",
        "the sample size increases, the mean of the sampling distribution of\n",
        "sample means will get closer and closer to the population mean.\n",
        "\n",
        "Assume that our sample size is 50 which is big enough for CLT ($> 30$).\n",
        "Assuming the GPAs (2022-23) follow a normal distribution with a\n",
        "population mean of $\\mu_0 = 72.5$, the sampling distribution of sample\n",
        "means will be distributed as:\n",
        "\n",
        "$$\n",
        "\\text{N}\\sim(\\mu_0 = 72.5, \\frac{\\sigma}{\\sqrt{n}})\n",
        "$$\n",
        "\n",
        "Here, $\\frac{\\sigma}{\\sqrt{n}}$is called the *standard error* of the\n",
        "sampling distribution with $\\sigma$ being the population standard\n",
        "deviation and $n$ the sample size.\n",
        "\n",
        "However, population standard deviations ($\\sigma$) are rarely known in\n",
        "real-world cases, and we can use $s$ or the sample standard deviation as\n",
        "an estimator. But, using $s$ instead of $\\sigma$ means that sample means\n",
        "will follow the `t-distribution` instead. $$\n",
        "\\bar{X} \\sim \\text{t}_{n-1}(\\mu, \\frac{\\sigma}{\\sqrt{n}})\n",
        "$$\n",
        "\n",
        "> A t-distribution has fatter tails than a normal distribution but it\n",
        "> does a good job at approximating a normal distribution when sample\n",
        "> size, n, is large ($> 30$). The $n-1$ notation denotes the “degrees of\n",
        "> freedom” which you would learn is important for calculating the right\n",
        "> probabilities.\n",
        "\n",
        "### 1.4: Simulating the Sampling Distribution under Null Hypothesis\n",
        "\n",
        "Here is a function that simulates the process of taking repeated samples\n",
        "(with replacement). Though this is beyond the scope of this tutorial,\n",
        "handy pre-built methods exist for R which simulate repeated sampling\n",
        "procedures such as those offered as part of the `infer` package."
      ],
      "id": "465b0ee6-01b4-4af6-83f9-505480879091"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#RUN THIS CELL BEFORE CONTINUING:\n",
        "rep_sample_n <- function(reps, n, pop_array) {\n",
        "  \n",
        "  output <- data.frame(replicate = integer(), GPA = numeric())\n",
        "\n",
        "  for (i in 1:reps) {\n",
        "    sample_vals <- sample(pop_array, n, replace = TRUE)\n",
        "    temp_df <- data.frame(sample_id = rep(i, n), GPA = sample_vals)\n",
        "    output <- rbind(output, temp_df)\n",
        "  }\n",
        "\n",
        "  return(output)\n",
        "}"
      ],
      "id": "7b6568d2-c399-4ba9-8f9c-562a1f7999e4"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Example usage of the function:\n",
        "test <- rep_sample_n(reps = 1500, n = 50, gpa_null_dist$GPA)\n",
        "\n",
        "head(test)"
      ],
      "id": "c3470186-41d9-4696-b837-cd2e570c7be4"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, let’s compute the sample means for each sample:"
      ],
      "id": "e26470b4-3b66-4b0f-bf6a-e66d79083708"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#RUN THIS CELL\n",
        "set.seed(80) #DO NOT CHANGE for reproducibility. \n",
        "\n",
        "sampling_dist_null <- rep_sample_n(reps = 1500, n = 50, gpa_null_dist$GPA)\n",
        "\n",
        "sampling_dist_means_null <- sampling_dist_null %>%\n",
        "  group_by(sample_id) %>% summarise(mean_GPA = mean(GPA))\n",
        "\n",
        "head(sampling_dist_means_null, 10)"
      ],
      "id": "fed66da6-8425-4cf5-9d91-b050c06d37f3"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we will visualize the distribution of the 1500 sample means. Can\n",
        "you guess what the mean for this distribution will be?"
      ],
      "id": "54a8593f-8c7d-4f0d-b3cc-4c740a66fd87"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "mean_sample_means_null <- mean(sampling_dist_means_null$mean_GPA)\n",
        "\n",
        "# Create the density plot for the sampling distribution and add the vertical line and annotation\n",
        "sampling_dist_means_null_plot <- ggplot(sampling_dist_means_null, aes(x = mean_GPA)) +\n",
        "  geom_density(fill = \"skyblue\", color = \"black\") +\n",
        "  geom_vline(xintercept = mean_sample_means_null, color = \"red\", linetype = \"dashed\") +\n",
        "  geom_text(aes(label = sprintf(\"Mean: %.2f\", mean_sample_means_null), x = mean_sample_means_null, y = 0.15), vjust = 1, color = \"red\") +\n",
        "  labs(title = \"Sampling Distribution of Sample Means (2022-23) Under Null Hypothesis\",\n",
        "       x = \"Sample Mean GPA (%)\", y = \"Density\")\n",
        "\n",
        "sampling_dist_means_null_plot"
      ],
      "id": "9a6c6925-29d2-4045-ba6f-e414240682b8"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The mean of our sampling distribution under null is 72.43% which is the\n",
        "(quite the) same as the hypothesized mean for the population under null.\n",
        "The Central Limit Theorem clearly holds in our case!\n",
        "\n",
        "> ***Think Critically:*** Researchers and statisticians *rarely* ever\n",
        "> get the chance to take *all possible* samples from the population.\n",
        "> Hence, we rely on classical inferential theory and the CLT assumption\n",
        "> to *hypothesize* what the sampling distribution will look like under\n",
        "> the null and alternate models. R is able to simulate and visualize the\n",
        "> process for us which helps us verify that the classical theory and\n",
        "> assumption we’re taught in classes are actually quite reliable!\n",
        "\n",
        "### 1.5: Calculating the `test-statistic` Under Null Hypothesis\n",
        "\n",
        "Suppose you were given two samples: **Sample A** shows an average GPA of\n",
        "78%, while **Sample B** boasts a higher average of 83%. At first glance,\n",
        "we might lean towards Sample B being more convincing evidence of an\n",
        "increase in the true mean GPA in 2022-23. But let’s consider the\n",
        "**standard deviations**.\n",
        "\n",
        "In **Sample A**, the standard deviation is only 1.2% which suggests the\n",
        "GPAs are huddled close to the average. **Sample B** wields a 7.4%\n",
        "standard deviation, implying the GPAs are more spread out, resembling a\n",
        "diverse set of students. Sample A’s small standard deviation supports\n",
        "the idea of a genuine increase as the GPAs are well-clustered around the\n",
        "average. In contrast, Sample B’s larger standard deviation introduces\n",
        "some doubt as the wide spread of GPAs within the sample doesn’t strongly\n",
        "back the claim of an increase in average GPA.\n",
        "\n",
        "The **test-statistic** incorporates the observed sample statistics and\n",
        "the sample size against the backdrop of assumptions implied by the null\n",
        "hypothesis, such as what the true population mean is. The result is *a\n",
        "numeric value* that we then use to calculate the **p-value** for the\n",
        "test.\n",
        "\n",
        "The formula for the test-statistic, assuming we don’t know the true\n",
        "population standard deviation ($\\sigma$), is as follows:\n",
        "\n",
        "$$\n",
        "\\frac{\\bar{x} - \\mu_0}{\\frac{{s}}{\\sqrt{n}}}\n",
        "$$\n",
        "\n",
        "where $s$ is the standard deviation of the sample, and $n$ is the sample\n",
        "size (50 in our case).\n",
        "\n",
        "Let’s now draw a sample randomly from the **true** 2022-23 population of\n",
        "GPAs, and *continue to assume* that this sample has been drawn from a\n",
        "population with mean equal to $\\mu_0 = 72.5$."
      ],
      "id": "2e968740-e615-437b-9998-2fc5160b58a9"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#RUN THE FOLLOWING CELL BEFORE CONTINUING:\n",
        "# Set the seed for reproducibility\n",
        "set.seed(42)\n",
        "\n",
        "# Alternate Distribution of 500 GPAs with mean 83.30% and standard deviation of 3.2%. \n",
        "gpa_dist_alt <- rnorm(n = 500, mean = 83.40, sd = 3.2)\n",
        "\n",
        "# Create a data frame with the population GPA data\n",
        "gpa_dist_alt <- data.frame(GPA = gpa_dist_alt)\n",
        "\n",
        "random_sample <- data.frame(GPA = sample(gpa_dist_alt$GPA, size = 50, replace = FALSE))\n",
        "\n",
        "head(random_sample, 10)"
      ],
      "id": "ff0bb2ba-babb-4995-a570-e4076c6e9e2a"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let’s now calculate and store the mean and standard deviation for this\n",
        "sample:"
      ],
      "id": "9f06f08d-220e-4c65-89ea-c8ee7b361d38"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "sample_mean <- mean(random_sample$GPA)\n",
        "sample_sd <- sd(random_sample$GPA)\n",
        "\n",
        "print(sample_mean)\n",
        "print(sample_sd)"
      ],
      "id": "343ac1a0-8004-4064-8f5e-005b446d4be4"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Calculating the test-statistic:"
      ],
      "id": "b4b46156-bf63-425b-927d-74c25c50a437"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# First, let's estimate the standard error using the sample S.D. \n",
        "standard_error <- sample_sd / sqrt(50)\n",
        "\n",
        "# Calculate the test-statistic\n",
        "test_stat <- (sample_mean - 72.5) / standard_error\n",
        "test_stat"
      ],
      "id": "2da0bfbd-7fc6-4ea5-a3dc-6e983696c1fa"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you might ask, how do we interpret this test-statistic? Does this\n",
        "provide enough evidence to reject or not reject the null hypothesis? We\n",
        "will introduce the concept of the **p-value** to answer these questions.\n",
        "\n",
        "### 1.6: The Null Distribution\n",
        "\n",
        "The null distribution describes how the test-statistic will be\n",
        "distributed **under the null hypothesis**.\n",
        "\n",
        "Consider individual 2022-23 GPAs distributed as:\n",
        "\n",
        "$$\n",
        "x \\sim N(\\mu_0, \\sigma)\n",
        "$$\n",
        "\n",
        "Our null hypothesis makes no assumptions about how the population is\n",
        "distributed or what its true variance or standard deviation is.\n",
        "\n",
        "The Central Limit Theorem, however, tells us that the sample mean\n",
        "($\\bar{X}$) will follow the normal distribution with a standard\n",
        "deviation that depends on the an estimator called $s$ or the sample\n",
        "standard deviation (obstained from any arbitary sample). The\n",
        "distribution of sample estimates will follow:\n",
        "\n",
        "$$\n",
        "\\bar{X} \\sim \\text{t}_{n-1}(\\mu_0, \\frac{s}{\\sqrt{n}})\n",
        "$$\n",
        "\n",
        "Here, $n$ is the sample size, $s$ the sample standard deviation and\n",
        "$n - 1$ denotes the degrees of freedom (a parameter which descibes the\n",
        "$t$-distribution).\n",
        "\n",
        "We can take this a bit further and describe how the test-statistic\n",
        "$\\frac{\\bar{x} - \\mu_0}{\\frac{s}{\\sqrt{n}}}$ is distributed under the\n",
        "null hypothesis:\n",
        "\n",
        "$$\n",
        "\\frac{\\bar{x} - \\mu_0}{\\frac{s}{\\sqrt{n}}} \\sim \\text{t}_{n-1}(0, 1)\n",
        "$$\n",
        "\n",
        "This is in face is the **null distribution**, ie. the distribution of\n",
        "the test-statistic under the null hypothesis.\n",
        "\n",
        "Let’s use R to visualize the distribution of test-statistics, ie. the\n",
        "$t_{n-1}$ distribution or “the null distribution”:"
      ],
      "id": "ca134570-c908-4f57-ba53-1eebaa4f9cb7"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "null_dist <- sampling_dist_null %>% \n",
        "  group_by(sample_id) %>% \n",
        "  summarise(sample_mean = mean(GPA),\n",
        "            sample_sd = sd(GPA),\n",
        "            sample_standard_error = sample_sd / sqrt(50)) %>%\n",
        "  mutate(test_statistic = (sample_mean - 72.5)/ (sample_standard_error)) %>%\n",
        "  ggplot(aes(x = test_statistic)) +\n",
        "  geom_density(fill = \"lightblue\", color = \"darkgrey\") +\n",
        "  labs(x = \"Test-Statistic\", y = \"Density\", title = \"Null Distribution of Test-Statistics\") +\n",
        "  geom_vline(xintercept = test_stat, color = \"red\", type = \"dashed\")\n",
        "\n",
        "null_dist"
      ],
      "id": "1d6950e6-bf46-4613-ac6b-7cd63d08fc83"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "> You’ll learn in further classes that the test-statistic is a\n",
        "> standardized version of the sample estimates. Hence, the\n",
        "> $t$-distribution is centered at a mean of 0 and has standard deviation\n",
        "> of 1.\n",
        "\n",
        "The red line indicates the test-statistic we had calculated in Part 1.5.\n",
        "We will calculate a probability of observing a test-statistic as extreme\n",
        "as the one indicated by the red line using the null distribution.\n",
        "\n",
        "### 1.6: Calculating the `p-value` From The Null Distribution\n",
        "\n",
        "Let’s now calculate the p-value for the test-statistic. For\n",
        "$t$-distribution, we need to know the degrees of freedom, which is\n",
        "simply equal to $n - 1$ in the case of single-sample one-tailed\n",
        "hypothesis testing."
      ],
      "id": "255f9e2a-f058-45d0-8640-19e4b26be068"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "df = 50 - 1 #degrees of freedom = sample size minus 1\n",
        "\n",
        "# Calculate the p-value using the t-distribution\n",
        "p_val <- 1 - pt(test_stat, df = df)\n",
        "\n",
        "print(paste(\"P-value:\", p_val))"
      ],
      "id": "8796aefa-6798-417d-b1a9-ce3310945b4d"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "> `pt(x, df = df)` is the probability of observing a test-statistic\n",
        "> equal to or **smaller** than x. We are interested in p-value which\n",
        "> denotes the probability of observing a test statistic equal to or\n",
        "> greater than x. Hence, the correct code for the p-value is\n",
        "> `1 - pt(x, df = df)`.\n",
        "\n",
        "> **P-values** explain the probability of observing a value of a\n",
        "> test-statistic as big as 23.7 under the assumption that the null model\n",
        "> for sample means holds. In other words, assuming the null model holds,\n",
        "> ie. the distribution of sample means in centered at $\\mu_0$, what is\n",
        "> the probability of observing a test-statistic (from a single sample)\n",
        "> of 23.7 or bigger?\n",
        "\n",
        "### 1.7: Rejecting or Not Rejecting the Null Model\n",
        "\n",
        "Let’s now compare our p-value to a **threshold** to decide whether to\n",
        "reject or not reject the null hypothesis.\n",
        "\n",
        "The threshold we commonly use for hypothesis tests are called $\\alpha$\n",
        "that are commonly set as: **0.10, 0.05, or 0.01**.\n",
        "\n",
        "In the following code cell, we will first visualize the **distribution\n",
        "of sample means under the null hypothesis** and mark where the\n",
        "percentile corresponding to $\\alpha = 0.05$ falls."
      ],
      "id": "213dfe21-39e5-4184-a971-d30f11f47bc8"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Calculate the alpha = 0.05 critical value under the null assumption\n",
        "critical_value_h0 <- quantile(sampling_dist_means_null$mean_GPA, probs = 1 - 0.05)\n",
        "\n",
        "sampling_dist_means_null_plot <- sampling_dist_means_null_plot +\n",
        "  # Annotate the mean of the sampling distribution under null\n",
        "  geom_vline(xintercept = mean_sample_means_null, color = \"purple\", linetype = \"dashed\") +\n",
        "  # Annotate where the quantile for alpha 0.05 falls under null\n",
        "  geom_vline(xintercept = critical_value_h0, color = \"red\", linetype = \"dashed\") +\n",
        "  # Annotate alpha = 0.05\n",
        "  annotate(\"text\", x = critical_value_h0 + 0.02, y = 0.9, label = \"alpha = 0.05\", color = \"red\") +\n",
        "  labs(title = \"Null Model Distribution of Sample Means (2022-23)\", \n",
        "       x = \"Sample Mean GPA (%)\",\n",
        "       y = \"Density\") +\n",
        "  theme_minimal()\n",
        "\n",
        "sampling_dist_means_null_plot"
      ],
      "id": "a345b8d7-a682-47bb-a28b-1f80520cfc60"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The purple line indicates the quantile to the left of which lie\n",
        "($1-\\alpha$)% of all the observations. Since we chose $\\alpha = 0.05$,\n",
        "95% of all the possible means should lie to the left of it.\n",
        "\n",
        "Whenever we observe a sample mean which falls to the **right** of the\n",
        "purple line, we say we have *enough evidence to reject the null\n",
        "hypothesis at $\\alpha = 0.05$*.\n",
        "\n",
        "We can also choose to simply verify if $\\alpha > p-value$ in order to\n",
        "decide whether to reject or not reject $H_0$ at that chosen level of\n",
        "alpha."
      ],
      "id": "bfe8295e-7fef-49f6-9485-8487021155ff"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#RUN THIS CELL:\n",
        "print(p_val < 0.05)\n",
        "print(p_val < 0.01)\n",
        "print(p_val < 0.1)"
      ],
      "id": "ec8cde79-f735-4e60-952c-0489c14392db"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Not only our p-value (0) is smaller than $\\alpha = 0.05$ but it’s\n",
        "smaller than all of the other commonly chosen levels. Hence we choose to\n",
        "reject the null hypothesis at each of those significance levels (ie.\n",
        "$\\alpha$’s)\n",
        "\n",
        "*Conclude and interpret the test as follows*:\n",
        "\n",
        "The p-value obtained from our single sample is smaller than\n",
        "$\\alpha = 0.05$. We thus have enough evidence to reject the null\n",
        "hypothesis which is equivalent to rejecting the hypothesis that\n",
        "$\\mu_1 = 72.5$.\n",
        "\n",
        "As one can infer using the plots displayed above, the sample mean we had\n",
        "obtained earlier seems to be *too unlikely* for us to not refute the\n",
        "assumption that the sample observations belong to the distribution of\n",
        "GPAs hypothesized under null (See the plot in Part 1.2).\n",
        "\n",
        "Observe that our reasoning is quite similar to a “*proof by\n",
        "contradiction*”, except that we use a chosen threshold such as\n",
        "$\\alpha = 0.05$ to decide if we have likely reached a contradiction. If\n",
        "the p-value is too small from a right-tailed hypothesis test, we have\n",
        "enough evidence to *informally conclude* that the true mean (ie. the\n",
        "center of both the population and sampling distribution must be to the\n",
        "right of previously hypothesized population mean).\n",
        "\n",
        "We can also use the null-distribution to visualize and understand *why*\n",
        "the null hypothesis was rejected:"
      ],
      "id": "e0f3785a-dc45-4019-aec2-007abc4538d8"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "null_dist <- null_dist +\n",
        "  geom_vline(xintercept = qt(0.95, df = 49), color = \"blue\", type = \"dashed\")\n",
        "\n",
        "null_dist"
      ],
      "id": "1d259f71-3487-4c9f-bf0b-6b83d75d6346"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We make the following observations: 1. The blue line indicates the\n",
        "$\\alpha = 0.05$ percentile for the null distribution. The area under the\n",
        "curve to the right of the blue marker is equal to 0.05, or we say, the\n",
        "probability of observing a test-statistic as extreme as the one marked\n",
        "by the blue line is 0.05.\n",
        "\n",
        "1.  The red line indicates the test-statistic we obtained from the\n",
        "    single sample we had drawn earlier. As you can see, this falls way\n",
        "    beyond to the right of the blue line.\n",
        "\n",
        "Both statements above conclude with the idea that the observed\n",
        "test-statistic is that unlikely under the null distribution that it has\n",
        "more chances of belonging to a different distribution which we call the\n",
        "alternative distribution.\n",
        "\n",
        "### 2. Conclusions\n",
        "\n",
        "### 2.1: Sanity Checks\n",
        "\n",
        "So did we make the correct decision by rejecting the null hypothesis?\n",
        "\n",
        "Let’s look at the *actual* distribution of 2022-23 GPAs that we used to\n",
        "draw our random sample:"
      ],
      "id": "13c78ef4-1adc-4bf3-b986-474c72d1c4a4"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#RUN THIS CELL:\n",
        "gpa_dist_alt_plot <- gpa_dist_alt %>% \n",
        "  ggplot(aes(x = GPA)) +\n",
        "  geom_density(fill = \"skyblue\", color = \"black\") +\n",
        "  labs(x = \"True GPAs (%) in 2022-23\", y = \"Density\") +\n",
        "  geom_vline(xintercept = mean(gpa_dist_alt$GPA), color = \"red\", type = \"dashed\") +\n",
        "  geom_text(aes(label = sprintf(\"True Population Mean: %.2f\", mean(gpa_dist_alt$GPA)), x = mean(gpa_dist_alt$GPA), y = 0.15), vjust = 1, color = \"red\")\n",
        "\n",
        "gpa_dist_alt_plot"
      ],
      "id": "f2ead23b-1ee4-4c5a-9822-cc88b27ee51c"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "SO we DID correctly conclude that the true mean GPA is not equal to\n",
        "72.5% by just observing one single sample! How impressive is that?\n",
        "\n",
        "Now, you might find it helpful to learn that we didn’t actually need to\n",
        "go all the way to Part 1.6 to make a decision about the null hypothesis.\n",
        "We could have instead followed either of the two methods:\n",
        "\n",
        "1.  Compute a 95% Confidence Interval for the True Population Mean GPA\n",
        "    (2022-23) using the random sample we had obtained, and then, verify\n",
        "    if the interval contains the hypothesized mean $\\mu_0 = 72.5$.\n",
        "\n",
        "2.  Calculate the probability of observing the sample mean (not the\n",
        "    test-statistic) under the distribution of sample means assumed under\n",
        "    null (ie. the one which is centered at $\\mu_0$ due to CLT), and see\n",
        "    if the probability (can call this p-val) is less than alpha.\n",
        "\n",
        "### 2.2: Remarks\n",
        "\n",
        "Suppose the organizers know the true **set in stone** average GPA for\n",
        "2022-23. If you conclude that the 2023 mean GPA is higher than 2022 mean\n",
        "GPA, when it actually is, you score 150 points. This is equivalent to\n",
        "saying, if you reject the null hypothesis when the null hypothesis in\n",
        "fact is not true, you score 150 points. However, if you reject the null\n",
        "hypothesis when the null hypothesis is in fact true, you lose 100\n",
        "points!\n",
        "\n",
        "When you reject the null hypothesis when it’s actually true, you are\n",
        "comitting a Type 1 error and the probability of comitting a Type 1 error\n",
        "is equal to the chosen significance level, ie. $\\alpha$ used for the\n",
        "test. This follows from the fact that $\\alpha = 0.05$ marks the point on\n",
        "the distributions (hypothesized under null) to the right of which lie 5%\n",
        "of the observations. Hence there is a 5% probability of comitting a Type\n",
        "1 error in this case.\n",
        "\n",
        "An **alternate distribution** describes how the test-statistic would be\n",
        "distributed under the alternative hypothesis. In our case, we found\n",
        "enough evidence supporting that the test-statistic did not belong to the\n",
        "null model. Instead, they belong to an alternative model where the\n",
        "population mean is different than the hypothesized value. Since our test\n",
        "was **right-tailed**, both of the distributions for the sample means and\n",
        "test-statistics must, according to our test’s conclusions, have centers\n",
        "(means) which lie to the right of the hypothesized mean $\\mu_0$.\n",
        "\n",
        "There’s also the two-tailed hypothesis test where the null hypothesis\n",
        "assumes that $\\mu_1 != \\mu_0$. Here, the alternate model would say that\n",
        "the true means (centers) for the distributions (eg. of null, sample\n",
        "means, or population itself) can lie either to the left or right of the\n",
        "$\\mu_0$."
      ],
      "id": "714054b4-3972-4e14-96cb-4bd49d306aa8"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "ir",
      "display_name": "R",
      "language": "r"
    }
  }
}