{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 1.7 - Beginner - Simple Regression\n",
        "\n",
        "COMET Team <br> *Jonathan Graves, Jonah Heyl, Anneke Dresselhuis, Rathin\n",
        "Dharani, Devan Rawlings, Jasmine Arora*  \n",
        "2023-07-11\n",
        "\n",
        "## Outline\n",
        "\n",
        "### Prerequisites\n",
        "\n",
        "-   Introduction to Jupyter\n",
        "-   Introduction to Data\n",
        "-   Introduction to R\n",
        "-   Hypothesis testing\n",
        "\n",
        "### Outcomes\n",
        "\n",
        "-   Build a simple linear regression using R\n",
        "-   Create and interpret regression outputs in R including: coefficient\n",
        "    estimates\n",
        "-   Examine the various elements of regression objects in R (including\n",
        "    fitted values, residuals and coefficients)\n",
        "-   Explain the role of qualitative variables in regression analysis as\n",
        "    dummy variables"
      ],
      "id": "01534ac1-4503-461b-9d8d-3496c3ce4dae"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "library(tidyverse)\n",
        "library(haven)\n",
        "\n",
        "source(\"beginner_simple_regression_tests.r\")"
      ],
      "id": "a526235d-a0fc-4ea3-b30c-cf55f188724e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Part 1: Learning About Regressions\n",
        "\n",
        "-   What is a regression?\n",
        "\n",
        "-   What is the relationship of a regression to other statistical\n",
        "    concepts?\n",
        "\n",
        "-   How do we use regressions to answer economic questions?\n",
        "\n",
        "In this notebook, we will explore these questions using our census data\n",
        "set and will learn more about the immigrant wage gap. If you remember\n",
        "from last lecture, we were interested in the relationship between\n",
        "`wages` and `immstat` - wages and immigration status. However we can\n",
        "also use to the variable, `mrkinc` (market income) to measure income.\n",
        "Let’s explore the relationship between `wages` and `mrkinc` in this data\n",
        "set.\n",
        "\n",
        "Let’s start off with a visualization:"
      ],
      "id": "d106514a-c0c1-44c7-8355-3ba4c8e5a885"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "census_data <- read_dta(\"../datasets_beginner/01_census2016.dta\")\n",
        "\n",
        "census_data <- as_factor(census_data)\n",
        "\n",
        "census_data <- filter(census_data, !is.na(census_data$wages)) #Removing the rows in which wages = NA\n",
        "census_data <- filter(census_data, !is.na(census_data$mrkinc)) #Removing the rows in which mrkinc = NA\n",
        "\n",
        "glimpse(census_data)"
      ],
      "id": "b4389ee6-1284-464d-9a2a-c11d283fa278"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "options(repr.plot.width=6,repr.plot.height=4) #controls the image size\n",
        "\n",
        "f <- ggplot(data = census_data, aes(x = wages, y = mrkinc)) + xlab(\"Wages\") + ylab(\"Market Income\")\n",
        "f + geom_point()"
      ],
      "id": "9b36730b-4184-4169-970c-beaeacf1c0f1"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "> *Think Deeper*: What do you see here? Is there anything about this\n",
        "> relationship that sticks out to you? Why does it have the shape it\n",
        "> does?\n",
        "\n",
        "While you can probably tell that there is some relationship between\n",
        "wages and market income, it can be difficult to visualize using a\n",
        "scatterplot alone since there are far too many points.\n",
        "\n",
        "## Conditional Expectation:\n",
        "\n",
        "The expectation of $X$ is what outcome we expect $X$ to *typically* be\n",
        "after *a lot* of sampling. We calculate the predicted value by\n",
        "multiplying the different values $X$ can take by the various\n",
        "probabilities of $X$ taking on that value. For instance, the expectation\n",
        "of a die throw is 3. Essentially, $$\n",
        "E[X] =  \\sum_{i=1}^n P(X_i=x) X_i\n",
        "$$\n",
        "\n",
        "You can think of conditional expectation as the expected value based on\n",
        "some condition: $$\n",
        "E[X|y_i=y] =  \\sum_{i=1}^n P(X_i=x|y_i=y) X_i\n",
        "$$\n",
        "\n",
        "-   The conditional expectation: **the expectation of a random variable\n",
        "    X, conditional on the value taken by another random variable Y** .\n",
        "    If the value of Y affects the value of X (i.e. X and Y are\n",
        "    dependent), the conditional expectation of X given the value of Y\n",
        "    will be different from the overall expectation of X.\n",
        "\n",
        "-   In other words, we use conditional expectation when we predict that\n",
        "    there is a relationship between a predictor variable and the\n",
        "    response variable, such that we want our predictions to be made in\n",
        "    the context of a specific value of the predictor(s).\n",
        "\n",
        "-   The shape of the conditional expectation function indicates the\n",
        "    relationship between the two variables we are interested in. For\n",
        "    example, the conditional expectation of a dice roll *given* that the\n",
        "    number is even, is 4.\n",
        "\n",
        "Linear regression assumes a linear conditional expectation function\n",
        "which means that the conditional expectation function can be described\n",
        "by a straight line:\n",
        "\n",
        "$E[Y|X=x]= \\beta_0 +\\beta_1X$\n",
        "\n",
        "We can split a regression model into two parts:\n",
        "\n",
        "1\\. The conditional expectation\n",
        "\n",
        "1.  An error term\n",
        "\n",
        "To be clear, let’s look at an example of this linear conditional\n",
        "expectation function, where our $Y$ (outcome variable) is wages and our\n",
        "$X$ (explanatory variable) is years of education.\n",
        "\n",
        "The conditional expectation function would\n",
        "be:$E[WAGES|YEARS=years_i] = \\beta_0 +\\beta_1X$.\n",
        "\n",
        "This means that the *given* a particular value of years of education\n",
        "($years_i$) for an individual $i$, the wages of that individual will\n",
        "follow the linear regression form $\\beta_0 +\\beta_1X$.\n",
        "\n",
        "## Regression Models\n",
        "\n",
        "A regression model specifies (the *specification*) the relationship\n",
        "between two variables. For example, a linear relationship would be:\n",
        "\n",
        "$$\n",
        "M_i = \\beta_0 + \\beta_1 W_i\n",
        "$$\n",
        "\n",
        "-   $M_i$, the market income of individual $i$ is our outcome variable.\n",
        "\n",
        "-   $W_i$, their wage is our explanatory variable.\n",
        "\n",
        "-   In econometrics, we use the terms **outcome** variable and\n",
        "    **explanatory** variable rather than the dependent and *independent*\n",
        "    variable respectively.\n",
        "\n",
        "A model like this describes the relationship between the variables - but\n",
        "it also depends on two unknowns: $\\beta_0$, $\\beta_1$.\n",
        "\n",
        "-   The $\\beta_0$ and $\\beta_1$ are **parameters** of the model: they\n",
        "    are numbers that determine the relationship (intercept and slope)\n",
        "    between $M_i$ and $W_i$\n",
        "-   This is a linear relationship as indicated by the linear\n",
        "    coefficients. It is also linear in the variables, but that isn’t\n",
        "    required (we will explore this later).\n",
        "\n",
        "It is highly unlikely that $M_i = \\beta_0 + \\beta_1 W_i$) can explain\n",
        "everything about our data. We also need to include the *residual* term\n",
        "(meaning “leftover”).\n",
        "\n",
        "-   The $\\epsilon_i$ is the residual: a component that corresponds to\n",
        "    the part of the data which is *not* described by the model\n",
        "-   These residual terms will usually have certain assumed properties\n",
        "    that allow us to estimate the model.\n",
        "\n",
        "We can think about a regression as two parts:\n",
        "\n",
        "1.  The part of the relationship explained by our model\n",
        "    ($M_i = \\beta_0 + \\beta_1 W_i$)\n",
        "2.  The part which is not explained ($\\epsilon_i$). The process of\n",
        "    “fitting” or estimating a regression model selects certain values\n",
        "    for $\\beta_0$ and $\\beta_1$ such that we minimize the amount that\n",
        "    needs to be explained by the residual term. We write the complete\n",
        "    *regression equation* by combining the two parts of the model:\n",
        "\n",
        "$$\n",
        "M_i = \\beta_0 + \\beta_1 W_i + \\epsilon_i\n",
        "$$\n",
        "\n",
        "The goal of regression analysis is to:\n",
        "\n",
        "1.  Accurately estimate this equation (and especially the model\n",
        "    parameters)\n",
        "2.  Learn about the relationship between $M_i$ and $W_i$ from the\n",
        "    results of that estimation.\n",
        "\n",
        "While there are several ways we can define “accurately as possible” and\n",
        "“estimate” the equation. In this course, we use *ordinary least squares*\n",
        "(OLS):\n",
        "\n",
        "$$\n",
        "(\\hat{\\beta_0},\\hat{\\beta_1}) = \\arg \\min_{b_0,b_1} \\sum_{i=1}^{n} (M_i - b_0 - b_1 W_i)^2 = \\sum_{i=1}^{n} (e_i)^2\n",
        "$$\n",
        "\n",
        "While this may look complicated, it is just the calculus way of writing\n",
        "“choose $\\beta_0$ and $\\beta_1$ ( $\\hat{\\beta_0},\\hat{\\beta_1}$) such\n",
        "that they minimize the sum of the squared residuals”.\n",
        "\n",
        "-   A regression should aim to have the bulk of the results explained by\n",
        "    the parameters ($\\beta_0, \\beta_1$) and as little as possible using\n",
        "    $\\epsilon_i$.\n",
        "-   Our statistical problem has now transformed into a calculus problem,\n",
        "    which you could solve for instance by taking derivatives.\n",
        "\n",
        "There are numerous ways to solve this estimation problem - either\n",
        "through R or Math. Let’s start by drawing a best fit line through points\n",
        "with a linear regression in R.\n",
        "\n",
        "## Example: Manual Estimation\n",
        "\n",
        "A bad way to solve this is the good-’ole eyeball method by observing the\n",
        "scatter plot and guessing how some values may perform.\n",
        "\n",
        "Try to get the best fit you can by playing around with the following\n",
        "example."
      ],
      "id": "2889a064-e490-4ab6-8e50-afc0e2d2ee8a"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# set the value of B_0 and B_1 with these values\n",
        "\n",
        "B_0 = 0  #change me\n",
        "B_1 = 1  #change me\n",
        "\n",
        "# don't touch the rest of this code - but see if you can understand it!\n",
        "\n",
        "SSE = sum((census_data$mrkinc - B_0 - B_1*census_data$wages)^2)\n",
        "\n",
        "# here is the SSE from your model\n",
        "\n",
        "round(SSE/1000000,0)"
      ],
      "id": "8995f87c-0d29-4729-b785-cfcb15e23139"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "What was the lowest value you got? Here is what your guess looks like in\n",
        "a graph:"
      ],
      "id": "50df5610-4cca-4fec-9062-63d4bf915222"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# just run this cell to see your results\n",
        "# re-run it if you change the values\n",
        "\n",
        "options(repr.plot.width=6,repr.plot.height=4) #controls the image size\n",
        "\n",
        "fitted_line = data.frame(wages = census_data$wages, mrkinc = B_0 + B_1*census_data$wages)\n",
        "\n",
        "f <- ggplot(data = census_data, aes(x = wages, y = mrkinc)) + xlab(\"Wages\") + ylab(\"Market Income\")\n",
        "f <- f + geom_point() + geom_line(color = \"red\", data = fitted_line)\n",
        "\n",
        "f"
      ],
      "id": "b0ff9f63-53e3-42fa-ae81-1e6331ea026e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Interactive Visualization of OLS\n",
        "\n",
        "Understanding OLS is fundamental to understanding regressions and other\n",
        "opics in econometrics. Let’s try and understand the formula for OLS\n",
        "above through a more visual approach:.\n",
        "\n",
        "$$\n",
        "(\\hat{\\beta_0},\\hat{\\beta_1}) = \\arg \\min_{b_0,b_1} \\sum_{i=1}^{n} (M_i - b_0 - b_1 W_i)^2 = \\sum_{i=1}^{n} (e_i)^2\n",
        "$$\n",
        "\n",
        "To demonstrate this, we will use a small scatter plot with just 4\n",
        "points.\n",
        "\n",
        "-   The straight line through the scatter plot is modelled by the simple\n",
        "    regression formula $B_0 + B_1X$.\n",
        "\n",
        "-   Since it’s nearly impossible for a regression to perfectly predict\n",
        "    the relationship between two variables, we will almost always\n",
        "    include an **unobservable error** $e_i$ with our regression\n",
        "    estimation. This is the vertical distance between the regression\n",
        "    line and the actual data points\n",
        "\n",
        "-   Hence each of the points can be modelled by the equation\n",
        "    $Y_i = B_0 + B_1X + e_i$.\n",
        "\n",
        "-   Instead of minimizing the error terms, we will try to minimize the\n",
        "    squared errors which are represented by the size of those red boxes.\n",
        "\n",
        "> Try your own values for `beta_0` and `beta_1`. Make sure to try the\n",
        "> values only roughly within the specified range. The actual value of\n",
        "> `beta_0` and `beta_1` that minimize the residual sum of squares is\n",
        "> 0.65 and 0.82 respectively. The code block below also displays the\n",
        "> area of the red boxes; deviation from these optimal values will\n",
        "> increase the area of the red boxes."
      ],
      "id": "48c5b2e4-7df7-4899-a0c8-a043f9f60136"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "beta_0 <- 0.65 #CHANGE THIS VALUE, TRY VALUES BETWEEN 0 - 1\n",
        "beta_1 <- 0.82 #CHANGE THIS VALUE, TRY VALUES BETWEEN 0.6 - 1.4\n",
        "\n",
        "x <- c(1, 2, 3, 4)\n",
        "y <- c(1.7, 1.5, 4, 3.6)\n",
        "\n",
        "# don't worry about this code, just run it!\n",
        "dta <- data.frame(x, y)\n",
        "example_df_graph <- dta %>%\n",
        "                    ggplot(aes(x = x, y = y)) +\n",
        "                    geom_point() +\n",
        "                    geom_abline(intercept = beta_0, slope = beta_1) +\n",
        "                    xlim(0, 5) +\n",
        "                    ylim(0, 5) +\n",
        "                    geom_rect(aes(xmin = (dta[1, \"x\"] + (beta_0 + (beta_1 * dta[1, \"x\"])) - dta[1, \"y\"]), xmax = dta[1, \"x\"], \n",
        "                                  ymin = (beta_0 + (beta_1 * dta[1, \"x\"])), ymax = dta[1, \"y\"]),\n",
        "                            alpha = 0.1,\n",
        "                            fill = \"red\") +\n",
        "                    geom_rect(aes(xmin = dta[2, \"x\"], xmax = (dta[2, \"x\"] + (beta_0 + (beta_1 * dta[2, \"x\"])) - dta[2, \"y\"]), \n",
        "                                  ymin = dta[2, \"y\"], ymax = (beta_0 + (beta_1 * dta[2, \"x\"]))), \n",
        "                            alpha = 0.1, \n",
        "                            fill = \"red\") +\n",
        "                    geom_rect(aes(xmin = (dta[3, \"x\"] + (beta_0 + (beta_1 * dta[3, \"x\"])) - dta[3, \"y\"]), xmax = dta[3, \"x\"], \n",
        "                                  ymin = (beta_0 + (beta_1 * dta[3, \"x\"])), ymax = dta[3, \"y\"]), \n",
        "                            alpha = 0.1, \n",
        "                            fill = \"red\") +\n",
        "                    geom_rect(aes(xmin = dta[4, \"x\"], xmax = (dta[4, \"x\"] + (beta_0 + (beta_1 * dta[4, \"x\"])) - dta[4, \"y\"]), \n",
        "                                  ymin = dta[4, \"y\"], ymax = (beta_0 + (beta_1 * dta[4, \"x\"]))), \n",
        "                            alpha = 0.1, \n",
        "                            fill = \"red\")\n",
        "example_df_graph\n",
        "\n",
        "area_1 <- ((dta[1, \"x\"] - (dta[1, \"x\"] + (beta_0 + (beta_1 * dta[1, \"x\"])) - dta[1, \"y\"])) * \n",
        "        ((beta_0 + (beta_1 * dta[2, \"x\"])) - dta[2, \"y\"]))\n",
        "area_2 <- ((dta[2, \"x\"] + (beta_0 + (beta_1 * dta[2, \"x\"])) - dta[2, \"y\"]) - dta[2, \"x\"]) * \n",
        "          ((beta_0 + (beta_1 * dta[2, \"x\"])) - dta[2, \"y\"])\n",
        "area_3 <- (dta[3, \"x\"] - (dta[3, \"x\"] + (beta_0 + (beta_1 * dta[3, \"x\"])) - dta[3, \"y\"])) * \n",
        "          (dta[3, \"y\"]) - (beta_0 + (beta_1 * dta[3, \"x\"]))\n",
        "area_4 <- ((dta[4, \"x\"] + (beta_0 + (beta_1 * dta[4, \"x\"])) - dta[4, \"y\"]) - dta[4, \"x\"]) * \n",
        "          ((beta_0 + (beta_1 * dta[4, \"x\"])) - dta[4, \"y\"])\n",
        "\n",
        "area <- area_1 + area_2 + area_3 + area_4\n",
        "print(\"Area of red boxes is: \")\n",
        "area"
      ],
      "id": "a4683ea5-c794-44b7-9f1a-58327a7613f6"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Simple Regressions in R\n",
        "\n",
        "Now, let’s see how we could use a regression in R to do this.\n",
        "\n",
        "-   Regression models look like: `Y ~ X` where `Y` is regressed on `X`\n",
        "    and the `~` symbol is called “tilde”.\n",
        "\n",
        "We can ignore the residual terms and parameters when writing the model\n",
        "in R and just focus on the variables for now.\n",
        "\n",
        "So, for example, our regression model is\n",
        "\n",
        "$$\n",
        "M_i = \\beta_0 + \\beta_1 W_i + \\epsilon_i\n",
        "$$\n",
        "\n",
        "Which can be written in R as\n",
        "\n",
        "`mrkinc ~ wages`\n",
        "\n",
        "Regressions are estimated in R using the `lm` function, which takes the\n",
        "data as an argument.\n",
        "\n",
        "-   This creates a *linear model* object, which can be used to calculate\n",
        "    things (using prediction) or perform tests\n",
        "-   It also stores all of the information about the model, such as the\n",
        "    coefficient and fit\n",
        "-   These models can also be printed and summarized to give important\n",
        "    basic information about a regression\n",
        "\n",
        "For an example linear model called `my_model`, some of the most\n",
        "important elements are:\n",
        "\n",
        "-   `my_model$coefficients`: the parameter coefficients\n",
        "-   `my_model$residuals`: the residuals\n",
        "-   `my_model$fitted.values`: the predicted values\n",
        "\n",
        "Let’s see our model in action here."
      ],
      "id": "740cbbe2-33d9-43a1-aa40-dd84469a9e76"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "regression1 = lm(mrkinc ~ wages, data = census_data)\n",
        "\n",
        "summary(regression1)\n",
        "\n",
        "head(regression1$coefficients)"
      ],
      "id": "ce14c149-104f-4280-8c0c-ad679e605a1b"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Take a close look at the results. Identify the following elements:\n",
        "\n",
        "-   The values of the parameters\n",
        "-   The standard errors of the parameters\n",
        "-   The %-of the data explained by the model (R-sqaured)\n",
        "\n",
        "> **Test Your Knowledge**: What % of the data is explained by the model?\n",
        "> Answer to 2 decimal places."
      ],
      "id": "d20b11ba-6a91-414e-b276-89268972ab41"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#Hint: Convert the Multiple-R squared value into a percentage \n",
        "ans1 <- ...  # answer goes here\n",
        "\n",
        "test_1()"
      ],
      "id": "e79d2411-9907-44f2-a5c0-828ce11c9855"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The underlying model and the parameters tell us about the relationship\n",
        "between the different values:\n",
        "\n",
        "$$\n",
        "M_i = -2455 + 1.61 W_i + \\epsilon_i\n",
        "$$\n",
        "\n",
        "Notice, for example:\n",
        "\n",
        "$$\n",
        "\\frac{\\partial M_i}{\\partial W_i} = \\beta_1 = 1.61\n",
        "$$\n",
        "\n",
        "In other words, when wages go up by 1 dollar, we would expect that\n",
        "market income will rise by 1.61 dollars. This kind of analysis is key to\n",
        "*interpreting* what this model is telling us.\n",
        "\n",
        "Finally, let’s visualize our fitted model on the scatterplot from\n",
        "before. How does it compare to your original model?"
      ],
      "id": "a6fd29e2-20f6-4b75-a748-609c38067db7"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "fitted_line2 = data.frame(wages = census_data$wages, mrkinc = regression1$fitted.values)\n",
        "\n",
        "f <- ggplot(data = census_data, aes(x = wages, y = mrkinc)) + xlab(\"Wages\") + ylab(\"Market Income\")\n",
        "f <- f + geom_point() + geom_line(color = \"red\", data = fitted_line) + geom_line(color = \"blue\", data = fitted_line2)\n",
        "\n",
        "f"
      ],
      "id": "5e78229c-df62-4c6c-b25f-13de691c8a71"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see - there’s a very close relationship between `mrkinc` and\n",
        "`wages`. This implies that we can focus our attention on wages in our\n",
        "analysis of the immigrant wage gap.\n",
        "\n",
        "# Part 2: Simple Regressions and $t$-Tests\n",
        "\n",
        "Previously, we looked at the relationship between market income and\n",
        "wages. However, these are both *quantitative* variables. However, what\n",
        "if we wanted to work with a *qualitative* variable like `immstat`?\n",
        "\n",
        "Thankfully regression models can still incorporate this kind of variable\n",
        "as this is the most common type of variable in real-world data. How is\n",
        "this possible?\n",
        "\n",
        "Let’s start out with the simplest kind of qualitative variable: a\n",
        "**dummy** (0 or 1) variable. Consider the regression equation:\n",
        "\n",
        "$$\n",
        "W_i = \\beta_0 + \\beta_1 I_i + \\epsilon_i\n",
        "$$\n",
        "\n",
        "The conditional expectation when I_i is 0 and when it is 1 is :\n",
        "\n",
        "$$\n",
        "E[W_i|I_i = 1] = \\beta_0 + \\beta_1 \\cdot 1 + E[\\epsilon_i|I_i = 1]\n",
        "$$\n",
        "\n",
        "$$\n",
        "E[W_i|I_i = 0] = \\beta_0 + \\beta_1 \\cdot 0 + E[\\epsilon_i|I_i = 0]\n",
        "$$\n",
        "\n",
        "Under Assumption 1, we have that $E[\\epsilon_i|I_i] = 0$, so:\n",
        "\n",
        "$$\n",
        "E[W_i|I_i = 1] = \\beta_0 + \\beta_1\n",
        "$$\n",
        "\n",
        "$$\n",
        "E[W_i|I_i = 0] = \\beta_0\n",
        "$$\n",
        "\n",
        "Combining these two expressions:\n",
        "\n",
        "$$\n",
        "\\beta_1 = E[W_i|I_i = 1] - E[W_i|I_i = 0]\n",
        "$$\n",
        "\n",
        "What this tells us:\n",
        "\n",
        "1.  You can include **dummy** variables in regressions\n",
        "2.  The coefficients of the dummy variable have meaning in terms of the\n",
        "    regression model\n",
        "3.  They measure the (average) difference in the dependent variable\n",
        "    between the two levels of the dummy variable\n",
        "\n",
        "Therefore dummy variables can be included in a regression model just\n",
        "like quantitative variables.\n",
        "\n",
        "Let’s look at this in terms of `immstat`. We can create our regression\n",
        "equation as:\n",
        "\n",
        "$$\n",
        "W_i = \\beta_0 + \\beta_1 I_i + \\epsilon_i\n",
        "$$\n",
        "\n",
        "Then we can estimate this using R."
      ],
      "id": "f8359ee2-fed0-4081-871e-ad1b46415bca"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "regression2 <- lm(wages ~ immstat, data = census_data)\n",
        "\n",
        "summary(regression2)"
      ],
      "id": "c6176ed6-2fa0-42f8-93e4-6ed41e29754d"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "What do you see here?\n",
        "\n",
        "> **Test Your Knowledge**: What is the difference in average wage\n",
        "> between immigrants and non-immigrants?"
      ],
      "id": "61229fd0-1937-4a5b-9338-743b53dc4ade"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# input the answer (to 1 decimal place)\n",
        "answer2 <- ...  \n",
        "\n",
        "test_2()"
      ],
      "id": "ecea6e6d-09ba-4b3a-a67e-a7e33a843618"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The number about might seem familiar, if you remember what we learned\n",
        "about a $t$-test from earlier. Remember this result?"
      ],
      "id": "c8081614-a993-4cf2-9632-620e93fda1c0"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "t1 <- t.test(x = filter(census_data, immstat == \"immigrants\")$wages,\n",
        "       y = filter(census_data, immstat == \"non-immigrants\")$wages,\n",
        "       alternative = \"two.sided\",\n",
        "       mu = 0,\n",
        "       conf.level = 0.95)\n",
        "t1\n",
        "\n",
        "t1$estimate[1] - t1$estimate[2]"
      ],
      "id": "840724fd-7a6d-4338-a94d-932f1ce1eba9"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Look closely at the results. What is the relationship here?\n",
        "\n",
        "Regression exemplifies the comparison that two sample variables make\n",
        "when the explanatory variable is a dummy. Recall:\n",
        "\n",
        "$$\n",
        "\\beta_1 = E[W_i|I_i = 1] - E[W_i|I_i = 0]\n",
        "$$\n",
        "\n",
        "The regression coefficient of $\\beta_1$ here is a comparison of two\n",
        "means. This is the same as how a $t$-test compares two means by\n",
        "different groups - groups which are specified by $I_i = 0$ or $I_i = 1$.\n",
        "\n",
        "-   In other words, another way of thinking about a regression is like a\n",
        "    form of a comparison of means test.\n",
        "-   It can handle the same kind of analysis (i.e. with dummies), but can\n",
        "    also include quantitative variables - which regular comparison of\n",
        "    means tests cannot handle.\n",
        "\n",
        "# Part 3: Exercises\n",
        "\n",
        "## Activity 1\n",
        "\n",
        "In this activity, we’ll explore how the immigrant wage gap could depend\n",
        "on sex (male vs. female). We can now examine this issue directly using\n",
        "regressions.\n",
        "\n",
        "Estimate the immigrant wage gap for males and for females using\n",
        "regressions.\n",
        "\n",
        "<em>Tested objects:</em> `regm` (the regression for males), `regf` (the\n",
        "regression for females)."
      ],
      "id": "357408c6-1f0a-4521-a7de-59a390504497"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Activity 1\n",
        "\n",
        "# Regression for males\n",
        "regm <- lm(... ~ ..., data = filter(census_data, ... == ...)) #what should replace the ...\n",
        "#Hint: Don't forget the quotation marks when specifying the subset \n",
        "\n",
        "# Regression for females\n",
        "regf <-  ... # what should replace the ...\n",
        "\n",
        "summary(regm) # Allow us to view regm's coefficient estimates\n",
        "summary(regf) # Same as above, but for regf\n",
        "\n",
        "test_3() # Quiz1\n",
        "test_4() # Quiz2"
      ],
      "id": "b1cd8357-cd10-45f6-89b9-974e6ebdeab0"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Short Answer 1\n",
        "\n",
        "**Prompt:** How do we interpret the coefficient (Intercept) estimate on\n",
        "`immstat` in each of these regressions?\n",
        "\n",
        "**A** The average wage of a non-immigrant  \n",
        "**B** The average wage of an immigrant  \n",
        "**C** The difference between the average wage of an immigrant and\n",
        "non-immigrant  \n",
        "**D** Nothing we should worry about"
      ],
      "id": "a8c31491-19ec-4550-a2a7-6c7d5bcbd17a"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Enter your answer below as \"A\", \"B\", \"C\", or \"D\"\n",
        "\n",
        "answer20 <- \"...\"\n",
        "test_20(answer20)"
      ],
      "id": "24561832-9c33-47b0-9b41-7a398f3839d6"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Short Answer 2\n",
        "\n",
        "**Prompt:** Compare the gaps. Is the immigrant wage gap larger for males\n",
        "or females? Why do you think that might that be?\n",
        "\n",
        "**A** The immigrant pay gap for females is much greater than that of\n",
        "males  \n",
        "**B** The immigrant pay gap for males is much greater than that of\n",
        "females  \n",
        "**C** The immigrant pay gap is roughly the same"
      ],
      "id": "38cec4c8-632f-4c4a-91f7-de4867243212"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Enter your answer below as \"A\", \"B\", or \"C\"\n",
        "\n",
        "answer21 <- \"\"\n",
        "test_21(answer21)"
      ],
      "id": "cf20bb41-936a-43d1-969f-9f98dfb3fd95"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Activity 2\n",
        "\n",
        "Many studies have suggested that workers’ wages increase as they age. In\n",
        "this activity, we will explore how the immigrant wage gap varies by age.\n",
        "First, let’s see the factor levels of the `agegrp`:"
      ],
      "id": "cc55c855-2575-421b-a4df-863303e0326f"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "levels(census_data$agegrp) # Run this!"
      ],
      "id": "dcbe95ea-57da-435e-9b7c-5b43b86b7006"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As we can see, there are several age groups in this dataframe, including\n",
        "ones that would not be particularly informative (have you ever seen a\n",
        "3-year-old doing salary work?). Let’s estimate the immigrant wage gap\n",
        "(with no controls) for five of these groups separately: \\* 20 to 24\n",
        "years \\* 30 to 34 years \\* 40 to 44 years \\* 50 to 54 years \\* 60 to 64\n",
        "years\n",
        "\n",
        "<em>Tested objects:</em> `reg5_20` (20 to 24 years), `reg5_50` (50 to 54\n",
        "years)"
      ],
      "id": "d58fe973-6825-4f74-8eb4-2ff75bb00bf6"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "reg5_20 <- lm(wages ~ immstat, data = filter(census_data, agegrp == '20 to 24 years')) \n",
        "\n",
        "reg5_30 <- ... # what should go here? Use the code above as a template\n",
        "\n",
        "reg5_40 <- ...\n",
        "\n",
        "reg5_50 <- ...\n",
        "\n",
        "reg5_60 <- ... \n",
        "\n",
        "# store the summaries (but don't show them!  too many!)\n",
        "sum20 <- summary(reg5_20)\n",
        "sum30 <- summary(reg5_30)\n",
        "sum40 <- summary(reg5_40)\n",
        "sum50 <- summary(reg5_50)\n",
        "sum60 <- summary(reg5_60)\n",
        "\n",
        "test_12() # Quiz3\n",
        "test_16() # Quiz4"
      ],
      "id": "b7d366a3-2f73-4ab3-b319-0474a164f544"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The code below will tabulate a brief summary of each regression:"
      ],
      "id": "09f85ba6-b347-4df5-bdad-0a35599b2e1e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Just run me!  You don't need to edit this\n",
        "\n",
        "Age_Group <- c(\"20-24\", \"30-34\", \"40-44\", \"50-54\", \"60-64\")\n",
        "Wage_Gap <- c(reg5_20$coefficients[2], reg5_30$coefficients[2], reg5_40$coefficients[2], reg5_50$coefficients[2], reg5_60$coefficients[2])\n",
        "Std._Error <- c(sum20$coefficients[2,2], sum30$coefficients[2,2], sum40$coefficients[2,2], sum50$coefficients[2,2], sum60$coefficients[2,2])\n",
        "t_Value <- c(sum20$coefficients[2,3], sum30$coefficients[2,3], sum40$coefficients[2,3], sum50$coefficients[2,3], sum60$coefficients[2,3])\n",
        "p_Value <- c(sum20$coefficients[2,4], sum30$coefficients[2,4], sum40$coefficients[2,4], sum50$coefficients[2,4], sum60$coefficients[2,4])\n",
        "\n",
        "tibble(Age_Group, Wage_Gap, Std._Error, t_Value, p_Value) # it's like a table but a tibble"
      ],
      "id": "31ac2c44-41ed-411d-b591-7ba8fa6f2c4f"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Short Answer 3\n",
        "\n",
        "**Prompt**: What happens to the immigrant wage gap as we move across age\n",
        "groups? What do you think might explain these changes?\n",
        "\n",
        "**A** Wage gap declines as age group decreases  \n",
        "**B** Wage gap increases as age group increases  \n",
        "**C** Wage gap is the highest at the “40 to 44 years” age group"
      ],
      "id": "f22b6a1d-03e1-496c-952a-158d3eac7911"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Enter your answer below as \"A\", \"B\", or \"C\"\n",
        "\n",
        "answer22 <- \"\"\n",
        "test_22(answer22)"
      ],
      "id": "caf3cf96-2fcb-4d60-b15b-22323d9b95e3"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Activity 3\n",
        "\n",
        "As we observed in last week’s worksheet, the immigrant wage gap could\n",
        "differ by education level. As there are many education categories, it\n",
        "may be tedious to run a regression for each individual education level.\n",
        "\n",
        "Instead, we could run a single regression and add education level as a\n",
        "second regressor, $E_i$:\n",
        "\n",
        "$$\n",
        "W_i = \\beta_0 + \\beta_1 I_i + \\beta_2 E_i + \\epsilon_i\n",
        "$$\n",
        "\n",
        "This is actually a **multiple regression**, which we will learn about\n",
        "later - but from the point of the this lesson, the idea is that it is\n",
        "“run” in R essentially in the same way as a simple regression. Estimate\n",
        "the regression model above without $E_i$, then re-estimate the model\n",
        "with $E_i$ added.\n",
        "\n",
        "<em>Tested objects:</em> `reg2A` (regression without controls), `reg2B`\n",
        "(regression with controls)."
      ],
      "id": "2a743bd2-3ad7-4c82-a046-62216dab9793"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Naive regression (just immstat)\n",
        "reg2A <- lm(... ~ ..., data = census_data) #this one works already\n",
        "\n",
        "# Regression with controls\n",
        "reg2B <-  lm(... ~ immstat + ..., data = census_data) # what should replace the ... think about the model\n",
        "\n",
        "# This will look ugly; try to look carefully at the output\n",
        "summary(reg2A)$coefficients\n",
        "summary(reg2B)$coefficients\n",
        "\n",
        "test_7()\n",
        "test_8() # Quiz 5"
      ],
      "id": "e91a18ac-a1da-425e-9c48-cadcaa6a70a7"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Short Answer 4\n",
        "\n",
        "**Prompt**: compare the estimated immigrant wage gap with and without\n",
        "$E_i$ in the regression. What happens to the gap when we add $E_i$? How\n",
        "do we interpret this?\n",
        "\n",
        "**A** The estimated immigrant wage gap has increased after adding\n",
        "controls  \n",
        "**B** The estimated immigrant wage gap has decreased after adding\n",
        "controls  \n",
        "**C** The estimated immigrant wage gap has not changed after adding\n",
        "controls"
      ],
      "id": "f225b83c-cd2c-4edc-be94-4c7eeb1c3540"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Enter your answer below as \"A\", \"B\", or \"C\"\n",
        "\n",
        "answer23 <- \"...\"\n",
        "test_23(answer23)"
      ],
      "id": "f09a2ff4-0563-428f-b459-6975760b2a2d"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Activity 4\n",
        "\n",
        "Another topic of interest for labor economists that is related to the\n",
        "immigrant wage gap is racial wage discrimination - the issue of workers\n",
        "of similar productivity being paid different wages on average because of\n",
        "their race. Consequently, we can also use regressions to estimate the\n",
        "racial wage gap.\n",
        "\n",
        "Let’s suppose that we want to estimate this racial wage gap. Run a\n",
        "regression (without controls) that does this.\n",
        "\n",
        "<em>Test objects:</em> `reg_race`."
      ],
      "id": "71d9a0df-5c7e-4b24-acc5-329e58ebdcf4"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Do not modify this l#| ine (sets \"not a visible minority\" as the reference level):\n",
        "census_data$vismin <- relevel(census_data$vismin, ref = \"not a visible minority\")\n",
        "# this is also how you set a different base level for a factor (handy!)\n",
        "\n",
        "# Racial Wage Gap Regression\n",
        "\n",
        "reg_race <- lm(wages ~ ..., data = census_data) # what model should we use here?\n",
        "\n",
        "summary(reg_race)\n",
        "\n",
        "test_10() #Quiz6"
      ],
      "id": "993cfcda-9614-4763-901c-2e9b00bec082"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Short Answer 5\n",
        "\n",
        "**Prompt**: How should we interpret the regression estimate for\n",
        "`visminblack`?\n",
        "\n",
        "**A** People from the Black community make on average about 14,795\n",
        "dollars less as compared to an average white person.  \n",
        "**B** Black immigrants make 14,795 dollars less than Black\n",
        "non-immigrants on average  \n",
        "**C** On average, a person from the Black community makes 14,795 dollars\n",
        "less than an average white person, holding all other variables constant"
      ],
      "id": "b8d12124-9505-4af3-85b7-370c916d42bd"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Enter your answer below as \"A\", \"B\", or \"C\"\n",
        "\n",
        "answer24 <- \"...\"\n",
        "test_24(answer24)"
      ],
      "id": "5e91bcd7-b448-4b59-8cc1-99f198c99970"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Short Answer 6\n",
        "\n",
        "**Prompt**: With this racial wage gap in mind, let’s return to the\n",
        "immigrant wage gap. Should we add explanatory variables for race to our\n",
        "regression from activity 2 and 3? Why or why not?\n",
        "\n",
        "**A** No we should not  \n",
        "**B** Yes, we should because there could be other factors explaining the\n",
        "wage gap  \n",
        "**C** Yes, we should control for education only.  \n",
        "**D** Yes, we should control for immigrant status only."
      ],
      "id": "0ebcc2ff-0a9a-4a5b-955e-4e3f045355b4"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Enter your answer below as \"A\", \"B\", or \"C\"\n",
        "\n",
        "answer25 <- \"...\"\n",
        "test_25(answer25)"
      ],
      "id": "94dffd21-156b-4e3f-aa98-fd9515e253da"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "tags": "aswer"
      },
      "outputs": [],
      "source": [
        "# Enter your answer below as \"A\", \"B\", or \"C\"\n",
        "\n",
        "answer25 <- \"B\"\n",
        "test_25(answer25)"
      ],
      "id": "9c8952f2-d9b0-4d28-baa8-4ec117601e3b"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "ir",
      "display_name": "R",
      "language": "r"
    }
  }
}