{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 2.1 - Intermediate - Introduction to Regression (326)\n",
        "\n",
        "COMET Team <br> *Emrul Hasan, Jonah Heyl, Shiming Wu, William Co,\n",
        "Jonathan Graves*  \n",
        "2022-12-08\n",
        "\n",
        "## Outline\n",
        "\n",
        "### Prerequisites\n",
        "\n",
        "-   Basic R and Jupyter skills\n",
        "-   A theoretical understanding of simple linear relationship\n",
        "-   An understanding of hypothesis testing\n",
        "-   Types of variables (qualitative, quantitative)\n",
        "\n",
        "### Outcomes\n",
        "\n",
        "By the end of this notebook, you will be able to:\n",
        "\n",
        "-   Learn how to run a simple linear regression using R\n",
        "\n",
        "-   Create and understand regression outputs in R\n",
        "\n",
        "-   Understand how to interpret coefficient estimates from simple linear\n",
        "    regressions in terms of an econometric model\n",
        "\n",
        "-   Examine the various elements of regression objects in R (including\n",
        "    fitted values, residuals and coefficients)\n",
        "\n",
        "-   Understand the relationship between $t$-tests and the estimates from\n",
        "    simple linear regressions\n",
        "\n",
        "-   Understand the role of qualitative variables in regression analysis,\n",
        "    with a particular emphasis on dummies\n",
        "\n",
        "-   Explain how adding variables to a model changes the results\n",
        "\n",
        "Note that the data in this exercise is provided under the Statistics\n",
        "Canada Open License: \\> <span id=\"fn1\">[<sup>1</sup>](#fn1s)Statistics\n",
        "Canada, Survey of Financial Security, 2019, 2021. Reproduced and\n",
        "distributed on an “as is” basis with the permission of Statistics\n",
        "Canada.Adapted from Statistics Canada, Survey of Financial Security,\n",
        "2019, 2021. This does not constitute an endorsement by Statistics Canada\n",
        "of this product.</span>"
      ],
      "id": "fb732fa0-21a1-4c9d-9321-ef52ccc9c011"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "library(tidyverse)\n",
        "library(haven)\n",
        "library(dplyr)\n",
        "source(\"intermediate_intro_to_regression_tests.r\")"
      ],
      "id": "bbdca182-f08b-4d40-a528-92f65529c1a0"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "SFS_data <- read_dta(\"../datasets_intermediate/SFS_2019_Eng.dta\")  #this code is discussed in module 1\n",
        "\n",
        "SFS_data <- filter(SFS_data, !is.na(SFS_data$pefmtinc))\n",
        "SFS_data <- rename(SFS_data, income_before_tax = pefmtinc)\n",
        "SFS_data <- rename(SFS_data, income_after_tax = pefatinc)\n",
        "SFS_data <- rename(SFS_data, wealth = pwnetwpg)\n",
        "SFS_data <- rename(SFS_data, gender = pgdrmie)\n",
        "SFS_data <- rename(SFS_data, education = peducmie)\n",
        "\n",
        "SFS_data <- SFS_data[!(SFS_data$education==\"9\"),]\n",
        "SFS_data$education <- as.numeric(SFS_data$education)\n",
        "SFS_data <- SFS_data[order(SFS_data$education),]\n",
        "SFS_data$education <- as.character(SFS_data$education)\n",
        "SFS_data$education[SFS_data$education == \"1\"] <- \"Less than high school\"\n",
        "SFS_data$education[SFS_data$education == \"2\"] <- \"High school\"\n",
        "SFS_data$education[SFS_data$education == \"3\"] <- \"Non-university post-secondary\"\n",
        "SFS_data$education[SFS_data$education == \"4\"] <- \"University\"\n",
        "\n",
        "SFS_data$gender <- as_factor(SFS_data$gender)\n",
        "SFS_data$education <- as_factor(SFS_data$education)"
      ],
      "id": "8ca15a13-67ec-48b1-8dcd-248b903792eb"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Part 1: Learning About Regressions\n",
        "\n",
        "What is a regression? What is the relationship of a regression to other\n",
        "statistical concepts? How do we use regressions to answer economic\n",
        "questions?\n",
        "\n",
        "In this notebook, we will explore these questions using our SFS data\n",
        "from Module 1 and learn more about the gender wealth gap. If you\n",
        "remember from last module, we were interested in the wealth gap between\n",
        "male and female lead households.\n",
        "\n",
        "We’ll begin our analysis by exploring the relationship between wealth\n",
        "and income. Let’s start off with a visualization:"
      ],
      "id": "6d73aaba-869a-46ae-aba9-8637c7e9d220"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "options(repr.plot.width=8,repr.plot.height=8) #controls the image size\n",
        "f <- ggplot(data = SFS_data, xlim=c(0,2.4*10^6), ylim=c(0,3.4*10^7), aes(x = income_after_tax, y = wealth)) + \n",
        "        xlab(\"Income After Tax\") + \n",
        "        ylab(\"Wealth\") + scale_x_continuous()\n",
        "\n",
        "f + geom_point()"
      ],
      "id": "6e3f56e8-37df-4a3d-9a36-a55a9df0d9d7"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "> *Think Deeper*: What do you see here? Is there anything about this\n",
        "> relationship that sticks out to you? Why does it have the shape it\n",
        "> does?\n",
        "\n",
        "You can probably tell that there is definitely some relationship between\n",
        "wealth and after-tax income - but it can be difficult to visualize using\n",
        "a scatterplot alone. There are far too many points to make out a\n",
        "discernable pattern or relationship here.\n",
        "\n",
        "### Regression Models\n",
        "\n",
        "This is where a **regression model** comes in. A regression model\n",
        "specifies the relationship between two variables. For example, a linear\n",
        "relationship would be:\n",
        "\n",
        "$$ W_i = \\beta_0 + \\beta_1I_i$$\n",
        "\n",
        "Where $W_i$ is wealth of family $i$, and $I_i$ is their after-tax\n",
        "income. In econometrics, we typically refer to $W_i$ as the **outcome**\n",
        "variable, and $I_i$ as the **explanatory** variable; you may have also\n",
        "heard the terms *dependent* and *independent* variables respectively,\n",
        "but these aren’t actually very good descriptions of what these variables\n",
        "are in econometrics which is why we won’t use them here.\n",
        "\n",
        "A model like this is our description of what this relationship is - but\n",
        "it depends on two unknowns: $\\beta_0$, $\\beta_1$.\n",
        "\n",
        "-   $\\beta_0$ and $\\beta_1$ are **parameters** of the model: they are\n",
        "    numbers that determine the relationship (intercept and slope,\n",
        "    respectively) between $W_i$ and $I_i$\n",
        "-   This is a *linear* relationship because the model we have specified\n",
        "    uses coefficients that are characteristic of linear model formulas -\n",
        "    note that there are many other kinds of models beyond the linear\n",
        "    type seen here.\n",
        "\n",
        "It is unlikely, if not impossible, for the relationship we observe here\n",
        "to completely explain everything about our data. We also need to include\n",
        "a term which captures everything that is *not* described by the\n",
        "relationship we described in the model. This is called the *residual*\n",
        "term (meaning “leftover”).\n",
        "\n",
        "-   The $\\epsilon_i$ is the **residual**: it is a component that\n",
        "    corresponds to the part of the data which is *not* described by the\n",
        "    model\n",
        "-   Residual terms will usually have certain assumed properties that\n",
        "    allow us to estimate the model\n",
        "\n",
        "Conceptually, you can think about a regression as two parts: the part of\n",
        "the relationship explained by your model ($W_i = \\beta_0 + \\beta_1 I_i$)\n",
        "and the part which is not explained ($\\epsilon_i$). The process of\n",
        "“fitting” or estimating a regression model refers to finding values for\n",
        "$\\beta_0$ and $\\beta_1$ such that as little as possible of the model is\n",
        "explained by the residual term. We write the complete regression\n",
        "equation by combining the two parts of the model:\n",
        "\n",
        "$$W_i = \\beta_0 + \\beta_1 I_i + \\epsilon_i$$\n",
        "\n",
        "The goal of regression analysis is to:\n",
        "\n",
        "1.  Estimate this equation (and especially the model parameters) as\n",
        "    accurately as possible.\n",
        "2.  Learn about the relationship between $W_i$ and $I_i$ from the\n",
        "    results of that estimation\n",
        "\n",
        "There are many ways to define “as accurately as possible” and similarly\n",
        "there are many ways to “estimate” the equation. In this course, we often\n",
        "use *ordinary least squares* (OLS) as our estimation method which can be\n",
        "understood as the following:\n",
        "\n",
        "$$(\\hat{\\beta_0},\\hat{\\beta_1}) = \\arg \\min_{b_0,b_1} \\sum_{i=1}^{n} (M_i - b_0 - b_1 W_i)^2 =\\arg \\min_{b_0,b_1} \\sum_{i=1}^{n} (e_i)^2$$\n",
        "\n",
        "It is just the calculus way of writing “choose $\\beta_0$ and $\\beta_1$\n",
        "(call them $\\hat{\\beta_0},\\hat{\\beta_1}$) such that they minimize the\n",
        "sum of the squared residuals”. Ultimately, the goal of doing a\n",
        "regression is to explain as much as possible using the parameters\n",
        "($\\beta_0, \\beta_1$) and as little as possible using $\\epsilon_i$.\n",
        "Through this equation, we have transformed our statistical problem into\n",
        "a calculus problem, one that can can be solved, for example, by taking\n",
        "derivatives.\n",
        "\n",
        "There are many, many ways to solve this estimation problem - most of\n",
        "which are built into R. Before getting into how we can estimate using R\n",
        "commands, we’ll discuss on how we can estimate manually.\n",
        "\n",
        "### Example: Manual Estimation\n",
        "\n",
        "If we think about the residuals as a gauge of error in our model\n",
        "(remember we want to think about the error in absolute terms, we can\n",
        "look at the scatterplot and guess how the model might perform based on\n",
        "how small or large the residuals are from the regression line. As you\n",
        "can probably imagine, this is not the most efficient nor the most\n",
        "accurate way to solve our estimate problem!\n",
        "\n",
        "Try to get the best fit you can by playing around with the following\n",
        "example."
      ],
      "id": "a5632ec2-7764-44d9-8a4f-9b6c662d1229"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#set the value of B_0 and B_1 with these values\n",
        "\n",
        "B_0 <- 10000  #change me\n",
        "B_1 <- 2  #change me\n",
        "\n",
        "# don't touch the rest of this code - but see if you can understand it!\n",
        "SSE <- sum((SFS_data$wealth - B_0 - B_1*SFS_data$income_after_tax)^2) #sum of our squared errors\n",
        "\n",
        "SSE_rounded <- round(SSE/1000000,0) \n",
        "print(paste(\"Your SSE is now,\", SSE_rounded,\", How low can you go?\")) #prints our SSE value\n",
        "\n",
        "options(repr.plot.width=10,repr.plot.height=8) #controls the image size\n",
        "\n",
        "fitted_line <- data.frame(income_before_tax = SFS_data$income_before_tax, wealth = B_0 + B_1*SFS_data$income_before_tax) #makes the regression line\n",
        "\n",
        "f <- ggplot(data = SFS_data, aes(x = income_before_tax, y = wealth),xlim=c(0,3*10^6),ylim=c(0,3*10^7)) + xlab(\"before tax income\") + ylab(\"wealth\")+scale_x_continuous() \n",
        "f <- f + geom_point() + geom_line(color = \"#330974\", data = fitted_line) #style preferences\n",
        "\n",
        "f #prints our graph with the line"
      ],
      "id": "856de209-7f4e-4946-be16-17e790735a86"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As we change our $\\beta_0, \\beta_1$, notice how the best fit line\n",
        "changes as well. The closer we fit our line to the data the lower SSE we\n",
        "have\n",
        "\n",
        "### Simple Regressions in R\n",
        "\n",
        "Now, let’s see how we could use a regression in R to do this. Regression\n",
        "models look like: `Y ~ X` (the `~` symbol is called “tilde” FYI).\n",
        "\n",
        "> For now you can ignore the residual terms and parameters when writing\n",
        "> the model in R - just focus on the variables.\n",
        "\n",
        "So, for example, our regression model is\n",
        "\n",
        "$$W_i = \\beta_0 + \\beta_1 I_i + \\epsilon_i$$\n",
        "\n",
        "Which can be written in R as\n",
        "\n",
        "`wealth ~ income_before_tax`\n",
        "\n",
        "Regressions are estimated in R using the `lm` command, which contains an\n",
        "argument to specify the dataset. This creates a **linear model object**,\n",
        "which can be used to calculate things (through prediction) or perform\n",
        "tests. It also stores all of the information about the model, such as\n",
        "the coefficient and fit. The model generated using the lm() command can\n",
        "also be printed and summarized to give important basic information about\n",
        "a regression.\n",
        "\n",
        "Below are a few of the most important elements of a linear model. Let’s\n",
        "say, for example, that we called the model `my_model.`\n",
        "\n",
        "-   `my_model$coefficients`: gives us the parameter coefficients\n",
        "-   `my_model$residuals`: gives us the residuals\n",
        "-   `my_model$fitted.values`: gives us the predicted values\n",
        "\n",
        "Enough talk! Let’s see our model in action here."
      ],
      "id": "7e7d4070-89bd-4b66-b099-75933242bf6e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "regression1 = lm(wealth ~ income_after_tax, data = SFS_data) # take note this is very important!\n",
        "\n",
        "summary(regression1)\n",
        " \n",
        "head(regression1$coefficients)"
      ],
      "id": "f2449d90-1301-4bab-a6ae-58aa11af486b"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Take a close look at the results. Identify the following elements:\n",
        "\n",
        "-   The values of the parameters\n",
        "-   The standard errors of the parameters\n",
        "-   The %-of the data explained by the model\n",
        "\n",
        "> **Test Your Knowledge**: What %-of the variance in wealth is explained\n",
        "> by the model?  \n",
        "> Write the percentage in *decimal form* and include all decimals given\n",
        "> by the model (example, x.xxx - where x are numbers)"
      ],
      "id": "ae6a4bb9-bfd3-4d1a-87bc-91f82430913a"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer1 <- ???   #answer goes here\n",
        "\n",
        "test_1()"
      ],
      "id": "4344956a-9816-4c57-aa6f-74a334c2008d"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The underlying model and the parameters tells us about the relationship\n",
        "between the different values:\n",
        "\n",
        "$$W_i = 169826.16 + 9.96 I_i + \\epsilon_i$$\n",
        "\n",
        "Notice, for example:\n",
        "\n",
        "$$\\frac{\\partial W_i}{\\partial I_i} = \\beta_1 = 9.96$$\n",
        "\n",
        "In other words, when incomes goes up by 1 dollar, we would expect that\n",
        "the wealth accumulated for this given family will rise by 9.96 dollars.\n",
        "This kind of analysis is key to interpreting what this model is telling\n",
        "us.\n",
        "\n",
        "Finally, let’s visualize our fitted model on the scatterplot from\n",
        "before. How does it compare to your original model?"
      ],
      "id": "2123edeb-6217-4ae1-9ed8-bdb48ded5c58"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "options(repr.plot.width=10,repr.plot.height=8) #style preferences\n",
        "\n",
        "fitted_line2 = data.frame(income_before_tax = SFS_data$income_before_tax, wealth = regression1$coefficients[1] + regression1$coefficients[2]*SFS_data$income_before_tax)\n",
        "#this is our estimated fitted line\n",
        "\n",
        "f <- ggplot(data = SFS_data, aes(x = income_before_tax, y = wealth)) + xlab(\"Wealth\") + ylab(\"Income before tax\")+scale_x_continuous() #defines our x and y\n",
        "f <- f + geom_point() + geom_line(color = \"#070069\", data = fitted_line) + geom_line(color = \"#ff0000\", data = fitted_line2) #style preferences\n",
        "\n",
        "f #prints  graph"
      ],
      "id": "0c497382-1f3b-46d4-9f50-d3c1bdc2bbc6"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see - there’s a very close relationship between\n",
        "`after_tax_income` and `wealth`. The red line is a regression line of\n",
        "wealth and after_tax_income.\n",
        "\n",
        "Notice as well we have negative values? Negative income and negative\n",
        "wealth is weird. We will deal with this later.\n",
        "\n",
        "## Part 2: Simple Regressions and $t$-Tests\n",
        "\n",
        "What if we wanted to work with a qualitative variable like `gender`?\n",
        "\n",
        "Regression models can still incorporate this kind of variable - which is\n",
        "good, because (as the Census makes clear) this is the most common type\n",
        "of variable in real-world data. How is this possible?\n",
        "\n",
        "Let’s start out with the simplest kind of qualitative variable: a\n",
        "**dummy** (0 or 1) variable. Let’s use Male = $0$ and Female = $1$.\n",
        "Consider the regression equation:\n",
        "\n",
        "$$W_i = \\beta_0 + \\beta_1 G_i + \\epsilon_i ~, \\text{where}\\ G_i \\ \\text{is Gender}$$\n",
        "\n",
        "Consider the conditional expectation:\n",
        "\n",
        "$$E[W_i|G_i = \\text{Male}] = \\beta_0 + \\beta_1 \\cdot 1 + \\epsilon_i$$\n",
        "\n",
        "$$E[W_i|G_i = \\text{Female}] = \\beta_0 + \\beta_1 \\cdot 0 + \\epsilon_i$$\n",
        "\n",
        "By the OLS reggression assumptions, we have that \\$E\\[\\_i\\|G_i\\] = 0 \\$,\n",
        "so:\n",
        "\n",
        "$$E[W_i|G_i = \\text{Female}] = \\beta_0 + \\beta_1$$\n",
        "\n",
        "$$E[W_i|G_i = \\text{Male}] = \\beta_0$$\n",
        "\n",
        "Combining these two expressions:\n",
        "\n",
        "$$\\beta_1 = E[W_i|G_i = \\text{Female}] - E[W_i|G_i = \\text{Male}] = \\beta_1-\\beta_0$$\n",
        "\n",
        "What this tells us:\n",
        "\n",
        "1.  We can include **dummy** variables in regressions just like\n",
        "    quantitative variables\n",
        "2.  The coefficients on the dummy variable have meaning in terms of the\n",
        "    regression model\n",
        "3.  The coefficients measure the (average) difference in the dependent\n",
        "    variable between the two levels of the dummy variable\n",
        "\n",
        "We can estimate this relationship of gender and wealth using R. As we\n",
        "investigate the wealth gap between male and female lead households, we\n",
        "might expect to see a negative sign on the coefficient - that is, if we\n",
        "anticipate that female lead households will have less wealth than male\n",
        "lead households."
      ],
      "id": "2cedb3e0-179b-4c43-a577-cf3f56d98d22"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "regression2 <- lm(wealth ~ gender, data = SFS_data)\n",
        "\n",
        "summary(regression2)"
      ],
      "id": "02947d3c-7ec5-4126-b47f-c3b25ce53bae"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "What do you see here?\n",
        "\n",
        "> **Test Your Knowledge**: What is the difference in average wealth\n",
        "> between male and female lead households?"
      ],
      "id": "60bba6bf-844e-46fc-8460-01eb530b7634"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# input the answer (to 1 decimal place, don't forget to add a negative sign, if relevant)\n",
        "answer2 <-  ???  # your answer here\n",
        "\n",
        "test_2()"
      ],
      "id": "bb8d0ab8-c54a-4183-b0b7-88d6ee73f4e0"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The number might seem familiar if you remember what we learned about a\n",
        "$t$-test from earlier. Remember this result?"
      ],
      "id": "a2a7fe2d-276b-4180-8cc9-733ec4d6eaad"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "t1 = t.test(\n",
        "       x = filter(SFS_data, gender == \"Male\")$wealth,\n",
        "       y = filter(SFS_data, gender == \"Female\")$wealth,\n",
        "       alternative = \"two.sided\",\n",
        "       mu = 0,\n",
        "       conf.level = 0.95)\n",
        "\n",
        "t1 \n",
        "\n",
        "t1$estimate[1] - t1$estimate[2]"
      ],
      "id": "8c87f458-77b7-432e-a51e-aa8eebfb7bed"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Look closely at this result, and the result above. What do you see? What\n",
        "is the relationship here?\n",
        "\n",
        "This is a very important result because a dummy variable regression is\n",
        "an example a two sample comparison. Why is this? Recall:\n",
        "\n",
        "$$\\beta_1 = E[W_i|G_i = \\text{Female}] - E[W_i|G_i = \\text{Male}]$$\n",
        "\n",
        "The regression coefficient of $\\beta_1$ can be interpreted as a\n",
        "comparison of two means. This is exactly the same as what the $t$-test\n",
        "is doing. Comparing two means by different groups - groups which are\n",
        "specified by $G_i = \\text{Male}$ or $G_i = \\text{Female}$.\n",
        "\n",
        "In other words, another way of thinking about a regression is as a\n",
        "`super` comparison of means test. However, regressions can handle\n",
        "analysis using qualitative (dummy) variables as a well as quantitative\n",
        "variables, which regular comparison of means tests cannot handle.\n",
        "\n",
        "### Multiple Levels\n",
        "\n",
        "Okay, but what if you have a qualitative variable that takes on *more*\n",
        "than two levels? For example, the `education` variable includes four\n",
        "different education classes."
      ],
      "id": "d3d62249-7e13-485a-9e5f-147dcb60b7da"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "SFS_data %>%\n",
        "group_by(education) %>%\n",
        "summarize(number_of_observations = n())"
      ],
      "id": "9334b402-d2b3-4adb-9ffb-33163fa6dfca"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In this case, the idea is that you can replace a qualitative variable by\n",
        "a *set of dummies*. Consider the following set of variables:\n",
        "\n",
        "-   `d_1`: Is highest education less than high school? (Yes/No)\n",
        "-   `d_2`: Is highest education high school? (Yes/No)\n",
        "-   `d_3`: Is highest education non-university post-secondary? (Yes/No)\n",
        "-   `d_4`: Is highest education university? (Yes/No)\n",
        "\n",
        "These four dummy variables capture the same information as the\n",
        "qualitative variable `education`. In other words, if we were told the\n",
        "value of `education` we could discern which of these dummies were `Yes`\n",
        "or `No`, and vice-versa. In fact, if wetake a closer look, we’ll notice\n",
        "that we actually only need three of the four to figure our the value of\n",
        "`education`. For example, if I told you that `d_4`, `d_3`, `d_2` were\n",
        "all “No”, what would the value of `education` be?\n",
        "\n",
        "In other words, one of the dummies is redundant in helping us understand\n",
        "the qualitative variable. This property is important; we usually will\n",
        "omit one possible dummy to include only the minimum number of variables\n",
        "needed to explain the qualitative variable in question. This omitted\n",
        "dummy is called the **base level**. If we forget about this and still\n",
        "add 4 dummy variables, we would be committing a dummy variable trap.\n",
        "\n",
        "-   Which one should be the base level? It doesn’t matter, from a\n",
        "    technical perspective.\n",
        "\n",
        "> **Test Your Knowledge**: suppose you have a qualitative variable with\n",
        "> $k$ distinct levels. What is the minimum number of *possible* ways to\n",
        "> represent a set of dummies if you don’t want to include any redundant\n",
        "> variables?\n",
        "\n",
        "-   **A**: $k$\n",
        "-   **B**: $k-1$\n",
        "-   **C**: $k+1$\n",
        "-   **D**: $k^2$"
      ],
      "id": "3703c9de-dffd-4724-b151-d1ac1337578d"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer2.5 <- ??? # type in your answer here \n",
        "\n",
        "test_2.5()"
      ],
      "id": "2a77ab0c-ba22-42f5-bd56-a8ae5d1209bf"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In general, in R, most commands will automatically handle this process\n",
        "of creating dummies from qualitative variables. As you saw with the\n",
        "simple regression, R created them for you. You can also create dummies\n",
        "using a variety of commands, if necessary - but in general, if you tell\n",
        "R that your variables are factors, it will automatically handle the\n",
        "creation of dummies properly.\n",
        "\n",
        "Technically, the example above which includes multiple variables is\n",
        "called a **multiple regression** model, which we haven’t covered yet.\n",
        "\n",
        "Let’s explore regression some more, in the following series of\n",
        "exercises.\n",
        "\n",
        "## Part 3: Exercises\n",
        "\n",
        "### Activity 1\n",
        "\n",
        "Last week, we briefly explored the idea of the wealth gap and explored\n",
        "the idea that it could be caused by some income related factors. We can\n",
        "now examine this issue directly using regressions. Run a regression with\n",
        "\\* before tax income \\* on male and female lead households.\n",
        "\n",
        "<em>Tested objects:</em> `regm` (the regression for males).<em>Tested\n",
        "objects:</em> `regm` (the regression for females)."
      ],
      "id": "c9adc4c1-3d88-4104-9cb2-80b042aa26ed"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Quiz 1\n",
        "\n",
        "# Regression for males\n",
        "regm <- lm(??? ~ income_before_tax, filter(SFS_data, ??? == \"Male\")) \n",
        "# Replace \"...\" with the appropriate variables \n",
        "#remember answers are case sensitive!\n",
        "\n",
        "# Quiz 2\n",
        "# Regression for females\n",
        "regf <- lm(??? ~ income_before_tax, data = filter(SFS_data, ??? == \"Female\")) \n",
        "#remember answers are case sensitive!\n",
        "\n",
        "summary(regm) # Allow us to view regm's coefficient estimates\n",
        "summary(regf) # Same as above, but for regf\n",
        "\n",
        "test_3() # Quiz1\n",
        "test_4() # Quiz2"
      ],
      "id": "ad7abbb9-54d2-4294-8419-85cbab082d1f"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Short Answer 1\n",
        "\n",
        "**Prompt:** How do we interpret the coefficient estimate on `income` in\n",
        "each of these regressions?\n",
        "\n",
        "<font style=\"color:red\">Answer in red here!</font>"
      ],
      "id": "fb9061dd-6cf9-4f75-a685-bce85e15eb65"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_1 <- #fill in your short answer"
      ],
      "id": "ba4b1460-2df0-47e6-a1c1-d2f94fb212e7"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Activity 2\n",
        "\n",
        "We might think that income inequality between females and males might\n",
        "depend on the educational gaps between these two groups. In this\n",
        "activity, we will explore how the income gap varies by education. First,\n",
        "let’s see the factor levels of the `education`:"
      ],
      "id": "c94c3ce4-4db5-4444-9e30-6516a54fc95a"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "levels(SFS_data$education) # Run this"
      ],
      "id": "88be9837-4743-43e2-8399-49daa1edc873"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As we can see, there are a few education groups in this dataframe. Let’s\n",
        "estimate the income gap (with no controls) for each of the four groups\n",
        "separately:\n",
        "\n",
        "-   Less than high school\n",
        "-   High school\n",
        "-   Non-university post-secondary\n",
        "-   University\n",
        "\n",
        "<em>Tested objects:</em> `rege2` (High School), `rege4` (University)\n",
        "\n",
        "Notice we don’t need to do 4 regressions we could just do three."
      ],
      "id": "a44c4990-1638-4090-85f3-8c8b8d920e3a"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#reg1 is a regression performed on people, with a less than high scchool education\n",
        "reg1 <- lm(??? ~ ???, data = filter(SFS_data, education == \"Less than high school\")) #what should replace the ...\n",
        "#reg2 is the same as rege1,but we are looking at people with a high school education\n",
        "reg2 <- lm(??? ~ ???, data = filter(SFS_data, education == \"High school\")) #fill in the blanks\n",
        "\n",
        "reg3 <- lm(??? ~ ???, data = filter(SFS_data, education == \"Non-university post-secondary\")) #remember answers are case sensitive!\n",
        "\n",
        "reg4 <- lm(??? ~ ???, data = filter(SFS_data,education == \"University\"))\n",
        "\n",
        "# store the summaries (but don't show them!  too many!)\n",
        "sum20 <- summary(reg1)\n",
        "sum30 <- summary(reg2)\n",
        "sum40 <- summary(reg3)\n",
        "sum50 <- summary(reg4)\n",
        "\n",
        "test_9() \n",
        "test_10() \n",
        "test_11() \n",
        "test_12() "
      ],
      "id": "335dfa31-3747-4c43-acdc-87a8b1f5b516"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The code below will tabulate a brief summary of each regression:"
      ],
      "id": "26c02efd-e3ef-46ca-8b11-df135186de61"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# just run me.  You don't need to edit this\n",
        "\n",
        "Educ_Group <- c(\"Less than high school\", \"High School\", \"Non-university post-secondary\", \"University\") #defines column 1\n",
        "Income_Gap <- c(reg1$coefficients[2], reg2$coefficients[2], reg3$coefficients[2], reg4$coefficients[2]) #defines column 2\n",
        "Std._Error <- c(sum20$coefficients[2,2], sum30$coefficients[2,2], sum40$coefficients[2,2], sum50$coefficients[2,2]) #defines column 3\n",
        "t_Value <- c(sum20$coefficients[2,3], sum30$coefficients[2,3], sum40$coefficients[2,3], sum50$coefficients[2,3]) #defines column 4\n",
        "p_Value <- c(sum20$coefficients[2,4], sum30$coefficients[2,4], sum40$coefficients[2,4], sum50$coefficients[2,4]) #defines column 5\n",
        "\n",
        "tibble(Educ_Group, Income_Gap, Std._Error, t_Value, p_Value) #it's like a table but a tibble"
      ],
      "id": "5c1cf16c-44cb-415d-ab92-d6657c209a18"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Short Answer 3\n",
        "\n",
        "**Prompt**: What happens to the income gap as we move across eduction\n",
        "groups? What might explain these changes? (hint: think back to module\n",
        "1!)\n",
        "\n",
        "<font style=\"color:red\">Answer in red here!</font>"
      ],
      "id": "6d33104e-ed32-4b85-a46d-be8038abc88e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_3 <- #fill in your short answer"
      ],
      "id": "2c9390d4-358d-476e-8d10-cf3d42df1531"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Activity 3\n",
        "\n",
        "As we observed in last week’s worksheet, the income gap could differ by\n",
        "education level. Since there are many education categories, however, we\n",
        "may not want to examine this by running a regression for each education\n",
        "level separately.\n",
        "\n",
        "Instead, we could run a single regression and add education level as a\n",
        "second regressor, $E_i$:\n",
        "\n",
        "$$I_i = \\beta_0 + \\beta_1 G_i + \\beta_2 E_i + \\epsilon_i$$\n",
        "\n",
        "This is actually a **multiple regression**, which we will learn about\n",
        "next week - but from the point of the this lesson, the idea is that it\n",
        "is “run” in R essentially in the same way as a simple regression.\n",
        "Estimate the regression model above without $E_i$, then re-estimate the\n",
        "model with $E_i$ added. **USE INCOME BEFORE TAX**.\n",
        "\n",
        "<em>Tested objects:</em> `reg2A` (regression without controls), `reg2B`\n",
        "(regression with controls)."
      ],
      "id": "69457bd5-54d3-4dd0-b1c1-2042515c14a5"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Simple regression (just gender)\n",
        "reg2A <- lm(income_before_tax ~ gender, data = SFS_data) # this one works already\n",
        "\n",
        "# Regression with controls\n",
        "reg2B <-  lm(income_before_tax ~ ??? + education, data = SFS_data) # replace the ...\n",
        "\n",
        "summary(reg2A)\n",
        "summary(reg2B)\n",
        "#this will look ugly; try to look carefully at the output\n",
        "\n",
        "test_7()\n",
        "test_8() "
      ],
      "id": "2dd323a2-b929-405b-894e-360746683ba0"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Short Answer 4\n",
        "\n",
        "Prompt: Compare the estimated income gap with and without $E_i$ in the\n",
        "regression. What happens to the gap when we add $E_i$?\n",
        "\n",
        "<font style=\"color:red\">Answer in red here!</font>"
      ],
      "id": "b7f3b044-55ed-4378-877e-1ba7def57542"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_4 <- #fill in your short answer"
      ],
      "id": "470448ac-f03b-493c-8f36-2820e4978937"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Theoretical Activity 1\n",
        "\n",
        "When we deal with large quantitative variables, we often take the\n",
        "natural log of it:"
      ],
      "id": "5c19bfba-0db3-4431-ba71-238e33036a2c"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "W = log(SFS_data$wealth[SFS_data$wealth>0]) "
      ],
      "id": "96c47239-662b-4001-a18f-badcb8ce0cb8"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You may recall that the derivative of the log of a variable is\n",
        "approximately equal to percentage change in the variables:\n",
        "\n",
        "$$\\frac{dln(x)}{dx} \\approx \\frac{\\Delta x}{x}$$\n",
        "\n",
        "Thus, when we find the marginal effect of some continuous regressor\n",
        "$X_i$ (say, `income`):\n",
        "\n",
        "$$ln(W_i) = \\beta_0 + \\beta_1 I_i + \\epsilon_i \\implies \\frac{\\Delta W_i}{W_i} \\approx \\beta_1 \\Delta I_{i}$$\n",
        "\n",
        "This allows us to interpret the changes in a continuous variable as\n",
        "associated with a percentage change in wealth; for instance, if we\n",
        "estimate a coefficient of $0.02$ on `income_before_tax`, we say that\n",
        "when a family’s income before tax increases by 1 CAD, the corresponding\n",
        "wealth increases by 2 percent on average.\n",
        "\n",
        "Notice as well we are now talking about percent changes, rather than\n",
        "units.\n",
        "\n",
        "Let’s generate two variables that take the natural log of the wealth\n",
        "<em>and</em> market income from the `SFS_data` dataframe (hint: use a\n",
        "technique that we introduced last week). Then, estimate the effect of\n",
        "logarithmic market income on logarithmic wealth.\n",
        "\n",
        "<em>Tested Objects:</em> `lnreg`"
      ],
      "id": "3fa93d65-01c1-405b-88ce-41a2537a64f0"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#Generate log wage variable\n",
        "SFS_data <- SFS_data %>%\n",
        "               mutate(lnincome = log(SFS_data$income_before_tax)) %>% # what goes here?\n",
        "               mutate(lnwealth = log(SFS_data$wealth)) # what goes here?"
      ],
      "id": "1fa2f408-f001-4b9e-886c-abe76e2d405c"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Notice warning message “NaNs produced”. NaN means “Not a Number”. This\n",
        "happens because we had negative income and negative wealth. No matter\n",
        "how low our incomes are, the more we work, wealth and income should\n",
        "increase."
      ],
      "id": "2334f856-1150-4a0e-b447-b4544b76f95b"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# fix NANs\n",
        "SFS_data_logged <- SFS_data %>%\n",
        "               filter(income_before_tax>0) %>% #removes negative values\n",
        "               filter(wealth>0)  #removes negative values\n",
        "    \n",
        "# Log Regression \n",
        "lnreg <- lm(lnwealth ~ ???, data = SFS_data_logged) #the new and improved regression\n",
        "\n",
        "\n",
        "summary(lnreg)\n",
        "\n",
        "test_5() #Quiz7"
      ],
      "id": "85b6db49-773a-4fef-be48-6bbf12ac0f03"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Short Answer 5\n",
        "\n",
        "Prompt: How do we interpret each of these estimates? (Hint: what does a\n",
        "1-unit change in the explanatory variable mean here?)\n",
        "\n",
        "<font style=\"color:red\"> Answer here in red</font>"
      ],
      "id": "5dc616a5-0778-4768-ad15-48557f1518b6"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_5 <- #fill in your short answer"
      ],
      "id": "13843125-c770-4e02-b076-b879c0c017f1"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Optional: Bonus Activity 4\n",
        "\n",
        "You have learned about a linear regression model of income; however,\n",
        "income often follows a Pareto distribution. For now, using a linear\n",
        "approximation to find the wage gap is fine. We may want to know stuff\n",
        "about the underlying distribution of income in male and female lead\n",
        "households, however. Here’s the PDF of pareto distribution:\n",
        "\n",
        "$$f(x) = {\\displaystyle {\\frac {\\alpha x_{\\mathrm {m} }^{\\alpha }}{x^{\\alpha +1}}}} $$\n",
        "\n",
        "Ok, now with regression remember we said that we estimate the parameter\n",
        "given the data. To do this we said you could use Calcus or methods other\n",
        "than OLS. Here the probability of the data can be approximated by\n",
        "assuming independence between each $x_i$. If we do this, the probability\n",
        "of the data is given by:\n",
        "\n",
        "$$\\Pi_{i=1}^n f(x)$$\n",
        "\n",
        "Now we can just make a function in r and optimize over it which performs\n",
        "essentially the same operation as a linear regression."
      ],
      "id": "a6794722-5498-4f38-a8cc-3ab201a4e8f8"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "scrolled": true
      },
      "outputs": [],
      "source": [
        "x=filter(SFS_data,gender=='Female')\n",
        "x <- filter(x, is.numeric(income_before_tax))\n",
        "x <- x$income_before_tax"
      ],
      "id": "3a493b29-ef01-45e8-ace7-89896405a6cc"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "calc <- function (x){\n",
        "    q=0\n",
        "for (i in x){\n",
        "    if (i >0){\n",
        "      a= log(i[1]) }\n",
        "        if (is.numeric(a)==TRUE){\n",
        "            q=q+a }\n",
        "    }\n",
        "return (q)\n",
        "}"
      ],
      "id": "788ad3d9-636c-4781-95b8-1acdd4240b00"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "calc(x)"
      ],
      "id": "ffe56536-b938-4a89-ba5e-e174800488f8"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ell <- function(a,q,xm,n) { # we use the log function of the pareto distrubtion instead\n",
        "    d=(n*log(a))\n",
        "    b=(-1)*(a+1)*q\n",
        "    c=a*log(xm)*n \n",
        "    return (d+b+c)\n",
        "}"
      ],
      "id": "6c0d1ae9-c3ad-46b6-816d-0c40254344cd"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "a = optimize(ell,c(2,50),maximum=TRUE,q=43074.1853103325,xm=40000,n=length(x))\n",
        "a\n",
        "a_women=a$maximum "
      ],
      "id": "58142a87-aac4-424f-81ed-7140ddcceaa1"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "y=filter(SFS_data,gender=='Male')\n",
        "y <- filter(y, is.numeric(income_before_tax))\n",
        "y <- y$income_before_tax\n",
        "a_men = optimize(ell,c(2,1000),maximum=TRUE,q=calc(y),xm=65000,n=length(y))\n",
        "a_men = a_men$maximum\n",
        "a_men"
      ],
      "id": "f344edd2-f693-46b6-835c-d87aad38b0d2"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The theoretical mean of the Pareto distribution is,\n",
        "\n",
        "$$ \\frac{\\alpha x_m}{\\alpha -1} $$ Can you calculate the expected income\n",
        "gap with the Pareto distribution assumption?"
      ],
      "id": "e246546e-fabf-4264-aedf-7063bfe5209f"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "xmw=40000\n",
        "xmm=65000\n",
        "income_gap =((a_women* xmw )/ (a_women-1)) - ((a_men* xmm )/ (a_men-1))\n",
        "income_gap #note we set xm ourselves (I did this by playing around with xm, and doing a bit of research) see if you can get a better xm."
      ],
      "id": "1cfd804f-0112-47e9-9c69-b63158a12917"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "ir",
      "display_name": "R",
      "language": "r"
    }
  }
}