{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 2.5 - Intermediate - Interactions and Non-linear Terms\n",
        "\n",
        "COMET Team <br> *Emrul Hasan, Jonah Heyl, Shiming Wu, William Co,\n",
        "Jonathan Graves*  \n",
        "2022-12-08\n",
        "\n",
        "## Outline\n",
        "\n",
        "### Prerequisites\n",
        "\n",
        "-   Multiple regression\n",
        "-   Simple regression\n",
        "-   Data analysis and introduction\n",
        "\n",
        "### Outcomes\n",
        "\n",
        "In this worksheet, you will learn:\n",
        "\n",
        "-   How to incorporate interaction terms into a regression analysis\n",
        "-   How to interpret models with interaction terms\n",
        "-   How to create models which include non-linear terms\n",
        "-   How to compute simple marginal effects for models with non-linear\n",
        "    terms\n",
        "-   How to explain polynomial regressions as approximations to a\n",
        "    non-linear regression function\n",
        "\n",
        "### Notes\n",
        "\n",
        "<span id=\"fn1\">[<sup>1</sup>](#fn1s)Statistics Canada, Survey of\n",
        "Financial Security, 2019, 2021. Reproduced and distributed on an “as is”\n",
        "basis with the permission of Statistics Canada.Adapted from Statistics\n",
        "Canada, Survey of Financial Security, 2019, 2021. This does not\n",
        "constitute an endorsement by Statistics Canada of this product.</span>\n",
        "\n",
        "<span id=\"fn2\">[<sup>2</sup>](#fn2s)Stargazer package is due to: Hlavac,\n",
        "Marek (2022). stargazer: Well-Formatted Regression and Summary\n",
        "Statistics Tables. R package version 5.2.3.\n",
        "https://cran.r-project.org/web/packages/stargazer/index.html </span>"
      ],
      "id": "d1f3f140-65df-4b21-a688-98f2e0660c56"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "library(tidyverse)\n",
        "library(haven)\n",
        "library(dplyr)\n",
        "library(scales)\n",
        "library(stargazer)\n",
        "library(car)\n",
        "\n",
        "source(\"intermediate_interactions_and_nonlinear_terms_functions.r\")\n",
        "\n",
        "SFS_data <- read_dta(\"../datasets_intermediate/SFS_2019_Eng.dta\")\n",
        "SFS_data <- clean_up_data()"
      ],
      "id": "02434978-67c6-4970-8e1f-966db8ad049f"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Part 1: Interactions in Regression Models\n",
        "\n",
        "One of the most common extensions to multiple regression models is to\n",
        "include **interaction** terms. What is an interaction term? It’s\n",
        "basically a term which represents the product of two (or more) variables\n",
        "in a model.\n",
        "\n",
        "For example, if we have a dummy variable for being a female ($F_i$) and\n",
        "a dummy variable for having a university degree ($D_i$), the\n",
        "*interaction* of these two variables is the product $D_i \\times F_i$.\n",
        "This can seem complicated but it also has a simple interpretation: it is\n",
        "now a dummy for being *both* a female and having a university degree.\n",
        "You can see why this is true:\n",
        "\n",
        "$$ \n",
        "D_i \\times F_i = 1 \\iff D_i = 1 \\text{ and } F_i = 1\n",
        "$$\n",
        "\n",
        "This is why these terms are so important for understanding regressions:\n",
        "they provide us with a simple way to describe and study how\n",
        "*combinations* of our explanatory variables impact our model. These\n",
        "variables enter into our regression models in exactly the same way as\n",
        "usual:\n",
        "\n",
        "$$\n",
        "Y_i = \\beta_0 + \\beta_1 F_i + \\beta_2 D_i + \\beta_3 D_i \\times F_i + \\epsilon_i\n",
        "$$\n",
        "\n",
        "At this point, you can see that this is just a multiple regression\n",
        "model - the only difference is that one of the variables is a\n",
        "combination of the other variables. From an estimation perspective,\n",
        "there’s no issue - you can use OLS to estimate a model with interaction\n",
        "terms, just like normal. However, as we have seen, there are important\n",
        "differences when it comes to the *interpretation* of these models. Let’s\n",
        "learn more about this in this worksheet.\n",
        "\n",
        "There are (in general) two ways to create interactions in R: (i)\n",
        "manually (i.e. creating a new variables which is $D_i \\times F_i$ then\n",
        "adding it to the regression), or (ii) using the built-in tools in R.\n",
        "However, method (i) is a trap! You should never use this method. Why?\n",
        "There are two reasons:\n",
        "\n",
        "1.  The main reason is that R (and you, the analyst) lose track of the\n",
        "    relationship between the created interaction variable and the\n",
        "    underlying variables. This means that you can’t use other tools to\n",
        "    analyze these relationship (there are many packages such as\n",
        "    `margins` which allow you to investigate complex interaction) which\n",
        "    is a big loss. You also can’t perform post-regression analysis on\n",
        "    the underlying variables in a simple way anymore.\n",
        "2.  The second reason is that it’s easy to make mistakes. You might\n",
        "    define the interaction incorrectly (possible!). However, it’s more\n",
        "    of an issue if later on you change the underlying variables and then\n",
        "    forget to re-compute the interactions. It also makes your code\n",
        "    harder to read.\n",
        "\n",
        "Bottom line: don’t do it. Interaction in R are easy to create: you\n",
        "simply use the `:` or `*` operator when defining an interaction term.\n",
        "\n",
        "-   The `:` operator creates the interaction(s) of the two variables in\n",
        "    question\n",
        "-   The `*` operation creates the interactions(s) *and* the main effects\n",
        "    of the variables as well\n",
        "\n",
        "Even better: if you are interacting two qualitative (factor) variables,\n",
        "it will automatically “expand” the interaction into every possible\n",
        "combination of the variables. A lot less work!\n",
        "\n",
        "For example, let’s look at a regression model which interacts gender and\n",
        "education. Before we run regression, let’s first summarize education\n",
        "into ‘university’ and ‘non-university’."
      ],
      "id": "4670c051-8483-48ea-9a0a-57ca910bbcb5"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "SFS_data <- SFS_data %>% #cretes a Education dummy variable\n",
        "               mutate( \n",
        "               Education = case_when(\n",
        "                     education == \"University\" ~ \"University\", #the ~ seperates the original from the new name\n",
        "                     education == \"Non-university post-secondary\" ~ \"Non-university\",\n",
        "                     education == \"High school\" ~ \"Non-university\",\n",
        "                     education == \"Less than high school\" ~ \"Non-university\")) %>%\n",
        "             mutate(Education = as_factor(Education)) #remember, it's a factor!"
      ],
      "id": "09cb2efa-e4d9-48a6-823f-229e82c8dc8e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "regression1 <- lm(wealth ~ gender + Education + gender:Education, data = SFS_data)\n",
        "\n",
        "#regression1 <- lm(wealth ~ gender*Education, data = SFS_data) #an alternative way to run the same regression\n",
        "\n",
        "summary(regression1)"
      ],
      "id": "e4af0d60-20db-4f6c-81cf-72b638653b61"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "There are a few important things to notice about this regression result.\n",
        "First, take a close look at the terms:\n",
        "\n",
        "-   `genderFemale` this is the main effect for being a female. You\n",
        "    *might* immediately say that this is the impact of being a female -\n",
        "    but this is *not true*. Why? Because female shows up in two places!\n",
        "    We have to be a little more careful - this is the effect of being a\n",
        "    female in the base group (non-university)\n",
        "-   `genderFemale:EducationUniversity` this is the interaction effect of\n",
        "    being a female and having a university degree. Basically, family\n",
        "    with female (university degree) as main earner accumulates\n",
        "    $143,396+324,112=467,508$ less wealth, compared with male\n",
        "    counterpart.\n",
        "\n",
        "You can see this interpretation in the regression model itself:\n",
        "\n",
        "$$\n",
        "W_i = \\beta_0 + \\beta_1 F_i + \\beta_2 D_i + \\beta_3 F_i \\times D_i + \\epsilon_i\n",
        "$$\n",
        "\n",
        "Consider:\n",
        "\n",
        "$$\n",
        "\\frac{\\Delta W_i}{\\Delta F_i} = \\beta_1 + \\beta_3 D_i\n",
        "$$\n",
        "\n",
        "The marginal effect of being a female-lead household *changes* depending\n",
        "on what the value of $D_i$ is! For non-university degree (the level\n",
        "where $D_i = 0$) it’s $\\beta_1$. For university degree (the level where\n",
        "$D_i =1$), it’s $\\beta_1 + \\beta_3$. This is why, in an interaction\n",
        "model, it doesn’t really make sense to talk about the “effect of\n",
        "female” - because there isn’t a single, immutable effect. It is\n",
        "different for different education degrees!\n",
        "\n",
        "You can talk about the *average* effect, which is just\n",
        "$\\beta_1 + \\beta_3 \\bar{D_i}$ - but that’s not really what people are\n",
        "asking about when they are discussing the gender effect, in general.\n",
        "\n",
        "This is why it’s very important to carefully think about a regression\n",
        "model with interaction terms - the model may seem simple to estimate,\n",
        "but the interpretation is more complex.\n",
        "\n",
        "### Interactions with Continuous Variables\n",
        "\n",
        "So far, we have just looked at interacting qualitative variables - but\n",
        "you can interact any types of variables!\n",
        "\n",
        "-   Qualitative-Qualitative\n",
        "-   Qualitative-Quantitative\n",
        "-   Quantitative-Quantitative\n",
        "\n",
        "The format and syntax in R is similar, with some small exceptions to\n",
        "deal with certain combinations of variables. However (again), you do\n",
        "need to be careful with interpretation.\n",
        "\n",
        "For example, let’s look at the interaction of income and sex on wealth.\n",
        "In a regression equation, this would be expressed like:\n",
        "\n",
        "$$\n",
        "W_i = \\beta_0  + \\beta_1 Income_i + \\beta_2 F_i + \\beta_3 Income_i \\times F_i + \\epsilon_i\n",
        "$$\n",
        "\n",
        "Notice that, just like before:\n",
        "\n",
        "$$\n",
        "\\frac{\\partial W_i}{\\partial Income_i} = \\beta_1 + \\beta_3 F_i\n",
        "$$\n",
        "\n",
        "There are two *different* “slope” coefficients; basically, male and\n",
        "female lead family can have a different return to wealth. Let’s see this\n",
        "in R:"
      ],
      "id": "fb0c13c7-fdc4-4e74-a677-964bad657cc7"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "regression2 <- lm(wealth ~ income_before_tax + gender + income_before_tax:gender, data = SFS_data)\n",
        "\n",
        "summary(regression2)"
      ],
      "id": "76fae099-89ee-4ffd-b3ae-baa6675436e7"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As we can see here, the female-lead households in our model accumulate\n",
        "about 3.946 dollars more in wealth per dollar of income earned than\n",
        "male-lead respondents. But female-lead households accumulate 343,300\n",
        "dollars less than male counterparts. So the overall effects depend on\n",
        "average income before tax.\n",
        "\n",
        "This addresses the common problem of estimating a regression model where\n",
        "you think the impact of a continuous variable might be different across\n",
        "the two groups. One approach would be to run the model only for men, and\n",
        "only for women, and then compare - but this isn’t a good idea. Those\n",
        "regressions have a much smaller sample size, and if you have other\n",
        "controls in the model, you will “throw away” information. The\n",
        "interaction method is much better.\n",
        "\n",
        "## Part 2: Non-linear Terms in Regression Models\n",
        "\n",
        "You might have been puzzled by why these models were called “linear”\n",
        "regressions. The reason is because they are linear in the\n",
        "*coefficients*: the dependent variable is expressed as a linear\n",
        "combination of the explanatory variables.\n",
        "\n",
        "This implies that we can use the same methods (OLS) to estimate models\n",
        "that including linear combinations of *non-linear functions* of the\n",
        "explanatory variables. We have actually already seen an example of this:\n",
        "remember using `log` of a variable? That’s a non-linear function!\n",
        "\n",
        "As we learned when considering `log`, the most important difference here\n",
        "is again regarding interpretations, not the actual estimation.\n",
        "\n",
        "In R, there is one small complication: when you want to include\n",
        "mathematical expressions in a model formula, you need to “isolate” then\n",
        "using the `I()` function. This is because many operations in R, like `+`\n",
        "or `*` have a special meaning in a regression model.\n",
        "\n",
        "For example, let’s consider a quadratic regression - that is, including\n",
        "both $Income_i$ and $Income_i^2$ (income squared) in our model."
      ],
      "id": "b44a834e-14f2-4b79-a5e7-edfa0ba98311"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "regression3 <- lm(wealth ~ income_before_tax + I(income_before_tax^2), data = SFS_data)\n",
        "\n",
        "summary(regression3)"
      ],
      "id": "25a5f382-3057-42a6-9040-890a38dd8814"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see, we get regression results much like we would expect.\n",
        "However, how do we interpret them? The issue is that *income* enters\n",
        "into two places. We need to carefully interpret this model, using our\n",
        "knowledge of the equation:\n",
        "\n",
        "$$\n",
        "W_i = \\beta_0 + \\beta_1 Income_i + \\beta_2 Income_i^2 + \\epsilon_i\n",
        "$$\n",
        "\n",
        "$$\n",
        "\\implies \\frac{\\partial W_i}{\\partial Income_i} = \\beta_1 + 2 \\beta_2 Income_i\n",
        "$$\n",
        "\n",
        "You will notice something special about this; the marginal effect is\n",
        "*non-linear*. As $Income_i$ changes, the effect of income on $W_i$\n",
        "changes. This is because we have estimated a quadratic relationship; the\n",
        "slope of a quadratic changes as the explanatory variable changes. That’s\n",
        "what we’re seeing here!\n",
        "\n",
        "This makes these models relatively difficult to interpret, since the\n",
        "marginal effects change (often dramatically) as the explanatory\n",
        "variables change. You frequently need to carefully interpret the model\n",
        "and often (to get estimates) perform tests on *combinations* of\n",
        "coefficients, which can be done using things like the `car` package or\n",
        "the `lincom` function. You can also compute this manually, using the\n",
        "formula for the sum of variances.\n",
        "\n",
        "For example, let’s test if the marginal effect of income is significant\n",
        "at $Income_i = \\overline{Income}_i$. This is the most frequently\n",
        "reported version of this effects, often called the “marginal effect at\n",
        "the means”."
      ],
      "id": "44698cea-d86f-422a-8a3d-082bb1a1ee9f"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "m <- mean(SFS_data$income_before_tax)\n",
        "\n",
        "linearHypothesis(regression3, hypothesis.matrix = c(0, 1, 2*m), rhs=0) "
      ],
      "id": "dfb19665-067a-4779-90a4-13836b0db8eb"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As we can see, it is highly significant\n",
        "\n",
        "> *Think Deeper*: what is the vector `c(0, 1, 2*m)` doing in the above\n",
        "> expression?\n",
        "\n",
        "Let’s see exactly what those values are. Recall the formula:\n",
        "\n",
        "$$\n",
        "V(aX + bY) = a^2 V(X) + b^2 V(Y) + 2abCov(X,Y)\n",
        "$$\n",
        "\n",
        "In our situation, $X = Y = W_i$, so this is:\n",
        "\n",
        "$$\n",
        "V(\\beta_1 + 2\\bar{W_i}\\beta_2) = V(\\beta_1) + 4\\bar{W_i}^2V(\\beta_2) + 2(2\\bar{W_i})Cov(\\beta_1,\\beta_2)\n",
        "$$\n",
        "\n",
        "Fortunately, these are all things we have from the regression and its\n",
        "variance-covariance matrix:"
      ],
      "id": "b09e3af0-0055-419e-b787-21765dc6e61b"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "v <- vcov(regression3)\n",
        "coefs <- regression3$coefficients\n",
        "v\n",
        "\n",
        "var <- v[2,2] + 4*(m^2)*v[3,3] + 4*m*v[3,2]\n",
        "\n",
        "var\n",
        "\n",
        "coef <-  coefs[[2]] + 2*m*coefs[[3]]\n",
        "\n",
        "print(\"Coefficent Combination and SD\")\n",
        "round(coef,3)\n",
        "round(sqrt(var),3)"
      ],
      "id": "2bc45f4c-827e-457e-895b-d8fd2aeb5c50"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see, this gets fairly technical and is not something you will\n",
        "want to do without a very good reason. In general, it’s a better idea to\n",
        "rely on some of the packages written for R that handle this task for the\n",
        "(specific) model you are interested in evaluating.\n",
        "\n",
        "### Aside: Why Polynomial Terms?\n",
        "\n",
        "You might be wondering why econometricians spend so much time talking\n",
        "about models that included polynomial terms, when those are\n",
        "(realistically) a very small set of the universe of possible functions\n",
        "of an explanatory variable (you already know why we talk about `log` so\n",
        "much!).\n",
        "\n",
        "The reason is actually approximation. Consider the following\n",
        "*non-linear* model:\n",
        "\n",
        "$$\n",
        "Y_i = f(X_i) + e_i\n",
        "$$\n",
        "\n",
        "This model is truly non-linear (and not just in terms of the\n",
        "parameters). How can we estimate this model? It’s hard! There are\n",
        "techniques to estimate complex models like this, but how far can we get\n",
        "with good-old OLS? The answer is - provided that $f$ is “smooth” -\n",
        "pretty far.\n",
        "\n",
        "Think back to introductory calculus; you might remember a theorem called\n",
        "[Taylor’s Theorem](https://en.wikipedia.org/wiki/Taylor%27s_theorem). It\n",
        "says that a smoothly differentiable function can be arbitrarily\n",
        "well-approximated (about a point) by a polynomial expansion:\n",
        "\n",
        "$$\n",
        "f(x) = f(a) + f'(a)(x-a) + \\frac{f''(a)}{2!}(x-a)^2 + \\cdots + \\frac{f^{(k)}(a)}{k!}(x-a)^k + R_k(x)\n",
        "$$\n",
        "\n",
        "and the error term $R_k(x) \\to 0$ as $x \\to a$ and $k \\to \\infty$.\n",
        "\n",
        "Look closely at this expression. Most of the terms (like $f'(a)$) are\n",
        "constants. In fact, you can show that this can be written like:\n",
        "\n",
        "$$\n",
        "f(x) = \\beta_0 + \\beta_1 x + \\beta_2 x^2 + \\cdots + \\beta_k x^k + r\n",
        "$$\n",
        "\n",
        "Putting this into our expression above gives us the relationship:\n",
        "\n",
        "$$\n",
        "Y_i = \\beta_0 + \\beta_1 X_i + \\beta_2 X_i^2 + \\cdots + \\beta_k X_i^k+ \\epsilon_i\n",
        "$$\n",
        "\n",
        "Which is a linear regression model! What this say is actually very\n",
        "important: linear regression models can be viewed as *approximations* to\n",
        "nonlinear regressions, provided we have enough polynomial terms. This is\n",
        "one complication: the error term is definitely not uncorrelated. You can\n",
        "learn more about how to address this issue in other courses, but at the\n",
        "most the omitted variable bias is relatively small as $k \\to \\infty$.\n",
        "\n",
        "## Part 3: Exercises\n",
        "\n",
        "In this set of exercises, you will get some hands-on experience with\n",
        "estimating and interpreting non-linear regression models, as well as\n",
        "learning to scrutinize multiple regression models for their limitations.\n",
        "\n",
        "### Activity 1\n",
        "\n",
        "Consider the following regression model:\n",
        "\n",
        "$W_i$ denotes wealth, $F_i$ is a dummy variable for the gender of main\n",
        "earner in the household ($F_i=1$ if female is the main earner), $E_i$ is\n",
        "a factor variable for education and $P_i$ is a factor variable for\n",
        "province.\n",
        "\n",
        "### Short Answer 1\n",
        "\n",
        "<em>Prompt:</em> How should we interpret the coefficients $\\beta_5$ and\n",
        "$\\beta_6$? Why might these effects be important to estimate?"
      ],
      "id": "bde84fe1-be49-498e-a0d4-171d9273fb99"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_1 <- #fill in your short answer"
      ],
      "id": "2ca76dab-d0a4-48a1-a1e6-dcb8c629bbeb"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, let’s estimate the model and interpret it concretely. (Please\n",
        "follow the order of variables in regression model:\n",
        "\n",
        "    reg1 <- gender, Education, province\n",
        "\n",
        "    reg0 <- gender, education, province\n",
        "\n",
        "What are your interaction variables? (remember to follow the same order)"
      ],
      "id": "302ced8f-2875-4d69-8c92-6b524d34d255"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#Quiz 1\n",
        "reg1 <- lm(???, data = SFS_data)\n",
        "reg0 <- lm(???, data = SFS_data)\n",
        "\n",
        "summary(reg1)\n",
        "summary(reg0)\n",
        "\n",
        "test_1() #quiz1"
      ],
      "id": "c9878150-045c-4733-9860-7bdeb9a9abb7"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Short Answer 2\n",
        "\n",
        "<em>Prompt:</em> How do we interpret the coefficient estimate on\n",
        "`gender:Education`? What education level do female-lead households\n",
        "appear to be most discriminated in? How might we explain this\n",
        "intuitively?\n",
        "\n",
        "<font color = \"red\">Answer here in red</font>"
      ],
      "id": "9c823373-e7f2-432b-8452-db16b5989b60"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_2 <- #fill in your short answer"
      ],
      "id": "2ae43406-846f-4804-81c2-42434e7e7def"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Short Answer 3\n",
        "\n",
        "<em>Prompt:</em> How do you interpret the coefficient estimate on\n",
        "`genderFemale:provinceAlberta`? (Hint: Write out the average wealth\n",
        "equations for female, male in Alberta, and female in Alberta\n",
        "separately.)\n",
        "\n",
        "<font color = \"red\">Answer here in red</font>"
      ],
      "id": "b19ea97e-97fd-4524-ad4c-fa6e5fe4bb0e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_3 <- #fill in your short answer"
      ],
      "id": "b6300384-a8e7-4396-9faf-42b61ecbe430"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now let’s test whether the returns to education increase if people are\n",
        "entrepreneurs. `business` is a factor variable which suggests whether\n",
        "the household owns a business. Please add terms to the regression\n",
        "equation that allow us to run this test. Then, estimate this new model.\n",
        "We don’t need `province` and `gender` variables in this exercise. For\n",
        "education, please use `education` variable. And we will continue to\n",
        "study wealth accumulated in households."
      ],
      "id": "41add1d9-7dd0-48db-b2a3-d47e122b2c6a"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#Quiz 2\n",
        "SFS_data$business <- relevel(SFS_data$business, ref = \"No\") #Do not change; makes \"not a business owner\" the reference level for business\n",
        "\n",
        "reg2 <- lm(???, data = SFS_data)\n",
        "\n",
        "summary(reg2)\n",
        "test_1.5()"
      ],
      "id": "d53ffaff-5f2c-4606-b842-e76765b68b32"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Short Answer 4\n",
        "\n",
        "<em>Prompt:</em> Do returns to education increase when people are\n",
        "entrepreneurs? Explain why or why not with reference to the regression\n",
        "estimates.\n",
        "\n",
        "<font color = \"red\">Answer here in red</font>"
      ],
      "id": "88cfd700-ce2f-4a25-970f-559dc4068ba8"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_4 <- #fill in your short answer"
      ],
      "id": "7c7d20e3-094a-4745-b8bc-819a7fa94c3d"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Activity 2\n",
        "\n",
        "A topic that many labour economists are concerned with, and one that we\n",
        "have discussed before, is the gender-wage gap. In this activity, we will\n",
        "construct a “difference-in-difference” regression to explore this gap\n",
        "using the `SFS_data2`.\n",
        "\n",
        "Suppose that we want to estimate the relationship between age, sex and\n",
        "wages. Within this relationship, we suspect that women earn less than\n",
        "men from the beginning of their working lives, **but this gap does not\n",
        "change as workers age**.\n",
        "\n",
        "Estimate a regression model (with no additional control variables) that\n",
        "estimates this relationship using `SFS_data2`. We will use\n",
        "`income_before_tax` variable. Order: list `gender` before `agegr`.\n",
        "\n",
        "<em>Tested Objects:</em> `reg3A`.\n",
        "\n",
        "Let’s first simplify levels of age group using following codes."
      ],
      "id": "9403dd1f-23b3-43c6-aa32-d2f8a4a15381"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#Some Data cleaning, Just run this!\n",
        "SFS_data <- \n",
        "        SFS_data %>%\n",
        "        mutate(agegr = case_when(\n",
        "              age == \"01\" ~ \"Under 30\", #under 20\n",
        "              age == \"02\" ~ \"Under 30\", #20-24\n",
        "              age == \"03\" ~ \"20s\", #25-29\n",
        "              age == \"04\" ~ \"30s\",\n",
        "            age == \"05\" ~ \"30s\",\n",
        "              age == \"06\" ~ \"40s\",\n",
        "              age == \"07\" ~ \"40s\",\n",
        "              age == \"08\" ~ \"50s\",\n",
        "              age == \"09\" ~ \"50s\",\n",
        "              age == \"10\" ~ \"60s\", #60-64\n",
        "              age == \"11\" ~ \"Above 65\", #65-69\n",
        "              age == \"12\" ~ \"Above 65\", #70-74\n",
        "              age == \"13\" ~ \"Above 75\", #75-79\n",
        "              age == \"14\" ~ \"Above 75\", #80 and above\n",
        "              )) %>%\n",
        "        mutate(agegr = as_factor(agegr))\n",
        "\n",
        "SFS_data$agegr <- relevel(SFS_data$agegr, ref = \"Under 30\") #Set \"Under 30\" as default factor level"
      ],
      "id": "b93a884a-212b-4f9d-94fd-9682f9c7a464"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let’s restrict the sample to main working groups. Just run the following\n",
        "line."
      ],
      "id": "ebb096d2-11ab-483a-a1b3-4ba905adf7ed"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "SFS_data2 <- subset(SFS_data, agegr == \"20s\" | agegr == \"30s\" | agegr == \"40s\" | agegr == \"50s\" | agegr == \"60s\" )"
      ],
      "id": "1dfe88d0-c815-43ee-8bbd-018ca0e5ffa4"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#Quiz 3\n",
        "SFS_data2$agegr <- relevel(SFS_data2$agegr, ref = \"20s\") #Do not change; makes \"20s\" the reference level for age\n",
        "\n",
        "reg3A <- lm(???, data = SFS_data2)\n",
        "\n",
        "summary(reg3A)\n",
        "\n",
        "test_2() #quiz3"
      ],
      "id": "5ecbd4b7-281e-45b7-972c-7d493854f8ed"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Short Answer 5\n",
        "\n",
        "<em>Prompt:</em> What is the relationship between age and wages? Between\n",
        "sex and earnings? Is there a significant wage gap? Why might the\n",
        "regression above not give us the “full picture” of the sex wage gap?\n",
        "\n",
        "<font color = \"red\">Answer here in red</font>"
      ],
      "id": "506af4b4-596f-43a1-a645-83c809c3c4b5"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_5 <- #fill in your short answer"
      ],
      "id": "49bc246e-a3aa-4faf-ab89-7ed6a6a5ee9d"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Estimate the relationship between wages and age for male-lead households\n",
        "and female-lead households separately, then compare their returns to\n",
        "age. Let’s continue to use `income_before_tax`\n",
        "\n",
        "<em>Tested objects:</em> `reg3M` (for males), `reg3F` (for females)."
      ],
      "id": "c63ad975-1e66-4f4b-8aa6-0e2ad6299254"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#Quiz 4\n",
        "reg3M <- lm(..., data = filter(SFS_data2, gender == \"Male\"))\n",
        "reg3F <- lm(..., data = filter(SFS_data2, gender == \"Female\"))\n",
        "\n",
        "summary(reg3M)\n",
        "summary(reg3F)\n",
        "\n",
        "test_3()\n",
        "test_4() #quiz4"
      ],
      "id": "d1974285-9a36-4cb4-b310-714deb172624"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Short Answer 6\n",
        "\n",
        "<em>Prompt:</em> Do these regression estimates support your argument\n",
        "from Short Answer 5? Explain.\n",
        "\n",
        "<font color = \"red\">Answer here in red</font>"
      ],
      "id": "4f4646a8-60f0-4327-a8d9-d2ba482d1adc"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_6 <- #fill in your short answer"
      ],
      "id": "0d317ed9-2786-4b1d-957e-b2de20c5bc58"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Add one additional term to the multiple regression from Quiz 3 that\n",
        "accounts for the possibility that the sex wage gap can change as workers\n",
        "age. Please list gender before age.\n",
        "\n",
        "<em>Tested Objects:</em> `reg4`."
      ],
      "id": "2647b61b-c6c9-47e5-9a99-fb72fc724c10"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#Quiz 5\n",
        "reg4 <- lm(???, data = SFS_data2)\n",
        "\n",
        "summary(reg4)\n",
        "\n",
        "test_5() #quiz5"
      ],
      "id": "887ff5f0-7736-4814-bd99-bf6588516a7a"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Short Answer 7\n",
        "\n",
        "<em>Prompt:</em> According to the regression you estimated above, what\n",
        "is the nature of the sex wage gap?\n",
        "\n",
        "<font color = \"red\">Answer here in red</font>"
      ],
      "id": "01f37d55-b887-47d1-84a1-6b82958928c9"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_7 <- #fill in your short answer"
      ],
      "id": "0707a18f-f55e-4db7-be83-72216534159e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Theoretical Activity 1\n",
        "\n",
        "Suppose that a team of researchers is interested in the relationship\n",
        "between the price of a popular children’s chocolate brand (let’s call it\n",
        "“Jumbo Chocolate Egg”) and its demand. The team conducts a series of\n",
        "surveys over a five-year period where they ask 200 households in a\n",
        "Vancouver neighborhood to report how many packs of Jumbo Chocolate Egg\n",
        "they bought in each quarter of the year. The company that produces Jumbo\n",
        "Chocolate Egg is interested in estimating the price elasticity of demand\n",
        "for their chocolate, so they changed the price of a pack of chocolate\n",
        "each quarter over this period. This survey is voluntary - the team went\n",
        "door-to-door to gather the data, and people could refuse to participate.\n",
        "\n",
        "After they compile a dataset from the survey responses, the team\n",
        "estimates this model:\n",
        "\n",
        "$$\n",
        "Q_i^2 = \\alpha_1 + \\alpha_2 ln(P_i) + \\alpha_3 H_i + \\epsilon_i\n",
        "$$\n",
        "\n",
        "$Q_i$ denotes the quantity of chocolate packs that household i purchased\n",
        "in a given quarter of a given year. That is, each quarter for a given\n",
        "household is a separate observation. $P_i$ is the price of the pack of\n",
        "chocolate in the given quarter, and $H_i$ is the household size (in\n",
        "number of people). Note that $\\hat{\\alpha_2}$ is <em>supposed to be</em>\n",
        "the estimated elasticity of demand.\n",
        "\n",
        "You join the team as a research advisor - in other words, you get to\n",
        "criticize their project and get paid doing so. Sounds great, but you\n",
        "have a lot of work ahead.\n",
        "\n",
        "### Short Answer 8\n",
        "\n",
        "<em>Prompt:</em> Are there any omitted variables that the team should be\n",
        "worried about when estimating the model? Give 2 examples of such\n",
        "variables if so, and explain how each variable’s omission could affect\n",
        "the estimated elasticity of demand using the formula that we discussed\n",
        "in class.\n",
        "\n",
        "<font color = \"red\">Answer here in red</font>"
      ],
      "id": "a9a1cb0f-e4f8-4256-81c0-833b74bd8568"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_8 <- #fill in your short answer"
      ],
      "id": "159c1ea0-7baf-4967-840c-d2101ba25300"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Short Answer 9\n",
        "\n",
        "<em>Prompt:</em> Is there anything wrong with the specification of the\n",
        "regression model? If so, explain how to correct it; if not, explain why\n",
        "the specification is correct.\n",
        "\n",
        "<font color = \"red\">Answer here in red</font>"
      ],
      "id": "892ae850-b213-4e8e-9744-3d1be740e475"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_9 <- #fill in your short answer"
      ],
      "id": "e92c955b-7ef7-4df3-9d19-88273f3a9bbe"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Short Answer 10\n",
        "\n",
        "<em>Prompt:</em> Is there any potential for sample selection bias in\n",
        "this study? Explain by referencing specific aspects of the experiment.\n",
        "What effect might this bias have on the estimated elasticity of demand?\n",
        "\n",
        "<font color = \"red\">Answer here in red</font>"
      ],
      "id": "82a367c0-ab96-40e0-b000-ecf4d3c02d0d"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_10 <- #fill in your short answer"
      ],
      "id": "57cf4b7e-39c6-4689-a44d-87e7b91b16ee"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Short Answer 11\n",
        "\n",
        "<em>Prompt:</em> A member of your team writes in the research report\n",
        "that “this estimated elasticity of demand tells us about the preferences\n",
        "of consumers around Canada.” Do you have an issue with this statement?\n",
        "Why or why not?\n",
        "\n",
        "<font color = \"red\">Answer here in red</font>"
      ],
      "id": "c440531e-a093-4a05-b579-39765728a514"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_11 <- #fill in your short answer"
      ],
      "id": "93ce4841-3fcc-4226-a94f-83607f63bc1b"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "ir",
      "display_name": "R",
      "language": "r"
    }
  }
}