{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 2.5 - Intermediate - Interactions and Non-linear Terms\n",
        "\n",
        "COMET Team <br> *Emrul Hasan, Jonah Heyl, Shiming Wu, William Co,\n",
        "Jonathan Graves*  \n",
        "2022-12-08\n",
        "\n",
        "## Outline\n",
        "\n",
        "### Prerequisites\n",
        "\n",
        "-   Multiple regression\n",
        "-   Simple regression\n",
        "-   Data analysis and introduction\n",
        "\n",
        "### Outcomes\n",
        "\n",
        "In this worksheet, you will learn:\n",
        "\n",
        "-   How to incorporate interaction terms into a regression analysis\n",
        "-   How to interpret models with interaction terms\n",
        "-   How to create models which include non-linear terms\n",
        "-   How to compute simple marginal effects for models with non-linear\n",
        "    terms\n",
        "-   How to explain polynomial regressions as approximations to a\n",
        "    non-linear regression function\n",
        "\n",
        "### References\n",
        "\n",
        "-   Statistics Canada, Survey of Financial Security, 2019, 2021.\n",
        "    Reproduced and distributed on an “as is” basis with the permission\n",
        "    of Statistics Canada. Adapted from Statistics Canada, Survey of\n",
        "    Financial Security, 2019, 2021. This does not constitute an\n",
        "    endorsement by Statistics Canada of this product.\n",
        "\n",
        "-   Stargazer package is due to: Hlavac, Marek (2022). stargazer:\n",
        "    Well-Formatted Regression and Summary Statistics Tables. R package\n",
        "    version 5.2.3.\n",
        "    https://cran.r-project.org/web/packages/stargazer/index.html"
      ],
      "id": "bbede7fb-1cb4-4f11-a138-2a6fff1c86e1"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "library(tidyverse)\n",
        "library(haven)\n",
        "library(dplyr)\n",
        "library(scales)\n",
        "library(stargazer)\n",
        "library(car)\n",
        "\n",
        "source(\"intermediate_interactions_and_nonlinear_terms_functions.r\")\n",
        "source(\"intermediate_interactions_and_nonlinear_terms_tests.r\")\n",
        "\n",
        "SFS_data <- read_dta(\"../datasets_intermediate/SFS_2019_Eng.dta\")\n",
        "SFS_data <- clean_up_data()"
      ],
      "id": "2eb66825-2db4-41d8-a82f-a26023f63a8d"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Part 1: Interactions in Regression Models\n",
        "\n",
        "One of the most common extensions to multiple regression models is to\n",
        "include **interaction** terms. What is an interaction term? It’s\n",
        "basically a term which represents the product of two (or more) variables\n",
        "in a model.\n",
        "\n",
        "For example, if we have a dummy variable for being a female ($F_i$) and\n",
        "a dummy variable for having a university degree ($D_i$), the\n",
        "*interaction* of these two variables is the product $D_i \\times F_i$.\n",
        "This can seem complicated, but it also has a simple interpretation: it\n",
        "is now a dummy for being *both* a female and having a university degree.\n",
        "You can see why this is true:\n",
        "\n",
        "$$ \n",
        "D_i \\times F_i = 1 \\iff D_i = 1 \\text{ and } F_i = 1\n",
        "$$\n",
        "\n",
        "This is why these terms are so important for understanding regressions:\n",
        "they provide us with a simple way to describe and study how\n",
        "*combinations* of our explanatory variables impact our model. These\n",
        "variables enter into our regression models in exactly the same way as\n",
        "usual:\n",
        "\n",
        "$$\n",
        "Y_i = \\beta_0 + \\beta_1 F_i + \\beta_2 D_i + \\beta_3 D_i \\times F_i + \\epsilon_i\n",
        "$$\n",
        "\n",
        "At this point, you can see that this is just a multiple regression\n",
        "model - the only difference is that one of the variables is a\n",
        "combination of the other variables. From an estimation perspective,\n",
        "there’s no issue - you can use OLS to estimate a model with interaction\n",
        "terms, just like simple regressions. However, as we have seen, there are\n",
        "important differences when it comes to the *interpretation* of these\n",
        "models. Let’s learn more about this in this worksheet.\n",
        "\n",
        "There are (in general) two ways to create interactions in R: (i)\n",
        "manually (i.e. creating a new variable which is $D_i \\times F_i$ then\n",
        "adding it to the regression), or (ii) using the built-in tools in R.\n",
        "However, method (i) is a trap! You should never use this method. Why?\n",
        "There are two reasons:\n",
        "\n",
        "1.  The main reason is that R (and you, the analyst) lose track of the\n",
        "    relationship between the created interaction variable and the\n",
        "    underlying variables. This means that you can’t use other tools to\n",
        "    analyze this relationship (there are many packages such as `margins`\n",
        "    which allow you to investigate complex interactions) which is a big\n",
        "    loss. You also can’t perform post-regression analysis on the\n",
        "    underlying variables in a simple way anymore.\n",
        "2.  The second reason is that it’s easy to make mistakes. You might\n",
        "    define the interaction incorrectly (possible!). However, it’s more\n",
        "    of an issue if later on you change the underlying variables and then\n",
        "    forget to re-compute the interactions. It also makes your code\n",
        "    harder to read.\n",
        "\n",
        "Bottom line: don’t do it. Interactions in R are easy to create: you\n",
        "simply use the `:` or `*` operator when defining an interaction term.\n",
        "\n",
        "-   The `:` operator creates the interaction(s) of the two variables in\n",
        "    question\n",
        "-   The `*` operation creates the interactions(s) *and* the main effects\n",
        "    of the variables as well\n",
        "\n",
        "Even better: if you are interacting two qualitative (factor) variables,\n",
        "it will automatically “expand” the interaction into every possible\n",
        "combination of the variables. A lot less work!\n",
        "\n",
        "For example, let’s look at a regression model which interacts gender and\n",
        "education. Before we run regression, let’s first summarize education\n",
        "into ‘university’ and ‘non-university’."
      ],
      "id": "77da11c7-e056-4484-a9b9-8e6c30cccdff"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "SFS_data <- SFS_data %>% # creates a Education dummy variable\n",
        "               mutate( \n",
        "               Education = case_when(\n",
        "                     education == \"University\" ~ \"University\", # the ~ separates the original from the new name\n",
        "                     education == \"Non-university post-secondary\" ~ \"Non-university\",\n",
        "                     education == \"High school\" ~ \"Non-university\",\n",
        "                     education == \"Less than high school\" ~ \"Non-university\")) %>%\n",
        "             mutate(Education = as_factor(Education)) # remember, it's a factor!"
      ],
      "id": "7781db2b-d6a5-42e3-8d82-10fcb348aa07"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "regression1 <- lm(wealth ~ gender + Education + gender:Education, data = SFS_data)\n",
        "\n",
        "# regression1 <- lm(wealth ~ gender*Education, data = SFS_data) # an alternative way to run the same regression\n",
        "\n",
        "summary(regression1)"
      ],
      "id": "2b10eb7f-73c6-4791-8dd2-81953a439dca"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "There are a few important things to notice about this regression result.\n",
        "First, take a close look at the terms:\n",
        "\n",
        "-   `genderFemale` this is the main effect for being a female. You\n",
        "    *might* immediately say that this is the impact of being a female -\n",
        "    but this is *not true*. Why? Because female shows up in two places!\n",
        "    We have to be a little more careful - this is the effect of being a\n",
        "    female in the base group (non-university)\n",
        "-   `genderFemale:EducationUniversity` this is the interaction effect of\n",
        "    being a female and having a university degree. Basically, family\n",
        "    with female (university degree) as main earner accumulates\n",
        "    $143,396 + 324,112 = 467,508$ less wealth, compared with male\n",
        "    counterpart.\n",
        "\n",
        "You can see this interpretation in the regression model itself:\n",
        "\n",
        "$$\n",
        "W_i = \\beta_0 + \\beta_1 F_i + \\beta_2 D_i + \\beta_3 F_i \\times D_i + \\epsilon_i\n",
        "$$\n",
        "\n",
        "Consider:\n",
        "\n",
        "$$\n",
        "\\frac{\\Delta W_i}{\\Delta F_i} = \\beta_1 + \\beta_3 D_i\n",
        "$$\n",
        "\n",
        "The marginal effect of being a female-lead household *changes* depending\n",
        "on what the value of $D_i$ is! For non-university degree (the level\n",
        "where $D_i = 0$) it’s $\\beta_1$. For university degree (the level where\n",
        "$D_i =1$), it’s $\\beta_1 + \\beta_3$. This is why, in an interaction\n",
        "model, it doesn’t really make sense to talk about the “effect of\n",
        "female” - because there isn’t a single, immutable effect. It is\n",
        "different for different education degrees!\n",
        "\n",
        "You can talk about the *average* effect, which is just\n",
        "$\\beta_1 + \\beta_3 \\bar{D_i}$ - but that’s not really what people are\n",
        "asking about when they are discussing the gender effect, in general.\n",
        "\n",
        "This is why it’s very important to carefully think about a regression\n",
        "model with interaction terms - the model may seem simple to estimate,\n",
        "but the interpretation is more complex.\n",
        "\n",
        "### Interactions with Continuous Variables\n",
        "\n",
        "So far, we have just looked at interacting qualitative variables - but\n",
        "you can interact any types of variables!\n",
        "\n",
        "-   Qualitative-Qualitative\n",
        "-   Qualitative-Quantitative\n",
        "-   Quantitative-Quantitative\n",
        "\n",
        "The format and syntax in R is similar, with some small exceptions to\n",
        "deal with certain combinations of variables. However (again), you do\n",
        "need to be careful with interpretation.\n",
        "\n",
        "For example, let’s look at the interaction of income and sex on wealth.\n",
        "In a regression equation, this would be expressed like:\n",
        "\n",
        "$$\n",
        "W_i = \\beta_0  + \\beta_1 Income_i + \\beta_2 F_i + \\beta_3 Income_i \\times F_i + \\epsilon_i\n",
        "$$\n",
        "\n",
        "Notice that, just like before:\n",
        "\n",
        "$$\n",
        "\\frac{\\partial W_i}{\\partial Income_i} = \\beta_1 + \\beta_3 F_i\n",
        "$$\n",
        "\n",
        "There are two *different* “slope” coefficients; basically, male and\n",
        "female lead family can have a different return to wealth. Let’s see this\n",
        "in R:"
      ],
      "id": "c6ffc3cb-e72f-40cf-9d75-ca275981765e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "regression2 <- lm(wealth ~ income_before_tax + gender + income_before_tax:gender, data = SFS_data)\n",
        "\n",
        "summary(regression2)"
      ],
      "id": "a103370e-cbb6-43d1-b752-155573da8ee0"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As we can see here, the female-lead households in our model accumulate\n",
        "about 3.946 dollars more in wealth per dollar of income earned than\n",
        "male-lead respondents. But female-lead households accumulate 343,300\n",
        "dollars less than male counterparts. So the overall effects depend on\n",
        "average income before tax.\n",
        "\n",
        "This addresses the common problem of estimating a regression model where\n",
        "you think the impact of a continuous variable might be different across\n",
        "the two groups. One approach would be to run the model only for men, and\n",
        "only for women, and then compare - but this isn’t a good idea. Those\n",
        "regressions have a much smaller sample size, and if you have other\n",
        "controls in the model, you will “throw away” information. The\n",
        "interaction method is much better.\n",
        "\n",
        "## Part 2: Non-linear Terms in Regression Models\n",
        "\n",
        "You might have been puzzled by why these models were called “linear”\n",
        "regressions. The reason is because they are linear in the\n",
        "*coefficients*: the dependent variable is expressed as a linear\n",
        "combination of the explanatory variables.\n",
        "\n",
        "This implies that we can use the same methods (OLS) to estimate models\n",
        "that including linear combinations of *non-linear functions* of the\n",
        "explanatory variables. We have actually already seen an example of this:\n",
        "remember using `log` of a variable? That’s a non-linear function!\n",
        "\n",
        "As we learned when considering `log`, the most important difference here\n",
        "is again regarding interpretations, not the actual estimation.\n",
        "\n",
        "In R, there is one small complication: when you want to include\n",
        "mathematical expressions in a model formula, you need to “isolate” then\n",
        "using the `I()` function. This is because many operations in R, like `+`\n",
        "or `*` have a special meaning in a regression model.\n",
        "\n",
        "For example, let’s consider a quadratic regression - that is, including\n",
        "both $Income_i$ and $Income_i^2$ (income squared) in our model."
      ],
      "id": "404b38c1-1f37-42be-af22-761dc81f5820"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "regression3 <- lm(wealth ~ income_before_tax + I(income_before_tax^2), data = SFS_data)\n",
        "\n",
        "summary(regression3)"
      ],
      "id": "766bca24-619d-4ecb-9e8e-05c3508fa130"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see, we get regression results much like we would expect.\n",
        "However, how do we interpret them? The issue is that *income* enters\n",
        "into two places. We need to carefully interpret this model, using our\n",
        "knowledge of the equation:\n",
        "\n",
        "$$\n",
        "W_i = \\beta_0 + \\beta_1 Income_i + \\beta_2 Income_i^2 + \\epsilon_i\n",
        "$$\n",
        "\n",
        "$$\n",
        "\\implies \\frac{\\partial W_i}{\\partial Income_i} = \\beta_1 + 2 \\beta_2 Income_i\n",
        "$$\n",
        "\n",
        "You will notice something special about this; the marginal effect is\n",
        "*non-linear*. As $Income_i$ changes, the effect of income on $W_i$\n",
        "changes. This is because we have estimated a quadratic relationship; the\n",
        "slope of a quadratic changes as the explanatory variable changes. That’s\n",
        "what we’re seeing here!\n",
        "\n",
        "This makes these models relatively difficult to interpret, since the\n",
        "marginal effects change (often dramatically) as the explanatory\n",
        "variables change. You frequently need to carefully interpret the model\n",
        "and often (to get estimates) perform tests on *combinations* of\n",
        "coefficients, which can be done using things like the `car` package or\n",
        "the `lincom` function. You can also compute this manually, using the\n",
        "formula for the sum of variances.\n",
        "\n",
        "For example, let’s test if the marginal effect of income is significant\n",
        "at $Income_i = \\overline{Income}_i$. This is the most frequently\n",
        "reported version of this effects, often called the “marginal effect at\n",
        "the means”."
      ],
      "id": "ef2606d9-03b9-4023-ad03-0a6191d79be3"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "m <- mean(SFS_data$income_before_tax)\n",
        "\n",
        "linearHypothesis(regression3, hypothesis.matrix = c(0, 1, 2*m), rhs=0) "
      ],
      "id": "ac7d669f-81d5-4933-bae1-9417e5144a4d"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As we can see, it is highly significant\n",
        "\n",
        "> **Think Deeper**: what is the vector `c(0, 1, 2*m)` doing in the above\n",
        "> expression?\n",
        "\n",
        "Let’s see exactly what those values are. Recall the formula:\n",
        "\n",
        "$$\n",
        "V(aX + bY) = a^2 V(X) + b^2 V(Y) + 2abCov(X,Y)\n",
        "$$\n",
        "\n",
        "In our situation, $X = Y = W_i$, so this is:\n",
        "\n",
        "$$\n",
        "V(\\beta_1 + 2\\bar{W_i}\\beta_2) = V(\\beta_1) + 4\\bar{W_i}^2V(\\beta_2) + 2(2\\bar{W_i})Cov(\\beta_1,\\beta_2)\n",
        "$$\n",
        "\n",
        "Fortunately, these are all things we have from the regression and its\n",
        "variance-covariance matrix:"
      ],
      "id": "0f3894ed-12bc-4bb2-bdae-b464c8a4cf6e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "v <- vcov(regression3)\n",
        "coefs <- regression3$coefficients\n",
        "v\n",
        "\n",
        "var <- v[2,2] + 4*(m^2)*v[3,3] + 4*m*v[3,2]\n",
        "\n",
        "var\n",
        "\n",
        "coef <-  coefs[[2]] + 2*m*coefs[[3]]\n",
        "\n",
        "print(\"Coefficent Combination and SD\")\n",
        "round(coef,3)\n",
        "round(sqrt(var),3)"
      ],
      "id": "e3311c20-ebd3-4f68-9c3d-fca313bb86d1"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see, this gets fairly technical and is not something you will\n",
        "want to do without a very good reason. In general, it’s a better idea to\n",
        "rely on some of the packages written for R that handle this task for the\n",
        "(specific) model you are interested in evaluating.\n",
        "\n",
        "### Aside: Why Polynomial Terms?\n",
        "\n",
        "You might be wondering why econometricians spend so much time talking\n",
        "about models that included polynomial terms, when those are\n",
        "(realistically) a very small set of the universe of possible functions\n",
        "of an explanatory variable (you already know why we talk about `log` so\n",
        "much!).\n",
        "\n",
        "The reason is actually approximation. Consider the following\n",
        "*non-linear* model:\n",
        "\n",
        "$$\n",
        "Y_i = f(X_i) + e_i\n",
        "$$\n",
        "\n",
        "This model is truly non-linear (and not just in terms of the\n",
        "parameters). How can we estimate this model? It’s hard! There are\n",
        "techniques to estimate complex models like this, but how far can we get\n",
        "with good-old OLS? The answer is - provided that $f$ is “smooth” -\n",
        "pretty far.\n",
        "\n",
        "Think back to introductory calculus; you might remember a theorem called\n",
        "[Taylor’s Theorem](https://en.wikipedia.org/wiki/Taylor%27s_theorem). It\n",
        "says that a smoothly differentiable function can be arbitrarily\n",
        "well-approximated (about a point) by a polynomial expansion:\n",
        "\n",
        "$$\n",
        "f(x) = f(a) + f'(a)(x-a) + \\frac{f''(a)}{2!}(x-a)^2 + \\cdots + \\frac{f^{(k)}(a)}{k!}(x-a)^k + R_k(x)\n",
        "$$\n",
        "\n",
        "and the error term $R_k(x) \\to 0$ as $x \\to a$ and $k \\to \\infty$.\n",
        "\n",
        "Look closely at this expression. Most of the terms (like $f'(a)$) are\n",
        "constants. In fact, you can show that this can be written like:\n",
        "\n",
        "$$\n",
        "f(x) = \\beta_0 + \\beta_1 x + \\beta_2 x^2 + \\cdots + \\beta_k x^k + r\n",
        "$$\n",
        "\n",
        "Putting this into our expression above gives us the relationship:\n",
        "\n",
        "$$\n",
        "Y_i = \\beta_0 + \\beta_1 X_i + \\beta_2 X_i^2 + \\cdots + \\beta_k X_i^k+ \\epsilon_i\n",
        "$$\n",
        "\n",
        "Which is a linear regression model! What this say is actually very\n",
        "important: linear regression models can be viewed as *approximations* to\n",
        "nonlinear regressions, provided we have enough polynomial terms. This is\n",
        "one complication: the error term is definitely not uncorrelated. You can\n",
        "learn more about how to address this issue in other courses, but at the\n",
        "most the omitted variable bias is relatively small as $k \\to \\infty$.\n",
        "\n",
        "## Part 3: Exercises\n",
        "\n",
        "This section has both written and coding exercises for you to test your\n",
        "knowledge about interactions and non-linear terms in regression models.\n",
        "The answers to the written exercises are on the last section of the\n",
        "notebook.\n",
        "\n",
        "### Questions\n",
        "\n",
        "Consider the following regression model:\n",
        "\n",
        "$$\n",
        "\\begin{equation}\n",
        "W_i = \\beta_1 + \\beta_2 F_i + \\beta_3 E_i + \\beta_4 P_i + \\beta_5 F_i\\times E_i + \\beta_6 F_i \\times P_i + \\epsilon_i\n",
        "\\end{equation}\n",
        "$$\n",
        "\n",
        "where\n",
        "\n",
        "-   $W_i$ denotes wealth\n",
        "-   $F_i$ is a dummy variable for the gender of main earner in the\n",
        "    household ($F_i=1$ if female is the main earner)\n",
        "-   $E_i$ is a factor variable for education\n",
        "-   $P_i$ is a factor variable for province\n",
        "\n",
        "1.  How should we interpret the coefficients $\\beta_5$ and $\\beta_6$?\n",
        "    Why might these effects be important to estimate?\n",
        "\n",
        "Now, let’s estimate the model and interpret it concretely. (Please\n",
        "follow the order of variables in regression model):\n",
        "\n",
        "    reg1 <- gender, Education, province\n",
        "\n",
        "    reg0 <- gender, education, province\n",
        "\n",
        "What are your interaction variables? (remember to follow the same order)"
      ],
      "id": "14d95e8f-6b35-4f49-8bd3-ae81f7050cad"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "reg0 <- lm(???, data = SFS_data)\n",
        "\n",
        "reg1 <- lm(???, data = SFS_data)\n",
        "\n",
        "summary(reg0)\n",
        "summary(reg1)\n",
        "\n",
        "test_1()\n",
        "test_2()"
      ],
      "id": "083edb2a-74e0-4330-bfa3-73b4e05f5c99"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "1.  How do we interpret the coefficient estimate on `gender:Education`?\n",
        "    What education level do female-lead households appear to be most\n",
        "    discriminated in? How might we explain this intuitively?\n",
        "\n",
        "2.  How do you interpret the coefficient estimate on\n",
        "    `genderFemale:provinceAlberta`? (Hint: Write out the average wealth\n",
        "    equations for female, male in Alberta, and female in Alberta\n",
        "    separately.)\n",
        "\n",
        "Now let’s test whether the returns to education increase if people are\n",
        "entrepreneurs. `business` is a factor variable which suggests whether\n",
        "the household owns a business. Please add terms to the regression\n",
        "equation that allow us to run this test. Then, estimate this new model.\n",
        "We don’t need `province` and `gender` variables in this exercise. For\n",
        "education, please use `education` variable. And we will continue to\n",
        "study wealth accumulated in households."
      ],
      "id": "408854f7-cae6-4589-8626-0c8b600247e4"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "SFS_data$business <- relevel(SFS_data$business, ref = \"No\") # co not change; makes \"not a business owner\" the reference level for business\n",
        "\n",
        "reg2 <- lm(???, data = SFS_data)\n",
        "\n",
        "summary(reg2)\n",
        "test_3()"
      ],
      "id": "68af8a3f-1010-4f31-9738-fa48a9b25d36"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "1.  Do returns to education increase when people are entrepreneurs?\n",
        "    Explain why or why not with reference to the regression estimates.\n",
        "\n",
        "A topic that many labour economists are concerned with, and one that we\n",
        "have discussed before, is the gender-wage gap. In this activity, we will\n",
        "construct a “difference-in-difference” regression to explore this gap\n",
        "using the `SFS_data2`.\n",
        "\n",
        "Suppose that we want to estimate the relationship between age, sex and\n",
        "wages. Within this relationship, we suspect that women earn less than\n",
        "men from the beginning of their working lives, ***but this gap does not\n",
        "change as workers age***.\n",
        "\n",
        "Estimate a regression model (with no additional control variables) that\n",
        "estimates this relationship using `SFS_data2`. We will use\n",
        "`income_before_tax` variable. Order: list `gender` before `agegr`.\n",
        "\n",
        "Tested Objects: `reg3A`\n",
        "\n",
        "Let’s first simplify levels of age group using following codes."
      ],
      "id": "5c38d915-dafe-4f3c-bf1b-68c9e16f716d"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# some data cleaning - just run this!\n",
        "SFS_data <- \n",
        "        SFS_data %>%\n",
        "        mutate(agegr = case_when(\n",
        "              age == \"01\" ~ \"Under 30\", #under 20\n",
        "              age == \"02\" ~ \"Under 30\", #20-24\n",
        "              age == \"03\" ~ \"20s\", #25-29\n",
        "              age == \"04\" ~ \"30s\",\n",
        "            age == \"05\" ~ \"30s\",\n",
        "              age == \"06\" ~ \"40s\",\n",
        "              age == \"07\" ~ \"40s\",\n",
        "              age == \"08\" ~ \"50s\",\n",
        "              age == \"09\" ~ \"50s\",\n",
        "              age == \"10\" ~ \"60s\", #60-64\n",
        "              age == \"11\" ~ \"Above 65\", #65-69\n",
        "              age == \"12\" ~ \"Above 65\", #70-74\n",
        "              age == \"13\" ~ \"Above 75\", #75-79\n",
        "              age == \"14\" ~ \"Above 75\", #80 and above\n",
        "              )) %>%\n",
        "        mutate(agegr = as_factor(agegr))\n",
        "\n",
        "SFS_data$agegr <- relevel(SFS_data$agegr, ref = \"Under 30\") #Set \"Under 30\" as default factor level"
      ],
      "id": "30afdaa9-dba3-434a-b5a7-c94662248ce6"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let’s restrict the sample to main working groups. Just run the following\n",
        "line."
      ],
      "id": "2a756d39-db54-4495-8d4d-eb95417505c5"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "SFS_data2 <- subset(SFS_data, agegr == \"20s\" | agegr == \"30s\" | agegr == \"40s\" | agegr == \"50s\" | agegr == \"60s\" )"
      ],
      "id": "60fc3fe1-342f-4464-96ca-99c0afacef7e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "SFS_data2$agegr <- relevel(SFS_data2$agegr, ref = \"20s\") # do not change; makes \"20s\" the reference level for age\n",
        "\n",
        "reg3A <- lm(???, data = SFS_data2)\n",
        "\n",
        "summary(reg3A)\n",
        "\n",
        "test_4() "
      ],
      "id": "ffe25fd9-6f12-46d6-b0fa-b1d95ee1c64e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "1.  What is the relationship between age and wages? Between sex and\n",
        "    earnings? Is there a significant wage gap? Why might the regression\n",
        "    above not give us the “full picture” of the sex wage gap?\n",
        "\n",
        "Now, estimate the relationship between wages and age for male-lead\n",
        "households and female-lead households separately, then compare their\n",
        "returns to age. Let’s continue to use `income_before_tax`\n",
        "\n",
        "Tested objects: `reg3M` (for males), `reg3F` (for females)."
      ],
      "id": "082ddecc-b966-4f78-8261-4d8f5ab49821"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "reg3M <- lm(..., data = filter(SFS_data2, gender == \"Male\"))\n",
        "reg3F <- lm(..., data = filter(SFS_data2, gender == \"Female\"))\n",
        "\n",
        "summary(reg3M)\n",
        "summary(reg3F)\n",
        "\n",
        "test_5()\n",
        "test_6() "
      ],
      "id": "89d69d37-14cb-4e71-ba8f-229fc8113c93"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "1.  Do these regression estimates support your argument? Explain.\n",
        "\n",
        "Add one additional term to the multiple regression that accounts for the\n",
        "possibility that the sex wage gap can change as workers age. Please list\n",
        "gender before age.\n",
        "\n",
        "Tested Objects: `reg4`."
      ],
      "id": "05e85f00-85ad-4fd6-b89a-f14c5d6366e3"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "reg4 <- lm(???, data = SFS_data2)\n",
        "\n",
        "summary(reg4)\n",
        "\n",
        "test_7() "
      ],
      "id": "1f99af9d-900b-4b37-955d-373111dc6036"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "1.  According to the regression you estimated above, what is the nature\n",
        "    of the sex wage gap?\n",
        "\n",
        "Now, suppose that a team of researchers is interested in the\n",
        "relationship between the price of a popular children’s chocolate brand\n",
        "(let’s call it “Jumbo Chocolate Egg”) and its demand. The team conducts\n",
        "a series of surveys over a five-year period where they ask 200\n",
        "households in a Vancouver neighborhood to report how many packs of Jumbo\n",
        "Chocolate Egg they bought in each quarter of the year. The company that\n",
        "produces Jumbo Chocolate Egg is interested in estimating the price\n",
        "elasticity of demand for their chocolate, so they changed the price of a\n",
        "pack of chocolate each quarter over this period. This survey is\n",
        "voluntary - the team went door-to-door to gather the data, and people\n",
        "could refuse to participate.\n",
        "\n",
        "After they compile a dataset from the survey responses, the team\n",
        "estimates this model:\n",
        "\n",
        "$$\n",
        "Q_i^2 = \\alpha_1 + \\alpha_2 ln(P_i) + \\alpha_3 H_i + \\epsilon_i\n",
        "$$\n",
        "\n",
        "$Q_i$ denotes the quantity of chocolate packs that household $i$\n",
        "purchased in a given quarter of a given year. That is, each quarter for\n",
        "a given household is a separate observation. $P_i$ is the price of the\n",
        "pack of chocolate in the given quarter, and $H_i$ is the household size\n",
        "(in number of people). Note that $\\hat{\\alpha_2}$ is *supposed to be*\n",
        "the estimated elasticity of demand.\n",
        "\n",
        "You join the team as a research advisor - in other words, you get to\n",
        "criticize their project and get paid doing so. Sounds great, but you\n",
        "have a lot of work ahead.\n",
        "\n",
        "1.  Are there any omitted variables that the team should be worried\n",
        "    about when estimating the model? Give 2 examples of such variables\n",
        "    if so, and explain how each variable’s omission could affect the\n",
        "    estimated elasticity of demand.\n",
        "\n",
        "2.  Is there anything wrong with the specification of the regression\n",
        "    model? If so, explain how to correct it; if not, explain why the\n",
        "    specification is correct.\n",
        "\n",
        "3.  Is there any potential for sample selection bias in this study?\n",
        "    Explain by referencing specific aspects of the experiment. What\n",
        "    effect might this bias have on the estimated elasticity of demand?\n",
        "\n",
        "4.  A member of your team writes in the research report that “this\n",
        "    estimated elasticity of demand tells us about the preferences of\n",
        "    consumers around Canada.” Do you have an issue with this statement?\n",
        "    Why or why not?\n",
        "\n",
        "### Solutions\n",
        "\n",
        "1.  $\\beta_5$ is the difference of ‘value-added’ of education between\n",
        "    female-lead household and male-lead household. $\\beta_6$ is the\n",
        "    difference of province effects between female-lead household and\n",
        "    male-lead household.\n",
        "\n",
        "These interaction terms provide us with a simple way to describe and\n",
        "study how *combinations* of our explanatory variables impact our model.\n",
        "With these interaction terms, we can understand how education and\n",
        "geography affect wealth of male-lead and female-lead households\n",
        "differently.\n",
        "\n",
        "2-3. Coding exercises.\n",
        "\n",
        "1.  It means if the main earner is a female with a university degree,\n",
        "    the household accumulates 302,008 dollars less wealth, than the male\n",
        "    counterpart. University degree holders are most discriminated,\n",
        "    because $|\\hat{\\beta_5}|$ is higher than $|\\hat{\\beta_2}|$.\n",
        "\n",
        "One possible explanation is that males with university degree are more\n",
        "likely to become managers or other higher level employees than females.\n",
        "The inequality in incomes lead to the inequality in wealth.\n",
        "\n",
        "1.  Average wealth for female-lead family:\n",
        "    $E[wealth|female]=\\beta_{1}+\\beta_{2}=455918-87277$\n",
        "\n",
        "Average wealth for male-lead family in Alberta:\n",
        "$E[wealth|male,Alberta]=\\beta_{1}+\\beta_{4,Alberta}=455918+502683$\n",
        "\n",
        "Average wealth for female-lead family in Alberta:\n",
        "$E[wealth|female,Alberta]=\\beta_{1}+\\beta_{2}+\\beta_{4,Alberta}+\\beta_{6,Alberta}=455918-87277+502683-105798$\n",
        "\n",
        "Thus $\\beta_{6,Alberta}$ is the difference of wealth when living in\n",
        "Alberta between female-lead household and male-lead household.\n",
        "\n",
        "1.  Coding exercise\n",
        "\n",
        "2.  From the above results of regression, owning a business increases\n",
        "    1.29 million dollars wealth on average. But the estimates of returns\n",
        "    to education for being an entrepreneur or not is not significant.\n",
        "    Thus we cannot conclude how education will affect the returns of\n",
        "    entrepreneurs.\n",
        "\n",
        "3.  Coding exercise.\n",
        "\n",
        "4.  Income before tax of households increase as main earners become\n",
        "    older, and the peak arrives at their 50s. Female-lead households\n",
        "    generally earn 34,852 dollars less than male-lead households. This\n",
        "    regression does not provide us the “full picture” of sex wage gap,\n",
        "    because the gender gap may change when people age.\n",
        "\n",
        "10-11. Coding exercises\n",
        "\n",
        "1.  Yes, these regression estimates support our hypothesis. First of\n",
        "    all, for both male-lead and female-lead households, incomes increase\n",
        "    as main earners become older, and the peak appears at their 50s.\n",
        "    Second, there is a gender-wage gap in which female-lead households\n",
        "    earn less. But the gap changes as age changes.\n",
        "\n",
        "2.  Coding exercise.\n",
        "\n",
        "3.  Generally speaking, the gender-income gap increases when workers\n",
        "    become older. And the gender-income gap is widest at their 40s.\n",
        "\n",
        "4.  There are omitted variables, for example, age of kids and a dummy\n",
        "    variable which represents Christmas and New Year. Omitting Christmas\n",
        "    and New Year dummy will decrease $|\\hat{\\alpha}_{2}|$, because\n",
        "    people buy more chocolates during Christmas, even if price of\n",
        "    chocolate is high. Controlling Christmas dummy will recover the real\n",
        "    elasticity of demand, which is a larger real $|\\alpha_{2}|$.\n",
        "\n",
        "When kids are young, they tend to consume more chocolates. Thus parts of\n",
        "the variations of consumption come from ages of kids. Without\n",
        "controlling for ages, the estimated elasticity is not reliable.\n",
        "\n",
        "1.  The dependent variable should be $ln(Q_{i})$ instead of $Q_{i}^{2}$.\n",
        "\n",
        "2.  There is sample selection bias, because people could refuse to\n",
        "    participate. This means people who responded might be households who\n",
        "    like Jumbo Chocolate Egg brand, and the estimated elasticity\n",
        "    $|\\hat{\\alpha}_{2}|$ is smaller than the real elasticity among\n",
        "    general population.\n",
        "\n",
        "3.  The result cannot represent the whole Canada, because participants\n",
        "    were from a single neighborhood in Vancouver."
      ],
      "id": "ce1efc44-b619-4ade-b38f-627c82bcd9d5"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "ir",
      "display_name": "R",
      "language": "r"
    }
  }
}