{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 12 - Dummy Variables and Interactions\n",
        "\n",
        "Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt  \n",
        "2024-05-29\n",
        "\n",
        "## Prerequisites\n",
        "\n",
        "1.  Importing data into R.\n",
        "2.  Examining data using `glimpse`.\n",
        "3.  Creating new variables in R.\n",
        "4.  Conducting linear regression analysis.\n",
        "\n",
        "## Learning Outcomes\n",
        "\n",
        "1.  Understand when a dummy variable is needed in analysis.\n",
        "2.  Create dummy variables from qualitative variables with two or more\n",
        "    categories.\n",
        "3.  Interpret coefficients on a dummy variable from an OLS regression.\n",
        "4.  Interpret coefficients on an interaction between a numeric variable\n",
        "    and a dummy variable from an OLS regression.\n",
        "\n",
        "## 12.1 Introduction to Dummy Variables for Regression Analysis\n",
        "\n",
        "We first took a look at dummy variables in [Module\n",
        "5](https://comet.arts.ubc.ca/docs/Research/econ490-r/05_Creating_Variables.html).\n",
        "There, we discussed both how to interpret and how to generate this type\n",
        "of variable. If you are unsure about what dummy variables measure,\n",
        "please make sure to review that module.\n",
        "\n",
        "Here we will discuss including qualitative variables as explanatory\n",
        "variables in a linear regression model as dummy variables.\n",
        "\n",
        "Imagine that we want to include a new explanatory variable in our\n",
        "multivariate regression from [Module\n",
        "10](https://comet.arts.ubc.ca/docs/Research/econ490-r/10_Linear_Reg.html)\n",
        "that indicates whether an individual is identified as female. To do\n",
        "this, we need to include a new dummy variable in our regression.\n",
        "\n",
        "For this module, we again will be using the fake data set. Recall that\n",
        "this data is simulating information for workers in the years 1982-2012\n",
        "in a fake country where a training program was introduced in 2003 to\n",
        "boost their earnings."
      ],
      "id": "922cca78-9028-4c08-94a1-3bb2c167886c"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#Clear the memory from any pre-existing objects\n",
        "rm(list=ls())\n",
        "\n",
        "# loading in our packages\n",
        "library(tidyverse) #This includes ggplot2! \n",
        "library(haven)\n",
        "library(IRdisplay)\n",
        "\n",
        "#Open the dataset \n",
        "fake_data <- read_dta(\"../econ490-stata/fake_data.dta\")\n",
        "\n",
        "# inspecting the data\n",
        "glimpse(fake_data)"
      ],
      "id": "83669d5c-cfe1-4dab-a3fd-60d15563a73d"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let’s generate a variable that takes the log of *earnings*, as we did\n",
        "for our regression in the previous module."
      ],
      "id": "d6488c3c-c590-46f2-a6b8-0b223bbe61fc"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "fake_data <- fake_data %>%\n",
        "        mutate(log_earnings = log(earnings)) #the log function"
      ],
      "id": "7bf810fc-8708-4be6-90a2-2a37eaa5a6fc"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let’s take a look at the data."
      ],
      "id": "4f6279d4-99ca-4fd9-949c-41dc7c854d08"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "glimpse(fake_data)"
      ],
      "id": "69f6a440-3080-4596-bd10-8e82319ca197"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As expected, *logearnings* is a quantitative variable showing the\n",
        "logarithm of each value of *earnings*. We observe a variable named\n",
        "*sex*, but it doesn’t seem to be coded as a numeric variable. Notice\n",
        "that next to sex it says `<chr>`.\n",
        "\n",
        "As expected, *sex* is a string variable and is not numeric. We cannot\n",
        "use a string variable in a regression analysis; we have to create a new\n",
        "variable which indicates the sex of the individual represented by the\n",
        "observation in numeric form.\n",
        "\n",
        "A dummy variable is a numeric variable that takes either the value of 0\n",
        "or 1 depending on a condition. In this case, we want to create a\n",
        "variable that equals 1 whenever a worker is identified as “female”. A\n",
        "very simple way to create different categories for a variable in R is to\n",
        "use the `as.factor()` function."
      ],
      "id": "40c6bca5-373f-4b44-a384-87bc25af90f9"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "as.factor(fake_data$sex)"
      ],
      "id": "95baa35c-72ba-4840-868f-fb27303a5da4"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 12.2 Interpreting the Coefficient on Dummy Variables\n",
        "\n",
        "Whenever we interpret the coefficient on a dummy variable in a\n",
        "regression, we are making a direct comparison between the 1-category and\n",
        "the 0-category for that dummy. In the case of this *female* dummy, we\n",
        "are directly comparing the mean earnings of female-identified workers\n",
        "against the mean earnings of male-identified workers.\n",
        "\n",
        "Let’s consider the regression below."
      ],
      "id": "0a111f6d-2d16-4c2c-b833-9262d184c55b"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "lm(data=fake_data, log_earnings ~ as.factor(sex))"
      ],
      "id": "19e0e983-5813-41cb-88c7-890b1d0cdea1"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Notice that the regression by default used females as the reference\n",
        "point and only estimated a male premium. Typically, we want this to be\n",
        "the other way around. To change the reference group we write the code\n",
        "below."
      ],
      "id": "b889180e-9e6c-4367-a112-05332921ad5a"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Change reference level\n",
        "fake_data = fake_data %>% mutate(female = relevel(as.factor(sex), \"M\"))"
      ],
      "id": "9491e7e1-c74f-413b-ac34-c471bd75c52c"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "summary(lm(data=fake_data, log_earnings ~ female))"
      ],
      "id": "76afc4fa-0a7a-43e8-afa7-b70b4e192292"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We remember from [Module\n",
        "10](https://comet.arts.ubc.ca/docs/Research/econ490-r/10_Linear_Reg.html)\n",
        "that “\\_cons” is the constant $β_0$, and we know that here\n",
        "$β_0 = E[logearnings_{i}|female_{i}=0]$. Therefore, the results of this\n",
        "regression suggest that, on average, males have log-earnings of 10.8. We\n",
        "also know from the [Module\n",
        "10](https://comet.arts.ubc.ca/docs/Research/econ490-r/10_Linear_Reg.html)\n",
        "that\n",
        "\n",
        "$$\n",
        "\\beta_1 = E[logearnings_{i}|female_{i}=1]- E[logearnings_{i}|female_{i}=0].\n",
        "$$\n",
        "\n",
        "The regression results here suggest that female-identified persons earn\n",
        "on average 0.55 less than male-identified persons. As a result,\n",
        "female-identified persons earn on average 10.8 - 0.55 = 10.25.\n",
        "\n",
        "In other words, the coefficient on the female variable shows the mean\n",
        "difference in log-earnings relative to males. $\\hat{β}_1$ thus provides\n",
        "the measure of the raw gender gap.\n",
        "\n",
        "**Note:** We are only able to state this result because the p-value for\n",
        "both $\\hat{β}_0$ and $\\hat{β}_1$ is less than 0.05, allowing us to\n",
        "reject the null hypothesis that $β_0 = 0$ and $β_1 = 0$ at 95%\n",
        "confidence level.\n",
        "\n",
        "The interpretation remains the same once we control for more variables,\n",
        "although it is *ceteris paribus* the other observables now also included\n",
        "in the regression. An example is below."
      ],
      "id": "ca5bc08e-45cd-434d-9664-acf9a2366f0d"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "summary(lm(data=fake_data, log_earnings ~ female + age))"
      ],
      "id": "94116407-49f9-4a5f-b9b4-9fb035ab1b1b"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In this case, among people that are the same age (i.e., holding *age*\n",
        "constant), the gender gap is (not surprisingly) slightly smaller than in\n",
        "our previous regression. That is expected, since previously we compared\n",
        "all females to all males, irrespective of the composition of age groups\n",
        "in those two categories of workers. As we control for age, we can see\n",
        "that the effect of *sex* decreases.\n",
        "\n",
        "## 12.3 Dummy Variables with Multiple Categories\n",
        "\n",
        "The previous section also holds when there is a variable with multiple\n",
        "categories, as is the case for region."
      ],
      "id": "725d5646-f859-43f7-9c37-fc7b557dfa91"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "lm(data=fake_data, log_earnings ~ as.factor(region))"
      ],
      "id": "d0027bfe-2e7c-451e-a6d7-49a9226566b7"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Notice that the sum of the five dummies in any row is equal to 1. This\n",
        "is because every worker is located in only one region. If we included\n",
        "all of the regional dummies in a regression, we would introduce the\n",
        "problem of **perfect collinearity**: the full set of our dummy variables\n",
        "are perfectly correlated with one another. Think about it this way - if\n",
        "a person is in region 1 (*regdummy1* = 1), then we know that that person\n",
        "is not in region 2 (*regdummy2* = 0). Therefore being in region 1\n",
        "perfectly predicts **not** being in region 2.\n",
        "\n",
        "We must always exclude one of the dummies. Failing to do so means\n",
        "falling into the **dummy variable trap** of perfect collinearity\n",
        "described above. Essentially, if we include all of the five dummy\n",
        "variables, the fifth one will not add any new information. This is\n",
        "because, using the four other dummies, we can perfectly deduce whether a\n",
        "person is in region 5 (*regdummy5* = 1). To avoid this, we have to\n",
        "choose one region to serve as a base level for which we will not define\n",
        "a dummy. This dummy variable that we exclude will be the category of\n",
        "reference, or base level, when interpreting coefficients in the\n",
        "regression. This means that the coefficient on each region dummy\n",
        "variable will be comparing the mean earnings of people in that region to\n",
        "the mean earnings of people in the one region excluded.\n",
        "\n",
        "We have actually already seen this approach in action in the first\n",
        "regression we ran above; there, we didn’t add a separate dummy variable\n",
        "for “male”. Instead, we excluded the male dummy variable and interpreted\n",
        "the coefficient on *female* as the difference between female and male\n",
        "log-earnings.\n",
        "\n",
        "You may have noticed that R drops the first region dummy (region = 1)\n",
        "and includes dummy variables for the regions 2 - 5.\n",
        "\n",
        "We can use the same trick as the previous section to change the\n",
        "reference group! Let’s change the reference group to 3."
      ],
      "id": "e7bcd6fb-c5d2-48d2-baf7-4f23157213f7"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "fake_data <- fake_data %>% mutate(region = relevel(as.factor(region), 3))"
      ],
      "id": "724d222e-49e3-45e1-8cae-5beba2d05521"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "summary(lm(data = fake_data, log_earnings ~ region))"
      ],
      "id": "0272db17-4095-4907-b3b2-2f02f360805e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "When interpreting the coefficients in the regression above, our\n",
        "intercept is again the mean log-earnings among those in the base level\n",
        "category, i.e. those for which all dummies in the regression are 0;\n",
        "here, that is the mean log-earnings for all people in region 3. Each\n",
        "individual coefficient gives the difference in average log-earnings\n",
        "among people in that region and in region 3. For instance, the mean\n",
        "log-earnings in region 1 are about 0.012 higher than in region 3 and the\n",
        "mean log-earnings in region 2 are about 0.017 lower than in region 3.\n",
        "Both of these differences are statistically significant at a high level\n",
        "(\\> 99%).\n",
        "\n",
        "We can also use this logic of interpretation to compare mean\n",
        "log-earnings between the non-reference groups. For example, the meaning\n",
        "log-earnings in region 3 are given by the intercept coefficient: about\n",
        "10.49. Since the mean log-earnings in region 1 are about 0.012 higher\n",
        "than this, they must be about 10.49 + 0.012 = 10.502. In region 2, the\n",
        "mean log-earnings are about 10.49 - 0.017 = 10.473. We can thus conclude\n",
        "that the mean log-earnings in region 1 are about 10.502 - 10.473 = 0.029\n",
        "higher than in region 2. In this way, we compared the levels of the\n",
        "dependent variable for 2 dummy variables, neither of which are in the\n",
        "reference group excluded from the regression. We could have much more\n",
        "quickly compared the levels of these groups by comparing their\n",
        "deviations from the base group. Region 1 has mean log-earnings about\n",
        "0.012 above the reference level, while region 2 has mean log-earnings\n",
        "about 0.017 below this same reference level; thus, region 1 has mean\n",
        "log-earnings about 0.012 - (-0.017) = 0.029 above region 2.\n",
        "\n",
        "## 12.4 Interactions\n",
        "\n",
        "It is an established fact that a wage gap exists between male and female\n",
        "workers. However, it is possible that the wage gap changes depending on\n",
        "the age of the workers. For example, female and male high school\n",
        "students tend to work minimum wage jobs; hence, we might believe that\n",
        "the wage gap between people within the 15-18 age bracket is very small.\n",
        "Conversely, once people have the experience to start looking for better\n",
        "paying jobs, we might believe the wage gap starts to increase, meaning\n",
        "that this gap might be much larger in higher age brackets. The way to\n",
        "capture that differential effect of age across males and females is to\n",
        "create a new variable that is the product of the female dummy and age.\n",
        "\n",
        "**Warning:** Whenever we do this, it is **very important** that we also\n",
        "include both the female dummy and age as control variables.\n",
        "\n",
        "Luckily, by simply regressing *log_earnings* on our interaction term,\n",
        "*female \\* age*, R automatically generates dummy variables for all\n",
        "female and age categories without inducing the dummy variable trap."
      ],
      "id": "6cb80f08-d1ab-47d8-8370-671d658856d6"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "summary(lm(data=fake_data, log_earnings ~ female * age))"
      ],
      "id": "44f8d028-d1d4-4515-b4d3-b5af43067f81"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "From the coefficient on *female*, we can see that, on average, people\n",
        "who are identified as female earn about 0.27 less than those identified\n",
        "as male, holding age constant. We can also see, from the coefficient on\n",
        "*age*, that each additional year of age increases log-earnings by about\n",
        "0.013 for the reference category (males). Looking at the coefficient on\n",
        "our interaction term, this effect of age on log-earnings is lower for\n",
        "females by 0.007, meaning that an extra year of age increases\n",
        "log-earnings for women by about 0.013 + (-0.007) = 0.006. It thus seems\n",
        "that our theory is correct: the wage gap between males and females of\n",
        "the same age increases as they get older. For men and women who are both\n",
        "20, an extra year will be associated with the man earning a bit more\n",
        "than the woman on average. However, if the man and woman are both 50, an\n",
        "extra year will be associated with the man earning much more than the\n",
        "woman on average (or at least out-earning her by much more than before).\n",
        "We can also see from the statistical significance of the coefficient on\n",
        "our interaction term that it was worth including!\n",
        "\n",
        "Try this yourself below with the set of region dummies we created above,\n",
        "and think about what these results mean!"
      ],
      "id": "8774a883-3616-4758-8b58-5e21986dc730"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "summary(lm(data=fake_data, log_earnings ~ female * region))"
      ],
      "id": "2292248d-fb3a-4248-8cb8-a2d95d764835"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 12.5 Wrap Up\n",
        "\n",
        "There are very few empirical research projects using micro data that do\n",
        "not require researchers to use dummy variables. Important qualitative\n",
        "measures such as marital status, immigration status, occupation,\n",
        "industry, and race always require that we use dummy variables. Other\n",
        "important variables such as education, income, age and number of\n",
        "children often require us to use dummy variables even when they are\n",
        "sometimes measured using ranked categorical variables. For example, we\n",
        "could have a variable that measures years of education which is included\n",
        "as a continuous variable. However, we might instead want to include a\n",
        "variable that indicates if the person has a university degree. If that\n",
        "is the case we can use `as.factor()` to create a dummy variable\n",
        "indicating that level of education!\n",
        "\n",
        "Even empirical research projects that use macro data sometimes require\n",
        "that we use dummy variables. For example, we might have a data set that\n",
        "measures macro variables for African countries with additional\n",
        "information about historic colonization. We might want to create a dummy\n",
        "variable that indicates the origin of the colonizers, and then include\n",
        "that in our analysis to understand that effect. As another example, we\n",
        "might have a time series data set and want to indicate whether or not a\n",
        "specific policy was implemented in a certain time period. We will need a\n",
        "dummy variable for that, and can include it in our analysis using the\n",
        "same process described above. Finally, we can use interaction terms to\n",
        "capture the effect of one variable on another if we believe that it\n",
        "varies between groups. If the coefficient on this interaction term is\n",
        "statistically significant, it can justify this term’s inclusion in our\n",
        "regression. This impacts our interpretation of coefficients in the\n",
        "regression.\n",
        "\n",
        "Create dummy variables and/or interaction terms with any data set that\n",
        "you have downloaded in R as you see fit. You will find that this\n",
        "approach is not complicated, but has the power to yield meaningful\n",
        "results!\n",
        "\n",
        "## 12.6 Wrap-up Table\n",
        "\n",
        "| Command | Function |\n",
        "|----------------------------------|--------------------------------------|\n",
        "| `as.factor(data$varname)` | It automatically creates different categories for a variable. |\n",
        "| `relevel(data$varname, reference_level)` | It changes the reference level for a set of dummy variables. |\n",
        "| `lm(data, depvar ~ var1 * var2))` | It adds an interaction term between *var1* and *var2* as well as *var1* and *var2* themselves to the regression. |"
      ],
      "id": "2180d070-9c1c-4f72-adf0-7b2b5de55a22"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "ir",
      "display_name": "R",
      "language": "r"
    }
  }
}