{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 2.2 - Intermediate - Multiple Regression\n", "\n", "COMET Team
*Emrul Hasan, Jonah Heyl, Shiming Wu, William Co,\n", "Jonathan Graves* \n", "2022-12-08\n", "\n", "## Outline\n", "\n", "### Prerequisites\n", "\n", "- Simple regression\n", "- Introduction to R\n", "- Introduction to Jupyter\n", "\n", "### Outcomes\n", "\n", "- Understand how the theory of multiple regression models works in\n", " practice\n", "- Be able to estimate multiple regression models using R\n", "- Interpret and explain the estimates from multiple regression models\n", "- Understand the relationship between simple linear regressions and\n", " multiple regressions\n", "- Describe a control variable and regression relationship\n", "- Explore the relationship between controls and causal interpretations\n", " of regression model estimates\n", "\n", "### References\n", "\n", "- Statistics Canada, Survey of Financial Security, 2019, 2021.\n", " Reproduced and distributed on an “as is” basis with the permission\n", " of Statistics Canada. Adapted from Statistics Canada, Survey of\n", " Financial Security, 2019, 2021. This does not constitute an\n", " endorsement by Statistics Canada of this product.\n", "- Stargazer package is due to: Hlavac, Marek (2018). stargazer:\n", " Well-Formatted Regression and Summary Statistics Tables. R package\n", " version 5.2.2. https://CRAN.R-project.org/package=stargazer" ], "id": "3c0ce51e-7b10-45f9-998c-4196d23583d3" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(tidyverse) \n", "library(haven)\n", "library(dplyr)\n", "library(stargazer)\n", "\n", "source(\"intermediate_multiple_regression_functions.r\")\n", "source(\"intermediate_multiple_regression_tests.r\")" ], "id": "a1d7f804-dc14-4d03-8c2d-2b73f2ab3b94" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SFS_data <- read_dta(\"../datasets_intermediate/SFS_2019_Eng.dta\")\n", "\n", "## massive data clean-up\n", "SFS_data <- clean_up_sfs(SFS_data) %>%\n", " filter(!is.na(education)) # renaming things, etc.\n", "\n", "# if you want to see the cleaning code, it's in intermediate_multiple_regression_functions.r" ], "id": "38483758-d467-4f64-ae99-6fbf2eb0bac6" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Introducing Multiple Regressions\n", "\n", "At this point, you are familiar with the simple regression model and its\n", "relationship to the comparison-of-means $t$-test. However, most\n", "econometric analysis don’t use simple regression. In general, economic\n", "data and models are far too complicated to be summarized with a single\n", "relationship. One of the features of most economic datasets is a\n", "complex, multi-dimensional relationship between different variables.\n", "This leads to the two key motivations for **multiple regression**:\n", "\n", "- First, it can improve the *predictive* properties of a regression\n", " model, by introducing other variables that play an important\n", " econometric role in the relationship being studied.\n", "- Second, it allows the econometrician to *differentiate* the\n", " importance of different variables in a relationship.\n", "\n", "This second motivation is usually part of **causal analysis**: when we\n", "believe that our model has an interpretation as a cause-and-effect.\n", "\n", "Let’s show this in practice. Let’s plot the relationship between\n", "`education`, `wealth`, and `gender`.\n", "\n", "Let’s summarize education into “university” and “non-university”. Then,\n", "let’s calculate log wealth and filter out NaN values." ], "id": "732c0ca3-e5fb-4fa0-ad08-b8709482a8c1" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SFS_data <- SFS_data %>% \n", " mutate( \n", " Education = case_when(\n", " education == \"University\" ~ \"University\", # the ~ seperates the original from the new name\n", " education == \"Non-university post-secondary\" ~ \"Non-university\",\n", " education == \"High school\" ~ \"Non-university\",\n", " education == \"Less than high school\" ~ \"Non-university\")) %>%\n", " mutate(Education = as_factor(Education)) # remember, it's a factor!\n", "\n", "glimpse(SFS_data$Education) # we have now data that only considers if someone has finished university or not" ], "id": "0aacc21d-7af4-4dcf-892a-5180370ad673" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SFS_data <- SFS_data %>%\n", " mutate(lnwealth = log(SFS_data$wealth)) # calculate log" ], "id": "44b065ba-9583-4044-9337-d78f706e30a5" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The calculation above gives us NaNs. We solve this by running the code\n", "below." ], "id": "28d99de2-9340-4d2f-b6ff-e59912320584" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SFS_data_logged <- SFS_data %>%\n", " filter(income_before_tax > 0) %>% # filters Nans\n", " filter(wealth > 0) # removes negative values" ], "id": "926f8d9b-66c7-444f-b676-83f9e5cbc480" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s look at the following plot, which depicts the relationships\n", "between `wealth`, `gender` and `education`. In the top panel, the colour\n", "of each cell is the (average) log of `wealth`. In the bottom panel, the\n", "size of each circle is the number of households in that combination of\n", "categories." ], "id": "3329a39e-2d29-49ca-90e0-3482524104e4" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "options(repr.plot.width=6,repr.plot.height=4) # controls the image size\n", "\n", "f <- ggplot(data = SFS_data_logged, aes(x = gender, y = Education)) + xlab(\"Gender\") + ylab(\"Education\") # defines x and y\n", "f + geom_tile(aes(fill=lnwealth)) + scale_fill_distiller(palette=\"Set1\") # this gives us fancier colours\n", "\n", "f <- ggplot(data = SFS_data, aes(x = gender, y = Education)) # defines x and y\n", "f + geom_count() # prints our graph" ], "id": "f01e9b27-4fce-4e5e-9248-a46a520862b2" }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see immediately that there are *three* relationships happening\n", "at the same time:\n", "\n", "1. There is a relationship between `wealth` of households and `gender`\n", " of main earner\n", "2. There is a relationship between `wealth` and `Education`\n", "3. There is a relationship between `gender` and `Education`\n", "\n", "A simple regression can analyze any *one* of these relationships in\n", "isolation, but it cannot assess more than one of them at a time. For\n", "instance, let’s look at these regressions." ], "id": "61ba565c-82ab-4c87-ba26-ec8cd06ae599" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regression1 <- lm(data = SFS_data, wealth ~ gender) # the effect of gender on wealth\n", "regression2 <- lm(data = SFS_data, wealth ~ Education) # the effect of education on wealth\n", "\n", "dummy_gender = as.numeric(SFS_data$gender)-1 # what is this line of code doing? \n", "# hint, the as.numeric variable treats a factor as a number\n", "# male is 0\n", "\n", "regression3 <- lm(data = SFS_data, dummy_gender ~ Education) # the effect of education on gender\n", "# this is actually be a very important regression model called \"linear probability\"\n", "# we will learn more about it later in the course\n", "\n", "stargazer(regression1, regression2, regression3, title=\"Comparison of Regression Results\",\n", " align = TRUE, type=\"text\", keep.stat = c(\"n\",\"rsq\")) # we will learn more about this command later on!" ], "id": "cf993d4d-64b4-40b1-896f-4823aaefd7be" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem here is that these results tell us:\n", "\n", "- Households with higher education accumulate more wealth (significant\n", " and positive coefficient on `EducationUniversity` in ($2$))\n", "- Among university degrees, the proportion of males is larger than\n", " females, with 42.6%(.38+.046) and 57.4%(1-42.6%) respectively.\n", " (coefficient on `EducationUniversity` in ($3$))\n", "- Families led by females accumulates less wealth than the male\n", " counterparts. (negative and significant coefficient on\n", " `genderFemale` in ($1$))\n", "\n", "This implies that when we measure the gender-wealth gap alone, we are\n", "*indirectly* including part of the education-wealth gap as well. This is\n", "bad; the “true” gender-wealth gap is probably lower, but it is being\n", "increased because men are more likely to have a university degree.\n", "\n", "This is both a practical and a theoretical problem. It’s not just about\n", "the model, it’s also about what we mean when we say “the gender wealth\n", "gap”.\n", "\n", "- If we mean “the difference in wealth between a male and female led\n", " family”, then the simple regression result is what we want.\n", "- However, this ignores all the other reasons that a male could have a\n", " different wealth (education, income, age, etc.)\n", "- If we mean “the difference in wealth between a male and female led\n", " family, holding other factors equal,” then the simple regression\n", " result is not suitable.\n", "\n", "The problem is that “holding other factors” equal is a debatable\n", "proposition. Which factors? Why? These different ways of computing the\n", "gender wealth gap make this topic very complex, contributing to ongoing\n", "debate in the economics discipline and in the media about various kinds\n", "of gaps (e.g., the education wealth gap). We will revisit this in the\n", "exercises.\n", "\n", "### Multiple Regression Models\n", "\n", "When we measure the gender wealth gap, we do not want to conflate our\n", "measurement with the *education wealth gap*. To ensure that these two\n", "different gaps are distinguished, we *must* add in some other variables.\n", "\n", "A multiple regression model simply adds more explanatory ($X_i$)\n", "variables to the model. In our case, we would take our simple regression\n", "model:\n", "\n", "$$\n", "W_i = \\beta_0 + \\beta_1 Gender_i + \\epsilon_i\n", "$$\n", "\n", "and augment with a variable which captures `Education`:\n", "\n", "$$\n", "W_i = \\beta_0 + \\beta_1 Gender_i + \\beta_2 Edu_i + \\epsilon_i\n", "$$\n", "\n", "Just as in a simple regression, the goal of estimating a multiple\n", "regression model using OLS is to solve the problem:\n", "\n", "$$\n", "(\\hat{\\beta_0},\\hat{\\beta_1},\\hat{\\beta_2}) = \\arg \\min_{b_0,b_1,b_2} \\sum_{i=1}^{n} (W_i - b_0 - b_1 Gender_i -b_2 Edu_i)^2 = \\sum_{i=1}^{n} (e_i)^2\n", "$$\n", "\n", "In general, you can have any number of explanatory variables in a\n", "multiple regression model (as long as it’s not larger than $n-1$, your\n", "sample size). However, there are costs to including more variables,\n", "which we will learn about more later. For now, we will focus on building\n", "an appropriate model and will worry about the number of variables later.\n", "\n", "Adding variables to a regression is easy in R; you use the same command\n", "as in simple regression, and just add the new variable to the model. For\n", "instance, we can add the variable `Education` like this:\n", "\n", "`wealth ~ gender + Education`\n", "\n", "Let’s see it in action:" ], "id": "6b40bbe5-706b-4617-9974-fee76df3b4a3" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "multiple_model_1 <- lm(data = SFS_data, wealth ~ gender + Education)\n", "\n", "summary(multiple_model_1)" ], "id": "554a697e-5741-45a5-8261-fb63eb6e694a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, there are now three coefficients: one for\n", "`genderFemale`, one for `EducationUniversity` and one for the intercept.\n", "The important thing to remember is that these relationships are being\n", "calculated *jointly*. Compare the result above to the two simple\n", "regressions we saw earlier:" ], "id": "09578ecd-9cb2-4aaa-92c5-b100df91f95f" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stargazer(regression1, regression2, multiple_model_1, title=\"Comparison of Muliple and Simple Regression Results\",\n", " align = TRUE, type=\"text\", keep.stat = c(\"n\",\"rsq\"))\n", "\n", "# which column is the multiple regression?" ], "id": "2429d0b3-7ddd-4e0e-b100-f024b96be1c0" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the difference in the coefficients: *all* of them are different.\n", "\n", "> **Think Deeper**: why would all of these coefficients change? Why not\n", "> just the coefficient on `gender`?\n", "\n", "You will also notice that the standard errors are different. This is an\n", "important lesson: including (or not including) variables can change the\n", "statistical significance of a result. This is why it is so important to\n", "be very careful when designing regression models and thinking them\n", "through: a coefficient estimate is a consequence of the *whole model*,\n", "and should not be considered in isolation.\n", "\n", "### Interpreting Multiple Regression Coefficients\n", "\n", "Interpreting coefficients in a multiple regression is nearly the same as\n", "in a simple regression. After all, our regression equation is:\n", "\n", "$$\n", "W_i = \\beta_0 + \\beta_1 Gender_i + \\beta_2 Edu_i + \\epsilon_i\n", "$$\n", "\n", "You could (let’s pretend for a moment that $Edu_i$ was continuous)\n", "calculate:\n", "\n", "$$\n", "\\frac{\\partial W_i}{\\partial Edu_i} = \\beta_2\n", "$$\n", "\n", "This is the same interpretation as in a simple regression model:\n", "\n", "- $\\beta_2$ is the change in $W_i$ for a 1-unit change in $Edu_i$.\n", "- As you will see in the exercises, when $Edu_i$ is a dummy, we have\n", " the same interpretation as in a simple regression model: the\n", " (average) difference in the dependent variable between the two\n", " levels of the dummy variable.\n", "\n", "However, there is an important difference: we are *holding constant* the\n", "other explanatory variables. That’s what the $\\partial$ means when we\n", "take a derivative. This was actually always there (since we were holding\n", "constant the residual), but now this is something that is directly\n", "observable in our data (and in the model we are building)." ], "id": "42032c6c-d875-402f-9022-28d324181c05" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "summary(multiple_model_1)" ], "id": "4bc4627f-4da5-4d77-94b1-15a17300b302" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test your knowledge\n", "\n", "Based on the results above, how much more wealth do university graduates\n", "accumulate, relative to folks with non-university education levels, when\n", "we hold gender fixed?" ], "id": "a3dba10a-65c1-4843-a8b2-ab8ebc59d83d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# answer the question above by filling in the number \n", "\n", "answer_1 <- # your answer here\n", "\n", "test_1()" ], "id": "82a9c1ad-dd87-4f5b-9289-7f35a93d8379" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Control Variables: What Do They Mean?\n", "\n", "One very common term you may have heard, especially in the context of a\n", "multiple regression model, is the idea of a **control variable**. In a\n", "multiple regression model, control variables are just explanatory\n", "variables - there is nothing special about how they are included.\n", "However, there *is* something special about how we think about them.\n", "\n", "The idea of a control variable refers to how we *think about* a\n", "regression model, and in particular, the different variables. Recall\n", "that the interpretation of a coefficient in a multiple regression model\n", "is the effect of that variable *holding constant* the other variables.\n", "This is often referred to as **controlling** for the values of those\n", "other variables - we are not allowing their relationship with the\n", "variable in question, and the outcome variable, to affect our\n", "measurement of the result. This is very common when we are discussing a\n", "*cause and effect* relationship - control is essential to these kinds of\n", "models. However, it is also valuable even when we are just thinking\n", "about a predictive model.\n", "\n", "You can see how this works directly if you think about a multiple\n", "regression as a series of “explanations” for the outcome variable. Each\n", "variable, one-by-one “explains” part of the outcome variable. When we\n", "“control” for a variable, we remove the part of the outcome that can be\n", "explained by that variable alone. In terms of our model, this refers to\n", "the residual.\n", "\n", "However, we must remember that our control variable *also* explains part\n", "of the other variables, so we must “control” for it as well.\n", "\n", "For instance, our multiple regression:\n", "\n", "$$\n", "W_i = \\beta_0 + \\beta_1 Gender_i + \\beta_2 Edu_i + \\epsilon_i\n", "$$\n", "\n", "Can be thought of as three, sequential, simple regressions:\n", "\n", "$$\n", "W_i = \\gamma_0 + \\gamma_1 Edu_i + u_i\n", "$$ $$\n", "Gender_i = \\gamma_0 + \\gamma_1 Edu_i + v_i\n", "$$\n", "\n", "$$\n", "\\hat{u_i} = \\delta_0 + \\delta_1 \\hat{v_i} + \\eta_i\n", "$$\n", "\n", "- The first two regressions say: “explain `wealth` and `gender` using\n", " `Education` (in simple regressions)”\n", "- The final regression says: “account for whatever is leftover\n", " ($\\hat{u_i}$) from the `education-wealth` relationship with whatever\n", " is leftover from the `gender-wealth` relationship.”\n", "\n", "This has effectively “isolated” the variation in the data which has to\n", "do with `education` from the result of the model.\n", "\n", "Let’s see this in action:" ], "id": "b17eeb83-34aa-4a94-9b84-d19c963c5dc5" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regression1 <- lm(wealth ~ Education, data = SFS_data)\n", "# regress wealth on education\n", "\n", "regression2 <- lm(dummy_gender ~ Education, data = SFS_data)\n", "# regress gender on education\n", "\n", "temp_data <- tibble(wealth_leftovers = regression1$residual, gender_leftovers = regression2$residuals)\n", "# take whatever is left-over from those regressions, save it" ], "id": "4bbc94a4-c037-455d-8127-5ff5cff94a59" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regression3 <- lm(wealth_leftovers ~ gender_leftovers, data = temp_data)\n", "# regress the leftovers on immigration status\n", "\n", "# compare the results with the multiple regression\n", "\n", "stargazer(regression1, regression2, regression3, multiple_model_1, title=\"Comparison of Multiple and Simple Regression Results\",\n", " align = TRUE, type=\"text\", keep.stat = c(\"n\",\"rsq\"))" ], "id": "cc5cc247-d1dd-47b6-b7ff-6cb105c49044" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look closely at these results. You will notice that the coefficients on\n", "`gender_leftovers` in the “control” regression and `gender` in the\n", "multiple regression are *exactly the same*.\n", "\n", "> **Think Deeper**: what if we had done this experiment another way\n", "> (`wealth` and `Education` on `gender`)? Which coefficients would\n", "> match? Why?\n", "\n", "This result is a consequence of the **Frisch-Waugh-Lovell theorem**\n", "about OLS - a variant of which is referred to as the “regression\n", "anatomy” equation.\n", "\n", "For our purposes, it does a very useful thing: it gives us a concrete\n", "way of thinking about what “controls” are doing: they are “subtracting”\n", "part of the variation from both the outcome and other explanatory\n", "variables. In OLS, this is *exactly* what is happening - but for all\n", "variables at once! If you don’t get it, don’t worry about it too much.\n", "What is important is now we have a way to disentangle the effects on\n", "wealth, weather it be gender or education.\n", "\n", "## Part 2: Hands-On\n", "\n", "Now, it’s time to continue our investigation of the gender-wealth gap,\n", "but now using our multiple regression tools. As we discussed before,\n", "when we investigate the education-wealth gap, we usually want to “hold\n", "fixed” different kinds of variables. We have already seen this, using\n", "the `Education` variable to control for the education-wealth gap.\n", "However, there are many more variables we might want to include.\n", "\n", "For example, risky investments usually generate more returns and men are\n", "typically more willing to take risks - based on research that explores\n", "[psychological differences in how risk is processed between men and\n", "women](https://journals.sagepub.com/doi/abs/10.1177/0963721411429452)\n", "and research that explores [how the perception of a person’s gender\n", "shapes how risk tolerant or risk adverse a person is thought to\n", "be](https://www.mendeley.com/catalogue/5a28efe5-479d-312a-bd80-32e6500a8f1c/).\n", "This implies that we may want to control for risky investments in the\n", "analysis.\n", "\n", "Let’s try that now:" ], "id": "5f715c48-1bf5-46c2-992d-ca68be7a732a" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "risk_regression1 <- lm(data = SFS_data, wealth ~ gender + Education + risk_proxy) \n", "#don't worry about what risk proxy is for now\n", "\n", "summary(risk_regression1)" ], "id": "4eb38b54-bbdc-47f4-b719-d49f8fc2d6ac" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we control for risky investments, what do you see? How has the\n", "gender-wealth gap changed?\n", "\n", "### Effects of adding too many controls\n", "\n", "Another way to model the wealth gap is to study financial assets and\n", "stocks at the same time, so that we can understand how different\n", "categories of assets affect wealth." ], "id": "deda2804-7010-40af-8a2c-de113f554ac0" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "risk_regression2 <- lm(wealth ~ financial_asset + stock + bond + bank_deposits + mutual_funds + other_investments, data = SFS_data)\n", "\n", "summary(risk_regression2)" ], "id": "28d41eec-3a99-4ad0-94b9-5998fb5b9c88" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look closely at this result. Do you see anything odd or problematic\n", "here?\n", "\n", "This is a topic we will revise later in this course, but this is\n", "**multicollinearity**. Essentially, what this means is that one of the\n", "variables we have added to our model does not add any new information.\n", "\n", "In other words, once we control for the other variables, there’s nothing\n", "left to explain. Can you guess what variables are interacting to cause\n", "this problem?\n", "\n", "Let’s dig deeper to see here:" ], "id": "c6c81d77-3172-4355-86b9-c5a5ea4c7e02" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "risk_reg1 <- lm(wealth ~ Education + stock + bond + bank_deposits + mutual_funds + other_investments, data = SFS_data)\n", "\n", "\n", "summary(risk_reg1)\n", "\n", "print(\"Leftovers from wealth ~ gender, education, stocks, bonds, ... \")\n", "head(round(risk_reg1$residuals,2))\n", "# peek at the leftover part of wealth\n", "\n", "risk_reg2 <- lm(financial_asset ~ Education + stock + bond + bank_deposits + mutual_funds + other_investments, data = SFS_data)\n", "\n", "\n", "summary(risk_reg2)\n", "\n", "print(\"Leftovers from financial asset ~ education, stock, bonds, ...\")\n", "head(round(risk_reg2$residuals,5))\n", "# peek at the leftover part of financial asset" ], "id": "b041f755-2e13-4ac2-baca-a649c7eb12d1" }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Think Deeper**: why is “Leftovers from financial asset ~ Education +\n", "> stock, bonds, …” equal to 0?\n", "\n", "As you can see, the residual from regressing\n", "`financial_asset ~ Education + stock + ...` is exactly (to machine\n", "precision) zero. In other words, when you “control” for the asset\n", "classes, there’s nothing left to explain about `financial_assets`.\n", "\n", "If we think about this, it makes sense: these “controls” are all the\n", "types of financial assets you could have! So, if I tell you about them,\n", "you will immediately know the total value of my financial assets.\n", "\n", "This means that the final step of the multiple regression would be\n", "trying to solve this equation:\n", "\n", "$$\n", "\\hat{u_i} = \\delta_0 + \\delta_1 0 + \\eta_i\n", "$$\n", "\n", "which does not have a unique solution for $\\delta_1$. R tries to “fix”\n", "this problem by getting rid of some variables, but this usually\n", "indicates that our model wasn’t set-up properly in the first place.\n", "\n", "The lesson is that we can’t just include controls without thinking about\n", "them; we have to pay close attention to their role in our model, and\n", "their relationship to other variables.\n", "\n", "For example, a *better* way to do this would be to just include `stock`\n", "and the total value of assets instead of all the other classes (bank\n", "deposits, mutual funds, etc.). This is what `risk_proxy` is: the ratio\n", "of stocks to total assets.\n", "\n", "You can also include different sets of controls in your model. Often\n", "adding different “layers” of controls is a very good way to understand\n", "how different variables interact and affect your conclusions. Here’s an\n", "example, adding on several different “layers” of controls:" ], "id": "5a78edcc-d7f3-45a4-8ed3-4e898c57e50d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regression1 <- lm(wealth ~ gender, data = SFS_data)\n", "regression2 <- lm(wealth ~ gender + Education, data = SFS_data)\n", "regression3 <- lm(wealth ~ gender + Education + risk_proxy, data = SFS_data)\n", "regression4 <- lm(wealth ~ gender + Education + risk_proxy + business + province + credit_limit, data = SFS_data)\n", "\n", "stargazer(regression1, regression2, regression3, regression4, title=\"Comparison of Controls\",\n", " align = TRUE, type=\"text\", keep.stat = c(\"n\",\"rsq\"))" ], "id": "ad4ce7ad-ce93-495d-9873-54969614ec2d" }, { "cell_type": "markdown", "metadata": {}, "source": [ "A pretty big table! Often, when we want to focus on just a single\n", "variable, we will simplify the table by just explaining which controls\n", "are included. Here’s an example which is much easier to read; it uses\n", "some formatting tricks which you don’t need to worry about right now:" ], "id": "cfe08215-9b3e-4319-bce8-17fbc91a5e9b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var_omit = c(\"(province)\\\\w+\",\"(Education)\\\\w+\") # don't worry about this right now!\n", "\n", "stargazer(regression1, regression2, regression3, regression4, title=\"Comparison of Controls\",\n", " align = TRUE, type=\"text\", keep.stat = c(\"n\",\"rsq\"), \n", " omit = var_omit,\n", " add.lines = list(c(\"Education Controls\", \"No\", \"Yes\", \"Yes\", \"Yes\"),\n", " c(\"Province Controls\", \"No\", \"No\", \"No\", \"Yes\")))\n", "\n", "# this is very advanced code; don't worry about it right now; we will come back to it at the end of the course" ], "id": "4c072c4f-52cb-4442-b7f0-c6c2bb4c3f48" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice in the above how the coefficients change when we change the\n", "included control variables. Understanding this kind of variation is\n", "really important to interpreting a model, and whether or not the results\n", "are credible. For example - ask yourself why the gender-wealth gap\n", "decreases as we include more control variables. What do you think?\n", "\n", "### Using the `predict()` function\n", "\n", "In this section, we will learn how to use the regression model to make\n", "predictions using the `predict()` function in R. This is particularly\n", "useful when you have new data and want to estimate the dependent\n", "variable based on your model. Note that this function can be used for\n", "both simple and multiple regressions.\n", "\n", "First, let’s fit a regression model using our existing dataset:" ], "id": "ff7979b1-3fb1-4587-a854-396f2ed9efcc" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Display the structure of the model\n", "summary(multiple_model_1)" ], "id": "48a331f6-4299-4fd8-b693-bd6a6d337a05" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s say we have new data and we want to predict the wealth for these\n", "new observations. We’ll create a new data frame with the new data points\n", "and use the `predict()` function to make predictions.\n", "\n", "We will create three new data points: - a female individual with a\n", "university education - a male individual with a non-university\n", "education - a male individual with a university education" ], "id": "9b5b5e08-1d17-4c6d-8068-972d4978f25e" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a new data frame with sample data points for prediction\n", "new_data <- data.frame(\n", " gender = factor(c(\"2\", \"1\", \"1\"), levels(SFS_data$gender)),\n", " Education = factor(c(\"University\", \"Non-university\", \"University\"), levels = levels(SFS_data$Education)))\n", "\n", "new_data" ], "id": "a0149b02-752e-47f1-90aa-17bb29b9a3d0" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use the `predict()` function to predict the wealth of these\n", "individuals by specifying the regression model from above and the set of\n", "new data points." ], "id": "e8c24286-e9f4-4aee-8e2e-718fc30eb800" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Predict wealth using the multiple_model_1\n", "predicted_wealth <- predict(multiple_model_1, newdata = new_data)\n", "\n", "# Display the predicted values\n", "predicted_wealth" ], "id": "22aebfde-81ee-46c8-a1a4-1bda0d6fe34b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also show the confidence intervals of our predicitons by adding\n", "and additional argument to the `predict()` function\n", "`interval = 'confidence'`. These are automatically set at a 95%\n", "confidence level." ], "id": "7d664a52-7da0-43f9-84de-b3c0b9f7e545" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Additional argument for confidence intervals\n", "predicted_wealth_conf <- predict(multiple_model_1, newdata = new_data, interval = 'confidence')\n", "\n", "# Show the predictions\n", "predicted_wealth_conf" ], "id": "b7090dab-af7a-4a9b-8292-a0aa653af6a3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, fit refers to the fitted (or predicted) value our regression makes\n", "based off of the inputs we specified above. Lwr is the lower bound and\n", "Upr is the upper bound.\n", "\n", "### Omitted Variables\n", "\n", "Another important topic comes up in the context of multiple regression:\n", "**omitted variables**. In a simple regression, this didn’t really mean\n", "anything, but now it does. When we have a large number of variables in a\n", "dataset, which ones do we include in our regression? All of them? Some\n", "of them?\n", "\n", "This is actually a very important problem, since it has crucial\n", "implication for the interpretation of our model. For example, remember\n", "Assumption 1? This is a statement about the “true” model - not what you\n", "are actually running. It can very easily be violated when variables\n", "aren’t included.\n", "\n", "We will revisit this later in the course, since it only really makes\n", "sense in the context of causal models, but for now we should pay close\n", "attention to which variables we are including and why. Let’s explore\n", "this, using the exercises.\n", "\n", "## Part 3: Exercises\n", "\n", "This sections has both written and coding exercises for you to test your\n", "knowledge about multiple regression. The answers to the written\n", "exercises are on the last section of the notebook.\n", "\n", "### Questions\n", "\n", "Suppose you have a regression model that looks like:\n", "\n", "$$\n", "Y_i = \\beta_0 + \\beta_1 X_{i} + \\beta_2 D_{i} + \\epsilon_i\n", "$$\n", "\n", "Where $D_i$ is a dummy variable. Recall that Assumption 1 implies that\n", "$E[\\epsilon_i|D_{i}, X_{i}] = 0$. Suppose this assumption holds true.\n", "Answer the following:\n", "\n", "1. Compute $E[Y_i|X_i,D_i=1]$ and $E[Y_i|X_i,D_i=0]$\n", "2. What is the difference between these two terms?\n", "3. Interpret what the coefficient $\\beta_2$ means in this regression,\n", " using your answers in 1 and 2.\n", "\n", "To explore the mechanics of multiple regressions, let’s return to the\n", "analysis that we did in Module 1; that is, let’s re-examine the\n", "relationship between the gender income gap and education.\n", "\n", "Run a simple regression for the gender income gap (with a single\n", "regressor) for each education level. Use `income_before_tax` as the\n", "dependent variable. Then, run a multiple regression for the gender\n", "income gap that includes `education` as a control.\n", "\n", "Tested objects: `reg_LESS` (simple regression; less than high school),\n", "`reg_HS` (high school diploma), `reg_NU` (Non-university\n", "post-secondary), `reg_U` (university), `reg2` (multiple regression)." ], "id": "27d53f21-d304-42f4-b41a-370b26553b58" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Less than high school\n", "reg_LESS <- lm(???, data = filter(SFS_data, education == \"Less than high school\"))\n", "test_4() #For reg_LESS\n", "\n", "# High school diploma\n", "reg_HS <- lm(???, data = filter(SFS_data, education == \"High school\"))\n", "test_5() #For reg_HS\n", "\n", "# Non-university post-secondary\n", "reg_NU <- lm(???, data = filter(SFS_data, education == \"Non-university post-secondary\"))\n", "test_6() #For reg_NU\n", "\n", "# University\n", "reg_U <- lm(???, data = filter(SFS_data, education == \"University\"))\n", "test_7() #For reg_NU\n", "\n", "# Multiple regression\n", "reg2 <- lm(??? + education, data = SFS_data)\n", "test_8() #For reg2\n", "\n", "#Table comparing regressions\n", "stargazer(reg_LESS, reg_HS, reg_NU, reg_U, \n", " title = \"Comparing Conditional Regressions with Multiple Regression\", align = TRUE, type = \"text\", keep.stat = c(\"n\",\"rsq\")) \n", "summary(reg2)" ], "id": "81ac379a-5a6d-4ebf-b110-318e2608cf5e" }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. What variable “value” appears to be missing from the multiple\n", " regression in the table? How can we interpret the average income for\n", " the group associated with that value? Hint: Dummy Variables\n", "\n", "2. Compare the coefficient estimates for `gender` across each of the\n", " simple regressions. How does the gender income gap appear to vary\n", " across education levels? How should we interpret this variation?\n", "\n", "3. Compare the simple regressions’ estimates with those of the multiple\n", " regression. How does the multiple regression’s coefficient estimate\n", " on `gender` compare to those estimates in the simple regressions?\n", " How can we interpret this? Further, how do we interpret the\n", " coefficient estimates on the other regressors in the multiple\n", " regression?\n", "\n", "Now, consider the multiple regression that we estimated in the previous\n", "activity:\n", "\n", "$$\n", "W_i = \\beta_0 + \\beta_1 Gender_i + \\beta_2 S_i + \\epsilon_i\n", "$$\n", "\n", "Note that $Gender_i$ is `gender` and $S_i$ is `education`.\n", "\n", "1. Why might we be skeptical of the argument that $\\beta_1$ captures\n", " the gender income gap (i.e., the effect of having female as the main\n", " earner on household’s income, all else being equal)? What can we do\n", " to address these concerns?\n", "\n", "2. Suppose that a member of your research team suggests that we should\n", " add `age` as a control in the regression. Do you agree with this\n", " group member that this variable would be a good control? Why or why\n", " not?\n", "\n", "Now let’s turn back to coding. Let’s first simplify levels of age group\n", "using codes." ], "id": "747a73a6-30db-492a-94b8-628dd5e9a3ad" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# run this!\n", "SFS_data <- \n", " SFS_data %>%\n", " mutate(agegr = case_when(\n", " age == \"01\" ~ \"Under 30\",\n", " age == \"02\" ~ \"Under 30\",\n", " age == \"03\" ~ \"Under 30\",\n", " age == \"04\" ~ \"30-45\",\n", " age == \"05\" ~ \"30-45\",\n", " age == \"06\" ~ \"30-45\",\n", " age == \"07\" ~ \"45-60\",\n", " age == \"08\" ~ \"45-60\",\n", " age == \"09\" ~ \"45-60\",\n", " age == \"10\" ~ \"60-75\",\n", " age == \"11\" ~ \"60-75\",\n", " age == \"12\" ~ \"60-75\",\n", " age == \"13\" ~ \"Above 75\",\n", " age == \"14\" ~ \"Above 75\",\n", " )) %>%\n", " mutate(agegr = as_factor(agegr))\n", "\n", "SFS_data$agegr <- relevel(SFS_data$agegr, ref = \"Under 30\") # set \"Under 30\" as default factor level" ], "id": "8f95f64b-f7e7-4ed1-95d5-8bde170379eb" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add `agegr` to the given multiple regression and compare it with the\n", "model that we estimated in the previous activity.\n", "\n", "Tested Objects: `reg3` (the same multiple regression that we estimated\n", "before, but with age added as a control)." ], "id": "ada58396-8e0b-4e15-999e-d195fdc11e89" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# add Age as Control\n", "# add them in the order: gender, education, age\n", "reg3 <- lm(???, data = SFS_data)\n", "\n", "# compare the regressions with and without this control\n", "stargazer(reg2, reg3, \n", " title = \"Multiple Regressions with and without Age Controls\", align = TRUE, type = \"text\", keep.stat = c(\"n\",\"rsq\")) \n", "\n", "test_14() #For reg3 " ], "id": "14ea28ea-4903-4e72-aa88-a1a27c5e2b1b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Compare the two regressions in the table above. What happens to the\n", " estimated gender income gap when we add age as a control? What might\n", " explain this effect?\n", "\n", "2. Suppose that one of your fellow researchers argues that `employment`\n", " (employment status) should be added to the multiple regression as a\n", " control. That way, they reason, we can account for differences\n", " between employed and unemployed workers. Do you agree with their\n", " reasoning?\n", "\n", "Let’s test this argument directly. Add `employment` as a control to the\n", "multiple regression with all previous controls. Estimate this new\n", "regression (`reg4`)." ], "id": "18ee8109-34b5-44f9-a8aa-ae6b06280476" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# add in the order before, with employment last\n", "reg4 <- lm(???, data = SFS_data)\n", "\n", "summary(reg4)\n", "\n", "test_17() " ], "id": "46638ce0-fe45-425f-86ba-514816ad0817" }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. What happens to the gender income gap when we run the regression\n", " with `employment`?\n", "\n", "Now consider the following scenario. In the middle of your team’s\n", "discussion of which controls they should add to the multiple regression\n", "(the same one as the previous activity), your roommate bursts into the\n", "room and yells “Just add them all!” After a moment of confused silence,\n", "the roommate elaborates that it never hurts to add controls as long as\n", "they don’t “break” the regression (like `employment` and `agegr`). “Data\n", "is hard to come by, so we should use as much of it as we can get,” he\n", "says.\n", "\n", "Recall: Below are all of the variables in the dataset." ], "id": "58dae0c2-ad74-42ac-8d00-97460b5bbdc5" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "glimpse(SFS_data) # run me" ], "id": "23889921-b143-4dd6-9754-e3177ea9b0c4" }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Do you agree with your roommate’s argument? Why or why not?\n", "\n", "Let’s back up our argument with regression analysis. Estimate a\n", "regression that has the same controls as `reg3` from the previous\n", "activity, but add `pasrbuyg` as a control as well.\n", "\n", "What is “pasrbuyg”?" ], "id": "8a8b44cf-8b25-48a9-bd90-c455a10e1bdb" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dictionary(\"pasrbuyg\") # run me" ], "id": "9fd5ec6e-11de-4e7a-a115-7c99aa9c14bf" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tested Objects: `reg5`." ], "id": "3c655d19-b565-4420-9ffd-30e7592aaa79" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# add pasrbuyg to regression\n", "# keep the order (gender, education, agegr, pasrbuyg)\n", "reg5 <- lm(???, data = SFS_data)\n", "\n", "# table comparing regressions with and without ppsort\n", "stargazer(reg3, reg5,\n", " title = \"Multiple Regressions with and without ppsort\", align = TRUE, type = \"text\", keep.stat = c(\"n\",\"rsq\")) \n", "\n", "test_20() # For reg5 " ], "id": "0b9044d4-c49e-4144-8126-30820e52993a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Does the table above suggest that we should add `pasrbuyg` as a\n", " control?\n", "\n", "2. What other variables can be added as controls?\n", "\n", "### Solutions\n", "\n", "1. $$\n", " \\begin{align}\n", " E[Y_i|X_i,D_i=1]&=\\beta_{0}+\\beta_{1}E[X_{i}|X_{i},D_{i}=1]+\\beta_{2} \\\\\n", " E[Y_i|X_i,D_i=0]&=\\beta_{0}+\\beta_{1}E[X_{i}|X_{i},D_{i}=0]\n", " \\end{align}\n", " $$\n", "\n", "2. $$\n", " \\begin{align}\n", " &E[E[Y_i|X_i,D_i=1]-E[Y_i|X_i,D_i=0]] \\\\\n", " &= E[\\beta_{1}(E[X_{i}|X_{i},D_{i}=1]-E[X_{i}|X_{i},D_{i}=0])]+\\beta_{2} \\\\\n", " &= \\beta_{1}(E[E[X_{i}|X_{i},D_{i}=1]]-E[E[X_{i}|X_{i},D_{i}=0]])+\\beta_{2}\\\\\n", " &= \\beta_{2}\n", " \\end{align}\n", " $$\n", "\n", "3. $\\beta_{2}$ is the average difference between outcome $Y_{i}$ of\n", " dummy variable $D_{i}$ to be 0 and 1.\n", "\n", "4-8. Coding exercises\n", "\n", "1. It seems like education “High School” is missing from the multiple\n", " regression. This value is represented when all the dummies included\n", " in the model equal zero. The average difference in income for male\n", " and female earners for this group is 37,638 dollars.\n", "\n", "2. The gender income gap becomes larger as education increases. The\n", " gender-income gap of each level of education may come from gender\n", " difference in occupations and position levels: as education\n", " increases, males may have higher chance to work in higher positions\n", " (e.g. managers), while females may have less opportunity.\n", "\n", "3. The coefficient estimate on `gender` of multiple regression is close\n", " to the (weighted) average estimates in the simple regressions. In\n", " multiple regression, the coefficient means female-lead households on\n", " average earn 37,638 dollars less than male counterpart.\n", "\n", "In multiple regression, the intercept means male-lead households with\n", "high school degree earn 84,169 dollars on average. For the coefficient\n", "of `educationLess than High school`, it means if male is the main\n", "earner, then the average incomes they will earn are\n", "$84,169 + 31,811 = 52,358$ dollars. Then for female with high school\n", "degree, the average before-tax income is $84,169-37,638 = 46,531$\n", "dollars. We can interpret coefficient estimates of\n", "`educationNon-university post-secondary` and `educationUniversity` in a\n", "similar way.\n", "\n", "1. Because there may be omitted variables that correlate with $Gender$.\n", " Maybe the omitted variables are the true reasons why there is an\n", " income gap, but we since do not control these variables, the\n", " coefficient estimate of $Gender$ is not correct.\n", "\n", "2. `age` is a good control, because age can affect incomes, and age can\n", " correlate with gender of main earners. If our regression omit `age`,\n", " the coefficient estimate of `gender` can be biased.\n", "\n", "3. Coding exercise\n", "\n", "4. After we control age, the estimated gender income gap shrinks. This\n", " means some of the income gap may come from difference in age, rather\n", " than gender gap.\n", "\n", "5. The reasoning is correct.\n", "\n", "6. Coding exercise.\n", "\n", "7. When we run regression with `employment`, the gender income gap\n", " shrinks even more, which means `employment` explains part of the\n", " income gap.\n", "\n", "8. The roommate’s reasoning is incorrect, because we could introduce\n", " multicollinearity if we add all variables.\n", "\n", "9. Coding exercise.\n", "\n", "10. No, we should not add `pasrbuyg` as a control, because the estimated\n", " coefficients are not significant, and it may have multicollinearity\n", " issue with `agegr`.\n", "\n", "11. This is an open question. Just make sure that your variables are\n", " included in dataset, and that you have a reasonable explanation as\n", " to why add them as controls." ], "id": "ddee88b5-a18d-40a5-b1be-9eb12282356d" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "ir", "display_name": "R", "language": "r" } } }