{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 2.2 - Intermediate - Multiple Regression\n", "\n", "COMET Team
*Emrul Hasan, Jonah Heyl, Shiming Wu, William Co,\n", "Jonathan Graves* \n", "2022-12-08\n", "\n", "## Outline\n", "\n", "### Prerequisites\n", "\n", "- Simple regression\n", "- Data analysis and introduction\n", "\n", "### Outcomes\n", "\n", "- Understand how the theory of multiple regression models works in\n", " practice\n", "- Be able to estimate multiple regression models using R\n", "- Interpret and explain the estimates from multiple regression models\n", "- Understand the relationship between simple linear regressions and\n", " similar multiple regressions\n", "- Describe a control variable and regression relationship\n", "- Explore the relationship between controls and causal interpretations\n", " of regression model estimates\n", "\n", "### Notes\n", "\n", "[1](#fn1s)Statistics Canada, Survey of\n", "Financial Security, 2019, 2021. Reproduced and distributed on an “as is”\n", "basis with the permission of Statistics Canada.Adapted from Statistics\n", "Canada, Survey of Financial Security, 2019, 2021. This does not\n", "constitute an endorsement by Statistics Canada of this product.\n", "\n", "[2](#fn2s)Stargazer package is due to: Hlavac,\n", "Marek (2018). stargazer: Well-Formatted Regression and Summary\n", "Statistics Tables. R package version 5.2.2.\n", "https://CRAN.R-project.org/package=stargazer " ], "id": "37176d7b-3595-4867-9eed-987fd3498341" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(tidyverse) \n", "library(haven)\n", "library(dplyr)\n", "library(stargazer)\n", "\n", "source(\"intermediate_multiple_regression_functions.r\")" ], "id": "50c9287d-68c9-4ad6-9bb8-355ad007fced" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SFS_data <- read_dta(\"../datasets_intermediate/SFS_2019_Eng.dta\")\n", "\n", "## massive data clean-up\n", "SFS_data <- clean_up_sfs(SFS_data) #renaming things, etc.\n", "\n", "#if you want to see, it's in intermediate_multiple_regression_functions.r" ], "id": "9bbe8c49-2ddb-4ac4-af3c-bffd9fb8389b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Introducing Multiple Regressions\n", "\n", "At this point, you are familiar with the simple regression model and its\n", "relationship to the comparison-of-means $t$-test. However, most\n", "econometric analysis don’t use simple regression - this is because, in\n", "general, economic data and models are far too complicated to be\n", "summarized with a single relationship. One of the features of most\n", "economic datasets is a complex, multi-dimensional relationship between\n", "different variables. This leads to the two key motivations for\n", "**multiple regression**:\n", "\n", "- First, it can improve the *predictive* properties of a regression\n", " model, by introducing other variables that play an important\n", " econometric role in the relationship being studied.\n", "- Second, it allows the econometrician to *differentiate* the\n", " importance of different variables in a relationship.\n", "\n", "This second motivation is usually part of **causal analysis** when we\n", "believe that our model has an interpretation as a cause-and-effect.\n", "However, even if it does not, it is still useful to understand which\n", "variables are “driving” the relationship in the data.\n", "\n", "Let’s look at the following plot, which depict the relationships between\n", "`wealth`, `gender` and `education`. In the top panel, the colour of each\n", "cell is the (average) log of `wealth`. In the bottom panel, the size of\n", "each circle is the number of households in that combination of\n", "categories.\n", "\n", "Let’s first summarize education into “university” and “non-university”.\n", "Since it’s easier to see the pattern from log wealth, we will calculate\n", "log wealth and filter out NaN values." ], "id": "9b32ae72-a911-42af-9435-d0c48a77b94b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SFS_data <- SFS_data %>% \n", " mutate( \n", " Education = case_when(\n", " education == \"University\" ~ \"University\", # the ~ seperates the original from the new name\n", " education == \"Non-university post-secondary\" ~ \"Non-university\",\n", " education == \"High school\" ~ \"Non-university\",\n", " education == \"Less than high school\" ~ \"Non-university\")) %>%\n", " mutate(Education = as_factor(Education)) # remember, it's a factor!\n", "\n", "glimpse(SFS_data$Education) #we have now data that only considers if someone has finished university or not" ], "id": "d35c9503-aca7-4a5c-a9eb-944ae92f5a94" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SFS_data <- SFS_data %>%\n", " mutate(lnwealth = log(SFS_data$wealth)) # calculate log" ], "id": "0bf2f32e-676e-47ba-b9c8-255d93cd5847" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oops NaNs again. We solve this by running the code below." ], "id": "2081c914-8504-43b4-ba6c-f63c1d831671" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SFS_data_logged <- SFS_data %>%\n", " filter(income_before_tax > 0) %>% #filters Nans\n", " filter(wealth > 0) #removes negative values" ], "id": "8bc0432c-b04f-41f2-af39-09907d70306c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "options(repr.plot.width=6,repr.plot.height=4) #controls the image size\n", "\n", "f <- ggplot(data = SFS_data_logged, aes(x = gender, y = Education)) + xlab(\"Gender\") + ylab(\"Education\") #defines x and y\n", "f + geom_tile(aes(fill=lnwealth)) + scale_fill_distiller(palette=\"Set1\") #this gives us fancier colours\n", "\n", "f <- ggplot(data = SFS_data, aes(x = gender, y = Education)) #defines x and y\n", "f + geom_count() #prints our graph" ], "id": "ad36a92f-5c75-4e33-b73c-59d809efa867" }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see immediately that there are *three* relationships happening\n", "at the same time:\n", "\n", "1. There is a relationship between `wealth` of households and `gender`\n", " of main earner\n", "2. There is a relationship between `wealth` and `Education`\n", "3. There is a relationship between `gender` and `Education`\n", "\n", "A simple regression can analyze any *one* of these relationships in\n", "isolation, but it cannot assess more than one of them at a time. For\n", "instance, let’s look at these regressions." ], "id": "2b42aa8d-ef9e-4e56-a984-0976bd46438f" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regression1 <- lm(data = SFS_data, wealth ~ gender) #the effect of gender on wealth\n", "regression2 <- lm(data = SFS_data, wealth ~ Education) #the effect of education on wealth\n", "\n", "dummy_gender = as.numeric(SFS_data$gender)-1 # what is this line of code doing? \n", "# hint, the as.numeric variable treats a factor as a number\n", "# male is 0\n", "\n", "regression3 <- lm(data = SFS_data, dummy_gender ~ Education) #the effect of education on gender\n", "# this is actually be a very important regression model called \"linear probability\"\n", "# we will learn more about it later in the course\n", "\n", "stargazer(regression1, regression2, regression3, title=\"Comparison of Regression Results\",\n", " align = TRUE, type=\"text\", keep.stat = c(\"n\",\"rsq\")) # we will learn more about this command later on!" ], "id": "2936c482-171b-444e-9455-41f658e790fa" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem here is that these results tell us:\n", "\n", "- Households with higher education accumulate more wealth (significant\n", " and positive coefficient on `EducationUniversity` in (2))\n", "- Among university degrees, the proportion of males is larger than\n", " females, with 42.6%(.38+.046) and 57.4%(1-42.6%) respectively.\n", " (coefficient on `EducationUniversity` in (3))\n", "- Families led by females accumulates less wealth than the male\n", " counterparts. (negative and significant coefficient on\n", " `genderFemale` in (1))\n", "\n", "This implies that when we measure the gender-wealth gap alone, we are\n", "*indirectly* including part of the education-wealth gap as well. This is\n", "bad; the “true” gender-wealth gap is probably lower, but it is being\n", "increased because men are more likely to have university degree.\n", "\n", "This is both a practical and a theoretical problem. It’s not just about\n", "the model, it’s also about what we mean when we say “the gender wealth\n", "gap”. \\* If we mean “the difference in wealth between a male and female\n", "led family”, then the simple regression result is what we want. \\*\n", "However, this ignores all the other reasons that a male could have a\n", "different wealth (education, income, age, etc.) \\* If we mean “the\n", "difference in wealth between a male and female led family, holding other\n", "factors equal,” then the simple regression result is not suitable.\n", "\n", "The problem is that “holding other factors” equal is a debatable\n", "proposition. Which factors? Why? These different ways of computing the\n", "gender wealth gap make this topic very complex, contributing to ongoing\n", "debate in the economics discipline and in the media about various kinds\n", "of gaps (e.g. the education wealth gap). We will revisit this in the\n", "exercises.\n", "\n", "### Multiple Regression Models\n", "\n", "When we measure the gender wealth gap, we do not want to conflate our\n", "measurement with the *education wealth gap*. To ensure that these two\n", "different gaps are distinguished, we *must* add in some other variables.\n", "\n", "A multiple regression model simply adds more explanatory ($X_i$)\n", "variables to the model. In our case, we would take our simple regression\n", "model:\n", "\n", "$$W_i = \\beta_0 + \\beta_1 Gender_i + \\epsilon_i$$\n", "\n", "and augment with a variable which captures `Education`:\n", "\n", "$$W_i = \\beta_0 + \\beta_1 Gender_i + \\color{red}{\\beta_2 Edu_i} + \\epsilon_i$$\n", "\n", "Just as in a simple regression, the goal of estimating a multiple\n", "regression model using OLS is to solve the problem:\n", "\n", "$$(\\hat{\\beta_0},\\hat{\\beta_1},\\hat{\\beta_2}) = \\arg \\min_{b_0,b_1,b_2} \\sum_{i=1}^{n} (W_i - b_0 - b_1 Gender_i -b_2 Edu_i)^2 = \\sum_{i=1}^{n} (e_i)^2$$\n", "\n", "In general, you can have any number of explanatory variables in a\n", "multiple regression model (as long as it’s not larger than $n-1$, your\n", "sample size). However, there are costs to including more variables,\n", "which we will learn about more later. For now, we will focus on building\n", "an appropriate model and will worry about the number of variables later.\n", "\n", "Adding variables to a regression is easy in R; you use the same command\n", "as in simple regression, and just add the new variable to the model. For\n", "instance, we can add the variable `Education` like this:\n", "\n", "`wealth ~ gender + Education`\n", "\n", "Let’s see it in action:" ], "id": "08ac40dc-0901-42f8-867f-62aec2c7951a" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "multiple_model_1 <- lm(data = SFS_data, wealth ~ gender + Education)\n", "\n", "summary(multiple_model_1)" ], "id": "a37c14db-ee17-4928-a27e-48159e59a049" }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, there are now three coefficients: one for\n", "`genderFemale`, one for `EducationUniversity` and one for the intercept.\n", "The important thing to remember is that these relationships are being\n", "calculated *jointly*. Compare the result above to the two simple\n", "regressions we saw earlier:" ], "id": "3b7a5905-47e1-48a6-a790-a645141708da" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stargazer(regression1, regression2, multiple_model_1, title=\"Comparison of Muliple and Simple Regression Results\",\n", " align = TRUE, type=\"text\", keep.stat = c(\"n\",\"rsq\"))\n", "\n", "# which column is the multiple regression?" ], "id": "38e97f8f-c56d-4ac0-b6c8-b2aaae250421" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the difference in the coefficients: *all* of them are different.\n", "\n", "> *Think Deeper*: Why would all of these coefficients change? Why not\n", "> just the coefficient on `gender`?\n", "\n", "You will also notice that the standard errors are different. This is an\n", "important lesson: including (or not including) variables can change the\n", "statistical significance of a result. This is why it is so important to\n", "be very careful when designing regression models and thinking them\n", "through: a coefficient estimate is a consequence of the *whole model*,\n", "and should not be considered in isolation.\n", "\n", "### Interpreting Multiple Regression Coefficients\n", "\n", "Interpreting coefficients in a multiple regression is nearly the same as\n", "in a simple regression. After all, our regression equation is:\n", "\n", "$$W_i = \\beta_0 + \\beta_1 Gender_i + \\beta_2 Edu_i + \\epsilon_i$$\n", "\n", "You could (let’s pretend for a moment that $Edu_i$ was continuous)\n", "calculate:\n", "\n", "$$\\frac{\\partial W_i}{\\partial Edu_i} = \\beta_2$$\n", "\n", "This is the same interpretation as in a simple regression model: \\*\n", "$\\beta_2$ is the change in $W_i$ for a 1-unit change in $Edu_i$. \\* As\n", "you will see in the exercises, when $Edu_i$ is a dummy, we have the same\n", "interpretation as in a simple regression model: the (average) difference\n", "in the dependent variable between the two levels of the dummy variable.\n", "\n", "However, there is an important difference: we are *holding constant* the\n", "other explanatory variables. That’s what the $\\partial$ means when we\n", "take a derivative. This was actually always there (since we were holding\n", "constant the residual), but now this is something that is directly\n", "observable in our data (and in the model we are building)." ], "id": "35f62c84-0246-49aa-9d58-1f37755ba549" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "summary(multiple_model_1)" ], "id": "1e8c9c7e-8525-4a6f-8c51-ded7878620ea" }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Test your knowledge:** Based on the results above, how much more\n", "> wealth do university graduates accumulate, relative to folks with\n", "> non-university education levels, when we hold gender fixed?" ], "id": "67981126-63d8-408e-af70-33f11b4a6d6d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# answer the question above by filling in the number \n", "\n", "answer1 <- ??? # your answer here\n", "\n", "test_1()" ], "id": "55caff93-ea2b-4246-8d29-20107319e7b4" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Control Variables: What Do They Mean?\n", "\n", "One very common term you may have heard, especially in the context of a\n", "multiple regression model, is the idea of a **control variable**. In a\n", "multiple regression model, control variables are just explanatory\n", "variables - there is nothing special about how they are included.\n", "However, there *is* something special about how we think about them.\n", "\n", "The idea of a control variable refers to how we *think about* a\n", "regression model, and in particular, the different variables. Recall\n", "that the interpretation of a coefficient in a multiple regression model\n", "is the effect of that variable *holding constant* the other variables.\n", "This is often referred to as **controlling** for the values of those\n", "other variables - we are not allowing their relationship with the\n", "variable in question, and the outcome variable, to affect our\n", "measurement of the result. This is very common when we are discussing a\n", "*cause and effect* relationship - control is essential to these kinds of\n", "models. However, it is also valuable even when we are just thinking\n", "about a predictive model.\n", "\n", "You can see how this works directly if you think about a multiple\n", "regression as a series of “explanations” for the outcome variable. Each\n", "variable, one-by-one “explains” part of the outcome variable. When we\n", "“control” for a variable, we remove the part of the outcome that can be\n", "explained by that variable alone. In terms of our model, this refers to\n", "the residual.\n", "\n", "However, we must remember that our control variable *also* explains part\n", "of the other variables, so we must “control” for it as well.\n", "\n", "For instance, our multiple regression:\n", "\n", "$$W_i = \\beta_0 + \\beta_1 Gender_i + \\beta_2 Edu_i + \\epsilon_i$$\n", "\n", "Can be thought of as three, sequential, simple regressions:\n", "\n", "$$W_i = \\gamma_0 + \\gamma_1 Edu_i + u_i$$\n", "$$Gender_i = \\gamma_0 + \\gamma_1 Edu_i + v_i$$\n", "\n", "$$\\hat{u_i} = \\delta_0 + \\delta_1 \\hat{v_i} + \\eta_i$$\n", "\n", "- The first two regressions say: “explain `wealth` and `gender` using\n", " `Education` (in simple regressions)”\n", "- The final regression says: “account for whatever is leftover\n", " ($\\hat{u_i}$) from the `education-wealth` relationship with whatever\n", " is leftover from the `gender-wealth` relationship.”\n", "\n", "This has effectively “isolated” the variation in the data which has to\n", "do with `education` from the result of the model.\n", "\n", "Let’s see this in action:" ], "id": "dd548630-abd7-4f08-aed2-8041e1989ff9" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regression1 <- lm(wealth ~ Education, data = SFS_data)\n", "# regress wealth on education\n", "\n", "regression2 <- lm(dummy_gender ~ Education, data = SFS_data)\n", "# regress gender on education\n", "\n", "temp_data <- tibble(wealth_leftovers = regression1$residual, gender_leftovers = regression2$residuals)\n", "# take whatever is left-over from those regressions, save it" ], "id": "4acd3f7a-9946-4185-8840-44dd2f12231c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regression3 <- lm(wealth_leftovers ~ gender_leftovers, data = temp_data)\n", "# regress the leftovers on immigration status\n", "\n", "# compare the results with the multiple regression\n", "\n", "stargazer(regression1, regression2, regression3, multiple_model_1, title=\"Comparison of Multiple and Simple Regression Results\",\n", " align = TRUE, type=\"text\", keep.stat = c(\"n\",\"rsq\"))" ], "id": "4d2e7d98-cc50-478a-8436-30752d368f1c" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look closely at these results. You will notice that the coefficients on\n", "`gender_leftovers` in the “control” regression and `gender` in the\n", "multiple regression are *exactly the same*.\n", "\n", "> *Think Deeper:* What if we had done this experiment another way\n", "> (`wealth` and `Education` on `gender`)?. Which coefficients would\n", "> match? Why?\n", "\n", "This result is a consequence of the **Frisch-Waugh-Lovell theorem**\n", "about OLS - a variant of which is referred to as the “regression\n", "anatomy” equation.\n", "\n", "For our purposes, it does a very useful thing: it gives us a concrete\n", "way of thinking about what “controls” are doing: they are “subtracting”\n", "part of the variation from both the outcome and other explanatory\n", "variables. In OLS, this is *exactly* what is happening - but for all\n", "variables at once! If you don’t get it, don’t worry about it too much.\n", "What is important is now we have a way to disentangle the effects on\n", "wealth, weather it be gender or education.\n", "\n", "## Part 2: Hands-On\n", "\n", "Now, it’s time to continue our investigation of the gender-wealth gap,\n", "but now using our multiple regression tools. As we discussed before,\n", "when we investigate the education-wealth gap, we usually want to “hold\n", "fixed” different kinds of variables. We have already seen this, using\n", "the `Education` variable to control for the education-wealth gap.\n", "However, there are many more variables we might want to include.\n", "\n", "For example, risky investments usually generate more returns and men are\n", "typically more willing to take risks - based on research that explores\n", "[psychological differences in how risk is processed between men and\n", "women](https://journals.sagepub.com/doi/abs/10.1177/0963721411429452)\n", "and research that explores [how the perception of a person’s gender\n", "shapes how risk tolerant or risk adverse a person is thought to\n", "be](https://www.mendeley.com/catalogue/5a28efe5-479d-312a-bd80-32e6500a8f1c/).\n", "This implies that we may want to control for risky investments in the\n", "analysis.\n", "\n", "Let’s try that now:" ], "id": "8215a936-32c3-463d-8e52-92a5d779c36b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "risk_regression1 <- lm(data = SFS_data, wealth ~ gender + Education + risk_proxy) \n", "#don't worry about what risk proxy is for now\n", "\n", "summary(risk_regression1)" ], "id": "9736d56c-b432-4283-b792-fefba5a36664" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we control for risky investments, what do you see? How has the\n", "gender-wealth gap changed?\n", "\n", "Another way is to study financial assets and stocks at the same time, so\n", "that we can understand how different categories of assets affect wealth." ], "id": "bbb74f5d-ea5c-40a2-ab29-9361db7b5856" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "risk_regression2 <- lm(wealth ~ financial_asset + stock + bond + bank_deposits + mutual_funds + other_investments, data = SFS_data)\n", "\n", "summary(risk_regression2)" ], "id": "e22c3403-82f2-4016-b0d2-11daadaf791f" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look closely at this result. Do you see anything odd or problematic\n", "here?\n", "\n", "This is a topic we will revise later in this course, but this is\n", "**multicollinearity**. Essentially, what this means is that one of the\n", "variables we have added to our model does not add any new information.\n", "\n", "In other words, once we control for the other variables, there’s nothing\n", "left to explain. Can you guess what variables are interacting to cause\n", "this problem?\n", "\n", "Let’s dig deeper to see here:" ], "id": "f69dd64e-ddc1-4413-8d03-db5ca3e9bcfd" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "risk_reg1 <- lm(wealth ~ Education + stock + bond + bank_deposits + mutual_funds + other_investments, data = SFS_data)\n", "\n", "\n", "summary(risk_reg1)\n", "\n", "print(\"Leftovers from wealth ~ gender, education, stocks, bonds, ... \")\n", "head(round(risk_reg1$residuals,2))\n", "#peek at the leftover part of wealth\n", "\n", "risk_reg2 <- lm(financial_asset ~ Education + stock + bond + bank_deposits + mutual_funds + other_investments, data = SFS_data)\n", "\n", "\n", "summary(risk_reg2)\n", "\n", "print(\"Leftovers from financial asset ~ education, stock, bonds, ...\")\n", "head(round(risk_reg2$residuals,5))\n", "#peek at the leftover part of financial asset" ], "id": "387b5f45-98e4-4534-824b-4d7727e7ee66" }, { "cell_type": "markdown", "metadata": {}, "source": [ "> *Think Deeper:* Why is “Average Leftovers from financial asset ~\n", "> Education + stock, bonds, …” equal to 0?\n", "\n", "As you can see, the residual from regressing\n", "`financial_asset ~ Education + stock + ...` is exactly (to machine\n", "precision) zero. In other words, when you “control” for the asset\n", "classes, there’s nothing left to explain about `financial_assets`.\n", "\n", "If we think about this, it makes sense: these “controls” are all the\n", "types of financial assets you could have! So, if I tell you about them,\n", "you will immediately know the total value of my financial assets.\n", "\n", "This means that the final step of the multiple regression would be\n", "trying to solve this equation:\n", "\n", "$$\\hat{u_i} = \\delta_0 + \\delta_1 0 + \\eta_i$$\n", "\n", "–which does not have a unique solution for $\\delta_1$, meaning the\n", "regression model isn’t well-posed. R tries to “fix” this problem by\n", "getting rid of some variables, but this usually indicates that our model\n", "wasn’t set-up properly in the first place.\n", "\n", "The lesson is that we can’t just include controls without thinking about\n", "them; we have to pay close attention to their role in our model, and\n", "their relationship to other variables.\n", "\n", "For example, a *better* way to do this would be to just include `stock`\n", "and the total value instead of all the other classes (bank deposits,\n", "mutual funds, etc.): this is what `risk_proxy`: the ratio of stocks to\n", "total assets.\n", "\n", "You can also include different sets of controls in your model; often\n", "adding different “layers” of controls is a very good way to understand\n", "how different variables interact and affect your conclusions. Here’s an\n", "example, adding on several different “layers” of controls:" ], "id": "00faf937-8ff6-4b2b-afcc-6f42928a576f" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regression1 <- lm(wealth ~ gender, data = SFS_data)\n", "regression2 <- lm(wealth ~ gender + Education, data = SFS_data)\n", "regression3 <- lm(wealth ~ gender + Education + risk_proxy, data = SFS_data)\n", "regression4 <- lm(wealth ~ gender + Education + risk_proxy + business + province + credit_limit, data = SFS_data)\n", "\n", "stargazer(regression1, regression2, regression3, regression4, title=\"Comparison of Controls\",\n", " align = TRUE, type=\"text\", keep.stat = c(\"n\",\"rsq\"))" ], "id": "35c2393b-97c4-4e82-aec4-ee670a93650e" }, { "cell_type": "markdown", "metadata": {}, "source": [ "A pretty big table! Often, when we want to focus on just a single\n", "variable, we will simplify the table by just explaining which controls\n", "are included. Here’s an example which is much easier to read; it uses\n", "some formatting tricks which you don’t need to worry about right now:" ], "id": "49d197f7-1f88-4d47-b2c2-cf29b87d27a6" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var_omit = c(\"(province)\\\\w+\",\"(Education)\\\\w+\") #don't worry about this right now!\n", "\n", "stargazer(regression1, regression2, regression3, regression4, title=\"Comparison of Controls\",\n", " align = TRUE, type=\"text\", keep.stat = c(\"n\",\"rsq\"), \n", " omit = var_omit,\n", " add.lines = list(c(\"Education Controls\", \"No\", \"Yes\", \"Yes\", \"Yes\"),\n", " c(\"Province Controls\", \"No\", \"No\", \"No\", \"Yes\")))\n", "\n", "#this is very advanced code; don't worry about it right now; we will come back to it at the end of the course" ], "id": "9fc61efe-bd1f-46d8-87a2-94f439f319a5" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice in the above how the coefficients change when we change the\n", "included control variables. Understanding this kind of variation is\n", "really important to interpreting a model, and whether or not the results\n", "are credible. For example - ask yourself why the gender-wealth gap\n", "decreases as we include more control variables. What do you think?\n", "\n", "### Omitted Variables\n", "\n", "Another important topic comes up in the context of multiple regression:\n", "**omitted variables**. In a simple regression, this didn’t really mean\n", "anything, but now it does. When we have a large number of variables in a\n", "dataset, which ones do we include in our regression? All of them? Some\n", "of them?\n", "\n", "This is actually a very important problem, since it has crucial\n", "implication for the interpretation of our model. For example, remember\n", "Assumption 1? This is a statement about the “true” model - not what you\n", "are actually running. It can very easily be violated when variables\n", "aren’t included.\n", "\n", "We will revisit this later in the course, since it only really makes\n", "sense in the context of causal models, but for now we should pay close\n", "attention to which variables we are including and why. Let’s explore\n", "this, using the exercises.\n", "\n", "## Part 3: Exercises\n", "\n", "### Theoretical Activity 1\n", "\n", "Suppose you have a regression model that looks like:\n", "\n", "$$Y_i = \\beta_0 + \\beta_1 X_{i} + \\beta_2 D_{i} + \\epsilon_i$$\n", "\n", "Where $D_i$ is a dummy variable. Recall that Assumption 1 implies that\n", "$E[\\epsilon_i|D_{i}, X_{i}] = 0$. Suppose this assumption holds true.\n", "Answer the following:\n", "\n", "1. Compute $E[Y_i|X_i,D_i=1]$ and $E[Y_i|X_i,D_i=0]$\n", "2. What is the difference between these two terms?\n", "3. Interpret what the coefficient $\\beta_2$ means in this regression,\n", " using your answers in 1 and 2.\n", "\n", "#### Theoretical Answer 1\n", "\n", "**Complete the Exercise**: Carefully write your solutions in the box\n", "below. Use mathematical notation where appropriate, and explain your\n", "results.\n", "\n", "**TA 1 Answer**: Answer in red here" ], "id": "dc309763-fde4-4abd-b428-7da6098b7d49" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_0 <- #fill in your short answer" ], "id": "a09204b2-2e0c-4712-b64d-f9b0a9e18e47" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Practical Activity 1\n", "\n", "To explore the mechanics of multiple regressions, let’s return to the\n", "analysis that we did in Module 1; that is, let’s re-examine the\n", "relationship between the gender income gap and education.\n", "\n", "Run a simple regression for the gender income gap (with a single\n", "regressor) for each education level. Then, run a multiple regression for\n", "the gender income gap that includes education (small e not big e) as a\n", "control.\n", "\n", "Tested objects: `reg_LESS` (simple regression; less than high\n", "school), `reg_HS` (high school diploma), `reg_NU` (Non-university\n", "post-secondary), `reg_U` (university), `reg2` (multiple regression)." ], "id": "4420f6cd-70e8-4e91-9728-f47fceef0fd9" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Less than high school\n", "reg_LESS <- lm(???, data = filter(SFS_data, education == \"Less than high school\"))\n", "test_2() #For reg_LESS\n", "\n", "#High school diploma\n", "reg_HS <- lm(???, data = filter(SFS_data, education == \"High school\"))\n", "test_2.5() #For reg_HS\n", "\n", "#Non-university post-secondary\n", "reg_NU <- lm(???, data = filter(SFS_data, education == \"Non-university post-secondary\"))\n", "test_3() #For reg_NU\n", "\n", "\n", "#University\n", "reg_U <- lm(???, data = filter(SFS_data, education == \"University\"))\n", "test_3.5() #For reg_NU\n", "\n", "#Multiple regression\n", "reg2 <- lm(???, data = SFS_data)\n", "test_4() #For reg2\n", "\n", "#Table comparing regressions\n", "stargazer(reg_LESS, reg_HS, reg_NU, reg_U, \n", " title = \"Comparing Conditional Regressions with Multiple Regression\", align = TRUE, type = \"text\", keep.stat = c(\"n\",\"rsq\")) \n", "summary(reg2)" ], "id": "a2c7dcf5-f1d0-4e3a-9651-41a262231e27" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Short Answer 1\n", "\n", "**Prompt**: What variable “value” appears to be missing from the\n", "multiple regression in the table? How can we interpret the average\n", "income for the group associated with that value? Hint: Dummy Variables\n", "\n", "Answer in red here" ], "id": "e9e32e6c-1c7a-445e-84a5-f7bba0fe4a49" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_1 <- #fill in your short answer" ], "id": "b7105712-90c3-46f6-8412-e054536bba0d" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Short Answer 2\n", "\n", "Prompt: Compare the coefficient estimates for `gender` across each of\n", "the simple regressions. How does the gender income gap appear to vary\n", "across education levels? How should we interpret this variation?\n", "\n", "Answer in red here" ], "id": "914a9105-408b-4a5a-b8fd-6dde242b3404" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_2 <- #fill in your short answer" ], "id": "e953e5c9-0231-40bb-aead-c208084fb137" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Short Answer 3\n", "\n", "Prompt: Compare the simple regressions’ estimates with those of the\n", "multiple regression. How does the multiple regression’s coefficient\n", "estimate on `gender` compare to those estimates in the simple\n", "regressions? How can we interpret this? Further, how do we interpret the\n", "coefficient estimates on the other regressors in the multiple\n", "regression?\n", "\n", "Answer in red here" ], "id": "b5306a12-62ca-4283-8e82-78d3fdcf2fd4" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_3 <- #fill in your short answer" ], "id": "2a503c01-e521-4037-8319-a62f43f3f042" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Activity 2\n", "\n", "Consider the multiple regression that we estimated in the previous\n", "activity:\n", "\n", "$$W_i = \\beta_0 + \\beta_1 Gender_i + \\beta_2 S_i + \\epsilon_i$$\n", "\n", "Note that $Gender_i$ is `gender` and $S_i$ is `education`.\n", "\n", "### Short Answer 4\n", "\n", "Prompt: Why might we be skeptical of the argument that $\\beta_1$\n", "captures the gender income gap (i.e., the effect of having female as the\n", "main earner on household’s income, all else being equal)? What can we do\n", "to address these concerns?\n", "\n", "Answer in red here" ], "id": "632a20b0-cc0a-4976-b36d-ea0c128afa1d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_4 <- #fill in your short answer" ], "id": "98039a6d-f419-45d8-9aa7-a83b4bca0287" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Short Answer 5\n", "\n", "Prompt: Suppose that a member of your research team suggests that we\n", "should add `age` as a control in the regression. Do you agree with this\n", "group member that this variable would be a good control? Why or why not?\n", "\n", "Answer in red here" ], "id": "a4c90bde-bc53-4294-adc6-2458d66e8aad" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_5 <- #fill in your short answer" ], "id": "aa131413-4076-4cad-a44e-f8164a883aef" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s first simplify levels of age group using following codes." ], "id": "1111b2ee-876c-4f86-bc21-31525de3e14d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Just run this!\n", "SFS_data <- \n", " SFS_data %>%\n", " mutate(agegr = case_when(\n", " age == \"01\" ~ \"Under 30\",\n", " age == \"02\" ~ \"Under 30\",\n", " age == \"03\" ~ \"Under 30\",\n", " age == \"04\" ~ \"30-45\",\n", " age == \"05\" ~ \"30-45\",\n", " age == \"06\" ~ \"30-45\",\n", " age == \"07\" ~ \"45-60\",\n", " age == \"08\" ~ \"45-60\",\n", " age == \"09\" ~ \"45-60\",\n", " age == \"10\" ~ \"60-75\",\n", " age == \"11\" ~ \"60-75\",\n", " age == \"12\" ~ \"60-75\",\n", " age == \"13\" ~ \"Above 75\",\n", " age == \"14\" ~ \"Above 75\",\n", " )) %>%\n", " mutate(agegr = as_factor(agegr))\n", "\n", "SFS_data$agegr <- relevel(SFS_data$agegr, ref = \"Under 30\") #Set \"Under 30\" as default factor level" ], "id": "faa4a124-e030-4564-85b0-eb4601c72b5b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add `agegr` to the given multiple regression and compare it with the\n", "model that we estimated in the previous activity.\n", "\n", "Tested Objects: `reg3` (the same multiple regression that we\n", "estimated before, but with age added as a control)." ], "id": "850ec157-abdd-435d-9b08-222d9c0009a0" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Add Age as Control\n", "#Add them in the order: gender, education, age\n", "reg3 <- lm(???, data = SFS_data)\n", "\n", "#Compare the regressions with and without this control\n", "stargazer(reg2, reg3, \n", " title = \"Multiple Regressions with and without Age Controls\", align = TRUE, type = \"text\", keep.stat = c(\"n\",\"rsq\")) \n", "\n", "test_5() #For reg3 " ], "id": "8d1bc3c3-2d68-4a89-be5a-9b44c6c6ec00" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Short Answer 6\n", "\n", "Prompt: Compare the two regressions in the table above. What happens to\n", "the estimated gender income gap when we add age as a control? What might\n", "explain this effect?\n", "\n", "Answer in red here" ], "id": "87b23eb7-fbf2-4fac-8720-e6ee6a7b042f" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_6 <- #fill in your short answer" ], "id": "c9de4c9e-b239-43ba-823e-cac594e31e88" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Short Answer 7\n", "\n", "Prompt: Suppose that one of your fellow researchers argues that\n", "`employment` (employment status) should be added to the multiple\n", "regression as a control. That way, they reason, we can account for\n", "differences between employed and unemployed workers. Do you agree with\n", "their reasoning? Why or why not?\n", "\n", "Answer in red here" ], "id": "472093c1-6b11-4a9f-95a1-2e0e64532992" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_7 <- #fill in your short answer" ], "id": "8ffe62c8-ff2e-4205-91a5-e5935d47cd22" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s test this argument directly. Add `employment` as a control to the\n", "multiple regression with all previous controls. Estimate this new\n", "regression (`reg4`)." ], "id": "5b353fdc-0925-4a60-a3a4-604fc21e7df8" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Add in the order before, with employment last\n", "reg4 <- lm(???, data = SFS_data)\n", "\n", "summary(reg4)\n", "\n", "test_5.5() " ], "id": "14cf2c73-c0cd-49f8-bb8a-0287a2111e77" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Short Answer 8\n", "\n", "Prompt: What happened when we tried to run the regression with\n", "`employment`? Does this “result” agree or disagree with your explanation\n", "in Short Answer 7?\n", "\n", "Answer in red here" ], "id": "54019b43-dbdc-4efc-93d1-5b6ea7644a2d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_8 <- #fill in your short answer" ], "id": "a51a643a-8d23-4e44-85fe-fc1dc372b5b2" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Activity 3\n", "\n", "In the middle of your team’s discussion of which controls they should\n", "add to the multiple regression (the same one as the previous activity),\n", "your roommate bursts into the room and yells “Just add them all!” After\n", "a moment of confused silence, the roommate elaborates that it never\n", "hurts to add controls as long as they don’t “break” the regression (like\n", "`employment` and `agegr`). “Data is hard to come by, so we should use as\n", "much of it as we can get,” he says.\n", "\n", "Recall: Below are all of the variables in the dataset." ], "id": "2a29ece3-b5fb-4499-8c98-9e8bfc819921" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "glimpse(SFS_data) #Run Me!" ], "id": "fe5886dc-bdd6-41ce-a358-864e82018ad8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Short Answer 9\n", "\n", "Prompt: Do you agree with your roommate’s argument? Why or why not?\n", "\n", "Answer in red here" ], "id": "14e0ae1c-b075-4887-81f3-854fb257b8f0" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_9 <- #fill in your short answer" ], "id": "06f3b38f-0dbe-49a5-8b96-f08c62bef340" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s back up our argument with regression analysis. Estimate a\n", "regression that has the same controls as `reg3` from the previous\n", "activity, but add `pasrbuyg` as a control as well.\n", "\n", "Tested Objects: `reg5`.\n", "\n", "What is “pasrbuyg”?" ], "id": "ea72b0f6-6beb-4d3e-b86c-d930dfaa6efa" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dictionary(???) #What goes in here?" ], "id": "731e6324-eca8-4fe2-aaa8-07d92e7c29ac" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "source(\"intermediate_multiple_regression_functions.r\")ƒ\n", "#Add pasrbuyg to regression\n", "#Keep the order (gender, education, agegr, pasrbuyg)\n", "reg5 <- lm(???, data = SFS_data)\n", "\n", "#Table comparing regressions with and without ppsort\n", "stargazer(reg3, reg5,\n", " title = \"Multiple Regressions with and without ppsort\", align = TRUE, type = \"text\", keep.stat = c(\"n\",\"rsq\")) \n", "\n", "test_6() #For reg5 " ], "id": "46ffbca8-bf44-4842-ae15-42bdd5b76d9c" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Short Answer 10\n", "\n", "Prompt: Does the table above suggest that we should add `pasrbuyg` as a\n", "control?\n", "\n", "Answer in red here" ], "id": "ad0b89b3-5394-4b65-bf0e-5e122a9f5522" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_10 <- #fill in your short answer" ], "id": "0366d851-d616-4173-9147-e24206615818" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Short Answer 11\n", "\n", "Prompt: What other variables can be added as controls?\n", "\n", "Answer in red here" ], "id": "8a560229-c71a-4a03-b170-878afbf781af" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_11 <- #fill in your short answer" ], "id": "938bc8a1-231b-4546-9c03-960eba66f377" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "ir", "display_name": "R", "language": "r" } } }