{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 2.4 - Intermediate - Issues in Regression\n", "\n", "COMET Team
*Emrul Hasan, Jonah Heyl, Shiming Wu, William Co,\n", "Jonathan Graves* \n", "2022-12-08\n", "\n", "## Outline\n", "\n", "### Prerequisites\n", "\n", "- Multiple regression\n", "- Simple regression\n", "- Data analysis and introduction\n", "\n", "### Outcomes\n", "\n", "- Understand the origin and meaning of multicollinearity in regression\n", " models\n", "- Perform simple tests for multicollinerarity using VIF\n", "- Be able to demonstrate common methods to fix or resolve collinear\n", " data\n", "- Understand the origin and meaning of heteroskedasticity in\n", " regression models\n", "- Perform a variety of tests for heteroskedasticity\n", "- Compute robust standard errors for regression models\n", "- Understand other techniques for resolving heteroskedasticity in\n", " regression models\n", "\n", "### References\n", "\n", "- Statistics Canada, Survey of Financial Security, 2019, 2021.\n", " Reproduced and distributed on an “as is” basis with the permission\n", " of Statistics Canada.Adapted from Statistics Canada, Survey of\n", " Financial Security, 2019, 2021. This does not constitute an\n", " endorsement by Statistics Canada of this product.\n", "- Stargazer package is due to: Hlavac, Marek (2018). stargazer:\n", " Well-Formatted Regression and Summary Statistics Tables. R package\n", " version 5.2.2. https://CRAN.R-project.org/package=stargazer" ], "id": "2d76c9b4-0be8-409a-bf46-e970548c49dd" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load the data and set it up" ], "id": "180ab4c2-da83-45d4-a2a3-9e66e67b7f76" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#load the data and set it up\n", "library(car)\n", "library(tidyverse)\n", "library(haven)\n", "library(stargazer)\n", "library(lmtest)\n", "library(sandwich)" ], "id": "9f674842-0047-4053-8121-b6d91d600ecf" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "source(\"intermediate_issues_in_regression_functions.r\")\n", "source(\"intermediate_issues_in_regression_tests.r\")\n", "SFS_data <- read_dta(\"../datasets_intermediate/SFS_2019_Eng.dta\")\n", "SFS_data <- clean_up_data(SFS_data) # massive data cleanup" ], "id": "c55a4a7f-065d-4af5-98da-1338b2137021" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "glimpse(SFS_data)" ], "id": "ce99b7c8-6814-415e-8b5b-ca9a1f09eb07" }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we will explore several important issues in multiple\n", "regression models, and explore how to identify, evaluate, and correct\n", "them where appropriate. It is important to remember that there can be\n", "many other issues that arise in specific regression models; as you learn\n", "more about econometrics and create your own research questions,\n", "different issues will arise. Consider these as “examples” for some of\n", "the most common issues that arise in regression models.\n", "\n", "## Part 1: Multicollinearity\n", "\n", "Multi-collinearity is a surprisingly common issue in applied regression\n", "analysis, where several explanatory variables are correlated to each\n", "other. For example, suppose we are interested at regressing one’s\n", "marriage rate against years of education and annual income. In this\n", "case, the two explanatory variables income and years of education are\n", "highly correlated. It refers to the situation where a variable is\n", "“overdetermined” by the other variables in a model, which will result in\n", "less reliable regression output. For example, if we have a high\n", "coefficient on education. How certain are we that this coefficient was\n", "not the result of having a high annual income as well? Let’s look at\n", "this problem mathematically; in calculating an OLS estimation, you are\n", "estimating a relationship like:\n", "\n", "$$\n", "Y_i = \\beta_0 + \\beta_1 X_1 + \\epsilon_i\n", "$$\n", "\n", "You find the estimates of the coefficients in this model using OLS;\n", "i.e., solving an equation like:\n", "\n", "$$ \\min_{\\beta_0, \\beta_1} \\sum_{i=1}^n(Y_i - \\beta_0 - \\beta_1 X_i)^2 $$\n", "\n", "Under the OLS regression assumptions, this has a unique solution; i.e\n", "you can find unique values for $\\beta_0$ and $\\beta_1$.\n", "\n", "However, what if you wrote an equation like this:\n", "\n", "$$\n", " \\beta_a=\\beta_0+\\beta_1\n", " $$\n", "\n", "We can then rewrite as\n", "$Y_i = \\beta_0 + \\beta_1 + \\beta_2 X_i + \\epsilon_i$\n", "\n", "This *seems* like it would be fine, but remember what you are doing:\n", "trying to find a *line* of best fit. The problem is that this equation\n", "does not define a unique line; the “intercept” is $\\beta_0 + \\beta_1$.\n", "There are two “parameters” ($\\beta_0, \\beta_1$) for a single\n", "“characteristics” (the intercept). This means that the resulting OLS\n", "problem:\n", "\n", "$$ \\min_{\\beta_0, \\beta_1, \\beta_2} \\sum_{i=1}^n(Y_i - \\beta_0 - \\beta_1 - \\beta_2 X_i)^2 $$\n", "\n", "Does not have a unique solution. In algebraic terms, it means you can\n", "find many representations of a line with two intercept parameters. This\n", "is referred to in econometrics as a lack of **identification**;\n", "multicollinearity is one way that identification can fail in regression\n", "models.\n", "\n", "You can see this in the following example, which fits an OLS estimate of\n", "`wealth` and `income_before_tax` then compares the fit to a regression\n", "with two intercepts. Try changing the values to see what happens.\n", "\n", "> **Note**: Make sure to understand what the example below is doing.\n", "> Notice how the results are exactly the same, no matter what the value\n", "> of `k` is?" ], "id": "7ec4456f-38c2-46b9-8b75-417d9b0eebd7" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reg <- lm(wealth ~ income_before_tax, data = SFS_data)\n", "\n", "b_0 <- reg$coef[[1]]\n", "b_1 <- reg$coef[[2]]\n", "\n", "resid1 <- SFS_data$wealth - b_0 - b_1*SFS_data$income_before_tax\n", "\n", "\n", "k <- 90 #change me! \n", "\n", "b_0 = (reg$coef[[1]])/2 - k\n", "b_1 = (reg$coef[[1]])/2 + k \n", "b_2 = reg$coef[[2]]\n", "\n", "resid2 <-SFS_data$wealth - b_0 - b_1 - b_2*SFS_data$income_before_tax\n", "\n", "\n", "\n", "ggplot() + geom_density(aes(x = resid1), color = \"blue\") + xlab(\"Residuals from 1 Variable Model\") + ylab(\"Density\")\n", "ggplot() + geom_density(aes(x = resid2), color = \"red\") + xlab(\"Residuals from 2 Variable Model\") + ylab(\"Density\")" ], "id": "17b5d850-9d04-4bc0-a0db-67ca4e4c56a7" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the residuals look *exactly* the same - despite these being\n", "from (purportedly) two different models. This is because they not really\n", "two different models! They *identify* the same model!\n", "\n", "Okay, you’re probably thinking, that makes sense - but just don’t write\n", "down an equation like that. After all, it seems somewhat artificial that\n", "we added an extra intercept term.\n", "\n", "However, multicollinearly can occur with *any* set of variables in the\n", "model; not just the intercept. For example, suppose you have a multiple\n", "regression:\n", "\n", "$$\n", "Y_ i = \\beta_0 + \\beta_1 X_{1,i} + \\beta_2 X_{2,i} + \\beta_3 X_{3,i} + \\epsilon_i\n", "$$\n", "\n", "What would happen if there was a relationship between $X_1, X_2$ and\n", "$X_3$ like:\n", "\n", "$$\n", "X_{1,i} = 0.4 X_{2,i} + 12 X_{3,i}\n", "$$\n", "\n", "We could then re-write the equation as:\n", "\n", "$$\n", "Y_ i = \\beta_0 + \\beta_1 (0.4 X_{2,i} + 12 X_{3,i}) + \\beta_2 X_{2,i} + \\beta_3 X_{3,i} + \\epsilon_i\n", "$$\n", "\n", "$$\n", "\\implies Y_ i = \\beta_0 + (\\beta_2 + 0.4 \\beta_1) X_{2,i} + (\\beta_3 + 12 \\beta_1)X_{3,i} + \\epsilon_i\n", "$$\n", "\n", "The same problem is now occuring, but with $X_2$ and $X_3$: the slope\n", "coefficients depend on a free parameter ($\\beta_1$). You cannot uniquely\n", "find the equation of a line (c.f. plane) with this kind of equation.\n", "\n", "Basically what is happening, is you are trying to solve a system of\n", "equations for 3 variables (or n variables), but only 2 (or n-1) are used\n", "in the equation (are independent). So what would you do, well you would\n", "leave one of the dependent variables out, so you could solve for all of\n", "your variables, this is exactly what R does.\n", "\n", "You can also intuitively see the condition here: multicollinearity\n", "occurs when you can express one variable as a *linear* combination of\n", "the other variables in the model.\n", "\n", "- This is sometimes referred to as **perfect multicollinearity**,\n", " since the variable is *perfectly* expressed as a linear combination\n", " of the other variable.\n", "- The linearity is important because this is a linear model; you can\n", " have similar issues in other models, but it has a special name in\n", " linear regression.\n", "\n", "### Perfect Multicollinearity in Models\n", "\n", "In general, most statistical packages (like R) will automatically\n", "detect, warn, and remove perfectly multicollinear variables from a\n", "model; this is because the algorithm they use to solve problems like the\n", "OLS estimation equation detects the problem and avoids a “crash”. This\n", "is fine, from a mathematical perspective - since mathematically the two\n", "results are the same (in a well-defined sense, as we saw above).\n", "\n", "However, from an economic perspective this is very bad - it indicates\n", "that there was a problem with the *model* that you defined in the first\n", "place. Usually, this means one of three things:\n", "\n", "1. You included a set of variables which were, in combination,\n", " identical. For example, including “family size” and then “number of\n", " children” and “number of adults” in a regression\n", "2. You did not understand the data well enough, and variables had less\n", " variation than you thought they did - conditional on the other\n", " variables in the model. For example, maybe you thought people in the\n", " dataset could have both graduate and undergraduate degrees - so\n", " there was variation in “higher than high-school” but that wasn’t\n", " true\n", "3. You wrote down a model which was poorly defined in terms of the\n", " variables. For example, you included all levels of a dummy variable,\n", " or included the same variable measured in two different units (wages\n", " in dollars and wages in 1000s of dollars).\n", "\n", "In all of these cases, you need to go back to your original regression\n", "model and re-evaluate what you are trying to do in order to simplify the\n", "model or correct the error.\n", "\n", "Consider the following regression model, in which we want to study\n", "whether or not there is a penalty for families led by someone who is\n", "younger is the SFS Data:" ], "id": "61d53680-87d2-4c00-b948-b5212a6ace66" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SFS_data <- SFS_data %>%\n", " mutate(ya = case_when(\n", " education == \"Less than high school\" ~ \"Yes\",\n", " education == \"High school\" ~ \"Yes\",\n", " education == \"Non-university post-secondary\" ~ \"No\",\n", " TRUE ~ \"No\" # this is for all other cases\n", " )) %>%\n", " mutate(ya = as_factor(ya))\n", "\n", "regression2 <- lm(income_before_tax ~ ya + education , data = SFS_data)\n", "\n", "summary(regression2)" ], "id": "3a63b2d8-f68a-4ebb-ab4c-0eac655a8b62" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can you see why the multi-collinearity is occurring here? Try to write\n", "down an equation which points out what the problem is in this\n", "regression - why is it multi-collinear? How could you fix this problem\n", "by changing the model?\n", "\n", "> **Think Deeper**: You will notice, above, that it excluded the\n", "> “University” education. Did it have to exclude that one? Could it have\n", "> excluded another one instead? What do you think?\n", "\n", "### Imperfect Multicollinearity\n", "\n", "A related issue to perfect multicollinearity is “near” (or\n", "**imperfect**) multicollinearity. If you recall from the above, perfect\n", "multicollinearity occurs when you have a relationship like:\n", "\n", "$$\n", "X_{1,i} = 0.4 X_{2,i} + 12 X_{3,i}\n", "$$\n", "\n", "Notice that in this relationship it holds *for all* values of $i$.\n", "However, what if it held for *nearly* all $i$ instead? In that case, we\n", "would still have a solution to the equation… but there would be a\n", "problem. Let’s look at this in the simple regression case.\n", "\n", "$$\n", "Y_i = \\beta_0 + \\beta_1 X_i + \\epsilon_i\n", "$$\n", "\n", "Now, let’s suppose that $X_i$ is “almost” collinear with $\\beta_0$. To\n", "be precise, suppose that $X_i = 15$ for $k$-% of the data ($k$ will be\n", "large) and $X_i = 20$ for $(1-k)$-% of the data. This is *almost*\n", "constant, and so it is *almost* collinear with $\\beta_0$ (the constant).\n", "Let’s also make the values of $Y_i$ so that\n", "$Y_i(X_i) = X_i + \\epsilon_i$ (so $\\beta_1 = 1$), and we will set\n", "$\\sigma_Y = 1$.\n", "\n", "This implies that:\n", "\n", "$$\n", "\\beta_1 = \\frac{Cov(X_i,Y_i)}{Var(X_i)} = 1\n", "$$\n", "\n", "$$\n", "s_b = \\frac{1}{\\sqrt{n-2}}\\sqrt{\\frac{1}{r^2}-1}\n", "$$\n", "\n", "$$\n", "r = \\frac{\\sigma_X}{\\sigma_Y}\n", "$$\n", "\n", "As you can see, when $Var(X_i)$ goes down, $\\sigma_X$ falls, and the\n", "value of $r$ falls; intuitively, when $k$ rises, the variance will go to\n", "zero, which makes $r$ go to zero as well (since there’s no variation).\n", "You can then see that $s_b$ diverges to infinity.\n", "\n", "We can make this more precise. In this model, how does $Var(X_i)$ depend\n", "on $k$? Well, first notice that\n", "$\\bar{X_i} = 15\\cdot k + 20 \\cdot (1-k)$. Then,\n", "\n", "$$\n", "Var(X_i) = (X_i - \\bar{X_i})^2 = k (15 - \\bar{X_i})^2 + (1-k)(20 - \\bar{X_i})^2\n", "$$\n", "\n", "$$\n", "\\implies Var(X_i) = 25[k(1-k)^2 + (1-k)k^2]\n", "$$\n", "\n", "Okay, that looks awful - so let’s plot a graph of $s_b$ versus $k$ (when\n", "$n = 1000$):" ], "id": "22f5d9a8-9cf4-4951-9541-d2b60d7406e8" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "options(repr.plot.width=6,repr.plot.height=4)\n", "\n", "r = 0.01 \n", "\n", "eq = function(k){(1/sqrt(1000-2))*(1/(25*(k*(1-k)^2 + (1-k)*k^2))-1)}\n", "s = seq(0.5, 1.00, by = r)\n", "n = length(s)\n", "\n", "plot(eq(s), type='l', xlab=\"Values of K\", ylab=\"Standard Error\", xaxt = \"n\")\n", "axis(1, at=seq(0, n-1, by = 10), labels=seq(0.5, 1.00, by = 10*r))\n", "\n", "# You will notice that the plot actually diverges to infinity\n", "# Try making R smaller to show this fact!\n", "# Notice the value at 1 increases" ], "id": "862861b7-2398-43fc-a581-02fb9613e2bf" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why does this happen? The reason actually has to do with *information*.\n", "\n", "When you estimate a regression, you are using the variation in the data\n", "to estimate each of the parameters. As the variation falls, the\n", "estimation gets less and less precise, because you are using less and\n", "less data to make an evaluation. The magnitude of this problem can be\n", "quantified using the **VIF** or **variance inflation factor** for each\n", "of the variables in question. Graphically you can think of regression as\n", "drawing a best fit line through data points. Now if the variance is $0$\n", "in the data, there is just one data point. If you remember from high\n", "school you need two points to draw a line, so with $0$ variance the OLS\n", "problem becomes ill-defined.\n", "\n", "We can calculate this directly in R by using the `vif` function. Let’s\n", "look at the collinearity in our model:" ], "id": "2d727f1d-dce6-4a85-81e9-faa35e2119d3" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regression2 <- lm(wealth ~ income_before_tax + income_after_tax, data = SFS_data)\n", "\n", "summary(regression2)" ], "id": "0ea38c15-66d1-450b-badf-bfc0b6b24a21" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cat(\"Variance inflation factor of income after tax on wealth: \",vif(regression2,SFS_data$income_after_tax,SFS_data$wealth),'\\n')\n", "cat(\"Variance inflation factor of income before tax on wealth: \",vif(regression2,SFS_data$income_before_tax,SFS_data$wealth),'\\n')\n", "cat(\"Variance inflation factor of income before tax on income after tax: \",vif(regression2,SFS_data$income_before_tax,SFS_data$income_after_tax),'\\n')" ], "id": "30b62904-52d2-43e8-b720-19451b1ee665" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the extremely large VIF. This would indicate that you have a\n", "problem with collinearity in your data.\n", "\n", "> **Think Deeper**: What happens to the VIF as `k` changes? Why? Can you\n", "> explain?\n", "\n", "There are no “hard” rules for what makes a VIF too large - you should\n", "think about your model holistically, and use it as a way to investigate\n", "whether you have any problems with your model evaluation and analysis.\n", "\n", "## Part 2: Heteroskedasticity\n", "\n", "**Heteroskedasticity** (Het-er-o-sked-as-ti-city) is another common\n", "problem in many economic models. It refers to the situation in which the\n", "distribution of the residuals changes as the explanatory variables\n", "change. Usually, we could visualize this problem by drawing a residual\n", "plot and a fan or cone shape indicates the presence of\n", "heteroskedasticity. For example, consider this regression:" ], "id": "f90614c4-9b66-4312-88ca-5984f40c2eed" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regression3 <- lm(income_before_tax ~ income_after_tax, data = SFS_data)\n", "\n", "ggplot(data = SFS_data, aes(x = as.numeric(income_after_tax), y = as.numeric(regression3$residuals))) + geom_point() + labs(x = \"After-tax income\", y = \"Residuals\")" ], "id": "43a04cd9-2c72-4bba-8d1b-54430767efb8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "This obviously does not look like a distribution which is unchanging as\n", "income after tax changes. This is a good “eyeball test” for\n", "heteroskedasticty. Why does heteroskedasticity arise? For many reasons:\n", "\n", "1. It can be a property of the data; it just happens that some values\n", " show more variation, due to the process which creates the data. One\n", " of the most common ways this can arise is where there are several\n", " different economic processes creating the data.\n", "\n", "2. It can be because of an unobserved variable. This is similar to\n", " above; if we can quantify that process in a variable or a\n", " description, we have left it out. This could create bias in our\n", " model, but it will also show up in the standard errors in this way.\n", "\n", "3. It can be because of your model specification. Models, by their very\n", " nature, can be heteroskedastic (or not); we will explore one\n", " important example later in this worksheet.\n", "\n", "4. There are many other reasons, which we won’t get into here.\n", "\n", "Whatever the reason it exists, you need to correct for it - if you\n", "don’t, while your coefficients will be OK, your standard errors will be\n", "incorrect. You can do this in a few ways. The first way is to try to\n", "change your variables that the “transformed” model (a) makes economic\n", "sense, and (b) no longer suffers from heteroskedasticity. For example,\n", "perhaps a *log-log* style model might work here:" ], "id": "9c06049f-c690-4978-83f3-11f2ee2354bc" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SFS_data <- SFS_data %>%\n", " filter(income_before_tax > 0) %>% # getting rid of NAs\n", " mutate(lnincome_before_tax = log(income_before_tax))\n", "\n", "SFS_data <- SFS_data %>%\n", " filter(income_after_tax > 0) %>%\n", " mutate(lnincome_after_tax = log(income_after_tax))\n", "\n", "\n", "regression4 <- lm(lnincome_before_tax ~ lnincome_after_tax, data = SFS_data)\n", "\n", "ggplot(data = SFS_data, aes(x = lnincome_before_tax, y = regression4$residuals)) + geom_point() + labs(x = \"Log of before-tax income\", y = \"Residuals\")" ], "id": "2c069dda-d5e0-47b6-bf57-b7faeb0bdcff" }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Think Deeper**: Do the errors of this model seem homoskedastic?\n", "\n", "As you can see, that didn’t work out. This is pretty typical: when you\n", "transform a model by changing the variables, what you are really doing\n", "is adjusting how you think the data process should be described so that\n", "it’s no longer heteroskedastic. If you aren’t correct with this, you\n", "won’t fix the problem.\n", "\n", "For example, in a *log-log* model, we are saying “there’s a\n", "multiplicative relationship”… but that probably doesn’t make sense here.\n", "This is one of the reasons why data transformations are not usually a\n", "good way to fix this problem unless you have a very clear idea of what\n", "the transformation *should* be.\n", "\n", "The most robust (no pun intended) way is to simply use standard errors\n", "which are robust to heteroskedasticity. There are actually a number of\n", "different versions of these (which you don’t need to know about), but\n", "they are all called **HC** or **hetereoskedasticity-corrected** standard\n", "errors. In economics, we typically adopt White’s versions of these\n", "(called **HC1** in the literature); these are often referred to in\n", "economics papers as “robust” standard errors (for short).\n", "\n", "This is relatively easy to do in R. Basically, you run your model, as\n", "normal, and then re-test the coefficients to get the correct error using\n", "the `coeftest` command, but specifying which kind of errors you want to\n", "use. Here is an example:" ], "id": "17d00731-34e9-47a6-9606-6976c6fc2fa4" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regression5 <- lm(income_before_tax ~ income_after_tax, data = SFS_data)\n", "\n", "summary(regression5)\n", "\n", "coeftest(regression5, vcov = vcovHC(regression5, type = \"HC1\"))" ], "id": "1baccd10-3e7e-44a1-ac41-0930820aae34" }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the standard errors (and significance tests) give\n", "different results; in particular, the HC1 errors are almost 10-times\n", "larger than the uncorrected errors. In this particular model, it didn’t\n", "make much of a different to the conclusions (even though it changed the\n", "$t$ statistics a lot), but it can sometimes change your results.\n", "\n", "### Testing for Heteroskedasticity\n", "\n", "You can also perform some formal tests for heteroskedasticity.\n", "\n", "1. White’s Test, which relies on performing a regression using the\n", " residuals\n", "2. Breusch-Pagan Test, which also relies on performing a simpler\n", " regression using the residuals\n", "\n", "Both of them are, conceptually, very similar. Let’s try (2) for the\n", "above regression:" ], "id": "e2677ef4-d9e4-41af-bce2-ea857ff8fc5a" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regression2 <- lm(income_before_tax ~ income_after_tax, data = SFS_data) \n", "\n", "SFS_data$resid_sq <- (regression2$residuals)^2 # get the residuals then square it\n", "\n", "regression3 <- lm(resid_sq ~ income_after_tax, data = SFS_data) # make the residuals a function of X\n", "\n", "summary(regression3)" ], "id": "6637a7ce-b8a7-44ee-9d32-f257864b6e92" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inspecting the results, we can see from the $F$-statistic that we can\n", "strongly reject the assumption of homoskedasticity. This is denoted by\n", "the 3 asterisks. This data looks like it’s heteroskedastic, because the\n", "residuals can be predicted using the explanatory variables.\n", "\n", "There is one very important note:\n", "\n", "- If you **fail** one of these tests, it implies that your data is\n", " heteroskedastic\n", "- If you **pass** one of these tests, it *does not* imply that your\n", " data is homoskedastic (i.e. not heteroskedastic)\n", "\n", "This is because these are statistical tests, and the null hypothesis is\n", "“not heteroskedastic”. Failing to reject the null does not mean that the\n", "null hypothesis is correct - it just means that you can’t rule it out.\n", "This is one of the reasons many economists recommend that you *always*\n", "use robust standard errors unless you have a really compelling reason to\n", "believe otherwise.\n", "\n", "### Linear Probability Models\n", "\n", "How can a model naturally have heteroskedastic standard errors? It turns\n", "out that many common, and important, models have this issue. In\n", "particular, the **linear probability** model has this problem. If you\n", "recall, a linear probability model is a linear regression in which the\n", "dependent variable is a dummy. For example:\n", "\n", "$$\n", "D_i = \\beta_0 + \\beta_1 X_{1,i} + \\beta_2 X_{2,i} + \\epsilon_i\n", "$$\n", "\n", "These models are quite useful because the coefficients have the\n", "interpretation as being the change in the probability of the dummy\n", "condition occurring. For example, we previously regressed `gender` (of\n", "male or female) in these models to investigate the wealth gap. However,\n", "this can easily cause a problem when estimated using OLS - the value of\n", "$D_i$ must be 0 or 1, and the fitted values (which are probabilities)\n", "must be between 0 and 1.\n", "\n", "However, *nothing* in the OLS model forces this to be true. If you\n", "estimate a value for $\\beta_1$, if you have an $X_{1,i}$ that is high or\n", "low enough, the fitted values will be above or below 1 or 0\n", "(respectively). This implies that *mechanically* you have\n", "heteroskedasticity because high or low values of the explanatory\n", "variables will ALWAYS fit worse than intermediate values. For example,\n", "let’s look at the fitted values from this regression:" ], "id": "9937d032-3d2f-4781-a2cb-584ad45dd64a" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SFS_data <- SFS_data %>%\n", " mutate(M_F = case_when(\n", " gender == \"Male\" ~ 0,\n", " gender == \"Female\" ~ 1\n", " ))\n", "\n", "SFS_data <- SFS_data[complete.cases(SFS_data$gender,SFS_data$income_before_tax), ]\n", "SFS_data$gender <- as.numeric(SFS_data$gender)\n", "SFS_data$income_before_tax <- as.numeric(SFS_data$income_before_tax)\n", "\n", "\n", "regression6 <- lm(gender ~ income_before_tax, data = SFS_data)\n", "\n", "SFS_data$fitted <- predict(regression6, SFS_data)\n", "\n", "summary(regression6)\n", "\n", "ggplot(data = SFS_data, aes(x = as.numeric(income_before_tax), y = fitted)) + geom_point() + labs(x = \"before tax income\", y = \"Predicted Probability\")" ], "id": "3a246889-88e3-4756-a07a-cdac0d8f6670" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how that as $y$ gets larger as the fitted value drops. If someone\n", "has an income of over 1 million dollars, they would be predicted to have\n", "a negative probability of being a female - which is impossible.\n", "\n", "This is why you *always* must use robust standard errors in these\n", "models - even if a test says otherwise. Let’s think about what is\n", "happening here, well remember the example of imperfect collinearity,\n", "we’re the was $1-k$ chance of $x$ being 15 and k of $x$ being 20. Now\n", "remember x was co-linear to $\\beta_0$, and this caused large standard\n", "errors. In this scenario the probability of a someone being female given\n", "they earn over a million dollars a year is very small. This because few\n", "female lead has household earn over a million dollars a year as a\n", "percent of the total households earning over a million dollars a year.\n", "\n", "## Part 3: Exercises\n", "\n", "This sections has both written and coding exercises for you to test your\n", "knowledge about issues in regressions. The answers to the written\n", "exercises are on the last section of the notebook.\n", "\n", "### Questions\n", "\n", "Multicollinearity may seem to be an abstract concept, so let’s explore\n", "this issue with a practical example.\n", "\n", "Suppose that we are looking to explore the relationship between family\n", "income and the gender of the major earner. We want to know whether\n", "families with higher incomes in Canada are more likely to have male\n", "major earners. Recall that we have two measures of income:\n", "`income_before_tax` and `income_after_tax`. Both measures of income are\n", "informative: `income_before_tax` refers to gross annual income (before\n", "taxes) that employers pay to employees; `income_after_tax` refers to net\n", "income after taxes have been deducted.\n", "\n", "Since they are both good measures of income, we decide to put them both\n", "in our regression:\n", "\n", "$$\n", "M_F = \\beta_0 + \\beta_1 I_{bi} + \\beta_2 I_{ai} + \\epsilon_i\n", "$$\n", "\n", "where\n", "\n", "- $M_F$ denotes the dummy variable for whether the person is male or\n", " female\n", "- $I_{ai}$ denotes income after taxes\n", "- $I_{bi}$ denotes income before taxes\n", "\n", "1. What concern should we have about this regression equation? Explain\n", " your intuition.\n", "\n", "Before we continue, let’s reduce the sample size of our data set to 200\n", "observations. We will also revert `gender` into a numeric variable:" ], "id": "d05be652-66e1-47fe-9656-723f80a45418" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# run this!\n", "SFS_data200 <- head(SFS_data, 200) %>%\n", " mutate(M_F = as.numeric(gender)) # everyone in the first 200 observations as, male or female" ], "id": "aeb2f40b-a549-4e83-9280-86a36b534719" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the regression between family income and the gender of the major\n", "earner described above.\n", "\n", "Tested Objects: `reg1`." ], "id": "25a84292-436d-4ca3-88a0-8d009c53717c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reg1 <- lm(???, data = SFS_data200) \n", "\n", "summary(reg1)\n", "\n", "test_2()" ], "id": "36b1d81f-8ee9-4d10-8407-231a9f083bbe" }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. What do you notice about the characteristics of the estimated\n", " regression? Does anything point to your concern being valid?\n", "\n", "Now, let’s suppose we drop 50 more observations:" ], "id": "9a1734dd-d819-4df1-ba46-79430b76e966" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# run this!\n", "SFS_data150 <- head(SFS_data200, 150)" ], "id": "883ce5f3-790d-4a41-8525-6a686fd3e023" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the regression model again and compare it with the previous\n", "regression.\n", "\n", "Tested Objects: `reg2`." ], "id": "a692d1a2-892a-47c3-ad30-a68a26a02bce" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reg2 <- lm(???) \n", "\n", "summary(reg2)\n", "\n", "test_4() " ], "id": "ef57322e-6dc2-4452-8184-5e0b85234ad7" }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. What happened to the regression estimates when we dropped 50\n", " observations? Does this point to your concern being valid?\n", "\n", "Next, increase the sample size back to its full size and run the\n", "regression once again.\n", "\n", "Tested Objects: `reg3`." ], "id": "9b4d8e80-efa9-43ce-ad8a-c2be35a58a1b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SFS_data <- SFS_data[complete.cases(SFS_data$income_after_tax), ] #do not modify this code\n", "SFS_data$income_after_tax <- as.numeric(SFS_data$income_after_tax) # do not modify this code\n", "\n", "reg3 <- lm(???) \n", "\n", "summary(reg3)\n", "\n", "test_6() " ], "id": "11909f8e-e60c-482a-b4ea-26066a2ffcb8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Did this change eliminate the concern? How do you know?\n", "\n", "Heteroskedasticity is another issue that researchers frequently deal\n", "with when they estimate regression models. Consider the following\n", "regression model:\n", "\n", "$$\n", "I_i = \\alpha_0 + \\alpha_1 E_i + \\alpha_2 G_i + \\epsilon_i\n", "$$\n", "\n", "where\n", "\n", "- $I_i$ denotes before tax income\n", "- $E_i$ is level of education\n", "- $D_i$ is a dummy variable for being female\n", "\n", "1. Should we be concerned about heteroskedasticity in this model? If\n", " so, what is the potential source of heteroskedasticity, and what do\n", " we suspect to be the relationship between the regressor and the\n", " error term?\n", "\n", "2. If we suppose that heteroskedasticity is a problem in this\n", " regression, what consequences will this have for our regression\n", " estimates?\n", "\n", "Run the regression below, and graph the residuals against the level of\n", "schooling." ], "id": "8dbffff7-7037-43fb-9d7d-dfcbf7719d93" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# run the regression\n", "reg5 <- lm(income_before_tax ~ education, data = SFS_data)" ], "id": "417cc46b-43f0-458e-a714-b8e25a0c9314" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reg5 <- lm(income_before_tax~education, data = SFS_data)\n", "\n", "resiplot <- ggplot(reg5, aes(x = education, y = .resid)) + xlab(\"Education Level\") + ylab(\"Income (Residuals)\")\n", "resiplot + geom_point() + geom_hline(yintercept = 0) + scale_x_discrete(guide = guide_axis(n.dodge=3))" ], "id": "62ec7bab-2e60-4a24-ac75-3dccc8206bc2" }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Describe the relationship between education level and the residuals\n", " in the graph above. What does the graph tell us about the presence\n", " and nature of heteroskedasticity in the regression model?\n", "\n", "To test for heteroskedasticity formally, let’s perform the White Test.\n", "First, store the residuals from the previous regression in `SFS_data`.\n", "\n", "Tested Objects: `SFS_data` (checks to see that residuals were added\n", "properly)." ], "id": "15b3223c-96a5-4f5b-a27c-fb4cd070a786" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SFS_data <- mutate(SFS_data, resid = ???)\n", "\n", "head(SFS_data$resid, 10) #Displays the residuals in the dataframe\n", "\n", "test_11() " ], "id": "58b8901e-d43b-4b63-b727-c192d1ce36f7" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, generate a variable for the squared residuals, then run the\n", "required auxiliary regression.\n", "\n", "Tested Objects: `WT` (the auxiliary regression)." ], "id": "8af2db95-cf88-4671-8f05-a6766acb18d6" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model <- lm(income_before_tax~gender^2 + gender + education^2 +education+ education*gender, data = SFS_data)\n", "\n", "resid = reg5$residuals\n", "\n", "rsq=(resid)^2" ], "id": "2442ba0e-e949-466a-85d3-d33c34f55689" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SFS_data$rsq <- rsq\n", "\n", "WT <- lm(rsq ~ ???, data =SFS_data) # fill me in\n", "\n", "summary(WT)\n", "\n", "test_12() " ], "id": "3084e2c9-566a-4a29-856b-71ea43d4c7b5" }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. What does the white test suggest?\n", "\n", "2. Finish filling in this table:\n", "\n", "| Formal Issue Name | Problem | Meaning | Test | Solution |\n", "|------|--------------------|---------------|---------------|------------------|\n", "| ??? | Incorrect Standard errors, which can lead to incorrect confidence intervals etc | The distribution of residuals is not constant | White’s Test and Breusch-Pagan: `bptest()` | Add additional factors to regression or use robust standard errors |\n", "| Perfect Collinearity | ??? | One variable in the regression is a linear function of another variable in the regression | Collinearity test on the model, `ols_vif_tol(model)` | ??? |\n", "| Imperfect Collinearity | The model will have very large standard errors. R may need to omit a variable | One variable can almost be fully predicted with a linear function of another variable in the model | ??? | Omit one of the collinear variables, try using more data, or consider transformations (e.g., logarithms) |\n", "\n", "### Solutions\n", "\n", "1. These two variables, before tax and after tax income, may be close\n", " to co-linear, which will increase the error term.\n", "\n", "2. Coding exercise.\n", "\n", "3. It seems like `income_before_tax` was dropped from the regression.\n", " The reason for that is multicollinearity.\n", "\n", "4. Coding exercise\n", "\n", "5. Similar to 3.\n", "\n", "6. Coding exercise.\n", "\n", "7. This change does not eliminate the concern. Now we see both\n", " `income_before_tax` and `income_after_tax` on the regression output.\n", " However, only `income_after_tax` is significant. The model is\n", " picking up some differences between both variables (likely\n", " differences across tax brackets) but it still seems like the\n", " collinearity is affecting the model. Having collinear terms in a\n", " regression increases standard errors.\n", "\n", "8. The distribution of residuals will likely change relative to\n", " education. This because the higher the level of education the more\n", " additional factors will be required to predict the persons income.\n", " For instance, if someone has a 4 year degree what that degree is in\n", " will have a large impact on their income; however, there is less\n", " variance in income for high school graduates.\n", "\n", "9. If we do not use robust standard errors, the standards errors will\n", " be understated.\n", "\n", "10. We can see that the residuals are increasing as the level of\n", " education increases, as predicted in the previous question. This\n", " indicates heteroskedasticity, as the distribution of errors is\n", " clearly not constant.\n", "\n", "11-12. Coding exercises\n", "\n", "1. The test suggests the model is heteroskedastic.\n", "\n", "2. Answers are: (A) Heteroskedasticity (B) An explanatory variable can\n", " be written as a linear combination of other explanatory variables\n", " included in the model (C) Drop one of the variables (D) VIF test." ], "id": "18195a6e-1477-4ac8-8a27-99ab37857012" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "ir", "display_name": "R", "language": "r" } } }