{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 13 - Good Regression Practice\n", "\n", "Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt \n", "2024-05-29\n", "\n", "## Prerequisites\n", "\n", "1. Importing data into R.\n", "2. Creating new variables in R.\n", "3. Running OLS regressions.\n", "\n", "## Learning Outcomes\n", "\n", "1. Identify and correct for outliers by trimming or winsorizing the\n", " dependent variable.\n", "2. Identify and correct for the problem of multicollinearity.\n", "3. Identify and correct for the problem of heteroskedasticity.\n", "4. Identify and correct for the problem of non-linearity.\n", "\n", "## 13.1 Dealing with Outliers\n", "\n", "Imagine that we have constructed a dependent variable which contains the\n", "earnings growth of individual workers and we see that some worker’s\n", "earnings increased by more than 400%. We might wonder if this massive\n", "change is just a coding error made by the statisticians that produced\n", "the data set. Even without that type of error, though, we might worry\n", "that the earnings growth of a small number of observations are driving\n", "the results of our analysis. If this is the case, we are producing an\n", "inaccurate analysis based on results that are not associated with the\n", "majority of our observations.\n", "\n", "The standard practice in these cases is to either winsorize or trim the\n", "subset of observations that are used in that regression. Both practices\n", "remove the outlier values in the dependent variable to allow us to\n", "produce a more accurate empirical analysis.\n", "\n", "**Warning:** We should only consider fixing outliers when there is a\n", "clear reason to address this issue. Do not apply the tools below if the\n", "summary statistics in your data make sense to you in terms of abnormal\n", "values. For example, outliers might be a sign that our dependent and\n", "explanatory variables have a non-linear relationship. If that is the\n", "case, we will want to consider including an interaction term that\n", "addresses that non-linearity. A good way to test for this is to create a\n", "scatter plot of our dependent and independent variables. This will help\n", "us to see if there are actually some outliers, or if there is just a\n", "non-linear relationship.\n", "\n", "### 13.1.1 Winsorizing a Dependent Variable\n", "\n", "Winsorizing is the process of limiting extreme values in the dependent\n", "variable to reduce the effect of (possibly erroneous) outliers. It\n", "consists of replacing values below the $a$th percentile by that\n", "percentile’s value, and values above the $b$th percentile by that\n", "percentile’s value. Consider the following example using our fake data\n", "set:" ], "id": "4b151757-ff1d-4a82-84e9-fa48555c5020" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Clear the memory from any pre-existing objects\n", "rm(list=ls())\n", "\n", "# loading in our packages\n", "library(tidyverse) #This includes ggplot2! \n", "library(haven)\n", "library(IRdisplay)\n", "\n", "#Open the dataset \n", "fake_data <- read_dta(\"../econ490-r/fake_data.dta\") " ], "id": "9c8e2831-2717-45a3-b467-2e29aa3f3db7" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "quantile(fake_data$earnings, probs = c(0.01, 0.99))" ], "id": "c4f13de9-87e2-41ce-8f21-e881da4f369d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "min(fake_data$earnings)" ], "id": "c8142f07-8ce4-4306-8c3d-740cbf06592b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "max(fake_data$earnings)" ], "id": "05912603-8f9f-466d-8eeb-c16969a5eaa4" }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the summary statistics above, we can see that that the income\n", "earned by the individual at the 1st percentile is 2,831.03 and that the\n", "lowest earner in the data set earned 8.88.\n", "\n", "We can also see that income earned by the individual at the 99th\n", "percentile is only 607,140.32 and that the highest earner in the data\n", "earned over 60 millions!\n", "\n", "These facts suggest to us that there are large outliers in our dependent\n", "variable.\n", "\n", "We want to get rid of these outliers by winsorizing our data set. What\n", "that means is replacing the earnings of all observations below the 1st\n", "percentile by exactly the earnings of the individual at the 1st\n", "percentile, and replacing the earnings of all observations above the\n", "99th percentile by exactly the earnings of the individual at the 99th\n", "percentile.\n", "\n", "To winsorize this data, we do the following 3 step process:\n", "\n", "1. We create a new variable called *earnings_winsor* which is identical\n", " to our *earnings* variable using `mutate`. We choose to store the\n", " winsorized version of the dependent variable in a different variable\n", " so that we don’t overwrite the original data set.\n", "2. If earnings are smaller than the 1st percentile, we replace the\n", " values of *earnings_winsor* with the earnings of the individual at\n", " the 1st percentile:\n", " `(quantile(fake_data$earnings, probs = 0.01) = 2831)`.\n", "3. If earnings are larger than the 99th percentile, we replace the\n", " values of *earnings_winsor* with the earnings of the individual at\n", " the 99th percentile:\n", " `(quantile(fake_data$earnings, probs = 0.99) = 607140 )`.\n", "\n", "The values of this new variable will be created using the command\n", "`ifelse()`. If earnings are less than 2831, the value of\n", "*earnings_winsor* is replaced by 2831 using this command.\n", "\n", "We do this below:" ], "id": "eba8e2bf-3202-47de-ad95-49f10f1cdad4" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data <- fake_data %>%\n", " mutate(earnings_winsor = ifelse(earnings<2831, 2831, ifelse(earnings>607140, 607140, earnings))) " ], "id": "bdf367ab-d487-42ff-aa00-e443ed03969e" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will use this new dependent variable in our regression analysis.\n", "If the outliers were not creating problems, there will be no change in\n", "the results. If they were creating problems, those problems will now be\n", "fixed.\n", "\n", "Let’s take a look at this by first running the regression from [Module\n", "10](https://comet.arts.ubc.ca/docs/Research/econ490-r/10_Linear_Reg.html)\n", "with the original earning variable." ], "id": "c913d941-ceb3-430b-a969-6d36f4a0f29d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Generate the log of earnings_winsor\n", "fake_data <- fake_data %>%\n", " mutate(log_earnings_winsor = log(earnings_winsor)) " ], "id": "4d922570-cc7d-44c9-91d6-559e85c8fdb4" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Run the regression from Module 10\n", "lm(data=fake_data, log(earnings) ~ as.factor(sex))" ], "id": "87189e6c-3954-4ad4-8930-b05151512bce" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Run the previous regression with the log of earnings_winsor\n", "lm(data=fake_data, log_earnings_winsor ~ as.factor(sex))" ], "id": "d64fc3e2-62ca-4969-a4fb-b8c9bf45eb39" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do you think that in this case the outliers were having a significant\n", "impact before being winsorized?\n", "\n", "### 13.1.2 Trimming a Dependent Variable\n", "\n", "Trimming consists of replacing both values below the $a$th percentile\n", "and values above the $b$ percentile by a missing value. This is done to\n", "exclude these outliers from regression, since R automatically excluedes\n", "missing (`NA`) observations in the `lm` command.\n", "\n", "Below, we look at the commands for trimming a variable. Notice that the\n", "steps are quite similar to when we winsorized the same variable. Here,\n", "we are directly creating the log of trimmed earnings in one step. Don’t\n", "forget to create a new variable to avoid overwriting our original\n", "variable!" ], "id": "c2331a83-d04e-4c1c-a03c-b68742d13e46" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data <- fake_data %>%\n", " mutate(log_earnings_trimmed = ifelse(earnings<2831 , NA, ifelse( earnings > 607140 , NA, log(earnings)))) " ], "id": "40be6314-7114-4f3f-9705-bb993e29deb2" }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here is the result of the regression with the new dependent\n", "variable:" ], "id": "a896e0d0-670b-4af7-b672-5132a2c51cf1" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lm(data=fake_data, log_earnings_trimmed ~ as.factor(sex))" ], "id": "3a88aa6c-3bcd-4931-92cd-e306fd5a30d2" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 13.2 Multicollinearity\n", "\n", "If two variables are linear combinations of one another they are\n", "multicollinear. Ultimately, R does not allow us to include two variables\n", "in a regression that are perfect linear combinations of one another,\n", "such as a constant or a dummy variable for male and a dummy for female\n", "(since female = 1 - male). In all of the regressions above, we see that\n", "one of those variables was dropped from the regression “because of\n", "collinearity”." ], "id": "3971d4fc-421c-4128-ad9e-c37d7b31e8d7" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data <- fake_data %>%\n", " mutate(male = case_when(sex == 'M' ~ 1, sex == 'F' ~ 0)) %>%\n", " mutate(female = case_when(sex == 'F' ~ 1, sex == 'M' ~ 0))" ], "id": "e15d6131-ea1b-478c-83d7-e9cb8d23a089" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lm(data=fake_data, log_earnings_trimmed ~ male + female)" ], "id": "ece69a5d-cfa7-4129-9c54-19f1ee70616a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is this a problem? Not really. Multicollinearity is a sign that a\n", "variable is not adding any new information. Notice that with the\n", "constant term and a male dummy we can know the mean earnings of females.\n", "In this case, the constant term is, by construction, the mean earnings\n", "of females, and the male dummy gives the earning premium paid to male\n", "workers.\n", "\n", "While there are some statistical tests for multicollinearity, nothing\n", "beats having the right intuition when running a regression. If there is\n", "an obvious case where two variables contain basically the same\n", "information, we’ll want to avoid including both in the analysis.\n", "\n", "For instance, we might have an age variable that includes both years and\n", "months (e.g. if a baby is 1 year and 1 month old, then this age variable\n", "would be coded as 1 + 1/12 = 1.083). If we include this variable in a\n", "regression that also includes an age variable for only years (e.g the\n", "baby’s age would be coded as 1) then we would have the problem of\n", "multicollinearity. Because they are not perfectly collinear, R might\n", "still produce some results, but the coefficients on these two variables\n", "would be biased.\n", "\n", "## 13.3 Heteroskedasticity\n", "\n", "When we run a linear regression, we essentially split the outcome into a\n", "(linear) part explained by observables ($x_i$) and an error term\n", "($e_i$):\n", "\n", "$$\n", "y_i = a + b x_i + e_i\n", "$$\n", "\n", "The standard errors in our coefficients depend on $e_i^2$ (as you might\n", "remember from your econometrics courses). Heteroskedasticity refers to\n", "the case where the variance of this projection error depends on the\n", "observables $x_i$. For instance, the variance of wages tends to be\n", "higher for people who are university educated (some of these people have\n", "very high wages) whereas it is small for people who are non-university\n", "educated (these people tend to be concentrated in lower paying jobs). R\n", "by default assumes that the variance does not depend on the observables,\n", "which is known as homoskedasticity. It is safe to say that this is an\n", "incredibly restrictive assumption.\n", "\n", "While there are tests for heteroskedasticity, the more empirical\n", "economists rely on including heteroskedastic consistent standard errors\n", "as a default in their regressions. The most standard way to do this is\n", "to use `feols`, another command similar to `lm()` that comes from the\n", "`fixest` package." ], "id": "b41b3ce8-08a6-4266-ab8b-988014f0f479" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#uncomment this line to install the package! install.packages('fixest')\n", "library(fixest)" ], "id": "95366676-c589-4ee7-913e-fc4a46e0ce79" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = feols(log_earnings_trimmed ~ as.factor(sex) , fake_data)" ], "id": "5ad69031-eacd-49cf-8f2f-99eb4c8600e8" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "summary(model, vcov=\"HC1\")" ], "id": "2578d919-9dd7-44fc-8e8b-65fc71fb181c" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Best practices are simply to always use robust standard errors in your\n", "own research project, since most standard errors will be\n", "heteroskedastic.\n", "\n", "## 13.4 Non-linearity\n", "\n", "Our regression analysis so far assumes that the relationship between our\n", "explained and explanatory variables is linear. If this is not the case,\n", "meaning the relationship is non-linear, then we will get inaccurate\n", "results from our analysis.\n", "\n", "Let’s consider an example. We know that earnings increases with age, but\n", "what if economic theory predicts that earnings increase by more for each\n", "year of age when workers are younger than when they are older? What we\n", "are asking here is whether earnings is increasing with age at a\n", "decreasing rate. In essence, we want to check whether there is a concave\n", "relation between age and earnings. We can think of several mechanisms\n", "for why this relationship might exist: for a young worker, as they age,\n", "they get higher wages through increased experience in the job; for an\n", "older worker, as they age, those wage increases will be smaller as there\n", "are smaller productity gains with each additional year working. In fact,\n", "if the productivity of workers decreaseas as they age, perhaps for\n", "reasons related to health, then it is possible to find a negative\n", "relationship between age and earning beyond a certain age – the\n", "relationship would be an inverted U-shape.\n", "\n", "We could check if this is the case in our model by including a new\n", "interaction term that is simply age interacted with itself, which is the\n", "equivalent of including age and age squared. We learned how to do this\n", "in [Module\n", "12](https://comet.arts.ubc.ca/docs/Research/econ490-r/12_Dummy.html).\n", "Let’s include this in the regression above." ], "id": "43e289b5-31ec-4a5e-a49a-1721fcefbcfa" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data <- fake_data %>% mutate(age2 = age^2) " ], "id": "97734e7e-ce2d-4fd3-be66-ea8911385732" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = lm(log_earnings_trimmed ~ age + age2, fake_data)" ], "id": "f17fcebf-bdbb-470c-9673-1be90d011212" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "summary(model, vcov=\"HC1\")" ], "id": "4fe02d90-ba4c-4f28-8695-309dd0a8525f" }, { "cell_type": "markdown", "metadata": {}, "source": [ "There does seem to be some evidence in our regression results that this\n", "economic theory is correct, since the coefficient on the interaction\n", "term is both negative and statistically significant.\n", "\n", "How do we interpret these results? Let’s think about the equation we\n", "have just estimated: $$\n", "Earnings_i = \\beta_0 + \\beta_1 Age_i + \\beta_2 Age^2_i + \\varepsilon_i \n", "$$\n", "\n", "This means that earnings of an individual change in the following way\n", "with their age: $$\n", "\\frac{\\partial Earnings_i}{\\partial Age_i} = \\beta_1 + 2 \\beta_2 Age_i\n", "$$\n", "\n", "Due to the quadratic term, as age changes, the relationship between age\n", "and earnings changes as well.\n", "\n", "We have just estimated $\\beta_1$ to be positive and equal to 0.079, and\n", "$\\beta_2$ to be negative and equal to 0.001.\n", "\n", "This means that, as age increases, it’s correlation with earnings\n", "decrease: $$\n", "\\frac{\\partial Earnings_i}{\\partial Age_i} = 0.079 - 0.002 Age_i\n", "$$\n", "\n", "Since the marginal effect changes with the size of $Age_i$, providing\n", "one unique number for the marginal effect becomes difficult.\n", "\n", "The most frequently reported version of this effect is the “marginal\n", "effect at the means”: the marginal effect of age on earnings when age\n", "takes its average value. In our case, this will be equal to 0.079 minus\n", "0.002 times the average value of age." ], "id": "32d8e210-2b8f-4107-a46d-75b4468d8c75" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mean(fake_data$age)" ], "id": "485c88f0-6cae-4bce-96ad-826e0e17a9c3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the average age of workers is 45, the marginal effect of age at\n", "the means is $$\n", "0.079 - 2 * 0.001 * 45 = -0.011\n", "$$\n", "\n", "What does that mean? It means that, for the average person, becoming one\n", "year older is associated with a 1% decrease in log earnings.\n", "\n", "Notice that this is the effect for the *average person*. Is the same\n", "true for young workers and older workers? To learn how to interpret this\n", "non-linearity in age, let’s see how the predicted earnings correlate\n", "with age.\n", "\n", "We can obtain the predicted earnings with the `predict` function and\n", "then use a scatterplot to eyeball its relationship with age. We covered\n", "how to create scatterplots in [Module\n", "8](https://comet.arts.ubc.ca/docs/Research/econ490-r/08_ggplot_graphs.html))\n", "and the `predict` function in [Module\n", "10](https://comet.arts.ubc.ca/docs/Research/econ490-r/10_Linear_Reg.html).\n", "\n", "Let’s see how to do this step by step.\n", "\n", "First, we store our regression in the object *fit*. Second, we use the\n", "function `model.frame` to keep only the observations and variables used\n", "in our regression. Then, we use `predict` to store in data frame *yhat*\n", "the predicted earnings obtained from our regression. Notice that\n", "`predict` creates a list, therefore we transform it into a data frame\n", "using the function `as.data.frame`. Finally, we merge the two data\n", "frames together using the function `cbind`." ], "id": "2a6a85e2-5660-4c67-b743-a919fb6e66ec" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run the regression with the quadratic term\n", "fit = lm(log_earnings_trimmed ~ age + age2, fake_data)\n", "\n", "# Predict earnigns and save them as yhat in the same data frame\n", "datareg <- model.frame(fit) # keep observations used in regression\n", "yhat <- as.data.frame(predict(fit)) # save predicted earnings as data frame\n", "datareg = cbind(datareg, yhat) # merge the two dataframes\n", "datareg <- datareg %>% rename(yhat = \"predict(fit)\") # rename" ], "id": "a6d0f56a-ab21-40f8-97c9-6763662dc299" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have all the information in one unique data frame called\n", "*datareg*, we can display a scatterplot with age on the x-axis and\n", "predicted log-earnings on the y-axis." ], "id": "1d845d81-db5c-4196-a359-294dc95f7e61" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create scatterplot\n", "ggplot(data = datareg, aes(x=age, y=yhat)) + geom_point()" ], "id": "b6b8d719-4077-4efd-94c8-d299e13a7a5f" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The scatterplot shows an inverted-U relationship between age and the\n", "predicted log-earnings. This relationship implies that, when a worker is\n", "very young, aging is positively correlated with earnings. However, after\n", "a certain age, this correlation becomes negative and the worker gets\n", "lower earnings for each additional year of age. In fact, based on this\n", "graph, workers earnings start to decline just after the age of 50. Had\n", "we modelled this as a linear model, we would have missed this important\n", "piece of information!\n", "\n", "**Note:** If there is a theoretical reason for believing that\n", "non-linearity exists, R provides some tests for non-linearity. We can\n", "also create a scatter-plot to see if we observe a non-linear\n", "relationship in the data. We covered that approach in [Module\n", "8](https://comet.arts.ubc.ca/docs/Research/econ490-r/8_ggplot_graphs.html).\n", "\n", "## 13.5 Wrap Up\n", "\n", "It is important to always follow best practices for regression analysis.\n", "Nonetheless, checking and correcting for outliers, as well as addressing\n", "heteroskedasticity, multicollinearity and non-linearity can be more of\n", "an art than a science. If you need any guidance on whether or not you\n", "need to address these issues, please be certain to speak with your\n", "instructor, TA, or supervisor.\n", "\n", "## 13.6 Wrap-up Table\n", "\n", "| Command | Function |\n", "|----------------------------------|--------------------------------------|\n", "| `quantile(data$varname, probs = c(0.01, 0.99))` | It calculates the sample quantiles for the specified probabilities. |\n", "| `min(data$varname)` | It calculates the minimum value of a variable. |\n", "| `max(data$varname)` | It calculates the maximum value of a variable. |\n", "| `feols(depvar ~ indepvar, data)` | It performs a regression using robust standard errors. |\n", "| `model.frame(object)` | It saves the variables and observations specified by `object`. |\n", "\n", "## References\n", "\n", "[Syntax of quantile()\n", "function](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile)\n", "[Syntax of model.frame()\n", "function](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/model.frame)" ], "id": "cf04d35a-bb96-46ec-b195-aa1564c8fd36" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "ir", "display_name": "R", "language": "r" } } }