{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 17 - Instrumental Variable Analysis\n", "\n", "Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt \n", "2024-05-29\n", "\n", "## Prerequisites\n", "\n", "1. Run OLS regressions.\n", "\n", "## Learning Outcomes\n", "\n", "1. Understand what an instrumental variable is and the conditions that\n", " must be satisfied to address the endogeneity problem.\n", "2. Implement a Two Stage Least Squares (2SLS) regression-based approach\n", " using an instrument.\n", "3. Describe the weak instrument problem.\n", "4. Interpret the first stage test of whether or not the instrument is\n", " weak.\n", "\n", "## 17.1 The Linear Instrumental Variable Model\n", "\n", "Consider a case where we want to know the effect of education on\n", "earnings. We may want to estimate a model like the following:\n", "\n", "$$\n", "Y_{i} = \\alpha + \\beta X_i + \\epsilon_i,\n", "$$\n", "\n", "where $Y_i$ is earnings of individual $i$ and $X_i$ is years of\n", "education of individual $i$.\n", "\n", "A possible issue with this model comes from omitted variable bias: it is\n", "possible that the decision to attend school is influenced by other\n", "individual characteristics that are also correlated with earnings. For\n", "example, think of individuals with high innate ability. They may want to\n", "enroll in school for longer and obtain higher-level degrees. Moreover,\n", "their employers may compensate them for their high ability, regardless\n", "of their years of schooling.\n", "\n", "Instrumental variables (IVs) can help us when there are hidden factors\n", "affecting both the treatment (in our case, years of education) and the\n", "outcome (in our case, earnings). The instrumental variable approach\n", "relies on finding something that affects the treatment and affects the\n", "outcome, but that affects the outcome solely through the treatment. In\n", "short, the instrument must satisfy two assumptions:\n", "\n", "1. *Relevance*: the instrument should be correlated with the\n", " explanatory variable; in our case, it should be correlated with the\n", " years of education $X_i$;\n", "2. *Exclusion restriction*: the instrument should be correlated with\n", " the dependent variable only through the explanatory variable; in our\n", " case, it should be correlated with $Y_i$ only through its\n", " correlation with $X_i$.\n", "\n", "Let’s say we have found an instrumental variable $Z_i$ for the variable\n", "$X_i$. Then, using an IV analyis implies estimating the following model:\n", "$$\n", "\\begin{align}\n", "Y_i &= \\alpha_1 + \\beta X_i + u_i \\quad \\text{(Structural Equation)}\\\\\n", "X_i &= \\alpha_2 + \\gamma Z_i + e_i \\quad \\text{(First Stage Equation)}\n", "\\end{align}\n", "$$\n", "\n", "where the two conditions we have seen above imply that:\n", "\n", "1. $\\gamma \\neq 0$;\n", "2. $Z_i$ is uncorrelated with $u_i$.\n", "\n", "In practice, using an IV analysis often implies using a Two-Stages Least\n", "Square (2SLS) estimator. The two steps of 2SLS are:\n", "\n", "1. Estimate the first stage equation by OLS and obtain the predicted\n", " value of $X_i$. In this way, we have effectively split $X_i$ into $$\n", " X_i = \\underbrace{\\hat{X}_i}_\\text{exogenous part} + \\underbrace{\\hat{e}_i}_\\text{endogenous part} \n", " $$\n", "\n", "where $\\hat{X_i} \\equiv \\hat{\\alpha_2} + \\hat{\\gamma} Z_i$.\n", "\n", "1. Plug $\\hat{X_i}$ instead of $X_i$ into the structural equation and\n", " estimate via OLS. We are then using the “exogenous” part of $X_i$ to\n", " capture $\\beta$.\n", "\n", "**Warning**: We can run 2SLS following the steps above, but when we want\n", "to do inference we need to be sure we’re using the true residuals in the\n", "structural equation $\\hat{u}_i$. The built-in Stata commands `ivregress`\n", "and `ivreg2` automatically give us the right residuals.\n", "\n", "Let’s see how to estimate this in Stata. Once again, we can use our\n", "fictional data set simulating wages of workers in the years 1982-2012 in\n", "a fictional country.\n", "\n", "``` {stata}\n", "clear* \n", "*cd \"\"\n", "use fake_data, clear\n", "describe, detail\n", "```\n", "\n", "In Stata, we can perform IV analysis with a 2SLS estimator by using one\n", "of the following two commands: `ivregress` or `ivreg2`. They have a\n", "similar syntax:\n", "\n", "``` stata\n", "ivregress 2sls ( = )\n", "\n", "ivreg2 ( = )\n", "```\n", "\n", "where instead of ``, ``, and ``, we write the names of the\n", "corresponding dependent, independent, and instrument variables of our\n", "model.\n", "\n", "We now have to choose an IV that can work in our setting. A well-known\n", "example for an instrument for years of schooling is studied by Angrist\n", "and Krueger (1991): they propose using $Z$, the quarter of birth. The\n", "premise behind their IV is that students are required to enter school in\n", "the *year they turn 6* but not necessarily when they are *already* 6\n", "years old, creating a relationship between quarter of birth and\n", "schooling. At the same time, the time of the year one is born shouldn’t\n", "affect one’s earnings aside from its effect on schooling.\n", "\n", "Let’s see how to estimate a simple IV in Stata using our data and each\n", "one of the commands `ivregress` and `ivreg2`.\n", "\n", "``` {stata}\n", "ivregress 2sls earnings (schooling = quarter_birth)\n", "```\n", "\n", "``` {stata}\n", "ivreg2 earnings (schooling = quarter_birth)\n", "```\n", "\n", "Both Stata functions give us a standard output: the values of the\n", "coefficients, standard errors, p-values, and 95% confidence intervals.\n", "From the regression output, years of schooling does not seem to have any\n", "effect on earnings. However, before trusting these results, we should\n", "check that the two IV assumptions are met in this case.\n", "\n", "Notice that `ivreg2` gives us more details about tests we can perform to\n", "assess whether our instrument is valid. We will talk more about these\n", "tests, especially the weak identification test, in the paragraphs below.\n", "\n", "## 17.2 Weak Instrument Test\n", "\n", "While we cannot really test for the exclusion restriction, we can check\n", "whether our instrument is relevant. We do that by looking directly at\n", "the coefficients in the first stage.\n", "\n", "In Stata, we only need to add the option `first` to get an explicit\n", "output for the first stage.\n", "\n", "``` {stata}\n", "ivregress 2sls earnings (schooling = quarter_birth), first\n", "```\n", "\n", "``` {stata}\n", "ivreg2 earnings (schooling = quarter_birth), first\n", "```\n", "\n", "From both methods, we can see that the IV we have chosen is not relevant\n", "for our explanatory variable $X$: *quarter_birth* is not correlated with\n", "*schooling*. Another indicator of the lack of relevance is given by the\n", "F-statistic reported by Stata in the “Weak Identification test” row: as\n", "a rule of thumb, every time its value is less than 10, the instrument is\n", "not relevant.\n", "\n", "Whenever the correlation between $X$ and $Z$ is very close to zero (as\n", "in our case), we say we have a **weak instrument** problem. In practice,\n", "this problem will result in severe finite-sample bias and large variance\n", "in our estimates. Since our instrument is not valid, we cannot trust the\n", "results we have obtained.\n", "\n", "## 17.3 Wrap Up\n", "\n", "In this module, we studied the linear IV model and how to estimate it\n", "using the 2SLS Method using `ivregress` or `ivreg2`. We learned that we\n", "can overcome the endogeneity problem when we have access to a different\n", "type of variable: an instrumental variable. A good instrument must\n", "satisfy two important conditions:\n", "\n", "1. It must be uncorrelated with the error term (also referred to as the\n", " exclusion restriction).\n", "2. It must be correlated, after controlling for observables, with the\n", " variable of interest (there must be a first stage).\n", "\n", "While the second condition can be checked using the regression results\n", "of the first stage, the first condition is inherently not testable.\n", "Therefore, any project that uses IVs must include a discussion, using\n", "contextual knowledge, of why the first condition may hold.\n", "\n", "Finally, do not forget that for every endogenous variable in our\n", "regression, we require at least one instrument. For example, if we have\n", "a regression with two endogenous variables, we require at least two IVs!\n", "\n", "## 17.4 Wrap-up Table\n", "\n", "| Command | Function |\n", "|--------------------------------------|----------------------------------|\n", "| `ivregress 2sls` | It performs Instrumental Variable analysis using a Two-Stage Least Squares estimator. |\n", "| `ivreg2` | It performs Instrumental Variable analysis using a Two-Stage Least Squares estimator by default. |\n", "| `, first` | This option shows the results for the First Stage regression in the IV analysis. |\n", "\n", "## References\n", "\n", "[Instrumental-variables regression using\n", "Stata](https://www.youtube.com/watch?v=lbnswRJ1qV0)" ], "id": "1b3d5e11-97c2-4bd6-be28-cb84ccca243b" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "python3", "display_name": "Python 3 (ipykernel)", "language": "python", "path": "/usr/local/share/jupyter/kernels/python3" }, "language_info": { "name": "python", "codemirror_mode": { "name": "ipython", "version": "3" }, "file_extension": ".py", "mimetype": "text/x-python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } } }