ECON 490: Instrumental Variable Analysis (17)

Prerequisites

Run OLS regressions.

Learning Outcomes

Understand what an instrumental variable is and the conditions it must satisfy to address the endogeneity problem.
Implement a Two Stage Least Squares (2SLS) regression-based approach using an instrument.
Describe the weak instrument problem.
Interpret the first stage test of whether or not the instrument is weak.

import stata_setup
stata_setup.config('C:\Program Files\Stata18/','se')

<>:2: SyntaxWarning: invalid escape sequence '\P'
<>:2: SyntaxWarning: invalid escape sequence '\P'
C:\Users\irene\AppData\Local\Temp\ipykernel_16976\4069384911.py:2: SyntaxWarning: invalid escape sequence '\P'
  stata_setup.config('C:\Program Files\Stata18/','se')

>>> import sys
>>> sys.path.append('/Applications/Stata/utilities') # make sure this is the same as what you set up in Module - 1, Section 1.5.1: Setting Up PyStata
>>> from pystata import config
>>> config.init('se')

17.1 The linear IV model

Consider a case where we want to know the effect of education on earnings. We may want to estimate a model like the following

\[ Y_{i} = \alpha + \beta X_i + \epsilon_i \] where $Y_i$ is earnings of individual $i$ and $X_i$ is years of education of individual $i$.

A possible issue comes from omitted variable bias: it is possible that the decision to attend school is influenced by other individual characteristics that are also correlated with earnings. For example, think of individuals with high innate ability. They may want to enroll in school for longer and obtain higher-level degrees. Moreover, their employers may compensate them for their high ability, regardless of their years of schooling.

Instrumental variables can help us when there are hidden factors affecting both the treatment (in our case, years of education) and the outcome (in our case, earnings). The instrumental variables approach relies on finding something affecting the treatment and affecting the outcome solely through the treatment. In short, the instrument should satisfy two assumptions: 1. relevance: the instrument should be correlated with the explanatory variable; in our case, it should be correlated with the years of education $X_i$; 2. exclusion restriction: the instrument should be correlated with the dependent variable only through the explanatory variable; in our case, it should be correlated with $Y_i$ only through its correlation with $X_i$.

Let’s say we have found an instrumental variable $Z_i$ for the variable $X_i$. Then, using an Instrumental Variable analyis implies estimating the following model \[ \begin{align} Y_i &= \alpha_1 + \beta X_i + u_i \quad \text{(Structural Equation)}\\ X_i &= \alpha_2 + \gamma Z_i + e_i \quad \text{(First Stage Equation)} \end{align} \] where the two conditions we have seen above imply that: 1. $\gamma \neq 0$; 2. $Z_i$ is uncorrelated with $u_i$.

In practice, using an Instrumental Variable analysis often implies using a Two-Stages Least Square (2SLS) estimator. The two steps of the 2SLS are: 1. Estimate the First Stage Equation by OLS and obtain the predicted value of $X_i$. In this way, we have effectively split $X_i$ into \[ X_i = \underbrace{\hat{X}_i}_\text{exogenous part} + \underbrace{\hat{e}_i}_\text{endogenous part} \] where $ + Z_i $.

Plug $\hat{X_i}$ instead of $X_i$ into the Structural Equation and estimate via OLS. We are then using the “exogenous” part of $X_i$ to capture $\beta$.

Caution: We can run 2SLS following the steps above, but when we want to do inference we need to be sure we’re using the true residuals in the Structural equation $\hat{u}_i$. The built-in Stata command ivregress or ivreg2 automatically give us the right residuals.

Let’s see how to estimate this in Stata. Once again, we can use our fictional dataset simulating wages of workers in the years 1982-2012 in a fictional country.

%%stata

clear* 
use fake_data, clear
describe, de


. 
. clear* 

. use fake_data, clear

. describe, de

Contains data from fake_data.dta
 Observations:       138,138                  
    Variables:            11                  16 Jul 2023 17:25
        Width:            28                  
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
workerid        long    %12.0g                Worker Identifier
year            int     %8.0g                 Calendar Year
sex             str1    %9s                   Sex
age             byte    %9.0g                 Age (years)
start_year      int     %9.0g                 Initial year worker is observed
region          byte    %9.0g                 group(prov)
treated         byte    %8.0g                 Treatment Dummy
earnings        float   %9.0g                 Earnings
sample_weight   float   %9.0g                 
quarter_birth   float   %9.0g                 Quarter of birth
schooling       float   %9.0g                 Years of schooling
-------------------------------------------------------------------------------
Sorted by: workerid

.

In Stata, we can perform IV analysis with a 2SLS estimator by using one of the following two commands: ivregress or ivreg2. They have a similar syntax: * ivregress 2sls <Y> (<X> = <Z>) * ivreg2 <Y> (<X> = <Z>)

where instead of <Y>, <X>, and <Z> you have to write the names of the corresponding Y, X, and Z variables of your model.

We now have to choose an instrumental variable that can work in our setting. A well-known example for an instrument for years of schooling is studied by Angrist and Krueger (1991): they propose that $Z$ is the quarter of birth. The premise behind their IV is that students are required to enter school in the year they turn 6 but not necessarily when they are already 6 years old, creating a relationship between quarter of birth and schooling. At the same time, the time of the year one is born shouldn’t affect one’s earnings aside from its effect on schooling.

Let’s see how to estimate a simple IV in Stata using our data and each one of the commands ivregress and ivreg2.

%%stata

ivregress 2sls earnings (schooling = quarter_birth)


. 
. ivregress 2sls earnings (schooling = quarter_birth)

Instrumental variables 2SLS regression            Number of obs   =    138,138
                                                  Wald chi2(1)    =       0.03
                                                  Prob > chi2     =     0.8691
                                                  R-squared       =          .
                                                  Root MSE        =     1.5e+06

------------------------------------------------------------------------------
    earnings | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
   schooling |   714972.8    4339032     0.16   0.869     -7789373     9219319
       _cons |  -1.09e+07   6.68e+07    -0.16   0.870    -1.42e+08    1.20e+08
------------------------------------------------------------------------------
Endogenous: schooling
Exogenous:  quarter_birth

.

%%stata

ivreg2 earnings (schooling = quarter_birth)


. 
. ivreg2 earnings (schooling = quarter_birth)

IV (2SLS) estimation
--------------------

Estimates efficient for homoskedasticity only
Statistics consistent for homoskedasticity only

                                                      Number of obs =   138138
                                                      F(  1,138136) =     0.03
                                                      Prob > F      =   0.8691
Total (centered) SS     =  8.82816e+15                Centered R2   = -36.2867
Total (uncentered) SS   =  9.80603e+15                Uncentered R2 = -32.5684
Residual SS             =  3.29173e+17                Root MSE      =  1.5e+06

------------------------------------------------------------------------------
    earnings | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
   schooling |   714972.8    4339032     0.16   0.869     -7789374     9219319
       _cons |  -1.09e+07   6.68e+07    -0.16   0.870    -1.42e+08    1.20e+08
------------------------------------------------------------------------------
Underidentification test (Anderson canon. corr. LM statistic):           0.026
                                                   Chi-sq(1) P-val =    0.8719
------------------------------------------------------------------------------
Weak identification test (Cragg-Donald Wald F statistic):                0.026
Stock-Yogo weak ID test critical values: 10% maximal IV size             16.38
                                         15% maximal IV size              8.96
                                         20% maximal IV size              6.66
                                         25% maximal IV size              5.53
Source: Stock-Yogo (2005).  Reproduced by permission.
------------------------------------------------------------------------------
Sargan statistic (overidentification test of all instruments):           0.000
                                                 (equation exactly identified)
------------------------------------------------------------------------------
Instrumented:         schooling
Excluded instruments: quarter_birth
------------------------------------------------------------------------------

.

Both Stata functions give us a standard output: value of the coefficients, standard errors, p-value, and 95% confidence intervals. From the regression output, years of schooling does not seem to have any effect on earnings. However, before trusting these results we should check that the two assumptions on IV are met in this case: relevance and exclusion restriction.

Notice that ivreg2 gives us more details about tests we can perform to assess whether our instrument is valid. We will talk more about these tests, especially the weak identification test, in the paragraphs below.

17.2 Weak instrument test

While we cannot really test for exclusion restriction, we can check whether our instrument is relevant. We do that by looking directly at the First Stage. In Stata, we only need to add the option first to get an explicit output for the First Stage.

%%stata

ivregress 2sls earnings (schooling = quarter_birth), first


. 
. ivregress 2sls earnings (schooling = quarter_birth), first

First-stage regressions
-----------------------

                                                       Number of obs = 138,138
                                                       F(1, 138136)  =    0.03
                                                       Prob > F      =  0.8719
                                                       R-squared     =  0.0000
                                                       Adj R-squared = -0.0000
                                                       Root MSE      =  2.2056

------------------------------------------------------------------------------
   schooling | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
quarter_bi~h |   .0009984   .0061896     0.16   0.872    -.0111332      .01313
       _cons |    15.3971   .0165699   929.22   0.000     15.36463    15.42958
------------------------------------------------------------------------------


Instrumental variables 2SLS regression            Number of obs   =    138,138
                                                  Wald chi2(1)    =       0.03
                                                  Prob > chi2     =     0.8691
                                                  R-squared       =          .
                                                  Root MSE        =     1.5e+06

------------------------------------------------------------------------------
    earnings | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
   schooling |   714972.8    4339032     0.16   0.869     -7789373     9219319
       _cons |  -1.09e+07   6.68e+07    -0.16   0.870    -1.42e+08    1.20e+08
------------------------------------------------------------------------------
Endogenous: schooling
Exogenous:  quarter_birth

.

%%stata

ivreg2 earnings (schooling = quarter_birth), first


. 
. ivreg2 earnings (schooling = quarter_birth), first

First-stage regressions
-----------------------


First-stage regression of schooling:

Statistics consistent for homoskedasticity only
Number of obs =                 138138
------------------------------------------------------------------------------
   schooling | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
quarter_bi~h |   .0009984   .0061896     0.16   0.872    -.0111332      .01313
       _cons |    15.3971   .0165699   929.22   0.000     15.36463    15.42958
------------------------------------------------------------------------------
F test of excluded instruments:
  F(  1,138136) =     0.03
  Prob > F      =   0.8719
Sanderson-Windmeijer multivariate F test of excluded instruments:
  F(  1,138136) =     0.03
  Prob > F      =   0.8719



Summary results for first-stage regressions
-------------------------------------------

                                           (Underid)            (Weak id)
Variable     | F(  1,138136)  P-val | SW Chi-sq(  1) P-val | SW F(  1,138136)
schooling    |       0.03    0.8719 |        0.03   0.8719 |        0.03

Stock-Yogo weak ID F test critical values for single endogenous regressor:
                                   10% maximal IV size             16.38
                                   15% maximal IV size              8.96
                                   20% maximal IV size              6.66
                                   25% maximal IV size              5.53
Source: Stock-Yogo (2005).  Reproduced by permission.
NB: Critical values are for Sanderson-Windmeijer F statistic.

Underidentification test
Ho: matrix of reduced form coefficients has rank=K1-1 (underidentified)
Ha: matrix has rank=K1 (identified)
Anderson canon. corr. LM statistic       Chi-sq(1)=0.03     P-val=0.8719

Weak identification test
Ho: equation is weakly identified
Cragg-Donald Wald F statistic                                       0.03

Stock-Yogo weak ID test critical values for K1=1 and L1=1:
                                   10% maximal IV size             16.38
                                   15% maximal IV size              8.96
                                   20% maximal IV size              6.66
                                   25% maximal IV size              5.53
Source: Stock-Yogo (2005).  Reproduced by permission.

Weak-instrument-robust inference
Tests of joint significance of endogenous regressors B1 in main equation
Ho: B1=0 and orthogonality conditions are valid
Anderson-Rubin Wald test           F(1,138136)=    1.01     P-val=0.3143
Anderson-Rubin Wald test           Chi-sq(1)=      1.01     P-val=0.3143
Stock-Wright LM S statistic        Chi-sq(1)=      1.01     P-val=0.3143

Number of observations               N  =     138138
Number of regressors                 K  =          2
Number of endogenous regressors      K1 =          1
Number of instruments                L  =          2
Number of excluded instruments       L1 =          1

IV (2SLS) estimation
--------------------

Estimates efficient for homoskedasticity only
Statistics consistent for homoskedasticity only

                                                      Number of obs =   138138
                                                      F(  1,138136) =     0.03
                                                      Prob > F      =   0.8691
Total (centered) SS     =  8.82816e+15                Centered R2   = -36.2867
Total (uncentered) SS   =  9.80603e+15                Uncentered R2 = -32.5684
Residual SS             =  3.29173e+17                Root MSE      =  1.5e+06

------------------------------------------------------------------------------
    earnings | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
   schooling |   714972.8    4339032     0.16   0.869     -7789374     9219319
       _cons |  -1.09e+07   6.68e+07    -0.16   0.870    -1.42e+08    1.20e+08
------------------------------------------------------------------------------
Underidentification test (Anderson canon. corr. LM statistic):           0.026
                                                   Chi-sq(1) P-val =    0.8719
------------------------------------------------------------------------------
Weak identification test (Cragg-Donald Wald F statistic):                0.026
Stock-Yogo weak ID test critical values: 10% maximal IV size             16.38
                                         15% maximal IV size              8.96
                                         20% maximal IV size              6.66
                                         25% maximal IV size              5.53
Source: Stock-Yogo (2005).  Reproduced by permission.
------------------------------------------------------------------------------
Sargan statistic (overidentification test of all instruments):           0.000
                                                 (equation exactly identified)
------------------------------------------------------------------------------
Instrumented:         schooling
Excluded instruments: quarter_birth
------------------------------------------------------------------------------

.

From both methods, we can see that the instrumental variable we have chosen is not relevant for our explanatory variable $X$: quarter_birth is not correlated with schooling. Another indicator of lack of relevance is given by the F-statistic reported by Stata in the “Weak Identification test” row: as a rule of thumb, every time its value is less than 10, the instrument is not relevant.

Whenever the correlation between $X$ and $Z$ is very close to zero (as in our case), we say we have a weak instrument problem. In practice, this problem will result in severe finite-sample bias and large variance in our estimates. Since our instrument is not valid, we cannot trust the results we have obtained so far.

17.3 Wrap Up

In this module we studied the Linear IV model and how to estimate it using the Two-Stage Least Squares Method using ivregress or ivreg2. We learned that we can overcome the endogeneity problem when we have access to a different type of variable: instrumental variables. A good instrument must satisfy two important conditions:

It must be uncorrelated with the error term (also referred to as the exclusion restriction principle).
It must be correlated, after controlling for observables, with the variable of interest (there must be a first stage).

While condition 2 can be checked using regression results on the first stage, condition 1 is inherently not capable of being tested. Therefore, any project that uses instrumental variables must include a discussion, using contextual knowledge, of why condition 1 may hold.

Finally, do not forget that for every endogenous variable in our regression, we require at least one instrument. For example, if we have a regression with 2 endogenous variables, we require at least 2 instrumental variables.

17.4 Video tutorial

Click on the image below for a video tutorial on this module.

References

Instrumental-variables regression using Stata