{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 3.2.1 - Advanced - Instrumental Variables\n",
        "\n",
        "COMET Team <br>  \n",
        "2024-06-10\n",
        "\n",
        "## Prerequisites\n",
        "\n",
        "-   An intermediate understanding of Jupyter and R\n",
        "-   A theoretical understanding of linear regressions\n",
        "\n",
        "## Learning Outcomes\n",
        "\n",
        "After completing this notebook, you will be able to:\n",
        "\n",
        "-   Understand how instrumental variables solve omitted variable bias\n",
        "-   Choose appropriate instrumental variables\n",
        "-   Estimate causal effects with 2SLS estimators\n",
        "\n",
        "## References\n",
        "\n",
        "-   Baicker, K., Taubman, S. L., Allen, H. L., Bernstein, M., Gruber, J.\n",
        "    H., Newhouse, J. P., Schneider, E. C., Wright, B. J., Zaslavsky, A.\n",
        "    M., & Finkelstein, A. N. (2013). The Oregon Experiment: Effects of\n",
        "    Medicaid on clinical outcomes. New England Journal of Medicine,\n",
        "    368(18), 1713–1722.\n",
        "-   Card, D. (1993). Using geographic variation in college proximity to\n",
        "    estimate the return to schooling. National Bureau of Economic\n",
        "    Research.\n",
        "-   Hanck, C., Arnold, M., Gerber, A., & Schmelzer, M. (n.d.).\n",
        "    Introduction to econometrics with R \\[E-book\\]. University of\n",
        "    Duisburg-Essen.\n",
        "-   Kleiber, Christian, and Achim Zeileis. 2008. Applied Econometrics\n",
        "    with R. New York: Springer-Verlag.\n",
        "\n",
        "## Outline\n",
        "\n",
        "The notebooks *Instrumental Variables 1 and 2* are structured as\n",
        "follows:\n",
        "\n",
        "-   Context: Oregon Health Insurance Experiment\n",
        "    -   Introducing instrumental variables in the context of partial\n",
        "        random assignment\n",
        "    -   Laying out the theory of instrumental variables and their\n",
        "        estimators\n",
        "-   Example 2: College Proximity and Returns to Education\n",
        "    -   Solving OVB with data from Card (1993)\n",
        "    -   Applying IV estimators on R using the `AER` package\n",
        "    -   Discussing the differences between OLS and IV estimates\n",
        "-   Example 3: Tariffs on Animal and Vegetable Oils\n",
        "    -   Understanding the first application of instrumental variables in\n",
        "        the context of endogeneity\n",
        "    -   Modeling solutions to endogeneity in demand and supply\n",
        "        relationships\n",
        "-   Example 4: Pigouvian Taxes on Cigarettes\n",
        "    -   Extending IV regressions to multiple instruments\n",
        "    -   Introducing statistical tests to quantify the validity of\n",
        "        instruments\n",
        "\n",
        "## Context: Oregon Health Insurance Experiment\n",
        "\n",
        "Universal healthcare is one of the most widely debated topics in\n",
        "economic policy. Since 1965, the federal government of the United States\n",
        "provides free healthcare to American citizens through two different\n",
        "health insurance programs: Medicare and Medicaid. These programs cover\n",
        "medical costs of at-risk and some low-income Americans. In 2010, the\n",
        "federal government approved the Affordable Care Act, which let US states\n",
        "extend Medicaid to all low-income adults within their jurisdictions.\n",
        "\n",
        "The key question the states faced was: **should we extend health\n",
        "insurance to all low-income adults?**\n",
        "\n",
        "This decision required the states to assess the costs and benefits of\n",
        "extending health insurance to the uninsured. Crucially, they need to\n",
        "know how much (or whether at all) health insurance improves health\n",
        "outcomes of individuals.\n",
        "\n",
        "A first approach might be to estimate the effect of health insurance\n",
        "using a regression. For instance, we could regress health outcomes on\n",
        "insurance status. Unfortunately, this model would have *omitted\n",
        "variables bias (OVB)*. It is likely that the older and lower-income\n",
        "population currently covered by Medicare and Medicaid is less healthy\n",
        "than the average American. This would result in a misleading estimate.\n",
        "We need another approach.\n",
        "\n",
        "Ideally, we would want to randomly select people from the uninsured\n",
        "population, randomly assign health insurance to them, then compare the\n",
        "health outcomes of the two groups. This was what the State of Oregon did\n",
        "in 2008.\n",
        "\n",
        "### The experiment: solving the problem of partial random assignment\n",
        "\n",
        "From 2008 to 2011, the state of Oregon randomly assigned Medicaid\n",
        "coverage to 30,000 uninsured citizens through a lottery system. The\n",
        "lottery winners were offered full coverage of the Oregon Health Plan\n",
        "(OHP) Standard Medicaid, once they submitted some documentation. The\n",
        "state recorded the health outcomes of both the individuals that won and\n",
        "lost the lottery over the course of several years.\n",
        "\n",
        "Although this is close to our ideal, it’s not a perfect randomly\n",
        "controlled trial (RCT). Lottery winners were only given free insurance\n",
        "if they submitted their documents and met some eligibility criteria[1].\n",
        "Many lottery winners either did not submit the required documents or\n",
        "turned out to be ineligible to the program. In the end, only about 25%\n",
        "of the lottery winners eventually enrolled in OHP Standard.\n",
        "\n",
        "While the *possibility to apply for insurance* was randomly assigned by\n",
        "the lottery system, *insurance status* was not randomly assigned. This\n",
        "means that if health outcomes were related to the reason why they didn’t\n",
        "fill out their forms, or were ineligible, we could still have OVB.\n",
        "\n",
        "Fortunately, there are ways around this “partial random assignment”. We\n",
        "can (1) isolate the variation in insurance status created by the lottery\n",
        "and (2) calculate the effect of insurance status on health outcomes just\n",
        "for this isolated variation. This approach is called **instrumental\n",
        "variables**.\n",
        "\n",
        "### The theory\n",
        "\n",
        "We are interested in the effect of *insurance status* on *health\n",
        "outcomes.* We know:\n",
        "\n",
        "-   Lottery results are randomly assigned\n",
        "-   Winning the lottery increases the probability of insurance coverage\n",
        "-   Insurance coverage affects health outcomes\n",
        "\n",
        "We can use these facts to isolate the effect of health insurance. The\n",
        "key insight is that winning the lottery can only affect your health\n",
        "through its impact on your health insurance. This means that the\n",
        "following relationship must be true:\n",
        "\n",
        "$$\n",
        "\\text{Effect of lottery on insurance} \\cdot \\text{Effect of insurance on health} = \\text{Effect of lottery on health}\n",
        "$$\n",
        "\n",
        "> **Example**: Imagine that all of the lottery winners were\n",
        "> automatically enrolled in the insurance program. That would mean that\n",
        "> the probability of insurance status for lottery winners is 100%. In\n",
        "> this scenario, insurance status is determined *solely* by the lottery,\n",
        "> so we have a traditional RCT. A comparison of health outcomes between\n",
        "> those who won and those who lost the lottery (or those who have and\n",
        "> those who don’t have insurance) would be unbiased. This can be seen on\n",
        "> the equation above: if the effect of lottery on insurance $=$ 1, then\n",
        "> effect of lottery on health outcomes $=$ effect of insurance on health\n",
        "> outcomes.\n",
        "\n",
        "Since lottery winners are not automatically enrolled in the insurance\n",
        "program, winning the lottery increases the probability of insurance\n",
        "coverage by less than 100%. This means that insurance status is\n",
        "determined by both the lottery and other external variables (for\n",
        "example, maybe those who don’t submit the application on time care less\n",
        "about their health). In this case, a simple comparison of health\n",
        "outcomes between individuals would yield a biased estimate.\n",
        "\n",
        "To get the true effect of insurance on health, we can rearrange the\n",
        "relationship.\n",
        "\n",
        "$$\n",
        "\\text{Effect of insurance on health} = \\frac{\\text{Effect of lottery on health}}{\\text{Effect of lottery on insurance}}\n",
        "$$\n",
        "\n",
        "Let’s be a little more rigorous about what we mean by “effect”. Since\n",
        "lottery is a binary variable (you either win or lose the lottery), we\n",
        "can rewrite the “effects” as the difference in averages for the lottery\n",
        "dummy turned on and off. We get a ratio of differences of conditional\n",
        "averages: the difference in health outcomes conditional on lottery\n",
        "result divided by the difference in insurance status conditional on\n",
        "lottery result.\n",
        "\n",
        "$$\n",
        "    \\text{Effect of insurance on health} = \\frac{\\text{Average health of winners} - \\text{Average health of losers}}{\\text{Average insurance of winners} - \\text{Average insurance of losers}}\n",
        "$$\n",
        "\n",
        "That seems like something we can calculate. This difference in averages\n",
        "should adjust our estimates to reflect the “partial random assignment”\n",
        "situation. The effect of insurance on health equals the effect of the\n",
        "lottery results on health *adjusted for the probability that lottery\n",
        "winners enroll in the insurance program*.\n",
        "\n",
        "> **Think deeper**: Why can we interpret the effect of lottery on\n",
        "> insurance as the probability that lottery winners enroll in the\n",
        "> insurance program?\n",
        "\n",
        "### Health insurance example with simulated data\n",
        "\n",
        "Let’s calculate the effect of insurance on health with a very simple\n",
        "simulated dataset.\n",
        "\n",
        "`simulated_health_data` has data on 1,000 uninsured individuals who\n",
        "participated in a lottery system for public health insurance coverage in\n",
        "a fictitious state. The variables coded are:\n",
        "\n",
        "-   `lot_win` dummy == 1 for lottery winners\n",
        "-   `insurance_status` dummy == 1 for enrolled in the insurance program\n",
        "-   `health_outcome` for an aggregate measure of health outcomes after\n",
        "    12 months of insurance enrollment (the higher, the better!)\n",
        "\n",
        "[1] To be eligible for OHP Standard, individuals must be 19-64 years\n",
        "old, an Oregon resident who is a US citizen or legal immigrant,\n",
        "ineligible for other public health insurance, and uninsured for the past\n",
        "six months. Individuals must also earn less than the federal poverty\n",
        "level, and have assets worth no more than US\\$2,000."
      ],
      "id": "8a52f817-e069-46c5-9eec-f0ff24345fdd"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# load packages needed for the analysis\n",
        "library(tidyverse)\n",
        "library(AER)\n",
        "\n",
        "# load datasets\n",
        "source('advanced_instrumental_variables1_data1.r')\n",
        "source('advanced_instrumental_variables1_data2.r')"
      ],
      "id": "605ad592-91cf-4c5c-b192-320c6821f028"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# set seed to ensure reproducibility\n",
        "set.seed(123)\n",
        "\n",
        "# inspect the data\n",
        "head(simulated_health_data)"
      ],
      "id": "7486baf6-3d7b-4779-86e8-e6c0cb7b124f"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Similar to the Oregon experiment, lottery winners are only enrolled in\n",
        "the insurance program if they submit a set of required documents. We can\n",
        "see that winning the lottery does not guarantee enrollment since there\n",
        "are individuals for whom `lot_win == 1` and `insurance_status == 0`.\n",
        "\n",
        "Let’s find the share of individuals who won the lottery but did not\n",
        "enroll in the insurance program."
      ],
      "id": "0ecc9901-6714-4544-85e4-67290f14b196"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "share_enrolled <- simulated_health_data %>%\n",
        "                  filter(lot_win == 1) %>%    # filter for lottery winners\n",
        "                  summarize(share_enrolled = sum(insurance_status)/sum(lot_win))    # find % of winners who enrolled\n",
        "print(as.double(share_enrolled))"
      ],
      "id": "49d2c455-102c-48e9-ba2f-d44c7e543d28"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Only 65% of those who win the lottery actually enroll in the insurance\n",
        "program. This adds bias to the randomization process and deems a simple\n",
        "difference in means (and consequently, an OLS estimate) an inappropriate\n",
        "estimate of the causal effect.\n",
        "\n",
        "Let’s (1) calculate the OLS estimate as if this was a traditional RCT\n",
        "(2) calculate our adjusted estimate of the causal effect. Since we’re\n",
        "working with simulated data, we can compare our estimates to the true\n",
        "underlying relationships.\n",
        "\n",
        "#### Calculating the OLS estimate\n",
        "\n",
        "Let’s use the `lm()` function to calculate the OLS estimate as if this\n",
        "was a traditional RCT. We log-transform health outcomes to interpret the\n",
        "coefficient in percentage terms."
      ],
      "id": "b850a443-f47a-4c62-8de6-8a7e225d8d90"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#  run linear regression\n",
        "OLS_estimate <- lm(log(health_outcome) ~ insurance_status, data = simulated_health_data)\n",
        "\n",
        "# test significance of coefficients with robust standard errors\n",
        "coeftest(OLS_estimate, vcov=vcovHC)"
      ],
      "id": "27445801-37e2-4400-9a97-5d689ac08e26"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The OLS estimate of the effect of insurance on health outcomes is very\n",
        "large: insured individuals have 21.5% better health outcomes on average.\n",
        "This estimate supports the idea that there are large benefits in\n",
        "extending health insurance coverage to the population that is currently\n",
        "uninsured.\n",
        "\n",
        "#### Calculating our adjusted estimate\n",
        "\n",
        "Let’s calculate the adjusted estimate that we derived in the previous\n",
        "section.\n",
        "\n",
        "First, we calculate the average values of insurance status and health\n",
        "outcomes conditional on lottery result. We store those values in a data\n",
        "frame called `conditional_means`."
      ],
      "id": "3d155cb8-1f52-403c-a1a8-d1e11cc93a60"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "conditional_means <- simulated_health_data %>%\n",
        "                     group_by(lot_win) %>%    # group data based on lottery results\n",
        "                     summarize(avg_insurance_status = mean(insurance_status),    # calculate sample averages of status and outcomes\n",
        "                     avg_health_outcome = mean(log(health_outcome)))            \n",
        "conditional_means"
      ],
      "id": "12e165ab-17b7-4215-ac86-fcc57490db86"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, we divide the difference in conditional means of health outcomes by\n",
        "the difference in conditional means of insurance status."
      ],
      "id": "14c1cad8-fe22-4795-8673-108d8770c320"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# calculate difference in conditional means of health outcomes\n",
        "effect_lot_health <- conditional_means[2,3] - conditional_means[1,3]\n",
        "\n",
        "# calculate difference in conditional means of insurance status\n",
        "effect_lot_insurance <- conditional_means[2,2] - conditional_means[1,2]\n",
        "\n",
        "# calculate adjusted estimate\n",
        "adjusted_estimate <- as.double(effect_lot_health/effect_lot_insurance)\n",
        "print(adjusted_estimate)"
      ],
      "id": "b6b4d5b2-d406-49fc-a2af-1f03dd75d963"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The estimated causal effect calculated with our adjusted estimate is\n",
        "approximately 5.7% - just a fraction of the OLS estimate.\n",
        "\n",
        "This stark difference suggests that there is OVB in our OLS estimate.\n",
        "Before we look at what’s happening under the hood, let’s formalize what\n",
        "we did in this example.\n",
        "\n",
        "### Formalizing instrumental variables\n",
        "\n",
        "We use **instrumental variables** (IVs) to isolate causal effects from\n",
        "models that might be plagued with omitted variable bias or endogeneity.\n",
        "IVs allow us to make causal inferences with observational data when OLS\n",
        "estimators are biased.\n",
        "\n",
        "We were interested in estimating $\\beta_1$ in the model below:\n",
        "\n",
        "$$\n",
        "Health_i = \\beta_0 + \\beta_1 Insurance_{i} + \\epsilon_i\n",
        "$$\n",
        "\n",
        "We say that a variable $Z$ can be used as an instrumental variable for\n",
        "$Insurance$ only if it satisfies all of the following conditions:\n",
        "\n",
        "-   $Z$ is randomly assigned (or as good as randomly assigned)\n",
        "-   $Z$ has a causal effect on $Insurance$\n",
        "-   $Z$ affects the outcome variable $Health$ exclusively through\n",
        "    $Insurance$ (that is, $Z$ does not have a direct effect on $Health$)\n",
        "\n",
        "It should be clear that lottery results can be used as an instrument\n",
        "since it is a variable that satisfies the three conditions: (1) is\n",
        "randomly assigned (2) has a causal effect on insurance status (3) only\n",
        "affects health outcomes through insurance status.\n",
        "\n",
        "### Formalizing the Wald estimator\n",
        "\n",
        "The **Wald estimator** uses instrumental variables to compute the causal\n",
        "effect of the variable of interest on the outcome. It does so through\n",
        "the relationship which we have derived in the context of the Oregon\n",
        "Health Insurance Experiment.\n",
        "\n",
        "As long as the three IV assumptions are met, the effect of the\n",
        "instrument on the variable of interest times the effect of the variable\n",
        "of interest on the outcome equals the effect of the instrument on the\n",
        "outcome. For an instrument $Z$, a treatment $D$, and an outcome $Y$:\n",
        "\n",
        "$$\n",
        "\\text{Effect of } Z \\text{ on } D \\cdot \\text{Effect of } D \\text{ on } Y = \\text{Effect of } Z \\text{ on } Y\n",
        "$$\n",
        "\n",
        "Rearranging the equation gets us to the effect we’re interested in\n",
        "calculating:\n",
        "\n",
        "$$\n",
        "\\text{Effect of } D \\text{ on } Y = \\frac{\\text{Effect of } Z \\text{ on } Y}{\\text{Effect of } Z \\text{ on } D}\n",
        "$$\n",
        "\n",
        "When $Z$ is a binary variable, our relationship of interest can be\n",
        "written as a differences in conditional means:\n",
        "\n",
        "$$\n",
        "    \\text{Effect of } D \\text{ on } Y = \\frac{\\mathbb{E}[Y_{i} \\mid Z_{i} = 1] - \\mathbb{E}[Y_{i} \\mid Z_{i} = 0]}{\\mathbb{E}[D_{i} \\mid Z_{i} = 1] - \\mathbb{E}[D_{i} \\mid Z_{i} = 0]}\n",
        "$$\n",
        "\n",
        "This is exactly what we did in our example with simulated data. Let’s\n",
        "turn back to our analysis and compare our estimates to the true\n",
        "underlying relationships.\n",
        "\n",
        "### Back to the insurance example with simulated data\n",
        "\n",
        "Let’s compare our OLS estimate (21.5%) and Wald estimate (5.7%) to the\n",
        "underlying relationships of the data.\n",
        "\n",
        "The following dataset `simulated_health_data_extended` includes the\n",
        "variable `income`, coded in thousands of dollars per year."
      ],
      "id": "e9bd29ec-3b9a-4cb2-9c2e-86118402f21e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# inspect the dataset\n",
        "head(simulated_health_data_extended)"
      ],
      "id": "e8baccc8-48dc-4c5f-ae73-384dd1677ccd"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In this (very simplified) simulated world, `income` was the only source\n",
        "of bias in our original model.\n",
        "\n",
        "If we had income data from the start, we could have solved the OVB by\n",
        "simply controlling for income on a regression of health outcomes on\n",
        "insurance status. Let’s do that now."
      ],
      "id": "df3d1786-36ac-40d3-983d-79ef09822afc"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# run linear regression controlling for income\n",
        "extended_model <- lm(log(health_outcome) ~ insurance_status + log(income), data = simulated_health_data_extended)\n",
        "\n",
        "# test significance of coefficients with robust standard errors\n",
        "coeftest(extended_model, vcov=vcovHC)"
      ],
      "id": "7eb689e3-021a-49b3-8dba-05ac3c353111"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The extended model shows that, controlling for income, the causal effect\n",
        "of insurance on outcomes is 5.8% - very similar to our Wald estimate.\n",
        "Income is positively related to both health outcomes and insurance\n",
        "status, and is the main determinant of health outcomes in our fictitious\n",
        "state.\n",
        "\n",
        "The true causal effect of insurance on outcomes is actually 5%[1]. The\n",
        "difference between the Wald estimate, the extended model, and the true\n",
        "causal effect can be attributed to sampling variance.\n",
        "\n",
        "This example shows that the Wald estimator can be used to solve OVB when\n",
        "we do not have data to control for the omitted variable in our model.\n",
        "That was only possible because we had an instrument that was (1)\n",
        "randomly assigned (2) causally related to the variable of interest (3)\n",
        "only affected the outcome through the variable of interest.\n",
        "\n",
        "#### From the Wald estimator to the 2SLS estimator\n",
        "\n",
        "The Wald estimator is useful to build the intuition around instrumental\n",
        "variables. However, researchers rarely use it in their research.\n",
        "Researchers typically use a much more flexible estimator, the **2SLS\n",
        "estimator**.\n",
        "\n",
        "The 2SLS estimator, which we will denote $\\beta^{TSLS}_{1}$, is\n",
        "equivalent to the Wald estimator. For an instrument $Z$, a treatment\n",
        "$D$, and an outcome $Y$:\n",
        "\n",
        "$$\n",
        "    \\text{Effect of } D \\text{ on } Y = \\frac{\\mathbb{E}[Y_{i} \\mid Z_{i} = 1] - \\mathbb{E}[Y_{i} \\mid Z_{i} = 0]}{\\mathbb{E}[D_{i} \\mid Z_{i} = 1] - \\mathbb{E}[D_{i} \\mid Z_{i} = 0]} = \\beta^{TSLS}_{1}\n",
        "$$\n",
        "\n",
        "To calculate $\\beta^{TSLS}_{1}$, we have to run 2 regressions. The first\n",
        "regression is a regression of the treatment on the instrument, called\n",
        "the **first stage regression**. The second regression is a regression of\n",
        "the outcome on the fitted values of the first stage regression, called\n",
        "the **second stage regression**. Follow the step-by-step below.\n",
        "\n",
        "1.  Run the first stage regression:\n",
        "\n",
        "$$\n",
        "    D_{i} = \\beta_{0} + \\beta_{1}Z_{i} + v_{i}\n",
        "$$\n",
        "\n",
        "1.  Store the fitted values from the first stage regression\n",
        "    $\\widehat{D_{i}}$:\n",
        "\n",
        "$$\n",
        "    \\widehat{D_{i}} = b_{0} + b_{1}Z_{i}\n",
        "$$\n",
        "\n",
        "where $b_{0}$, $b_{1}$ are the estimated coefficients of the first\n",
        "stage.\n",
        "\n",
        "1.  Run the second stage regression:\n",
        "\n",
        "$$\n",
        "    Y_{i} = \\beta^{TSLS}_{0} + \\beta^{TSLS}_{1}\\widehat{D_{i}} + u_{i}\n",
        "$$\n",
        "\n",
        "The coefficient on our second stage regression $\\beta^{TSLS}_{1}$ is the\n",
        "effect of interest: the causal effect of $D$ on $Y$.\n",
        "\n",
        "#### Calculating the 2SLS estimate\n",
        "\n",
        "Let’s calculate the 2SLS estimate of the effect of insurance on health\n",
        "with our simulated health insurance data.\n",
        "\n",
        "1.  Run the first stage:\n",
        "\n",
        "[1] We know because the data is simulated."
      ],
      "id": "1e53ca96-5daf-49d4-aabd-06849862aa73"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# run the first stage regression\n",
        "health_st1 <- lm(insurance_status ~ lot_win, data = simulated_health_data)\n",
        "\n",
        "# test significance of coefficients with robust standard errors\n",
        "coeftest(health_st1, vcov=vcovHC)"
      ],
      "id": "16547cab-5a29-4c46-80d4-8672604c0c59"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Similar to our Wald estimate, the first stage indicates that winning the\n",
        "lottery increases the probability of insurance coverage by 65%. The\n",
        "effect is significant at the 0.1% significance level.\n",
        "\n",
        "> **Think deeper**: what would happen if the coefficient on `lot_win`\n",
        "> was not significant?\n",
        "\n",
        "1.  Store the fitted values of the first stage:"
      ],
      "id": "cd5f8275-349e-469e-9c54-9fb7cc3f8ce9"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# store fitted values as a new column named `insurance_status_hat`\n",
        "simulated_health_data$insurance_status_hat <- health_st1$fitted.values\n",
        "\n",
        "# look at a subset of the updated dataset\n",
        "head(simulated_health_data)"
      ],
      "id": "03141fbb-7e0f-4b35-8731-23afa18191e4"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "1.  Run the second stage:"
      ],
      "id": "ef7ca46c-2132-4b8f-a67f-5da7d3596708"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# run the second stage regression\n",
        "health_st2 <- lm(log(health_outcome) ~ insurance_status_hat, data = simulated_health_data)\n",
        "\n",
        "# test significance of coefficients with robust standard errors\n",
        "coeftest(health_st2, vcov=vcovHC)"
      ],
      "id": "97691175-04e7-4500-b145-c93c17b51353"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Our 2SLS estimate is 5.7% - exactly equal to our Wald estimate. Let’s\n",
        "take a moment to understand why this works.\n",
        "\n",
        "#### Making sense of the 2SLS estimate\n",
        "\n",
        "Remember that the problem with just running a simple OLS when there is\n",
        "OVB is that the *variable of interest is correlated to the error term*\n",
        "of the model: $Cov(Insurance_{i}, \\epsilon_{i})\\neq 0$.\n",
        "\n",
        "When we run the first stage, we effectively decompose the treatment into\n",
        "two parts: the error term and the variation explained by the instrument\n",
        "(the fitted values). Since the instrument is randomly assigned (or as\n",
        "good as randomly assigned) and the fitted values are driven solely by\n",
        "the instrument, *the fitted values are necessarily not correlated to the\n",
        "error term*: $Cov(\\widehat{Insurance_{i}}, \\epsilon_{i}) = 0$.\n",
        "\n",
        "Problem solved. We throw away the bad variation of the treatment (the\n",
        "error term of the first stage) and run a regression of the outcome on\n",
        "the good variation of the treatment (the fitted values of the first\n",
        "stage). The estimated coefficient of this regression is our 2SLS\n",
        "estimate, an unbiased estimate of the causal effect.\n",
        "\n",
        "It is important to remember that this only works when the instrument\n",
        "satisfies the three criteria:\n",
        "\n",
        "1.  Random assignment: the instrument is as good as randomly assigned\n",
        "2.  Relevance: the instrument has a causal effect on the variable of\n",
        "    interest\n",
        "3.  Exogeneity: the instrument only affects the outcome through the\n",
        "    variable of interest\n",
        "\n",
        "> At this point, you should have the background knowledge to understand\n",
        "> most of what the researchers did to measure the effect of insurance on\n",
        "> health in the Oregon Health Insurance Experiment. The Baicker (2013)\n",
        "> study can be read for free on the [New England Journal of\n",
        "> Medicine](https://www.nejm.org/doi/10.1056/NEJMsa1212321?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%20%200www.ncbi.nlm.nih.gov).\n",
        "> We recommend reading the “Special Article” as well as the “Analytic\n",
        "> Specifications” section (pages 5-7) of the Supplementary Appendix.\n",
        "\n",
        "In the following example, we’ll show how we can take advantage of the\n",
        "flexibility of the 2SLS estimator to use instruments that are not\n",
        "randomly assigned.\n",
        "\n",
        "## Example 2: College Proximity and Returns to Education\n",
        "\n",
        "An interesting topic in labor economics is understanding how education\n",
        "affects future earnings. Card (1993) investigates this relationship by\n",
        "calculating the economic returns to schooling with college proximity as\n",
        "an instrumental variable.\n",
        "\n",
        "In this example, we’ll try to answer the same questions as Card (1993)\n",
        "with a simplified version of his dataset. Our dataset `cdist_data`\n",
        "contains the following variables for high school graduates:\n",
        "\n",
        "-   `distance` dummy == 1 for living close to 4-year college in 1966\n",
        "-   `momdad14` dummy == 1 for living with both parents at age 14\n",
        "-   `black` dummy == 1 for being black\n",
        "-   `south` dummy == 1 for living in the south in 1976\n",
        "-   `urban` dummy == 1 for living in urban area in 1976\n",
        "-   `wage` for wage in 1976\n",
        "-   `educ` for years of education in 1976\n",
        "-   `exper` for years of experience in 1976\n",
        "-   `fatheduc` for years of father’s education"
      ],
      "id": "f9ecad42-35dd-47f9-af66-55518dab96d7"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# view data\n",
        "head(as.data.frame(cdist_data))"
      ],
      "id": "0e21dd05-c345-4707-aec2-90eaf913298b"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### The selection problem\n",
        "\n",
        "The question we want to answer is: **what is the effect of an extra year\n",
        "of education on wages?**\n",
        "\n",
        "A simple regression of `wage` on `education` would generate a biased\n",
        "estimate of the causal effect because education is not randomly assigned\n",
        "across the surveyed. As Card (1993) put it, “individuals make their own\n",
        "schooling choices; depending on how these choices are made, measured\n",
        "earnings differences between workers with difference levels of schooling\n",
        "may over-state or under-state the true return to education.” That is\n",
        "just another way of saying that the model contains selection bias.\n",
        "\n",
        "We have two potential solutions for this problem (1) solving the OVB\n",
        "with additional control variables (2) solving the OVB with an\n",
        "instrumental variable that is randomly assigned (or as good as randomly\n",
        "assigned). Let’s try both approaches and compare them with the (biased)\n",
        "OLS estimate.\n",
        "\n",
        "#### Calculating the OLS estimate\n",
        "\n",
        "First, let’s estimate the returns to education with simple regression of\n",
        "the form:\n",
        "\n",
        "$$\n",
        "\\log(wage_i) = \\beta_0 + \\beta_1 education_i + u_i, \n",
        "$$"
      ],
      "id": "1ae0f686-5b90-4f60-b95a-baf6b3679af4"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# run linear regression\n",
        "simple_OLS <- lm(log(wage) ~ educ, data=cdist_data)\n",
        "\n",
        "# test significance of coefficients with robust standard errors\n",
        "coeftest(simple_OLS, vcov=vcovHC)"
      ],
      "id": "2a7819fd-029b-44d4-bcfe-2e5680927a87"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The OLS estimate for the returns to education is a 5.2% boost in\n",
        "earnings for every additional year of schooling.\n",
        "\n",
        "#### Controlling for observable differences\n",
        "\n",
        "Let’s try adding controls. Remember from the linear regression section\n",
        "that we should try controlling for confounding variables: variables that\n",
        "affect earnings and/or education but are not affected by education.\n",
        "Let’s follow Card (1993) and add the controls `momdad14`, `south`,\n",
        "`black`, `fatheduc`, `exper`, and `urban`.\n",
        "\n",
        "> Card (1993) runs additional specifications with `exper` as endogenous,\n",
        "> but we’ll limit our analysis to this single model."
      ],
      "id": "37828c0c-5eaa-4ce8-b693-e09a4113e24e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# run linear regression\n",
        "multiple_OLS <- lm(log(wage) ~ educ + momdad14 + south + black + fatheduc + exper + urban, data=cdist_data)\n",
        "\n",
        "# test significance of coefficients with robust standard errors\n",
        "coeftest(multiple_OLS, vcov=vcovHC)"
      ],
      "id": "dcebdeff-c4f8-44d1-acad-854745b419a0"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "When adding controls, the estimated returns to education increase to\n",
        "7.3% higher wages for every additional year of schooling.\n",
        "\n",
        "#### Using an instrumental variable to estimate the causal effect\n",
        "\n",
        "Let’s estimate the returns to education using college proximity as an\n",
        "instrumental variable. The logic behind choosing this instrument is that\n",
        "students who live closer to colleges are more likely to pursue more\n",
        "education than those who live further away.\n",
        "\n",
        "The variable `distance` on our dataset maps whether the survey\n",
        "respondents live close to a 4-year college. Let’s estimate the returns\n",
        "to education with the 2SLS estimator.\n",
        "\n",
        "1.  Run the first stage regression:"
      ],
      "id": "a75cf53f-c2aa-459f-9484-4eb655471c4e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# run the first stage\n",
        "dist_s1 <- lm(educ ~ distance , data=cdist_data)\n",
        "\n",
        "# test significance of coefficients with robust standard errors\n",
        "coeftest(dist_s1, vcov=vcovHC)"
      ],
      "id": "8c2b58cc-275a-4a0d-a661-8ff587267238"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The fitted values of our first stage are given by: $$\n",
        "\\widehat{education_{i}}= 12.698 + 0.829 distance_{i}\n",
        "$$\n",
        "\n",
        "We find that students living close to a 4-year college pursue 0.83 years\n",
        "more of education than those who don’t live close to a college. The\n",
        "effect is significant at the 0.1% significance level.\n",
        "\n",
        "1.  Store the fitted values from the first stage:\n",
        "\n",
        "We store the fitted values of our first stage regression,\n",
        "$\\widehat{education_{i}}$ as the variable `educ_hat` in our dataset\n",
        "`cdist_data`."
      ],
      "id": "e461ee69-0c5c-4d1b-a839-cbfa5e69246d"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# store fitted values as a new column named `educ_hat`\n",
        "cdist_data$educ_hat <- dist_s1$fitted.values\n",
        "\n",
        "# view the appended dataset\n",
        "head(cdist_data)"
      ],
      "id": "013a77e9-bf62-4fce-a28c-fd927824fa9e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "1.  Run the second stage regression:\n",
        "\n",
        "$$\n",
        "\\log(wage_{i}) = \\beta_0 + \\beta_1 \\widehat{education_{i}} + u_i.\n",
        "$$"
      ],
      "id": "4f674093-59a3-4bd5-aedc-9b882c9c4dc5"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# run the second stage\n",
        "dist_s2 <- lm(log(wage) ~ educ_hat, data=cdist_data)\n",
        "\n",
        "# test significance of coefficients with robust standard errors\n",
        "coeftest(dist_s2, vcov = vcovHC)"
      ],
      "id": "f1a30a28-d144-4f5d-ab3b-e92ff2d27e97"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The 2SLS estimate for the returns to education is a staggering 18.8%\n",
        "increase in wages for every additional year of schooling. This result\n",
        "suggests substantial returns to education; even higher than the range of\n",
        "10-14% found by Card (1993). More importantly, this estimate differs\n",
        "significantly from the 5-7% range that we found with our OLS estimates.\n",
        "\n",
        "But we’re not done yet. We need to take a closer look at our model,\n",
        "understand the shortcomings of our modeling choices, and try to fix\n",
        "them. Before doing that though, let’s take a quick look at `ivreg()`.\n",
        "\n",
        "#### Estimating 2SLS directly with `ivreg()`\n",
        "\n",
        "The function `ivreg()` from the `AER` package carries out 2SLS\n",
        "automatically. It follows the same structure as `lm()`, with the added\n",
        "feature of specifying instruments with a vertical bar after the\n",
        "regression formula. Let’s run `ivreg()` on our college distance data."
      ],
      "id": "3019e941-e889-451a-a8ee-f02c7396c75a"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# run 'ivreg()' regression\n",
        "dist_ivreg <- ivreg(log(wage) ~ educ | distance, data = cdist_data)\n",
        "\n",
        "# test significance of coefficients with robust standard errors\n",
        "coeftest(dist_ivreg, vcov = vcovHC)"
      ],
      "id": "abb8be14-105a-4eef-bf72-23c3d4b9e388"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Notice that `ivreg()` gives us the same result as running the first and\n",
        "second stages independently: an additional year of education is\n",
        "associated with a 18.8% increase in wages. However, `ivreg()` gives us\n",
        "larger standard errors. This might be a problem for hypothesis testing…\n",
        "More on this later.\n",
        "\n",
        "> Although we report our main results with `ivreg()`, running first\n",
        "> stage regressions is useful for testing assumptions about instrument\n",
        "> relevance. We discuss this on *Instrumental Variables 2*.\n",
        "\n",
        "### Analyzing the results\n",
        "\n",
        "Now that we have estimated the returns to education with OLS, multiple\n",
        "regression, and 2SLS, let’s think critically about the estimated\n",
        "coefficients and reflect about any possible shortcomings of our modeling\n",
        "choices. Use the following questions to guide your reflection:\n",
        "\n",
        "-   Are we confident that our multiple regression specification solves\n",
        "    the selection problem? Are there any important effects that we have\n",
        "    failed to control? If yes, what is the probable direction of the\n",
        "    bias?\n",
        "\n",
        "-   Is college proximity a good instrument? If not, what assumptions\n",
        "    does it fail to meet? Is there any way that we could improve our IV\n",
        "    approach?\n",
        "\n",
        "The answers to some of these questions can be found on [Card\n",
        "(1993)](https://davidcard.berkeley.edu/papers/geo_var_schooling.pdf).\n",
        "Read through the paper to find out how he solves the selection problem.\n",
        "\n",
        "### Adding controls to the IV regression\n",
        "\n",
        "We have a problem with our instrument: college proximity is not randomly\n",
        "assigned. It is possible that distance from a 4-year college is\n",
        "correlated to the error term of the model (for example, marginalized\n",
        "groups living far from colleges could be less likely to both attend\n",
        "college and work high-paying jobs).\n",
        "\n",
        "To fix this we need to control for variables which undermine our\n",
        "instrument (for example, ethnicity) in the IV regression. If we are able\n",
        "to control for all sources of variation that affect our outcome directly\n",
        "(or indirectly, through a variable that is not our instrument), the\n",
        "instrument will be “as good as randomly assigned”\n",
        "\n",
        "In our example, we need to control for all potential determinants of\n",
        "college proximity that also affect wages directly. Controlling for these\n",
        "potential sources of variation (`momdad14`, `south`, `black`,\n",
        "`fatheduc`, `exper`, `urban`), we can be more confident (although not\n",
        "100% confident) that our instrument satisfies the three IV conditions.\n",
        "\n",
        "Let’s run our extended IV regression with `ivreg()`:\n",
        "\n",
        "> Note that `ivreg()` requires users to specify the control variables on\n",
        "> both sides of the vertical bar."
      ],
      "id": "713d8227-226d-448b-a56b-c612462c0618"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# run 'ivreg()' regression with controls\n",
        "dist_ivreg <- ivreg(log(wage) ~ educ + momdad14 + south + black + fatheduc + exper + urban | distance + momdad14 + south + black + fatheduc + exper + urban, data = cdist_data)\n",
        "\n",
        "# test significance of coefficients with robust standard errors\n",
        "coeftest(dist_ivreg, vcov = vcovHC)"
      ],
      "id": "0dc7ce86-fe0f-4f80-ba31-478a76072b8b"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Our estimate is a 14.2% increase in earnings for every additional year\n",
        "of education. This is well within the range of 10-14% found by Card\n",
        "(1993).\n",
        "\n",
        "### Local average treatment effect (LATE)\n",
        "\n",
        "Our 2SLS estimate with controls is twice as large as our multiple\n",
        "regression estimate. If we’re controlling for so many variables in the\n",
        "multiple regression, what is driving such a big difference in estimates?\n",
        "\n",
        "It is possible that the **treatment on the treated (TOT)** and **local\n",
        "average treatment effect (LATE)** are different for our subjects.\n",
        "\n",
        "When we use traditional regression methods to estimate causal effects,\n",
        "we use the entire (within group) variation of the treatment to calculate\n",
        "the causal effect. This is what we call treatment on the treated.\n",
        "\n",
        "When we use instrumental variables, we isolate the variation of the\n",
        "treatment driven by the instrument. If the instrument does not affect\n",
        "the entire population equally, we could systematically exclude\n",
        "observations with economic meaning from our calculation. We have three\n",
        "types of individuals in the college distance dataset:\n",
        "\n",
        "1.  Those who would go to college regardless of where they live: the\n",
        "    *always-takers*.\n",
        "2.  Those who wouldn’t go to college regardless of where they live: the\n",
        "    *never-takers*.\n",
        "3.  Those who would go to college only if they live close to a 4-year\n",
        "    college: the *compliers*.\n",
        "\n",
        "Although both always-takers and compliers are treated, an IV approach\n",
        "only uses the variation of compliers to estimate the causal effect. That\n",
        "makes sense: since the instrument doesn’t affect the choice of\n",
        "individuals type 1 and 2 of going to college, their instrument-driven\n",
        "variation must necessarily be zero. We call the causal effect on\n",
        "compliers the local average treatment effect.\n",
        "\n",
        "It is likely that the compliers in our dataset are lower-income\n",
        "students, who would only go to college if they could live with their\n",
        "parents and not pay housing costs. If that is true, then the difference\n",
        "in results between the multiple regression and the IV regression could\n",
        "be driven by differences in the TOT and LATE (and not just OVB).\n",
        "\n",
        "Our larger LATE could suggest that the returns to education for the poor\n",
        "are higher than the returns to education for the rich, a finding that\n",
        "could influence the decisions that both agents and policymakers make\n",
        "about education spending.\n",
        "\n",
        "### A note on standard errors\n",
        "\n",
        "Previously, we noted that standard errors from manually estimated 2SLS\n",
        "and `ivreg()` are different. It is important to note that ***the correct\n",
        "standard errors are those calculated by*** `ivreg()`.\n",
        "\n",
        "Standard errors from manually calculated 2SLS do not adjust for the\n",
        "added uncertainty inherent to IVs: the fact that we use predictions from\n",
        "the first stage regression as regressors in the second stage regression.\n",
        "The IV standard erros fix for this additional uncertainty by adding a\n",
        "term for the correlation between the instrument and the treatment. This\n",
        "makes IV standard errors from `ivreg()` larger and more accurate than\n",
        "`lm()` standard errors."
      ],
      "id": "cbd281b9-bc93-41c4-a2be-ced5173453d4"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "ir",
      "display_name": "R",
      "language": "r"
    }
  }
}