{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 3.4 - Advanced - Synthetic Control\n", "\n", "COMET Team
*Avi Woodward-Kelen* \n", "2024-08-24\n", "\n", "## Outline\n", "\n", "### Prerequisites\n", "\n", "- Intermediate Econometrics (equivalent to ECON 326)\n", "- Panel Data\n", "- Difference in Differences\n", "\n", "### Learning Outcomes\n", "\n", "- Develop a strong intuition behind the synthetic control method of\n", " analysis,\n", "- Develop an understanding of the econometric theory behind synthetic\n", " control,\n", "- Be able to use synthetic control to estimate the causal effect of a\n", " policy change in case study contexts, and\n", "- Apply methods of inference when sample sizes are very small.\n", "\n", "### References\n", "\n", "- Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic Control\n", " Methods for Comparative Case Studies: Estimating the Effect of\n", " California’s Tobacco Control Program. Journal of the American\n", " Statistical Association, 105(490), 493–505.\n", " https://doi.org/10.1198/jasa.2009.ap08746\n", "\n", "- Abadie, A., Diamond, A., & Hainmueller, J. (2015). Comparative\n", " Politics and the Synthetic Control Method. American Journal of\n", " Political Science, 59(2), 495–510.\n", " https://doi.org/10.1111/ajps.12116\n", "\n", "- Cunningham, S. (2021). Causal inference: The mixtape. Yale\n", " university press. https://mixtape.scunning.com/10-synthetic_control\n", "\n", "- Hainmueller, Jens, 2014, “Replication data for: Comparative Politics\n", " and the Synthetic Control Method”,\n", " https://doi.org/10.7910/DVN/24714, Harvard Dataverse, V2,\n", " UNF:5:AtEF45hDnFLetMIiv9tjpQ== \\[fileUNF\\]\n", "\n", "- Mendez, C. (n.d.). *Basic synthetic control tutorial*.\n", " carlos-mendez.\n", " https://carlos-mendez.quarto.pub/r-synthetic-control-tutorial/" ], "id": "b338c881-34b4-4402-836d-77aac5756379" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#install.packages(\"foreign\")\n", "#install.packages(\"Synth\")\n", "#install.packages(\"tidyverse\")\n", "#install.packages(\"haven\")\n", "#install.packages(\"SCtools\")\n", "#install.packages(\"skimr\")" ], "id": "d0a38b35-0b7d-4b9b-bcc9-2d01ebb761fe" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(foreign)\n", "library(Synth)\n", "library(haven)\n", "library(tidyverse)\n", "library(SCtools)\n", "library(skimr)" ], "id": "4d684c95-b226-4825-9243-c80388c3bc58" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "oecd_data <- read.dta(\"datasets/repgermany.dta\")\n", "#NB: using foreign::read.dta() instead of read_dta() is strangely important here because portions of the `synth()` package we will be using only accept numerical and character strings, and read_dta() will gives columns a dbl type which is unsupported." ], "id": "c4ed700c-d5c1-4a4e-941c-b64da27562f7" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "source(\"advanced_synthetic_control_functions.r\") #minor data cleaning" ], "id": "ab55c4d8-87bb-4e82-b101-24e72ec9d299" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: What is Synthetic Control and Why Use It?\n", "\n", "The purpose of synthetic control is to make comparative case study\n", "analyses more rigerous. Three major issues which have traditionally\n", "plagued comparative case studies are a) the presence of confounders, b)\n", "a lack of a control group which shares parallel trends, and c) the\n", "selection of the control group.\n", "\n", "Suppose improvements in vehicle safety design and AI-assisted driving is\n", "leading to fewer road fatalities every year. Nevertheless, in order to\n", "improve road safety, Vancouver’s city council decides to amend municipal\n", "bylaws such that the new speed limit is 30km/h throughout the city.\n", "\n", "Researchers want to know what sort of impact that had, but the trend\n", "line for Canada’s national road fatalities is not similar to that of\n", "Vancouver’s. Moreover, behavior changes slowly and there’s an element of\n", "randomness to the number of people killed in car crashes every year; so\n", "even if everything else was held equal there might not be a sharp enough\n", "change in the trendline to do a simple comparison with pre-bylaw\n", "Vancouver.\n", "\n", "In such a situation, should the researchers compare Vancouver to\n", "Burnaby, because that’s the nearest city? Or perhaps we should compare\n", "it to Toronto or to Seattle because those cities could arguably have\n", "more in common with Vancouver? Thus the concern arises that whatever\n", "control group the researchers choose will be abitrary - and potentially\n", "misleading.\n", "\n", "How do we get around this? **The essence of synthetic control is to\n", "define a “synthetic control group” as the weighted average of all\n", "available control units which best approximates the relevant\n", "characteristics of the treatment group.** The set of available control\n", "units are also called the “donor pool”.\n", "\n", "What does that mean? Suppose the characteristics of a city most relevant\n", "to the number of road fatalities are `average age of drivers`,\n", "`car ownership per capita`, `average speed driven`, and\n", "`alcohol consumption per capita`. Vancouver might be substantially\n", "different from Burnaby, Toronto, and Seattle on all of these metrics -\n", "but by assigning each variable and each city a specific weight in your\n", "analysis you can often get extremely close to replicating a “Synthetic\n", "Vancouver” which is highly representative of the real city.\n", "\n", "For instance, an extremely rudimentary (and arbitrary) version synthetic\n", "control would be to assign a weight of 1/4th to Burnaby, 1/2 to Toronto,\n", "and 1/4th to Seattle (as well as applying weights to each of the\n", "characteristics noted above); and then comparing this rudimentary\n", "synthetic Vancouver to real Vancouver. The sophisticated version is to\n", "have R run an optimization program which, in a manner analagous to a\n", "simple regression, finds the optimal weights for each city and each\n", "characteristic by minimizing the distance between real Vancouver and\n", "synthetic Vancouver in the pre-treatment period (i.e. before the bylaw\n", "change). We then compare how synthetic Vancouver would have faired\n", "(based on the earlier weights) to how things actually turned out in\n", "Vancouver.\n", "\n", "Some famous examples of synthetic control include the effect of\n", "terrorism on GDP in the Basque region of Spain, California’s tobacco\n", "control laws, the impact of Texan prison construction on the number of\n", "people imprisoned, and the results of German reunification after the\n", "Berlin Wall fell - the last of which will be the example we work through\n", "together.\n", "\n", "**Think Deeper**: What sorts of bias might still creep in?\n", "\n", "## Part 2: Synthetic Control Theory & Practice\n", "\n", "### Counterfactual Estimation\n", "\n", "In a perfect world we would be measuring the true effect of a policy by\n", "randomly assigning individuals/cities/countries to control and treatment\n", "groups. Then, we would look at the difference in outcomes between units\n", "with (1) and without (0) treatment after the intervention has occured.\n", "\n", "$$\n", "\\alpha = Y_{post}(1) - Y_{post}(0)\n", "$$\n", "\n", "but in the context of a case study $Y_{post}(0)$ doesn’t exist! Instead\n", "if we want to find the effect we’re going to need some way to estimate\n", "what it might have been like.\n", "\n", "$$\n", "\\begin{align*}\n", " \\hat{\\alpha_t} &= Y_{t,post}(1) - \\hat{Y}_{t,post}(0) \\\\\n", " &= Y_{1,t}^{real} - Y_{1,t}^{synthetic}\n", "\\end{align*}\n", "$$\n", "\n", "How do we estimate $Y_{1,t}^{synthetic}$? Well, let:\n", "\n", "- $Y_{jt}$ be the outcome variable for unit $j$ of $J+1$ units at time\n", " $t$\n", "- The treatment group be $j=1$\n", "- Treatment intervention occurs at $T_0$\n", "- $\\omega_i$ represents the weights for unit $j$\n", "\n", "Then define\n", "\n", "$$\n", "\\hat{Y}_{t,post}(0) \\equiv \\sum_{j=2}^{J+1}{\\omega_i^* Y_{jT}}\n", "$$\n", "\n", "This says that our counterfactual value is the optimally weighted\n", "average of all the other units, which raises the question of “how to\n", "optimally weight said units?” The answer is by *minimizing the distance\n", "between the units’ covariates in the pre-treatment period* (subject to\n", "the restriction that weights must be non-negative and must sum to one).\n", "\n", "$$\n", "\\omega^* = \\text{min}_{\\{\\omega_j\\}_{j=1}^{J}} \\sum_{t=1}^{T_0}({Y_{1t}}-\\sum_{j=2}^{J+1}\\omega_jY_{jt})^2 \\text{ s.t. } \\sum_{j=2}^{J+1} \\omega_j = 1, \\text{ and } \\omega_j \\geq 0\n", "$$\n", "\n", "And taking the average of this gives us what is known as the Mean\n", "Squared Prediction Error (MSPE).\n", "\n", "$$\n", "MSPE = \\frac{1}{T_0} \\sum_{t=1}^{T_0}({Y_{1t}}-\\sum_{j=2}^{J+1}\\omega_jY_{jt})^2\n", "$$\n", "\n", "The MSPE tells us how good a fit we have between the synthetic control\n", "and the treated group during the pre-treatment period; and this will be\n", "core to how we build and analyze our model as well as our inference\n", "tests.\n", "\n", "> **Extend Your Knowledge: Matrix Algebra and Econometrics**\n", ">\n", "> We can (and do) actually minimize the function across multiple\n", "> observed variables in the pre-treatment period by choosing\n", ">\n", "> $$\n", "> {\\{\\omega^*}\\} = \\text{arg min}_{\\vec{W}} ||\\vec{X_1} - \\vec{X_0}\\vec{W}|| = \\sqrt{(X_1 - X_0W)'V(X_1 - X_0W)}\n", "> $$\n", ">\n", "> For those who have a background in linear algebra and who want to dig\n", "> deeper, the following references provide increasingly sophisticated\n", "> backgrounders on the process\n", ">\n", "> - Cunningham, S. (2021). Causal inference: The mixtape. Yale\n", "> university press.\n", "> https://mixtape.scunning.com/10-synthetic_control#formalization\n", "> - Abadie, A., Diamond, A., & Hainmueller, J. (2015). Comparative\n", "> Politics and the Synthetic Control Method. American Journal of\n", "> Political Science, 59(2), 495–510.\n", "> https://doi.org/10.1111/ajps.12116\n", "> - Abadie, A. (2021). Using Synthetic Controls: Feasibility, Data\n", "> Requirements, and Methodological Aspects. Journal of Economic\n", "> Literature, 59(2), 391–425. https://doi.org/10.1257/jel.20191450\n", "\n", "Finally, in the context of synthetic control we will typically estimate\n", "the Average Treatment effect on the Treated (ATT) over time.\n", "\n", "$$\n", "\\begin{aligned}\n", " ATT_t &= \\frac{1}{T_1 - T_0} \\sum_{t=T_0+1}^{T_1} \\alpha_t \\\\\n", " &= \\frac{1}{T_1 - T_0}\\sum_{t=T_0+1}^{T_1}({Y_{1t}}-\\sum_{j=2}^{J+1}\\omega_jY_{jt})\n", "\\end{aligned}\n", "$$\n", "\n", "### Implementation\n", "\n", "First things first, let’s take a peek at our data:" ], "id": "f080cc3e-65f1-4878-bc0a-4f583b40e69e" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "head(oecd_data)" ], "id": "1dd91c82-991e-4b09-84f6-b47887e4efc2" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "glimpse(oecd_data)" ], "id": "7f8dec33-de30-49ab-b52e-07ecbd5712eb" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "skim(oecd_data)" ], "id": "4491b01b-2d22-45e7-9091-5e1eda57a057" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Where:\n", "\n", "- `gdp`: GDP per capita, PPP adjusted in 2002 USD\n", "- `invest`: average investment rate for a given decade\n", "- `schooling`: percentage of secondary school attained in the total\n", " population aged 25 and older\n", "- `industry` share of value added to GDP by industrial processes\n", "- `infrate` annual rate of inflation (base year 1995)\n", "- `trade` is an index of openness to international trade, exports +\n", " imports as a percentage of GDP\n", "\n", "We have data available from 1960 to 2003, and we will split this into\n", "two major sections: **pre-treatment** (1960 to 1990) and\n", "**post-treatment** (1990-2003). During the pre-treatment phase we will\n", "be establishing our synthetic West Germany, and in the post-treatment we\n", "will see how it performs.\n", "\n", "We will *also* be splitting **pre-treatment** into two periods: a\n", "*training period* (1971 to 1980) during which we find the values of our\n", "explainatory variables; and a *validation period* (1980-1990) in which\n", "we optimize the weights based on the explainatory variables found during\n", "the previous period.[1] This is known as the process of cross-validation\n", "and it helps prevent us from overfitting our model.[2]\n", "\n", "While cross-validation is not strictly necessary, it is good practice.\n", "Moreover, it is sort of confusing to try and figure out both the\n", "rationale and the syntax without a little bit of hand-holding. So, we’ll\n", "do it together.\n", "\n", "> **Under Tips & Tricks at the bottom of the notebook there is a\n", "> non-cross-validated (simpler) version of synthetic control**\n", ">\n", "> I chose to teach synthetic control with cross-validation because\n", ">\n", "> 1. It is a good way to make sure you’re not overfitting the data\n", "> (which is a real risk in synthetic control studies), and\n", ">\n", "> 2. Without a tutorial on how cross-validation works and what it looks\n", "> like it is quite difficult both to intuit how to do it yourself\n", "> and to read/understand other people’s code when *they* are doing\n", "> it.\n", ">\n", "> The downside is that it makes the creation and display of graphs and\n", "> tables significantly more complicated, as I think you’ll see if you\n", "> skip to the bottom of the notebook.\n", "\n", "Order of operations (with cross-validation) is\n", "\n", "- `dataset` -\\> `dataprep(training_period)`-\\>\n", " `synth(training_period)`\n", "- `dataset` -\\> `dataprep(validation_period)` -\\>\n", " `synth(training_period & validation_period)` -\\>\n", " `output (graphs, tables, etc.)`\n", "\n", "[1] Quick counters will notice that there are actually *three*\n", "sub-periods within the pre-treatment period (1960-1971). We’re going to\n", "revisit this period when we get to placebo studies, but until then it is\n", "yet another way to visually identify whether or not our model does a\n", "good job.\n", "\n", "[2] The exact years chosen here are somewhat arbitrary, so feel free to\n", "experiment with the dates on your own." ], "id": "64c71a77-6c20-44b5-b2cc-22cbe2567af2" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#although our data is already cleaned we need to put it in a format that the `synth()` package understands, using the `dataprep()` command\n", "\n", "training_data <- dataprep(\n", " \n", " foo = oecd_data, #the dataset to be prepared... don't ask why it's called foo.\n", " \n", " predictors = c(\"gdp\", \"trade\", \"infrate\"), #predictors of the outcome variable (GDP). \n", " \n", " dependent = \"gdp\", #outcome variable we want\n", " \n", " #special.predictors() is used for variables which require special rules (e.g. allows us to choose the time periods and which measure of central tendency to use), or when observations are only present in certain years.\n", " special.predictors = list(\n", " list(\"industry\", 1971:1980, c(\"mean\")),\n", " list(\"schooling\",c(1970,1975), c(\"mean\")),\n", " list(\"invest70\" ,1980, c(\"mean\"))\n", " ),\n", " \n", " unit.variable = \"index\", #tells the package which column is the unit of observation. It must be either the numerical value of the column (i.e. `unit.variable = 1` is an acceptable alternative), or the name of the column in string form as I have done.\n", "\n", " treatment.identifier = 7, #the index value in the dataset for West Germany (our treatment group)\n", " \n", " controls.identifier = unique(oecd_data$index)[-7], #all country indexes other than West Germany \n", "\n", " unit.names.variable = \"country\", #This is the column in the dataset which contains the names of the units. It must be either the numerical value of the column (i.e. `unit.names.variable = 2` is an acceptable alternative), or the name of the column in string form as I have done. \n", "\n", " time.variable = \"year\", #tells the package which column is the time variable. It must be either the numerical value of the column (i.e. `time.variable = 3` is an acceptable alternative), or the name of the column in string form as I have done.\n", " \n", " time.predictors.prior = 1971:1980, #This is the training period! The mean of the predictors() argument above will be calculated over this span.\n", " \n", " time.optimize.ssr = 1981:1990, #This is the validation period! It is here where we designate the time frame on which we want to optimize weights for the synthetic West Germany. \n", " \n", " time.plot = 1960:2003 #This is the time period we'll be plotting the data for.\n", " )" ], "id": "14960fb3-423f-4a20-9684-77880b3af7b4" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that you’ve prepared the data we’re going to optimize weights for\n", "our potential control countries by minimizing the sum of squared\n", "residuals (SSR). This is a multivariate optimization problem such as you\n", "may be familiar with from calculus… luckily for us, we don’t have to do\n", "it by hand! We do it with the `synth()` command." ], "id": "6dfa1593-3af0-4c5b-8e4f-797944f53121" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_model <- synth(data.prep.obj = training_data)" ], "id": "f1ffba43-9a06-4161-a40b-1f11dc1f60c2" }, { "cell_type": "markdown", "metadata": {}, "source": [ "In case it’s not clear (it isn’t) `synth()` has generated optimized\n", "weights, `solution.v` and `solution.w`, for the variables and the\n", "countries respectively.\n", "\n", "Great. Next, we need to create the dataset for the validation period.\n", "Once that is done, we will apply our training model to it - the result\n", "of which is our main model.\n", "\n", "This is cross-validation in action! And it may seem like we’re sort of\n", "doing the same thing over and over again…because we are (but notice that\n", "the years are changing!)" ], "id": "156e1861-f3a0-469a-b8e2-0df3664c53dc" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "main_data <- dataprep(\n", " foo = oecd_data,\n", " predictors = c(\"gdp\",\"trade\",\"infrate\"),\n", " dependent = \"gdp\",\n", " special.predictors = list(\n", " list(\"industry\" ,1981:1990, c(\"mean\")),\n", " list(\"schooling\",c(1980,1985), c(\"mean\")),\n", " list(\"invest80\" ,1980, c(\"mean\"))\n", " ),\n", " unit.variable = \"index\",\n", " unit.names.variable = 2,\n", " treatment.identifier = 7,\n", " controls.identifier = unique(oecd_data$index)[-7],\n", " time.variable = \"year\",\n", " time.predictors.prior = 1981:1990, #take explainatory variable averages from the validation period\n", " time.optimize.ssr = 1960:1989, #optimize across the entire pre-treatment period\n", " time.plot = 1960:2003\n", ")" ], "id": "0db40cc9-2b56-438c-8863-588ca9091eae" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#apply training model weights to the main model\n", "main_model <- synth(\n", " data.prep.obj = main_data,\n", " custom.v = as.numeric(training_model$solution.v) #This is the cross-validation in action! This line specifies that, although we are optimizing across the whole period, we are doing so using weights derived from the training_model rather than the ones from the main model. \n", " )" ], "id": "758072d0-15fd-4445-bb71-d6b4bf910994" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, phew, that was a lot of work! I hope it wasn’t a waste of time,\n", "what if we actually could have just done a DiD between West Germany and\n", "the OECD avarage?\n", "\n", "Let’s look at a pretty picture, you’ve earned it." ], "id": "c7208069-2aa7-4a95-be52-e28f96ade930" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Text.height <- 23000\n", "Cex.set <- .8\n", "plot(1960:2003,main_data$Y1plot,\n", " type=\"l\",ylim=c(0,33000),col=\"black\",lty=\"solid\",\n", " ylab =\"per-capita GDP (PPP, 2002 USD)\",\n", " xlab =\"year\",\n", " xaxs = \"i\", yaxs = \"i\",\n", " lwd=2\n", " )\n", "lines(1960:2003,aggregate(oecd_data[,c(\"gdp\")],by=list(oecd_data$year),mean,na.rm=T)[,2]\n", " ,col=\"black\",lty=\"dashed\",lwd=2) # mean 2\n", "abline(v=1990,lty=\"dotted\")\n", "legend(x=\"bottomright\",\n", " legend=c(\"West Germany\",\"rest of the OECD sample\")\n", " ,lty=c(\"solid\",\"dashed\"),col=c(\"black\",\"black\")\n", " ,cex=.8,bg=\"white\",lwd=c(2,2))\n", "arrows(1987,Text.height,1989,Text.height,col=\"black\",length=.1)\n", "text(1982.5,Text.height,\"reunification\",cex=Cex.set)" ], "id": "8044c758-a9cc-4d8f-aa1c-6370ab484d38" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oh thank God they aren’t the same! That would have been quite the plot\n", "twist, eh?\n", "\n", "Now let’s look at how our synthetic West Germany compares to the real\n", "deal." ], "id": "34a7e6e5-32d6-4e43-93c3-1b224d61a680" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "synthY0 <- (main_data$Y0%*%main_model$solution.w)\n", "plot(1960:2003,main_data$Y1plot,\n", " type=\"l\",ylim=c(0,33000),col=\"black\",lty=\"solid\",\n", " ylab =\"per-capita GDP (PPP, 2002 USD)\",\n", " xlab =\"year\",\n", " xaxs = \"i\", yaxs = \"i\",\n", " lwd=2\n", " )\n", "lines(1960:2003,synthY0,col=\"black\",lty=\"dashed\",lwd=2)\n", "abline(v=1990,lty=\"dotted\")\n", "legend(x=\"bottomright\",\n", " legend=c(\"West Germany\",\"synthetic West Germany\")\n", " ,lty=c(\"solid\",\"dashed\"),col=c(\"black\",\"black\")\n", " ,cex=.8,bg=\"white\",lwd=c(2,2))\n", "arrows(1987,Text.height,1989,Text.height,col=\"black\",length=.1)\n", "text(1982.5,Text.height,\"reunification\",cex=Cex.set)" ], "id": "333b30c4-30fe-4e3f-9dc4-6b6396db130d" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that sure looks like common trends to me! They’re practically on top\n", "of each other until 1990.\n", "\n", "“But”, you might be asking, “can we do better than the eyeball test?”\n", "\n", "We sure can!" ], "id": "4021273c-1267-4f9b-b03c-f0316b1d143b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "synth.tables <- synth.tab(\n", " dataprep.res = main_data,\n", " synth.res = main_model,\n", " round.digit = 2\n", " )\n", "synth.tables" ], "id": "e6cb5bb1-209d-42ca-a133-bf38a276601e" }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can hopefully see from `tab.pred`, this function compares the\n", "pre-treatment predictor values for the treated unit to the synthetic\n", "control unit, and to all the units in the sample[1]. `tab.w` gives us\n", "the weights for each country in the donor pool, and `tab.v` the weights\n", "for each variable. Finally, `tab.loss` gives us the loss function.\n", "\n", "Similarly to DiD analyses, we can also visualize this in terms of the\n", "gap that exists between the real and synthetic versions.\n", "\n", "[1] You can think of it as being roughly analagous to a balance table in\n", "an RCT." ], "id": "876147a1-f0cf-4f98-b736-118c3877ff83" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gap <- main_data$Y1-(main_data$Y0%*%main_model$solution.w) # the difference between the treated unit and the synthetic control at a specific point in time" ], "id": "c001f86c-53ae-43ad-b69a-75f361515838" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot(1960:2003,gap,\n", " type=\"l\",ylim=c(-4500,4500),col=\"black\",lty=\"solid\",\n", " ylab =c(\"gap in per-capita GDP (PPP, 2002 USD)\"),\n", " xlab =\"year\",\n", " xaxs = \"i\", yaxs = \"i\",\n", " lwd=2\n", " )\n", "abline(v=1990,lty=\"dotted\")\n", "abline(h=0,lty=\"dotted\")\n", "arrows(1987,1000,1989,1000,col=\"black\",length=.1)\n", "text(1982.5,1000,\"reunification\",cex=Cex.set)" ], "id": "72f26449-b941-4451-a1e0-88aa12268294" }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **This method of looking at the gaps will become important later when\n", "> we try to decide whether or not we can assign statistical significance\n", "> to a post-treatment change.**\n", "\n", "Before we get to that, let’s take a look at the size of the effect of\n", "reunification." ], "id": "f3ab8e96-8224-475c-bb2c-0fa8e6f18a53" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#ATT_t is the average size of `gap` between 1990 and 2003\n", "mean(gap[31:44, 1])" ], "id": "ad5b0a41-b3ac-437a-b183-31511c796352" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ouch! West Germans had their per capita GDP reduced by an average of\n", "US\\$1,600 per year after reunification." ], "id": "94c72af1-dca3-4c25-8659-47b8943ccfa8" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#fraction of ATT_t to West German GDP in 1990\n", "round(\n", " mean(gap[31:44, 1]) / oecd_data$gdp[oecd_data$country == \"West Germany\" & oecd_data$year == 1990],\n", " 2\n", ")" ], "id": "6299c868-3e45-4f4f-a818-a68984c2495f" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Relative to national income in 1990, that’s an 8% average reduction!\n", "\n", "## Statistical Inference\n", "\n", "Okay, so we have our “balance table”, we have our common trends, and we\n", "have a large effect size. But how can we know if this is a statistically\n", "significant change? Unlike in traditional methods of estimation and\n", "inference we cannot easily draw upon the law of large numbers and the\n", "central limit theorem to save us. Not to belabor the point but we\n", "*literally* only have two observations per year, neither of which were\n", "randomly assigned to treatment or control, and no particularly good\n", "reason to think that such events would be independent and identically\n", "distributed.\n", "\n", "Recall that to test for significance in a random experiment what we do\n", "is randomly assign treatment to untreated units, collect data on the two\n", "groups, calculate coefficients, and then collecte those coefficients\n", "into a well behaved distribution in order to infer things about said\n", "coefficients.\n", "\n", "This is probably where the conceptual framework of synthetic control\n", "differs most profoundly from the traditional statistical methods you’re\n", "familiar with. To get around these problems we’ll use so-called “placebo\n", "studies” which “iteratively apply the synthetic control method to each\n", "\\[country\\] in the donor pool and obtain a distribution of placebo\n", "effects” (Cunningham, 2021).\n", "\n", "Let’s unpack what that means. First and foremost, it means **there will\n", "be no confidence intervals and p-values will not reflect how unlikely a\n", "result would be to occur under the null hypothesis**. Instead, our\n", "efforts will be focused on\n", "\n", "1. trying to falsify our findings, and\n", "\n", "2. trying to figure out how extreme the treatment effect on our treated\n", " group is, *relative to other members of the donor pool*.\n", "\n", "By doing these two things we will attempt to uncover whether the effect\n", "was a statistical fluke or perhaps merely prediction error on the part\n", "of the model.\n", "\n", "## Part 3: Placebo Studies, Significance Tests, Distribution, and Robustness\n", "\n", "At the core of how we will attempt to falsify our findings is the basic\n", "assumption that if you found a similarly sized effect in cases where\n", "German reunification never happened (i.e. in a different year or in a\n", "different country) that this would severely undermine the validity of\n", "the supposed effect of German reunification we just found. Working\n", "through this process of falsification is what we call “placebo studies”,\n", "which can broadly be broken down into “in-time placebos” and “in-space\n", "placebos”.\n", "\n", "### In-time Placebos\n", "\n", "Running an in-time placebo is no different than running the original\n", "synthetic control model, except that the dates change. For example, how\n", "would our model fair if German reunification had taken place 15 years\n", "earlier, in 1975?\n", "\n", "As before, we will cross-validate our model by choosing variable means\n", "and optimal weights across different time periods. Let the placebo\n", "training period be 1960-1964, the placebo validation period be\n", "1965-1975, and the placebo treatment occur in 1975." ], "id": "34492a7c-80c3-4e9d-b1a4-934a19f5786c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# data prep for placebo_training model\n", "placebo_time_training_data <-\n", " dataprep(\n", " foo = oecd_data,\n", " predictors = c(\"gdp\",\"trade\",\"infrate\"),\n", " dependent = \"gdp\",\n", " unit.variable = \"index\",\n", " time.variable = \"year\",\n", " special.predictors = list(\n", " list(\"industry\",1971, c(\"mean\")),\n", " list(\"schooling\",c(1960,1965), c(\"mean\")),\n", " list(\"invest60\" ,1980, c(\"mean\"))\n", " ),\n", " treatment.identifier = 7,\n", " controls.identifier = unique(oecd_data$index)[-7],\n", " time.predictors.prior = 1960:1964,\n", " time.optimize.ssr = 1965:1975,\n", " unit.names.variable = 2,\n", " time.plot = 1960:1990\n", " )\n", "\n", "# fit placebo_time_training model\n", "placebo_time_training_model <- synth(\n", " data.prep.obj=placebo_time_training_data)" ], "id": "5e7c6c35-858f-49ad-a84c-5e3487f895c0" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# data prep for placebo_main model\n", "placebo_time_main_data <-\n", " dataprep(\n", " foo = oecd_data,\n", " predictors = c(\"gdp\",\"trade\",\"infrate\"),\n", " dependent = \"gdp\",\n", " unit.variable = 1,\n", " time.variable = 3,\n", " special.predictors = list(\n", " list(\"industry\" ,1971:1975, c(\"mean\")),\n", " list(\"schooling\",c(1970,1975), c(\"mean\")),\n", " list(\"invest70\" ,1980, c(\"mean\"))\n", " ),\n", " treatment.identifier = 7,\n", " controls.identifier = unique(oecd_data$index)[-7],\n", " time.predictors.prior = 1965:1975,\n", " time.optimize.ssr = 1960:1975,\n", " unit.names.variable = 2,\n", " time.plot = 1960:1990\n", " )\n", "\n", "# fit main model\n", "placebo_time_main_model <- synth(\n", " data.prep.obj=placebo_time_main_data,\n", " custom.v=as.numeric(placebo_time_training_model$solution.v)\n", ")" ], "id": "cb1bd36e-ae53-47f3-a1cb-4ce97d139a08" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Cex.set <- 1\n", "plot(1960:1990,placebo_time_main_data$Y1plot,\n", " type=\"l\",ylim=c(0,33000),col=\"black\",lty=\"solid\",\n", " ylab =\"per-capita GDP (PPP, 2002 USD)\",\n", " xlab =\"year\",\n", " xaxs = \"i\", yaxs = \"i\",\n", " lwd=2\n", " )\n", "lines(1960:1990,(placebo_time_main_data$Y0%*%placebo_time_main_model$solution.w),col=\"black\",lty=\"dashed\",lwd=2)\n", "abline(v=1975,lty=\"dotted\")\n", "legend(x=\"bottomright\",\n", " legend=c(\"West Germany\",\"synthetic West Germany\")\n", " ,lty=c(\"solid\",\"dashed\"),col=c(\"black\",\"black\")\n", " ,cex=.8,bg=\"white\",lwd=c(2,2))\n", "arrows(1973,20000,1974.5,20000,col=\"black\",length=.1)\n", "text(1967.5,20000,\"placebo reunification\",cex=Cex.set)" ], "id": "586f72e4-f92f-4cee-890f-7c48ae65b415" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, good. Just like in the real world, [nothing happened in\n", "1975](https://en.wikipedia.org/wiki/1975#Events). This is a good sign\n", "for our model! If an effect is visible, given that nothing should have\n", "happened, that would have implied there were factors other than the\n", "reunification which caused synthetic West Germany to diverge from West\n", "Germany.\n", "\n", "### In-space Placebos\n", "\n", "In-space placebo studies are a little more strange to consider, and as I\n", "think you’ll see, they are how we try to estimate if the effect of the\n", "intervention on our treatment group is extreme relative to other members\n", "in the donor pool.\n", "\n", "The center piece of in-space placebos is the amount of prediction error\n", "in our treatment unit, our synthetic unit, and the units from the donor\n", "pool. This is obtained by repeatedly applying the same process of\n", "synthetic control that we did with West Germany to each other unit in\n", "the donor pool (i.e. France, Japan, Spain, etc.).\n", "\n", "Thinking back to our earlier discussion of the MSPE, we’re now going to\n", "take its square root for each unit (now the RMSPE)[1] both before and\n", "after the intervention supposedly took place in 1990. By doing so, this\n", "gives us a tractable way to measure the magnitude of the gap in our\n", "outcome variable between each country and its synthetic counterpart.\n", "\n", "To be clear, a large post-treatment RMSPE does not necessarily indicate\n", "a large effect of the intervention if the pre-treatment RMSPE is also\n", "large. However, if the post-treatment RMSPE is large *and* the\n", "pre-treatment RMSPE is small, then that is a strong indication that the\n", "intervention had an effect.\n", "\n", "Once you’ve calculated the RMSPE in each period, the most\n", "straightforward way to decide what constitutes a large or small effect\n", "is to take the ratio\n", "\n", "$$\n", "\\frac{RMSPE_{post,j}}{RMSPE_{pre,j}}\n", "$$\n", "\n", "Once that’s done, rank the fractions in descending order (highest to\n", "lowest) and let\n", "\n", "$$\n", "p \\equiv \\frac{RANK}{TOTAL}\n", "$$\n", "\n", "Lets do this now.\n", "\n", "[1] Taking the square root scales the values and makes it a little\n", "easier to interpret." ], "id": "53cde955-c934-46be-b6a6-72669f8b6467" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create a dataframe to store the gaps between the actual and synthetic versions of each country in the donor pool\n", "storegaps <- matrix(NA, \n", " length(1960:2003),\n", " length(unique(oecd_data$index))-1\n", " )\n", "rownames(storegaps) <- 1960:2003\n", "i <- 1\n", "country_index <- unique(oecd_data$index)" ], "id": "fcb89f93-91e2-4983-9ead-ddf65499a395" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#looping over control units from the donor pool\n", "for(k in unique(oecd_data$index)[-7]){ # excluding index=7 because that is West Germany\n", "\n", " placebo_space_training_data <- dataprep(\n", " foo = oecd_data,\n", " predictors = c(\"gdp\", \"trade\", \"infrate\"),\n", " dependent = \"gdp\",\n", " unit.variable = \"index\",\n", " time.variable = \"year\",\n", " special.predictors = list(\n", " list(\"industry\", 1971:1980, c(\"mean\")),\n", " list(\"schooling\",c(1970,1975), c(\"mean\")),\n", " list(\"invest70\" ,1980, c(\"mean\"))\n", " ),\n", " treatment.identifier = k, #kth placebo unit being treated\n", " controls.identifier = country_index[-which(country_index==k)], #when kth placebo unit is being treated it cannot also be a control\n", " time.predictors.prior = 1971:1980,\n", " time.optimize.ssr = 1981:1990,\n", " unit.names.variable = \"country\",\n", " time.plot = 1960:2003\n", " )\n", "\n", " placebo_space_training_model <- synth(data.prep.obj=placebo_space_training_data)\n", " \n", "\n", "placebo_space_main_data <-\n", " dataprep(\n", " foo = oecd_data,\n", " predictors = c(\"gdp\",\"trade\",\"infrate\"),\n", " dependent = \"gdp\",\n", " unit.variable = 1,\n", " time.variable = 3,\n", " special.predictors = list(\n", " list(\"industry\" ,1981:1990, c(\"mean\")),\n", " list(\"schooling\",c(1980,1985), c(\"mean\")),\n", " list(\"invest80\" ,1980, c(\"mean\"))\n", " ),\n", " treatment.identifier = k,\n", " controls.identifier = country_index[-which(country_index==k)],\n", " time.predictors.prior = 1981:1990,\n", " time.optimize.ssr = 1960:1989,\n", " unit.names.variable = 2,\n", " time.plot = 1960:2003\n", " )\n", "\n", "\n", "placebo_space_main_model <- synth(\n", " data.prep.obj=placebo_space_main_data,\n", " custom.v=as.numeric(placebo_space_training_model$solution.v) #cross-validation\n", " )\n", "\n", " storegaps[,i] <- \n", " placebo_space_main_data$Y1-\n", " (placebo_space_main_data$Y0%*%placebo_space_main_model$solution.w)\n", " i <- i + 1\n", "} # close loop over control units" ], "id": "4fbf63c2-b424-4f4c-8262-69f3cb6357ef" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "oecd_data <- oecd_data[order(oecd_data$index,oecd_data$year),] #sorting our primary df\n", "colnames(storegaps) <- unique(oecd_data$country)[-7] #filling columns with donor group names\n", "\n", "storegaps <- cbind(gap,storegaps) #adding & then naming a column for West Germany to the df\n", "colnames(storegaps)[1] <- c(\"West Germany\")" ], "id": "3165df77-622f-4890-b256-e44b93818fbc" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# compute ratio of post-reunification RMSPE to pre-reunification RMSPE \n", "rmspe <- function(x){sqrt(mean(x^2))} #function to calculate RMSPE\n", "pre_treat <- apply(storegaps[1:30,],2,rmspe)\n", "post_treat <- apply(storegaps[31:44,],2,rmspe)\n", "\n", "dotchart(sort(post_treat/pre_treat),\n", " xlab=\"Post-Period RMSE / Pre-Period RMSE\",\n", " pch=19)" ], "id": "bf0aa5d8-1436-4682-8ce0-e33da7e041f4" }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the ratio of post-treatment to pre-treatment RMSPE is\n", "quite high for West Germany, and significantly larger than for all other\n", "countries, which is another good indication that the reunification had a\n", "large effect. With our earlier definition of $p = \\frac{RANK}{TOTAL}$ we\n", "can now calculate that $p = 1/17 \\approx 0.059$. This $p$-value *not*\n", "how unlikely it would be to find this result under the null hypothesis -\n", "it answers the more subtle question of “if one were to pick a country at\n", "random from the sample, what are the chances of obtaining a ratio as\n", "high as this one?”\n", "\n", "Relatedly, we can also look at the distribution of the gaps between the\n", "actual and synthetic versions of each country in the donor pool. This is\n", "a way to see how much of an outlier our actually treated unit is from\n", "the placebo treated units.[1] This is also known as building a\n", "*distribution* of the placebo effects.\n", "\n", "[1] Recall: it’s a best fit in the pre-treatment period so some amount\n", "of gap is to be expected. However, often you will have a handful of\n", "donor units whose synthetic versions of themselves are a terrible fit -\n", "usually because they’re very unusual in their pre-treatment\n", "characteristics, which means no combination of samples from other units\n", "in the pool can reproduce the pre-treatment trends. In those cases, it\n", "is common to drop those observations from your in-place placebo\n", "distribution graph." ], "id": "ba7c40c1-d6d5-48a5-b3dc-b34315d7a2a4" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Placebo Effect Distribution\n", "Cex.set <- .75\n", "plot(1960:2003,gap,\n", " ylim=c(-4500,4500),xlab=\"year\",\n", " xlim=c(1960,2003),ylab=\"Gap in real GDPpc\",\n", " type=\"l\",lwd=2,col=\"black\",\n", " xaxs=\"i\",yaxs=\"i\")\n", "\n", "# Add lines for control states\n", "for (i in 2:ncol(storegaps)) { lines(1960:2003,storegaps[1:nrow(storegaps),i],col=\"gray\") }\n", "\n", "\n", "# Add grid\n", "abline(v=1990,lty=\"dotted\",lwd=2)\n", "abline(h=0,lty=\"dashed\",lwd=2)\n", "legend(\"topleft\",legend=c(\"West Germany\",\"control regions\"),\n", "lty=c(1,1),col=c(\"black\",\"gray\"),lwd=c(2,1),cex=.8)\n", "arrows(1987,-2500,1989,-2500,col=\"black\",length=.1)\n", "text(1983.5,-2500,\"Reunification\",cex=Cex.set)\n", "abline(v=1960)\n", "abline(v=2003)\n", "abline(h=-2)\n", "abline(h=2)" ], "id": "c73a8972-51e3-4ed7-8249-d4fc6583cbde" }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Excluding Extreme MSPE Values**\n", ">\n", "> As noted earlier, some papers make a point of excluding countries\n", "> whose pre-treatment MSPE is substantially larger than the treated\n", "> unit. This is a problem in the context of deriving a placebo\n", "> distribution because - by definition - such units’ pre-treatment\n", "> trends cannot be adequately modeled. However, this is not an issue\n", "> when taking ratios of post-treatment to pre-treatment MSPE because our\n", "> inability to model these units is generally symmetric across both\n", "> periods.\n", ">\n", "> As a rule of thumb, a conservative cut-off is 2x the treated unit’s\n", "> MSPE, a moderate cut-off is 5x the treated unit, and a lenient cut-off\n", "> is 20x the treated unit. (Abadie et al, 2010)\n", ">\n", "> Here is a piece of code that will exclude countries whose\n", "> pre-treatment MSPE is more than 20 times the pre-treatment MSPE of\n", "> West Germany.\n", ">\n", "> `mspe <- function(x){(mean(x^2))} #function to calculate MSPE`\n", "> `outliers <- apply(storegaps[1:30,],2,mspe) > 20*mspe(storegaps[1:30,][,1])`\n", "> `filtered_storegaps <- storegaps[, !outliers]` `print(outliers)`\n", ">\n", "> I encourage you to experiment with this code and see how it changes\n", "> the placebo effect distribution graph.\n", "\n", "### Robustness Testing: Leave-one-out\n", "\n", "The next step in Placebo Studies is to do a leave-one-out test. This is\n", "a form of robustness check where we iteratively remove one country from\n", "the control group (starting with the least important) and re-run the\n", "model. This will tell us something about how sensitive our synthetic\n", "West Germany is to the idiosyncratic features of any particular country\n", "within the control group." ], "id": "cb953434-b9d2-4f3b-ad2f-d676381a47c3" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#refresh ourselves on which countries have a positive weight in synthetic West Germany\n", "synth.tables$tab.w" ], "id": "aabc35ba-345c-4644-a5a9-1cbf3e68d711" }, { "cell_type": "markdown", "metadata": {}, "source": [ "In decreasing order of importance we have: Austria, the USA, Japan,\n", "Switzerland, and the Netherlands." ], "id": "47b7df05-0213-499a-8a05-75b1ab5c40ea" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Leave-one-out distribution of the synthetic control for West Germany\n", "\n", "# loop over leave one outs\n", "storegaps <- \n", " matrix(NA,\n", " length(1960:2003),\n", " 5)\n", "colnames(storegaps) <- c(1,3,9,11,12) #index values for countries with positive weight\n", "country <- unique(oecd_data$index)[-7]\n", "\n", "for(k in 1:5){\n", "\n", "# data prep for training model\n", "omit <- c(1,3,9,11,12)[k] \n", " robustness_training_data <-\n", " dataprep(\n", " foo = oecd_data,\n", " predictors = c(\"gdp\",\"trade\",\"infrate\"),\n", " dependent = \"gdp\",\n", " unit.variable = 1,\n", " time.variable = 3,\n", " special.predictors = list(\n", " list(\"industry\",1971:1980, c(\"mean\")),\n", " list(\"schooling\" ,c(1970,1975), c(\"mean\")),\n", " list(\"invest70\" ,1980, c(\"mean\"))\n", " ),\n", " treatment.identifier = 7,\n", " controls.identifier = country[-which(country==omit)],\n", " time.predictors.prior = 1971:1980,\n", " time.optimize.ssr = 1981:1990,\n", " unit.names.variable = 2,\n", " time.plot = 1960:2003\n", " )\n", " \n", " # fit training model\n", " robustness_training_model <- synth(\n", " data.prep.obj=robustness_training_data)\n", " \n", "# data prep for main model\n", "robustness_main_data <-\n", " dataprep(\n", " foo = oecd_data,\n", " predictors = c(\"gdp\",\"trade\",\"infrate\"),\n", " dependent = \"gdp\",\n", " unit.variable = 1,\n", " time.variable = 3,\n", " special.predictors = list(\n", " list(\"industry\" ,1981:1990, c(\"mean\")),\n", " list(\"schooling\",c(1980,1985), c(\"mean\")),\n", " list(\"invest80\" ,1980, c(\"mean\"))\n", " ),\n", " treatment.identifier = 7,\n", " controls.identifier = country[-which(country==omit)],\n", " time.predictors.prior = 1981:1990,\n", " time.optimize.ssr = 1960:1989,\n", " unit.names.variable = 2,\n", " time.plot = 1960:2003\n", " )\n", " \n", " # fit main model \n", " robustness_main_model <- synth(\n", " data.prep.obj=robustness_main_data,\n", " custom.v=as.numeric(robustness_training_model$solution.v)\n", " )\n", " storegaps[,k] <- (robustness_main_data$Y0%*%robustness_main_model$solution.w)\n", "} # close loop over leave one outs" ], "id": "c3dc0825-8456-4499-a11a-5f27d8c5bad7" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Text.height <- 23000\n", "Cex.set <- .8\n", "plot(1960:2003,robustness_main_data$Y1plot,\n", " type=\"l\",ylim=c(0,33000),col=\"black\",lty=\"solid\",\n", " ylab =\"per-capita GDP (PPP, 2002 USD)\",\n", " xlab =\"year\",\n", " xaxs = \"i\", yaxs = \"i\",lwd=2\n", " )\n", "\n", "abline(v=1990,lty=\"dotted\")\n", "arrows(1987,23000,1989,23000,col=\"black\",length=.1)\n", " for(i in 1:5){\n", " lines(1960:2003,storegaps[,i],col=\"darkgrey\",lty=\"solid\")\n", " }\n", "lines(1960:2003,synthY0,col=\"black\",lty=\"dashed\",lwd=2)\n", "lines(1960:2003,robustness_main_data$Y1plot,col=\"black\",lty=\"solid\",lwd=2)\n", "text(1982.5,23000,\"reunification\",cex=.8)\n", "legend(x=\"bottomright\",\n", " legend=c(\"West Germany\",\n", " \"synthetic West Germany\",\n", " \"synthetic West Germany (leave-one-out)\")\n", " ,lty=c(\"solid\",\"dashed\",\"solid\"),\n", " col=c(\"black\",\"black\",\"darkgrey\")\n", " ,cex=.8,bg=\"white\",lwdc(2,2,1))" ], "id": "2bcf5f65-221a-4443-b561-80373fc4587c" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "In this tutorial, we’ve walked through the process of the synthetic\n", "control method [using Abadie et al. (2015)’s excellent\n", "paper](https://doi.org/10.1111/ajps.12116) on German reunification as a\n", "template. If you want to dig into the replication package some more, you\n", "can find it [here](https://doi.org/10.7910/DVN/24714).\n", "\n", "In the process of working through this paper we’ve seen how to prepare\n", "the data, optimize weights, and cross-validate the model. We’ve also\n", "discussed how to assess the statistical significance of the estimated\n", "effect using placebo studies and leave-one-out tests.\n", "\n", "To recap, the synthetic control method is a powerful tool for estimating\n", "the effects of policy interventions when traditional methods are not\n", "feasible. It allows us to construct a counterfactual scenario by\n", "combining information from multiple control units, and to estimate the\n", "treatment effect by comparing the treated unit to its synthetic\n", "counterpart.\n", "\n", "I hope this tutorial has been helpful in understanding the synthetic\n", "control method and giving you the confidence to try it out in your\n", "495/499 research paper.\n", "\n", "> **Tips & Tricks**\n", ">\n", "> As I mentioned earlier much of the code gets overcomplicated by the\n", "> process of cross-validation. You don’t actually *need* to\n", "> cross-validate the model with a `training_data`, a `training_model`, a\n", "> `main_data`, and a `main_model` each time. You *can* just run the\n", "> model on the full pre-treatment period.\n", ">\n", "> I chose to use cross-validation because\n", ">\n", "> 1. It is a good way to make sure you’re not overfitting the data\n", "> (which is a real risk in synthetic control studies), and\n", ">\n", "> 2. Without a tutorial on how cross-validation works and what it looks\n", "> like it is quite difficult both to intuit how to do it yourself\n", "> and to read/understand other people’s code when *they* are doing\n", "> it.\n", ">\n", "> The downside is that it makes the creation and display of graphs and\n", "> tables significantly more complicated. So, let me give a quick run\n", "> down on how you could simplify the code.\n", ">\n", "> ``` r\n", "> #prepare the data. Primary difference here is that there's only one block and the years span the entire pre-treatment period \n", "> dataprep_out <-\n", "> dataprep(\n", "> foo = oecd_data,\n", "> predictors = c(\"gdp\",\"trade\",\"infrate\"),\n", "> dependent = \"gdp\",\n", "> unit.variable = \"index\",\n", "> unit.names.variable = \"country\",\n", "> time.variable = \"year\",\n", "> special.predictors = list(\n", "> list(\"industry\" ,1971:1990, c(\"mean\")),\n", "> list(\"schooling\",c(1970,1985), c(\"mean\")),\n", "> list(\"invest80\" ,1980, c(\"mean\"))\n", "> ),\n", "> treatment.identifier = 7,\n", "> controls.identifier = c(1:6,8:17),\n", "> time.predictors.prior = 1960:1990,\n", "> time.optimize.ssr = 1960:1989,\n", "> time.plot = 1960:2003\n", "> )\n", ">\n", "> synth_out <- synth(data.prep.obj=dataprep_out)\n", ">\n", ">\n", "> #plot the results\n", "> path.plot(synth_out, dataprep_out, Ylab = \"per-capita GDP (PPP, 2002 USD)\", Xlab = \"year\", Main = NA)\n", ">\n", "> gaps.plot(synth_out, dataprep_out, Ylab = \"Gap in real GDPpc\", Xlab = \"year\", Ylim = c(-4500,4500), Main = NA)\n", ">\n", "> #placebo studies\n", "> placebos <- generate.placebos(dataprep_out, synth_out, Sigf.ipop = 3)\n", ">\n", "> plot_placebos(placebos)\n", ">\n", "> mspe.plot(placebos)\n", "> ```\n", "\n", "## Further reading\n", "\n", "- Abadie, A., & Gardeazabal, J. (2003). The Economic Costs of\n", " Conflict: A Case Study of the Basque Country. American Economic\n", " Review, 93(1), 113–132. https://doi.org/10.1257/000282803321455188\n", "\n", "- Abadie, A. (2021). Using Synthetic Controls: Feasibility, Data\n", " Requirements, and Methodological Aspects. Journal of Economic\n", " Literature, 59(2), 391–425. https://doi.org/10.1257/jel.20191450\n", "\n", "- Abadie, A., Diamond, A., & Hainmueller, J. (2011). Synth: An R\n", " Package for Synthetic Control Methods in Comparative Case Studies.\n", " Journal of Statistical Software, 42(13), 1–17.\n", " https://doi.org/10.18637/jss.v042.i13" ], "id": "0a4dc899-cea6-43d5-9458-38c01cbc002c" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "ir", "display_name": "R", "language": "r" } } }