{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 16 - Differences-in-Differences Analysis\n",
        "\n",
        "Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt  \n",
        "2024-05-29\n",
        "\n",
        "## Prerequisites\n",
        "\n",
        "1.  Run OLS regressions.\n",
        "2.  Run panel data regressions.\n",
        "\n",
        "## Learning Outcomes\n",
        "\n",
        "1.  Understand the parallel trends (PT) assumption.\n",
        "2.  Run the according OLS regression that retrieves the causal estimand.\n",
        "3.  Implement these regressions in the two-period case and in multiple\n",
        "    time periods (a.k.a event studies).\n",
        "4.  Conduct a test on the plausibility of the PT whenever there are more\n",
        "    than 1 pre-treatment periods.\n",
        "\n",
        "## 16.1 Difference-in-differences\n",
        "\n",
        "Difference-in-differences (diff-in-diff) is a **research design** used\n",
        "to estimate the causal impact of a treatment by comparing the changes in\n",
        "outcomes over time between a treated group and an untreated (or control)\n",
        "group. By comparing changes in outcomes over time, it relies on the use\n",
        "of multiple (at least two) time periods. Therefore, there is a link\n",
        "between diff-in-diff designs and panel data. Every time we want to use a\n",
        "diff-in-diff design, we will always have to make sure that we have panel\n",
        "data.\n",
        "\n",
        "Why are panel datasets crucial in diff-in-diff research designs? The\n",
        "idea is that panel data allows us to control for heterogeneity that is\n",
        "both unobserved and time invariant.\n",
        "\n",
        "Consider the following example. Earnings $y_{it}$ of worker $i$ at time\n",
        "$t$ can be split into two components:\n",
        "\n",
        "$$\n",
        "y_{it} = e_{it} + \\alpha_{i}\n",
        "$$\n",
        "\n",
        "where $\\alpha_i$ is a measure of worker quality and $e_{it}$ are the\n",
        "part of earnings not explained by $\\alpha_i$. This says that a bad\n",
        "quality worker (low $\\alpha_i$) will receive lower earnings *at any time\n",
        "period*, since $\\alpha_i$ is time invariant. Notice that worker quality\n",
        "is typically unobserved and is usually part of our error term, which\n",
        "should not be correlated with treatment. In many cases though, this\n",
        "invariant heterogeneity (in our case, worker quality) is the cause of\n",
        "endogeneity bias. In this example, it can be that workers who attend a\n",
        "training program also tend to be the ones that perform poorly at their\n",
        "job and *select* into this program.\n",
        "\n",
        "However, notice that if we take time differences, we get rid of this\n",
        "heterogeneity. Suppose we subtract earnings at time $1$ from earnings at\n",
        "time $0$, thus obtaining:\n",
        "\n",
        "$$\n",
        "y_{i1} - y_{i0} =   e_{i1} - e_{i0}\n",
        "$$\n",
        "\n",
        "where our new equation no longer depends on $\\alpha_i$! However, see how\n",
        "we are now measuring $y_{i1} - y_{i0}$ instead of $y_{it}$? Our model\n",
        "now has *changes* rather than levels. This is going to be the trick used\n",
        "implicitly throughout this module.\n",
        "\n",
        "For this module, we will keep working on our fake data set. Recall that\n",
        "this data is simulating information of workers in the years 1982-2012 in\n",
        "a fake country where a training program was introduced in 2003 to boost\n",
        "their earnings.\n",
        "\n",
        "Let’s start by loading our data and letting Stata know that it is panel\n",
        "data with panel variable *workerid* and time variable *year*. We’ve seen\n",
        "how to do this in [Module\n",
        "15](https://comet.arts.ubc.ca/docs/Research/econ490-stata/15_Panel_Data.html).\n",
        "\n",
        "``` {stata}\n",
        "* Load the data\n",
        "clear* \n",
        "*cd \"\"\n",
        "use fake_data, clear \n",
        "\n",
        "* Set as panel data\n",
        "xtset workerid year, yearly\n",
        "```\n",
        "\n",
        "## 16.2 Parallel Trends Assumption\n",
        "\n",
        "When using a diff-in-diff design, we first need to make sure our data\n",
        "has a binary treatment variable which takes the value 1 when our unit of\n",
        "observation is treated and 0 otherwise. In the example above, let’s\n",
        "denote such a binary treatment variable as $D_i$. It takes value 1 if a\n",
        "worker $i$ is enrolled in the training program at some point in time.\n",
        "\n",
        "In our fake data set, the binary treatment variable already exists and\n",
        "is called *treated*. Let’s check that it takes values 0 or 1.\n",
        "\n",
        "``` {stata}\n",
        "describe, full\n",
        "\n",
        "summarize treated, detail\n",
        "```\n",
        "\n",
        "The aim of diff-in-diff analysis is to estimate the causal impact of a\n",
        "treatment by comparing the changes in outcomes over time between a\n",
        "treated group and an untreated group.\n",
        "\n",
        "A crucial assumption needed to claim causal impact is that, *in the\n",
        "absence of treatment*, the treatment and control groups would follow\n",
        "similar trends over time. This assumption is called **parallel trends\n",
        "assumption**. Whenever we adopt a diff-in-diff design in our research,\n",
        "the first thing we need to check is that this assumption is satisfied.\n",
        "\n",
        "How do we do that?\n",
        "\n",
        "A common approach to check for parallel trends is to plot the mean\n",
        "outcome for both the treated and untreated group over time.\n",
        "\n",
        "Do you recall how to make these plots from [Module\n",
        "9](https://comet.arts.ubc.ca/docs/Research/econ490-stata/09_Stata_Graphs.html)?\n",
        "We start by generating the average log-earnings for each group in each\n",
        "year.\n",
        "\n",
        "``` {stata}\n",
        "* Generate log-earnings\n",
        "generate logearn = log(earnings)\n",
        "\n",
        "* Take the average by group and year\n",
        "bysort year treated: egen meanearn = mean(logearn)\n",
        "```\n",
        "\n",
        "Next, we plot the trend of average earnings by each group. It is common\n",
        "practice to add a vertical line in the period just before the treatment\n",
        "is assigned. In our case, that would be year 2002. The idea is that the\n",
        "treated workers receive the treatment between years 2002 and 2003.\n",
        "\n",
        "``` {stata}\n",
        "* Make graph\n",
        "twoway (line meanearn year if treated == 1, lcolor(gs12) lpattern(solid)) || ///\n",
        "    (line meanearn year if treated == 0, lcolor(gs6) lpattern(dash)), ///\n",
        "    graphregion(color(white))                     ///\n",
        "    legend(label(1 \"Treated\") label(2 \"Control\")) ///\n",
        "    ytitle(\"Average earnings\") xtitle(\"Year\")     ///\n",
        "    xline(2002, lpattern(dash) lcolor(black))\n",
        "graph export graph1.jpg, as(jpg) replace\n",
        "```\n",
        "\n",
        "Remember that we care about the two variables having similar trends\n",
        "*before* the year of the treatment. By looking at the graph, it seems\n",
        "that the average earnings of the two groups had similar trends up until\n",
        "year 2002, just before the treatment. This makes us confident that the\n",
        "parallel trends assumption is satisfied.\n",
        "\n",
        "This test for parallel trends assumption is very rudimentary, but\n",
        "perfectly fine for the early stage of our research project. In the next\n",
        "sections, we will see how to estimate the diff-in-diff design, and there\n",
        "we will see a more formal test for the parallel trends assumption.\n",
        "\n",
        "## 16.3 Difference-in-Differences and Regression\n",
        "\n",
        "Whenever we talk about diff-in-diff, we refer to a research design that\n",
        "relies on some version of the parallel trends assumption. To connect\n",
        "this design to regressions, we need to first build a model. To begin, we\n",
        "will assume a case where no control variables are involved.\n",
        "\n",
        "For simplicity, suppose there are only two periods: a period $t=0$ when\n",
        "no one is treated, and a period $t=1$ when some workers receive the\n",
        "treatment.\n",
        "\n",
        "We would then rely on a linear model of the form:\n",
        "\n",
        "$$\n",
        "y_{it} = \\beta D_i \\mathbf{1}\\{t=1\\}  +  \\lambda_t + \\alpha_i + e_{it} \\tag{1}\n",
        "$$\n",
        "\n",
        "where $y_{it}$ is earnings while $\\lambda_t$ and $\\alpha_i$ are year and\n",
        "worker fixed-effects.\n",
        "\n",
        "The key element in this linear model is the interaction between $D_i$\n",
        "and $\\mathbf{1}\\{t=1\\}$.\n",
        "\n",
        "Recall that $D_i$ is a dummy variable taking value 1 if worker $i$\n",
        "receives the treatment at any point in time, and $\\mathbf{1}\\{t=1\\}$ is\n",
        "an indicator function taking value 1 when $t=1$.\n",
        "\n",
        "Therefore, the interaction term $D_i \\mathbf{1}\\{t=1\\}$ will take value\n",
        "1 for treated workers only when the year is $t=1$, or when the treated\n",
        "workers are treated.\n",
        "\n",
        "The parameter $\\beta$ provides the average treatment effect (on the\n",
        "treated) at period $t=1$ (i.e. we get the effect for those with $D_i=1$\n",
        "at $t=1$). It is the average impact of the treatment on those workers\n",
        "who actually received the treatment. $\\beta$ states by how much the\n",
        "average earnings of treated individuals would have changed if they had\n",
        "not received the treatment.\n",
        "\n",
        "Let’s see how we can estimate this linear diff-in-diff model!\n",
        "\n",
        "Recall that we have information of workers in the years 1982-2012 and\n",
        "the training program (the treatment) was introduced in 2003. We’ll keep\n",
        "one year prior and one year after the program, to keep things consistent\n",
        "with the previous section. Specifically, we can think of year 2002 as\n",
        "$t=0$ and year 2003 as $t=1$.\n",
        "\n",
        "``` {stata}\n",
        "keep if year==2002 | year==2003\n",
        "```\n",
        "\n",
        "Notice that the diff-in-diff linear model in Equation (1) can be seen as\n",
        "a specific case of a linear model with many fixed-effects. We can use\n",
        "the command `reghdfe` and the option `absorb()` to run this type of\n",
        "regression, which we saw in [Module\n",
        "13](https://comet.arts.ubc.ca/docs/Research/econ490-stata/13_Dummy.html).\n",
        "We can also use the command `areg` alongside the option `absorb()` which\n",
        "has the same syntax. In either case, don’t forget to list the\n",
        "fixed-effects in `absorb()` to avoid seeing them in the regression\n",
        "output!\n",
        "\n",
        "Recall that we can create fixed-effects with the `i.` operator and\n",
        "interactions with the `#` operator.\n",
        "\n",
        "``` {stata}\n",
        "areg logearn treated#2003.year i.year, absorb(workerid)\n",
        "```\n",
        "\n",
        "This says that, *on average*, workers who entered the program received\n",
        "18 percentage points more earnings relative to a counterfactual scenario\n",
        "where they never entered the program (which in this case is captured by\n",
        "the control units). How did we get this interpretation? Recall that OLS\n",
        "estimates are interpreted as a 1 unit increase in the independent\n",
        "variable: a 1 unit increase of $D_i \\mathbf{1}\\{t=1\\}$ corresponds to\n",
        "those who started receiving treatment at $t=1$. Furthermore, the\n",
        "dependent variable is in log scale, so a 0.18 increase corresponds to a\n",
        "18 percentage point increase in earnings.\n",
        "\n",
        "### 16.3.1 Adding Covariates\n",
        "\n",
        "The first thing to notice is that our regression specification in\n",
        "Equation (1) involves worker fixed-effects $\\alpha_i$. This means that\n",
        "every worker characteristic that is fixed over time (for example, sex at\n",
        "birth) will be absorbed by the fixed-effects $\\alpha_i$. Therefore, if\n",
        "we added characteristics such as sex and race as covariates, those would\n",
        "be omitted from the regression due to perfect collinearity.\n",
        "\n",
        "This means that we can add covariates to the extent that they are time\n",
        "varying by nature (e.g. tenure, experience), or are trends based on\n",
        "fixed characteristics (e.g. time dummies interacted with sex). We refer\n",
        "to the latter as covariate-specific trends.\n",
        "\n",
        "Algebraically, we obtain a specification that is very similar to\n",
        "Equation (1): $$\n",
        "y_{it} = \\beta D_i \\mathbf{1}\\{t=1\\}  + \\gamma X_{it} +  \\lambda_t + \\alpha_i + e_{it} \\tag{2}\n",
        "$$\n",
        "\n",
        "where $X_{it}$ is a time-varying characteristic of worker $i$ and time\n",
        "$t$.\n",
        "\n",
        "## 16.4 Multiple Time Periods\n",
        "\n",
        "In keeping only the years 2002 and 2003, we have excluded substantial\n",
        "information from our analysis. We may want to keep our data set at its\n",
        "original state, with all its years.\n",
        "\n",
        "A very natural approach to extending this to multiple time periods is to\n",
        "attempt to get the average effect across all post-treatment time\n",
        "periods. For example, it may be that the effects of the training program\n",
        "decay over time, but we are interested in the average effect. We may\n",
        "think of maintaining the parallel trends assumption in a model like\n",
        "this:\n",
        "\n",
        "$$\n",
        "y_{it} = \\beta D_i \\mathbf{1}\\{t\\geq 1\\}  + \\lambda_t + \\alpha_i + e_{it} \\tag{3}\n",
        "$$\n",
        "\n",
        "where the $\\beta$ corresponds now to all time periods after the year in\n",
        "which treatment was applied: $t\\geq 1$. Some people rename\n",
        "$D_i \\mathbf{1}\\{t\\geq 1\\}$ to $D_{it}$, where $D_{it}$ is simply a\n",
        "variable that takes 0 before any treatment and 1 for those who are being\n",
        "treated at that particular time $t$. This is known as the *Two-Way\n",
        "Fixed-Effects (TWFE) Model* . It receives this name because we are\n",
        "including unit fixed-effects, time fixed-effects, and our treatment\n",
        "status.\n",
        "\n",
        "Let’s load our fake data set again and estimate a TWFE model\n",
        "step-by-step.\n",
        "\n",
        "``` {stata}\n",
        "* Load data\n",
        "clear* \n",
        "use fake_data, clear \n",
        "\n",
        "* Generate log-earnings\n",
        "generate logearn = log(earnings)\n",
        "```\n",
        "\n",
        "Remember that now we need to create $\\mathbf{1}\\{t\\geq 1\\}$, a dummy\n",
        "equal to 1 for all years following the year in which the treatment was\n",
        "administered. In our example, we need to create a dummy variable taking\n",
        "value 1 for all years greater than or equal to 2003.\n",
        "\n",
        "``` {stata}\n",
        "generate post2003 = year>=2003\n",
        "```\n",
        "\n",
        "We can again use `areg` or `reghdfe` to estimate Equation (3), but\n",
        "remember to use the new *post2003* dummy variable.\n",
        "\n",
        "``` {stata}\n",
        "areg logearn 1.treated#1.post2003 i.year, absorb(workerid)\n",
        "```\n",
        "\n",
        "The results say that a 1 unit increase in $D_i \\mathbf{1}\\{t\\geq 1\\}$\n",
        "corresponds to a 0.07 increase in log-earnings *on average*. That 1 unit\n",
        "increase only occurs for those who start receiving treatment in 2003.\n",
        "Given that the outcome is in a log scale, we interpret these results in\n",
        "percentage points. Therefore, the coefficient of interest says that\n",
        "those who started treatment in 2003 received, on average, a 7 percentage\n",
        "point increase in earnings.\n",
        "\n",
        "In this fake data set, everyone either starts treatment at year 2003 or\n",
        "does not enter the program at all. However, when there is variation in\n",
        "the timing of the treatment (i.e. people entering the training program\n",
        "earlier than others), a regression using this model may fail to capture\n",
        "the true parameter of interest. For a reference, see this\n",
        "[paper](https://www.sciencedirect.com/science/article/abs/pii/S0304407621001445).\n",
        "\n",
        "## 16.5 Event Studies\n",
        "\n",
        "The natural extension of the previous section, which is the standard\n",
        "approach today, is to estimate different treatment effects depending on\n",
        "the time period.\n",
        "\n",
        "It may be possible that the effect of the treatment fades over time: it\n",
        "was large right after the training program was received, but then\n",
        "decreased over time.\n",
        "\n",
        "To capture the evolution of treatment effects over time, we may want to\n",
        "compute treatment effects at different lags after the program was\n",
        "received: 1 year after, 2 years after, etc.\n",
        "\n",
        "Similarly, we may want to compute “treatment effects” at different years\n",
        "*prior* the program.\n",
        "\n",
        "This is a very powerful tool because it allows us to more formally test\n",
        "whether the parallel trends assumption holds or not: if there are\n",
        "treatment effects prior to receiving the treatment, then the treatment\n",
        "and control groups were likely not having the same trend before\n",
        "receiving the treatment. This is often known as a pre-trends test.\n",
        "\n",
        "A linear model where we test for different treatment effects in\n",
        "different years is usually called an *event study*.\n",
        "\n",
        "Essentially, we extend the diff-in-diff linear model to the following\n",
        "equation:\n",
        "\n",
        "$$\n",
        "y_{it} = \\sum_{k=-T,k\\neq-1}^T \\beta_k \\mathbf{1}\\{K_{it} = k\\}  + \\lambda_t + \\alpha_i + e_{it} \\tag{4}\n",
        "$$\n",
        "\n",
        "where $K_{it}$ are event time dummies (i.e. whether person $i$ is\n",
        "observed at event time $k$ in time $t$). These are essentially dummies\n",
        "for each year until and each year since the event, or “time to” and\n",
        "“time from” dummies. For example, there will be a dummy indicating that\n",
        "a treated individual is one year away from being treated, two years away\n",
        "from being treated, etc. Notice that, for workers who never enter\n",
        "treatment, it is as if the event time is $\\infty$: they are an infinite\n",
        "amount of years away from receiving the treatment. Due to\n",
        "multicollinearity, we need to omit one category of event time dummies\n",
        "$k$. The typical choice is $k=-1$ (one year prior to treatment), which\n",
        "will serve as our reference group. This means that we are comparing\n",
        "changes relative to event time -1.\n",
        "\n",
        "How do we estimate Equation (4) in practice?\n",
        "\n",
        "We begin by constructing a variable that identifies the time relative to\n",
        "the event. For instance, if a person enters the training program in\n",
        "2003, the observation corresponding to 2002 is time -1 relative to the\n",
        "event, the observation corresponding to 2003 is time 0 relative to the\n",
        "event, and so on. We call this variable *event_time* and we compute it\n",
        "as the difference between the current year and the year in which the\n",
        "treatment was received (stored in variable *time_entering_treatment*).\n",
        "\n",
        "In this fake data set, everyone enters the program in 2003, so it is\n",
        "very easy to construct the event time. If this is not the case, we need\n",
        "to make sure that we have a variable which states the year in which each\n",
        "person receives their treatment.\n",
        "\n",
        "``` {stata}\n",
        "* Load data\n",
        "clear* \n",
        "use fake_data, clear \n",
        "\n",
        "* Generate log-earnings\n",
        "generate logearn = log(earnings)\n",
        "\n",
        "* Generate a variable for year in which treatment was received\n",
        "capture drop time_entering_treatment \n",
        "generate time_entering_treatment = 2003 if treated==1 \n",
        "replace time_entering_treatment = . if treated==0\n",
        "\n",
        "* Generate a variable for time relative to the event\n",
        "capture drop event_time\n",
        "generate event_time = year - time_entering_treatment\n",
        "```\n",
        "\n",
        "To make sure we have created *event_time* properly, let’s see which\n",
        "values it takes.\n",
        "\n",
        "``` {stata}\n",
        "tabulate event_time , missing\n",
        "```\n",
        "\n",
        "Notice that all untreated workers have a missing value for the variable\n",
        "*event_time*. We want to include untreated workers in the reference\n",
        "category $k=-1$. Recall that we are still trying to understand the\n",
        "effect of being treated compared to the reference group, those that are\n",
        "untreated. Therefore, we code untreated units as if they always belonged\n",
        "to event time -1.\n",
        "\n",
        "``` {stata}\n",
        "replace event_time = -1 if treated==0\n",
        "```\n",
        "\n",
        "We then decide which *window* of time around the treatment we want to\n",
        "focus on (the $T$’s in Equation (4)). For instance, we may want to focus\n",
        "on 2 years prior to the treatment and 2 years after the treatment, and\n",
        "estimate those treatment effects. Our choice should depend on the amount\n",
        "of information we have in each year. In this case, notice that the\n",
        "number of workers 8 years after treatment is substantially lower than\n",
        "the number of workers 8 years before treatment is started.\n",
        "\n",
        "We could drop all observations before $k=-2$ and after $k=2$. This would\n",
        "once again reduce the amount of information we have in our dataset.\n",
        "\n",
        "An alternative approach, called *binning* the window around treatment,\n",
        "is usually preferred. It works by pretending that treated workers who\n",
        "are observed before *event_time* -2 were actually observed in\n",
        "*event_time* -2 and treated workers who are observed after *event_time*\n",
        "2 were actually observed in *event_time* 2.\n",
        "\n",
        "``` {stata}\n",
        "replace event_time = -2 if event_time<-2 & treated==1\n",
        "replace event_time = 2 if event_time>2 & treated==1\n",
        "```\n",
        "\n",
        "Notice how these steps have modified the values of variable\n",
        "*event_time*:\n",
        "\n",
        "``` {stata}\n",
        "tabulate event_time\n",
        "```\n",
        "\n",
        "The next step is to generate a dummy variable for each value of\n",
        "*event_time*.\n",
        "\n",
        "``` {stata}\n",
        "tabulate event_time, gen(event_time_dummy)\n",
        "```\n",
        "\n",
        "Notice that *event_time_dummy2* is the one that corresponds to\n",
        "*event_time* -1.\n",
        "\n",
        "Once again, Equation (4) is nothing but a linear model with many\n",
        "fixed-effects. We can again use either command `areg` or `reghdfe`.\n",
        "\n",
        "This time, we must include dummy variables for the different values of\n",
        "*event_time*, with the exception of the dummy variable for the baseline\n",
        "event time $k=-1$: *event_time_dummy2*.\n",
        "\n",
        "``` {stata}\n",
        "areg logearn event_time_dummy1 event_time_dummy3 event_time_dummy4 event_time_dummy5 i.year , absorb(workerid) // do you recall how we included worker and year fixed-effects?\n",
        "```\n",
        "\n",
        "Again, the interpretation is the same as before, only now we have\n",
        "dynamic effects. The coefficient on the *event_time1* dummy says that 2\n",
        "years prior to entering treatment, treated units experienced a 0.4\n",
        "percentage point increase in earnings relative to control units.\n",
        "\n",
        "Should we worry that we are finding a difference between treated and\n",
        "control units prior to the policy? Notice that the effect of the policy\n",
        "at event time -2 (*event_time_dummy1*, when there was no training\n",
        "program) is not statistically different than zero.\n",
        "\n",
        "This confirms that our parallel trends assumption is supported by the\n",
        "data. In other words, there are no observable differences in trends\n",
        "prior to the enactment of the training program. Checking the p-value of\n",
        "those coefficients prior to the treatment is called the **pre-trend\n",
        "test** and does not require any fancy work. A mere look at the\n",
        "regression results suffices!\n",
        "\n",
        "Furthermore, we can observe how the policy effect evolves over time. At\n",
        "the year of entering the training program, earnings are boosted by 20\n",
        "percentage points. The next year the effect decreases to 15 percentage\n",
        "points, and 2+ years after the policy, the effect significantly\n",
        "decreases towards 6 percentage points and is less statistically\n",
        "significant.\n",
        "\n",
        "### 16.5.1 Event Study Graph\n",
        "\n",
        "The table output is a correct way to convey the results, but it’s\n",
        "efficacy is limited, especially when we want to use a large time window.\n",
        "In those cases, a graph does a better job of representing all\n",
        "coefficients of interest.\n",
        "\n",
        "We can easily do that using the command `coefplot`, which we covered in\n",
        "[Module\n",
        "9](https://comet.arts.ubc.ca/docs/Research/econ490-stata/09_Stata_Graphs.html).\n",
        "We keep all coefficients of interest by including all *event_time*\n",
        "dummies as inputs in `keep()`, and we rename them one-by-one in\n",
        "`rename()` to increase clarity of the graph.\n",
        "\n",
        "``` {stata}\n",
        "coefplot, keep(event_time_*) vertical graphregion(color(white)) yline(0) ///\n",
        "    rename(event_time_dummy1=\"k=-2\" event_time_dummy3=\"k=0\" event_time_dummy4=\"k=+1\" event_time_dummy5=\"k=+2\") \n",
        "graph export graph2.jpg, as(jpg) replace\n",
        "```\n",
        "\n",
        "In the graph, it is easy to see that the parallel trends assumption is\n",
        "satisfied: the difference between the treatment and the control group\n",
        "before the treatment is administered (the coefficient for $k=-2$) is not\n",
        "statistically different than zero.\n",
        "\n",
        "## 16.6 Common Mistakes\n",
        "\n",
        "The most common mistake when dealing with a diff-in-diff research design\n",
        "is to add covariates that are already captured by the fixed-effects.\n",
        "\n",
        "Let’s see what happens if we try to estimate Equation (2) where $X$ is\n",
        "gender at birth.\n",
        "\n",
        "``` {stata}\n",
        "* Load the data\n",
        "clear* \n",
        "use fake_data, clear \n",
        "\n",
        "* Set as panel data\n",
        "xtset workerid year, yearly\n",
        "\n",
        "* Generate log-earnings\n",
        "generate logearn = log(earnings)\n",
        "\n",
        "* Keep only two years\n",
        "keep if year==2002 | year==2003\n",
        "\n",
        "* Estimate incorrect specification\n",
        "areg logearn treated#2003.year i.year sex, absorb(workerid)\n",
        "```\n",
        "\n",
        "We cannot estimate the specification above because *sex* does not change\n",
        "over time for the same individual. Remember: in diff-in-diff\n",
        "regressions, we can only add covariates that are time varying by nature\n",
        "(e.g. tenure, experience) or are trends based on fixed characteristics\n",
        "(e.g. time dummies interacted with sex).\n",
        "\n",
        "Another common mistake when dealing with event studies is to forget to\n",
        "re-assign untreated workers to the reference group $k=-1$. Let’s see\n",
        "what happens if we try to estimate Equation (4) without this adjustment.\n",
        "\n",
        "``` {stata}\n",
        "* Load data\n",
        "clear* \n",
        "use fake_data, clear \n",
        "\n",
        "* Generate log-earnings\n",
        "generate logearn = log(earnings)\n",
        "\n",
        "* Generate a variable for year in which treatment was received\n",
        "capture drop time_entering_treatment \n",
        "generate time_entering_treatment = 2003 if treated==1 \n",
        "replace time_entering_treatment = . if treated==0\n",
        "\n",
        "* Generate a variable for time relative to the event\n",
        "capture drop event_time\n",
        "generate event_time = year - time_entering_treatment\n",
        "\n",
        "* Binning\n",
        "replace event_time = -2 if event_time<-2 & treated==1\n",
        "replace event_time = 2 if event_time>2 & treated==1\n",
        "\n",
        "* Create event_time dummies\n",
        "tabulate event_time, gen(event_time_dummy)\n",
        "\n",
        "* Run regression\n",
        "areg logearn event_time_dummy1 event_time_dummy3 event_time_dummy4 event_time_dummy5 i.year , absorb(workerid)\n",
        "```\n",
        "\n",
        "There are no error messages from Stata, but do you notice anything\n",
        "different compared to our results in Section 16.5?\n",
        "\n",
        "The number of observations has decreased dramatically: instead of\n",
        "138,138 workers as in Section 16.5, we only have around 40,000 workers.\n",
        "We are estimating our linear model only on the treated workers. This is\n",
        "a conceptual mistake: we cannot uncover the effect of the treatment if\n",
        "we do not compare the earnings of treated workers with the earnings of\n",
        "untreated workers.\n",
        "\n",
        "## 16.7 Wrap Up\n",
        "\n",
        "In this module, we’ve seen how the difference-in-differences design\n",
        "relies on two components:\n",
        "\n",
        "1.  Panel data, in which units are observed over time, and\n",
        "2.  Time and unit fixed-effects.\n",
        "\n",
        "These two components make regressions mathematically equivalent to\n",
        "taking time-differences that eliminate any time-invariant components of\n",
        "the error term creating endogeneity. Furthermore, when we have access to\n",
        "more than 2 time periods, we are able to construct dynamic treatment\n",
        "effects (run an event study) and test whether the parallel trends\n",
        "condition holds.\n",
        "\n",
        "## 16.8 Wrap-up Table\n",
        "\n",
        "| Command | Function |\n",
        "|--------------------------------|----------------------------------------|\n",
        "| `areg depvar indepvar, absorb(fixed-effects))` | It runs a linear regression with fixed-effects, while suppressing the coefficients on the fixed-effects. |\n",
        "\n",
        "## References\n",
        "\n",
        "[Difference in differences using\n",
        "Stata](https://www.youtube.com/watch?v=OQCKafoCb9Q)"
      ],
      "id": "693c9060-a4b7-487d-a07b-ba06add00071"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "path": "/usr/local/share/jupyter/kernels/python3"
    },
    "language_info": {
      "name": "python",
      "codemirror_mode": {
        "name": "ipython",
        "version": "3"
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  }
}