>>> import sys
>>> sys.path.append('/Applications/Stata/utilities') # make sure this is the same as what you set up in Module 01, Section 1.3: Setting Up the STATA Path
>>> from pystata import config
>>> config.init('se')
16 - Differences-in-Differences Analysis
Prerequisites
- Run OLS regressions.
- Run panel data regressions.
Learning Outcomes
- Understand the parallel trends (PT) assumption.
- Run the according OLS regression that retrieves the causal estimand.
- Implement these regressions in the two-period case and in multiple time periods (a.k.a event studies).
- Conduct a test on the plausibility of the PT whenever there are more than 1 pre-treatment periods.
16.0 Intro
16.1 Difference-in-differences
Difference-in-differences (diff-in-diff) is a research design used to estimate the causal impact of a treatment by comparing the changes in outcomes over time between a treated group and an untreated (or control) group. By comparing changes in outcomes over time, it relies on the use of multiple (at least two) time periods. Therefore, there is a link between diff-in-diff designs and panel data. Every time we want to use a diff-in-diff design, we will always have to make sure that we have panel data.
Why are panel datasets crucial in diff-in-diff research designs? The idea is that panel data allows us to control for heterogeneity that is both unobserved and time invariant.
Consider the following example. Earnings \(y_{it}\) of worker \(i\) at time \(t\) can be split into two components:
\[ y_{it} = e_{it} + \alpha_{i} \]
where \(\alpha_i\) is a measure of worker quality and \(e_{it}\) are the part of earnings not explained by \(\alpha_i\). This says that a bad quality worker (low \(\alpha_i\)) will receive lower earnings at any time period, since \(\alpha_i\) is time invariant. Notice that worker quality is typically unobserved and is usually part of our error term, which should not be correlated with treatment. In many cases though, this invariant heterogeneity (in our case, worker quality) is the cause of endogeneity bias. In this example, it can be that workers who attend a training program also tend to be the ones that perform poorly at their job and select into this program.
However, notice that if we take time differences, we get rid of this heterogeneity. Suppose we subtract earnings at time \(1\) from earnings at time \(0\), thus obtaining:
\[ y_{i1} - y_{i0} = e_{i1} - e_{i0} \]
where our new equation no longer depends on \(\alpha_i\)! However, see how we are now measuring \(y_{i1} - y_{i0}\) instead of \(y_{it}\)? Our model now has changes rather than levels. This is going to be the trick used implicitly throughout this module.
For this module, we will keep working on our fake data set. Recall that this data is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.
Let’s start by loading our data and letting Stata know that it is panel data with panel variable workerid and time variable year. We’ve seen how to do this in Module 15.
%%stata
* Load the data
*
clear*cd ""
use fake_data, clear
* Set as panel data
xtset workerid year, yearly
16.2 Parallel Trends Assumption
When using a diff-in-diff design, we first need to make sure our data has a binary treatment variable which takes the value 1 when our unit of observation is treated and 0 otherwise. In the example above, let’s denote such a binary treatment variable as \(D_i\). It takes value 1 if a worker \(i\) is enrolled in the training program at some point in time.
In our fake data set, the binary treatment variable already exists and is called treated. Let’s check that it takes values 0 or 1.
%%stata
describe, full
summarize treated, detail
The aim of diff-in-diff analysis is to estimate the causal impact of a treatment by comparing the changes in outcomes over time between a treated group and an untreated group.
A crucial assumption needed to claim causal impact is that, in the absence of treatment, the treatment and control groups would follow similar trends over time. This assumption is called parallel trends assumption. Whenever we adopt a diff-in-diff design in our research, the first thing we need to check is that this assumption is satisfied.
How do we do that?
A common approach to check for parallel trends is to plot the mean outcome for both the treated and untreated group over time.
Do you recall how to make these plots from Module 9? We start by generating the average log-earnings for each group in each year.
%%stata
* Generate log-earnings
= log(earnings)
generate logearn
* Take the average by group and year
= mean(logearn) bysort year treated: egen meanearn
Next, we plot the trend of average earnings by each group. It is common practice to add a vertical line in the period just before the treatment is assigned. In our case, that would be year 2002. The idea is that the treated workers receive the treatment between years 2002 and 2003.
%%stata
* Make graph
if treated == 1, lcolor(gs12) lpattern(solid)) || ///
twoway (line meanearn year if treated == 0, lcolor(gs6) lpattern(dash)), ///
(line meanearn year ///
graphregion(color(white)) 1 "Treated") label(2 "Control")) ///
legend(label("Average earnings") xtitle("Year") ///
ytitle(2002, lpattern(dash) lcolor(black))
xline(as(jpg) replace graph export graph1.jpg,
Remember that we care about the two variables having similar trends before the year of the treatment. By looking at the graph, it seems that the average earnings of the two groups had similar trends up until year 2002, just before the treatment. This makes us confident that the parallel trends assumption is satisfied.
This test for parallel trends assumption is very rudimentary, but perfectly fine for the early stage of our research project. In the next sections, we will see how to estimate the diff-in-diff design, and there we will see a more formal test for the parallel trends assumption.
16.3 Difference-in-Differences and Regression
Whenever we talk about diff-in-diff, we refer to a research design that relies on some version of the parallel trends assumption. To connect this design to regressions, we need to first build a model. To begin, we will assume a case where no control variables are involved.
For simplicity, suppose there are only two periods: a period \(t=0\) when no one is treated, and a period \(t=1\) when some workers receive the treatment.
We would then rely on a linear model of the form:
\[ y_{it} = \beta D_i \mathbf{1}\{t=1\} + \lambda_t + \alpha_i + e_{it} \tag{1} \]
where \(y_{it}\) is earnings while \(\lambda_t\) and \(\alpha_i\) are year and worker fixed-effects.
The key element in this linear model is the interaction between \(D_i\) and \(\mathbf{1}\{t=1\}\).
Recall that \(D_i\) is a dummy variable taking value 1 if worker \(i\) receives the treatment at any point in time, and \(\mathbf{1}\{t=1\}\) is an indicator function taking value 1 when \(t=1\).
Therefore, the interaction term \(D_i \mathbf{1}\{t=1\}\) will take value 1 for treated workers only when the year is \(t=1\), or when the treated workers are treated.
The parameter \(\beta\) provides the average treatment effect (on the treated) at period \(t=1\) (i.e. we get the effect for those with \(D_i=1\) at \(t=1\)). It is the average impact of the treatment on those workers who actually received the treatment. \(\beta\) states by how much the average earnings of treated individuals would have changed if they had not received the treatment.
Let’s see how we can estimate this linear diff-in-diff model!
Recall that we have information of workers in the years 1982-2012 and the training program (the treatment) was introduced in 2003. We’ll keep one year prior and one year after the program, to keep things consistent with the previous section. Specifically, we can think of year 2002 as \(t=0\) and year 2003 as \(t=1\).
%%stata
if year==2002 | year==2003 keep
Notice that the diff-in-diff linear model in Equation (1) can be seen as a specific case of a linear model with many fixed-effects. We can use the command reghdfe
and the option absorb()
to run this type of regression, which we saw in Module 13. We can also use the command areg
alongside the option absorb()
which has the same syntax. In either case, don’t forget to list the fixed-effects in absorb()
to avoid seeing them in the regression output!
Recall that we can create fixed-effects with the i.
operator and interactions with the #
operator.
%%stata
#2003.year i.year, absorb(workerid) areg logearn treated
This says that, on average, workers who entered the program received 18 percentage points more earnings relative to a counterfactual scenario where they never entered the program (which in this case is captured by the control units). How did we get this interpretation? Recall that OLS estimates are interpreted as a 1 unit increase in the independent variable: a 1 unit increase of \(D_i \mathbf{1}\{t=1\}\) corresponds to those who started receiving treatment at \(t=1\). Furthermore, the dependent variable is in log scale, so a 0.18 increase corresponds to a 18 percentage point increase in earnings.
16.3.1 Adding Covariates
The first thing to notice is that our regression specification in Equation (1) involves worker fixed-effects \(\alpha_i\). This means that every worker characteristic that is fixed over time (for example, sex at birth) will be absorbed by the fixed-effects \(\alpha_i\). Therefore, if we added characteristics such as sex and race as covariates, those would be omitted from the regression due to perfect collinearity.
This means that we can add covariates to the extent that they are time varying by nature (e.g. tenure, experience), or are trends based on fixed characteristics (e.g. time dummies interacted with sex). We refer to the latter as covariate-specific trends.
Algebraically, we obtain a specification that is very similar to Equation (1): \[ y_{it} = \beta D_i \mathbf{1}\{t=1\} + \gamma X_{it} + \lambda_t + \alpha_i + e_{it} \tag{2} \]
where \(X_{it}\) is a time-varying characteristic of worker \(i\) and time \(t\).
16.4 Multiple Time Periods
In keeping only the years 2002 and 2003, we have excluded substantial information from our analysis. We may want to keep our data set at its original state, with all its years.
A very natural approach to extending this to multiple time periods is to attempt to get the average effect across all post-treatment time periods. For example, it may be that the effects of the training program decay over time, but we are interested in the average effect. We may think of maintaining the parallel trends assumption in a model like this:
\[ y_{it} = \beta D_i \mathbf{1}\{t\geq 1\} + \lambda_t + \alpha_i + e_{it} \tag{3} \]
where the \(\beta\) corresponds now to all time periods after the year in which treatment was applied: \(t\geq 1\). Some people rename \(D_i \mathbf{1}\{t\geq 1\}\) to \(D_{it}\), where \(D_{it}\) is simply a variable that takes 0 before any treatment and 1 for those who are being treated at that particular time \(t\). This is known as the Two-Way Fixed-Effects (TWFE) Model . It receives this name because we are including unit fixed-effects, time fixed-effects, and our treatment status.
Let’s load our fake data set again and estimate a TWFE model step-by-step.
%%stata
* Load data
*
clear
use fake_data, clear
* Generate log-earnings
= log(earnings) generate logearn
Remember that now we need to create \(\mathbf{1}\{t\geq 1\}\), a dummy equal to 1 for all years following the year in which the treatment was administered. In our example, we need to create a dummy variable taking value 1 for all years greater than or equal to 2003.
%%stata
= year>=2003 generate post2003
We can again use areg
or reghdfe
to estimate Equation (3), but remember to use the new post2003 dummy variable.
%%stata
1.treated#1.post2003 i.year, absorb(workerid) areg logearn
The results say that a 1 unit increase in \(D_i \mathbf{1}\{t\geq 1\}\) corresponds to a 0.07 increase in log-earnings on average. That 1 unit increase only occurs for those who start receiving treatment in 2003. Given that the outcome is in a log scale, we interpret these results in percentage points. Therefore, the coefficient of interest says that those who started treatment in 2003 received, on average, a 7 percentage point increase in earnings.
In this fake data set, everyone either starts treatment at year 2003 or does not enter the program at all. However, when there is variation in the timing of the treatment (i.e. people entering the training program earlier than others), a regression using this model may fail to capture the true parameter of interest. For a reference, see this paper.
16.5 Event Studies
The natural extension of the previous section, which is the standard approach today, is to estimate different treatment effects depending on the time period.
It may be possible that the effect of the treatment fades over time: it was large right after the training program was received, but then decreased over time.
To capture the evolution of treatment effects over time, we may want to compute treatment effects at different lags after the program was received: 1 year after, 2 years after, etc.
Similarly, we may want to compute “treatment effects” at different years prior the program.
This is a very powerful tool because it allows us to more formally test whether the parallel trends assumption holds or not: if there are treatment effects prior to receiving the treatment, then the treatment and control groups were likely not having the same trend before receiving the treatment. This is often known as a pre-trends test.
A linear model where we test for different treatment effects in different years is usually called Event study.
Essentially, we extend the diff-in-diff linear model to the following equation:
\[ y_{it} = \sum_{k=-T,k\neq1}^T \beta_k \mathbf{1}\{K_{it} = k\} + \lambda_t + \alpha_i + e_{it} \tag{4} \]
where \(K_{it}\) are event time dummies (i.e. whether person \(i\) is observed at event time \(k\) in time \(t\)). These are essentially dummies for each year until and each year since the event, or “time to” and “time from” dummies. For example, there will be a dummy indicating that a treated individual is one year away from being treated, two years away from being treated, etc. Notice that, for workers who never enter treatment, it is as if the event time is \(\infty\): they are an infinite amount of years away from receiving the treatment. Due to multicollinearity, we need to omit one category of event time dummies \(k\). The typical choice is \(k=-1\) (one year prior to treatment), which will serve as our reference group. This means that we are comparing changes relative to event time -1.
How do we estimate Equation (4) in practice?
We begin by constructing a variable that identifies the time relative to the event. For instance, if a person enters the training program in 2003, the observation corresponding to 2002 is time -1 relative to the event, the observation corresponding to 2003 is time 0 relative to the event, and so on. We call this variable event_time and we compute it as the difference between the current year and the year in which the treatment was received (stored in variable time_entering_treatment).
In this fake data set, everyone enters the program in 2003, so it is very easy to construct the event time. If this is not the case, we need to make sure that we have a variable which states the year in which each person receives their treatment.
%%stata
* Load data
*
clear
use fake_data, clear
* Generate log-earnings
= log(earnings)
generate logearn
* Generate a variable for year in which treatment was received
capture drop time_entering_treatment = 2003 if treated==1
generate time_entering_treatment = . if treated==0
replace time_entering_treatment
* Generate a variable for time relative to the event
capture drop event_time= year - time_entering_treatment generate event_time
To make sure we have created event_time properly, let’s see which values it takes.
%%stata
tabulate event_time , missing
Notice that all untreated workers have a missing value for the variable event_time. We want to include untreated workers in the reference category \(k=-1\). Recall that we are still trying to understand the effect of being treated compared to the reference group, those that are untreated. Therefore, we code untreated units as if they always belonged to event time -1.
%%stata
= -1 if treated==0 replace event_time
We then decide which window of time around the treatment we want to focus on (the \(T\)’s in Equation (4)). For instance, we may want to focus on 2 years prior to the treatment and 2 years after the treatment, and estimate those treatment effects. Our choice should depend on the amount of information we have in each year. In this case, notice that the number of workers 8 years after treatment is substantially lower than the number of workers 8 years before treatment is started.
We could drop all observations before \(k=-2\) and after \(k=2\). This would once again reduce the amount of information we have in our dataset.
An alternative approach, called binning the window around treatment, is usually preferred. It works by pretending that treated workers who are observed before event_time -2 were actually observed in event_time -2 and treated workers who are observed after event_time 2 were actually observed in event_time 2.
%%stata
= -2 if event_time<-2 & treated==1
replace event_time = 2 if event_time>2 & treated==1 replace event_time
Notice how these steps have modified the values of variable event_time:
%%stata
tabulate event_time
The next step is to generate a dummy variable for each value of event_time.
%%stata
tabulate event_time, gen(event_time_dummy)
Notice that event_time_dummy2 is the one that corresponds to event_time -1.
Once again, Equation (4) is nothing but a linear model with many fixed-effects. We can again use either command areg
or reghdfe
.
This time, we must include dummy variables for the different values of event_time, with the exception of the dummy variable for the baseline event time \(k=-1\): event_time_dummy2.
%%stata
// do you recall how we included worker and year fixed-effects? areg logearn event_time_dummy1 event_time_dummy3 event_time_dummy4 event_time_dummy5 i.year , absorb(workerid)
Again, the interpretation is the same as before, only now we have dynamic effects. The coefficient on the event_time1 dummy says that 2 years prior to entering treatment, treated units experienced a 0.4 percentage point increase in earnings relative to control units.
Should we worry that we are finding a difference between treated and control units prior to the policy? Notice that the effect of the policy at event time -2 (event_time_dummy1, when there was no training program) is not statistically different than zero.
This confirms that our parallel trends assumption is supported by the data. In other words, there are no observable differences in trends prior to the enactment of the training program. Checking the p-value of those coefficients prior to the treatment is called the pre-trend test and does not require any fancy work. A mere look at the regression results suffices!
Furthermore, we can observe how the policy effect evolves over time. At the year of entering the training program, earnings are boosted by 20 percentage points. The next year the effect decreases to 15 percentage points, and 2+ years after the policy, the effect significantly decreases towards 6 percentage points and is less statistically significant.
16.5.1 Event Study Graph
The table output is a correct way to convey the results, but it’s efficacy is limited, especially when we want to use a large time window. In those cases, a graph does a better job of representing all coefficients of interest.
We can easily do that using the command coefplot
, which we covered in Module 9. We keep all coefficients of interest by including all event_time dummies as inputs in keep()
, and we rename them one-by-one in rename()
to increase clarity of the graph.
%%stata
*) vertical graphregion(color(white)) yline(0) ///
coefplot, keep(event_time_="k=-2" event_time_dummy3="k=0" event_time_dummy4="k=+1" event_time_dummy5="k=+2")
rename(event_time_dummy1as(jpg) replace graph export graph2.jpg,
In the graph, it is easy to see that the parallel trends assumption is satisfied: the difference between the treatment and the control group before the treatment is administered (the coefficient for \(k=-2\)) is not statistically different than zero.
16.6 Common Mistakes
The most common mistake when dealing with a diff-in-diff research design is to add covariates that are already captured by the fixed-effects.
Let’s see what happens if we try to estimate Equation (2) where \(X\) is gender at birth.
%%stata
* Load the data
*
clear
use fake_data, clear
* Set as panel data
xtset workerid year, yearly
* Generate log-earnings
= log(earnings)
generate logearn
* Keep only two years
if year==2002 | year==2003
keep
* Estimate incorrect specification
#2003.year i.year sex, absorb(workerid) areg logearn treated
We cannot estimate the specification above because sex does not change over time for the same individual. Remember: in diff-in-diff regressions, we can only add covariates that are time varying by nature (e.g. tenure, experience) or are trends based on fixed characteristics (e.g. time dummies interacted with sex).
Another common mistake when dealing with event studies is to forget to re-assign untreated workers to the reference group \(k=-1\). Let’s see what happens if we try to estimate Equation (4) without this adjustment.
%%stata
* Load data
*
clear
use fake_data, clear
* Generate log-earnings
= log(earnings)
generate logearn
* Generate a variable for year in which treatment was received
capture drop time_entering_treatment = 2003 if treated==1
generate time_entering_treatment = . if treated==0
replace time_entering_treatment
* Generate a variable for time relative to the event
capture drop event_time= year - time_entering_treatment
generate event_time
* Binning
= -2 if event_time<-2 & treated==1
replace event_time = 2 if event_time>2 & treated==1
replace event_time
* Create event_time dummies
tabulate event_time, gen(event_time_dummy)
* Run regression
areg logearn event_time_dummy1 event_time_dummy3 event_time_dummy4 event_time_dummy5 i.year , absorb(workerid)
There are no error messages from Stata, but do you notice anything different compared to our results in Section 16.5?
The number of observations has decreased dramatically: instead of 138,138 workers as in Section 16.5, we only have around 40,000 workers. We are estimating our linear model only on the treated workers. This is a conceptual mistake: we cannot uncover the effect of the treatment if we do not compare the earnings of treated workers with the earnings of untreated workers.
16.7 Wrap Up
In this module, we’ve seen how the difference-in-differences design relies on two components:
- Panel data, in which units are observed over time, and
- Time and unit fixed-effects.
These two components make regressions mathematically equivalent to taking time-differences that eliminate any time-invariant components of the error term creating endogeneity. Furthermore, when we have access to more than 2 time periods, we are able to construct dynamic treatment effects (run an event study) and test whether the parallel trends condition holds.
16.8 Wrap-up Table
Command | Function |
---|---|
areg depvar indepvar, absorb(fixed-effects)) |
It runs a linear regression with fixed-effects, while suppressing the coefficients on the fixed-effects. |