# Load the plm library (for panel data)
install.packages("plm")
library(plm)
# Loading in our packages
library(tidyverse)
library(haven)
ECON 490: Differences-in-Differences Analysis (15)
Prerequisites
- Run OLS regressions.
- Run panel data regressions.
Learning Outcomes
- Understand the parallel trends (PT) assumption.
- Run the according OLS regression that retrieves the causal estimand.
- Implement these regressions in the two-period case and in multiple time periods (a.k.a event studies).
- Conduct a test on the plausibility of the PT whenever there are more than 1 pre-treatment periods.
For this module, we will keep working on our fake dataset. Recall that this data is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.
Let’s start by loading our data and letting R know that it is a panel data with panel variable workerid and time variable year. You will remember that we have seen how to do this in the previous modules.
15.1 Loading the panel data
Difference-in-differences (Diff-in-Diff) is a research design used to estimate the causal impact of a treatment by comparing the changes in outcomes over time between a treated group and an untreated or control group. By comparing changes in outcomes over time, it relies on the use of multiple (at least two) time periods. Therefore, there is a link between Diff-in-Diff designs and panel data. Every time you want to use a Diff-in-Diff design, you always have to make sure you have panel data.
Why are panel datasets crucial in Diff-in-Diff research designs? The idea is that panel data allows us to control for unobserved time invariant heterogeneity. Consider the following example. Earnings \(y_{it}\) of worker \(i\) at time \(t\) can be split into two components:
\[ y_{it} = e_{it} + \alpha_{i} \]
where \(\alpha_i\) is a measure of worker quality and \(e_{it}\) are the part of earnings not explained by \(\alpha_i\). This says that a bad quality worker (low \(\alpha_i\)) will receive lower earnings at any time period. Notice that worker quality is typically unobserved and is usually part of our error term, which should not be correlated with treatment. In many cases, this invariant heterogeneity is the cause of endogeneity bias. In this example, it can be that workers who attend a training program also tend to be the ones that perform poorly at their job and select into this program.
However, notice that if we take time differences, we get rid of this heterogeneity. Suppose we subtract earnings at time \(1\) from earnings at time \(0\) obtaining:
\[ y_{i1} - y_{i0} = e_{i1} - e_{i0} \]
where our new equation no longer depends on \(\alpha_i\)! However, our model now has changes rather than levels. This is going to be the trick used implicitly throughout this module.
For this module, we will keep working on our fake dataset. Recall that this data is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.
Let’s start by loading the packages we need.
The we import our data and let R know that it is a panel data with panel variable workerid and time variable year.
# Load data
<- read_dta("../econ490-stata/fake_data.dta")
fake_data
# Set as panel
<- pdata.frame(fake_data, index = c("workerid","year")) panel_data
15.2 Parallel trends assumption
When using a Diff-in-Diff design, you should always make sure that your data has a binary treatment variable which takes value 1 when your unit of observation is treated and 0 otherwise. In the example above, let’s denote such a binary treatment variable as \(D_i\). It takes value 1 if a worker \(i\) is enrolled in the training program at some point in time.
In our fake dataset, the binary treatment variable already exists and is called treated. Let’s check that it takes values 0 or 1.
summary(panel_data$treated)
The aim of Diff-in-Diff analysis is to estimate the causal impact of a treatment by comparing the changes in outcomes over time between a treated group and an untreated or control group. A crucial assumption needed to claim causal impact is that, in the absence of treatment, the treatment and control groups would follow similar trends over time. This assumption is called parallel trends assumption. Whenever we adopt a Diff-in-Diff design in our research, the first thing we need to check is that this assumption is satisfied. How do we do that?
A common approach to check that for parallel trends is to plot the mean outcome for each group (the treated and untreated group) over time. We start by generating the average log-earnings for each group in each year.
# Generate log-earnings
<- panel_data %>% mutate(logearn = log(earnings))
panel_data
# Generate average by group and year
<- panel_data %>%
mean_earn group_by(treated, year) %>%
summarise(meanearnings = mean(logearn)) %>%
mutate(treatment = case_when(treated == 1 ~ 'Treated', treated == 0 ~ 'Untreated'))
Next, we plot the trend of average earnings by each group. Remember that we have seen how to make graphs in Module 8. It is common practice to add a vertical line in the period just before the treatment is assigned. In our case, that would be year 2002. The idea is that the treated workers receive the treatment between years 2002 and 2003.
# Make graph
ggplot(mean_earn, aes(x=year, y=meanearnings, group=treatment, color=treatment)) +
geom_line() +
geom_vline(xintercept = "2002", linetype = "dashed", color = "red") + # add vertical line in 2002
labs(x = "Year", y = "Mean earnings", color = "Treatment")
Remember that we care about the two variables having similar trends before the year of the treatment. By looking at the graph, it seems that the average earnings of the two groups had similar trends up until year 2002, just right before the treatment. This makes us confident that the parallel trends assumption is satisfied.
This test for parallel trends assumption is very rudimentary, but perfectly fine for the early stage of our research project. In the next sections, we will see how to estimate the Diff-in-Diff design and there we will see a more formal test for parallel trends assumption.
15.3 Difference-in-Differences and Regression
Whenever we talk about difference-in-differences, we refer to a research design that relies on some version of the parallel trend assumption. To connect this design to regression, we need to first build a model. To begin, we will assume a case where no control variables are involved.
For simplicity, suppose there are only two periods: a period \(t=0\) when no-one is treated, and a period \(t=1\) when some workers receive the treatment. We would then rely on a linear model of the form:
\[ y_{it} = \beta D_i \mathbf{1}\{t=1\} + \lambda_t + \alpha_i + e_{it} \tag{1} \]
where \(y_{it}\) is earnings while \(\lambda_t\) and \(\alpha_i\) are year and worker fixed effects. The key element in this linear model is the interaction between \(D_i\) and \(\mathbf{1}\{t=1\}\). Recall that \(D_i\) is a dummy variable taking value 1 if worker \(i\) receives the treatment at any point in time and \(\mathbf{1}\{t=1\}\) is a dummy variable taking value 1 when \(t=1\). Therefore, the interaction term \(D_i \mathbf{1}\{t=1\}\) will take value 1 for treated workers only when the year is \(t=1\). The parameter \(\beta\) provides the average treatment effect (on the treated) at period \(t=1\) (i.e. the effect activates for those with \(D_i=1\) and at \(t=1\)). It is the average impact of the treatment on those workers who actually received the treatment. \(\beta\) states by how much the average earnings of treated individuals would have changed if they had not received the treatment.
Let’s see how we can estimate this Diff-in-Diff linear model. Recall that we have information of workers in the years 1982-2012 and the training program (the treatment) as introduced in 2003. We’ll keep one year prior and one year after the program, to keep things consistent with the previous section. Specifically, we can think of year 2002 as \(t=0\) and year 2003 as \(t=1\).
# Keep only years 2002 and 2003
<- panel_data[panel_data$year %in% c("2002", "2003"),] panel_data
Next, we create a dummy variable called time that takes value 1 when year is 2003 and 0 otherwise. It will be the equivalent of \(\mathbf{1}\{t=1\}\) from Equation (1).
# Create dummy variable
<- panel_data %>%
panel_data mutate(time = ifelse(panel_data$year == "2003", 1, 0))
Notice that the Diff-in-Diff linear model in Equation (1) can be seen as a specific case of a linear model with many fixed effects using a panel data. We can still use the plm()
function that we have studied in the panel data module. Remember to add the option effect = "twoways"
to tell R to add both time and worker Fixed Effects to the specification.
<- plm(logearn ~ treated * time, data = panel_data, index=c("workerid", "year"), model = "within", effect = "twoways")
did_model summary(did_model)
Our coefficient of interest is treated:time. This says that on average workers who entered the program received 18 percentage points more earnings relative to a counterfactual scenario where they never entered the program (which in this case is captured by the control units). How did we get this interpretation? Recall that OLS estimates are interpreted as a 1 unit increase in the independent variable: a 1 unit increase of $ D_i {t=1}$ corresponds to those who started receiving treatment at \(t=1\). Furthermore, the dependent variable is in log scale, so a 0.18 increase corresponds to a 18 percentage point increase in earnings.
15.3.1 Adding covariates
The first thing to notice is that our regression specification in Equation (1) involves workers fixed effects \(\alpha_i\). This means that every worker characteristic that is fixed over time (for example, sex at birth) will be absorbed by the fixed effects \(\alpha_i\). Therefore, if we added characteristics such as sex and race, those will be omitted from the regression due to perfect collinearity.
This means that we can add covariates to the extent that they are time varying by nature (e.g. tenure, experience) or are trends based on fixed characteristics (e.g. time dummies interacted with sex). We refer to the latter as covariate-specific trends.
Algebraically, we obtain a specification that is very similar to Equation (1): \[ y_{it} = \beta D_i \mathbf{1}\{t=1\} + \gamma X_{it} + \lambda_t + \alpha_i + e_{it} \tag{2} \]
where \(X_{it}\) is a time varying characteristic of worker \(i\) and time \(t\).
15.4 Multiple Time Periods
When we have kept only years 2002 and 2003, we have canceled substantial information. We may want to keep our dataset at its original state, with all its years. A very natural approach to extending this to multiple time periods is to attempt to get the average effect across all post-treatment time periods (i.e. maybe the effects of the training program decay over time, but we are interested in the average over time). We may think of maintaining the parallel trends assumption in a model like this:
\[ y_{it} = \beta D_i \mathbf{1}\{t\geq 1\} + \lambda_t + \alpha_i + e_{it} \tag{3} \]
where the \(\beta\) corresponds now to all time periods following the year in which treatment was applied: \(t\geq 1\). Some people rename $ D_i {t}$ to \(D_{it}\), where \(D_{it}\) is simply a variable that takes 0 before any treatment and 1 for those who are being treated at that particular time \(t\). This is known as the Two-way Fixed Effects Model . It receives this name because we are including unit fixed effects, time fixed effects, and our treatment status.
Let’s load our fake dataset again and estimate a Two-way Fixed Effects Model step-by-step.
# Load data
<- read_dta("../econ490-stata/fake_data.dta")
fake_data
# Set as panel
<- pdata.frame(fake_data, index = c("workerid","year"))
panel_data
# Generate log-earnings
<- panel_data %>% mutate(logearn = log(earnings)) panel_data
Remember that now we need to create \(\mathbf{1}\{t\geq 1\}\), a dummy equal to 1 for all years following the year in which the treatment was administered. In our example, we need to create a dummy variable taking value 1 for all years greater or equal than 2003.
# Create dummy for year >= 2003
$post2003 = ifelse(panel_data$year %in% c("2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011"), 1, 0) panel_data
We can use again plm
to estimate Equation (3), but remember to use the new post2003 dummy variable.
<- plm(logearn ~ treated * post2003, data = panel_data, index=c("workerid", "year"), model = "within", effect = "twoways")
did_model summary(did_model)
The results say that a 1 unit increase in \(D_i \mathbf{1}\{t\geq 1\}\) corresponds to a 0.07 increase in log-earnings on average. That 1 unit increase only occurs for those who start receiving treatment in 2003. Given that the outcome is in a log scale, we interpret these results in percentage points. Therefore, the coefficient of interest says that those who started treatment in 2003 received, on average, a 7 percentage point increase in earnings.
In this fake data set, everyone either starts treatment at year 2003 or does not enter the program at all. However, when there is variation in the timing of the program (i.e. people entering the training program earlier than others), regression using this model may fail to capture the true parameter of interest. For a reference, see this paper.
15.5 Event studies
The natural extension of the previous section, which is the standard approach today, is to estimate different treatment effects depending on the time period. It may be possible that the effect of the treatment fades over time: it was large right after the training program was received, but then decreased over time.
To capture the evolution of treatment effects over time, we may want to compute treatment effects at different time lags from when the program was received: 1 year after, 2 years after, and so on. Similarly, we may compute “treatment effects” at different years prior the program. This is a very powerful tool because it allows us to more formally test whether the parallel trend assumption holds or not: if there are treatment effects prior to receiving the treatment, then the treatment and control groups were likely not having the same trend before receiving the treatment. This is often known as a pre-trends test.
A linear model where we test for different treatment effects in different years is usually called Event study. Essentially, we extend the Diff-in-Diff linear model to the following equation:
\[ y_{it} = \sum_{k=-T,k\neq1}^T \beta_k \mathbf{1}\{K_{it} = k\} + \lambda_t + \alpha_i + e_{it} \tag{4} \]
where \(K_{it}\) are event time dummies (i.e. whether person \(i\) is observed at event time \(k\) in time \(t\)). Notice that, for workers who never enter treatment, it is as if the event time is \(\infty\): they are an infinite amount of years away from receiving the treatment. Due to multicollinearity, we need to omit one category of event time dummies \(k\). The typical choice is \(k=-1\) (one year prior to treatment), which will serve as our reference group. This means that we are comparing changes relative to event time -1.
How do we estimate Equation (4) in practice? We begin by constructing a variable that identifies the time relative to the event. For instance, if a person enters the training program in 2003, the observation corresponding to 2002 is time -1 relative to the event, the observation corresponding to 2003 is time 0 relative to the event, and so on. We call this variable event_time and we compute it as the difference between the current year and the year in which the treatment was received (stored in variable time_entering_treatment).
In this fake data set, everyone enters the program in 2003, so it is very easy to construct the event time. If this is not the case, you need to make sure that your data set contains a variable which states the year in which every person receives the treatment.
# Load data
<- read_dta("../econ490-stata/fake_data.dta")
fake_data
# Set as panel
<- pdata.frame(fake_data, index = c("workerid","year"))
panel_data
# Generate log-earnings
<- panel_data %>% mutate(logearn = log(earnings))
panel_data
# Generate a variable for year in which treatment was received
$time_entering_treatment = ifelse(panel_data$treated == 1, 2003, NA)
panel_data
# Convert year to numeric
$yearnum <- 1994 + as.numeric(panel_data$year)
panel_data
# Generate a variable for time relative to the event
$event_time = panel_data$yearnum - panel_data$time_entering_treatment panel_data
To make sure we have created event_time in the right way, let’s see which values it takes.
summary(panel_data$event_time)
Notice that all untreated workers have a missing value for variable event_time. We want to include untreated workers in the reference category \(k=-1\). Therefore, we code untreated units as if they always belonged to event time -1. We use ifelse
to replace variable event_time with value -1 when variable treated takes value 0.
$event_time <- ifelse(panel_data$treated == 0, -1, panel_data$event_time)
panel_datasummary(panel_data$event_time)
We then decide which window of time around the treatment we want to focus on (the \(T\)’s in Equation (4)). For instance, we may want to focus on 2 years prior to the treatment and 2 years after the treatment and estimate those treatment effects. Our choice should depend on the amount of information we have in each year. In this case, notice that the number of workers 8 years after treatment is substantially lower than the number of workers 8 years before treatment is started.
We could drop all observations before \(k=-2\) and after \(k=2\). This would once again reduce the amount of information we have in our dataset. An alternative approach, called binning the window around treatment, is usually preferred. It works by pretending that treated workers who are observed before event_time -2 were actually observed in event_time -2 and treated workers who are observed after event_time 2 were actually observed in event_time 2. Once again, we use the command ifelse
.
$event_time <- ifelse(panel_data$event_time < -2 & panel_data$treated == 1, -2, panel_data$event_time)
panel_data$event_time <- ifelse(panel_data$event_time > 2 & panel_data$treated == 1, 2, panel_data$event_time) panel_data
Notice how these steps have modified the values of variable event_time compared to before:
summary(panel_data$event_time)
The next step is to generate a dummy variable for each value of event_time. We use the function case_when()
to do it.
<- panel_data %>%
panel_data mutate(event_time_dummy1 = case_when(event_time == -2 ~ 1, TRUE ~ 0),
event_time_dummy2 = case_when(event_time == -1 ~ 1, TRUE ~ 0),
event_time_dummy3 = case_when(event_time == 0 ~ 1, TRUE ~ 0),
event_time_dummy4 = case_when(event_time == 1 ~ 1, TRUE ~ 0),
event_time_dummy5 = case_when(event_time == 2 ~ 1, TRUE ~ 0))
Notice that event_time_dummy2 is the one that corresponds to event_time -1.
Once again, Equation (4) is nothing but a linear model with many fixed effects. We can again use the command plm
. This time we include dummy variables for the different values of event_time, with the exception of the dummy variable for the baseline event time \(k=-1\): event_time_dummy2.
<- plm(logearn ~ event_time_dummy1 + event_time_dummy3 + event_time_dummy4 + event_time_dummy5 ,
did_model data = panel_data, index=c("workerid", "year"), model = "within", effect = "twoways")
summary(did_model)
Again, the interpretation is the same as before, only now we have dynamic effects. The coefficient on the event_time1 dummy says that 2 years prior to entering treatment, treated units experienced a 0.4 percentage point increase in earnings relative to control units.
Should we worry that we are finding a difference between treated and control units prior to the policy? Notice that the “effect” of the policy at event time -2 (event_time_dummy1, when there was no training program) is not statistically different than zero. This confirms that our parallel trend assumption is supported by the data. In other words, there are no observable differences in trends prior to the enactment of the training program. Checking the p-value of those coefficients prior to the treatment is called the pre-trend test and does not require any fancy work. A mere look at the regression results suffices!
Furthermore, we can observe how the policy effect evolves over time. At the year of entering the training program, earnings are boosted by 20 percentage points. The next year the effect decreases to 15 percentage points, and 2+ years after the policy the effect significantly decreases towards 6 percentage points and is less statistically significant.
15.5.1 Event study graph
The table output is a correct way to convey the results, but its efficacy is limited, especially when we want to use a large time window. In those cases, we prefer a graph representing all coefficients of interest.
We can easily do that using the library coefplot
. We use the function coefplot
from the same library and the coefficients we have saved in object did_model as input.
# Load coefplot
install.packages("coefplot")
library(coefplot)
# Create graph
coefplot(did_model, horizontal = TRUE)
In the graph it is easy to see that the parallel trend assumption is satisfied: the difference between treatment and control group before the treatment is administered (the coefficient for event_dummy_1) is not statistically different than zero.
15.6 Common mistakes
The most common mistake when dealing with a Diff-in-Diff research design is to add covariates that are already captured by the fixed effects. Let’s see what happens if we try to estimate Equation (2) where \(X\) is gender at birth.
# Load data
<- read_dta("../econ490-stata/fake_data.dta")
fake_data
# Set as panel
<- pdata.frame(fake_data, index = c("workerid","year"))
panel_data
# Generate log-earnings
<- panel_data %>% mutate(logearn = log(earnings))
panel_data
# Keep only years 2002 and 2003
<- panel_data[panel_data$year %in% c("2002", "2003"),]
panel_data
# Create dummy variable
<- panel_data %>%
panel_data mutate(time = ifelse(panel_data$year == "2003", 1, 0))
# Estimate incorrect specification
<- plm(logearn ~ treated * time + sex, data = panel_data, index=c("workerid", "year"), model = "within", effect = "twoways")
did_model summary(did_model)
We cannot estimate the coefficient for sex in the specification above because sex does not change over time for the same individual. Remember: you can add covariates that are time varying by nature (e.g. tenure, experience) or are trends based on fixed characteristics (e.g. time dummies interacted with sex).
A common mistake when dealing with event studies is to forget to re-assign untreated workers to the baseline time event \(k=0\). Let’s see what happens if we try to estimate Equation (4) without this adjustment.
# Load data
<- read_dta("../econ490-stata/fake_data.dta")
fake_data
# Set as panel
<- pdata.frame(fake_data, index = c("workerid","year"))
panel_data
# Generate log-earnings
<- panel_data %>% mutate(logearn = log(earnings))
panel_data
# Generate a variable for year in which treatment was received
$time_entering_treatment = ifelse(panel_data$treated == 1, 2003, NA)
panel_data
# Convert year to numeric
$yearnum <- 1994 + as.numeric(panel_data$year)
panel_data
# Generate a variable for time relative to the event
$event_time = panel_data$yearnum - panel_data$time_entering_treatment
panel_data
# Binning
$event_time <- ifelse(panel_data$event_time < -2 & panel_data$treated == 1, -2, panel_data$event_time)
panel_data$event_time <- ifelse(panel_data$event_time > 2 & panel_data$treated == 1, 2, panel_data$event_time)
panel_data
# Create event time dummies
<- panel_data %>%
panel_data mutate(event_time_dummy1 = case_when(event_time == -2 ~ 1, TRUE ~ 0),
event_time_dummy2 = case_when(event_time == -1 ~ 1, TRUE ~ 0),
event_time_dummy3 = case_when(event_time == 0 ~ 1, TRUE ~ 0),
event_time_dummy4 = case_when(event_time == 1 ~ 1, TRUE ~ 0),
event_time_dummy5 = case_when(event_time == 2 ~ 1, TRUE ~ 0))
# Run regression
<- plm(logearn ~ event_time_dummy1 + event_time_dummy3 + event_time_dummy4 + event_time_dummy5 ,
did_model data = panel_data, index=c("workerid", "year"), model = "within", effect = "twoways")
summary(did_model)
There are no error messages from R, but do you notice anything different compared to our results in Section 15.5?
The number of observations has decreased dramatically: instead of 138,138 workers as in Section 15.5, we now have less than 40,000 workers. We are estimating our linear model only on the treated workers. This is a conceptual mistake: we cannot uncover the effect of the treatment if we do not compare the earnings of treated workers with the earnings of untreated workers.
15.7 Wrap Up
In this module we’ve seen how the difference-in-differences design relies on two components:
- Panel data, in which units are observed over time.
- Including time and unit fixed effects
These two components make regressions mathematically equivalent to taking time-differences that eliminate any time-invariant components of the error term creating endogeneity. Furthermore, when we have access to more than 2 time periods, we are able to construct dynamic treatment effects (event study) and test whether the parallel trends condition holds.