15 - Difference-in-Differences Analysis

econ 490

difference-in-differences

panel data

regression

parallel trends

event study

This notebook introduces difference-in-difference analysis. We look the assumptions required to perform this type of analysis, how to run the regressions, how to run event studies, and some common mistakes to avoid.

Author

Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt

Published

29 May 2024

Prerequisites

Run OLS regressions.
Run panel data regressions.

Learning Outcomes

Understand the parallel trends (PT) assumption.
Run the according OLS regression that retrieves the causal estimand.
Implement these regressions in the two-period case and in multiple time periods (a.k.a event studies).
Conduct a test on the plausibility of the PT whenever there are more than 1 pre-treatment periods.

15.1 Difference-in-differences

Difference-in-differences (diff-in-diff) is a research design used to estimate the causal impact of a treatment by comparing the changes in outcomes over time between a treated group and an untreated (or control) group. By comparing changes in outcomes over time, it relies on the use of multiple (at least two) time periods. Therefore, there is a link between diff-in-diff designs and panel data. Every time we want to use a diff-in-diff design, we will always have to make sure that we have panel data.

Why are panel datasets crucial in diff-in-diff research designs? The idea is that panel data allows us to control for heterogeneity that is both unobserved and time invariant.

Consider the following example. Earnings \(y_{it}\) of worker \(i\) at time \(t\) can be split into two components:

\[ y_{it} = e_{it} + \alpha_{i} \]

where \(\alpha_i\) is a measure of worker quality and \(e_{it}\) are the part of earnings not explained by \(\alpha_i\). This says that a bad quality worker (low \(\alpha_i\)) will receive lower earnings at any time period, since \(\alpha_i\) is time invariant. Notice that worker quality is typically unobserved and is usually part of our error term, which should not be correlated with treatment. In many cases though, this invariant heterogeneity (in our case, worker quality) is the cause of endogeneity bias. In this example, it can be that workers who attend a training program also tend to be the ones that perform poorly at their job and select into this program.

However, notice that if we take time differences, we get rid of this heterogeneity. Suppose we subtract earnings at time \(1\) from earnings at time \(0\), thus obtaining:

\[ y_{i1} - y_{i0} = e_{i1} - e_{i0} \]

where our new equation no longer depends on \(\alpha_i\)! However, see how we are now measuring \(y_{i1} - y_{i0}\) instead of \(y_{it}\)? Our model now has changes rather than levels. This is going to be the trick used implicitly throughout this module.

For this module, we will keep working on our fake data set. Recall that this data is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.

Let’s start by loading the packages we need.

# Load the plm library (for panel data)
#uncomment to install the package! install.packages("plm")
library(plm)

# Loading in our packages
library(tidyverse)
library(haven)

Then we import our data and let R know that it is a panel data with panel variable workerid and time variable year.

# Load data
fake_data <- read_dta("../econ490-stata/fake_data.dta") 

# Set as panel
panel_data <- pdata.frame(fake_data, index = c("workerid","year"))

15.2 Parallel Trends Assumption

When using a diff-in-diff design, we first need to make sure our data has a binary treatment variable which takes the value 1 when our unit of observation is treated and 0 otherwise. In the example above, let’s denote such a binary treatment variable as \(D_i\). It takes value 1 if a worker \(i\) is enrolled in the training program at some point in time.

In our fake data set, the binary treatment variable already exists and is called treated. Let’s check that it takes values 0 or 1.

summary(panel_data$treated)

The aim of diff-in-diff analysis is to estimate the causal impact of a treatment by comparing the changes in outcomes over time between a treated group and an untreated group.

A crucial assumption needed to claim causal impact is that, in the absence of treatment, the treatment and control groups would follow similar trends over time. This assumption is called parallel trends assumption. Whenever we adopt a diff-in-diff design in our research, the first thing we need to check is that this assumption is satisfied.

How do we do that?

A common approach to check for parallel trends is to plot the mean outcome for both the treated and untreated group over time.

Do you recall how to make these plots from Module 8?

We start by generating the average log-earnings for each group in each year.

# Generate log-earnings
panel_data <- panel_data %>% mutate(logearn = log(earnings))

# Generate average by group and year
 mean_earn <- panel_data %>% 
            group_by(treated, year) %>% 
            summarise(meanearnings = mean(logearn)) %>%
            mutate(treatment = case_when(treated == 1 ~ 'Treated', treated == 0 ~ 'Untreated'))

Next, we plot the trend of average earnings by each group. It is common practice to add a vertical line in the period just before the treatment is assigned. In our case, that would be year 2002. The idea is that the treated workers receive the treatment between years 2002 and 2003.

# Make graph
ggplot(mean_earn, aes(x=year, y=meanearnings, group=treatment, color=treatment)) +
  geom_line() +
  geom_vline(xintercept = "2002", linetype = "dashed", color = "red") + # add vertical line in 2002
  labs(x = "Year", y = "Mean earnings", color = "Treatment")

Remember that we care about the two variables having similar trends before the year of the treatment. By looking at the graph, it seems that the average earnings of the two groups had similar trends up until year 2002, just before the treatment. This makes us confident that the parallel trends assumption is satisfied.

This test for parallel trends assumption is very rudimentary, but perfectly fine for the early stage of our research project. In the next sections, we will see how to estimate the diff-in-diff design, and there we will see a more formal test for the parallel trends assumption.

15.3 Difference-in-Differences and Regression

Whenever we talk about diff-in-diff, we refer to a research design that relies on some version of the parallel trends assumption. To connect this design to regressions, we need to first build a model. To begin, we will assume a case where no control variables are involved.

For simplicity, suppose there are only two periods: a period \(t=0\) when no one is treated, and a period \(t=1\) when some workers receive the treatment.

We would then rely on a linear model of the form:

\[ y_{it} = \beta D_i \mathbf{1}\{t=1\} + \lambda_t + \alpha_i + e_{it} \tag{1} \]

where \(y_{it}\) is earnings while \(\lambda_t\) and \(\alpha_i\) are year and worker fixed-effects.

The key element in this linear model is the interaction between \(D_i\) and \(\mathbf{1}\{t=1\}\).

Recall that \(D_i\) is a dummy variable taking value 1 if worker \(i\) receives the treatment at any point in time, and \(\mathbf{1}\{t=1\}\) is an indicator function taking value 1 when \(t=1\).

Therefore, the interaction term \(D_i \mathbf{1}\{t=1\}\) will take value 1 for treated workers only when the year is \(t=1\), or when the treated workers are treated.

The parameter \(\beta\) provides the average treatment effect (on the treated) at period \(t=1\) (i.e. we get the effect for those with \(D_i=1\) at \(t=1\)). It is the average impact of the treatment on those workers who actually received the treatment. \(\beta\) states by how much the average earnings of treated individuals would have changed if they had not received the treatment.

Let’s see how we can estimate this linear diff-in-diff model!

Recall that we have information of workers in the years 1982-2012 and the training program (the treatment) was introduced in 2003. We’ll keep one year prior and one year after the program, to keep things consistent with the previous section. Specifically, we can think of year 2002 as \(t=0\) and year 2003 as \(t=1\).

# Keep only years 2002 and 2003
panel_data <- panel_data[panel_data$year %in% c("2002", "2003"),]

Next, we create a dummy variable called time that takes the value 1 when the year is 2003 and 0 otherwise. It will be the equivalent of \(\mathbf{1}\{t=1\}\) from Equation (1).

# Create dummy variable
panel_data <- panel_data %>%
            mutate(time = ifelse(panel_data$year == "2003", 1, 0))

Notice that the diff-in-diff linear model in Equation (1) can be seen as a specific case of a linear model with many fixed-effects using panel data. We can still use the plm() function that we have studied in Module 14 Remember to add the option effect = "twoways" to tell R to add both time and worker fixed-effects to the specification.

did_model <- plm(logearn ~ treated * time, data = panel_data, index=c("workerid", "year"), model = "within", effect = "twoways")
summary(did_model)

Our coefficient of interest is treated:time. This says that, on average, workers who entered the program received 18 percentage points more earnings relative to a counterfactual scenario where they never entered the program (which in this case is captured by the control units). How did we get this interpretation? Recall that OLS estimates are interpreted as a 1 unit increase in the independent variable: a 1 unit increase of \(D_i \mathbf{1}\{t=1\}\) corresponds to those who started receiving treatment at \(t=1\). Furthermore, the dependent variable is in log scale, so a 0.18 increase corresponds to a 18 percentage point increase in earnings.

15.3.1 Adding Covariates

The first thing to notice is that our regression specification in Equation (1) involves worker fixed-effects \(\alpha_i\). This means that every worker characteristic that is fixed over time (for example, sex at birth) will be absorbed by the fixed-effects \(\alpha_i\). Therefore, if we added characteristics such as sex and race as covariates, those would be omitted from the regression due to perfect collinearity.

This means that we can add covariates to the extent that they are time varying by nature (e.g. tenure, experience), or are trends based on fixed characteristics (e.g. time dummies interacted with sex). We refer to the latter as covariate-specific trends.

Algebraically, we obtain a specification that is very similar to Equation (1): \[ y_{it} = \beta D_i \mathbf{1}\{t=1\} + \gamma X_{it} + \lambda_t + \alpha_i + e_{it} \tag{2} \]

where \(X_{it}\) is a time-varying characteristic of worker \(i\) and time \(t\).

15.4 Multiple Time Periods

In keeping only the years 2002 and 2003, we have excluded substantial information from our analysis. We may want to keep our data set at its original state, with all its years.

A very natural approach to extending this to multiple time periods is to attempt to get the average effect across all post-treatment time periods. For example, it may be that the effects of the training program decay over time, but we are interested in the average effect. We may think of maintaining the parallel trends assumption in a model like this:

\[ y_{it} = \beta D_i \mathbf{1}\{t\geq 1\} + \lambda_t + \alpha_i + e_{it} \tag{3} \]

where the \(\beta\) corresponds now to all time periods after the year in which treatment was applied: \(t\geq 1\). Some people rename \(D_i \mathbf{1}\{t\geq 1\}\) to \(D_{it}\), where \(D_{it}\) is simply a variable that takes 0 before any treatment and 1 for those who are being treated at that particular time \(t\). This is known as the Two-Way Fixed-Effects (TWFE) Model . It receives this name because we are including unit fixed-effects, time fixed-effects, and our treatment status.

Let’s load our fake data set again and estimate a TWFE model step-by-step.

# Load data
fake_data <- read_dta("../econ490-stata/fake_data.dta") 

# Set as panel
panel_data <- pdata.frame(fake_data, index = c("workerid","year"))

# Generate log-earnings
panel_data <- panel_data %>% mutate(logearn = log(earnings))

Remember that now we need to create \(\mathbf{1}\{t\geq 1\}\), a dummy equal to 1 for all years following the year in which the treatment was administered. In our example, we need to create a dummy variable taking value 1 for all years greater than or equal to 2003.

# Create dummy for year >= 2003
panel_data$post2003 = ifelse(panel_data$year %in% c("2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011"), 1, 0)

We can use again plm to estimate Equation (3), but remember to use the new post2003 dummy variable.

did_model <- plm(logearn ~ treated * post2003, data = panel_data, index=c("workerid", "year"), model = "within", effect = "twoways")
summary(did_model)

The results say that a 1 unit increase in \(D_i \mathbf{1}\{t\geq 1\}\) corresponds to a 0.07 increase in log-earnings on average. That 1 unit increase only occurs for those who start receiving treatment in 2003. Given that the outcome is in a log scale, we interpret these results in percentage points. Therefore, the coefficient of interest says that those who started treatment in 2003 received, on average, a 7 percentage point increase in earnings.

In this fake data set, everyone either starts treatment at year 2003 or does not enter the program at all. However, when there is variation in the timing of the treatment (i.e. people entering the training program earlier than others), a regression using this model may fail to capture the true parameter of interest. For a reference, see this paper.

15.5 Event Studies

The natural extension of the previous section, which is the standard approach today, is to estimate different treatment effects depending on the time period.

It may be possible that the effect of the treatment fades over time: it was large right after the training program was received, but then decreased over time.

To capture the evolution of treatment effects over time, we may want to compute treatment effects at different lags after the program was received: 1 year after, 2 years after, etc.

Similarly, we may want to compute “treatment effects” at different years prior the program.

This is a very powerful tool because it allows us to more formally test whether the parallel trends assumption holds or not: if there are treatment effects prior to receiving the treatment, then the treatment and control groups were likely not having the same trend before receiving the treatment. This is often known as a pre-trends test.

A linear model where we test for different treatment effects in different years is usually called an event study.

Essentially, we extend the diff-in-diff linear model to the following equation:

\[ y_{it} = \sum_{k=-T,k\neq1}^T \beta_k \mathbf{1}\{K_{it} = k\} + \lambda_t + \alpha_i + e_{it} \tag{4} \]

where \(K_{it}\) are event time dummies (i.e. whether person \(i\) is observed at event time \(k\) in time \(t\)). These are essentially dummies for each year until and each year since the event, or “time to” and “time from” dummies. For example, there will be a dummy indicating that a treated individual is one year away from being treated, two years away from being treated, etc. Notice that, for workers who never enter treatment, it is as if the event time is \(\infty\): they are an infinite amount of years away from receiving the treatment. Due to multicollinearity, we need to omit one category of event time dummies \(k\). The typical choice is \(k=-1\) (one year prior to treatment), which will serve as our reference group. This means that we are comparing changes relative to event time -1.

How do we estimate Equation (4) in practice?

We begin by constructing a variable that identifies the time relative to the event. For instance, if a person enters the training program in 2003, the observation corresponding to 2002 is time -1 relative to the event, the observation corresponding to 2003 is time 0 relative to the event, and so on. We call this variable event_time and we compute it as the difference between the current year and the year in which the treatment was received (stored in variable time_entering_treatment).

In this fake data set, everyone enters the program in 2003, so it is very easy to construct the event time. If this is not the case, we need to make sure that we have a variable which states the year in which each person receives their treatment.

# Load data
fake_data <- read_dta("../econ490-stata/fake_data.dta") 

# Set as panel
panel_data <- pdata.frame(fake_data, index = c("workerid","year"))

# Generate log-earnings
panel_data <- panel_data %>% mutate(logearn = log(earnings))

# Generate a variable for year in which treatment was received
panel_data$time_entering_treatment = ifelse(panel_data$treated == 1, 2003, NA)

# Convert year to numeric
panel_data$yearnum <- 1994 + as.numeric(panel_data$year)

# Generate a variable for time relative to the event
panel_data$event_time = panel_data$yearnum - panel_data$time_entering_treatment

To make sure we have created event_time properly, let’s see which values it takes.

summary(panel_data$event_time)

Notice that all untreated workers have a missing value for the variable event_time. We want to include untreated workers in the reference category \(k=-1\). Recall that we are still trying to understand the effect of being treated compared to the reference group, those that are untreated. Therefore, we code untreated units as if they always belonged to event time -1. We use ifelse to replace variable event_time with value -1 when variable treated takes value 0.

panel_data$event_time <- ifelse(panel_data$treated == 0, -1, panel_data$event_time)
summary(panel_data$event_time)

We then decide which window of time around the treatment we want to focus on (the \(T\)’s in Equation (4)). For instance, we may want to focus on 2 years prior to the treatment and 2 years after the treatment, and estimate those treatment effects. Our choice should depend on the amount of information we have in each year. In this case, notice that the number of workers 8 years after treatment is substantially lower than the number of workers 8 years before treatment is started.

We could drop all observations before \(k=-2\) and after \(k=2\). This would once again reduce the amount of information we have in our dataset.

An alternative approach, called binning the window around treatment, is usually preferred. It works by pretending that treated workers who are observed before event_time -2 were actually observed in event_time -2 and treated workers who are observed after event_time 2 were actually observed in event_time 2. Once again, we use the command ifelse.

panel_data$event_time <- ifelse(panel_data$event_time < -2 & panel_data$treated == 1, -2, panel_data$event_time)
panel_data$event_time <- ifelse(panel_data$event_time > 2 & panel_data$treated == 1, 2, panel_data$event_time)

Notice how these steps have modified the values of variable event_time compared to before:

summary(panel_data$event_time)

The next step is to generate a dummy variable for each value of event_time. We use the function case_when() to do it.

panel_data <- panel_data %>%
            mutate(event_time_dummy1 = case_when(event_time == -2 ~ 1, TRUE ~ 0),
                   event_time_dummy2 = case_when(event_time == -1 ~ 1, TRUE ~ 0),
                   event_time_dummy3 = case_when(event_time == 0 ~ 1, TRUE ~ 0),
                   event_time_dummy4 = case_when(event_time == 1 ~ 1, TRUE ~ 0),
                   event_time_dummy5 = case_when(event_time == 2 ~ 1, TRUE ~ 0))

Notice that event_time_dummy2 is the one that corresponds to event_time -1.

Once again, Equation (4) is nothing but a linear model with many fixed-effects. We can again use the command plm.

This time, we must include dummy variables for the different values of event_time, with the exception of the dummy variable for the baseline event time \(k=-1\): event_time_dummy2.

did_model <- plm(logearn ~ event_time_dummy1 + event_time_dummy3 + event_time_dummy4 + event_time_dummy5 , 
                 data = panel_data, index=c("workerid", "year"), model = "within", effect = "twoways")
summary(did_model)

Again, the interpretation is the same as before, only now we have dynamic effects. The coefficient on the event_time1 dummy says that 2 years prior to entering treatment, treated units experienced a 0.4 percentage point increase in earnings relative to control units.

Should we worry that we are finding a difference between treated and control units prior to the policy? Notice that the effect of the policy at event time -2 (event_time_dummy1, when there was no training program) is not statistically different than zero.

This confirms that our parallel trends assumption is supported by the data. In other words, there are no observable differences in trends prior to the enactment of the training program. Checking the p-value of those coefficients prior to the treatment is called the pre-trend test and does not require any fancy work. A mere look at the regression results suffices!

Furthermore, we can observe how the policy effect evolves over time. At the year of entering the training program, earnings are boosted by 20 percentage points. The next year the effect decreases to 15 percentage points, and 2+ years after the policy, the effect significantly decreases towards 6 percentage points and is less statistically significant.

15.5.1 Event Study Graph

The table output is a correct way to convey the results, but it’s efficacy is limited, especially when we want to use a large time window. In those cases, a graph does a better job of representing all coefficients of interest.

We can easily do that using the library coefplot, which we covered in Module 8. We use the function coefplot from the same library and the coefficients we have saved in object did_model as inputs.

# Load coefplot
#uncomment to install the package! install.packages("coefplot")
library(coefplot)

# Create graph
coefplot(did_model, horizontal = TRUE)

In the graph, it is easy to see that the parallel trends assumption is satisfied: the difference between treatment and control group before the treatment is administered (the coefficient for event_dummy_1) is not statistically different than zero.

15.6 Common Mistakes

The most common mistake when dealing with a diff-in-diff research design is to add covariates that are already captured by the fixed-effects.

Let’s see what happens if we try to estimate Equation (2) where \(X\) is gender at birth.

# Load data
fake_data <- read_dta("../econ490-stata/fake_data.dta") 

# Set as panel
panel_data <- pdata.frame(fake_data, index = c("workerid","year"))

# Generate log-earnings
panel_data <- panel_data %>% mutate(logearn = log(earnings))

# Keep only years 2002 and 2003
panel_data <- panel_data[panel_data$year %in% c("2002", "2003"),]

# Create dummy variable
panel_data <- panel_data %>%
            mutate(time = ifelse(panel_data$year == "2003", 1, 0))

# Estimate incorrect specification
did_model <- plm(logearn ~ treated * time +  sex, data = panel_data, index=c("workerid", "year"), model = "within", effect = "twoways")
summary(did_model)

We cannot estimate the specification above because sex does not change over time for the same individual. Remember: in diff-in-diff regressions, we can only add covariates that are time varying by nature (e.g. tenure, experience) or are trends based on fixed characteristics (e.g. time dummies interacted with sex).

Another common mistake when dealing with event studies is to forget to re-assign untreated workers to the reference group \(k=-1\). Let’s see what happens if we try to estimate Equation (4) without this adjustment.

# Load data
fake_data <- read_dta("../econ490-stata/fake_data.dta") 

# Set as panel
panel_data <- pdata.frame(fake_data, index = c("workerid","year"))

# Generate log-earnings
panel_data <- panel_data %>% mutate(logearn = log(earnings))

# Generate a variable for year in which treatment was received
panel_data$time_entering_treatment = ifelse(panel_data$treated == 1, 2003, NA)

# Convert year to numeric
panel_data$yearnum <- 1994 + as.numeric(panel_data$year)

# Generate a variable for time relative to the event
panel_data$event_time = panel_data$yearnum - panel_data$time_entering_treatment

# Binning
panel_data$event_time <- ifelse(panel_data$event_time < -2 & panel_data$treated == 1, -2, panel_data$event_time)
panel_data$event_time <- ifelse(panel_data$event_time > 2 & panel_data$treated == 1, 2, panel_data$event_time)

# Create event time dummies
panel_data <- panel_data %>%
            mutate(event_time_dummy1 = case_when(event_time == -2 ~ 1, TRUE ~ 0),
                   event_time_dummy2 = case_when(event_time == -1 ~ 1, TRUE ~ 0),
                   event_time_dummy3 = case_when(event_time == 0 ~ 1, TRUE ~ 0),
                   event_time_dummy4 = case_when(event_time == 1 ~ 1, TRUE ~ 0),
                   event_time_dummy5 = case_when(event_time == 2 ~ 1, TRUE ~ 0))

# Run regression
did_model <- plm(logearn ~ event_time_dummy1 + event_time_dummy3 + event_time_dummy4 + event_time_dummy5 , 
                 data = panel_data, index=c("workerid", "year"), model = "within", effect = "twoways")
summary(did_model)

There are no error messages from R, but do you notice anything different compared to our results in Section 15.5?

The number of observations has decreased dramatically: instead of 138,138 workers as in Section 15.5, we only have around 40,000 workers. We are estimating our linear model only on the treated workers. This is a conceptual mistake: we cannot uncover the effect of the treatment if we do not compare the earnings of treated workers with the earnings of untreated workers.

15.7 Wrap Up

In this module, we’ve seen how the difference-in-differences design relies on two components:

Panel data, in which units are observed over time, and
Time and unit fixed-effects.

These two components make regressions mathematically equivalent to taking time-differences that eliminate any time-invariant components of the error term creating endogeneity. Furthermore, when we have access to more than 2 time periods, we are able to construct dynamic treatment effects (run an event study) and test whether the parallel trends condition holds.