#Clear the memory from any pre-existing objects
rm(list=ls())
# loading in our packages
library(tidyverse) #This includes ggplot2!
library(haven)
library(IRdisplay)
#Open the dataset
<- read_dta("../econ490-stata/fake_data.dta")
fake_data
# inspecting the data
glimpse(fake_data)
12 - Dummy Variables and Interactions
Prerequisites
- Importing data into R.
- Examining data using
glimpse
. - Creating new variables in R.
- Conducting linear regression analysis.
Learning Outcomes
- Understand when a dummy variable is needed in analysis.
- Create dummy variables from qualitative variables with two or more categories.
- Interpret coefficients on a dummy variable from an OLS regression.
- Interpret coefficients on an interaction between a numeric variable and a dummy variable from an OLS regression.
12.1 Introduction to Dummy Variables for Regression Analysis
We first took a look at dummy variables in Module 5. There, we discussed both how to interpret and how to generate this type of variable. If you are unsure about what dummy variables measure, please make sure to review that module.
Here we will discuss including qualitative variables as explanatory variables in a linear regression model as dummy variables.
Imagine that we want to include a new explanatory variable in our multivariate regression from Module 10 that indicates whether an individual is identified as female. To do this, we need to include a new dummy variable in our regression.
For this module, we again will be using the fake data set. Recall that this data is simulating information for workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.
Let’s generate a variable that takes the log of earnings, as we did for our regression in the previous module.
<- fake_data %>%
fake_data mutate(log_earnings = log(earnings)) #the log function
Let’s take a look at the data.
glimpse(fake_data)
As expected, logearnings is a quantitative variable showing the logarithm of each value of earnings. We observe a variable named sex, but it doesn’t seem to be coded as a numeric variable. Notice that next to sex it says <chr>
.
As expected, sex is a string variable and is not numeric. We cannot use a string variable in a regression analysis; we have to create a new variable which indicates the sex of the individual represented by the observation in numeric form.
A dummy variable is a numeric variable that takes either the value of 0 or 1 depending on a condition. In this case, we want to create a variable that equals 1 whenever a worker is identified as “female”. A very simple way to create different categories for a variable in R is to use the as.factor()
function.
as.factor(fake_data$sex)
12.2 Interpreting the Coefficient on Dummy Variables
Whenever we interpret the coefficient on a dummy variable in a regression, we are making a direct comparison between the 1-category and the 0-category for that dummy. In the case of this female dummy, we are directly comparing the mean earnings of female-identified workers against the mean earnings of male-identified workers.
Let’s consider the regression below.
lm(data=fake_data, log_earnings ~ as.factor(sex))
Notice that the regression by default used females as the reference point and only estimated a male premium. Typically, we want this to be the other way around. To change the reference group we write the code below.
# Change reference level
= fake_data %>% mutate(female = relevel(as.factor(sex), "M")) fake_data
summary(lm(data=fake_data, log_earnings ~ female))
We remember from Module 10 that “_cons” is the constant \(β_0\), and we know that here \(β_0 = E[logearnings_{i}|female_{i}=0]\). Therefore, the results of this regression suggest that, on average, males have log-earnings of 10.8. We also know from the Module 10 that
\[ \beta_1 = E[logearnings_{i}|female_{i}=1]- E[logearnings_{i}|female_{i}=0]. \]
The regression results here suggest that female-identified persons earn on average 0.55 less than male-identified persons. As a result, female-identified persons earn on average 10.8 - 0.55 = 10.25.
In other words, the coefficient on the female variable shows the mean difference in log-earnings relative to males. \(\hat{β}_1\) thus provides the measure of the raw gender gap.
The interpretation remains the same once we control for more variables, although it is ceteris paribus the other observables now also included in the regression. An example is below.
summary(lm(data=fake_data, log_earnings ~ female + age))
In this case, among people that are the same age (i.e., holding age constant), the gender gap is (not surprisingly) slightly smaller than in our previous regression. That is expected, since previously we compared all females to all males, irrespective of the composition of age groups in those two categories of workers. As we control for age, we can see that the effect of sex decreases.
12.3 Dummy Variables with Multiple Categories
The previous section also holds when there is a variable with multiple categories, as is the case for region.
lm(data=fake_data, log_earnings ~ as.factor(region))
Notice that the sum of the five dummies in any row is equal to 1. This is because every worker is located in only one region. If we included all of the regional dummies in a regression, we would introduce the problem of perfect collinearity: the full set of our dummy variables are perfectly correlated with one another. Think about it this way - if a person is in region 1 (regdummy1 = 1), then we know that that person is not in region 2 (regdummy2 = 0). Therefore being in region 1 perfectly predicts not being in region 2.
We must always exclude one of the dummies. Failing to do so means falling into the dummy variable trap of perfect collinearity described above. Essentially, if we include all of the five dummy variables, the fifth one will not add any new information. This is because, using the four other dummies, we can perfectly deduce whether a person is in region 5 (regdummy5 = 1). To avoid this, we have to choose one region to serve as a base level for which we will not define a dummy. This dummy variable that we exclude will be the category of reference, or base level, when interpreting coefficients in the regression. This means that the coefficient on each region dummy variable will be comparing the mean earnings of people in that region to the mean earnings of people in the one region excluded.
We have actually already seen this approach in action in the first regression we ran above; there, we didn’t add a separate dummy variable for “male”. Instead, we excluded the male dummy variable and interpreted the coefficient on female as the difference between female and male log-earnings.
You may have noticed that R drops the first region dummy (region = 1) and includes dummy variables for the regions 2 - 5.
We can use the same trick as the previous section to change the reference group! Let’s change the reference group to 3.
<- fake_data %>% mutate(region = relevel(as.factor(region), 3)) fake_data
summary(lm(data = fake_data, log_earnings ~ region))
When interpreting the coefficients in the regression above, our intercept is again the mean log-earnings among those in the base level category, i.e. those for which all dummies in the regression are 0; here, that is the mean log-earnings for all people in region 3. Each individual coefficient gives the difference in average log-earnings among people in that region and in region 3. For instance, the mean log-earnings in region 1 are about 0.012 higher than in region 3 and the mean log-earnings in region 2 are about 0.017 lower than in region 3. Both of these differences are statistically significant at a high level (> 99%).
We can also use this logic of interpretation to compare mean log-earnings between the non-reference groups. For example, the meaning log-earnings in region 3 are given by the intercept coefficient: about 10.49. Since the mean log-earnings in region 1 are about 0.012 higher than this, they must be about 10.49 + 0.012 = 10.502. In region 2, the mean log-earnings are about 10.49 - 0.017 = 10.473. We can thus conclude that the mean log-earnings in region 1 are about 10.502 - 10.473 = 0.029 higher than in region 2. In this way, we compared the levels of the dependent variable for 2 dummy variables, neither of which are in the reference group excluded from the regression. We could have much more quickly compared the levels of these groups by comparing their deviations from the base group. Region 1 has mean log-earnings about 0.012 above the reference level, while region 2 has mean log-earnings about 0.017 below this same reference level; thus, region 1 has mean log-earnings about 0.012 - (-0.017) = 0.029 above region 2.
12.4 Interactions
It is an established fact that a wage gap exists between male and female workers. However, it is possible that the wage gap changes depending on the age of the workers. For example, female and male high school students tend to work minimum wage jobs; hence, we might believe that the wage gap between people within the 15-18 age bracket is very small. Conversely, once people have the experience to start looking for better paying jobs, we might believe the wage gap starts to increase, meaning that this gap might be much larger in higher age brackets. The way to capture that differential effect of age across males and females is to create a new variable that is the product of the female dummy and age.
Luckily, by simply regressing log_earnings on our interaction term, female * age, R automatically generates dummy variables for all female and age categories without inducing the dummy variable trap.
summary(lm(data=fake_data, log_earnings ~ female * age))
From the coefficient on female, we can see that, on average, people who are identified as female earn about 0.27 less than those identified as male, holding age constant. We can also see, from the coefficient on age, that each additional year of age increases log-earnings by about 0.013 for the reference category (males). Looking at the coefficient on our interaction term, this effect of age on log-earnings is lower for females by 0.007, meaning that an extra year of age increases log-earnings for women by about 0.013 + (-0.007) = 0.006. It thus seems that our theory is correct: the wage gap between males and females of the same age increases as they get older. For men and women who are both 20, an extra year will be associated with the man earning a bit more than the woman on average. However, if the man and woman are both 50, an extra year will be associated with the man earning much more than the woman on average (or at least out-earning her by much more than before). We can also see from the statistical significance of the coefficient on our interaction term that it was worth including!
Try this yourself below with the set of region dummies we created above, and think about what these results mean!
summary(lm(data=fake_data, log_earnings ~ female * region))
12.5 Wrap Up
There are very few empirical research projects using micro data that do not require researchers to use dummy variables. Important qualitative measures such as marital status, immigration status, occupation, industry, and race always require that we use dummy variables. Other important variables such as education, income, age and number of children often require us to use dummy variables even when they are sometimes measured using ranked categorical variables. For example, we could have a variable that measures years of education which is included as a continuous variable. However, we might instead want to include a variable that indicates if the person has a university degree. If that is the case we can use as.factor()
to create a dummy variable indicating that level of education!
Even empirical research projects that use macro data sometimes require that we use dummy variables. For example, we might have a data set that measures macro variables for African countries with additional information about historic colonization. We might want to create a dummy variable that indicates the origin of the colonizers, and then include that in our analysis to understand that effect. As another example, we might have a time series data set and want to indicate whether or not a specific policy was implemented in a certain time period. We will need a dummy variable for that, and can include it in our analysis using the same process described above. Finally, we can use interaction terms to capture the effect of one variable on another if we believe that it varies between groups. If the coefficient on this interaction term is statistically significant, it can justify this term’s inclusion in our regression. This impacts our interpretation of coefficients in the regression.
Create dummy variables and/or interaction terms with any data set that you have downloaded in R as you see fit. You will find that this approach is not complicated, but has the power to yield meaningful results!
12.6 Wrap-up Table
Command | Function |
---|---|
as.factor(data$varname) |
It automatically creates different categories for a variable. |
relevel(data$varname, reference_level) |
It changes the reference level for a set of dummy variables. |
lm(data, depvar ~ var1 * var2)) |
It adds an interaction term between var1 and var2 as well as var1 and var2 themselves to the regression. |