```
#Clear the memory from any pre-existing objects
rm(list=ls())
# loading in our packages
library(tidyverse) #This includes ggplot2!
library(haven)
library(IRdisplay)
#Open the dataset
<- read_dta("../econ490-stata/fake_data.dta")
fake_data
# inspecting the data
glimpse(fake_data)
```

# ECON 490: Dummy Variables and Interactions (12)

## Prerequisites

- Importing data into R.
- Examining data using
`glimpse`

. - Creating new variables in R.
- Conducting linear regression analysis.

## Learning Outcomes

- Understand when dummy variable are needed in analysis.
- Create dummy variables from qualitative variables with two or more categories.
- Interpret coefficients associated with dummy variables from an OLS regression.
- Interpret coefficients of an interaction between a numeric variable and a dummy variable from an OLS regression.

## 12.1 Introduction to Dummy Variables for Regression Analysis

You will remember dummy variables from when they were introduced in Module 6. There we discussed both how to interpret and to generate this type of variable. If you have any uncertainty about what this type of variable measures, please make sure you review that module.

Here we will discuss including qualitative variables as explanatory variables in a linear regression model.

Imagine that we want to include a new explanatory variable in our multivariate regression from Module 10 which indicates whether an individual represented by a given observation is female. To do this we will need to include a new dummy variable in our regression and then interpret the coefficient on that variable from the regression results.

For this module we will again be using the “fake_data” data set. Recall that this data is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.

In Module 5 we showed how to create new variables. Here, we are generating a new variable based on the values of the already existing variable *earnings*.

```
<- fake_data %>%
fake_data mutate(log_earnings = log(earnings)) #the log function
```

Let’s take a look at the data.

`glimpse(fake_data)`

As expected, *logearnings* is a quantitative variable showing the logarithm of each value of *earnings*. We observe a variable named *sex*, but it doesn’t seem to be coded as a numeric variable. Notice that next to sex it says `<chr>`

.

As expected, sex is a string variable and is not numeric. We cannot use a string variable in a regression analysis; we have to create a new variable which indicates the sex of the individual represented by the observation.

A dummy variable is a numeric variable that takes either the value of 0 or 1 depending on a condition. A very simple way to create different categories for a variable in R is to use the `as.factor()`

function.

`as.factor(fake_data$sex)`

## 12.2 Interpreting the Coefficient on Dummy Variables

Whenever we interpret the coefficient on a dummy variable in a regression, we are making a direct comparison between the 1-category and the 0-category for that dummy. In the case of this female dummy, we are directly comparing the mean earnings of female identified workers against the mean earnings of male identified workers.

Let’s consider the regression below.

`lm(data=fake_data, log_earnings ~ as.factor(sex))`

Notice that the regression by default used females as the reference point and only estimated a male premium. Typically, we want this to be the other way around. To change the reference group we write the code below.

```
# Change reference level
= fake_data %>% mutate(female = relevel(as.factor(sex), "M")) fake_data
```

`summary(lm(data=fake_data, log_earnings ~ female))`

We remember from Module 10 that “_cons” is the constant \(β_0\), and we know that here \(β_0 = E[logearnings_{i}|female_{i}=0]\). Therefore, the results of this regression suggest that, on average, males have log earnings of 10.8. We also know from the Module 10 that

\[ \beta_1 = E[logearnings_{i}|female_{i}=1]- E[logearnings_{i}|female_{i}=0]. \]

The regression results here suggest that female identified persons earn on average 0.55 less than male identified persons and, as a result, on average female identified persons earn 10.8 - 0.55 = 10.25.

In other words, the coefficient on the female variable shows the mean difference in log-earnings relative to males. \(\hat{β}_1\) thus provides the measure of the raw gender gap.

**Note:** We are only able to state this result because the p-value for both \(\hat{β}_0\) and \(\hat{β}_1\) is less than 0.05, allowing us to reject the null hypothesis that \(β_0 = 0\) and \(β_1 = 0\) at 95% confidence level.

The interpretation remains the same once we control for more variables, although it is ceteris paribus (holding constant) the other observables in the regression.

`summary(lm(data=fake_data, log_earnings ~ female + age))`

In this case, among people that are the same age, the gender gap is (not surprisingly) slightly smaller than in our previous regression. That is expected since previously we compared all females to all males irrespective of the composition of age groups in those two categories of workers. As we control for age, we can see that this differential decreases.

## 12.3 Dummy Variables with Multiple Categories

The previous section also holds when there is a variable with multiple categories, as is the case for region.

`lm(data=fake_data, log_earnings ~ as.factor(region))`

Notice that the sum of the five dummies in any row is equal to 1. This is because every worker is located in exactly one region. If we included all of the regional dummies in a regression we would introduce the problem of multi-collinearity: the full set of dummy variables are perfectly correlated. Think about it this way - if a person is in region 1 (regdummy1 = 1) then we know that the person is not in region 2 (regdummy2 = 0). Therefore being in region 1 predicts not being in region 2.

We must always exclude one of the dummies. Failing to do so means falling into the **dummy variable trap** of perfect collinearity described above. To avoid this, choose one region to serve as a base level for which you will not define a dummy. This dummy variable that you exclude will be the category of reference, or base level, when interpreting coefficients in the regression. That is, the coefficient on each region dummy variable will be comparing the mean earnings of people in that region to the mean earnings of people in the one region excluded.

We have actually already seen this approach in action in the regression we ran above; there we didn’t add a separate dummy variable for “male”. Instead, we essentially excluded the male dummy variable and interpreted the coefficient on “female” as the difference between female and male log-earnings.

You may have noticed that R drops the first region dummy (region = 1) and includes dummy variables for the regions 2 - 5.

We can use the same trick as the previous section to change the reference group! Let’s change the reference group to 3.

`<- fake_data %>% mutate(region = relevel(as.factor(region), 3)) fake_data `

`summary(lm(data = fake_data, log_earnings ~ region))`

When interpreting the coefficients in the regression above, our intercept is again the mean log earnings among those for which all dummies in the regression are 0; here, that is the mean earnings for all people in region 3. Each individual coefficient gives the difference in average log earnings among people in that region and in region 3. For instance, the mean log earnings in region 1 are about 0.012 higher than in region 3, and the mean log earnings in region 2 are about 0.017 lower than in region 3. Both of these differences are statistically significant at a high level (> 99%).

It follows from this logic of interpretation that we can compare mean earnings among non-reference groups. For example, the meaning log earnings in region 3 are given by the intercept coefficient: about 10.49. Since the mean log earnings in region 1 are about 0.012 higher than this, they must be about 10.49 + 0.012 = 10.502. In region 2, the mean log earnings are similarly about 10.49 - 0.017 = 10.473. We can thus conclude that the mean log earnings in region 1 are about 10.502 - 10.473 = 0.029 higher than in region 2. In this way, we compared the levels of the dependent variable for 2 dummy variables, neither of which are in the reference group excluded from the regression. One could imagine that we could have much more quickly compared the levels of these groups by comparing their deviations from the base group. Region 1 has mean log earnings about 0.012 above a reference level, while region 2 has mean log earnings about 0.017 below this same reference level; thus, region 1 should have mean log earnings about 0.012 - (-0.017) = 0.029 above region 2.

## 12.4 Interactions

It is an established fact that a wage gap exists between male and female workers. However, it is possible that the wage gap changes depending on the age of the workers. For example, female and male high school students tend to work minimum wage jobs, hence we might believe that the wage gap between people within the 15-18 age bracket is very small. Conversely, once people have the experience to start looking for better paying jobs, we might believe the wage gap starts to increase, meaning that this gap might be much larger in higher age brackets. This means that the wage gap between males and females may also vary as age increases. The way to capture that differential effect of age across males and females is to create a new variable that is the product of the female dummy and age.

Whenever we do this it is *very important* that we also include both the female dummy and age as control variables. Luckily, by simply regressing *log_earnings* on our interaction term, _female*age_, R automatically generates dummy variables for all female and age categories without inducing the dummy variable trap.

`summary(lm(data=fake_data, log_earnings ~ female * age))`

We can see that, on average, people who are identified as female earn about 0.27 less than those identified as male, holding age constant. We can also see that each additional year of age increases log-earnings by about 0.013 for the reference category (males). This affect of age on log-earnings is lower for females by 0.007, meaning that an extra year of age increase log earnings for women by about 0.013 + (-0.007) = 0.006. It thus seems that our theory is correct: the wage gap between males and females of the same age increases as they get older. For men and women who are both 20, an extra year will be associated with the man earning a bit more than the woman on average. However, if the man and woman are both 50, an extra year will be associated with the man earning much more than the women on average (or at least out-earning her by much more than before). We can also see from the statistical significance of the coefficient on our interaction term that it was worth including!

Try this yourself below with the set of region dummies we created above, and think about what these results mean!

`summary(lm(data=fake_data, log_earnings ~ female * region))`

## 12.5 Wrap Up

There are very few empirical research projects using micro data that do not require researchers to use dummy variables. Important qualitative measures such as marital status, immigration status, occupation, industry, and race always require that we use dummy variables. Other important variables such as education, income, age and number of children often require us to use dummy variables, even when they are measured using ranked categorical variables. For example, we could have a variable that measures years of education as a continuous variable. However, we might instead want to include a variable that indicated if the person has a university degree. If that is the case we can use `as.factor()`

to create a dummy variable indicating that level of education.

Even empirical research projects that use macro data sometimes require that we use dummy variables. For example, we might have a data set that measures macro variables for African countries, including information about historic colonization. We might want to create dummy variables that indicate the origin of the colonizers, and then include that in our analysis to understand that effect. As another example, we might have a time series data set and want to indicate whether or not a specific policy was implemented in any one time period. We will need a dummy variable for that, and can include one in our analysis using the same process described above. Finally, we can use interaction terms to capture the effect of one variable on another if we believe that it varies between groups. If the coefficient on this interaction term is statistically significant, it is justified that this term be included in our regression for analysis. This impacts our interpretation of coefficients in the regression.

Create dummy variables and/or interaction terms with any data set that you have downloaded in R as you see fit. You will find that this approach is not complicated, but has the power to yield meaningful results!