import stata_setup
'C:\Program Files\Stata18/','se') stata_setup.config(
ECON 490: Using Dummy Variables and Interactions (13)
Prerequisites
- Importing data into Stata.
- Examining data using
browse
andcodebook
. - Creating new variables using the commands
generate
andtabulate
. - Using globals in your analysis.
- Understanding linear regression analysis.
Learning Outcomes
- Understand when a dummy variable is needed in analysis.
- Create dummy variables from qualitative variables with two or more categories.
- Interpret coefficients on a dummy variable from an OLS regression.
- Interpret coefficients on an interaction between a numeric variable and a dummy variable from an OLS regression.
>>> import sys
>>> sys.path.append('/Applications/Stata/utilities') # make sure this is the same as what you set up in Module - 1, Section 1.5.1: Setting Up PyStata
>>> from pystata import config
>>> config.init('se')
13.1 Introduction to Dummy Variables for Regression Analysis
You will remember dummy variables from when they were introduced in Module 6. There we discussed both how to interpret and how to generate this type of variable. If you have any uncertainty about what dummy variables measure, please make sure you review that module.
Here we will discuss including qualitative variables as explanatory variables in a linear regression model.
Imagine that we want to include a new explanatory variable in our multivariate regression from Module 11 that indicates whether an individual is identified as female. To do this we need to include a new dummy variable in our regression.
For this module we again will be using the fake data data set. Recall that this data is simulating information for workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.
%%stata
** Below you will need to include the path on your own computer to where the data is stored between the quotation marks.
*
clear ** cd " "
use fake_data,clear
In Module 6 we introduced the command gen
(or generate
). It is used to create new variables. Here, we are generating a new variable based on the values of the already existing variable earnings.
%%stata
= log(earnings) gen logearnings
Let’s take a look at the data.
%%stata
%browse 10
As expected, logearnings is a quantitative variable showing the logarithm of each value of earnings. We observe a variable named sex, but it doesn’t seem to be coded as a numeric variable. Let’s take a closer look:
%%stata
codebook sex
As expected, sex is a string variable and is not numeric. We cannot use a string variable in a regression analysis; we have to create a new variable which indicates the sex of the individual represented by the observation in numeric form.
A dummy variable is a numeric variable that takes either the value of 0 or 1 depending on a condition. In this case, we want to create a variable that equals 1 whenever a worker is identified as “female”. We have seen how to do this in previous notebooks.
%%stata
= sex == "F" gen female
13.2 Interpreting the Coefficient on a Dummy Variable
Whenever we interpret the coefficient on a dummy variable in a regression, we are making a direct comparison between the 1-category and the 0-category for that dummy. In the case of this female dummy, we are directly comparing the mean earnings of female identified workers against the mean earnings of male identified workers.
Let’s consider the regression below.
%%stata
reg logearnings female
We remember from Module 11 that “_cons” is the constant \(β_0\), and we know that here \(β_0 = E[logearnings_{i}|female_{i}=0]\). Therefore, the results of this regression suggest that on average, males have log earnings of 10.8. We also know from the Module 11 that
\[ \beta_1 = E[logearnings_{i}|female_{i}=1]- E[logearnings_{i}|female_{i}=0]. \]
The regression results here suggest that female identified persons earn on average 0.55 less than male identified persons. As a result, female identified persons earn on average 10.8 - 0.55 = 10.25.
In other words, the coefficient on the female variable shows the mean difference in log-earnings relative to males. \(\hat{β}_1\) thus provides the measure of the raw gender gap.
Note: We are only able to state this result because the p-value for both \(\hat{β}_0\) and \(\hat{β}_1\) is less than 0.05, allowing us to reject the null hypothesis that \(β_0 = 0\) and \(β_1 = 0\) at 95% confidence level.
The interpretation remains the same once we control for more variables, although it is ceteris paribus (holding constant) the other observables now also included in the regression. An example is below.
%%stata
reg logearnings female age
In this case, among people that are the same age, the gender gap is (not surprisingly) slightly smaller than in our previous regression. That is expected since previously we compared all females to all males irrespective of the composition of age groups in those two categories of workers. As we control for age, we can see that this differential decreases.
13.3 Dummy Variables with Multiple Categories
In this data set we also have a region variable that has 5 different regions. As in Module 6, we can create dummies for each category using tabulate
.
First, we tabulate
the categorical variable we want to make into a set of dummy variables. Then we use the option gen
to create five new dummy variables for the 5 regions represented in the data.
%%stata
tab region, gen(regdummy)
%%stata
%browse 10
Notice that the sum of the five dummies in any row is equal to 1. This is because every worker is located in exactly one region. If we included all of the regional dummies in a regression, we would introduce the problem of perfect collinearity: the full set of dummy variables are perfectly correlated. Think about it this way - if a person is in region 1 (regdummy1 = 1) then we know that the person is not in region 2 (regdummy2 = 0). Therefore being in region 1 predicts not being in region 2.
We must always exclude one of the dummies. Failing to do so means falling into the dummy variable trap of perfect collinearity described above. To avoid this, choose one region to serve as a base level for which you will not define a dummy. This dummy variable that you exclude will be the category of reference, or base level, when interpreting coefficients in the regression. That is, the coefficient on each region dummy variable will be comparing the mean earnings of people in that region to the mean earnings of people in the one region excluded.
We have actually already seen this approach in action in the regression we ran above; there we didn’t add a separate dummy variable for “male”. Instead, we excluded the male dummy variable and interpreted the coefficient on “female” as the difference between female and male log-earnings.
The easiest way to include multiple categories in a regression is to write the list of variables using the notation i.variable
. Below you will see that Stata drops the first region dummy (region = 1) and includes dummy variables for the regions 2 - 5. In this way, Stata automatically helps us avoid the dummy variable trap.
%%stata
reg logearnings i.region
Often we will want to control which dummy variable is selected as the reference or base level category. If that is the case, we first have to control the reference dummy variable using the command fvset base
. We do this below by setting the base level category to be region 3.
%%stata
3 region fvset base
When you run the regression below, the reference is now region 3 and not region 1.
%%stata
reg logearnings i.region
Of course, we could also create a new global
as was learned in Module 4 that includes all of the dummy variables and includes that in the regression. Here is an example of what that would look like:
%%stata
global regiondummies "regdummy1 regdummy2 regdummy4 regdummy5"
reg logearnings ${regiondummies}
When interpreting the coefficients in the regression above, our intercept is again the mean log earnings among those for which all dummies in the regression are 0; here, that is the mean earnings for all people in region 3. Each individual coefficient gives the difference in average log earnings among people in that region and in region 3. For instance, the mean log earnings in region 1 are about 0.012 higher than in region 3 and the mean log earnings in region 2 are about 0.017 lower than in region 3. Both of these differences are statistically significant at a high level (> 99%).
It follows from this logic of interpretation that we can compare mean earnings among non-reference groups. For example, the meaning log earnings in region 3 are given by the intercept coefficient: about 10.49. Since the mean log earnings in region 1 are about 0.012 higher than this, they must be about 10.49 + 0.012 = 10.502. In region 2, the mean log earnings are similarly about 10.49 - 0.017 = 10.473. We can thus conclude that the mean log earnings in region 1 are about 10.502 - 10.473 = 0.029 higher than in region 2. In this way, we compared the levels of the dependent variable for 2 dummy variables, neither of which are in the reference group excluded from the regression. We could have much more quickly compared the levels of these groups by comparing their deviations from the base group. Region 1 has mean log earnings about 0.012 above the reference level, while region 2 has mean log earnings about 0.017 below this same reference level; thus, region 1 has mean log earnings about 0.012 - (-0.017) = 0.029 above region 2.
13.3.1 Dummy variables with many multiple categories
In your project, it may happen that a variable has many different categories. This issue is often referred to as high-dimensional fixed effects. Going back to our fictional dataset, imagine the case where we have data for all workers in the United States and we know the municipality in which they work. If that was the case, the variable municipality would take roughly 19,000 different values. To see how earnings vary by municipality, we would have to create 19,000-1 dummy variables. Using the approach described above would work in principle, but in practice it would require substantial computing power.
What can we do then? Luckily for us, there is a package that deals exactly with this issue. The package is called reghdfe
and needs to be installed with the command ssc install reghdfe
.
Using the package is very easy. The syntax is reghdfe depvar indepvars, absorb(fixedeffects)
, where depvar is our dependent variable of interest, indepvar is a list of explanatory variables, and fixedeffects is a list of variables for which we would like to create dummies.
To see how it works in practice, let’s say we want to study how earnings change with age for all regions. The code would then be reghdfe logearnings age, absorb(region)
.
%%stata
* Install reghdfe
ssc install reghdfe
%%stata
* Estimate the model
reghdfe logearnings age, absorb(region)
In practice, using reghdfe
is equivalent to asking Stata to create four dummy variables for region and use them as additional explanatory variables. As a matter of fact, reghdfe logearnings age, absorb(region)
produces the same results as reg logearnings age i.region
. You can check it by running the code below. Notice that by default reghdfe
suppresses the coefficients associated to each dummy variable for region.
%%stata
reg logearnings age i.region
13.4 Interactions
It is an established fact that a wage gap exists between male and female workers. However, it is possible that the wage gap changes depending on the age of the workers. For example, female and male high school students tend to work minimum wage jobs; hence, we might believe that the wage gap between people within the 15-18 age bracket is very small. Conversely, once people have the experience to start looking for better paying jobs, we might believe the wage gap starts to increase, meaning that this gap might be much larger in higher age brackets. Similarly, the wage gap between males and females may also vary as age increases. The way to capture that differential effect of age across males and females is to create a new variable that is the product of the female dummy and age.
Whenever we do this it is very important that we also include both the female dummy and age as control variables.
To run this in Stata, categorical variables must be preceded by a i.
, continuous variables must be preceded by c.
and terms are interacted with the ##
symbol. For our example, we have the categorical variable i.female
interacted with continuous variable c.age
and the regression looks like this:
%%stata
##c.age reg logearnings i.female
Notice that Stata automatically includes the female and age variables as dummy variables for controls. From our results, we can see that, on average, people who are identified as female earn about 0.27 less than those identified as male, holding age constant. We can also see that each additional year of age increases log-earnings by about 0.013 for the reference category (males). This affect of age on log-earnings is lower for females by 0.007, meaning that an extra year of age increase log earnings for women by about 0.013 + (-0.007) = 0.006. It thus seems that our theory is correct: the wage gap between males and females of the same age increases as they get older. For men and women who are both 20, an extra year will be associated with the man earning a bit more than the woman on average. However, if the man and woman are both 50, an extra year will be associated with the man earning much more than the woman on average (or at least out-earning her by much more than before). We can also see from the statistical significance of the coefficient on our interaction term that it was worth including!
Try this yourself below with the set of region dummies we created above. Think about what these results mean.
%%stata
##i.region reg logearnings i.female
13.5 Wrap Up
There are very few empirical research projects using micro data that do not require researchers to use dummy variables. Important qualitative measures such as marital status, immigration status, occupation, industry, and race always require that we use dummy variables. Other important variables such as education, income, age and number of children often require us to use dummy variables even when they are sometimes measured using ranked categorical variables. For example, we could have a variable that measures years of education which is included as a continuous variable. However, you might instead want to include a variable that indicates if the person has a university degree. If that is the case, you can use generate
to create a dummy variable indicating that specific level of education.
Even empirical research projects that use macro data sometimes require that we use dummy variables. For example, you might have a data set that measures macro variables for African countries with additional information about historic colonization. You might want to create a dummy variable that indicates the origin of the colonizers, and then include that in your analysis to understand that effect. As another example, you might have a time series data set and want to indicate whether or not a specific policy was implemented in any one time period. You will need a dummy variable for that, and can include one in your analysis using the same process described above. Finally, you can use interaction terms to capture the effect of one variable on another if you believe that it varies between groups. If the coefficient on this interaction term is statistically significant, it can justify this term’s inclusion in your regression. This impacts your interpretation of coefficients in the regression.
Try this yourself with any data set that you have downloaded in Stata. You will find that this approach is not complicated, but has the power to yield meaningful results!
The table below summarizes the main commands we have studied in this module:
Command | Function |
---|---|
reg depvar indepvar i.var |
It adds dummy variables for multiple categories of the categorical variable var in a regression. |
reghdfe depvar indepvar, absorb(vars) |
It adds dummy variables for multiple categories of variables vars. It is particularly efficient when vars takes on many different values. |
reg depvar var1#var2 |
It adds an interaction term between variable var1 and var2 in a regression. |
reg depvar var1##var2 |
It adds the interaction between var1 and var2 as well as var1 and var2 themselves to the regression. reg depvar var1##var2 is the same as reg depvar var1 var2 var1#var2 . |
13.6 Video tutorial
Click on the image below for a video tutorial on this module.
References
Use factor variables in Stata to estimate interactions between two categorical variables