05 - ECON 326: Interactions and Non-linear Terms in Regressions

introduction

econ 326

multiple regression

regression

collinearity

controls

heteroskedasticity

How do we specific non-linear models? Why would we want to do so? This notebooks covers non-linear and interaction models, including marginal effects and related issues.

Author

COMET Team
Emrul Hasan, Jonah Heyl, Shiming Wu, William Co, Jonathan Graves

Published

8 December 2022

Outline

Prerequisites

Multiple regression
Simple regression
Data analysis and introduction

Outcomes

In this worksheet, you will learn:

How to incorporate interaction terms into a regression analysis
How to interpret models with interaction terms
How to create models which include non-linear terms
How to compute simple marginal effects for models with non-linear terms
How to explain polynomial regressions as approximations to a non-linear regression function

Notes

¹Statistics Canada, Survey of Financial Security, 2019, 2021. Reproduced and distributed on an “as is” basis with the permission of Statistics Canada.Adapted from Statistics Canada, Survey of Financial Security, 2019, 2021. This does not constitute an endorsement by Statistics Canada of this product.

²Stargazer package is due to: Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.3. https://cran.r-project.org/web/packages/stargazer/index.html

library(tidyverse)
library(haven)
library(dplyr)
library(scales)
library(stargazer)
library(car)

source("functions5.r")

SFS_data <- read_dta("../datasets/SFS_2019_Eng.dta")
SFS_data <- clean_up_data()

Part 1: Interactions in Regression Models

One of the most common extensions to multiple regression models is to include interaction terms. What is an interaction term? It’s basically a term which represents the product of two (or more) variables in a model.

For example, if we have a dummy variable for being a female (\(F_i\)) and a dummy variable for having a university degree (\(D_i\)), the interaction of these two variables is the product \(D_i \times F_i\). This can seem complicated but it also has a simple interpretation: it is now a dummy for being both a female and having a university degree. You can see why this is true:

\[ D_i \times F_i = 1 \iff D_i = 1 \text{ and } F_i = 1 \]

This is why these terms are so important for understanding regressions: they provide us with a simple way to describe and study how combinations of our explanatory variables impact our model. These variables enter into our regression models in exactly the same way as usual:

\[ Y_i = \beta_0 + \beta_1 F_i + \beta_2 D_i + \beta_3 D_i \times F_i + \epsilon_i \]

At this point, you can see that this is just a multiple regression model - the only difference is that one of the variables is a combination of the other variables. From an estimation perspective, there’s no issue - you can use OLS to estimate a model with interaction terms, just like normal. However, as we have seen, there are important differences when it comes to the interpretation of these models. Let’s learn more about this in this worksheet.

There are (in general) two ways to create interactions in R: (i) manually (i.e. creating a new variables which is \(D_i \times F_i\) then adding it to the regression), or (ii) using the built-in tools in R. However, method (i) is a trap! You should never use this method. Why? There are two reasons:

The main reason is that R (and you, the analyst) lose track of the relationship between the created interaction variable and the underlying variables. This means that you can’t use other tools to analyze these relationship (there are many packages such as margins which allow you to investigate complex interaction) which is a big loss. You also can’t perform post-regression analysis on the underlying variables in a simple way anymore.
The second reason is that it’s easy to make mistakes. You might define the interaction incorrectly (possible!). However, it’s more of an issue if later on you change the underlying variables and then forget to re-compute the interactions. It also makes your code harder to read.

Bottom line: don’t do it. Interaction in R are easy to create: you simply use the : or * operator when defining an interaction term.

The : operator creates the interaction(s) of the two variables in question
The * operation creates the interactions(s) and the main effects of the variables as well

Even better: if you are interacting two qualitative (factor) variables, it will automatically “expand” the interaction into every possible combination of the variables. A lot less work!

For example, let’s look at a regression model which interacts gender and education. Before we run regression, let’s first summarize education into ‘university’ and ‘non-university’.

SFS_data <- SFS_data %>% #cretes a Education dummy variable
               mutate( 
               Education = case_when(
                     education == "University" ~ "University", #the ~ seperates the original from the new name
                     education == "Non-university post-secondary" ~ "Non-university",
                     education == "High school" ~ "Non-university",
                     education == "Less than high school" ~ "Non-university")) %>%
             mutate(Education = as_factor(Education)) #remember, it's a factor!

regression1 <- lm(wealth ~ gender + Education + gender:Education, data = SFS_data)

#regression1 <- lm(wealth ~ gender*Education, data = SFS_data) #an alternative way to run the same regression

summary(regression1)

There are a few important things to notice about this regression result. First, take a close look at the terms:

genderFemale this is the main effect for being a female. You might immediately say that this is the impact of being a female - but this is not true. Why? Because female shows up in two places! We have to be a little more careful - this is the effect of being a female in the base group (non-university)
genderFemale:EducationUniversity this is the interaction effect of being a female and having a university degree. Basically, family with female (university degree) as main earner accumulates \(143,396+324,112=467,508\) less wealth, compared with male counterpart.

You can see this interpretation in the regression model itself:

\[ W_i = \beta_0 + \beta_1 F_i + \beta_2 D_i + \beta_3 F_i \times D_i + \epsilon_i \]

Consider:

\[ \frac{\Delta W_i}{\Delta F_i} = \beta_1 + \beta_3 D_i \]

The marginal effect of being a female-lead household changes depending on what the value of \(D_i\) is! For non-university degree (the level where \(D_i = 0\)) it’s \(\beta_1\). For university degree (the level where \(D_i =1\)), it’s \(\beta_1 + \beta_3\). This is why, in an interaction model, it doesn’t really make sense to talk about the “effect of female” - because there isn’t a single, immutable effect. It is different for different education degrees!

You can talk about the average effect, which is just \(\beta_1 + \beta_3 \bar{D_i}\) - but that’s not really what people are asking about when they are discussing the gender effect, in general.

This is why it’s very important to carefully think about a regression model with interaction terms - the model may seem simple to estimate, but the interpretation is more complex.

Interactions with Continuous Variables

So far, we have just looked at interacting qualitative variables - but you can interact any types of variables!

Qualitative-Qualitative
Qualitative-Quantitative
Quantitative-Quantitative

The format and syntax in R is similar, with some small exceptions to deal with certain combinations of variables. However (again), you do need to be careful with interpretation.

For example, let’s look at the interaction of income and sex on wealth. In a regression equation, this would be expressed like:

\[ W_i = \beta_0 + \beta_1 Income_i + \beta_2 F_i + \beta_3 Income_i \times F_i + \epsilon_i \]

Notice that, just like before:

\[ \frac{\partial W_i}{\partial Income_i} = \beta_1 + \beta_3 F_i \]

There are two different “slope” coefficients; basically, male and female lead family can have a different return to wealth. Let’s see this in R:

regression2 <- lm(wealth ~ income_before_tax + gender + income_before_tax:gender, data = SFS_data)

summary(regression2)

As we can see here, the female-lead households in our model accumulate about 3.946 dollars more in wealth per dollar of income earned than male-lead respondents. But female-lead households accumulate 343,300 dollars less than male counterparts. So the overall effects depend on average income before tax.

This addresses the common problem of estimating a regression model where you think the impact of a continuous variable might be different across the two groups. One approach would be to run the model only for men, and only for women, and then compare - but this isn’t a good idea. Those regressions have a much smaller sample size, and if you have other controls in the model, you will “throw away” information. The interaction method is much better.

Part 2: Non-linear Terms in Regression Models

You might have been puzzled by why these models were called “linear” regressions. The reason is because they are linear in the coefficients: the dependent variable is expressed as a linear combination of the explanatory variables.

This implies that we can use the same methods (OLS) to estimate models that including linear combinations of non-linear functions of the explanatory variables. We have actually already seen an example of this: remember using log of a variable? That’s a non-linear function!

As we learned when considering log, the most important difference here is again regarding interpretations, not the actual estimation.

In R, there is one small complication: when you want to include mathematical expressions in a model formula, you need to “isolate” then using the I() function. This is because many operations in R, like + or * have a special meaning in a regression model.

For example, let’s consider a quadratic regression - that is, including both \(Income_i\) and \(Income_i^2\) (income squared) in our model.

regression3 <- lm(wealth ~ income_before_tax + I(income_before_tax^2), data = SFS_data)

summary(regression3)

As you can see, we get regression results much like we would expect. However, how do we interpret them? The issue is that income enters into two places. We need to carefully interpret this model, using our knowledge of the equation:

\[ W_i = \beta_0 + \beta_1 Income_i + \beta_2 Income_i^2 + \epsilon_i \]

\[ \implies \frac{\partial W_i}{\partial Income_i} = \beta_1 + 2 \beta_2 Income_i \]

You will notice something special about this; the marginal effect is non-linear. As \(Income_i\) changes, the effect of income on \(W_i\) changes. This is because we have estimated a quadratic relationship; the slope of a quadratic changes as the explanatory variable changes. That’s what we’re seeing here!

This makes these models relatively difficult to interpret, since the marginal effects change (often dramatically) as the explanatory variables change. You frequently need to carefully interpret the model and often (to get estimates) perform tests on combinations of coefficients, which can be done using things like the car package or the lincom function. You can also compute this manually, using the formula for the sum of variances.

For example, let’s test if the marginal effect of income is significant at \(Income_i = \overline{Income}_i\). This is the most frequently reported version of this effects, often called the “marginal effect at the means”.

m <- mean(SFS_data$income_before_tax)

linearHypothesis(regression3, hypothesis.matrix = c(0, 1, 2*m), rhs=0)

As we can see, it is highly significant

Think Deeper: what is the vector c(0, 1, 2*m) doing in the above expression?

Let’s see exactly what those values are. Recall the formula:

\[ V(aX + bY) = a^2 V(X) + b^2 V(Y) + 2abCov(X,Y) \]

In our situation, \(X = Y = W_i\), so this is:

\[ V(\beta_1 + 2\bar{W_i}\beta_2) = V(\beta_1) + 4\bar{W_i}^2V(\beta_2) + 2(2\bar{W_i})Cov(\beta_1,\beta_2) \]

Fortunately, these are all things we have from the regression and its variance-covariance matrix:

v <- vcov(regression3)
coefs <- regression3$coefficients
v

var <- v[2,2] + 4*(m^2)*v[3,3] + 4*m*v[3,2]

var

coef <-  coefs[[2]] + 2*m*coefs[[3]]

print("Coefficent Combination and SD")
round(coef,3)
round(sqrt(var),3)

As you can see, this gets fairly technical and is not something you will want to do without a very good reason. In general, it’s a better idea to rely on some of the packages written for R that handle this task for the (specific) model you are interested in evaluating.

Aside: Why Polynomial Terms?

You might be wondering why econometricians spend so much time talking about models that included polynomial terms, when those are (realistically) a very small set of the universe of possible functions of an explanatory variable (you already know why we talk about log so much!).

The reason is actually approximation. Consider the following non-linear model:

\[ Y_i = f(X_i) + e_i \]

This model is truly non-linear (and not just in terms of the parameters). How can we estimate this model? It’s hard! There are techniques to estimate complex models like this, but how far can we get with good-old OLS? The answer is - provided that \(f\) is “smooth” - pretty far.

Think back to introductory calculus; you might remember a theorem called Taylor’s Theorem. It says that a smoothly differentiable function can be arbitrarily well-approximated (about a point) by a polynomial expansion:

\[ f(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \cdots + \frac{f^{(k)}(a)}{k!}(x-a)^k + R_k(x) \]

and the error term \(R_k(x) \to 0\) as \(x \to a\) and \(k \to \infty\).

Look closely at this expression. Most of the terms (like \(f'(a)\)) are constants. In fact, you can show that this can be written like:

\[ f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_k x^k + r \]

Putting this into our expression above gives us the relationship:

\[ Y_i = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \cdots + \beta_k X_i^k+ \epsilon_i \]

Which is a linear regression model! What this say is actually very important: linear regression models can be viewed as approximations to nonlinear regressions, provided we have enough polynomial terms. This is one complication: the error term is definitely not uncorrelated. You can learn more about how to address this issue in other courses, but at the most the omitted variable bias is relatively small as \(k \to \infty\).

Part 3: Exercises

In this set of exercises, you will get some hands-on experience with estimating and interpreting non-linear regression models, as well as learning to scrutinize multiple regression models for their limitations.

Activity 1

Consider the following regression model:

\[\begin{equation} W_i = \beta_1 + \beta_2 F_i + \beta_3 E_i + \beta_4 P_i + \beta_5 F_i\times E_i + \beta_6 F_i \times P_i + \epsilon_i \end{equation}\]

\(W_i\) denotes wealth, \(F_i\) is a dummy variable for the gender of main earner in the household (\(F_i=1\) if female is the main earner), \(E_i\) is a factor variable for education and \(P_i\) is a factor variable for province.

Short Answer 1

Prompt: How should we interpret the coefficients \(\beta_5\) and \(\beta_6\)? Why might these effects be important to estimate?

answer_1 <- #fill in your short answer

Now, let’s estimate the model and interpret it concretely. (Please follow the order of variables in regression model:

reg1 <- gender, Education, province

reg0 <- gender, education, province

What are your interaction variables? (remember to follow the same order)

#Quiz 1
reg1 <- lm(???, data = SFS_data)
reg0 <- lm(???, data = SFS_data)

summary(reg1)
summary(reg0)

test_1() #quiz1

Short Answer 2

Prompt: How do we interpret the coefficient estimate on gender:Education? What education level do female-lead households appear to be most discriminated in? How might we explain this intuitively?

Answer here in red

answer_2 <- #fill in your short answer

Short Answer 3

Prompt: How do you interpret the coefficient estimate on genderFemale:provinceAlberta? (Hint: Write out the average wealth equations for female, male in Alberta, and female in Alberta separately.)

Answer here in red

answer_3 <- #fill in your short answer

Now let’s test whether the returns to education increase if people are entrepreneurs. business is a factor variable which suggests whether the household owns a business. Please add terms to the regression equation that allow us to run this test. Then, estimate this new model. We don’t need province and gender variables in this exercise. For education, please use education variable. And we will continue to study wealth accumulated in households.

#Quiz 2
SFS_data$business <- relevel(SFS_data$business, ref = "No") #Do not change; makes "not a business owner" the reference level for business

reg2 <- lm(???, data = SFS_data)

summary(reg2)
test_1.5()

Short Answer 4

Prompt: Do returns to education increase when people are entrepreneurs? Explain why or why not with reference to the regression estimates.

Answer here in red

answer_4 <- #fill in your short answer

Activity 2

A topic that many labour economists are concerned with, and one that we have discussed before, is the gender-wage gap. In this activity, we will construct a “difference-in-difference” regression to explore this gap using the SFS_data2.

Suppose that we want to estimate the relationship between age, sex and wages. Within this relationship, we suspect that women earn less than men from the beginning of their working lives, but this gap does not change as workers age.

Estimate a regression model (with no additional control variables) that estimates this relationship using SFS_data2. We will use income_before_tax variable. Order: list gender before agegr.

Tested Objects: reg3A.

Let’s first simplify levels of age group using following codes.

#Some Data cleaning, Just run this!
SFS_data <- 
        SFS_data %>%
        mutate(agegr = case_when(
              age == "01" ~ "Under 30", #under 20
              age == "02" ~ "Under 30", #20-24
              age == "03" ~ "20s", #25-29
              age == "04" ~ "30s",
            age == "05" ~ "30s",
              age == "06" ~ "40s",
              age == "07" ~ "40s",
              age == "08" ~ "50s",
              age == "09" ~ "50s",
              age == "10" ~ "60s", #60-64
              age == "11" ~ "Above 65", #65-69
              age == "12" ~ "Above 65", #70-74
              age == "13" ~ "Above 75", #75-79
              age == "14" ~ "Above 75", #80 and above
              )) %>%
        mutate(agegr = as_factor(agegr))

SFS_data$agegr <- relevel(SFS_data$agegr, ref = "Under 30") #Set "Under 30" as default factor level

Let’s restrict the sample to main working groups. Just run the following line.

SFS_data2 <- subset(SFS_data, agegr == "20s" | agegr == "30s" | agegr == "40s" | agegr == "50s" | agegr == "60s" )

#Quiz 3
SFS_data2$agegr <- relevel(SFS_data2$agegr, ref = "20s") #Do not change; makes "20s" the reference level for age

reg3A <- lm(???, data = SFS_data2)

summary(reg3A)

test_2() #quiz3

Short Answer 5

Prompt: What is the relationship between age and wages? Between sex and earnings? Is there a significant wage gap? Why might the regression above not give us the “full picture” of the sex wage gap?

Answer here in red

answer_5 <- #fill in your short answer

Estimate the relationship between wages and age for male-lead households and female-lead households separately, then compare their returns to age. Let’s continue to use income_before_tax

Tested objects: reg3M (for males), reg3F (for females).

#Quiz 4
reg3M <- lm(..., data = filter(SFS_data2, gender == "Male"))
reg3F <- lm(..., data = filter(SFS_data2, gender == "Female"))

summary(reg3M)
summary(reg3F)

test_3()
test_4() #quiz4

Short Answer 6

Prompt: Do these regression estimates support your argument from Short Answer 5? Explain.

Answer here in red

answer_6 <- #fill in your short answer

Add one additional term to the multiple regression from Quiz 3 that accounts for the possibility that the sex wage gap can change as workers age. Please list gender before age.

Tested Objects: reg4.

#Quiz 5
reg4 <- lm(???, data = SFS_data2)

summary(reg4)

test_5() #quiz5

Short Answer 7

Prompt: According to the regression you estimated above, what is the nature of the sex wage gap?

Answer here in red

answer_7 <- #fill in your short answer

Theoretical Activity 1

Suppose that a team of researchers is interested in the relationship between the price of a popular children’s chocolate brand (let’s call it “Jumbo Chocolate Egg”) and its demand. The team conducts a series of surveys over a five-year period where they ask 200 households in a Vancouver neighborhood to report how many packs of Jumbo Chocolate Egg they bought in each quarter of the year. The company that produces Jumbo Chocolate Egg is interested in estimating the price elasticity of demand for their chocolate, so they changed the price of a pack of chocolate each quarter over this period. This survey is voluntary - the team went door-to-door to gather the data, and people could refuse to participate.

After they compile a dataset from the survey responses, the team estimates this model:

\[ Q_i^2 = \alpha_1 + \alpha_2 ln(P_i) + \alpha_3 H_i + \epsilon_i \]

\(Q_i\) denotes the quantity of chocolate packs that household i purchased in a given quarter of a given year. That is, each quarter for a given household is a separate observation. \(P_i\) is the price of the pack of chocolate in the given quarter, and \(H_i\) is the household size (in number of people). Note that \(\hat{\alpha_2}\) is supposed to be the estimated elasticity of demand.

You join the team as a research advisor - in other words, you get to criticize their project and get paid doing so. Sounds great, but you have a lot of work ahead.

Short Answer 8

Prompt: Are there any omitted variables that the team should be worried about when estimating the model? Give 2 examples of such variables if so, and explain how each variable’s omission could affect the estimated elasticity of demand using the formula that we discussed in class.

Answer here in red

answer_8 <- #fill in your short answer

Short Answer 9

Prompt: Is there anything wrong with the specification of the regression model? If so, explain how to correct it; if not, explain why the specification is correct.

Answer here in red

answer_9 <- #fill in your short answer

Short Answer 10

Prompt: Is there any potential for sample selection bias in this study? Explain by referencing specific aspects of the experiment. What effect might this bias have on the estimated elasticity of demand?

Answer here in red

answer_10 <- #fill in your short answer

Short Answer 11

Prompt: A member of your team writes in the research report that “this estimated elasticity of demand tells us about the preferences of consumers around Canada.” Do you have an issue with this statement? Why or why not?

Answer here in red

answer_11 <- #fill in your short answer