#Clear the memory from any pre-existing objects
rm(list=ls())
# loading in our packages
library(tidyverse) #This includes ggplot2!
library(haven)
library(IRdisplay)
#Open the dataset
<- read_dta("../econ490-stata/fake_data.dta")
fake_data
# inspecting the data
glimpse(fake_data)
10 - Conducting Regression Analysis
Prerequisites
- Econometric approaches to linear regression taught in ECON326 or other introductory econometrics courses.
- Importing data into R.
- Creating new variables in R.
Learning Outcomes
- Implement the econometric theory for linear regressions learned in ECON326 or other introductory econometrics courses.
- Run simple univariate and multivariate regressions using the command
lm()
. - Understand the interpretation of the coefficients in linear regression output.
- Consider the quality of control variables in a proposed model.
10.1 A Word of Caution Before We Begin
Before conducting a regression analysis, a great deal of work must go into understanding the data and investigating the theoretical relationships between variables. The biggest mistake that students make at this stage is not how they run the regression analysis. It is failing to spend enough time preparing data for analysis.
Here are some common challenges that students run into. Please pay attention to this when conducting your own research project.
- A variable that is qualitative and not ranked cannot be used in an OLS regression without first being transformed into a dummy variable (or a series of dummy variables). Examples of variables that must always be included as dummy variables are sex, race, religiosity, immigration status, and marital status. Examples of variables that are sometimes included as dummy variables are education, income and age.
- You will want to take a good look to see how your variables are coded before you begin running regressions and interpreting the results. Make sure that missing values are coded as “.” and not some value (such as “99”). Also, check that qualitative ranked variables are coded in the way you expect (e.g. higher education is coded with a larger number). If you do not do this, you could misinterpret your results.
- Some samples are not proper representations of the population and must be weighted accordingly (we will deal with this in depth later).
- You should always think about the theoretical relationship between your variables before you start your regression analysis: Does economic theory predict a linear relationship, independence between explanatory terms, or is there possibly an interaction at play?
10.2 Linear Regression Models
Understanding how to run a well structured OLS regression and how to interpret the results of that regression are the most important skills for undertaking empirical economic analysis. You have acquired a solid understanding of the theory behind the OLS regression in earlier econometrics courses; keep this in mind throughout your analysis. Here, we will cover the practical side of running regressions and, perhaps more importantly, how to interpret the results.
An econometric model describes an equation (or set of equations) that impose some structure on how the data was generated. The most natural way to describe statistical information is the mean. Therefore, we typically model the mean of a (dependent) variable and how it can depend on different factors (independent variables or covariates). The easiest way to describe a relationship between a dependent variable, y, and one or more independent variables, x is linearly.
Suppose we want to know what variables are needed to understand how and why earnings vary between each person in the world. What would be the measures needed to predict everyone’s earnings?
Some explanatory variables might be:
- Age
- Year (e.g. macroeconomic shocks in that particular year)
- Region (local determinants on earnings)
- Hours worked
- Education
- Labor Market Experience
- Industry / Occupation
- Number of children
- Level of productivity
- Passion for their job
- etc., there are so many factors which can be included!
For simplicity, let’s assume we want to predict earnings but we only have access to data sets with information regarding people’s age and earnings. If we want to generate a model which predicts the relationship between these two variables, we could create a linear model where the dependent variable (y) is annual earnings, the independent variable (x) is age, the slope (m) is how much an extra year of age affects earnings, and the y-intercept (b) is earnings when age is equal to 0. We would write this relationship as:
We only have access to annual earnings and age, so we are unable to observe the rest of the variables (independent variables or covariates
Where
It’s important to understand what
How do equations (1) and (2) relate? If we take the expectation given age on equation (1), we can see that
and, this will leave us with
If
If
Differencing the two equations above gives us the solution,
where
If we know those
This is the intuition that we should follow to interpret the coefficients!
Consider a slightly more complicated example.
Let’s assume there are only two regions in this world: region A and region B. In this world, we’ll make it such that workers in region B earn
Furthermore, an extra year of age increases earnings by
We could generate log-earnings of worker
In the second line we did one of the most powerful tricks in all of mathematics: add and subtract the same term! Specifically, we add and subtract the mean earnings for workers who are in region A and have age equal to zero. This term is the interpretation of the constant in our linear model. The re-defined unobservable term
Be mindful of the interpretation of the coefficients in this new equation. As we have just seen, the constant
But what are the expected earnings of a worker living in region B and with age equal to zero?
If
Therefore,
Lastly,
Therefore,
Using the equations above, try computing the following difference in expected earnings for workers with different age and different region, and check that it is not equal to
So far, we have made an assumption at the population level. Remember that to know the CEF, we need to know the true
10.3 Ordinary Least Squares
If we are given some data set and we have to find the unknown
be the estimators of
The formula for the estimators will return some values that will give rise to a sample version of the population model:
where
This expression can also be written as
OLS is minimizing the squared residuals (the sample version of the error term) given our data. This minimization problem can be solved using calculus, specifically the derivative chain rule. The first order conditions are given by :
From these first order conditions, we construct the most important restrictions for OLS:
In other words, by construction, the sample version of our error term will be uncorrelated with all the covariates. The constant term works the same way as including a variable equal to 1 in the regression (try it yourself!).
Notice that the formula for
10.4 Ordinary Least Squares Regressions with R
For this module, we will be using the fake data set. Recall that this data is simulating information for workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.
10.4.1 Univariate regressions
To run a linear regression using OLS in R, we use the command lm()
. The basic syntax of the command is
lm(data=dataset_name, dep_varname ~ indep_varnames)
Feel free to look at the help file to look at the different options that this command provides.
Let’s start by creating a new variable that is the natural log of earnings and then run our regression. We are using the log of earnings since earnings has a highly skewed distribution, and applying a log transformation allows us to more normally distribute our earnings variable. This will be helpful for a variety of analytical pursuits.
<- fake_data %>%
fake_data mutate(log_earnings = log(earnings)) #the log function
lm(data=fake_data, log_earnings ~ age)
By default, R includes a constant (which is usually what we want, since this will set residuals to 0 on average). The estimated coefficients are
The interpretation of coefficients in a univariate regression is fairly simple.
Sometimes, we find that our coefficient is negative. This is not a concern. If it was the case that
10.4.2 Multivariate Regressions
The command lm()
also allows us to list multiple covariates. When we want to carry out a multivariate regression we write,
lm(data=dataset_name, dep_varname ~ indep_varname1 + indep_varname2 + ... )
and so on.
lm(data=fake_data, log_earnings ~ age + treated )
How would we interpret the coefficient corresponding to being treated? Consider the following two comparisons:
- Mean logearnings of 18 year old treated workers minus the mean logearnings of 18 year old untreated workers =
. - Mean logearnings of 20 year old treated workers minus the mean logearnings of 20 year old untreated workers =
.
Therefore, the coefficient gives the increase in logearnings between treated and untreated workers holding all other characteristics equal. We economists usually refer to this as
To check whether these coefficients are statistically significant, we can use another very helpful function: summary().
summary(lm(data = fake_data, log_earnings ~ age + treated))
This function provides us with standard errors for our beta coefficients, useful in testing whether these coefficients are statistically significantly different from 0. To test this, we set up the hypothesis that a coefficient
If the t-statistic is roughly greater than 2 in absolute value, we reject the null hypothesis that there is no effect of the independent variable in question on earnings (
An alternative test can be performed using the p-value statistic: if the p-value is less than 0.05, we reject the null hypothesis at 95% confidence level. In either case, when we reject the null hypothesis, we say that the coefficient is statistically significant.
No matter which of the two approaches we choose, this summary()
function expedites the process by giving us our p-value and t-statistic immediately, so that we can reject or fail to reject this null hypothesis immediately.
Thus, when working with either univariate or multivariate regressions, we must pay attention to two key features of our coefficient estimates:
- the sign of the coefficient (positive or negative), and
- the p-value or t-statistic of the coefficient (checking for statistical significance).
A subtler but also important point is to always inspect the magnitude of the coefficient. We could find
10.4.3 Interpreting Coefficients
While we have explored univariate and multivariate regressions of a log dependent variable and non-log independent variables (known as a log-linear model), the variables in linear regressions can take on many other forms. Each of these forms, whether a transformation of variables or not, influences how we can interpret these
For instance, look at the following regression:
lm(data = fake_data, earnings ~ age)
This is a classic single variable regression with no transformations (e.g. log) applied to the variables. In this regression, a one-unit change in the independent variable leads to a
Next, let’s look at the following regression, where a log transformation has now been applied to the independent variable and not the dependent variable:
<- fake_data %>%
fake_data mutate(log_age = log(age)) # creating our log age variable first
lm(data = fake_data, earnings ~ log_age)
This is known as a linear-log regression, since only the independent variable has been transformed. It is a mirror image of the log-linear model we first looked at when we took the log of earnings. In this regression, we can say that a 1 unit increase in logage leads to a 37482 increase in earnings, or that a 1% increase in age leads to an increase in earnings of 374.82. To express this more neatly, a 10% increase in age leads to an increase in earnings of about 3750, or a 100% increase in age (doubling of age) leads to an increase in earnings of about 37500.
We can even have a log-log regression, wherein both the dependent and independent variables in question have been transformed into log format.
lm(data = fake_data, log_earnings ~ log_age)
When interpreting the coefficients in this regression, we can say that a 1 unit increase in logage leads to a 0.52 unit increase in logearn, or that a 1% increase in age leads to a 0.52% increase in earnings. To express this more neatly, we can also say that a 10% increase in age leads to a 5.2% increase in earnings, or that a 100% increase in age (doubling of age) leads to a 52% increase in earnings.
Additionally, while we have been looking at log transformations, we can apply other transformations to our variables. Suppose that we believe that age is not linearly related to earnings. Instead, we believe that age may have a quadratic relationship with earnings. We can define another variable for this term and then include it in our regression to create a multivariate regression as follows.
<- fake_data %>%
fake_data mutate(age_sqr = age^2) # creating a squared age variable
lm(data = fake_data, earnings ~ age + age_sqr)
In this regression, we get coefficients on both age and agesqr. Since the age variable appears in two places, neither coefficient can individually tell us the effect of age on earnings. Instead, we must take the partial derivative of earnings with respect to age. If our population regression model is
then the effect of age on earnings is
In all of these examples, our
Some regressions involve dummy variables and interaction terms. It is critical to understand how to interpret these coefficients, since these terms are quite common. The coefficient on a dummy variable effectively states the difference in the dependent variable between two groups, ceteris paribus, with one of the groups being the base level group left out of the regression entirely. The coefficient on interaction terms, conversely, emphasizes how the relationship between a dependent and independent variable differs between groups, or differs as another variable changes. We’ll look at both dummy variables and interaction terms in regressions in much more depth in Module 12.
10.4.4 Sample Weights
The data that is provided to us is often not statistically representative of the population as a whole. This is because the agencies that collect data (like Statistics Canada) often decide to over-sample some segments of the population. They do this to ensure that there is a large enough sample size of subgroups of the population to conduct meaningful statistical analysis of those sub-populations. For example, the population of Indigenous identity in Canada accounts for approximately 5% of the total population. If we took a representative sample of 10,000 Canadians, there would only be 500 people who identified as Indigenous in the sample.
This creates two problems. The first is that this is not a large enough sample to undertake any meaningful analysis of characteristics of the Indigenous population in Canada. The second is that when the sample is this small, it might be possible for researchers to identify individuals in data. This would be extremely unethical, and Stats Canada works hard to make sure that data remains anonymized.
To resolve this issue, Statistics Canada over-samples people of Indigenous identity when they collect data. For example, they might survey 1000 people of Indigenous identity so that those people now account for 10% of observations in the sample. This would allow researchers who want to specifically look at the experiences of Indigenous people to conduct reliable research, and maintain the anonymity of the individuals represented by the data.
When we use this whole sample of 10,000, however, the data is no longer nationally representative since it overstates the share of the population of Indigenous identity - 10% instead of 5%. This sounds like a complex problem to resolve, but the solution is provided by the statistical agency that created the data in the form of “sample weights” that can be used to recreate data that is nationally representative.
Our sample weights will be commonly coded as an additional variable in our data set such as weight_pct, however sometimes this is not the case, and we will need to select the variable ourselves. Please reach out to an instructor, TA, or supervisor if you think this is the case. To include the weights in regression analysis, we can simply include the following option immediately after our independent variable(s) in the lm
function:
lm(data = data, y ~ x, weights = weight_pct)
We can do that with the variable sample_weight which is provided to us in the “fake_data” data set, re-running the regression of logearnings on age and treated from above.
lm(data = fake_data, log_earnings ~ age + treated, weights = sample_weight)
Often, after weighting our sample, the coefficients from our regression will change in magnitude. In these cases, there was some sub-sample of the population that was over-represented in the data and skewed the results of the unweighted regression.
Finally, while this section described the use of weighted regressions, it is important to know that there are many times we might want to apply weights to our sample that have nothing to do with running regressions. For example, if we wanted to calculate the mean of a variable using data from a skewed sample, we would want to make sure to use the weighted mean. While mean
is used in R to calculate means, R also has an incredibly useful command called weighted.mean
which directly weights observations to calculate the weighted mean. Many packages exist which can calculate the weighted form of numerous other summary statistics.
10.5 What can we do with OLS?
Notice that OLS gives us a linear approximation to the conditional mean of some dependent variable, given some observables. We can use this information for prediction: if we had different observables, how would the expected mean differ?We can do this in Stata by using the predict
function. The syntax is predict(model)
. We first need to save our regression into an object (using the <- lm(...)
syntax), and then we can place that object as the model
in the predict function
to obtain the predicted values of our dependent variable. We can do this with different regressions that have different observables (one might include age as an explanatory variable, while another might include education), and we can compare the predicted values.
Another thing we can do with OLS is discuss causality: how does manipulating one variable impact a dependent variable on average? To give a causal interpretation to our OLS estimates, we require that, in the population, it holds that
We might be tempted to think that we can test this using the sample version
For instance, looking at the previous regression, if we want to say that the causal effect of being treated is equal to -0.81, it must be the case that treatment is not correlated (in the population sense) with the error term (our unobservables). However, it could be the case that treated workers are the ones that usually perform worse at their job, which would contradict a causal interpretation of our OLS estimates. This brings us to a short discussion of what distinguishes good and bad controls in a regression model:
- Good Controls: To think about good controls, we need to consider which unobserved determinants of the outcome are possibly correlated with our variable of interest.
- Bad Controls: It is bad practice to include variables that are themselves outcomes. For instance, consider studying the causal effect of college on earnings. If we include a covariate of working at a high paying job, then we’re blocking part of the causal channel between college and earnings (i.e. you are more likely to have a nice job if you study more years!)
10.7 Wrap Up
In this module we discussed the following concepts:
- Linear Model: an equation that describes how the outcome is generated, and depends on some coefficients
. - Ordinary Least Squares: a method to obtain a good approximation of the true
of a linear model from a given sample.
Notice that there is no such thing as an OLS model. More specifically, notice that we could apply a different method (estimator) to a linear model. For example, consider minimizing the sum of all error terms
This model is linear but the solution to this problem is not an OLS estimate.
We also learned how to interpret coefficients in any linear model.
It is the expected value of y when x = 0. More precisely, because we have a sample approximation for this true value, it would be the sample mean of y when x = 0.
In the case of any other beta,
is going to be the difference between the expected value of y due to a change in x. Therefore, each
10.7 Wrap-up Table
Command | Function |
---|---|
lm(data=<data>, <model>) |
It estimates a linear model using <data> as dataset and <model> as the specification. |
predict(model) |
It is used to obtain predicted values of the model. |