#Clear the memory from any pre-existing objects
rm(list=ls())
# loading in our packages
library(tidyverse) #This includes ggplot2!
library(haven)
library(IRdisplay)
#Open the dataset
<- read_dta("../econ490-stata/fake_data.dta")
fake_data
# inspecting the data
glimpse(fake_data)
ECON 490: Conducting Regression Analysis (10)
Prerequisites
- Econometric approaches to linear regression taught in ECON 326.
- Importing data into R.
- Creating new variables in R.
Learning Outcomes
- Implement the econometric theory for linear regressions learned in ECON 326.
- Run simple univariate and multivariate regressions using the command
lm()
. - Understand the interpretation of the coefficients in linear regression output.
- Consider the quality of control variables in a proposed model.
10.1 A Word of Caution Before We Begin
Before conducting a regression analysis, a great deal of work must go into understanding the data and investigating the theoretical relationships between variables. The biggest mistake that students make at this stage is not how they run the regression analysis, it is failing to spend enough time preparing data for analysis. - A variable that is qualitative and not ranked cannot be used in an OLS regression without first creating a dummy variable(or a series of dummy variables). Examples of variables that must always be included as dummy variables are sex, race, religiosity, immigration status, and marital status. Examples of variables that are sometimes included as dummy variables are education, income and age. - You will want to take a good look to see how your variables are coded before you begin run regressions and interpreting the results. Make sure that missing values are coded a “.” and not some value (such as “99”). Also, check that qualitative ranked variables are coded in the way you expect (e.g. higher education is coded with a larger number). If you do not do this you could be misinterpreting your results. - Some samples are not proper representations of the population and must be weighted accordingly (we will deal with this in depth later). - You should always think about the theoretical relationship between your variables before you start your regression analysis: Does economic theory predict a linear relationship? Independence between explanatory terms, or is there possibly an interaction?
10.2 Linear Regression Models
Understanding how to run a well structured OLS regression and how to interpret the results of that regression are the most important skills for undertaking empirical economic analysis. You have acquired a solid understanding of the theory behind the OLS regression in ECON 326; keep this in mind throughout your analysis. Here, we will cover the practical side of running regressions and, perhaps more importantly, how to interpret the results.
An econometric model describes an equation (or set of equations) that impose some structure on how the data was generated. The most natural way to describe statistical information is the mean. Therefore, we typically model the mean of a (dependent) variable and how it can depend on different factors (independent variables or covariates). The easiest way to describe a relationship between a dependent variable, y, and one or more independent variables, x is linearly.
Suppose we want to know what variables are needed to understand why and how earnings vary between each person in the world. What would be the measures needed to predict everyone’s earnings?
Some explanatory variables might be: - Age - Year (e.g. macroeconomic shocks in that particular year) - Region (local determinants on earnings) - Hours worked - Education - Labor Market Experience - Industry / Occupation - Number of children - Level of productivity - Passion about their job - etc., etc., there are so many!
For simplicity, let’s assume we want to predict earnings but we only have access to datasets relating to people’s age and earnings. If we want to generate a model that predicted the relationship between these two variables we could create a linear model where the dependent variable (y) would be annual earnings, the independent variable (x) would be age, the slope (m) would be how much an extra year of age affects earnings, and the y-intercept (b) would be earning when age is equal to 0. We would write this relationship as,
\[ y = b +mx. \]
We only have access to two variables, so we are unable to observe the rest of the variables (independent variables or covariates \(X_{i}\)) that might determine earnings. Even if we do not observe these variables they are still affecting earnings and our model above would have error; the values would diverge from the linear model.
Where \(\beta_0\) is the y-intercept, \(\beta_1\) is the slope and \(i\) indicates the worker observation in the data we have,
\[ logearnings_{i} =\beta_0 + \beta_1 age_{i} + u_{i}. \tag{1} \]
It’s important to understand what \(\beta_0\) and \(\beta_1\) stand for in the linear model. We said above that we typically model the mean of a (dependent) variable and how it can depend on different factors (independent variables or covariates). Therefore we are in fact modeling the expected value of earnings conditional on the value of age. This is called the conditional expectation function or CEF. We assume that it takes the form of:
\[ E[logearnings_{i}|age_{i}] =\beta_0 + \beta_1 \beta_1 age_i \tag{2} \]
How do equations (1) and (2) relate? If you take an expectation given age on equation (1) you will notice that \[ E[age_{i}|age_{i}]=age_{i} \] and, this will leave us with \[ E[u_{i}|age_{i}]=0. \]
If \(age=0\) then, \(\beta_1 \times age=0\) and \[ E[logearnings_{i}|age_{i}=0]=\beta_0 \]
If \(age=1\) then, \(\beta_1 \times age=\beta_1\) and \[ E[logearnings_{i}|age_{i}=1]=E[logearnings_{i}|age_{i}=0]+ \beta_1 \]
Differencing the two equations above gives us the solution,
\[ E[logearnings_{i}|age_{i}=1]- E[logearnings_{i}|age_{i}=0]= \beta_1, \]
where \(β_1\) is the difference in the expected value of logearnings when there is a one unit increase in age. If you choose any two values that differ by 1 unit you will also get \(\beta_1\) as the solution (try it yourself!).
If we know those \(β_1\)s we can know a lot of information about the means of different set of workers. For instance, we can compute the mean log-earnings of 18 year old workers:
\[ E[logearnings_{i} \mid age_{i}=18] = \beta_0 + \beta_1 \times 18 \]
This is the intuition that we should follow to interpret the coefficients!
Consider a slightly more complicated example.
Let’s assume there are only two regions in this world: region A and region B. In this world, we’ll make it such that workers in region B earn \(\beta_1\) percentage points more than workers in region A on average. We are going to create a dummy variable called \(region\) that takes the value of 1 if the worker’s region is B and a value of 0 if the worker’s region is A.
Furthermore, an extra year of age increases earnings by \(\beta_2\) on average and we take the same approach with every explanatory variable on the list above. The empirical economist (us!) only observes a subset of all these variables, which we call the observables or covariates \(X_{it}\). Let’s suppose that the empirical economist only observes the region and age of the workers.
We could generate log-earnings of worker \(i\) as follows.
\[\begin{align} logearnings_{i} &= \beta_1 \{region_{i}=1\} + \beta_2 age_{i} + \underbrace{ \beta_3 education_{i} + \beta_4 hours_{i} + \dots }_{\text{Unobservable, so we'll call this }u_{i}^*} \\ &= E[logearnings_{i} \mid region_{i}=0, age_{i}=0] + \beta_1 \{region_{i}=1\} + \beta_2 age_{i} + u_{i}^* - E[logearnings_{i} \mid region_{i}=0, age_{i}=0] \\\\\ &= \beta_0 + \beta_1 \{region_{i}=1\} + \beta_2 age_{i} + u_{i} \end{align}\]+ 1 {region{i}=1} + 2 age{i} + u_{i} \end{align}
In the second line we did one of the most powerful tricks in all of mathematics: add and subtract the same term! Specifically, we add and subtract the mean earnings for workers who are in region A and have age equal to zero. This term is the interpretation of the constant in our linear model. The re-defined unobservable term \(u_i\) is a deviation from such mean, which we expect to be zero on average.
Be mindful of the interpretation of the coefficients in this new equation. As we have just seen, the constant \(\beta_0\) is interpreted as the average earnings of workers living in region A and with age equal to zero: if \(age=0\) and \({region}_{i}=0\) then \(\beta_1 \times \{{region}_{i}=0\} = 0\) and \(\beta_2 \times age=0\). All that remains is \(\beta_0\): \[ E[logearnings_{i}|age_{i}=0 \; \text{and} \; {region}_{i}=0]=\beta_0 \]
But what are the expected earnings of a worker living in region B and with age equal to zero? If \(age=0\) and \({region}_{i}=1\) then \(\beta_1 \times \{{region}_{i}=1\} = \beta_1\) and \(\beta_2 \times age=0\). As a result, we obtain \[ E[logearnings_{i}|age_{i}=0 \; \text{and} \; {region}_{i}=1]=\beta_0 + \beta_1 \]
Therefore, \(\beta_1\) is interpreted as the difference in average earnings of workers living in region B compared to workers living in region A. Lastly, \(\beta_2\) is interpreted as the extra average earnings obtained by individuals with one additional year of age compared to other individuals living in the same region. That ‘living in the same region’ portion of the sentence is key. Consider an individual living in region A and with age equal to 1. The expected earnings in that case are \[ E[logearnings_{i}|age_{i}=1 \; \text{and} \; {region}_{i}=0]=\beta_0 + \beta_2 \]
Therefore, \(\beta_2\) is equal to the extra average earnings obtained by workers of region A for each one additional year of age: \[ \beta_2 = E[logearnings_{i}|age_{i}=1 \; \text{and} \; {region}_{i}=0] - E[logearnings_{i}|age_{i}=0 \; \text{and} \; {region}_{i}=0] \]
Using the equations above, try computing the following difference in expected earnings for workers with different age and different region, and check that it is not equal to \(\beta_2\): \[ E[logearnings_{i}|age_{i}=1 \; \text{and} \; {region}_{i}=0] - E[logearnings_{i}|age_{i}=0 \; \text{and} \; {region}_{i}=1] \]
So far we have made an assumption at the population level. Remember that to know the CEF we need to know the true betas, which in turn depend on the joint distribution of the outcome (\(Y_i\)) and covariates (\(X_i\)). However, in practice, we are given a random sample where we can compute average instead of expectations, and empirical distributions instead of the true distributions. We can use these in a formula (also known as an estimator!) to obtain a reasonable guess of the true \(\beta\)s. For a given sample, the numbers that are thrown by the estimator or formula are known as estimates. One of the most powerful estimators out there is the Ordinary Least Squares Estimator (OLS).