10.3 Ordinary Least Squares

If we are given some dataset and we have to find the unknown \(\beta\)s, the most common and powerful tool is known as OLS. Continuing with the example above, let all the observations be indexed by \(j=1,2,\dots, n\). Let \[ \hat{β_0}, \hat{β_1},\hat{β_2} \] be the estimators of \[ β_0, β_1, β_2. \] The formula or estimator will return some values that wil give rise to a sample version of the population model:

\[ logearnings_{j} = b_0 + b_1\{region_{j}=1\} + b_2 age_{j} + \hat{u_{j}}, \]

where \(u_j\) is the true error in the population, and $ $ is called a residual (the sample version of the error given the current estimates). OLS finds the values of \(\hat{β}\)s that minimize the sum of squared residuals. This is given by the following minimization problem: \[ \min_{b} \frac{1}{n} \sum_{j}^n \hat{u}_{j}^2 \] This expression can also be written as,

\[ \min_{b} \frac{1}{n} \sum_{j}^n (logearnings_{j} - b_0 - b_1 \{region_{j}=1\} - b_2age_{j} )^2 \]

OLS is minimizing the squared residuals (the sample version of the error term) given our data. This minimization problem can be solved using calculus, specifically the derivative chain rule. The first order conditions are given by :

\[\begin{align} \frac{1}{n} \sum_{j}^n 1 \times \hat{u}_{j} &= 0 \\ \frac{1}{n} \sum_{j}^n age_i \times \hat{u}_{j} &= 0 \\ \frac{1}{n} \sum_{j}^n \{region_i = B\} \times \hat{u}_{j} &= 0 \end{align}\]

From these first order conditions we construct the most important restrictions for OLS:

\[ \frac{1}{n} \sum_{j}^n \hat{u}_j = \frac{1}{n} \sum_{j}^n \hat{u}_j \times age_j=\frac{1}{n} \sum_{j}^n \hat{u}_j\times\{region_j = 1\}=0 \]

In other words, by construction, the sample version of our error term will be uncorrelated with all the covariates. The constant term works the same way as including a variable equal to 1 in the regression (try it yourself!).

Notice that the formula for \(β_0, β_1, β_2\) (the true values!) is using these conditions but we replace expectation instead of sample averages. This is obviously an infeasible approach since we argued before that we need to know the true joint distribution of the variables to compute such expectations. As a matter of fact, many useful estimators rely on this approach: replace an expectation by a sample average, which is called the sample analogue approach.

Note: Because this is an optimization problem, all of our variables must be numeric. If a variable is categorical we must be able to re-code it into a numerical variable. You will understand more about this after completing our next module.

10.4 Ordinary Least Squares Regressions with R

For this module we will be using the fake data dataset. Recall that this data is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.

#Clear the memory from any pre-existing objects
rm(list=ls())

# loading in our packages
library(tidyverse) #This includes ggplot2! 
library(haven)
library(IRdisplay)

#Open the dataset 
fake_data <- read_dta("../econ490-stata/fake_data.dta")  

# inspecting the data
glimpse(fake_data)

10.4.1 Univariate regressions

To run a linear regression using OLS we use the command lm(). The basic syntax of the command is

lm(data=dataset_name, dep_varname ~ indep_varnames)

You can look at the help file to look at the different options that this command provides.

Let’s start by creating a new variable that is the natural log of earnings and then run our regression. We are using the log of earnings since earnings has a highly skewed distribution and applying a log transformation to it allows us to more normally distribute our earnings variable, which is helpful for a variety of analytical pursuits.

fake_data <- fake_data %>%
        mutate(log_earnings = log(earnings)) #the log function
lm(data=fake_data, log_earnings ~ age)

By default R includes a constant (which is usually what we want, since this will make that residuals are 0 on average). The estimated coefficients are \(\hat{\beta}_0 = 10.014\) and \(\hat{\beta}_1 = 0.014\). Notice that we only included one covariate here, which is known as univariate (linear) regression.

The interpretation of coefficients in a univariate regression is fairly simple. \(\hat{\beta}_1\) says that having one extra year of age increases log earnings by \(0.014\) on average. In other words, one extra year in age returns 1.4 percentage points higher earnings. Meanwhile, \(\hat{\beta}_0\) says that the average log earnings of individuals with a recorded age of 0 is about \(10\). This intercept is not particularly meaningful given that no one in the data set has an age of 0. It is important to note that this often occurs, that the \(\hat{\beta}_0\) intercept is often not economically meaningful. After all, \(\hat{\beta}_0\) is simply an OLS estimate resulting from minimizing the sum of squared residuals.

Sometimes we find that our coefficient is negative. This is not a concern. If it was the case that \(\hat{\beta}_1 = -0.014\), this would instead mean that one extra year of age is associated with a \(0.014\) decrease in log earnings, or \(1.4\) percentage point lower earnings. When interpreting coefficients, the sign is also important. We will look at how to interpret coefficients in a series of cases later in this notebook.

10.4.2 Multivariate regressions

The command lm() also allows us to list multiple covariates. When we want to carry out a multivariate regression we write,

lm(data=dataset_name, dep_varname ~ indep_varname1 + indep_varname2 + ... )

and so on.

lm(data=fake_data, log_earnings ~ age + treated )

How would we interpt the coefficient corresponding to being treated? Consider the following two comparisons:

Therefore, the coefficient gives the increase in log earnings between treated and untreated among workers with the same other characteristics. We economists usually refer to this as \(\textit{ceteris paribus}\).

To check whether these coefficients are statistically significant, we can use another very helpful function: summary().

summary(lm(data = fake_data, log_earnings ~ age + treated))

This function provides us with standard errors for our beta coefficients, useful in testing whether these coefficients are statistically significantly different from 0. To test this, we set up the hypothesis that a coefficient \(\beta\) equals 0, and thus has a mean of 0, then standardize it using the standard error provided:

\[ t = \frac{ \hat{\beta} - 0 }{StdErr} \]

If the t-statistic is roughly greater than 2 in absolute value, we reject the null hypothesis that there is no effect. This would mean that the data supports the hypothesis that the variable in question has some effect on earnings at a confidence level of 95%.

An alternative test can be performed using the p-value statistic: if the p-value is less than 0.05 we reject the null hypothesis at 95% confidence level. In either case, when we reject the null hypothesis, we say that the coefficient is statistically significant.

No matter which of the two approaches we choose, this summary() function expedites the process by giving us our p-value and t-statistic immediately, so that we can reject or fail to reject this null hypothesis immediately.

Note: Without statistical significance we cannot reject the null hypothesis and have no choice but to conclude that the coefficient is zero, meaning that the independent variable of interest has no effect on the dependent variable.

Thus, when working with either univariate or multivariate regressions, we must pay attention to two key features of our coefficient estimates:

  1. the sign of the coefficient (positive or negative)
  2. the p-value or t-statistic of the coefficient (checking for statistical significance)

A subtler but also important point is to always inspect the magnitude of the coefficient. We could find \(\hat{\beta}_1 = 0.00005\) in our regression and determine that it is statistically significant. However, this would not change the fact that this is a very weak effect, that an extra year of age increases your log earnings by 0.005. Magnitude is always important when seeing whether a relationship, even if it is statistically significant and thus we can be quite sure it’s not 0, is actually large in size (whether positive or negative). Understanding whether the magnitude of a coefficient is economically meaningful typically requires a firm understanding of the economic literature in that area.

10.4.3 Interpreting coefficients

While we have explored univariate and multivariate regressions of a log dependent variable and non-log independent variables (known as a log-linear model), the variables in linear regressions can take on many other forms. Each of these forms, whether a transformation of variables or not, influences how we can interpret these \(\beta\) coefficient estimates.

For instance, look at the following regression:

lm(data = fake_data, earnings ~ age)

This is a classic single variable regression with no transformations (i.e. log) applied to the variables. In this regression, a one-unit change in the independent variable leads to a \(\beta\) unit change in the dependent variable. As such, we can interpret our coefficients in the following way: an extra year of age increases earnings by 1046.49 on average. The average earnings of individuals with 0 age is 35484, which we have already discussed in not economically meaningful. The incredibly low p-value for the coefficient on age also indicates that this is a statistically significant effect.

Next look at the following regression, where a log transformation has now been applied to the independent variable and not the dependent variable:

fake_data <- fake_data %>% 
        mutate(log_age = log(age)) # creating our log age variable first
lm(data = fake_data, earnings ~ log_age)

This is known as a linear-log regression, since only the independent variable has been transformed. It is a mirror image of the log-linear model we first looked at when we took the log of earnings. In this regression, we can say that a 1 unit increase in log age leads to a 37482 increase in earnings, or that a 1% increase in age leads to an increase in earnings of 374.82. To express this more neatly, a 10% increase in age leads to an increase in earnings of about 3750, or a 100% increase in age (doubling of age) leads to an increase in earnings of about 37500.

We can even have a log-log regression, wherein both the dependent and independent variable in question have been transformed into log format.

lm(data = fake_data, log_earnings ~ log_age)

When interpret the coefficients in this regression, we can say that a 1 unit increase in log age leads to a 0.52 unit increase in log earnings, or that a 1% increase in age leads to a 0.52% increase in earnings. To express this more neatly, we can also say that a 10% increase in age leads to a 5.2% increase in earnings, or that a 100% increase in age (doubling of age) leads to a 52% increase in earnings.

Additionally, while we have been looking at log transformations, we can apply other transformations to our variables. Suppose that we believe that age is not linearly related to earnings. Instead, we believe that age may have a quadratic relationship with earnings. We can define another variable for this term and then include it in our regression to create a multivariate regression as follows.

fake_data <- fake_data %>% 
        mutate(age_sqr = age^2) # creating a squared age variable
lm(data = fake_data, earnings ~ age + age_sqr)

In this regression, we get coefficients on both \(age\) and \(age^2\). Since the age variable appears in two places, neither coefficient can individually tell us the effect of age on earnings. Instead, we must take the partial derivative of earnings with respect to age. If our population regression model is

\[ earnings_i = \beta_0 + \beta_1age_i + \beta_2age^2_i + \mu_i \]

then the effect of age on earnings is \(\beta_1 + 2\beta_2\), meaning that a one year increase in age leads to a 3109.1 + 2(-27.7) = 3053.7 unit increase in earnings. There are many other types of transformations we can apply to variables in our regression models. This is one just example.

In all of these examples, our \(\beta_0\) intercept coefficient gives us the expected value of our dependent variable when our independent variable equals 0. We can inspect the output of these regressions further, looking at their p-values or t-statistics, to determine whether the coefficients we receive as output are statistically significant.

Finally, some regressions involve dummy variables and interaction terms. It is critical to understand how to interpret these coefficients, since these terms are quite common. The coefficient on a dummy variable effectively states the difference in the dependent variable between two groups, ceteris paribus, with one of the groups being the base level group left out of the regression entirely. The coefficient on interaction terms, conversely, emphasizes how the relationship between a dependent and independent variable differs between groups, or differs as another variable changes. We’ll look at both dummy variables and interaction terms in regressions in much more depth in Module 12.

10.4.4 Sample weights

The data that is provided to us is often not statistically representative of the population as a whole. This is because the agencies that collect data (like Statistics Canada) often decide to over-sample some segments of the population. They do this to ensure that there is a large enough sample size of subgroups of the population to conduct meaningful statistical analysis of those sub-populations. For example, the population of Indigenous identity in Canada accounts for approximately 5% of the total population. If we took a representative sample of 10,000 Canadians, there would only be 500 people who identified as Indigenous in the sample.

This creates two problems. The first is that this is not a large enough sample to undertake any meaningful analysis of characteristics of the Indigenous population in Canada. The second is that when the sample is this small, it might be possible for researchers to identify individuals in data. This would be extremely unethical, and Stats Canada works hard to make sure that data remains anonymized.

To resolve this issue, Statistics Canada over-samples people of Indigenous identity when they collect data. For example, they might survey 1000 people of Indigenous identity so that those people now account for 10% of observations in the sample. This would allow researchers who want to specifically look at the experiences of Indigenous people to conduct reliable research, and maintain the anonymity of the individuals represented by the data.

When we use this whole sample of 10,000, however, the data is no longer nationally representative since it overstates the share of the population of Indigenous identity - 10% instead of 5%. This sounds like a complex problem to resolve, but the solution is provided by the statistical agency that created the data in the form of “sample weights” that can be used to recreate data that is nationally representative.

Note: Before applying any weights in your regression, it is important that you read the user guide that comes with your data to see how weights should be applied. There are several options for weights and you should never apply weights without first understanding the intentions of the authors of the data.

Our sample weights will be commonly coded as an additional variable in our data set such as weight_pct. To include the weights in regression analysis, we can simply include the following option immediately after our independent variable(s) in the lm function:

    lm(data = data, y ~ x, weights = weight_pct)  

We can do that with the variable sample_weight which is provided to us in the “fake_data” data set, re-running the regression of log earnings on age and treatment status from above.

lm(data = fake_data, log_earnings ~ age + treated, weights = sample_weight)

Often, after weighting our sample, the coefficients from our regression will change in magnitude. In these cases, there was some sub-sample of the population that was over-represented in the data and skewed the results of the unweighted regression.

Finally, while this section described the use of weighted regressions, it is important to know that there are many times we might want to apply weights to our sample that have nothing to do with running regressions. For example, if we wanted to calculate the mean of a variable using data from a skewed sample, we would want to make sure to use the weighted mean. While mean is used in R to calculate means, R also has an incredibly useful command called weighted.mean which directly weights observations to calculate the weighted mean. Many packages exist which can calculate the weighted form of numerous other summary statistics.

10.5 Frisch-Waugh-Lovell Theorem

The Frisch-Waugh-Lovell Theorem (FWL henceforth) is a very powerful result in theoretical econometrics that will help us understand what happens when we are interested in the relationship between \(Y\) and \(D\) once we control for covariates \(X\) in a linear fashion.

This theorem states that running the following regression \[ Y_i = \hat{\beta}_0 + \hat{\beta}_1 D_i + \hat{\Gamma} X_i + \hat{\varepsilon}_i \]

provides the same estimate \(\hat{\beta}_1\) as if we did the following procedure.

  1. Run the following OLS regressions and keep the residuals \(\tilde{D}_i\) and \(\tilde{Y}_i\):

\[ D_i = \hat{\lambda}_0 + \hat{\Lambda} X_i + \tilde{D}_i \]

\[ Y_i = \hat{\omega}_0 + \hat{\Omega} X_i + \tilde{Y}_i \]

  1. Run a univariate OLS regression of \(\tilde{Y}_i\) on \(\tilde{D}_i\). Notice that this excludes the use of a constant term.

Therefore, controlling (linearly) for covariates \(X\) works just as when we do an OLS (linear projection) of the variables of interest onto the covariates and then run a univariate regression. Intuitively, we are partialling-out the effect of \(X\) of both variables so that we can focus on the relationship that does not depend on \(X\). That’s why we also say that we interpret the results of a multivariate regression as “ceteris-paribus” to all the covariates.

Let’s see how it works using our data set:

lm(data = fake_data, log_earnings ~ treated + age + region)

Now let’s see if we can obtain the same coefficient on treated using the partialling-out procedure:

lm(data = fake_data, treated ~ age + region)

# storing the residuals from the regression of treated on age and region
Dtilde <- residuals(lm(data = fake_data, treated ~ age + region))
lm(data = fake_data, log_earnings ~ age + region)

# storing the residuals from the regression of log earnings on age and region
Ytilde <- residuals(lm(data = fake_data, log_earnings ~ age + region))

We include -1 in the following regression so that we exclude the use of a constant term in this final step.

lm(data = fake_data, Ytilde ~ Dtilde -1)

reg Ytilde Dtilde, nocons

Indeed, we obtain the same result!

10.6 What can we do with OLS?

Notice that OLS gives us a linear approximation to the conditional mean of some dependent variable given some observables. We can use this information for prediction: if we had different observables how does the expected mean would differ? Another thing we could do with OLS is discuss causality: how does manipulating one variable impacts the dependent variable on average?

To give a causal interpretation to our OLS estimates we require that in the population it holds that \(\mathbf{E}[X_i u_i] = 0\), the unobservables are uncorrelated to the independent variables of the equation (remember this is untestable because we cannot compute the expectations in practice!). If these unobservables are correlated to an independent variable it means the variable can be causing a change in the dependent variable because of a change in an unobservable rather than a change in the independent variable itself, making us unable to prove causality. This is also called an endogeneity problem.

You might be tempted to think that we can test this using the sample version \(\frac{1}{n} \sum_{j}^n X_i u_i = 0\), but notice that from the first order conditions this is true by construction! It’s by design a circular argument: we are assuming that it holds true when we compute the solution to OLS.

For instance, if we want to interpret in the previous regression that the causal effect of being treated is equal to -0.81 it must be the case that treatment is not correlated (in the population sense) to the error term. However, it could be the case that treated workers are the ones that usually perform worse at their job, and that would invalidate a causal interpretation of our OLS estimates.

10.7 Wrap Up

In this module we distinguished the following concepts:

Therefore, notice that there is no such thing as OLS model. Notice that we could apply a different method (estimator) to a linear model. For example, consider minimizing the sum of all error terms \[ \min_{b} \frac{1}{n} \sum_{i}^n | \hat{u}_j | \]

This model is linear but the solution to this problem are not OLS estimates.

We also learned how to interpret coefficients in any linear model. \(\beta_0\) is the y-intercept of the line therefore its equal to \[ E[y_{i}|x_{i}=0]=\beta_0. \] Its the expected value of y when x=0. Because we have a sample approximation to this true value, it would be the sample mean of y when x=0.

In the case of any other beta, \(\beta_1\) or 2 or 3, \[ E[y_{i}|x_{i}=1]- E[y_{i}|x_{i}=0]= \beta \]

is going to be the difference between the expected value of y due to a change in x. Therefore, each \(\beta\) value tells us the effect that a particular covariate has on y, ceteris paribus. Transformations can also be applied to the variables in question, scaling the interpretation of this \(\beta\) coefficient. Overall, these coefficient estimates are values of great importance when we are developing our research!

The following table summarizes the new commands we have seen in this module.

Command Function
lm(data=<data>, <model>) It estimates a linear model using <data> as dataset and <model> as the specification.