library(tidyverse)
library(haven)
source("beginner_simple_regression_tests.r")
1.7 - Beginner - Simple Regression
Outline
Prerequisites
- Introduction to Jupyter
- Introduction to Data
- Introduction to R
- Hypothesis testing
Outcomes
- Build a simple linear regression using R
- Create and interpret regression outputs in R including: coefficient estimates
- Examine the various elements of regression objects in R (including fitted values, residuals and coefficients)
- Explain the role of qualitative variables in regression analysis as dummy variables
Part 1: Learning About Regressions
What is a regression?
What is the relationship of a regression to other statistical concepts?
How do we use regressions to answer economic questions?
In this notebook, we will explore these questions using our census data set and will learn more about the immigrant wage gap. If you remember from last lecture, we were interested in the relationship between wages
and immstat
- wages and immigration status. However we can also use to the variable, mrkinc
(market income) to measure income. Let’s explore the relationship between wages
and mrkinc
in this data set.
Let’s start off with a visualization:
<- read_dta("../datasets_beginner/01_census2016.dta")
census_data
<- as_factor(census_data)
census_data
<- filter(census_data, !is.na(census_data$wages)) #Removing the rows in which wages = NA
census_data <- filter(census_data, !is.na(census_data$mrkinc)) #Removing the rows in which mrkinc = NA
census_data
glimpse(census_data)
options(repr.plot.width=6,repr.plot.height=4) #controls the image size
<- ggplot(data = census_data, aes(x = wages, y = mrkinc)) + xlab("Wages") + ylab("Market Income")
f + geom_point() f
Think Deeper: What do you see here? Is there anything about this relationship that sticks out to you? Why does it have the shape it does?
While you can probably tell that there is some relationship between wages and market income, it can be difficult to visualize using a scatterplot alone since there are far too many points.
Conditional Expectation:
The expectation of \(X\) is what outcome we expect \(X\) to typically be after a lot of sampling. We calculate the predicted value by multiplying the different values \(X\) can take by the various probabilities of \(X\) taking on that value. For instance, the expectation of a die throw is 3. Essentially, \[ E[X] = \sum_{i=1}^n P(X_i=x) X_i \]
You can think of conditional expectation as the expected value based on some condition: \[ E[X|y_i=y] = \sum_{i=1}^n P(X_i=x|y_i=y) X_i \]
The conditional expectation: the expectation of a random variable X, conditional on the value taken by another random variable Y . If the value of Y affects the value of X (i.e. X and Y are dependent), the conditional expectation of X given the value of Y will be different from the overall expectation of X.
In other words, we use conditional expectation when we predict that there is a relationship between a predictor variable and the response variable, such that we want our predictions to be made in the context of a specific value of the predictor(s).
The shape of the conditional expectation function indicates the relationship between the two variables we are interested in. For example, the conditional expectation of a dice roll given that the number is even, is 4.
Linear regression assumes a linear conditional expectation function which means that the conditional expectation function can be described by a straight line:
\(E[Y|X=x]= \beta_0 +\beta_1X\)
We can split a regression model into two parts:
1. The conditional expectation
- An error term
To be clear, let’s look at an example of this linear conditional expectation function, where our \(Y\) (outcome variable) is wages and our \(X\) (explanatory variable) is years of education.
The conditional expectation function would be:\(E[WAGES|YEARS=years_i] = \beta_0 +\beta_1X\).
This means that the given a particular value of years of education (\(years_i\)) for an individual \(i\), the wages of that individual will follow the linear regression form \(\beta_0 +\beta_1X\).
Regression Models
A regression model specifies (the specification) the relationship between two variables. For example, a linear relationship would be:
\[ M_i = \beta_0 + \beta_1 W_i \]
\(M_i\), the market income of individual \(i\) is our outcome variable.
\(W_i\), their wage is our explanatory variable.
In econometrics, we use the terms outcome variable and explanatory variable rather than the dependent and independent variable respectively.
A model like this describes the relationship between the variables - but it also depends on two unknowns: \(\beta_0\), \(\beta_1\).
- The \(\beta_0\) and \(\beta_1\) are parameters of the model: they are numbers that determine the relationship (intercept and slope) between \(M_i\) and \(W_i\)
- This is a linear relationship as indicated by the linear coefficients. It is also linear in the variables, but that isn’t required (we will explore this later).
It is highly unlikely that \(M_i = \beta_0 + \beta_1 W_i\)) can explain everything about our data. We also need to include the residual term (meaning “leftover”).
- The \(\epsilon_i\) is the residual: a component that corresponds to the part of the data which is not described by the model
- These residual terms will usually have certain assumed properties that allow us to estimate the model.
We can think about a regression as two parts:
- The part of the relationship explained by our model (\(M_i = \beta_0 + \beta_1 W_i\))
- The part which is not explained (\(\epsilon_i\)). The process of “fitting” or estimating a regression model selects certain values for \(\beta_0\) and \(\beta_1\) such that we minimize the amount that needs to be explained by the residual term. We write the complete regression equation by combining the two parts of the model:
\[ M_i = \beta_0 + \beta_1 W_i + \epsilon_i \]
The goal of regression analysis is to:
- Accurately estimate this equation (and especially the model parameters)
- Learn about the relationship between \(M_i\) and \(W_i\) from the results of that estimation.
While there are several ways we can define “accurately as possible” and “estimate” the equation. In this course, we use ordinary least squares (OLS):
\[ (\hat{\beta_0},\hat{\beta_1}) = \arg \min_{b_0,b_1} \sum_{i=1}^{n} (M_i - b_0 - b_1 W_i)^2 = \sum_{i=1}^{n} (e_i)^2 \]
While this may look complicated, it is just the calculus way of writing “choose \(\beta_0\) and \(\beta_1\) ( \(\hat{\beta_0},\hat{\beta_1}\)) such that they minimize the sum of the squared residuals”.
- A regression should aim to have the bulk of the results explained by the parameters (\(\beta_0, \beta_1\)) and as little as possible using \(\epsilon_i\).
- Our statistical problem has now transformed into a calculus problem, which you could solve for instance by taking derivatives.
There are numerous ways to solve this estimation problem - either through R or Math. Let’s start by drawing a best fit line through points with a linear regression in R.
Example: Manual Estimation
A bad way to solve this is the good-’ole eyeball method by observing the scatter plot and guessing how some values may perform.
Try to get the best fit you can by playing around with the following example.
# set the value of B_0 and B_1 with these values
= 0 #change me
B_0 = 1 #change me
B_1
# don't touch the rest of this code - but see if you can understand it!
= sum((census_data$mrkinc - B_0 - B_1*census_data$wages)^2)
SSE
# here is the SSE from your model
round(SSE/1000000,0)
What was the lowest value you got? Here is what your guess looks like in a graph:
# just run this cell to see your results
# re-run it if you change the values
options(repr.plot.width=6,repr.plot.height=4) #controls the image size
= data.frame(wages = census_data$wages, mrkinc = B_0 + B_1*census_data$wages)
fitted_line
<- ggplot(data = census_data, aes(x = wages, y = mrkinc)) + xlab("Wages") + ylab("Market Income")
f <- f + geom_point() + geom_line(color = "red", data = fitted_line)
f
f
Interactive Visualization of OLS
Understanding OLS is fundamental to understanding regressions and other opics in econometrics. Let’s try and understand the formula for OLS above through a more visual approach:.
\[ (\hat{\beta_0},\hat{\beta_1}) = \arg \min_{b_0,b_1} \sum_{i=1}^{n} (M_i - b_0 - b_1 W_i)^2 = \sum_{i=1}^{n} (e_i)^2 \]
To demonstrate this, we will use a small scatter plot with just 4 points.
The straight line through the scatter plot is modelled by the simple regression formula \(B_0 + B_1X\).
Since it’s nearly impossible for a regression to perfectly predict the relationship between two variables, we will almost always include an unobservable error \(e_i\) with our regression estimation. This is the vertical distance between the regression line and the actual data points
Hence each of the points can be modelled by the equation \(Y_i = B_0 + B_1X + e_i\).
Instead of minimizing the error terms, we will try to minimize the squared errors which are represented by the size of those red boxes.
Try your own values for
beta_0
andbeta_1
. Make sure to try the values only roughly within the specified range. The actual value ofbeta_0
andbeta_1
that minimize the residual sum of squares is 0.65 and 0.82 respectively. The code block below also displays the area of the red boxes; deviation from these optimal values will increase the area of the red boxes.
<- 0.65 #CHANGE THIS VALUE, TRY VALUES BETWEEN 0 - 1
beta_0 <- 0.82 #CHANGE THIS VALUE, TRY VALUES BETWEEN 0.6 - 1.4
beta_1
<- c(1, 2, 3, 4)
x <- c(1.7, 1.5, 4, 3.6)
y
# don't worry about this code, just run it!
<- data.frame(x, y)
dta <- dta %>%
example_df_graph ggplot(aes(x = x, y = y)) +
geom_point() +
geom_abline(intercept = beta_0, slope = beta_1) +
xlim(0, 5) +
ylim(0, 5) +
geom_rect(aes(xmin = (dta[1, "x"] + (beta_0 + (beta_1 * dta[1, "x"])) - dta[1, "y"]), xmax = dta[1, "x"],
ymin = (beta_0 + (beta_1 * dta[1, "x"])), ymax = dta[1, "y"]),
alpha = 0.1,
fill = "red") +
geom_rect(aes(xmin = dta[2, "x"], xmax = (dta[2, "x"] + (beta_0 + (beta_1 * dta[2, "x"])) - dta[2, "y"]),
ymin = dta[2, "y"], ymax = (beta_0 + (beta_1 * dta[2, "x"]))),
alpha = 0.1,
fill = "red") +
geom_rect(aes(xmin = (dta[3, "x"] + (beta_0 + (beta_1 * dta[3, "x"])) - dta[3, "y"]), xmax = dta[3, "x"],
ymin = (beta_0 + (beta_1 * dta[3, "x"])), ymax = dta[3, "y"]),
alpha = 0.1,
fill = "red") +
geom_rect(aes(xmin = dta[4, "x"], xmax = (dta[4, "x"] + (beta_0 + (beta_1 * dta[4, "x"])) - dta[4, "y"]),
ymin = dta[4, "y"], ymax = (beta_0 + (beta_1 * dta[4, "x"]))),
alpha = 0.1,
fill = "red")
example_df_graph
<- ((dta[1, "x"] - (dta[1, "x"] + (beta_0 + (beta_1 * dta[1, "x"])) - dta[1, "y"])) *
area_1 + (beta_1 * dta[2, "x"])) - dta[2, "y"]))
((beta_0 <- ((dta[2, "x"] + (beta_0 + (beta_1 * dta[2, "x"])) - dta[2, "y"]) - dta[2, "x"]) *
area_2 + (beta_1 * dta[2, "x"])) - dta[2, "y"])
((beta_0 <- (dta[3, "x"] - (dta[3, "x"] + (beta_0 + (beta_1 * dta[3, "x"])) - dta[3, "y"])) *
area_3 3, "y"]) - (beta_0 + (beta_1 * dta[3, "x"]))
(dta[<- ((dta[4, "x"] + (beta_0 + (beta_1 * dta[4, "x"])) - dta[4, "y"]) - dta[4, "x"]) *
area_4 + (beta_1 * dta[4, "x"])) - dta[4, "y"])
((beta_0
<- area_1 + area_2 + area_3 + area_4
area print("Area of red boxes is: ")
area
Simple Regressions in R
Now, let’s see how we could use a regression in R to do this.
- Regression models look like:
Y ~ X
whereY
is regressed onX
and the~
symbol is called “tilde”.
We can ignore the residual terms and parameters when writing the model in R and just focus on the variables for now.
So, for example, our regression model is
\[ M_i = \beta_0 + \beta_1 W_i + \epsilon_i \]
Which can be written in R as
mrkinc ~ wages
Regressions are estimated in R using the lm
function, which takes the data as an argument.
- This creates a linear model object, which can be used to calculate things (using prediction) or perform tests
- It also stores all of the information about the model, such as the coefficient and fit
- These models can also be printed and summarized to give important basic information about a regression
For an example linear model called my_model
, some of the most important elements are:
my_model$coefficients
: the parameter coefficientsmy_model$residuals
: the residualsmy_model$fitted.values
: the predicted values
Let’s see our model in action here.
= lm(mrkinc ~ wages, data = census_data)
regression1
summary(regression1)
head(regression1$coefficients)
Take a close look at the results. Identify the following elements:
- The values of the parameters
- The standard errors of the parameters
- The %-of the data explained by the model (R-sqaured)
Test Your Knowledge: What % of the data is explained by the model? Answer to 2 decimal places.
#Hint: Convert the Multiple-R squared value into a percentage
<- ... # answer goes here
ans1
test_1()
The underlying model and the parameters tell us about the relationship between the different values:
\[ M_i = -2455 + 1.61 W_i + \epsilon_i \]
Notice, for example:
\[ \frac{\partial M_i}{\partial W_i} = \beta_1 = 1.61 \]
In other words, when wages go up by 1 dollar, we would expect that market income will rise by 1.61 dollars. This kind of analysis is key to interpreting what this model is telling us.
Finally, let’s visualize our fitted model on the scatterplot from before. How does it compare to your original model?
= data.frame(wages = census_data$wages, mrkinc = regression1$fitted.values)
fitted_line2
<- ggplot(data = census_data, aes(x = wages, y = mrkinc)) + xlab("Wages") + ylab("Market Income")
f <- f + geom_point() + geom_line(color = "red", data = fitted_line) + geom_line(color = "blue", data = fitted_line2)
f
f
As you can see - there’s a very close relationship between mrkinc
and wages
. This implies that we can focus our attention on wages in our analysis of the immigrant wage gap.
Part 2: Simple Regressions and \(t\)-Tests
Previously, we looked at the relationship between market income and wages. However, these are both quantitative variables. However, what if we wanted to work with a qualitative variable like immstat
?
Thankfully regression models can still incorporate this kind of variable as this is the most common type of variable in real-world data. How is this possible?
Let’s start out with the simplest kind of qualitative variable: a dummy (0 or 1) variable. Consider the regression equation:
\[ W_i = \beta_0 + \beta_1 I_i + \epsilon_i \]
The conditional expectation when I_i is 0 and when it is 1 is :
\[ E[W_i|I_i = 1] = \beta_0 + \beta_1 \cdot 1 + E[\epsilon_i|I_i = 1] \]
\[ E[W_i|I_i = 0] = \beta_0 + \beta_1 \cdot 0 + E[\epsilon_i|I_i = 0] \]
Under Assumption 1, we have that \(E[\epsilon_i|I_i] = 0\), so:
\[ E[W_i|I_i = 1] = \beta_0 + \beta_1 \]
\[ E[W_i|I_i = 0] = \beta_0 \]
Combining these two expressions:
\[ \beta_1 = E[W_i|I_i = 1] - E[W_i|I_i = 0] \]
What this tells us:
- You can include dummy variables in regressions
- The coefficients of the dummy variable have meaning in terms of the regression model
- They measure the (average) difference in the dependent variable between the two levels of the dummy variable
Therefore dummy variables can be included in a regression model just like quantitative variables.
Let’s look at this in terms of immstat
. We can create our regression equation as:
\[ W_i = \beta_0 + \beta_1 I_i + \epsilon_i \]
Then we can estimate this using R.
<- lm(wages ~ immstat, data = census_data)
regression2
summary(regression2)
What do you see here?
Test Your Knowledge: What is the difference in average wage between immigrants and non-immigrants?
# input the answer (to 1 decimal place)
<- ...
answer2
test_2()
The number about might seem familiar, if you remember what we learned about a \(t\)-test from earlier. Remember this result?
<- t.test(x = filter(census_data, immstat == "immigrants")$wages,
t1 y = filter(census_data, immstat == "non-immigrants")$wages,
alternative = "two.sided",
mu = 0,
conf.level = 0.95)
t1
$estimate[1] - t1$estimate[2] t1
Look closely at the results. What is the relationship here?
Regression exemplifies the comparison that two sample variables make when the explanatory variable is a dummy. Recall:
\[ \beta_1 = E[W_i|I_i = 1] - E[W_i|I_i = 0] \]
The regression coefficient of \(\beta_1\) here is a comparison of two means. This is the same as how a \(t\)-test compares two means by different groups - groups which are specified by \(I_i = 0\) or \(I_i = 1\).
- In other words, another way of thinking about a regression is like a form of a comparison of means test.
- It can handle the same kind of analysis (i.e. with dummies), but can also include quantitative variables - which regular comparison of means tests cannot handle.
Part 3: Exercises
Activity 1
In this activity, we’ll explore how the immigrant wage gap could depend on sex (male vs. female). We can now examine this issue directly using regressions.
Estimate the immigrant wage gap for males and for females using regressions.
Tested objects: regm
(the regression for males), regf
(the regression for females).
# Activity 1
# Regression for males
<- lm(... ~ ..., data = filter(census_data, ... == ...)) #what should replace the ...
regm #Hint: Don't forget the quotation marks when specifying the subset
# Regression for females
<- ... # what should replace the ...
regf
summary(regm) # Allow us to view regm's coefficient estimates
summary(regf) # Same as above, but for regf
test_3() # Quiz1
test_4() # Quiz2
Short Answer 1
Prompt: How do we interpret the coefficient (Intercept) estimate on immstat
in each of these regressions?
A The average wage of a non-immigrant
B The average wage of an immigrant
C The difference between the average wage of an immigrant and non-immigrant
D Nothing we should worry about
# Enter your answer below as "A", "B", "C", or "D"
<- "..."
answer20 test_20(answer20)
Short Answer 2
Prompt: Compare the gaps. Is the immigrant wage gap larger for males or females? Why do you think that might that be?
A The immigrant pay gap for females is much greater than that of males
B The immigrant pay gap for males is much greater than that of females
C The immigrant pay gap is roughly the same
# Enter your answer below as "A", "B", or "C"
<- ""
answer21 test_21(answer21)
Activity 2
Many studies have suggested that workers’ wages increase as they age. In this activity, we will explore how the immigrant wage gap varies by age. First, let’s see the factor levels of the agegrp
:
levels(census_data$agegrp) # Run this!
As we can see, there are several age groups in this dataframe, including ones that would not be particularly informative (have you ever seen a 3-year-old doing salary work?). Let’s estimate the immigrant wage gap (with no controls) for five of these groups separately: * 20 to 24 years * 30 to 34 years * 40 to 44 years * 50 to 54 years * 60 to 64 years
Tested objects: reg5_20
(20 to 24 years), reg5_50
(50 to 54 years)
<- lm(wages ~ immstat, data = filter(census_data, agegrp == '20 to 24 years'))
reg5_20
<- ... # what should go here? Use the code above as a template
reg5_30
<- ...
reg5_40
<- ...
reg5_50
<- ...
reg5_60
# store the summaries (but don't show them! too many!)
<- summary(reg5_20)
sum20 <- summary(reg5_30)
sum30 <- summary(reg5_40)
sum40 <- summary(reg5_50)
sum50 <- summary(reg5_60)
sum60
test_12() # Quiz3
test_16() # Quiz4
The code below will tabulate a brief summary of each regression:
# Just run me! You don't need to edit this
<- c("20-24", "30-34", "40-44", "50-54", "60-64")
Age_Group <- c(reg5_20$coefficients[2], reg5_30$coefficients[2], reg5_40$coefficients[2], reg5_50$coefficients[2], reg5_60$coefficients[2])
Wage_Gap <- c(sum20$coefficients[2,2], sum30$coefficients[2,2], sum40$coefficients[2,2], sum50$coefficients[2,2], sum60$coefficients[2,2])
Std._Error <- c(sum20$coefficients[2,3], sum30$coefficients[2,3], sum40$coefficients[2,3], sum50$coefficients[2,3], sum60$coefficients[2,3])
t_Value <- c(sum20$coefficients[2,4], sum30$coefficients[2,4], sum40$coefficients[2,4], sum50$coefficients[2,4], sum60$coefficients[2,4])
p_Value
tibble(Age_Group, Wage_Gap, Std._Error, t_Value, p_Value) # it's like a table but a tibble
Short Answer 3
Prompt: What happens to the immigrant wage gap as we move across age groups? What do you think might explain these changes?
A Wage gap declines as age group decreases
B Wage gap increases as age group increases
C Wage gap is the highest at the “40 to 44 years” age group
# Enter your answer below as "A", "B", or "C"
<- ""
answer22 test_22(answer22)
Activity 3
As we observed in last week’s worksheet, the immigrant wage gap could differ by education level. As there are many education categories, it may be tedious to run a regression for each individual education level.
Instead, we could run a single regression and add education level as a second regressor, \(E_i\):
\[ W_i = \beta_0 + \beta_1 I_i + \beta_2 E_i + \epsilon_i \]
This is actually a multiple regression, which we will learn about later - but from the point of the this lesson, the idea is that it is “run” in R essentially in the same way as a simple regression. Estimate the regression model above without \(E_i\), then re-estimate the model with \(E_i\) added.
Tested objects: reg2A
(regression without controls), reg2B
(regression with controls).
# Naive regression (just immstat)
<- lm(... ~ ..., data = census_data) #this one works already
reg2A
# Regression with controls
<- lm(... ~ immstat + ..., data = census_data) # what should replace the ... think about the model
reg2B
# This will look ugly; try to look carefully at the output
summary(reg2A)$coefficients
summary(reg2B)$coefficients
test_7()
test_8() # Quiz 5
Short Answer 4
Prompt: compare the estimated immigrant wage gap with and without \(E_i\) in the regression. What happens to the gap when we add \(E_i\)? How do we interpret this?
A The estimated immigrant wage gap has increased after adding controls
B The estimated immigrant wage gap has decreased after adding controls
C The estimated immigrant wage gap has not changed after adding controls
# Enter your answer below as "A", "B", or "C"
<- "..."
answer23 test_23(answer23)
Activity 4
Another topic of interest for labor economists that is related to the immigrant wage gap is racial wage discrimination - the issue of workers of similar productivity being paid different wages on average because of their race. Consequently, we can also use regressions to estimate the racial wage gap.
Let’s suppose that we want to estimate this racial wage gap. Run a regression (without controls) that does this.
Test objects: reg_race
.
# Do not modify this l#| ine (sets "not a visible minority" as the reference level):
$vismin <- relevel(census_data$vismin, ref = "not a visible minority")
census_data# this is also how you set a different base level for a factor (handy!)
# Racial Wage Gap Regression
<- lm(wages ~ ..., data = census_data) # what model should we use here?
reg_race
summary(reg_race)
test_10() #Quiz6
Short Answer 5
Prompt: How should we interpret the regression estimate for visminblack
?
A People from the Black community make on average about 14,795 dollars less as compared to an average white person.
B Black immigrants make 14,795 dollars less than Black non-immigrants on average
C On average, a person from the Black community makes 14,795 dollars less than an average white person, holding all other variables constant
# Enter your answer below as "A", "B", or "C"
<- "..."
answer24 test_24(answer24)
Short Answer 6
Prompt: With this racial wage gap in mind, let’s return to the immigrant wage gap. Should we add explanatory variables for race to our regression from activity 2 and 3? Why or why not?
A No we should not
B Yes, we should because there could be other factors explaining the wage gap
C Yes, we should control for education only.
D Yes, we should control for immigrant status only.
# Enter your answer below as "A", "B", or "C"
<- "..."
answer25 test_25(answer25)
# Enter your answer below as "A", "B", or "C"
<- "B"
answer25 test_25(answer25)