# load the data and set it up
2.4 - Intermediate - Issues in Regression
Outline
Prerequisites
- Multiple regression
- Simple regression
- Data analysis and introduction
Outcomes
- Understand the origin and meaning of multicollinearity in regression models
- Perform simple tests for multicollinerarity using VIF
- Be able to demonstrate common methods to fix or resolve collinear data
- Understand the origin and meaning of heteroskedasticity in regression models
- Perform a variety of tests for heteroskedasticity
- Compute robust standard errors for regression models
- Understand other techniques for resolving heteroskedasticity in regression models
References
- Statistics Canada, Survey of Financial Security, 2019, 2021. Reproduced and distributed on an “as is” basis with the permission of Statistics Canada.Adapted from Statistics Canada, Survey of Financial Security, 2019, 2021. This does not constitute an endorsement by Statistics Canada of this product.
- Stargazer package is due to: Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
#load the data and set it up
library(car)
library(tidyverse)
library(haven)
library(stargazer)
library(lmtest)
library(sandwich)
source("intermediate_issues_in_regression_functions.r")
source("intermediate_issues_in_regression_tests.r")
<- read_dta("../datasets_intermediate/SFS_2019_Eng.dta")
SFS_data <- clean_up_data(SFS_data) # massive data cleanup SFS_data
glimpse(SFS_data)
In this notebook, we will explore several important issues in multiple regression models, and explore how to identify, evaluate, and correct them where appropriate. It is important to remember that there can be many other issues that arise in specific regression models; as you learn more about econometrics and create your own research questions, different issues will arise. Consider these as “examples” for some of the most common issues that arise in regression models.
Part 1: Multicollinearity
Multi-collinearity is a surprisingly common issue in applied regression analysis, where several explanatory variables are correlated to each other. For example, suppose we are interested at regressing one’s marriage rate against years of education and annual income. In this case, the two explanatory variables income and years of education are highly correlated. It refers to the situation where a variable is “overdetermined” by the other variables in a model, which will result in less reliable regression output. For example, if we have a high coefficient on education. How certain are we that this coefficient was not the result of having a high annual income as well? Let’s look at this problem mathematically; in calculating an OLS estimation, you are estimating a relationship like:
\[ Y_i = \beta_0 + \beta_1 X_1 + \epsilon_i \]
You find the estimates of the coefficients in this model using OLS; i.e., solving an equation like:
\[ \min_{\beta_0, \beta_1} \sum_{i=1}^n(Y_i - \beta_0 - \beta_1 X_i)^2 \]
Under the OLS regression assumptions, this has a unique solution; i.e you can find unique values for \(\beta_0\) and \(\beta_1\).
However, what if you wrote an equation like this:
\[ \beta_a=\beta_0+\beta_1 \]
We can then rewrite as \(Y_i = \beta_0 + \beta_1 + \beta_2 X_i + \epsilon_i\)
This seems like it would be fine, but remember what you are doing: trying to find a line of best fit. The problem is that this equation does not define a unique line; the “intercept” is \(\beta_0 + \beta_1\). There are two “parameters” (\(\beta_0, \beta_1\)) for a single “characteristics” (the intercept). This means that the resulting OLS problem:
\[ \min_{\beta_0, \beta_1, \beta_2} \sum_{i=1}^n(Y_i - \beta_0 - \beta_1 - \beta_2 X_i)^2 \]
Does not have a unique solution. In algebraic terms, it means you can find many representations of a line with two intercept parameters. This is referred to in econometrics as a lack of identification; multicollinearity is one way that identification can fail in regression models.
You can see this in the following example, which fits an OLS estimate of wealth
and income_before_tax
then compares the fit to a regression with two intercepts. Try changing the values to see what happens.
Note: Make sure to understand what the example below is doing. Notice how the results are exactly the same, no matter what the value of
k
is?
<- lm(wealth ~ income_before_tax, data = SFS_data)
reg
<- reg$coef[[1]]
b_0 <- reg$coef[[2]]
b_1
<- SFS_data$wealth - b_0 - b_1*SFS_data$income_before_tax
resid1
<- 90 #change me!
k
= (reg$coef[[1]])/2 - k
b_0 = (reg$coef[[1]])/2 + k
b_1 = reg$coef[[2]]
b_2
<-SFS_data$wealth - b_0 - b_1 - b_2*SFS_data$income_before_tax
resid2
ggplot() + geom_density(aes(x = resid1), color = "blue") + xlab("Residuals from 1 Variable Model") + ylab("Density")
ggplot() + geom_density(aes(x = resid2), color = "red") + xlab("Residuals from 2 Variable Model") + ylab("Density")
Notice how the residuals look exactly the same - despite these being from (purportedly) two different models. This is because they not really two different models! They identify the same model!
Okay, you’re probably thinking, that makes sense - but just don’t write down an equation like that. After all, it seems somewhat artificial that we added an extra intercept term.
However, multicollinearly can occur with any set of variables in the model; not just the intercept. For example, suppose you have a multiple regression:
\[ Y_ i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \beta_3 X_{3,i} + \epsilon_i \]
What would happen if there was a relationship between \(X_1, X_2\) and \(X_3\) like:
\[ X_{1,i} = 0.4 X_{2,i} + 12 X_{3,i} \]
We could then re-write the equation as:
\[ Y_ i = \beta_0 + \beta_1 (0.4 X_{2,i} + 12 X_{3,i}) + \beta_2 X_{2,i} + \beta_3 X_{3,i} + \epsilon_i \]
\[ \implies Y_ i = \beta_0 + (\beta_2 + 0.4 \beta_1) X_{2,i} + (\beta_3 + 12 \beta_1)X_{3,i} + \epsilon_i \]
The same problem is now occuring, but with \(X_2\) and \(X_3\): the slope coefficients depend on a free parameter (\(\beta_1\)). You cannot uniquely find the equation of a line (c.f. plane) with this kind of equation.
Basically what is happening, is you are trying to solve a system of equations for 3 variables (or n variables), but only 2 (or n-1) are used in the equation (are independent). So what would you do, well you would leave one of the dependent variables out, so you could solve for all of your variables, this is exactly what R does.
You can also intuitively see the condition here: multicollinearity occurs when you can express one variable as a linear combination of the other variables in the model.
- This is sometimes referred to as perfect multicollinearity, since the variable is perfectly expressed as a linear combination of the other variable.
- The linearity is important because this is a linear model; you can have similar issues in other models, but it has a special name in linear regression.
Perfect Multicollinearity in Models
In general, most statistical packages (like R) will automatically detect, warn, and remove perfectly multicollinear variables from a model; this is because the algorithm they use to solve problems like the OLS estimation equation detects the problem and avoids a “crash”. This is fine, from a mathematical perspective - since mathematically the two results are the same (in a well-defined sense, as we saw above).
However, from an economic perspective this is very bad - it indicates that there was a problem with the model that you defined in the first place. Usually, this means one of three things:
- You included a set of variables which were, in combination, identical. For example, including “family size” and then “number of children” and “number of adults” in a regression
- You did not understand the data well enough, and variables had less variation than you thought they did - conditional on the other variables in the model. For example, maybe you thought people in the dataset could have both graduate and undergraduate degrees - so there was variation in “higher than high-school” but that wasn’t true
- You wrote down a model which was poorly defined in terms of the variables. For example, you included all levels of a dummy variable, or included the same variable measured in two different units (wages in dollars and wages in 1000s of dollars).
In all of these cases, you need to go back to your original regression model and re-evaluate what you are trying to do in order to simplify the model or correct the error.
Consider the following regression model, in which we want to study whether or not there is a penalty for families led by someone who is younger is the SFS Data:
<- SFS_data %>%
SFS_data mutate(ya = case_when(
== "Less than high school" ~ "Yes",
education == "High school" ~ "Yes",
education == "Non-university post-secondary" ~ "No",
education TRUE ~ "No" # this is for all other cases
%>%
)) mutate(ya = as_factor(ya))
<- lm(income_before_tax ~ ya + education , data = SFS_data)
regression2
summary(regression2)
Can you see why the multi-collinearity is occurring here? Try to write down an equation which points out what the problem is in this regression - why is it multi-collinear? How could you fix this problem by changing the model?
Think Deeper: You will notice, above, that it excluded the “University” education. Did it have to exclude that one? Could it have excluded another one instead? What do you think?
Imperfect Multicollinearity
A related issue to perfect multicollinearity is “near” (or imperfect) multicollinearity. If you recall from the above, perfect multicollinearity occurs when you have a relationship like:
\[ X_{1,i} = 0.4 X_{2,i} + 12 X_{3,i} \]
Notice that in this relationship it holds for all values of \(i\). However, what if it held for nearly all \(i\) instead? In that case, we would still have a solution to the equation… but there would be a problem. Let’s look at this in the simple regression case.
\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \]
Now, let’s suppose that \(X_i\) is “almost” collinear with \(\beta_0\). To be precise, suppose that \(X_i = 15\) for \(k\)-% of the data (\(k\) will be large) and \(X_i = 20\) for \((1-k)\)-% of the data. This is almost constant, and so it is almost collinear with \(\beta_0\) (the constant). Let’s also make the values of \(Y_i\) so that \(Y_i(X_i) = X_i + \epsilon_i\) (so \(\beta_1 = 1\)), and we will set \(\sigma_Y = 1\).
This implies that:
\[ \beta_1 = \frac{Cov(X_i,Y_i)}{Var(X_i)} = 1 \]
\[ s_b = \frac{1}{\sqrt{n-2}}\sqrt{\frac{1}{r^2}-1} \]
\[ r = \frac{\sigma_X}{\sigma_Y} \]
As you can see, when \(Var(X_i)\) goes down, \(\sigma_X\) falls, and the value of \(r\) falls; intuitively, when \(k\) rises, the variance will go to zero, which makes \(r\) go to zero as well (since there’s no variation). You can then see that \(s_b\) diverges to infinity.
We can make this more precise. In this model, how does \(Var(X_i)\) depend on \(k\)? Well, first notice that \(\bar{X_i} = 15\cdot k + 20 \cdot (1-k)\). Then,
\[ Var(X_i) = (X_i - \bar{X_i})^2 = k (15 - \bar{X_i})^2 + (1-k)(20 - \bar{X_i})^2 \]
\[ \implies Var(X_i) = 25[k(1-k)^2 + (1-k)k^2] \]
Okay, that looks awful - so let’s plot a graph of \(s_b\) versus \(k\) (when \(n = 1000\)):
options(repr.plot.width=6,repr.plot.height=4)
= 0.01
r
= function(k){(1/sqrt(1000-2))*(1/(25*(k*(1-k)^2 + (1-k)*k^2))-1)}
eq = seq(0.5, 1.00, by = r)
s = length(s)
n
plot(eq(s), type='l', xlab="Values of K", ylab="Standard Error", xaxt = "n")
axis(1, at=seq(0, n-1, by = 10), labels=seq(0.5, 1.00, by = 10*r))
# You will notice that the plot actually diverges to infinity
# Try making R smaller to show this fact!
# Notice the value at 1 increases
Why does this happen? The reason actually has to do with information.
When you estimate a regression, you are using the variation in the data to estimate each of the parameters. As the variation falls, the estimation gets less and less precise, because you are using less and less data to make an evaluation. The magnitude of this problem can be quantified using the VIF or variance inflation factor for each of the variables in question. Graphically you can think of regression as drawing a best fit line through data points. Now if the variance is \(0\) in the data, there is just one data point. If you remember from high school you need two points to draw a line, so with \(0\) variance the OLS problem becomes ill-defined.
We can calculate this directly in R by using the vif
function. Let’s look at the collinearity in our model:
<- lm(wealth ~ income_before_tax + income_after_tax, data = SFS_data)
regression2
summary(regression2)
cat("Variance inflation factor of income after tax on wealth: ",vif(regression2,SFS_data$income_after_tax,SFS_data$wealth),'\n')
cat("Variance inflation factor of income before tax on wealth: ",vif(regression2,SFS_data$income_before_tax,SFS_data$wealth),'\n')
cat("Variance inflation factor of income before tax on income after tax: ",vif(regression2,SFS_data$income_before_tax,SFS_data$income_after_tax),'\n')
Notice the extremely large VIF. This would indicate that you have a problem with collinearity in your data.
Think Deeper: What happens to the VIF as
k
changes? Why? Can you explain?
There are no “hard” rules for what makes a VIF too large - you should think about your model holistically, and use it as a way to investigate whether you have any problems with your model evaluation and analysis.
Part 2: Heteroskedasticity
Heteroskedasticity (Het-er-o-sked-as-ti-city) is another common problem in many economic models. It refers to the situation in which the distribution of the residuals changes as the explanatory variables change. Usually, we could visualize this problem by drawing a residual plot and a fan or cone shape indicates the presence of heteroskedasticity. For example, consider this regression:
<- lm(income_before_tax ~ income_after_tax, data = SFS_data)
regression3
ggplot(data = SFS_data, aes(x = as.numeric(income_after_tax), y = as.numeric(regression3$residuals))) + geom_point() + labs(x = "After-tax income", y = "Residuals")
This obviously does not look like a distribution which is unchanging as income after tax changes. This is a good “eyeball test” for heteroskedasticty. Why does heteroskedasticity arise? For many reasons:
It can be a property of the data; it just happens that some values show more variation, due to the process which creates the data. One of the most common ways this can arise is where there are several different economic processes creating the data.
It can be because of an unobserved variable. This is similar to above; if we can quantify that process in a variable or a description, we have left it out. This could create bias in our model, but it will also show up in the standard errors in this way.
It can be because of your model specification. Models, by their very nature, can be heteroskedastic (or not); we will explore one important example later in this worksheet.
There are many other reasons, which we won’t get into here.
Whatever the reason it exists, you need to correct for it - if you don’t, while your coefficients will be OK, your standard errors will be incorrect. You can do this in a few ways. The first way is to try to change your variables that the “transformed” model (a) makes economic sense, and (b) no longer suffers from heteroskedasticity. For example, perhaps a log-log style model might work here:
<- SFS_data %>%
SFS_data filter(income_before_tax > 0) %>% # getting rid of NAs
mutate(lnincome_before_tax = log(income_before_tax))
<- SFS_data %>%
SFS_data filter(income_after_tax > 0) %>%
mutate(lnincome_after_tax = log(income_after_tax))
<- lm(lnincome_before_tax ~ lnincome_after_tax, data = SFS_data)
regression4
ggplot(data = SFS_data, aes(x = lnincome_before_tax, y = regression4$residuals)) + geom_point() + labs(x = "Log of before-tax income", y = "Residuals")
Think Deeper: Do the errors of this model seem homoskedastic?
As you can see, that didn’t work out. This is pretty typical: when you transform a model by changing the variables, what you are really doing is adjusting how you think the data process should be described so that it’s no longer heteroskedastic. If you aren’t correct with this, you won’t fix the problem.
For example, in a log-log model, we are saying “there’s a multiplicative relationship”… but that probably doesn’t make sense here. This is one of the reasons why data transformations are not usually a good way to fix this problem unless you have a very clear idea of what the transformation should be.
The most robust (no pun intended) way is to simply use standard errors which are robust to heteroskedasticity. There are actually a number of different versions of these (which you don’t need to know about), but they are all called HC or hetereoskedasticity-corrected standard errors. In economics, we typically adopt White’s versions of these (called HC1 in the literature); these are often referred to in economics papers as “robust” standard errors (for short).
This is relatively easy to do in R. Basically, you run your model, as normal, and then re-test the coefficients to get the correct error using the coeftest
command, but specifying which kind of errors you want to use. Here is an example:
<- lm(income_before_tax ~ income_after_tax, data = SFS_data)
regression5
summary(regression5)
coeftest(regression5, vcov = vcovHC(regression5, type = "HC1"))
As you can see, the standard errors (and significance tests) give different results; in particular, the HC1 errors are almost 10-times larger than the uncorrected errors. In this particular model, it didn’t make much of a different to the conclusions (even though it changed the \(t\) statistics a lot), but it can sometimes change your results.
Testing for Heteroskedasticity
You can also perform some formal tests for heteroskedasticity.
- White’s Test, which relies on performing a regression using the residuals
- Breusch-Pagan Test, which also relies on performing a simpler regression using the residuals
Both of them are, conceptually, very similar. Let’s try (2) for the above regression:
<- lm(income_before_tax ~ income_after_tax, data = SFS_data)
regression2
$resid_sq <- (regression2$residuals)^2 # get the residuals then square it
SFS_data
<- lm(resid_sq ~ income_after_tax, data = SFS_data) # make the residuals a function of X
regression3
summary(regression3)
Inspecting the results, we can see from the \(F\)-statistic that we can strongly reject the assumption of homoskedasticity. This is denoted by the 3 asterisks. This data looks like it’s heteroskedastic, because the residuals can be predicted using the explanatory variables.
There is one very important note:
- If you fail one of these tests, it implies that your data is heteroskedastic
- If you pass one of these tests, it does not imply that your data is homoskedastic (i.e. not heteroskedastic)
This is because these are statistical tests, and the null hypothesis is “not heteroskedastic”. Failing to reject the null does not mean that the null hypothesis is correct - it just means that you can’t rule it out. This is one of the reasons many economists recommend that you always use robust standard errors unless you have a really compelling reason to believe otherwise.
Linear Probability Models
How can a model naturally have heteroskedastic standard errors? It turns out that many common, and important, models have this issue. In particular, the linear probability model has this problem. If you recall, a linear probability model is a linear regression in which the dependent variable is a dummy. For example:
\[ D_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \epsilon_i \]
These models are quite useful because the coefficients have the interpretation as being the change in the probability of the dummy condition occurring. For example, we previously regressed gender
(of male or female) in these models to investigate the wealth gap. However, this can easily cause a problem when estimated using OLS - the value of \(D_i\) must be 0 or 1, and the fitted values (which are probabilities) must be between 0 and 1.
However, nothing in the OLS model forces this to be true. If you estimate a value for \(\beta_1\), if you have an \(X_{1,i}\) that is high or low enough, the fitted values will be above or below 1 or 0 (respectively). This implies that mechanically you have heteroskedasticity because high or low values of the explanatory variables will ALWAYS fit worse than intermediate values. For example, let’s look at the fitted values from this regression:
<- SFS_data %>%
SFS_data mutate(M_F = case_when(
== "Male" ~ 0,
gender == "Female" ~ 1
gender
))
<- SFS_data[complete.cases(SFS_data$gender,SFS_data$income_before_tax), ]
SFS_data $gender <- as.numeric(SFS_data$gender)
SFS_data$income_before_tax <- as.numeric(SFS_data$income_before_tax)
SFS_data
<- lm(gender ~ income_before_tax, data = SFS_data)
regression6
$fitted <- predict(regression6, SFS_data)
SFS_data
summary(regression6)
ggplot(data = SFS_data, aes(x = as.numeric(income_before_tax), y = fitted)) + geom_point() + labs(x = "before tax income", y = "Predicted Probability")
Notice how that as \(y\) gets larger as the fitted value drops. If someone has an income of over 1 million dollars, they would be predicted to have a negative probability of being a female - which is impossible.
This is why you always must use robust standard errors in these models - even if a test says otherwise. Let’s think about what is happening here, well remember the example of imperfect collinearity, we’re the was \(1-k\) chance of \(x\) being 15 and k of \(x\) being 20. Now remember x was co-linear to \(\beta_0\), and this caused large standard errors. In this scenario the probability of a someone being female given they earn over a million dollars a year is very small. This because few female lead has household earn over a million dollars a year as a percent of the total households earning over a million dollars a year.
Part 3: Exercises
This sections has both written and coding exercises for you to test your knowledge about issues in regressions. The answers to the written exercises are on the last section of the notebook.
Questions
Multicollinearity may seem to be an abstract concept, so let’s explore this issue with a practical example.
Suppose that we are looking to explore the relationship between family income and the gender of the major earner. We want to know whether families with higher incomes in Canada are more likely to have male major earners. Recall that we have two measures of income: income_before_tax
and income_after_tax
. Both measures of income are informative: income_before_tax
refers to gross annual income (before taxes) that employers pay to employees; income_after_tax
refers to net income after taxes have been deducted.
Since they are both good measures of income, we decide to put them both in our regression:
\[ M_F = \beta_0 + \beta_1 I_{bi} + \beta_2 I_{ai} + \epsilon_i \]
where
- \(M_F\) denotes the dummy variable for whether the person is male or female
- \(I_{ai}\) denotes income after taxes
- \(I_{bi}\) denotes income before taxes
- What concern should we have about this regression equation? Explain your intuition.
Before we continue, let’s reduce the sample size of our data set to 200 observations. We will also revert gender
into a numeric variable:
# run this!
<- head(SFS_data, 200) %>%
SFS_data200 mutate(M_F = as.numeric(gender)) # everyone in the first 200 observations as, male or female
Run the regression between family income and the gender of the major earner described above.
Tested Objects: reg1
.
<- lm(???, data = SFS_data200)
reg1
summary(reg1)
test_2()
- What do you notice about the characteristics of the estimated regression? Does anything point to your concern being valid?
Now, let’s suppose we drop 50 more observations:
# run this!
<- head(SFS_data200, 150) SFS_data150
Run the regression model again and compare it with the previous regression.
Tested Objects: reg2
.
<- lm(???)
reg2
summary(reg2)
test_4()
- What happened to the regression estimates when we dropped 50 observations? Does this point to your concern being valid?
Next, increase the sample size back to its full size and run the regression once again.
Tested Objects: reg3
.
<- SFS_data[complete.cases(SFS_data$income_after_tax), ] #do not modify this code
SFS_data $income_after_tax <- as.numeric(SFS_data$income_after_tax) # do not modify this code
SFS_data
<- lm(???)
reg3
summary(reg3)
test_6()
- Did this change eliminate the concern? How do you know?
Heteroskedasticity is another issue that researchers frequently deal with when they estimate regression models. Consider the following regression model:
\[ I_i = \alpha_0 + \alpha_1 E_i + \alpha_2 G_i + \epsilon_i \]
where
- \(I_i\) denotes before tax income
- \(E_i\) is level of education
- \(D_i\) is a dummy variable for being female
Should we be concerned about heteroskedasticity in this model? If so, what is the potential source of heteroskedasticity, and what do we suspect to be the relationship between the regressor and the error term?
If we suppose that heteroskedasticity is a problem in this regression, what consequences will this have for our regression estimates?
Run the regression below, and graph the residuals against the level of schooling.
# run the regression
<- lm(income_before_tax ~ education, data = SFS_data) reg5
<- lm(income_before_tax~education, data = SFS_data)
reg5
<- ggplot(reg5, aes(x = education, y = .resid)) + xlab("Education Level") + ylab("Income (Residuals)")
resiplot + geom_point() + geom_hline(yintercept = 0) + scale_x_discrete(guide = guide_axis(n.dodge=3)) resiplot
- Describe the relationship between education level and the residuals in the graph above. What does the graph tell us about the presence and nature of heteroskedasticity in the regression model?
To test for heteroskedasticity formally, let’s perform the White Test. First, store the residuals from the previous regression in SFS_data
.
Tested Objects: SFS_data
(checks to see that residuals were added properly).
<- mutate(SFS_data, resid = ???)
SFS_data
head(SFS_data$resid, 10) #Displays the residuals in the dataframe
test_11()
Next, generate a variable for the squared residuals, then run the required auxiliary regression.
Tested Objects: WT
(the auxiliary regression).
<- lm(income_before_tax~gender^2 + gender + education^2 +education+ education*gender, data = SFS_data)
model
= reg5$residuals
resid
=(resid)^2 rsq
$rsq <- rsq
SFS_data
<- lm(rsq ~ ???, data =SFS_data) # fill me in
WT
summary(WT)
test_12()
What does the white test suggest?
Finish filling in this table:
Formal Issue Name | Problem | Meaning | Test | Solution |
---|---|---|---|---|
??? | Incorrect Standard errors, which can lead to incorrect confidence intervals etc | The distribution of residuals is not constant | White’s Test and Breusch-Pagan: bptest() |
Add additional factors to regression or use robust standard errors |
Perfect Collinearity | ??? | One variable in the regression is a linear function of another variable in the regression | Collinearity test on the model, ols_vif_tol(model) |
??? |
Imperfect Collinearity | The model will have very large standard errors. R may need to omit a variable | One variable can almost be fully predicted with a linear function of another variable in the model | ??? | Omit one of the collinear variables, try using more data, or consider transformations (e.g., logarithms) |
Solutions
These two variables, before tax and after tax income, may be close to co-linear, which will increase the error term.
Coding exercise.
It seems like
income_before_tax
was dropped from the regression. The reason for that is multicollinearity.Coding exercise
Similar to 3.
Coding exercise.
This change does not eliminate the concern. Now we see both
income_before_tax
andincome_after_tax
on the regression output. However, onlyincome_after_tax
is significant. The model is picking up some differences between both variables (likely differences across tax brackets) but it still seems like the collinearity is affecting the model. Having collinear terms in a regression increases standard errors.The distribution of residuals will likely change relative to education. This because the higher the level of education the more additional factors will be required to predict the persons income. For instance, if someone has a 4 year degree what that degree is in will have a large impact on their income; however, there is less variance in income for high school graduates.
If we do not use robust standard errors, the standards errors will be understated.
We can see that the residuals are increasing as the level of education increases, as predicted in the previous question. This indicates heteroskedasticity, as the distribution of errors is clearly not constant.
11-12. Coding exercises
The test suggests the model is heteroskedastic.
Answers are: (A) Heteroskedasticity (B) An explanatory variable can be written as a linear combination of other explanatory variables included in the model (C) Drop one of the variables (D) VIF test.