ECON 490: Good Regression Practices (14)

Prerequisites

Importing data into Stata.
Creating new variables using generate and replace.
Identifying percentiles in data using summarize and return list.
Running OLS regressions.

Learning Outcomes

Identify and correct for outliers by trimming or winsorizing the dependent variable.
Identify and correct for the problem of multicollinearity.
Identify and correct for the problem of heteroskedasticity.
Identify and correct for the problem of non-linearity.

import stata_setup
stata_setup.config('C:\Program Files\Stata18/','se')

<>:2: SyntaxWarning: invalid escape sequence '\P'
<>:2: SyntaxWarning: invalid escape sequence '\P'
C:\Users\irene\AppData\Local\Temp\ipykernel_9832\4069384911.py:2: SyntaxWarning: invalid escape sequence '\P'
  stata_setup.config('C:\Program Files\Stata18/','se')


  ___  ____  ____  ____  ____ ®
 /__    /   ____/   /   ____/      18.0
___/   /   /___/   /   /___/       SE—Standard Edition

 Statistics and Data Science       Copyright 1985-2023 StataCorp LLC
                                   StataCorp
                                   4905 Lakeway Drive
                                   College Station, Texas 77845 USA
                                   800-STATA-PC        https://www.stata.com
                                   979-696-4600        stata@stata.com

Stata license: Unlimited-user network, expiring 19 Aug 2024
Serial number: 401809301518
  Licensed to: Irene Berezin
               UBC

Notes:
      1. Unicode is supported; see help unicode_advice.
      2. Maximum number of variables is set to 5,000 but can be increased;
          see help set_maxvar.

>>> import sys
>>> sys.path.append('/Applications/Stata/utilities') # make sure this is the same as what you set up in Module - 1, Section 1.5.1: Setting Up PyStata
>>> from pystata import config
>>> config.init('se')

14.1 Dealing with Outliers

Imagine that we have constructed a dependent variable which contains the earnings growth of individual workers and we see that some worker’s earnings increased by more than 400%. We might wonder if this massive change is just a coding error made by the statisticians that produced the data set. Even without that type of error, though, we might worry that the earnings growth of a small number of observations are driving the results of our analysis. If this is the case, we will produce an inaccurate analysis based on results that are not associated with the majority of our observations.

The standard practice in these cases is to either winsorize or trim the subset of observations that are used in that regression. Both practices remove the outlier values in the dependent variable to allow us to produce a more accurate empirical analysis. In this section, we will look at both approaches.

Warning: You should only consider fixing outliers when there is a clear reason to address this issue. Do not apply the tools below if the summary statistics in your data make sense to you in terms of abnormal values. For example, outliers might be a sign that your dependent and explanatory variables have a non-linear relationship. If that is the case, you will want to consider including an interaction term that addresses that non-linearity.

14.1.1 Winsorizing a dependent variable

Winsorizing is the process of limiting extreme values in the dependent variable to reduce the effect of (possibly erroneous) outliers. It consists of replacing values below the \(a\) percentile by that percentile value, and values above the \(b\) percentile by that percentile. Consider the following example using our fake data set:

%%stata

clear all
use fake_data, clear


. 
. clear all

. use fake_data, clear 

.

Let’s have a look at the distribution of earnings in the dataset. Specifically, focus on the earnings at four points of the distribution: the minimum, the maximum, the 1st percentile, and the 99th percentile. We can display them using locals, as seen in Module 4.

%%stata

su earnings, d
local ratio_lb = round(r(p1)/r(min))
local ratio_ub = round(r(max)/r(p99))
display "The earnings of the individual in the 1st percentile are `r(p1)'"
display "The lowest earner in the dataset earned `r(min)'"
display "The earnings of the individual in the 99th percentile are `r(p99)' "
display "The highest earner in the dataset earned `r(max)'"
display "The individual in the 1st pctile earned `ratio_lb' times as much as the lowest earner!"
display "The highest earner earned `ratio_ub' times as much as the individual in the 99th pctile!"


. 
. su earnings, d

                          Earnings
-------------------------------------------------------------
      Percentiles      Smallest
 1%     2830.869       36.19316
 5%     6639.007        46.1867
10%     10220.93       51.08368       Obs             138,138
25%     20562.63       57.63494       Sum of wgt.     138,138

50%     43783.01                      Mean           84136.44
                        Largest       Std. dev.      252801.7
75%     92378.23       1.10e+07
90%     183237.5       1.38e+07       Variance       6.39e+10
95%     277388.4       3.80e+07       Skewness        143.227
99%     607200.1       6.36e+07       Kurtosis       32584.05

. local ratio_lb = round(r(p1)/r(min))

. local ratio_ub = round(r(max)/r(p99))

. display "The earnings of the individual in the 1st percentile are `r(p1)'"
The earnings of the individual in the 1st percentile are 2830.868896484375

. display "The lowest earner in the dataset earned `r(min)'"
The lowest earner in the dataset earned 36.19315719604492

. display "The earnings of the individual in the 99th percentile are `r(p99)' "
The earnings of the individual in the 99th percentile are 607200.125 

. display "The highest earner in the dataset earned `r(max)'"
The highest earner in the dataset earned 63573580

. display "The individual in the 1st pctile earned `ratio_lb' times as much as 
> the lowest earner!"
The individual in the 1st pctile earned 78 times as much as the lowest earner!

. display "The highest earner earned `ratio_ub' times as much as the individual
>  in the 99th pctile!"
The highest earner earned 105 times as much as the individual in the 99th pctil
> e!

.

This table suggests to us that there are large outliers in our dependent variable.

We want to get rid of these outliers by winsorizing our data set. What that means is replacing the earnings of all observations below the 1st percentile by exactly the earnings of the individual at the 1st percentile, and replacing the earnings of all observations above the 99th percentile by exactly the earnings of the individual at the 99th percentile.

Recall that we can see how Stata stored the information in the previously run summarize command by using the command return list.

%%stata

return list


. 
. return list

scalars:
                  r(N) =  138138
              r(sum_w) =  138138
               r(mean) =  84136.44398112247
                r(Var) =  63908703718.91576
                 r(sd) =  252801.7082990456
           r(skewness) =  143.2270435630384
           r(kurtosis) =  32584.05136018418
                r(sum) =  11622440098.6643
                r(min) =  36.19315719604492
                r(max) =  63573580
                 r(p1) =  2830.868896484375
                 r(p5) =  6639.00732421875
                r(p10) =  10220.9267578125
                r(p25) =  20562.630859375
                r(p50) =  43783.0078125
                r(p75) =  92378.234375
                r(p90) =  183237.453125
                r(p95) =  277388.40625
                r(p99) =  607200.125

.

To winsorize this data, we do the following 3 step process:

We create a new variable called earnings_winsor which is identical to our earnings variable (gen earnings_winsor = earnings). We choose to store the winsorized version of the dependent variable in a different variable so that we don’t overwrite the original data set.
If earnings are smaller than the 1st percentile, we replace the values of earnings_winsor with the earnings of the individual at the 1st percentile (stored in Stata in r(p1)). Note that we need to ensure that Stata does not include missing values.
If earnings are larger than the 1st percentile, we replace the values of earnings_winsor with the earnings of the individual at the 99th percentile (stored in Stata in r(p99). Note that we need to ensure that Stata does not include missing values.

You can run these commands yourself below:

%%stata

gen earnings_winsor = earnings
replace earnings_winsor = r(p1) if earnings_winsor<r(p1) & earnings_winsor!=.
replace earnings_winsor = r(p99) if earnings_winsor>r(p99) & earnings_winsor!=.


. 
. gen earnings_winsor = earnings

. replace earnings_winsor = r(p1) if earnings_winsor<r(p1) & earnings_winsor!=.
(1,381 real changes made)

. replace earnings_winsor = r(p99) if earnings_winsor>r(p99) & earnings_winsor!
> =.
(1,381 real changes made)

.

Let’s take a look at the summary statistics of the original earnings variable and the new variable that we have created:

%%stata

su earnings earnings_winsor


. 
. su earnings earnings_winsor

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    earnings |    138,138    84136.44    252801.7   36.19316   6.36e+07
earnings_w~r |    138,138    78637.28    101010.2   2830.869   607200.1

.

Now we will use this new dependent variable in our regression analysis. If the outliers were not creating problems, there will be no change in the results. If they were creating problems, those problems will now be fixed.

Let’s take a look at this by first running the regression from Module 11 with the original earnings variable.

%%stata

capture drop logearnings
gen logearnings = log(earnings)
regress logearnings age


. 
. capture drop logearnings

. gen logearnings = log(earnings)

. regress logearnings age 

      Source |       SS           df       MS      Number of obs   =   138,138
-------------+----------------------------------   F(1, 138136)    =    682.89
       Model |  884.189446         1  884.189446   Prob > F        =    0.0000
    Residual |  178856.105   138,136  1.29478272   R-squared       =    0.0049
-------------+----------------------------------   Adj R-squared   =    0.0049
       Total |  179740.295   138,137  1.30117416   Root MSE        =    1.1379

------------------------------------------------------------------------------
 logearnings | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .0107776   .0004124    26.13   0.000     .0099692    .0115859
       _cons |   10.19145   .0188753   539.94   0.000     10.15445    10.22844
------------------------------------------------------------------------------

.

Now we will run this again, using the new winsorized earnings variable.

%%stata

capture drop logearnings_winsor
gen logearnings_winsor = log(earnings_winsor)
regress logearnings_winsor age


. 
. capture drop logearnings_winsor

. gen logearnings_winsor = log(earnings_winsor)

. regress logearnings_winsor age 

      Source |       SS           df       MS      Number of obs   =   138,138
-------------+----------------------------------   F(1, 138136)    =    670.64
       Model |  825.649342         1  825.649342   Prob > F        =    0.0000
    Residual |  170063.031   138,136  1.23112752   R-squared       =    0.0048
-------------+----------------------------------   Adj R-squared   =    0.0048
       Total |  170888.681   138,137  1.23709564   Root MSE        =    1.1096

------------------------------------------------------------------------------
logearning~r | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .0104147   .0004022    25.90   0.000     .0096265    .0112029
       _cons |   10.20861   .0184055   554.65   0.000     10.17253    10.24468
------------------------------------------------------------------------------

.

Do you think that in this case the outliers were having a significant impact before being winsorized?

14.1.2 Trimming a dependent variable

Trimming consists of replacing both values below the \(a\) percentile and values above the \(b\) percentile by a missing value. This is useful since any observation which equals a missing value won’t be used in the regression due to Stata automatically excluding observations with missing values in the command regress.

Here are the commands for trimming a variable. Notice that the steps are quite similar to when we winsorized the same variable.

%%stata

su earnings, d

capture drop earnings_trim
gen earnings_trim = earnings
replace earnings_trim = . if earnings_trim < r(p1) & earnings_trim!=.
replace earnings_trim = . if earnings_trim > r(p99) & earnings_trim!=.


. 
. su earnings, d

                          Earnings
-------------------------------------------------------------
      Percentiles      Smallest
 1%     2830.869       36.19316
 5%     6639.007        46.1867
10%     10220.93       51.08368       Obs             138,138
25%     20562.63       57.63494       Sum of wgt.     138,138

50%     43783.01                      Mean           84136.44
                        Largest       Std. dev.      252801.7
75%     92378.23       1.10e+07
90%     183237.5       1.38e+07       Variance       6.39e+10
95%     277388.4       3.80e+07       Skewness        143.227
99%     607200.1       6.36e+07       Kurtosis       32584.05

. 
. capture drop earnings_trim

. gen earnings_trim = earnings

. replace earnings_trim = . if earnings_trim < r(p1) & earnings_trim!=.
(1,381 real changes made, 1,381 to missing)

. replace earnings_trim = . if earnings_trim > r(p99) & earnings_trim!=.
(1,381 real changes made, 1,381 to missing)

.

And here is the result of the regression with the new dependent variable.

%%stata

capture drop logearnings_trim
gen logearnings_trim = log(earnings_trim)
regress logearnings_trim age


. 
. capture drop logearnings_trim

. gen logearnings_trim = log(earnings_trim)
(2,762 missing values generated)

. regress logearnings_trim age 

      Source |       SS           df       MS      Number of obs   =   135,376
-------------+----------------------------------   F(1, 135374)    =    531.45
       Model |  590.410164         1  590.410164   Prob > F        =    0.0000
    Residual |  150393.096   135,374   1.1109452   R-squared       =    0.0039
-------------+----------------------------------   Adj R-squared   =    0.0039
       Total |  150983.506   135,375  1.11529829   Root MSE        =     1.054

------------------------------------------------------------------------------
logearning~m | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .0089048   .0003863    23.05   0.000     .0081477    .0096618
       _cons |   10.27775   .0176777   581.40   0.000      10.2431    10.31239
------------------------------------------------------------------------------

.

14.2 Multicollinearity

If two variables are linear combinations of one another they are multicollinear. Ultimately, Stata will not allow you to include two variables in a regression that are perfect linear combinations of one another, such as a constant, a dummy variable for male and a dummy for female (since female = 1 - male). If you try this yourself you will see that one of those variables will be dropped from the regression “because of collinearity”.

%%stata

cap drop male
gen male = sex == "M"

cap drop female 
gen female = sex == "F"


. 
. cap drop male

. gen male = sex == "M"

. 
. cap drop female 

. gen female = sex == "F"

.

%%stata

reg logearnings male female


. 
. reg logearnings male female
note: female omitted because of collinearity.

      Source |       SS           df       MS      Number of obs   =   138,138
-------------+----------------------------------   F(1, 138136)    =   5952.64
       Model |   7425.4984         1   7425.4984   Prob > F        =    0.0000
    Residual |  172314.796   138,136   1.2474286   R-squared       =    0.0413
-------------+----------------------------------   Adj R-squared   =    0.0413
       Total |  179740.295   138,137  1.30117416   Root MSE        =    1.1169

------------------------------------------------------------------------------
 logearnings | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        male |   .5588429   .0072433    77.15   0.000     .5446463    .5730396
      female |          0  (omitted)
       _cons |   10.24279   .0063933  1602.12   0.000     10.23026    10.25532
------------------------------------------------------------------------------

.

Is this a problem? Not really. Multicollinearity is a sign that a variable is not adding any new information. Notice that with the constant term and a male dummy we can know the mean earnings of females. In this case, the constant term is by construction the mean earnings of females, and the male dummy gives the earning premium paid to male workers.

While there are some statistical tests for multicollinearity, nothing beats having the right intuition when running a regression. If there is an obvious case where two variables contain basically the same information, you should avoid including both in the analysis.

For instance, we might have an age variable that includes both years and months (e.g. if a baby is 1 year and 1 month old, then this age variable would be coded as 1 + 1/12 = 1.083). If we included this variable in a regression which also included an age variable that includes only years (e.g the baby’s age would be coded as 1) then we would have the problem of multicollinearity. Because they are not perfectly collinear, Stata might still produce some results; however, the coefficients on these two variables would be biased.

14.3 Heteroskedasticity

When we run a linear regression, we essentially split the outcome into a (linear) part explained by observables and an error term: \[ y_i = a + b x_i + e_i \]

The standard errors in our coefficients depend on \(e_i^2\) (as you might remember from ECON 326). Heteroskedasticity refers to the case where the variance of this projection error depends on the observables \(x_i\). For instance, the variance of wages tends to be higher for people who are university educated (some of these people have very high wages) whereas it is small for people who are non-university educated (these people tend to be concentrated in smaller paying jobs). Stata by default assumes that the variance does not depend on the observables, which is known as homoskedasticity. It is safe to say that this is an incredibly restrictive assumption.

While there are tests for heteroskedasticity, the more empirical economists rely on including the option robust at the end of the regress command for the OLS regression to address this.

%%stata

cap drop logearnings
gen logearnings = log(earnings)
regress logearnings age, robust


. 
. cap drop logearnings

. gen logearnings = log(earnings)

. regress logearnings age, robust

Linear regression                               Number of obs     =    138,138
                                                F(1, 138136)      =     666.21
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0049
                                                Root MSE          =     1.1379

------------------------------------------------------------------------------
             |               Robust
 logearnings | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .0107776   .0004176    25.81   0.000     .0099592     .011596
       _cons |   10.19145   .0190919   533.81   0.000     10.15403    10.22887
------------------------------------------------------------------------------

.

Best practices are simply to always use robust standard errors in your ECON 490 project.

14.4 Non-linearity

Our regression analysis so far assumes that the relationship between our independent and explanatory variables is linear. If this is not the case, and the relationship is non-linear, then we are getting inaccurate results with our analysis.

Let’s consider an example. We know that earnings increases with age, but what if economic theory predicts that the amount at which earnings increase for each year of age when workers are younger is larger than the amount at which earnings increase for each year of age when workers are older? What we are asking here is whether earnings is increasing with age at a decreasing rate. In essence, we want to check whether there is a concave relation between age and earnings. We can think of several mechanisms for why this relationship might exist: for a young worker, as they age they get higher wages through increased experience in the job; for an older worker, as they age those wage increases will be smaller as there are smaller productity gains with each additional year working. In fact, if the productivity of workers decreaseas as they age, perhaps for reasons related to health, then it is possible to find a negative relationship between age and earning beyond a certain age; the relationship is an inverted U-shape.

We could check if this is the case in our model by including a new interaction term that is simply age interacted with itself. You learned how to do this in Module 13. Let’s include this in the regression above, remembering that age is a continuous variable.

%%stata

reg logearnings c.age##c.age


. 
. reg logearnings c.age##c.age

      Source |       SS           df       MS      Number of obs   =   138,138
-------------+----------------------------------   F(2, 138135)    =    470.65
       Model |  1216.53501         2  608.267504   Prob > F        =    0.0000
    Residual |   178523.76   138,135  1.29238614   R-squared       =    0.0068
-------------+----------------------------------   Adj R-squared   =    0.0068
       Total |  179740.295   138,137  1.30117416   Root MSE        =    1.1368

------------------------------------------------------------------------------
 logearnings | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |    .079004   .0042745    18.48   0.000     .0706262    .0873819
             |
 c.age#c.age |  -.0007587   .0000473   -16.04   0.000    -.0008514    -.000666
             |
       _cons |   8.699491   .0949292    91.64   0.000     8.513431     8.88555
------------------------------------------------------------------------------

.

There does seem to be some evidence in our regression results that this economic theory is correct, since the coefficient on the interaction term is both negative and statistically significant.

How do we interpret these results? Let’s think about the equation we have just estimated: \[ Earnings_i = \beta_0 + \beta_1 Age_i + \beta_2 Age^2_i + \varepsilon_i \] This means that earnings of an individual change in the following way with their age: \[ \frac{\partial Earnings_i}{\partial Age_i} = \beta_1 + 2 \beta_2 Age_i \] Due to the quadratic term, as age changes, the relationship between age and earnings changes as well. We have just estimated \(\beta_1\) to be positive and equal to 0.079, and \(\beta_2\) to be negative and equal to 0.001. This means that as age increases, its correlation with earnings decrease: \[ \frac{\partial Earnings_i}{\partial Age_i} = 0.079 - 0.002 Age_i \]

Since the marginal effect changes with the size of \(Age_i\), providing one unique number for the marginal effect becomes difficult. The most frequently reported version of this effect is the ‘’marginal effect at the means’’: the marginal effect of age on earnings when age takes its average value. In our case, this will be equal to 0.079 minus 0.002 times the average value of age.

To do this in practice, we store the estimated coefficients and average age in three locals: local agemean stores the average age, while locals beta1 and beta2 store the estimated coefficients. You learned how to do this in Module 4. Notice that Stata automatically stores the estimated coefficients in locals with syntax _b[regressor name]. To retrieve the estimated coefficient \(\beta_2\), we manually create the variable \(Age^2_i\) and call it agesq.

%%stata

sum age
local agemean : display %2.0fc r(mean)
cap drop agesq
gen agesq = age*age
reg logearnings age agesq
local beta1 : display %5.3fc _b[age]
local beta2 : display %5.3fc _b[agesq]
local marg_effect = `beta1' + (2 * `beta2' * `agemean')
display "beta1 is `beta1', beta2 is `beta2', and average age is `agemean'."
display "Therefore, the marginal effect at the means is `beta1' + 2*(`beta2')*`agemean', which is equal to `marg_effect'."


. 
. sum age

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         age |    138,138    45.16033    7.423285         23         63

. local agemean : display %2.0fc r(mean)

. cap drop agesq

. gen agesq = age*age

. reg logearnings age agesq

      Source |       SS           df       MS      Number of obs   =   138,138
-------------+----------------------------------   F(2, 138135)    =    470.65
       Model |  1216.53501         2  608.267504   Prob > F        =    0.0000
    Residual |   178523.76   138,135  1.29238614   R-squared       =    0.0068
-------------+----------------------------------   Adj R-squared   =    0.0068
       Total |  179740.295   138,137  1.30117416   Root MSE        =    1.1368

------------------------------------------------------------------------------
 logearnings | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |    .079004   .0042745    18.48   0.000     .0706262    .0873819
       agesq |  -.0007587   .0000473   -16.04   0.000    -.0008514    -.000666
       _cons |   8.699491   .0949292    91.64   0.000     8.513431     8.88555
------------------------------------------------------------------------------

. local beta1 : display %5.3fc _b[age]

. local beta2 : display %5.3fc _b[agesq]

. local marg_effect = `beta1' + (2 * `beta2' * `agemean')

. display "beta1 is `beta1', beta2 is `beta2', and average age is `agemean'."
beta1 is 0.079, beta2 is -0.001, and average age is 45.

. display "Therefore, the marginal effect at the means is `beta1' + 2*(`beta2')
> *`agemean', which is equal to `marg_effect'."
Therefore, the marginal effect at the means is 0.079 + 2*(-0.001)*45, which is 
> equal to -.011.

.

We obtain that the marginal effect at the means is -0.011. What does that mean? It means that, for the average person, becoming one year older is associated with a 1% decrease in log earnings.

Notice that this is the effect for the average person. Is the same true for young workers and elder workers? To learn how to interpret this non-linearity in age, let’s see how the predicted earnings correlate with age. We can obtain the predicted earnings with the predict command and then use a scatterplot to eyeball its relationship with age. We covered how to create scatterplots in Module 9.

Note: Stata graphs will not appear in the Jupyter Notebooks. To make the most out of this part of the module, it is recommended that you run this code on Stata installed locally in your computer.

%%stata

* Run the regression with the quadratic term
reg logearnings c.age##c.age

* Predict earnings and save them as yhat
predict yhat, xb

* Plot the scatterplot
twoway scatter yhat age


. 
. * Run the regression with the quadratic term
. reg logearnings c.age##c.age

      Source |       SS           df       MS      Number of obs   =   138,138
-------------+----------------------------------   F(2, 138135)    =    470.65
       Model |  1216.53501         2  608.267504   Prob > F        =    0.0000
    Residual |   178523.76   138,135  1.29238614   R-squared       =    0.0068
-------------+----------------------------------   Adj R-squared   =    0.0068
       Total |  179740.295   138,137  1.30117416   Root MSE        =    1.1368

------------------------------------------------------------------------------
 logearnings | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |    .079004   .0042745    18.48   0.000     .0706262    .0873819
             |
 c.age#c.age |  -.0007587   .0000473   -16.04   0.000    -.0008514    -.000666
             |
       _cons |   8.699491   .0949292    91.64   0.000     8.513431     8.88555
------------------------------------------------------------------------------

. 
. * Predict earnings and save them as yhat
. predict yhat, xb

. 
. * Plot the scatterplot
. twoway scatter yhat age

.

You should obtain a scatterplot showing an inverted-U relationship between age and the predicted log-earnings. This relationship implies that, when a worker is very young, becoming older is positively correlated with earnings. However, after a certain age, this correlation becomes negative and the worker gets lower earnings for each additional year of age. In fact, based on this graph workers earns start to decline just after the age of 50. Had we modelled this as a linear model we would have missed this important piece of information!

Note: If there is a theoretical reason for believing that non-linearity exists, Stata provides some tests for non-linearity. You can also create a scatter-plot to see if you can observe a non-linear relationship in the data. We covered that approach in Module 9.

14.5 Wrap Up

It is important to always follow best practices for regression analysis. Nonetheless, checking and correcting for outliers, as well as addressing heteroskedasticity, multicollinearity and non-linearity can be more of an art than a science. If you need any guidance on whether or not you need to address these issues, please be certain to speak with your instructor or TA.

14.6 Video tutorial

Click on the image below for a video tutorial on this module.

References

How to identify and replace unusual data values