Projects - Example Project for ECON 326

econ 326

regression

projects

visualization

data wrangling

data cleaning

multiple regression

homoskedasticity

multicollinearity

Breusch-Pagan test

heteroskedasticity

vif

dummy variable

interaction terms

Let’s put it all together! This notebook is an example of what a “final project” might look like in ECON 326. It summarizes and uses all many of the empirical skills and R commands present in the other notebooks.

Author

COMET Team
Sarthak Kwatra, Charlotte White

Published

1 July 2023

Prerequisites

Introduction to Data in R
Introduction to Data Visualization - I and II
Simple Regression
Multiple Regression
Issues in Regression using R
Interactions and Non-Linear Terms in Regressions

Introduction:

Now that you are well armored with a statistical toolkit and experience with R, you are well on your way to embark on your own economic research adventure! This project serves as a sample to give you some intuition into the broad steps to a successful research project. It synthesizes the knowledge you have gained in your study of the ECON 325 and ECON 326 modules, and allows you to apply it to your own research project. It explains the steps involved in cleaning your data and preparing it for analysis, the actual analysis itself, and the careful interpretation and visualization of that analysis.

It is important to note that while the more minute tasks in each of these big steps may vary according to the needs of the project, these steps remain mostly the same. Let’s get started by importing all of the packages that we will use through out this module!

# If any of the packages happened to not be installed for you, use the command install.packages() with the name of the packages, like 'stargazer'

library(ggplot2) 
library(haven)
library(stargazer)
library(tidyverse)
library(car)
library(vtable)
library(sandwich)
library(corrplot)
library(lmtest)
source("Projects_Example_Project_ECON326_tests.r")

Note:

Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

Development and the Planet

Following a 2019 scientific report which revealed an alarming rate of climate change in the country, the [Government of Canada declared a national climate emergency.] (https://globalnews.ca/news/5401586/canada-national-climate-emergency/) Canada has been far from the only one to take notice and when it comes to the environment; threat of climate change continues to take priority at an international level. All over the world, people are seeking to better understand the causes and impacts of climate change, and looking for ways to mitigate and adapt to how these changes will affect our lives. In particular, the greenhouse gas carbon dioxide [\(CO_2\)] gets a lot of attention. While there are many other gasses that contribute to to the atmospheric greenhouse effect, [\(CO_2\)] is one of the most immediate concerns because of its role in industrialization and energy use.

Gross domestic product (GDP), is a measure that you’re likely very familiar with at this point. As a measure production, GDP is often used to infer the health of an economy or to some degree, the prosperity of the people operating within it. In general, a rising GDP is a desirable outcome. However, we might wonder whether all other outcomes associated with a higher GDP are desirable. In this project, we will be examining the connection between the production of [\(CO_2\)] and GDP in Canada.

****🔎 Let’s think critically**** > 🟠 GDP is commonly considered to not be a zero-sum measure, meaning that a rising GDP in one country does not mean another country’s GDP has to fall. What are the limitations on GDP growth, then? > 🟠 What are the implications of assuming that there can infinite GDP growth when it’s connected to finite measures such as the amount of [\(CO_2\)] that can be sustainably produced and recaptured? Is this just a reflection of what our current energy sources and technology allow, or is there more to the story in how we think about economic growth in general?

Part 1: Preparing our Data

For the sake of our analysis today, we hope to observe whether factors like Electricity Generation, GDP, and Population, have had any impact on CO2 Emissions across all the Canadian Provinces.

Importing Data into R

Once you have gathered data, R has great dependability and dexterity in the viewing and manipulation of that data. To do this, you will want to import your datasets into R, like you have observed in multiple other modules so far. The data that you have gathered could be in a host of different formats like,

.csv (Comma-Separated Values file),
.dta (STATA data file),
.xlsx (Excel file),
.sav (SPSS file) or,
.sas (SAS file)

All of these files correspond to different softwares, like Microsoft Excel, STATA, or SPSS, but can nonetheless be conveniently imported onto R. Fortunately, we will not be needing separate packages to import these files; haven is our jack-of-all-trades. We used the command library(haven) to load it at beginning of this module. In this case, since all of our data is in the .csv format, we use the function read_csv. The corresponding functions for the other formats are, read_dta, read_spss, and so on.

# Loading the Data into R

gdp_data <- read_csv("../datasets_projects/gdp_data.csv")
pollution_data <- read_csv("../datasets_projects/pollution_data.csv")
elec_data <- read_csv("../datasets_projects/elec_data.csv")
pop_data <- read_csv("../datasets_projects/pop_data.csv")

NOTE: By default, some functions in the Haven package, like read_csv(), assume that the CSV file has a header row with variable names. If your file does not have a header, or you would like different headers for your columns, you can use the argument col_names to adjust the column names manually.

Viewing the Data

Once you have imported your datasets in R, it is worthwhile to get an overview of the data. There are two main reasons for this:

Not every dataset will come formatted in a way that is suitable for your analysis, and therefore it is important to understand the structure of your dataset and its variables
An overview allows you to recognize any potential obvious issues that the data may have, like missing values, duplicates, or unnecessary variables, that would pose issues in your analysis at a later stage

Commands that can be used to view and understand the structure of your data include: head(), str(), summary(), and view(). These four functions can be used roughly interchangeably understand the structure of your data

# Make sure to run these commands individually!

head(gdp_data)
summary(pollution_data)
str(elec_data)
view(pop_data)

An overview of our data reveals a few interesting things. All of data has been collected for the years 2009 - 2020. However, while the GDP and CO2 Emissions data is Annual, the Electricity Generation Data is Monthly, and the Population Data is Quarterly. It is also interesting to note that the some of the values for Electricity Generation are missing for some years for the Provinces of Newfoundland and Labrador, and Prince Edward Island.

Cleaning the Data

Having recognized these potential issues, getting rid of them is important, and it deems the name “Cleaning the Data” to this section of the project. An important rough structure to keep in mind while cleaning your data is called “Tidy Data”, introduced by the statistician Hadley Wickham, where,

Each Variable has its own Column
Each Observation has its own Row, and,
Each Value has its own Cell

To begin with, we try to keep the column names of our variables such that they are short and easy to manipulate, so let’s change some of the column names in our datasets.

# Changing the Names across our Datasets

pollution_data <- pollution_data %>% rename(c(year = REF_DATE, province = GEO, sector = Sector, CO2 = VALUE))

gdp_data <- gdp_data %>% rename(c(year = REF_DATE, province = GEO, NAICS = `North American Industry Classification System (NAICS)`, GDP = VALUE))

elec_data <- elec_data %>% rename(c(year = REF_DATE, province = GEO, type = `Type of electricity generation`, elec = VALUE))

pop_data <- pop_data %>% rename(c(year = REF_DATE, province = GEO, pop = VALUE))

Next, note that across our Pollution and Population datasets, there are aggregations to a Canada-wide level, while our analysis is limited to the Provinces. Therefore, an inclusion of the Canada-wide aggregations will lead to a bias in our results. Let’s get rid of that by filtering them out.

# Filtering to keep every observation for which the GEO isn't equal to Canada

pop_data <- pop_data %>% filter(province != 'Canada')
pollution_data <- pollution_data %>% filter(province != 'Canada')

As noted before, there were some missing values in the Electricity Generation dataset. Although there are multiple ways of dealing with missing data, like using averages, or using advanced imputation techniques like multiple imputation, we choose to deal with missing values here by omitting them from our data.

elec_data <- elec_data %>% filter(elec != is.na(elec))

Similar aggregations also exist in the Pollution dataset for “Total, industries and households”, “Total, industries”, and “Total, households”. They also exist in the Electricity Generation dataset as “Total all types of electricity generation”. Let’s filter them, only this time, we will keep the aggregates across the categories of electricity generation and pollution, and get rid of the sub-categories.

pollution_data <- pollution_data %>% filter(sector == "Total, industries and households")
elec_data <- elec_data %>% filter(type == 'Total all types of electricity generation')

Next, as we previously noted, while the GDP and CO2 Emissions data is Annual, the Electricity Generation Data is Monthly, and the Population Data is Quarterly. Therefore, let’s group them both to Yearly levels. Before we do that, note that “REF_DATE” contains the variables Month and Year. Therefore, satisfying our principles Tidy Data, let’s use the Substring function to break it down into Month and Year.

elec_data <- elec_data %>%
  mutate(year = substr(year, 1, 4), month = substr(year, 6, 7))

pop_data <- pop_data %>%
  mutate(year = substr(year, 1, 4), month = substr(year, 6, 7))

Now, let’s work on making both the Electricity and Population datasets annual.

elec_data_grouped <- elec_data %>%
  group_by(year, province) %>%
  summarise(electricity = round(mean(elec)))

pop_data_grouped <- pop_data %>%
  group_by(year, province) %>%
  summarise(population = round(mean(pop)))

Our next step will be to merge our datasets, so that we can smoothly run the analysis from one clean reference.

# Making the Data types compatible for joining
pop_data_grouped <- pop_data_grouped %>% mutate(year = as.double(year))
elec_data_grouped <- elec_data_grouped %>% mutate(year = as.double(year))

# Merging the four datasets into two
merged_data_1 <- left_join(gdp_data, pop_data_grouped, by = c('year', 'province'))
merged_data_2 <- left_join(pollution_data, elec_data_grouped, by = c('year', 'province'))

# Performing the Final Merge
merged_data <- left_join(merged_data_1, merged_data_2, by = c('year', 'province'))

str(merged_data)

#renaming some categories
merged_data <- merged_data  %>%
  rename(gdp = GDP)

# Now we need some factors. `Province` should be a factor variable. 

merged_data  <- merged_data  %>%
  mutate(province = case_when(
    province == "Newfoundland and Labrador" ~ "1",
    province == "Prince Edward Island" ~ "2",
    province == "Nova Scotia" ~ "3",
    province == "New Brunswick" ~ "4",
    province == "Quebec" ~ "5",
    province == "Ontario" ~ "6",
    province == "Manitoba" ~ "7",
    province == "Saskatchewan" ~ "8",
    province == "Alberta" ~ "9",
    province == "British Columbia" ~ "10",
    province == "Yukon" ~ "11",
    province == "Northwest Territories" ~ "12",
    province == "Nunavut" ~ "13",
  )) %>%
  mutate(province = as_factor(province))

#Now for clarity, we'll rename the dataset.

CO2_data <- merged_data

Now let’s see how that’s changed our data structure.

str(CO2_data)

Great! We’re ready to start building our model.

Part 2: Building our Multiple Regression Model

Now that we have our dataset ready and cleaned, let’s start to think about building our model. What are the relationships that we’re interested in investigating? For the dataset that we’re working with, these would be : gross domestic product (GDP), electricity, and population.

Think Deeper: Why might we suspect that relationships exist between these variables? Is this consistent with economic theory? How would these relationships relate to your own experience?

Let’s begin investigating these relationships by making some visualizations.

corr_plot_data <- CO2_data[, c('CO2', 'gdp', 'population', 'electricity')]
corr_plot_data <- as.data.frame(corr_plot_data)

# Compute the correlation matrix)
cor_matrix = cor(corr_plot_data)

# Create the correlation plot
corrplot(cor_matrix, order = 'hclust', addrect = 2)

a <- ggplot(data = CO2_data, aes(x = gdp, y = CO2))+
  xlab("GDP in millions of dollars")+
  ylab("CO2 Emissions in kilotonnes") + scale_x_continuous()

a + geom_point()

b <- ggplot(data = CO2_data, aes(x = population, y = CO2))+
  xlab("Population")+
  ylab("CO2 Emissions in kilotonnes") + scale_x_continuous()

b + geom_point()

c <- ggplot(data = CO2_data, aes(x = electricity, y = CO2))+
  xlab("Electricity in megawatt hours)")+
  ylab("CO2 Emissions in kilotonnes") + scale_x_continuous()

c + geom_point()

view(gdp_data)

From these plots, we can see visually that there appears to be an approximate linear realtionship between the continuous independent variables and CO2. Now that we’ve established some of these relationships, let’s build our multiple regression model.

slr_1 <- lm(CO2 ~ gdp, data = CO2_data)
summary(slr_1)

slr_2 <- lm(CO2 ~ electricity, data = CO2_data)
summary(slr_2)

slr_3<- lm(CO2 ~ population, data = CO2_data)
summary(slr_3)

mlr_1 <- lm(CO2 ~ gdp + electricity + population, data = CO2_data)
summary(mlr_1)

As seen from the adjusted r-squared output, our multiple regression model has greater explanatory power than the any of the simple regressions alone, and all of the coefficients given are significant.

Part 3: Addressing Issues and Improving the Model

Underlying Assumptions - Homoskedasticity

Homoskedasticity, or constant variance, is an underlying assumption of OLS. Knowing that heteroskedasticity is another common issue in regression, we need to check our model to ensure that it meets this requirement. We’ll start by checking visually with a residual plot for our first variable, GDP.

slr_1 <- lm(CO2 ~ gdp, data = CO2_data)

ggplot(data = CO2_data, aes(x = as.numeric(gdp), y = as.numeric(slr_1$residuals))
        )+geom_point()+labs(x = "GDP", y = "Residuals")

From this “eyeball test”, it looks like the data display heterskedasticity. We’ll test for this formally now with a Breusch-Pagan test.

CO2_data$resid_sqgdp <-  (slr_1$residuals)^2
residualsslr_1 <- lm(resid_sqgdp ~ gdp, data = CO2_data)
summary(residualsslr_1)

We reject the null hypothesis and conclude that the GDP data are heteroskedastic. Let’s try to address this with a log transformation.

# transform the data
CO2_data <- CO2_data %>%
  mutate(ln_gdp = log(gdp))

# visualize the logged value residuals
slr_ln1 <- lm(CO2 ~ ln_gdp, data = CO2_data) 

ggplot(data = CO2_data, aes(x = as.numeric(ln_gdp), y = as.numeric(slr_ln1$residuals))) + 
  geom_point()+labs(x = "Log GDP", y = "Residuals")

# conduct another Breusch-Pagan Test
CO2_data$resid_sq_lngdp <-  (slr_ln1$residuals)^2
residualsslr_ln1 <- lm(resid_sq_lngdp ~ ln_gdp, data = CO2_data)
summary(residualsslr_ln1)

This doesn’t seem to have fixed the problem of heteroskedasticity within the gdp data. We’ll overcome this in our final model by using robust standard errors. We’ll now go through a similar process with the remaining two variables, and conclude that the electricity data are also heteroskedastic.

#For `electricity`
#Visualise the residuals
slr_2 <- lm(CO2 ~ electricity, data = CO2_data)
ggplot(data = CO2_data, aes(x = as.numeric(electricity), y = as.numeric(slr_2$residuals))
        )+geom_point()+labs(x = "Electricity", y = "Residuals")

#Fromally test the hypothesis
CO2_data$resid_sqelec <-  (slr_2$residuals)^2
residualsslr_2 <- lm(resid_sqelec ~ electricity, data = CO2_data)
summary(residualsslr_2)

#Log transform the data
CO2_data <- CO2_data %>%
  mutate(ln_electricity = log(electricity))

# visualize the logged value residuals
slr_ln2 <- lm(CO2 ~ ln_electricity, data = CO2_data)
ggplot(data = CO2_data, aes(x = as.numeric(ln_electricity), y = as.numeric(slr_ln2$residuals))
)+geom_point()+labs(x = "Log Electricity", y_ = "Residuals")

#Formally test 
CO2_data$resid_sq_lnelec <-  (slr_ln2$residuals)^2
residualsslr_ln2  <- lm(resid_sq_lnelec ~ ln_electricity, data = CO2_data)
summary(residualsslr_ln2)

# We reject the null hypothesis, and conclude heteroskedasticity in this data.

#For `population`
#Visualise the residuals
slr_3 <- lm(CO2 ~ population, data = CO2_data)
ggplot(data = CO2_data, aes(x = as.numeric(population), y = as.numeric(slr_3$residuals))
        )+geom_point()+labs(x = "Population", y = "Residuals")

#Formal hypothesis testing
CO2_data$resid_sqpop <- (slr_3$residuals)^2
residualsslr_3 <- lm(resid_sqpop ~ population, data = CO2_data)
summary(residualsslr_3)

# We cannot reject the null hypothesis, and therefore cannot conclude heteroscedasticity for the population data.

To run our regressions with heteroskedasticity-robust standard errors, we use the sandwich package, which we’ve called on earlier.

# The initial regression that we ran
mlr_1 <- lm(CO2 ~ gdp + population + electricity, data = CO2_data)

# Obtain robust standard errors
robust_se <- sqrt(diag(vcovHC(mlr_1, type = "HC1")))  # "HC1" is one of the robust variance estimators

# Print the robust standard errors
print(robust_se)

# Usiing Coeftest to print the Hetetrokedasticity-Robust Standard Errors
coeftest(mlr_1)

Multicollinearity

As we know, multicollinearity is a common issue that arises in developing a regression. To test how this shows up in our current model, we will calculate variance inflation factors (VIF), using the package car.

Think Deeper: Do you suspect any variables in the model may be affected by multicollinearity? How come?

car::vif(mlr_1)
coeftest(slr_3)
summary(slr_3)

The high VIF indicate that there’s a problem with collinearity in our data. We’ll try to combat this by creating a model which combines the effects of GDP and population into a single variable of per capita GDP.

#Create a new variable and view how it fits in the structure. 
CO2_data$per_capita_gdp <- CO2_data$gdp/CO2_data$population
str(CO2_data)

#Build a new model 
mlr_2 <- lm(CO2 ~ per_capita_gdp + electricity, data = CO2_data)
summary(mlr_2)

#Calculate VIF on the new model
car::vif(mlr_2)

This appears to have addressed out problems with multicollinearity in our model.

Adding Other Variables

There is still another variable that we haven’t considered in our model. Let’s take a look at how province affects CO2. Since our these variables are expressed qualitative factors, they need to be represented by dummy variables. R does this automatically when a dummy is used in regression, and excludes the first category (in our case Newfoundland and Labrador) as the reference variable.

slr_4 <- lm(CO2 ~ province, data = CO2_data)
summary(slr_4)

mlr_3 <- lm(CO2 ~ gdp + population + electricity + province, data = CO2_data)
summary(mlr_3)

This model has great explnatory power, but not all coefficients are significant anymore. We’ll remove electricity and see how this affects our model.

mlr_4 <- lm(CO2 ~ gdp + population + province, data = CO2_data)
summary(mlr_4)
car::vif(mlr_4)

#GVIF stands for Generalized Variance Inflation Factors, and appears here to show the combined VIF of all the province coefficients.

Unsurprisingly, this model has high multicollinearity. We expect this because we understand that things like population and gdp vary significantly with province. Recall how the data points were clustered around different centers in the original visualizations we made? If we summarize these mean values in a table, we can clearly see just how variable they are.

a + geom_point()

b + geom_point()

c + geom_point()

#create a summary table organized by province
sumtable(CO2_data, 
      vars = c("gdp", "population", "electricity"),
       summ = c('mean(x)'),
       group = 'province',
       digits = 6,
       out = 'return')

Let’s see what happens when we interact some of the terms.

#Province interacted with population 
mlr_5 <- lm(CO2 ~ gdp + electricity + population*province, data = CO2_data)
summary(mlr_5)

#Province interacted with gdp
mlr_6 <- lm(CO2 ~ population + electricity + gdp*province, data = CO2_data)
summary(mlr_6)

mlr_7 <- lm(CO2 ~ population + electricity + gdp:province, data = CO2_data)
summary(mlr_7)

vif(mlr_7)

In the final model the relationships appear to be significant. We would interpret the coefficients of the interacted terms as the combined effect of gdp and province- or how the effect of gdp on CO2 emissions changes by province. However this model is still severely affected by multicollinearity.

Now that we have a few different models, let’s compare some of our results.

stargazer(mlr_1, mlr_2, mlr_3, mlr_4, mlr_5, mlr_6, mlr_7, title="Comparison of Muliple Regression Results",
          align = TRUE, type="text", keep.stat = c("n","rsq"))

From the results of our simple regression, we know that there is a strong relationship between between province and [\(CO_2\)] emissions. But to choose the best model based on the data we have, we need to consider the models holistically, paying attention to the common issues that arise in regression, and not just the explanatory power alone.

Summary:

In conclusion, we note that CO2 Emissions in the Canadian Economy are severely impacted by the variables of our consideration: GDP, Electricity Generation, and Population. It is also important to note that we ought to explore our data by running multiple regressions, and clean it appropriately before running our analysis. This exploratory practice may lead us relationships that we expect, or something completely contrasting. Therefore, use this as a sample for your own project, and keep researching!

Exercises:

The below exercises are intended to help you check your conceptual understanding of the content.

Exercise 1:

What is an advantage of a multiple regression model over a simple regression model? Pick the best answer.

A: Multiple regression models often exhibit multicollinearity, which gives them better explanatory power compared to simple regressions.

B: Multiple regression models improve the predictive properties of a model by adding multiple regressors that play a role in the relationship.

C: Multiple regression models can display complicated relationships, but simple regression models are too simple to ever be useful.

D: By having multiple variables, multiple regressions allow us to see the interactions between different variables in a relationship.

answer_1 <- '...' # your answer here ('A', 'B', 'C', or 'D')

test_1()

Exercise 2:

Which of the below would not be a way of resolving heteroskedasticity within a regression model?

A: Performing a log transformation on the outcome variable

B: Using the appropriate code to generate HC1 standard errors

C: Searching for a different dataset to use

D: Adding more explanatory variables to the model (so long as all of the individual relationships remain significant)

answer_2 <- '...' # your answer here ('A', 'B', 'C', or 'D')

test_2()

Exercise 3:

What does a variance inflation factor of a variable tell us?

A: The magnitude of the heteroskedasticity (variance of error terms) in the data for a given variable.

B: The extent to which the variability of a dependent variable is inflated by multicollinearity in the model.

C: Whether or not there is multicollinearity in our model.

D: The extent to which the model is inflated by ommitted variable bias.

answer_3 <- '...' # your answer here ('A', 'B', 'C', or 'D')

test_3()

Exercise 4:

In what situation would it be appropriate to incorporate an interaction into your model? Pick the best answer.

A: There is reason to believe that the combined effect of two or more variables has an impact on the outcome variable.

B: There is reason to believe that the combined effect of two or more continuous variables has an impact on the outcome variable.

C: Your model is displaying multicollinearity with two variables.

D: Your model is boring, and you want to make it more interesting.

answer_4 <- '...' # your answer here ('A', 'B', 'C', or 'D')

test_4()