{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Projects - Example Project for ECON 326\n", "\n", "COMET Team
*Sarthak Kwatra, Charlotte White* \n", "2023-07-01\n", "\n", "### Prerequisites\n", "\n", "- Introduction to Data in R\n", "- Introduction to Data Visualization - I and II\n", "- Simple Regression\n", "- Multiple Regression\n", "- Issues in Regression using R\n", "- Interactions and Non-Linear Terms in Regressions\n", "\n", "## Introduction:\n", "\n", "Now that you are well armored with a statistical toolkit and experience\n", "with R, you are well on your way to embark on your own economic research\n", "adventure! This project serves as a sample to give you some intuition\n", "into the broad steps to a successful research project. It synthesizes\n", "the knowledge you have gained in your study of the ECON 325 and ECON 326\n", "modules, and allows you to apply it to your own research project. It\n", "explains the steps involved in cleaning your data and preparing it for\n", "analysis, the actual analysis itself, and the careful interpretation and\n", "visualization of that analysis.\n", "\n", "It is important to note that while the more minute tasks in each of\n", "these big steps may vary according to the needs of the project, these\n", "steps remain mostly the same. Let’s get started by importing all of the\n", "packages that we will use through out this module!" ], "id": "f83aced9-7642-4338-bae1-60e33099c0b4" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# If any of the packages happened to not be installed for you, use the command install.packages() with the name of the packages, like 'stargazer'\n", "\n", "library(ggplot2) \n", "library(haven)\n", "library(stargazer)\n", "library(tidyverse)\n", "library(car)\n", "library(vtable)\n", "library(sandwich)\n", "library(corrplot)\n", "library(lmtest)\n", "source(\"Projects_Example_Project_ECON326_tests.r\")" ], "id": "1a0105b5-cebe-4b0c-9c95-eea19fc52b9a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Note:\n", "\n", "Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary\n", "Statistics Tables. R package version 5.2.3.\n", "https://CRAN.R-project.org/package=stargazer\n", "\n", "## Development and the Planet\n", "\n", "Following a 2019 scientific report which revealed an alarming rate of\n", "climate change in the country, the \\[Government of Canada declared a\n", "national climate emergency.\\]\n", "(https://globalnews.ca/news/5401586/canada-national-climate-emergency/)\n", "Canada has been far from the only one to take notice and when it comes\n", "to the environment; threat of climate change continues to take priority\n", "at an international level. All over the world, people are seeking to\n", "better understand the causes and impacts of climate change, and looking\n", "for ways to mitigate and adapt to how these changes will affect our\n", "lives. In particular, the greenhouse gas carbon dioxide \\[$CO_2$\\] gets\n", "a lot of attention. While there are many other gasses that contribute to\n", "to the atmospheric greenhouse effect, \\[$CO_2$\\] is one of the most\n", "immediate concerns because of its role in industrialization and energy\n", "use.\n", "\n", "Gross domestic product (GDP), is a measure that you’re likely very\n", "familiar with at this point. As a measure production, GDP is often used\n", "to infer the health of an economy or to some degree, the prosperity of\n", "the people operating within it. In general, a rising GDP is a desirable\n", "outcome. However, we might wonder whether all other outcomes associated\n", "with a higher GDP are desirable. In this project, we will be examining\n", "the connection between the production of \\[$CO_2$\\] and GDP in Canada.\n", "\n", " \\*\\*\\*\\*🔎 Let’s think critically\\*\\*\\*\\*\n", " \\> 🟠 GDP is commonly considered to *not* be a zero-sum measure,\n", "meaning that a rising GDP in one country does not mean another country’s\n", "GDP has to fall. What are the limitations on GDP growth, then? \\> 🟠\n", "What are the implications of assuming that there can infinite GDP growth\n", "when it’s connected to finite measures such as the amount of \\[$CO_2$\\]\n", "that can be sustainably produced and recaptured? Is this just a\n", "reflection of what our current energy sources and technology allow, or\n", "is there more to the story in how we think about economic growth in\n", "general?\n", "\n", "## Part 1: Preparing our Data\n", "\n", "For the sake of our analysis today, we hope to observe whether factors\n", "like Electricity Generation, GDP, and Population, have had any impact on\n", "CO2 Emissions across all the Canadian Provinces.\n", "\n", "### Importing Data into R\n", "\n", "Once you have gathered data, R has great dependability and dexterity in\n", "the viewing and manipulation of that data. To do this, you will want to\n", "import your datasets into R, like you have observed in multiple other\n", "modules so far. The data that you have gathered could be in a host of\n", "different formats like,\n", "\n", "- .csv (Comma-Separated Values file),\n", "- .dta (STATA data file),\n", "- .xlsx (Excel file),\n", "- .sav (SPSS file) or,\n", "- .sas (SAS file)\n", "\n", "All of these files correspond to different softwares, like Microsoft\n", "Excel, STATA, or SPSS, but can nonetheless be conveniently imported onto\n", "R. Fortunately, we will not be needing separate packages to import these\n", "files; `haven` is our jack-of-all-trades. We used the command\n", "`library(haven)` to load it at beginning of this module. In this case,\n", "since all of our data is in the .csv format, we use the function\n", "`read_csv`. The corresponding functions for the other formats are,\n", "`read_dta`, `read_spss`, and so on." ], "id": "99148cc9-f93c-4ae2-a4c8-55e214f01750" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loading the Data into R\n", "\n", "gdp_data <- read_csv(\"../datasets_projects/gdp_data.csv\")\n", "pollution_data <- read_csv(\"../datasets_projects/pollution_data.csv\")\n", "elec_data <- read_csv(\"../datasets_projects/elec_data.csv\")\n", "pop_data <- read_csv(\"../datasets_projects/pop_data.csv\")" ], "id": "fe6bca78-413a-4dee-932b-515701c91647" }, { "cell_type": "markdown", "metadata": {}, "source": [ "> NOTE: By default, some functions in the Haven package, like\n", "> `read_csv()`, assume that the CSV file has a header row with variable\n", "> names. If your file does not have a header, or you would like\n", "> different headers for your columns, you can use the argument\n", "> `col_names` to adjust the column names manually.\n", "\n", "### Viewing the Data\n", "\n", "Once you have imported your datasets in R, it is worthwhile to get an\n", "overview of the data. There are two main reasons for this:\n", "\n", "- Not every dataset will come formatted in a way that is suitable for\n", " your analysis, and therefore it is important to understand the\n", " structure of your dataset and its variables\n", "- An overview allows you to recognize any potential obvious issues\n", " that the data may have, like missing values, duplicates, or\n", " unnecessary variables, that would pose issues in your analysis at a\n", " later stage\n", "\n", "Commands that can be used to view and understand the structure of your\n", "data include: `head()`, `str()`, `summary()`, and `view()`. These four\n", "functions can be used roughly interchangeably understand the structure\n", "of your data" ], "id": "ee768cff-7594-4ebb-9f0c-77a638f3cb39" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Make sure to run these commands individually!\n", "\n", "head(gdp_data)\n", "summary(pollution_data)\n", "str(elec_data)\n", "view(pop_data)" ], "id": "20693380-518e-4293-90ec-f0bb85a32485" }, { "cell_type": "markdown", "metadata": {}, "source": [ "An overview of our data reveals a few interesting things. All of data\n", "has been collected for the years 2009 - 2020. However, while the GDP and\n", "CO2 Emissions data is Annual, the Electricity Generation Data is\n", "Monthly, and the Population Data is Quarterly. It is also interesting to\n", "note that the some of the values for Electricity Generation are missing\n", "for some years for the Provinces of Newfoundland and Labrador, and\n", "Prince Edward Island.\n", "\n", "### Cleaning the Data\n", "\n", "Having recognized these potential issues, getting rid of them is\n", "important, and it deems the name “Cleaning the Data” to this section of\n", "the project. An important rough structure to keep in mind while cleaning\n", "your data is called “Tidy Data”, introduced by the statistician Hadley\n", "Wickham, where,\n", "\n", "- Each Variable has its own Column\n", "- Each Observation has its own Row, and,\n", "- Each Value has its own Cell\n", "\n", "To begin with, we try to keep the column names of our variables such\n", "that they are short and easy to manipulate, so let’s change some of the\n", "column names in our datasets." ], "id": "3bcb7b9f-5e87-4314-b719-cb8e3897b19a" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Changing the Names across our Datasets\n", "\n", "pollution_data <- pollution_data %>% rename(c(year = REF_DATE, province = GEO, sector = Sector, CO2 = VALUE))\n", "\n", "gdp_data <- gdp_data %>% rename(c(year = REF_DATE, province = GEO, NAICS = `North American Industry Classification System (NAICS)`, GDP = VALUE))\n", "\n", "elec_data <- elec_data %>% rename(c(year = REF_DATE, province = GEO, type = `Type of electricity generation`, elec = VALUE))\n", "\n", "pop_data <- pop_data %>% rename(c(year = REF_DATE, province = GEO, pop = VALUE))" ], "id": "c9642602-1239-48b7-93f1-4f6da8201d5b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, note that across our Pollution and Population datasets, there are\n", "aggregations to a Canada-wide level, while our analysis is limited to\n", "the Provinces. Therefore, an inclusion of the Canada-wide aggregations\n", "will lead to a bias in our results. Let’s get rid of that by filtering\n", "them out." ], "id": "fc1870dd-ccce-4022-9644-fbb0de52ddfa" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Filtering to keep every observation for which the GEO isn't equal to Canada\n", "\n", "pop_data <- pop_data %>% filter(province != 'Canada')\n", "pollution_data <- pollution_data %>% filter(province != 'Canada')" ], "id": "1eafabb2-92b5-4c1f-aa47-e0a5341d3d64" }, { "cell_type": "markdown", "metadata": {}, "source": [ "As noted before, there were some missing values in the Electricity\n", "Generation dataset. Although there are multiple ways of dealing with\n", "missing data, like using averages, or using advanced imputation\n", "techniques like multiple imputation, we choose to deal with missing\n", "values here by omitting them from our data." ], "id": "ace2a6e6-ff9a-47ef-bfe1-ff6c49add4d5" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "elec_data <- elec_data %>% filter(elec != is.na(elec))" ], "id": "8666804e-f576-4e3e-9a8a-54714432f84d" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar aggregations also exist in the Pollution dataset for “Total,\n", "industries and households”, “Total, industries”, and “Total,\n", "households”. They also exist in the Electricity Generation dataset as\n", "“Total all types of electricity generation”. Let’s filter them, only\n", "this time, we will keep the aggregates across the categories of\n", "electricity generation and pollution, and get rid of the sub-categories." ], "id": "90b27ee6-7bb4-4d5d-9ee7-53c2efe8034f" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pollution_data <- pollution_data %>% filter(sector == \"Total, industries and households\")\n", "elec_data <- elec_data %>% filter(type == 'Total all types of electricity generation')" ], "id": "17a084a6-424a-4e9b-b206-bc12e65eb27d" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, as we previously noted, while the GDP and CO2 Emissions data is\n", "Annual, the Electricity Generation Data is Monthly, and the Population\n", "Data is Quarterly. Therefore, let’s group them both to Yearly levels.\n", "Before we do that, note that “REF_DATE” contains the variables **Month**\n", "and **Year**. Therefore, satisfying our principles Tidy Data, let’s use\n", "the Substring function to break it down into Month and Year." ], "id": "fe1742e0-5c33-488b-8521-10463d92c016" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "elec_data <- elec_data %>%\n", " mutate(year = substr(year, 1, 4), month = substr(year, 6, 7))\n", "\n", "pop_data <- pop_data %>%\n", " mutate(year = substr(year, 1, 4), month = substr(year, 6, 7))" ], "id": "2d3cac55-0445-476f-b813-438ec4aeffbb" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let’s work on making both the Electricity and Population datasets\n", "annual." ], "id": "ccb94096-b12b-47da-ab75-d203c7e9ee5b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "elec_data_grouped <- elec_data %>%\n", " group_by(year, province) %>%\n", " summarise(electricity = sum(elec))\n", "\n", "pop_data_grouped <- pop_data %>%\n", " group_by(year, province) %>%\n", " summarise(population = sum(pop))" ], "id": "6f0eb635-6a3f-43e0-af18-a5612fd258b3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our next step will be to merge our datasets, so that we can smoothly run\n", "the analysis from one clean reference." ], "id": "5562ff48-a4c3-4c09-a3e0-48746431b4ed" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Making the Data types compatible for joining\n", "pop_data_grouped <- pop_data_grouped %>% mutate(year = as.double(year))\n", "elec_data_grouped <- elec_data_grouped %>% mutate(year = as.double(year))\n", "\n", "# Merging the four datasets into two\n", "merged_data_1 <- left_join(gdp_data, pop_data_grouped, by = c('year', 'province'))\n", "merged_data_2 <- left_join(pollution_data, elec_data_grouped, by = c('year', 'province'))\n", "\n", "# Performing the Final Merge\n", "merged_data <- left_join(merged_data_1, merged_data_2, by = c('year', 'province'))" ], "id": "f22ff040-0466-49a7-9b75-19f3abe8106d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "str(merged_data)" ], "id": "550f6f21-a180-4ed0-a62e-6cf1cf86dc64" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#renaming some categories\n", "merged_data <- merged_data %>%\n", " rename(gdp = GDP)\n", "\n", "# Now we need some factors. `Province` should be a factor variable. \n", "\n", "merged_data <- merged_data %>%\n", " mutate(province = case_when(\n", " province == \"Newfoundland and Labrador\" ~ \"1\",\n", " province == \"Prince Edward Island\" ~ \"2\",\n", " province == \"Nova Scotia\" ~ \"3\",\n", " province == \"New Brunswick\" ~ \"4\",\n", " province == \"Quebec\" ~ \"5\",\n", " province == \"Ontario\" ~ \"6\",\n", " province == \"Manitoba\" ~ \"7\",\n", " province == \"Saskatchewan\" ~ \"8\",\n", " province == \"Alberta\" ~ \"9\",\n", " province == \"British Columbia\" ~ \"10\",\n", " province == \"Yukon\" ~ \"11\",\n", " province == \"Northwest Territories\" ~ \"12\",\n", " province == \"Nunavut\" ~ \"13\",\n", " )) %>%\n", " mutate(province = as_factor(province))\n", "\n", "#Now for clarity, we'll rename the dataset.\n", "\n", "CO2_data <- merged_data" ], "id": "c03ed642-dfaf-4bfd-820b-58778d7616d5" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let’s see how that’s changed our data structure." ], "id": "b989c63e-c862-495f-9afa-eff64cde60d3" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "str(CO2_data)" ], "id": "c31a6f34-1b92-4f09-9ce9-84ef07371312" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! We’re ready to start building our model.\n", "\n", "## Part 2: Building our Multiple Regression Model\n", "\n", "Now that we have our dataset ready and cleaned, let’s start to think\n", "about building our model. What are the relationships that we’re\n", "interested in investigating? For the dataset that we’re working with,\n", "these would be : gross domestic product (GDP), electricity, and\n", "population.\n", "\n", "> *Think Deeper*: Why might we suspect that relationships exist between\n", "> these variables? Is this consistent with economic theory? How would\n", "> these relationships relate to your own experience?\n", "\n", "Let’s begin investigating these relationships by making some\n", "visualizations." ], "id": "ec81824d-ffcd-41e2-8b0b-10f44068b4e3" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corr_plot_data <- CO2_data[, c('CO2', 'gdp', 'population', 'electricity')]\n", "corr_plot_data <- as.data.frame(corr_plot_data)\n", "\n", "# Compute the correlation matrix)\n", "cor_matrix = cor(corr_plot_data)\n", "\n", "# Create the correlation plot\n", "corrplot(cor_matrix, order = 'hclust', addrect = 2)" ], "id": "4e07bfe1-99cd-40b4-90c0-bb87f7af9424" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a <- ggplot(data = CO2_data, aes(x = gdp, y = CO2))+\n", " xlab(\"GDP in millions of dollars\")+\n", " ylab(\"CO2 Emissions in kilotonnes\") + scale_x_continuous()\n", "\n", "a + geom_point()\n", "\n", "b <- ggplot(data = CO2_data, aes(x = population, y = CO2))+\n", " xlab(\"Population\")+\n", " ylab(\"CO2 Emissions in kilotonnes\") + scale_x_continuous()\n", "\n", "b + geom_point()\n", "\n", "c <- ggplot(data = CO2_data, aes(x = electricity, y = CO2))+\n", " xlab(\"Electricity in megawatt hours)\")+\n", " ylab(\"CO2 Emissions in kilotonnes\") + scale_x_continuous()\n", "\n", "c + geom_point()\n", "\n", "view(gdp_data)" ], "id": "d6e01d04-e28e-476e-afdb-6010251ae0d8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "From these plots, we can see visually that there appears to be an\n", "approximate linear realtionship between the continuous independent\n", "variables and `CO2`. Now that we’ve established some of these\n", "relationships, let’s build our multiple regression model." ], "id": "30b9b30e-1236-44e0-ad84-51933f85fe4d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "slr_1 <- lm(CO2 ~ gdp, data = CO2_data)\n", "summary(slr_1)\n", "\n", "slr_2 <- lm(CO2 ~ electricity, data = CO2_data)\n", "summary(slr_2)\n", "\n", "slr_3<- lm(CO2 ~ population, data = CO2_data)\n", "summary(slr_3)\n", "\n", "mlr_1 <- lm(CO2 ~ gdp + electricity + population, data = CO2_data)\n", "summary(mlr_1)" ], "id": "2bcfa35b-8466-4180-8779-3b3459a6c28a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "As seen from the adjusted r-squared output, our multiple regression\n", "model has greater explanatory power than the any of the simple\n", "regressions alone, and all of the coefficients given are significant.\n", "\n", "## Part 3: Addressing Issues and Improving the Model\n", "\n", "### Underlying Assumptions - Homoskedasticity\n", "\n", "Homoskedasticity, or constant variance, is an underlying assumption of\n", "OLS. Knowing that heteroskedasticity is another common issue in\n", "regression, we need to check our model to ensure that it meets this\n", "requirement. We’ll start by checking visually with a residual plot for\n", "our first variable, GDP." ], "id": "b5c1d019-0563-48db-8569-69b1ea209ae0" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "slr_1 <- lm(CO2 ~ gdp, data = CO2_data)\n", "\n", "ggplot(data = CO2_data, aes(x = as.numeric(gdp), y = as.numeric(slr_1$residuals))\n", " )+geom_point()+labs(x = \"GDP\", y = \"Residuals\")" ], "id": "80d43ed1-6755-466d-9f6c-db26e7c99751" }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this “eyeball test”, it looks like the data display\n", "heterskedasticity. We’ll test for this formally now with a Breusch-Pagan\n", "test." ], "id": "182a2b75-5599-41d4-a69e-9cadb4ec7a83" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "CO2_data$resid_sqgdp <- (slr_1$residuals)^2\n", "residualsslr_1 <- lm(resid_sqgdp ~ gdp, data = CO2_data)\n", "summary(residualsslr_1)" ], "id": "8000574e-476e-458d-a08f-b7e432825841" }, { "cell_type": "markdown", "metadata": {}, "source": [ "We reject the null hypothesis and conclude that the GDP data are\n", "heteroskedastic. Let’s try to address this with a log transformation." ], "id": "502f4e99-5c2f-47eb-891b-4a9c82c51256" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# transform the data\n", "CO2_data <- CO2_data %>%\n", " mutate(ln_gdp = log(gdp))\n", "\n", "# visualize the logged value residuals\n", "slr_ln1 <- lm(CO2 ~ ln_gdp, data = CO2_data) \n", "\n", "ggplot(data = CO2_data, aes(x = as.numeric(ln_gdp), y = as.numeric(slr_ln1$residuals))) + \n", " geom_point()+labs(x = \"Log GDP\", y = \"Residuals\")\n", "\n", "# conduct another Breusch-Pagan Test\n", "CO2_data$resid_sq_lngdp <- (slr_ln1$residuals)^2\n", "residualsslr_ln1 <- lm(resid_sq_lngdp ~ ln_gdp, data = CO2_data)\n", "summary(residualsslr_ln1)" ], "id": "0f3b0f96-fa65-4b8b-b116-312c0473cd29" }, { "cell_type": "markdown", "metadata": {}, "source": [ "This doesn’t seem to have fixed the problem of heteroskedasticity within\n", "the gdp data. We’ll overcome this in our final model by using robust\n", "standard errors. We’ll now go through a similar process with the\n", "remaining two variables, and conclude that the electricity data are also\n", "heteroskedastic." ], "id": "92f30aab-37e2-4a47-9fa2-0f5d29c7651d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#For `electricity`\n", "#Visualise the residuals\n", "slr_2 <- lm(CO2 ~ electricity, data = CO2_data)\n", "ggplot(data = CO2_data, aes(x = as.numeric(electricity), y = as.numeric(slr_2$residuals))\n", " )+geom_point()+labs(x = \"Electricity\", y = \"Residuals\")\n", "\n", "#Fromally test the hypothesis\n", "CO2_data$resid_sqelec <- (slr_2$residuals)^2\n", "residualsslr_2 <- lm(resid_sqelec ~ electricity, data = CO2_data)\n", "summary(residualsslr_2)\n", "\n", "#Log transform the data\n", "CO2_data <- CO2_data %>%\n", " mutate(ln_electricity = log(electricity))\n", "\n", "# visualize the logged value residuals\n", "slr_ln2 <- lm(CO2 ~ ln_electricity, data = CO2_data)\n", "ggplot(data = CO2_data, aes(x = as.numeric(ln_electricity), y = as.numeric(slr_ln2$residuals))\n", ")+geom_point()+labs(x = \"Log Electricity\", y_ = \"Residuals\")\n", "\n", "#Formally test \n", "CO2_data$resid_sq_lnelec <- (slr_ln2$residuals)^2\n", "residualsslr_ln2 <- lm(resid_sq_lnelec ~ ln_electricity, data = CO2_data)\n", "summary(residualsslr_ln2)\n", "\n", "# We reject the null hypothesis, and conclude heteroskedasticity in this data. " ], "id": "040b3b73-6cab-4fac-ab20-6b24301c9759" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#For `population`\n", "#Visualise the residuals\n", "slr_3 <- lm(CO2 ~ population, data = CO2_data)\n", "ggplot(data = CO2_data, aes(x = as.numeric(population), y = as.numeric(slr_3$residuals))\n", " )+geom_point()+labs(x = \"Population\", y = \"Residuals\")\n", "\n", "#Formal hypothesis testing\n", "CO2_data$resid_sqpop <- (slr_3$residuals)^2\n", "residualsslr_3 <- lm(resid_sqpop ~ population, data = CO2_data)\n", "summary(residualsslr_3)\n", "\n", "# We cannot reject the null hypothesis, and therefore cannot conclude heteroscedasticity for the population data. " ], "id": "096fde3e-eca3-4eac-88bf-b07524a811bc" }, { "cell_type": "markdown", "metadata": {}, "source": [ "To run our regressions with heteroskedasticity-robust standard errors,\n", "we use the `sandwich` package, which we’ve called on earlier." ], "id": "a6bd504e-1786-4de4-aeaf-4a04d813e8e1" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The initial regression that we ran\n", "mlr_1 <- lm(CO2 ~ gdp + population + electricity, data = CO2_data)\n", "\n", "# Obtain robust standard errors\n", "robust_se <- sqrt(diag(vcovHC(mlr_1, type = \"HC1\"))) # \"HC1\" is one of the robust variance estimators\n", "\n", "# Print the robust standard errors\n", "print(robust_se)\n", "\n", "# Usiing Coeftest to print the Hetetrokedasticity-Robust Standard Errors\n", "coeftest(mlr_1)" ], "id": "af875940-55c6-4477-b621-16e8ad034630" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multicollinearity\n", "\n", "As we know, multicollinearity is a common issue that arises in\n", "developing a regression. To test how this shows up in our current model,\n", "we will calculate variance inflation factors (VIF), using the package\n", "`car`.\n", "\n", "> *Think Deeper*: Do you suspect any variables in the model may be\n", "> affected by multicollinearity? How come?" ], "id": "011a2739-379f-4438-93ab-40a5dac337cb" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "car::vif(mlr_1)\n", "coeftest(slr_3)\n", "summary(slr_3)" ], "id": "33ebec31-ad1b-4c9c-9803-6f882ffb858f" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The high VIF indicate that there’s a problem with collinearity in our\n", "data. We’ll try to combat this by creating a model which combines the\n", "effects of `GDP` and `population` into a single variable of per capita\n", "GDP." ], "id": "4b5596b5-e7b6-46bb-86c3-f2dc83932f8e" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Create a new variable and view how it fits in the structure. \n", "CO2_data$per_capita_gdp <- CO2_data$gdp/CO2_data$population\n", "str(CO2_data)\n", "\n", "#Build a new model \n", "mlr_2 <- lm(CO2 ~ per_capita_gdp + electricity, data = CO2_data)\n", "summary(mlr_2)\n", "\n", "#Calculate VIF on the new model\n", "car::vif(mlr_2)" ], "id": "26a59559-5a87-4c45-9e09-5812e7731eac" }, { "cell_type": "markdown", "metadata": {}, "source": [ "This appears to have addressed out problems with multicollinearity in\n", "our model.\n", "\n", "### Adding Other Variables\n", "\n", "There is still another variable that we haven’t considered in our model.\n", "Let’s take a look at how `province` affects CO2. Since our these\n", "variables are expressed qualitative factors, they need to be represented\n", "by dummy variables. R does this automatically when a dummy is used in\n", "regression, and excludes the first category (in our case Newfoundland\n", "and Labrador) as the reference variable." ], "id": "b1c07b3f-f430-44a2-b4f8-4d646c2d0960" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "slr_4 <- lm(CO2 ~ province, data = CO2_data)\n", "summary(slr_4)\n", "\n", "mlr_3 <- lm(CO2 ~ gdp + population + electricity + province, data = CO2_data)\n", "summary(mlr_3)" ], "id": "d6fd9803-d5ac-4a67-90b0-355bd25a2304" }, { "cell_type": "markdown", "metadata": {}, "source": [ "This model has great explnatory power, but not all coefficients are\n", "significant anymore. We’ll remove electricity and see how this affects\n", "our model." ], "id": "9a5fdad7-55ef-4ad6-8335-b11602680432" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mlr_4 <- lm(CO2 ~ gdp + population + province, data = CO2_data)\n", "summary(mlr_4)\n", "car::vif(mlr_4)\n", "\n", "#GVIF stands for Generalized Variance Inflation Factors, and appears here to show the combined VIF of all the province coefficients. " ], "id": "62727969-3023-4632-81d4-0f9612d13f70" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unsurprisingly, this model has high multicollinearity. We expect this\n", "because we understand that things like population and gdp vary\n", "significantly with province. Recall how the data points were clustered\n", "around different centers in the original visualizations we made? If we\n", "summarize these mean values in a table, we can clearly see just how\n", "variable they are." ], "id": "e7ec8d8b-6791-4157-b543-a7a85d011d9d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a + geom_point()\n", "\n", "b + geom_point()\n", "\n", "c + geom_point()\n", "\n", "#create a summary table organized by province\n", "sumtable(CO2_data, \n", " vars = c(\"gdp\", \"population\", \"electricity\"),\n", " summ = c('mean(x)'),\n", " group = 'province',\n", " digits = 6,\n", " out = 'return')" ], "id": "949ed101-1b23-4d29-b283-d0458971d0b5" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s see what happens when we interact some of the terms." ], "id": "b83d111d-b1a0-4b87-b4de-8922d28eb719" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Province interacted with population \n", "mlr_5 <- lm(CO2 ~ gdp + electricity + population*province, data = CO2_data)\n", "summary(mlr_5)\n", "\n", "#Province interacted with gdp\n", "mlr_6 <- lm(CO2 ~ population + electricity + gdp*province, data = CO2_data)\n", "summary(mlr_6)\n", "\n", "mlr_7 <- lm(CO2 ~ population + electricity + gdp:province, data = CO2_data)\n", "summary(mlr_7)\n", "\n", "vif(mlr_7)" ], "id": "6a164d65-b87d-4d09-b709-90139f02c4ca" }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the final model the relationships appear to be significant. We would\n", "interpret the coefficients of the interacted terms as the *combined*\n", "effect of gdp and province- or how the effect of gdp on CO2 emissions\n", "changes by province. However this model is still severely affected by\n", "multicollinearity.\n", "\n", "Now that we have a few different models, let’s compare some of our\n", "results." ], "id": "c8b30ff5-153a-4248-aee9-1e11a1dd59dc" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stargazer(mlr_1, mlr_2, mlr_3, mlr_4, mlr_5, mlr_6, mlr_7, title=\"Comparison of Muliple Regression Results\",\n", " align = TRUE, type=\"text\", keep.stat = c(\"n\",\"rsq\"))" ], "id": "5c159b38-5ac6-4d9c-a1f1-47a0979adb3f" }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the results of our simple regression, we know that there is a\n", "strong relationship between between province and \\[$CO_2$\\] emissions.\n", "But to choose the best model based on the data we have, we need to\n", "consider the models holistically, paying attention to the common issues\n", "that arise in regression, and not just the explanatory power alone.\n", "\n", "## Summary:\n", "\n", "In conclusion, we note that CO2 Emissions in the Canadian Economy are\n", "severely impacted by the variables of our consideration: GDP,\n", "Electricity Generation, and Population. It is also important to note\n", "that we ought to explore our data by running multiple regressions, and\n", "clean it appropriately before running our analysis. This exploratory\n", "practice may lead us relationships that we expect, or something\n", "completely contrasting. Therefore, use this as a sample for your own\n", "project, and keep researching!\n", "\n", "## Exercises:\n", "\n", "The below exercises are intended to help you check your conceptual\n", "understanding of the content.\n", "\n", "### Exercise 1:\n", "\n", "What is an advantage of a multiple regression model over a simple\n", "regression model? Pick the best answer.\n", "\n", "A: Multiple regression models often exhibit multicollinearity, which\n", "gives them better explanatory power compared to simple regressions.\n", "\n", "B: Multiple regression models improve the predictive properties of a\n", "model by adding multiple regressors that play a role in the\n", "relationship.\n", "\n", "C: Multiple regression models can display complicated relationships, but\n", "simple regression models are too simple to ever be useful.\n", "\n", "D: By having multiple variables, multiple regressions allow us to see\n", "the interactions between different variables in a relationship." ], "id": "b5511034-fde2-494a-b80f-e4a39cc7793b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_1 <- '...' # your answer here ('A', 'B', 'C', or 'D')\n", "\n", "test_1()" ], "id": "6f0bd172-aa57-4248-bbef-85c88e8dc30a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2:\n", "\n", "Which of the below would *not* be a way of resolving heteroskedasticity\n", "within a regression model?\n", "\n", "A: Performing a log transformation on the outcome variable\n", "\n", "B: Using the appropriate code to generate HC1 standard errors\n", "\n", "C: Searching for a different dataset to use\n", "\n", "D: Adding more explanatory variables to the model (so long as all of the\n", "individual relationships remain significant)" ], "id": "bbc91a38-0482-4e8b-8439-837dc60452f6" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_2 <- '...' # your answer here ('A', 'B', 'C', or 'D')\n", "\n", "test_2()" ], "id": "884de564-0c6a-40c8-babc-a91388fd297c" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3:\n", "\n", "What does a variance inflation factor of a variable tell us?\n", "\n", "A: The magnitude of the heteroskedasticity (variance of error terms) in\n", "the data for a given variable.\n", "\n", "B: The extent to which the variability of a dependent variable is\n", "inflated by multicollinearity in the model.\n", "\n", "C: Whether or not there is multicollinearity in our model.\n", "\n", "D: The extent to which the model is inflated by ommitted variable bias." ], "id": "7eb7e69e-ba43-4023-ada2-b1e6f76d2cd9" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_3 <- '...' # your answer here ('A', 'B', 'C', or 'D')\n", "\n", "test_3()" ], "id": "cf8fbb8a-3ea9-4d07-bf14-a541a2443319" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 4:\n", "\n", "In what situation would it be appropriate to incorporate an interaction\n", "into your model? Pick the best answer.\n", "\n", "A: There is reason to believe that the combined effect of two or more\n", "variables has an impact on the outcome variable.\n", "\n", "B: There is reason to believe that the combined effect of two or more\n", "continuous variables has an impact on the outcome variable.\n", "\n", "C: Your model is displaying multicollinearity with two variables.\n", "\n", "D: Your model is boring, and you want to make it more interesting." ], "id": "39df3241-8134-458c-a621-28faff0ac2d3" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer_4 <- '...' # your answer here ('A', 'B', 'C', or 'D')\n", "\n", "test_4()" ], "id": "2e77a087-88db-466e-9bbe-b7fe782125cb" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "ir", "display_name": "R", "language": "r" } } }