{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 14 - Panel Data Regressions\n",
        "\n",
        "Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt  \n",
        "2024-05-29\n",
        "\n",
        "## Prerequisites\n",
        "\n",
        "1.  Run OLS Regressions.\n",
        "\n",
        "## Learning Outcomes\n",
        "\n",
        "1.  Prepare data for time-series analysis.\n",
        "2.  Run panel data regressions.\n",
        "3.  Create lagged variables.\n",
        "4.  Understand and work with fixed-effects.\n",
        "5.  Correct for heteroskedasticity and serial correlation.\n",
        "\n",
        "## 15.0 Intro\n",
        "\n",
        "This module uses the [Penn World\n",
        "Tables](https://www.rug.nl/ggdc/productivity/pwt/?lang=en) which measure\n",
        "income, input, output, and productivity, covering 183 countries between\n",
        "1950 and 2019. Before beginning this module, download this data in the\n",
        ".dta format.\n",
        "\n",
        "## 14.1 What is Panel Data?\n",
        "\n",
        "In economics, we typically have data consisting of many units observed\n",
        "at a particular point in time. This is called cross-sectional data.\n",
        "There may be several different versions of the data set that are\n",
        "collected over time (monthly, annually, etc.), but each version includes\n",
        "an entirely different set of individuals.\n",
        "\n",
        "For example, let’s consider a Canadian cross-sectional data set:\n",
        "*General Social Survey Cycle 31: Family, 2017*. In this data set, the\n",
        "first observation is a 55 year old married woman who lives in Alberta\n",
        "with two children. When the *General Social Survey Cycle 25: Family,\n",
        "2011* was collected six years earlier, there were probably similar women\n",
        "surveyed, but it is extremely unlikely that this exact same woman was\n",
        "included in that data set as well. Even if she was included, we would\n",
        "have no way to match her data over the two years of the survey.\n",
        "\n",
        "Cross-sectional data allows us to explore variation between individuals\n",
        "at one point in time but does not allow us to explore variation over\n",
        "time for those same individuals.\n",
        "\n",
        "Time-series data sets contain observations over several years for only\n",
        "one unit, such as country, state, province, etc. For example, measures\n",
        "of income, output, unemployment, and fertility for Canada from 1960 to\n",
        "2020 would be considered time-series data. Time-series data allows us to\n",
        "explore variation over time for one individual unit (e.g. Canada), but\n",
        "does not allow us to explore variation between individual units\n",
        "(i.e. multiple countries) at any one point in time.\n",
        "\n",
        "Panel data allows us to observe the same unit across multiple time\n",
        "periods. For example, the [Penn World\n",
        "Tables](https://www.rug.nl/ggdc/productivity/pwt/?lang=en) is a panel\n",
        "data set that measures income, output, input, and productivity, covering\n",
        "183 countries from 1950 to the near present. There are also microdata\n",
        "panel data sets that follow the same people over time. One example is\n",
        "the Canadian National Longitudinal Survey of Children and Youth (NLSCY),\n",
        "which followed the same children from 1994 to 2010, surveying them every\n",
        "two years as they progressed from childhood to adulthood.\n",
        "\n",
        "Panel data sets allow us to answer questions that we cannot answer with\n",
        "time-series and cross-sectional data. They allow us to simultaneously\n",
        "explore variation over time for individual countries (for example) and\n",
        "variation between individuals at one point in time. This approach is\n",
        "extremely productive for two reasons:\n",
        "\n",
        "1.  Panel data sets are large, much larger than if we were to use data\n",
        "    collected at one point in time.\n",
        "2.  Panel data regressions control for variables that do not change over\n",
        "    time and are difficult to measure, such as geography and culture.\n",
        "\n",
        "In this sense, panel data sets allow us to answer empirical questions\n",
        "that cannot be answered with other types of data such as cross-sectional\n",
        "or time-series data.\n",
        "\n",
        "Before we move forward exploring panel data sets in this module, we\n",
        "should understand the two main types of panel data:\n",
        "\n",
        "-   A **Balanced Panel** is a panel data set in which we observe *all*\n",
        "    units over *all* included time periods. Suppose we have a data set\n",
        "    following the school outcomes of a select group of $N$ children over\n",
        "    $T$ years. This is common in studies which investigate the effects\n",
        "    of early childhood interventions on relevant outcomes over time. If\n",
        "    the panel data set is balanced, we will see $T$ observations for\n",
        "    each child corresponding to the $T$ years they have been tracked. As\n",
        "    a result, our data set in total will have $n = N*T$ observations.\n",
        "-   An **Unbalanced Panel** is a panel data set in which we do *not*\n",
        "    observe all units over all included time periods. Suppose in our\n",
        "    data set tracking select children’s education outcomes over time,\n",
        "    and that some children drop out of the study. This panel data set\n",
        "    would be an unbalanced panel because it would necessarily have\n",
        "    $n < N*T$ observations, since the children who dropped out would not\n",
        "    have observations for the years they were no longer in the study.\n",
        "\n",
        "We learned the techniques to create a balanced panel in [Module\n",
        "6](https://comet.arts.ubc.ca/docs/Research/econ490-r/06_Within_Group.html).\n",
        "Essentially, all that is needed is to create a new data set that\n",
        "includes only the years for which there are no missing values.\n",
        "\n",
        "## 14.2 Preparing Our Data for Panel Analysis\n",
        "\n",
        "The first step in any panel data analysis is to identify which variable\n",
        "is the panel variable and which variable is the time variable. The panel\n",
        "variable is the identifier of the units that are observed over time. The\n",
        "second step is indicating that information to R.\n",
        "\n",
        "We are going to use the Penn World Data (discussed above) in this\n",
        "example. In that data set, the panel variable is either *country* or\n",
        "*countrycode*, and the time variable is *year*."
      ],
      "id": "e8820e7e-3b18-4a70-a051-6f7dd11a8e0b"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Clear the memory from any pre-existing objects\n",
        "rm(list=ls())\n",
        "\n",
        "# Load packages\n",
        "library(dplyr)\n",
        "library(tidyr)\n",
        "library(haven)"
      ],
      "id": "3758b118-1554-4398-b39d-025a2d6b2dda"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Import data (remember to change directory to the location of this data file)\n",
        "#setwd()\n",
        "pwt100 <- read_dta(\"../econ490-r/pwt100.dta\")  #change me!\n",
        "\n",
        "# Get summary of the data\n",
        "summary(pwt100)"
      ],
      "id": "74b8a259-c571-4046-a0c6-06c4c92a82f8"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You may have noticed that the variable *year* is an integer (i.e. a\n",
        "number like 2010) and that *country* and *countrycode* are character\n",
        "variables (i.e. they are words like “Canada”). Specifying the panel and\n",
        "time variables requires that both of the variables we are using are\n",
        "coded as numeric variables. Moireover, we need to sort our data by the\n",
        "unique identifier (*country* or *countrycode* in our case) and tme\n",
        "variable (*year*)."
      ],
      "id": "a813c560-80e5-4f5c-bc24-1d57bb81abd1"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Order data according to countrycode and year, and call it df\n",
        "df <- pwt100 %>% arrange(countrycode, year)"
      ],
      "id": "f9a385b9-6a75-4780-8113-1513bdb2ad13"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now that we have sorted our data, we need to tell R that the data frame\n",
        "*df* contains panel data. We do so by relying on the package `plm`, a\n",
        "package containing various tools for Linear Models for Panel data. We\n",
        "load the package `plm` and use the `pdata.frame()` function to create a\n",
        "panel data frame. In the argument `index` of the function\n",
        "`pdata.frame()` we have to specify the name of the cross-sectional unit\n",
        "identifier (*countrycode*) and the time variable (*year*)."
      ],
      "id": "9ba09f67-d9e5-49ca-a8a6-2a373d6bb8d1"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Install and load plm package\n",
        "#uncomment to install the package! install.packages(\"plm\")\n",
        "library(plm)\n",
        "\n",
        "# Convert dataframe to panel data format\n",
        "panel_data <- pdata.frame(df, index=c(\"countrycode\", \"year\"))"
      ],
      "id": "df7ae3fd-7594-4e36-8eff-0a7aa7aece43"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To check that we have correctly converted our data in a panel data\n",
        "frame, we can use the `class` or the `pdim` functions. Note that `pdim`\n",
        "tells us if our data frame is balanced or not, as well as the number of\n",
        "cross-sectional unit identifiers and time periods."
      ],
      "id": "e72b4898-27e9-4a3f-aae1-bc6587094318"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "class(panel_data)"
      ],
      "id": "c88cef49-ba3e-49c4-9cd4-a794caf5d972"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "pdim(panel_data)"
      ],
      "id": "404d3303-2191-4e2f-914b-7b3f35ee371b"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 14.3 Basic Regressions with Panel Data\n",
        "\n",
        "For now, we are going to focus on the skills we need to run our own\n",
        "panel data regressions. In section 14.6, there are more details about\n",
        "the econometrics of panel data regressions that may help with the\n",
        "understanding of these approaches. Please make sure you understand that\n",
        "theory before beginning your own research.\n",
        "\n",
        "Now that we have specified the panel and time variables we are working\n",
        "with, we can begin to run regressions using our panel data. For panel\n",
        "data regressions, we simply replace `lm` with the command `plm`. The\n",
        "command `plm` takes another input, `model`. We can specify `model` to be\n",
        "fixed effect, random effect, or a pooled OLS. For now, let’s use a\n",
        "pooled OLS with `model=\"pooling\"`. More details on the other models will\n",
        "be addressed below.\n",
        "\n",
        "Let’s try this out by regressing the natural log of GDP per capita on\n",
        "the natural log of human capital."
      ],
      "id": "69b033cd-c3e7-47f5-998d-ca0b56479a0e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Create the two new variables\n",
        "panel_data <- panel_data %>% mutate(lngdp = log(rgdpo/pop), lnhc = log(hc))\n",
        "\n",
        "# Estimate specification\n",
        "model <- plm(lngdp ~ lnhc, data = panel_data, model = \"pooling\")\n",
        "summary(model)"
      ],
      "id": "82d45aa3-3578-48f5-b1cf-e99971159c11"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The coefficients in a panel regression are interpreted similarly to\n",
        "those in a basic OLS regression. Because we have taken the natural log\n",
        "of our variables, we can interpret the coefficient on each explanatory\n",
        "variable as being a $\\beta$ % increase in the dependent variable\n",
        "associated with a 1% increase in the explanatory variable.\n",
        "\n",
        "Thus, in the regression results above, a 1% increase in human capital\n",
        "leads to a roughly 2% increase in real GDP per capita. That’s a huge\n",
        "effect, but then again this model is almost certainly misspecified due\n",
        "to omitted variable bias. Namely, we are likely missing a number of\n",
        "explanatory variables that explain variation in both GDP per capita and\n",
        "human capital, such as savings and population growth rates.\n",
        "\n",
        "One thing we know is that GDP per capita can be impacted by the\n",
        "individual characteristics of a country that do not change much over\n",
        "time. For example, it is known that distance from the equator has an\n",
        "impact on the standard of living of a country; countries that are closer\n",
        "to the equator are generally poorer than those farther from it. This is\n",
        "a time-invariant characteristic that we might want to control for in our\n",
        "regression. Similarly, we know that GDP per capita could be similarly\n",
        "impacted in many countries by a shock at one point in time. For example,\n",
        "a worldwide global recession would affect the GDP per capita of all\n",
        "countries at a given time such that values of GDP per capita in this\n",
        "time period are uniformly different in all countries from values in\n",
        "other periods. That seems like a time-variant characteristic (time\n",
        "trend) that we might want to control for in our regression. Fortunately,\n",
        "with panel data regressions, we can account for these sources of\n",
        "endogeneity. Let’s look at how panel data helps us do this.\n",
        "\n",
        "### 14.3.1 Fixed-Effects Models\n",
        "\n",
        "We refer to shocks that are invariant based on some variable\n",
        "(e.g. household level shocks that don’t vary with year or time-specific\n",
        "shocks that don’t vary with household) as **fixed-effects**. For\n",
        "instance, we can define household fixed-effects, time fixed-effects, and\n",
        "so on. Notice that this is an assumption on the error terms, and as\n",
        "such, when we include fixed-effects to our specification they become\n",
        "part of the model we assume to be true.\n",
        "\n",
        "When we ran our regression of log real GDP per capita on log human\n",
        "capital from earlier, we were concerned about omitted variable bias and\n",
        "endogeneity. Specifically, we were concerned about distance from the\n",
        "equator positively impacting both human capital and real GDP per capita,\n",
        "in which case our measure of human capital would be correlated with our\n",
        "error term, preventing us from interpreting our regression result as\n",
        "causal. We are now able to add country fixed-effects to our regression\n",
        "to account for this and come closer to determining the pure effect of\n",
        "human capital on GDP growth. There are two ways to do this. Let’s look\n",
        "at the more obvious one first.\n",
        "\n",
        "*Approach 1*: create a series of country dummy variables and include\n",
        "them in the regression. For example, we would have one dummy variable\n",
        "called “Canada” that would be equal to 1 if the country is Canada and 0\n",
        "if not. We would have dummy variables for all but one of the countries\n",
        "in this data set to avoid perfect collinearity. Rather than define all\n",
        "of these dummies manually and include them in our regression command, we\n",
        "can simply factorize them and R will include them automatically."
      ],
      "id": "4416c46c-0b68-4bc3-9acc-12911f84717b"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Factorize countrycode\n",
        "panel_data <- panel_data %>% mutate(countrycode = factor(countrycode))"
      ],
      "id": "2e25711f-725f-41bb-b4a7-ffeb9b1e75fb"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now we can add the factorized version of country codes to our panel\n",
        "linear model."
      ],
      "id": "700a9a48-6859-47b1-b77b-2bd18427d684"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "model <- plm(lngdp ~ lnhc + countrycode, data = panel_data, model = \"pooling\")\n",
        "summary(model)"
      ],
      "id": "ca9f8bbd-37fd-485e-bfbc-42ab4ceca4fd"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The problem with this approach is that we end up with a huge table\n",
        "containing the coefficients of every country dummy, none of which we\n",
        "care about. We are interested in the relationship between GDP and human\n",
        "capital, not the mean values of GDP for each country relative to the\n",
        "omitted one. Luckily for us, a well-known result is that controlling for\n",
        "fixed-effects is equivalent to adding multiple dummy variables. This\n",
        "leads us into the second approach to including fixed-effects in a\n",
        "regression.\n",
        "\n",
        "*Approach 2*: We can alternatively apply fixed affects to the regression\n",
        "by adding `model=\"within\"` as an option on the regression."
      ],
      "id": "c840bacf-1ef6-43d6-9ef3-d30e21c0b1f8"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "model <- plm(lngdp ~ lnhc, data = panel_data, model = \"within\")\n",
        "summary(model)"
      ],
      "id": "bf480568-b02f-4a7b-87fe-f5f42bdb9273"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We obtained the same coefficient and standard errors on our explanatory\n",
        "variable using both approaches!\n",
        "\n",
        "### 14.3.2 Random-Effects Models\n",
        "\n",
        "One type of model we can also run is a **random-effects model**. The\n",
        "main difference between a random and fixed-effects model is that, with\n",
        "the random-effects model, differences across countries are assumed to be\n",
        "random. This allows us to treat time-invariant variables such as\n",
        "latitude as control variables. To run a random-effects model, just add\n",
        "`model=\"random\"` as argument of `plm`."
      ],
      "id": "5c08bb87-864a-46ce-b608-3c222ad60ebb"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "model <- plm(lngdp ~ lnhc, data = panel_data, model = \"random\")\n",
        "summary(model)"
      ],
      "id": "6c1775fe-b0d9-4715-96cc-4cd733ef2649"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As we can see, with this data and choice of variables, there is little\n",
        "difference in results between all of these models.\n",
        "\n",
        "This, however, will not always be the case. The test to determine if you\n",
        "should use the fixed-effects model or the random-effects model is called\n",
        "the Hausman test.\n",
        "\n",
        "To run this test in R, we first have to store the fixed-effect and the\n",
        "random-effect models in two different objects, one called *fixed* and\n",
        "the other called *random*."
      ],
      "id": "d9501562-ee85-4ed3-b2be-1531f8dc0e43"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "fixed <- plm(lngdp ~ lnhc, data = panel_data, model = \"within\")\n",
        "random <- plm(lngdp ~ lnhc, data = panel_data, model = \"random\")"
      ],
      "id": "c575a207-7182-412e-8cb2-db4de81d04ac"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Then, we perform the Hausman test by comparing the two objects *fixed*\n",
        "and *random* using the function `phtest`. Remember, the null hypothesis\n",
        "is that the preferred model is random-effects."
      ],
      "id": "f5fdd126-4f6d-4733-ba9b-3174e64dd19c"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "phtest(fixed, random)"
      ],
      "id": "ecad316d-6bd9-4772-81b5-0477aa294aa3"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see, the p-values associated with this test suggest that we\n",
        "would reject the null hypothesis (random effect) and that we should\n",
        "adopt a fixed-effects model.\n",
        "\n",
        "### 14.3.3 What if We Want to Control for Multiple Fixed-Effects?\n",
        "\n",
        "Let’s say we have run a panel data regression with fixed-effects, and we\n",
        "think that no more needs to be done to control for factors that are\n",
        "constant across our cross-sectional variables (i.e. countries) at any\n",
        "one point in time (i.e. years). However, for very long series (for\n",
        "example those over 20 years), we will want to check that time dummy\n",
        "variables are not also needed.\n",
        "\n",
        "In R, we can easily do it using two functions: the `pFtest()` and the\n",
        "`plmtest()`.\n",
        "\n",
        "First, let’s save our models with and without time fixed-effects in two\n",
        "objects."
      ],
      "id": "f3d55aaa-8e54-41d2-8227-84536d014c0d"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# No time fixed-effects\n",
        "fixed <- plm(lngdp ~ lnhc, data = panel_data, model = \"within\")\n",
        "\n",
        "# Time fixed-effects\n",
        "fixed_yearfe <- plm(lngdp ~ lnhc + factor(year), data = panel_data, model = \"within\")"
      ],
      "id": "7c732fbd-b1f3-4d2e-ac46-d251cc824fff"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now that we have saved both models, we can use the test. `pFtest()`\n",
        "requires us to use both models as inputs. `plmtest()` only needs the\n",
        "model without time fixed-effects as input."
      ],
      "id": "d84810f3-012a-4570-85a4-e5ce73347a3f"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Option 1: pFtest\n",
        "pFtest(fixed_yearfe, fixed)"
      ],
      "id": "ca507e44-f38e-4b56-83e5-9e30c9b0c1a8"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Option 2: plmtest\n",
        "plmtest(fixed, c(\"time\"), type=(\"bp\"))"
      ],
      "id": "14ccaac2-f78d-4e74-b9e7-7f0f2d51b3c9"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Both tests report a p-value smaller than 0.05, which suggests that we\n",
        "can reject the null hypothesis and need time-fixed-effects in our model.\n",
        "\n",
        "## 15.4 Creating New Panel Variables\n",
        "\n",
        "Panel data also provides us with a new source of variation: variation\n",
        "over time. This means that we have access to a wide variety of variables\n",
        "we can include. For instance, we can create lags (variables in previous\n",
        "periods) and leads (variables in future periods). Once we have defined\n",
        "our panel data set using the `pdata.frame` function (which we did\n",
        "earlier), we can create the lags using the `dplyr::lag()` function and\n",
        "the leads using the `dplyr::lead()` function.\n",
        "\n",
        "<b>Warning:</b> Many other packages have a lag() and a lead() function.\n",
        "To make sure that R knows which function you want to use, specify that\n",
        "the source library is `dplyr` by writing the functions in their full\n",
        "names: `dplyr::lag()` and `dplyr::lead()`. Failing to do so may result\n",
        "in lag() and lead() not to behave as expected.\n",
        "\n",
        "For example, let’s create a new variable that lags the natural log of\n",
        "GDP per capita by one period."
      ],
      "id": "72e758a9-a2ea-4a11-9caf-9782f229544b"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "panel_data <- panel_data %>% mutate(lag1_lngdp = dplyr::lag(lngdp,1))"
      ],
      "id": "c7661f2d-62e6-4a0d-afc2-1b1bc533c3fd"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If we wanted to lag this same variable ten periods, we would write it as\n",
        "such:"
      ],
      "id": "95b98998-5b25-4688-83d4-044967222c98"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "panel_data <- panel_data %>% mutate(lag10_lngdp = dplyr::lag(lngdp,10))"
      ],
      "id": "70baee1f-eff5-4d27-91df-9a4e9ff1b2e7"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let’s inspect the first 50 rows of our data frame to check that we have\n",
        "created lagged variables as expected."
      ],
      "id": "07c0fbfb-3830-403b-b297-2638b6704550"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "head(panel_data[, c(\"lngdp\", \"lag1_lngdp\", \"lag10_lngdp\")],50)"
      ],
      "id": "f377351c-b79b-42aa-80d8-b56f0668f68a"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can include lagged variables directly in our regression if we believe\n",
        "that past values of real GDP per capita influence current levels of real\n",
        "GDP per capita."
      ],
      "id": "2ad21dd7-e928-4878-ba01-15be1147fdd3"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "model <- plm(lngdp ~ lnhc + lag10_lngdp, data = panel_data, model = \"within\")\n",
        "summary(model)"
      ],
      "id": "b8ac5875-723c-4a0d-8027-424d28a58b85"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "While we included lags from the previous period and 10 periods back as\n",
        "examples, we can use any period for our lags. In fact, including lag\n",
        "variables as controls for recent periods such as one lag back and two\n",
        "lags back is the most common choice for inclusion of past values of\n",
        "independent variables as controls.\n",
        "\n",
        "Finally, these variables are useful if we are trying to measure the\n",
        "growth rate of a variable. Recall that the growth rate of a variable X\n",
        "is just equal to $ln(X_{t}) - ln(X_{t-1})$ where the subscripts indicate\n",
        "time.\n",
        "\n",
        "For example, if we want to now include the natural log of the population\n",
        "growth rate in our regression, we can create that new variable by taking\n",
        "the natural log of the population growth rate\n",
        "$ln(pop_{t}) - ln(pop_{t-1})$"
      ],
      "id": "2ce50bf3-250d-4a5e-b46f-f027e97cb6db"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Create log of population\n",
        "panel_data$lnpop <- log(panel_data$pop)\n",
        "\n",
        "# Create the population growth rate\n",
        "panel_data <- panel_data %>% mutate(lnn = lnpop - dplyr::lag(lnpop,1))"
      ],
      "id": "745860cd-1eb4-42a0-883b-2b1ca02f3393"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Another variable that might also be useful is the natural log of the\n",
        "growth rate of GDP per capita."
      ],
      "id": "7ae63bb9-c26c-4a00-8393-c40dddd6f50c"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "panel_data <- panel_data %>% mutate(dlngdp = lngdp - dplyr::lag(lngdp,1))"
      ],
      "id": "9cf62e71-bded-4d31-b602-8ac7176fd6ce"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let’s put this all together in a regression to see the effect of the\n",
        "growth rate of population on growth rate of GDP per capita, controlling\n",
        "for human capital and the level of GDP per capita in the previous year:"
      ],
      "id": "207177af-e4cf-4236-a76a-143306d7bbef"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "model <- plm(dlngdp ~ lag1_lngdp + lnn + lnhc, data = panel_data, model = \"within\")\n",
        "summary(model)"
      ],
      "id": "b624c710-bfcc-435f-a328-d7c397b5b59e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 14.5 Is Our Panel Data Regression Properly Specified?\n",
        "\n",
        "While there are the typical concerns with interpreting the coefficients\n",
        "of regressions (i.e. multicollinearity, inferring causality), there are\n",
        "some topics which require special treatment when working with panel\n",
        "data.\n",
        "\n",
        "### 14.5.1 Heteroskedasticity\n",
        "\n",
        "As always, when running regressions, we must consider whether our\n",
        "residuals are heteroskedastic (not constant for all values of $X$). To\n",
        "test our panel data regression for heteroskedasticity in the residuals,\n",
        "we need to calculate a modified Wald statistic. We use the Breusch-Pagan\n",
        "test that can be found in the `lmtest` package."
      ],
      "id": "9bb787e4-5228-4be7-bd06-b62e778ef295"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "library(lmtest)"
      ],
      "id": "480f44d8-6f21-422f-9f51-ea078b0aedc2"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Once we have loaded the `lmtest` package, we can call the Breusch-Pagan\n",
        "test in the `bptest()` function. The first argument of `bptest()` is the\n",
        "model we want to test; in our case, it is the specification for log GDP\n",
        "and log human capital. The second argument is the data frame."
      ],
      "id": "49434484-52b6-48d9-807f-bf918414ef6a"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "bptest(lngdp ~ lnhc + countrycode, data = panel_data)"
      ],
      "id": "01263c1d-3982-46d7-ba59-5bfca96eea73"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The null hypothesis is homoskedasticity (or constant variance of the\n",
        "error term). From the output above, we can see that we reject the null\n",
        "hypothesis and conclude that the residuals in this regression are\n",
        "heteroskedastic.\n",
        "\n",
        "We can control for heteroskedasticity in different ways when we use a\n",
        "fixed-effects model. The `coeftest()` function allows us to estimate\n",
        "several heteroskedasticity-consistent covariance estimators."
      ],
      "id": "468dc129-2e16-4f1a-b69b-3730572c3644"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Estimate model\n",
        "fixed <- plm(lngdp ~ lnhc, data = panel_data, model=\"within\")\n",
        "\n",
        "# Show original coefficients\n",
        "coeftest(fixed)\n",
        "\n",
        "# Show heteroskedasticity consistent coefficients\n",
        "coeftest(fixed, vcovHC)"
      ],
      "id": "0f19d61f-1658-4b74-902f-f47912284be9"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 14.5.2 Serial Correlation\n",
        "\n",
        "In time-series setups where we only observe a single unit over time (no\n",
        "cross-sectional dimension) we might be worried that a linear regression\n",
        "model like\n",
        "\n",
        "$$\n",
        "Y_t = \\alpha + \\beta X_t + \\varepsilon_t \n",
        "$$\n",
        "\n",
        "can have errors that not only are heteroskedastic (i.e. that depend on\n",
        "observables $X_t$) but can also be correlated across time. For instance,\n",
        "if $Y_t$ was income, then $\\varepsilon_t$ may represent income shocks\n",
        "(including transitory and permanent components). The permanent income\n",
        "shocks are, by definition, very persistent over time. This would mean\n",
        "that $\\varepsilon_{t-1}$ affects (and thus is correlated with) shocks in\n",
        "the next period $\\varepsilon_t$. This problem is called serial\n",
        "correlation or autocorrelation, and if it exists, the assumptions of the\n",
        "regression model (i.e. unbiasedness, consistency, etc.) are violated.\n",
        "This can take the form of regressions where a variable is correlated\n",
        "with lagged versions of the same variable.\n",
        "\n",
        "To test our panel data regression for serial correlation, we need to run\n",
        "a Breusch-Godfrey/Woolridge test. In R, we can do it easily with\n",
        "`pbgtest()`."
      ],
      "id": "c2ec61a5-94d2-4ab2-b09b-c2cfdb7fa18a"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Estimate model\n",
        "fixed <- plm(lngdp ~ lnhc, data = panel_data, model=\"within\")\n",
        "\n",
        "# Run test\n",
        "pbgtest(fixed)"
      ],
      "id": "6364a381-7581-4464-aa40-235eec64a081"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The null hypothesis is that there is no serial correlation between\n",
        "residuals. From the output, we see that we cannot reject the null\n",
        "hypothesis and conclude the variables are correlated with lagged\n",
        "versions of themselves. One method for dealing with this serial\n",
        "correlation in panel data regression is by using again the `coeftest()`\n",
        "function, this time with the Arellano method of computing the covariance\n",
        "matrix. Note that the Arellano method allows a fully general structure\n",
        "with respect to both heteroskedasticity and serial correlation, so that\n",
        "our standard errors would effectively be robust to both threats."
      ],
      "id": "f9827ea1-599b-461b-9c24-e6b1eb77ced7"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Estimate model\n",
        "fixed <- plm(lngdp ~ lnhc, data = panel_data, model=\"within\")\n",
        "\n",
        "# Show original coefficients\n",
        "coeftest(fixed)\n",
        "\n",
        "# Show heteroskedasticity and serial correlation consistent coefficients\n",
        "coeftest(fixed, vcovHC(fixed, method=\"arellano\"))"
      ],
      "id": "602b6409-3d3a-4a8b-be6b-e0ebb1a2c4e1"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 14.5.3 Granger Causality\n",
        "\n",
        "In the regressions that we have been running in this example, we have\n",
        "found that the level of human capital is correlated with the level of\n",
        "GDP per capita. But have we proven that having high human capital causes\n",
        "countries to be wealthier? Or is is possible that wealthier countries\n",
        "can afford to invest in human capital? This is known as the issue of\n",
        "**reverse causality**, and arises when our independent variable\n",
        "determines our dependent variable.\n",
        "\n",
        "The Granger Causality test allows use to unpack some of the causality in\n",
        "these regressions. While understanding how this test works is beyond the\n",
        "scope of this notebook, we can look at an example using this data.\n",
        "\n",
        "The first thing we need to do is ensure that our panel is balanced. In\n",
        "the Penn World Tables, there are no missing values for real GDP and for\n",
        "population, but there are missing values for human capital. We can\n",
        "balance our panel by simply dropping all of the observations that do not\n",
        "include that measure."
      ],
      "id": "88f115c9-efe4-4e4b-ab40-38e8a224eb87"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "panel_data <- panel_data %>%\n",
        "            drop_na(lnhc)"
      ],
      "id": "2434216e-da21-4497-8269-91f55521de97"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we can run the test that is provided by R for Granger Causality:\n",
        "`grangertest()`. The first input is the model we want to use, the second\n",
        "input is the data, and the optional third input is the number of lags we\n",
        "want to use (by default, R uses only 1 lag)."
      ],
      "id": "290e3491-b239-48ad-aee9-bcab72c1f5fd"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "granger_test <- grangertest(lngdp ~ lnhc, data = panel_data, order=3)\n",
        "print(granger_test)"
      ],
      "id": "ec653fc7-3fe6-48be-b75b-0d1546f3f49d"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Note that R gives us two models. In model 1, both previous values of GDP\n",
        "and human capital are included: this is an unrestricted model that\n",
        "includes all Granger-causal terms. In model 2, the Granger-causal terms\n",
        "are omitted and only previous values of GDP are included.\n",
        "\n",
        "From our results, we can reject the null hypothesis of lack of Granger\n",
        "causality. The evidence seems to suggest that high levels of human\n",
        "capital cause countries to be wealthier.\n",
        "\n",
        "Please speak to your instructor, supervisor, or TA if you need help with\n",
        "this test.\n",
        "\n",
        "## 14.6 How is Panel Data Helpful?\n",
        "\n",
        "In typical cross-sectional settings, it is hard to defend the selection\n",
        "on observables assumption (otherwise known as conditional independence).\n",
        "However, panel data allows us to control for unobserved time-invariant\n",
        "heterogeneity.\n",
        "\n",
        "Consider the following example. Household income $y_{jt}$ at time $t$\n",
        "can be split into two components:\n",
        "\n",
        "$$\n",
        "y_{jt} = e_{jt} + \\Psi_{j}\n",
        "$$\n",
        "\n",
        "where $\\Psi_{j}$ is a measure of unobserved household-level determinants\n",
        "of income, such as social programs targeted towards certain households.\n",
        "\n",
        "Consider what happens when we compute each $j$ household’s average\n",
        "income, average value of $e$, and average value of $\\Psi$ across time\n",
        "$t$ in the data:\n",
        "\n",
        "$$\n",
        "\\bar{y}_{J}= \\frac{1}{\\sum_{j,t}   \\mathbf{1}\\{ j = J \\}  } \\sum_{j,t}  y_{jt} \\mathbf{1}\\{ j = J \\}\n",
        "$$ $$\n",
        "\\bar{e}_{J}= \\frac{1}{\\sum_{j,t}   \\mathbf{1}\\{ j = J \\}  } \\sum_{j,t}  e_{jt} \\mathbf{1}\\{ j = J \\}\n",
        "$$ $$\n",
        "\\bar{\\Psi}_{J} =  \\Psi_{J}\n",
        "$$\n",
        "\n",
        "Notice that the mean of $\\Psi_{j}$ does not change over time for a fixed\n",
        "household $j$. Hence, we can subtract the two household level means from\n",
        "the original equation to get:\n",
        "\n",
        "$$\n",
        "y_{jt} - \\bar{y}_{j} = e_{jt} - \\bar{e}_{j}  + \\underbrace{ \\Psi_{j} - \\bar{\\Psi}_{j}  }_\\text{equals zero!}\n",
        "$$\n",
        "\n",
        "Therefore, we are able to get rid of the unobserved heterogeneity in\n",
        "household determinants of income via “de-meaning”! This is called a\n",
        "within-group or fixed-effects transformation. If we believe these types\n",
        "of unobserved errors/shocks are creating endogeneity, we can get rid of\n",
        "them using this powerful trick. In some cases, we may alternatively\n",
        "choose to do a first-difference transformation of our regression\n",
        "specification. This entails subtracting the regression in one period not\n",
        "from it’s expectation across time, but from the regression in the\n",
        "previous period. In this case, time-invariant characteristics are\n",
        "similarly removed from the regression since they are constant across all\n",
        "periods $t$.\n",
        "\n",
        "## 14.7 Common Mistakes\n",
        "\n",
        "One common mistake is not to respect the order set by R in defining the\n",
        "ordering variables. By default, R orders panel data based on a\n",
        "cross-sectional ID first and a time variable second. If we change the\n",
        "order of the indices, then the estimates produced by R will change.\n",
        "\n",
        "If we invert the order of the cross-sectional ID (*country*) and the\n",
        "time variable (*year*) we may get different results."
      ],
      "id": "70d7f5ea-09dd-4dfb-9304-dfc3effb0c0d"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Default order\n",
        "plm(lngdp ~ lnhc, data = panel_data, model=\"within\")\n",
        "\n",
        "# Inverted order\n",
        "plm(lngdp ~ lnhc, data = panel_data, model=\"within\", index=c(\"year\",\"countrycode\"))"
      ],
      "id": "ab0fa7f8-db82-4776-b065-5d16d28826e6"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Another common mistake happens with the `lag()` and `lead()` functions.\n",
        "Since there are several functions with this name, it’s always best to\n",
        "specify to R that we want to use the `lag()` and `lead()` functions from\n",
        "the package `dplyr`.\n",
        "\n",
        "See what happens when we forget to specify it: do you see any difference\n",
        "between *lag1_lngdp* and *new_lag1_lngdp*?"
      ],
      "id": "7afae7d1-191b-4ddb-9dca-26df61cd24b6"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Create lag using dplyr::lag\n",
        "panel_data <- panel_data %>% mutate(lag1_lngdp = dplyr::lag(lngdp,1))\n",
        "\n",
        "# Create lag using lag\n",
        "panel_data <- panel_data %>% mutate(new_lag1_lngdp = lag(lngdp,1))\n",
        "\n",
        "# Check the difference\n",
        "head(panel_data[, c(\"lngdp\", \"lag1_lngdp\", \"new_lag1_lngdp\")],50)"
      ],
      "id": "ced456d5-3728-489e-a228-6a722263b92c"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 14.8 Wrap Up\n",
        "\n",
        "In this module, we’ve learned how to address linear regression in the\n",
        "case where we have access to two dimensions: cross-sectional variation\n",
        "and time variation. The usefulness of time variation is that it allows\n",
        "us to control for time-invariant components of the error term which may\n",
        "be causing endogeneity. We also investigated different ways for\n",
        "addressing problems such as heteroskedasticity and autocorrelation in\n",
        "our standard errors when working specifically with panel data. In the\n",
        "next module, we will cover a popular research design method:\n",
        "difference-in-differences.\n",
        "\n",
        "## 14.9 Wrap-up Table\n",
        "\n",
        "| Command | Function |\n",
        "|--------------------------------|----------------------------------------|\n",
        "| `pdata.frame` | It transforms a data frame in panel data format. |\n",
        "| `plm` | It estimates a linear model with panel data. Use option “within” for Fixed-Effects and “random” for Random-Effects. |\n",
        "| `phtest` | It performs a test to choose between Fixed-Effects and Random-Effects model. |\n",
        "| `pFtest` | It performs a test to choose whether time fixed-effects are needed. |\n",
        "| `dplyr::lag` | It creates lag variables. |\n",
        "| `dplyr::lead` | It creates lead variables. |\n",
        "| `bptest` | It tests for heteroskedasticity. |\n",
        "| `pbgtest` | It tests for serial correlation. |\n",
        "| `grangertest` | It tests for Granger causality. |\n",
        "\n",
        "## References\n",
        "\n",
        "[Formatting and managing\n",
        "dates](https://www.youtube.com/watch?v=SOQvXICIRNY&t=149s) <br>\n",
        "[Time-series operators\n",
        "(lags)](https://www.youtube.com/watch?v=ik8r4WvrPkc&t=224s)"
      ],
      "id": "bfdb0968-1fe0-450e-aa84-b0dd0cf6550f"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "ir",
      "display_name": "R",
      "language": "r"
    }
  }
}