{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 11 - Conducting Regression Analysis\n",
        "\n",
        "Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt  \n",
        "2024-05-29\n",
        "\n",
        "## Prerequisites\n",
        "\n",
        "1.  Econometric approaches to linear regression taught in ECON326 or\n",
        "    other introductory econometrics courses.\n",
        "2.  Importing data into Stata.\n",
        "3.  Creating new variables using `generate`.\n",
        "\n",
        "## Learning Outcomes\n",
        "\n",
        "1.  Implement the econometric theory for linear regressions learned in\n",
        "    ECON326 or other introductory econometrics courses.\n",
        "2.  Run simple univariate and multivariate regressions using the command\n",
        "    `regress`.\n",
        "3.  Understand the interpretation of the coefficients in linear\n",
        "    regression output.\n",
        "4.  Consider the quality of control variables in a proposed model.\n",
        "\n",
        "## 11.0 Intro"
      ],
      "id": "3db92dfe-fdd8-4ca1-84e8-7405f64464d4"
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {},
      "outputs": [],
      "source": [
        "import stata_setup\n",
        "stata_setup.config('C:\\Program Files\\Stata18/','se')"
      ],
      "id": "f537f438"
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {},
      "outputs": [],
      "source": [
        ">>> import sys\n",
        ">>> sys.path.append('/Applications/Stata/utilities') # make sure this is the same as what you set up in Module 01, Section 1.3: Setting Up the STATA Path\n",
        ">>> from pystata import config\n",
        ">>> config.init('se')"
      ],
      "id": "29d24f97"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 11.1 A Word of Caution Before We Begin\n",
        "\n",
        "Before conducting a regression analysis, a great deal of work must go\n",
        "into understanding the data and investigating the theoretical\n",
        "relationships between variables. The biggest mistake that students make\n",
        "at this stage is not how they run the regression analysis. It is failing\n",
        "to spend enough time preparing data for analysis.\n",
        "\n",
        "Here are some common challenges that students run into. Please pay\n",
        "attention to this when conducting your own research project.\n",
        "\n",
        "-   A variable that is qualitative and not ranked cannot be used in an\n",
        "    OLS regression without first being transformed into a dummy variable\n",
        "    (or a series of dummy variables). Examples of variables that must\n",
        "    always be included as dummy variables are sex, race, religiosity,\n",
        "    immigration status, and marital status. Examples of variables that\n",
        "    are sometimes included as dummy variables are education, income and\n",
        "    age.\n",
        "-   You will want to take a good look to see how your variables are\n",
        "    coded before you begin running regressions and interpreting the\n",
        "    results. Make sure that missing values are coded as “.” and not some\n",
        "    value (such as “99”). Also, check that qualitative ranked variables\n",
        "    are coded in the way you expect (e.g. higher education is coded with\n",
        "    a larger number). If you do not do this, you could misinterpret your\n",
        "    results.\n",
        "-   Some samples are not proper representations of the population and\n",
        "    must be weighted accordingly (we will deal with this in depth\n",
        "    later).\n",
        "-   You should always think about the theoretical relationship between\n",
        "    your variables before you start your regression analysis: Does\n",
        "    economic theory predict a linear relationship, independence between\n",
        "    explanatory terms, or is there possibly an interaction at play?\n",
        "\n",
        "## 11.2 Linear Regression Models\n",
        "\n",
        "Understanding how to run a well structured OLS regression and how to\n",
        "interpret the results of that regression are the most important skills\n",
        "for undertaking empirical economic analysis. You have acquired a solid\n",
        "understanding of the theory behind the OLS regression in earlier\n",
        "econometrics courses; keep this in mind throughout your analysis. Here,\n",
        "we will cover the practical side of running regressions and, perhaps\n",
        "more importantly, how to interpret the results.\n",
        "\n",
        "An econometric model describes an equation (or set of equations) that\n",
        "impose some structure on how the data was generated. The most natural\n",
        "way to describe statistical information is the mean. Therefore, we\n",
        "typically model the mean of a (dependent) variable and how it can depend\n",
        "on different factors (independent variables or covariates). The easiest\n",
        "way to describe a relationship between a dependent variable, y, and one\n",
        "or more independent variables, x is linearly.\n",
        "\n",
        "Suppose we want to know what variables are needed to understand how and\n",
        "why earnings vary between each person in the world. What would be the\n",
        "measures needed to predict everyone’s earnings?\n",
        "\n",
        "Some explanatory variables might be:\n",
        "\n",
        "-   Age\n",
        "-   Year (e.g. macroeconomic shocks in that particular year)\n",
        "-   Region (local determinants on earnings)\n",
        "-   Hours worked\n",
        "-   Education\n",
        "-   Labor Market Experience\n",
        "-   Industry / Occupation\n",
        "-   Number of children\n",
        "-   Level of productivity\n",
        "-   Passion for their job\n",
        "-   etc., there are so many factors which can be included!\n",
        "\n",
        "For simplicity, let’s assume we want to predict earnings but we only\n",
        "have access to data sets with information regarding people’s age and\n",
        "earnings. If we want to generate a model which predicts the relationship\n",
        "between these two variables, we could create a linear model where the\n",
        "dependent variable (y) is annual earnings, the independent variable (x)\n",
        "is age, the slope (m) is how much an extra year of age affects earnings,\n",
        "and the y-intercept (b) is earnings when age is equal to 0. We would\n",
        "write this relationship as:\n",
        "\n",
        "$$\n",
        "y = b +mx.\n",
        "$$\n",
        "\n",
        "We only have access to annual earnings and age, so we are unable to\n",
        "observe the rest of the variables (independent variables or covariates\n",
        "$X_{i}$) that might determine earnings. Even if we do not observe these\n",
        "variables, they still affect earnings. In other words, age does not\n",
        "perfectly predict earnings, so our model above would have some error:\n",
        "the true values for earnings would diverge from what is predicted by the\n",
        "linear model.\n",
        "\n",
        "Where $\\beta_0$ is the y-intercept, $\\beta_1$ is the slope, and $i$\n",
        "indicates the worker observation in the data, we have:\n",
        "\n",
        "$$\n",
        "logearnings_{i} =\\beta_0 + \\beta_1 age_{i}  + u_{i}. \\tag{1}\n",
        "$$\n",
        "\n",
        "It’s important to understand what $\\beta_0$ and $\\beta_1$ stand for in\n",
        "the linear model. We said above that we typically model the mean of a\n",
        "(dependent) variable and how it can depend on different factors\n",
        "(independent variables or covariates). Therefore, we are in fact\n",
        "modeling the expected value of *logearnings*, conditional on the value\n",
        "of *age*. This is called the conditional expectation function, or\n",
        "**CEF**. We assume that it takes the form of:\n",
        "\n",
        "$$\n",
        "E[logearnings_{i}|age_{i}] =\\beta_0 + \\beta_1 age_i \\tag{2}\n",
        "$$\n",
        "\n",
        "How do equations (1) and (2) relate? If we take the expectation given\n",
        "age on equation (1), we can see that $$\n",
        "E[age_{i}|age_{i}]=age_{i}\n",
        "$$\n",
        "\n",
        "and this will leave us with $$\n",
        "E[u_{i}|age_{i}]=0.\n",
        "$$\n",
        "\n",
        "If $age=0$, then $\\beta_1 \\times age=0$ and $$\n",
        "E[logearnings_{i}|age_{i}=0]=\\beta_0\n",
        "$$\n",
        "\n",
        "If $age=1$, then $\\beta_1 \\times age=\\beta_1$ and\n",
        "\n",
        "$$\n",
        "E[logearnings_{i}|age_{i}=1]=\\beta_0+ \\beta_1\n",
        "$$\n",
        "\n",
        "Differencing the two equations above gives us the solution,\n",
        "\n",
        "$$\n",
        "E[logearnings_{i}|age_{i}=1]- E[logearnings_{i}|age_{i}=0]= \\beta_1,\n",
        "$$\n",
        "\n",
        "where $β_1$ is the difference in the expected value of *logearnings*\n",
        "when there is a one unit increase in *age*. If we choose any two values\n",
        "that differ by 1 unit we will also get $\\beta_1$ as the solution (try it\n",
        "yourself!).\n",
        "\n",
        "If we know those ${\\beta_1}s$, we can know a lot of information about\n",
        "the mean earnings for different set of workers. For instance, we can\n",
        "compute the mean log-earnings of 18 year old workers:\n",
        "\n",
        "$$\n",
        "E[logearnings_{i} \\mid  age_{i}=18] = \\beta_0 + \\beta_1 \\times 18\n",
        "$$\n",
        "\n",
        "This is the intuition that we should follow to interpret the\n",
        "coefficients!\n",
        "\n",
        "Consider a slightly more complicated example.\n",
        "\n",
        "Let’s assume there are only two regions in this world: region **A** and\n",
        "region **B**. In this world, we’ll make it such that workers in region B\n",
        "earn $\\beta_1$ percentage points more than workers in region A on\n",
        "average. We are going to create a dummy variable called *region* that\n",
        "takes the value of 1 if the worker’s region is B and a value of 0 if the\n",
        "worker’s region is A.\n",
        "\n",
        "Furthermore, an extra year of age increases earnings by $\\beta_2$ on\n",
        "average. We take the same approach with every explanatory variable on\n",
        "the list above. The empirical economist (us!) only observes a subset of\n",
        "all these variables, which we call the observables or covariates\n",
        "$X_{it}$. Let’s suppose that the empirical economist only observes the\n",
        "region and age of the workers.\n",
        "\n",
        "We could generate log-earnings of worker $i$ as follows.\n",
        "\n",
        "In the second line we did one of the most powerful tricks in all of\n",
        "mathematics: add and subtract the same term! Specifically, we add and\n",
        "subtract the mean earnings for workers who are in region A and have\n",
        "*age* equal to zero. This term is the interpretation of the constant in\n",
        "our linear model. The re-defined unobservable term $u_i$ is a deviation\n",
        "from such mean, which we expect to be zero on average.\n",
        "\n",
        "Be mindful of the interpretation of the coefficients in this new\n",
        "equation. As we have just seen, the constant $\\beta_0$ is interpreted as\n",
        "the average earnings of workers living in region A and with *age* equal\n",
        "to zero: if $age=0$ and ${region}_{i}=0$ then\n",
        "$\\beta_1 \\times \\{{region}_{i}=0\\} = 0$ and $\\beta_2 \\times age=0$. All\n",
        "that remains is $\\beta_0$:\n",
        "\n",
        "$$\n",
        "E[logearnings_{i}|age_{i}=0 \\; \\text{and} \\; {region}_{i}=0]=\\beta_0\n",
        "$$\n",
        "\n",
        "But what are the expected earnings of a worker living in region B and\n",
        "with age equal to zero?\n",
        "\n",
        "If $age=0$ and ${region}_{i}=1$, then\n",
        "$\\beta_1 \\times \\{{region}_{i}=1\\} = \\beta_1$ and\n",
        "$\\beta_2 \\times age=0$. As a result, we obtain\n",
        "\n",
        "$$\n",
        "E[logearnings_{i}|age_{i}=0 \\; \\text{and} \\; {region}_{i}=1]=\\beta_0 + \\beta_1\n",
        "$$\n",
        "\n",
        "Therefore, $\\beta_1$ is interpreted as the difference in average\n",
        "earnings of workers living in region B compared to workers living in\n",
        "region A.\n",
        "\n",
        "Lastly, $\\beta_2$ is interpreted as the extra average earnings obtained\n",
        "by individuals with one additional year of age compared to other\n",
        "individuals **living in the same region**. That ‘living in the same\n",
        "region’ portion of the sentence is key. Consider an individual living in\n",
        "region A and with *age* equal to 1. The expected earnings in that case\n",
        "are\n",
        "\n",
        "$$\n",
        "E[logearnings_{i}|age_{i}=1 \\; \\text{and} \\; {region}_{i}=0]=\\beta_0 + \\beta_2\n",
        "$$\n",
        "\n",
        "Therefore, $\\beta_2$ is equal to the extra average earnings obtained by\n",
        "workers of region A for each one additional year of *age*: $$\n",
        "\\beta_2 = E[logearnings_{i}|age_{i}=1 \\; \\text{and} \\; {region}_{i}=0] - E[logearnings_{i}|age_{i}=0 \\; \\text{and} \\; {region}_{i}=0] \n",
        "$$\n",
        "\n",
        "Using the equations above, try computing the following difference in\n",
        "expected earnings for workers with different age and different region,\n",
        "and check that it is not equal to $\\beta_2$:\n",
        "\n",
        "$$\n",
        "E[logearnings_{i}|age_{i}=1 \\; \\text{and} \\; {region}_{i}=0] - E[logearnings_{i}|age_{i}=0 \\; \\text{and} \\; {region}_{i}=1] \n",
        "$$\n",
        "\n",
        "So far, we have made an assumption at the population level. Remember\n",
        "that to know the CEF, we need to know the true ${\\beta}s$, which in turn\n",
        "depend on the joint distribution of the outcome ($Y_i$) and covariates\n",
        "($X_i$). However, in practice, we typically work with a random sample\n",
        "where we compute averages instead of expectations and empirical\n",
        "distributions instead of the true distributions. Fortunately, we can use\n",
        "these in a formula (also known as an estimator!) to obtain a reasonable\n",
        "guess of the true ${\\beta}s$. For a given sample, the numbers that are\n",
        "output by the estimator or formula are known as estimates. One of the\n",
        "most powerful estimators out there is the Ordinary Least Squares\n",
        "Estimator (OLS).\n",
        "\n",
        "## 11.3 Ordinary Least Squares\n",
        "\n",
        "If we are given some data set and we have to find the unknown\n",
        "${\\beta}s$, the most common and powerful tool is known as OLS.\n",
        "Continuing with the example above, let all the observations be indexed\n",
        "by $j=1,2,\\dots, n$. Let\n",
        "\n",
        "$$\n",
        "\\hat{β_0}, \\hat{β_1},\\hat{β_2}\n",
        "$$\n",
        "\n",
        "be the estimators of\n",
        "\n",
        "$$\n",
        "β_0, β_1, β_2.\n",
        "$$\n",
        "\n",
        "The formula for the estimators will return some values that will give\n",
        "rise to a sample version of the population model:\n",
        "\n",
        "$$\n",
        "logearnings_{j} = b_0 + b_1\\{region_{j}=1\\} + b_2 age_{j}  + \\hat{u_{j}},\n",
        "$$\n",
        "\n",
        "where $u_j$ is the true error in the population, and $\\hat{u_{j}}$ is\n",
        "called a residual (the sample version of the error given the current\n",
        "estimates). OLS finds the values of ${\\hat{β}}s$ that minimize the sum\n",
        "of squared residuals. This is given by the following minimization\n",
        "problem: $$\n",
        "\\min_{b} \\frac{1}{n} \\sum_{j}^n \\hat{u}_{j}^2\n",
        "$$\n",
        "\n",
        "This expression can also be written as $$\n",
        "\\min_{b} \\frac{1}{n} \\sum_{j}^n (logearnings_{j} - b_0 - b_1 \\{region_{j}=1\\} - b_2age_{j} )^2\n",
        "$$\n",
        "\n",
        "OLS is minimizing the squared residuals (the sample version of the error\n",
        "term) given our data. This minimization problem can be solved using\n",
        "calculus, specifically the derivative chain rule. The first order\n",
        "conditions are given by :\n",
        "\n",
        "From these first order conditions, we construct the most important\n",
        "restrictions for OLS:\n",
        "\n",
        "$$\n",
        "\\frac{1}{n} \\sum_{j}^n \\hat{u}_j = \\frac{1}{n} \\sum_{j}^n \\hat{u}_j \\times  age_j=\\frac{1}{n} \\sum_{j}^n \\hat{u}_j\\times\\{region_j = 1\\}=0\n",
        "$$\n",
        "\n",
        "In other words, by construction, the sample version of our error term\n",
        "will be uncorrelated with all the covariates. The constant term works\n",
        "the same way as including a variable equal to 1 in the regression (try\n",
        "it yourself!).\n",
        "\n",
        "Notice that the formula for $β_0, β_1, β_2$ (the true values!) is using\n",
        "these conditions, but we replaced expectations with sample averages.\n",
        "This is obviously an infeasible approach since we argued before that we\n",
        "need to know the true joint distribution of the variables to compute\n",
        "such expectations. As a matter of fact, many useful estimators rely on\n",
        "this approach: replace an expectation by a sample average. This is\n",
        "called the sample analogue approach.\n",
        "\n",
        "**Note:** Because this is an optimization problem, all of our variables\n",
        "must be numeric. If a variable is categorical, we must re-code it into a\n",
        "numerical variable. You will understand more about this after completing\n",
        "our next module.\n",
        "\n",
        "## 11.4 Ordinary Least Squares Regressions with Stata\n",
        "\n",
        "For this module, we will be using the fake data set. Recall that this\n",
        "data is simulating information for workers in the years 1982-2012 in a\n",
        "fake country where a training program was introduced in 2003 to boost\n",
        "their earnings."
      ],
      "id": "edec242b-9333-4e98-9034-6ab613a70a63"
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%stata\n",
        "\n",
        "clear *\n",
        "*cd \"\"\n",
        "use \"fake_data.dta\", clear"
      ],
      "id": "4f371d2e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 11.4.1 Univariate Regressions\n",
        "\n",
        "To run a linear regression using OLS in Stata, we use the command\n",
        "`regress`. The basic syntax of the command is:\n",
        "\n",
        "``` stata\n",
        "regress dep_varname indep_varname\n",
        "```\n",
        "\n",
        "Feel free to look at the help file to see the different options that\n",
        "this command provides!\n",
        "\n",
        "Let’s start by creating a new variable that is the natural log of\n",
        "earnings and then run our regression. We are using the log of earnings\n",
        "since earnings has a highly skewed distribution, and applying a log\n",
        "transformation allows us to more normally distribute our earnings\n",
        "variable. This will be helpful for a variety of analytical pursuits."
      ],
      "id": "9506018a-c578-43b0-991e-8250fe8d8d28"
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%stata\n",
        "\n",
        "gen logearn = log(earnings)\n",
        "regress logearn age "
      ],
      "id": "20755bf3"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "By default, Stata includes a constant (which is usually what we want,\n",
        "since this will set residuals to 0 on average). The estimated\n",
        "coefficients are $\\hat{\\beta}_0 = 10$ and $\\hat{\\beta}_1 = 0.014$.\n",
        "Notice that we only included one covariate here. This is known as a\n",
        "univariate (linear) regression.\n",
        "\n",
        "The interpretation of coefficients in a univariate regression is fairly\n",
        "simple. $\\hat{\\beta}_1$ says that having one extra year of *age*\n",
        "increases *logearnings* by $0.014$ on average. In other words, one extra\n",
        "year in age returns 1.4 percentage points higher earnings. Meanwhile,\n",
        "$\\hat{\\beta}_0$ says that the average log earnings of individuals with a\n",
        "recorded age of 0 is about $10$. This intercept is not particularly\n",
        "meaningful given that no one in the data set has an age of 0. It is\n",
        "important to note that this often occurs: the $\\hat{\\beta}_0$ intercept\n",
        "is often not economically meaningful. After all, $\\hat{\\beta}_0$ is\n",
        "simply an OLS estimate resulting from minimizing the sum of squared\n",
        "residuals.\n",
        "\n",
        "Sometimes, we find that our coefficient is negative. This is not a\n",
        "concern. If it was the case that $\\hat{\\beta}_1 = -0.014$, this would\n",
        "instead mean that one extra year of *age* is associated with a $0.014$\n",
        "decrease in *logearnings*, or $1.4$ percentage point lower earnings.\n",
        "When interpreting coefficients, the sign is also important. We will look\n",
        "at how to interpret coefficients in a series of cases later.\n",
        "\n",
        "### 11.4.2 Multivariate Regressions\n",
        "\n",
        "The command `regress` also allows us to list multiple covariates. When\n",
        "we want to carry out a multivariate regression, we write:\n",
        "\n",
        "``` stata\n",
        "regress dep_varname indep_varname1 indep_varname2\n",
        "```\n",
        "\n",
        "and so on."
      ],
      "id": "9b9395f6-5e02-4b1e-81a7-e8b7f35fb840"
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%stata\n",
        "\n",
        "regress logearn age treated"
      ],
      "id": "3ba7f96c"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "How would we interpret the coefficient corresponding to being treated?\n",
        "Consider the following two comparisons:\n",
        "\n",
        "-   Mean *logearnings* of 18 year old treated workers minus the mean\n",
        "    *logearnings* of 18 year old untreated workers = $\\beta_2$.\n",
        "-   Mean *logearnings* of 20 year old treated workers minus the mean\n",
        "    *logearnings* of 20 year old untreated workers = $\\beta_2$.\n",
        "\n",
        "Therefore, the coefficient gives the increase in *logearnings* between\n",
        "treated and untreated workers **holding all other characteristics\n",
        "equal**. We economists usually refer to this as\n",
        "$\\textit{ceteris paribus}$.\n",
        "\n",
        "The second column shows the standard errors. Using those, we can compute\n",
        "the third column, which tests whether a given $\\beta$ coefficient is\n",
        "equal to zero. To test this, we set up the hypothesis that a coefficient\n",
        "$\\beta$ equals 0, and thus has a mean of 0, then standardize it using\n",
        "the standard error provided:\n",
        "\n",
        "$$\n",
        "t = \\frac{ \\hat{\\beta} - 0 }{StdErr}\n",
        "$$\n",
        "\n",
        "If the t-statistic is roughly greater than 2 in absolute value, we\n",
        "reject the null hypothesis that there is no effect of the independent\n",
        "variable in question on earnings ($\\hat{\\beta} = 0$). This would mean\n",
        "that the data supports the hypothesis that the variable in question has\n",
        "some effect on earnings at a confidence level of 95%.\n",
        "\n",
        "An alternative test can be performed using the p-value statistic: if the\n",
        "p-value is less than 0.05 we reject the null hypothesis at 95%\n",
        "confidence level. In either case, when we reject the null hypothesis, we\n",
        "say that the coefficient is statistically significant.\n",
        "\n",
        "An alternative test can be performed using the p-value statistic: if the\n",
        "p-value is less than 0.05, we reject the null hypothesis at 95%\n",
        "confidence level. In either case, when we reject the null hypothesis, we\n",
        "say that the coefficient is statistically significant.\n",
        "\n",
        "No matter which of the two approaches we choose, Stata luckily provides\n",
        "us with the t-statistic and p-value for a coefficient immediately,\n",
        "allowing us to reject or fail to reject the null hypothesis that our\n",
        "coefficient is statistically significantly different from 0 immediately.\n",
        "\n",
        "**Note:** Without statistical significance, we cannot reject the null\n",
        "hypothesis and have no choice but to conclude that the coefficient is\n",
        "zero, meaning that the independent variable of interest has no effect on\n",
        "the dependent variable.\n",
        "\n",
        "Thus, when working with either univariate or multivariate regressions,\n",
        "we must pay attention to two key features of our coefficient estimates:\n",
        "\n",
        "1.  the sign of the coefficient (positive or negative), and\n",
        "2.  the p-value or t-statistic of the coefficient (checking for\n",
        "    statistical significance).\n",
        "\n",
        "A subtler but also important point is to always inspect the magnitude of\n",
        "the coefficient. We could find $\\hat{\\beta}_1 = 0.00005$ in our\n",
        "regression and determine that it is statistically significant. However,\n",
        "this would not change the fact that an extra year of age increases your\n",
        "log earnings by 0.005, which is a very weak effect. Magnitude is always\n",
        "important when seeing whether a relationship is actually large in size,\n",
        "even if it is statistically significant and thus we can be quite sure\n",
        "it’s not 0. Understanding whether the magnitude of a coefficient is\n",
        "economically meaningful typically requires a firm understanding of the\n",
        "economic literature in that area.\n",
        "\n",
        "### 11.4.3 Interpreting Coefficients\n",
        "\n",
        "While we have explored univariate and multivariate regressions of a log\n",
        "dependent variable and non-log independent variables (known as a\n",
        "log-linear model), the variables in linear regressions can take on many\n",
        "other forms. Each of these forms, whether a transformation of variables\n",
        "or not, influences how we can interpret these $\\beta$ coefficient\n",
        "estimates.\n",
        "\n",
        "For instance, look at the following regression:"
      ],
      "id": "5f053f77-9352-40f5-89fd-a26e26fd4f7e"
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%stata\n",
        "\n",
        "regress earnings age"
      ],
      "id": "f4a5d74c"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This is a classic single variable regression with no transformations\n",
        "(e.g. log) applied to the variables. In this regression, a one-unit\n",
        "change in the independent variable leads to a $\\beta$ unit change in the\n",
        "dependent variable. As such, we can interpret our coefficients in the\n",
        "following way: an extra year of *age* increases *earnings* by 1046.49 on\n",
        "average. The average earnings of individuals with age equal to 0 is\n",
        "35484, which we have already discussed is not economically meaningful.\n",
        "The incredibly low p-value for the coefficient on age also indicates\n",
        "that this is a statistically significant effect.\n",
        "\n",
        "Next, let’s look at the following regression, where a log transformation\n",
        "has now been applied to the independent variable and not the dependent\n",
        "variable:"
      ],
      "id": "3438fcf4-aeed-4727-ac4f-d6d71ef3bcd7"
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%stata\n",
        "\n",
        "generate logage = log(age)\n",
        "\n",
        "regress earnings logage"
      ],
      "id": "8cf0e375"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This is known as a linear-log regression, since only the independent\n",
        "variable has been transformed. It is a mirror image of the log-linear\n",
        "model we first looked at when we took the log of earnings. In this\n",
        "regression, we can say that a 1 unit increase in *logage* leads to a\n",
        "37482 increase in *earnings*, or that a 1% increase in age leads to an\n",
        "increase in earnings of 374.82. To express this more neatly, a 10%\n",
        "increase in age leads to an increase in earnings of about 3750, or a\n",
        "100% increase in age (doubling of age) leads to an increase in earnings\n",
        "of about 37500.\n",
        "\n",
        "We can even have a log-log regression, wherein both the dependent and\n",
        "independent variables in question have been transformed into log format."
      ],
      "id": "bd885477-66c5-4d50-828c-547c214f2fc4"
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%stata\n",
        "\n",
        "regress logearn logage"
      ],
      "id": "03705f7a"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "When interpreting the coefficients in this regression, we can say that a\n",
        "1 unit increase in *logage* leads to a 0.52 unit increase in *logearn*,\n",
        "or that a 1% increase in age leads to a 0.52% increase in earnings. To\n",
        "express this more neatly, we can also say that a 10% increase in age\n",
        "leads to a 5.2% increase in earnings, or that a 100% increase in age\n",
        "(doubling of age) leads to a 52% increase in earnings.\n",
        "\n",
        "Additionally, while we have been looking at log transformations, we can\n",
        "apply other transformations to our variables. Suppose that we believe\n",
        "that age is not linearly related to earnings. Instead, we believe that\n",
        "age may have a quadratic relationship with earnings. We can define\n",
        "another variable for this term and then include it in our regression to\n",
        "create a multivariate regression as follows."
      ],
      "id": "e48bfe69-98fb-4f77-bce1-3a9081d897be"
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%stata\n",
        "\n",
        "generate agesqr = age^2\n",
        "\n",
        "regress earnings age agesqr"
      ],
      "id": "1a987688"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In this regression, we get coefficients on both *age* and *agesqr*.\n",
        "Since the age variable appears in two places, neither coefficient can\n",
        "individually tell us the effect of age on earnings. Instead, we must\n",
        "take the partial derivative of earnings with respect to age. If our\n",
        "population regression model is\n",
        "\n",
        "$$\n",
        "earnings_i = \\beta_0 + \\beta_1age_i + \\beta_2age^2_i + \\mu_i,\n",
        "$$\n",
        "\n",
        "then the effect of age on earnings is $\\beta_1 + 2\\beta_2$, meaning that\n",
        "a one year increase in age leads to a 3109.1 + 2(-27.7) = 3053.7 unit\n",
        "increase in earnings. There are many other types of transformations we\n",
        "can apply to variables in our regression models. This is just one\n",
        "example.\n",
        "\n",
        "In all of these examples, our $\\beta_0$ intercept coefficient gives us\n",
        "the expected value of our dependent variable when our independent\n",
        "variable equals 0. We can inspect the output of these regressions\n",
        "further, looking at their p-values or t-statistics, to determine whether\n",
        "the coefficients we receive as output are statistically significant.\n",
        "\n",
        "Some regressions involve dummy variables and interaction terms. It is\n",
        "critical to understand how to interpret these coefficients, since these\n",
        "terms are quite common. The coefficient on a dummy variable effectively\n",
        "states the difference in the dependent variable between two groups,\n",
        "*ceteris paribus*, with one of the groups being the base level group\n",
        "left out of the regression entirely. The coefficient on interaction\n",
        "terms, conversely, emphasizes how the relationship between a dependent\n",
        "and independent variable differs between groups, or differs as another\n",
        "variable changes. We’ll look at both dummy variables and interaction\n",
        "terms in regressions in much more depth in [Module\n",
        "13](https://comet.arts.ubc.ca/docs/Research/econ490-pystata/13_Dummy.html).\n",
        "\n",
        "### 11.4.4 Sample weights\n",
        "\n",
        "The data that is provided to us is often not statistically\n",
        "representative of the population as a whole. This is because the\n",
        "agencies that collect data (like Statistics Canada) often decide to\n",
        "over-sample some segments of the population. They do this to ensure that\n",
        "there is a large enough sample size of subgroups of the population to\n",
        "conduct meaningful statistical analysis of those sub-populations. For\n",
        "example, the population of Indigenous identity in Canada accounts for\n",
        "approximately 5% of the total population. If we took a representative\n",
        "sample of 10,000 Canadians, there would only be 500 people who\n",
        "identified as Indigenous in the sample.\n",
        "\n",
        "This creates two problems. The first is that this is not a large enough\n",
        "sample to undertake any meaningful analysis of characteristics of the\n",
        "Indigenous population in Canada. The second is that when the sample is\n",
        "this small, it might be possible for researchers to identify individuals\n",
        "in data. This would be extremely unethical, and Stats Canada works hard\n",
        "to make sure that data remains anonymized.\n",
        "\n",
        "To resolve this issue, Statistics Canada over-samples people of\n",
        "Indigenous identity when they collect data. For example, they might\n",
        "survey 1000 people of Indigenous identity so that those people now\n",
        "account for 10% of observations in the sample. This would allow\n",
        "researchers who want to specifically look at the experiences of\n",
        "Indigenous people to conduct reliable research, and maintain the\n",
        "anonymity of the individuals represented by the data.\n",
        "\n",
        "When we use this whole sample of 10,000, however, the data is no longer\n",
        "nationally representative since it overstates the share of the\n",
        "population of Indigenous identity - 10% instead of 5%. This sounds like\n",
        "a complex problem to resolve, but the solution is provided by the\n",
        "statistical agency that created the data in the form of “sample weights”\n",
        "that can be used to recreate data that is nationally representative.\n",
        "\n",
        "There are four ways to weight in Stata. We can include frequency weights\n",
        "(`fw`), analytic weights (`aw`), probability or sampling weights (`pw`),\n",
        "and importance weights (`iw`). All of these are used for different\n",
        "purposes. For example, `pw` is most frequently used with survey data, to\n",
        "indicate the probability that an observation was selected into the\n",
        "sample. You can find more information about this by typing `help weight`\n",
        "in the Command Window.\n",
        "\n",
        "**Note**: Before applying any weights in our regression, it is important\n",
        "that we read the user guide that comes with the data to see how weights\n",
        "should be applied. There are several options for weights and we should\n",
        "never apply weights without first understanding the intentions of the\n",
        "authors of the data.\n",
        "\n",
        "Our sample weights will be commonly coded as an additional variable in\n",
        "our data set such as *weight_pct*, however sometimes this is not the\n",
        "case, and we will need to select the variable ourselves. Please reach\n",
        "out to an instructor, TA, or supervisor if you think this is the case.\n",
        "To include probability weights in regression analysis, we can simply\n",
        "include the following command immediately after our independent\n",
        "variable(s):\n",
        "\n",
        "``` stata\n",
        "    regress y x [pw = weight_pct]   \n",
        "```\n",
        "\n",
        "We can do that with the variable *sample_weight* which is provided to us\n",
        "in the “fake_data” data set, re-running the regression of *logearnings*\n",
        "on *age* and *treated* from above."
      ],
      "id": "93061a71-b111-4074-af24-a8cc313aae51"
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%stata\n",
        "\n",
        "regress logearn age treated [pw = sample_weight]"
      ],
      "id": "30a12c62"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Often, after weighting our sample, the coefficients from our regression\n",
        "will change in magnitude. In these cases, there was some subsample of\n",
        "the population that was over-represented in the data and skewed the\n",
        "results of the unweighted regression.\n",
        "\n",
        "Finally, while this section described the use of weighted regressions,\n",
        "it is important to know that there are many times we might want to apply\n",
        "weights to our sample that have nothing to do with running regressions.\n",
        "For example, if we wanted to calculate the mean of a variable using data\n",
        "from a skewed sample, we would want to make sure to use the weighted\n",
        "mean. While `summarize` is used in Stata to calculate means, we can use\n",
        "`collapse` to create summary statistics with sample weights factored\n",
        "into the calculations (see [Module\n",
        "7](https://comet.arts.ubc.ca/docs/Research/econ490-pystata/07_Within_Group.html)).\n",
        "\n",
        "## 11.5 What can we do with OLS?\n",
        "\n",
        "Notice that OLS gives us a linear approximation to the conditional mean\n",
        "of some dependent variable, given some observables. We can use this\n",
        "information for prediction: if we had different observables, how would\n",
        "the expected mean differ?We can do this in Stata by using the `predict`\n",
        "function. All we need to do is run `predict varname` after we run our\n",
        "regression. `varname` represents a new variable that will hold the\n",
        "predicted values of our dependent variable. We can do this with\n",
        "different regressions that have different observables (one might include\n",
        "*age* as an explanatory variable, while another might include\n",
        "*education*), and we can compare the predicted values.\n",
        "\n",
        "Another thing we can do with OLS is discuss causality: how does\n",
        "manipulating one variable impact a dependent variable on average? To\n",
        "give a causal interpretation to our OLS estimates, we require that, in\n",
        "the population, it holds that $\\mathbf{E}[X_i u_i] = 0$. This is the\n",
        "same as saying that the unobservables are uncorrelated with the\n",
        "independent variables of the equation (remember, this is not testable\n",
        "because we cannot compute the expectations in practice!). If these\n",
        "unobservables are correlated with an independent variable, this means\n",
        "the independent variable can be causing a change in the dependent\n",
        "variable because of a change in an unobservable rather than a change in\n",
        "the independent variable itself. This inhibits our ability to interpret\n",
        "our coefficients with causality and is known as the endogeneity problem.\n",
        "\n",
        "We might be tempted to think that we can test this using the sample\n",
        "version $\\frac{1}{n} \\sum_{j}^n  X_i u_i = 0$, but notice from the first\n",
        "order conditions that this is true by construction! It is by design a\n",
        "circular argument; we are assuming that it holds true when we compute\n",
        "the solution to OLS.\n",
        "\n",
        "For instance, looking at the previous regression, if we want to say that\n",
        "the causal effect of being treated is equal to -0.81, it must be the\n",
        "case that treatment is not correlated (in the population sense) with the\n",
        "error term (our unobservables). However, it could be the case that\n",
        "treated workers are the ones that usually perform worse at their job,\n",
        "which would contradict a causal interpretation of our OLS estimates.\n",
        "This brings us to a short discussion of what distinguishes good and bad\n",
        "controls in a regression model:\n",
        "\n",
        "-   Good Controls: To think about good controls, we need to consider\n",
        "    which **unobserved** determinants of the outcome are possibly\n",
        "    correlated with our variable of interest.\n",
        "-   Bad Controls: It is bad practice to include variables that are\n",
        "    themselves outcomes. For instance, consider studying the causal\n",
        "    effect of college on earnings. If we include a covariate of working\n",
        "    at a high paying job, then we’re blocking part of the causal channel\n",
        "    between college and earnings (i.e. you are more likely to have a\n",
        "    nice job if you study more years!)\n",
        "\n",
        "## 11.6 Wrap Up\n",
        "\n",
        "In this module we discussed the following concepts:\n",
        "\n",
        "-   Linear Model: an equation that describes how the outcome is\n",
        "    generated, and depends on some coefficients $\\beta$.\n",
        "-   Ordinary Least Squares: a method to obtain a good approximation of\n",
        "    the true $\\beta$ of a linear model from a given sample.\n",
        "\n",
        "Notice that there is no such thing as an OLS model. More specifically,\n",
        "notice that we could apply a different method (estimator) to a linear\n",
        "model. For example, consider minimizing the sum of all error terms $$\n",
        "\\min_{b} \\frac{1}{n} \\sum_{i}^n | \\hat{u}_j |\n",
        "$$\n",
        "\n",
        "This model is linear but the solution to this problem is not an OLS\n",
        "estimate.\n",
        "\n",
        "We also learned how to interpret coefficients in any linear model.\n",
        "$\\beta_0$ is the y-intercept of the line in a typical linear regression\n",
        "model. Therefore, it is equal to: <br>\n",
        "\n",
        "$$\n",
        "E[y_{i}|x_{i}=0]=\\beta_0.\n",
        "$$\n",
        "\n",
        "It is the expected value of y when x = 0. More precisely, because we\n",
        "have a sample approximation for this true value, it would be the sample\n",
        "mean of y when x = 0.\n",
        "\n",
        "In the case of any other beta, $\\beta_1$ or $\\beta_2$ or $\\beta_3$,\n",
        "\n",
        "$$\n",
        "E[y_{i}|x_{i}=1]- E[y_{i}|x_{i}=0]= \\beta\n",
        "$$\n",
        "\n",
        "is going to be the difference between the expected value of y due to a\n",
        "change in x. Therefore, each $\\beta$ value tells us the effect that a\n",
        "particular covariate has on y, ceteris paribus. Transformations can also\n",
        "be applied to the variables in question, scaling the interpretation of\n",
        "this $\\beta$ coefficient. Overall, these coefficient estimates are\n",
        "values of great importance when we are developing our research!\n",
        "\n",
        "## 11.7 Wrap-up Table\n",
        "\n",
        "| Command | Function |\n",
        "|----------------------------------|--------------------------------------|\n",
        "| `regress dep_varname indep_varname1 indep_varname2 ... [pw= weight_pct]` | It estimates a model using OLS including probability weights. |\n",
        "| `predict varname` | It creates a variable holding the predicted values for our dependent variable. |\n",
        "\n",
        "## References\n",
        "\n",
        "[Simple linear regression in\n",
        "Stata](https://www.youtube.com/watchv=HafqFSB9x70&list=PLN5IskQdgXWnnIVeA_Y0OBGmnw21fvcmU&index=21)\n",
        "<br> [(Non StataCorp) Summary of Interpreting a Regression Output from\n",
        "Stata](https://www.youtube.com/watch?v=-kr1BaEqlxg) [(Non StataCorp)\n",
        "Weighting in\n",
        "Stata](https://stats.oarc.ucla.edu/other/mult-pkg/faq/what-types-of-weights-do-sas-stata-and-spss-support/#:~:text=You%20need%20to%20read%20the,while%20another%20assumes%20probability%20weights.)\n",
        "[How to use the predict function in\n",
        "Stata](https://www.stata.com/manuals/rpredict.pdf)"
      ],
      "id": "7cefcffd-b1e5-4d66-b06a-ee0a6a78bca7"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "path": "/usr/local/share/jupyter/kernels/python3"
    },
    "language_info": {
      "name": "python",
      "codemirror_mode": {
        "name": "ipython",
        "version": "3"
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.3"
    }
  }
}