{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 07 - Conducting Within Group Analysis\n",
        "\n",
        "Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt  \n",
        "2024-05-29\n",
        "\n",
        "## Prerequisites\n",
        "\n",
        "1.  Be able to effectively use Stata do-files and generate log-files.\n",
        "2.  Be able to change your directory so that Stata can find your files.\n",
        "3.  Import data sets in .csv and .dta format.\n",
        "4.  Save data files.\n",
        "\n",
        "## Learning Outcomes\n",
        "\n",
        "1.  Create new variables using the command `egen`.\n",
        "2.  Know when to use the pre-command `by` and when to use `bysort`.\n",
        "3.  Use the command `collapse` to create a new data set of summary\n",
        "    statistics.\n",
        "4.  Change a panel data set to a cross-sectional data set using the\n",
        "    command `reshape`.\n",
        "\n",
        "## 7.1 Introduction to Working Within Groups\n",
        "\n",
        "There are times when we will want to analyze our data while considering\n",
        "observations as a part of a group. Consider some of the following\n",
        "examples:\n",
        "\n",
        "-   We would like to know the average wages of workers by educational\n",
        "    grouping, in each year of the data.\n",
        "-   We would like to know the standard deviation of men and women’s\n",
        "    earnings, by geographic region.\n",
        "-   We would like to know the top quintile of wealth, by birth cohort.\n",
        "\n",
        "In this module, we will go over how to calculate these statistics using\n",
        "the fake data set introduced in the previous lecture. Recall that this\n",
        "data set is simulating information of workers in the years 1982-2012 in\n",
        "a fake country where a training program was introduced in 2003 to boost\n",
        "their earnings.\n",
        "\n",
        "Let’s begin by loading that data set into Stata:\n",
        "\n",
        "``` {stata}\n",
        "clear *\n",
        "* cd \" \"\n",
        "use \"fake_data.dta\", clear\n",
        "```\n",
        "\n",
        "## 7.2 Generating Variables using `generate`\n",
        "\n",
        "When we are working on a particular project, it is important to know how\n",
        "to create variables that are computed for a group rather than an\n",
        "individual or an observation. For instance, we may have a data set that\n",
        "is divided by individual and by year. We might want the variables to\n",
        "show us the statistics of a particular individual throughout the years\n",
        "or the statistics of all individuals each year.\n",
        "\n",
        "Stata provides a function to easily compute such statistics. The key to\n",
        "this analysis is the pre-command `by`. A pre-command is simply a prefix\n",
        "that tells Stata how we want it to run the command. In the case of `by`,\n",
        "we tell Stata to run the command on the subsets of data. The only\n",
        "requirement to using this pre-command is to ensure that the data is\n",
        "sorted the correct way.\n",
        "\n",
        "Let’s take a look at our data by using the `browse` command we learned\n",
        "in [Module\n",
        "5](https://comet.arts.ubc.ca/docs/Research/econ490-stata/05_Opening_Datasets.html).\n",
        "\n",
        "``` {stata}\n",
        "%browse 10\n",
        "```\n",
        "\n",
        "We can tell here that the data is sorted by the variable *workerid*.\n",
        "\n",
        "We use the pre-command `by` alongside the command `generate` to develop\n",
        "these group-compounded variables.\n",
        "\n",
        "If we use variables other than *workerid* (the variable by which the\n",
        "data is sorted) to group our new variable, we will not be able to\n",
        "generate the new variable. We can see this error when we run the command\n",
        "below.\n",
        "\n",
        "``` {stata}\n",
        "capture drop var_one //recall that capture drop tells Stata to ignore any errors if var_one does not exist, and to drop it if it does\n",
        "by year: generate var_one = 1 \n",
        "```\n",
        "\n",
        "If we want to group by year, Stata expects us to sort the data such that\n",
        "all observations corresponding to the same year are next to each other.\n",
        "We can use the `sort` pre-command as follows.\n",
        "\n",
        "``` {stata}\n",
        "sort year \n",
        "```\n",
        "\n",
        "Let’s take a look at our data now.\n",
        "\n",
        "``` {stata}\n",
        "%browse 10\n",
        "```\n",
        "\n",
        "Let’s try the command above again, now with the sorted data.\n",
        "\n",
        "``` {stata}\n",
        "capture drop var_one \n",
        "by year: generate var_one = 1 \n",
        "```\n",
        "\n",
        "Now that the data is sorted by year, the code works!\n",
        "\n",
        "We could have also used the pre-command `bysort` instead of sorting the\n",
        "data with `sort` and then using `by`. Everything is done in one step!\n",
        "\n",
        "Let’s sort the data, so it is reverted back to the same ordering scheme\n",
        "as when we started (by *workerid*), and generate our new variable again.\n",
        "\n",
        "Stata also lets us sort by two variables. The following block of code\n",
        "tells Stata to first `sort` the data by *workerid*, and then within each\n",
        "*workerid*, to sort the data by *year*.\n",
        "\n",
        "``` {stata}\n",
        "sort workerid year \n",
        "```\n",
        "\n",
        "``` {stata}\n",
        "capture drop var_one \n",
        "bysort year: generate var_one = 1 \n",
        "```\n",
        "\n",
        "The variable we have created is not interesting by any means. It simply\n",
        "takes the value of 1 everywhere. In fact, we haven’t done anything that\n",
        "we couldn’t have done with `gen var_one=1`. We can see this by using the\n",
        "`summarize` command.\n",
        "\n",
        "``` {stata}\n",
        "summarize var_one\n",
        "```\n",
        "\n",
        "You may not be aware, but Stata records the observation number as a\n",
        "hidden variable (a scalar) called \\*\\_n\\* and the total number of\n",
        "observations as \\*\\_N\\*.\n",
        "\n",
        "Let’s take a look at these by creating two newvariables: one that is the\n",
        "observation number and one that is the total number of observations.\n",
        "\n",
        "``` {stata}\n",
        "capture drop obs_number \n",
        "generate obs_number = _n \n",
        "\n",
        "capture drop tot_obs\n",
        "generate tot_obs = _N\n",
        "```\n",
        "\n",
        "``` {stata}\n",
        "%browse 10\n",
        "```\n",
        "\n",
        "As expected, the numbering of observations is sensitive to the way that\n",
        "the data is sorted! The cool thing is that whenever we use the\n",
        "pre-command `by`, the scalars `_n` and `_N` record the observation\n",
        "number and total number of observations for each group separately. Let’s\n",
        "check that below:\n",
        "\n",
        "``` {stata}\n",
        "capture drop obs_number \n",
        "bysort workerid: generate obs_number = _n \n",
        "\n",
        "capture drop tot_obs\n",
        "bysort workerid: generate tot_obs = _N\n",
        "```\n",
        "\n",
        "``` {stata}\n",
        "%browse 10\n",
        "```\n",
        "\n",
        "As we can see, some workers are observed only 2 times in the data (they\n",
        "were only surveyed in two years), whereas other workers are observed 8\n",
        "times (they were surveyed in 8 years). By knowing (and recording in a\n",
        "variable) the number of times a worker has been observed, we can do some\n",
        "analysis based on this information. For example, in some cases you might\n",
        "be interested in keeping only workers who are observed across all time\n",
        "periods. In this case, you could use the command:\n",
        "\n",
        "``` {stata}\n",
        "keep if tot_obs==8\n",
        "```\n",
        "\n",
        "``` {stata}\n",
        "%browse 10\n",
        "```\n",
        "\n",
        "## 7.3 Generating Variables Using Extended Generate\n",
        "\n",
        "The command `egen` is used whenever we want to create variables which\n",
        "require access to some functions (e.g. mean, standard deviation, min).\n",
        "The basic syntax works as follows:\n",
        "\n",
        "``` stata\n",
        " bysort groupvar: egen new_var = function() , options\n",
        "```\n",
        "\n",
        "Let’s see an example where we create a new variable called\n",
        "*avg_earnings*, which is the mean of earnings for every worker. We will\n",
        "need to reload our data since we dropped many observations above when we\n",
        "used the `keep` command.\n",
        "\n",
        "``` {stata}\n",
        "clear *\n",
        "use \"fake_data.dta\", clear\n",
        "```\n",
        "\n",
        "``` {stata}\n",
        "capture drop avg_earnings\n",
        "bysort workerid: egen avg_earnings = mean(earnings)\n",
        "```\n",
        "\n",
        "``` {stata}\n",
        "capture drop total_earnings\n",
        "bysort workerid: egen total_earnings = total(earnings)\n",
        "```\n",
        "\n",
        "By definition, these commands will create variables that use information\n",
        "across different observations. You can check the list of available\n",
        "functions by writing `help egen` in the Stata command window.\n",
        "\n",
        "In this documentation, we can see that there are some functions that do\n",
        "not allow for `by`. For example, suppose we want to create the total sum\n",
        "across different variables in the same row. We do this below by taking\n",
        "the sum of *start_year*, *region*, and *treated*.\n",
        "\n",
        "``` {stata}\n",
        "cap drop sum_of_vars\n",
        "egen sum_of_vars = rowtotal(start_year region treated)\n",
        "```\n",
        "\n",
        "The variable we are creating for the example has no particular meaning,\n",
        "but what we need to notice is that the function `rowtotal()` only sums\n",
        "the non-missing values in our variables. This means that if there is a\n",
        "missing value in any of the three variables, the sum only occurs between\n",
        "the two variables that do not have the missing value. We could also\n",
        "write this command as `gen sum_of_vars = start_year + region + treated`;\n",
        "however, if there is a missing value (`.`) in *start_year*, *region*, or\n",
        "*treated*, then the generated value for *sum_of_vars* will also be a\n",
        "missing value. The answer lies in the missing observations. If we sum\n",
        "any number with a missing value (`.`), then the sum will also be missing\n",
        "when using `generate`, but not when using `egen`.\n",
        "\n",
        "Just as with `sort`, we can also use `by` with multiple variables. Doing\n",
        "so tells Stata to run the command over all combinations of subgroups.\n",
        "Here will use *year* and *region* in one command. This tells Stata to\n",
        "generate a new variable for each *year*-*region* combination.\n",
        "\n",
        "``` {stata}\n",
        "capture drop regionyear_earnings\n",
        "bysort year region : egen regionyear_earnings = total(earnings)\n",
        "```\n",
        "\n",
        "What this command gives us is a new variable that records total earnings\n",
        "in each region for every year.\n",
        "\n",
        "## 7.4 Collapsing Data\n",
        "\n",
        "We can also compute statistics at some group level with the `collapse`\n",
        "command. `collapse` is extremely useful whenever we want to apply sample\n",
        "weights to our data (we will learn more about this in [Module\n",
        "11](https://comet.arts.ubc.ca/docs/Research/econ490-stata/11_Linear_Reg.html)).\n",
        "Sample weights cannot be applied using `egen` but are often extremely\n",
        "important when using micro data. These weights allow us to manipulate\n",
        "our data to better reflect the true composition of the data when the\n",
        "authorities that collected the data might have over-sampled some\n",
        "segments of the population.\n",
        "\n",
        "The syntax is:\n",
        "\n",
        "``` stata\n",
        "collapse (statistic1) new_name = existing_variable (statistic2) new_name2 = existing_variable2 ... [pweight = weight_variable], by(group) \n",
        "```\n",
        "\n",
        "We can find a full list of possible statistics that `collapse` can take\n",
        "by running the command `help collapse`. We can also learn more about\n",
        "using weights by typing `help weight`.\n",
        "\n",
        "Let’s suppose we want to create a data set at the region-year level\n",
        "using information in the current data set, but we want to use the sample\n",
        "weights that were provided with our data (*sample_weight*). First, we\n",
        "decide which statistics we want to keep from the original data set. For\n",
        "the sake of explanation, let’s suppose we want to keep the average\n",
        "earnings, the variance of earnings, and the total employment. We will\n",
        "have three new variables: *avg_earnings*, *sd_earnings*, *tot_emp*. We\n",
        "write the following:\n",
        "\n",
        "``` {stata}\n",
        "collapse (mean) avg_earnings = earnings (sd) sd_earnings = earnings (count) tot_emp = earnings [pweight = sample_weight], by(region year)\n",
        "```\n",
        "\n",
        "``` {stata}\n",
        "%browse 10\n",
        "```\n",
        "\n",
        "**Warning:** When we use `collapse`, Stata will produce a new data set\n",
        "with the results, and in the process it drops the data set that was\n",
        "loaded at the time the command was run. If we need to keep the original\n",
        "data, be certain to save the file before running this command.\n",
        "\n",
        "## 7.5 Reshaping\n",
        "\n",
        "We have collapsed our data and so we need to import the data again to\n",
        "gain access to the full data set.\n",
        "\n",
        "``` {stata}\n",
        "clear *\n",
        "\n",
        "use \"fake_data.dta\", clear\n",
        "```\n",
        "\n",
        "Notice that the nature of this particular data set is panel form;\n",
        "individuals have been followed over many years. Sometimes we are\n",
        "interested in working with a cross section (i.e. we have 1 observation\n",
        "per worker which includes all of the years). Is there a simple way to go\n",
        "back and forth between these two? Yes!\n",
        "\n",
        "The command’s name is `reshape` and has two main forms: `wide` and\n",
        "`long`. `wide` data is cross-sectional in nature, whereas `long` is the\n",
        "usual panel.\n",
        "\n",
        "Suppose we want to record the earnings of workers while keeping the\n",
        "information across years. This entails transforming our panel data into\n",
        "a cross sectional data set. We want one observation per worker, where\n",
        "each observation has all of the years. This is a `wide` transformation.\n",
        "\n",
        "``` {stata}\n",
        "reshape wide earnings region age start_year quarter_birth sample_weight, i(workerid) j(year)\n",
        "```\n",
        "\n",
        "**Warning:** This command acts on all of the variables in our data set.\n",
        "If we don’t include them in the list, Stata will assume that they do not\n",
        "vary across *i* (in this case *workerid*). If we don’t check this\n",
        "beforehand, we may get an error message!\n",
        "\n",
        "``` {stata}\n",
        "%browse 10\n",
        "```\n",
        "\n",
        "There are so many missing values in the data! Should we worry? Not at\n",
        "all. As a matter of fact, we learned at the beginning of this module\n",
        "that many workers are not observed across all years. That’s what these\n",
        "missing values are representing.\n",
        "\n",
        "Notice that the variable *year* which was part of the command line (the\n",
        "`j(year)` part) has disappeared. We now have one observation per worker,\n",
        "with their information recorded across years in a cross-sectional way.\n",
        "\n",
        "How do we go from a `wide` data set to a regular panel form? We will use\n",
        "the `reshape long` command. Note that to do this, we need to specify the\n",
        "prefix variables. These are formally known as `stubs` in Stata. They are\n",
        "the variables that all share the same prefix (in this case, *year*),\n",
        "that will be transformed into one variable. When we write `j(year)`,\n",
        "Stata will create a new variable called *year*.\n",
        "\n",
        "``` {stata}\n",
        "reshape long earnings region age  start_year sample_weight, i(workerid) j(year) \n",
        "```\n",
        "\n",
        "``` {stata}\n",
        "%browse 10\n",
        "```\n",
        "\n",
        "Notice that we now have an observation for every worker in every year,\n",
        "although we know some workers are only observed in a subset of these.\n",
        "This is known as a **balanced panel**.\n",
        "\n",
        "To retrieve the original data set, we get rid of such observations with\n",
        "missing values.\n",
        "\n",
        "``` {stata}\n",
        "keep if !missing(earnings)\n",
        "```\n",
        "\n",
        "``` {stata}\n",
        "%browse 10\n",
        "```\n",
        "\n",
        "## 7.6 Errors\n",
        "\n",
        "### 7.6.1. Sort\n",
        "\n",
        "To develop group-compounded variables, we first need to ensure that we\n",
        "sort the observations by the variable. Not sorting the obserations will\n",
        "return an error code.\n",
        "\n",
        "``` {stata}\n",
        "capture drop var\n",
        "by sex: generate var = _n\n",
        "```\n",
        "\n",
        "The correct method of of generating compounded variables is below:\n",
        "\n",
        "``` {stata}\n",
        "capture drop var\n",
        "bysort sex: generate var = _n\n",
        "```\n",
        "\n",
        "Take a look at it below:\n",
        "\n",
        "``` {stata}\n",
        "summarize var\n",
        "```\n",
        "\n",
        "### 7.6.2. Reshape Error\n",
        "\n",
        "Reshaping data can be tricky and doing so incorrectly can cause many\n",
        "variables to be dropped in the process. The command `reshape error` can\n",
        "be used to identify the issues encountered when reshaping data.\n",
        "\n",
        "``` {stata}\n",
        "clear *\n",
        "use \"fake_data.dta\", clear\n",
        "```\n",
        "\n",
        "``` {stata}\n",
        "reshape wide earnings sex, i(year) j(workerid)\n",
        "```\n",
        "\n",
        "``` {stata}\n",
        "reshape error\n",
        "```\n",
        "\n",
        "Can you tell what the error is here?\n",
        "\n",
        "## 7.7 Wrap Up\n",
        "\n",
        "In this module, we have covered some very useful skills that will be\n",
        "useful for exploring data sets. Namely, these skills will help us both\n",
        "prepare data for empirical analysis (i.e. turning cross sectional data\n",
        "into panel data) and create summary statistics that illustrate our\n",
        "results. In the next module, we will look at how to work with multiple\n",
        "data sets simultaneously and merge them into one.\n",
        "\n",
        "## 7.8 Wrap-up Table\n",
        "\n",
        "| Command | Function |\n",
        "|----------------------------------|--------------------------------------|\n",
        "| `by` | It is a pre-command used to Repeat Stata command on subsets of the data |\n",
        "| `generate` | It generates variables |\n",
        "| `sort` | It sorts data |\n",
        "| `summary` | It summarizes statistics of a data set |\n",
        "| `_n` | It records the observation number |\n",
        "| `_N` | It records the total number of observations for each group separately |\n",
        "| `drop` | It drops variables or observations |\n",
        "| `keep` | It keeps variables or observations that satisfy a specified condition |\n",
        "| `egenerate` | It create variables that require access to some functions |\n",
        "| `rowtotal()` | It sums non-missing values for each observation of a list of variables |\n",
        "| `collapse` | It makes a data set of a summary of statistics |\n",
        "| `reshape` | It converts data from wide to long and vice versa |\n",
        "\n",
        "## References\n",
        "\n",
        "[Reshape data from wide format to long\n",
        "format](https://www.youtube.com/watch?v=Bx9kVdkr9oY) <br> [(Non\n",
        "StataCorp) How to group data in STATA with SORT and\n",
        "BY](https://www.youtube.com/watch?v=nEOyH0AFKHc) [Syntax for\n",
        "pre-commands](https://www.stata.com/manuals/u11.pdf)"
      ],
      "id": "aa3ba017-c4f7-45a9-8832-05b64202a295"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "path": "/usr/local/share/jupyter/kernels/python3"
    },
    "language_info": {
      "name": "python",
      "codemirror_mode": {
        "name": "ipython",
        "version": "3"
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.3"
    }
  }
}