{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 08 - Combining Datasets\n",
        "\n",
        "Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt  \n",
        "2024-05-29\n",
        "\n",
        "## Prerequisites\n",
        "\n",
        "1.  Change your directory so that Stata can find your files.\n",
        "2.  Import data sets in .csv and .dta format.\n",
        "\n",
        "## Learning Outcomes\n",
        "\n",
        "1.  Add new variables to an existing data set using `merge`.\n",
        "2.  Add new observations to already existing variables using `append`.\n",
        "\n",
        "## 8.1 Introduction to `merge` and `append`\n",
        "\n",
        "Often when we are working with data sets it is necessary to merge or\n",
        "append existing data to other data sets. For example, imagine that we\n",
        "want to do one of the following:\n",
        "\n",
        "-   We want to run a regression that has the number of births as the\n",
        "    main dependent variable and education level of the mother as an\n",
        "    explanatory variable. We have two such micro data sets, one from\n",
        "    Canada and one from the US, and we want to combine them into one\n",
        "    data set that includes observations from both countries. To do our\n",
        "    research, we need to take one data set (say, the Canadian data) and\n",
        "    append to it the second data set (here, the US data). This final\n",
        "    data set will have same number of variables as the initial data\n",
        "    set(s), but the number of observations will be the number of\n",
        "    observations in the Canadian data set plus the number of\n",
        "    observations in the US data set.\n",
        "-   We want to run a regression that has the national fertility rate as\n",
        "    the main dependent variable and GDP/capita as an explanatory\n",
        "    variable. We have one macro data set that has three variables -\n",
        "    country, year, and fertility rate - and a second macro data set also\n",
        "    with three variables - country, year, and GDP/capita. To do our\n",
        "    research, we need to merge these two data sets to create a final\n",
        "    data set. That final data set will have the same number of\n",
        "    observations as the initial data set(s), but now with four\n",
        "    variables: country, year, fertility rate and GDP/capita.\n",
        "\n",
        "In this notebook, we will learn how to undertake these two approaches to\n",
        "combining data sets: using `merge` and `append`.\n",
        "\n",
        "We’ll continue working with our fake data. Recall that this data set is\n",
        "simulating information for workers in the years 1982-2012 in a fake\n",
        "country where a training program was introduced in 2003 to boost their\n",
        "earnings.\n",
        "\n",
        "``` {stata}\n",
        "clear*\n",
        "*cd \" \"\n",
        "use fake_data, clear\n",
        "```\n",
        "\n",
        "## 8.2 Preparing the Data for Merging and Appending\n",
        "\n",
        "Before introducing the command `merge`, we need the follow the steps\n",
        "below in order to properly combine datasets.\n",
        "\n",
        "### 8.2.1 Check the Data Set’s Unique Identifiers\n",
        "\n",
        "The key to merging data sets is to understand which variable(s) uniquely\n",
        "identifies each observation. This (or these) variable(s) are called the\n",
        "unique identifiers.\n",
        "\n",
        "Let’s look at our data.\n",
        "\n",
        "``` {stata}\n",
        "%browse 10\n",
        "```\n",
        "\n",
        "Here we can see that each observation in the “fake_data” data set is\n",
        "identified by the variables *workerid* and *year* (worker-year pairs).\n",
        "\n",
        "We can check to see if this is correct using the command\n",
        "`duplicates report`.\n",
        "\n",
        "``` {stata}\n",
        "duplicates report workerid year\n",
        "```\n",
        "\n",
        "What this table shows is that there are 138,138 *workerid-year*\n",
        "combinations (which is exactly equal to all of our observations). This\n",
        "means that every observation we have corresponds to a worker in a\n",
        "particular year.\n",
        "\n",
        "Let’s compare this to a different data set we made in [Module\n",
        "7](https://comet.arts.ubc.ca/docs/Research/econ490-stata/07_Within_Group.html)\n",
        "using the `collapse` command. Run the cell below to recreate that data\n",
        "set.\n",
        "\n",
        "``` {stata}\n",
        "generate log_earnings = log(earnings)\n",
        "collapse (mean) avg_log_earnings=log_earnings (count) total_employment = log_earnings, by(region year)\n",
        "label var avg_log_earnings \"Average Log-earnings in Region-Year Cell\"\n",
        "```\n",
        "\n",
        "``` {stata}\n",
        "%browse 10\n",
        "```\n",
        "\n",
        "In this case, it seems that every observation corresponds to a region\n",
        "and year combination. Again, we can use `duplicates report` to see if\n",
        "the variables *region* and *year* uniquely identify all observations.\n",
        "\n",
        "``` {stata}\n",
        "duplicates report region year\n",
        "```\n",
        "\n",
        "The table shows that there is not a single case of repeated observations\n",
        "(the surplus is zero). Hence, we will refer to these variables as the\n",
        "“unique identifiers”.\n",
        "\n",
        "### 8.2.2 Identify the “Master” and “Using” Data Sets\n",
        "\n",
        "When merging data we need to decide which data set will be the primary\n",
        "data set (Stata refers to this data set as “master”) and which will be\n",
        "the secondary data set (Stata refers to this data set as “using”).\n",
        "Often, it will not matter which is the master and which is the using\n",
        "data set; however, we need to know which is which in order to properly\n",
        "interpret our results.\n",
        "\n",
        "### 8.2.3 Identify the Matching Observations\n",
        "\n",
        "There are three main ways to match observations. The first is when both\n",
        "observations share the same unique identifiers, so that one observation\n",
        "in the master data set is matched to one observation in the using data\n",
        "set (referred to as a `1:1` merge). The other two ways arise when we\n",
        "match multiple observations in the master data set to one observation in\n",
        "the using data set (referred to as a `m:1` merge), or one observation in\n",
        "the master data set to multiple observations in the using data set\n",
        "(referred to as a `1:m` merge).\n",
        "\n",
        "## 8.3 Merging Data Sets\n",
        "\n",
        "Once we have the master and using data sets, we know our unique\n",
        "identifiers and what type of match we are doing, we are able to merge\n",
        "the data sets.\n",
        "\n",
        "We begin by having the master data opened in our current Stata session.\n",
        "For the sake example, let’s suppose we want to set “fake_data” as the\n",
        "master data set, and use “region-year” as the using data set. We will\n",
        "recreate “region_year_data” below.\n",
        "\n",
        "We already know that the “fake_data” data set’s unique identifiers are\n",
        "*workerid* and *year*, while the “region-year” data set’s unique\n",
        "identifiers are *region* and *year*. The variables we use to link both\n",
        "data sets have to be the unique identifiers that are present in both\n",
        "data sets. Because *workerid* does not exist in the region-level data\n",
        "set, we will use the variables *region* and *year* for our merge.\n",
        "\n",
        "This means that for every region in the using data set, there will be\n",
        "many observations in the individual level (master) data set to be\n",
        "matched. Therefore, this will be a `m:1` merge.\n",
        "\n",
        "``` {stata}\n",
        "use fake_data, clear \n",
        "generate log_earnings = log(earnings)\n",
        "collapse (mean) avg_log_earnings=log_earnings (count) total_employment = log_earnings, by(region year)\n",
        "label var avg_log_earnings \"Average Log-earnings in Region-Year Cell\"\n",
        "save region_year_data, replace\n",
        "```\n",
        "\n",
        "``` {stata}\n",
        "use fake_data, clear  // This sets this data set as the master\n",
        "```\n",
        "\n",
        "`merge` commands are written as follows:\n",
        "\n",
        "``` stata\n",
        "merge 1:1 unique_id1 unique_id2 ... using using_dataset\n",
        "```\n",
        "\n",
        "Let’s now merge the “region_year_data” to “fake_data” using an `m:1`\n",
        "merge.\n",
        "\n",
        "``` {stata}\n",
        "merge m:1 region year using region_year_data \n",
        "```\n",
        "\n",
        "Let’s analyze the table above. It says that there were 406,963\n",
        "observations in the master data which couldn’t be matched to any\n",
        "observation in the using data set. This is due to the fact that our data\n",
        "set at the region-year level does not have information for some years.\n",
        "\n",
        "Furthermore, the table shows that every observation from the using data\n",
        "set got matched to some observation in the master data set. The total\n",
        "number of matched observations is roughly 2.5 million. All of this\n",
        "information gets recorded in a new variable named <i>\\_merge</i>.\n",
        "Because of this, it is good practice to write `capture drop _merge`\n",
        "before running a merge command.\n",
        "\n",
        "Think about the following question: would we get the same results if we\n",
        "switched the master and using datasets?\n",
        "\n",
        "``` {stata}\n",
        "use region_year_data, clear\n",
        "merge 1:m region year using fake_data\n",
        "```\n",
        "\n",
        "Indeed, we get the same information. We typically want to restrict to\n",
        "observations that were correctly matched across data sets.\n",
        "\n",
        "``` {stata}\n",
        "keep if _merge==3\n",
        "```\n",
        "\n",
        "**Warning:** Before dropping the unmerged observations, it is important\n",
        "to think about why they did not merge and correct any errors that we\n",
        "identify. For example, maybe our identifier is country, and country\n",
        "names are different in the two data sets (i.e. one data set has\n",
        "“Barbados” and another data set has “The Barbados”). If this is the\n",
        "case, we will want to change one of these names and attempt to match a\n",
        "second time by merging again.\n",
        "\n",
        "## 8.4 Appending Data Sets\n",
        "\n",
        "We have used `merge` to combine data sets horizontally (we have added\n",
        "columns/variables to the master data set). However, if we want to\n",
        "combine datasets vertically (add observations to the master data set) we\n",
        "can use `append`. Adding new information with `append` is very simple\n",
        "compared to working with `merge`. When we have a master data set opened\n",
        "in our session, we can add observations using the syntax:\n",
        "\n",
        "``` stata\n",
        "    append using new_dataset\n",
        "```\n",
        "\n",
        "This command will add new observations to our master data set, so long\n",
        "as the variables across both data sets have the exact same names.\n",
        "\n",
        "## 8.5 Wrap Up\n",
        "\n",
        "In this module, we learned how to combine different data sets. This is\n",
        "an extremely useful skill, especially when we are undertaking panel data\n",
        "regressions. Next, we will switch our focus to data visualization and\n",
        "look at how to graph in Stata.\n",
        "\n",
        "## 8.6 Wrap-up Table\n",
        "\n",
        "| Command | Function |\n",
        "|--------------------------------|----------------------------------------|\n",
        "| `merge` | It pastes two data sets based on values of some variables. It creates a new data set with **more variables**, and potentially more observations. |\n",
        "| `append` | It pastes two data sets based on the names of some variables. It creates a new data set with **more observations**, and potentially more variables. |\n",
        "\n",
        "## References\n",
        "\n",
        "[Merge files into a single\n",
        "dataset](https://www.youtube.com/watch?v=niGZBRyyDuY) <br> [Append files\n",
        "into a single dataset](https://www.youtube.com/watch?v=AZGW8tohiqw)"
      ],
      "id": "eb2c18cd-43ef-4636-91ea-164ca309fde5"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "path": "/usr/local/share/jupyter/kernels/python3"
    },
    "language_info": {
      "name": "python",
      "codemirror_mode": {
        "name": "ipython",
        "version": "3"
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.3"
    }
  }
}