ECON 490: Combining Data Sets (8)

Prerequisites

Change your directory so that Stata can find your files.
Import datasets in csv and dta format.

Learning Outcomes

Add new variables to an existing data set using merge.
Add new observations to already existing variables append.

import stata_setup
stata_setup.config('C:\Program Files\Stata18/','se')

>>> import sys
>>> sys.path.append('/Applications/Stata/utilities') # make sure this is the same as what you set up in Module - 1, Section 1.5.1: Setting Up PyStata
>>> from pystata import config
>>> config.init('se')

[0.005s][warning][os,thread] Attempt to protect stack guard pages failed (0x0000000169c3c000-0x0000000169c48000).
[0.006s][warning][os,thread] Attempt to deallocate stack guard pages failed.

  ___  ____  ____  ____  ____ ©
 /__    /   ____/   /   ____/      17.0
___/   /   /___/   /   /___/       SE—Standard Edition

 Statistics and Data Science       Copyright 1985-2021 StataCorp LLC
                                   StataCorp
                                   4905 Lakeway Drive
                                   College Station, Texas 77845 USA
                                   800-STATA-PC        https://www.stata.com
                                   979-696-4600        stata@stata.com

Stata license: Unlimited-user network, expiring 19 Aug 2024
Serial number: 401809301518
  Licensed to: Sarthak Kwatra
               University of British Columbia

Notes:
      1. Unicode is supported; see help unicode_advice.
      2. Maximum number of variables is set to 5,000; see help set_maxvar.

8.1 Introduction to Merge and Append

Often when we are working with data sets it is necessary to merge or append existing data with other data sets. For example, imagine that we want to do one of the following:

We want to run a regression that has the number of births as the main dependent variable and education level of the mother as an explanatory variable. We have two such micro data sets, one from Canada and one from the US, and we want to combine them into one data set that includes observations from both countries. To do our research, we need to take one data set (say, the Canadian data) and append to it the second data set (here, the US data). This final data set will have same number of variables as the initial data set(s), but the number of observations will be the number of observations in the Canadian data set plus the number of observations in the US data set.
We want to run a regression that has the national fertility rate as the main dependent variable and GDP/capita as an explanatory variable. We have one macro data set that has three variables - country, year, and fertility rate - and a second macro data set also with three variables - country, year, and GDP/capita. To do our research, we need to merge these two data sets to create a final data set. That final data set will have the same number of observations as the initial data set(s), but now with four variables: country, year, fertility rate and GDP/capita.

In this notebook, we will learn how to undertake these two approaches to combining data sets: using merge and append.

We’ll continue working with our fake data. Recall that this data set is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.

%%stata

clear*

use fake_data, clear

8.2 Getting Ready to Merge and Append

Before introducing the command merge, we need the follow the steps below in order to properly combine datasets.

8.2.1 Check the data set’s unique identifiers

The key to merging data sets is to understand which variable(s) uniquely identifies each observation.

Let’s look at our data.

%%stata

%browse 10

Here we can see that each observation in the “fake_data” data set is identified by the variables workerid and year (worker-year pairs).

We can check to see if this is correct using the command duplicates report.

%%stata

duplicates report workerid year

What this table shows is that there are 2,861,772 workerid-year combinations (which is exactly equal to all of our observations). This means that every observation we have corresponds to a worker in a particular year.

Let’s take a look at a different data set now also stored in this folder.

%%stata

gen log_earnings = log(earnings)
collapse (mean) avg_log_earnings=log_earnings (count) total_employment = log_earnings, by(region year)
la var avg_log_earnings "Average Log-earnings in Region-Year Cell"

%%stata

%browse 10

In this case, it seems that every observation corresponds to a region and year combination. Again, we can use duplicates report to see if the variables region and year uniquely identify all observations.

%%stata

duplicates report region year

The table shows that there is not a single case of repeated observations. Hence, we will refer to these variables as the “unique identifiers”.

8.2.2 Identify the “master” and “using” data sets

When merging data we need to decide which data set will be the primary data set (Stata refers to this data set as “master”) and which will be the secondary data set (Stata refers to this data set as “using”). Often, it will not matter which is the master and which is the using data set; however, we need to know which is which in order to properly interpret our results.

8.2.3 Identify the matching observations

There are three main ways to match observations. The first is when both observations share the same unique identifiers, so that one observation in the master data set is matched to one observation in the using data set (referred to as a 1:1 merge). The other two ways arise when we match multiple observations in the master data set to one observation in the using data set (referred to as a m:1 merge), or one observation in the master data set to multiple observations in the using data set (referred to as a 1:m merge).

8.3 Merging Data Sets

Once we have the master and using data sets, we know our unique identifiers and what type of match we are doing, we are able to merge the data sets.

We begin by having the master data opened in our current Stata session. For the sake example, let’s suppose we want to set “fake_data” as the master data set, and use “region-year” as the using data set.

We already know that the “fake_data” data set’s unique identifiers are workerid and year, while the “region-year” data set’s unique identifiers are region and year. The variables we use to link both data sets have to be the unique identifiers that are present in both data sets. Because workerid does not exist in the region-level data set, we will use the variables region and year for our merge.

This means that for every region in the using data set, there will be many observations in the individual level (master) data set to be matched. Therefore, this will be a m:1 merge.

%%stata

use fake_data, clear 
gen log_earnings = log(earnings)
collapse (mean) avg_log_earnings=log_earnings (count) total_employment = log_earnings, by(region year)
la var avg_log_earnings "Average Log-earnings in Region-Year Cell"
save region_year_data, replace

%%stata

use fake_data, clear  // This sets this data set as the master

%%stata

merge m:1 region year using region_year_data

Let’s analyze the table above. It says that there were 406,963 observations in the master data which couldn’t be matched to any observation in the using data set. This is due to the fact that our data set at the region-year level does not have information for some years.

Furthermore, the table shows that every observation from the using data set got matched to some observation in the master data set. The total number of matched observations is roughly 2.5 million. All of this information gets recorded in a new variable named *_merge*. Because of this, it is good practice to write cap drop _merge before running a merge command.

Think about the following question: would we get the same results if we switched the master and using datasets?

%%stata

use region_year_data, clear
merge 1:m region year using fake_data

Indeed, we get the same information. We typically want to restrict to observations that were correctly matched across datasets.

%%stata

keep if _merge==3

Warning: Before dropping the unmerged observations, make sure you spend some time thinking about why they did not merge and correct any errors that you identify. For example, maybe your identifier is country and country names are different in the two data sets (i.e. one data set has “Barbados” and another data set has “The Barbados”). If this is the case, you will want to change one of these names and attempt your match a second time by merging again.

8.4 Appending Data Sets

We have used merge to combine data sets horizontally (we have added columns/variables to the master data set). However, if we want to combine datasets vertically (add observations to the master data set) we can use append. Adding new information with append is very simple compared to working with merge. When we have a master data set opened in our session, we can add observations using the syntax:

    append using new_dataset

This command will add new observations to our master data set, so long as the variables across both data sets have the exact same name.

8.5 Wrap Up

In this module we learned how to combine different data sets. This is an extremely useful skill, especially when we are undertaking panel data regressions. Next, we will switch our focus to data visualization and look at how to graph in Stata.

The following table summarizes the main commands we have seen in this module.

Command	Function
`merge`	It pastes two data sets based on values of some variables. It creates a new data set with more variables, and potentially more observations.
`append`	It pastes two data sets based on the names of some variables. It creates a new data set with more observations, and potentially more variables.

8.6 Video tutorial

Click on the image below for a video tutorial on this module.

References

Merge files into a single dataset
Append files into a single dataset