08 - Combining Datasets

econ 490

pystata

merge

append

m:m

m:1

1:m

This notebook explains how to append and merge data sets.

Author

Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt

Published

29 May 2024

Prerequisites

Change your directory so that Stata can find your files.
Import data sets in .csv and .dta format.

Learning Outcomes

Add new variables to an existing data set using merge.
Add new observations to already existing variables using append.

8.0 Intro

import stata_setup
stata_setup.config('C:\Program Files\Stata18/','se')

>>> import sys
>>> sys.path.append('/Applications/Stata/utilities') # make sure this is the same as what you set up in Module 01, Section 1.3: Setting Up the STATA Path
>>> from pystata import config
>>> config.init('se')

8.1 Introduction to `merge` and `append`

Often when we are working with data sets it is necessary to merge or append existing data to other data sets. For example, imagine that we want to do one of the following:

We want to run a regression that has the number of births as the main dependent variable and education level of the mother as an explanatory variable. We have two such micro data sets, one from Canada and one from the US, and we want to combine them into one data set that includes observations from both countries. To do our research, we need to take one data set (say, the Canadian data) and append to it the second data set (here, the US data). This final data set will have same number of variables as the initial data set(s), but the number of observations will be the number of observations in the Canadian data set plus the number of observations in the US data set.
We want to run a regression that has the national fertility rate as the main dependent variable and GDP/capita as an explanatory variable. We have one macro data set that has three variables - country, year, and fertility rate - and a second macro data set also with three variables - country, year, and GDP/capita. To do our research, we need to merge these two data sets to create a final data set. That final data set will have the same number of observations as the initial data set(s), but now with four variables: country, year, fertility rate and GDP/capita.

In this notebook, we will learn how to undertake these two approaches to combining data sets: using merge and append.

We’ll continue working with our fake data. Recall that this data set is simulating information for workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.

%%stata

clear*
*cd " "
use fake_data, clear

8.2 Preparing the Data for Merging and Appending

Before introducing the command merge, we need the follow the steps below in order to properly combine datasets.

8.2.1 Check the Data Set’s Unique Identifiers

The key to merging data sets is to understand which variable(s) uniquely identifies each observation. This (or these) variable(s) are called the unique identifiers.

Let’s look at our data.

%%stata

%browse 10

Here we can see that each observation in the “fake_data” data set is identified by the variables workerid and year (worker-year pairs).

We can check to see if this is correct using the command duplicates report.

%%stata

duplicates report workerid year

What this table shows is that there are 138,138 workerid-year combinations (which is exactly equal to all of our observations). This means that every observation we have corresponds to a worker in a particular year.

Let’s compare this to a different data set we made in Module 7 using the collapse command. Run the cell below to recreate that data set.

%%stata

generate log_earnings = log(earnings)
collapse (mean) avg_log_earnings=log_earnings (count) total_employment = log_earnings, by(region year)
label var avg_log_earnings "Average Log-earnings in Region-Year Cell"

%%stata

%browse 10

In this case, it seems that every observation corresponds to a region and year combination. Again, we can use duplicates report to see if the variables region and year uniquely identify all observations.

%%stata

duplicates report region year

The table shows that there is not a single case of repeated observations. Hence, we will refer to these variables as the “unique identifiers”.

8.2.2 Identify the “Master” and “Using” Data Sets

When merging data we need to decide which data set will be the primary data set (Stata refers to this data set as “master”) and which will be the secondary data set (Stata refers to this data set as “using”). Often, it will not matter which is the master and which is the using data set; however, we need to know which is which in order to properly interpret our results.

8.2.3 Identify the Matching Observations

There are three main ways to match observations. The first is when both observations share the same unique identifiers, so that one observation in the master data set is matched to one observation in the using data set (referred to as a 1:1 merge). The other two ways arise when we match multiple observations in the master data set to one observation in the using data set (referred to as a m:1 merge), or one observation in the master data set to multiple observations in the using data set (referred to as a 1:m merge).

8.3 Merging Data Sets

Once we have the master and using data sets, we know our unique identifiers and what type of match we are doing, we are able to merge the data sets.

We begin by having the master data opened in our current Stata session. For the sake example, let’s suppose we want to set “fake_data” as the master data set, and use “region-year” as the using data set. We will recreate “region-year” below.

We already know that the “fake_data” data set’s unique identifiers are workerid and year, while the “region-year” data set’s unique identifiers are region and year. The variables we use to link both data sets have to be the unique identifiers that are present in both data sets. Because workerid does not exist in the region-level data set, we will use the variables region and year for our merge.

This means that for every region in the using data set, there will be many observations in the individual level (master) data set to be matched. Therefore, this will be a m:1 merge.

%%stata

use fake_data, clear 
generate log_earnings = log(earnings)
collapse (mean) avg_log_earnings=log_earnings (count) total_employment = log_earnings, by(region year)
label var avg_log_earnings "Average Log-earnings in Region-Year Cell"
save region_year_data, replace

%%stata

use fake_data, clear  // This sets this data set as the master

merge commands are written as follows:

merge 1:1 unique_id1 unique_id2 ... using using_dataset

Let’s now merge the “region_year_data” to “fake_data” using an m:1 merge.

%%stata

merge m:1 region year using region_year_data

Let’s analyze the table above. It says that there were 406,963 observations in the master data which couldn’t be matched to any observation in the using data set. This is due to the fact that our data set at the region-year level does not have information for some years.

Furthermore, the table shows that every observation from the using data set got matched to some observation in the master data set. The total number of matched observations is roughly 2.5 million. All of this information gets recorded in a new variable named _merge. Because of this, it is good practice to write capture drop _merge before running a merge command.

Think about the following question: would we get the same results if we switched the master and using datasets?

%%stata

use region_year_data, clear
merge 1:m region year using fake_data

Indeed, we get the same information. We typically want to restrict to observations that were correctly matched across data sets.

%%stata

keep if _merge==3

Warning: Before dropping the unmerged observations, it is important to think about why they did not merge and correct any errors that we identify. For example, maybe our identifier is country, and country names are different in the two data sets (i.e. one data set has “Barbados” and another data set has “The Barbados”). If this is the case, we will want to change one of these names and attempt to match a second time by merging again.

8.4 Appending Data Sets

We have used merge to combine data sets horizontally (we have added columns/variables to the master data set). However, if we want to combine datasets vertically (add observations to the master data set) we can use append. Adding new information with append is very simple compared to working with merge. When we have a master data set opened in our session, we can add observations using the syntax:

    append using new_dataset

This command will add new observations to our master data set, so long as the variables across both data sets have the exact same names.

In this module, we learned how to combine different data sets. This is an extremely useful skill, especially when we are undertaking panel data regressions. Next, we will switch our focus to data visualization and look at how to graph in Stata.

8.6 Wrap-up Table

Command	Function
`merge`	It pastes two data sets based on values of some variables. It creates a new data set with more variables, and potentially more observations.
`append`	It pastes two data sets based on the names of some variables. It creates a new data set with more observations, and potentially more variables.

References

Merge files into a single dataset
Append files into a single dataset