08 - Combining Datasets
Prerequisites
- Change your directory so that Stata can find your files.
- Import data sets in .csv and .dta format.
Learning Outcomes
- Add new variables to an existing data set using
merge
. - Add new observations to already existing variables using
append
.
8.1 Introduction to merge
and append
Often when we are working with data sets it is necessary to merge or append existing data to other data sets. For example, imagine that we want to do one of the following:
- We want to run a regression that has the number of births as the main dependent variable and education level of the mother as an explanatory variable. We have two such micro data sets, one from Canada and one from the US, and we want to combine them into one data set that includes observations from both countries. To do our research, we need to take one data set (say, the Canadian data) and append to it the second data set (here, the US data). This final data set will have same number of variables as the initial data set(s), but the number of observations will be the number of observations in the Canadian data set plus the number of observations in the US data set.
- We want to run a regression that has the national fertility rate as the main dependent variable and GDP/capita as an explanatory variable. We have one macro data set that has three variables - country, year, and fertility rate - and a second macro data set also with three variables - country, year, and GDP/capita. To do our research, we need to merge these two data sets to create a final data set. That final data set will have the same number of observations as the initial data set(s), but now with four variables: country, year, fertility rate and GDP/capita.
In this notebook, we will learn how to undertake these two approaches to combining data sets: using merge
and append
.
We’ll continue working with our fake data. Recall that this data set is simulating information for workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.
clear*
*cd " "
use fake_data, clear
8.2 Preparing the Data for Merging and Appending
Before introducing the command merge
, we need the follow the steps below in order to properly combine datasets.
8.2.1 Check the Data Set’s Unique Identifiers
The key to merging data sets is to understand which variable(s) uniquely identifies each observation. This (or these) variable(s) are called the unique identifiers.
Let’s look at our data.
%browse 10
Here we can see that each observation in the “fake_data” data set is identified by the variables workerid and year (worker-year pairs).
We can check to see if this is correct using the command duplicates report
.
duplicates report workerid year
What this table shows is that there are 138,138 workerid-year combinations (which is exactly equal to all of our observations). This means that every observation we have corresponds to a worker in a particular year.
Let’s compare this to a different data set we made in Module 7 using the collapse
command. Run the cell below to recreate that data set.
generate log_earnings = log(earnings)
collapse (mean) avg_log_earnings=log_earnings (count) total_employment = log_earnings, by(region year)
label var avg_log_earnings "Average Log-earnings in Region-Year Cell"
%browse 10
In this case, it seems that every observation corresponds to a region and year combination. Again, we can use duplicates report
to see if the variables region and year uniquely identify all observations.
duplicates report region year
The table shows that there is not a single case of repeated observations (the surplus is zero). Hence, we will refer to these variables as the “unique identifiers”.
8.2.2 Identify the “Master” and “Using” Data Sets
When merging data we need to decide which data set will be the primary data set (Stata refers to this data set as “master”) and which will be the secondary data set (Stata refers to this data set as “using”). Often, it will not matter which is the master and which is the using data set; however, we need to know which is which in order to properly interpret our results.
8.2.3 Identify the Matching Observations
There are three main ways to match observations. The first is when both observations share the same unique identifiers, so that one observation in the master data set is matched to one observation in the using data set (referred to as a 1:1
merge). The other two ways arise when we match multiple observations in the master data set to one observation in the using data set (referred to as a m:1
merge), or one observation in the master data set to multiple observations in the using data set (referred to as a 1:m
merge).
8.3 Merging Data Sets
Once we have the master and using data sets, we know our unique identifiers and what type of match we are doing, we are able to merge the data sets.
We begin by having the master data opened in our current Stata session. For the sake example, let’s suppose we want to set “fake_data” as the master data set, and use “region-year” as the using data set. We will recreate “region_year_data” below.
We already know that the “fake_data” data set’s unique identifiers are workerid and year, while the “region-year” data set’s unique identifiers are region and year. The variables we use to link both data sets have to be the unique identifiers that are present in both data sets. Because workerid does not exist in the region-level data set, we will use the variables region and year for our merge.
This means that for every region in the using data set, there will be many observations in the individual level (master) data set to be matched. Therefore, this will be a m:1
merge.
use fake_data, clear
generate log_earnings = log(earnings)
collapse (mean) avg_log_earnings=log_earnings (count) total_employment = log_earnings, by(region year)
label var avg_log_earnings "Average Log-earnings in Region-Year Cell"
save region_year_data, replace
use fake_data, clear // This sets this data set as the master
merge
commands are written as follows:
merge 1:1 unique_id1 unique_id2 ... using using_dataset
Let’s now merge the “region_year_data” to “fake_data” using an m:1
merge.
merge m:1 region year using region_year_data
Let’s analyze the table above. It says that there were 406,963 observations in the master data which couldn’t be matched to any observation in the using data set. This is due to the fact that our data set at the region-year level does not have information for some years.
Furthermore, the table shows that every observation from the using data set got matched to some observation in the master data set. The total number of matched observations is roughly 2.5 million. All of this information gets recorded in a new variable named _merge. Because of this, it is good practice to write capture drop _merge
before running a merge command.
Think about the following question: would we get the same results if we switched the master and using datasets?
use region_year_data, clear
merge 1:m region year using fake_data
Indeed, we get the same information. We typically want to restrict to observations that were correctly matched across data sets.
keep if _merge==3
8.4 Appending Data Sets
We have used merge
to combine data sets horizontally (we have added columns/variables to the master data set). However, if we want to combine datasets vertically (add observations to the master data set) we can use append
. Adding new information with append
is very simple compared to working with merge
. When we have a master data set opened in our session, we can add observations using the syntax:
append using new_dataset
This command will add new observations to our master data set, so long as the variables across both data sets have the exact same names.
8.5 Wrap Up
In this module, we learned how to combine different data sets. This is an extremely useful skill, especially when we are undertaking panel data regressions. Next, we will switch our focus to data visualization and look at how to graph in Stata.
8.6 Wrap-up Table
Command | Function |
---|---|
merge |
It pastes two data sets based on values of some variables. It creates a new data set with more variables, and potentially more observations. |
append |
It pastes two data sets based on the names of some variables. It creates a new data set with more observations, and potentially more variables. |
References
Merge files into a single dataset
Append files into a single dataset