import stata_setup
'C:\Program Files\Stata18/','se') stata_setup.config(
ECON 490: Combining Data Sets (8)
Prerequisites
- Change your directory so that Stata can find your files.
- Import datasets in csv and dta format.
Learning Outcomes
- Add new variables to an existing data set using
merge
. - Add new observations to already existing variables
append
.
>>> import sys
>>> sys.path.append('/Applications/Stata/utilities') # make sure this is the same as what you set up in Module - 1, Section 1.5.1: Setting Up PyStata
>>> from pystata import config
>>> config.init('se')
[0.005s][warning][os,thread] Attempt to protect stack guard pages failed (0x0000000169c3c000-0x0000000169c48000).
[0.006s][warning][os,thread] Attempt to deallocate stack guard pages failed.
___ ____ ____ ____ ____ ©
/__ / ____/ / ____/ 17.0
___/ / /___/ / /___/ SE—Standard Edition
Statistics and Data Science Copyright 1985-2021 StataCorp LLC
StataCorp
4905 Lakeway Drive
College Station, Texas 77845 USA
800-STATA-PC https://www.stata.com
979-696-4600 stata@stata.com
Stata license: Unlimited-user network, expiring 19 Aug 2024
Serial number: 401809301518
Licensed to: Sarthak Kwatra
University of British Columbia
Notes:
1. Unicode is supported; see help unicode_advice.
2. Maximum number of variables is set to 5,000; see help set_maxvar.
8.1 Introduction to Merge and Append
Often when we are working with data sets it is necessary to merge or append existing data with other data sets. For example, imagine that we want to do one of the following:
We want to run a regression that has the number of births as the main dependent variable and education level of the mother as an explanatory variable. We have two such micro data sets, one from Canada and one from the US, and we want to combine them into one data set that includes observations from both countries. To do our research, we need to take one data set (say, the Canadian data) and append to it the second data set (here, the US data). This final data set will have same number of variables as the initial data set(s), but the number of observations will be the number of observations in the Canadian data set plus the number of observations in the US data set.
We want to run a regression that has the national fertility rate as the main dependent variable and GDP/capita as an explanatory variable. We have one macro data set that has three variables - country, year, and fertility rate - and a second macro data set also with three variables - country, year, and GDP/capita. To do our research, we need to merge these two data sets to create a final data set. That final data set will have the same number of observations as the initial data set(s), but now with four variables: country, year, fertility rate and GDP/capita.
In this notebook, we will learn how to undertake these two approaches to combining data sets: using merge and append.
We’ll continue working with our fake data. Recall that this data set is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.
%%stata
*
clear
use fake_data, clear
8.2 Getting Ready to Merge and Append
Before introducing the command merge
, we need the follow the steps below in order to properly combine datasets.
8.2.1 Check the data set’s unique identifiers
The key to merging data sets is to understand which variable(s) uniquely identifies each observation.
Let’s look at our data.
%%stata
%browse 10
Here we can see that each observation in the “fake_data” data set is identified by the variables workerid and year (worker-year pairs).
We can check to see if this is correct using the command duplicates report
.
%%stata
duplicates report workerid year
What this table shows is that there are 2,861,772 workerid-year combinations (which is exactly equal to all of our observations). This means that every observation we have corresponds to a worker in a particular year.
Let’s take a look at a different data set now also stored in this folder.
%%stata
= log(earnings)
gen log_earnings =log_earnings (count) total_employment = log_earnings, by(region year)
collapse (mean) avg_log_earnings"Average Log-earnings in Region-Year Cell" la var avg_log_earnings
%%stata
%browse 10
In this case, it seems that every observation corresponds to a region and year combination. Again, we can use duplicates report
to see if the variables region and year uniquely identify all observations.
%%stata
duplicates report region year
The table shows that there is not a single case of repeated observations. Hence, we will refer to these variables as the “unique identifiers”.
8.2.2 Identify the “master” and “using” data sets
When merging data we need to decide which data set will be the primary data set (Stata refers to this data set as “master”) and which will be the secondary data set (Stata refers to this data set as “using”). Often, it will not matter which is the master and which is the using data set; however, we need to know which is which in order to properly interpret our results.
8.2.3 Identify the matching observations
There are three main ways to match observations. The first is when both observations share the same unique identifiers, so that one observation in the master data set is matched to one observation in the using data set (referred to as a 1:1
merge). The other two ways arise when we match multiple observations in the master data set to one observation in the using data set (referred to as a m:1
merge), or one observation in the master data set to multiple observations in the using data set (referred to as a 1:m
merge).
8.3 Merging Data Sets
Once we have the master and using data sets, we know our unique identifiers and what type of match we are doing, we are able to merge the data sets.
We begin by having the master data opened in our current Stata session. For the sake example, let’s suppose we want to set “fake_data” as the master data set, and use “region-year” as the using data set.
We already know that the “fake_data” data set’s unique identifiers are workerid and year, while the “region-year” data set’s unique identifiers are region and year. The variables we use to link both data sets have to be the unique identifiers that are present in both data sets. Because workerid does not exist in the region-level data set, we will use the variables region and year for our merge.
This means that for every region in the using data set, there will be many observations in the individual level (master) data set to be matched. Therefore, this will be a m:1
merge.
%%stata
use fake_data, clear = log(earnings)
gen log_earnings =log_earnings (count) total_employment = log_earnings, by(region year)
collapse (mean) avg_log_earnings"Average Log-earnings in Region-Year Cell"
la var avg_log_earnings save region_year_data, replace
%%stata
// This sets this data set as the master use fake_data, clear
%%stata
1 region year using region_year_data merge m:
Let’s analyze the table above. It says that there were 406,963 observations in the master data which couldn’t be matched to any observation in the using data set. This is due to the fact that our data set at the region-year level does not have information for some years.
Furthermore, the table shows that every observation from the using data set got matched to some observation in the master data set. The total number of matched observations is roughly 2.5 million. All of this information gets recorded in a new variable named *_merge*. Because of this, it is good practice to write cap drop _merge
before running a merge command.
Think about the following question: would we get the same results if we switched the master and using datasets?
%%stata
use region_year_data, clear1:m region year using fake_data merge
Indeed, we get the same information. We typically want to restrict to observations that were correctly matched across datasets.
%%stata
if _merge==3 keep
Warning: Before dropping the unmerged observations, make sure you spend some time thinking about why they did not merge and correct any errors that you identify. For example, maybe your identifier is country and country names are different in the two data sets (i.e. one data set has “Barbados” and another data set has “The Barbados”). If this is the case, you will want to change one of these names and attempt your match a second time by merging again.
8.4 Appending Data Sets
We have used merge
to combine data sets horizontally (we have added columns/variables to the master data set). However, if we want to combine datasets vertically (add observations to the master data set) we can use append
. Adding new information with append
is very simple compared to working with merge
. When we have a master data set opened in our session, we can add observations using the syntax:
append using new_dataset
This command will add new observations to our master data set, so long as the variables across both data sets have the exact same name.
8.5 Wrap Up
In this module we learned how to combine different data sets. This is an extremely useful skill, especially when we are undertaking panel data regressions. Next, we will switch our focus to data visualization and look at how to graph in Stata.
The following table summarizes the main commands we have seen in this module.
Command | Function |
---|---|
merge |
It pastes two data sets based on values of some variables. It creates a new data set with more variables, and potentially more observations. |
append |
It pastes two data sets based on the names of some variables. It creates a new data set with more observations, and potentially more variables. |
8.6 Video tutorial
Click on the image below for a video tutorial on this module.
References
Merge files into a single dataset
Append files into a single dataset