library(haven)
library(tidyverse)
library(IRdisplay)
<- read_csv("../econ490-stata/fake_data.csv") # change me! fake_data
ECON 490: Combining Data Sets (7)
Prerequisites
- Import data sets in csv and dta format.
- Create new variables for a variety of purposes.
- Use group_by and other functions to conduct group level analysis.
Learning Outcomes
- Append new observations and variables to an already existing data set using
rbind
andcbind
. - Merge variables and their values from one data set into another using
left_join
,right_join
,inner_join
, andfull_join
.
7.1 Working with Multiple Data Sets
We’ll continue working with the “fake_data” data set introduced in the previous lecture. Recall that this data set is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings. Let’s load in this data set now.
Since we are working with multiple data sets in this module, we will also import the “region_year_data” data set below. This data set is much smaller and gives the average log earnings and total number of people employed among each region in a series of years.
<- fake_data %>% #
region_year_data group_by(year, region) %>%
summarize(average_logearn = mean(log_earnings), n=n())
Often we will need to draw on data from multiple data sets such as these. Most of the time, these data sets will be available for download in different files (each for a given year, month, country, etc.) and may store different variables or observations. Thus, if we want to compile them we need to combine them into the same data frame.
There are two key ways of combining data, each reflecting different goals:
- When we want to paste data directly beside or under our existing data set, we call this appending data.
- If we think of a data set as a spreadsheet, this is like taking one data set and “pasting” it into the bottom of another to add more observations, or pasting one data set directly beside another to add more variables. We do this when two data sets have identical columns/variables (so that we can stack them vertically) or an equal number of observations (so that we can stick them beside each other horizontally).
- When we want to add new variables and their data from another data set into our existing data set, we call this merging data.
- This is like looking up values in a table and then adding a column; in Excel, this is like using
VLOOKUP
. Importantly, we can only merge data sets that share a common column or key to identify observations with particular values. For example, if we want to merge in data from a different year but for the same people (observations) as those we are currently working with, data sets will usually have an identifying number for the person that functions as our key when merging. Unlike with appending, this does not require column names or numbers of observations to be identical.
- This is like looking up values in a table and then adding a column; in Excel, this is like using
7.2 Appending Datasets
7.2.1 Append vertically with rbind
Let’s say that our “fake_data” data set is inexplicably missing 3 observations for worker 1; specifically, the earnings for this worker for the years 2003, 2005, and 2007 are missing. However, let’s say these observations exist in another data set, “missing_data”, which we can append to our “fake_data” data set since it contains all of the same variables. We can inspect this small data frame below.
<- data.frame(workerid = c(1, 1, 1), year = c(2003, 2005, 2007), sex = c("M", "M", "M"),
missing_data birth_year = c(1944, 1944, 1944), age = c(59, 61, 63), start_year = c(1997, 1997, 1997),
region = c(1, 1, 1), treated = c(0, 0, 0), earnings = c(30000, 35000, 36000))
missing_data
To append these four rows to the bottom of our data set, we can simply use the rbind
function (row bind). This function allows us to bind together data sets vertically, with the data set specified second being placed directly underneath the data set specified first. In this way, we can combine data sets vertically if they share the same column names.
<- rbind(fake_data, missing_data)
fake_data
tail(fake_data)
This is a fast way of concatenating data sets vertically. We can see that it also does not require us to have a designated “master” and “using” data set. We can have both data sets stored in our notebook and view them simultaneously, making the process of appending data sets simpler, especially if we want to check for identical column names or missing values.
7.2.2 Append horizontally with cbind
We may also want to concatenate data sets horizontally. Suppose that we have a new variable, religious, which is a dummy coded as 1 if the person self-identified as religious in that year and 0 if not. This data frame (which is technically a vector) is below.
set.seed(123)
<- data.frame(religious = sample(0:1, 2861772, replace = TRUE))
missing_data2
head(missing_data2)
Assuming it is ordered identically to our “fake_data” data set with respect to participants, we can simply bind this column to our existing data set using the cbind
function.
<- cbind(fake_data, missing_data2)
fake_data
head(fake_data)
We can see that this function appended our religious variable to the data set. However, it required us to have an identical number of observations between the two data frames, and for both data frames to be ordered identically with respect to people. Often this is not the case, so we must turn to a more commonly used and slightly more challenging concept next: merging datasets.
7.3 Merging Data Sets
Merging data sets means matching existing observations between datasets along specific variables, typically in order to add more information about existing participants to our current data set. This process, also known in R as joining data, is more complicated than simply appending data. Luckily, we have four functions with descriptive names which help to crystallize this process for us depending on how we want to merge two data sets. Before we start, we should look at the structure of each data set.
head(fake_data)
head(region_year_data)
To do a merge of any type, we need to specify a “key” or variable on which we will merge our data sets. It is best to choose a variable (or variables) which uniquely identifies each observation, otherwise merging will incur challenges. We can guess from our knowledge of the data set that every combination of workerid and year returns a unique observation in the “fake_data” data set. Looking at the “region_year_data” data set above, we can see that every combination of year and region identifies unique observations in this data set. This second data set, however, does not have the workerid variable, while the first data set has all three of the workerid, year and region. Since the unique identifiers common to both data sets are year and region, we will use these as our keys within the join functions. Since there are many observations with identical years and regions within the “fake_data” data set, we will be doing what is similar to a m:1 merge in Stata. However, we can specify how we would like matched and unmatched observations to be treated.
Tip: If we do not have any common identifiers between our data sets, but do have variables which express the exact same information, we can simply rename one of the variables so that they are identical.
7.3.1 Merge with left_join
The left join merge is a type of merge whereby we merge two data sets along one or more “keys”, but keep all observations without a match from the data set specified first in the function and discard all the unmatched observations in the data set specified second.
left_join(fake_data, region_year_data, by = c("year", "region"))
Notice here that this function preserves all rows in the first data set, in this case the “fake_data” data set, no matter what. The only rows of the second data set, “region_year_data”, which are kept are those which can be matched to a corresponding row from the first with identical key values (identical values for year and region). A direct partner to this function is the right_join
function, which operates identically but in reverse. That is, it keeps all observations in the second data set and keeps only those in the first which found a match with the second based on the identifier columns specified.
7.3.2 Merge with inner_join
The inner join merge is a type of merge whereby we keep only observations which have found a match between the two data sets. In this way, this function necessarily discards as many or more observations than the other types of merges.
inner_join(fake_data, region_year_data, by = c("year", "region"))
We can see that this function matched many identical region and year pairings to different workers. That is because there are many workers who have data reported for the same year and same region (i.e. many different workers in “fake_data” have earnings recorded for 1999 in region 1). In some data sets, however, especially those which are not as large as “fake_data”, we will lose many observations with inner_join
, since this function only preserves observations which can be matched across the key/s specified in both data sets.
7.3.3 Merge with full_join
This is the function that is closest to appending data horizontally. The process of full join ensures that all observations from both data sets are maintained; if observations from one data set do not find a match, they simply take on values of NA for the newly merged variables from the other data set.
full_join(fake_data, region_year_data, by = c("year", "region")
We can see that this function left many observations from our “fake_data” data set with missing values for variables from our “region_year_data” data set such as avg_log_earnings and total_employment. This is because the “fake_data” data set has observations for workers in years which are not included in the “region_year_data” data set (since the former records information from 1982 on and the latter records information from 1998 on). In this way, while full_join
typically retains the highest number of observations, it fills our data set with many missing observations.
When choosing which merge method to choose, it is important to consider if any observations will not find a match, which data sets these “unmatched” observations are in, and whether we would like for these observations to be recorded as missing or dropped. If we wish to drop unmatched observations in all cases, inner_join
is most appropriate. If we have two data sets and want to drop unmatched observations solely from the first, left_join
is most appropriate (and correspondingly right_join
if we want to drop unmatched observations solely from the second). Finally, if we wanted to keep all observations no matter what and have unmatched observations automatically marked with missing values for variables for which they have no recorded information, we should use full_join
. In all cases, unmatched observations refer to observations in a data set which do not share the same recorded value for the specified key/s (common identifier/s) with the data set they are being merged with.
7.4 Wrap Up
In this module, we learned how to combine different data sets. The most important lesson we should take away from this module is that we can append data sets vertically when they have identical variables and horizontally when they have identical observations (and when these variables and observations are identically ordered in both data sets). More generally, however, we want to merge different variables (columns) between two data sets using common identifier variables. We have a series of four types of merges we can use to accomplish this, each of which treats unmatched observations differently.
As a final note, throughout this module we used the join functions. However, base R has a merge
function which can accomplish all of the joins we have discussed. We didn’t cover this function in detail, however, because it operates much more slowly on large data sets. If you wish to learn more about this function, you can view its documentation by running the code cell below!
?merge
In the next module, we will look at graphing in R: the main types of graphs we can create, how to save these graphs, and best practices for data visualization more generally.
The following table summarizes the main commands we have seen in this module.
Command | Function |
---|---|
rbind() |
It stacks vertically two data sets based on certain variables. The resulting data set has more observations. |
cbind() |
It combines two data sets by binding them side by side. The resulting data set has more variables. |
left_join() |
It pastes two data sets together based on specific values of variables. It keeps all the observation of the main data frame but only the observations of the using data frame that are successfully matched. |
inner_join() |
It pastes two data sets together based on specific values of variables. It only keeps the matched observations from both the main and the using data set. |
full_join() |
It pastes two data sets together based on specific values of variables. It keeps all observations of both the main and the using data set, also the observations that are not matched. |