07 - Combining Data Sets

econ 490

append

cbind

rbind

merge

left join

inner join

full join

This notebook explains how to append and merge data sets.

Author

Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt

Published

29 May 2024

Prerequisites

Import data sets in .csv and .dta format.
Create new variables for a variety of purposes.
Use group_by and other functions to conduct group level analysis.

Learning Outcomes

Append new observations and variables to an already existing data set using rbind, bind_rows, cbind and bind_cols.
Merge variables and their values from one data set into another using left_join, right_join, inner_join, and full_join.

7.1 Working with Multiple Data Sets

We’ll continue working with the “fake_data” data set that was introduced in the previous lecture. Recall that this data set is simulating information for workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings. Let’s load in this data set now.

#As always, clear the memory!
rm(list=ls())

library(haven)
library(tidyverse)
library(IRdisplay)

fake_data <- read_dta("../econ490-r/fake_data.dta")  # change me!

For the purposes of this module, we will want to work with multiple datasets. Let’s create the “region_year_data” data set below. This data set is much smaller and gives the average log earnings and total number of people employed among each region in a series of years.

fake_data <- fake_data %>% mutate(log_earnings = log(earnings))

region_year_data <-  fake_data %>%    
    group_by(year, region) %>%
    summarize(average_logearn = mean(log_earnings), n=n())

Often we will need to draw on data from multiple data sets such as these. Most of the time, these data sets will be available for download in different files (each for a given year, month, country, etc.) and may store different variables or observations. Thus, if we want to compile them, we need to combine them into the same data frame.

There are two key ways of combining data, each reflecting different goals:

When we want to paste data directly beside or under our existing data set, we call this appending data.
- If we think of a data set as a spreadsheet, this is like taking one data set and “pasting” it into the bottom of another to add more observations, or pasting one data set directly beside another to add more variables. We do this when two data sets have identical columns/variables (so that we can stack them vertically) or an equal number of observations (so that we can stick them beside each other horizontally).
When we want to add new variables and their data from another data set into our existing data set, we call this merging data.
- This is like looking up values in a table and then adding a column; in Excel, this is like using VLOOKUP. Importantly, we can only merge data sets that share a common column or key to identify observations with particular values. For example, if we want to merge in data from a different year but for the same people (observations) as those we are currently working with, data sets will usually have an identifying number for the person that functions as our key when merging. Unlike with appending, this does not require column names or numbers of observations to be identical.

7.2 Appending Data Sets

7.2.1 Append Vertically with `rbind` and `bind_rows`

Let’s say that our “fake_data” data set is inexplicably missing 3 observations for worker 1; specifically, the earnings for this worker for the years 2003, 2005, and 2007 are missing. However, let’s say these observations exist in another data set, “missing_data”, which we can append to our “fake_data” data set since it contains all of the same variables. We can inspect this small data frame below.

missing_data <- data.frame(workerid = c(1, 1, 1), year = c(2003, 2005, 2007), sex = c("M", "M", "M"), 
                           age = c(59, 61, 63), start_year = c(1997, 1997, 1997),
                           region = c(1, 1, 1), treated = c(0, 0, 0), earnings = c(30000, 35000, 36000))

missing_data

To append these three rows to the bottom of our data set, we can simply use the rbind function (row bind). This function allows us to bind together data sets vertically, with the data set specified second being placed directly underneath the data set specified first. In this way, we can combine data sets vertically if they share the exact same column names. Try that below!

fake_data <- rbind(fake_data, missing_data)

tail(fake_data)

As we can see, we’ve run into an error. In this case, our missing_data does not have the exact same columns as fake_data. missing_data does not have a sample_weight column. This is a problem! To override this and append the data sets anyways, we can use the bind_rows command. This will append our missing_data to the end of fake_data, and leave the columns empty when the information is not available in the appending data set. See below:

fake_data <- bind_rows(fake_data, missing_data)

tail(fake_data)

This is a fast way of concatenating data sets vertically. We can see that it also does not require us to have a designated “master” and “using” data set. We can have both data sets stored in our notebook and view them simultaneously, making the process of appending data sets simpler, especially if we want to check for identical column names or missing values.

7.2.2 Append Horizontally with `cbind` and `bind_cols`

We may also want to concatenate data sets horizontally. Suppose that we have a new variable, religious, which is a dummy coded as 1 if the person self-identified as religious in that year and 0 if not. This data frame (which is technically a vector) is below.

set.seed(123)

missing_data2 <- data.frame(religious = sample(0:1, 138141, replace = TRUE))

head(missing_data2)

Assuming it is ordered identically to our “fake_data” data set with respect to participants, and that we have the exact same number of rows, we can simply bind this column to our existing data set using the cbind function.

fake_data <- cbind(fake_data, missing_data2)

head(fake_data)

We can see that this function appended our religious variable to the data set. However, it required us to have an identical number of observations between the two data frames, and for both data frames to be ordered identically with respect to people. Often this is not the case, so we must turn to a more commonly used and slightly more challenging concept next: merging datasets. Alternatively, we would have to use the command bind_cols.

7.3 Merging Data Sets

Merging data sets means matching existing observations between datasets along specific variables, typically in order to add more information about existing participants to our current data set. This process, also known in R as joining data, is more complicated than simply appending data. Luckily, we have four functions with descriptive names which help to crystallize this process for us depending on how we want to merge two data sets. Before we start, we should look at the structure of each data set.

head(fake_data)

head(region_year_data)

To do a merge of any type, we need to specify a “key” or variable on which we will merge our data sets – this is our unique identifier. It is best to choose a variable (or variables) which uniquely identify each observation, otherwise merging will incur challenges. We can guess from our knowledge of the data set that every combination of workerid and year returns a unique observation in the “fake_data” data set. Looking at the “region_year_data” data set above, we can see that every combination of year and region identifies unique observations in this data set. This second data set, however, does not have the workerid variable, while the first data set has all three of the workerid, year and region variables. Since the unique identifiers common to both data sets are year and region, we will use these as our keys within the join functions. Since there are many observations with identical years and regions within the “fake_data” data set, we will be doing what is similar to a m:1 merge in Stata. However, we can specify how we would like matched and unmatched observations to be treated.

Tip: If we do not have any common identifiers between our data sets, but do have variables which express the exact same information, we can simply rename one of the variables so that they are identical.

7.3.1 Merge with `left_join`

The left join merge is a type of merge whereby we merge two data sets along one or more “keys”, but keep all observations without a match from the first data set (specified first in the function) and discard all the unmatched observations from the second data set (specified second). This is similar to an m:1 merge in Stata, where we keep everything in the “master” data.

left_join(fake_data, region_year_data, by = c("year", "region"))

Notice here that this function preserves all rows in the first data set, in this case the “fake_data” data set, no matter what. The only rows of the second data set (“region_year_data”) which are kept are those which can be matched to a corresponding row from “fake_data” with identical key values (identical values for year and region). A direct partner to this function is the right_join function, which operates identically but in reverse (similar to 1:m). That is, it keeps all observations in the second data set and keeps only those in the first which found a match with the second based on the identifier columns specified.

7.3.2 Merge with `inner_join`

The inner join merge is a type of merge whereby we keep only observations which have found a match between the two data sets, i.e. the identifiers must be in both the first and second data sets (1:1 merge in Stata). In this way, this function necessarily discards as many or more observations than the other types of merges.

inner_join(fake_data, region_year_data, by = c("year", "region"))

We can see that this function matched many identical region and year pairings to different workers. That is because there are many workers who have data reported for the same year and same region (i.e. many different workers in “fake_data” have earnings recorded for 1999 in region 1). In some data sets, however, especially those which are not as large as “fake_data”, we will lose many observations with inner_join, since this function only preserves observations which can be matched across the key(s) specified in both data sets.

7.3.3 Merge with `full_join`

This is the function that is closest to appending data horizontally. The process of full join ensures that all observations from both data sets are maintained (similar to an m:m merge in Stata); if observations from one data set do not find a match, they simply take on values of NA for the newly merged variables from the other data set.

full_join(fake_data, region_year_data, by = c("year", "region"))

We can see that this function left many observations from our “fake_data” data set with missing values for variables from our “region_year_data” data set such as avg_log_earnings and total_employment. This is because the “fake_data” data set has observations for workers in years which are not included in the “region_year_data” data set (since the former records information from 1982 on and the latter records information from 1998 on). In this way, while full_join typically retains the highest number of observations, it fills our data set with many missing observations.

When choosing which merge method to choose, it is important to consider if any observations will not find a match, which data sets these “unmatched” observations are in, and whether we would like for these observations to be recorded as missing or dropped. If we wish to drop unmatched observations in all cases, inner_join is most appropriate. If we have two data sets and want to drop unmatched observations solely from the first, left_join is most appropriate (and correspondingly right_join if we want to drop unmatched observations solely from the second). Finally, if we wanted to keep all observations no matter what and have unmatched observations automatically marked with missing values for variables for which they have no recorded information, we should use full_join. In all cases, unmatched observations refer to observations in a data set which do not share the same recorded value for the specified key(s) (common identifier(s)) with the data set they are being merged with.

7.4 Wrap Up

In this module, we learned how to combine different data sets. The most important lesson we should take away from this module is that we can append data sets vertically when they have identical variables and horizontally when they have identical observations (and when these variables and observations are identically ordered in both data sets). More generally, however, we want to merge different variables (columns) between two data sets using common identifier variables. We have a series of four types of merges we can use to accomplish this, each of which treats unmatched observations differently.

As a final note, throughout this module we used the join functions. However, base R has a merge function which can accomplish all of the joins we have discussed. We didn’t cover this function in detail, however, because it operates much more slowly on large data sets. If you wish to learn more about this function, you can view its documentation by running the code cell below!

?merge

In the next module, we will look at graphing in R: the main types of graphs we can create, how to save these graphs, and best practices for data visualization more generally.

7.5 Wrap-up Table

Command	Function
`rbind()`	It stacks vertically two data sets based on certain variables. The resulting data set has more observations.
`cbind()`	It combines two data sets by binding them side by side. The resulting data set has more variables.
`left_join()`	It pastes two data sets together based on specific values of variables. It keeps all the observation of the main data frame but only the observations of the using data frame that are successfully matched.
`inner_join()`	It pastes two data sets together based on specific values of variables. It only keeps the matched observations from both the main and the using data set.
`full_join()`	It pastes two data sets together based on specific values of variables. It keeps all observations of both the main and the using data set, also the observations that are not matched.