3.1 - Introduction to Data in R - Part 1

introduction

econ intro

data

syntax

tidyverse

cleaning

tables

This notebook introduces you to data in R, primarily using the tidyverse set of packages. It includes basic data curation and cleaning, including table-based inspection and missing data.

Author

COMET Team
Manas Mridul, Valeria Zolla, Colby Chamber, Colin Grimes, Jonathan Graves

Published

12 January 2023

Outline

Prerequisites

Introduction to Jupyter
Introduction to R

Outcomes

After completing this notebook, you will be able to:

Identify and understand the packages and commands needed to load, manipulate, and combine data frames in R
Load data using R in a variety of forms
Create and reformat data, including transforming data into factor variables

References

# Run this cell

source("intro_to_data_tests.r")

Drawing insights from data requires information to be presented in a way that is both interpretable to R and our audiences. However, before you can wrangle data sets, you need to ensure that they are clean. A clean data set means:

Observations where data for key variables is missing are removed or stored in a different data set (eg. df_raw). Missing data can create bias in your analysis and there are other reasons why researchers choose to drop variables with too many missing variables
Data set is tidy, ie. each row captures only one observation and each column captures only one variable/characteristic of the observation. Data scraped and collected manually or using automation often comes in untidy shapes (eg. the column has both the price and square foot area separated with a hyphen -)

In this notebook, we teach you how to load data sets properly in R and then clean them using some common methods from the haven and tidyverse packages.

Loading Data in R

R needs to be provided with the appropriate packages to have access to the appropriate functions needed to interpret our raw data.

install.packages('package_name') is used to install packages for the first time while library('package_name') is used to import the package into our notebook’s session run-time.

Let’s get started by loading them now.

# loading in our packages
library(tidyverse)
library(haven)

Researchers usually work with data stored in STATA, Excel, or comma-separated variables (CSV) files. The extension of a file tells us the file type we’re dealing with. Here are two common file types used in the data community:

.dta for a STATA data file
.csv for a comma-separated variables file
.txt or text files can be used to store data separated by white-space(s).

Also take note of the following functions used to import data of different file types:

read_csv("file name") for CSV
read_dta("file name") for STATA from the haven package
read_excel("file name") from the readxl package
read_table("file name", header = FALSE) for text files
- The ‘header’ argument indicates whether the first row of data represent the column names or not.

Exercise

In this notebook, we will be working with data from the Canadian census which is stored as 01_census2016.dta. Which function should we use to load this file? Write the name of the function just before the brackets (e.g. read_table)

# which function should we use?

answer0 <- "..."

test_0()

Did you get it? Okay, now replace the ??? in the code below with that function to load the data!

# reading in the data
census_data <- ???("../datasets/01_census2016.dta")  # change me!

# inspecting the data
glimpse(census_data)

Cleaning Data

It’s important to ask what we mean by cleaning our data sets? This might usually look like:

Loading the data into R by importing a local file or from the internet and telling R how to interpret it.
Merging data frames from different sources, horizontally or vertically, in order to be able to answer certain questions about the populations.
Renaming variables, creating new variables and removing observations where data for the new variables is missing.
Removing outliers and or creating subsets of the data based on values for different variables using filter, select and other reshaping methods in R.

We now begin to clean the census data. We want to redefine and factor variables, define new ones, and dropping missing observations.

Factor Variables

As discussed previously, two types of variables can be stored in R: quantitative and qualitative variables. Qualitative variables are usually stored in R as sequences of characters or letters, ie. as character variables. They can also be stored as factor variables which map qualitative responses to categorical values. In other words, the qualitative variable gets encoded so the levels of the variable are represented by numeric “codes”. This process further streamlines data interpretation and analysis.

Look at line pr in the output from glimpse above:

pr      <dbl+lbl> 35, 35, 11, 24, 35, 35, 35, 10, 35, 35, 59, 59, 46, 24, 59

The pr variable in the Census data stands for province. Do these look much like Canadian provinces to you? This is an example of encoding. We can also see the variable type is <dbl+lbl>: this is a labeled double. This is good: it means that R already understands what the levels of this variable mean.

There are three similar ways to change variables into factor variables.

We can change a specific variable inside a dataframe to a factor by using the as_factor command

census_data <- census_data %>%  #we start by saying we want to update the data, AND THEN... (%>%)
    mutate(pr = as_factor(pr)) #mutate (update pr to be a factor variable)

glimpse(census_data)

Do you see the difference in the pr variable? You can also see that the type has changed to <fct> for factor variable.

R knows how to decode province names out of the <dbl+lbl> type variable, since the variable specification captures both the numeric code as dbl and the label as lbl.

We can also supply a list of factors using the factor command. This command takes two other values:
- A list of levels the qualitative variable will take on (eg. 35, 11, 24… in the case of pr)
- A list of labels, one for each level, which describes what each level means (eg. ‘ontario’, ‘prince edward island’, ‘quebec’ …)

Let’s look at the pkids (has children) variable as an example. Let’s suppose we didn’t notice that it is of type <dbl+lbl> or we decided we didn’t like the built-in labels.

# first, we write down a list of levels
kids_levels = c(0,1,9)

# then, we write a list of our labels
kids_labels = c('none', 'one or more', 'not applicable')

# finally, we use the command but with some options - telling factor() how to interpret the levels

census_data <- census_data %>%  # we start by saying we want to update the data, AND THEN... (%>%)
    mutate(pkids = factor(pkids,   # notice the function is "factor", not "as_factor"
                          levels = kids_levels, 
                          labels = kids_labels)) # mutate (update pkids) to be a factor of pkids
glimpse(census_data)

See the difference here and notice how we can customize factor labels when creating new variables.

If we have a large data set, it can be tiresome to decode all of the variables one-by-one. Instead, we can use as_factor on the entire dataset and it will convert all of the variables with appropriate types.

Note: as_factor will only match the levels (eg. 35, 11, 24) to labels (ie. ‘ontario’, ‘prince edward island’, ‘quebec’) ifthe variable is of <dbl+lbl> type).

census_data <- as_factor(census_data)
glimpse(census_data)

Here is our final dataset, all cleaned up! Notice that some of the variables (e.g. ppsort) were not converted into factor variables.

Test Your Knowledge: Can you tell why?

Creating New Variables

Another important clean-up task is to make new variables. We can use the mutate command again. In fact, when we were making factor variables earlier, we were in a way making new variables.

The mutate command is an efficient way of manipulating the columns of our data frame, and we can specify a formula for creating the new variable:

census_data <- census_data %>%
        mutate(new_variable_name = function(do some stuff...))

Let’s see it in action with the log() function that can be used to create a new variable for the natural logarithm of wages.

census_data <- census_data %>%
        mutate(log_wages = log(wages)) # the log function

glimpse(census_data)

Do you see our new variable at the bottom? Nice!

The `case_when` function

We won’t cover a lot of complex functions in this notebook, but we will mention one very important example: the case_when function. This function creates different values for an input based on specified cases. You can read more about it by running the code block below.

?case_when

The case_when() function in R operates like a series of ‘if-then’ statements. Put simply, for each line:

The ‘case’ is the condition that you’re checking for.
The ‘value’ is what you assign when that condition is met.

Suppose we are working with the pkids variable and find it has three levels ('none', 'one or more', 'not applicable'). We are interested in creating a dummy variable which is equal to one if the respondent has children and zero otherwise. Let’s call it has_kids.

Here’s how you can use case_when() to achieve this:

census_data <- census_data %>% 
    mutate(has_kids = case_when( # make a new variable, call it has_kids
        pkids == "none" ~ 0, # case 1: pkids is "none"; output is 0 (no kids)
        pkids == "one or more" ~ 1, # case 2: "one or more"; output is 1 (kids)
        pkids == 'not applicable' ~ 0)) # case 2: "not applicable"; output is 0 (no kids) 
       

glimpse (census_data)

Notice that has_kids is not a factor variable. We must add on the appropriate line of code to do that.

Exercise: Factorize `has_kids`

Create an object, stored in answer1, in which the census_data data frame is identical to the one above but in which the has_kids variable is also in factor form.

answer1 <- census_data %>% 
                ... # fill me in!

test_1()

More Complex Variables

Dummy variables can be created using “complex” variables. For example, agegrp is a factor variable with 22 levels. Imagine we were interested in an analysis comparing people who are of retirement age with those who are not. Then we could create a variable called retired that simply tells us whether a person has reached retirement or not, ie. if they are younger than 65 then retired will equal zero and 1 otherwise.

As best practice, start with having a look at agegrp.

glimpse(census_data$agegrp)

levels(census_data$agegrp)

glimpse(census_data$agegrp) told us that agegrp is a Factor variable with 22 levels. We then use levels() to see the names for the levels.

We can now bunch together all levels that represent ages 65 and above and assign such observations a value of 1 (and 0 otherwise).

census_data <- census_data %>% 
  
    mutate(retired = case_when((agegrp == "65 to 69 years")|(agegrp == "70 to 74 years")|(agegrp == "75 to 79 years")|(agegrp == "80 to 84 years")|(agegrp == "85 years and over") ~ 1, 
                               TRUE ~ 0)) %>% #otherwise
    mutate(retired = as_factor(retired)) # factor

glimpse(census_data)

A Tip: To assign a default value on all cases that don’t match your specified conditions, use TRUE as your last ‘case’. This works because TRUE will always be met if none of the previous conditions are. (Notice TRUE ~ 0 in the code)

Dummy variables are always useful and sometimes a necessity in many econometric and ML models that need to work with complex types of qualitative data.

Exercise: Adding a Dummy Variable

Create an object answer2 in which the census_data dataframe now has a dummy variable called knows_english which is equal to 1 if the respondent knows English and 0 if not. Make sure that this variable is factorized.

Hint: We’re telling you here that you need the kol variable to create the dummy. As best practice, should have verified that kol and fol are the two language-related variables and it is kol (“knowledge of official languages”) that best suits knows_english.

#Run this first:
glimpse(census_data$kol)
levels(census_data$kol)

answer2 <- ... # fill me in!

test_2()

Conclusion

In this notebook, we learned how to load and manipulate data using various R packages and commands. You also learned how to factor variables and create dummies to meet the needs of your statistical research.

Don’t hesitate to come back to this notebook and apply what you’ve learned here to new data sets. You may now proceed to Part 2 on Intro to Data.