# Run this cell
source("getting_started_intro_to_data_tests.r")
0.3.1 - Introduction to Data in R - Part 1
tidyverse
set of packages. It includes basic data curation and cleaning, including table-based inspection and missing data.
Outline
Prerequisites
- Introduction to Jupyter
- Introduction to R
Outcomes
After completing this notebook, you will be able to:
- Identify and understand the packages and commands needed to load, manipulate, and combine data frames in R
- Load data using R in a variety of forms
- Create and reformat data, including transforming data into factor variables
References
Introduction
Drawing insights from data requires information to be presented in a way that is both interpretable to R and our audiences. However, before you can wrangle data sets, you need to ensure that they are clean. A clean data set means:
- Observations where data for key variables are missing are removed or stored in a different data set (e.g.,
df_raw
). Missing data can create bias in your analysis. - Data set is tidy, i.e., each row captures only one observation and each column captures only one variable/characteristic of the observation. Data scraped and collected manually or using automation often comes in untidy shapes (e.g., two variables might be placed in the same column separated with a hyphen
-
).
In this notebook, we teach you how to load datasets properly in R and then clean them using some common methods from the haven
and tidyverse
packages.
Part 1: Introduction to data in R
R needs to be provided with the appropriate packages to have access to the appropriate functions needed to interpret our raw data.
install.packages('package_name')
is used to install packages for the first time while library(package_name)
is used to import the package into our notebook’s session run-time.
Let’s get started by loading them now.
# loading in our packages
library(tidyverse)
library(haven)
Researchers usually work with data stored in STATA, Excel, or comma-separated variables (CSV) files. The extension of a file tells us the file type we’re dealing with. For instance:
.dta
for a STATA data file.csv
for a comma-separated variables file.txt
for text files (stores data separated by white-space)
In R, we import data with functions that specify the file names and types. These are the R functions used to import data from the most commons file types:
# import csv files
read_csv("file_name")
# import stata data files
read_dta("file_name")
# import excel files
read_excel("file_name")
# import text files
read_table("file_name", header = FALSE)
To use the read_dta
function you have to have the haven
package installed and to use read_excel
you have to have the package readxl
installed.
The header
argument in the last function indicates whether the first row of the data represents the column names or not.
Cleaning data
Cleaning our dataset might mean:
- Loading the data into R by importing a local file or from the internet and telling R how to interpret it.
- Merging data frames from different sources, horizontally or vertically, in order to be able to answer certain questions about the populations.
- Renaming variables, creating new variables and removing observations where data for the new variables is missing.
- Removing outliers or creating subsets of the data based on values for different variables using
filter
,select
, and other reshaping methods in R.
We now begin to clean the census data. Let’s redefine and factor some variables, create new ones, and drop missing observations.
Test your knowledge
In this notebook, we will be working with data from the Canadian census which is stored in the folder datasets
as the file 01_census2016.dta
.
Which function should we use to load this file? Complete the name of the function below.
<- "read_..."
answer_1
test_1()
Did you get it? Okay, now replace the ???
in the code below with that function to load the data!
<- ???("../datasets_getting_started/01_census2016.dta")
census_data
<- census_data # don't change this!
answer_2 test_2()
# inspecting the data
glimpse(census_data)
Part 2: Factor variables
As explained in Intro to R, R usually stores qualitative variables as character variables. However, they can also be stored as factor variables, used to map a (usually predetermined) set of responses to categorical values. In other words, factors encode the data so that the levels of the variable are represented by numeric codes. This process is useful because it streamlines data interpretation and analysis.
Look at line pr
in the output from glimpse
above:
pr <dbl+lbl> 35, 35, 11, 24, 35, 35, 35, 10, 35, 35, 59, 59, 46, 24, 59
The pr
variable in the Census data stands for province. Do these look much like Canadian provinces to you? We can see the variable type is <dbl+lbl>
: this is a labeled double. Let’s transform this variable type into factors.
There are three ways to change variable types into factor variables.
- We can change a specific variable inside a dataframe to a factor by using the
as_factor
command
Note: The operator
%>%
is called the pipe operator. It is used to indicate the “next operation”. For example, you could read the code below as: the final value will be assigned to objectcensus_data
; the value should be calculated by (1) taking the data fromcensus_data
and (2) mutatingpr
toas_factor(pr)
. The pipe operator indicates that we’re going from operation (1) to operation (2).
<- census_data %>% # overwrite the object census_data with `<-`
census_data mutate(pr = as_factor(pr)) # use mutate function to update variable type (more on this later)
glimpse(census_data)
Do you see the difference in the pr
variable now? Notice that the type has changed to <fct>
, which stands for factor.
R knows how to decode province names out of the <dbl+lbl>
type variable because the <dbl+lbl>
specification captures both the numeric code as dbl
and the label as lbl
.
- We can supply a list of factors using the
factor
command. This command takes three inputs:
- The variable we’re trying to convert
- A list of the codes the qualitative variable will take on (e.g.,
35
,11
,24
, …) - A list of labels corresponding to each of the codes (e.g.,
"Ontario"
,"Prince Edward Island"
,"Quebec"
, …)
Let’s take the variable pkids
as an example. pkids
stores whether the respondent has children or not. Let’s change the built-in labels to our own labels.
# write a list of levels
= c(0,1,9)
kids_levels
# write a list of our labels
= c('none', 'one_or_more', 'not_applicable')
kids_labels
# apply the new level-label combinations to the data
<- census_data %>% # overwrite the object census_data with `<-`
census_data mutate(pkids = factor(pkids, # notice the function is "factor", not "as_factor"
levels = kids_levels,
labels = kids_labels)) # mutate (update pkids) to be a factor of pkids
glimpse(census_data)
Notice that now pkids
has our customized factor labels.
- We can use
as_factor
on the entire dataset to convert all of the variables with appropriate types.
Note:
as_factor
will only match the levels (e.g.,35
,11
,24
, …) to labels (e.g.,"Ontario"
,"Prince Edward Island"
,"Quebec"
) if the variable is of<dbl+lbl>
type.
<- as_factor(census_data)
census_data glimpse(census_data)
Here is our final dataset, all cleaned up! Notice that some of the variables (e.g., ppsort
) were not converted into factor variables.
Think Deeper: Can you tell why?
Creating new variables
Another important clean-up task is to make new variables. The best way to create a new variable is using the mutate
command.
The mutate
command is an efficient way of manipulating the columns of our data frame. We can use it to create new columns out of existing columns or with completely new inputs. The structure of the mutate command is as follows:
census_data <- census_data %>%
mutate(new_variable_name = function(...))
It’s easier to understand with an example.
When working with economic data, we usually deal with wages in logarithmic form. Let’s use mutate
to create a new variable on the dataset for the log of wages.
<- census_data %>%
census_data mutate(log_wages = log(wages)) # we pass `wages` to the function `log()` to create log_wages
glimpse(census_data)
Do you see our new variable at the bottom? Nice!
Test your knowledge
In the following code, what is (1) the name of the new variable created, (2) the inputs used to make the new variable, and (3) the function used to transform the inputs in the values of the new variable?
- grade_adjusted, grade and 2, mutate
- mutate, grade and 2, mutate
- round, data, mutate
- mutate, data, round
- grade_adjusted, grade and 2, round
- round, data, round
<- data %>%
data mutate(grade_adjusted = round(grade,2))
# enter your answer as "A", "B", "C", "D", "E", or "F"
<-"..."
answer_3
test_3()
Part 3: Functions
We won’t cover a lot of complex functions in this notebook, but we will mention a very important one: the case_when
function. This function acts like a combination of “if (…), then (…)” operators, creating different values for an input based on specified cases. You can read more about it by running the code block below.
# use the helper function to read details of `case-when`
# ?case_when
The case_when()
function operates with the following parameters:
The ‘case’, which is the condition that you’re checking for.
The ‘value’, which is what you assign when that condition is met.
Suppose we are working with the pkids
variable and find it has three levels ('none'
, 'one or more'
, 'not applicable'
). We are interested in creating a dummy variable which equals one if the respondent has children and zero otherwise. Let’s call this new variable has_kids
.
Here’s how you can use case_when()
to achieve this:
<- census_data %>%
census_data mutate(has_kids = case_when( # use mutate to make a new variable called `has_kids`
== "none" ~ 0, # case 1: when pkids is "none"; output is 0 (no kids)
pkids == "one_or_more" ~ 1, # case 2: when pkids is "one or more"; output is 1 (kids)
pkids == 'not_applicable' ~ 0)) # case 2: when pkids is "not applicable"; output is 0 (no kids)
pkids
glimpse (census_data)
Notice that our new variable has_kids
is not a factor variable. We must add on the appropriate line of codes to make it a factor.
Dummy Variables
We might also want to use R to create dummy variables in our dataset. For example, suppose we want to create a variable that indicates whether the respondent is retired (dummy == 1) or not retired (dummy == 0). We can simply decode the data of the variable agegrp
, which is currently a factor indicating the age of the respondent.
Let’s start by taking a look at agegrp
.
glimpse(census_data$agegrp)
tells us that agegrp
is a factor variable with 22 levels. We can see the names of the levels with the function levels()
.
# inspect the data
glimpse(census_data$agegrp)
# understand levels
levels(census_data$agegrp)
We can now bunch together all levels that represent ages 65 and above (the retirement age) and assign such observations a value of 1 (and 0 otherwise).
<- census_data %>%
census_data mutate(retired = case_when(
== "65 to 69 years")| # the vertical bar can be read as "or"
(agegrp == "70 to 74 years")|
(agegrp == "75 to 79 years")|
(agegrp == "80 to 84 years")|
(agegrp == "85 years and over") ~ 1, # the ~ separates the 'case' from the 'value'
(agegrp TRUE ~ 0)) %>% # use `TRUE` for the 'otherwise' condition
mutate(retired = as_factor(retired)) # make the variable a factor
glimpse(census_data)
To Remember: To assign a default value on all cases that don’t match your specified conditions, use TRUE as your last ‘case’. This works because the condition TRUE will always be met if none of the previous conditions are.
Test your knowledge
Overwrite the existing has_kids
variable with a new has_kids
variable but with type factor.
Hint: To overwrite a variable, create a new variable with the same name as the name of the variable you want to overwrite.
# use this cell to write your code
# run this cell to check your answer - don't change the code here!
<- class(census_data$has_kids)
answer_4
test_4()
Create a new dummy variable called knows_english
for whether the respondent speaks english (dummy == 1) or not (dummy == 0). Use data from the variable kol
and assign the updated data frame to the object answer_4
.
#Run this first:
glimpse(census_data$kol)
levels(census_data$kol)
# don't forget to factorize your new variable!
<- census_data %>%
answer_5 mutate(... = case_...(
== ...)|
(kol == ...) ~ ...,
(... TRUE ~ ...)) %>%
mutate(... = ...(...))
test_5()
Conclusion
In this notebook, we learned how to load and manipulate data using various R packages and commands. You also learned how to factor variables and create dummies to meet the needs of your statistical research.
Don’t hesitate to come back to this notebook and apply what you’ve learned here to new data sets. You may now proceed to Part 2 on Intro to Data.