# Run this cell
source("intro_to_data_tests.r")
3.1 - Introduction to Data in R - Part 1
tidyverse
set of packages. It includes basic data curation and cleaning, including table-based inspection and missing data.
Outline
Prerequisites
- Introduction to Jupyter
- Introduction to R
Outcomes
After completing this notebook, you will be able to:
- Identify and understand the packages and commands needed to load, manipulate, and combine data frames in R
- Load data using R in a variety of forms
- Create and reformat data, including transforming data into factor variables
References
Drawing insights from data requires information to be presented in a way that is both interpretable to R and our audiences. However, before you can wrangle data sets, you need to ensure that they are clean. A clean data set means:
- Observations where data for key variables is missing are removed or stored in a different data set (eg.
df_raw
). Missing data can create bias in your analysis and there are other reasons why researchers choose to drop variables with too many missing variables - Data set is tidy, ie. each row captures only one observation and each column captures only one variable/characteristic of the observation. Data scraped and collected manually or using automation often comes in untidy shapes (eg. the column has both the price and square foot area separated with a hyphen
-
)
In this notebook, we teach you how to load data sets properly in R and then clean them using some common methods from the haven
and tidyverse
packages.
Loading Data in R
R needs to be provided with the appropriate packages to have access to the appropriate functions needed to interpret our raw data.
install.packages('package_name')
is used to install packages for the first time whilelibrary('package_name')
is used to import the package into our notebook’s session run-time.
Let’s get started by loading them now.
# loading in our packages
library(tidyverse)
library(haven)
Researchers usually work with data stored in STATA, Excel, or comma-separated variables (CSV) files. The extension of a file tells us the file type we’re dealing with. Here are two common file types used in the data community:
.dta
for a STATA data file.csv
for a comma-separated variables file.txt
or text files can be used to store data separated by white-space(s).
Also take note of the following functions used to import data of different file types:
read_csv("file name")
for CSVread_dta("file name")
for STATA from thehaven
packageread_excel("file name")
from thereadxl
packageread_table("file name", header = FALSE)
for text files- The ‘header’ argument indicates whether the first row of data represent the column names or not.
Exercise
In this notebook, we will be working with data from the Canadian census which is stored as 01_census2016.dta
. Which function should we use to load this file? Write the name of the function just before the brackets (e.g. read_table
)
# which function should we use?
<- "..."
answer0
test_0()
Did you get it? Okay, now replace the ???
in the code below with that function to load the data!
# reading in the data
<- ???("../datasets/01_census2016.dta") # change me!
census_data
# inspecting the data
glimpse(census_data)
Cleaning Data
It’s important to ask what we mean by cleaning our data sets? This might usually look like:
- Loading the data into R by importing a local file or from the internet and telling R how to interpret it.
- Merging data frames from different sources, horizontally or vertically, in order to be able to answer certain questions about the populations.
- Renaming variables, creating new variables and removing observations where data for the new variables is missing.
- Removing outliers and or creating subsets of the data based on values for different variables using filter, select and other reshaping methods in R.
We now begin to clean the census data. We want to redefine and factor variables, define new ones, and dropping missing observations.
Factor Variables
As discussed previously, two types of variables can be stored in R: quantitative and qualitative variables. Qualitative variables are usually stored in R as sequences of characters or letters, ie. as character variables. They can also be stored as factor variables which map qualitative responses to categorical values. In other words, the qualitative variable gets encoded so the levels of the variable are represented by numeric “codes”. This process further streamlines data interpretation and analysis.
Look at line pr
in the output from glimpse
above:
pr <dbl+lbl> 35, 35, 11, 24, 35, 35, 35, 10, 35, 35, 59, 59, 46, 24, 59
The pr
variable in the Census data stands for province. Do these look much like Canadian provinces to you? This is an example of encoding. We can also see the variable type is <dbl+lbl>
: this is a labeled double. This is good: it means that R already understands what the levels of this variable mean.
There are three similar ways to change variables into factor variables.
- We can change a specific variable inside a dataframe to a factor by using the
as_factor
command
<- census_data %>% #we start by saying we want to update the data, AND THEN... (%>%)
census_data mutate(pr = as_factor(pr)) #mutate (update pr to be a factor variable)
glimpse(census_data)
Do you see the difference in the pr
variable? You can also see that the type has changed to <fct>
for factor variable.
R knows how to decode province names out of the <dbl+lbl> type variable, since the variable specification captures both the numeric code as dbl
and the label as lbl
.
- We can also supply a list of factors using the
factor
command. This command takes two other values:- A list of levels the qualitative variable will take on (eg. 35, 11, 24… in the case of pr)
- A list of labels, one for each level, which describes what each level means (eg. ‘ontario’, ‘prince edward island’, ‘quebec’ …)
Let’s look at the pkids
(has children) variable as an example. Let’s suppose we didn’t notice that it is of type <dbl+lbl>
or we decided we didn’t like the built-in labels.
# first, we write down a list of levels
= c(0,1,9)
kids_levels
# then, we write a list of our labels
= c('none', 'one or more', 'not applicable')
kids_labels
# finally, we use the command but with some options - telling factor() how to interpret the levels
<- census_data %>% # we start by saying we want to update the data, AND THEN... (%>%)
census_data mutate(pkids = factor(pkids, # notice the function is "factor", not "as_factor"
levels = kids_levels,
labels = kids_labels)) # mutate (update pkids) to be a factor of pkids
glimpse(census_data)
See the difference here and notice how we can customize factor labels when creating new variables.
- If we have a large data set, it can be tiresome to decode all of the variables one-by-one. Instead, we can use
as_factor
on the entire dataset and it will convert all of the variables with appropriate types.
Note:
as_factor
will only match the levels (eg. 35, 11, 24) to labels (ie. ‘ontario’, ‘prince edward island’, ‘quebec’) ifthe variable is of <dbl+lbl> type).
<- as_factor(census_data)
census_data glimpse(census_data)
Here is our final dataset, all cleaned up! Notice that some of the variables (e.g. ppsort
) were not converted into factor variables.
Test Your Knowledge: Can you tell why?
Creating New Variables
Another important clean-up task is to make new variables. We can use the mutate
command again. In fact, when we were making factor variables earlier, we were in a way making new variables.
The mutate
command is an efficient way of manipulating the columns of our data frame, and we can specify a formula for creating the new variable:
census_data <- census_data %>%
mutate(new_variable_name = function(do some stuff...))
Let’s see it in action with the log()
function that can be used to create a new variable for the natural logarithm of wages.
<- census_data %>%
census_data mutate(log_wages = log(wages)) # the log function
glimpse(census_data)
Do you see our new variable at the bottom? Nice!
The case_when
function
We won’t cover a lot of complex functions in this notebook, but we will mention one very important example: the case_when
function. This function creates different values for an input based on specified cases. You can read more about it by running the code block below.
?case_when
The case_when() function in R operates like a series of ‘if-then’ statements. Put simply, for each line:
The ‘case’ is the condition that you’re checking for.
The ‘value’ is what you assign when that condition is met.
Suppose we are working with the pkids
variable and find it has three levels ('none', 'one or more', 'not applicable'
). We are interested in creating a dummy variable which is equal to one if the respondent has children and zero otherwise. Let’s call it has_kids
.
Here’s how you can use case_when()
to achieve this:
<- census_data %>%
census_data mutate(has_kids = case_when( # make a new variable, call it has_kids
== "none" ~ 0, # case 1: pkids is "none"; output is 0 (no kids)
pkids == "one or more" ~ 1, # case 2: "one or more"; output is 1 (kids)
pkids == 'not applicable' ~ 0)) # case 2: "not applicable"; output is 0 (no kids)
pkids
glimpse (census_data)
Notice that has_kids
is not a factor variable. We must add on the appropriate line of code to do that.
Exercise: Factorize has_kids
Create an object, stored in answer1
, in which the census_data
data frame is identical to the one above but in which the has_kids
variable is also in factor form.
<- census_data %>%
answer1 # fill me in!
...
test_1()
More Complex Variables
Dummy variables can be created using “complex” variables. For example, agegrp
is a factor variable with 22 levels. Imagine we were interested in an analysis comparing people who are of retirement age with those who are not. Then we could create a variable called retired
that simply tells us whether a person has reached retirement or not, ie. if they are younger than 65 then retired
will equal zero and 1 otherwise.
As best practice, start with having a look at agegrp
.
glimpse(census_data$agegrp)
levels(census_data$agegrp)
glimpse(census_data$agegrp)
told us that agegrp
is a Factor variable with 22 levels. We then use levels()
to see the names for the levels.
We can now bunch together all levels that represent ages 65 and above and assign such observations a value of 1 (and 0 otherwise).
<- census_data %>%
census_data
mutate(retired = case_when((agegrp == "65 to 69 years")|(agegrp == "70 to 74 years")|(agegrp == "75 to 79 years")|(agegrp == "80 to 84 years")|(agegrp == "85 years and over") ~ 1,
TRUE ~ 0)) %>% #otherwise
mutate(retired = as_factor(retired)) # factor
glimpse(census_data)
A Tip: To assign a default value on all cases that don’t match your specified conditions, use TRUE as your last ‘case’. This works because TRUE will always be met if none of the previous conditions are. (Notice TRUE ~ 0
in the code)
Dummy variables are always useful and sometimes a necessity in many econometric and ML models that need to work with complex types of qualitative data.
Exercise: Adding a Dummy Variable
Create an object answer2
in which the census_data
dataframe now has a dummy variable called knows_english
which is equal to 1 if the respondent knows English and 0 if not. Make sure that this variable is factorized.
Hint: We’re telling you here that you need the kol
variable to create the dummy. As best practice, should have verified that kol
and fol
are the two language-related variables and it is kol
(“knowledge of official languages”) that best suits knows_english
.
#Run this first:
glimpse(census_data$kol)
levels(census_data$kol)
<- ... # fill me in!
answer2
test_2()
Conclusion
In this notebook, we learned how to load and manipulate data using various R packages and commands. You also learned how to factor variables and create dummies to meet the needs of your statistical research.
Don’t hesitate to come back to this notebook and apply what you’ve learned here to new data sets. You may now proceed to Part 2 on Intro to Data.