# 03.1 - Introduction to Data in R - Part 1

COMET Team <br> *Manas Mridul, Valeria Zolla, Colby Chamber, Colin
Grimes, Jonathan Graves*  
2023-01-12

## Outline

### Prerequisites

-   Introduction to Jupyter
-   Introduction to R

### Outcomes

After completing this notebook, you will be able to:

-   Identify and understand the packages and commands needed to load,
    manipulate, and combine data frames in R
-   Load data using R in a variety of forms
-   Create and reformat data, including transforming data into factor
    variables

### References

-   [Introduction to Probability and Statistics Using
    R](https://mran.microsoft.com/snapshot/2018-09-28/web/packages/IPSUR/vignettes/IPSUR.pdf)
-   [DSCI 100 Textbook](https://datasciencebook.ca/index.html)

In [None]:
# Run this cell

source("getting_started_intro_to_data_tests.r")

Drawing insights from data requires information to be presented in a way
that is both interpretable to R and our audiences. However, before you
can wrangle data sets, you need to ensure that they are clean. A *clean*
data set means:

1.  Observations where data for key variables is missing are removed or
    stored in a different data set (eg. `df_raw`). *Missing data* can
    create bias in your analysis and there are other reasons why
    researchers choose to drop variables with too many missing variables
2.  Data set is *tidy*, ie. each row captures only one observation and
    each column captures only one variable/characteristic of the
    observation. Data scraped and collected manually or using automation
    often comes in *untidy* shapes (eg. the column has both the price
    and square foot area separated with a hyphen `-`)

In this notebook, we teach you how to load data sets properly in R and
then clean them using some common methods from the `haven` and
`tidyverse` packages.

# Loading Data in R

R needs to be provided with the appropriate packages to have access to
the appropriate functions needed to interpret our raw data.

> `install.packages('package_name')` is used to install packages for the
> first time while `library('package_name')` is used to import the
> package into our notebook’s session run-time.

Let’s get started by loading them now.

In [None]:
# loading in our packages
library(tidyverse)
library(haven)

Researchers usually work with data stored in STATA, Excel, or
comma-separated variables (CSV) files. The extension of a file tells us
the file type we’re dealing with. Here are two common file types used in
the data community:

-   `.dta` for a STATA data file
-   `.csv` for a comma-separated variables file
-   `.txt` or text files can be used to store data separated by
    white-space(s).

Also take note of the following functions used to import data of
different file types:

-   `read_csv("file name")` for CSV
-   `read_dta("file name")` for STATA from the `haven` package
-   `read_excel("file name")` from the `readxl` package
-   `read_table("file name", header = FALSE)` for text files
    -   The ‘header’ argument indicates whether the first row of data
        represent the column names or not.

### Exercise

In this notebook, we will be working with data from the Canadian census
which is stored as `01_census2016.dta`. Which function should we use to
load this file? Write the name of the function just before the brackets
(e.g. `read_table`)

In [None]:
# which function should we use?

answer0 <- "..."

test_0()

Did you get it? Okay, now replace the `???` in the code below with that
function to load the data!

In [None]:
# reading in the data
census_data <- ???("../datasets_getting_started/01_census2016.dta")  # change me!

# inspecting the data
glimpse(census_data)

# Cleaning Data

It’s important to ask what we mean by **cleaning** our data sets? This
might usually look like:

1.  Loading the data into R by importing a local file or from the
    internet and telling R how to interpret it.
2.  Merging data frames from different sources, horizontally or
    vertically, in order to be able to answer certain questions about
    the populations.
3.  Renaming variables, creating new variables and removing observations
    where data for the new variables is missing.
4.  Removing outliers and or creating subsets of the data based on
    values for different variables using filter, select and other
    reshaping methods in R.

We now begin to clean the census data. We want to redefine and factor
variables, define new ones, and dropping missing observations.

## Factor Variables

As discussed previously, two types of variables can be stored in R:
quantitative and qualitative variables. Qualitative variables are
usually stored in R as sequences of characters or letters, ie. as
**character** variables. They can also be stored as **factor** variables
which map qualitative responses to categorical values. In other words,
the qualitative variable gets **encoded** so the *levels* of the
variable are represented by numeric “codes”. This process further
streamlines data interpretation and analysis.

Look at line `pr` in the output from `glimpse` above:

    pr      <dbl+lbl> 35, 35, 11, 24, 35, 35, 35, 10, 35, 35, 59, 59, 46, 24, 59

The `pr` variable in the Census data stands for province. Do these look
much like Canadian provinces to you? This is an example of **encoding**.
We can also see the variable type is `<dbl+lbl>`: this is a *labeled
double*. This is good: it means that R already understands what the
levels of this variable mean.

There are three similar ways to change variables into factor variables.

1.  We can change a specific variable inside a dataframe to a factor by
    using the `as_factor` command

In [None]:
census_data <- census_data %>%  #we start by saying we want to update the data, AND THEN... (%>%)
    mutate(pr = as_factor(pr)) #mutate (update pr to be a factor variable)

glimpse(census_data)

Do you see the difference in the `pr` variable? You can also see that
the type has changed to `<fct>` for **factor variable**.

R knows how to decode province names out of the \<dbl+lbl\> type
variable, since the variable specification captures both the numeric
code as `dbl` and the label as `lbl`.

1.  We can also **supply a list of factors** using the `factor` command.
    This command takes two other values:
    -   A list of levels the qualitative variable will take on (eg. 35,
        11, 24… in the case of pr)
    -   A list of labels, one for each level, which describes what each
        level means (eg. ‘ontario’, ‘prince edward island’, ‘quebec’ …)

Let’s look at the `pkids` (has children) variable as an example. Let’s
suppose we didn’t notice that it is of type `<dbl+lbl>` *or* we decided
we didn’t like the built-in labels.

In [None]:
# first, we write down a list of levels
kids_levels = c(0,1,9)

# then, we write a list of our labels
kids_labels = c('none', 'one or more', 'not applicable')

# finally, we use the command but with some options - telling factor() how to interpret the levels

census_data <- census_data %>%  # we start by saying we want to update the data, AND THEN... (%>%)
    mutate(pkids = factor(pkids,   # notice the function is "factor", not "as_factor"
                          levels = kids_levels, 
                          labels = kids_labels)) # mutate (update pkids) to be a factor of pkids
glimpse(census_data)

See the difference here and notice how we can customize factor labels
when creating new variables.

1.  If we have a large data set, it can be tiresome to decode all of the
    variables one-by-one. Instead, we can use `as_factor` on the
    **entire dataset** and it will convert all of the variables with
    appropriate types.

> **Note**: `as_factor` will only match the levels (eg. 35, 11, 24) to
> labels (ie. ‘ontario’, ‘prince edward island’, ‘quebec’) ifthe
> variable is of \<dbl+lbl\> type).

In [None]:
census_data <- as_factor(census_data)
glimpse(census_data)

Here is our final dataset, all cleaned up! Notice that some of the
variables (e.g. `ppsort`) were *not* converted into factor variables.

> **Test Your Knowledge**: Can you tell why?

## Creating New Variables

Another important clean-up task is to make new variables. We can use the
`mutate` command again. In fact, when we were making factor variables
earlier, we were in a way making new variables.

The `mutate` command is an efficient way of manipulating the columns of
our data frame, and we can specify a formula for creating the new
variable:

    census_data <- census_data %>%
            mutate(new_variable_name = function(do some stuff...))

Let’s see it in action with the `log()` function that can be used to
create a new variable for the natural logarithm of wages.

In [None]:
census_data <- census_data %>%
        mutate(log_wages = log(wages)) # the log function

glimpse(census_data)

Do you see our new variable at the bottom? Nice!

## The `case_when` function

We won’t cover a lot of complex functions in this notebook, but we will
mention one very important example: the `case_when` function. This
function creates different values for an input based on specified cases.
You can read more about it by running the code block below.

In [None]:
?case_when

The case_when() function in R operates like a series of ‘if-then’
statements. Put simply, for each line:

-   The ‘case’ is the condition that you’re checking for.

-   The ‘value’ is what you assign when that condition is met.

Suppose we are working with the `pkids` variable and find it has three
levels (`'none', 'one or more', 'not applicable'`). We are interested in
creating a dummy variable which is equal to one if the respondent has
children and zero otherwise. Let’s call it `has_kids`.

Here’s how you can use `case_when()` to achieve this:

In [None]:
census_data <- census_data %>% 
    mutate(has_kids = case_when( # make a new variable, call it has_kids
        pkids == "none" ~ 0, # case 1: pkids is "none"; output is 0 (no kids)
        pkids == "one or more" ~ 1, # case 2: "one or more"; output is 1 (kids)
        pkids == 'not applicable' ~ 0)) # case 2: "not applicable"; output is 0 (no kids) 
       

glimpse (census_data)

Notice that `has_kids` is not a factor variable. We must add on the
appropriate line of code to do that.

## Exercise: Factorize `has_kids`

Create an object, stored in `answer1`, in which the `census_data` data
frame is identical to the one above but in which the `has_kids` variable
is also in factor form.

In [None]:
answer1 <- census_data %>% 
                ... # fill me in!

test_1()

## More Complex Variables

Dummy variables can be created using “complex” variables. For example,
`agegrp` is a **factor** **variable** with **22 levels**. Imagine we
were interested in an analysis comparing people who are of retirement
age with those who are not. Then we could create a variable called
`retired` that simply tells us whether a person has reached retirement
or not, ie. if they are younger than 65 then `retired` will equal zero
and 1 otherwise.

As best practice, start with having a look at `agegrp`.

In [None]:
glimpse(census_data$agegrp)

levels(census_data$agegrp)

`glimpse(census_data$agegrp)` told us that `agegrp` is a Factor variable
with 22 levels. We then use `levels()` to see the names for the levels.

We can now bunch together all levels that represent **ages 65 and
above** and assign such observations a value of 1 (and 0 otherwise).

In [None]:
census_data <- census_data %>% 
  
    mutate(retired = case_when((agegrp == "65 to 69 years")|(agegrp == "70 to 74 years")|(agegrp == "75 to 79 years")|(agegrp == "80 to 84 years")|(agegrp == "85 years and over") ~ 1, 
                               TRUE ~ 0)) %>% #otherwise
    mutate(retired = as_factor(retired)) # factor

glimpse(census_data)

**A Tip:** To assign a default value on all cases that don’t match your
specified conditions, use TRUE as your last ‘case’. This works because
TRUE will always be met if none of the previous conditions are. (Notice
`TRUE ~ 0` in the code)

Dummy variables are always useful and sometimes a necessity in many
econometric and ML models that need to work with complex types of
qualitative data.

## Exercise: Adding a Dummy Variable

Create an object `answer2` in which the `census_data` dataframe now has
a dummy variable called `knows_english` which is equal to 1 if the
respondent knows English and 0 if not. Make sure that this variable is
factorized.

**Hint:** We’re telling you here that you need the `kol` variable to
create the dummy. As best practice, should have verified that `kol` and
`fol` are the two language-related variables and it is `kol` (“knowledge
of official languages”) that best suits `knows_english`.

In [None]:
#Run this first:
glimpse(census_data$kol)
levels(census_data$kol)

In [None]:
answer2 <- ... # fill me in!

test_2()

# Conclusion

In this notebook, we learned how to load and manipulate data using
various R packages and commands. You also learned how to factor
variables and create dummies to meet the needs of your statistical
research.

Don’t hesitate to come back to this notebook and apply what you’ve
learned here to new data sets. You may now proceed to Part 2 on Intro to
Data.