0.2 - Introduction to R

basics

getting started

data types

data structures

introduction

dataframes

variables

operations

functions

This notebook introduces you to some fundamental concepts in R. It might be a little complex for a start, but it covers basically all of the fundamental syntax you need to know in later notebooks. Don’t get overwhelmed! Remember, you can always review this later!

Author

COMET Team
Colby Chambers, Anneke Dresselhuis, Oliver Xu, Colin Grimes, Jonathan Graves

Published

12 January 2023

Outline

Prerequisites:

Introduction to Jupyter

Learning Objectives

Understand variables, functions and objects in R
Import and load data into Jupyter Notebook
Access and perform manipulations on data

# Run this cell

source("getting_started_intro_to_r_tests.r")

Introduction

In this notebook, we will be introducing R, which is a programming language that is particularly well-suited for statistics, econometrics, and data science. If you are familiar with other programming languages, such as Python, this will likely be very familiar to you - if this is your first time coding, don’t be intimidated! Try to play around with the examples and exercises as you work through this notebook; it’s easiest to learn R (or any programming language) by trying things for yourself.

Part 1: Manipulating Data

To begin, it’s important to get a good grasp of the different data types in R and how to use them. Whenever we work with R, we will be manipulating different kinds of information, which is referred to as “data”. Data comes in many different forms, and these forms define how we can use it in calculations or visualizations - these are called types in R.

R has 6 basic data types. Data types are used to store information about a variable or object in R:

Character: data in text format, like “word” or “abc”
Numeric (real or decimal): data in real number format, like 6, or 18.8 (referred to as Double in R)
Integer: data in a whole number (integer) format, like 2L (the L tells R to store this as an integer)
Logical: truth values, like TRUE or FALSE
Complex: data in complex (i.e. imaginary) format, like 1+6i (where $i$ is the $\sqrt{-1}$)
Raw: raw digital data, an unusual type which we will not cover here

If we are ever wondering what type an object in R is, or what its properties are, we can use the following two functions, which allow us to examine the data type and elements contained within an object:

typeof(): this function returns a character string that corresponds to the data type of an object
str(): this function displays a compact internal structure of an R object

We will see some examples of these in just a moment.

Data Structures

We often need to store data in complex forms. Data can also be stored in different structures in R beyond basic data types. Data structures in R programming can be complicated, as they are tools for holding multiple values. However, some of them are very important and are worth discussing here.

Vectors: a vector of values, like $(1,3,5,7)$
Matrices: a matrix of values, like $[1,2; 3,4]$ (usually displayed as a square)
Lists: a list of elements, like $($pet = “cat,”dog”, “mouse”$)$, with named properties
Dataframe: a collection of vectors or lists, organized into rows and columns according to observations

Note that vectors don’t need to be numeric! There are some useful built-in functions to create data structures (we don’t have to create our own functions to do so).

c(): this function combines values into a vector
matrix(): this function creates a matrix from a given set of values
list(): this function creates a list from a given set of values
data.frame(): this function creates a data frame from a given set of lists or vectors

Okay, enough background - lets see this in action!

Working with Vectors

Vectors are important. We can create them from values or other elements, using the c() function:

# generates a vector containing values
z <- c(1, 2, 3)

# generates a vector containing characters
countries <- c("Canada", "Japan", "United Kingdom")

We can also access the elements of the vector. Since a vector is made of basic data, we can get those elements using the [ ] index notation. This is very similar to how in mathematical notation we refer to elements of a vector.

Note: if you’re familiar with other programming languages, it’s important to note that R is 1-indexed. So, the first element of a vector is 1, not 0. Keep this in mind!

# extract the 2nd component of "z"
z[2]

# extract the 3rd component of "countries"
countries[3]

As mentioned above, we can use the typeof() and str() functions to glimpse the kind of data stored in our objects. Run the cell below to see how this works:

# view the data type of countries
typeof(countries)

# view the data structure of countries
str(countries)

# view the data type of z
typeof(z)

# view the data structure of z
str(z)

The output of str(countries) begins by acknowledging that the contained data is of a character (chr) type. The information contained in the [1:3] first refers to the component number (there is only 1 component list here) and then the number of observations (the 3 countries).

Working with Matrices

Just like vectors, we can also create matrices; you can think of them as organized collections of rows (or columns), which are vectors. They’re a little bit more complicated to create manually, since you need to use a more complex function.

The simplest way to make a matrix is to provide a vector of all the values you are interested in including, and then tell R how the matrix is organized. R will then fill in the values:

# generates a 2 x 2 matrix
m <- matrix(c(2,3,6,7,7,3), nrow=2,ncol=3)

print(m)

Take note of the order in which the values are filled in; it might be unexpected!

Just like with vectors, we can also access parts of the matrix. If you look at the cell output above, you will see some notation like [1,] or [,2]. These are the rows and columns of the matrix. We can refer to them using this notation. We can also refer to elements using [1,2]. Again, this is very similar to the mathematical notation for matrices.

# If we want to access specific parts of the matrix:

# 2th column of matrix
m[,2]     

# 1st row of matrix
m[1,]  

# Element in row 1, column 2

m[1,2]

As with vectors, we can also observe and inspect the data structures of matrices using the helper function above.

# what type is m?

typeof(m)

# glimpse data structure of m
str(m)

The output of str(m) begins by displaying that the data in the matrix is of a numeric (num) type. The [1:2, 1:3] shows the structure of the rows and columns. The final part displays the values in the matrix.

Working with Lists

Lists are a little bit more complex because they can store many different data types and objects, each of which can be given names which are specific ways to refer to these objects. Names can be any useful descriptive term for an element of the list. You can think of lists like flexible vectors with names.

# generates a list with 3 components named "text" "a_vector" and "a_matrix"
my_list <- list(text="test", a_vector = z, a_matrix = m)

We can access elements of the list using the [ ] or [[ ]] operations. There is a difference:

[ ] accesses the elements of the list which is the name and object
[[ ]] accesses the object directly

We usually want to use [[ ]] when working with data stored in lists. One very nice feature is that you can refer to elements of a list by number (like a vector) or by their name.

# If we want to access specific parts of the list:

# 1st component in list
my_list[[1]] 

# 1st component in list by name (text)
my_list[["text"]]

# 1st part of the list (note the brackets)
my_list[1] 

# glimpse data type of my_list
typeof(my_list)

There is one final way to access elements of a list by name: using the $ or access operator. This works basically like [[name]] but is more transparent when writing code. You put the object you want to access, followed by the operator, followed by the property:

# access the named property "text"
my_list$text

#access the named property "a_matrix"
my_list$a_matrix

You will notice that this only works for named objects - which is particularly convenient for data frames, which we will discuss next.

Working with Dataframes

Dataframes are the most complex object you will work with in this course but also the most important. They represent data - like the kind of data we would use in econometrics. In this course, we will primarily focus on tidy data, which refers to data in which the columns represent variables, and the rows represent observations. In terms of R, you can think of data-frames as a combination of a matrix and a list.

We can access columns (variables) using their names, or their ordering

# generates a dataframe with 2 columns and 3 rows
df <- data.frame(ID=c(1:3),
                 Country=c('Canada', 'Japan', 'United Kingdom'))

# If we want access specific parts of the dataframe:

# 2nd column in dataframe
df[2] 

df$Country

# glimpse compact data structure of df
str(df)

Notice that the str(df) command shows us what the names of the columns are in this dataset and how we can access them.

Test your knowledge

Let’s see if you understand how to create new vectors! In the block below:

Create an object named my_vector which is a vector and which contains the numbers from 10 to 15, inclusive.
Extract the 4th element of my_vector and store it in the object answer_1.

my_vector <- c(...) # replace ... with the appropriate code

answer_1 <- my_vector[...] # replace ... with the appropriate code

test_1()

Create a new vector named my_second_vector with the elements my, second, and vector (all lowercase!).
Extract the 3th element of my_second_vector and store it in the object answer_2.

my_second_vector <- (...)

answer_2 <- (...)

test_2()

In this exercise:

Create an object named mat which is a matrix with 2 rows and 2 columns. The first column will take on values 11, 22, while the second column will take on values 33, 44.
Extract the value in the first row, second column from mat and store it in the object answer_2

mat <- matrix(..., nrow=...,ncol=...) # fill in the missing code

answer_3 <- mat[...]  # fill in the missing code

test_3()

Given the matrix below, extract the element on the 30th row and 10th column, store it in an object called first.
Extract the element on the 27th row and 13th column, store it in an object called second.
Extract the element on the 12th row and 33rd column, store it in an object called third.
Create a vector with the three elements and store it in the object answer_4.

mat2 <- matrix(c(1:1050), nrow=30,ncol=35)

first <- ...

second <- ...

third <- ...

answer_4 <- ...   

test_4()

Now, extract the 8th row of mat2 and store it in answer_5.

answer_5 <- ... 

test_5()

In this exercise, you will need to:

Create an object named a_list, which is a list with two components: an element called string which stores the character string “Hello World”, and an element called range which contains a vector with values 1 through 5.
Extract the elements of the second object, and store it in the variable answer_6.

a_list <- list(... = "Hello World", range = ...) # fill in the missing code

answer_6 <- a_list... # fill in the missing code

test_6()

In this exercise:

Create an object my_dataframe which is a dataframe with two variables and two observations. The first column var1 will take on values 1,2,3. The second column var2 will take on values "A", "B", "C".
Extract the column var1 and store it in the object answer_7

my_dataframe <- data.frame(var1=..., ...=c(...)) # fill in the missing code

answer_7 <- ... # fill in the missing code

test_7()

Now, we have created a data frame called hidden_df. Select the third element from the column col4 of the data frame hidden_df and store it in answer_8. Try to do that without actually seeing the dataset.

# Hint: you can think of a column of a dataset as a vector

answer_8 <- ... # fill in the missing code

test_8()

Part 2: Operations with Variables

At this point, you are familiar with some of the different types of data in R and how they work. However, let’s understand how we can work with these data types in more detail by writing R code. A variable or object is a name assigned to a memory location in the R workspace (working memory). For now we can use the terms variable and object interchangeably. An object will always have an associated type, determined by the information assigned to it. Clear and concise object assignment is essential for reproducible data analysis, as mentioned in the module Intro to Jupyter.

When it comes to code, we can assign information (stored in a specific data type) to variables and objects using the assignment operator <-. Using the assignment operator, the information on the right-hand side is assigned to the variable/object on the left-hand side; we’ve seen this before, in some of the examples earlier.

In the example [2] below, "Hello" has been assigned to the object var_1. "Hello" will be stored in the R workspace as an object named "var_1".

Note: R is case sensitive. When referring to an object, it must exactly match the assignment. Var_1 is not the same as var_1 or var1.

var_1 <- "Hello"

var_1

typeof(var_1)

You can create variables of many different types, including all of the basic and advanced types we discussed above.

var_2 <- 34.5 #numeric/double
var_3 <- 6L #integer
var_4 <- TRUE #logical/boolean
var_5 <- 1 + 3i #complex

Operations

In R, we can also perform operations on objects; the type of an object defines what operations are valid. All of the basic mathematical and logical operations you are familiar with are examples of these, but there are many more. For example:

a <- 4 # creates an object named "a" assigned to the value: 4
b <- 6 # creates an object named "b" assigned to the value: 6
c <- a + b # creates an object "c" assigned to the value (a = 4) + (b = 6)

Try and think about what value c holds!

We can view the assigned value of c in two different ways:

By printing a + b
By printing c

Run the code cell below to see for yourself!

a + b
c

It is also possible to change the value of our objects. In the example below, the object b has been reassigned the value 5.

b <- 5

R will now store the updated value of 5 in the object b. This overrides the original assignment of 6 to b. The ability to change object names is a key benefit using variables in R. We can simply reassign the value to a variable without having to change that value everywhere in our code. This will be quite useful when we want to do things such as change the name of a column in a dataset.

To Remember: use a unique object name that hasn’t been used before in order to avoid unplanned object reassignments when creating a new object. The more descriptive, the better!

Comments

While developing our code, we do not always have to use markdown cells to document our process. We can also write notes in code cells using something called a comment. A comment simply allows us to write lines in our code cell which will not run when we run the cell itself. By simply typing the # sign, anything written directly after this sign and on the same line will not run; it is a comment. To comment out multiple lines of code, simply include the # sign at the start of each line.

Note: In general, the purpose of comments is to make the source code easier for readers to understand. Remember the concept of reproducibility from our last notebook?

It is important to comment on our code for three main reasons:

It allows us to keep track of our actions and thought process: Commenting is a great way to help us stay organized. Code comments provide an ordered process for everyone to follow. In case we need to debug our codes, we can easily track which step is problematic and come back to that particular line of code that may be the source of the problem.
It helps readers understand why we’re coding in a particular way: While coding something like a + b may be a more or less straightforward computation, our reader may not be able to understand what a or b are referring to, or why they need to be added to each other. Our readers or other developers may ask: why is addition used instead of multiplication or division? With comments, we can explain why this particular method was used for this particular code block and how it relates to other code blocks.
It saves everyone’s time in the future, including yourself: It’s far easier than you might expect to forget what a piece of code does, or is supposed to do. Keeping good comments ensures that your code remains comprehensible.

To Remember: an old woodworker’s tip is to always label something when taking it apart so that a stranger could put the pieces back together. The same advice applies to comments and coding: write code so that a stranger could figure out what it is supposed to do.

Generally, it is always a good idea to add comments to our code. However, if we find ourselves needing to explain an important block of code using lines upon lines of comments, it is preferable to use a markdown cell instead to give ourselves more room. Comments are best served for the reasons above.

More on Operators

Earlier, we used discussed operations and used the example of + to run the addition of a and b. + is a type of R arithmetic operator, a symbol that tells R to perform a specific operation. We can use different R operators with variables. R has 4 types of operators:

Arithmetic operators: used to carry out mathematical operations. Ex. * for multiplication, / for division, ^ for exponent etc.
Assignment operators: used to assign values to variables. Ex. <-
Relational operators: used to compare between values. Ex. > for greater than, == for equal to, != for not equal to etc.
Logical operators: used to carry out Boolean operations. Ex. ! for Logical NOT, & for Logical AND etc.

We won’t cover all of these right now, but you can look them up online; for now, keep an eye out for them when they occur.

Functions

These simple operations are great to start with, but what if we want to do operations on different values of X and Y over and over again and don’t want to constantly rewrite this code? This is where functions come in. Functions allow us to carry out specific tasks. We simply pass in a parameter or parameters to the function. Code is then executed in the function body based on these parameters, and output may be returned.

# function_name <- function(arguments)
#  {code operating on the arguments
#   }

This structure says that we start with a name for our function (function_name) and we use the assignment operator similarly to when we assign values to variables. We then pass arguments or parameters to our function (which can be numeric, characters, vectors, collections such as lists, etc.): those are the inputs to the function.

Finally, within the curly brackets we write our code needed to accomplish our desired task. Once we have done this, we can call this function anywhere in our code (after having run the cell defining the function!) and evaluate it based on specific parameter values.

An example is shown below; can you figure out what this function does?

my_function <- function(x, y)
 { k <- 2*(x + y)
 print(k)
}

The function (1) takes the inputs (x,y) given by the user (2) assigns the value of 2 times x plus y to some object k (3) prints the object k. We can call this function with any parameters we want (as long as it is a number), and the function will output the operations for the inputs we specified. For example, let’s call the function for the values x = 2 and y = 3.

my_function(x = 2, y = 3)

The parameters placed into functions can be given defaults. Defaults are specific values for parameters that have been chosen and defined within the circular brackets of the function definition. For example, we can define y = 3 as a default in our my_function. Then, when we call our function, we do not have to specify an input for y unless we want to.

my_function <- function(x, y = 3)
 { k <- 2*(x + y)
 print(k)
}

my_function(2)

If we want to override this default parameter, we can simply call the function with a new input for y. This is done below for y = 4, allowing us to execute our code as though our default was actually y = 4.

my_function <- function(x, y = 3)
 { k <- 2*(x + y)
 print(k)
}

my_function(x = 2, y = 4)

Finally, note that we can nest functions within functions; meaning we can call functions inside of other functions, creating very complex arrangements. Just be sure that these inner functions have themselves already been defined.

my_function_1 <- function(x, y)
 { k <- 2*(x + y)
}

my_function_2 <- function(x, y)
 {j <- x + y - my_function_1(x, y)
  i <- 2 * j
  print(i)
}

my_function_2(2, 3)

Luckily, we usually don’t have to define our own functions, since most useful built-in functions we need already come with R - although we may need to import specific packages to access them. We can always use the help ? feature in R to learn more about a built-in function if we’re unsure. For example, ?max gives us more information about the max() function.

For more information about how you can read and use different functions, please refer to the Function Cheat Sheet.

Test Your Knowledge

In this exercise:

create an object u which is equal to 1
create an object y which is equal to 7
create an object w which is equal to 10.
create an object answer_9 which is equal to the sum of u and y, divided by w

... <- 1 # fill in the missing code
y <- ...
... <- ...

answer_9 <- ...

test_9()

Now multiply each of u, y, and w by 2, and store the answers in the objects v1, v2, and v3, respectively.
Create a vector with objects v1, v2, and v3, and store the vector in answer_10. Make sure the elements are exactly in that order!

v1 <- ...
v2 <- ...
v3 <- ...

answer_10 <- ...

test_10()

In this exercise:

Create a function divide which takes in two arguments, x and y. The function should return x divided by y.
Store the solution to divide(5,3) in the object answer_11:

divide <- function(x,y) {
    ...
    }


answer_11 <- ...(x = 5, y = 3)

test_11()

Part 3: Dealing with Errors

Sometimes in our analysis we can run into errors in our code. This happens to everyone - don’t worry - it’s not a reason to panic. Understanding the nature of the error we are confronted with can be a helpful first step to finding a solution. There are two common types of errors:

Syntax errors: This is the most common error type. These errors result from invalid code statements/structures that R doesn’t understand. Suppose R speaks English, speaking to it in German or broken English certainly would not work! Here are some examples of common syntax errors: the associated package is not loaded, misspelling of a command as R is case-sensitive, unmatched/incomplete parenthesis etc. How we handle syntax errors is case-by-case: we can usually solve syntax errors by reading the error message and finding what is often a typo or by looking up the error message on the internet using resources like stack overflow.
Semantic errors: These errors result from valid code that successfully executes but produces unintended outcomes. Again, let us suppose R speaks English. Although we asked it to hand us an apple in English and R successfully understood, it somehow handed us a banana! This is not okay! How we handle semantic errors is also case-by-case: we can usually solve semantic errors by reading the error message and searching it online.

Now that we have all of these terms and tools at our disposal, we can begin to load in data and operate on it using what we’ve learned.

Test your knowledge

In this exercise, you will be asked to extract data and perform basic operations with a dataset of statistics from violent crime rates in US states. This is a dataset that comes with R. For now, don’t worry about how we get and load the data (you’ll learn about it in the next modules); just focus on what the questions ask you to do. This is intended to be a hard exercise - if you complete it successfully, it means you understood the content well! Let’s get to it:

Use the function str to view the structure of the data frame us_arrests. How many columns and rows do we have in the dataset?

# use this cell to write your code if you need

# write your answers here
numrows <- ...
numcols <- ...

answer_12 <- c(numrows, numcols) # don't change this!

test_12()

The units of our numeric columns are as follows:

Murder: Murder arrests (per 100,000)
Assault: Assault arrests (per 100,000)
UrbanPop: Percent urban population
Rape: Rape arrests (per 100,000)

States are listed in alphabetical order in the column States.

Extract the murder rate in Arizona and store it in answer_13.

# use this cell to write your code if you need

# write your answer here
answer_13 <- ...

test_13()

What is the absolute value of the difference between the murder rate in New York and in Colorado? Store the answer in answer_14.

Hint: New York is the 32nd state and Colorado is the 6th state in alphabetical order

Hint 2: The function abs() returns the absolute value of your input

# use this cell to write your code if you need

# write your answer here
murder_ny <- ...

murder_co <- ...

answer_14 <- abs(...)

test_14()

And what about the differences in Assault and Rape? Create a vector with the absolute value of the differences as elements, and store it in answer_15. Make sure you order your elements as (1) murder, (2) assault, and (3) rape.

# use this cell to write your code if you need

# write your answer here
murder <- ...

assault <- ...

rape <- ...

answer_15 <- ...

test_15()