# Run this cell
source("getting_started_intro_to_r_tests.r")
0.2 - Introduction to R
Outline
Prerequisites:
- Introduction to Jupyter
Learning Objectives
- Understand variables, functions and objects in R
- Import and load data into Jupyter Notebook
- Access and perform manipulations on data
Introduction
In this notebook, we will be introducing R, which is a programming language that is particularly well-suited for statistics, econometrics, and data science. If you are familiar with other programming languages, such as Python, this will likely be very familiar to you - if this is your first time coding, don’t be intimidated! Try to play around with the examples and exercises as you work through this notebook; it’s easiest to learn R (or any programming language) by trying things for yourself.
Part 1: Manipulating Data
To begin, it’s important to get a good grasp of the different data types in R and how to use them. Whenever we work with R, we will be manipulating different kinds of information, which is referred to as “data”. Data comes in many different forms, and these forms define how we can use it in calculations or visualizations - these are called types in R.
R has 6 basic data types. Data types are used to store information about a variable or object in R:
- Character: data in text format, like “word” or “abc”
- Numeric (real or decimal): data in real number format, like 6, or 18.8 (referred to as Double in R)
- Integer: data in a whole number (integer) format, like 2L (the L tells R to store this as an integer)
- Logical: truth values, like TRUE or FALSE
- Complex: data in complex (i.e. imaginary) format, like 1+6i (where \(i\) is the \(\sqrt{-1}\))
- Raw: raw digital data, an unusual type which we will not cover here
If we are ever wondering what type an object in R is, or what its properties are, we can use the following two functions, which allow us to examine the data type and elements contained within an object:
typeof()
: this function returns a character string that corresponds to the data type of an objectstr()
: this function displays a compact internal structure of an R object
We will see some examples of these in just a moment.
Data Structures
We often need to store data in complex forms. Data can also be stored in different structures in R beyond basic data types. Data structures in R programming can be complicated, as they are tools for holding multiple values. However, some of them are very important and are worth discussing here.
- Vectors: a vector of values, like \((1,3,5,7)\)
- Matrices: a matrix of values, like \([1,2; 3,4]\) (usually displayed as a square)
- Lists: a list of elements, like \((\)pet = “cat,”dog”, “mouse”\()\), with named properties
- Dataframe: a collection of vectors or lists, organized into rows and columns according to observations
Note that vectors don’t need to be numeric! There are some useful built-in functions to create data structures (we don’t have to create our own functions to do so).
c()
: this function combines values into a vectormatrix()
: this function creates a matrix from a given set of valueslist()
: this function creates a list from a given set of valuesdata.frame()
: this function creates a data frame from a given set of lists or vectors
Okay, enough background - lets see this in action!
Working with Vectors
Vectors are important. We can create them from values or other elements, using the c()
function:
# generates a vector containing values
<- c(1, 2, 3)
z
# generates a vector containing characters
<- c("Canada", "Japan", "United Kingdom") countries
We can also access the elements of the vector. Since a vector is made of basic data, we can get those elements using the [ ]
index notation. This is very similar to how in mathematical notation we refer to elements of a vector.
Note: if you’re familiar with other programming languages, it’s important to note that R is 1-indexed. So, the first element of a vector is 1, not 0. Keep this in mind!
# extract the 2nd component of "z"
2]
z[
# extract the 3rd component of "countries"
3] countries[
As mentioned above, we can use the typeof()
and str()
functions to glimpse the kind of data stored in our objects. Run the cell below to see how this works:
# view the data type of countries
typeof(countries)
# view the data structure of countries
str(countries)
# view the data type of z
typeof(z)
# view the data structure of z
str(z)
The output of str(countries)
begins by acknowledging that the contained data is of a character (chr) type. The information contained in the [1:3]
first refers to the component number (there is only 1 component list here) and then the number of observations (the 3 countries).
Working with Matrices
Just like vectors, we can also create matrices; you can think of them as organized collections of rows (or columns), which are vectors. They’re a little bit more complicated to create manually, since you need to use a more complex function.
The simplest way to make a matrix is to provide a vector of all the values you are interested in including, and then tell R how the matrix is organized. R will then fill in the values:
# generates a 2 x 2 matrix
<- matrix(c(2,3,6,7,7,3), nrow=2,ncol=3)
m
print(m)
Take note of the order in which the values are filled in; it might be unexpected!
Just like with vectors, we can also access parts of the matrix. If you look at the cell output above, you will see some notation like [1,]
or [,2]
. These are the rows and columns of the matrix. We can refer to them using this notation. We can also refer to elements using [1,2]
. Again, this is very similar to the mathematical notation for matrices.
# If we want to access specific parts of the matrix:
# 2th column of matrix
2]
m[,
# 1st row of matrix
1,]
m[
# Element in row 1, column 2
1,2] m[
As with vectors, we can also observe and inspect the data structures of matrices using the helper function above.
# what type is m?
typeof(m)
# glimpse data structure of m
str(m)
The output of str(m)
begins by displaying that the data in the matrix is of a numeric (num) type. The [1:2, 1:3]
shows the structure of the rows and columns. The final part displays the values in the matrix.
Working with Lists
Lists are a little bit more complex because they can store many different data types and objects, each of which can be given names which are specific ways to refer to these objects. Names can be any useful descriptive term for an element of the list. You can think of lists like flexible vectors with names.
# generates a list with 3 components named "text" "a_vector" and "a_matrix"
<- list(text="test", a_vector = z, a_matrix = m) my_list
We can access elements of the list using the [ ]
or [[ ]]
operations. There is a difference:
[ ]
accesses the elements of the list which is the name and object[[ ]]
accesses the object directly
We usually want to use [[ ]]
when working with data stored in lists. One very nice feature is that you can refer to elements of a list by number (like a vector) or by their name.
# If we want to access specific parts of the list:
# 1st component in list
1]]
my_list[[
# 1st component in list by name (text)
"text"]]
my_list[[
# 1st part of the list (note the brackets)
1]
my_list[
# glimpse data type of my_list
typeof(my_list)
There is one final way to access elements of a list by name: using the $
or access operator. This works basically like [[name]]
but is more transparent when writing code. You put the object you want to access, followed by the operator, followed by the property:
# access the named property "text"
$text
my_list
#access the named property "a_matrix"
$a_matrix my_list
You will notice that this only works for named objects - which is particularly convenient for data frames, which we will discuss next.
Working with Dataframes
Dataframes are the most complex object you will work with in this course but also the most important. They represent data - like the kind of data we would use in econometrics. In this course, we will primarily focus on tidy data, which refers to data in which the columns represent variables, and the rows represent observations. In terms of R, you can think of data-frames as a combination of a matrix and a list.
We can access columns (variables) using their names, or their ordering
# generates a dataframe with 2 columns and 3 rows
<- data.frame(ID=c(1:3),
df Country=c('Canada', 'Japan', 'United Kingdom'))
# If we want access specific parts of the dataframe:
# 2nd column in dataframe
2]
df[
$Country
df
# glimpse compact data structure of df
str(df)
Notice that the str(df)
command shows us what the names of the columns are in this dataset and how we can access them.
Test your knowledge
Let’s see if you understand how to create new vectors! In the block below:
- Create an object named
my_vector
which is a vector and which contains the numbers from 10 to 15, inclusive. - Extract the 4th element of
my_vector
and store it in the objectanswer_1
.
<- c(...) # replace ... with the appropriate code
my_vector
<- my_vector[...] # replace ... with the appropriate code
answer_1
test_1()
- Create a new vector named
my_second_vector
with the elementsmy
,second
, andvector
(all lowercase!). - Extract the 3th element of
my_second_vector
and store it in the objectanswer_2
.
<- (...)
my_second_vector
<- (...)
answer_2
test_2()
In this exercise:
- Create an object named
mat
which is a matrix with 2 rows and 2 columns. The first column will take on values 11, 22, while the second column will take on values 33, 44. - Extract the value in the first row, second column from
mat
and store it in the objectanswer_2
<- matrix(..., nrow=...,ncol=...) # fill in the missing code
mat
<- mat[...] # fill in the missing code
answer_3
test_3()
- Given the matrix below, extract the element on the 30th row and 10th column, store it in an object called
first
. - Extract the element on the 27th row and 13th column, store it in an object called
second
. - Extract the element on the 12th row and 33rd column, store it in an object called
third
. - Create a vector with the three elements and store it in the object
answer_4
.
<- matrix(c(1:1050), nrow=30,ncol=35)
mat2
<- ...
first
<- ...
second
<- ...
third
<- ...
answer_4
test_4()
- Now, extract the 8th row of
mat2
and store it inanswer_5
.
<- ...
answer_5
test_5()
In this exercise, you will need to:
- Create an object named
a_list
, which is a list with two components: an element calledstring
which stores the character string “Hello World”, and an element calledrange
which contains a vector with values 1 through 5. - Extract the elements of the second object, and store it in the variable
answer_6
.
<- list(... = "Hello World", range = ...) # fill in the missing code
a_list
<- a_list... # fill in the missing code
answer_6
test_6()
In this exercise:
- Create an object
my_dataframe
which is a dataframe with two variables and two observations. The first columnvar1
will take on values1,2,3
. The second columnvar2
will take on values"A", "B", "C"
. - Extract the column
var1
and store it in the objectanswer_7
<- data.frame(var1=..., ...=c(...)) # fill in the missing code
my_dataframe
<- ... # fill in the missing code
answer_7
test_7()
Now, we have created a data frame called hidden_df
. Select the third element from the column col4
of the data frame hidden_df
and store it in answer_8
. Try to do that without actually seeing the dataset.
# Hint: you can think of a column of a dataset as a vector
<- ... # fill in the missing code
answer_8
test_8()
Part 2: Operations with Variables
At this point, you are familiar with some of the different types of data in R and how they work. However, let’s understand how we can work with these data types in more detail by writing R code. A variable or object is a name assigned to a memory location in the R workspace (working memory). For now we can use the terms variable and object interchangeably. An object will always have an associated type, determined by the information assigned to it. Clear and concise object assignment is essential for reproducible data analysis, as mentioned in the module Intro to Jupyter.
When it comes to code, we can assign information (stored in a specific data type) to variables and objects using the assignment operator <-
. Using the assignment operator, the information on the right-hand side is assigned to the variable/object on the left-hand side; we’ve seen this before, in some of the examples earlier.
In the example [2] below, "Hello"
has been assigned to the object var_1
. "Hello"
will be stored in the R workspace as an object named "var_1"
.
Note: R is case sensitive. When referring to an object, it must exactly match the assignment.
Var_1
is not the same asvar_1
orvar1
.
<- "Hello"
var_1
var_1
typeof(var_1)
You can create variables of many different types, including all of the basic and advanced types we discussed above.
<- 34.5 #numeric/double
var_2 <- 6L #integer
var_3 <- TRUE #logical/boolean
var_4 <- 1 + 3i #complex var_5
Operations
In R, we can also perform operations on objects; the type of an object defines what operations are valid. All of the basic mathematical and logical operations you are familiar with are examples of these, but there are many more. For example:
<- 4 # creates an object named "a" assigned to the value: 4
a <- 6 # creates an object named "b" assigned to the value: 6
b <- a + b # creates an object "c" assigned to the value (a = 4) + (b = 6) c
Try and think about what value c holds!
We can view the assigned value of c
in two different ways:
- By printing
a + b
- By printing
c
Run the code cell below to see for yourself!
+ b
a c
It is also possible to change the value of our objects. In the example below, the object b
has been reassigned the value 5.
<- 5 b
R will now store the updated value of 5 in the object b
. This overrides the original assignment of 6 to b
. The ability to change object names is a key benefit using variables in R. We can simply reassign the value to a variable without having to change that value everywhere in our code. This will be quite useful when we want to do things such as change the name of a column in a dataset.
To Remember: use a unique object name that hasn’t been used before in order to avoid unplanned object reassignments when creating a new object. The more descriptive, the better!
More on Operators
Earlier, we used discussed operations and used the example of +
to run the addition of a
and b
. +
is a type of R arithmetic operator, a symbol that tells R to perform a specific operation. We can use different R operators with variables. R has 4 types of operators:
- Arithmetic operators: used to carry out mathematical operations. Ex.
*
for multiplication,/
for division,^
for exponent etc. - Assignment operators: used to assign values to variables. Ex.
<-
- Relational operators: used to compare between values. Ex.
>
for greater than,==
for equal to,!=
for not equal to etc. - Logical operators: used to carry out Boolean operations. Ex.
!
for Logical NOT,&
for Logical AND etc.
We won’t cover all of these right now, but you can look them up online; for now, keep an eye out for them when they occur.
Functions
These simple operations are great to start with, but what if we want to do operations on different values of X and Y over and over again and don’t want to constantly rewrite this code? This is where functions come in. Functions allow us to carry out specific tasks. We simply pass in a parameter or parameters to the function. Code is then executed in the function body based on these parameters, and output may be returned.
# function_name <- function(arguments)
# {code operating on the arguments
# }
This structure says that we start with a name for our function (function_name
) and we use the assignment operator similarly to when we assign values to variables. We then pass arguments or parameters to our function (which can be numeric, characters, vectors, collections such as lists, etc.): those are the inputs to the function.
Finally, within the curly brackets we write our code needed to accomplish our desired task. Once we have done this, we can call this function anywhere in our code (after having run the cell defining the function!) and evaluate it based on specific parameter values.
An example is shown below; can you figure out what this function does?
<- function(x, y)
my_function <- 2*(x + y)
{ k print(k)
}
The function (1) takes the inputs (x,y)
given by the user (2) assigns the value of 2 times x
plus y
to some object k
(3) prints the object k
. We can call this function with any parameters we want (as long as it is a number), and the function will output the operations for the inputs we specified. For example, let’s call the function for the values x = 2
and y = 3
.
my_function(x = 2, y = 3)
The parameters placed into functions can be given defaults. Defaults are specific values for parameters that have been chosen and defined within the circular brackets of the function definition. For example, we can define y = 3
as a default in our my_function
. Then, when we call our function, we do not have to specify an input for y
unless we want to.
<- function(x, y = 3)
my_function <- 2*(x + y)
{ k print(k)
}
my_function(2)
If we want to override this default parameter, we can simply call the function with a new input for y
. This is done below for y = 4
, allowing us to execute our code as though our default was actually y = 4
.
<- function(x, y = 3)
my_function <- 2*(x + y)
{ k print(k)
}
my_function(x = 2, y = 4)
Finally, note that we can nest functions within functions; meaning we can call functions inside of other functions, creating very complex arrangements. Just be sure that these inner functions have themselves already been defined.
<- function(x, y)
my_function_1 <- 2*(x + y)
{ k
}
<- function(x, y)
my_function_2 <- x + y - my_function_1(x, y)
{j <- 2 * j
i print(i)
}
my_function_2(2, 3)
Luckily, we usually don’t have to define our own functions, since most useful built-in functions we need already come with R - although we may need to import specific packages to access them. We can always use the help ?
feature in R to learn more about a built-in function if we’re unsure. For example, ?max
gives us more information about the max()
function.
For more information about how you can read and use different functions, please refer to the Function Cheat Sheet.
Test Your Knowledge
In this exercise:
- create an object
u
which is equal to 1 - create an object
y
which is equal to 7 - create an object
w
which is equal to 10. - create an object
answer_9
which is equal to the sum ofu
andy
, divided byw
<- 1 # fill in the missing code
... <- ...
y <- ...
...
<- ...
answer_9
test_9()
- Now multiply each of
u
,y
, andw
by 2, and store the answers in the objectsv1
,v2
, andv3
, respectively. - Create a vector with objects
v1
,v2
, andv3
, and store the vector inanswer_10
. Make sure the elements are exactly in that order!
<- ...
v1 <- ...
v2 <- ...
v3
<- ...
answer_10
test_10()
In this exercise:
- Create a function
divide
which takes in two arguments,x
andy
. The function should returnx
divided byy
. - Store the solution to
divide(5,3)
in the objectanswer_11
:
<- function(x,y) {
divide
...
}
<- ...(x = 5, y = 3)
answer_11
test_11()
Part 3: Dealing with Errors
Sometimes in our analysis we can run into errors in our code. This happens to everyone - don’t worry - it’s not a reason to panic. Understanding the nature of the error we are confronted with can be a helpful first step to finding a solution. There are two common types of errors:
Syntax errors: This is the most common error type. These errors result from invalid code statements/structures that R doesn’t understand. Suppose R speaks English, speaking to it in German or broken English certainly would not work! Here are some examples of common syntax errors: the associated package is not loaded, misspelling of a command as R is case-sensitive, unmatched/incomplete parenthesis etc. How we handle syntax errors is case-by-case: we can usually solve syntax errors by reading the error message and finding what is often a typo or by looking up the error message on the internet using resources like stack overflow.
Semantic errors: These errors result from valid code that successfully executes but produces unintended outcomes. Again, let us suppose R speaks English. Although we asked it to hand us an apple in English and R successfully understood, it somehow handed us a banana! This is not okay! How we handle semantic errors is also case-by-case: we can usually solve semantic errors by reading the error message and searching it online.
Now that we have all of these terms and tools at our disposal, we can begin to load in data and operate on it using what we’ve learned.
Test your knowledge
In this exercise, you will be asked to extract data and perform basic operations with a dataset of statistics from violent crime rates in US states. This is a dataset that comes with R. For now, don’t worry about how we get and load the data (you’ll learn about it in the next modules); just focus on what the questions ask you to do. This is intended to be a hard exercise - if you complete it successfully, it means you understood the content well! Let’s get to it:
- Use the function
str
to view the structure of the data frameus_arrests
. How many columns and rows do we have in the dataset?
# use this cell to write your code if you need
# write your answers here
<- ...
numrows <- ...
numcols
<- c(numrows, numcols) # don't change this!
answer_12
test_12()
The units of our numeric columns are as follows:
Murder
: Murder arrests (per 100,000)Assault
: Assault arrests (per 100,000)UrbanPop
: Percent urban populationRape
: Rape arrests (per 100,000)
States are listed in alphabetical order in the column States
.
- Extract the murder rate in Arizona and store it in
answer_13
.
# use this cell to write your code if you need
# write your answer here
<- ...
answer_13
test_13()
- What is the absolute value of the difference between the murder rate in New York and in Colorado? Store the answer in
answer_14
.
Hint: New York is the 32nd state and Colorado is the 6th state in alphabetical order
Hint 2: The function
abs()
returns the absolute value of your input
# use this cell to write your code if you need
# write your answer here
<- ...
murder_ny
<- ...
murder_co
<- abs(...)
answer_14
test_14()
- And what about the differences in
Assault
andRape
? Create a vector with the absolute value of the differences as elements, and store it inanswer_15
. Make sure you order your elements as (1) murder, (2) assault, and (3) rape.
# use this cell to write your code if you need
# write your answer here
<- ...
murder
<- ...
assault
<- ...
rape
<- ...
answer_15
test_15()
Comments
While developing our code, we do not always have to use markdown cells to document our process. We can also write notes in code cells using something called a comment. A comment simply allows us to write lines in our code cell which will not run when we run the cell itself. By simply typing the
#
sign, anything written directly after this sign and on the same line will not run; it is a comment. To comment out multiple lines of code, simply include the#
sign at the start of each line.It is important to comment on our code for three main reasons:
It allows us to keep track of our actions and thought process: Commenting is a great way to help us stay organized. Code comments provide an ordered process for everyone to follow. In case we need to debug our codes, we can easily track which step is problematic and come back to that particular line of code that may be the source of the problem.
It helps readers understand why we’re coding in a particular way: While coding something like
a + b
may be a more or less straightforward computation, our reader may not be able to understand whata
orb
are referring to, or why they need to be added to each other. Our readers or other developers may ask: why is addition used instead of multiplication or division? With comments, we can explain why this particular method was used for this particular code block and how it relates to other code blocks.It saves everyone’s time in the future, including yourself: It’s far easier than you might expect to forget what a piece of code does, or is supposed to do. Keeping good comments ensures that your code remains comprehensible.
Generally, it is always a good idea to add comments to our code. However, if we find ourselves needing to explain an important block of code using lines upon lines of comments, it is preferable to use a markdown cell instead to give ourselves more room. Comments are best served for the reasons above.