library(IRdisplay)
03 - R Essentials
Prerequisites
- Understand how to effectively use R scripts or create Jupyter cells.
Learning Outcomes
- Understand objects, variables, and functions in R.
3.0 Setting Up
Run the code cell below before starting!
3.1 Basics of Using R
In this notebook, we will be introduced to R. R is a programming language which is particularly well-suited for statistics, econometrics, and data science. If you are familiar with other programming languages such as Python, this will likely be very familiar. If this is your first time working with a programming language, don’t be intimidated! Try to play around with the examples as you work through this notebook; it’s easiest to learn R (or any programming language) by playing around with it.
R is an object oriented programming language. This means that we can create many different things (e.g. data sets, matrices, vectors, scalars) within it and they will all be stored in the same environment and accessed the same way.
Every new line in R is understood as
function_name(input1 = valid_alternatives, input2 = valid_alternatives, ... )
Fairly simple stuff! However, we first need to understand the basic data types that exist in R and how we can put these into different objects/data structures. Usually we use functions that are provided in a given library
(package), but later we’ll also look at how to create our own functions.
3.2 Basic Data Types
To begin, it’s important to get a good grasp of the different data types in R and how to use them. Whenever we work with R, we will be manipulating different kinds of information referred to as “data”. Data comes in many different forms, called types, which define how we can use it in calculations or visualizations in R.
R has 6 basic data types. Data types are used to store information about a variable or object in R:
- Character: data in text format, like “word” or “abc”;
- Numeric (real or decimal): data in real number format, like 6 or 18.8 (referred to as Double or dbl in R);
- Integer: data in whole number (integer) format, like 2L (the L tells R to store this as an integer);
- Logical: truth values, like TRUE or FALSE;
- Complex: data in complex (i.e. imaginary) format, like 1+6i (where \(i\) is the \(\sqrt{-1}\));
- Raw: raw digital data, which is unusual and which will not be covered in this section.
If we are ever wondering what kind of type an object in R has, or what its properties are, we can use the following two functions that allow us to examine the data type and elements contained within an object:
typeof()
: this function returns a character string that corresponds to the data type of an object;str()
: this function displays a compact internal structure of an R object.
We will see some examples of these in just a moment.
3.3 Data Structures
Basic data is fine, but we often need to store data in more complex forms. Luckily, data can be stored in different structures in R beyond these basic data types. Below are some of the most important data structures in R, each of which we will look at individually in greater detail.
- Vectors: a vector of values, like \((1,3,5,7)\);
- Matrices: a matrix of values, like \([1,2; 3,4]\) (usually displayed as a square);
- Lists: a list of elements with named properties, like \((pet =''cat'', ''dog'', ''mouse'')\);
- Data frames: a collection of vectors or lists, organized into rows and columns according to observations.
Note that vectors don’t need to be numeric! We can also use some useful built-in functions to create data structures (we don’t have to create our own functions to do so).
c
: this function combines values into a vector;matrix
: this function creates a matrix from a given set of values;list
: this function creates a list from a given set of values;data.frame
: this function creates a data frame from a given set of lists or vectors.
Let’s look at each of these four data structures in turn.
3.3.1 Vectors
Vectors are important, and we can work with them by creating them from values or other elements using the c()
function:
# generate a vector containing values
<- c(1, 2, 3)
z
# generate a vector containing characters
<- c("Canada", "Japan", "United Kingdom") countries
We can also access the elements of a vector. Since a vector is made of basic data, we can access its elements using the [ ]
index notation. This is very similar to how we refer to elements of a vector in mathematical notation.
Below we access specific components of the z and countries vectors that have already been defined.
# the 2nd component of z
2]
z[
# the 2nd component of countries
2] countries[
As mentioned above, we can use the typeof
and str
functions to glimpse the kind of data stored in our objects.
Run the cell below to see how this works:
# view the data type of countries
typeof(countries)
# view the data structure of countries
str(countries)
# view the data type of z
typeof(z)
# view the data structure of z
str(z)
The output of str(countries)
begins by acknowledging that the contained data is of a character (chr) type. The information contained in the [1:3]
first refers to the component number (there is only 1 component list here) and then the number of observations (the 3 countries).
3.3.2 Matrices
Just like vectors, we can also create matrices; we can think of matrices as organized collections of row (or column) vectors. They’re a little bit more complicated to create manually since we need to use a more complex function: matrix
. However, the simplest way to create them is just to provide a vector of all the values to this function, then tell R how the matrix should be organized. R will then fill in the specified values. An example is below.
# generate a 2 x 2 matrix
<- matrix(c(2,3,6,7,7,3), nrow=2,ncol=3)
m
print(m)
Take note of the order in which the values are filled in; it might be unexpected to you!
Just like with vectors, we can also access parts of a matrix. If we look at the cell output above, we can see some notation like [1,]
and [,2]
. These are the rows and columns of the matrix. We can refer to them using this notation. We can also refer to specific elements using [1,2]
. Again, this is very similar to the mathematical notation for matrices. Below we access specific columns, rows, and elements of the matrix m.
# 2nd column of matrix
2]
m[,
# 1st row of matrix
1,]
m[
# Element in row 1, column 2
1,2] m[
As with vectors, we can also observe and inspect the data structures of matrices using the helper function above.
# what type is m?
typeof(m)
# glimpse data structure of m
str(m)
The output of str(m)
begins by displaying that the data in the matrix is of a numeric (num) type. The [1:2, 1:3]
shows the structure of the rows and columns. The final part displays the values in the matrix.
3.3.3 Lists
Lists are a little bit more complex because they can store many different data types and objects, each of which can be given names which are specific ways to refer to these objects. Names can be any useful descriptive term for an element of a list. You can think of lists as flexible vectors with names. Let’s generate a list below.
# generate a list with 3 components named "text" "a_vector" and "a_matrix"
<- list(text="test", a_vector = z, a_matrix = m) my_list
We can access elements of the list using the [ ]
or [[ ]]
operations. There is a difference:
[ ]
accesses the elements of the list: the name and object;[[ ]]
accesses the object directly.
We usually want to use [[ ]]
when working with data stored in lists. One very nice feature of lists is that we can refer to their elements by number (like a vector) or by their name. Let’s access specific components of the list both by name and number below.
# 1st component in list
1]]
my_list[[
#1st component in list by name (text)
"text"]]
my_list[[
# 1st part of the list (note the brackets)
1]
my_list[
# glimpse data type of my_list
typeof(my_list)
There is one final way to access elements of a list by name: using the $
or access operator. This works basically like [[name]]
but is more transparent when writing code. We write down the object we want to access, followed by the operator, followed by the property. Let’s do this below.
# get the named property "text"
$text
my_list
#get the name property
$a_matrix my_list
We can see that this only works for named objects. This is particularly convenient for data frames, which we will discuss next.
3.3.4 Data frames
Data frames are the most complex object we will work with in this module, but also the most important. They represent data - like the kind of data we use in econometrics. In this module, we will primarily focus on tidy data, data in which the columns represent variables and the rows represent observations. In terms of R, we can think of data frames as a combination of a matrix and a list. Let’s generate a data frame below using the data.frame
function.
# generates a dataframe with 2 columns and 3 rows
<- data.frame(ID=c(1:3),
df Country=countries)
We can access specific columns (variables) of this data frame using their names or their ordering. We can also use the str
function like before to inspect the data structure of this new data frame df.
# If we want access specific parts of the dataframe:
# 2nd column in dataframe
2]
df[
$Country
df
# glimpse compact data structure of df
str(df)
Notice that the str(df)
command shows us the names of the columns in this data set, as well as how we can access them.
3.4 Objects and Variables
At this point, we have now covered some of the different types of data in R and how they work. However, let’s see how we can work with them in more detail by writing R code. A variable or object is a name assigned to a memory location in the R workspace (working memory). For now we can use the terms variable and object interchangeably. An object will always have an associated type, determined by the information assigned to it. Clear and concise object assignment is essential for reproducible data analysis, as mentioned in Module 2.
When it comes to code, we can assign information (stored in a specific data type) to variables and objects using the assignment operator <-
. With the assignment operator, the information on the right-hand side is assigned to the variable/object on the left-hand side. We’ve seen this already with some of the vectors, lists, matrices, and data frames defined earlier in this notebook.
Var_1
is not the same as var_1
or var1
.
In the example below, "Hello"
has been assigned to the object var_1
. "Hello"
will be stored in the R workspace as an object named "var_1"
, which we can call at any time.
<- "Hello"
var_1
var_1
typeof(var_1)
We can create variables of many types, including all of the basic and advanced types discussed above. Below are some examples of four different type objects assigned to four different variables.
<- 34.5 # numeric/double
var_2 <- 6L # integer
var_3 <- TRUE # logical/boolean
var_4 <- 1 + 3i # complex var_5
3.5 Operations
In R, we can also perform operations on objects; the type of an object defines what operations are valid. All of the basic mathematical and logical operations we are familiar with are examples of these, but there are many more. For example:
<- 4 # creates an object named "a" assigned to the value: 4
a <- 6 # creates an object named "b" assigned to the value: 6
b <- a + b # creates an object "c" assigned to the value (a = 4) + (b = 6) c
Try and think about what value c holds!
We can view the assigned value of c in two different ways:
- By printing
a + b
- By printing
c
Run the code cell below to see for yourself!
+ b
a c
It is also possible to change the value of an object. In the example below, the object b has been reassigned the value 5.
<- 5 b
R will now store the updated value of 5 in the object b. This overrides the original assignment of 6 to b. The ability to change object names is a key benefit of using variables in R. We can simply reassign the value to a variable without having to change that value everywhere in our code. This will be quite useful when we want to do things such as change the name of a column in a data set in a future module.
Earlier, we discussed operations and used the example of +
to run the addition of a and b. +
is a type of arithmetic operator, meaning it is a symbol that tells R to perform a specific operation. R has 4 types of operators, some of which we’ve already seen and some of which we haven’t:
- Arithmetic operators: used to carry out mathematical operations. Ex.
*
for multiplication,/
for division,^
for exponentiation, etc.; - Assignment operators: used to assign values to variables. Ex.
<-
; - Relational operators: used to compare between values. Ex.
>
for greater than,==
for equal to,!=
for not equal to etc.; - Logical operators: used to carry out Boolean operations. Ex.
!
for Logical NOT,&
for Logical AND etc.
We won’t cover all of these right now, but you can look them up online. For now, keep an eye out for them when they appear.
3.6 Functions
These simple operations are great to start with, but what if we want to do operations on different values of X and Y over and over and don’t want to constantly rewrite this code? This is where functions come in. Functions allow us to carry out specific tasks. We simply pass in a parameter or parameters to the function. Code is then executed in the function body based on these parameters, and output may be returned.
Some functions are built-in to R, such as the library
function we have been using to load in packages. However, some functions are customized, meaning we created them ourselves. Below is the format for these customized functions that we create ourselves and customize for our intended purpose.
<- function(arguments)
Functionname
{code operating on the arguments }
This structure says that we start with a name for our function (Functionname
) and we use the assignment operator similarly to when we assign values to variables. We then pass arguments or parameters to our function (which can be numeric, characters, vectors, collections such as lists, etc.) in the (arguments)
space; think of them as the inputs to the function.
Finally, within the curly brackets we write the code needed to accomplish our desired task. Once we have done this, we can call this function anywhere in our code (after having run the cell defining the function!) and evaluate it based on the specific parameter values that we choose to pass in as inputs.
An example is shown below; can you figure out what this function does?
<- function(x, y)
my_function = x + y
{x 2 * x
}
The parameters passed as input to functions can be given defaults. Defaults are specific values for parameters that have been chosen and defined within the circular brackets of the function definition. For example, we can define y = 3
as a default in our my_function
. When we call this function, we then do not have to specify an input for y unless we want to.
<- function(x, y = 3)
my_function = x + y
{x 2 * x}
my_function(2)
However, if we want to override this default, we can simply call the function with a new input for y. This is done below for y = 4
, allowing us to execute our code as though our default was actually y = 4
.
<- function(x, y = 3)
my_function = x + y
{x 2 * x}
my_function(2, 4)
Finally, note that we can nest functions within functions, meaning we can call functions inside of other functions, creating very complex arrangements. Just be sure that these inner functions have themselves already been defined! An example is below.
<- function(x, y)
my_function_1 = x + y + 2
{x 2 * x}
<- function(x, y)
my_function_2 = x + y - my_function_1(x, y)
{x 2 * x}
my_function_2(2, 3)
Luckily, we usually don’t have to define our own functions, since most useful built-in functions we need already come with R and its core packages. They do not require creation; they already exist for us, although they may require the importing of specific packages to be operable. We can always use the help ?
feature in R to learn more about a built-in function if we’re unsure. For example, ?max
gives us more information about the max()
function.
For more information about how you should read and use key functions, please refer to the Function Cheat Sheet.
3.7 Errors
Sometimes in our analysis we run into errors in our code; this happens to everyone and is not a reason to panic.
Understanding the nature of the error we are confronted with is a helpful first step in finding a solution. There are two common types of errors:
Syntax errors: This is the most common error type. These errors result from invalid code statements/structures that R doesn’t understand. Suppose R speaks English. This error is representative of us asking it to help by speaking German, which would certainly not work! Here are some examples of common syntax errors: using a function for which an unloaded package is needed, misspelling of a command as R is case-sensitive, and unmatched parenthesis. How we handle syntax errors is case-by-case: we can usually solve syntax errors by reading the error message and searching it.
Semantic errors: These errors result from valid code that successfully executes but produces unintended outcomes. Again, let us suppose R speaks English. This error is representative of us asking R to hand us an apple in English, which R successfully understood, but it handed us a banana in return. This is not okay! How we handle semantic errors is also case-by-case: we can usually solve semantic errors by reading the associated error message and searching it for help/suggestions.
3.8 Wrap Up
In this notebook, we have learned the different ways data can be stored and structured in our R memory. We have also learned how to manipulate, extract and operate on data from different structures. Additionally, we have learned how to write a function to perform operations more efficiently. Now that we have all of this knowledge at our disposal, we can load in data and operate on it in the next module.
3.9 Wrap-up Table
Command | Function |
---|---|
c() |
It creates a vector. |
matrix() |
It creates a matrix. |
list() |
It creates a list. |
data.frame() |
It creates a data frame object. |
typeof() |
It prints the type of the object in parenthesis. |
str() |
It prints the structure of the object in parenthesis. |
function(arg){code} |
It creates a function that takes arg as inputs and uses them to run the code detailed within curly brackets. |