source("testing_intro_to_visualization.r")
06 - ECON 325: Introduction to Data Visualization I
ggplot
packages to help create impactful data visualizations using R. We will also discuss the rationale and principles of effective data design.
Outline
Prerequisites
- Introduction to Jupyter
- Introduction to R
- Be able to load data and packages in R
- Be able to create variables and objects in R
- Be familiar with the general syntax of R commands
Learning Outcomes
- Identify best practices for data visualization design
- Describe when to use the following kinds of visualizations to answer specific questions using a data set:
- scatterplots
- line plots
- bar plots
- histograms
- Use the
ggplot2
package in R to create and refine the above visualizations using- geometric objects
- aesthetic mappings:
x
,y
,fill
,color
- labeling:
xlab
,ylab
,labs
- font control and legend positioning: theme
- Describe the difference between vector and raster file outputs
- Use
ggsave
to save visualizations in.png
and.svg
format
References
- Timbers, T., Campbell, T., Lee, M. (2022). Data Science: A First Introduction.
- Metwalli, S. A. (2021, July 15). Data Visualization 101: How to choose a chart type. Medium. Retrieved June 10, 2022, from https://towardsdatascience.com/data-visualization-101-how-to-choose-a-chart-type-9b8830e558d6
Part 1: Understanding Visualization
Introduction
“The purpose of a visualization is to answer a question about a data set of interest.”
Timbers, T., Campbell, T., Lee, M. (2022). Data Science: A First Introduction.
In econometrics, good data visualizations should always answer a well-thought-out and relevant economic research question. A good data visualization should also exist as a standalone explanation - that is, it should provide readers with a clear understanding of both the research question at hand and its answer in a way that doesn’t require further explanation.
As we learn the iterative practice of making data visualizations, it’s helpful to keep the following questions in mind as we make decisions about what to include/exclude in our graphic:
- Who is our audience?
- What do they know?
- What is the question we’re trying to answer?
Not only are data visualizations incredibly important as narrative outputs from data analysis, they are also a beneficial tool to use iteratively throughout our analysis to identify patterns or anomalies as we process our data. We’ll begin here with a few notes about best practices for visualizations.
Principles of Design: Data Visualization DOs and DONT’s
DO use data visualization to tell the story of the data truthfully
DO remember that a visualization’s accuracy is only as good as the data is
DO label your axes in font sizes that are readable and use descriptive titles
DON’T choose colours that are very similar to each other when trying to distinguish 2 variables (eg, choose red and blue rather than red and orange)
DON’T use design features (eg, exaggerated scaling) to manipulate readers into believing a particular narrative of the data
Types of Visualizations:
The four plot types we will be working with in this worksheet are Scatter plots, Bar plots, Line plots and Histograms which can all be found in the ggplot2
package.
Note that there are many more types of plots that can be generated using this package. We’ll explore some of these in the notebook, Introduction to Data Visualization II or you can check out R studio’s ggplot2 Cheat Sheet to see for yourself in the meantime.
- A Scatter plot visualizes the relationship between two quantitative variables
- This plot works great when we are interested in showing relationships and groupings among variables from relatively large datasets
- A Line plot visualizes trends with respect to an independent, ordered quantity (e.g., time)
- This plot works great when one of our variables is ordinal (time-like) or when we want to display multiple series on a common timeline
- A Bar plot visualizes comparisons of amounts
- This plot works great when we are interested in comparing a few categories as parts of a whole, or across time
- A Histogram visualizes the distribution of one quantitative variable
- This plot works great when we are working with a discrete variable and are interested in visualizing all its possible values and how often they occur
Definitions adapted from: Data Science: A First Introduction.
Figure 1. Examples of scatter, line and bar plots, as well as histograms. (from Data Science: A First Introduction)
Loading data
In this tutorial, we will be working with the Penn World Table 10.0. This data is via:
- Feenstra, Robert C., Robert Inklaar and Marcel P. Timmer (2015), “The Next Generation of the Penn World Table” American Economic Review, 105(10), 3150-3182, available for download at https://www.rug.nl/ggdc/productivity/pwt/
To download the dataset we will be using for this notebook:
- Click the link provided above. The PWT page should appear
- Scroll down until three access options appear
- Click Stata and a Stata file (
.dta
) should immediately download. Now move that file to your media directory and we can start the analysis!
We can now get started by importing the packages and data into our notebook, so that we can use it. You should also check out the documentation on the link above, if you’re not sure what a variable does or represents.
# import packages
library(tidyverse) # contains ggplot2, which is what we'll be using!
library(haven)
# load the data
<- read_dta("../datasets/pwt100.dta") # make sure that the .dta file has this exact name
pwt_data
# declare factors
<- as_factor(pwt_data)
pwt_data
<- pwt_data %>%
pwt_data mutate(countrycode = as.factor(countrycode)) %>%
mutate(country = as.factor(country)) %>%
mutate(currency_unit = as.factor(currency_unit))
# check that it looks OK
# there will be a lot of missing data
glimpse(pwt_data)
As you can see, this data set includes 12,810 observations and many different variables.
Question time: How many variables are included in this data set?
Hint: variables are stored in columns
#Fill in the ... below with your answer to the above question
<- ...
answer_1
test_1()
Understanding ggplot2
R uses a “language” for how graphics are created called the grammar of graphics, which is a system of best practices from statistical visualization theory that centres data in the process.
Note: An important companion to this Notebook is the
ggplot2
cheatsheet that you can find among the RStudio collection of cheatsheets. This is CC-by-SA Material, and is available from RStudio’s website
Layers with ggplot
Essentially, in this “grammar” of graphics, we create a series of “layers” which implement a specific visual output:
- First, we identify a dataset from which we want to create our graph (
data=
) - Then, we associate variables in that dataset to aesthetics (
aes=
)- Aesthetics represent different properties of a graph, like “what goes on the \(x\)-axis”, or “what does the color of the line represent”. Each type of visualization is associated with a collection of necessary and optional aesthetic features.
- Next, we attach a coordinate system and a plot type to the graph using
geom
, which takes the aesthetics and describes them- This will also include options like
position
which indicates how to combine elements (e.g. stack the bars in a barchart, or place them side-by-side)
- This will also include options like
- Finally, we can tweak the visualization - such as adding labels or changing the colour scheme
Let’s see what this looks like in practice.
Interpreting the Data
As we can see from our preview, this data set is quite complex. After spending some time with our data source, we understand a few of the key variables represent the following:
rgdpe
= expenditure-side real GDP (millions of USD)
pop
= population of a given country (millions of people)
year
= year of data recording (1950-2019)
country
= country being studied (183 countries are captured in this data set)
hc
= an index of human capital per person, which is based on average years of schooling and the return to education
emp
= number of persons engaged in employment (millions)
Beginning our Analysis
Let’s say we are interested in creating a visualization that answers the following question:
How has real GDP per Capita changed over time in North American countries?
# First, filter the dataset to only include data on North American countries
<- filter(pwt_data, (countrycode == "CAN")|(countrycode == "USA")|(countrycode == "MEX"))
NA_data
# We can take a look at our the rgdpe/pop variable by making a quick histogram here
<- ggplot(data = NA_data, aes(x = rgdpe/pop)) +
histogram geom_histogram(colour = "black", bins = 20)
histogram
It looks like a solid number of GDP per capita measurements are under 20,000 - what might be driving this? Let’s get back to our main chart to find out!
# Use the ggplot command and specify the data frame that is to be used (NA_data in this case) and the set of plot aesthetics (which variables will be included)
<- ggplot(data = NA_data, # this declares the data for the chart; all variable names are in this data
plot aes(# this is a list of the aesthetic features of the chart
x = year, # for example, the x-axis will be "year" (a continuous variable)
y = rgdpe/pop, # the y-axis will be expenditure-based real GDP per capita
fill = country, # this means that the country variable in our dataset will determine the colour of the bars
color = country # country variable will also determine the color of the borders or outline
),
)
# Now, input the labels to the aesthetic features added above
<- plot + labs( # add human-readable, aesthetic labels
plot x = "Year", # label for the x aesthetic (x-axis title)
y = "Real GDP per capita (expenditure-based)", #y-axis title
color = "Country", # adds the label "Country" to the legend and tells us which colour is used to represent which country
fill = "Country", # similarly, tells us about the colours used to fill
title = "North American Real GDP per Capita over Time") # and title of plot
# Because the variable "country" is expressed by colours, we are able to change the colours used in the chart using the commands below. Try playing with different palettes. To display other palettes use the command display.brewer.all()
<- plot + scale_fill_brewer(palette="Accent") #set the colour palette for fills
plot <- plot + scale_color_brewer(palette="Accent") #set the colour palette for outlines
plot options(repr.plot.width = 15, repr.plot.height = 9) #adjusts plot size: try playing around with the dimensions, and then return the values to width = 15 and height = 9
# Finally, input the type of vizualisation of the chart
<- plot + geom_col( # now we add the visualization geom_col() produces a bar graph)
plot1 position = "dodge") # this places the visualizations side-by-side
# if you change position to "stack" it will be a stacked graph!
plot1
Beautiful! It’s also very easy to quickly change the visualization. For instance, if we wanted to make this a line graph instead of a bar chart, we could do the following:
# fig.width = 40
<- plot + geom_line()
plot2
# show the plot plot2
Nice! Don’t worry if any or all of this feels confusing right now - we’re going to work through a few more examples together. If you’re having trouble reading the text on your graphic outputs, we’ll walk through how to adjust text size in the next section as well!
Part 2: Building a Visualization
Next, let’s learn about how we build visualizations. It’s important to note that we should approach this iteratively by building our visualization piece-by-piece, making adjustments along the way. We don’t have to get it completely right on the first try!
Let’s say we are interested in creating a visualization that answers the following question:
What is the relationship between GDP per capita and human capital in the world today?
The first thing we want to do is identify what data we need:
- We will need GDP, which we can get using either
rdgpe
orrdgpo
(Quiz: what’s the difference?) - We will need population, which we can get using
pop
- We will need human capital, which is the variable
hc
- We will need data from “today”, which is
year == 2019
, the most recent data in our sample
Let’s start out by filter
-ing the data to just get 2019 data.
<- filter(pwt_data, year == 2019)
figure_data
head(figure_data$year)
Nice, it looks like we’ve got all the 2019 data! Next we need to start creating our object - however, we should pause for a moment an consider what kind of vizualization we want.
- Our question is about the relationship between two quantitative variables
- We are interested in understanding how they move together
While there are a couple of options, we’ll start with a scatterplot. If we consult our cheat-sheet, we can see that scatterplots are the geom_point()
command. This requires the aesthetic properties:
x
, the \(x\)-axisy
, the \(y\)-axis
We then have other optional ones, like alpha, color, fill, shape, size, stroke
which we won’t use here but you can check their function also on the R studio’s ggplot2 Cheat Sheet.
Notice an important fact: you can assign aesthetics on either the ggplot
layer or on a geom
. The difference is that the ggplot
aesthetics are automatically inherited by all other layers; there’s no other real difference.
Important Note: generally, any aesthetic property which can be assigned in
aes()
can also be assigned to thegeom
directly. For example, if you wanted to make a line dashed or a point red, you could do this by settinggeom_point(color = "red")
. However, this will apply to all parts of thegeom
so use it wisely!
Let’s start simple, and make x
represent human capital, and y
represent real GDP per capita. We can start our visualization by creating our ggplot
object and assigning all these properties:
<- ggplot(data = figure_data, # associate the data we chose
figure aes(
x = hc, # x is human capital
y = rgdpe/pop # we divide rgdpe by pop to get gdp per capita
))
<- figure + labs(x = "Human Capital",
figure y = "Real GDP per capita (expenditure-based)",
title = "Global GDP per Capita and Human Capital in 2019") +
theme(
text = element_text(
size = 15)) #increases text size: try playing around with this number!
# note: you can set aethestics to be simple functions of variables!
After running the previous cell, nothing was printed in our notebook; this is because we need to assign our visualization! Right now, it’s just data and properties. Let’s test it our by adding our geom_point()
layer:
+ geom_point() figure
Nice! Already looking pretty good. But, there’s still some work to do to make this graph more readable and tell the story of the dataset. How about if we made the size of each point relative to the population? That way, bigger countries would be more prominent on the graph. We can do this by assigning the aesthetic again:
+ geom_point(aes(
figure size = pop,)) # assigns the size of the point to be relative to the population values
Now, let’s make it more colourful. How about if we make colour of each point change as the employment (emp
) rate changes? So, darker colors would represent higher labour force utilization. Again, we can do this by assigning the aesthetic:
<- figure + geom_point(aes(
figure size = pop,
colour = 100*emp/pop))
figure
Great work! We now have a solid visualization. Say we’re not the biggest fan of the colour blue and want to change it. The easiest way to set colours in R is using palettes. The list of all the palette options are:
::display.brewer.all() RColorBrewer
Let’s choose YlOrRed
. We can apply this using the following (somewhat cryptic) command:
<- figure + scale_color_distiller(palette="YlOrRd")
figure
figure
options(repr.plot.width = 15, repr.plot.height = 9)
You might have noticed that we used color_brewer
earlier and color_distiller
here. color_brewer
is for visualizations with discrete variables while color_distiller
is for continuous. It doesn’t really matter why this is, but watch out for it; R will complain.
Good work! We now have a publication quality visualization, ready to go. This is a good example of the process we go through when we build visualizations: a lot of experimenting, a lot of trial and error. Always ask yourself: “Is this effective? Is this what I want it to do?”
Exporting Visualizations
Once we’ve decided that our graph can successfully answer our economic question, we’d like to export it from Jupyter. Once again, the ggplot
package comes to our rescue with the following command:
ggsave
allows us to save a visualization using the following key arguments("file_name.file_format", my_plot, width = #, height = #)
You can check out an expanded list of possible arguments at the R documentation page for
ggsave
- The first part of the argument,
"file_name.file_format"
is where we give our graphic a descriptive name and specify which file format we want our graphic to be saved in the Jupyter workspace. If you are saving to a specific folder, you can add this before the file name with a/
in between to delineate the two (example:"folder/file_name.file_format"
). The format you choose may depend on the context you plan to use the visualization in. Images are typically stored in either raster or vector formats. The following definitions are borrowed from Data Science: A First Introduction.
Raster images are represented as a 2-D grid of square pixels, each with its own color. Raster images are often compressed before storing so they take up less space. A compressed format is “lossy” if the image cannot be perfectly re-created when loading and displaying, with the hope that the change is not noticeable. “Lossless” formats, on the other hand, allow a perfect display of the original image.
Common raster file types:
- JPEG (.jpg, .jpeg): lossy, usually used for photographs
- PNG (.png): lossless, usually used for plots / line drawings
- BMP (.bmp): lossless, raw image data, no compression (rarely used)
- TIFF (.tif, .tiff): typically lossless, no compression, used mostly in graphic arts, publishing
- Open-source software: GIMP
Vector images are represented as a collection of mathematical objects (lines, surfaces, shapes, curves). When the computer displays the image, it redraws all of the elements using their mathematical formulas.
Common vector file types:
- SVG (.svg): general-purpose use
- EPS (.eps): general-purpose use (rarely used)
- Open-source software: Inkscape
Raster and vector images have opposing advantages and disadvantages. A raster image of a fixed width and height takes the same amount of space and time to load regardless of what the image shows (the one caveat is that the compression algorithms may shrink the image more or run faster for certain images). A vector image takes space and time to load corresponding to how complex the image is, since the computer has to draw all the elements each time it is displayed. For example, if you have a scatter plot with 1 million points stored as an SVG file, it may take your computer some time to open the image. On the upside, with vector graphics you can zoom into / scale up the image as much as you like without it looking bad, while raster images eventually start to look “pixelated.”
The second part of the argument,
my_plot
specifies which plot in our analysis we’d like to exportThe last key part of the argument
width =
andheight =
specifies the dimensions of our image. Because we’ve tinkered with the graph output size usingoptions(repr.plot.width = 15, repr.plot.height = 9)
above in our code, we’ll want to use these dimensions as we export to ensure that our visualization isn’t cut off by R’s default saving dimensions. If we haven’t made modifications to the size, these commands can be left out.
Try uncommenting the code section below and saving our “Global GDP per capita and Human Capital in 2019” graph in the Jupyter directory that this notebook is stored in.
# ggsave("gdp_hc_plot.png", figure, width = 15, height = 9)
Did you see file appear in the directory? Now try saving the same graph as an .svg
in the code cell below.
# ggsave("gdp_hc_plot. ...", figure, width = ..., height = ...)
As we have seen, R makes it easy to create high-quality, impactful graphics. We’ll let you try it on your own now!
Part 3: Making Your Own Chart
For the final section of this notebook, you’ll make your own visualization using the Penn Data again. Suppose you are tasked with building a visualization which explores the following question: What is the relationship between the economic development of China and United States over time?
Some variables you might want to consider are:
year
: the year of observationrtfpna
: total factor productivity (here’s a link, if you’re ECON 102 is rusty)rgdpe
: real GDP (expenditure-based)pop
: populationccon
: real consumption of householdsavh
: average hours worked
To be clear: you don’t need to use all of these variables in your visualization.
- Start by deciding what variables are essential and which ones are optional. Choose at least two to include in your visualization.
- Decide what kind of visualization you want to make. Relate your choices to the best practices for types of visualizations. > You might want to consult the cheat-sheet to see your options.
- Finally, decide how you want to present it; what should the final product look like?
A good idea is to create it in layers, like we did before - updating as you go. We’ll start you off with some of the data and code scaffolding:
# my_data <- filter(pwt_data, (countrycode == "USA")|(countrycode == "CHN"))
# my_figure # give your plot a descriptive title
# <- ggplot(data = my_data, aes( #add your aesthetics below
# x = ...,
# y = ...,
# color = ...)) + # remember this is optional
# labs(x = "...", # what labels do you want to add?
# y = "...",
# title = "...") +
# theme(text = element_text(size = ...))+
# geom_...() # what geom will you use? Does it need options?
#my_figure
# uncomment (delete the leading "#" symbol) to use these lines.
# Pro tip, you can uncomment an entire section by highlighting it and selecting "command + /"
Try to see if you can piece together a decent graph using your learning from this notebook before you continue. Depending on the direction you choose, your plot might look something like this one below.
This visualization was made using the following features:
y = year
,x = rtfpna
geom_line()
function with argument:size = 3
in between the parentheses to make the lines a bit more visiblecolor = country
to create two unique lines on the graph for China and the USlabs(color = "Country")
to give nice, human readable title to our color legend
If you’re stuck, try to re-create this one, before starting on your own.