source("testing_intro_to_visualization2.r")
07 - ECON 325: Introduction to Data Visualization II
ggplot
packages to help create impactful data visualizations using R. This builds on the first notebook, introducing more complex visualizations and techniques.
Outline
Prerequisites
- Introduction to Jupyter
- Introduction to R
- Introduction to Data Visualization 1
Outcomes
By the end of this notebook, you will be able to: * Customize aesthetic labels on a graph to communicate the key message of a visualization * Use faceted graphs to visually represent complex data * Identify best practices for creating effective visualizations * Recognize ways in which visualizations can be used nefariously
References
- Deeb, Sameer. 2005. “The Molecular Basis of Variation in Human Color Vision.” Clinical Genetics 67: 369–77.
Introduction
In the notebook “Introduction to Visualization 1,” we learned about scatter, line and bar charts, as well as histograms. This notebook expands on the concepts introduced in Introduction to Data Visualization 1 and offers a more in depth and granular toolkit for creating visualizations. Because we’re already familiar with the theoretical concepts of ggplot2
, this notebook serves as more of a resource guide with practical case study examples than as a learning module.
The concept of layers is fundamental to ggplot2
- in this notebook we’ll explore a few more layering customizations that can help make our visualizations look extra polished. We’ll also see how our new tools perform in a few case study examples.
Loading the data
In this notebook, we’ll be working with the Penn World Tables data set again. Let’s load it now.
# import packages
library(tidyverse) # contains ggplot2, which is what we'll be using!
library(haven)
library(RColorBrewer)
library(ggthemes)
# library(lubridate)
# load the data
<- read_dta("../datasets/pwt100.dta")
pwt_data
# declare factors
<- as_factor(pwt_data)
pwt_data
<- pwt_data %>%
pwt_data mutate(countrycode = as.factor(countrycode)) %>%
mutate(country = as.factor(country)) %>%
mutate(currency_unit = as.factor(currency_unit))
<- filter(pwt_data, (countrycode == "CAN")|(countrycode == "USA"))
NA_data
# check that it looks OK
tail(pwt_data,10)
# there will be a lot of missing data
Part 1: Adding Techniques to our Data Visualization Toolkit
Adding and Adjusting Labels
In notebook 1 we introduced the labs()
function which allows us to specify the aesthetic outputs of different labels on our chart, such as the x and y axis titles, the legend title and main graph title.
Here are a few best practices to keep in mind when crafting and adding labels to charts (pulled from FusionCharts 2022): * Graph title: should summarize the graph in short, understandable language that is as objective as possible - avoid using unnecessary words such as “the”, “a”, or “an”, as well as adjectives like “amazing” or “poor” which can manipulate your reader’s perception of the graphic. * Example: “Online Grocery Order Growth 2018 vs 2020” (a better title) vs “The Significant Growth in Online Grocery Orders in the years 2018 and 2020” (a worse title) * Graph subtitle: should be used to add helpful supplemental information that will help your audience understand the graph. * Example: Units of measurement, time frames (if this is secondary information) * Axis Labels: units of measurement should always be made known! If the data is labeled inside the visualization, sometimes axis labels are not necessary - we should always ask ourselves: what does our audience need to know?
Here, we explore a few tips and tricks for customizing labels to suit the needs of our graph.
The labs()
function
As a refresher, a few labs()
arguments we’ve already covered are:
x =
specifies x-axis title.
y =
specifies y-axis title.
color =
specifies meaning of color outline.
fill =
specifies meaning of color fill.
title =
specifies title of plot.
NEW ARGUMENTS
subtitle =
specifies a subtitle for the graph (positioned below title).
caption =
specifies a caption at the bottom of graph, which can be great for listing the source of our data.
These arguments give us a basic infrastructure for a plot - we can demonstrate these features using a simple bar chart.
<- ggplot(data = NA_data,
basic_plot aes(x = year,
y = rgdpe/pop,
fill = country, # specifies the fill to be by country
color = country)) + # specifies the outline to also be by country
labs(x = "Year",
y = "Real GDP per capita (expenditure-based)",
fill = "Country",
color = "countyr",
title = "Canada & US Real GDP per Capita over Time") +
geom_col(position = "dodge") +
theme(text = element_text(size = 20, hjust = 0)) # specifies the x, y and legend label text size
options(repr.plot.width = 15, repr.plot.height = 7) # specifies the dimension
basic_plot
Exercise 1
Fix the typo from the graph above (in the code below!) so that the two legends on the righthand side are merged into one.
Hint: remember that arguments in the
ggplot()
contain aesthetic mappings, while arguments in thelabs()
function contain labels and other specifications that are set by us and not the data. What might be going on here that would create two legends on this graph?
# replace the error with your answer: your legend should begin with a Capital letter
<- ggplot(data = NA_data,
basic_plot_test aes(x = year,
y = rgdpe/pop,
fill = country, # specifies the fill to be by country
color = country)) + # specifies the outline to also be by country
labs(x = "Year",
y = "Real GDP per capita (expenditure-based)",
fill = "Country",
color = "countyr",
title = "Canada & US Real GDP per Capita over Time") +
geom_col(position = "dodge") +
theme(text = element_text(size = 20)) # specifies the x, y and legend label text size
options(repr.plot.width = 15, repr.plot.height = 7) # specifies the dimension
basic_plot_test
<- basic_plot_test
answer_1
test_1()
The theme()
function
We also learned about the theme()
function, which allows us to modify components of a theme such as text.
As a refresher, one theme()
argument we’ve already covered is:
text = element_text()
specifies text attributes that broadly apply to all text components in a graph * Example: title, labels, caption, etc.
NEW ARGUMENTS
plot.title = element_text()
allows specifications for the title text.
plot.subtitle.title = element_text()
allows specifications for the subtitle text.
plot.caption = element_text()
allows specifications for the caption text.
axis.text.x = element_text()
allows specifications for x axis text
axis.text.y = element_text()
allows specifications for y axis text.
legend.position =
allows specifications for legend position. * example,"top"
,"bottom"
or as a vector:c(x-coordinate, y-coordinate)
When considering text size, we always want to ensure that our text is readable to an audience with a range of seeing abilities - we might ask ourselves: could a grandparent or older person I know read this at first glance?
As we’ll explore below, adjusting label size and boldness, for example, can help us emphasize important information about a graph that we’d like our audience to focus on. In effect, this means leveraging the power of the size and thickness of text to encode visual cues that signal a hierarchy of importance. You can check out an extensive list of theme()
arguments in this documentation resource created by R Studio.
The element_text()
function
There are also a quite a few things we can specify using the element_text()
function:
size =
specifies x-axis title (default in R is set at size 11)
NEW ARGUMENTS
hjust =
specifies the position of the plot titles (default is left). *0.5
: centre,1
: right,0
: left
color =
specifies the text color.
face =
specifies typographical emphasis. * example:"bold"
,"italic"
,"bold.italic"
angle =
specifies angular rotation of text.
vjust =
specifies the vertical adjustment. * example: higher number is up, lower number is down from the graph (default is 0)
Let’s use some of these new functions and arguments to improve our earlier visualization!
<- ggplot(data = NA_data,
intermediate_plot aes(x = year,
y = rgdpe/pop,
fill = country,
color = country,
geom_text(mapping = country))) +
labs(x = "Year",
y = "Real GDP per capita (expenditure-based)",
fill = "Country",
color = "Country",
title = "Canada & US Real GDP per Capita over Time", subtitle = "1950-2019",
caption = "Source: Penn World Tables 2019") +
geom_col(position = "dodge") +
theme(text = element_text(size = 20, hjust = 0)) + # specifies the x, y and legend label text size
theme(plot.title = element_text(size = 25, hjust = 0.5, color = "black", face = "bold")) + # specifies title text details
theme(plot.subtitle = element_text(size = 19, hjust = 0.5)) + # specifies subtitle text details
theme(plot.caption = element_text(size = 15, face = "italic", vjust = 0)) + # specifies caption text details
theme(legend.position = "top") # places the legend at the top of the graph
options(repr.plot.width = 15, repr.plot.height = 9)
intermediate_plot
Formatting Graphs
Scales
At times, we may wish to visualize only a subsection of our data. To do this, we can use the following commands which manipulate the scale of our graphs: * xlim()
specifies scale adjustments on the x-axis. * ylim()
specifies scale adjustments on the y-axis.
Both of these functions take two arguments - one lower bound limit and one upper bound limit, like this: xlim(lowerbound,upperbound)
. We can use the xlim()
and ylim()
functions to examine subsections of our axis variables. In the GDP per capita over time plot that we’ve been working with, use xlim()
to view a subsection of data from 2000-2019 below. This will make the power of changing the scale of our graph much more obvious.
Note: scaling can occur on any quantitative attribute, that is, we can scale variables other than year.
# fill in the ... to make the graph
<- ggplot(data = NA_data,
scaled_plot aes(x = year,
y = rgdpe/pop,
fill = country,
color = country)) +
labs(x = "Year",
y = "Real GDP per capita (expenditure-based)",
fill = "Country",
color = "Country",
title = "Canada & US Real GDP per Capita over Time", subtitle = "...-2019", # adjust the subtitle to reflect the window of time we're working with
caption = "Source: Penn World Tables 2019") +
geom_col(position = "dodge") +
theme(text = element_text(size = 20, hjust = 0)) +
theme(plot.title = element_text(size = 25, hjust = 0.5, color = "black", face = "bold")) +
theme(plot.subtitle = element_text(size = 19, hjust = 0.5)) +
theme(plot.caption = element_text(size = 15, face = "italic", vjust = 0)) +
theme(legend.position = "top") +
xlim(2000,2019) + # adjust x-axis scale with lower bound = 2000 and upper bound = 2019
ylim(0,200000) # adjust the max y value to 200,000 rather than the automatic scale of ~64,000
options(repr.plot.width = 15, repr.plot.height = 9)
scaled_plot
What do you notice about this graph? How do the oscillations (ups and downs) appear different when zoomed in here versus in the large view in the other visualization?
🟠 What are some good reasons why we might want to adjust the scale limits of our axis?
🟠 Is there a subsection of data that would be of interest? Why?
🟠 Are we trying to make our oscillations seem less volatile (as seen above) to prove a point? What point?
🟠 Are our scale choices helping our audience gain an understanding of the data that is as objective and accurate as possible?
The points above are all important to consider in any data visualization project. This is because scaling in particular can communicate a variety of different messages depending on how it is used. This example provides just one demonstration of this idea.
Faceting
So far, we’ve only been creating one plot at a time; we may, however, be interested in piecing together a visualization that takes on a more dashboard-like effect to display multiple plots in a grid simultaneously. A function we can use to achieve this is facet_grid()
. To facet something simply means to split - in data visualization, faceting helps us arrange graphs into multiple views or layers which can help us explore multidimensionality and visualize complexity in a dataset.
NEW ARGUMENTS
facet_grid(rows = vars(variable))
creates a grid of plots using a variable or variables split into rows (horizontal split) or columns (vertical split). Thevars()
function allows our variable(s) of choice to be correctly evaluated in the context of the data frame.
geom_hline()
creates a horizontal line across our plot(s) at a y value of our choosing.
geom_vline()
creates a vertical line across our plot(s) at an x value of our choosing.
Both geom_hline()
and geom_vline()
can be used on single plots and subplots to emphasize particular thresholds, values or time periods. In the graph below, we’ll add a horizontal line to our faceted plot to help us see when the GDP from each G7 country rose above 40,000.
<- pwt_data %>%
g7_data filter(country == "Canada" | country == "United States" | country == "France" |
== "Germany" | country == "Italy" | country == "Japan" | country == "United Kingdom") # select G7 countries
country
<- ggplot(data = g7_data,
facet_plot aes(x = year,
y = rgdpe/pop,
fill = as_factor(country),
color = country)) +
labs(x = "Year",
y = "Real GDP per capita (expenditure-based)",
fill = "Country",
color = "Country",
title = "G7 GDP per Capita over Time", subtitle = "1950-2019",
caption = "Source: Penn World Tables 2019") +
geom_line(size = 2) +
theme(text = element_text(size = 18, hjust = 0)) +
theme(text = element_text(size = 20, hjust = 0)) +
theme(plot.title = element_text(size = 25, hjust = 0.5, color = "black", face = "bold")) +
theme(plot.subtitle = element_text(size = 19, hjust = 0.5)) +
theme(plot.caption = element_text(size = 15, face = "italic", vjust = 0)) +
theme(legend.position = "top") +
geom_hline(yintercept = 40000, linetype = "solid", size = 0.25) + # add horizontal line
facet_grid(rows = vars(country)) + # create a set of subplots organized by country
scale_color_brewer(palette="Paired")
options(repr.plot.width = 10, repr.plot.height = 15)
facet_plot
Faceting allows us to arrange charts to make comparisons while leveraging the power of white space to handle visual complexity. Without faceting, a consolidated line chart of all G7 countries such as the one below might look overcrowded and create challenges in observing the nuances between countries.
<- ggplot(data = g7_data,
line_plot aes(x = year,
y = rgdpe/pop,
fill = as_factor(country),
color = country)) +
labs(x = "Year",
y = "Real GDP per capita (expenditure-based)",
fill = "Country",
color = "Country",
title = "G7 GDP per Capita over Time", subtitle = "1950-2019",
caption = "Source: Penn World Tables 2019") +
geom_line(size = 2) +
theme(text = element_text(size = 18, hjust = 0)) +
theme(text = element_text(size = 20, hjust = 0)) +
theme(plot.title = element_text(size = 25, hjust = 0.5, color = "black", face = "bold")) +
theme(plot.subtitle = element_text(size = 18, hjust = 0.5)) +
theme(plot.caption = element_text(size = 15, face = "italic", vjust = 0)) +
theme(legend.position = "top") +
geom_hline(yintercept = 40000, linetype = "solid", size = 0.25) + # add horizontal line +
scale_color_brewer(palette="Paired")
options(repr.plot.width = 10, repr.plot.height = 12)
line_plot
We can see that this graph above, while it is quite consolidated, does not present the specific patterns of each country as clearly as does our faceted graph immediately before it.
Confidence Bands
When creating visualizations which use a predictive element like a regression line, we can summon graph features that visualize how accurate our model is at predicting a variable (or variables) with 95% confidence. By using this feature, we can visualize prediction error and invite our viewer to understand and take into account the possible error that our visualization holds.
NEW ARGUMENTS
geom_smooth()
creates a trendline that aids the eye in spotting patterns in the data.
Arguments within this function include: *method =
specifies the type of smoothing we want to use - in this notebook, we’ll use linear regression using thelm
function. *se =
uses a logical argument to specify the presence of a standard error/confidence band (default isTRUE
). *colour =
specifies the colour of the trendline. *fill =
specifies the colour of the confidence band.
Design
Before we move into some case studies, we’ll examine a few more design tips and tricks that we can add to our visualization toolkit to make our visualizations look extra polished from an aesthetic standpoint! ✨
Shapes
When working with scatterplots, we can change the shape of the data points using the pch =
argument. Changing the shape of data points should be done when the icon of choice can provide a useful visual code that supports the audience’s understanding of the graph. * For example, in the image below, option 2 (the upwards triangle icon) might be useful to visualize growth or movement upwards, whereas option 6 (the downwards triangle icon) might be useful to visualize decline or movement downwards
The shape options are listed in the chart below - the default, as we’ve seen before, is 19, a simple circle. Note that numbers 21-25 are shapes that have both fill and outline colour options which can be helpful if we are trying to make our data points stand out better. This is particularly useful if we are working with a large data set.
Try exploring a few different shape and size options in the scatterplot below!
<- ggplot(data = NA_data,
intermediate_scatterplot aes(x = hc,
y = rgdpe/pop,
fill = country,
color = country,
geom_text(mapping = country))) +
labs(x = "Human Capital",
y = "Real GDP per capita (expenditure-based)",
fill = "Country",
color = "Country",
title = "Canada & US Real GDP per Capita and Human Capital", subtitle = "1950-2019",
caption = "Source: Penn World Tables 2019") +
geom_point(pch = 19, size = 2) + # try exploring a few different shape and size options
theme(text = element_text(size = 20, hjust = 0)) +
theme(plot.title = element_text(size = 25, hjust = 0.5, color = "black", face = "bold")) +
theme(plot.subtitle = element_text(size = 19, hjust = 0.5)) +
theme(plot.caption = element_text(size = 15, face = "italic", vjust = 0)) +
theme(legend.position = "top")
options(repr.plot.width = 12, repr.plot.height = 9)
intermediate_scatterplot
Colour
In our last notebook, we introduced the RColorBrewer
theme options in which colour themes could be specified using character strings that include text like "blue"
or "Set3"
. Colours can also be set using hexadecimal colour codes, which are six digit codes that store information about a color by various levels of red (R), green (G), and blue (B) like this: (#RRGGBB). > ex. “#FF0000” (red), “#FF6347” (orange), “#FFD700” (yellow)
When encoding colours into a visualization, choose diverse colours when making comparisons, and where appropriate, use colour hue (i.e. the lightness or darkness of a colour) when demonstrating concentration or other quantitative measures. Be sure to not pick too many colours (no more than a dozen is recommended) that may distract your audience from the main message of the visualization. Also be sure not to choose colours which can signal a helpful semantic (semantics is a domain of philosophy concerned with meaning) association; for example, in the West, green might be used to indicate growth, while red might be used to indicate loss or warning. Colour semantics vary from place to place and culture to culture.
You may have noticed that the default background colour for visualization in R is a pale grey colour. This is fine for quick visualizations, but there may be cases where we would like our visualization to have a more intentional aesthetic. Using a white background, for example, can make a visualization look a bit cleaner and allow colours to stand out better.
To change our background colour, we can add layers to our ggplot visualizations using the following functions which invoke pre-designed themes from the ggthemes()
package we imported at the beginning of our notebook: >theme_bw()
white background, grey gridlines, black graph border
>theme_minimal()
white background, grey gridlines, no graph border
>theme_classic()
white background, no gridlines, no graph border
>theme_economist()
: a theme based on the plots in the The Economist magazine
> theme_hc()
: a theme based on Highcharts JS
>theme_wsj()
: a theme based on the plots in the The Wall Street Journal
You can read more about ggthemes here.
When working with colours, it’s important use colours that are “understandable by those with colorblindness (a surprisingly large fraction of the overall population — from about 1% to 10%, depending on sex and ancestry (Deeb 2005)) (Timbers, T., Campbell, T., Lee, M. 2022).
We can view a list of colorblind friendly palettes using the following command:
display.brewer.all(colorblindFriendly = TRUE)
Alternatively, once we have completed and exported a visualization, we can run it through a Colour Blindness Simulator like this one to test if our visualization looks okay from the perspective of those who have different colour blindness or impairment conditions.
Part 2: Case Studies
Let’s now apply what we’ve learned with a few case studies. In them, we will try to recreate a few different types of graphs with our PWT dataset!
Case Study 1: Labeled Scatterplots
Let’s see if we can use our tools to recreate a graph similar to the one below using our PWT data set.
What type of plot will we want to specify using the geom_X
command?
# Fill in the ... with your answer below (example: "geom_histogram")
<- "geom_..."
answer_2 test_2()
Source: Boston Consulting Group
# filter our data for a subset of African and South American Countires
<- filter(pwt_data, year == 2014, country == "Ethiopia" | country == "South Africa" | country == "Kenya" | country == "Rwanda" | country == "Botswana" | country == "Nigeria" | country == "Mali" | country == "Peru" | country == "Brazil" | country == "Chile" | country == "Paraguay" | country == "Colombia" | country == "Argentina") %>%
casestudy2_data
# assign each observation to a corresponding "region", since we are only given data at the country-level
mutate(region = c("South America","South America","Africa","South America","South America","Africa","Africa","Africa","Africa","South America","South America","Africa","Africa")) %>%
select(country, region, hc, rgdpe, pop)
Trace through the code below to see where we apply what we learned from Part 1!
<- ggplot(casestudy2_data, aes(x = hc, y = rgdpe/pop, fill = region, label = country)) +
labeled_scatter geom_point(color = "grey", size = 10, shape = 21, stroke = 1) + # shape 21 allows a stroke to be made visible, the default = 19
theme_hc() + # use a clean, white background
geom_text(size = 7, nudge_y = 10, nudge_x = 0.09) + # adjusting our text
labs(x = "Human Capital Index", y = "GDP per Capita", fill = "", title = "Human Capital and GDP per Capita in Africa and South America",
subtitle = "2014") +
theme(text = element_text(size = 18), legend.position = "top") + # place our legend at the top to maximize space
geom_smooth(method = lm, se = FALSE, colour = "black", size = 0.5) + # compute trendlines for the data
scale_fill_brewer(palette = "Set2") # select our colour palette
labeled_scatter
Case Study 2: Confidence Bands
Now, let’s see if we can make our own visualization that uses a confidence band!
<- pwt_data %>%
confidence_data filter(country == "Canada") %>%
mutate(adjusted_gdp = rgdpe/pop)
Trace through the code below to see where we apply what we learned from Part 1!
<- ggplot(confidence_data, aes(x = hc, y = adjusted_gdp)) +
confidence_plot geom_point(colour = "red", size = 3) + # set our data points to be a distinct colour from our confidence interval
labs(x = "Human Capital", y = "GDP per Capita", fill = "", title = "Human Capital and GDP Per Capita in Canada",
subtitle = "1950-2019", caption = "95% confidence interval") +
geom_smooth(method = lm, se = TRUE, colour = "black", size = 0.99, fill = "black") + # compute trendlines for the data
theme_minimal() + # select a theme with a light grid background
theme(text = element_text(size = 20), plot.caption = element_text(color = "grey")) # create a plot caption describing the confidence band d
confidence_plot
Exercise 2
Use the pwt_data
dataframe to create a graph which looks similar to the one below. Store your object in answer_3. (Hint: all relevant styling options are identical to those used in the Real GDP per capita comparison across G7 countries we worked through earlier, except options()
, which specifies a plot width and height of 10 each).
# your code here
<-
answer_3
test_3()
We’ve covered a lot of content in this notebook. For further reading or exploration, we recommend visiting the hyperlinks attached throughout this module, as all of them are valuable resources for deepening your understanding of visualization in R!