07 - ECON 325: Introduction to Data Visualization II

econ 325

visualization

ggplot

charts

advanced

How do we make visualizations of data? This notebook uses the ggplot packages to help create impactful data visualizations using R. This builds on the first notebook, introducing more complex visualizations and techniques.

Author

COMET Team
Anneke Dresselhuis, Jonathan Graves

Published

12 January 2023

Outline

Prerequisites

Introduction to Jupyter
Introduction to R
Introduction to Data Visualization 1

Outcomes

By the end of this notebook, you will be able to: * Customize aesthetic labels on a graph to communicate the key message of a visualization * Use faceted graphs to visually represent complex data * Identify best practices for creating effective visualizations * Recognize ways in which visualizations can be used nefariously

References

Deeb, Sameer. 2005. “The Molecular Basis of Variation in Human Color Vision.” Clinical Genetics 67: 369–77.

Introduction

In the notebook “Introduction to Visualization 1,” we learned about scatter, line and bar charts, as well as histograms. This notebook expands on the concepts introduced in Introduction to Data Visualization 1 and offers a more in depth and granular toolkit for creating visualizations. Because we’re already familiar with the theoretical concepts of ggplot2, this notebook serves as more of a resource guide with practical case study examples than as a learning module.

The concept of layers is fundamental to ggplot2 - in this notebook we’ll explore a few more layering customizations that can help make our visualizations look extra polished. We’ll also see how our new tools perform in a few case study examples.

Loading the data

In this notebook, we’ll be working with the Penn World Tables data set again. Let’s load it now.

source("testing_intro_to_visualization2.r")

# import packages
library(tidyverse) # contains ggplot2, which is what we'll be using!
library(haven)
library(RColorBrewer)
library(ggthemes)

# library(lubridate)

# load the data
pwt_data <- read_dta("../datasets/pwt100.dta")

# declare factors
pwt_data <- as_factor(pwt_data)

pwt_data <- pwt_data %>%
    mutate(countrycode = as.factor(countrycode)) %>%
    mutate(country = as.factor(country)) %>%
    mutate(currency_unit = as.factor(currency_unit))

NA_data <- filter(pwt_data, (countrycode == "CAN")|(countrycode == "USA"))

# check that it looks OK
tail(pwt_data,10)
# there will be a lot of missing data

Part 1: Adding Techniques to our Data Visualization Toolkit

Adding and Adjusting Labels

In notebook 1 we introduced the labs() function which allows us to specify the aesthetic outputs of different labels on our chart, such as the x and y axis titles, the legend title and main graph title.

Here are a few best practices to keep in mind when crafting and adding labels to charts (pulled from FusionCharts 2022): * Graph title: should summarize the graph in short, understandable language that is as objective as possible - avoid using unnecessary words such as “the”, “a”, or “an”, as well as adjectives like “amazing” or “poor” which can manipulate your reader’s perception of the graphic. * Example: “Online Grocery Order Growth 2018 vs 2020” (a better title) vs “The Significant Growth in Online Grocery Orders in the years 2018 and 2020” (a worse title) * Graph subtitle: should be used to add helpful supplemental information that will help your audience understand the graph. * Example: Units of measurement, time frames (if this is secondary information) * Axis Labels: units of measurement should always be made known! If the data is labeled inside the visualization, sometimes axis labels are not necessary - we should always ask ourselves: what does our audience need to know?

Here, we explore a few tips and tricks for customizing labels to suit the needs of our graph.

The `labs()` function

As a refresher, a few labs() arguments we’ve already covered are:

x = specifies x-axis title.
y = specifies y-axis title.
color = specifies meaning of color outline.
fill = specifies meaning of color fill.
title = specifies title of plot.

NEW ARGUMENTS
subtitle = specifies a subtitle for the graph (positioned below title).
caption = specifies a caption at the bottom of graph, which can be great for listing the source of our data.

These arguments give us a basic infrastructure for a plot - we can demonstrate these features using a simple bar chart.

basic_plot <- ggplot(data = NA_data,  
                aes(x = year,   
                    y = rgdpe/pop, 
                    fill = country, # specifies the fill to be by country
                    color = country)) + # specifies the outline to also be by country
        labs(x = "Year",  
             y = "Real GDP per capita (expenditure-based)",
             fill = "Country",
             color = "countyr",
             title = "Canada & US Real GDP per Capita over Time") +
        geom_col(position = "dodge") +
        theme(text = element_text(size = 20, hjust = 0)) # specifies the x, y and legend label text size
        options(repr.plot.width = 15, repr.plot.height = 7) # specifies the dimension 
  
basic_plot

Exercise 1

Fix the typo from the graph above (in the code below!) so that the two legends on the righthand side are merged into one.

Hint: remember that arguments in the ggplot() contain aesthetic mappings, while arguments in the labs() function contain labels and other specifications that are set by us and not the data. What might be going on here that would create two legends on this graph?

# replace the error with your answer: your legend should begin with a Capital letter

basic_plot_test <- ggplot(data = NA_data,  
                    aes(x = year,   
                    y = rgdpe/pop, 
                    fill = country, # specifies the fill to be by country
                    color = country)) + # specifies the outline to also be by country
        labs(x = "Year",  
             y = "Real GDP per capita (expenditure-based)",
             fill = "Country",
             color = "countyr",
             title = "Canada & US Real GDP per Capita over Time") +
        geom_col(position = "dodge") +
        theme(text = element_text(size = 20)) # specifies the x, y and legend label text size
        options(repr.plot.width = 15, repr.plot.height = 7) # specifies the dimension 

basic_plot_test

answer_1 <- basic_plot_test

test_1()

The `theme()` function

We also learned about the theme() function, which allows us to modify components of a theme such as text.
As a refresher, one theme() argument we’ve already covered is:

text = element_text() specifies text attributes that broadly apply to all text components in a graph * Example: title, labels, caption, etc.

NEW ARGUMENTS
plot.title = element_text() allows specifications for the title text.
plot.subtitle.title = element_text() allows specifications for the subtitle text.
plot.caption = element_text() allows specifications for the caption text.
axis.text.x = element_text() allows specifications for x axis text
axis.text.y = element_text() allows specifications for y axis text.
legend.position = allows specifications for legend position. * example, "top", "bottom" or as a vector: c(x-coordinate, y-coordinate)

When considering text size, we always want to ensure that our text is readable to an audience with a range of seeing abilities - we might ask ourselves: could a grandparent or older person I know read this at first glance?

As we’ll explore below, adjusting label size and boldness, for example, can help us emphasize important information about a graph that we’d like our audience to focus on. In effect, this means leveraging the power of the size and thickness of text to encode visual cues that signal a hierarchy of importance. You can check out an extensive list of theme() arguments in this documentation resource created by R Studio.

The `element_text()` function

There are also a quite a few things we can specify using the element_text() function:

size = specifies x-axis title (default in R is set at size 11)

NEW ARGUMENTS
hjust = specifies the position of the plot titles (default is left). * 0.5: centre, 1: right, 0: left

color = specifies the text color.
face = specifies typographical emphasis. * example: "bold", "italic", "bold.italic"

angle = specifies angular rotation of text.
vjust = specifies the vertical adjustment. * example: higher number is up, lower number is down from the graph (default is 0)

Let’s use some of these new functions and arguments to improve our earlier visualization!

intermediate_plot <- ggplot(data = NA_data,  
                aes(x = year,   
                    y = rgdpe/pop, 
                    fill = country, 
                    color = country,
                    geom_text(mapping = country))) +
        labs(x = "Year",  
             y = "Real GDP per capita (expenditure-based)",
             fill = "Country",
             color = "Country", 
             title = "Canada & US Real GDP per Capita over Time", subtitle = "1950-2019", 
             caption = "Source: Penn World Tables 2019") +
        geom_col(position = "dodge") +
        theme(text = element_text(size = 20, hjust = 0)) + # specifies the x, y and legend label text size
        theme(plot.title = element_text(size = 25, hjust = 0.5, color = "black", face = "bold")) + # specifies title text details
        theme(plot.subtitle = element_text(size = 19, hjust = 0.5)) +  # specifies subtitle text details
        theme(plot.caption = element_text(size = 15, face = "italic", vjust = 0)) + # specifies caption text details
        theme(legend.position = "top")  # places the legend at the top of the graph
        options(repr.plot.width = 15, repr.plot.height = 9) 

intermediate_plot

Formatting Graphs

Scales

At times, we may wish to visualize only a subsection of our data. To do this, we can use the following commands which manipulate the scale of our graphs: * xlim() specifies scale adjustments on the x-axis. * ylim() specifies scale adjustments on the y-axis.

Both of these functions take two arguments - one lower bound limit and one upper bound limit, like this: xlim(lowerbound,upperbound) . We can use the xlim() and ylim() functions to examine subsections of our axis variables. In the GDP per capita over time plot that we’ve been working with, use xlim() to view a subsection of data from 2000-2019 below. This will make the power of changing the scale of our graph much more obvious.

Note: scaling can occur on any quantitative attribute, that is, we can scale variables other than year.

# fill in the ... to make the graph

scaled_plot <- ggplot(data = NA_data,  
                aes(x = year,   
                    y = rgdpe/pop, 
                    fill = country, 
                    color = country)) +
        labs(x = "Year",  
             y = "Real GDP per capita (expenditure-based)",
             fill = "Country",
             color = "Country", 
             title = "Canada & US Real GDP per Capita over Time", subtitle = "...-2019",  # adjust the subtitle to reflect the window of time we're working with
             caption = "Source: Penn World Tables 2019") +
        geom_col(position = "dodge") +
        theme(text = element_text(size = 20, hjust = 0)) + 
        theme(plot.title = element_text(size = 25, hjust = 0.5, color = "black", face = "bold")) +
        theme(plot.subtitle = element_text(size = 19, hjust = 0.5)) +  
        theme(plot.caption = element_text(size = 15, face = "italic", vjust = 0)) + 
        theme(legend.position = "top") +
        xlim(2000,2019) + # adjust x-axis scale with lower bound = 2000 and upper bound = 2019
        ylim(0,200000) # adjust the max y value to 200,000 rather than the automatic scale of ~64,000
        options(repr.plot.width = 15, repr.plot.height = 9) 

scaled_plot

What do you notice about this graph? How do the oscillations (ups and downs) appear different when zoomed in here versus in the large view in the other visualization?

🔎 Let’s think critically

🟠 What are some good reasons why we might want to adjust the scale limits of our axis?
🟠 Is there a subsection of data that would be of interest? Why?
🟠 Are we trying to make our oscillations seem less volatile (as seen above) to prove a point? What point?
🟠 Are our scale choices helping our audience gain an understanding of the data that is as objective and accurate as possible?

The points above are all important to consider in any data visualization project. This is because scaling in particular can communicate a variety of different messages depending on how it is used. This example provides just one demonstration of this idea.

Faceting

So far, we’ve only been creating one plot at a time; we may, however, be interested in piecing together a visualization that takes on a more dashboard-like effect to display multiple plots in a grid simultaneously. A function we can use to achieve this is facet_grid(). To facet something simply means to split - in data visualization, faceting helps us arrange graphs into multiple views or layers which can help us explore multidimensionality and visualize complexity in a dataset.

NEW ARGUMENTS
facet_grid(rows = vars(variable)) creates a grid of plots using a variable or variables split into rows (horizontal split) or columns (vertical split). The vars() function allows our variable(s) of choice to be correctly evaluated in the context of the data frame.
geom_hline() creates a horizontal line across our plot(s) at a y value of our choosing.
geom_vline() creates a vertical line across our plot(s) at an x value of our choosing.

Both geom_hline() and geom_vline() can be used on single plots and subplots to emphasize particular thresholds, values or time periods. In the graph below, we’ll add a horizontal line to our faceted plot to help us see when the GDP from each G7 country rose above 40,000.

g7_data <- pwt_data %>%
filter(country == "Canada" | country == "United States" | country == "France" | 
       country == "Germany" |  country == "Italy" |  country == "Japan" | country == "United Kingdom")  # select G7 countries  

facet_plot <- ggplot(data = g7_data,      
                aes(x = year,   
                    y = rgdpe/pop, 
                    fill = as_factor(country), 
                    color = country)) +
        labs(x = "Year",  
             y = "Real GDP per capita (expenditure-based)",
             fill = "Country",
             color = "Country", 
             title = "G7 GDP per Capita over Time", subtitle = "1950-2019", 
             caption = "Source: Penn World Tables 2019") +
        geom_line(size = 2) +
        theme(text = element_text(size = 18, hjust = 0)) +
        theme(text = element_text(size = 20, hjust = 0)) + 
        theme(plot.title = element_text(size = 25, hjust = 0.5, color = "black", face = "bold")) + 
        theme(plot.subtitle = element_text(size = 19, hjust = 0.5)) +  
        theme(plot.caption = element_text(size = 15, face = "italic", vjust = 0)) + 
        theme(legend.position = "top") +
        geom_hline(yintercept = 40000, linetype = "solid", size = 0.25) + # add horizontal line 
        facet_grid(rows = vars(country)) + # create a set of subplots organized by country
        scale_color_brewer(palette="Paired")
        options(repr.plot.width = 10, repr.plot.height = 15) 

facet_plot

Faceting allows us to arrange charts to make comparisons while leveraging the power of white space to handle visual complexity. Without faceting, a consolidated line chart of all G7 countries such as the one below might look overcrowded and create challenges in observing the nuances between countries.

line_plot <- ggplot(data = g7_data,      
                aes(x = year,   
                    y = rgdpe/pop, 
                    fill = as_factor(country), 
                    color = country)) +
        labs(x = "Year",  
             y = "Real GDP per capita (expenditure-based)",
             fill = "Country",
             color = "Country", 
             title = "G7 GDP per Capita over Time", subtitle = "1950-2019", 
             caption = "Source: Penn World Tables 2019") +
        geom_line(size = 2) +
        theme(text = element_text(size = 18, hjust = 0)) +
        theme(text = element_text(size = 20, hjust = 0)) + 
        theme(plot.title = element_text(size = 25, hjust = 0.5, color = "black", face = "bold")) + 
        theme(plot.subtitle = element_text(size = 18, hjust = 0.5)) +  
        theme(plot.caption = element_text(size = 15, face = "italic", vjust = 0)) + 
        theme(legend.position = "top") +
        geom_hline(yintercept = 40000, linetype = "solid", size = 0.25) + # add horizontal line +
        scale_color_brewer(palette="Paired")
        options(repr.plot.width = 10, repr.plot.height = 12) 

line_plot

We can see that this graph above, while it is quite consolidated, does not present the specific patterns of each country as clearly as does our faceted graph immediately before it.

Confidence Bands

When creating visualizations which use a predictive element like a regression line, we can summon graph features that visualize how accurate our model is at predicting a variable (or variables) with 95% confidence. By using this feature, we can visualize prediction error and invite our viewer to understand and take into account the possible error that our visualization holds.

NEW ARGUMENTS
geom_smooth() creates a trendline that aids the eye in spotting patterns in the data.
Arguments within this function include: * method = specifies the type of smoothing we want to use - in this notebook, we’ll use linear regression using the lm function. * se = uses a logical argument to specify the presence of a standard error/confidence band (default is TRUE). * colour = specifies the colour of the trendline. * fill = specifies the colour of the confidence band.

Design

Before we move into some case studies, we’ll examine a few more design tips and tricks that we can add to our visualization toolkit to make our visualizations look extra polished from an aesthetic standpoint! ✨

Shapes

When working with scatterplots, we can change the shape of the data points using the pch = argument. Changing the shape of data points should be done when the icon of choice can provide a useful visual code that supports the audience’s understanding of the graph. * For example, in the image below, option 2 (the upwards triangle icon) might be useful to visualize growth or movement upwards, whereas option 6 (the downwards triangle icon) might be useful to visualize decline or movement downwards

The shape options are listed in the chart below - the default, as we’ve seen before, is 19, a simple circle. Note that numbers 21-25 are shapes that have both fill and outline colour options which can be helpful if we are trying to make our data points stand out better. This is particularly useful if we are working with a large data set.

Try exploring a few different shape and size options in the scatterplot below!

intermediate_scatterplot <- ggplot(data = NA_data,  
                aes(x = hc,   
                    y = rgdpe/pop, 
                    fill = country, 
                    color = country,
                    geom_text(mapping = country))) +
        labs(x = "Human Capital",  
             y = "Real GDP per capita (expenditure-based)",
             fill = "Country",
             color = "Country", 
             title = "Canada & US Real GDP per Capita and Human Capital", subtitle = "1950-2019", 
             caption = "Source: Penn World Tables 2019") +
        geom_point(pch = 19, size = 2) + # try exploring a few different shape and size options 
        theme(text = element_text(size = 20, hjust = 0)) + 
        theme(plot.title = element_text(size = 25, hjust = 0.5, color = "black", face = "bold")) +
        theme(plot.subtitle = element_text(size = 19, hjust = 0.5)) +  
        theme(plot.caption = element_text(size = 15, face = "italic", vjust = 0)) + 
        theme(legend.position = "top")  
        options(repr.plot.width = 12, repr.plot.height = 9) 

intermediate_scatterplot

Colour

In our last notebook, we introduced the RColorBrewer theme options in which colour themes could be specified using character strings that include text like "blue" or "Set3". Colours can also be set using hexadecimal colour codes, which are six digit codes that store information about a color by various levels of red (R), green (G), and blue (B) like this: (#RRGGBB). > ex. “#FF0000” (red), “#FF6347” (orange), “#FFD700” (yellow)

When encoding colours into a visualization, choose diverse colours when making comparisons, and where appropriate, use colour hue (i.e. the lightness or darkness of a colour) when demonstrating concentration or other quantitative measures. Be sure to not pick too many colours (no more than a dozen is recommended) that may distract your audience from the main message of the visualization. Also be sure not to choose colours which can signal a helpful semantic (semantics is a domain of philosophy concerned with meaning) association; for example, in the West, green might be used to indicate growth, while red might be used to indicate loss or warning. Colour semantics vary from place to place and culture to culture.

You may have noticed that the default background colour for visualization in R is a pale grey colour. This is fine for quick visualizations, but there may be cases where we would like our visualization to have a more intentional aesthetic. Using a white background, for example, can make a visualization look a bit cleaner and allow colours to stand out better.

To change our background colour, we can add layers to our ggplot visualizations using the following functions which invoke pre-designed themes from the ggthemes() package we imported at the beginning of our notebook: >theme_bw() white background, grey gridlines, black graph border
>theme_minimal() white background, grey gridlines, no graph border
>theme_classic() white background, no gridlines, no graph border
>theme_economist(): a theme based on the plots in the The Economist magazine
> theme_hc(): a theme based on Highcharts JS
>theme_wsj(): a theme based on the plots in the The Wall Street Journal

You can read more about ggthemes here.

When working with colours, it’s important use colours that are “understandable by those with colorblindness (a surprisingly large fraction of the overall population — from about 1% to 10%, depending on sex and ancestry (Deeb 2005)) (Timbers, T., Campbell, T., Lee, M. 2022).

We can view a list of colorblind friendly palettes using the following command:

display.brewer.all(colorblindFriendly = TRUE)

Alternatively, once we have completed and exported a visualization, we can run it through a Colour Blindness Simulator like this one to test if our visualization looks okay from the perspective of those who have different colour blindness or impairment conditions.

Part 2: Case Studies

Let’s now apply what we’ve learned with a few case studies. In them, we will try to recreate a few different types of graphs with our PWT dataset!

Case Study 1: Labeled Scatterplots

Let’s see if we can use our tools to recreate a graph similar to the one below using our PWT data set.

What type of plot will we want to specify using the geom_X command?

# Fill in the ... with your answer below (example: "geom_histogram")

answer_2 <- "geom_..."
test_2()

2014 Wellbeing and Financial Inclusion Map

Source: Boston Consulting Group

# filter our data for a subset of African and South American Countires
casestudy2_data <- filter(pwt_data, year == 2014, country == "Ethiopia" | country == "South Africa" | country == "Kenya" | country == "Rwanda" | country == "Botswana" | country == "Nigeria" | country == "Mali" | country == "Peru" | country == "Brazil" | country == "Chile" | country == "Paraguay" | country == "Colombia" | country == "Argentina") %>%

# assign each observation to a corresponding "region", since we are only given data at the country-level
mutate(region = c("South America","South America","Africa","South America","South America","Africa","Africa","Africa","Africa","South America","South America","Africa","Africa")) %>%
    select(country, region, hc, rgdpe, pop)

Trace through the code below to see where we apply what we learned from Part 1!

labeled_scatter <- ggplot(casestudy2_data, aes(x = hc, y = rgdpe/pop, fill = region, label = country)) +
    geom_point(color = "grey", size = 10, shape = 21, stroke = 1) + # shape 21 allows a stroke to be made visible, the default = 19
    theme_hc() + # use a clean, white background
    geom_text(size = 7, nudge_y = 10, nudge_x = 0.09) + # adjusting our text
    labs(x = "Human Capital Index", y = "GDP per Capita", fill = "", title = "Human Capital and GDP per Capita in Africa and South America",
        subtitle = "2014") +
    theme(text = element_text(size = 18), legend.position = "top") + # place our legend at the top to maximize space
    geom_smooth(method = lm, se = FALSE, colour = "black", size = 0.5) + # compute trendlines for the data
    scale_fill_brewer(palette = "Set2")  # select our colour palette
    
labeled_scatter

Case Study 2: Confidence Bands

Now, let’s see if we can make our own visualization that uses a confidence band!

2010 Line Graph with Probabilistic Population Projects for China

Source: Eviews User Forum

confidence_data <- pwt_data %>%
    filter(country == "Canada") %>%
    mutate(adjusted_gdp = rgdpe/pop)

Trace through the code below to see where we apply what we learned from Part 1!

confidence_plot <- ggplot(confidence_data, aes(x = hc, y = adjusted_gdp)) +
    geom_point(colour = "red", size = 3) + # set our data points to be a distinct colour from our confidence interval
    labs(x = "Human Capital", y = "GDP per Capita", fill = "", title = "Human Capital and GDP Per Capita in Canada",
        subtitle = "1950-2019", caption = "95% confidence interval") +
    geom_smooth(method = lm, se = TRUE, colour = "black", size = 0.99, fill = "black") + # compute trendlines for the data
    theme_minimal() + # select a theme with a light grid background
    theme(text = element_text(size = 20), plot.caption = element_text(color = "grey")) # create a plot caption describing the confidence band d
     
confidence_plot

Exercise 2

Use the pwt_data dataframe to create a graph which looks similar to the one below. Store your object in answer_3. (Hint: all relevant styling options are identical to those used in the Real GDP per capita comparison across G7 countries we worked through earlier, except options(), which specifies a plot width and height of 10 each).

# your code here

answer_3 <-

test_3()

We’ve covered a lot of content in this notebook. For further reading or exploration, we recommend visiting the hyperlinks attached throughout this module, as all of them are valuable resources for deepening your understanding of visualization in R!

Outline

Prerequisites

Outcomes

References

Introduction

Loading the data

Part 1: Adding Techniques to our Data Visualization Toolkit

Adding and Adjusting Labels

The labs() function

Exercise 1

The theme() function

The element_text() function

Formatting Graphs

Scales

Faceting

Confidence Bands

Design

Shapes

Colour

Part 2: Case Studies

Case Study 1: Labeled Scatterplots

Case Study 2: Confidence Bands

Exercise 2

The `labs()` function

The `theme()` function

The `element_text()` function