GEOG 374: Wrangle and Visualize Climate Disaster Data

Authors: (Hewitt Lab) (Nina Hewitt and Michael Jerowsky) \(\times\) COMET Team (Jonathan Graves)
Last Update: 15 September 2023

Outline

Prerequisites

Introduction to Jupyter
Introduction to R

Outcomes

After completing this notebook, you will be able to: * Explore data to gain a better understanding of its content and structure. * Transform data to meet the needs of data analysis and visualization. * Visualize data using time series and histograms

References

This module has a suggested citation of: Hewitt, N. and Jerowsky, M., 2023. Interactive Notebooks for Statistics and Computation in Geography: Wrangle and Visualize Climate Disaster Data. In Adshade et al. 2023. The COMET Project: Creating Online Materials for Econometrics Teaching. https://comet.arts.ubc.ca/.

We also want to recognize the special contributions of Dr Dan Moore, who provided some of the code for this module.

Introduction

This module uses a data set on disasters to illustrate working with date variables. The data comes from Climate Central’s exploration of climate events from 1980-present, including the time interval between major events. The Hewitt Lab would like to acknowledge the help of Dr. Dan Moore, who provided some of the code for this module.

Disaster fatigue is a term used to describe the phenomenon of exhaustion and apathy that can arise among individuals and communities as a result of repeated exposure to disasters or crises. It refers to the sense of feeling overwhelmed and emotionally depleted by the incessant flow of news and information regarding disasters, as well as the long-term economic impacts that such disasters can have on the resiliency of local communities. In this module, we will be looking at disasters associated with climate change. As More et al. (2018) discuss, greenhouse gas (GHG) emission are triggering many new climate hazards around the globe. Unless we substantially reduce these, the results could be catastrophic.

View of flooded New Orleans in the aftermath of Hurricane Katrina. (Wikipedia Commons): Commander Mark Moran, of the NOAA Aviation Weather Center, and Lt. Phil Eastman and Lt. Dave Demers, of the NOAA Aircraft Operations Center

Over time, disaster fatigue can lead to a reduced ability to respond, recover and prepare, as the resources of communities are depleted. People may also become desensitized to the the impacts of each new disaster, particularly at national and international levels where such disasters are not felt equally across the population or by policy makers. This can make it more challenging to prepare for disasters, which may ultimately worsen their impact.

Disaster fatigue can also be exacerbated by other factors. Individuals may have a lack of trust in authorities or institutions, feelings of helplessness or hopelessness, and ongoing stress and trauma resulting from previous disasters or other life events. Meanwhile, communities may be overwhelmed financially or impacted negatively by rigid response policies that are not flexible enough to meet the emergent needs of a natural disaster as it unfolds.

Data

disaster_data.csv contains a subset of data from the Climate Central data on climate events. It focusses on the United States. The data contains information on: * Event Name * Type of Event * Year * Month * Day * Full Date * Cost * Death * Days between events

Prior to begining this module, run the r code below to read read in the .csv file and save it to a variable. The source and library functions are included so that you can complete the test questions in this module and they can be autograded. Don’t be concerned if there is a warning that there is an incomplete final line found in the data set. This will be worked around.

library(testthat)
library(digest)

disaster = readLines("../datasets/disaster_data.csv")

Packages

If the following packages are not already installed, please run the following code block; otherwise, you can skip this step.

install.packages("dplyr")
install.packages("magrittr")
install.packages("tidyr")
install.packages("lubridate")
install.packages("ggplot2")
install.packages("here")

library(dplyr)
library(magrittr)
library(tidyr)
library(lubridate)
library(ggplot2)
library(here)

Exercises

A) Briefly look at the records from the beginning and end of your data to understand its basic structure.

head(disaster)
tail(disaster)

B) You can also view a condensed summary of your data. Since the disaster dataframe doesn’t have a huge number of columns, you can view a quick snapshot of your data using the str() function. This will tell you the class of each variable and give you a preview of its contents.

str(disaster) # preview of data with helpful details

C) The glimpse() function from dplyr is a slightly cleaner alternative to str(). The reason to review the overall structure of your data is to determine if there are any issues with the way columns are labelled, how variables are encoded, etc.

glimpse(disaster) #better version of str() from dplyr

D) Neither of these provided much information as information is being treated as text and needs to be wrangled. Let’s save it as a data frame and use the mutate() function to create new columns that are functions of existing columns to try and clean-up the data a bit. The %>% operator is called the pipe operator and is used to chain together multiple functions into a single pipeline. In this case, we’re using it to apply the mutate() function to the csv. file data. In addition to simple arithmetic expressions, you can use any R function inside the mutate() function to create new variables.

Specifically regarding the mutation of the time and date data, lubridate is an R package that provides a set of tools for working with dates and times in R. It makes it easy to parse, manipulate, and format date and time objects in a way that is both intuitive and efficient. You can find more information on working with dates and times here.

dd2 <- read.csv("../datasets/disaster_data.csv") %>% # read the csv. fuke and save it as a data frame
  mutate(date_lub = mdy(full_date)) %>%              # create a new column and combine year, month, day separated by dashes
  mutate(date_base = ISOdate(year, month, day)) %>%  # create a new column with a data time for each record
  mutate(cost = as.numeric(cost))                    # ensure the cost column is saved as numeric

E) Briefly look at the beginning and end of the records again. This time, the output is much more legible. This is an example of Tidy data: each subject or observation is in a row; It is coded by different variables (disaster type, year, etc) in successive columns. Values correspond to the measurement or category for that subject for the given variable listed in the column header.

head(dd2)
tail(dd2)

F) Review the structure of the new dataframe using the following functions.

summary(dd2) # summary of the structure of your data
class(dd2) # class of data object
dim(dd2) # dimensions of data object
names(dd2) # column names
str(dd2) # preview of data with helpful details
glimpse(dd2) # better version of str() from dplyr

G) Now that the we have wrangled our data, let’s visualize the climate disasters between 1980 and now. Begin by setting the theme of ggplot2 to a classic grey background with white gridlines.

theme_set(theme_bw())

H) Next, use the following r code to create a time series plot. These are useful tools when analyzing and understanding trends, patterns, and changes in data over time. There are several reasons why one might use a time series plot.

They can reveal long-term trends in the data, such as seasonal or cyclical patterns, which may not be apparent from individual data points.
They can help identify outliers, anomalies, and other irregularities in the data, which can be useful in detecting and correcting errors or anomalies.
They can provide insight into the relationship between variables and their changes over time, which can be helpful in identifying cause-and-effect relationships and making predictions or forecasts.

Sometimes it can be difficult to see if there is a trend in the data based on point data alone, which is why a trend line has been added here and smoothed using the geom_smooth() function. By default, this function uses locally weighted scatterplot smoothing (LOESS), which is a non-paramteric method for fitting a smooth curve through a scatterplot. Specifically, the LOESS algorithm assigns weights to nearby data points according to their distance from the point being smoothed, and these weights are then used to fit a weighted least squares regression line through the data points in that neighbourhood. If you would like more information on this function, check out this link, which provides a more detailed discussion of this function.

ggplot(data = dd2, aes(x = date_lub, y = cost)) +
  geom_point(col = "red") + # style the points representing each individual disaster
  labs(x = "Date", y = "Cost ($US)") + # label the x and y axis
  scale_y_log10() + # transform the y scale to use log 10
  geom_smooth(se = FALSE) # add a trend line over the data

I) The data would be better visualized if a trend line was given for each type of disaster. We can do this by grouping the data by disaster_type when creating the time series plot, then we can color code them to make it easier for readers to compare different trends when looking at our visualization.

ggplot(data = dd2, aes(x = date_lub, y = cost,
                      group = disaster_type)) + # group the data by disaster type.
  geom_point(aes(col = disaster_type, # color each point based on disaster type
                 shape = disaster_type)) + # assign a shape to each disaster type
  labs(x = "Date", y = "Cost ($US)") + # label the x and y axis
  scale_y_log10() + # transform the y scale to use log 10
  geom_smooth(se = FALSE, aes(col = disaster_type), # create trend lines and color by disaster type
              span = 1) + # controls the amount of smoothing. Larger numbers produce smoother lines.
  labs(col = "Disaster type", # create a column for the legend and label
       shape = "Disaster type") # generate the shap and color for the legend

J) Next, let’s create a histogram to visualize how many days are generally between each major disaster.

ggplot(data = dd2) + # reference the datafram
  geom_histogram(aes(x = days_between), # indicate the variable being tracked on the x-axis.
                 boundary = 0, binwidth = 20, # determine how far apart your columns are and the bin-width
                 fill = "lightblue", col = "black") + # styke your columns and their outline.
  labs(x = "Days between successive events", y = "Count") # label the x and y axis

K) Once again, make a time series plot; however, this time compare the number of days between natural disasters as opposed to their cost.

# time series of days_between
ggplot(data = dd2, aes(x = date_lub, y = days_between)) + # indicate which variable will be plotted on the x and y-axis
  geom_point(col = "red") + # color the points representing disasters
  geom_smooth(span = 1, se = FALSE) + # controls the amount of smoothing. Larger numbers produce smoother lines.
  labs(y = "Days between successive events", x = "Date") # label the x and y axis

L) While we could create a time series of days between disasters which plots each type of disaster on the same graph, sometimes it can be better to plot each as a separate visualization and compare them next to one another. facet_wrap() makes a long ribbon of panels (generated by any number of variables) and wraps it into 2d.

ggplot(data = dd2, aes(x = date_lub, y = days_between)) + # indicate which variable will be plotted on the x and y-axis
  geom_point(col = "red") + # color the points representing disasters
  geom_smooth(span = 1, se = FALSE) + # controls the amount of smoothing. Larger numbers produce smoother lines.
  labs(y = "Days between successive events", x = "Date") + # label the x and y axis
  facet_wrap(vars(disaster_type)) # create multiple graphs for the disaster-type variable