library(testthat)
library(digest)
= readLines("../datasets/disaster_data.csv") disaster
GEOG 374: Wrangle and Visualize Climate Disaster Data
- Authors: (Hewitt Lab) (Nina Hewitt and Michael Jerowsky) \(\times\) COMET Team (Jonathan Graves)
- Last Update: 15 September 2023
Outline
Prerequisites
- Introduction to Jupyter
- Introduction to R
Outcomes
After completing this notebook, you will be able to: * Explore data to gain a better understanding of its content and structure. * Transform data to meet the needs of data analysis and visualization. * Visualize data using time series and histograms
References
- Datacamp, Introduction to Data Cleaning
- Climate Central: Disaster Fatigue
- How to Use geom_smooth in R
- Data Wrangling
- Data Wrangling: Dates and Times
This module has a suggested citation of: Hewitt, N. and Jerowsky, M., 2023. Interactive Notebooks for Statistics and Computation in Geography: Wrangle and Visualize Climate Disaster Data. In Adshade et al. 2023. The COMET Project: Creating Online Materials for Econometrics Teaching. https://comet.arts.ubc.ca/.
We also want to recognize the special contributions of Dr Dan Moore, who provided some of the code for this module.
Introduction
This module uses a data set on disasters to illustrate working with date variables. The data comes from Climate Central’s exploration of climate events from 1980-present, including the time interval between major events. The Hewitt Lab would like to acknowledge the help of Dr. Dan Moore, who provided some of the code for this module.
Disaster fatigue is a term used to describe the phenomenon of exhaustion and apathy that can arise among individuals and communities as a result of repeated exposure to disasters or crises. It refers to the sense of feeling overwhelmed and emotionally depleted by the incessant flow of news and information regarding disasters, as well as the long-term economic impacts that such disasters can have on the resiliency of local communities. In this module, we will be looking at disasters associated with climate change. As More et al. (2018) discuss, greenhouse gas (GHG) emission are triggering many new climate hazards around the globe. Unless we substantially reduce these, the results could be catastrophic.
Over time, disaster fatigue can lead to a reduced ability to respond, recover and prepare, as the resources of communities are depleted. People may also become desensitized to the the impacts of each new disaster, particularly at national and international levels where such disasters are not felt equally across the population or by policy makers. This can make it more challenging to prepare for disasters, which may ultimately worsen their impact.
Disaster fatigue can also be exacerbated by other factors. Individuals may have a lack of trust in authorities or institutions, feelings of helplessness or hopelessness, and ongoing stress and trauma resulting from previous disasters or other life events. Meanwhile, communities may be overwhelmed financially or impacted negatively by rigid response policies that are not flexible enough to meet the emergent needs of a natural disaster as it unfolds.
Data
disaster_data.csv
contains a subset of data from the Climate Central data on climate events. It focusses on the United States. The data contains information on: * Event Name * Type of Event * Year * Month * Day * Full Date * Cost * Death * Days between events
Prior to begining this module, run the r code below to read read in the .csv file and save it to a variable. The source
and library
functions are included so that you can complete the test questions in this module and they can be autograded. Don’t be concerned if there is a warning that there is an incomplete final line found in the data set. This will be worked around.
Packages
If the following packages are not already installed, please run the following code block; otherwise, you can skip this step.
install.packages("dplyr")
install.packages("magrittr")
install.packages("tidyr")
install.packages("lubridate")
install.packages("ggplot2")
install.packages("here")
library(dplyr)
library(magrittr)
library(tidyr)
library(lubridate)
library(ggplot2)
library(here)
Exercises
A) Briefly look at the records from the beginning and end of your data to understand its basic structure.
head(disaster)
tail(disaster)
B) You can also view a condensed summary of your data. Since the disaster
dataframe doesn’t have a huge number of columns, you can view a quick snapshot of your data using the str()
function. This will tell you the class of each variable and give you a preview of its contents.
str(disaster) # preview of data with helpful details
C) The glimpse()
function from dplyr is a slightly cleaner alternative to str()
. The reason to review the overall structure of your data is to determine if there are any issues with the way columns are labelled, how variables are encoded, etc.
glimpse(disaster) #better version of str() from dplyr
D) Neither of these provided much information as information is being treated as text and needs to be wrangled. Let’s save it as a data frame and use the mutate()
function to create new columns that are functions of existing columns to try and clean-up the data a bit. The %>%
operator is called the pipe operator and is used to chain together multiple functions into a single pipeline. In this case, we’re using it to apply the mutate() function to the csv. file data. In addition to simple arithmetic expressions, you can use any R function inside the mutate() function to create new variables.
Specifically regarding the mutation of the time and date data, lubridate
is an R package that provides a set of tools for working with dates and times in R. It makes it easy to parse, manipulate, and format date and time objects in a way that is both intuitive and efficient. You can find more information on working with dates and times here.
<- read.csv("../datasets/disaster_data.csv") %>% # read the csv. fuke and save it as a data frame
dd2 mutate(date_lub = mdy(full_date)) %>% # create a new column and combine year, month, day separated by dashes
mutate(date_base = ISOdate(year, month, day)) %>% # create a new column with a data time for each record
mutate(cost = as.numeric(cost)) # ensure the cost column is saved as numeric
E) Briefly look at the beginning and end of the records again. This time, the output is much more legible. This is an example of Tidy data: each subject or observation is in a row; It is coded by different variables (disaster type, year, etc) in successive columns. Values correspond to the measurement or category for that subject for the given variable listed in the column header.
head(dd2)
tail(dd2)
F) Review the structure of the new dataframe using the following functions.
summary(dd2) # summary of the structure of your data
class(dd2) # class of data object
dim(dd2) # dimensions of data object
names(dd2) # column names
str(dd2) # preview of data with helpful details
glimpse(dd2) # better version of str() from dplyr
G) Now that the we have wrangled our data, let’s visualize the climate disasters between 1980 and now. Begin by setting the theme of ggplot2 to a classic grey background with white gridlines.
theme_set(theme_bw())
H) Next, use the following r code to create a time series plot. These are useful tools when analyzing and understanding trends, patterns, and changes in data over time. There are several reasons why one might use a time series plot.
- They can reveal long-term trends in the data, such as seasonal or cyclical patterns, which may not be apparent from individual data points.
- They can help identify outliers, anomalies, and other irregularities in the data, which can be useful in detecting and correcting errors or anomalies.
- They can provide insight into the relationship between variables and their changes over time, which can be helpful in identifying cause-and-effect relationships and making predictions or forecasts.
Sometimes it can be difficult to see if there is a trend in the data based on point data alone, which is why a trend line has been added here and smoothed using the geom_smooth()
function. By default, this function uses locally weighted scatterplot smoothing (LOESS), which is a non-paramteric method for fitting a smooth curve through a scatterplot. Specifically, the LOESS algorithm assigns weights to nearby data points according to their distance from the point being smoothed, and these weights are then used to fit a weighted least squares regression line through the data points in that neighbourhood. If you would like more information on this function, check out this link, which provides a more detailed discussion of this function.
ggplot(data = dd2, aes(x = date_lub, y = cost)) +
geom_point(col = "red") + # style the points representing each individual disaster
labs(x = "Date", y = "Cost ($US)") + # label the x and y axis
scale_y_log10() + # transform the y scale to use log 10
geom_smooth(se = FALSE) # add a trend line over the data
I) The data would be better visualized if a trend line was given for each type of disaster. We can do this by grouping the data by disaster_type
when creating the time series plot, then we can color code them to make it easier for readers to compare different trends when looking at our visualization.
ggplot(data = dd2, aes(x = date_lub, y = cost,
group = disaster_type)) + # group the data by disaster type.
geom_point(aes(col = disaster_type, # color each point based on disaster type
shape = disaster_type)) + # assign a shape to each disaster type
labs(x = "Date", y = "Cost ($US)") + # label the x and y axis
scale_y_log10() + # transform the y scale to use log 10
geom_smooth(se = FALSE, aes(col = disaster_type), # create trend lines and color by disaster type
span = 1) + # controls the amount of smoothing. Larger numbers produce smoother lines.
labs(col = "Disaster type", # create a column for the legend and label
shape = "Disaster type") # generate the shap and color for the legend
J) Next, let’s create a histogram to visualize how many days are generally between each major disaster.
ggplot(data = dd2) + # reference the datafram
geom_histogram(aes(x = days_between), # indicate the variable being tracked on the x-axis.
boundary = 0, binwidth = 20, # determine how far apart your columns are and the bin-width
fill = "lightblue", col = "black") + # styke your columns and their outline.
labs(x = "Days between successive events", y = "Count") # label the x and y axis
K) Once again, make a time series plot; however, this time compare the number of days between natural disasters as opposed to their cost.
# time series of days_between
ggplot(data = dd2, aes(x = date_lub, y = days_between)) + # indicate which variable will be plotted on the x and y-axis
geom_point(col = "red") + # color the points representing disasters
geom_smooth(span = 1, se = FALSE) + # controls the amount of smoothing. Larger numbers produce smoother lines.
labs(y = "Days between successive events", x = "Date") # label the x and y axis
L) While we could create a time series of days between disasters which plots each type of disaster on the same graph, sometimes it can be better to plot each as a separate visualization and compare them next to one another. facet_wrap()
makes a long ribbon of panels (generated by any number of variables) and wraps it into 2d.
ggplot(data = dd2, aes(x = date_lub, y = days_between)) + # indicate which variable will be plotted on the x and y-axis
geom_point(col = "red") + # color the points representing disasters
geom_smooth(span = 1, se = FALSE) + # controls the amount of smoothing. Larger numbers produce smoother lines.
labs(y = "Days between successive events", x = "Date") + # label the x and y axis
facet_wrap(vars(disaster_type)) # create multiple graphs for the disaster-type variable