# un-comment the packages that need to be installed. Installation may take a while.
#install.packages("tidytext")
#install.packages("ggplot2")
#install.packages("RcppParallel")
#install.packages("quanteda")
#install.packages("tidytext")
#install.packages("SentimentAnalysis")
#install.packages("textdata")
#install.packages("RColorBrewer")
#install.packages("RedditExtractoR")
#install.packages("wordcloud")
4.2 - Advanced - Introduction to Sentiment Analysis
Loading and Installing Packages
# these may take a while to load
library(tidytext)
library(ggplot2)
library(readr)
library(tidyverse)
library(RcppParallel)
library(quanteda)
library(janeaustenr)
library(dplyr)
library(tidyr)
library(stringr)
library(lubridate)
library(ggplot2)
library(scales)
library(reshape2)
library(dplyr)
library(RColorBrewer)
Prerequisites
- Introduction to Jupyter
- Introduction to R
Learning outcomes
After completing this notebook, you will be able to:
1. Understand and apply the principles of “tidy text” data to clean a textual dataset 2. Perform basic sentiment analysis using Reddit and the RedditExtractoR package.
What is Sentiment Analysis?
“Sentiment analysis is the practice of applying natural language processing and text analysis techniques to identify and extract subjective information from text” (Hussein, 2018). As this definition alludes, sentiment analysis is a part of natural language processing (NLP) which is a field that exists at the intersection of human language and computation. Because humans are complex, emotional beings, the language we use is often shaped by our affective (emotional) dispositions. Sentiment analysis, sometimes referred to as “opinion mining” is one way that researchers can methodologically understand the emotional intentions that lie in a textual dataset.
🔎 Let’s think critically****
🟠 At the heart of sentiment analysis is the assumption that language reveals interior, affective states, and that these states can be codified and generalized to broader populations. In her book, Atlas of AI the artificial intelligence scholar Kate Crawford explores how many assumptions found in contemporary sentiment research (ie, that there are 7 universal emotions, etc) are largely unsubstantiated notions that emerged from mid 20th century research funded by US Department of Defense. Rather than maintaining that emotions can be universally categorized, her work invites researchers to think about how emotional expression is highly contextualized by social and cultural factors and the distinct subject positions of content makers.
🟠 Consider the research question for your sentiment analysis project. How might the text you are working with be shaped by the distinct communities that have generated it?
🟠 Are there steps you can take to educate yourself around the unique language uses of your dataset (for example, directly speaking with someone from that group or learning from a qualified expert on the subject)?
🟠 If you’re interested, you can learn more about data justice in community research in a guide created by UBC’s Office for Regional and International Community Engagement.
The rise of web 2.0 has produced prolific volumes of user-generated content (UGC) on the internet, particularly as people engage in a variety of social platforms and forums to share opinions, ideas and express themselves. Maybe you are interested in understanding how people feel about a particular political candidate by examining tweets around election time, or you wonder what people think about a particular bus route on reddit. UGC is often unstructured data, meaning that it isn’t organized in a recognizable way.
Structured data for a microwave product review might look something like this:
Pro | Con | Neutral |
---|---|---|
Interface is visually appealing | Hard to change the time | Purchased from store #553 |
Heats up food perfectly | Plug cord length is too short | Product weighed 23lbs |
Unstructured data for a microwave product review might look something like this:
I bought the WAV0 X5K microwave last week. When i got home I was tryign to set it up and needed to go out and buy an extension cord because the one on the thing was too short. Took me 20 mins to figure out how to change the time, but teh interface was visually appealing. When I finally got working, it heated up my leftover take-out dinner perfectly.
In the structured data example above, the reviewer defines which parts of the feedback are positive, negative or neutral. In the unstructured example on the other hand, there are many typos and a given sentence might include a positive and a negative review as well as more nuanced contextual information (ie, that the person had to buy an additional product to make the microwave work). While messy, this contextual information often carries valuable insights that can be very useful for researchers.
The task of sentiment analysis is to make sense of these kinds of nuanced textual data - often for the purpose of understanding people, predicting human behaviour, or even in some cases, manipulating human behaviour.
Language is complex and always changing.
In the English language, for example, the word “present” has multiple meanings which could have positive, negative or neutral connotations. Further, a contemporary sentiment lexicon might code the word “miss” as being associated with negative or sad emotional experiences such as longing; if such a lexicon were applied to a 19th century novel which uses the word “miss” to describe single women, then, it might incorrectly associate negative sentiment where it shouldn’t be. While sentiment analysis can be a useful tool, it demands ongoing criticality and reflexivity from a researcher (you!). Throughout your analysis, be sure to continually ask yourself whether a particular sentiment lexicon is appropriate for your project.
Working with Textual Data
In this workshop we will consider a few basic, frequency-based approaches to performing sentiment analysis. Even while the methods explored here may be simple, they fundamentally underpin the kinds of learning and pattern recognition that more elaborate machine learning (ML) approaches used in technologies that employ voice recognition, face recognition and so on.
Before we can begin analyzing sentiment in a given dataset, we need to clean it to ensure that it’s formatted in a way that will be responsive to the operations we’d like to perform. As has been covered in other COMET notebooks, the tidy data principles are a set of guidelines for cleaning and structuring data that make it easier to work with and compatible with packages from the tidyverse. When formatting textual data according to the tidy text format, we want each token (or each meaningful unit of text, be it a sentence, phrase or individual word) to have its own row.
You may also be familiar with other formats that text can be stored in. Hicks (2022) outlines how textual data can be stored in strings, corpera and document-term matrices:
- String: text can, of course, be stored as strings, i.e., character vectors, within R, and often text data is first read into memory in this form. (example: “here is a string of text.”)
- Corpus: these types of objects typically contain raw strings annotated with additional metadata and details.
- Document-term matrix: This is a sparse matrix describing a collection (i.e., a corpus) of documents with one row for each document and one column for each term. The value in the matrix is typically word count.
If you have lots of experience with cleaning textual data sets, feel free to skip to the next section - if you’re looking for more practice with this, you’re right where you need to be!
The following list contains a few things we want to consider when preparing a textual dataset; many of these steps will be automatically taken by sentiment analysis functions: * Stop words are common words that aren’t typically useful in textual analysis because they don’t communicate very much meaning. Some common stop words include: “to”, “the,” and “and.”
You can look at a pre-defined list of stop words in the tidytext dataset using this command:
head(stop_words)
- Case matching is one way we can remove noise in a textual dataset by transforming all characters to lowercase, especially if we don’t want our model to treat capitalized words differently from non-capitalized words.
- Punctuation removal is another noise removal strategy. One important thing to consider is how punctuation removal can accidentally remove important meaning or structure to a text.
Next, we’ll run through a practice example using a predetermined lexicon titled, bing
which was created by Bing Liu and collaborators. The lexicon contains a list of around 6800 English words which are categorized as being either positive or negative. You can read more about the lexicon here. One thing to note about lexicon-based analyses that operate at the word level is that they are not able to understand the sentiment of phrases or pairs of words such as “not bad” or “not true.” Can you think of any other phrases that might not work with a word-based analysis?
# Construct a dataframe
<- c("@potus", "@abject.ron", "@tess888", "@ayden99", "@curious_reggie",
username "@peter.the.third", "@xavier_w", "@humble.pacifist",
"@krz4377", "@not.nat")
<- c("Today we changed prehistoric policies held our great country back from progress.",
policy_text "@potus this policy change is an abomination of everything America stands for",
"I have completely lost trust in the government",
"I am hopeful things will get better after this valuable change",
"Navigating the past is always a challenge, but one that can be overcome through hard work.",
"Can our country recover from this?",
"@ayden99 - Progress wins. A victory for America today.",
"Poor call @potus - old rules kept us from making the mistakes of the past...",
"I'm sick of aristocracy stamping out the people's power",
"Definitely some mixed feelings about today's decision. Some wins, some losses, but hey - that's democracy.")
<- tibble(username = username, text=policy_text) |>
policy_df group_by(username)
policy_df
Once we have a dataframe, we want to convert it into the tidytext format. The unnest_tokens
function helps us do that and includes arguments which we can use to specify how we want this done. You can learn more about the unnest_tokens
function by inputing ?unnest_tokens
into a code cell.
<- policy_df |>
policy_token unnest_tokens(output = word,
input = text,
token = "words", # this specifies that we want a token to be 1 word
to_lower = TRUE) # converts all text to uniform lowercase
head(policy_token)
In the above code, try changing the argument token = "words"
to token = "characters"
or token = "sentences"
What do you see?
If we were interested in running our sentiment analysis at a higher level, for example, by considering sentences as tokens, we could also do that. For the purpose of this analysis, we will be working at the word level; be sure to return the above argument to token = "words"
when you are ready to continue the analysis.
Negative Sentiment
If we are only interested in identifying the words in our corpus of tweets that contain negative (as opposed to positive) sentiment, we can use the bing
library. Let’s take a look at some of the words on the bing
list:
head(get_sentiments("bing"))
<- get_sentiments("bing") %>%
negative_sentiments filter(sentiment == "negative") # select only the negative words
<- policy_token %>%
negative_policy inner_join(negative_sentiments) %>%
count(word, sort = TRUE) # count the number of negative words
head(negative_policy)
Negative and Positive Sentiment
<- policy_token %>%
sentiment_policy inner_join(get_sentiments("bing")) %>% # adds column with binary sentiment library
count(word, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0)
head(sentiment_policy)
Summarizing Sentence-level Sentiment
<- policy_token %>%
summary_policy inner_join(get_sentiments("bing")) %>% # adds column with binary sentiment library
count(username, word, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
summary_policy
<- policy_token |>
bing_word_counts inner_join(get_sentiments("bing")) |>
count(word, sentiment, sort = TRUE) |>
ungroup()
bing_word_counts
We can further look at the summed score for a given user’s tweet beyond the individual words.
For example, if we ran the code below, we would get a value of 0
because (-1) + (+1) = 0
.
Try out a few different usernames in the blank ...
code below: * “@potus” * “@abject.ron” * “@tess888” * “@ayden99” * “@curious_reggie” * “@peter.the.third” * “@xavier_w” * “@humble.pacifist” * “@krz4377” * “@not.nat”
<- summary_policy |>
user_summary filter(username == "@curious_reggie")|>
mutate(sentiment_sum = sum(sentiment))|>
select(username, sentiment_sum)|>
distinct()
user_summary
Applied Analysis: Sentimental Analysis of r/UBC
In real use-cases, sentiment analysis isn’t done on small hand-made datasets. Rather, it’s usually done on large, messy datasets pulled from the internet. Now that we’ve looked at the basics of sentiment analysis, let’s apply our skills to analysing the r/UBC subreddit. The r/ubc subreddit is a subreddit ran by UBC students, for UBC students, that is frequented by over 100 thousand members. We’ll attempt to answer the following research question:
Which registertaion tool do students prefer most? Workday or the SSC?
To answer this question, we’ll apply the following steps:
- Scrape the top posts on r/UBC using the
RedditExtractoR
package and selected keywords. - Preprocess our resulting dataset using
tidyverse
. - Run sentimental analysis using the
quanteda
package. - Create visualizations to back-up our research question using
ggplot2
.
The RedditExtractoR
package will allow us to scrape reddit posts directly from R, without having to log into reddit ourselves. This makes it an effective tool for small-scale web scraping.
library(RedditExtractoR)
Downloaded Reddit posts
To find reddit posts, we’ll use the find_thread_urls()
command. The keywords
argument specifies which keywords we’d like to have in our reddit posts. We’ll create two dataframes, one with reddit posts with the keyword “workday”, and the other with the keyword “ssc”. Additionally, we’ll specify to sort by “top” for the most popular results, and ensure that our subreddit is set it “UBC”.
Depending on your internet speed, set sort=by = "top"
and period = "all"
for best results.
<- tibble(find_thread_urls(
rubc_urls_workday keywords = "workday",
sort_by = "new",
subreddit = "ubc",
period = "day"
))
<- tibble(find_thread_urls(
rubc_urls_ssc keywords = "ssc",
sort_by = "new",
subreddit = "ubc",
period = "day"
))
Running the new dataframes will give you some examples of the different comments and posts related to workday and the ssc, respectively.
rubc_urls_workday
rubc_urls_ssc
We’ll also have to preprocess the comments to allow for sentiment analysis. We’ll do so by removing individual numbers, punctuation, separators and symbols. We’ll also remove unnessesary stopwords, as seen earlier. Lastly, we’ll convert all the columns to characters.
<- rubc_urls_workday |>
r_ubc_workday_cleaned select(text)|>
as.character()|>
tokens(remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE, remove_symbols = TRUE)|>
tokens_select(pattern = stopwords("en"), selection = "remove")
<- rubc_urls_ssc |>
r_ubc_ssc_cleaned select(text)|>
as.character()|>
tokens(remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE, remove_symbols = TRUE)|>
tokens_select(pattern = stopwords("en"), selection = "remove")
Let’s create a visualization of our two datasets, using the wordcloud
package.
library(wordcloud)
set.seed(100)
wordcloud(r_ubc_workday_cleaned, min.freq=5, scale=c(5,0.25), colors=brewer.pal(8, "Dark2"))
wordcloud(r_ubc_ssc_cleaned, min.freq=5, scale=c(5,0.25), colors=brewer.pal(8, "Accent"))
Downloading a Dictionary
For our analysis, we’ll be using a lexicon-based sentiment analysis approach: grouping sentiment by either “negative” or “positive” using a pre-labelled dictionary. We’ll download and use a sentiment dictionary, a dataset of words which are split up into negative vesus positive sentiments. This is particularly usefull when all we want to do is quantitatively compare the amount of positive versus negative sentiment in a dataset. We’ll be using the opinion lexicon english dictonary for this task.
The dictionary itself is a set of two .txt
files located under data > opinion-lexicon-english
. We’ll combine the two into a tibble and associate the negative text file with negative sentiment, and the positive text file with positive sentiment.
<- scan("data/opinion-lexicon-English/positive-words.txt", what = "char", sep = "\n", skip = 35, quiet = T)
positive.words.bl <- scan("data/opinion-lexicon-English/negative-words.txt", what = "char", sep = "\n", skip = 35, quiet = T)
negative.words.bl
<- dictionary(list(positive = positive.words.bl, negative = negative.words.bl))
sentiment.dictionary str(sentiment.dictionary)
Conducting the Sentiment Analysis
We can now coduct sentiment analysis on the reddit comments. We’ll do so using the quanteda
package, as seen earlier. Using our dictionary from earlier, we’ll comb through the datasets and isolate their sentiment into a new dataframe. We’ll be doing a polar sentiment analysis, meaning that we will assigning strictly positive or negative sentiment to each entry. Additionally, we’ll standardize our sentiment values to be between 0 and 1. This will allow for a more acurate visualization, as the amount of rows in the ssc and workday datasets are different.
<- dictionary(list(positive = positive.words.bl, negative = negative.words.bl))
sentiment.dictionary
<- dfm(r_ubc_workday_cleaned)
dfm_ubc_workday <- dfm(r_ubc_ssc_cleaned)
dfm_ubc_ssc
<- dfm_lookup(dfm_ubc_workday, dictionary = sentiment.dictionary)
dfm_ubc_workday <- dfm_lookup(dfm_ubc_ssc, dictionary = sentiment.dictionary)
dfm_ubc_ssc
<- convert(dfm_ubc_workday, to = "data.frame") |>
sentiment_ubc_workday gather(positive, negative, key = "Polarity", value = "Words")
<- convert(dfm_ubc_ssc, to = "data.frame") |>
sentiment_ubc_ssc gather(positive, negative, key = "Polarity", value = "Words")
<- c("workday", "workday", "ssc", "ssc")
labels
<- tibble(bind_rows(sentiment_ubc_workday, sentiment_ubc_ssc)) |>
sentiment_ubc select(!doc_id) |>
bind_cols(labels) |>
rename(label = ...3)|>
group_by(label)|>
mutate(Words = Words/sum(Words))
sentiment_ubc
<- ggplot(sentiment_ubc, aes(x=label, y=Words, fill=Polarity,))+geom_bar(stat="identity", color="#faf3ff", linewidth=2)+ggtitle("Sentiment Analysis of posts related to the SSC versus Workday")+scale_fill_brewer(palette = "Paired")
sentiment_ubc_plot sentiment_ubc_plot
Our analysis shows that there is slightly more negative sentiment towards workday compared to the SSC, over the last year. However, it is worth mentioning that these results are not entirely representative of the sentiment of posts on r/UBC, primarily for two reasons:
1) Traditional dictionary-based sentiment analysis struggles with terms that are not in it’s reference dictionaries. Therefore, traditional sentiment analysis will not pick up negative “gen-Z” terms. Take, for instance, the following post by u/Zestyclose-Collar-76:
WHY DOES WORKDAY SHOW MY AVERAGE: Literally triggering. I don’t want to be shown this against my will.
The sentiment here is clearly negative: OP is expressing frustration about having to see their GPA every time they open workday. However, the word “triggering” is not listed as a negative word in the opinion english lexicon. Thus, this post will not be considered “negative”.
2) Dictonary-based entiment analysis struggles with irony, sarcasm, and not explicitly negative/positive statements. Consider the following post by u/imFrenly:
I made workday more usable: I made a chrome extension that changes workday’s current course browsing layout to one that looks like the old SSC one. Hopefully this will make everyone’s lives easier!! Feel free to give feedback and report bugs to me. Installation instructions here: https://github.com/JustintyC/Custom-Workday-Courselist-Extension (Unfortunately this doesn’t work with the other side by side calendar extension, but its other non-calendar related features such as pfp removal still work)
The context implied by this post is that workday is unusable, and therefore, this student put together a Chrome extension to bypass some of the issues. However, nothing in this post is explicitly critiquing workday. The critique is implied through the context, something traditional dictionary-based sentiment analysis cannot do.
References
- Air Force Institute of Technology. (n.d.). Text Mining: Sentiment Analysis · AFIT Data Science Lab R Programming Guide. Retrieved May 31, 2024, from https://afit-r.github.io/sentiment_analysis
- Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., & Matsuo, A. (2018). quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 774. https://doi.org/10.21105/joss.00774
- Hicks, S. (2022, October 13). Tidytext and sentiment analysis: Introduction to tidytext and sentiment analysis. https://www.stephaniehicks.com/jhustatcomputing2022/posts/2022-10-13-working-with-text-sentiment-analysis/
- Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004). https://doi.org/10.1145/1014052.1014073
- Hussein, D. M. E.-D. M. (2018). A survey on sentiment analysis challenges. Journal of King Saud University - Engineering Sciences, 30(4), 330–338. https://doi.org/10.1016/j.jksues.2016.04.002
- imFrenly. (2024, June 27). I made workday more usable. https://www.reddit.com/r/UBC/comments/1dpe5so/i_made_workday_more_usable/
- Liu, B. (2011). Sentiment Analysis and Opinion Mining. Department of Computer Science University Of Illinois at Chicago. https://www.cs.uic.edu/~liub/FBS/Sentiment-Analysis-tutorial-AAAI-2011.pdf
- Robinson, D. (2016, July 21). Does sentiment analysis work? A tidy analysis of Yelp reviews. Variance Explained. http://varianceexplained.org/r/yelp-sentiment/
- Silge, J., & Hvitfeldt, E. (2022). Supervised Machine Learning for Text Analysis in R. https://smltar.com/
- Silge, J., & Robinson, D. (2017). Welcome to Text Mining with R | Text Mining with R. https://www.tidytextmining.com/
- Zestyclose-Collar-76. (2024, May 21). WHY DOES WORKDAY SHOW MY AVERAGE. https://www.reddit.com/r/UBC/comments/1cxjf77/why_does_workday_show_my_average/