{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 4.4 - Advanced - Word Embeddings (R)\n", "\n", "*R Version*\n", "\n", "\n", "\n", "*This notebook was prepared by Laura Nelson in collaboration with [UBC\n", "COMET](https://comet.arts.ubc.ca/) team members: Jonathan Graves, Angela\n", "Chen and Anneke Dresselhuis*\n", "\n", "## Prerequisites\n", "\n", "1. Some familiarity programming in R\n", "2. Some familarity with natural language processing\n", "3. No computational text experience necessary!\n", "\n", "## Learning outcomes\n", "\n", "In the notebook you will\n", "\n", "1. Familiarize yourself with concepts such as word embeddings (WE)\n", " vector-space model of language, natural language processing (NLP)\n", " and how they relate to small and large language models (LMs)\n", "2. Import and pre-process a textual dataset for use in word embedding\n", "3. Use word2vec to build a simple language model for examining patterns\n", " and biases textual datasets\n", "4. Identify and select methods for saving and loading models\n", "5. Use critical and reflexive thinking to gain a deeper understanding\n", " of how the inherent social and cultural biases of language are\n", " reproduced and mapped into language computation models\n", "\n", "## Outline\n", "\n", "The goal of this notebook is to demystify some of the technical aspects\n", "of language models and to invite learners to start thinking about how\n", "these important tools function in society.\n", "\n", "In particular, this lesson is designed to explore features of word\n", "embeddings produced through the word2vec model. The questions we ask in\n", "this lesson are guided by Ben Schmidt’s blog post, [Rejecting the Gender\n", "Binary](%22http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html).\n", "\n", "The primary corpus we will use consists of the\n", "150 English-language novels made\n", "available by the .txtLab at McGill University. We also look at\n", "a Word2Vec model trained\n", "on the ECCO-TCP corpus of 2,350 eighteenth-century literary texts\n", "made available by Ryan Heuser. (Note that the number of terms in the\n", "model has been shortened by half in order to conserve memory.)\n", "\n", "## Key Terms\n", "\n", "Before we dive in, feel free to familiarize yourself with the following\n", "key terms and how they relate to each other." ], "id": "d3bcd253-d9a3-4381-ab13-dce0115e3b66" }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/html" }, "source": [ "" ], "id": "f3152375-c59d-4ad2-8732-e939246204b8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ], "id": "f2065b82-c866-4391-8bdb-0b5c00171337" }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/html" }, "source": [ "" ], "id": "a47de43f-5ef1-4139-917f-0a0885cb18bc" }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Artificial Intelligence (AI):** this term is a broad category that\n", "includes the study and development of computer systems that can copy\n", "intelligent human behaviour (adapted from [*Oxford Learners\n", "Dictionary*](https://www.oxfordlearnersdictionaries.com/definition/english/ai#:~:text=%2F%CB%8Ce%C9%AA%20%CB%88a%C9%AA%2F-,%2F%CB%8Ce%C9%AA%20%CB%88a%C9%AA%2F,way%20a%20human%20brain%20does.))\n", "\n", "**Machine Learning (ML):** this is branch of AI which is uses\n", "statistical methods to imitate the way that humans learn (adapted from\n", "[*IBM*](https://www.ibm.com/topics/machine-learning))\n", "\n", "**Natural Language Processing (NLP):** this is branch of AI which\n", "focuses on training computers to interpret human text and spoken words\n", "(adapted from\n", "[*IBM*](https://www.ibm.com/topics/natural-language-processing#:~:text=the%20next%20step-,What%20is%20natural%20language%20processing%3F,same%20way%20human%20beings%20can.))\n", "\n", "**Word Embeddings (WE):** this is an NLP process through which human\n", "words are converted into numerical representations (usually vectors) in\n", "order for computers to be able to understand them (adapted from\n", "[*Turing*](https://www.turing.com/kb/guide-on-word-embeddings-in-nlp))\n", "\n", "**word2vec:** this is an NLP technique that is commonly used to generate\n", "word embeddings\n", "\n", "## What are Word Embeddings?\n", "\n", "Building off of the definition above, word embeddings are one way that\n", "humans can represent language in a way that is legible to a machine.\n", "More specifically, they are an NLP approach that use vectors to store\n", "textual data in multiple dimensions; by existing in the\n", "multi-dimensional space of vectors, word embeddings are able to include\n", "important semantic information within a given numeric representation.\n", "\n", "For example, if we are trying to answer a research question about how\n", "popular a term is on the web at a given time, we might use a simple word\n", "frequency analysis to count how many times the word “candidate” shows up\n", "in tweets during a defined electoral period. However, if we wanted to\n", "gain a more nuanced understanding of what kind of language, biases or\n", "attitudes contextualize the term, “candidate” in discourse, we would\n", "need to use a method like word embedding to encode meaning into our\n", "understanding of how people have talked about candidates over time.\n", "Instead of describing our text as a series of word counts, we would\n", "treat our text like coordinates in space, where similar words and\n", "concepts are closer to each other, and words that are different from\n", "each other are further away.\n", "\n", "\n", "\n", "For example, in the visualization above, a word frequency count returns\n", "the number of times the word “candidate” or “candidates” is used in a\n", "sample text corpus. When a word embedding is made from the same text\n", "corpus, we are able to map related concepts and phrases that are closely\n", "related to “candidate” as neighbours, while other words and phrases such\n", "as “experimental study” (which refers to the research paper in question,\n", "and not to candidates specifically) are further away.\n", "\n", "Here is another example of how different, but related words might be\n", "represented in a word embedding: \n", "\n", "## Making a Word Embedding\n", "\n", "So, how do word embeddings work? To make a word embedding, an input word\n", "gets compressed into a dense vector.\n", "\n", "\n", "\n", "The magic and mystery of the word embedding process is that often the\n", "vectors produced during the model embed qualities of a word or phrase\n", "that are not interpretable by humans. However, for our purposes, having\n", "the text in vector format is all we need. With this format, we can\n", "perform tests like cosine similarity and other kinds of operations. Such\n", "operations can reveal many different kinds of relationships between\n", "words, as we’ll examine a bit later.\n", "\n", "## Using word2vec\n", "\n", "Word2vec is one NLP technique that is commonly used to generate word\n", "embeddings. More precisely, word2vec is an algorithmic learning tool\n", "rather than a specific neural net that is already trained. The example\n", "we will be working through today has been made using this tool.\n", "\n", "The series of algorithms inside of the word2vec model try to describe\n", "and acquire parameters for a given word in terms of the text that appear\n", "immediately to the right and left in actual sentences. Essentially, it\n", "learns how to predict text.\n", "\n", "Without going too deep into the algorithm, suffice it to say that it\n", "involves a two-step process:\n", "\n", "1. First, the input word gets compressed into a dense vector, as seen\n", " in the simplified diagram, “Creating a Word Embedding,” above.\n", "2. Second, the vector gets decoded into the set of context words.\n", " Keywords that appear within similar contexts will have similar\n", " vector representations in between steps.\n", "\n", "Imagine that each word in a novel has its meaning determined by the ones\n", "that surround it in a limited window. For example, in Moby Dick’s first\n", "sentence, “me” is paired on either side by “Call” and “Ishmael.” After\n", "observing the windows around every word in the novel (or many novels),\n", "the computer will notice a pattern in which “me” falls between similar\n", "pairs of words to “her,” “him,” or “them.” Of course, the computer had\n", "gone through a similar process over the words “Call” and “Ishmael,” for\n", "which “me” is reciprocally part of their contexts. This chaining of\n", "signifiers to one another mirrors some of humanists’ most sophisticated\n", "interpretative frameworks of language.\n", "\n", "The two main model architectures of word2vec are **Continuous Bag of\n", "Words (CBOW)** and **Skip-Gram**, which can be distinguished partly by\n", "their input and output during training.\n", "\n", "**CBOW** takes the context words (for example, “Call”,“Ishmael”) as a\n", "single input and tries to predict the word of interest (“me”).\n", "\n", "\n", "\n", "**Skip-Gram** does the opposite, taking a word of interest as its input\n", "(for example, “me”) and tries to learn how to predict its context words\n", "(“Call”,“Ishmael”).\n", "\n", "\n", "\n", "In general, CBOW is is faster and does well with frequent words, while\n", "Skip-Gram potentially represents rare words better.\n", "\n", "Since the word embedding is a vector, we are able perform tests like\n", "cosine similarity (which we’ll learn more about in a bit!) and other\n", "kinds of operations. Those operations can reveal many different kinds of\n", "relationships between words, as we shall see.\n", "\n", "## Bias and Language Models\n", "\n", "You might already be piecing together that the encoding of meaning in\n", "word embeddings is entirely shaped by patterns of language use captured\n", "in the training data. That is, what is included in a word embedding\n", "directly reflects the complex social and cultural biases of everyday\n", "human language - in fact, exploring how these biases function and change\n", "over time (as we will do later) is one of the most interesting ways to\n", "use word embeddings in social research.\n", "\n", "#### It is simply impossible to have a bias-free language model (LM).\n", "\n", "In LMs, bias is not a bug or a glitch, rather, it is an essential\n", "feature that is baked into the fundamental structure. For example, LMs\n", "are not outside of learning and absorbing the pejorative dimensions of\n", "language which in turn, can result in reproducing harmful correlations\n", "of meaning for words about race, class or gender (among others). When\n", "unchecked, these harms can be “amplified in downstream applications of\n", "word embeddings” ([Arseniev-Koehler & Foster, 2020,\n", "p. 1](https://osf.io/preprints/socarxiv/b8kud/)).\n", "\n", "Just like any other computational model, it is important to critically\n", "engage with the source and context of the training data. One way that\n", "[Schiffers, Kern and Hienert](https://arxiv.org/abs/2302.06174v1)\n", "suggest doing this is by using domain specific models (2023). Working\n", "with models that understand the nuances of your particular topic or\n", "field can better account for “specialized vocabulary and semantic\n", "relationships” that can help make applications of WE more effective.\n", "\n", "## Preparing for our Analysis\n", "\n", "#### Word2vec Features\n", "\n", "**Here are a few features of the word2vec tool that we can use to\n", "customize our analysis:**\n", "\n", "- `size`: Number of dimensions for word embedding model\n", " \n", "- `window`: Number of context words to observe in each direction\n", " \n", "- `min_count`: Minimum frequency for words included in model\n", " \n", "- `sg` (Skip-Gram): ‘0’ indicates CBOW model; ‘1’ indicates Skip-Gram\n", " \n", "- `alpha`: Learning rate (initial); prevents model from\n", " over-correcting, enables finer tuning\n", " \n", "- `iterations`: Number of passes through dataset\n", " \n", "- `batch size`: Number of words to sample from data during each pass\n", " \n", "\n", "Note: the script uses default value for each argument.\n", "\n", "**Some limitations of the word2vec Model**\n", "\n", "- Within word2vec, common articles or conjunctions, called **stop\n", " words** such as “the” and “and,” may not provide very rich\n", " contextual information for a given word, and may need additional\n", " subsampling or to be combined into a word phrase (Anwla, 2019).\n", "- Word2vec isn’t always the best at handling out-of-vocabulary words\n", " well (Chandran, 2021).\n", "\n", "Let’s begin our analysis!\n", "\n", "## Exercise #1: Eggs, Sausages and Bacon\n", "\n", "\n", "\n", "To begin, we are going to install and load a few packages that are\n", "necessary for our analysis. Run the code cells below if these packages\n", "are not already installed:\n", "\n", "``` r\n", "# uncomment these by deleting the \"#\" to install them\n", "\n", "#install.packages(\"tidyverse\")\n", "#install.packages(\"repr\")\n", "#install.packages(\"proxy\")\n", "#install.packages(\"scales\")\n", "#install.packages(\"tm\")\n", "#install.packages(\"MASS\")\n", "#install.packages(\"SentimentAnalysis\")\n", "#install.packages(\"reticulate\")\n", "```\n", "\n", "``` r\n", "# Load the required libraries\n", "library(tidyverse)\n", "library(repr)\n", "library(proxy)\n", "library(tm)\n", "library(scales)\n", "library(MASS)\n", "\n", "\n", "# Set up figures to save properly\n", "options(jupyter.plot_mimetypes = \"image/png\") \n", "```\n", "\n", "``` r\n", "# Time: 30s\n", "library(reticulate)\n", "gensim <- import(\"gensim\")\n", "```\n", "\n", "#### Create a Document-Term Matrix (DTM) with a Few Pseudo-Texts\n", "\n", "To start off, we’re going to create a mini dataframe based on the use of\n", "the words “eggs,” “sausages” and “bacon” found in three different\n", "novels: A, B and C.\n", "\n", "``` r\n", "# Construct dataframe\n", "columns <- c('eggs', 'sausage', 'bacon')\n", "indices <- c('Novel A', 'Novel B', 'Novel C')\n", "dtm <- data.frame(eggs = c(50, 90, 20),\n", " sausage = c(60, 10, 70),\n", " bacon = c(60, 10, 70),\n", " row.names = indices)\n", "\n", "# Show dataframe\n", "print(dtm)\n", "```\n", "\n", "#### Visualize\n", "\n", "``` r\n", "# Then, we'll create the scatter plot of our data using ggplot2\n", "ggplot(dtm, aes(x = eggs, y = sausage)) +\n", " geom_point() +\n", " geom_text(aes(label = rownames(dtm)), nudge_x = 2, nudge_y = 2, size = 3) +\n", " xlim(0, 100) +\n", " ylim(0, 100) +\n", " labs(x = \"eggs\", y = \"sausage\")\n", "```\n", "\n", "### Vectors\n", "\n", "At a glance, a couple of points are lying closer to one another. We used\n", "the word frequencies of just two words in order to plot our texts in a\n", "two-dimensional plane. The term frequency “summaries” of Novel A\n", "& Novel C are pretty similar to one another: they both share a\n", "major concern with “sausage”, whereas Novel B seems to focus\n", "primarily on “eggs.”\n", "\n", "This raises a question: how can we operationalize our intuition that\n", "spatial distance expresses topical similarity?\n", "\n", "## Cosine Similarity\n", "\n", "The most common measurement of distance between points is their [Cosine\n", "Similarity](https://en.wikipedia.org/wiki/Cosine_similarity). Cosine\n", "similarity can operate on textual data that contain word vectors and\n", "allows us to identify how similar documents are to each other, for\n", "example. Cosine Similarity thus helps us understand how much content\n", "overlap a set of documents have with one another. For example, imagine\n", "that we were to draw an arrow from the origin of the graph - point\n", "(0,0) - to the dot representing each text. This arrow is called a\n", "*vector*.\n", "\n", "Mathematically, this can be represented as:\n", "\n", "\n", "\n", "Using our example above, we can see that the angle from (0,0) between\n", "Novel C and Novel A (orange triangle) is smaller than between Novel A\n", "and Novel B (navy triangle) or between Novel C and Novel B (both\n", "triangles together).\n", "\n", "\n", "\n", "Because this similarity measurement uses the cosine of the angle between\n", "vectors, the magnitude is not a matter of concern (this feature is\n", "really helpful for text vectors that can often be really long!).\n", "Instead, the output of cosine similarity yields a value between 0 and 1\n", "(we don’t have to work with something confusing like 18º!) that can be\n", "easily interpreted and compared - and thus we can also avoid the\n", "troubles associated with other dimensional distance measures such as\n", "[Euclidean Distance](https://en.wikipedia.org/wiki/Euclidean_distance).\n", "\n", "### Calculating Cosine Distance\n", "\n", "``` r\n", "# Assuming dtm_df is a data frame containing the document-term matrix\n", "dtm_matrix <- as.matrix(dtm)\n", "\n", "# Calculate cosine similarity\n", "cos_sim <- proxy::dist(dtm_matrix, method = \"cosine\")\n", "\n", "\n", "# Although we want the Cosine Distance, it is mathematically simpler to calculate its opposite: Cosine Similarity\n", "# The formula for Cosine Distance is = 1 - Cosine Similarity\n", "\n", "# Convert the cosine similarity matrix to a 2-dimensional array\n", "# So we will subtract the similarities from 1\n", "n <- nrow(dtm_matrix)\n", "cos_sim_array <- matrix(1 - as.vector(as.matrix(cos_sim)), n, n)\n", "\n", "# Print the result\n", "print(cos_sim_array)\n", "```\n", "\n", "``` r\n", "# Make it a little easier to read by rounding the values\n", "cos_sim_rounded <- round(cos_sim_array, 2)\n", "\n", "# Label the dataframe rows and columns with eggs, sausage and bacon\n", "cos_df <- data.frame(cos_sim_rounded, row.names = indices, check.names = FALSE)\n", "colnames(cos_df) <- indices\n", "\n", "# Print the data frame\n", "head(cos_df)\n", "```\n", "\n", "## Exercise #2: Working with 18th Century Literature\n", "\n", "\n", "\n", "Workshop Run Here at Start\n", "\n", "``` r\n", "# Load the required libraries\n", "library(tidyverse)\n", "library(repr)\n", "library(proxy)\n", "library(tm)\n", "library(scales)\n", "library(MASS)\n", "\n", "\n", "# Set up figures to save properly\n", "options(jupyter.plot_mimetypes = \"image/png\") \n", "\n", "# Time: 3 mins\n", "# File paths and names\n", "filelist <- c(\n", " 'txtlab_Novel450_English/EN_1850_Hawthorne,Nathaniel_TheScarletLetter_Novel.txt',\n", " 'txtlab_Novel450_English/EN_1851_Hawthorne,Nathaniel_TheHouseoftheSevenGables_Novel.txt',\n", " 'txtlab_Novel450_English/EN_1920_Fitzgerald,FScott_ThisSideofParadise_Novel.txt',\n", " 'txtlab_Novel450_English/EN_1922_Fitzgerald,FScott_TheBeautifulandtheDamned_Novel.txt',\n", " 'txtlab_Novel450_English/EN_1811_Austen,Jane_SenseandSensibility_Novel.txt',\n", " 'txtlab_Novel450_English/EN_1813_Austen,Jane_PrideandPrejudice_Novel.txt'\n", ")\n", "\n", "novel_names <- c(\n", " 'Hawthorne: Scarlet Letter',\n", " 'Hawthorne: Seven Gables',\n", " 'Fitzgerald: This Side of Paradise',\n", " 'Fitzgerald: Beautiful and the Damned',\n", " 'Austen: Sense and Sensibility',\n", " 'Austen: Pride and Prejudice'\n", ")\n", "\n", "# Function to read non-empty lines from the text file\n", "readNonEmptyLines <- function(filepath) {\n", " lines <- readLines(filepath, encoding = \"UTF-8\")\n", " non_empty_lines <- lines[trimws(lines) != \"\"]\n", " return(paste(non_empty_lines, collapse = \" \"))\n", "}\n", "\n", "# Read non-empty texts into a corpus\n", "text_corpus <- VCorpus(VectorSource(sapply(filelist, readNonEmptyLines)))\n", "\n", "# Preprocess the text data\n", "text_corpus <- tm_map(text_corpus, content_transformer(tolower))\n", "text_corpus <- tm_map(text_corpus, removePunctuation)\n", "text_corpus <- tm_map(text_corpus, removeNumbers)\n", "text_corpus <- tm_map(text_corpus, removeWords, stopwords(\"english\"))\n", "text_corpus <- tm_map(text_corpus, stripWhitespace)\n", "\n", "## Time: 5 mins\n", "# Create a custom control for DTM with binary term frequency\n", "custom_control <- list(\n", " tokenize = function(x) SentimentAnalysis::ngram_tokenize(x, ngmax = 1),\n", " bounds = list(global = c(3, Inf)),\n", " weighting = weightTf\n", ")\n", "\n", "# Convert the corpus to a DTM using custom control\n", "dtm <- DocumentTermMatrix(text_corpus, control = custom_control)\n", "\n", "# Convert DTM to a binary data frame (0 or 1)\n", "dtm_df_novel <- as.data.frame(as.matrix(dtm > 0))\n", "colnames(dtm_df_novel) <- colnames(dtm)\n", "\n", "# Set row names to novel names\n", "rownames(dtm_df_novel) <- novel_names\n", "\n", "# Print the resulting data frame\n", "tail(dtm_df_novel)\n", "```\n", "\n", "``` r\n", "# Just as we did above with the small data frame, we'll find the cosine similarity for these texts\n", "cos_sim_novel <- as.matrix(proxy::dist(dtm_df_novel, method = \"cosine\"))\n", "\n", "# Convert the cosine similarity matrix to a 2-dimensional array\n", "n <- nrow(dtm_df_novel)\n", "cos_sim_array <- matrix(1 - as.vector(as.matrix(cos_sim_novel)), n, n)\n", "\n", "# Round the cosine similarity matrix to two decimal places\n", "cos_sim_novel_rounded <- round(cos_sim_array, 2)\n", "\n", "# Print the rounded cosine similarity matrix\n", "print(cos_sim_novel_rounded)\n", "```\n", "\n", "``` r\n", "# Again, we'll make this a bit more readable\n", "cos_df <- data.frame(cos_sim_novel_rounded, row.names = novel_names, check.names = FALSE)\n", "\n", "# Set column names to novel names\n", "colnames(cos_df) <- novel_names\n", "\n", "# Print the DataFrame\n", "head(cos_df)\n", "```\n", "\n", "``` r\n", "# Transform cosine similarity to cosine distance\n", "cos_dist <- 1 - cos_sim_novel_rounded\n", "\n", "# Perform MDS\n", "mds <- cmdscale(cos_dist, k = 2)\n", "\n", "# Extract x and y coordinates from MDS output\n", "xs <- mds[, 1]\n", "ys <- mds[, 2]\n", "\n", "# Create a data frame with x, y coordinates, and novel names\n", "mds_df <- data.frame(x = xs, y = ys, novel_names = novel_names)\n", "\n", "ggplot(mds_df, aes(x, y, label = novel_names)) +\n", " geom_point(size = 4) +\n", " geom_text(hjust =0.6, vjust = 0.2, size = 4, angle = 45, nudge_y = 0.01) + # Rotate text and adjust y position\n", " labs(title = \"MDS Visualization of Novel Differences\") +\n", " theme_minimal() +\n", " theme(\n", " plot.title = element_text(size = 20, hjust = 0.6, margin = margin(b = 10)),\n", " plot.margin = margin(5, 5, 5, 5, \"pt\"), # Adjust the margin around the plot\n", " plot.background = element_rect(fill = \"white\"), # Set the background color of the plot to white\n", " plot.caption = element_blank(), # Remove the default caption\n", " axis.text = element_text(size = 12), # Adjust the size of axis text\n", " legend.text = element_text(size = 12), # Adjust the size of legend text\n", " legend.title = element_text(size = 14) # Adjust the size of legend title\n", " )\n", "```\n", "\n", "The above method has a broad range of applications, such as unsupervised\n", "clustering. Common techniques include\n", "K-Means\n", "Clustering and\n", "Hierarchical\n", "Dendrograms. These attempt to identify groups of texts with shared\n", "content, based on these kinds of distance measures.\n", "\n", "Here’s an example of a dendrogram based on these six novels:\n", "\n", "``` r\n", "# Assuming you have already calculated the \"cos_dist\" matrix and have the \"novel_names\" vector\n", "\n", "# Perform hierarchical clustering\n", "hclust_result <- hclust(as.dist(cos_dist), method = \"ward.D\")\n", "\n", "# Plot the dendrogram\n", "plot(hclust_result, hang = -1, labels = novel_names)\n", "\n", "# Optional: Adjust the layout to avoid cutoff labels\n", "par(mar = c(5, 4, 2, 10)) # Adjust margins\n", "\n", "# Display the dendrogram plot\n", "```\n", "\n", "#### Vector Semantics\n", "\n", "We can also turn this logic on its head. Rather than produce vectors\n", "representing texts based on their words, we will produce vectors for the\n", "words based on their contexts.\n", "\n", "``` r\n", "# Transpose the DTM data frame\n", "transposed_dtm <- t(dtm_df_novel)\n", "\n", "# Display the first few rows of the transposed DTM\n", "tail(transposed_dtm)\n", "```\n", "\n", "Because the number of words is so large, for memory reasons we’re going\n", "to work with just the last few, pictured above.\n", "\n", "- If you are running this locally, you may want to try this with more\n", " words\n", "\n", "``` r\n", "# Assuming dtm_df is a data frame containing the document-term matrix\n", "tail_transposed_dtm <- tail(transposed_dtm)\n", "\n", "dtm_matrix <- as.matrix(tail_transposed_dtm) #remove 'tail_' to use all words\n", "\n", "# Calculate cosine similarity\n", "cos_sim_words <- proxy::dist(dtm_matrix, method = \"cosine\")\n", "\n", "# Convert the cosine similarity matrix to a 2-dimensional array\n", "n <- nrow(dtm_matrix)\n", "cos_sim_words <- matrix(1 - as.vector(as.matrix(cos_sim_words)), n, n)\n", "\n", "# Print the result\n", "head(cos_sim_words)\n", "```\n", "\n", "``` r\n", "# In readable format\n", "\n", "cos_sim_words <- data.frame(round(cos_sim_words, 2))\n", "row.names(cos_sim_words) <- row.names(tail_transposed_dtm) #remove tail_ for all\n", "colnames(cos_sim_words) <- row.names(tail_transposed_dtm) #remove tail_ for all\n", "\n", "head(cos_sim_words)\n", "```\n", "\n", "Theoretically we could visualize and cluster these as well - but it\n", "would a lot of computational power!\n", "\n", "We’ll instead turn to the machine learning version: word embeddings\n", "\n", "``` r\n", "#check objects in memory; delete the big ones\n", "\n", "sort(sapply(ls(), function(x) format(object.size(get(x)), unit = 'auto')))\n", " \n", "rm(cos_sim_words, cos_sim_array, text_corpus, dtm_df_novel)\n", " \n", "sort(sapply(ls(), function(x) format(object.size(get(x)), unit = 'auto')))\n", "```\n", "\n", "## Exercise #3: Using Word2vec with 150 English Novels\n", "\n", "In this exercise, we’ll use an English-language subset from a dataset\n", "about novels created by [Andrew\n", "Piper](https://www.mcgill.ca/langlitcultures/andrew-piper). Specifically\n", "we’ll look at 150 novels by British and American authors spanning the\n", "years 1771-1930. These texts reside on disk, each in a separate\n", "plaintext file. Metadata is contained in a spreadsheet distributed with\n", "the novel files.\n", "\n", "#### Metadata Columns\n", "\n", "