{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 4.4 - Advanced - Word Embeddings (Python)\n", "\n", "*Python Version*\n", "\n", "\n", "\n", "*This notebook was prepared by Laura Nelson in collaboration with [UBC\n", "COMET](https://comet.arts.ubc.ca/) team members: Jonathan Graves, Angela\n", "Chen and Anneke Dresselhuis*\n", "\n", "## Prerequisites\n", "\n", "1. Some familiarity programming in R\n", "2. Some familarity with natural language processing\n", "3. No computational text experience necessary!\n", "\n", "## Learning outcomes\n", "\n", "In the notebook you will\n", "\n", "1. Familiarize yourself with concepts such as word embeddings (WE)\n", " vector-space model of language, natural language processing (NLP)\n", " and how they relate to small and large language models (LMs)\n", "2. Import and pre-process a textual dataset for use in word embedding\n", "3. Use word2vec to build a simple language model for examining patterns\n", " and biases textual datasets\n", "4. Identify and select methods for saving and loading models\n", "5. Use critical and reflexive thinking to gain a deeper understanding\n", " of how the inherent social and cultural biases of language are\n", " reproduced and mapped into language computation models\n", "\n", "## Outline\n", "\n", "The goal of this notebook is to demystify some of the technical aspects\n", "of language models and to invite learners to start thinking about how\n", "these important tools function in society.\n", "\n", "In particular, this lesson is designed to explore features of word\n", "embeddings produced through the word2vec model. The questions we ask in\n", "this lesson are guided by Ben Schmidt’s blog post, [Rejecting the Gender\n", "Binary](%22http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html).\n", "\n", "The primary corpus we will use consists of the\n", "150 English-language novels made\n", "available by the .txtLab at McGill University. We also look at\n", "a word2Vec model trained\n", "on the ECCO-TCP corpus of 2,350 eighteenth-century literary texts\n", "made available by Ryan Heuser. (Note that the number of terms in the\n", "model has been shortened by half in order to conserve memory.)\n", "\n", "## Key Terms\n", "\n", "Before we dive in, feel free to familiarize yourself with the following\n", "key terms and how they relate to each other." ], "id": "ca749a74-a7d2-437a-9b23-158c952dfb33" }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/html" }, "source": [ "" ], "id": "eeb5fdb3-0083-4899-9884-225412b75ff5" }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ], "id": "89ecd32c-7911-4194-a9f3-23ea9fc0c734" }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/html" }, "source": [ "" ], "id": "7cafc0de-a64f-4afb-9d80-01ec4cefbe5f" }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Artificial Intelligence (AI):** is a broad category that includes the\n", "study and development of computer systems that can copy intelligent\n", "human behaviour (adapted from [*Oxford Learners\n", "Dictionary*](https://www.oxfordlearnersdictionaries.com/definition/english/ai#:~:text=%2F%CB%8Ce%C9%AA%20%CB%88a%C9%AA%2F-,%2F%CB%8Ce%C9%AA%20%CB%88a%C9%AA%2F,way%20a%20human%20brain%20does.))\n", "\n", "**Machine Learning (ML):** is branch of AI which is uses statistical\n", "methods to imitate the way that humans learn (adapted from\n", "[*IBM*](https://www.ibm.com/topics/machine-learning))\n", "\n", "**Natural Language Processing (NLP):** is branch of AI which focuses on\n", "training computers to interpret human text and spoken words (adapted\n", "from\n", "[*IBM*](https://www.ibm.com/topics/natural-language-processing#:~:text=the%20next%20step-,What%20is%20natural%20language%20processing%3F,same%20way%20human%20beings%20can.))\n", "\n", "**Word Embeddings (WE):** is one part of NLP where human words are\n", "converted into numerical representations (usually vectors) in order for\n", "computers to be able to understand them (adapted from\n", "[*Turing*](https://www.turing.com/kb/guide-on-word-embeddings-in-nlp))\n", "\n", "**word2vec:** is an NLP technique that is commonly used to generate word\n", "embeddings\n", "\n", "## What are Word Embeddings?\n", "\n", "Building off of the definition above, word embeddings are one way that\n", "humans can represent language in a way that is legible to a machine.\n", "More specifically, word embeddings are an NLP approach that use vectors\n", "to store textual data in multiple dimensions; by existing in the\n", "multi-dimensional space of vectors, word embeddings are able to include\n", "important semantic information within a given numeric representation.\n", "\n", "For example, if we are trying to answer a research question about how\n", "popular a term is on the web at a given time, we might use a simple word\n", "frequency analysis to count how many times the word “candidate” shows up\n", "in tweets during a defined electoral period. However, if we wanted to\n", "gain a more nuanced understanding of what kind of language, biases or\n", "attitudes contextualize the term, “candidate” in discourse, we would\n", "need to use a method like word embedding to encode meaning into our\n", "understanding of how people have talked about candidates over time.\n", "Instead of describing our text as a series of word counts, we would\n", "treat our text like coordinates in space, where similar words and\n", "concepts are closer to each other, and words that are different from\n", "each other are further away.\n", "\n", "\n", "\n", "## What are Word Embeddings?\n", "\n", "Building off of the definition above, word embeddings are one way that\n", "humans can represent language in a way that is legible to a machine.\n", "More specifically, word embeddings are an NLP approach that use vectors\n", "to store textual data in multiple dimensions; by existing in the\n", "multi-dimensional space of vectors, word embeddings are able to include\n", "important semantic information within a given numeric representation.\n", "\n", "For example, if we are trying to answer a research question about how\n", "popular a term is on the web at a given time, we might use a simple word\n", "frequency analysis to count how many times the word “candidate” shows up\n", "in tweets during a defined electoral period. However, if we wanted to\n", "gain a more nuanced understanding of what kind of language, biases or\n", "attitudes contextualize the term, “candidate” in discourse, we would\n", "need to use a method like word embedding to encode meaning into our\n", "understanding of how people have talked about candidates over time.\n", "Instead of describing our text as a series of word counts, we would\n", "treat our text like coordinates in space, where similar words and\n", "concepts are closer to each other, and words that are different from\n", "each other are further away.\n", "\n", "\n", "\n", "For example, in the visualization above, a word frequency count returns\n", "the number of times the word “candidate” or “candidates” is used in a\n", "sample text corpus. When a word embedding is made from the same text\n", "corpus, we are able to map related concepts and phrases that are closely\n", "related to “candidate” as neighbours, while other words and phrases such\n", "as “experimental study” (which refers to the research paper in question,\n", "and not to candidates specifically) are further away.\n", "\n", "Here is another example of how different, but related words might be\n", "represented in a word embedding: \n", "\n", "## Making a Word Embedding\n", "\n", "So, how do word embeddings work? To make a word embedding, an input word\n", "gets compressed into a dense vector.\n", "\n", "\n", "\n", "The magic and mystery of the word embedding process is that often the\n", "vectors produced during the model embed qualities of a word or phrase\n", "that are not interpretable by humans. However, for our purposes, having\n", "the text in vector format is all we need. With this format, we can\n", "perform tests like cosine similarity and other kinds of operations. Such\n", "operations can reveal many different kinds of relationships between\n", "words, as we’ll examine a bit later.\n", "\n", "## Using word2vec\n", "\n", "Word2vec is one NLP technique that is commonly used to generate word\n", "embeddings. More precisely, word2vec is an algorithmic learning tool\n", "rather than a specific neural net that is already trained. The example\n", "we will be working through today has been made using this tool.\n", "\n", "The series of algorithms inside of the word2vec model try to describe\n", "and acquire parameters for a given word in terms of the text that appear\n", "immediately to the right and left in actual sentences. Essentially, it\n", "learns how to predict text.\n", "\n", "Without going too deep into the algorithm, suffice it to say that it\n", "involves a two-step process:\n", "\n", "1. First, the input word gets compressed into a dense vector, as seen\n", " in the simplified diagram, “Creating a Word Embedding,” above.\n", "2. Second, the vector gets decoded into the set of context words.\n", " Keywords that appear within similar contexts will have similar\n", " vector representations in between steps.\n", "\n", "Imagine that each word in a novel has its meaning determined by the ones\n", "that surround it in a limited window. For example, in Moby Dick’s first\n", "sentence, “me” is paired on either side by “Call” and “Ishmael.” After\n", "observing the windows around every word in the novel (or many novels),\n", "the computer will notice a pattern in which “me” falls between similar\n", "pairs of words to “her,” “him,” or “them.” Of course, the computer had\n", "gone through a similar process over the words “Call” and “Ishmael,” for\n", "which “me” is reciprocally part of their contexts. This chaining of\n", "signifiers to one another mirrors some of humanists’ most sophisticated\n", "interpretative frameworks of language.\n", "\n", "The two main model architectures of word2vec are **Continuous Bag of\n", "Words (CBOW)** and **Skip-Gram**, which can be distinguished partly by\n", "their input and output during training.\n", "\n", "**CBOW** takes the context words (for example, “Call”,“Ishmael”) as a\n", "single input and tries to predict the word of interest (“me”).\n", "\n", "\n", "\n", "**Skip-Gram** does the opposite, taking a word of interest as its input\n", "(for example, “me”) and tries to learn how to predict its context words\n", "(“Call”,“Ishmael”).\n", "\n", "\n", "\n", "In general, CBOW is is faster and does well with frequent words, while\n", "Skip-Gram potentially represents rare words better.\n", "\n", "Since the word embedding is a vector, we are able perform tests like\n", "cosine similarity (which we’ll learn more about in a bit!) and other\n", "kinds of operations. Those operations can reveal many different kinds of\n", "relationships between words, as we shall see.\n", "\n", "## Bias and Language Models\n", "\n", "You might already be piecing together that the encoding of meaning in\n", "word embeddings is entirely shaped by the patterns of language use\n", "captured in the training data. That is, what is included in a word\n", "embedding directly reflects the complex social and cultural biases of\n", "everyday human language - in fact, exploring how these biases function\n", "and change over time (as we will do later) is one of the most\n", "interesting ways to use word embeddings in social research.\n", "\n", "#### It is simply impossible to have a bias-free language model (LM).\n", "\n", "In LMs, bias is not a bug or a glitch, rather, it is an essential\n", "feature that is baked into the fundamental structure. For example, LMs\n", "are not outside of learning and absorbing the pejorative dimensions of\n", "language which in turn, can result in reproducing harmful correlations\n", "of meaning for words about race, class or gender (among others). When\n", "unchecked, these harms can be “amplified in downstream applications of\n", "word embeddings” ([Arseniev-Koehler & Foster, 2020,\n", "p. 1](https://osf.io/preprints/socarxiv/b8kud/)).\n", "\n", "Just like any other computational model, it is important to critically\n", "engage with the source and context of the training data. One way that\n", "[Schiffers, Kern and Hienert](https://arxiv.org/abs/2302.06174v1)\n", "suggest doing this is by using domain specific models (2023). Working\n", "with models that understand the nuances of your particular topic or\n", "field can better account for “specialized vocabulary and semantic\n", "relationships” that can help make applications of WE more effective.\n", "\n", "## Preparing for our Analysis\n", "\n", "#### Word2vec Features\n", "\n", "**Here are a few features of the word2vec tool that we can use to\n", "customize our analysis:**\n", "\n", "- `size`: Number of dimensions for word embedding model\n", " \n", "- `window`: Number of context words to observe in each direction\n", " \n", "- `min_count`: Minimum frequency for words included in model\n", " \n", "- `sg` (Skip-Gram): ‘0’ indicates CBOW model; ‘1’ indicates Skip-Gram\n", " \n", "- `alpha`: Learning rate (initial); prevents model from\n", " over-correcting, enables finer tuning\n", " \n", "- `iterations`: Number of passes through dataset\n", " \n", "- `batch size`: Number of words to sample from data during each pass\n", " \n", "\n", "Note: the script uses default value for each argument.\n", "\n", "**Some limitations of the word2vec Model**\n", "\n", "- Within word2vec, common articles or conjunctions, called **stop\n", " words** such as “the” and “and,” may not provide very rich\n", " contextual information for a given word, and may need additional\n", " subsampling or to be combined into a word phrase (Anwla, 2019).\n", "- word2vec isn’t always the best at handling out-of-vocabulary words\n", " well (Chandran, 2021).\n", "\n", "Let’s begin our analysis!\n", "\n", "## Excercise #1: Eggs, Sausages and Bacon\n", "\n", "\n", "\n", "To begin, we are going to load a few packages that are necessary for our\n", "analysis. Please run the code cells below.\n", "\n", "``` python\n", "%pylab inline\n", "matplotlib.style.use('ggplot')\n", "```\n", "\n", "#### Create a Document-Term Matrix (DTM) with a Few Pseudo-Texts\n", "\n", "To start off, we’re going to create a mini dataframe based on the use of\n", "the words “eggs,” “sausages” and “bacon” found in three different\n", "novels: A, B and C.\n", "\n", "``` python\n", "# dataframes!\n", "import pandas\n", "\n", "# Construct dataframe with three novels each containing three words\n", "columns = ['eggs','sausage','bacon']\n", "indices = ['Novel A', 'Novel B', 'Novel C']\n", "dtm = [[50,60,60],[90,10,10], [20,70,70]]\n", "dtm_df = pandas.DataFrame(dtm, columns = columns, index = indices)\n", "\n", "# Show dataframe\n", "dtm_df\n", "```\n", "\n", "### Visualize\n", "\n", "``` python\n", "# Plot our points\n", "scatter(dtm_df['eggs'], dtm_df['sausage'])\n", "\n", "# Make the graph look good\n", "xlim([0,100]), ylim([0,100])\n", "xlabel('eggs'), ylabel('sausage')\n", "```\n", "\n", "### Vectors\n", "\n", "At a glance, a couple of points are lying closer to one another. We used\n", "the word frequencies of just two of the three words (eggs and sausages)\n", "in order to plot our texts in a two-dimensional plane. The term\n", "frequency “summaries” of Novel A & Novel C are pretty\n", "similar to one another: they both share a major concern with “sausage”,\n", "whereas Novel B seems to focus primarily on “eggs.”\n", "\n", "This raises a question: how can we operationalize our intuition that the\n", "spatial distance presented here expresses topical similarity?\n", "\n", "## Cosine Similarity\n", "\n", "The most common measurement of distance between points is their [Cosine\n", "Similarity](https://en.wikipedia.org/wiki/Cosine_similarity). Cosine\n", "similarity can operate on textual data that contain word vectors and\n", "allows us to identify how similar documents are to each other, for\n", "example. Cosine Similarity thus helps us understand how much content\n", "overlap a set of documents have with one another. For example, imagine\n", "that we were to draw an arrow from the origin of the graph - point\n", "(0,0) - to the dot representing each text. This arrow is called a\n", "*vector*.\n", "\n", "Mathematically, this can be represented as:\n", "\n", "\n", "\n", "Using our example above, we can see that the angle from (0,0) between\n", "Novel C and Novel A (orange triangle) is smaller than between Novel A\n", "and Novel B (navy triangle) or between Novel C and Novel B (both\n", "triangles together).\n", "\n", "\n", "\n", "Because this similarity measurement uses the cosine of the angle between\n", "vectors, the magnitude is not a matter of concern (this feature is\n", "really helpful for text vectors that can often be really long!).\n", "Instead, the output of cosine similarity yields a value between 0 and 1\n", "(we don’t have to work with something confusing like 18º!) that can be\n", "easily interpreted and compared - and thus we can also avoid the\n", "troubles associated with other dimensional distance measures such as\n", "[Euclidean Distance](https://en.wikipedia.org/wiki/Euclidean_distance).\n", "\n", "### Calculating Cosine Distance\n", "\n", "``` python\n", "# Although we want the Cosine Distance, it is mathematically simpler to calculate its opposite: Cosine Similarity\n", "\n", "from sklearn.metrics.pairwise import cosine_similarity\n", "```\n", "\n", "``` python\n", "# So we will subtract the similarities from 1\n", "\n", "cos_sim = cosine_similarity(dtm_df)\n", "```\n", "\n", "``` python\n", "# Make it a little easier to read by rounding the values\n", "\n", "np.round(cos_sim, 2)\n", "\n", "# Label the dataframe rows and columns with eggs, sausage and bacon\n", "\n", "frame_2 = np.round(cos_sim, 2)\n", "frame_2 = pandas.DataFrame(frame_2, columns = indices, index = indices)\n", "frame_2\n", "```\n", "\n", "*From this output table, which novels appear to be more similar to each\n", "other?*\n", "\n", "## Excercise #2: Working with 18th Century Literature\n", "\n", "\n", "\n", "``` python\n", "# Compare the distance between novels\n", "\n", "filelist = ['txtlab_Novel450_English/EN_1850_Hawthorne,Nathaniel_TheScarletLetter_Novel.txt',\n", " 'txtlab_Novel450_English/EN_1851_Hawthorne,Nathaniel_TheHouseoftheSevenGables_Novel.txt',\n", " 'txtlab_Novel450_English/EN_1920_Fitzgerald,FScott_ThisSideofParadise_Novel.txt',\n", " 'txtlab_Novel450_English/EN_1922_Fitzgerald,FScott_TheBeautifulandtheDamned_Novel.txt',\n", " 'txtlab_Novel450_English/EN_1811_Austen,Jane_SenseandSensibility_Novel.txt',\n", " 'txtlab_Novel450_English/EN_1813_Austen,Jane_PrideandPrejudice_Novel.txt']\n", "\n", "novel_names = ['Hawthorne: Scarlet Letter',\n", " 'Hawthorne: Seven Gables',\n", " 'Fitzgerald: This Side of Paradise',\n", " 'Fitzgerald: Beautiful and the Damned',\n", " 'Austen: Sense and Sensibility',\n", " 'Austen: Pride and Prejudice']\n", "\n", "text_list = []\n", "\n", "for file in filelist:\n", " with open(file, 'r', encoding = 'utf-8') as myfile:\n", " text_list.append(myfile.read()) \n", "\n", "# Import the function CountVectorizer\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "cv = CountVectorizer(stop_words = 'english', min_df = 3, binary=True)\n", "\n", "novel_dtm = cv.fit_transform(text_list).toarray()\n", "feature_list = cv.get_feature_names_out()\n", "dtm_df_novel = pandas.DataFrame(novel_dtm, columns = feature_list, index = novel_names)\n", "dtm_df_novel\n", "```\n", "\n", "``` python\n", "cos_sim_novel = cosine_similarity(dtm_df_novel)\n", "cos_sim_novel = np.round(cos_sim_novel, 2)\n", "```\n", "\n", "``` python\n", "cos_df = pandas.DataFrame(cos_sim_novel, columns = novel_names, index = novel_names)\n", "cos_df\n", "```\n", "\n", "``` python\n", "# Visualizing differences\n", "\n", "from sklearn.manifold import MDS\n", "\n", "# Two components as we're plotting points in a two-dimensional plane\n", "# \"Precomputed\" because we provide a distance matrix\n", "# We will also specify `random_state` so that the plot is reproducible.\n", "\n", "# Transform cosine similarity to cosine distance\n", "cos_dist = 1 - cosine_similarity(dtm_df_novel)\n", "\n", "mds = MDS(n_components=2, dissimilarity=\"precomputed\", random_state=1, normalized_stress=\"auto\")\n", "\n", "pos = mds.fit_transform(cos_dist) # shape (n_components, n_samples)\n", "xs, ys = pos[:, 0], pos[:, 1]\n", "\n", "for x, y, name in zip(xs, ys, novel_names):\n", " plt.scatter(x, y)\n", " plt.text(x, y, name)\n", "\n", "plt.show()\n", "```\n", "\n", "The above method has a broad range of applications, such as unsupervised\n", "clustering. Common techniques include\n", "K-Means\n", "Clustering and\n", "Hierarchical\n", "Dendrograms. These attempt to identify groups of texts with shared\n", "content, based on these kinds of distance measures.\n", "\n", "Here’s an example of a dendrogram based on these six novels:\n", "\n", "``` python\n", "from scipy.cluster.hierarchy import ward, dendrogram\n", "linkage_matrix = ward(cos_dist)\n", "\n", "dendrogram(linkage_matrix, orientation=\"right\", labels=novel_names)\n", "\n", "plt.tight_layout() # fixes margins\n", "\n", "plt.show()\n", "```\n", "\n", "#### Vector Semantics\n", "\n", "We can also turn this logic on its head. Rather than produce vectors\n", "representing texts based on their words, we will produce vectors for the\n", "words based on their contexts.\n", "\n", "``` python\n", "# Turn our DTM sideways\n", "\n", "dtm_df_novel.T.head()\n", "```\n", "\n", "``` python\n", "# Find the Cosine Distances between pairs of word-vectors\n", "\n", "cos_sim_words = cosine_similarity(dtm_df_novel.T)\n", "```\n", "\n", "``` python\n", "# In readable format\n", "\n", "np.round(cos_sim_words, 2)\n", "```\n", "\n", "Theoretically you could visualize and cluster these as well - but this\n", "takes a lot of computational power!\n", "\n", "We’ll thus turn to the machine learning version: word embeddings\n", "\n", "``` python\n", "# Clean-up memory\n", "import sys\n", "\n", "# These are the usual ipython objects, including this one you are creating\n", "ipython_vars = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars']\n", "\n", "# Get a sorted list of the objects and their sizes\n", "sorted([(x, sys.getsizeof(globals().get(x))) for x in dir() if not x.startswith('_') and x not in sys.modules and x not in ipython_vars], key=lambda x: x[1], reverse=True)\n", "\n", " \n", "del cos_sim_words \n", "del dtm_df_novel \n", "del novel_dtm \n", "del feature_list\n", "```\n", "\n", "At this point you should restart your kernel\n", "if \\< 4 gb memory available\n", "\n", "- Do this by clicking on the “Kernel” menu and hitting “restart”\n", "\n", "## Exercise #3: Using word2vec with 150 English Novels\n", "\n", "In this exercise, we’ll use an English-language subset from a dataset\n", "about novels created by [Andrew\n", "Piper](https://www.mcgill.ca/langlitcultures/andrew-piper). Specifically\n", "we’ll look at 150 novels by British and American authors spanning the\n", "years 1771-1930. These texts reside on disk, each in a separate\n", "plaintext file. Metadata is contained in a spreadsheet distributed with\n", "the novel files.\n", "\n", "#### Metadata Columns\n", "\n", "