{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 3.3 - Advanced - Classification and Clustering\n",
        "\n",
        "-   **Authors**: COMET Team (Colby Chambers, Jonathan Graves)\n",
        "-   **Last Update**: 18 October 2022\n",
        "\n",
        "### Prerequisites\n",
        "\n",
        "-   Introduction to Jupyter\n",
        "-   Introduction to R\n",
        "-   Introduction to Visualization\n",
        "-   Central Tendency\n",
        "\n",
        "### Learning Outcomes\n",
        "\n",
        "After completing this notebook, you will be able to:\n",
        "\n",
        "-   Understand clustering and its purpose through the common method of\n",
        "    K-means clustering.\n",
        "-   Apply K-means clustering to predict rates of recidivism.\n",
        "\n",
        "### References\n",
        "\n",
        "-   James, G., Witten, D., Hastie, T., & Tibshirani, R. *An Introduction\n",
        "    to Statistical Learning: With Applications in R.* (2nd Ed.),\n",
        "    Springer Texts in Statistics, 2021. https://www.statlearning.com/\n",
        "-   Angwin, J., Larson, S., Mattu, S., & Kirchner, L. (23 May 2016)\n",
        "    Machine Bias. *Propublica*.\n",
        "    https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.\n",
        "    Retrieved 18 October 2022. [Link to Propublica\n",
        "    Data](https://github.com/propublica/compas-analysis)\n",
        "-   StatQuest: K-means clustering.\n",
        "    https://www.youtube.com/watch?v=4b5d3muPQmA&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=40\n",
        "-   Cluster-then-predict for classification tasks.\n",
        "    https://towardsdatascience.com/cluster-then-predict-for-classification-tasks-142fdfdc87d6\n",
        "-   Ideal Choice for K. https://www.guru99.com/r-k-means-clustering.html\n",
        "\n",
        "## Introduction\n",
        "\n",
        "Many statistical models deal exclusively with data that is quantitative\n",
        "(numerical) in nature. For example, a comparison of means ($t$-test)\n",
        "might evaluate the difference in *average* incomes of two groups: a\n",
        "quantitative measure. However, many questions of interest involve trying\n",
        "to predict *qualitative* outcomes: will a person be arrested or not?\n",
        "Which university degree will they pursue? Answering these kinds of\n",
        "questions requires us to predict the qualities an individual will have,\n",
        "which in statistics is called **classification** (the process of placing\n",
        "observations into distinct categories based on certain traits).\n",
        "\n",
        "To understand classification, it helps to first look at a numerical\n",
        "example with some simulated data. Run the code cell below to see an\n",
        "example."
      ],
      "id": "d28eb7f1-a3dd-4373-91db-6eedc998328f"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "set.seed(123)\n",
        "\n",
        "source('advanced_classification_and_clustering_source.r')\n",
        "\n",
        "# creating a random data set \n",
        "dataset <- simulate_data3(c(1,1),c(1.5,2),c(2,3))\n",
        "\n",
        "# plotting the data points\n",
        "ggplot(dataset, aes(x = x, y = y)) + geom_point()"
      ],
      "id": "9d1ff601-682f-4e94-bc36-1856cde1179e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In this case, we can see that our observations can be roughly classified\n",
        "in terms of values of $y$ centered around 1, 2, and 3 (or potentially\n",
        "“low”, “medium” and “high” if these can be categorized in this way). We\n",
        "can make this classification even clearer with appropriate colours and\n",
        "linear boundaries separating our clusters."
      ],
      "id": "662d0b29-5066-4454-b2cf-19716ded196b"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# graphing our clusters with colour and linear demarcations\n",
        "ggplot(dataset, aes(x = x, y = y)) + geom_point(col = dataset$color) +\n",
        " geom_segment(x = 2.5, y = 0.8, xend = 0.5, yend = 2, linetype = \"dashed\") +\n",
        "geom_segment(x = 0, y = 5.7, xend = 3.4, yend = -0.4, linetype = \"dashed\")"
      ],
      "id": "94d50b81-d967-48da-a549-ea5309e6f80f"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This is an early example of categorizing or classifying data. In this\n",
        "case, we found groups within our data manually, based on simply looking\n",
        "at the distribution of data points. We were even able to separate our\n",
        "data using lines - again a manual process.\n",
        "\n",
        "Often, however, our observations cannot be easily classified using a\n",
        "linear boundary that we can eye-ball. Instead, we may need to group our\n",
        "observations using more complicated functions. Even worse, sometimes we\n",
        "cannot observe how observations should be grouped by looking at them at\n",
        "all; observing the categorization of the data is part of the observation\n",
        "itself, making this an **unsupervised** classification task.\n",
        "\n",
        "We typically like to classify data using a more systematic approach. The\n",
        "process of finding groups, and then classifying observations as members\n",
        "of these groups, is called **clustering**. Once we have clustered our\n",
        "data, we can then interpret these clusters for meaning. Let’s look at\n",
        "one of the most common methods of clustering used in machine learning\n",
        "below.\n",
        "\n",
        "## $K$-means Clustering\n",
        "\n",
        "One very popular approach to clustering is called **$K$-means\n",
        "clustering**. This approach is centered on the idea that “clusters” of\n",
        "similar observations should be close to one another in terms of their\n",
        "observable characteristics. This means that if we picture our clusters\n",
        "graphically, observations in the same cluster lie in a similar region in\n",
        "terms of the relevant observables we are measuring. The $K$-means\n",
        "approach relies on the following step-by-step, iterative process:\n",
        "\n",
        "1.  Choose a value for $K$ (the number of clusters you want, a\n",
        "    deceptively simple choice that we will come back to later).\n",
        "2.  Randomly select $K$ unique data points within your space of\n",
        "    observations (from now on called cluster points).\n",
        "3.  Assign every data point to the nearest cluster point in Euclidean\n",
        "    distance (creating $K$ large groups of points).\n",
        "4.  Calculate the mean point of each cluster group and redefine this\n",
        "    mean point as the new clustering point (results in $K$ new cluster\n",
        "    points).\n",
        "5.  Repeat 3-4 until all data points remain in the same cluster as the\n",
        "    previous iteration (so that no data points move to new clusters).\n",
        "\n",
        "We can see the following steps in an example below by using the `kmeans`\n",
        "function available to us in base R. This time, to demonstrate the\n",
        "strength of the algorithm, we will use a set of observations which\n",
        "cannot be easily categorized from a simple glance."
      ],
      "id": "bf2db242-6b56-4579-b691-60619c3a281e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "set.seed(123)\n",
        "\n",
        "# creating a new and less easily classifiable set of data\n",
        "dataset2 <- simulate_data2(c(1,1), c(1.65,1.55))\n",
        "\n",
        "# visualizing the data\n",
        "ggplot(dataset2, aes(x = x, y = y)) + geom_point()\n",
        "ggplot(dataset2, aes(x = x, y = y)) + geom_point(color = dataset2$color)"
      ],
      "id": "6f344dfa-58e4-4622-b9a5-d0a9d2738926"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can see that the above data are not as easily classifiable as before.\n",
        "The `kmeans` function will now run the K-means clustering algorithm for\n",
        "us to cluster these 100 data points into $K$ groups. For now, we will\n",
        "choose to use $K = 2$ as our number of initial cluster points (number of\n",
        "eventual clusters). Remember, the algorithm will first choose the\n",
        "centers randomly within the dataset, then iterate."
      ],
      "id": "562bc5aa-4ad4-424f-9375-8cf92d2ce29b"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "set.seed(123)\n",
        "\n",
        "dataset3 <- within(dataset2, rm(color, cluster))\n",
        "# running the kmeans function to cluster our data\n",
        "basic_clusters <- kmeans(dataset3, 2)\n",
        "basic_clusters\n",
        "\n",
        "# visualizing the clustering of our data\n",
        "ggplot(dataset3, aes(x = x, y = y)) + geom_point(col = basic_clusters$cluster)"
      ],
      "id": "f8380f82-766e-4390-b905-b8c41633a9fb"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We used the `$cluster` access above to assign colour to our data points,\n",
        "since this access assigns a value of 1 or 2 to each data point in every\n",
        "iteration depending on which of the current clusters it is in.\n",
        "\n",
        "From the above, we can look at some useful properties of the\n",
        "*basic_clusters* object we have created through use of the `kmeans`\n",
        "function. Firstly, the algorithm’s iterative process led to clusters of\n",
        "51 and 49 observations respectively. We can also see the suggested\n",
        "location of the centers for the cluster. Let’s visualize this as well:"
      ],
      "id": "0356c4b2-439f-41c0-a2fd-e13fe71b4d8c"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# visualizing our same data with final cluster points indicated\n",
        "ggplot(dataset3, aes(x = x, y = y)) + geom_point(col = basic_clusters$cluster) + \n",
        "    geom_point(data = data.frame(basic_clusters$center), aes(x = x, y = y), col = c(\"black\", \"red\"), size = 4) # new part for bolded points"
      ],
      "id": "0ee158d6-48e2-419a-b346-ba035859f71a"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Use the help menu (i.e. `kmeans?`) to see some of the additional values\n",
        "of the `kmeans` output that are available for analysis, such as the\n",
        "total variance, within cluster variance, and between cluster variance.\n",
        "\n",
        "### How Close Did We Come?\n",
        "\n",
        "If you remember, we simulated this data - we actually know the answer\n",
        "for where the “center” of the two clusters should be! Let’s check:\n",
        "\n",
        "|         | $x_1$ | $y_1$ | $x_2$ | $y_2$ |\n",
        "|---------|-------|-------|-------|-------|\n",
        "| Cluster | 1.01  | 1.03  | 1.60  | 1.58  |\n",
        "| Actual  | 1.00  | 1.00  | 1.65  | 1.55  |\n",
        "| Error   | 1%    | 3%    | 3%    | 2%    |\n",
        "\n",
        "Pretty close! We can also see which points matched and which ones\n",
        "didn’t."
      ],
      "id": "7770fefd-6463-415d-bcf6-fda9fad1678e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ggplot(dataset3, aes(x = x, y = y)) + geom_point(col = basic_clusters$cluster - dataset2$cluster + 2)"
      ],
      "id": "82550f87-98d0-47f8-9ca4-80df13372e93"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Red points are the points which were correctly assigned to their group.\n",
        "The highlighted points are the ones the clustering algorithm got wrong:\n",
        "green points are ones which should have been in the lower group, but\n",
        "were assigned to the upper one. Black points are ones which should have\n",
        "been in the upper group, but were assigned to the lower one. There’s\n",
        "only 5 errors total, resulting in an accuracy rate of 95%. Pretty good!\n",
        "\n",
        "> **Think About It**: What do you think would happen if the clusters\n",
        "> were closer together? Further apart? You can test your intution by\n",
        "> changing the `mean` values in the cell earlier in this notebook (with\n",
        "> 1.55 and 1.65)\n",
        "\n",
        "## Key Issues in Clustering\n",
        "\n",
        "Our `kmeans` function above appeared to pretty cleanly classify our 100\n",
        "points into 2 groups. In applications, we can think of this as the\n",
        "algorithm taking the values of continuous variables for all available\n",
        "data points to create a categorical, or qualitative, variable with 2\n",
        "distinct values, indicative of the three clustered groups found among\n",
        "all of our data. In this way, the algorithm can allow us to “find”\n",
        "groupings within our data that are not even apparent to us at first\n",
        "glance.\n",
        "\n",
        "There are methods of clustering other than the $k$-means clustering\n",
        "technique, such as the hierarchical clustering technique mentioned\n",
        "earlier. However, the $k$-means approach is the most intuitive and by\n",
        "far most common technique used in machine learning to classify\n",
        "observations. Additionally, there are other versions of this algorithm\n",
        "which change how the cluster points (centers) are computed, such as\n",
        "using the `median` of all points within a cluster to find each cluster\n",
        "point; however, these approaches are conceptually similar to using the\n",
        "mean.\n",
        "\n",
        "Importantly, you may be wondering what the end of goal of clustering\n",
        "actually is. We used the $k$-means algorithm to group our 100\n",
        "observations into 2 clusters above, but how do we know whether this is a\n",
        "good classification? Are our results worthy of being presented, or is\n",
        "there a better way to cluster these points? Perhaps we can tweak our\n",
        "approach to get clusters which are compact, that is, clusters which\n",
        "don’t have wide variation from their mean cluster point. This is where\n",
        "that seemingly arbitrary choice of $K$ from earlier comes in.\n",
        "\n",
        "### Choosing $K$\n",
        "\n",
        "Perhaps the most important decision when doing k-means clustering is the\n",
        "selection of $K$, the number of clusters. Choice of this value, while it\n",
        "may seem arbitrary, is actually critical in ensuring that our clustering\n",
        "is accurate. The goal when choosing a value for $K$ is to minimize the\n",
        "sum of within-cluster variation across all clusters. This means creating\n",
        "$K$ clusters so that the individual points within each cluster are as\n",
        "close to the center point of that cluster as possible.\n",
        "\n",
        "An extremely bad value for $K$ is 1. With one cluster, there is actually\n",
        "no clustering occurring at all, so the total variance of all data points\n",
        "from their mean value is as large as possible. Increasing the value of\n",
        "$K$ allows for an increasing number of clusters, so that all available\n",
        "data points are crowded into increasingly small groups with consistently\n",
        "shrinking variances. From this, it may seem that the ideal value for $K$\n",
        "is $\\infty$, infinite clusters!\n",
        "\n",
        "However, this introduces the problem of **overfitting**. If we have an\n",
        "extremely large number of clusters, this means that our $k$-means\n",
        "algorithm is working incredibly hard to adapt to the specific set of\n",
        "points we have. Unfortunately, this means that it will perform\n",
        "substantially worse when new data is added. To put it simply, the\n",
        "machine has adapted so well to the specific data points we have that it\n",
        "cannot flexibly adjust for new data! As a result, the ideal choice of K\n",
        "lies somewhere on $(1, \\infty)$. The question is, how do we find it?\n",
        "\n",
        "One very common approach for finding an optimal value for $K$ is to\n",
        "graph what is called an **Elbow Plot**. An Elbow Plot represents the\n",
        "relationship between the value of $K$ and the total within-cluster\n",
        "variance. This graph naturally decreases; as $K$ increases, the number\n",
        "of clusters is increasing and so the within-cluster variance is\n",
        "decreasing. However, it begins to generate diminishing marginal returns\n",
        "for a certain $K$, meaning that the benefits from a larger number of\n",
        "clusters (a decreasing total variance) begin to become smaller and\n",
        "smaller. It is at this point, where the diminishing marginal returns to\n",
        "$K$ set in, that we find our optimal $K$. Graphically, this at the point\n",
        "in our graph that looks like an “elbow”, hence the name.\n",
        "\n",
        "Let’s define a simple function below to create an Elbow Plot, then use\n",
        "it to find the optimal value of $K$ for our clustering of `dataset2`\n",
        "above."
      ],
      "id": "8ca449da-2de0-4712-b9cb-b50e277f5169"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "## draw it!\n",
        "elbow_plot()"
      ],
      "id": "e19ee6e4-6f19-469a-a513-7dacb7bfa3e8"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Using an Elbow Plot to choose a value for $K$ is always highly\n",
        "subjective. However, we can approximate from the above graph that the\n",
        "optimal $K$ is likely one of 2, 3, or 4. Let’s choose 4, since this is\n",
        "where it most clearly looks like the graph is beginning to take on\n",
        "diminishing marginal returns."
      ],
      "id": "de0b9840-5a38-482f-9bb6-28f955d7e929"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "set.seed(123)\n",
        "# running the kmeans function to cluster our data (now with k = 4)\n",
        "basic_clusters <- kmeans(dataset3, 4)\n",
        "\n",
        "# visualizing the clustering of our data\n",
        "ggplot(dataset3, aes(x = x, y = y)) + geom_point(col = basic_clusters$cluster)"
      ],
      "id": "682539e1-9852-499c-91b6-2aab28f37d6c"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We now see that our data has been clustered into four groups instead of\n",
        "two. Is this better? It’s hard to say! This kind of learning is called\n",
        "**unsupervised** because, in general, we don’t know what the right\n",
        "answer is. We know there’s only two groups here, but only because we\n",
        "simulated the data. Generally, we don’t know the exact number of\n",
        "clusters that actually exist in our data.\n",
        "\n",
        "There is plenty of room for personal discretion. Sometimes you just have\n",
        "to use your best judgment when choosing a value for $K$.\n",
        "\n",
        "As a side-note, we generated this Elbow Plot by adapting the code from\n",
        "Andrea Gustafsen in her article on $K$-Means clustering (listed in the\n",
        "References section above). Whenever you are struggling to create a more\n",
        "complicated function, looking for help on the internet is a great idea!\n",
        "Just be sure to be prudent when you’re reading others’ code so that you\n",
        "can apply it to your situation accordingly. Also be sure to cite/credit\n",
        "them appropriately.\n",
        "\n",
        "### Standardization\n",
        "\n",
        "Another important issue in K-means clustering is standardizing\n",
        "distances. Often, a continuous variable will take on a range of values,\n",
        "some of which are very small and some of which are very large. These\n",
        "**outliers** can skew the calculation of our mean cluster point within\n",
        "each cluster. For this reason, we often standardize our data points to\n",
        "be distributed with a mean of 0 and standard deviation of 1 (the\n",
        "standard normal distribution) to reduce the impact of these outliers on\n",
        "calculations of our cluster points. This allows the algorithm to create\n",
        "clusters that are often more precise. Luckily for us, R has the `scale`\n",
        "function that we can invoke to achieve this. Let’s use this function to\n",
        "standardize the data in our *dataset2* dataframe, then use our `kmeans`\n",
        "function again with our new value of $K = 4$ to create some new\n",
        "clusters."
      ],
      "id": "386a8382-fded-4d81-b3d0-f9a1433daa1b"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "set.seed(123)\n",
        "\n",
        "# standardizing all of our data points\n",
        "dataset3 <- dataset3 %>% mutate(x = scale(x), y = scale(y))\n",
        "\n",
        "# running our algorithm again\n",
        "basic_clusters <- kmeans(dataset3, 4)\n",
        "\n",
        "# generating our clusters\n",
        "ggplot(dataset3, aes(x = x, y = y)) + geom_point(col = basic_clusters$cluster)"
      ],
      "id": "83045ce8-2728-406b-90b4-1f0609cb96f1"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now our clusters look to be more grouped in a top, middle, left and\n",
        "right region. This is indicative of the fact that, once standardized,\n",
        "points that were formerly extreme have a reduced effect on the\n",
        "calculation of mean cluster points at each step in the algorithm,\n",
        "allowing us to arrive at final clusters that look more precise.\n",
        "\n",
        "While all of our data was relatively compact in this example, in the\n",
        "real world we often work with data containing extreme outliers. When\n",
        "looking at income, for instance, there will be massive values for income\n",
        "which can skew our K-means clustering process by distorting the mean\n",
        "value within each cluster at every step in our algorithm. In these\n",
        "cases, standardizing can be a good idea.\n",
        "\n",
        "## Application: Algorithmic Bias and Clustering\n",
        "\n",
        "So far in this module, we’ve worked with simulated data. However, the\n",
        "$k$-means clustering approach can be applied to real-world data to help\n",
        "us find groups within our observations and even make predictions. To see\n",
        "this more closely, we will work with data from COMPAS, an American risk\n",
        "assessment program used primarily to predict the rate of recidivism of\n",
        "convicted felons based on a host of personal characteristics. The data\n",
        "below, cleaned and prepared by authors of the following [Github\n",
        "repo](https://github.com/propublica/compas-analysis), has been retrieved\n",
        "from [ProPublica](https://www.propublica.org/), an American company\n",
        "specializing in investigative journalism. This data set looks\n",
        "specifically at arrests in Broward County, Florida, since Florida has a\n",
        "breadth of open records available and all detainees in the county must\n",
        "complete the COMPAS risk assessment survey.\n",
        "\n",
        "> **Reading**: before going further, [read the\n",
        "> article](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing)!\n",
        "\n",
        "Firstly, let’s import and prepare the data."
      ],
      "id": "d8c2dbbf-09d5-4ebf-9089-3e204907538e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# reading in the data\n",
        "raw_data <- read.csv(\"compas-scores-two-years.csv\")\n",
        "\n",
        "# cleaning up the data\n",
        "raw_data <- clean_up_data(raw_data)\n",
        "\n",
        "# inspecting the data\n",
        "head(raw_data)"
      ],
      "id": "89f890ea-fea2-4f44-8a69-ad277168c9ff"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In the COMPAS system, the idea is to predict who is likely to reoffend:\n",
        "the goal is to assign a person a rating of either **low**, **medium**,\n",
        "or **high** to represent their risk of recidivism. We don’t know exactly\n",
        "how the creators of COMPAS have done that since they have not specified\n",
        "their calculation mechanism, but we can apply the idea of clustering to\n",
        "see how they *might* have done it.\n",
        "\n",
        "Let’s do this by creating some dummies for the different categories,\n",
        "then creating three clusters.\n",
        "\n",
        "> *Note*: Technically, we should probably use the $k$-medioids or\n",
        "> $k$-modes algorithm here, but let’s run with $k$-means since this is\n",
        "> what we’ve learned!"
      ],
      "id": "a22ce388-2c1e-42ce-ba9b-1943b8a9e294"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "set.seed(123)\n",
        "\n",
        "# make dummies and select the variables to cluster on\n",
        "cluster_data <- raw_data %>% select(age, is_recid, c_charge_degree, sex, priors_count)\n",
        "cluster_data <- make_dummies(cluster_data)\n",
        "\n",
        "# make the clusters\n",
        "recidivism_clusters <- kmeans(cluster_data, 3)\n",
        "\n",
        "#show the results\n",
        "centers <- data.frame(recidivism_clusters$centers)\n",
        "\n",
        "#adding some labels\n",
        "centers$cluster <- c(\"medium\", \"high\", \"low\")\n",
        "centers <- centers %>% mutate(cluster_id = as.factor(cluster))\n",
        "\n",
        "centers"
      ],
      "id": "4524876a-fbb0-48a5-b809-98b85d2c650d"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As we can see, this has split the data into three groups, which differ\n",
        "in terms of their recidivism rate (`is_recid`).\n",
        "\n",
        "-   Cluster 1 (“medium”) has a re-offense rate of about 48%\n",
        "-   Cluster 2 (“high”) has a re-offense rate of about 55%\n",
        "-   Cluster 3 (“low”) has a re-offense rate of about 34%\n",
        "\n",
        "The other variables reflect the differences. We can see most of them are\n",
        "not very influential, except `age` (decreases as re-offense rate\n",
        "increases) and `priors_count` (increases and then decreases as\n",
        "re-offense rate increases!). However, look at the racial makeup of the\n",
        "three groups."
      ],
      "id": "d8e2482c-29e9-4268-aa77-ca83b197a80e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "raw_data$cluster <- recidivism_clusters$cluster\n",
        "\n",
        "table <- raw_data %>%\n",
        "    group_by(cluster) %>%\n",
        "    summarize(\n",
        "        black = mean(race == \"African-American\"),\n",
        "        white = mean(race == \"Caucasian\"),\n",
        "        other = mean(race == \"Other\")\n",
        "    )\n",
        "\n",
        "table$cluster_name <- c(\"medium\", \"high\", \"low\")\n",
        "\n",
        "table"
      ],
      "id": "116635a0-f491-47ef-b6d1-8f50c0746fd2"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ggplot(data = centers, aes(x = table$black, y = age, color = table$cluster_name)) + geom_point(size = 10) + \n",
        "labs(x = \"% Black\", y = \"Age\", color = \"Risk\")\n",
        "\n",
        "ggplot(data = centers, aes(x = table$black, y = priors_count, color = table$cluster_name)) + geom_point(size = 10) + \n",
        "labs(x = \"% Black\", y = \"Priors\", color = \"Risk\")"
      ],
      "id": "a1b868cd-3153-4d7b-aa03-2ef4379474f4"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Being young and black makes you very likely to be in the “high risk”\n",
        "category - paradoxically, even if you have *few* prior offenses. This\n",
        "matches many of the paradoxical conclusions the ProPublica team found in\n",
        "their analysis:\n",
        "\n",
        "> James Rivelli \\[Caucasian\\], a 54-year old Hollywood, Florida, man,\n",
        "> was arrested two years ago for shoplifting seven boxes of Crest\n",
        "> Whitestrips from a CVS drugstore. Despite a criminal record that\n",
        "> included aggravated assault, multiple thefts and felony drug\n",
        "> trafficking, the Northpointe algorithm classified him as being at a\n",
        "> low risk of reoffending. \\[…\\] Less than a year later, he was charged\n",
        "> with two felony counts for shoplifting about \\$1,000 worth of tools\n",
        "> from Home Depot\n",
        "\n",
        "On the other hand, Brisha Borden, an 18-year old African American, with\n",
        "no prior offenses was rated a high-risk to re-offend.\n",
        "\n",
        "Based on our clustering analysis, can you see why?\n",
        "\n",
        "### Think Critically\n",
        "\n",
        "What is this algorithm picking up? It’s likely a complex combination of\n",
        "a couple of things:\n",
        "\n",
        "-   Black individuals may be more likely to be arrested or criminally\n",
        "    charged than white individuals, conditional on other relevant\n",
        "    characteristics. This may be especially true when comparing young\n",
        "    black individuals and old white individuals. This creates an\n",
        "    algorithmic association with age and race, in addition to an\n",
        "    associated between race and re-arrest. However, age has a strongly\n",
        "    negative relationship with reoffense *and* a strong positive\n",
        "    relationship with priors. Older people have less time to reoffend\n",
        "    and have had more time to incur priors; this creates the paradoxical\n",
        "    negative relationship described.\n",
        "\n",
        "In other words, the system is likely picking up **existing cultural**\n",
        "relationships, rather than any true causal relationship. This may be why\n",
        "Propublica found:\n",
        "\n",
        "> \\[S\\]ignificant racial disparities … in forecasting who would\n",
        "> re-offend, the algorithm made mistakes with black and white defendants\n",
        "> at roughly the same rate but in very different ways.\n",
        ">\n",
        "> -   The formula was particularly likely to falsely flag black\n",
        ">     defendants as future criminals, wrongly labeling them this way at\n",
        ">     almost twice the rate as white defendants.\n",
        "> -   White defendants were mislabeled as low risk more often than black\n",
        ">     defendants.\n",
        "\n",
        "This is called **algorithmic bias**: the algorithm is innately biased\n",
        "against black defendants. You will notice this is *despite the fact*\n",
        "that race was never used in the construction of the clusters. The bias\n",
        "notably arises from the relationship race has with *other factors* in\n",
        "the model.\n",
        "\n",
        "Moreover, it’s highly dependent on the algorithm used. Let’s try a\n",
        "different model (called *linear probability*):"
      ],
      "id": "2501f000-987a-432f-8966-879a217ba1c4"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "model <- lm(is_recid ~ c_charge_degree + race + age + priors_count + sex, data = raw_data)\n",
        "stargazer(model, type = \"text\")"
      ],
      "id": "101733f0-51dc-4bd2-be5e-0457a2e9bcb5"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can notice that `raceAfrican-American` has a very small coefficient -\n",
        "this indicates that there is a very small impact of being black; in\n",
        "fact, it is not statistically significant at the 95% level. This model\n",
        "is not particularly biased against black individuals - and it correctly\n",
        "assigns a higher rating to people with more priors.\n",
        "\n",
        "This lesson illustrates the challenge of making predictions about\n",
        "individuals based on patterns in larger groups they belong to: it is\n",
        "likely that these predictive measures will misrepresent some\n",
        "individuals’ circumstances. This has applications to many debates beyond\n",
        "just how to predict recidivism. Affirmative action, which is often\n",
        "grounded in predicting the material circumstances of individuals by\n",
        "their racial background, is just one such example. When designing\n",
        "prediction mechanisms for individuals based on group patterns, these\n",
        "practical and moral concerns should be taken seriously.\n",
        "\n",
        "## Conclusion\n",
        "\n",
        "In this module, we looked at the process of clustering and how it can be\n",
        "used to classify observations. Specifically, we started with a general\n",
        "explanation of how clustering works, then worked more closely with the\n",
        "$K$-means clustering algorithm, the most common and basic clustering\n",
        "method available. We saw the importance of standardizing our\n",
        "observations and choosing the appropriate value for $K$ when using this\n",
        "model.\n",
        "\n",
        "Then, we applied what we learned from this algorithm to make predictions\n",
        "about rates of recidivism among various populations, comparing our\n",
        "predictions to actual recidivism rates and the accuracy of the COMPAS\n",
        "risk assessment tool. We have learned that - while a powerful tool - we\n",
        "need to think very critically about exactly what it is doing, and\n",
        "whether our model makes sense. Is it studying something fundamental, or\n",
        "is it just re-enforcing existing biases and patterns?\n",
        "\n",
        "It is important to remember that the $K$-means clustering algorithm is\n",
        "just one of many clustering algorithms out there. Its benefit lies in\n",
        "its simplicity. However, its main drawback is the requirement to choose\n",
        "a value for $K$, which can often be quite subjective. Other clustering\n",
        "methods exist which automatically find an optimal number of clusters for\n",
        "us. This is especially useful when we are doing unsupervised learning\n",
        "and looking for latent patterns in our data, patterns that we cannot see\n",
        "from just the observations themselves. If you want to look at brief\n",
        "overviews of some of these algorithms and their benefits/drawbacks,\n",
        "don’t hesitate to consult the [following\n",
        "resource](https://www.freecodecamp.org/news/8-clustering-algorithms-in-machine-learning-that-all-data-scientists-should-know/).\n",
        "\n",
        "### Addendum\n",
        "\n",
        "Some [food for\n",
        "thought](http://smbc-comics.com/comic/rise-of-the-machines).\n",
        "\n",
        "## Exercises\n",
        "\n",
        "### Exercise 1\n",
        "\n",
        "In this analysis, we only looked at black and white individuals (mainly\n",
        "to match the results). However, the data *also* contained information\n",
        "about other races. Consider the table below, which shows the average of\n",
        "several of the key variables we clustered on. Based on this table,\n",
        "hypothesize which groups would be *least* likely to be classified as\n",
        "high-risk."
      ],
      "id": "a109203a-49e7-4658-8cdd-e7ee6b15c426"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "table <- raw_data %>%\n",
        "    group_by(race) %>%\n",
        "    summarize(\n",
        "        mean_age = mean(age),\n",
        "        mean_priors = mean(priors_count),\n",
        "        frac_male = mean(sex == \"Male\"),\n",
        "        charge_felony = mean(c_charge_degree == \"F\")\n",
        "    )\n",
        "\n",
        "table"
      ],
      "id": "b704601a-62ca-4f1b-944b-bffd6b2eedee"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<span style=\"color: red;\">Write your answer and reasoning here:\n",
        "\n",
        "### Exercise 2\n",
        "\n",
        "The COMPAS system produces both a code (low, medium, high) for risk, and\n",
        "a numerical measure. One way of thinking about such a numerical measure\n",
        "is that it’s a clustering process with a very high $K$.\n",
        "\n",
        "-   Why do you think that in the sentencing recommendations they focused\n",
        "    on the code, and not the measure?\n",
        "-   Do you think the numerical measure is immune to algorithmic bias or\n",
        "    not?\n",
        "\n",
        "<span style=\"color: red;\">Write your answer and reasoning here:\n",
        "\n",
        "### Exercise 3\n",
        "\n",
        "According to our linear probability model, which characteristics are\n",
        "strong predictors of an individual’s likelihood to reoffend? Would you\n",
        "use any of these characteristics to partially decide one’s sentence? If\n",
        "so, which ones and why?\n",
        "\n",
        "<span style=\"color: red;\">Write your answer and explain your thought\n",
        "process here:\n",
        "\n",
        "### Exercise 4\n",
        "\n",
        "In machine learning, we often like to split our dataset up into two\n",
        "mutually exclusive and collectively exhaustive groups: training and\n",
        "testing samples. We use the training sample to train our classification\n",
        "(creation of our model), then use the testing sample to ensure that this\n",
        "classification has good external validity (cross-validation of our\n",
        "model). This allows us to construct a good classification initially\n",
        "while also guarding against this initial classification being\n",
        "over-fitted to our chosen group of data. In the case of $k$-means\n",
        "clustering specifically, this hinges on the choice of $K$ that we make.\n",
        "\n",
        "Consider three choices of $K$ used to cluster points in a dataset, with\n",
        "training and testing subsamples randomly chosen from the data to\n",
        "maximize the accuracy of our classification procedure. <br> - **A**:\n",
        "$K = 2$ <br> - **B**: $K = 5$ <br> - **C**: $K = 10$\n",
        "\n",
        "Assume that the distribution of points in our overall dataset looks\n",
        "roughly similar to those we have seen in this module.\n",
        "\n",
        "Which of the following choices of $K$ is most likely to create a\n",
        "classification that clusters our training data with poor accuracy?"
      ],
      "id": "f25fa8ed-a392-4461-93f2-1cb0a5164e71"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_1 <- \"X\" # your answer of A, B, or C in place of X here\n",
        "\n",
        "test_1()"
      ],
      "id": "34af3045-2426-42bf-96d4-a81e004c7c15"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Which of the following choices of $K$ is most likely to create a\n",
        "classification that clusters our training data with high accuracy but\n",
        "our testing data with low accuracy?"
      ],
      "id": "fff63ef7-48dc-4c1b-a6cb-a659f5e11abc"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_2 <- \"X\" # your answer of A, B, or C in place of X here\n",
        "\n",
        "test_2()"
      ],
      "id": "68055d44-b1d1-406e-93c1-c16668184aa2"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Which of the following choices of $K$ is most likely to create a\n",
        "classification that clusters our training data with high accuracy and\n",
        "has high external validity?"
      ],
      "id": "be1e0e47-d454-4e75-8a19-9bef49f48cf6"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "answer_3 <- \"X\" # your answer of A, B, or C in place of X here\n",
        "\n",
        "test_3()"
      ],
      "id": "4f3508fd-83e7-44ba-b462-2100cc00c928"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "ir",
      "display_name": "R",
      "language": "r"
    }
  }
}