{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 1.1 - Beginner - Introduction to Central Tendency\n",
        "\n",
        "COMET Team <br> *Sarthak Kwatra and Jonathan Graves*  \n",
        "2023-10-13\n",
        "\n",
        "## Prerequisites\n",
        "\n",
        "-   Introduction to Jupyter\n",
        "-   Introduction to R"
      ],
      "id": "ab15d002-8b2e-40d9-a903-72f007c63e48"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Importing the packages we'll be using in this module!\n",
        "\n",
        "# If any of these packages isn't installed run the line - install.packages(\"ggplot2\") -  with the name of the package within the quotation marks\n",
        "library(ggplot2)\n",
        "library(tidyverse)"
      ],
      "id": "3b432cb9-ebae-4227-9772-301eddef57ec"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "> **Work in Progress**\n",
        ">\n",
        "> We haven’t written self-test for this unit yet! You’ll have to check\n",
        "> with your friends if you’ve got them right or not!\n",
        ">\n",
        "> -   Want to submit some? [Contact us!](mailto:comet.project@ubc.ca)\n",
        "\n",
        "## Introduction to Central Tendency\n",
        "\n",
        "For a moment, let’s think of data as alphabets. This data, or these\n",
        "alphabets, are available to us, but they are disarrayed and scattered,\n",
        "and they may mean nothing by themselves. However, if we look at data as\n",
        "alphabets, statistics is the language that we use to put alphabets into\n",
        "words to understand and communicate. Statistics is how we make sense of\n",
        "data.\n",
        "\n",
        "Therefore, understanding different statistical tools is almost like\n",
        "knowing different languages. All these statistical tools use the same\n",
        "alphabets (the data), and yet communicate a variety of things.\n",
        "\n",
        "Statistics is an economist’s arsenal of techniques and tools that allows\n",
        "them to extract meaningful insights from data. Many of these tools\n",
        "calculate a single representative value that summarizes the data in one\n",
        "way or another. We call these **numerical statistics**.\n",
        "\n",
        "> Here’s a helpful way of thinking about it: Have you ever tried\n",
        "> summarizing a movie to a friend? You’d probably pick the most\n",
        "> significant events or themes, presenting a concise yet comprehensive\n",
        "> overview. Similarly, statistical concepts aims to “summarize” a data\n",
        "> set into a single **typical** value.\n",
        "\n",
        "If you don’t have any experience with statistics, don’t fret! This\n",
        "course starts with all of the concepts from the ground up. If you have\n",
        "experience with statistics, this course will allow you to associate each\n",
        "statistical concept with R Code to make you even more efficient. The\n",
        "first among these statistical tools is the idea **central tendency**.\n",
        "\n",
        "Central tendency is meant to talk about what is “typical” for a dataset.\n",
        "Specifically, as evident from the term, tools of *central* tendency are\n",
        "concerned with the *centrality* of the data, or the middle values of the\n",
        "data. However, the center or the middle can mean multiple different\n",
        "things as far as data is concerned.\n",
        "\n",
        "Imagine standing in a room full of people and trying to find an average\n",
        "height. Or imagine being in a city and trying to find the most common\n",
        "temperature during the summer. Or imagine trying to understand what is\n",
        "the most commonly purchased car in your city. Both these tasks involve\n",
        "finding a ‘central’ or ‘typical’ value, which is the essence of central\n",
        "tendency.\n",
        "\n",
        "In order to understand the concepts of central tendency and use them,\n",
        "we’ll need a data set to work with. For this purpose, we will be using\n",
        "the `swiss` dataset that comes in-built as a part of R. We don’t need to\n",
        "import it, we just need to call the dataset. Additionally, for\n",
        "convenience, we’ll try to have a glimpse of the data set to see if\n",
        "anything important jumps out to us immediately."
      ],
      "id": "1d5c0782-16f0-45e9-927f-5e86ffc848e4"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "glimpse(swiss)"
      ],
      "id": "4626366b-b7d3-440b-9e3c-4038f5be2312"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`swiss` is a data set with records for socio-economic indicators for\n",
        "each of 47 French-speaking provinces of Switzerland at about 1888. Each\n",
        "of these uniquely recognized administrative divisions are called\n",
        "*cantons*. This data set like a guidebook, giving us insights into each\n",
        "canton’s characteristics. As a budding economist building your\n",
        "statistical arsenal, the `swiss` data set is the perfect place to start!\n",
        "\n",
        "Aside from just looking at your data set, one of the more helpful ways\n",
        "to understand your data set is to visualize it. Across the 47 cantons,\n",
        "let’s try to observe the `Agriculture` variable in our data set, which\n",
        "stands for “% of males involved in agriculture as occupation”, and see\n",
        "how it varies.\n",
        "\n",
        "For this, we’ll rely on a plot that is known as a **histogram**. A\n",
        "histogram is a plot that groups data into ranges or “bins” and showcases\n",
        "the frequency of our data points within these ranges. Let’s go ahead and\n",
        "visualize it!"
      ],
      "id": "8117a3e4-0868-4439-ae1e-5ba520c358e2"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "agriculture_plot <- ggplot(swiss, aes(y = Agriculture)) + \n",
        "  geom_histogram( \n",
        "                 bins = 30, \n",
        "                 fill=\"lightgray\", \n",
        "                 color=\"black\", \n",
        "                 alpha = 0.7) +\n",
        "  labs(title = \"Histogram of Agriculture Rates\", \n",
        "       x = \"Frequency\", \n",
        "       y = \"% of Men Involved in Agriculture as an Occupation\") +\n",
        "  scale_x_continuous(breaks = seq(min(swiss$Agriculture), max(swiss$Agriculture))) +\n",
        "  scale_y_continuous(n.breaks = 10) +\n",
        "  theme_minimal()\n",
        "\n",
        "agriculture_plot"
      ],
      "id": "219297cd-1dea-4b2d-ace7-ac24f3df6f65"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In a very similar manner, let’s also look at another variable,\n",
        "`Education`, which stands for the percentage of draftees educated beyond\n",
        "primary school for each of the cantons."
      ],
      "id": "38368bdd-a7cc-4d57-9757-b98ca92cf2f3"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "education_plot <- ggplot(swiss) + \n",
        "  geom_histogram(aes(y = Education), \n",
        "                 bins = 10, \n",
        "                 fill=\"lightgray\", \n",
        "                 color=\"black\", \n",
        "                 alpha = 0.7) +\n",
        "  labs(title = \"Histogram of Education in Draftees\", \n",
        "       x = \"Frequency\", \n",
        "       y = \"% Education beyond Primary School for Draftees.\") +\n",
        "  scale_x_continuous(breaks = seq(min(swiss$Education), max(swiss$Education))) +\n",
        "  scale_y_continuous(n.breaks = 10) +\n",
        "  theme_minimal()\n",
        "\n",
        "education_plot"
      ],
      "id": "3ebc0f74-d806-4729-83f5-361afad462c5"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "These graphs allow us to observe the distribution of the observations\n",
        "across the different levels in our variables. For instance, we can\n",
        "observe that for `Agriculture`, observations between 60 - 70 tends to\n",
        "have the highest frequency. Or on the other hand, most of the\n",
        "observations in the `Education` variable are in the 0 - 10 area.\n",
        "\n",
        "What does this all mean? How do we interpret all of these? To do this,\n",
        "we’ll take assistance of a few statistical concepts, namely: **Mean,\n",
        "Median, and Mode**, all of which are different interpretations of the\n",
        "word middle. Mean, Median, and Mode are the three primary concepts in\n",
        "the idea of central tendency.\n",
        "\n",
        "> **Test Your Knowledge**: Before you move on, where does the “middle”\n",
        "> of the data look like for `Education`? `Agriculture`? Write down your\n",
        "> answers, and see how the relate to the numerical statistics we will\n",
        "> compute below."
      ],
      "id": "e13ae6f3-9aad-4df6-b581-a6c56ff254b2"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# My answers are:\n",
        "\n",
        "Middle_of_education <- ?\n",
        "Middle_of_agriculture <- ?"
      ],
      "id": "922ae7f8-7c38-403f-b33e-f2d2731d6717"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## The Key Ideas of Central Tendency\n",
        "\n",
        "### Mean\n",
        "\n",
        "At its core, the mean[1] is a simple concept – it is what you get when\n",
        "you distribute the total equally among every entry in the data set. Or\n",
        "alteratively, when you sum all of the observations in a data set, then\n",
        "divide by the number of observations in that data set. Formulaically,\n",
        "\n",
        "$$\n",
        "\\overline{X} = \\frac{1}{n}\\sum_{i=1}^{n} X_i =  \\frac{\\text{Sum of Value of All Data Points}}{\\text{Total Number of Data Points}}\n",
        "$$\n",
        "\n",
        "Here, $\\sum$ stands for summation, and $\\overline{X}$ is what is used to\n",
        "represent the mean. While this may be enough for you to understand the\n",
        "concept, we can nuance this explanation a slight bit and make it more\n",
        "intuitive to interpret!\n",
        "\n",
        "> **Check Your Understanding**: can you see why the two explanations for\n",
        "> mean given above are the same?\n",
        "\n",
        "Let’s imagine a scenario within the context of our `swiss` dataset. If\n",
        "we considered all the cantons in Switzerland, what education level would\n",
        "a “typical” canton have? This is the **Mean Education Rate**. In R, the\n",
        "Mean is calculated quite simply through the `mean` function:\n",
        "\n",
        "[1] Specifically, the *arithmetic mean*."
      ],
      "id": "e96d6410-5fb1-4691-85a3-12d86cf1a973"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "mean_education <- mean(swiss$Education)\n",
        "mean_education"
      ],
      "id": "b3837629-f1d7-476f-94a1-de9f74d9aa8b"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We could also check our comparison by computing it manually as well:"
      ],
      "id": "ebb0c863-a616-40ff-83d0-7d79d315f075"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "total_education <- sum(swiss$Education)\n",
        "total_cantons <- nrow(swiss) #number of observations in `swiss`\n",
        "\n",
        "mean_education_manual <- total_education/total_cantons\n",
        "mean_education_manual"
      ],
      "id": "ab244919-9ecf-4f30-b796-a5f061fcc54e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This allows us to notice that the **Mean Education Rate** across all of\n",
        "the Swiss cantons is **10.98%**. You can practice this yourself as well!\n",
        "Try to calculate fraction of Catholics within a typical Swiss canton in\n",
        "the code block below:"
      ],
      "id": "6d9c51c1-80fb-45a3-b234-5b9d386eb5fa"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Note: The first blank is supposed to be the function, and the second blank is supposed to be the variable\n",
        "avg_mean_catholic <- ...(swiss$...)"
      ],
      "id": "78561621-0c18-480b-b381-f569de9605d2"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Having observed the mean numerically, we can make our understanding of\n",
        "the concept even more robust by observing it visually. We can do this by\n",
        "slightly adjusting one of the histograms we’ve come up with earlier."
      ],
      "id": "c0007a6e-beac-4cb2-b7ba-1156a037bf85"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "education_plot + \n",
        "  geom_hline(aes(yintercept=mean_education), color=\"red\", linetype=\"dashed\", linewidth=1)"
      ],
      "id": "5c152994-2a98-450a-bb79-4d4dc2dbff5e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "See the red line? This is the mean we calculated before! How does it\n",
        "compare to the guess you made based on the histogram?\n",
        "\n",
        "However, as with any tool, it is important to understand the appropriate\n",
        "use of the mean as well as its limitations. One of the primary\n",
        "limitations of the mean is that it is severely affected by extreme\n",
        "values.\n",
        "\n",
        "Let’s say there was an error in recording, and a canton accidentally\n",
        "reported an extremely high fertility rate, much beyond the actual range.\n",
        "We’ll simulate this and see its effect on the mean.\n",
        "\n",
        "First, let’s store your fertility measure from before as the original\n",
        "mean fertility rate:"
      ],
      "id": "acebdfe0-ad1a-4b2a-86c8-6d1996b2f2d8"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "original_mean <- mean(swiss$Fertility)\n",
        "original_mean"
      ],
      "id": "0aa81fad-241e-4d82-a25f-90f065c6471d"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Then, let’s introduce an extreme value. For the sake of illustration,\n",
        "we’ll assign an unrealistically high fertility rate (e.g., 1000) to the\n",
        "first canton:"
      ],
      "id": "d6cd5e8f-3e5d-419f-ae50-5f9f758ddae2"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "swiss_with_extreme <- swiss\n",
        "swiss_with_extreme$Fertility[1] <- 1000"
      ],
      "id": "7b3c1aca-5980-479b-bf46-afdd332f732d"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, let’s compute the original mean with this extreme value:"
      ],
      "id": "6f9048fb-747b-4a1b-80e4-de14ef70e832"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "extreme_mean <- mean(swiss_with_extreme$Fertility)\n",
        "extreme_mean"
      ],
      "id": "e2b6abc7-fcfe-4394-9d5d-fd97a9f4b8e0"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This allows us to observe how significant a change a single observation\n",
        "can bring around in the Mean, making it jump from 70 to 89.7. For good\n",
        "measure, let’s also observe this visually:"
      ],
      "id": "64df516c-d9eb-4717-a766-9e1c55db3785"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ggplot() +\n",
        "  geom_histogram(data = swiss, aes(x=log(Fertility)), color=\"blue\", alpha=0.1, boundary = 0) +\n",
        "  geom_vline(aes(xintercept=log(original_mean)), color=\"blue\", linetype=\"dashed\", size=1) +\n",
        "  geom_histogram(data = swiss_with_extreme, aes(x=log(Fertility)), fill=\"red\", alpha=0.3, boundary = 0) +\n",
        "  geom_vline(data = swiss, aes(xintercept=log(extreme_mean)), color=\"red\", linetype=\"dashed\", size=1) +\n",
        "  labs(title=\"Effect of Extreme Value on Mean Fertility Rate\",\n",
        "       x=\"Fertility Rate (in logs)\",\n",
        "       fill=\"Dataset\") +\n",
        "  scale_fill_manual(values = c(\"blue\", \"red\"), labels = c(\"Original Data\")) +\n",
        "  theme_minimal()"
      ],
      "id": "f160e024-ee6f-4608-851e-60b82fc18432"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In this plot, the blue histogram represents the original `swiss` data\n",
        "set, while the red histogram represents the data set with the extreme\n",
        "value. See how they’re pretty similar?\n",
        "\n",
        "The dashed lines indicate the mean of each data set. It becomes evident\n",
        "how the mean shifts due to just one extreme value, showcasing the\n",
        "sensitivity of the mean to outliers.\n",
        "\n",
        "In conclusion, the mean is particularly susceptible to extremes in a\n",
        "data set. This sensitivity is a primary reason why, in skewed\n",
        "distributions or when outliers are suspected, one might also consider\n",
        "other metrics of central tendency, like the median, which remains robust\n",
        "in the presence of extreme values. To deal with this potential issue, we\n",
        "naturally move onto other measures of central tendency.\n",
        "\n",
        "## The Median\n",
        "\n",
        "The **median** is, quite literally, the **middle** of an ordered\n",
        "sequence. The idea of centrality with the Median is to essentially order\n",
        "the data set, be it in an ascending or a descending order, and then\n",
        "dividing the data set in half. However, one of the characteristics that\n",
        "makes the Median important is that it allows us to deal with the very\n",
        "problem that we just elaborated on about the Mean. It is resilient to\n",
        "outliers or the extreme values in the data.\n",
        "\n",
        "It provides a central location of your dataset. For a symmetrical\n",
        "dataset, the mean and median will be the same. However, for a skewed\n",
        "dataset, the median will lie closer to the bulk of the data, making it a\n",
        "more representative metric.\n",
        "\n",
        "To calculate the Median, you arrange data in ascending (or descending)\n",
        "order. Let $n$ be the number of data points. If $n$ is odd, then:\n",
        "\n",
        "$$\n",
        "\\text{Median} = \\frac{n+1}{2}\\text{th data point}\n",
        "$$\n",
        "\n",
        "Otherwise,\n",
        "\n",
        "$$\n",
        "\\text{Median}  = \\frac{1}{2} \\cdot [\\frac{n}{2}\\text{th data point} + (\\frac{n}{2} + 1)\\text{th data point}]\n",
        "$$\n",
        "\n",
        "Not nice! On the other hand, in R, computing the median is\n",
        "straightforward using the built-in `median()` function.\n",
        "\n",
        "Using the Fertility column of the swiss dataset as an example:"
      ],
      "id": "8347a784-798d-4d2a-b0ab-7ff3e22bfb06"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Calculating the median\n",
        "median_fertility <- median(swiss$Fertility)\n",
        "median_fertility"
      ],
      "id": "5d3802f8-3c5c-49da-a633-74b0515afc09"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To further visualize where the median lies in relation to the data:"
      ],
      "id": "373c0483-e3e6-408e-b4fa-e2dfd2928077"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Plotting the data and highlighting the median\n",
        "\n",
        "fertility_plot <- ggplot(swiss, aes(x=Fertility)) + \n",
        "  geom_histogram(binwidth=2, fill=\"lightgray\", color=\"black\", alpha=0.7) + \n",
        "  geom_vline(aes(xintercept=median_fertility), color=\"red\", linetype=\"dashed\", size=1) +\n",
        "  labs(title=\"Median Fertility Rate Across Swiss Cantons\", x=\"Fertility Rate\") +\n",
        "  annotate(\"text\", x = median_fertility + 10, y=8, label = paste(\"Median:\", round(median_fertility, 2)), color=\"red\")\n",
        "\n",
        "fertility_plot"
      ],
      "id": "f0c8bf87-dd39-4867-a29b-dc672224a73d"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, to bring this concept home, let’s repeat this exercise with the\n",
        "`Education` variable:"
      ],
      "id": "3c007918-a810-435e-8b16-b9ad2301ffce"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Calculating the median\n",
        "median_education <- median(swiss$Education)\n",
        "median_education"
      ],
      "id": "5369ec01-d90d-4342-b3d8-260a03b6ace5"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To further visualize where the median lies in relation to the data:"
      ],
      "id": "a2e00158-4aa4-4dfe-a70b-87446df2ae1a"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Plotting the data and highlighting the median\n",
        "\n",
        "education_plot + \n",
        "  geom_hline(aes(yintercept = median_education), color=\"red\", linetype=\"dashed\", size=1) +\n",
        "  labs(title=\"Median Education Rate Across Swiss Cantons\", x=\"Education Rate\") +\n",
        "  annotate(\"text\", y = median_education + 2, x=8, label = paste(\"Median:\", round(median_education, 2)), color=\"red\")"
      ],
      "id": "555e92d6-f74d-4fd5-a255-d9d045fdb675"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Outlier Robustness\n",
        "\n",
        "One important property of the median is that it is **robust to\n",
        "outliers**, unlike the mean. This make sense, since it only has to do\n",
        "with the *rank* of observations: it doesn’t matter how high the highest\n",
        "value is, or how low the lowest value is.\n",
        "\n",
        "We can see this with our `swiss` education situation before. Try it!"
      ],
      "id": "42c8922b-54a3-4a0f-98b5-e67ef883cc15"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# compute the median for education in the original data\n",
        "\n",
        "original_median <- ...(swiss$Education)\n",
        "original_median\n",
        "\n",
        "median_with_extreme <- ...(...)\n",
        "median_with_extreme"
      ],
      "id": "cd667944-f7c7-4f62-ba4c-5bf650498da7"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "What do you see? If you want, try changing that extreme value (`1000`)\n",
        "to other values. Does it make a difference?\n",
        "\n",
        "## Mode\n",
        "\n",
        "The **mode** refers to the value(s) that **appears most frequently** in\n",
        "a data set. This stands in contrast to other measures like the mean,\n",
        "which gives an average, or the median, which provides a midpoint.\n",
        "\n",
        "The beauty of the mode is its *versatility*. It’s relevant for both\n",
        "numeric data sets and qualitative data. This means that the mode can be\n",
        "used to gauge whether a value, such as `5`, appears with the greatest\n",
        "frequency in a data set, as well as if a category like `\"Female\"` or\n",
        "`\"University Graduate\"` appears with the greatest frequency.\n",
        "\n",
        "However, this is also the problem with mode. A data set’s relationship\n",
        "with mode can be quite complicated.\n",
        "\n",
        "-   It might not have a mode if no value repeats…\n",
        "-   be **uni-modal** if one value dominates in frequency…\n",
        "-   **bi-modal** if two values tie in their recurrence…\n",
        "-   or even **multi-modal** if multiple values share the highest\n",
        "    frequency.\n",
        "\n",
        "In R, we can calculate the mode without relying on external packages,\n",
        "since unlike the `mean` or the `median` function, there is no `mode`\n",
        "function. But mode is so simple, we can create one ourself.\n",
        "\n",
        "Consider a function that first creates a frequency table of the data set\n",
        "in question. It then identifies the maximum frequency from this table.\n",
        "Using this frequency, it’s possible to extract the modes, which are the\n",
        "values that appear with this maximum frequency. Here’s how it might\n",
        "look:"
      ],
      "id": "49ab1c96-980f-4aca-9131-c635551fca41"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "calculate_mode <- function(x) {\n",
        "  # Tabulating frequencies of each value in the dataset\n",
        "  freq_table <- table(x)\n",
        "  \n",
        "  # Determining the maximum frequency\n",
        "  max_freq <- max(freq_table)\n",
        "  \n",
        "  # Pinpointing the values (modes) that correspond to the maximum frequency\n",
        "  modes <- as.numeric(names(freq_table[freq_table == max_freq]))\n",
        "  \n",
        "  return(modes)\n",
        "}\n",
        "\n",
        "# Applying the function on the 'Education' column from the 'swiss' dataset\n",
        "modes_education <- calculate_mode(swiss$Education)\n",
        "\n",
        "modes_education"
      ],
      "id": "ca69b0be-0af7-4717-98ab-40afe7f103ca"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Therefore, as our function correctly interprets, the Mode for the\n",
        "Education variable is 7. This means that the among the cantons, a lot of\n",
        "them have 7% draftees who are educated above the primary school level.\n",
        "\n",
        "## Getting All of Central Tendency Together\n",
        "\n",
        "To truly appreciate the nature of a data set, it’s beneficial to look at\n",
        "the mode in tandem with other measures like the mean and median.\n",
        "Together, these metrics provide a fuller, more nuanced picture of the\n",
        "data’s central tendency. By superimposing our histogram with lines\n",
        "symbolizing the mean (blue), median (red), and mode (green), we create a\n",
        "tapestry that visually harmonizes the data’s spread with its central\n",
        "measures."
      ],
      "id": "40733854-7ee2-40f6-9dbe-92311a39174d"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ggplot(swiss, aes(x = Education)) + \n",
        "  geom_histogram(binwidth = 2, fill=\"lightgray\", color=\"black\", alpha=0.7) + \n",
        "  geom_vline(aes(xintercept = mean_education), color=\"blue\", linetype=\"dashed\", size=1) +\n",
        "  geom_vline(aes(xintercept = median_education), color=\"red\", linetype=\"dashed\", size=1) +\n",
        "  geom_vline(aes(xintercept = modes_education), color=\"green\", linetype=\"dashed\", size=1) +\n",
        "  labs(title=\"Median Education Rate Across Swiss Cantons\", x=\"Education Rate\")"
      ],
      "id": "8501a02f-4544-44eb-9828-1a2508c246c8"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "See the relationship? We can also do this in a table using the\n",
        "`summarize` function:"
      ],
      "id": "8939986b-605c-47fb-b635-b6fa954772e2"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "swiss %>%\n",
        "  summarize(\n",
        "    mean = mean(Education),\n",
        "    median = median(Education),\n",
        "    mode = calculate_mode(Education)\n",
        "  )"
      ],
      "id": "f3c11c69-036f-4b37-aa62-e6e1a2d59441"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This is called a **table of descriptive statistics** and is an important\n",
        "tool for any economist.\n",
        "\n",
        "### Try it Yourself!\n",
        "\n",
        "As a final check, why don’t you make a nice table of the same results\n",
        "for Fertility, as well?"
      ],
      "id": "c88274c1-4694-441e-b928-df8f256972b4"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "swiss %>%\n",
        "  summarize(\n",
        "    ...\n",
        "  )"
      ],
      "id": "5bee8ba9-8815-4c58-8f2f-cb8079318135"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "What do you see?"
      ],
      "id": "9008933c-87ec-46f0-84d4-bc0d8617aecd"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "ir",
      "display_name": "R",
      "language": "r"
    }
  }
}