01 - Introduction to Central Tendency using R

introduction

review

econ 226

econ 227

central tendency

mean

median

mode

skewness

dispersion

This notebook is an hands-on introduction to the concepts of Central Tendency at the beginner level using R. It is meant for undergraduates with no or very little prior exposure to university-level statistics.

Author

COMET Team
Sarthak Kwatra and Jonathan Graves

Published

13 October 2023

Prerequisites

Introduction to Jupyter
Introduction to R

# Importing the packages we'll be using in this module!

# If any of these packages isn't installed run the line - install.packages("ggplot2") -  with the name of the package within the quotation marks
library(ggplot2)
library(tidyverse)
options(jupyter.plot_mimetypes = "image/png")

Work in Progress

We haven’t written self-test for this unit yet! You’ll have to check with your friends if you’ve got them right or not!

Want to submit some?

Introduction to Central Tendency

For a moment, let’s think of data as alphabets. This data, or these alphabets, are available to us, but they are disarrayed and scattered, and they may mean nothing by themselves. However, if we look at data as alphabets, statistics is the language that we use to put alphabets into words to understand and communicate. Statistics is how we make sense of data.

Therefore, understanding different statistical tools is almost like knowing different languages. All these statistical tools use the same alphabets (the data), and yet communicate a variety of things.

Statistics is an economist’s arsenal of techniques and tools that allows them to extract meaningful insights from data. Many of these tools calculate a single representative value that summarizes the data in one way or another. We call these numerical statistics.

Here’s a helpful way of thinking about it: Have you ever tried summarizing a movie to a friend? You’d probably pick the most significant events or themes, presenting a concise yet comprehensive overview. Similarly, statistical concepts aims to “summarize” a data set into a single typical value.

If you don’t have any experience with statistics, don’t fret! This course starts with all of the concepts from the ground up. If you have experience with statistics, this course will allow you to associate each statistical concept with R Code to make you even more efficient. The first among these statistical tools is the idea central tendency.

Central tendency is meant to talk about what is “typical” for a dataset. Specifically, as evident from the term, tools of central tendency are concerned with the centrality of the data, or the middle values of the data. However, the center or the middle can mean multiple different things as far as data is concerned.

Imagine standing in a room full of people and trying to find an average height. Or imagine being in a city and trying to find the most common temperature during the summer. Or imagine trying to understand what is the most commonly purchased car in your city. Both these tasks involve finding a ‘central’ or ‘typical’ value, which is the essence of central tendency.

In order to understand the concepts of central tendency and use them, we’ll need a data set to work with. For this purpose, we will be using the swiss dataset that comes in-built as a part of R. We don’t need to import it, we just need to call the dataset. Additionally, for convenience, we’ll try to have a glimpse of the data set to see if anything important jumps out to us immediately.

glimpse(swiss)

swiss is a data set with records for socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888. Each of these uniquely recognized administrative divisions are called cantons. This data set like a guidebook, giving us insights into each canton’s characteristics. As a budding economist building your statistical arsenal, the swiss data set is the perfect place to start!

Aside from just looking at your data set, one of the more helpful ways to understand your data set is to visualize it. Across the 47 cantons, let’s try to observe the Agriculture variable in our data set, which stands for “% of males involved in agriculture as occupation”, and see how it varies.

For this, we’ll rely on a plot that is known as a histogram. A histogram is a plot that groups data into ranges or “bins” and showcases the frequency of our data points within these ranges. Let’s go ahead and visualize it!

agriculture_plot <- ggplot(swiss, aes(y = Agriculture)) + 
  geom_histogram( 
                 bins = 30, 
                 fill="lightgray", 
                 color="black", 
                 alpha = 0.7) +
  labs(title = "Histogram of Agriculture Rates", 
       x = "Frequency", 
       y = "% of Men Involved in Agriculture as an Occupation") +
  scale_x_continuous(breaks = seq(min(swiss$Agriculture), max(swiss$Agriculture))) +
  scale_y_continuous(n.breaks = 10) +
  theme_minimal()

agriculture_plot

In a very similar manner, let’s also look at another variable, Education, which stands for the percentage of draftees educated beyond primary school for each of the cantons.

education_plot <- ggplot(swiss) + 
  geom_histogram(aes(y = Education), 
                 bins = 10, 
                 fill="lightgray", 
                 color="black", 
                 alpha = 0.7) +
  labs(title = "Histogram of Education in Draftees", 
       x = "Frequency", 
       y = "% Education beyond Primary School for Draftees.") +
  scale_x_continuous(breaks = seq(min(swiss$Education), max(swiss$Education))) +
  scale_y_continuous(n.breaks = 10) +
  theme_minimal()

education_plot

These graphs allow us to observe the distribution of the observations across the different levels in our variables. For instance, we can observe that for Agriculture, observations between 60 - 70 tends to have the highest frequency. Or on the other hand, most of the observations in the Education variable are in the 0 - 10 area.

What does this all mean? How do we interpret all of these? To do this, we’ll take assistance of a few statistical concepts, namely: Mean, Median, and Mode, all of which are different interpretations of the word middle. Mean, Median, and Mode are the three primary concepts in the idea of central tendency.

Test Your Knowledge: Before you move on, where does the “middle” of the data look like for Education? Agriculture? Write down your answers, and see how the relate to the numerical statistics we will compute below.

# My answers are:

Middle_of_education <- ?
Middle_of_agriculture <- ?

The Key Ideas of Central Tendency

Mean

At its core, the mean¹ is a simple concept – it is what you get when you distribute the total equally among every entry in the data set. Or alteratively, when you sum all of the observations in a data set, then divide by the number of observations in that data set. Formulaically,

\[ \overline{X} = \frac{1}{n}\sum_{i=1}^{n} X_i = \frac{\text{Sum of Value of All Data Points}}{\text{Total Number of Data Points}} \]

Here, \(\sum\) stands for summation, and \(\overline{X}\) is what is used to represent the mean. While this may be enough for you to understand the concept, we can nuance this explanation a slight bit and make it more intuitive to interpret!

Check Your Understanding: can you see why the two explanations for mean given above are the same?

Let’s imagine a scenario within the context of our swiss dataset. If we considered all the cantons in Switzerland, what education level would a “typical” canton have? This is the Mean Education Rate. In R, the Mean is calculated quite simply through the mean function:

mean_education <- mean(swiss$Education)
mean_education

We could also check our comparison by computing it manually as well:

total_education <- sum(swiss$Education)
total_cantons <- nrow(swiss) #number of observations in `swiss`

mean_education_manual <- total_education/total_cantons
mean_education_manual

This allows us to notice that the Mean Education Rate across all of the Swiss cantons is 10.98%. You can practice this yourself as well! Try to calculate fraction of Catholics within a typical Swiss canton in the code block below:

# Note: The first blank is supposed to be the function, and the second blank is supposed to be the variable
avg_mean_catholic <- ...(swiss$...)

Having observed the mean numerically, we can make our understanding of the concept even more robust by observing it visually. We can do this by slightly adjusting one of the histograms we’ve come up with earlier.

education_plot + 
  geom_hline(aes(yintercept=mean_education), color="red", linetype="dashed", linewidth=1)

See the red line? This is the mean we calculated before! How does it compare to the guess you made based on the histogram?

However, as with any tool, it is important to understand the appropriate use of the mean as well as its limitations. One of the primary limitations of the mean is that it is severely affected by extreme values.

Let’s say there was an error in recording, and a canton accidentally reported an extremely high fertility rate, much beyond the actual range. We’ll simulate this and see its effect on the mean.

First, let’s store your fertility measure from before as the original mean fertility rate:

original_mean <- mean(swiss$Fertility)
original_mean

Then, let’s introduce an extreme value. For the sake of illustration, we’ll assign an unrealistically high fertility rate (e.g., 1000) to the first canton:

swiss_with_extreme <- swiss
swiss_with_extreme$Fertility[1] <- 1000

Now, let’s compute the original mean with this extreme value:

extreme_mean <- mean(swiss_with_extreme$Fertility)
extreme_mean

This allows us to observe how significant a change a single observation can bring around in the Mean, making it jump from 70 to 89.7. For good measure, let’s also observe this visually:

ggplot() +
  geom_histogram(data = swiss, aes(x=log(Fertility)), color="blue", alpha=0.1, boundary = 0) +
  geom_vline(aes(xintercept=log(original_mean)), color="blue", linetype="dashed", size=1) +
  geom_histogram(data = swiss_with_extreme, aes(x=log(Fertility)), fill="red", alpha=0.3, boundary = 0) +
  geom_vline(data = swiss, aes(xintercept=log(extreme_mean)), color="red", linetype="dashed", size=1) +
  labs(title="Effect of Extreme Value on Mean Fertility Rate",
       x="Fertility Rate (in logs)",
       fill="Dataset") +
  scale_fill_manual(values = c("blue", "red"), labels = c("Original Data")) +
  theme_minimal()

In this plot, the blue histogram represents the original swiss data set, while the red histogram represents the data set with the extreme value. See how they’re pretty similar?

The dashed lines indicate the mean of each data set. It becomes evident how the mean shifts due to just one extreme value, showcasing the sensitivity of the mean to outliers.

In conclusion, the mean is particularly susceptible to extremes in a data set. This sensitivity is a primary reason why, in skewed distributions or when outliers are suspected, one might also consider other metrics of central tendency, like the median, which remains robust in the presence of extreme values. To deal with this potential issue, we naturally move onto other measures of central tendency.

The Median

The median is, quite literally, the middle of an ordered sequence. The idea of centrality with the Median is to essentially order the data set, be it in an ascending or a descending order, and then dividing the data set in half. However, one of the characteristics that makes the Median important is that it allows us to deal with the very problem that we just elaborated on about the Mean. It is resilient to outliers or the extreme values in the data.

It provides a central location of your dataset. For a symmetrical dataset, the mean and median will be the same. However, for a skewed dataset, the median will lie closer to the bulk of the data, making it a more representative metric.

To calculate the Median, you arrange data in ascending (or descending) order. Let \(n\) be the number of data points. If \(n\) is odd, then:

\[ \text{Median} = \frac{n+1}{2}\text{th data point} \]

Otherwise,

\[ \text{Median} = \frac{1}{2} \cdot [\frac{n}{2}\text{th data point} + (\frac{n}{2} + 1)\text{th data point}] \]

Not nice! On the other hand, in R, computing the median is straightforward using the built-in median() function.

Using the Fertility column of the swiss dataset as an example:

# Calculating the median
median_fertility <- median(swiss$Fertility)
median_fertility

To further visualize where the median lies in relation to the data:

# Plotting the data and highlighting the median

fertility_plot <- ggplot(swiss, aes(x=Fertility)) + 
  geom_histogram(binwidth=2, fill="lightgray", color="black", alpha=0.7) + 
  geom_vline(aes(xintercept=median_fertility), color="red", linetype="dashed", size=1) +
  labs(title="Median Fertility Rate Across Swiss Cantons", x="Fertility Rate") +
  annotate("text", x = median_fertility + 10, y=8, label = paste("Median:", round(median_fertility, 2)), color="red")

fertility_plot

Finally, to bring this concept home, let’s repeat this exercise with the Education variable:

# Calculating the median
median_education <- median(swiss$Education)
median_education

To further visualize where the median lies in relation to the data:

# Plotting the data and highlighting the median

education_plot + 
  geom_hline(aes(yintercept = median_education), color="red", linetype="dashed", size=1) +
  labs(title="Median Education Rate Across Swiss Cantons", x="Education Rate") +
  annotate("text", y = median_education + 2, x=8, label = paste("Median:", round(median_education, 2)), color="red")

Outlier Robustness

One important property of the median is that it is robust to outliers, unlike the mean. This make sense, since it only has to do with the rank of observations: it doesn’t matter how high the highest value is, or how low the lowest value is.

We can see this with our swiss education situation before. Try it!

# compute the median for education in the original data

original_median <- ...(swiss$Education)
original_median

median_with_extreme <- ...(...)
median_with_extreme

What do you see? If you want, try changing that extreme value (1000) to other values. Does it make a difference?

Mode

The mode refers to the value(s) that appears most frequently in a data set. This stands in contrast to other measures like the mean, which gives an average, or the median, which provides a midpoint.

The beauty of the mode is its versatility. It’s relevant for both numeric data sets and qualitative data. This means that the mode can be used to gauge whether a value, such as 5, appears with the greatest frequency in a data set, as well as if a category like "Female" or "University Graduate" appears with the greatest frequency.

However, this is also the problem with mode. A data set’s relationship with mode can be quite complicated.

It might not have a mode if no value repeats…
be uni-modal if one value dominates in frequency…
bi-modal if two values tie in their recurrence…
or even multi-modal if multiple values share the highest frequency.

In R, we can calculate the mode without relying on external packages, since unlike the mean or the median function, there is no mode function. But mode is so simple, we can create one ourself.

Consider a function that first creates a frequency table of the data set in question. It then identifies the maximum frequency from this table. Using this frequency, it’s possible to extract the modes, which are the values that appear with this maximum frequency. Here’s how it might look:

calculate_mode <- function(x) {
  # Tabulating frequencies of each value in the dataset
  freq_table <- table(x)
  
  # Determining the maximum frequency
  max_freq <- max(freq_table)
  
  # Pinpointing the values (modes) that correspond to the maximum frequency
  modes <- as.numeric(names(freq_table[freq_table == max_freq]))
  
  return(modes)
}

# Applying the function on the 'Education' column from the 'swiss' dataset
modes_education <- calculate_mode(swiss$Education)

modes_education

Therefore, as our function correctly interprets, the Mode for the Education variable is 7. This means that the among the cantons, a lot of them have 7% draftees who are educated above the primary school level.

Getting All of Central Tendency Together

To truly appreciate the nature of a data set, it’s beneficial to look at the mode in tandem with other measures like the mean and median. Together, these metrics provide a fuller, more nuanced picture of the data’s central tendency. By superimposing our histogram with lines symbolizing the mean (blue), median (red), and mode (green), we create a tapestry that visually harmonizes the data’s spread with its central measures.

ggplot(swiss, aes(x = Education)) + 
  geom_histogram(binwidth = 2, fill="lightgray", color="black", alpha=0.7) + 
  geom_vline(aes(xintercept = mean_education), color="blue", linetype="dashed", size=1) +
  geom_vline(aes(xintercept = median_education), color="red", linetype="dashed", size=1) +
  geom_vline(aes(xintercept = modes_education), color="green", linetype="dashed", size=1) +
  labs(title="Median Education Rate Across Swiss Cantons", x="Education Rate")

See the relationship? We can also do this in a table using the summarize function:

swiss %>%
  summarize(
    mean = mean(Education),
    median = median(Education),
    mode = calculate_mode(Education)
  )

This is called a table of descriptive statistics and is an important tool for any economist.

Try it Yourself!

As a final check, why don’t you make a nice table of the same results for Fertility, as well?

swiss %>%
  summarize(
    ...
  )

What do you see?

Footnotes

Specifically, the arithmetic mean.↩︎