# Importing the packages we'll be using in this module!
# If any of these packages isn't installed run the line - install.packages("ggplot2") - with the name of the package within the quotation marks
library(ggplot2)
library(tidyverse)
1.1 - Beginner - Introduction to Central Tendency
Prerequisites
- Introduction to Jupyter
- Introduction to R
We haven’t written self-tests for this unit yet! You’ll have to check with your friends if you’ve got them right or not!
- Want to submit some?
Introduction to Central Tendency
For a moment, let’s think of data as alphabets. This data, or these alphabets, are available to us, but they are disarrayed and scattered, and they may mean nothing by themselves. However, if we look at data as alphabets, statistics is the language that we use to put alphabets into words to understand and communicate. Statistics is how we make sense of data.
Therefore, understanding different statistical tools is almost like knowing different languages. All these statistical tools use the same alphabets (the data), and yet communicate a variety of things.
Statistics is an economist’s arsenal of techniques and tools that allows them to extract meaningful insights from data. Many of these tools calculate a single representative value that summarizes the data in one way or another. We call these numerical statistics.
Here’s a helpful way of thinking about it: Have you ever tried summarizing a movie to a friend? You’d probably pick the most significant events or themes, presenting a concise yet comprehensive overview. Similarly, statistical concepts aims to “summarize” a data set into a single typical value.
If you don’t have any experience with statistics, don’t fret! This course starts with all of the concepts from the ground up. If you have experience with statistics, this course will allow you to associate each statistical concept with R Code to make you even more efficient. The first among these statistical tools is the idea central tendency.
Central tendency is meant to talk about what is “typical” for a dataset. Specifically, as evident from the term, tools of central tendency are concerned with the centrality of the data, or the middle values of the data. However, the center or the middle can mean multiple different things as far as data is concerned.
Imagine standing in a room full of people and trying to find an average height. Or imagine being in a city and trying to find the most common temperature during the summer. Or imagine trying to understand what is the most commonly purchased car in your city. Both these tasks involve finding a ‘central’ or ‘typical’ value, which is the essence of central tendency.
In order to understand the concepts of central tendency and use them, we’ll need a data set to work with. For this purpose, we will be using the swiss
dataset that comes in-built as a part of R. We don’t need to import it, we just need to call the dataset. Additionally, for convenience, we’ll try to have a glimpse of the data set to see if anything important jumps out to us immediately.
glimpse(swiss)
swiss
is a data set with records for socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888. Each of these uniquely recognized administrative divisions are called cantons. This dataset is like a guidebook, giving us insights into each canton’s characteristics. As a budding economist building your statistical arsenal, the swiss
data set is the perfect place to start!
Aside from just looking at your data set, one of the more helpful ways to understand your data set is to visualize it. Across the 47 cantons, let’s try to observe the Agriculture
variable in our data set, which stands for “% of males involved in agriculture as occupation”, and see how it varies.
For this, we’ll rely on a plot that is known as a histogram. A histogram is a plot that groups data into ranges or “bins” and showcases the frequency of our data points within these ranges. Let’s go ahead and visualize it!
<- ggplot(swiss, aes(y = Agriculture)) +
agriculture_plot geom_histogram(
bins = 30,
fill="lightgray",
color="black",
alpha = 0.7) +
labs(title = "Histogram of Agriculture Rates",
x = "Frequency",
y = "% of Men Involved in Agriculture as an Occupation") +
scale_x_continuous(breaks = seq(min(swiss$Agriculture), max(swiss$Agriculture))) +
scale_y_continuous(n.breaks = 10) +
theme_minimal()
agriculture_plot
In a very similar manner, let’s also look at another variable, Education
, which stands for the percentage of draftees educated beyond primary school for each of the cantons.
<- ggplot(swiss) +
education_plot geom_histogram(aes(y = Education),
bins = 10,
fill="lightgray",
color="black",
alpha = 0.7) +
labs(title = "Histogram of Education in Draftees",
x = "Frequency",
y = "% Education beyond Primary School for Draftees.") +
scale_x_continuous(breaks = seq(min(swiss$Education), max(swiss$Education))) +
scale_y_continuous(n.breaks = 10) +
theme_minimal()
education_plot
These graphs allow us to observe the distribution of the observations across the different levels in our variables. For instance, we can observe that for Agriculture
, observations between 60 - 70 tend to have the highest frequency. Or on the other hand, most of the observations in the Education
variable are in the 0 - 10 area.
What does this all mean? How do we interpret all of these? To do this, we’ll take assistance of a few statistical concepts, namely: Mean, Median, and Mode, all of which are different interpretations of the word middle. Mean, Median, and Mode are the three primary concepts in the idea of central tendency.
Test Your Knowledge: Before you move on, where does the “middle” of the data look like for
Education
?Agriculture
? Write down your answers, and see how the relate to the numerical statistics we will compute below.
# My answers are:
<- ?
Middle_of_education <- ? Middle_of_agriculture
The Key Ideas of Central Tendency
Mean
At its core, the mean1 is a simple concept – it is what you get when you distribute the total equally among every entry in the data set. Or alternatively, when you sum all of the observations in a data set, then divide by the number of observations in that data set. Formulaically,
\[ \overline{X} = \frac{1}{n}\sum_{i=1}^{n} X_i = \frac{\text{Sum of Value of All Data Points}}{\text{Total Number of Data Points}} \]
Here, \(\sum\) stands for summation, and \(\overline{X}\) is what is used to represent the mean. While this may be enough for you to understand the concept, we can nuance this explanation a slight bit and make it more intuitive to interpret!
Check Your Understanding: can you see why the two explanations for the mean given above are the same?
Let’s imagine a scenario within the context of our swiss
dataset. If we considered all the cantons in Switzerland, what education level would a “typical” canton have? This is the Mean Education Rate. In R, the Mean is calculated quite simply through the mean
function:
<- mean(swiss$Education)
mean_education mean_education
We could also check our comparison by computing it manually as well:
<- sum(swiss$Education)
total_education <- nrow(swiss) #number of observations in `swiss`
total_cantons
<- total_education/total_cantons
mean_education_manual mean_education_manual
This allows us to notice that the Mean Education Rate across all of the Swiss cantons is 10.98%. You can practice this yourself as well! Try to calculate fraction of Catholics within a typical Swiss canton in the code block below:
# Note: The first blank is supposed to be the function, and the second blank is supposed to be the variable
<- ...(swiss$...) avg_mean_catholic
Having observed the mean numerically, we can make our understanding of the concept even more robust by observing it visually. We can do this by slightly adjusting one of the histograms we’ve come up with earlier.
+
education_plot geom_hline(aes(yintercept=mean_education), color="red", linetype="dashed", linewidth=1)
See the red line? This is the mean we calculated before! How does it compare to the guess you made based on the histogram?
However, as with any tool, it is important to understand the appropriate use of the mean as well as its limitations. One of the primary limitations of the mean is that it is severely affected by extreme values.
Let’s say there was an error in recording, and a canton accidentally reported an extremely high fertility rate, much beyond the actual range. We’ll simulate this and see its effect on the mean.
First, let’s store your fertility measure from before as the original mean fertility rate:
<- mean(swiss$Fertility)
original_mean original_mean
Then, let’s introduce an extreme value. For the sake of illustration, we’ll assign an unrealistically high fertility rate (e.g., 1000) to the first canton:
<- swiss
swiss_with_extreme $Fertility[1] <- 1000 swiss_with_extreme
Now, let’s compute the original mean with this extreme value:
<- mean(swiss_with_extreme$Fertility)
extreme_mean extreme_mean
This allows us to observe how significant a change a single observation can bring around in the Mean, making it jump from 70 to 89.7. For good measure, let’s also observe this visually:
ggplot() +
geom_histogram(data = swiss, aes(x=log(Fertility)), color="blue", alpha=0.1, boundary = 0) +
geom_vline(aes(xintercept=log(original_mean)), color="blue", linetype="dashed", size=1) +
geom_histogram(data = swiss_with_extreme, aes(x=log(Fertility)), fill="red", alpha=0.3, boundary = 0) +
geom_vline(data = swiss, aes(xintercept=log(extreme_mean)), color="red", linetype="dashed", size=1) +
labs(title="Effect of Extreme Value on Mean Fertility Rate",
x="Fertility Rate (in logs)",
fill="Dataset") +
scale_fill_manual(values = c("blue", "red"), labels = c("Original Data")) +
theme_minimal()
In this plot, the blue histogram represents the original swiss
data set, while the red histogram represents the data set with the extreme value. See how they’re pretty similar?
The dashed lines indicate the mean of each data set. It becomes evident how the mean shifts due to just one extreme value, showcasing the sensitivity of the mean to outliers.
In conclusion, the mean is particularly susceptible to extremes in a data set. This sensitivity is a primary reason why, in skewed distributions or when outliers are suspected, one might also consider other metrics of central tendency, like the median, which remains robust in the presence of extreme values. To deal with this potential issue, we naturally move onto other measures of central tendency.
The Median
The median is, quite literally, the middle of an ordered sequence. The idea of centrality with the Median is to essentially order the data set, be it in an ascending or a descending order, and then dividing the data set in half. However, one of the characteristics that makes the Median important is that it allows us to deal with the very problem that we just elaborated on about the Mean. It is resilient to outliers or the extreme values in the data.
It provides a central location of your dataset. For a symmetrical dataset, the mean and median will be the same. However, for a skewed dataset, the median will lie closer to the bulk of the data, making it a more representative metric.
To calculate the Median, you arrange data in ascending (or descending) order. Let \(n\) be the number of data points. If \(n\) is odd, then:
\[ \text{Median} = \frac{n+1}{2}\text{th data point} \]
Otherwise,
\[ \text{Median} = \frac{1}{2} \cdot [\frac{n}{2}\text{th data point} + (\frac{n}{2} + 1)\text{th data point}] \]
Not nice! On the other hand, in R, computing the median is straightforward using the built-in median()
function.
Using the Fertility column of the swiss dataset as an example:
# Calculating the median
<- median(swiss$Fertility)
median_fertility median_fertility
To further visualize where the median lies in relation to the data:
# Plotting the data and highlighting the median
<- ggplot(swiss, aes(x=Fertility)) +
fertility_plot geom_histogram(binwidth=2, fill="lightgray", color="black", alpha=0.7) +
geom_vline(aes(xintercept=median_fertility), color="red", linetype="dashed", size=1) +
labs(title="Median Fertility Rate Across Swiss Cantons", x="Fertility Rate") +
annotate("text", x = median_fertility + 10, y=8, label = paste("Median:", round(median_fertility, 2)), color="red")
fertility_plot
Finally, to bring this concept home, let’s repeat this exercise with the Education
variable:
# Calculating the median
<- median(swiss$Education)
median_education median_education
To further visualize where the median lies in relation to the data:
# Plotting the data and highlighting the median
+
education_plot geom_hline(aes(yintercept = median_education), color="red", linetype="dashed", size=1) +
labs(title="Median Education Rate Across Swiss Cantons", x="Education Rate") +
annotate("text", y = median_education + 2, x=8, label = paste("Median:", round(median_education, 2)), color="red")
Outlier Robustness
One important property of the median is that it is robust to outliers, unlike the mean. This makes sense, since it only has to do with the rank of observations: it doesn’t matter how high the highest value is, or how low the lowest value is.
We can see this with our swiss
education situation before. Try it!
# compute the median for education in the original data
<- ...(swiss$Education)
original_median
original_median
<- ...(...)
median_with_extreme median_with_extreme
What do you see? If you want, try changing that extreme value (1000
) to other values. Does it make a difference?
Mode
The mode refers to the value(s) that appears most frequently in a data set. This stands in contrast to other measures like the mean, which gives an average, or the median, which provides a midpoint.
The beauty of the mode is its versatility. It’s relevant for both numeric data sets and qualitative data. This means that the mode can be used to gauge whether a value, such as 5
, appears with the greatest frequency in a data set, as well as if a category like "Female"
or "University Graduate"
appears with the greatest frequency.
However, this is also the problem with mode. A data set’s relationship with mode can be quite complicated.
- It might not have a mode if no value repeats…
- be uni-modal if one value dominates in frequency…
- bi-modal if two values tie in their recurrence…
- or even multimodal if multiple values share the highest frequency.
In R, we can calculate the mode without relying on external packages, since unlike the mean
or the median
function, there is no mode
function. But mode is so simple, we can create one ourselves.
Consider a function that first creates a frequency table of the data set in question. It then identifies the maximum frequency from this table. Using this frequency, it’s possible to extract the modes, which are the values that appear with this maximum frequency. Here’s how it might look:
<- function(x) {
calculate_mode # Tabulating frequencies of each value in the dataset
<- table(x)
freq_table
# Determining the maximum frequency
<- max(freq_table)
max_freq
# Pinpointing the values (modes) that correspond to the maximum frequency
<- as.numeric(names(freq_table[freq_table == max_freq]))
modes
return(modes)
}
# Applying the function on the 'Education' column from the 'swiss' dataset
<- calculate_mode(swiss$Education)
modes_education
modes_education
Therefore, as our function correctly interprets, the Mode for the Education variable is 7. This means that among the cantons, a lot of them have 7% draftees who are educated above the primary school level.
Getting All of Central Tendency Together
To truly appreciate the nature of a data set, it’s beneficial to look at the mode in tandem with other measures like the mean and median. Together, these metrics provide a fuller, more nuanced picture of the data’s central tendency. By superimposing our histogram with lines symbolizing the mean (blue), median (red), and mode (green), we create a tapestry that visually harmonizes the data’s spread with its central measures.
ggplot(swiss, aes(x = Education)) +
geom_histogram(binwidth = 2, fill="lightgray", color="black", alpha=0.7) +
geom_vline(aes(xintercept = mean_education), color="blue", linetype="dashed", size=1) +
geom_vline(aes(xintercept = median_education), color="red", linetype="dashed", size=1) +
geom_vline(aes(xintercept = modes_education), color="green", linetype="dashed", size=1) +
labs(title="Median Education Rate Across Swiss Cantons", x="Education Rate")
See the relationship? We can also do this in a table using the summarize
function:
%>%
swiss summarize(
mean = mean(Education),
median = median(Education),
mode = calculate_mode(Education)
)
This is called a table of descriptive statistics and is an important tool for any economist.
Try it Yourself!
As a final check, why don’t you make a nice table of the same results for Fertility, as well?
%>%
swiss summarize(
... )
What do you see?
Footnotes
Specifically, the arithmetic mean.↩︎