{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1.1.1 - Beginner - Central Tendency\n",
"\n",
"COMET Team
*Colby Chambers, Oliver (Junye) Xu, Anneke Dresselhuis,\n",
"Jonathan Graves, Sarthak Kwatra* \n",
"2023-06-04\n",
"\n",
"## Outline\n",
"\n",
"### Prerequisites\n",
"\n",
"- Introduction to Jupyter\n",
"- Introduction to R\n",
"- Introduction to Visualization\n",
"\n",
"### Outcomes\n",
"\n",
"After completing this notebook, you will be able to:\n",
"\n",
"- Define the following terms: mean, median, percentiles, and mode.\n",
"- Calculate mean, median, and mode in R.\n",
"- Create boxplots to visualize ranges of data.\n",
"- Interpret and work with these statistics under various applications.\n",
"\n",
"### References\n",
"\n",
"- [Finding the Statistical Mode in\n",
" R](https://stackoverflow.com/questions/2547402/how-to-find-the-statistical-mode)\n",
"\n",
"# Introduction\n",
"\n",
"In this notebook, we will introduce the idea of **central tendency**. In\n",
"statistics, central tendency refers to the idea of how different\n",
"interpretations of the term “middle” can be used to describe a\n",
"probability distribution or dataset. In this notebook, we’ll think about\n",
"central tendency in terms of numerical values which describe a given\n",
"subset of dataset. This concept is important because we often deal with\n",
"incredibly large datasets that are too big to describe in their\n",
"entirety. In this light, it’s crucial to have summary statistics that we\n",
"can use to describe the general behavior of variables. Before we\n",
"continue, let’s import our familiar Canadian census dataset using the\n",
"tools we’ve learned already."
],
"id": "f312e88e-ccb9-4d18-9e1d-ce8042b73035"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# install.packages(\"tidyverse\") only run these three commands if you have not already installed these packages\n",
"# install.packages(\"haven\")\n",
"# install.packages(\"ggplot2\")\n",
"\n",
"library(tidyverse)\n",
"library(haven)\n",
"library(ggplot2)\n",
"\n",
"source(\"beginner_central_tendency_tests.r\") # self-testing materials\n",
"\n",
"census_data <- read_dta(\"../datasets_beginner/01_census2016.dta\")"
],
"id": "bfaf8009-43ec-425c-a3b3-fff4c5a0db29"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1: Key Concepts\n",
"\n",
"### Mean\n",
"\n",
"The first, and most commonly referenced measure of central tendency is\n",
"the **sample mean** (also referred to as the **arithmetic mean**). The\n",
"mean of a variable is the average value of that variable, which can be\n",
"found by summing together all values that a variable takes on in a set\n",
"of observations and dividing by the total number of observations used.\n",
"This is an intuitive measure of central tendency that many of us think\n",
"of when we are trying to describe data. The formula for the sample mean\n",
"is below.\n",
"\n",
"$$\n",
"\\overline{x} = \\frac{1}{n}\\sum_{i=0}^{n} x_i = \\frac{Sum~of~All~Data~Points}{Total~Number~of~Data~Points}\n",
"$$\n",
"\n",
"For large datasets, using the formula above to find the mean by hand is\n",
"impossibly inconvenient. Luckily, we can quickly calculate the mean of a\n",
"variable in R as below."
],
"id": "9c47e7ca-b30b-4989-bac6-078187fee382"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# find the mean of market income (mrkinc)\n",
"mean(census_data$mrkinc)"
],
"id": "29066590-0052-4656-af02-046c2e6c3c50"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Interesting! We see that the above code outputs `NA`. Why is this\n",
"happening? Essentially, any time we try to perform operations to find\n",
"statistics of central tendency for a variable that includes `NA` values,\n",
"R will produce `NA` as the output. This is the case even if the data set\n",
"only includes one observation recorded as `NA` for that variable. To\n",
"account for this, we can simply filter our data set to remove these\n",
"missing observations. We must do this when calculating *any* of the\n",
"statistics introduced in this section, not just the mean. We do this\n",
"below."
],
"id": "02e4fa84-b6c1-419f-bd84-6bab8be57be9"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# remove missing values (NA values) in order to find the mean of mrkinc\n",
"census_data <- filter(census_data, !is.na(census_data$mrkinc))\n",
"\n",
"mean(census_data$mrkinc)"
],
"id": "48559c08-58c5-4516-acf0-db5a0e1db009"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the code above, the filter function takes in a dataframe and\n",
"an argument. It then keeps all observations which return a value of\n",
"`TRUE` for that argument. Here, we had to specify that `is.na()` be\n",
"`FALSE` (i.e. `!=TRUE`) to keep observations for which there was no `NA`\n",
"recorded on the `mrkinc` variable. This now gives us an actual answer\n",
"for our mean: the average market income is about 59230.\n",
"\n",
"> **Question**: Notice that the mean only makes sense when we can add\n",
"> and divide the values of a variable. What kind of variable type is\n",
"> this appropriate for?\n",
"\n",
"### Median\n",
"\n",
"Another common measure of central tendency is the **median**. The median\n",
"is the value which exactly splits the observations for a variable in our\n",
"data set in half when ordered in increasing (or decreasing) order.\n",
"\n",
"E.g:\n",
"\n",
"
Observation mrkinc | \n",
"60000, 45000, 72000 | \n", "
Median value mrkinc | \n",
"60000 \n", "There is exactly one observation above (70000) and one observation\n", "below (45000) this value. | \n",
"