{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1.0.1 - Beginner - Introduction to Statistics using R\n",
"\n",
"COMET Team
William Co, Anneke Dresselhuis, Jonathan Graves, Emrul\n",
"Hasan, Jonah Heyl, Mridul Manas, Shiming Wu. \n",
"2023-05-08\n",
"\n",
"## Outline\n",
"\n",
"### Prerequisites\n",
"\n",
"- Introduction to Jupyter\n",
"- Introduction to R\n",
"\n",
"### References\n",
"\n",
"- Esteban Ortiz-Ospina and Max Roseqoifhoihr (2018) - “Economic\n",
" inequality by gender”. Published online at OurWorldInData.org.\n",
" Retrieved from:\n",
" \\[Online\n",
" Resource\\]\n",
"\n",
"### Outcomes\n",
"\n",
"In this notebook, you will learn how to:\n",
"\n",
"- Import data from the Survey of Financial Security (Statistics\n",
" Canada, 2019)\n",
"- Wrangle, reshape and visualize `SFS_data` as part of an Exploratory\n",
" Data Analysis (EDA)\n",
"- Run statistical tests, such as the $t$-test, to compare mean income\n",
" of male-led vs. female-led households\n",
"- Generate summary statistics tables and other data-representations of\n",
" the data using `group_by()`\n",
"- Optional: Run a formal two sample t-test to check for heterogeneity\n",
" in how gender affects income and compare the returns to education\n",
"\n",
"## Part 1: Import Data into R\n",
"\n",
"The data we use comes from the 2019 Survey of Financial Security\n",
"released by Statistics Canada [1].\n",
"\n",
"[1] Statistics Canada, Survey of Financial Security, 2019, 2021.\n",
"Reproduced and distributed on an “as is” basis with the permission of\n",
"Statistics Canada. Adapted from Statistics Canada, Survey of Financial\n",
"Security, 2019, 2021. This does not constitute an endorsement by\n",
"Statistics Canada of this product."
],
"id": "63fafe85-6b16-44f7-bf44-ca54796f6638"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# run this cell to load necessary packages for this tutorial\n",
"# install.packages('vtable')\n",
"# install.packages('viridis')\n",
"library(tidyverse)\n",
"library(haven)\n",
"library(dplyr)\n",
"library(vtable)\n",
"library(viridis)\n",
"\n",
"\n",
"source(\"beginner_intro_to_statistics2_tests.r\")\n",
"source(\"beginner_intro_to_statistics2_functions.r\")\n",
"# warning messages are okay"
],
"id": "73ce4034-cbe9-47a7-a54f-19b9a1a9bcb9"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `tidyverse` is a collection of R packages developed by Hadley\n",
"Wickham and his colleagues as a cohesive set of tools for data\n",
"manipulation, visualization, and analysis. In a **tidy** data set, each\n",
"variable forms a column and each observation forms a row. `tidyverse`\n",
"packages such as the `tidyr` and `dplyr` are recommended for cleaning\n",
"and transforming your data into tidy formats.\n",
"\n",
"Let’s import the `.dta` file from Statistics Canada using the `read_dta`\n",
"function."
],
"id": "027f2aab-3980-4b82-80a9-931697198707"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# if this is your first time using Jupyter Lab, the shortcut to run a cell is `Shift + Enter`\n",
"SFS_data <- read_dta(\"../datasets_beginner/SFS_2019_Eng.dta\")"
],
"id": "0a81819a-0182-46e6-871c-f0db37163798"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are some of the common file extensions and import functions in R:\n",
"\n",
"- `.dta` and `read_dta()` for STATA files\n",
"- `.csv` and `read_csv()` for data stored as comma-separated values\n",
"- `.Rda` and `load()` for RStudio files and other files formatted for\n",
" R"
],
"id": "2ac1acc9-af4b-4369-9cb8-e2bb6a9065f2"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"head(SFS_data, 5)"
],
"id": "c59a7b3b-2606-4f74-b1a6-473daa88d399"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Note**: `head(df, n)` displays first n rows of the data frame. Other\n",
"> popular methods include `glance()` and `print()`."
],
"id": "4dd7508d-aa89-40f7-839f-bdcf9f6489ad"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# you can read the documentation for a given function by adding a question-mark before its name\n",
"?head"
],
"id": "ce4c19ea-6a53-4c54-ad0d-c977c6494b0b"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 2: Exploratory Data Analysis in R\n",
"\n",
"There are a routine of steps you should generally follow as part of your\n",
"*EDA* or *Exploratory Data Analysis.* Normally, you would analyze and\n",
"visualize the variation, correlation, and distribution of your variables\n",
"of interest. We do this to gain an intuitive understanding of the data\n",
"before we undetake any formal hypothesis tests or model-fitting.\n",
"\n",
"Let’s think of our key variables of interest. We’re interested in\n",
"estimating the effect of gender on differences in earnings.\n",
"\n",
"- **Independent variable**: gender of the highest income earner\n",
"\n",
"- **Variable of interest**: income after tax for each individual\n",
"\n",
"- **Variable of interest**: income before tax for each individual\n",
"\n",
"- **Control**: wealth for the household\n",
"\n",
"- **Control**: level of education\n",
"\n",
"### Cleaning and Reshaping `SFS_data`\n",
"\n",
"For now, it’d be convenient to work with a new data frame containing\n",
"only the key variables (columns) listed above. Moreover, the columns\n",
"need to be renamed so they are easier for the reader to remember."
],
"id": "14fa3283-84fc-4b59-9cc4-2f6d7ef44792"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# rename columns\n",
"SFS_data <- SFS_data %>%\n",
" rename(income_before_tax = pefmtinc) %>% \n",
" rename(income_after_tax = pefatinc) %>%\n",
" rename(wealth = pwnetwpg) %>%\n",
" rename(gender = pgdrmie) %>%\n",
" rename(education = peducmie) \n",
"\n",
"# drop rows where tax info is missing, ie. pefmtinc = 'NA'.\n",
"SFS_data <- filter(SFS_data, !is.na(SFS_data$income_before_tax))\n",
"\n",
"keep <- c(\"pefamid\", \"gender\", \"education\", \"wealth\", \"income_before_tax\", \"income_after_tax\")\n",
"\n",
"# new df with chosen columns\n",
"df_gender_on_wealth <- SFS_data[keep]\n",
"\n",
"# preview\n",
"head(df_gender_on_wealth, 5)"
],
"id": "b7bbe3ee-c923-4e81-ad3a-52ea057b7d8d"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Note**: This is another **tidy representation** of the original data\n",
"> but with less variables. The original data set is still stored as\n",
"> `SFS_data`.\n",
"\n",
"### Ensuring correct data-types\n",
"\n",
"Notice that education is stored as `chr` but we want to keep it as a\n",
"`factor`. The variable `education` came *encoded* as it is from a set of\n",
"values {1, 2, 3, 4, 9}, each of which represent a level of education\n",
"obtained."
],
"id": "9fb7abf7-6354-4e39-aaed-9e8f72951cc8"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_gender_on_wealth <- df_gender_on_wealth %>%\n",
" mutate(education = as.factor(education), \n",
" gender = as.factor(gender),\n",
" income_before_tax = as.numeric(income_before_tax),\n",
" income_after_tax = as.numeric(income_after_tax))\n",
"\n",
"head(df_gender_on_wealth, 2)"
],
"id": "ae4b3f8a-06a4-41a9-855c-5146a5392cd2"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All good! Let’s use descriptive statistics to understand how each of the\n",
"numbers in the set {1, 2, 3, 9} represent an individual’s educational\n",
"background.\n",
"\n",
"### Computing Descriptive Statistics using `vtable` in R\n",
"\n",
"Let’s calculate the summary statistics of our dataset.\n",
"\n",
"> **Note**: the `sumtable` method from the `vtable` package can be used\n",
"> to display the table in different formats including LaTeX, HTML, and\n",
"> data.frame."
],
"id": "33eeffc9-c77b-4f4e-b961-4a85aff97450"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# out = \"kable\" tells it to return a knitr::kable()\n",
"# replace \"kable\" with \"latex\" and see what happens!\n",
"sumtbl <- sumtable(df_gender_on_wealth, out = \"kable\")\n",
"sumtbl"
],
"id": "9e56e4da-0d7d-4c0a-8820-be7a8be11d5d"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is like having a birds-eye view at our data. As a researcher, we\n",
"should take note of outliers and other irregularities and ask how those\n",
"issues might affect the *validity* of our models and tests.\n",
"\n",
"> **Note**: see Appendix for a common method to remove outliers using\n",
"> Z-score thresholds.\n",
"\n",
"### Grouping observations\n",
"\n",
"Wouldn’t it be neat to see how mean or median incomes for male and\n",
"female-led households look like based on the level of education obtained\n",
"by the main income-earner?"
],
"id": "768bb28b-1e74-4f6d-8a2b-39a13640a759"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"by_gender_education <- df_gender_on_wealth %>%\n",
" group_by(gender, education) %>%\n",
" summarise(mean_income = mean(income_before_tax, na.rm = TRUE),\n",
" median_income = median(income_before_tax, na.rm = TRUE),\n",
" mean_wealth = mean(wealth, na.rm = TRUE),\n",
" median_wealth = median(wealth, na.rm = TRUE))\n",
"\n",
"by_gender_education"
],
"id": "55e2c3e8-20cf-4fcb-9361-a17610ca845e"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Note**: this is again a **tidy representation** of `SFS_data`.\n",
"> *Grouping* observations by gender and education makes it a bit easier\n",
"> to make comparisons across groups.\n",
"\n",
"We can take this chain-of-thought further and generate a `heatmap` using\n",
"the `ggplot` package."
],
"id": "9d6969f5-0489-49f2-9f84-3d63692c6f5d"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"library(ggplot2)\n",
"library(viridis)\n",
"\n",
"# Create the heatmap with an accessible color palette\n",
"heatmap_plot <- ggplot(by_gender_education, aes(x = education, y = gender, fill = mean_income)) +\n",
" geom_tile() +\n",
" scale_fill_viridis_c(option = \"plasma\", na.value = \"grey\", name = \"Mean Income\") +\n",
" labs(x = \"Education\", y = \"Gender\")\n",
"\n",
"# Display the heatmap\n",
"heatmap_plot"
],
"id": "e0c12d77-959e-48f3-a184-30a83251e6ef"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Note**: we use `scale_fill_viridis_c()` from the `viridis` package\n",
"> to ensure that the color palette follows the standards of DS.\n",
"\n",
"Now, what does this tell you about how male-led households (gender = 1)\n",
"compare with female-led households in terms of the mean household\n",
"income? Does this tell if education widens the income gap between\n",
"male-led and female-led households with the same level of education?\n",
"\n",
"We can infer from the visualization that the female-led households with\n",
"the same level of education have different mean incomes as compared to\n",
"male-led households. This smells of *heterogeneity* and we can explore\n",
"regression and other empirical methods to formally test this claim.\n",
"\n",
"However, we shouldn’t *yet* draw any conclusive statements about the\n",
"relationships between gender (of the main income earner), income,\n",
"education and other variables such as wealth.\n",
"\n",
"As researchers, we should ask if the differences in the mean or median\n",
"incomes for the two groups are significant at all. We can then go a bit\n",
"further and test if education indeed widens the gap or not.\n",
"\n",
"> **Think Deeper**: how you would specify the null and alternative\n",
"> hypotheses?\n",
"\n",
"### Test your knowledge\n",
"\n",
"Match the function with the appropriate description. Enter your answer\n",
"as a long string with the letter choices in order.\n",
"\n",
"1. Order rows using column values\n",
"2. Keep distinct/unique rows\n",
"3. Keep rows that match a condition\n",
"4. Get a glimpse of your data\n",
"5. Create, modify, and delete columns\n",
"6. Keep or drop columns using their names and types\n",
"7. Count the observations in each group\n",
"8. Group by one or more variables\n",
"9. A general vectorised if-else\n",
"\n",
"\n",
"\n",
"1. `mutate()`\n",
"2. `glimpse()`\n",
"3. `filter()`\n",
"4. `case_when()`\n",
"5. `select()`\n",
"6. `group_by()`\n",
"7. `distinct()`\n",
"8. `arrange()`\n",
"9. `count()`\n",
"\n",
"> **Note**: it’s fine if you don’t know all those functions yet! Match\n",
"> the functions you know and run code to figure out the rest."
],
"id": "aeb7d90a-0076-4d51-a3c2-6ff99b7a180f"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Enter your answer as a long string ex: if you think the matches are 1-B, 2-C, 3-A, enter answer as \"BCA\"\n",
"answer_1 <- \"\"\n",
"\n",
"test_1()"
],
"id": "0f405565-4904-43d3-bc51-cc0b7d942b1a"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 3: Running $t$-tests in R\n",
"\n",
"Let’s run a t-test for a comparison of means."
],
"id": "292f865c-20ff-46f6-8927-cd223f54b72f"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# performs a t-test for means comparison\n",
"t_test_result <- t.test(income_before_tax ~ gender, data = df_gender_on_wealth)\n",
"print(t_test_result)"
],
"id": "3e27651e-0860-4b5a-a0fc-d9c9c1ce5c54"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The 95% confidence interval does not include 0 and we can confirm that\n",
"the male-led households on average earn more as income before tax than\n",
"the female-led households, and the gap is statistically significant.\n",
"\n",
"Let’s now run a test to compare the medians of both groups."
],
"id": "d5d6ed76-2701-4614-b102-b309b0d9d583"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# perform a Mann-Whitney U test for median comparison\n",
"mannwhitneyu_test_result <- wilcox.test(income_before_tax ~ gender, data = df_gender_on_wealth)\n",
"print(mannwhitneyu_test_result)"
],
"id": "cc0a4932-74c0-4f90-bcd2-21d85d5f4523"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This p-value is again highly significant, and based on our data, the\n",
"median incomes for the two groups are not equal.\n",
"\n",
"Our variable of interest is income, and so far, we have provided\n",
"statistical evidence for the case that the gender of the main\n",
"income-earner is correlated with the household’s income.\n",
"\n",
"We are however more interested in the causal mechanisms through which\n",
"education and wealth *determine* how gender affects household income.\n",
"\n",
"> **Think Deeper**: According to Ortiz-Ospina and Roser (2018), women\n",
"> are overrepresented in low-paying jobs and are underrepresented in\n",
"> high-paying ones. What role does the attainment of education play in\n",
"> sorting genders into high vs. low-paying jobs? Can we test this\n",
"> formally with the data?\n",
"\n",
"### Studying how wealth and education might impact the income-gap\n",
"\n",
"There are multiple reasons to study the links between **wealth** and the\n",
"**income** gap. For instance, we might want to answer whether having\n",
"more wealth affects an individual’s income.\n",
"\n",
"We can use some of the methods we have learned in R to analyze and\n",
"visualize relationships between income, gender, education and wealth.\n",
"\n",
"Let’s see if having a university degree widens the gender income gap."
],
"id": "922e497d-6173-4c6f-a89f-1a4c4b50d700"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"SFS_data <- SFS_data %>% \n",
" mutate(university = case_when( # create a new variable with mutate\n",
" education == \"4\" ~ \"Yes\", # use case_when and ~ operator to applt `if else` conditions \n",
" TRUE ~ \"No\")) %>% \n",
" mutate(university = as_factor(university)) #remember, it's a factor!\n",
"\n",
"head(SFS_data$university, 10)"
],
"id": "b23f442c-8e5b-4f0c-9c84-418f75618f0d"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let’s visualize how the mean wealth compares for male-led vs. female-led\n",
"households, conditional on whether the main-income earner went to\n",
"university."
],
"id": "74e4edff-311e-4350-9c66-4009e68a88e9"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"results <- SFS_data %>%\n",
" group_by(university,gender) %>%\n",
" summarize(m_wealth = mean(wealth), sd_wealth = sd(wealth))\n",
"\n",
"results \n",
"\n",
"f <- ggplot(data = SFS_data, aes(x = gender, y = wealth)) + xlab(\"Gender\") + ylab(\"Wealth\") # label and define our x and y axis\n",
"f <- f + geom_bar(stat = \"summary\", fun = \"mean\", fill = \"lightblue\") # produce a summary statistic, the mean\n",
"f <- f + facet_grid(. ~ university) # add a grid by education\n",
"\n",
"f"
],
"id": "c10d0fc1-28eb-4f43-847c-b2aec7b87c03"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It smells like the wealth gap between the two types of households widens\n",
"for groups that have obtained an university degree.\n",
"\n",
"Similarly, let’s look at the difference in wealth gap in percentage\n",
"terms. We use `results` generated in previous cell (the $4 \\times 4$\n",
"table) as the inputs this time. We need to load the package `scales` to\n",
"use the function `percent`."
],
"id": "e8f6fa79-95fc-4cce-aa12-eb7b8a97d015"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"library(scales)\n",
"\n",
"percentage_table <- SFS_data %>%\n",
" group_by(university) %>%\n",
" group_modify(~ data.frame(wealth_gap = mean(filter(., gender == 2)$wealth)/mean(filter(., gender == 1)$wealth) - 1)) %>%\n",
" mutate(wealth_gap = scales::percent(wealth_gap))\n",
"\n",
"percentage_table"
],
"id": "ae336faf-1654-4174-aa71-8074e318d3c7"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice the signs are both negative. Hence, on average, female-led\n",
"households have less wealth regardless of whether they have an\n",
"university degree or not.\n",
"\n",
"More importantly, based on our data, female-led households with\n",
"university degrees on average have 28% less wealth than male-led\n",
"households with university degrees. Comparing the two groups given they\n",
"don’t have university degrees, the gap is quite smaller: 18%.\n",
"\n",
"So, we have shown that the gap widens by about 10% when conditioned for\n",
"university degree.\n",
"\n",
"Let’s test this further by creating sub-samples of “university degree”\n",
"and “no university degree” respectively and then running formal two\n",
"sample t-test."
],
"id": "4952021f-0f65-449f-b03e-9d9a95aff540"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"university_data <- filter(SFS_data, university == \"Yes\") # university only data \n",
"nuniversity_data <- filter(SFS_data, university == \"No\") # non university data\n",
"\n",
"t2 = t.test(\n",
" x = filter(university_data, gender == 1)$wealth,\n",
" y = filter(university_data, gender == 2)$wealth,\n",
" alternative = \"two.sided\",\n",
" mu = 0,\n",
" conf.level = 0.95)\n",
"\n",
"t2 # test for the wealth gap in university data\n",
"\n",
"round(t2$estimate[1] - t2$estimate[2],2) # rounds our estimate\n",
"\n",
"\n",
"t3 = t.test(\n",
" x = filter(nuniversity_data, gender == 1)$wealth,\n",
" y = filter(nuniversity_data, gender == 2)$wealth,\n",
" alternative = \"two.sided\",\n",
" mu = 0,\n",
" conf.level = 0.95)\n",
"\n",
"t3 # test for the wealth gap in non-university data\n",
"\n",
"round(t3$estimate[1] - t3$estimate[2],2) # rounds our estimate"
],
"id": "3e203907-9e60-40e3-87e0-cfd25fb1e09b"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In both tests, the p-values are very small, indicating strong\n",
"statistical evidence to reject the null hypothesis. The confidence\n",
"intervals also provide a range of plausible values for the difference in\n",
"means, further supporting the alternative hypothesis.\n",
"\n",
"Based on these results, there appears to be a significant difference in\n",
"wealth between the two gender groups regardless of university-status,\n",
"with males consistently having higher mean wealth compared to females.\n",
"\n",
"### Optional: Returns to HS diploma\n",
"\n",
"Next, examine whether returns to education differ between genders. For\n",
"our purposes, we will define returns to education as *the difference in\n",
"average income before tax between two subsequent education levels*.\n",
"\n",
"The following t-test finds the returns to education of a high school\n",
"diploma for males (`retHS`) and for females(`retHSF`)."
],
"id": "677c694f-22a1-4100-b569-2adde637165a"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Returns to education: High school diploma\n",
"\n",
"less_than_high_school_data <- filter(SFS_data, education == 1) # Less than high school\n",
"high_school_data <- filter(SFS_data, education == 2) # High school\n",
"post_secondary_data <- filter(SFS_data, education == 3) # Non-university post-secondary\n",
"university_data <- filter(SFS_data, education == 4) # University\n",
"\n",
"\n",
"retHS = t.test(\n",
" x = filter(high_school_data, gender == 1)$income_before_tax,\n",
" y = filter(less_than_high_school_data, gender == 1)$income_before_tax,\n",
" alternative = \"two.sided\",\n",
" mu = 0,\n",
" conf.level = 0.95)\n",
"retHS_ans=round(retHS$estimate[1] - retHS$estimate[2],2)\n",
"\n",
"retHSF = t.test(\n",
" x = filter(high_school_data, gender == 2)$income_before_tax,\n",
" y = filter(less_than_high_school_data, gender == 2)$income_before_tax,\n",
" alternative = \"two.sided\",\n",
" mu = 0,\n",
" conf.level = 0.95)\n",
"\n",
"retHS\n",
"retHSF\n",
"retHS_ans=round(retHS$estimate[1] - retHS$estimate[2],2)\n",
"retHSF_ans=round(retHSF$estimate[1] - retHSF$estimate[2],2)"
],
"id": "9812cf3e-b34e-487c-ac70-6a9274ffd5be"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have found statistically significant evidence for the case that\n",
"returns to graduating with a high school diploma are indeed positive for\n",
"individuals living in both male-led and female-led households.\n",
"\n",
"### Test your knowledge\n",
"\n",
"As an exercise, create a copy of the cell above and try to calculate the\n",
"returns of a university degree for males and females."
],
"id": "a68d7b64-df1b-4b20-85e7-e61f1908dbf5"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# your code here"
],
"id": "90486ffd-3fd8-4c42-885b-140ea172cef4"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let’s work with a simulated dataset of mutual fund performance.\n",
"Interpret the data below as the yearly returns for a sample of 300\n",
"mutual funds from 2010 to 2015."
],
"id": "df34600b-fe5d-4e5d-b0dd-ff7d4c7a83cf"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fund_performance"
],
"id": "f46a17bc-f672-4500-9471-23e116b0d134"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a subset of the data with the returns for 2015. Rename the column\n",
"to `investment_returns`. Store your answer in `fp_15`."
],
"id": "395d0a39-c66a-4ffd-95e0-7223eaa48573"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# fell free to use this cell if you need "
],
"id": "638e9548-57f7-42a0-bb07-0acf4fd506b9"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fp_15 <- fund_performance %>%\n",
" ...(fund, ...)\n",
"\n",
"answer_2 <- fp_15\n",
"\n",
"test_2()"
],
"id": "8ab1fa84-fa4f-4e6c-900e-eda4c49b6d26"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calculate the mean and median return of funds in 2015. Store your\n",
"answers in `mean_ret` and `median_ret`, respectively."
],
"id": "e1d9ae15-45e6-4f74-a0e7-b42b6bbcee3d"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# fell free to use this cell if you need "
],
"id": "6dcadfd9-46f5-4275-ba8a-15e472df810f"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mean_ret <- ...(...)\n",
"\n",
"answer_3 <- mean_ret\n",
"\n",
"test_3()"
],
"id": "319c7a18-ca7d-4381-b401-2648533cba2f"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"median_ret <- ...(...)\n",
"\n",
"answer_4 <- median_ret\n",
"\n",
"test_4()"
],
"id": "831303e7-71f8-4edc-83ef-194903746ef2"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let’s suppose the market return (average return of investments\n",
"available) was 5%. Run a 95% confidence level t-test on the returns of\n",
"`fp_15` to find whether the funds outperformed the market or not.\n",
"Complete the code below."
],
"id": "d66445f3-7db8-468b-a788-1dedbe7e7991"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"t_stat = ...( \n",
" ...,\n",
" mu = ...,\n",
" alternative = \"two.sided\",\n",
" conf.level = ...)\n",
"\n",
"answer_5 <- t_stat$conf.int\n",
"\n",
"test_5()"
],
"id": "55e3bad7-1b83-4ae6-aaa4-8bc6a9ed9ed3"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Do we have statistical evidence to believe that the funds outperformed\n",
"the market?\n",
"\n",
"1. Yes\n",
"2. No"
],
"id": "3a67cb43-1619-46ab-b4ca-f0ff5a50ac66"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# enter your answer as either \"A\" or \"B\"\n",
"answer_6 <- \"\"\n",
"\n",
"test_6()"
],
"id": "066a630b-e902-4a68-a183-61387d235a29"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But wait! Do you notice anything interesting about this dataset?\n",
"Investigate the dataset with special attention to the NAs. Do you notice\n",
"a pattern?"
],
"id": "daf724e9-0b92-4c85-adec-257ecd1a283e"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fund_performance"
],
"id": "b3afab92-5013-4ce2-92bf-00916fab5cd9"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are no funds with negative performance in the dataset! It’s likely\n",
"that the NAs have replaced the observations with negative returns. How\n",
"that might affect our analysis of fund performance? Think about the\n",
"biases that could have introduced in our mean and statistical test\n",
"calculations.\n",
"\n",
"### Appendix\n",
"\n",
"Removing outliers is a common practice in data analysis. The code below\n",
"removes outliers based on a custom Z-score threshold.\n",
"\n",
"> **Note**: here we use the 95th percentile but you should first\n",
"> visualize your data with box plots and then find a convenient\n",
"> threshold to remove outliers in the variables of interest."
],
"id": "7ccda055-9948-4a27-a3a3-93dea49ff2be"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# function to remove outliers based on z-score\n",
"remove_outliers_zscore <- function(data, variable, threshold) {\n",
" z_scores <- scale(data[[variable]])\n",
" data_without_outliers <- data[abs(z_scores) <= threshold, ]\n",
" return(data_without_outliers)\n",
"}\n",
"\n",
"# set the threshold for z-score outlier removal\n",
"zscore_threshold <- 1.645 # Adjust as needed\n",
"\n",
"# remove outliers based on z-score for the desired variable\n",
"df_filtered <- remove_outliers_zscore(df_gender_on_wealth, \"wealth\", zscore_threshold)\n",
"\n",
"df_filtered"
],
"id": "0d8c825b-8498-4389-a026-4e69703330eb"
}
],
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"kernelspec": {
"name": "ir",
"display_name": "R",
"language": "r"
}
}
}