{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 03.2 - Introduction to Data in R - Part 2\n",
        "\n",
        "COMET Team <br> *Colby Grimes, Colby Chamber, Jonathan Graves, Manas\n",
        "Mridul, Valeria Zolla*  \n",
        "2023-07-09\n",
        "\n",
        "## Outline\n",
        "\n",
        "### Suggested Prerequisites\n",
        "\n",
        "-   Introduction to Jupyter\n",
        "-   Introduction to R\n",
        "\n",
        "### Prerequisites\n",
        "\n",
        "-   Introduction to Data in R - Part 1\n",
        "\n",
        "### Learning Outcomes\n",
        "\n",
        "In this notebook, you will learn about:\n",
        "\n",
        "\\- Filtering, Segmenting, Combining and Cleaning Data Sets for further\n",
        "analysis.\n",
        "\n",
        "\\- Reshaping and Presenting Data in different kinds of Tidy-formats as\n",
        "best suited for your Research Question (RQs)\n",
        "\n",
        "### References\n",
        "\n",
        "-   [Introduction to Probability and Statistics Using\n",
        "    R](https://mran.microsoft.com/snapshot/2018-09-28/web/packages/IPSUR/vignettes/IPSUR.pdf)\n",
        "-   [DSCI 100 Textbook](https://datasciencebook.ca/index.html)"
      ],
      "id": "9a666df5-e4b3-4b3d-8f97-ccb3d07dbb36"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Run this cell\n",
        "source(\"getting_started_intro_to_data_tests.r\")\n",
        "# loading in our packages\n",
        "library(tidyverse)\n",
        "library(haven)"
      ],
      "id": "9f27111d-ebbb-4ce6-892f-ab258f019875"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Start with loading the WDI data set\n",
        "\n",
        "The World Development Indicators (WDI) is the primary World Bank\n",
        "collection of development indicators, compiled from officially\n",
        "recognized international sources.\n",
        "\n",
        "We have used World Bank’s DataBank applet to select and import some\n",
        "macro and development-related time series data for the countries Canada,\n",
        "India, Mexico, South Africa, and Zimbabwe for years 2014-2022."
      ],
      "id": "8dbbf3ff-2ce2-462f-878f-5845b9b3c6a9"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Importing required packages\n",
        "library(readr)\n",
        "library(dplyr)\n",
        "library(tidyr)\n",
        "library(stringr)\n",
        "\n",
        "#Loading the CSV file using the URL \n",
        "file_path <- \"WB_Data/WDI_raw.csv\"\n",
        "wdi <- read_csv(file_path) %>%\n",
        "  rename(Series_Code = \"Series Code\", Series_Name = \"Series Name\",\n",
        "         Country = \"Country Name\", Country_Code = \"Country Code\")\n",
        "\n",
        "head(wdi, 10)"
      ],
      "id": "a1ccba49-82a0-4a91-8035-6c4db8d42e0d"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dim(wdi)"
      ],
      "id": "ad5449e5-5c4e-4c0c-b110-212fd657fb8b"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The data frame can be described to be in a long format. Each unique\n",
        "value in the **`Series_Name`** serves as an index (in addition to\n",
        "Country Name, Country Code) hence creating distinct rows.\n",
        "\n",
        "A simpler version of the data frame in a **wide-format** could look like\n",
        "this:\n",
        "\n",
        "1.  Canada, 2017, Var A, Var B, Var C\n",
        "2.  Canada, 2018, Var A, Var B, Var C\n",
        "\n",
        "Meanwhile, one way of representing the same data in a **long-format**\n",
        "looks like this:\n",
        "\n",
        "1.  Canada, Var A, 2017, 2018\n",
        "2.  Canada, Var B, 2017, 2018\n",
        "3.  Canada, Var C, 2018, 2018\n",
        "\n",
        "Do you now see the difference in the two ways of representing the same\n",
        "data?\n",
        "\n",
        "> While we don’t cover it in this notebook, switching between long and\n",
        "> wide formats is can be quite useful sometimes.\n",
        ">\n",
        "> For example, when you want to either aggregate values (eg. over\n",
        "> years*)* or compute a new variable using two or more variables, having\n",
        "> your data frame in a shape that best suits your *data-wrangling*\n",
        "> skills becomes essential.\n",
        "\n",
        "While `Series Name` contains descriptions for each of the series in the\n",
        "data frame, `Series Code` offers a handy way to group related series (or\n",
        "call them variables).\n",
        "\n",
        "Our `Series Code`s follow a taxonomy system. For example, any code\n",
        "starting with `AG` belongs to a *family of series* related to the state\n",
        "of agriculture in the world countries."
      ],
      "id": "c6b73f73-a843-48bd-bbfe-10694e510139"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#Let's see the unique Series Families and Series Names contained within\n",
        "Series_Families <- wdi %>%\n",
        "  mutate(Series_Family = substr(`Series_Code`, 1, 2)) %>% #substring first two chars\n",
        "  group_by(Series_Family) %>% #group by unique vals\n",
        "  summarize(Series_Names_Under_Family = paste(unique(`Series_Name`), collapse = \", \")) #For each family, find and paste ALL the Series names in the \"cell\"\n",
        "\n",
        "head(Series_Families, 13)"
      ],
      "id": "1e0cd898-e56a-44ec-b010-f03b4ff141f5"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If we had to prepare a *tidy* data frame comparing the state of *Access\n",
        "to Institutions* within the countries, we would look towards Series with\n",
        "codes beginning with `SE` (Schooling), `FX` (Financial Institutions) and\n",
        "`EG` (Electricity).\n",
        "\n",
        "Let’s now create a new data called `access_wdi`."
      ],
      "id": "8ffe01c1-5346-4db4-9850-0f5a2f3919bf"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "prefixes <- c(\"SE\", \"EG\", \"FX\")\n",
        "\n",
        "access_wdi <- wdi %>% filter(str_detect(`Series_Code`, str_c(\"^\", prefixes, collapse = \"|\")))\n",
        "\n",
        "# 'access_wdi' will now contain only the rows where the 'Series_Code' starts with any of the specified prefixes\n",
        "\n",
        "access_wdi <- access_wdi %>%  rename_with(~str_extract(., \"\\\\d{4}\"), starts_with(\"20\")) #Rename Year cols to follow the 2XYZ format.\n",
        "\n",
        "head(access_wdi, 5)"
      ],
      "id": "f2d8d21f-d2a8-4a42-9d7a-2bedee2251c3"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Handling Missing Values\n",
        "\n",
        "We are interested in seeing if there are any specific years for when the\n",
        "data was not collected or is missing for *most* of the variables/series.\n",
        "\n",
        "Let’s look at each of the year columns and count the number of rows with\n",
        "`NA`."
      ],
      "id": "3d0ce674-536c-4a2c-908d-cfcb4347676e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "year_columns <- c(\"2018\", \"2019\", \"2020\", \"2021\", \"2022\")\n",
        "\n",
        "missing_counts <- sapply(year_columns, function(column) sum(is.na(access_wdi[[column]])))\n",
        "\n",
        "print(missing_counts / nrow(access_wdi))"
      ],
      "id": "231279e7-50d5-49cd-97ca-40ab6129ef50"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Looks like the data is missing for 98% of the Series for the year 2022,\n",
        "and it makes more sense to drop the 2022 column altogether."
      ],
      "id": "c7c1e6e0-a855-4c79-a450-149b1f867169"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "access_wdi <- access_wdi %>% select(-`2022`)\n",
        "head(access_wdi, 20)"
      ],
      "id": "673ab10f-fbdd-427e-83b5-b00a9902056b"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Since we are interested in a comparative analysis of countries over\n",
        "time, we need to also check for missing values *row-wise*. So if for a\n",
        "certain combination of Country, Series values, data is systematically\n",
        "missing more than general. We can use thresholds – for example, let’s\n",
        "see if there are any particular series for which the data (for any\n",
        "country) is missing for more than 2 years."
      ],
      "id": "ba40d503-ff58-4b8f-a3b5-7889fe0dfda2"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#Create a new column that shows NA count by columns for each row\n",
        "access_wdi$count_na <- rowSums(is.na(access_wdi))\n",
        "\n",
        "#Sort by number of missing vals\n",
        "access_wdi <- access_wdi %>% arrange(desc(count_na))\n",
        "\n",
        "select(access_wdi, c(\"Country_Code\", \"Series_Code\", \"count_na\"))"
      ],
      "id": "0e2903c9-37fd-4771-acef-e58930fbe027"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This data frame shows that we don’t have *any* data for Series beginning\n",
        "with the `SE` (Schooling) prefix.\n",
        "\n",
        "> *Think Critically*: We can be a lot of systematic about dropping\n",
        "> missing values! In our case, we have two indexes (Country and Series)\n",
        "> while the years are the columns of interest. Dropping values hence\n",
        "> becomes a bit more complex as we have a lot of options to choose from:\n",
        ">\n",
        "> -   Dropping/Replacing Series altogether\n",
        ">\n",
        "> -   Dropping/Replacing Countries\n",
        ">\n",
        "> -   Dropping/Replacing specific rows, ie. Country-Series observation\n",
        ">     pairs.\n",
        ">\n",
        "> Of course, making such decisions by looking at a 3x4 data frame is a\n",
        "> lot easier than making decisions by looking at a 10x4 data frame with\n",
        "> multiple indexes. Hence, it helps to narrow your research scope and\n",
        "> define your RQs before making decisions about replacing/dropping\n",
        "> observations.\n",
        "\n",
        "Continuing with our tutorial, there are *any* countries that are Series\n",
        "data for more than 3 years.\n",
        "\n",
        "> This would make sense if we were interested in analyzing, comparing or\n",
        "> visualizing YoY change in Series values. For example, How did primary\n",
        "> school enrollment change *as a result of* COVID-19 lockdowns?"
      ],
      "id": "59d08fd7-4c97-47f9-96f5-0020cdccde83"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#Creating an array of Series_Code(s) that need to be dropped from access_wdi\n",
        "\n",
        "to_drop <- access_wdi %>%\n",
        "  filter(count_na > 2) %>%\n",
        "  distinct(Series_Code) %>% unique() \n",
        "\n",
        "to_drop <- unlist(to_drop, use.names = FALSE)\n",
        "\n",
        "to_drop"
      ],
      "id": "3faeaeaa-20c6-441f-9477-e5ccd911328e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "> *Think Critically:* If we drop all Series included in `to_drop`, our\n",
        "> research scope will become significantly narrower! The presence of\n",
        "> missing data thus poses a challenge, but there are potential solutions\n",
        "> to address this limitation. One approach is to import data series from\n",
        "> alternative sources and *fill them into* our original data set. This\n",
        "> is beyond the scope of this tutorial but sincere caution must be\n",
        "> practices with appropriate citations when combining data from\n",
        "> different data sources.\n",
        "\n",
        "For simplicity’s sake, let’s now drop all the rows where `Series_Code`\n",
        "matches any of the codes in *`to`*`_drop` and save the resulting data\n",
        "frame as a new version of `access_wdi`."
      ],
      "id": "6fe45567-e36c-4580-97e1-8ec51978caeb"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "filtered_access_wdi <- access_wdi %>%\n",
        "   filter(!(Series_Code %in% to_drop))\n",
        "\n",
        "filtered_access_wdi"
      ],
      "id": "145ea3ea-e99a-4cdc-b3aa-395e1f7ece8a"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now the only variables really left in this data frame are the `EG`\n",
        "variables that indicate the levels of access to electricity and other\n",
        "power sources within the countries.\n",
        "\n",
        "If we had to answer a holistic set of questions on the state of access\n",
        "to public amenities, do you think the data frame above would be\n",
        "appropriate?\n",
        "\n",
        "Narrowing down our research scope a bit, we could still visualize the\n",
        "growth in access to energy across the countries over the last 5 years\n",
        "using R.\n",
        "\n",
        "### Merging Data Frames in R\n",
        "\n",
        "The World Bank (in tandem with the International Monetary Fund) also\n",
        "updates the Quarterly Public Debt data base. The WDI data set has a few\n",
        "key macro-series including *national incomes*, *CABs*, *Bank Capital to\n",
        "Assets Ratios,* and various kinds of *CPIA* ratings.\n",
        "\n",
        "Let’s try to present some macro-relevant information in a neat *tidy*\n",
        "frame that one would further use to visualize and analyze the fiscal and\n",
        "monetary landscapes within different world countries."
      ],
      "id": "aeb36421-3365-4926-806f-72ab27e34ed9"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#Here are all the Series_Families I want from WDI\n",
        "prefixes <- c(\"NY\", \"FD\", \"FB\", \"IQ\", \"BN\")\n",
        "\n",
        "#Subset WDI and create macro_WDI\n",
        "macro_wdi <- wdi %>% filter(str_detect(`Series_Code`, str_c(\"^\", prefixes, collapse = \"|\"))) %>% # Documentation on `?str_detect`\n",
        "  rename_with(~str_extract(., \"\\\\d{4}\"), starts_with(\"20\")) #Cleaning Year Column names\n",
        "  \n",
        "macro_wdi"
      ],
      "id": "046a858d-4534-415b-a9f1-082b44b11640"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Again take note of the missing values! It’s sad that we’re missing the\n",
        "CPIA rating variables (starting with `IQ`) for all the countries\n",
        "(Canada, India, Mexico, South Africa and Indonesia).\n",
        "\n",
        "Since this won’t add much to our *comparitive* analysis of the world\n",
        "countries, it makes more sense to drop the CPIA rows altogether."
      ],
      "id": "56edbfb7-9077-44b5-bea7-4d6ba4abd5e3"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "macro_wdi <- macro_wdi %>% filter(!(Series_Code %in% c('IQ.CPA.FINS.XQ', 'IQ.CPA.FISP.XQ', 'IQ.CPA.MACR.XQ', 'IQ.CPA.PROP.XQ', 'IQ.CPA.PROT.XQ', 'IQ.CPA.DEBT.XQ', 'IQ.CPA.TRAD.XQ')))\n",
        "\n",
        "c(macro_wdi$Series_Code %>% unique(), macro_wdi$Series_Name %>% unique())"
      ],
      "id": "ae154d82-0e5c-4782-95fe-b5f25f2730d6"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "So these are the variables we’ll include from the WDI data frame! Now,\n",
        "let’s load the QPD data base."
      ],
      "id": "631ef747-2618-4e1d-b995-1e16350a71bd"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#Loading the CSV file using the URL \n",
        "file_path_2 <- \"WB_Data/qpd.csv\"\n",
        "\n",
        "qpd <- read_csv(file_path_2) %>%\n",
        "  rename(Series_Code = \"Series Code\", Series_Name = \"Series Name\",\n",
        "         Country = \"Country Name\", Country_Code = \"Country Code\")\n",
        "\n",
        "head(qpd, 25)"
      ],
      "id": "4453b259-fa5f-44dd-aed5-a9a53a636840"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Aggregating Quarterly Values by Year\n",
        "\n",
        "The Series data in QPD is stored on a quarter-by-year basis, and we can\n",
        "aggregate *column_wise* to get yearly amounts for each year.\n",
        "\n",
        "> As you might ask, how do missing values affect data aggregation? R\n",
        "> will usually throw an error if you’re telling it to sum over certain\n",
        "> rows/columns and if they include NA values. We resolve this by setting\n",
        "> the parameter `na.rm = TRUE` telling the aggregation functons to\n",
        "> handle NA values properly.\n",
        "\n",
        "As for best practice, I want to check the number of periods for which\n",
        "data is missing for **each unique combination of Country and Series_Code\n",
        "values**. The following code can be described as a custom loop-function\n",
        "telling R to *manually* go over each unique row, count the number of NAs\n",
        "along the period columns, and then store the result in another dataframr\n",
        "called `status`."
      ],
      "id": "136312b5-48ae-4950-9465-aced332bc514"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "status <- data.frame() #empty data-frame where we will store the information\n",
        "Series_Codes <- qpd$Series_Code %>% unique() #gets all Series_Codes to iterate over\n",
        "Countries <- qpd$Country_Code %>% unique() #gets all Country_Codes to iterate over \n",
        "\n",
        "for (country_code in Countries) {\n",
        "  select <- filter(qpd, Country_Code == country_code) # first filter by country\n",
        "  \n",
        "  for (series_code in Series_Codes) {\n",
        "    select_further <- filter(select, Series_Code == series_code) #then filter by Series_Code\n",
        "    #now, select the period columns. The result will be a single row of period columns\n",
        "    # for each unique Country_Code, Series_Code combination. \n",
        "    cols_to_check <- select(select_further, c(\"2018Q1 [YR2018Q1]\", \"2018Q2 [YR2018Q2]\", \"2018Q3 [YR2018Q3]\", \"2018Q4 [YR2018Q4]\",\n",
        "                                              \"2019Q1 [YR2019Q1]\", \"2019Q2 [YR2019Q2]\", \"2019Q3 [YR2019Q3]\", \"2019Q4 [YR2019Q4]\",\n",
        "                                              \"2020Q1 [YR2020Q1]\", \"2020Q2 [YR2020Q2]\", \"2020Q3 [YR2020Q3]\", \"2020Q4 [YR2020Q4]\",\n",
        "                                              \"2021Q1 [YR2021Q1]\", \"2021Q2 [YR2021Q2]\", \"2021Q3 [YR2021Q3]\", \"2021Q4 [YR2021Q4]\",\n",
        "                                              \"2022Q1 [YR2022Q1]\", \"2022Q2 [YR2022Q2]\", \"2022Q3 [YR2022Q3]\", \"2022Q4 [YR2022Q4]\"))\n",
        "    \n",
        "    na_count <- sum(is.na(cols_to_check)) #finally, store the value of NAs\n",
        "    \n",
        "    result <- data.frame(Country_Code = country_code, Series_Code = series_code, na_count = na_count)\n",
        "    \n",
        "    status <- rbind(status, result)\n",
        "    #this loop appends the result to the status dataframe\n",
        "  }\n",
        "}"
      ],
      "id": "fbc39b8d-2be0-411c-8386-163cd59b6283"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As expected, `status` should have three columns: Country_Code,\n",
        "Series_Code, and the na_count. I can now figure out which Country,\n",
        "Series combinations are missing data for **less than 20 periods**."
      ],
      "id": "94fc0589-17ea-4a0d-9b8b-93e8d2759c08"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "status_to_drop <- status %>% filter(na_count < 20 & na_count > 0) #strictly less than 20\n",
        "status_to_drop"
      ],
      "id": "7071d770-0402-4037-9603-3dba5a2b7f8c"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "*So*, these are the Countries-Series pairs which we must drop from the\n",
        "data-frame. Why specifically `na_count < 20 & na_count > 0`. It’s\n",
        "because, if the data is missing for **all** of the 20 columns, R’s\n",
        "aggregate function will take care of that and give me `0` as the yearly\n",
        "aggregate value. However, if data is missing for strictly less than 20\n",
        "period columns, the yearly aggregate values will be **under-estimated**.\n",
        "\n",
        "> By storing our exploration’s results in `status`, we can re-visit our\n",
        "> decision to drop values anytime we want. Such proper documentation\n",
        "> builds *trust* around the validity of the aggregate computations!\n",
        "\n",
        "Let’s now use `anti_join()` to drop any rows from `qpd` that match the\n",
        "Country_Code, Series_Code pairs in `status_to_drop`."
      ],
      "id": "70360476-dca3-4804-affb-a1c137c6e16d"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "qpd_filtered <- anti_join(qpd, status_to_drop, by = c(\"Country_Code\", \"Series_Code\"))\n",
        "qpd_filtered"
      ],
      "id": "69f5da4f-092d-45bc-bbe4-0d4791225f0d"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "> `anti_join()` is a function that removes rows from a dataframe that\n",
        "> have matching values in specified columns with another dataframe. In\n",
        "> this context, it is used to drop rows from qpd that match the\n",
        "> Country_Code and Series_Code pairs in status_to_drop, resulting in the\n",
        "> filtered dataframe qpd_filtered.\n",
        "\n",
        "The code below tells R how to manually go over each unique combination\n",
        "of Country, Series_Code values and aggregate quaterly values by year. To\n",
        "see each line in action, head to the Appendix!"
      ],
      "id": "e46c5a98-83e7-4bf1-bacd-b2ac4741f301"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Pivot the data from wide to long format, creating separate rows for each quarterly value\n",
        "qpd_long <- qpd_filtered %>%\n",
        "  tidyr::pivot_longer(starts_with(\"20\"), names_to = \"quarter\", values_to = \"value\")\n",
        "\n",
        "# Extract the year from the \"quarter\" column\n",
        "qpd_extracted <- qpd_long %>%\n",
        "  dplyr::mutate(year = stringr::str_extract(quarter, \"\\\\d+\"))\n",
        "\n",
        "# Group the data by country, series, and year for aggregation\n",
        "qpd_grouped <- qpd_extracted %>%\n",
        "  dplyr::group_by(Country_Code, Country, Series_Code, Series_Name, year)\n",
        "\n",
        "# Calculate the sum of values for each combination of country, series, and year\n",
        "qpd_summarized <- qpd_grouped %>%\n",
        "  dplyr::summarise(total_value = sum(value, na.rm = TRUE))\n",
        "\n",
        "# Pivot the data back to wide format, with separate columns for each year\n",
        "qpd_aggregated <- qpd_summarized %>%\n",
        "  tidyr::pivot_wider(names_from = year, values_from = total_value, names_prefix = \"year_\")\n",
        "\n",
        "qpd_aggregated"
      ],
      "id": "dd42ed10-6b4c-40cb-843c-d629b2094e1a"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Also, take note of the zeroes in the data frame. These occur due to\n",
        "certain Country, Series pairs missing data for all of the 20 periods. We\n",
        "could filter out these observations if we had to!\n",
        "\n",
        "> The best thing about the dataframe is that the aggregate values\n",
        "> strictly greater than 0 are at least not being under-aggregated since\n",
        "> we took care of cases where data was missing for less than 20 periods.\n",
        "\n",
        "#### Performing the Merge\n",
        "\n",
        "I’ll now create a new data frame which includes both the macro variables\n",
        "from WDI and the QPD data. We’ll use `rbind()` to append observations\n",
        "from QPD to `macro_wdi`."
      ],
      "id": "fe194416-d0e4-46c7-89cf-2d032950c821"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Example: Since the yearly column names are different, we can rename them to match\n",
        "colnames(qpd_aggregated) <- colnames(macro_wdi)\n",
        "\n",
        "# Combine the data frames using rbind\n",
        "df_macro <- rbind(macro_wdi, qpd_aggregated)\n",
        "\n",
        "# Print the dimensions of the combined data frame\n",
        "print(dim(df_macro))\n",
        "\n",
        "# View the combined data frame\n",
        "df_macro"
      ],
      "id": "75846e27-a9ba-412f-bb3c-2eaee529b846"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Performing a Horizontal Merge\n",
        "\n",
        "Let’s suppose we were to update `df_macro` with 2023 values which we\n",
        "have saved in a different dataframe."
      ],
      "id": "78143a94-7711-4bee-beeb-ff94af32adb5"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "values_2023 <- c(-46898267372, -25575220073, 1457067267, 81641557999, -59663613794, 79677936994, 88935053721, 32159558497, 25822808780, -87642745906, -58805085020, -64688649494, 37404569332, -23179256357, 53968284000, -460151583, 43523701653, 98381218966, -23992964113, 55489044264, 86941046221, -57571495743, 30334753217, -74888980808, -46555866254, -22777181491, -97321933368, -23522408586, 73938169144, -31930200662, -3583976906, 19913165085)\n",
        "\n",
        "Country <- c(\"India\",  \"India\", \"Canada\", \"Canada\", \"Honduras\", \"Honduras\", \"Indonesia\", \"Indonesia\", \"South Africa\", \"South Africa\", \"Mexico\",  \"Mexico\", \"United Kingdom\", \"United Kingdom\", \"United States\", \"United States\", \"China\", \"China\", \"Hong Kong SAR, China\", \"Hong Kong SAR, China\", \"Netherlands\", \"Netherlands\", \"Egypt, Arab Rep.\", \"Egypt, Arab Rep.\", \"Georgia\", \"Georgia\", \"Slovak Republic\", \"Slovak Republic\", \"Georgia\", \"Honduras\", \"Indonesia\", \"Slovak Republic\")\n",
        "\n",
        "Series_Code <- rep(df_macro$Series_Code, length.out = length(values_2023))\n",
        "Series_Name <- rep(df_macro$Series_Name, length.out = length(values_2023))\n",
        "\n",
        "series_2023 <- data.frame( Country = Country, Series_Code = Series_Code, Series_Name = Series_Name, values_2023 = values_2023)\n",
        "\n",
        "head(series_2023, 5)"
      ],
      "id": "1a6c38d4-4126-46ec-8a86-0f1cc8520a61"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, we can use `merge()` to add the 2023 values column into `df_macro`\n",
        "while ensuring only the observations where values for `Country` are\n",
        "matched between the two data frames are merged."
      ],
      "id": "3025f22c-0dc6-4a91-8749-2e750ac72ff5"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "df_macro_2 <- merge(df_macro, df_2023, by = c(\"Country\", \"Series_Code\", \"Series_Name\"))\n",
        "\n",
        "df_macro_2 <- df_macro_2 %>% rename(\"2023\" = \"values_2023\")\n",
        "\n",
        "df_macro_2"
      ],
      "id": "698aa353-d9a5-4fc0-a877-c40e7f14253f"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Optional: Review of different Merge Methods in R"
      ],
      "id": "c0fe7ff2-1767-4566-b51a-feae693d98bf"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "library(knitr)\n",
        "knitr::include_graphics(\"join_visual.png\")"
      ],
      "id": "cc13d26a-caea-41e3-837e-6720f2f79077"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As illustrated in the image, an `full_join()` and `right_join()` are\n",
        "great for merging data sources in situations when we are particularly\n",
        "interested in the issue of missing matches.\n",
        "\n",
        "For simpler cases, `inner_join()` is ideal when you only want to include\n",
        "fully matched observations in the final data set.\n",
        "\n",
        "### Conclusion\n",
        "\n",
        "Thanks for completing this module. The skills you have learnt in **Intro\n",
        "to Data Parts 1 and 2** should help with various kinds of data\n",
        "interviews. Researchers should remember to document their decisions\n",
        "while excluding/including data, merging different data sources, always\n",
        "think about how certain actions might change the validity of later\n",
        "statistical tests or the drawn conclusions.\n",
        "\n",
        "### Appendix\n",
        "\n",
        "#### More on the *Wrangling Code* that aggregates period values in QPD\n",
        "\n",
        "The following code should produce 5 different data-frames that\n",
        "incrementally show how our *wrangling* code calculated the yearly\n",
        "aggregates for each Country and Series."
      ],
      "id": "a90f2666-a7d8-442a-871a-9eb9017659e1"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Select a specific country and series for demonstration (e.g., \"USA\" and \"GDP\")\n",
        "country_code <- \"ALB\"\n",
        "series_code <- \"DP.DOD.DSCD.CR.PS.CD\"\n",
        "\n",
        "# Filter the data for the specific country and series\n",
        "qpd_example <- qpd_filtered %>%\n",
        "  dplyr::filter(Country_Code == country_code, Series_Code == series_code)\n",
        "\n",
        "# Pivot the data from wide to long format, creating separate rows for each quarterly value\n",
        "qpd_long <- qpd_example %>%\n",
        "  tidyr::pivot_longer(starts_with(\"20\"), names_to = \"quarter\", values_to = \"value\")\n",
        "\n",
        "# Show the intermediate result: qpd_long\n",
        "qpd_long\n",
        "\n",
        "# Extract the year from the \"quarter\" column\n",
        "qpd_extracted <- qpd_long %>%\n",
        "  dplyr::mutate(year = stringr::str_extract(quarter, \"\\\\d+\"))\n",
        "\n",
        "# Show the intermediate result: qpd_extracted\n",
        "qpd_extracted\n",
        "\n",
        "# Group the data by country, series, and year for aggregation\n",
        "qpd_grouped <- qpd_extracted %>%\n",
        "  dplyr::group_by(Country_Code, Country, Series_Code, Series_Name, year)\n",
        "\n",
        "# Show the intermediate result: qpd_grouped\n",
        "qpd_grouped\n",
        "\n",
        "# Calculate the sum of values for each combination of country, series, and year\n",
        "qpd_summarized <- qpd_grouped %>%\n",
        "  dplyr::summarise(total_value = sum(value, na.rm = TRUE))\n",
        "\n",
        "# Show the intermediate result: qpd_summarized\n",
        "qpd_summarized\n",
        "\n",
        "# Pivot the data back to wide format, with separate columns for each year\n",
        "qpd_aggregated <- qpd_summarized %>%\n",
        "  tidyr::pivot_wider(names_from = year, values_from = total_value, names_prefix = \"year_\")\n",
        "\n",
        "# Show the final result: qpd_aggregated\n",
        "qpd_aggregated"
      ],
      "id": "eb7154a7-5f23-46e8-a0bb-4263e308c1bc"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "ir",
      "display_name": "R",
      "language": "r"
    }
  }
}