{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 04 - Opening Datasets\n",
        "\n",
        "Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt  \n",
        "2024-05-29\n",
        "\n",
        "## Prerequisites\n",
        "\n",
        "1.  Understand the basics of R such as data types and structures.\n",
        "\n",
        "## Learning Outcomes\n",
        "\n",
        "1.  Load a variety of data types into R using various functions.\n",
        "2.  View and reformat variables, specifically by factorizing.\n",
        "3.  Work with missing data.\n",
        "4.  Select subsets of observations and variables for use.\n",
        "\n",
        "## 4.0 Intro\n",
        "\n",
        "In this notebook, we will focus on loading, viewing and cleaning up our\n",
        "data set: these are **fundamental** skills which will be necessary for\n",
        "essentially every data project we will do. This data analysis process\n",
        "usually consists of four steps:\n",
        "\n",
        "1.  We clear the workspace and set up the directory (the folder that R\n",
        "    accesses whenever we run a command that either opens or saves a\n",
        "    file).\n",
        "2.  We load the data into R, meaning we take a file on our computer and\n",
        "    tell R how to interpret it.\n",
        "3.  We inspect the data through a variety of methods to ensure it looks\n",
        "    good and has been properly loaded.\n",
        "4.  We clean up the data by removing missing observations and adjusting\n",
        "    the way variables are interpreted.\n",
        "\n",
        "In this module, we will cover each of these steps in detail.\n",
        "\n",
        "## 4.1 Clearing the Workspace and Changing the Directory\n",
        "\n",
        "Our script files should begin with a command that clears the previous\n",
        "work that has been done in R. This makes sure that:\n",
        "\n",
        "1.  we do not waste computer memory on things other than the current\n",
        "    project;\n",
        "2.  whatever result we obtain in the current session truly belongs to\n",
        "    that session.\n",
        "\n",
        "To clear our workspace, we can use the `rm()` function. If we do not\n",
        "specify any object names as inputs of the `rm()` function, R will remove\n",
        "all objects available in the workspace. Alternatively, we can use the\n",
        "function `ls()` which lists all the objects in the current workspace.\n",
        "Any one of the two commands below will clear our workspace from all\n",
        "existing objects."
      ],
      "id": "593ff1d5-2bde-4d12-af71-2def2aeab803"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "rm()\n",
        "rm(list=ls())"
      ],
      "id": "435caa10-53d4-4590-bc16-e726de2a83cc"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Before importing data into R, it is useful to know how to change the\n",
        "folder that R accesses whenever we run a command that either opens or\n",
        "saves a file. Once we instruct R to change the directory to a specific\n",
        "folder, from that point onward it will open files from that folder and\n",
        "save all files to that folder, including data files and script files. R\n",
        "will continue to do this until either the program is closed or we change\n",
        "to another directory.\n",
        "\n",
        "Before changing the directory, it is important to know what the current\n",
        "directory is. In R, we can view the current directory with the command\n",
        "`getwd()`."
      ],
      "id": "13fa7c40-a931-4928-8a48-a506b02cb31f"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "print(getwd())"
      ],
      "id": "4df063e4-6f8e-4d43-97d1-2b03a870190e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**Note:** We write the directory path within quotation marks to make\n",
        "sure R interprets this as a single string of words. If we don’t do this,\n",
        "we may encounter issues with folders that include blank spaces.\n",
        "\n",
        "Now that we know what the current directory is, we can change it to any\n",
        "specific location you like by using the command `setwd()` and a file\n",
        "path in quotes.\n",
        "\n",
        "For example, we can change our working directory to a directory named\n",
        "“some_folder/some_folder” with the command\n",
        "`setwd(\"some_folder/some_subfolder\")`.\n",
        "\n",
        "Instead of changing directory every time, R allows us to create\n",
        "‘projects’. RStudio Projects are built-in features of RStudio that allow\n",
        "us to create a working directory for a project which we can launch\n",
        "whenever we want.\n",
        "\n",
        "To create an RStudio Project, first launch RStudio. Then navigate\n",
        "through **File**, **New Project**, **New Directory**, and then **New\n",
        "Project**. We can then choose the name of our project and select where\n",
        "we would like the project to be stored. To allow for the project to live\n",
        "on OneDrive (which is highly recommended), we select the OneDrive\n",
        "directory in our computer. Finally, we can create the project. If we\n",
        "access your OneDrive folder on our computer, we should then see a\n",
        "subfolder with our project name and a default .RProj object already\n",
        "inside.\n",
        "\n",
        "Whenever we want to return to our project to work on it, we can simply\n",
        "click the .RStudio Project object above. We can also start a fresh\n",
        "session in RStudio and navigate to our project by selecting **File**,\n",
        "**Open Project**, and then following the specified instructions.\n",
        "\n",
        "More details on RStudio Projects can be found in [Module\n",
        "17](https://comet.arts.ubc.ca/docs/Research/econ490-r/17_Wf_Guide.html).\n",
        "\n",
        "## 4.2 Loading Data\n",
        "\n",
        "Before we can load our data, we need to tell R which packages we will be\n",
        "using in our notebook. Without these packages, R will not have access to\n",
        "the appropriate functions needed to interpret our raw data. As explained\n",
        "previously, packages only need to be installed once; however, they need\n",
        "to be imported every time we open a notebook.\n",
        "\n",
        "We have discussed packages previously: for data loading, the two most\n",
        "important ones are `tidyverse` and `haven`.\n",
        "\n",
        "-   `tidyverse` should already be somewhat familiar. It includes a wide\n",
        "    range of useful functions for working with data in R.\n",
        "-   `haven` is a special package containing functions that can be used\n",
        "    to import data.\n",
        "\n",
        "Let’s get started by loading them now."
      ],
      "id": "9a34953e-f237-4f74-9617-8fbeab6faefd"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# loading in our packages\n",
        "library(tidyverse)\n",
        "library(haven)\n",
        "library(IRdisplay)"
      ],
      "id": "a2e41e80-d6a4-4a2d-b0a1-3784952d4d86"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Data can be created by different programs and stored in different\n",
        "styles - these are called **file types**. We can usually tell what kind\n",
        "of file type we are working with by looking at the extension. For\n",
        "example, a text file usually has the extension `.txt`. The data we will\n",
        "be using in this course is commonly stored in Stata, Excel, text, or\n",
        "comma-separated variables (csv) files. These have the following types:\n",
        "\n",
        "-   .dta for a Stata data file;\n",
        "-   .xls or .xlsx for an Excel file;\n",
        "-   .txt for a text file;\n",
        "-   .csv for a comma-separated variables file.\n",
        "\n",
        "To load any data set, we need to use the appropriate function in order\n",
        "to specify to R the format in which the data is stored:\n",
        "\n",
        "-   To load a .csv file, we use the command `read_csv(\"file name\")`.\n",
        "-   To load a STATA data file, we use the command\n",
        "    `read_dta(\"file name\")`.\n",
        "-   To load an Excel file, we use the command `read_excel(\"file name\")`.\n",
        "-   To load a text file, we use the command\n",
        "    `read_table(\"file name\", header = FALSE)`.\n",
        "    -   The header argument specifies whether or not we have specified\n",
        "        column names in our data file.\n",
        "-   There exist many other commands to import different types of data\n",
        "    files. Feel free to research other shortcuts that might help you\n",
        "    with whatever data you are using!\n",
        "\n",
        "**Note:** If we are using an Excel file, we need to load in the readxl\n",
        "package alongside the tidyverse and haven packages above to read the\n",
        "file.\n",
        "\n",
        "In this module, we’ll be working with the data set in the\n",
        "`\"fake_data.dta\"` files. This data set is simulating information of\n",
        "workers in the years 1982-2012 in a fake country where a training\n",
        "program was introduced in 2003 to boost their earnings.\n",
        "\n",
        "Let’s read in our data in .dta format now."
      ],
      "id": "bd40007e-1ff0-4e37-8c44-c2a3bc77feeb"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# reading in the data\n",
        "fake_data <- read_dta(\"../econ490-stata/fake_data.dta\") ## .. just tells R to go back one folder."
      ],
      "id": "e70f5acb-eeea-4290-aa5d-6ab7bb2ede70"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 4.3 Viewing Data\n",
        "\n",
        "Now that we’ve loaded in our data, it’s important to inspect the data.\n",
        "Let’s look at a series of commands which help us to do this.\n",
        "\n",
        "### 4.3.1 `glimpse` and `print`\n",
        "\n",
        "The first command we are going to use describes the basic\n",
        "characteristics of the variables in the loaded data set."
      ],
      "id": "e7fc4c91-e080-4ef6-acfb-998f09c68d4e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "glimpse(fake_data)"
      ],
      "id": "7a574784-242a-42a8-a765-db5c4e53db4e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Alternatively, we can use the `print` command, which displays the same\n",
        "information as the `glimpse` command but in horizontal form."
      ],
      "id": "1aacb099-ee78-4c9f-a9af-c4a7556a2533"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "print(fake_data)"
      ],
      "id": "3c921dcc-801c-4fc1-8cbb-23d19a8363ce"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "With many variables, this can be harder to read than the `glimpse`\n",
        "command. Thus, we typically prefer to use the `glimpse` command.\n",
        "\n",
        "### 4.3.2 `view`, `head`, and `tail`\n",
        "\n",
        "In addition to the `glimpse` command, we can also see the raw data we\n",
        "have imported as if it were an Excel file. To do this, we can use the\n",
        "`view` function. This command will open a clear representation of our\n",
        "data as though it were a spreadsheet. We can also use the command\n",
        "`head`. This prints out a preview of our data set exactly as it would\n",
        "appear in Excel (showing the first ten rows by default). We can then\n",
        "specify a numeric argument to the function to change the number of rows\n",
        "we want to see, as well as the specific rows we want via indicating\n",
        "their positions."
      ],
      "id": "fc615749-b3bf-4a26-b932-d3d4eed89f5c"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "head(fake_data)"
      ],
      "id": "a3b05040-603b-4be7-bc16-aa201d7fc8bc"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "There is even the function `tail`, which functions identically to `head`\n",
        "but works from the back of the data set (outputs the final rows)."
      ],
      "id": "73035b27-57f5-48d2-ae08-0eeb6be222ec"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "tail(fake_data)"
      ],
      "id": "c6620270-afd4-4401-9641-0dc8136d7044"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Opening the data editor has many benefits. Most importantly we get to\n",
        "see our data as a whole, allowing us to have a clearer perspective of\n",
        "the information the data set is providing us. For example, here we\n",
        "observe that we have unique worker codes, the year where they are\n",
        "observed, worker characteristics, and whether or not they participated\n",
        "in the training program. This viewing process is particularly useful\n",
        "when we first load a data set, since it lets us know if our data has\n",
        "been loaded in correctly and looks appropriate.\n",
        "\n",
        "### 4.3.3 `summary` and `sapply`\n",
        "\n",
        "We can further analyze any variable by using the `summary` command. This\n",
        "command gives us the minimum, 25th percentile, 50th percentile (median),\n",
        "75th percentile, and max of each of our variables, as well as the mean\n",
        "of each of these variables. It is a good command for getting a quick\n",
        "overview of the general spread of all variables in our data set."
      ],
      "id": "a7e5ece8-c1d1-4148-8c48-b56983b29b77"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "summary(fake_data)"
      ],
      "id": "a8aa220f-74fb-41fc-8b4b-cd0f9af5095f"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "From the command above, we can tell that this function will only be\n",
        "meaningful for variables in numeric or integer form.\n",
        "\n",
        "We can also apply `summary` to specific variables."
      ],
      "id": "d4d94f69-36ae-4eed-92da-3b14a8b64000"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "summary(fake_data$earnings)"
      ],
      "id": "f99e80cb-3d32-48c6-acad-8f764d8871c0"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If we want to quickly access more specific information about our\n",
        "variables, such as their standard deviations, we can supply this as an\n",
        "argument to the function `sapply`. It will output the standard\n",
        "deviations of each of our numeric variables. However, it will not\n",
        "operate on character variables. Remember, we can check the type of each\n",
        "variable using the `glimpse` function from earlier."
      ],
      "id": "c001639a-c766-4f8b-bc58-8dcfdf3cf052"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "sapply(fake_data, sd)"
      ],
      "id": "b4f1a9ca-44bd-498f-b723-94bc68eda187"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can also apply arguments such as mean, min, and median to the\n",
        "function above; however, `sd` is a good one since it is not covered in\n",
        "the `summary` function.\n",
        "\n",
        "### 4.3.4 `count` and `table`\n",
        "\n",
        "We can also learn more about the frequency of the different measures of\n",
        "our variables by using the command `count`. We simply supply a specific\n",
        "variable to the function to see the distribution of values for that\n",
        "variable."
      ],
      "id": "10b0c5f0-32be-4237-b686-e3d61f38b047"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "count(fake_data, region)"
      ],
      "id": "5aecfcd7-a416-4566-a283-f13d1ae4bbca"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Here we can see that there are five regions indicated in this data set,\n",
        "that more people surveyed came from region 1 and then fewer people\n",
        "surveyed came from region 3. Similarly, we can use the `table` function\n",
        "and specify our variable to accomplish the same task."
      ],
      "id": "08a1107b-1c3a-45c6-bf79-de88b94ff691"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "table(fake_data$region)"
      ],
      "id": "920d01b4-eda8-4357-9188-ee572e80c09e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 4.4 Cleaning Data\n",
        "\n",
        "Now that we’ve loaded in our data, the next step is to do some\n",
        "rudimentary data cleaning. This most commonly includes factorizing\n",
        "variables and dropping missing observations.\n",
        "\n",
        "### 4.4.1 Factorizing Variables\n",
        "\n",
        "We have already seen that there are different types of variables which\n",
        "can be stored in R. Namely, there are quantitative variables and\n",
        "qualitative variables. Any quantitative variable can be stored in R as a\n",
        "set of strings or letters. These are known as **character** variables.\n",
        "Qualitative variables can also be stored in R as factor variables.\n",
        "Factor variables associate a qualitative response to a categorical\n",
        "value, making analysis much easier. Additionally, data is often encoded,\n",
        "meaning that the levels of a qualitative variable are represented by\n",
        "“codes”, usually in numeric form.\n",
        "\n",
        "Look at the *region* variable in the output from `glimpse` above:\n",
        "\n",
        "    region     <dbl> 1, 1, 4, 4, 4, 5, 5, 5, 5, 2, 2, 5, 5, 5, 5, 2, 2, 4, 4, 2,~\n",
        "\n",
        "The *region* variable in this data set corresponds to a particular\n",
        "region that the worker is living in. We can also see the variable type\n",
        "is `<dbl+lbl>`: this is a labeled double. This is good: it means that R\n",
        "already understands what the levels of this variable mean.\n",
        "\n",
        "There are three similar ways to change variables into factor variables.\n",
        "\n",
        "1.  We can change a specific variable inside a data frame to a factor by\n",
        "    using the `as_factor` command. Let’s do that below, using the\n",
        "    special pipe `%>%` operator. This operator allows us to pipe\n",
        "    existing code into a new function. In this way, it helps us break up\n",
        "    long code across many lines, improving legibility. You can think of\n",
        "    the pipe operator as saying AND THEN when describing your code\n",
        "    aloud."
      ],
      "id": "777741d2-8961-46f6-b297-65b2277a21c7"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "fake_data <- fake_data %>%  #we start by saying we want to update the data, AND THEN... (%>%)\n",
        "    mutate(region = as_factor(region)) #mutate (update) region to be a factor variable\n",
        "\n",
        "glimpse(fake_data)"
      ],
      "id": "4f186b6b-9fc3-499b-9cca-271d1e3c3e79"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Do you see the difference in the *region* variable? You can also see\n",
        "that the type has changed to `<fct>`, a **factor variable**.\n",
        "\n",
        "R would already know how to “decode” the factor variables from the\n",
        "imported data if and only if they were of type `<dbl+lbl>`. What about\n",
        "when this isn’t the case? This brings us to the next method:\n",
        "\n",
        "1.  We can **supply a list of factors** using the `factor` command. This\n",
        "    command takes two other values:\n",
        "    -   A list of levels the qualitative variable will take on.\n",
        "    -   A list of labels, one for each level, describing what each level\n",
        "        means.\n",
        "\n",
        "We can create a custom factor variable as follows:"
      ],
      "id": "4cbeab03-d624-4fd3-b959-573ccd723211"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#first, we write down a list of levels\n",
        "region_levels = c(1:5)\n",
        "#then, we write down a list of our labels\n",
        "region_labels = c('Region A', 'Region B', 'Region C', 'Region D', 'Region E')\n",
        "\n",
        "#now, we use the command but with some options - telling factor() how to interpret the levels\n",
        "\n",
        "fake_data <- fake_data %>%  #we start by saying we want to update the data, AND THEN... (%>%)\n",
        "    mutate(region2 = factor(region,   #notice it's factor, not as_factor\n",
        "                          levels = region_levels, \n",
        "                          labels = region_labels)) #mutate (update region) to be a factor of regions\n",
        "glimpse(fake_data)"
      ],
      "id": "2492d7de-7419-4d45-83fd-ea8e167b3215"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Again, do you see the difference between *region* and *region2* here?\n",
        "This is how we can customize factor labels when creating new variables.\n",
        "\n",
        "1.  The final method is very similar to the first. If we have a large\n",
        "    data set, it can be tiresome to decode all of the variables\n",
        "    one-by-one. Instead, we can use `as_factor` on the **entire data\n",
        "    set** and it will convert all of the variables with appropriate\n",
        "    types."
      ],
      "id": "c06a1c3a-39ac-439d-bfd9-b07df671185b"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "fake_data <- as_factor(fake_data)\n",
        "\n",
        "glimpse(fake_data)"
      ],
      "id": "07682b12-151b-484f-a2b0-55337c00be23"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This is our final data set, with all variables factorized.\n",
        "\n",
        "### 4.4.2 Removing Missing Data\n",
        "\n",
        "We often face the challenge of dealing with missing values among\n",
        "observations for some of our variables. To check if any of our variables\n",
        "have missing values, we can use the `is.na` function alongside the `any`\n",
        "function. This code will return a value of TRUE or FALSE depending on\n",
        "whether we do or do not have any missing observations in our data set."
      ],
      "id": "5c194e3f-f253-4402-a56d-dfd17c320aae"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "any(is.na(fake_data))"
      ],
      "id": "c8d50096-a2b7-47fd-8bbd-a58d79c5879f"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Here, we can see that our data set already has no missing observations,\n",
        "so we do not need to worry about the process of potentially removing or\n",
        "redefining them. However, this is often not the case.\n",
        "\n",
        "Let’s go through the process of dropping missing observations for the\n",
        "*sex* variable anyway, assuming that missing observations are coded as\n",
        "“not available”. We will do this as a demonstration, even though no\n",
        "observations will actually be dropped. To do this, we will use the\n",
        "`filter()` method. This function conditionally drops rows (observations)\n",
        "by evaluating each row against the supplied condition. Only observations\n",
        "where the condition is true/met are retained (selection by inclusion) in\n",
        "the data frame. To use this to drop hypothetical missing observations\n",
        "for *sex*, we do the following:"
      ],
      "id": "194df38f-5898-438c-8cc7-7cc61bfc2135"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "filter(fake_data, sex != \"not available\")"
      ],
      "id": "94a72723-b955-48d7-9bbf-8dd579870a68"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**Recall**: The operator `!=` is a conditional statement for “not equal\n",
        "to”. Therefore, we are telling R to keep the observations that are not\n",
        "equal to “not available”.\n",
        "\n",
        "This process utilized the `filter` function, which retains rows meeting\n",
        "a specific condition. However, we can also supply a series of conditions\n",
        "to filter at once. We could have, for instance, decided that we only\n",
        "wanted to keep observations for females from region 1. In this case, we\n",
        "could run the following code."
      ],
      "id": "a221b7ec-c467-4296-811d-a768f6f75d4e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "head(filter(fake_data, sex == \"F\" & region == 1))"
      ],
      "id": "0f24d47f-a133-4f80-aa6e-01e76272efc4"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**Important Note**: Choosing which observations to drop is always an\n",
        "important research decision. There are two key ways to handle missing\n",
        "data: dropping it altogether (done above) or treating “missing” as its\n",
        "own valid category (not done above since no data is missing). This\n",
        "decision has important consequences for your analysis, and should always\n",
        "be carefully thought through - especially if the reasons why data are\n",
        "missing might not be random.\n",
        "\n",
        "### 4.4.3 Removing Variables\n",
        "\n",
        "Beyond filtering observations as was done above, we sometimes want to\n",
        "“filter” our variables. This process of operating on columns instead of\n",
        "rows requires the `select` function instead of the `filter` function.\n",
        "This is a useful function when we have more data at our disposal than we\n",
        "actually need to answer the research question at hand. This is\n",
        "especially pertinent given the propensity for data sets to collect an\n",
        "abundance of information, some of which may not be useful to us and\n",
        "instead slow down our loading and cleaning process.\n",
        "\n",
        "Let’s assume we are interested in seeing the gender wage gap among male\n",
        "and female workers of region 2, and nothing else. To help us with our\n",
        "analysis, we can filter by only observations which belong to region 2,\n",
        "then select for just the variables we are interested in."
      ],
      "id": "95765847-0472-4c3c-9d89-bfdf2138e7de"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "head(fake_data %>% filter(region == 2) %>% select(sex, earnings)) "
      ],
      "id": "c875b99d-3e73-4ec1-92fb-17be605652da"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In the code above, we pass as parameters to the `select` function every\n",
        "column we wish to keep.\n",
        "\n",
        "-   `select(variables, I, want, to, keep)`\n",
        "-   `select(-variables, -I, -don't, -want)`\n",
        "\n",
        "This is very useful and is usually done for practical reasons such as\n",
        "memory. Cleaning data sets to remove unessential information also allows\n",
        "us to focus our analysis and makes it easier to answer our desired\n",
        "research question. In our specific case, we want to keep data on just\n",
        "wages and sex. We have used the select function for this. If we were to\n",
        "further our research of the gender wage gap within region 2, we would\n",
        "now be able to refer to “fake_data” more quickly for immediate results.\n",
        "\n",
        "## 4.5 Common Mistakes\n",
        "\n",
        "Common mistakes happen because we do not respect the format of specific\n",
        "variables. Let’s say we want to filter the observations in order to get\n",
        "only women working in region 1. We may forget that variable *sex* is a\n",
        "string variable and type the following:"
      ],
      "id": "c3c7ccdd-c7dd-4a14-bf8b-f8dc6f92e8cb"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "head(filter(fake_data, sex == F & region == 1))"
      ],
      "id": "adc2eae3-765a-4465-a0df-af0b7d0a3dce"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We obtain a tibble with no observations. This mistake occurs when we\n",
        "forget to wrap values of string variables in quotes. The correct command\n",
        "would be the following:"
      ],
      "id": "f4c45e0b-c6fd-4389-97aa-652878057696"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "head(filter(fake_data, sex == \"F\" & region == 1))"
      ],
      "id": "d998680f-91ed-431a-b15e-5bbb9a8a2ae6"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 4.6 Wrap Up\n",
        "\n",
        "In this notebook, we have covered the basic process of working with\n",
        "data. Specifically, we looked at how to load in data, how to view it,\n",
        "and how to clean it by factorizing, dropping and selecting variables and\n",
        "observations. This general scheme is critical to any research project,\n",
        "so it is important to keep in mind as you progress throughout your\n",
        "undergraduate economics coursework and beyond. In the next section, we\n",
        "will cover a larger concept which is also essential to the cleaning of a\n",
        "data set, but merits its own section: creating variables.\n",
        "\n",
        "## 4.7 Wrap-up Table\n",
        "\n",
        "| Command | Function |\n",
        "|----------------------------------|--------------------------------------|\n",
        "| `rm()` | It removes all objects in the workspace. |\n",
        "| `getwd()` | It shows the current working directory. |\n",
        "| `setwd()` | It changes the working directory to a file path of our choice. |\n",
        "| `read_dta()` | It imports a .dta file. |\n",
        "| `read_csv()` | It imports a .csv file. |\n",
        "| `read_table()` | It imports a .txt file. |\n",
        "| `glimpse()` | It shows basic characteristics of the data. |\n",
        "| `print()` | It shows basic characteristics of the data, displaying them on a horizontal format. |\n",
        "| `head()` | It shows the top observations of the data. |\n",
        "| `tail()` | It shows the bottom observations of the data. |\n",
        "| `summary()` | It gives the minimum, 25th percentile, 50th percentile (median), 75th percentile, and max of each variable. |\n",
        "| `sapply()` | It returns a given statistic for each variable of the dataset. |\n",
        "| `count()` | It counts how many different values there are for a given variable. |\n",
        "| `as_factor()` | It transforms a variable into a factor variable. |\n",
        "| `is.na()` | It returns a value of TRUE if there are not-available observations for a given variable; otherwise, it returns FALSE. |\n",
        "| `filter()` | It filters the data according to specific conditions that observations must satisfy. |\n",
        "| `select()` | It keeps only certain variables of our data. |\n",
        "\n",
        "## References\n",
        "\n",
        "-   [Introduction to Probability and Statistics Using\n",
        "    R](https://mran.microsoft.com/snapshot/2018-09-28/web/packages/IPSUR/vignettes/IPSUR.pdf)\n",
        "-   [DSCI 100 Textbook](https://datasciencebook.ca/index.html)"
      ],
      "id": "dadcb3e9-5a01-460a-8b34-cc9e245d4a15"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "ir",
      "display_name": "R",
      "language": "r"
    }
  }
}