{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 03.1 - Introduction to Data in R - Part 1\n", "\n", "COMET Team
*Manas Mridul, Valeria Zolla, Colby Chamber, Colin\n", "Grimes, Jonathan Graves* \n", "2023-01-12\n", "\n", "## Outline\n", "\n", "### Prerequisites\n", "\n", "- Introduction to Jupyter\n", "- Introduction to R\n", "\n", "### Outcomes\n", "\n", "After completing this notebook, you will be able to:\n", "\n", "- Identify and understand the packages and commands needed to load,\n", " manipulate, and combine data frames in R\n", "- Load data using R in a variety of forms\n", "- Create and reformat data, including transforming data into factor\n", " variables\n", "\n", "### References\n", "\n", "- [Introduction to Probability and Statistics Using\n", " R](https://mran.microsoft.com/snapshot/2018-09-28/web/packages/IPSUR/vignettes/IPSUR.pdf)\n", "- [DSCI 100 Textbook](https://datasciencebook.ca/index.html)" ], "id": "91d71acf-d51f-4188-aff1-729d1feb5674" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run this cell\n", "\n", "source(\"getting_started_intro_to_data_tests.r\")" ], "id": "6f6330c0-efe1-4260-bf23-ddb00519685e" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Drawing insights from data requires information to be presented in a way\n", "that is both interpretable to R and our audiences. However, before you\n", "can wrangle data sets, you need to ensure that they are clean. A *clean*\n", "data set means:\n", "\n", "1. Observations where data for key variables is missing are removed or\n", " stored in a different data set (eg. `df_raw`). *Missing data* can\n", " create bias in your analysis and there are other reasons why\n", " researchers choose to drop variables with too many missing variables\n", "2. Data set is *tidy*, ie. each row captures only one observation and\n", " each column captures only one variable/characteristic of the\n", " observation. Data scraped and collected manually or using automation\n", " often comes in *untidy* shapes (eg. the column has both the price\n", " and square foot area separated with a hyphen `-`)\n", "\n", "In this notebook, we teach you how to load data sets properly in R and\n", "then clean them using some common methods from the `haven` and\n", "`tidyverse` packages.\n", "\n", "# Loading Data in R\n", "\n", "R needs to be provided with the appropriate packages to have access to\n", "the appropriate functions needed to interpret our raw data.\n", "\n", "> `install.packages('package_name')` is used to install packages for the\n", "> first time while `library('package_name')` is used to import the\n", "> package into our notebook’s session run-time.\n", "\n", "Let’s get started by loading them now." ], "id": "03f5fc15-47fb-4ff2-a06b-0410aa8fe48b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# loading in our packages\n", "library(tidyverse)\n", "library(haven)" ], "id": "c27002ab-1322-4fe6-b1f2-d1db5e673546" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Researchers usually work with data stored in STATA, Excel, or\n", "comma-separated variables (CSV) files. The extension of a file tells us\n", "the file type we’re dealing with. Here are two common file types used in\n", "the data community:\n", "\n", "- `.dta` for a STATA data file\n", "- `.csv` for a comma-separated variables file\n", "- `.txt` or text files can be used to store data separated by\n", " white-space(s).\n", "\n", "Also take note of the following functions used to import data of\n", "different file types:\n", "\n", "- `read_csv(\"file name\")` for CSV\n", "- `read_dta(\"file name\")` for STATA from the `haven` package\n", "- `read_excel(\"file name\")` from the `readxl` package\n", "- `read_table(\"file name\", header = FALSE)` for text files\n", " - The ‘header’ argument indicates whether the first row of data\n", " represent the column names or not.\n", "\n", "### Exercise\n", "\n", "In this notebook, we will be working with data from the Canadian census\n", "which is stored as `01_census2016.dta`. Which function should we use to\n", "load this file? Write the name of the function just before the brackets\n", "(e.g. `read_table`)" ], "id": "9d441ce1-75be-48a4-9b68-8cf73e4820c4" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# which function should we use?\n", "\n", "answer0 <- \"...\"\n", "\n", "test_0()" ], "id": "b51e2c4e-f32c-4a90-95b1-9cd1e4472e39" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Did you get it? Okay, now replace the `???` in the code below with that\n", "function to load the data!" ], "id": "66c0b011-4014-4767-b7a8-e36524d64b9d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# reading in the data\n", "census_data <- ???(\"../datasets_getting_started/01_census2016.dta\") # change me!\n", "\n", "# inspecting the data\n", "glimpse(census_data)" ], "id": "495ca5bd-d4e1-4f70-a0a2-8b9c3006d14d" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Cleaning Data\n", "\n", "It’s important to ask what we mean by **cleaning** our data sets? This\n", "might usually look like:\n", "\n", "1. Loading the data into R by importing a local file or from the\n", " internet and telling R how to interpret it.\n", "2. Merging data frames from different sources, horizontally or\n", " vertically, in order to be able to answer certain questions about\n", " the populations.\n", "3. Renaming variables, creating new variables and removing observations\n", " where data for the new variables is missing.\n", "4. Removing outliers and or creating subsets of the data based on\n", " values for different variables using filter, select and other\n", " reshaping methods in R.\n", "\n", "We now begin to clean the census data. We want to redefine and factor\n", "variables, define new ones, and dropping missing observations.\n", "\n", "## Factor Variables\n", "\n", "As discussed previously, two types of variables can be stored in R:\n", "quantitative and qualitative variables. Qualitative variables are\n", "usually stored in R as sequences of characters or letters, ie. as\n", "**character** variables. They can also be stored as **factor** variables\n", "which map qualitative responses to categorical values. In other words,\n", "the qualitative variable gets **encoded** so the *levels* of the\n", "variable are represented by numeric “codes”. This process further\n", "streamlines data interpretation and analysis.\n", "\n", "Look at line `pr` in the output from `glimpse` above:\n", "\n", " pr 35, 35, 11, 24, 35, 35, 35, 10, 35, 35, 59, 59, 46, 24, 59\n", "\n", "The `pr` variable in the Census data stands for province. Do these look\n", "much like Canadian provinces to you? This is an example of **encoding**.\n", "We can also see the variable type is ``: this is a *labeled\n", "double*. This is good: it means that R already understands what the\n", "levels of this variable mean.\n", "\n", "There are three similar ways to change variables into factor variables.\n", "\n", "1. We can change a specific variable inside a dataframe to a factor by\n", " using the `as_factor` command" ], "id": "a3fe02fa-2151-4df8-9b80-9f39a723cf20" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "census_data <- census_data %>% #we start by saying we want to update the data, AND THEN... (%>%)\n", " mutate(pr = as_factor(pr)) #mutate (update pr to be a factor variable)\n", "\n", "glimpse(census_data)" ], "id": "a67019b6-fb7a-4540-9397-1ed68dedb801" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do you see the difference in the `pr` variable? You can also see that\n", "the type has changed to `` for **factor variable**.\n", "\n", "R knows how to decode province names out of the \\ type\n", "variable, since the variable specification captures both the numeric\n", "code as `dbl` and the label as `lbl`.\n", "\n", "1. We can also **supply a list of factors** using the `factor` command.\n", " This command takes two other values:\n", " - A list of levels the qualitative variable will take on (eg. 35,\n", " 11, 24… in the case of pr)\n", " - A list of labels, one for each level, which describes what each\n", " level means (eg. ‘ontario’, ‘prince edward island’, ‘quebec’ …)\n", "\n", "Let’s look at the `pkids` (has children) variable as an example. Let’s\n", "suppose we didn’t notice that it is of type `` *or* we decided\n", "we didn’t like the built-in labels." ], "id": "e1ce0e54-312a-4639-aedc-f5bae9a041fa" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# first, we write down a list of levels\n", "kids_levels = c(0,1,9)\n", "\n", "# then, we write a list of our labels\n", "kids_labels = c('none', 'one or more', 'not applicable')\n", "\n", "# finally, we use the command but with some options - telling factor() how to interpret the levels\n", "\n", "census_data <- census_data %>% # we start by saying we want to update the data, AND THEN... (%>%)\n", " mutate(pkids = factor(pkids, # notice the function is \"factor\", not \"as_factor\"\n", " levels = kids_levels, \n", " labels = kids_labels)) # mutate (update pkids) to be a factor of pkids\n", "glimpse(census_data)" ], "id": "3bb02503-bdd2-4752-bd5e-a30ca0c2a052" }, { "cell_type": "markdown", "metadata": {}, "source": [ "See the difference here and notice how we can customize factor labels\n", "when creating new variables.\n", "\n", "1. If we have a large data set, it can be tiresome to decode all of the\n", " variables one-by-one. Instead, we can use `as_factor` on the\n", " **entire dataset** and it will convert all of the variables with\n", " appropriate types.\n", "\n", "> **Note**: `as_factor` will only match the levels (eg. 35, 11, 24) to\n", "> labels (ie. ‘ontario’, ‘prince edward island’, ‘quebec’) ifthe\n", "> variable is of \\ type)." ], "id": "7aa897e1-b6ce-4abd-aed2-6cd14a55565c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "census_data <- as_factor(census_data)\n", "glimpse(census_data)" ], "id": "3535a424-d262-4753-854d-3392c206f243" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is our final dataset, all cleaned up! Notice that some of the\n", "variables (e.g. `ppsort`) were *not* converted into factor variables.\n", "\n", "> **Test Your Knowledge**: Can you tell why?\n", "\n", "## Creating New Variables\n", "\n", "Another important clean-up task is to make new variables. We can use the\n", "`mutate` command again. In fact, when we were making factor variables\n", "earlier, we were in a way making new variables.\n", "\n", "The `mutate` command is an efficient way of manipulating the columns of\n", "our data frame, and we can specify a formula for creating the new\n", "variable:\n", "\n", " census_data <- census_data %>%\n", " mutate(new_variable_name = function(do some stuff...))\n", "\n", "Let’s see it in action with the `log()` function that can be used to\n", "create a new variable for the natural logarithm of wages." ], "id": "4e8cbb2b-80a2-47e7-8557-f5d9bfe4a43e" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "census_data <- census_data %>%\n", " mutate(log_wages = log(wages)) # the log function\n", "\n", "glimpse(census_data)" ], "id": "8b69e6bb-2e32-4d81-b5ef-deb7201807c3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do you see our new variable at the bottom? Nice!\n", "\n", "## The `case_when` function\n", "\n", "We won’t cover a lot of complex functions in this notebook, but we will\n", "mention one very important example: the `case_when` function. This\n", "function creates different values for an input based on specified cases.\n", "You can read more about it by running the code block below." ], "id": "798aa5eb-8927-4b30-804d-eef7fe66b13c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?case_when" ], "id": "d5f7fcc5-6540-4bdf-8da1-a97ba2cae76d" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The case_when() function in R operates like a series of ‘if-then’\n", "statements. Put simply, for each line:\n", "\n", "- The ‘case’ is the condition that you’re checking for.\n", "\n", "- The ‘value’ is what you assign when that condition is met.\n", "\n", "Suppose we are working with the `pkids` variable and find it has three\n", "levels (`'none', 'one or more', 'not applicable'`). We are interested in\n", "creating a dummy variable which is equal to one if the respondent has\n", "children and zero otherwise. Let’s call it `has_kids`.\n", "\n", "Here’s how you can use `case_when()` to achieve this:" ], "id": "fd85e781-8c90-47a5-ab2d-f60175c378cb" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "census_data <- census_data %>% \n", " mutate(has_kids = case_when( # make a new variable, call it has_kids\n", " pkids == \"none\" ~ 0, # case 1: pkids is \"none\"; output is 0 (no kids)\n", " pkids == \"one or more\" ~ 1, # case 2: \"one or more\"; output is 1 (kids)\n", " pkids == 'not applicable' ~ 0)) # case 2: \"not applicable\"; output is 0 (no kids) \n", " \n", "\n", "glimpse (census_data)" ], "id": "813824f2-50e1-4e24-9b1a-5eca16912470" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that `has_kids` is not a factor variable. We must add on the\n", "appropriate line of code to do that.\n", "\n", "## Exercise: Factorize `has_kids`\n", "\n", "Create an object, stored in `answer1`, in which the `census_data` data\n", "frame is identical to the one above but in which the `has_kids` variable\n", "is also in factor form." ], "id": "5699aa3b-c0d1-4d40-985a-f83a8a88529c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer1 <- census_data %>% \n", " ... # fill me in!\n", "\n", "test_1()" ], "id": "6e43c195-e4d4-407a-9920-a9b2a8e0a165" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More Complex Variables\n", "\n", "Dummy variables can be created using “complex” variables. For example,\n", "`agegrp` is a **factor** **variable** with **22 levels**. Imagine we\n", "were interested in an analysis comparing people who are of retirement\n", "age with those who are not. Then we could create a variable called\n", "`retired` that simply tells us whether a person has reached retirement\n", "or not, ie. if they are younger than 65 then `retired` will equal zero\n", "and 1 otherwise.\n", "\n", "As best practice, start with having a look at `agegrp`." ], "id": "9fa9a116-cb5a-4b6f-bdd5-8913bc675709" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "glimpse(census_data$agegrp)\n", "\n", "levels(census_data$agegrp)" ], "id": "c61c40b7-fb4a-4c68-8815-551d6dba83dd" }, { "cell_type": "markdown", "metadata": {}, "source": [ "`glimpse(census_data$agegrp)` told us that `agegrp` is a Factor variable\n", "with 22 levels. We then use `levels()` to see the names for the levels.\n", "\n", "We can now bunch together all levels that represent **ages 65 and\n", "above** and assign such observations a value of 1 (and 0 otherwise)." ], "id": "cd75afa3-0f60-401b-8c05-2e63f7be3b6a" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "census_data <- census_data %>% \n", " \n", " mutate(retired = case_when((agegrp == \"65 to 69 years\")|(agegrp == \"70 to 74 years\")|(agegrp == \"75 to 79 years\")|(agegrp == \"80 to 84 years\")|(agegrp == \"85 years and over\") ~ 1, \n", " TRUE ~ 0)) %>% #otherwise\n", " mutate(retired = as_factor(retired)) # factor\n", "\n", "glimpse(census_data)" ], "id": "38ed4b26-cc65-4d48-98f9-d67988eb22e1" }, { "cell_type": "markdown", "metadata": {}, "source": [ "**A Tip:** To assign a default value on all cases that don’t match your\n", "specified conditions, use TRUE as your last ‘case’. This works because\n", "TRUE will always be met if none of the previous conditions are. (Notice\n", "`TRUE ~ 0` in the code)\n", "\n", "Dummy variables are always useful and sometimes a necessity in many\n", "econometric and ML models that need to work with complex types of\n", "qualitative data.\n", "\n", "## Exercise: Adding a Dummy Variable\n", "\n", "Create an object `answer2` in which the `census_data` dataframe now has\n", "a dummy variable called `knows_english` which is equal to 1 if the\n", "respondent knows English and 0 if not. Make sure that this variable is\n", "factorized.\n", "\n", "**Hint:** We’re telling you here that you need the `kol` variable to\n", "create the dummy. As best practice, should have verified that `kol` and\n", "`fol` are the two language-related variables and it is `kol` (“knowledge\n", "of official languages”) that best suits `knows_english`." ], "id": "18f68b70-5836-412b-b031-4c7f83e6621c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Run this first:\n", "glimpse(census_data$kol)\n", "levels(census_data$kol)" ], "id": "9224a11a-3dd5-4aa5-b644-68e6e5cf006d" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answer2 <- ... # fill me in!\n", "\n", "test_2()" ], "id": "e6646046-937e-4fd4-b241-18e7526aeafd" }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusion\n", "\n", "In this notebook, we learned how to load and manipulate data using\n", "various R packages and commands. You also learned how to factor\n", "variables and create dummies to meet the needs of your statistical\n", "research.\n", "\n", "Don’t hesitate to come back to this notebook and apply what you’ve\n", "learned here to new data sets. You may now proceed to Part 2 on Intro to\n", "Data." ], "id": "52fd5e0e-7294-4bfb-89ae-39f5c228e551" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "ir", "display_name": "R", "language": "r" } } }