{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 05 - Generating Variables\n", "\n", "Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt \n", "2024-05-29\n", "\n", "## Prerequisites\n", "\n", "1. Import data sets in .csv and .dta format.\n", "2. Save files.\n", "\n", "## Learning Outcomes\n", "\n", "1. Generate dummy (or indicator) variables using `ifelse` and\n", " `case_when`.\n", "2. Create new variables using `mutate`.\n", "3. Rename variables using `rename`.\n", "\n", "## 5.1 Review of the Data Loading Procedure\n", "\n", "We’ll continue working with the fake data set introduced in the previous\n", "lecture. Recall that this data set is simulating information of workers\n", "in the years 1982-2012 in a fake country where a training program was\n", "introduced in 2003 to boost their earnings.\n", "\n", "In the previous module ([Module\n", "4](https://comet.arts.ubc.ca/docs/Research/econ490-r/04_Opening_Datasets.html)),\n", "we looked at the process of loading our “fake_data” data set into R and\n", "preparing it for analysis. Specifically, we covered the following\n", "important points:\n", "\n", "1. Importing the relevant package (haven) which gives us access to\n", " commands for loading the data. Additionally, importing the tidyverse\n", " package in order to clean our data.\n", "2. Using the `read_csv` or `read_dta` functions to load our data set.\n", "3. Cleaning our data by factorizing all important variables.\n", "\n", "Let’s run through this procedure quickly so that we are all ready to do\n", "our analysis." ], "id": "9626088a-d8c2-4b53-b5d4-3cfc93703645" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(haven)\n", "library(tidyverse)\n", "library(IRdisplay)" ], "id": "97ab23b9-29c7-495f-b44d-0a53c2c5ac9c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data <- read_dta(\"../econ490-stata/fake_data.dta\")\n", "fake_data <- as_factor(fake_data)" ], "id": "3c2239f7-a526-4e0c-a305-c4692249f158" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.2 Generating Dummy Variables\n", "\n", "Dummy variables are variables that can only take on two values: 0 and 1.\n", "It is useful to think of a dummy variable as the answer to a “yes” or\n", "“no” question. With a dummy variable, the answer yes is coded as “1” and\n", "no is coded as “0”.\n", "\n", "Examples of question that are used to create dummy variables include:\n", "\n", "1. Is the person female? Females are coded “1” and everyone else is\n", " coded “0”.\n", "2. Does the person have a university degree? People with a degree are\n", " coded “1” and everyone else is coded “0”.\n", "3. Is the person married? Married people are coded “1” and everyone\n", " else is coded “0”.\n", "4. Is the person a millennial? People born between 1980 and 1996 are\n", " coded “1” and those born in other years are coded “0”.\n", "\n", "As you have probably already figured out, dummy variables are used\n", "primarily for data that is qualitative and cannot be ranked in any way.\n", "For example, being married is qualitative and “married” is neither\n", "higher nor lower than “single”. But they are sometimes also used for\n", "variables that are qualitative and ranked, such as level of education.\n", "Further, dummy variables are sometimes used for variables that are\n", "quantitative, such as age groupings.\n", "\n", "It is important to remember that dummy variables must always be used\n", "when we want to include categorical (qualitative) variables in our\n", "analysis. These are variables such as sex, gender, race, marital status,\n", "religiosity, immigration status etc. We can’t use these variables\n", "without creating a dummy variable because the results found would in no\n", "way be meaningful, as we are working with variables which have been\n", "numerically scaled in an arbitrary way. This is especially true for\n", "interpreting the coefficients outputted from regression.\n", "\n", "### 5.2.1 Creating dummy variables using `ifelse`\n", "\n", "We can use the `ifelse` function to create a simple dummy variable. This\n", "command generates a completely new variable based on certain conditions.\n", "Let’s do an example where we create a dummy variable that indicates if\n", "the observation identified as female." ], "id": "1db69f22-e623-4530-a174-cc69f140803b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data$female = ifelse(fake_data$sex == \"F\", 1, 0)" ], "id": "c4612944-27e8-49ce-8829-ce81449d7945" }, { "cell_type": "markdown", "metadata": {}, "source": [ "What R interprets here is that IF the condition `sex == \"F\"` holds, our\n", "dummy will take the value of 1; otherwise (ELSE), it will take the value\n", "of 0. Depending on what we’re doing, we may want it to be the case that\n", "when *sex* is missing, our dummy is zero. We can first check if we have\n", "any missing observations for a given variable by using the `is.na`\n", "function nested within the `any` function. If there are any missing\n", "values for the *sex* variable in this data set, the code below will\n", "return TRUE. This helps us see whether any data is in fact missing for\n", "*sex*." ], "id": "0fb6d781-2e30-4146-b414-a7d5373850c0" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "any(is.na(fake_data$sex))" ], "id": "31585780-0c21-4176-a784-455bd72eb423" }, { "cell_type": "markdown", "metadata": {}, "source": [ "It appears that there are no missing observations for the *sex*\n", "variable. Nonetheless, if we wanted to account for missing values and\n", "ensure that they were denoted as 0 for the dummy *female*, we can invoke\n", "the `is.na` function as an additional condition in our function as is\n", "done below." ], "id": "e9dff00c-d23d-4567-bf25-10f7307e81c7" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data$female = ifelse(fake_data$sex == \"F\" & !is.na(fake_data$sex), 1, 0)" ], "id": "f7730408-dbe1-484e-96ac-d6f6aa8fb04a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above condition within our function says that *female* == 1 only\n", "when *sex* == “F” and *sex* is not marked as NA (since `!is.na` must be\n", "TRUE).\n", "\n", "### 5.2.2 Creating a series of dummy variables using `ifelse`\n", "\n", "We now know how to create singular dummy variables with `ifelse`.\n", "However, we may also want to create dummy variables corresponding to a\n", "whole set of categories for a given variable - for example, one for each\n", "region identified in the data set. To do this, we can just meticulously\n", "craft a dummy for each category, such as *reg1*, *reg2*, *reg3*, and\n", "*reg4*. We must leave out one region to serve as our base group, being\n", "region 5, in order to avoid the dummy variable trap. The reason why we\n", "do this will be explained in greater detail in a future notebook; for\n", "now, just take it as given." ], "id": "b10ab123-4a89-411a-943f-3d4ec7492c20" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data$reg1 = ifelse(fake_data$region == 1 & !is.na(fake_data$region), 1, 0)\n", "fake_data$reg2 = ifelse(fake_data$region == 2 & !is.na(fake_data$region), 1, 0)\n", "fake_data$reg3 = ifelse(fake_data$region == 3 & !is.na(fake_data$region), 1, 0)\n", "fake_data$reg4 = ifelse(fake_data$region == 4 & !is.na(fake_data$region), 1, 0)" ], "id": "caca6a21-a104-406d-95f8-eaedac6b6e57" }, { "cell_type": "markdown", "metadata": {}, "source": [ "This command helped us generate four new dummy variables, one for each\n", "category for each region. This was quite cumbersome though. In general,\n", "there are packages out there which help to expedite this process in R.\n", "Fortunately, if we are running a regression on a qualitative variable\n", "such as *region*, R will generate the necessary dummy variables for us\n", "automatically!\n", "\n", "### 5.2.3 Creating Dummy Variables using `case_when`\n", "\n", "We can also use more complex functions to create dummy variables. An\n", "important one is the `case_when` function. This function creates\n", "different values for an input based on specified cases. Specifically, it\n", "consists of a series of lines, and each line gives a (i) case and (ii)\n", "value for that case. This function is nearly always used to operate on\n", "either strings or variables which do not have numerical significance in\n", "terms of how they are coded. Otherwise, we could use simple operators\n", "such as \\<, \\>, and = to classify values of these variables and then\n", "invoke the `ifelse` function as we did above. Unfortunately, we don’t\n", "have any variables in our “fake_data” data set which call for this and\n", "so we don’t have an example fit for this function. However, to see\n", "documentation for this useful `case_when` function, run the code cell\n", "below!" ], "id": "b268b69e-6453-4311-b6a8-1b3a07923a62" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?case_when" ], "id": "9e0c2968-a811-4d0c-9173-9033e7c5570a" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.3 Generating Variables Based on Expressions\n", "\n", "Sometimes we want to generate variables after some transformations\n", "(e.g. squaring, taking logs, combining different variables). We can do\n", "that by simply writing the expression as an argument to the function\n", "`mutate`. This function manipulates our data frame by supplying to it a\n", "new column based on the function we input. For example, let’s create a\n", "variable called *log_earnings* which is the log of earnings." ], "id": "fe38f922-8779-4977-bcda-f44bde7fb8c9" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data <- fake_data %>% mutate(log_earnings = log(earnings))\n", "\n", "summary(fake_data$log_earnings)" ], "id": "718c7d25-e7cb-43a4-b1ad-e821c91bb9d9" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s try a second example. Let’s create a new variable that is the\n", "number of years since the year the individual started working." ], "id": "a96b3405-dc88-4091-9f35-9a52c20f1794" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data <- fake_data %>% mutate(experience_proxy = year - start_year)\n", "\n", "summary(fake_data$experience_proxy)" ], "id": "771ee6c4-ccd4-4d99-9faa-d0e225031c1e" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try this out for yourself! Can you create a variable that indicates the\n", "number of years until/since the training program?" ], "id": "2f573409-b464-4c42-addf-a2d0c91f6671" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#try here!" ], "id": "806990bf-3a9c-4c4a-a341-b375c6f28127" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `mutate` function allows us to easily add new variables to our data\n", "frame. If we wanted to instead replace a given variable with a new\n", "feature, say add one default year to all *experience_proxy*\n", "observations, we can simply redefine it directly in our data frame." ], "id": "3f4afa74-6d4f-4a8e-8852-4457cb7f87fa" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data$experience_proxy <- fake_data$experience_proxy + 1" ], "id": "361b5ce7-93bf-4816-99c1-332a13823495" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.4 Following Good Naming Conventions\n", "\n", "Choosing good names for our variables is more important, and harder,\n", "than we might think! Some of the variables in an original data set may\n", "have very unrecognizable names, which can be confusing when conducting\n", "research. In these cases, changing them early on is preferable. We will\n", "also be creating our own variables, such as dummy variables for\n", "qualitative measures, and we will want to be careful about giving them\n", "good names. This will become even more pertinent once we start\n", "generating tables, since we will want all of our variables to have\n", "high-quality names that will easily carry over to a paper for ease of\n", "comprehension on the reader’s part.\n", "\n", "We can rename variables with the `rename` function found inside the\n", "`dplyr` package (which we can access via having loaded in R’s\n", "tidyverse). Let’s try to rename one of those dummy variables we created\n", "above. Maybe we know that if region = 3 then the region is in the west." ], "id": "ec4693e0-10f8-436f-ae55-9238aab451b7" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rename(fake_data, west = reg3)" ], "id": "0c0075ab-53d8-4b42-927d-071f38d22b90" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Don’t worry about including every piece of information in your variable\n", "names. Instead, just try to be clear and concise. Avoid variable names\n", "that include unnecessary pieces of information and can only be\n", "interpreted by you. At the end of the day, you want others to be able to\n", "understand your work.\n", "\n", "## 5.5 Wrap Up\n", "\n", "When we are doing our own research, we **always** have to spend some\n", "time working with the data before beginning analysis. In this module, we\n", "have learned some important tools for manipulating data to get it ready\n", "for that analysis. Like everything else that we do in R, these\n", "manipulations should be done in a script, so that we always know exactly\n", "what we have done with our data. Losing track of those changes can cause\n", "some very serious mistakes when we start to do our research! In the\n", "[next\n", "module](https://comet.arts.ubc.ca/docs/Research/econ490-r/06_Within_Group.html),\n", "we will look at how to do analysis on the sub-groups of variables in our\n", "data set.\n", "\n", "## 5.6 Wrap-up Table\n", "\n", "| Command | Function |\n", "|--------------------------------|----------------------------------------|\n", "| `ifelse()` | It creates a variable taking two values, based on whether it satisfies one certain condition. |\n", "| `case_when()` | It creates a variable taking multiple values, based on whether it satisfies multiple conditions. |\n", "| `mutate` | It creates a new variable based on an expression. |\n", "| `rename` | It renames a variable. |" ], "id": "6fd046af-0805-45e9-9b47-1f63b2986226" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "ir", "display_name": "R", "language": "r" } } }