library(haven)
library(tidyverse)
library(IRdisplay)
ECON 490: Generating Variables (5)
Prerequisites
- Import datasets in csv and dta format.
- Save files.
Learning Outcomes
In this module, you will learn to:
- Generate dummy (or indicator) variables using
ifelse
andcase_when
. - Create new variables using
mutate
. - Rename variables using
rename
.
5.1 Review of the Data Opening Procedure
We’ll continue working with the fake data data set introduced in the previous lecture. Recall that this data set is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.
Last lecture we looked at the process of loading our “fake_data” data set into R and preparing it for analysis. Specifically, we covered the following important points: 1. Import the relevant package (Haven) which gives us access to commands for loading the data. Additionally, import the tidyverse package in order to clean our data. 2. Use the read_csv
or read_dta
functions to load our data set. 3. Clean our data by factorizing all important variables.
Let’s run through this procedure quickly so that we are all ready to do our analysis.
<- read_csv("../econ490-stata/fake_data.csv")
fake_data <- as_factor(fake_data) fake_data
5.2 Generating Dummy Variables
The most common type of variable we will create during data cleaning and analysis is the dummy variable. Dummy variables are variables that can only take on two values: 0 and 1. It is useful to think of a dummy variable as being the answer to a question that can be answered “yes” or “no”. With a dummy variable the answer yes is coded as “1” and no is coded as “0”.
Examples of question that are used to create dummy variables can include:
- Is the person female? Females are coded “1” and males are coded “0”.
- Does the person have a university degree? People with a university degree are coded “1” and everyone else is coded “0”.
- Is the person married? Married people are coded “1” and everyone else is coded “0”.
- Is the person a millennial? People born between 1980 and 1996 are coded “1” and those born in other years are coded “0”.
As you have probably already figured out, dummy variables are used primarily for data that is qualitative and cannot be ranked in any way. For example, being married is qualitative and “married” is neither higher nor lower than “single”, thus we could create a dummy variable coded as 1 if the person is married and 0 if not. However, dummy variables sometimes also refer to variables that are qualitative and ranked, such as level of education. They can also refer to variables that are quantitative, such as age groupings.
It is important to remember that dummy variables must always be used when we want to include categorical (qualitative) variables in our analysis. These are variables such as sex, gender, race, marital status, religiosity, immigration status, etc. Without creating dummy variables for these demographics, analysis of the results from data analysis, regression, and other research will not be meaningful, as we are working with variables which have been numerically scaled in an arbitrary way. This is especially true for interpreting the coefficients outputted from regression.
5.2.1 Creating dummy variables using ifelse
We can use the ifelse
function to create a simple dummy variable. This command generates a completely new variable based on certain conditions. Let’s do an example where we create a dummy variable that indicates if the observation identified as female.
$female = ifelse(fake_data$sex == "F", 1, 0) fake_data
What R interprets here is that IF the condition sex == "F"
holds, our dummy will take the value of 1; otherwise (ELSE), it will take the value of 0. Depending on what we’re doing, we may want it to be the case that when sex is missing, our dummy is zero. We can first check if we have any missing observations for a given variable by using the is.na
function nested within the any
function. If there are any missing values for the sex variable in this data set, the code below will return TRUE. This helps us see whether any data is in fact missing for sex.
sum(is.na(fake_data$sex))
It appears that there are no missing observations for the sex variable. Nonetheless, if we wanted to account for missing values and ensure that they were denoted as 0 for the dummy female, we can invoke the is.na
function as an additional condition in our function as is done below.
$female = ifelse(fake_data$sex == "F" & !is.na(fake_data$sex), 1, 0) fake_data
The above condition within our function says that female == 1 only when sex == “F” and sex is not marked as NA (since !is.na must be TRUE).
5.2.2 Creating a series of dummy variables using ifelse
We now know how to create singular dummy variables with ifelse
. However, we may also want to create dummy variables corresponding to a whole set of categories for a given variable - for example, one for each region identified in the data set. To do this, we can just meticulously craft a dummy for each category, such as reg1, reg2, reg3, and reg4. We must leave out one region to serve as our base group, being region 5, in order to avoid the dummy variable trap. The reason why we do this will be explained in greater detail in a future notebook; for now, just take it as given.
$reg1 = ifelse(fake_data$region == 1 & !is.na(fake_data$region), 1, 0)
fake_data$reg2 = ifelse(fake_data$region == 2 & !is.na(fake_data$region), 1, 0)
fake_data$reg3 = ifelse(fake_data$region == 3 & !is.na(fake_data$region), 1, 0)
fake_data$reg4 = ifelse(fake_data$region == 4 & !is.na(fake_data$region), 1, 0) fake_data
This command helped us generate five new dummy variables, one for each category for region. This was quite cumbersome. In general, there are packages out there which help to expedite this process in R. Fortunately, if we are running a regression on a qualitative variable such as region, R will generate the necessary dummy variables for us automatically.
5.2.3 Creating dummy variables using case_when
We can also use more complex functions to create dummy variables. An important one is the case_when
function. This function creates different values for an input based on specified cases. Specifically, it consists of a series of lines, and each line gives a (i) case and (ii) value for that case. This function is nearly always used to operate on either strings or variables which do not have numerical significance in terms of how they are coded. Otherwise, we could use simple operators such as <, >, and = to classify values of these variables and then invoke the ifelse
function as we did above. Unfortunately, we don’t have any variables in our “fake_data” data set which call for this and so we don’t have an example fit for this function. However, to see documentation for this useful case_when
function, run the code cell below!
?case_when
5.3 Generating Variables Based on Expressions
Sometimes we want to generate variables after some transformations (e.g. squaring, taking logs, combining different variables). We can do that by simply writing the expression as an argument to the function mutate
. This function manipulates our data frame by supplying to it a new column based on the function we input. For example, let’s create a variable called log_earnings which is the log of earnings.
<- fake_data %>% mutate(log_earnings = log(earnings))
fake_data
summary(fake_data$log_earnings)
Let’s try a second example. Let’s create a new variable that is the number of years since the year the individual started working.
<- fake_data %>% mutate(experience_proxy = year - start_year)
fake_data
summary(fake_data$experience_proxy)
The mutate
function allows us to easily add new variables to our data frame. If we wanted to instead replace a given variable with a new feature, say add one default year to all experience_proxy observations, we can simply redefine it directly in our data frame.
$experience_proxy <- fake_data$experience_proxy + 1 fake_data
5.4 Following Good Naming Conventions
Choosing good names for our variables is more important, and harder, than one might think! Sometimes, the variables in a data set have unrecognizable names which may be confusing when conducting research. In these cases, it is a good idea to change them immediately. In your research, you will also be creating your own variables (like dummy variables) for qualitative measures and will want to be careful about giving them good names. This is especially important for generating tables, since you will want your tables to be easily legible in your paper.
We can rename variables with the rename
function found inside the dplyr
package (which we can access via having loaded in R’s tidyverse). Let’ try to rename one of those dummy variables we created above. Maybe we know that if region = 3 then the region is in the west.
rename(fake_data, west = reg3)
Don’t worry about including every piece of information in your variable names. Instead, just try to be clear and concise. Avoid variable names that include unnecessary pieces of information and can only be interpreted by you. At the end of the day, you want others to be able to understand your work.
5.5 Wrap Up
When we are doing our own research, we always have to spend some time working with the data before beginning analysis. In this module, you have learned some important tools for manipulating data to get it ready for that analysis. Like everything else that you do in R, emphasis should be on readability and reproducibility in your code. This is pivotal for both you and your audience to understand your research. In the next module, we will explore how to create new variables for group level analysis, among other things.
The following table summarizes the main commands we have seen in this module.
Command | Function |
---|---|
ifelse() |
It creates a variable taking two values, based on whether it satisfies one certain condition. |
case_when() |
It creates a variable taking multiple values, based on whether it satisfies multiple conditions. |
mutate |
It creates a new variable based on an expression. |
rename |
It renames a variable. |