{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 06 - Generating Variables\n",
"\n",
"Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt \n",
"2024-05-29\n",
"\n",
"## Prerequisites\n",
"\n",
"1. Be able to effectively use Stata do-files and generate-log files.\n",
"2. Be able to change your directory so that Stata can find your files.\n",
"3. Import data sets in .csv and .dta format.\n",
"4. Save data files.\n",
"\n",
"## Learning Outcomes\n",
"\n",
"1. Explore your data set with commands like `describe`,\n",
" `browse`,`tabulate`, `codebook` and `lookfor`.\n",
"2. Generate dummy (or indicator) variables using the command `generate`\n",
" or `tabulate`.\n",
"3. Create new variables in Stata using `generate` and `replace`.\n",
"4. Rename and label variables.\n",
"\n",
"## 6.0 Intro"
],
"id": "0d23d12b-a3f1-4848-9c33-070cddca8466"
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import stata_setup\n",
"stata_setup.config('C:\\Program Files\\Stata18/','se')"
],
"id": "2f870c3b"
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
">>> import sys\n",
">>> sys.path.append('/Applications/Stata/utilities') # make sure this is the same as what you set up in Module 01, Section 1.3: Setting Up the STATA Path\n",
">>> from pystata import config\n",
">>> config.init('se')"
],
"id": "f1f116c1"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6.1 Getting Started\n",
"\n",
"We’ll continue working with the fake data set introduced in the previous\n",
"lecture. Recall that this data set is simulating information of workers\n",
"in the years 1982-2012 in a fake country where a training program was\n",
"introduced in 2003 to boost their earnings.\n",
"\n",
"Last lecture we introduced a three step process to import data into\n",
"Stata:\n",
"\n",
"1. Clear the workspace.\n",
"2. Change the directory to the space where the data files we will use\n",
" are located.\n",
"3. Import the data using commands specific to the file type.\n",
"\n",
"Let’s run these commands now so we are all ready to do our analysis."
],
"id": "14fda225-743d-47d2-9958-8d589b98b500"
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"%%stata\n",
"\n",
"* Below you will need to include the path on your own computer to where the data is stored between the quotation marks.\n",
"\n",
"clear *\n",
"cd \" \"\n",
"import delimited using \"fake_data.csv\", clear"
],
"id": "c58ea590"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6.2 Generating Variables\n",
"\n",
"### 6.2.1 Generating Variables using `generate`\n",
"\n",
"Generating variables is very simple in Stata. The syntax of the\n",
"`generate` command is relatively straightforward: we first tell Stata we\n",
"want to `generate` a variable, we provide Stata with a name for this new\n",
"variable, and we indicate the condition for Stata to follow in\n",
"generating this variable. All in all, our line of come will look like\n",
"this:\n",
"\n",
"``` stata\n",
"generate name_of_variable insert_condition\n",
"```\n",
"\n",
"In a future sub-section, we will look in more detail at how to do this\n",
"for the particular case of dummy variables. First, let’s review what\n",
"dummy variables are!\n",
"\n",
"### 6.2.2 Dummy Variables\n",
"\n",
"Dummy variables are variables that can only take on two values: 0 and 1.\n",
"It is useful to think of a dummy variable as the answer to a “yes” or\n",
"“no” question. With a dummy variable, the answer yes is coded as “1” and\n",
"no is coded as “0”.\n",
"\n",
"Examples of question that are used to create dummy variables include:\n",
"\n",
"1. Is the person female? Females are coded “1” and everyone else is\n",
" coded “0”.\n",
"2. Does the person have a university degree? People with a degree are\n",
" coded “1” and everyone else is coded “0”.\n",
"3. Is the person married? Married people are coded “1” and everyone\n",
" else is coded “0”.\n",
"4. Is the person a millennial? People born between 1980 and 1996 are\n",
" coded “1” and those born in other years are coded “0”.\n",
"\n",
"As you have probably already figured out, dummy variables are used\n",
"primarily for data that is qualitative and cannot be ranked in any way.\n",
"For example, being married is qualitative and “married” is neither\n",
"higher nor lower than “single”. But they are sometimes also used for\n",
"variables that are qualitative and ranked, such as level of education.\n",
"Further, dummy variables are sometimes used for variables that are\n",
"quantitative, such as age groupings.\n",
"\n",
"It is important to remember that dummy variables must always be used\n",
"when we want to include categorical (qualitative) variables in our\n",
"analysis. These are variables such as sex, gender, race, marital status,\n",
"religiosity, immigration status etc. We can’t use these variables\n",
"without creating a dummy variable because the results found would in no\n",
"way be meaningful, as we are working with variables which have been\n",
"numerically scaled in an arbitrary way. This is especially true for\n",
"interpreting the coefficients outputted from regression.\n",
"\n",
"### 6.2.3 Creating Dummy Variables using `generate`\n",
"\n",
"As an example, let’s create a dummy variable which indicates if the\n",
"observation is identified as female. To do this, we are going to use the\n",
"command `generate` which generates a completely new variable."
],
"id": "36b29089-a764-4ec8-873b-02272c799d20"
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"%%stata\n",
"generate female = 1 if sex == \"F\""
],
"id": "a8291a4d"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What Stata does here is that it defines our dummy variable as 1 whenever\n",
"the condition `sex == \"F\"` holds. However, we didn’t tell Stata what to\n",
"do if the condition `sex == \"M\"` does not hold! Let’s do that below."
],
"id": "26ad0471-b8e4-476a-ad94-65ca1b70a2ef"
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"%%stata\n",
"generate female = 0 if sex == \"M\""
],
"id": "b067e338"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whoops! We got an error. This says that our variable is already defined.\n",
"Stata does this because it doesn’t want us to accidentally overwrite an\n",
"existing variable. Whenever we want to replace an existing variable, we\n",
"have to use the command `replace`."
],
"id": "15867de8-5310-4cb3-a56e-fa8858c16bfb"
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"%%stata\n",
"replace female = 0 if sex == \"M\""
],
"id": "1218d218"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There is another, simpler way to create a dummy variable, which is shown\n",
"below."
],
"id": "05c8bfe9-dc69-4a0a-91a7-dc28d8c09500"
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"%%stata\n",
"\n",
"replace female = ( sex == \"F\") "
],
"id": "444932e2"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What Stata does here is that it defines our dummy variable as 1 whenever\n",
"the condition `sex == \"F\"` holds. Otherwise, it directly makes the\n",
"variable take the value of zero. Depending on what we’re doing, we may\n",
"want it to be the case that our dummy takes on the value of 0 when *sex*\n",
"is missing. We could do that as we did above, using the `replace`\n",
"command.\n",
"\n",
"We could have also used the command `capture drop female` before we used\n",
"`generate`. The `capture` command tells Stata to ignore any error in the\n",
"command that immediately follows. In this example, this would do the\n",
"following:\n",
"\n",
"- If the variable that is being dropped (here, *female*) didn’t exist,\n",
" the `drop female` command would automatically create an error. The\n",
" `capture` command tells Stata to ignore that problem.\n",
"- If the variable (*female*) did exist already, the `drop female`\n",
" command would work just fine, so that line will proceed as normal.\n",
"\n",
"### 6.2.4 Creating Multiple Dummy Variables using `tabulate`\n",
"\n",
"We already talked about how to create dummy variables with `generate`\n",
"and `replace`. Let’s see how this can be done for a whole set of dummy\n",
"variables. For our example, we will create one dummy for each region\n",
"identified in the data set."
],
"id": "88040ce5-c860-4da9-91f1-5f6f4402c1f5"
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"%%stata\n",
"\n",
"tabulate region, generate(reg)"
],
"id": "9fa2db13"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This command generated five new dummy variables, one for each region\n",
"category. We asked Stata to call these variables “reg”, and so these\n",
"five new variables are called *reg1*, *reg2*, *reg3*, *reg4*, and\n",
"*reg5*. We can run the command `describe` alongside each of these\n",
"variables, or we can simply run `describe reg*`, which provides\n",
"information for all variables starting with “reg”. Stata has helpfully\n",
"labeled these variables with data labels from the region variable.\n",
"Sometimes, we might want to change the names for our own project to\n",
"something that is more meaningful to us."
],
"id": "60363c63-b175-4269-83ba-3e3a941f32d2"
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"%%stata\n",
"\n",
"describe reg*"
],
"id": "adecc7c6"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6.3 Generating Variables Based on Expressions\n",
"\n",
"Sometimes we want to generate variables after some transformations\n",
"(e.g. squaring, taking logs, combining different variables). We can do\n",
"that by simply writing the expression for the desired transformation.\n",
"For example, let’s create a new variable that is simply the natural log\n",
"of earnings."
],
"id": "8ab4cdcb-a78c-4194-8003-91e6b68b239d"
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"%%stata\n",
"\n",
"generate log_earnings = log(earnings)"
],
"id": "3f65a56a"
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"%%stata\n",
"\n",
"summarize earnings log_earnings"
],
"id": "7c7e179b"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let’s try a second example. Let’s create a new variable that is the\n",
"number of years since the year the individual started working."
],
"id": "fcdafb12-cead-42a8-82d2-64139147d062"
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"%%stata\n",
"\n",
"generate experience_proxy = year - start_year"
],
"id": "fc65c29c"
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"%%stata\n",
"\n",
"summarize experience_proxy"
],
"id": "d433538e"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Try this out for yourself! Can you create a variable that indicates the\n",
"number of years until/since the training program?"
],
"id": "4efe740a-e6af-4e4b-9a46-35b82e7b07cf"
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"%%stata\n",
"*try here!"
],
"id": "032ed2f0"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6.4 Following Good Naming Conventions\n",
"\n",
"Choosing good names for our variables is more important, and harder,\n",
"than we might think! Some of the variables in an original data set may\n",
"have very unrecognizable names, which can be confusing when conducting\n",
"research. In these cases, changing them early on is preferable. We will\n",
"also be creating our own variables, such as dummy variables for\n",
"qualitative measures, and we will want to be careful about giving them\n",
"good names. This will become even more pertinent once we start\n",
"generating tables, since we will want all of our variables to have\n",
"high-quality names that will easily carry over to a paper for ease of\n",
"comprehension on the reader’s part.\n",
"\n",
"Luckily, we can always rename our variables with the command `rename`.\n",
"Let’s try to rename one of the dummy variables we just created above.\n",
"Maybe we know that if region = 3 then the region is in the west."
],
"id": "8ab9c3bf-dd0e-4971-96ed-9b9854975991"
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"%%stata\n",
"\n",
"rename reg3 west\n",
"describe west"
],
"id": "9058c88f"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Importantly, we don’t need to include every piece of information in our\n",
"variable name. Most of the important information is included in the\n",
"variable label (more on that in a moment). We should always avoid\n",
"variable names that include unnecessary pieces of information and can\n",
"only be interpreted by the researcher.\n",
"\n",
"**Pro tip:** Stata is case sensitive, so put all of your variables in\n",
"lower case to avoid errors.\n",
"\n",
"## 6.5 Creating Variable Labels\n",
"\n",
"It is important that anyone using our data set knows what each variable\n",
"measures. We can add a new label, or change a variable label, at any\n",
"time by using the label variable command. Continuing the example from\n",
"above, if we create a new dummy variable indicating whether people are\n",
"female, we will want to add a label to this new variable. To do this,\n",
"the appropriate command would be:"
],
"id": "dd8e2a3a-4c54-4505-bcba-97fa2a3e0479"
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"%%stata\n",
"\n",
"label variable female \"Female Dummy\""
],
"id": "8e464931"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we describe the data, we will see this extra information in the\n",
"variable label column. See for yourself!"
],
"id": "66a8a497-e484-4530-b435-e97cffbe9413"
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"%%stata\n",
"\n",
"describe female"
],
"id": "a2474211"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6.6 Encoding and Stringing Variables\n",
"\n",
"Sometimes, we might want to transform the type of variable we are using.\n",
"For example, we might want to transform a string variable into a numeric\n",
"one. We went over variable types in [Module\n",
"3](https://comet.arts.ubc.ca/docs/Research/econ490-stata/03_Stata_Essentials.html).\n",
"\n",
"Stata luckily has commands that can help us do this! Let’s say we have a\n",
"quantitative variable from a data set we found online, but Stata is\n",
"interpreting this variable as a string. This will pose some issues later\n",
"in our analysis, for example if we want to use it in regressions, so it\n",
"is best to encode this variable. There are many ways to do this, but one\n",
"of the simplest will be to generate a numeric variable by making a\n",
"`real` transformation of the string one. The syntax is the following:\n",
"\n",
"``` stata\n",
"generate new_numeric_var = real(old_string_var)\n",
"```\n",
"\n",
"We can do the exact same thing to transform a numeric variable into a\n",
"string by making a `string` transformation. See below:\n",
"\n",
"``` stata\n",
"generate new_string_var = string(old_numeric_var)\n",
"```\n",
"\n",
"Try this out yourself!\n",
"\n",
"## 6.7 Wrap Up\n",
"\n",
"When we are doing our own research, we **always** have to spend some\n",
"time working with the data before beginning our analysis. In this\n",
"module, we have learned some important tools for manipulating data to\n",
"get it ready for that analysis. Like everything else that we do in\n",
"Stata, these manipulations should be done in a do-file, so that we\n",
"always know exactly what we have done with our data. Losing track of\n",
"those changes can cause some very serious mistakes when we start to do\n",
"our research! In the [next\n",
"module](https://comet.arts.ubc.ca/docs/Research/econ490-stata/07_Within_Group.html),\n",
"we will look at how to do analysis on the sub-groups of variables in our\n",
"data set.\n",
"\n",
"## 6.8 Wrap-up Table\n",
"\n",
"| Command | Function |\n",
"|-------------|-----------------------------------------------------------|\n",
"| `tabulate` | It provides a list of the different values of a variable. |\n",
"| `summarize` | It provides the summary statistics of a variable. |\n",
"| `generate` | It generates a new variable. |\n",
"| `replace` | It replaces specific values of a variable. |\n",
"\n",
"## References\n",
"\n",
"[How to create a date variable from a date stored as a\n",
"string](https://www.youtube.com/watch?v=M3XVgPJuFzU)
[How to create\n",
"a categorical variable from a continuous\n",
"variable](https://www.youtube.com/watch?v=XWVaXN2KwmA)
[How to\n",
"create a new variable that is calculated from other (multiple)\n",
"variables](https://www.youtube.com/watch?v=E_wCh0rf4p8)"
],
"id": "8b55b8df-1c12-4a75-9f8b-dfcb04220fb1"
}
],
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"path": "/usr/local/share/jupyter/kernels/python3"
},
"language_info": {
"name": "python",
"codemirror_mode": {
"name": "ipython",
"version": "3"
},
"file_extension": ".py",
"mimetype": "text/x-python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
}
}