03 - Stata Essentials
Prerequisites
- Understand how to effectively use Stata do-files and know how to generate log files.
Learning Outcomes
- View the characteristics of any dataset using the command
describe
. - Use
help
to learn best how to run commands. - Understand the Stata command syntax using the command
summarize
. - Create loops using the commands
for
,while
,forvalues
andforeach
.
3.1 Describing Your Data
Let’s start by opening a dataset that was provided when we installed Stata onto our computers. We will soon move on to importing our own data, but this Stata data set will help get us started. This is a data set on automobiles and their characteristics. We can install this dataset by running the command in the cell below:
sysuse auto.dta, clear
We can begin by checking the characteristics of the data set we have just downloaded. The command describe
allows us to see the number of observations, the number of variables, a list of variable names and descriptions, and the variable types and labels of that data set.
describe
Notice that this data set consists of 12 variables and 74 observations. We can see that the first variable is named make, which indicates the make and model of the vehicle. We can also see that the variable make is a string variable (made up of text). Other variables in this data set are numeric. For example, the variable mpg indicates the vehicle’s mileage (miles per gallon) as an integer. The variable foreign is also numeric, and it only takes the values 0 or 1, indicating whether the car is foreign or domestically made; this is a dummy variable.
3.2 Introduction to Stata Command Syntax
3.2.1 Using HELP to understand commands
Stata has a help manual installed in the program which provides documentation for all Stata published commands. This information can be reached by typing the command help
and then the name of the command we need extra information about.
Let’s try to see what extra information Stata provides by using the help
command with the summarize
command. summarize
gives us the basic statistics from any variable(s) in the data set, such as the variables we have discussed above, but what else can it do? To see the extra information that is available by using summarize
, let’s run the command below:
help summarize
We need to run this command directly into the Stata console on our computer in order to able to see all of the information provided by help
. Running this command now will allow us to see that output directly.
When we do, we can see that the first 1-2 letters of the command are often underlined. This underlining indicates the shortest permitted abbreviation for a command (or option).
For example, if we type help rename
, we can see that rename
can be abbreviated as ren
, rena
, or renam
, or it can be spelled out in its entirety.
Other examples are, g
enerate, ap
pend, rot
ate, ru
n.
If there is no underline, then no abbreviation is allowed. For example, the command replace
cannot be abbreviated. The reason for this is that Stata doesn’t want us to accidentally make changes to our data by replacing the information in the variable.
We can write the summarize
command with its shortest abbreviation su
or a longer abbreviation such as sum
.
Also, in the Stata help output we can see that some words are written in blue and are encased within square brackets. We will talk more about these options below, but in Stata we can directly click on those links for more information from help.
Finally, help provides a list of the available options for a command. In the case of summarize
, these options allow us to display extra information for a variable. We will learn more about this below in section 3.2.4.
3.2.2 Imposing IF conditions
When the syntax of the command allows for [if]
, we can run the command on a subset of the data that satisfies any condition we choose. Here is the list of conditional operators available to us:
- Equal: ==
- Greater than and less than: > and <
- Greater than or equal and less than or equal: >= and <=
- Not Equal: !=
We can also compound different conditions using the list of logical operators:
- And: &
- Or: |
- Not: ! or ~
Let’s look at an example which applies this new knowledge: summarizing the variable price when the make of the car is domestic (i.e. not foreign):
su price if foreign == 0
Let’s do this again, but now we will impose the additional condition that the mileage must be less than 25.
su price if foreign == 0 & mpg < 25
Maybe we want to restrict to a particular list of values. Here we can write out all of the conditions using the “or” operator, or we can simply make use of the option inlist()
:
su price if mpg == 10 | mpg == 15 | mpg == 25 | mpg == 40
This works exactly the same way as this command:
su price if inlist(mpg,10,15,25,40)
Maybe we want to restrict to values in a particular range. Here we can use the conditional operators, or we can make use of the option inrange()
:
su price if mpg >= 5 & mpg <= 25
Notice the output returned by the code below is equal to the previous cell:
su price if inrange(mpg,5,25)
There might be variables for which there is no information recorded for some observations. For example, when we summarize
our automobile data we will see that there are 74 observations for most variables, but that the variable rep78 has only 69 observations - for five observations there is no repair record indicated in the data set.
su price rep78
If, for some reason, we only want to consider observations without missing values, we can use the option !missing()
which combines the command missing()
with the negative conditional operator “!”. For example, the command below says to summarize the variable price for all observations for which rep78 is NOT missing.
su price if !missing(rep78)
This command can also be written using the conditional operator since missing numeric variables are indicated by a “.”. This is shown below:
su price if rep78 != .
Notice that in both cases there are only 69 observations.
If we wanted to do this with missing string variables, we could indicate those with ““.
3.2.3 Imposing IN conditions
We can also subset the data by using the observation number. The example below summarizes the data in observations 1 through 10.
su price in 1/10
But be careful! This type of condition is generally not recommended because it depends on how the data is ordered.
To see this, let’s sort the observations in ascending order by running the command sort
:
sort price
su price in 1/10
We can see that the result changes because the observations 1 through 10 in the data are now different.
Always avoid using in
whenever you can. Try to use if
instead!
3.2.4 Command options
When we used the help
command, we saw that we can introduce some optional arguments after a comma. In the case of the summarize
command, we were shown the following options: d
etail, mean
only, f
ormat and sep
arator(#).
If we want additional statistics apart from the mean, standard deviation, min, and max values, we can use the option detail
or just d
for short.
su price, d
3.3 Using Loops
Much like any other programming language, there are for
and while
loops that we can use to iterate through many times. In particular, the for
loops are also sub-divided into forvalues
(which iterate across a range of numbers) and foreach
(which iterate across a list of names).
It is very common that these loops create a local scope (i.e. the iteration labels only exist within a loop). A local
in Stata is a special variable that we create ourselves that temporarily stores information. We’ll discuss locals in the next module, but consider this simple example in which the letter “i” is used as a place holder for the number 95 – it is a local
.
For a better understanding of locals and globals, please visit Module 4.
local i = 95
display `i'
We can also create locals that are strings rather than numeric in type. Consider this example:
local course = "ECON 490"
display "`course'"
We can store anything inside a local. When we want to use that information, we include the local encased in a backtick (`) and apostrophe (’).
local course = "ECON 490"
display "I am enrolled in `course' and hope my grade will be `i'%!"
3.3.1 Creating loops Using forvalues
Whenever we want to iterate across a range of values defined as forvalues = local_var_name = min_value(steps)max_value
, we can write the command below. Here we are iterating from 1 to 10 in increments of 1.
forvalues counter=1(1)10{
*Notice that now counter is a local variable
display `counter'
}
Notice that the open brace {
needs to be on the same line as the for
command, with no comments after it. Similarly, the closing brace }
needs to be on its own line.
Experiment below with the command above by changing the increments and min or max values. See what your code outputs.
/*
forvalues counter=???(???)???{
display `counter'
}
*/
3.3.2 Creating loops using foreach
Whenever we want to iterate across a list of names, we can use the foreach
command below. This asks Stata to summarize
for a list of variables (in this example, mpg and price).
The syntax for foreach
is similar to that of forvalues
: foreach local_var_name in "list of variables"
. Here, we are asking Stata to perform the summarize
command on two variables (mpg and price):
foreach name in "mpg" "price"{
summarize `name'
}
We can have a list stored in a local variable as well. Here, we are storing a list, which includes two variable names (mpg and price) in a local called namelist. Then, using foreach
, we summarize name which runs through the list we created above, called namelist.
local namelist "mpg price"
foreach name in `namelist'{
summarize `name'
}
3.3.3 Writing loops with conitions using while
Whenever we want to iterate until a condition is met, we can write the command below. The condition here is simply “while counter is less than 5”.
local counter = 1
while `counter'<5{
display `counter'
local counter = `counter'+1
}
3.4 Errors
A common occurrence while working with Stata is encountering various errors. Whenever an error occurs, the program will stop executing and an error message will pop-up. Most commonly occuring errors can be attributed to syntax issues, so we should always verify our code before execution. Below we have provided 3 common errors that may pop up.
summarize hello
We must always verify that the variable you use for a command exists and that you are using its correct spelling. Stata alerts you when you try to execute a command with a non-existing variable.
su price if 5 =< mpg =< 25
In this example, the error is due to the use of invalid conditional operators. To make use of the greater than or equal to operator, you must use the symbol (mpg >= ) and to use the less than or equal to operator, you use the symbol (mpg <= ).
local word = 95
display "I am enrolled in `course' and hope my grade will be 'word'%!" // this is incorrect
display "I am enrolled in `course' and hope my grade will be `word'%!" // this is correct
The number 95 does not display in the string due to the wrong punctuation marks being used to enclose the local. We make the error of using two apostraphes instead of a backtick (`) and an apostrophe (’).
3.5 Wrap Up
In this module, we looked at the way Stata commands function and how their syntax works. In general, many Stata commands will follow the folllowing structure:
varlist] [if] [in] [weight] [, options] name_of_command [
At this point, you should feel more comfortable reading a documentation file for a Stata command. The question that remains is how to find new commands!
You are encouraged to search for commands using the command search
. For example, if you are interested in running a regression you can write:
search regress
We can see that a new Stata window pops up on our computer, and we can click on the different options that are shown to look at the documentation for all these commands. Try it yourself in the code cell below!
In the following modules, whenever there is a command which confuses you, feel free to write search command
or help command
to redirect to the documentation for reference.
In the next module, we will expand on our knowledge of locals, as well as globals, another type of variable.
3.6 Wrap-up Table
Command | Function |
---|---|
describe |
Provides the characteristics of our dataset including the number of observations and variables, and variable types |
summarize |
Calculates and provides a variety of summary statistics of the general dataset or specific variables |
help |
Provides information on each command including its definition, syntax, and the options associated with the command |
if-conditions |
Used to verify a condition before executing a command. If conditions make use of logical and conditional operators and are preceded by the desired command |
sort |
Used to sort the observations of the data set into ascending order |
detail |
Provides additional statistics, including skewness, kurtosis, the four smallest and four largest values, and various percentile |
display |
Displays strings and values of scalar expressions |
search |
Can be used to find useful commands |
while |
A type of loop that iterates until a condition is met |
forvalues |
A type of for-loop that iterates across a range of numbers |
foreach |
A type of for-loop that iterates across a list of items |
References
PDF documentation in Stata
Stata Interface tour
One-way tables of summary statistics
Two-way tables of summary statistics