ECON 490: Opening Datasets (5)

Prerequisites

Understand how to effectively use Stata do files and know how to generate log files.
Run basic Stata commands such as help, describe, summarize, for and while.
Know how to use macros in writing Stata commands.

Learning Outcomes

Understand how to use clear at the beginning of do-files.
Know how to change directories so that Stata can find relevant files.
Import datasets in csv and excel formats.
Import datasets in dta format.
Save data files.

import stata_setup
stata_setup.config('C:\Program Files\Stata18/','se')

<>:2: SyntaxWarning: invalid escape sequence '\P'
<>:2: SyntaxWarning: invalid escape sequence '\P'
C:\Users\irene\AppData\Local\Temp\ipykernel_26444\4069384911.py:2: SyntaxWarning: invalid escape sequence '\P'
  stata_setup.config('C:\Program Files\Stata18/','se')

>>> import sys
>>> sys.path.append('/Applications/Stata/utilities') # make sure this is the same as what you set up in Module - 1, Section 1.5.1: Setting Up PyStata
>>> from pystata import config
>>> config.init('se')

In this repository you will find a folder named “data”, with a sub-folder named “raw”. In that sub-folder you will find two different versions of the same data set: “fake_data.csv” and “fake_data.dta”. The data set simulates information of workers in the years 1982-2012 in a fake country where, in 2003, a policy was enacted that allowed some workers to enter a training program with the purpose of boosting their earnings. We will be using this data set to learn how to explore and manipulate real-world datasets.

5.1 Clearing the Workspace

Do-files should begin with a command that clears the previous work that has been open in Stata. This makes sure that: 1. We do not waste computer memory on things other than the current project. 2. Whatever result we obtain in the current session truly belongs to that session.

We can clear the workspace of many different things (see help clear if needed). For the purpose of this lecture, the most comprehensive thing to do is to run the following:

%%stata
clear *

5.2 Changing Directories

Before we get started on importing data into Stata, it is useful to know how to change the folder that Stata accesses whenever we run a command that either opens or saves a file. Once we instruct Stata to change the directory to a specific folder, from that point onward it will open files from that folder and save all files to that folder, including data files, do files, and log files. Stata will continue to do this until either the program is closed or we change to another directory. This means that every time we open Stata, we need to change the directory to the one we want to use.

We can begin by using the pwd command to view the current working directory.

%%stata
pwd

C:\Users\irene\econometrics\econ490-pystata

Note: We write the directory path within quotation marks to make sure Stata interprets this as a single string of words. If we don’t do this, we may encounter issues with folders that include blank spaces.

Now change the directory to the specific location where you saved the fake_data file using the command below. You can change your workspace to a directory named “some_folder/some_sub_folder” by writing cd "some_folder/some_sub_folder".

Use the space below to do this on your own computer.

%%stata

cd "\Users\irene\econometrics\econ490-pystata" 
* type your file path to the folder containing the data between the quotation marks in the line above


. 
. cd "\Users\irene\econometrics\econ490-pystata" 
C:\Users\irene\econometrics\econ490-pystata

. * type your file path to the folder containing the data between the quotation
>  marks in the line above
.

Notice that once we change directories, Stata outputs the full name of the directory where we are currently working.

One trick to using cd is that we can use periods (.) to move back folders: two period to move back one folder, three periods to move back two folders, etc. Try the command below to compare the folder Stata is now directed to with the command above. You can repeat this using two periods.

%%stata

cd ..


. 
. cd ..
C:\Users\irene\econometrics

.

An easier way to change the directory is by typing the cd command followed by the folder you want to set as your working directory. In this method, quotation marks are not necessary.

%%stata

cd myfolder

*Use myfolder as a placeholder for the folder you want to set as your working directory

SystemError: 
. 
. cd myfolder
unable to change to myfolder
r(170);
r(170);

In addition, we can use the command cd on its own to go back to the home directory.

The process for changing directories in Stata varies depending on the type of computer being used. If one approach does not work, it is possible that the method is not suitable to your computer. Please the Stata manual for instructions on how to change directories according to the type of computer you are using: https://www.stata.com/manuals/dcd.pdf

5.3 Opening Datasets

5.3.1 Excel and CSV files

When looking for the data for your research, you will realize that many data sets are not formatted for Stata. In some cases, data sets are formatted as excel or CSV files. Not surprisingly the command to load in data is called import. IT comes in two main forms: import excel and import delimited.

Let’s import the data set called fake_data.csv. We need to use import delimited to import this data into Stata. The syntax for this command is import delimited [using] filename [, import_delimited_options].

We always include the option clear when we use import to make sure we’re clearing any previous data set that was opened before in our Stata session. Recall that to use an option, we include a comma (,) after the command line and write the option name. You are welcome to also read the documentation of these commands by writing help import delimited.

Note that the command below will not import the data unless you have changed your directory (above) to the folder which contains this file.

Ignore the following block of code that will create a csv file to be used as an example

%%stata

use fake_data, clear
export delimited using "fake_data.csv", replace


. 
. use fake_data, clear

. export delimited using "fake_data.csv", replace
(file fake_data.csv not found)
file fake_data.csv saved

.

To load a csv dataset we write.

%%stata

import delimited using "fake_data.csv", clear


. 
. import delimited using "fake_data.csv", clear
(encoding automatically selected: UTF-8)
(11 vars, 138,138 obs)

.

When we run this command, Stata will print a message saying that there are 9 variables and almost 3 million observations. When we open datasets that are not in Stata format, it is very important to check whether the first row of the data includes the variable names.

We can use the command list to look at our data. It is better to limit the observations we see since we don’t want to see all 3 million! Thus, we use in to constrain the list to the first 3 observations below.

%%stata

list in 1/3


. 
. list in 1/3 

     +----------------------------------------------------------------------+
  1. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        1 | 1999 |   M |  55 |     1997 |      1 |       0 | 39975.01 |
     |----------------------------------------------------------------------|
     |       sample~t       |       quarte~h        |       school~g        |
     |       .2607649       |              2        |             16        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  2. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        1 | 2001 |   M |  57 |     1997 |      1 |       0 | 278378.1 |
     |----------------------------------------------------------------------|
     |       sample~t       |       quarte~h        |       school~g        |
     |       .0142739       |              2        |             16        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  3. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        2 | 2001 |   M |  54 |     2001 |      4 |       0 |  18682.6 |
     |----------------------------------------------------------------------|
     |       sample~t       |       quarte~h        |       school~g        |
     |       .0321868       |              4        |             16        |
     +----------------------------------------------------------------------+

.

By default the first row of data is interpreted as the variable names of the data set, which in this case is correct. If that’s not the case, we need to include the import delimited option varnames(#|nonames), where we replace # by the observation number that includes the names. If the data has no names the option is varnames(nonames). Don’t forget that you can always check the documentation by writing help import delimited.

5.3.2 Stata files

To open data sets in Stata format, we use the command use. As we can observe from the example below, we can recognize a dataset is stored in stata format because the file’s name will end with .dta.

%%stata

use "fake_data.dta", clear


. 
. use "fake_data.dta", clear

.

%%stata

list in 1/3


. 
. list in 1/3 

     +----------------------------------------------------------------------+
  1. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        1 | 1999 |   M |  55 |     1997 |      1 |       0 | 39975.01 |
     |----------------------------------------------------------------------|
     |       sample~t       |       quarte~h        |       school~g        |
     |       .2607649       |              2        |             16        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  2. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        1 | 2001 |   M |  57 |     1997 |      1 |       0 | 278378.1 |
     |----------------------------------------------------------------------|
     |       sample~t       |       quarte~h        |       school~g        |
     |       .0142739       |              2        |             16        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  3. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        2 | 2001 |   M |  54 |     2001 |      4 |       0 |  18682.6 |
     |----------------------------------------------------------------------|
     |       sample~t       |       quarte~h        |       school~g        |
     |       .0321868       |              4        |             16        |
     +----------------------------------------------------------------------+

.

5.3.3 Other files

You can open a number of different data files in Stata with no issues. If you are struggling, one option at UBC is to use the program StatTransfer to convert your file to dta format. This program is available in the library on the UBC Vancouver Campus at one of the Digital Scholarship workstations. Once your data is in dta format, it can be imported with the use command seen above.

Note: UBC has research support available for any student who needs help with data, including anyone who needs help getting data into a format that can be imported into Stata. You can find the contact information for the Economics Librarian on the UBC Library ECON 490 Research Guide.

5.4 Commands to Explore the Dataset

5.4.1 `describe`

The first command we are going to use describes the basic characteristics of the variables in the loaded data set.

%%stata

describe


. 
. describe

Contains data from fake_data.dta
 Observations:       138,138                  
    Variables:            11                  16 Jul 2023 17:25
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
workerid        long    %12.0g                Worker Identifier
year            int     %8.0g                 Calendar Year
sex             str1    %9s                   Sex
age             byte    %9.0g                 Age (years)
start_year      int     %9.0g                 Initial year worker is observed
region          byte    %9.0g                 group(prov)
treated         byte    %8.0g                 Treatment Dummy
earnings        float   %9.0g                 Earnings
sample_weight   float   %9.0g                 
quarter_birth   float   %9.0g                 Quarter of birth
schooling       float   %9.0g                 Years of schooling
-------------------------------------------------------------------------------
Sorted by: workerid

.

5.4.2 `codebook`

We can further analyze any variable by using the codebook command. Let’s do this here to learn more about the variable earnings.

%%stata

codebook earnings


. 
. codebook earnings

-------------------------------------------------------------------------------
earnings                                                               Earnings
-------------------------------------------------------------------------------

                  Type: Numeric (float)

                 Range: [36.193157,63573580]          Units: 1.000e-06
         Unique values: 137,915                   Missing .: 0/138,138

                  Mean: 84136.4
             Std. dev.:  252802

           Percentiles:     10%       25%       50%       75%       90%
                        10220.9   20562.6     43783   92378.2    183237

.

The codebook command gives us important information about this variable such as the type (i.e. string or numeric), how many missing observations it has (very useful to know!) and all unique values. If the variable is numeric, it will also provide some summary statistics. If the variable is a string, it will provided examples of some of the entries.

Try changing the variable name in the cell above to see the codebook entries for different variables in the data set.

5.4.3 `tabulate`

We can also learn more about the frequency of the different measures of one variable by using the command tabulate.

%%stata

tabulate region


. 
. tabulate region

group(prov) |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |     54,364       39.35       39.35
          2 |     34,072       24.67       64.02
          3 |      6,216        4.50       68.52
          4 |     17,572       12.72       81.24
          5 |     25,914       18.76      100.00
------------+-----------------------------------
      Total |    138,138      100.00

.

Here we can see that there are five regions indicated in this data set. We can see that the majority of people surveyed came from region 1.

We can actually include two variables in the tabulate command if we want more information. When we do this below we see that there were 234,355 female identified and 425,698 male identified persons surveyed in region 1.

%%stata

tabulate region sex


. 
. tabulate region sex

group(prov |          Sex
         ) |         F          M |     Total
-----------+----------------------+----------
         1 |    11,036     43,328 |    54,364 
         2 |     7,881     26,191 |    34,072 
         3 |     1,247      4,969 |     6,216 
         4 |     3,997     13,575 |    17,572 
         5 |     6,358     19,556 |    25,914 
-----------+----------------------+----------
     Total |    30,519    107,619 |   138,138 

.

5.4.4 `lookfor`

What if there’s a gazillion variables and we’re looking for a particular one? Thankfully, Stata provides a nice command called lookfor which helps us search for variables based on keywords. Suppose we want to look for a variable that is related to year.

%%stata

lookfor year


. 
. lookfor year

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
year            int     %8.0g                 Calendar Year
age             byte    %9.0g                 Age (years)
start_year      int     %9.0g                 Initial year worker is observed
schooling       float   %9.0g                 Years of schooling

.

Stata found three variables that include the word year either in the variable name or in the variable label. This is super useful when we are getting to know a data set!

5.5 Saving Datasets

We can save any opened data set in Stata format by writing save using "some_directory/dataset_name.dta", replace. The replace option overwrites a previous version of the file to keep our save current.

We can also save files in different formats with the export excel and export delimited commands. Look at the help documentation for more details.

5.6 Wrap Up

Now that you are able to import data into Stata, you can start doing your own analysis! Try finding a data set that interests you and practice some of the commands that you have already learned in the first few Modules. In the next module, we will look at commands for working with data in greater depth.

5.6.1 Wrap Up Table

Command	Function
`clear`	used to clear the workspace
`cd`	used to change the working directory
`pwd`	used to view the current working directory
`use`	used to open a Stata dataset
`import delimited`	used to load a csv dataset
`import excel`	used to load an excel dataset
`list`	used to look at the data
`describe`	used to describe the basic characteristics of the variables in the loaded dataset
`browse`	used to open up the data editor and view the observations of the dataset
`codebook`	used to describe data contents
`tabulate`	used to summarize the frequency of the different measures of a variable
`lookfor`	used to search for the variables of a dataset based on keywords
`export excel`	used to save a dataset in excel format
`export delimited`	used to save a dataset in csv format

5.6.2 Errors

The tabulate command may be used in conjunction with conditional statements. When specifying the condition, ensure that you use quotation marks; otherwise, Stata will return an error code. Uncomment each line of code below to see it in action.

%%stata


*tabulate sex if sex==F          //incorrect
*tabulate sex if sex=="F"        //correct


. 
. 
. *tabulate sex if sex==F          //incorrect
. *tabulate sex if sex=="F"        //correct
.

5.7 Video tutorial

Click on the image below for a video tutorial on this module.

References

Import data from excel
Import delimited data