ECON 490: Conducting Within Group Analysis (7)

Prerequisites

Be able to effectively use Stata do-files and generate log-files.
Be able to change your directory so that Stata can find your files.
Import datasets in csv and dta format.
Save data files.

Learning Outcomes

Create new variables using the command egen.
Know when to use the pre-command by and when to use bysort.
Use the command collapse to create a new data set of summary statistics.
Change a panel data set to a cross-sectional data set using the command reshape.

import stata_setup
stata_setup.config('C:\Program Files\Stata18/','se')

<>:2: SyntaxWarning: invalid escape sequence '\P'
<>:2: SyntaxWarning: invalid escape sequence '\P'
C:\Users\irene\AppData\Local\Temp\ipykernel_21768\4069384911.py:2: SyntaxWarning: invalid escape sequence '\P'
  stata_setup.config('C:\Program Files\Stata18/','se')

>>> import sys
>>> sys.path.append('/Applications/Stata/utilities') # make sure this is the same as what you set up in Module - 1, Section 1.5.1: Setting Up PyStata
>>> from pystata import config
>>> config.init('se')

7.1 Introduction to Working Within Groups

There are times when you need to consider workers as a group. Consider some of the following examples:

You would like to know the average wages of workers by educational grouping, in each year of the data.
You would like to know the standard deviation of men and women’s earnings, in each geographic region in the data.
You would like to know the top quintile of wealth, by birth cohort.

This module will show you how to calculate these statistics using the fake data data set introduced in the previous lecture. Recall that this data set is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.

Let’s begin by loading that data set into Stata:

%%stata

clear *

use "fake_data.dta", clear


. 
. clear *

. 
. use "fake_data.dta", clear

.

7.2 Generating Variables using `generate`

When we are working on a particular project, it is important to know how to create variables that are computed for a group rather than an individual or an observation. For instance, we may have a data set that is divided by individual and by year. We might want the variables to show us the statistics of a particular individual throughout the years or the statistics of all individuals each year.

Stata provides functionality to easily compute such statistics. The key to this analysis is the pre-command by, and the only requisite to using this is to ensure data is sorted the correct way.

Let’s take a look at our data by using the list command we learned in Module 5.

%%stata
list in 1/10


     +----------------------------------------------------------------------+
  1. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        1 | 1999 |   M |  55 |     1997 |      1 |       0 | 39975.01 |
     |----------------------------------------------------------------------|
     |       sample~t       |       quarte~h        |       school~g        |
     |       .2607649       |              2        |             16        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  2. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        1 | 2001 |   M |  57 |     1997 |      1 |       0 | 278378.1 |
     |----------------------------------------------------------------------|
     |       sample~t       |       quarte~h        |       school~g        |
     |       .0142739       |              2        |             16        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  3. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        2 | 2001 |   M |  54 |     2001 |      4 |       0 |  18682.6 |
     |----------------------------------------------------------------------|
     |       sample~t       |       quarte~h        |       school~g        |
     |       .0321868       |              4        |             16        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  4. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        2 | 2002 |   M |  55 |     2001 |      4 |       0 | 293336.4 |
     |----------------------------------------------------------------------|
     |       sample~t       |       quarte~h        |       school~g        |
     |       .4712022       |              2        |             16        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  5. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        2 | 2003 |   M |  56 |     2001 |      4 |       0 | 111797.3 |
     |----------------------------------------------------------------------|
     |       sample~t       |       quarte~h        |       school~g        |
     |        .704381       |              2        |             16        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  6. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        3 | 2005 |   M |  54 |     2005 |      5 |       0 | 88351.67 |
     |----------------------------------------------------------------------|
     |       sample~t       |       quarte~h        |       school~g        |
     |       .3559006       |              4        |             16        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  7. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        3 | 2010 |   M |  59 |     2005 |      5 |       0 | 46229.57 |
     |----------------------------------------------------------------------|
     |       sample~t       |       quarte~h        |       school~g        |
     |       .8969152       |              2        |             16        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  8. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        4 | 1997 |   M |  45 |     1997 |      5 |       1 | 24911.03 |
     |----------------------------------------------------------------------|
     |       sample~t       |       quarte~h        |       school~g        |
     |       .3990085       |              2        |             12        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  9. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        4 | 2001 |   M |  49 |     1997 |      5 |       1 | 9908.362 |
     |----------------------------------------------------------------------|
     |       sample~t       |       quarte~h        |       school~g        |
     |       .5519462       |              3        |             12        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 10. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        5 | 2009 |   M |  55 |     1998 |      2 |       1 | 137207.3 |
     |----------------------------------------------------------------------|
     |       sample~t       |       quarte~h        |       school~g        |
     |        .014439       |              3        |             14        |
     +----------------------------------------------------------------------+

We can tell here that the data is sorted by the variable workerid.

We use the pre-command by alongside the command generate to develop these group compounded variables. If we use variables other than workerid (the variable by which the data is sorted) to group our new variable, we will not be able to generate the new variable.

When we run the command below Stata will produce this error.

%%stata

cap drop var_one 
by year: gen var_one = 1


. 
. cap drop var_one 

. by year: gen var_one = 1 

.

If we want to group by year, Stata expects us to sort the data such that all observations corresponding to the same year are next to each other. We can use the sort command as follows.

%%stata

sort year


. 
. sort year 

.

%%stata

list in 1/10 //change the numbers if you would like to see more observations


. 
. list in 1/10 //change the numbers if you would like to see more observations

     +----------------------------------------------------------------------+
  1. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |      179 | 1995 |   M |  42 |     1995 |      2 |       0 | 4943.277 |
     |-----------------+----------------------------------------------------|
     |    sample~t     |    quarte~h     |    school~g     |    var_one     |
     |    .8801816     |           2     |          17     |          1     |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  2. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |    32663 | 1995 |   M |  53 |     1995 |      2 |       0 | 37268.91 |
     |-----------------+----------------------------------------------------|
     |    sample~t     |    quarte~h     |    school~g     |    var_one     |
     |    .3344809     |           1     |          14     |          1     |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  3. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |    39131 | 1995 |   M |  38 |     1995 |      4 |       0 | 24581.38 |
     |-----------------+----------------------------------------------------|
     |    sample~t     |    quarte~h     |    school~g     |    var_one     |
     |    .4255158     |           1     |          12     |          1     |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  4. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |    25935 | 1995 |   F |  36 |     1995 |      2 |       1 | 12666.49 |
     |-----------------+----------------------------------------------------|
     |    sample~t     |    quarte~h     |    school~g     |    var_one     |
     |    .1697022     |           1     |          13     |          1     |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  5. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |    27256 | 1995 |   M |  55 |     1995 |      5 |       1 |  24022.8 |
     |-----------------+----------------------------------------------------|
     |    sample~t     |    quarte~h     |    school~g     |    var_one     |
     |    .1655299     |           2     |          14     |          1     |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  6. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |    24049 | 1995 |   M |  33 |     1995 |      5 |       0 | 17288.84 |
     |-----------------+----------------------------------------------------|
     |    sample~t     |    quarte~h     |    school~g     |    var_one     |
     |    .2100701     |           2     |          15     |          1     |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  7. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |     8354 | 1995 |   M |  39 |     1995 |      2 |       0 | 40420.63 |
     |-----------------+----------------------------------------------------|
     |    sample~t     |    quarte~h     |    school~g     |    var_one     |
     |    .0431403     |           3     |          15     |          1     |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  8. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |    32867 | 1995 |   M |  45 |     1995 |      4 |       1 | 13114.48 |
     |-----------------+----------------------------------------------------|
     |    sample~t     |    quarte~h     |    school~g     |    var_one     |
     |    .9815345     |           2     |          14     |          1     |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  9. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |    26250 | 1995 |   M |  39 |     1995 |      5 |       0 | 8696.396 |
     |-----------------+----------------------------------------------------|
     |    sample~t     |    quarte~h     |    school~g     |    var_one     |
     |    .8022571     |           1     |          17     |          1     |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 10. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |    11657 | 1995 |   F |  52 |     1995 |      1 |       1 | 53814.48 |
     |-----------------+----------------------------------------------------|
     |    sample~t     |    quarte~h     |    school~g     |    var_one     |
     |    .6134558     |           2     |          18     |          1     |
     +----------------------------------------------------------------------+

.

Let’s try the command above again, now with the sorted data.

%%stata

cap drop var_one 
by year: gen var_one = 1


. 
. cap drop var_one 

. by year: gen var_one = 1 

.

Now that the data is sorted by year, the code works!

We could have also used the pre-command bysort instead of by. When we do this we can skip the command to sort the data. Everything is done in one step!

Let’s sort the data, so it is reverted back to the same ordering scheme as when we started, and generate our new variable again.

%%stata

sort workerid year


. 
. sort workerid year 

.

%%stata

cap drop var_one 
bysort year: gen var_one = 1


. 
. cap drop var_one 

. bysort year: gen var_one = 1 

.

The variable we have created is not interesting by any means. It simply takes the value of 1 everywhere. In fact, we haven’t done anything that we couldn’t have done with gen var_one=1. We can see this by using the summary command.

%%stata

su var_one


. 
. su var_one

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     var_one |    138,138           1           0          1          1

.

You may not be aware, but Stata records the observation number as a hidden variable (a scalar) called *_n* and the total number of observations as *_N*.

Let’s take a look at these by creating new two variables: one that is the observation number and one that is the total number of observations.

%%stata

cap drop obs_number 
gen obs_number = _n 

cap drop tot_obs
gen tot_obs = _N


. 
. cap drop obs_number 

. gen obs_number = _n 

. 
. cap drop tot_obs

. gen tot_obs = _N

.

%%stata

list in 1/10


. 
. list in 1/10

     +----------------------------------------------------------------------+
  1. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |    20666 | 1995 |   M |  34 |     1995 |      1 |       0 | 4863.026 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .8200595  |        4  |       14  |       1  |        1  |   138138  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  2. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |    13454 | 1995 |   F |  31 |     1995 |      1 |       1 | 267.0328 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .9174706  |        3  |        9  |       1  |        2  |   138138  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  3. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |     5982 | 1995 |   M |  29 |     1995 |      2 |       0 | 125189.6 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .3408671  |        1  |       16  |       1  |        3  |   138138  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  4. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |    27683 | 1995 |   M |  33 |     1995 |      5 |       1 |  33299.9 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .6024967  |        4  |       14  |       1  |        4  |   138138  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  5. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |    38766 | 1995 |   M |  49 |     1995 |      3 |       1 | 15291.03 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .0741042  |        4  |       12  |       1  |        5  |   138138  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  6. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |    10402 | 1995 |   F |  32 |     1995 |      1 |       0 |  14877.4 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .2130904  |        3  |       16  |       1  |        6  |   138138  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  7. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |    37500 | 1995 |   M |  30 |     1995 |      1 |       0 | 25551.36 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .2380847  |        2  |       16  |       1  |        7  |   138138  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  8. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |    12312 | 1995 |   F |  29 |     1995 |      1 |       1 | 110410.9 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .4058204  |        3  |       13  |       1  |        8  |   138138  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  9. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |    11463 | 1995 |   M |  50 |     1995 |      1 |       0 | 20900.79 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .8487378  |        4  |       18  |       1  |        9  |   138138  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 10. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |     8158 | 1995 |   F |  31 |     1995 |      2 |       1 | 7653.579 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .8682386  |        3  |       11  |       1  |       10  |   138138  |
     +----------------------------------------------------------------------+

.

As expected, the numbering of observations is sensitive to the way that the data is sorted! The cool thing is that whenever we use the pre-command by, the scalars _n and _N record the observation number and total number of observations for each group separately.

%%stata

cap drop obs_number 
bysort workerid: gen obs_number = _n 

cap drop tot_obs
bysort workerid: gen tot_obs = _N


. 
. cap drop obs_number 

. bysort workerid: gen obs_number = _n 

. 
. cap drop tot_obs

. bysort workerid: gen tot_obs = _N

.

%%stata

list in 1/10 //change the numbers if you would like to see more observations


. 
. list in 1/10 //change the numbers if you would like to see more observations

     +----------------------------------------------------------------------+
  1. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        1 | 2001 |   M |  57 |     1997 |      1 |       0 | 278378.1 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .0142739  |        2  |       16  |       1  |        1  |        2  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  2. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        1 | 1999 |   M |  55 |     1997 |      1 |       0 | 39975.01 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .2607649  |        2  |       16  |       1  |        2  |        2  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  3. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        2 | 2002 |   M |  55 |     2001 |      4 |       0 | 293336.4 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .4712022  |        2  |       16  |       1  |        1  |        3  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  4. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        2 | 2001 |   M |  54 |     2001 |      4 |       0 |  18682.6 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .0321868  |        4  |       16  |       1  |        2  |        3  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  5. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        2 | 2003 |   M |  56 |     2001 |      4 |       0 | 111797.3 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     |  .704381  |        2  |       16  |       1  |        3  |        3  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  6. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        3 | 2005 |   M |  54 |     2005 |      5 |       0 | 88351.67 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .3559006  |        4  |       16  |       1  |        1  |        2  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  7. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        3 | 2010 |   M |  59 |     2005 |      5 |       0 | 46229.57 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .8969152  |        2  |       16  |       1  |        2  |        2  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  8. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        4 | 1997 |   M |  45 |     1997 |      5 |       1 | 24911.03 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .3990085  |        2  |       12  |       1  |        1  |        2  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  9. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        4 | 2001 |   M |  49 |     1997 |      5 |       1 | 9908.362 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .5519462  |        3  |       12  |       1  |        2  |        2  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 10. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |        5 | 2009 |   M |  55 |     1998 |      2 |       1 | 137207.3 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     |  .014439  |        3  |       14  |       1  |        1  |        2  |
     +----------------------------------------------------------------------+

.

As we can see, some workers are observed only 2 times in the data (they were only surveyed in two years), whereas other workers are observed 8 times (they were surveyed in 8 years). By knowing (and recording in a variable) the number of times a worker has been observed, we can do some analysis based on this information. For example, in some cases you might be interested in keeping only workers who are observed across all time periods. In this case, you could use the command:

%%stata

keep if tot_obs==8


. 
. keep if tot_obs==8
(135,274 observations deleted)

.

%%stata

list in 1/10 //change the numbers if you would like to see more observations


. 
. list in 1/10 //change the numbers if you would like to see more observations

     +----------------------------------------------------------------------+
  1. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |       41 | 2004 |   M |  45 |     1995 |      2 |       0 | 309854.4 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .5882115  |        1  |       16  |       1  |        1  |        8  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  2. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |       41 | 2011 |   M |  52 |     1995 |      2 |       0 | 20448.55 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .0300817  |        3  |       16  |       1  |        2  |        8  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  3. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |       41 | 2009 |   M |  50 |     1995 |      2 |       0 | 66324.93 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .6211095  |        1  |       16  |       1  |        3  |        8  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  4. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |       41 | 2008 |   M |  49 |     1995 |      2 |       0 | 16850.16 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .1999107  |        3  |       16  |       1  |        4  |        8  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  5. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |       41 | 2002 |   M |  43 |     1995 |      2 |       0 | 39701.47 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     |  .429385  |        3  |       16  |       1  |        5  |        8  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  6. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |       41 | 1995 |   M |  36 |     1995 |      2 |       0 | 54630.28 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .4711307  |        4  |       16  |       1  |        6  |        8  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  7. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |       41 | 2003 |   M |  44 |     1995 |      2 |       0 | 41871.21 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .4665109  |        3  |       16  |       1  |        7  |        8  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  8. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |       41 | 1999 |   M |  40 |     1995 |      2 |       0 | 86709.02 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .2023048  |        1  |       16  |       1  |        8  |        8  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  9. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |      345 | 2008 |   F |  49 |     1995 |      2 |       1 | 10066.77 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .0494048  |        2  |       15  |       1  |        1  |        8  |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 10. | workerid | year | sex | age | start_~r | region | treated | earnings |
     |      345 | 2002 |   F |  43 |     1995 |      2 |       1 | 11149.12 |
     |-----------------------+----------------------------------------------|
     | sample~t  | quarte~h  | school~g  | var_one  | obs_nu~r  |  tot_obs  |
     | .6418093  |        2  |       15  |       1  |        2  |        8  |
     +----------------------------------------------------------------------+

.

7.3 Generating Variables Using Extended Generate

The command egenerate is used whenever we want to create variables which require access to some functions (e.g. mean, standard deviation, min). The basic syntax works as follows:

 bysort groupvar: egen new_var = function() , options

Let’s see an example where we create a new variable called avg_earnings which is the mean of earnings for every worker. We will need to reload our data since we dropped many observations above when we used the keep command.

%%stata

clear *
use "fake_data.dta", clear


. 
. clear *

. use "fake_data.dta", clear

.

%%stata

cap drop avg_earnings
bysort workerid: egen avg_earnings = mean(earnings)


. 
. cap drop avg_earnings

. bysort workerid: egen avg_earnings = mean(earnings)

.

%%stata

cap drop total_earnings
bysort workerid: egen total_earnings = total(earnings)


. 
. cap drop total_earnings

. bysort workerid: egen total_earnings = total(earnings)

.

By definition, these commands will create variables that use information across different observations. You can check the list of available functions by writing help egen in the Stata command window.

In this documentation, you will notice that there are some functions that do not allow for by. For example, suppose we want to create the total sum across different variables in the same row.

%%stata

cap drop sum_of_vars
egen sum_of_vars = rowtotal(start_year region treated)


. 
. cap drop sum_of_vars

. egen sum_of_vars = rowtotal(start_year region treated)

.

The variable we are creating for the example has no particular meaning, but what we need to notice is that the function rowtotal() only sums the non-missing values in our variables. This means that if there is a missing value in any of the three variables, the sum only occurs between the two variables that do not have the missing value. We could also write this command as gen sum_of_vars = start_year + region + treated; however, if there is a missing value (.) in start_year, region or treated, then the generated value for sum_of_vars will also be a missing value. The answer lies in the missing observations. If we sum any number with a missing value (.), then the sum will also be missing.

We can also use by with a list of variables. Here will use year and region in one command.

%%stata

cap drop regionyear_earnings
bysort year region : egen regionyear_earnings = total(earnings)


. 
. cap drop regionyear_earnings

. bysort year region : egen regionyear_earnings = total(earnings)

.

What this command gives us is a new variable that records total earnings in each region for every year.

7.4 Collapsing Data

We can also compute statistics at some group level with the collapse command. Collapse is extremely useful whenever we want to apply sample weights to our data (we will learn more about this in Module 11). Sample weights cannot be applied using egen but are often extremely important when using micro data. Those weights allow us to manipulate our data to better reflect the composition of the data when the authority that collected the data might have over sampled some segments of the population.

The syntax is

collapse (statistic1) new_name = existing_variable (statistic2) new_name2 = existing_variable2 ... [pweight =     weight_variable], by(group)

You can obtain a list of possible statistics by running the command help collapse. You can also learn more about using weights by typing help weight.

Let’s suppose we want to create a data set at the region-year level using information in the current data set, but we want to use the sample weights that were provided with our data. First, we decide which statistics we want to keep from the original data set. For the sake of explanation, let’s suppose we want to keep average earnings, the variance of earnings, and the total employment. We write the following:

%%stata

collapse (mean) avg_earnings = earnings (sd) sd_earnings = earnings (count) tot_emp = earnings, by(region year)


. 
. collapse (mean) avg_earnings = earnings (sd) sd_earnings = earnings (count) t
> ot_emp = earnings, by(region year)

.

%%stata

list in 1/10 //change the numbers if you would like to see more observations


. 
. list in 1/10 //change the numbers if you would like to see more observations

     +-----------------------------------------------+
     | year   region   avg_ea~s   sd_ear~s   tot_emp |
     |-----------------------------------------------|
  1. | 1995        1   73879.54   111401.5      4049 |
  2. | 1996        1   72385.41   140133.1      4129 |
  3. | 1997        1   75709.29   127096.7      4187 |
  4. | 1998        1   75836.01   110712.4      4191 |
  5. | 1999        1    79147.8   125508.6      4066 |
     |-----------------------------------------------|
  6. | 2000        1   79012.02   127400.3      3957 |
  7. | 2001        1   84775.07   173502.6      3789 |
  8. | 2002        1   84860.38   126634.6      3632 |
  9. | 2003        1   87483.05   151080.3      3405 |
 10. | 2004        1   89746.04   243319.7      3140 |
     +-----------------------------------------------+

.

Warning: When you use collapse, Stata produces a new data set with the results and in the process drops the data set that was loaded at the time the command was run. If you need to keep that data, be certain to save the file before you run this command.

7.5 Reshaping

We have collapsed our data and so we need to import the data again to gain access to the full data set.

%%stata

clear *

use "fake_data.dta", clear


. 
. clear *

. 
. use "fake_data.dta", clear

.

Notice that the nature of this particular data set is panel form; individuals have been followed over many years. Sometimes we are interested in working with a cross section (i.e. we have 1 observation per worker which includes all of the years). Is there a simple way to go back and forth between these two? Yes!

The command’s name is reshape and has two main forms: wide and long. The former is related to a cross-sectional nature, whereas the latter relates to the usual panel nature.

Suppose we want to record the earnings of workers while keeping the information across years.

%%stata

reshape wide earnings region age start_year sample_weight quarter_birth, i(workerid) j(year)


. 
. reshape wide earnings region age start_year sample_weight quarter_birth, i(wo
> rkerid) j(year)
(j = 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
>  2010 2011)

Data                               Long   ->   Wide
-----------------------------------------------------------------------------
Number of observations          138,138   ->   39,999      
Number of variables                  11   ->   106         
j variable (17 values)             year   ->   (dropped)
xij variables:
                               earnings   ->   earnings1995 earnings1996 ... ea
> rnings2011
                                 region   ->   region1995 region1996 ... region
> 2011
                                    age   ->   age1995 age1996 ... age2011
                             start_year   ->   start_year1995 start_year1996 ..
> . start_year2011
                          sample_weight   ->   sample_weight1995 sample_weight1
> 996 ... sample_weight2011
                          quarter_birth   ->   quarter_birth1995 quarter_birth1
> 996 ... quarter_birth2011
-----------------------------------------------------------------------------

.

Warning: This command acts on all of the variables in your data set. If you don’t include them in the list, Stata will assume that they do not vary across i (in this case workers). If you don’t check this beforehand, you may get an error message.

%%stata

list in 1/5 //change the numbers if you would like to see more observations


. 
. list in 1/5 //change the numbers if you would like to see more observations

     +----------------------------------------------------------------+
  1. | workerid | age1995 | sta~1995 | reg~1995 | ear~1995 | sam~1995 |
     |        1 |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1995 | age1996 | sta~1996 | reg~1996 | ear~1996 | sam~1996 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1996 | age1997 | sta~1997 | reg~1997 | ear~1997 | sam~1997 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1997 | age1998 | sta~1998 | reg~1998 | ear~1998 | sam~1998 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1998 | age1999 | sta~1999 | reg~1999 | ear~1999 | sam~1999 |
     |        . |      55 |     1997 |        1 | 39975.01 | .2607649 |
     |----------+---------+----------+----------+----------+----------|
     | qua~1999 | age2000 | sta~2000 | reg~2000 | ear~2000 | sam~2000 |
     |        2 |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2000 | age2001 | sta~2001 | reg~2001 | ear~2001 | sam~2001 |
     |        . |      57 |     1997 |        1 | 278378.1 | .0142739 |
     |----------+---------+----------+----------+----------+----------|
     | qua~2001 | age2002 | sta~2002 | reg~2002 | ear~2002 | sam~2002 |
     |        2 |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2002 | age2003 | sta~2003 | reg~2003 | ear~2003 | sam~2003 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2003 | age2004 | sta~2004 | reg~2004 | ear~2004 | sam~2004 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2004 | age2005 | sta~2005 | reg~2005 | ear~2005 | sam~2005 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2005 | age2006 | sta~2006 | reg~2006 | ear~2006 | sam~2006 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2006 | age2007 | sta~2007 | reg~2007 | ear~2007 | sam~2007 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2007 | age2008 | sta~2008 | reg~2008 | ear~2008 | sam~2008 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2008 | age2009 | sta~2009 | reg~2009 | ear~2009 | sam~2009 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2009 | age2010 | sta~2010 | reg~2010 | ear~2010 | sam~2010 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2010 | age2011 | sta~2011 | reg~2011 | ear~2011 | sam~2011 |
     |        . |       . |        . |        . |        . |        . |
     |----------------------------------------------------------------|
     |    qua~2011    |    sex     |    treated     |    school~g     |
     |           .    |      M     |          0     |          16     |
     +----------------------------------------------------------------+

     +----------------------------------------------------------------+
  2. | workerid | age1995 | sta~1995 | reg~1995 | ear~1995 | sam~1995 |
     |        2 |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1995 | age1996 | sta~1996 | reg~1996 | ear~1996 | sam~1996 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1996 | age1997 | sta~1997 | reg~1997 | ear~1997 | sam~1997 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1997 | age1998 | sta~1998 | reg~1998 | ear~1998 | sam~1998 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1998 | age1999 | sta~1999 | reg~1999 | ear~1999 | sam~1999 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1999 | age2000 | sta~2000 | reg~2000 | ear~2000 | sam~2000 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2000 | age2001 | sta~2001 | reg~2001 | ear~2001 | sam~2001 |
     |        . |      54 |     2001 |        4 |  18682.6 | .0321868 |
     |----------+---------+----------+----------+----------+----------|
     | qua~2001 | age2002 | sta~2002 | reg~2002 | ear~2002 | sam~2002 |
     |        4 |      55 |     2001 |        4 | 293336.4 | .4712022 |
     |----------+---------+----------+----------+----------+----------|
     | qua~2002 | age2003 | sta~2003 | reg~2003 | ear~2003 | sam~2003 |
     |        2 |      56 |     2001 |        4 | 111797.3 |  .704381 |
     |----------+---------+----------+----------+----------+----------|
     | qua~2003 | age2004 | sta~2004 | reg~2004 | ear~2004 | sam~2004 |
     |        2 |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2004 | age2005 | sta~2005 | reg~2005 | ear~2005 | sam~2005 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2005 | age2006 | sta~2006 | reg~2006 | ear~2006 | sam~2006 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2006 | age2007 | sta~2007 | reg~2007 | ear~2007 | sam~2007 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2007 | age2008 | sta~2008 | reg~2008 | ear~2008 | sam~2008 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2008 | age2009 | sta~2009 | reg~2009 | ear~2009 | sam~2009 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2009 | age2010 | sta~2010 | reg~2010 | ear~2010 | sam~2010 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2010 | age2011 | sta~2011 | reg~2011 | ear~2011 | sam~2011 |
     |        . |       . |        . |        . |        . |        . |
     |----------------------------------------------------------------|
     |    qua~2011    |    sex     |    treated     |    school~g     |
     |           .    |      M     |          0     |          16     |
     +----------------------------------------------------------------+

     +----------------------------------------------------------------+
  3. | workerid | age1995 | sta~1995 | reg~1995 | ear~1995 | sam~1995 |
     |        3 |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1995 | age1996 | sta~1996 | reg~1996 | ear~1996 | sam~1996 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1996 | age1997 | sta~1997 | reg~1997 | ear~1997 | sam~1997 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1997 | age1998 | sta~1998 | reg~1998 | ear~1998 | sam~1998 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1998 | age1999 | sta~1999 | reg~1999 | ear~1999 | sam~1999 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1999 | age2000 | sta~2000 | reg~2000 | ear~2000 | sam~2000 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2000 | age2001 | sta~2001 | reg~2001 | ear~2001 | sam~2001 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2001 | age2002 | sta~2002 | reg~2002 | ear~2002 | sam~2002 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2002 | age2003 | sta~2003 | reg~2003 | ear~2003 | sam~2003 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2003 | age2004 | sta~2004 | reg~2004 | ear~2004 | sam~2004 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2004 | age2005 | sta~2005 | reg~2005 | ear~2005 | sam~2005 |
     |        . |      54 |     2005 |        5 | 88351.67 | .3559006 |
     |----------+---------+----------+----------+----------+----------|
     | qua~2005 | age2006 | sta~2006 | reg~2006 | ear~2006 | sam~2006 |
     |        4 |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2006 | age2007 | sta~2007 | reg~2007 | ear~2007 | sam~2007 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2007 | age2008 | sta~2008 | reg~2008 | ear~2008 | sam~2008 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2008 | age2009 | sta~2009 | reg~2009 | ear~2009 | sam~2009 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2009 | age2010 | sta~2010 | reg~2010 | ear~2010 | sam~2010 |
     |        . |      59 |     2005 |        5 | 46229.57 | .8969152 |
     |----------+---------+----------+----------+----------+----------|
     | qua~2010 | age2011 | sta~2011 | reg~2011 | ear~2011 | sam~2011 |
     |        2 |       . |        . |        . |        . |        . |
     |----------------------------------------------------------------|
     |    qua~2011    |    sex     |    treated     |    school~g     |
     |           .    |      M     |          0     |          16     |
     +----------------------------------------------------------------+

     +----------------------------------------------------------------+
  4. | workerid | age1995 | sta~1995 | reg~1995 | ear~1995 | sam~1995 |
     |        4 |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1995 | age1996 | sta~1996 | reg~1996 | ear~1996 | sam~1996 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1996 | age1997 | sta~1997 | reg~1997 | ear~1997 | sam~1997 |
     |        . |      45 |     1997 |        5 | 24911.03 | .3990085 |
     |----------+---------+----------+----------+----------+----------|
     | qua~1997 | age1998 | sta~1998 | reg~1998 | ear~1998 | sam~1998 |
     |        2 |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1998 | age1999 | sta~1999 | reg~1999 | ear~1999 | sam~1999 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1999 | age2000 | sta~2000 | reg~2000 | ear~2000 | sam~2000 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2000 | age2001 | sta~2001 | reg~2001 | ear~2001 | sam~2001 |
     |        . |      49 |     1997 |        5 | 9908.362 | .5519462 |
     |----------+---------+----------+----------+----------+----------|
     | qua~2001 | age2002 | sta~2002 | reg~2002 | ear~2002 | sam~2002 |
     |        3 |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2002 | age2003 | sta~2003 | reg~2003 | ear~2003 | sam~2003 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2003 | age2004 | sta~2004 | reg~2004 | ear~2004 | sam~2004 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2004 | age2005 | sta~2005 | reg~2005 | ear~2005 | sam~2005 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2005 | age2006 | sta~2006 | reg~2006 | ear~2006 | sam~2006 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2006 | age2007 | sta~2007 | reg~2007 | ear~2007 | sam~2007 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2007 | age2008 | sta~2008 | reg~2008 | ear~2008 | sam~2008 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2008 | age2009 | sta~2009 | reg~2009 | ear~2009 | sam~2009 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2009 | age2010 | sta~2010 | reg~2010 | ear~2010 | sam~2010 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2010 | age2011 | sta~2011 | reg~2011 | ear~2011 | sam~2011 |
     |        . |       . |        . |        . |        . |        . |
     |----------------------------------------------------------------|
     |    qua~2011    |    sex     |    treated     |    school~g     |
     |           .    |      M     |          1     |          12     |
     +----------------------------------------------------------------+

     +----------------------------------------------------------------+
  5. | workerid | age1995 | sta~1995 | reg~1995 | ear~1995 | sam~1995 |
     |        5 |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1995 | age1996 | sta~1996 | reg~1996 | ear~1996 | sam~1996 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1996 | age1997 | sta~1997 | reg~1997 | ear~1997 | sam~1997 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1997 | age1998 | sta~1998 | reg~1998 | ear~1998 | sam~1998 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1998 | age1999 | sta~1999 | reg~1999 | ear~1999 | sam~1999 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~1999 | age2000 | sta~2000 | reg~2000 | ear~2000 | sam~2000 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2000 | age2001 | sta~2001 | reg~2001 | ear~2001 | sam~2001 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2001 | age2002 | sta~2002 | reg~2002 | ear~2002 | sam~2002 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2002 | age2003 | sta~2003 | reg~2003 | ear~2003 | sam~2003 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2003 | age2004 | sta~2004 | reg~2004 | ear~2004 | sam~2004 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2004 | age2005 | sta~2005 | reg~2005 | ear~2005 | sam~2005 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2005 | age2006 | sta~2006 | reg~2006 | ear~2006 | sam~2006 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2006 | age2007 | sta~2007 | reg~2007 | ear~2007 | sam~2007 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2007 | age2008 | sta~2008 | reg~2008 | ear~2008 | sam~2008 |
     |        . |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2008 | age2009 | sta~2009 | reg~2009 | ear~2009 | sam~2009 |
     |        . |      55 |     1998 |        2 | 137207.3 |  .014439 |
     |----------+---------+----------+----------+----------+----------|
     | qua~2009 | age2010 | sta~2010 | reg~2010 | ear~2010 | sam~2010 |
     |        3 |       . |        . |        . |        . |        . |
     |----------+---------+----------+----------+----------+----------|
     | qua~2010 | age2011 | sta~2011 | reg~2011 | ear~2011 | sam~2011 |
     |        . |      57 |     1998 |        2 |  5227.69 | .3182252 |
     |----------------------------------------------------------------|
     |    qua~2011    |    sex     |    treated     |    school~g     |
     |           3    |      M     |          1     |          14     |
     +----------------------------------------------------------------+

.

There are so many missing values in the data! Should we worry? Not at all. As a matter of fact, we learned at the beginning of this module that many workers are not observed across all years. That’s what these missing values are representing.

Notice that the variable year which was part of the command line (the j(year) part) has disappeared. We now have one observation per worker, with their information recorded across years in a cross-sectional way.

How do we go from a wide data set to a regular panel form? We need to indicate the prefix in the variables, which are formally known as stubs in the Stata lingo, and use the reshape long command. When we write j(year) it will create a new variable called year.

%%stata

reshape long earnings region age  start_year sample_weight, i(workerid) j(year)


. 
. reshape long earnings region age  start_year sample_weight, i(workerid) j(yea
> r) 
(j = 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
>  2010 2011)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations           39,999   ->   679,983     
Number of variables                 106   ->   27          
j variable (17 values)                    ->   year
xij variables:
earnings1995 earnings1996 ... earnings2011->   earnings
   region1995 region1996 ... region2011   ->   region
            age1995 age1996 ... age2011   ->   age
start_year1995 start_year1996 ... start_year2011->start_year
sample_weight1995 sample_weight1996 ... sample_weight2011->sample_weight
-----------------------------------------------------------------------------

.

%%stata

Notice that we now have an observation for every worker in every year, although we know some workers are only observed in a subset of these. This is known as a balanced panel.

To retrieve the original data set, we get rid of such observations with missing values.

%%stata

keep if !missing(earnings)

%%stata

%browse 10

7.6 Wrap Up

In this module, you have developed some very useful skills that will help you explore data sets. Namely, these skills will help you both prepare your data for empirical analysis (i.e. turning cross sectional data into panel data) and create summary statistics that you can use to illustrate your results. In the next module, we will look at how to work with multiple data sets simultaneously and merge them into one.

7.6.1 Wrap Up Table

Command	Function
`by`	It is a pre-command used to Repeat Stata command on subsets of the data
`generate`	It generates variables
`sort`	It sorts data
`summary`	It summarizes statistics of a dataset
`_n`	It records the observation number
`_N`	It records the total number of observations for each group separately
`drop`	It drops variables or observations
`keep`	It keeps variables or observations that satisfy a specified condition
`egenerate`	It create variables that require access to some functions
`rowtotal()`	It sums non-missing values for each observation of a list of variables
`collapse`	It makes a dataset of a summary of statistics
`reshape`	It converts data from wide to long and vice versa

7.6.2 Errors

1. Sort

To develop group compounded variables, ensure that you first sort the observations by the variable. Not sorting the obserations will return an error code.

%%stata

cap drop var
by sex: gen var = _n

The correct method of of generating compounded variables is below:

%%stata

cap drop var
bysort sex: gen var = _n

%%stata

su var

2. Reshape Error

Reshaping data can be tricky and doing so incorrectly can cause many variables to be dropped in the proccess. The command reshape error can be used to identify the issues encountered when reshaping data.

%%stata

clear *
use "fake_data.dta", clear

%%stata

reshape wide earnings sex, i(year) j(workerid)

%%stata

reshape error

7.7 Video tutorial

Click on the image below for a video tutorial on this module.

References

Reshape data from wide format to long format
(Non StataCorp) How to group data in STATA with SORT and BY