ECON 407: Stata Primer

Here I briefly introduce the use of matrix algebra manipulations and maximum likelihood programming in Stata. Other software packages are arguably more adept for these tasks, but in this class we'll focus on stata as the tool for all of our work. If you prefer to do you work in other mathematical packages (e.g. R, Python, or Matlab, etc.) you are free to do so, but I might no be able to support any technical issues you run into.

Stata can load comma-delimited (csv), excel (xls), and stata (dta) files out of the box. It can also load data from the web:

use "http://rlhick.people.wm.edu/econ407/data/mroz"
sum

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
lfp |        753    .5683931    .4956295          0          1
whrs |        753    740.5764    871.3142          0       4950
kl6 |        753    .2377158     .523959          0          3
k618 |        753    1.353254    1.319874          0          8
wa |        753    42.53785    8.072574         30         60
-------------+---------------------------------------------------------
we |        753    12.28685    2.280246          5         17
ww |        753    2.374565    3.241829          0         25
rpwg |        753    1.849734    2.419887          0       9.98
hhrs |        753    2267.271    595.5666        175       5010
ha |        753    45.12085    8.058793         30         60
-------------+---------------------------------------------------------
he |        753    12.49137    3.020804          3         17
hw |        753    7.482179    4.230559      .4121     40.509
faminc |        753    23080.59     12190.2       1500      96000
mtr |        753    .6788632    .0834955      .4415      .9415
wmed |        753    9.250996    3.367468          0         17
-------------+---------------------------------------------------------
wfed |        753    8.808765     3.57229          0         17
un |        753    8.623506    3.114934          3         14
cit |        753    .6427623    .4795042          0          1
ax |        753    10.63081     8.06913          0         45


Loading files from disk is a slight variation the above command. Supposing that your stata data file mroz.dta was in the folder /some/place, in Linux or MacOS we would use the R command

use "/some/place/mroz.dta"


Viewing Data

If you are using the graphical version of Stata (recommended) viewing data is easy and I can show you how to do that. Viewing Listing data at the command line is achieved by the list command, and might be useful for your problem sets for showing a few lines of data. Here we'll view the first 5 rows of data:

list in 1/5

   +--------------------------------------------------------------------+
1. | lfp | whrs | kl6 | k618 | wa | we |     ww | rpwg | hhrs | ha | he |
|   1 | 1610 |   1 |    0 | 32 | 12 |  3.354 | 2.65 | 2708 | 34 | 12 |
|--------------------------------------------------------------------|
|     hw  | faminc  |   mtr  | wmed  |  wfed  |   un  |  cit  |  ax  |
| 4.0288  |  16310  | .7215  |   12  |     7  |    5  |    0  |  14  |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
2. | lfp | whrs | kl6 | k618 | wa | we |     ww | rpwg | hhrs | ha | he |
|   1 | 1656 |   0 |    2 | 30 | 12 | 1.3889 | 2.65 | 2310 | 30 |  9 |
|--------------------------------------------------------------------|
|     hw  | faminc  |   mtr  | wmed  |  wfed  |   un  |  cit  |  ax  |
| 8.4416  |  21800  | .6615  |    7  |     7  |   11  |    1  |   5  |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
3. | lfp | whrs | kl6 | k618 | wa | we |     ww | rpwg | hhrs | ha | he |
|   1 | 1980 |   1 |    3 | 35 | 12 | 4.5455 | 4.04 | 3072 | 40 | 12 |
|--------------------------------------------------------------------|
|     hw  | faminc  |   mtr  | wmed  |  wfed  |   un  |  cit  |  ax  |
| 3.5807  |  21040  | .6915  |   12  |     7  |    5  |    0  |  15  |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
4. | lfp | whrs | kl6 | k618 | wa | we |     ww | rpwg | hhrs | ha | he |
|   1 |  456 |   0 |    3 | 34 | 12 | 1.0965 | 3.25 | 1920 | 53 | 10 |
|--------------------------------------------------------------------|
|     hw  | faminc  |   mtr  | wmed  |  wfed  |   un  |  cit  |  ax  |
| 3.5417  |   7300  | .7815  |    7  |     7  |    5  |    0  |   6  |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
5. | lfp | whrs | kl6 | k618 | wa | we |     ww | rpwg | hhrs | ha | he |
|   1 | 1568 |   1 |    2 | 31 | 14 | 4.5918 |  3.6 | 2000 | 32 | 12 |
|--------------------------------------------------------------------|
|     hw  | faminc  |   mtr  | wmed  |  wfed  |   un  |  cit  |  ax  |
|     10  |  27300  | .6215  |   12  |    14  |  9.5  |    1  |   7  |
+--------------------------------------------------------------------+


You can combine list with logical expressions for showing rows meeting logical conditions. Let's look at the first 3 rows where the respondent has kids less than 6 years old:

list if kl6>0 in 1/3

   +--------------------------------------------------------------------+
1. | lfp | whrs | kl6 | k618 | wa | we |     ww | rpwg | hhrs | ha | he |
|   1 | 1610 |   1 |    0 | 32 | 12 |  3.354 | 2.65 | 2708 | 34 | 12 |
|--------------------------------------------------------------------|
|     hw  | faminc  |   mtr  |  wmed  |  wfed  |  un  |  cit  |  ax  |
| 4.0288  |  16310  | .7215  |    12  |     7  |   5  |    0  |  14  |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
3. | lfp | whrs | kl6 | k618 | wa | we |     ww | rpwg | hhrs | ha | he |
|   1 | 1980 |   1 |    3 | 35 | 12 | 4.5455 | 4.04 | 3072 | 40 | 12 |
|--------------------------------------------------------------------|
|     hw  | faminc  |   mtr  |  wmed  |  wfed  |  un  |  cit  |  ax  |
| 3.5807  |  21040  | .6915  |    12  |     7  |   5  |    0  |  15  |
+--------------------------------------------------------------------+


Creating and Modifying Variables

Creating Variables

In stata, you need to start a new variable with create.

gen newvar = lfp * ax


Modifying Variables

To modify an existing variable, use replace

Unlike stata we simply redefine the variable and don't need to bother with replace:

replace newvar = newvar/10


Here is an example that creates a new dummy variable.

gen haskids = 0
replace haskids = (kl6>0) | (k618>0)
list haskids kl6 k618 in 1/10

(524 real changes made)

+----------------------+
|----------------------|
1. |       1     1      0 |
2. |       1     0      2 |
3. |       1     1      3 |
4. |       1     0      3 |
5. |       1     1      2 |
|----------------------|
6. |       0     0      0 |
7. |       1     0      2 |
8. |       0     0      0 |
9. |       1     0      2 |
10. |       1     0      2 |
+----------------------+


Creating dummy variables

While the above example shows how to make "manually" use logical checks to create dummy variables, a better way (particularly if you need to create many categories) is tab. Suppose a variable x takes on the values 1,2, or 3. To create categorical (dummy) variables for each value, use

tab x, gen(dum_x)


Starting Over

Sometimes, you want to get rid of all the variables for a new analysis, or simply to start over. To do this, use the clear command

Log Files

A very useful way to save your results is have stata automatically put everything in a log file. To initialize a log file and use it, issue

log using "/some/place/my_first.log", replace txt


will create (or if it exists, will replace) the file myfirst.log in the folder /some/place. If you don't won't to replace your existing work, use this command instead

log using "/some/place/my_first.log", append txt


and all of your results will be appended to the log file. When you are finished for a stata session, issue the command log close to close the file and save all changes. You may then open it using the text editor of your choosing.

Do Files

Do files allow you to put all of the relevant stata commands for a project into one file, so that results can be easily replicated from one stata settion to the next. The use of do files are highly recommended for your own work, and are a required part of your assignments in the course. I will illustrate their use early in the class. Additionally, you are required to write literate "do" files as I will show you on the first day of class.

Getting help in Stata

If you need to find general help in stata, type help command where command is some stata command. You can also do keyword searches: search keyword. To see the same set of results in a better help viewer, type view search keyword for example view search reg.

Linear Algebra in Stata

Stata has a linear algebra environment that can be started using the mata command from the stata command line. Notice, when you type mata from the stata command window, the command prompt changes from a . to :. This is really your only way of distinguishing if you are in the mata or stata environment. At this point "normal" stata commands (e.g. summary, reg, or use) will not work and will lead to error messages. To exit mata, issue the command end. Commands for mata may also be nested inside stata do files (command files) so long as all mata commands are between the commands mata and end

Getting help in mata is similar to the normal Stata environment. Type help mata command where command is some mata command. You can also do keyword searches: search mata keyword. To see the same set of results in a better help viewer, type view search mata keyword. For example view search mata inverse.

Once you have Stata running, you can invoke mata like this

mata

------------------------------------------------- mata (type end to exit) -----


Creating matrices, vectors, and scalars

There are two ways to create a matrix. Consider a two by two matrix,

A = (1,2 \ 3,4)
A

     1   2
+---------+
1 |  1   2  |
2 |  3   4  |
+---------+


Or, you could create an empty matrix of the desired dimension

B=J(2,3,.)
B

     1   2   3
+-------------+
1 |  .   .   .  |
2 |  .   .   .  |
+-------------+


where B is of dimension rows=2 and columns=3. We can fill $\mathbf{B}$ element by element:

B[1,1]=5
B[1,2]=6
B[1,3]=7
B[2,1]=8
B[2,2]=9
B[2,3]=10
B

      1    2    3
+----------------+
1 |   5    6    7  |
2 |   8    9   10  |
+----------------+


Building a matrix from submatrices

Suppose you have the matrices A to D defined as: The matrix E=[ACBD]

A=(1,2 \ 3,4)
B=(5,6,7 \ 8,9,10)
C=(3,4 \ 5,6)
D=(1,2,3 \ 4,5,6)
E=(A,B \ C,D)
E

      1    2    3    4    5
+--------------------------+
1 |   1    2    5    6    7  |
2 |   3    4    8    9   10  |
3 |   3    4    1    2    3  |
4 |   5    6    4    5    6  |
+--------------------------+


Creating Vectors

Row and column vectors can also be created using the same basic syntax:

f = (1, 2, 3)
f

     1   2   3
+-------------+
1 |  1   2   3  |
+-------------+


or, a column vector can be created by

g=(3\ 4 \5)
g

     1
+-----+
1 |  3  |
2 |  4  |
3 |  5  |
+-----+


The command below can construct a row vectors of incremented integer values between 1 and 100 (e.g. 1,2,3,…,99,100).

id_rows=(1::5)
id_rows

     1
+-----+
1 |  1  |
2 |  2  |
3 |  3  |
4 |  4  |
5 |  5  |
+-----+


Creating Scalars

These are easy. To define a scalar variable called u:

u = 3
u

3


Creating a vector of zeros or ones

Suppose we have 1000 observations and we wish to create a column of ones (this is especially useful for estimating a constant term), use this command

ones=J(1000, 1, 1)
ones[1::5]

     1
+-----+
1 |  1  |
2 |  1  |
3 |  1  |
4 |  1  |
5 |  1  |
+-----+


This command can be combined with what we have previously to create the fully matrix of independent variables (with the constant in the first positions) using

X=(ones=J(1000, 1, 1),x)


so long as your matrix of independent variables x exists in mata and has 1000 rows.

Creating the Identity Matrix

The command will create an identity matrix with 5 rows/columns.

identity = I(5)
identity

[symmetric]
1   2   3   4   5
+---------------------+
1 |  1                  |
2 |  0   1              |
3 |  0   0   1          |
4 |  0   0   0   1      |
5 |  0   0   0   0   1  |
+---------------------+


Note, Stata only shows the lower triangular part of any symmetric matrix.

Stata datasets in Mata

Once you have loaded data into stata as described above, it is easy to access that information from within mata. Using the Mroz data (that we loaded into Stata already) into mata, there are two ways to proceed. One can copy the data or one can create a view that always refers back to the original stata dataset. Views are useful if you want to modify the data in mata and then return to stata with the original dataset changed based on operations in mata, while copying the data is both faster and requires less memory. If you need to do all your work in mata and don't need to change any of the underlying .dta data, I recommend the copy method. The command to load everything in the stata workspace into mata is

X=st_data(.,.)
X[1::5,]

               1             2             3             4             5
+-----------------------------------------------------------------------
1 |            1          1610             1             0            32
2 |            1          1656             0             2            30
3 |            1          1980             1             3            35
4 |            1           456             0             3            34
5 |            1          1568             1             2            31
+-----------------------------------------------------------------------
6             7             8             9            10
-----------------------------------------------------------------------
1             12   3.354000092   2.650000095          2708            34
2             12   1.388900042   2.650000095          2310            30
3             12   4.545499802   4.039999962          3072            40
4             12   1.096500039          3.25          1920            53
5             14   4.591800213   3.599999905          2000            32
-----------------------------------------------------------------------
11            12            13            14            15
-----------------------------------------------------------------------
1             12   4.028800011         16310   .7214999795            12
2              9   8.441599846         21800   .6614999771             7
3             12   3.580699921         21040   .6915000081            12
4             10   3.541699886          7300   .7814999819             7
5             12            10         27300   .6215000153            12
-----------------------------------------------------------------------
16            17            18            19            20
-----------------------------------------------------------------------+
1              7             5             0            14             1  |
2              7            11             1             5             1  |
3              7             5             0            15             1  |
4              7             5             0             6             1  |
5             14           9.5             1             7             1  |
-----------------------------------------------------------------------+


Note, columns aren't labeled and you need to keep track of variable order in Stata to know which columns are important for your work.

Alternatively, you can selectively include columns in the order you define using this and viewing the first 5 rows:

X=st_data(.,("kl6","k618","faminc"))
X[1::5,]

         1       2       3
+-------------------------+
1 |      1       0   16310  |
2 |      0       2   21800  |
3 |      1       3   21040  |
4 |      0       3    7300  |
5 |      1       2   27300  |
+-------------------------+


Remember, once you end the mata session, all changes to the data following an st_data command are lost. The st_view command has identical syntax to st_data and allows changes to the data to be preserved once back in stata. In this course, it is sufficient to use the command st_data to load data into mata as described above.

The mata workspace

The command mata describe will list all the matrices, vectors, and scalars currently defined.

mata describe

      # bytes   type                        name and extent
-------------------------------------------------------------------------------
32   real matrix                 A[2,2]
48   real matrix                 B[2,3]
32   real matrix                 C[2,2]
48   real matrix                 D[2,3]
160   real matrix                 E[4,5]
18,072   real matrix                 X[753,3]
24   real rowvector              f[3]
24   real colvector              g[3]
80   real colvector              id1[10]
40   real colvector              id_cols[5]
40   real colvector              id_rows[5]
200   real matrix                 identity[5,5]
8,000   real colvector              ones[1000]
8   real scalar                 u
-------------------------------------------------------------------------------


To delete all of these, issue mata clear. To delete only a few matrices, vectors, or scalars, issue mata drop X f g

Stata offers three functions useful for checking conformability conditions. The function rows(X) and cols(X) return the number of rows and columns of X respectively,

rows(X)
cols(X)

753
3


while length()

length(X)

2259


Calculates the total number of elements in matrix X, equal to (# rows) × (# columns.).

Linear Algebra Operations

Important Commands

Operation Command
Transpose of B B'
Inverse of B luinv(B)
Inverse of B (if symmetric) invsym(B)
Diagonal Elements of matrix B diagonal(B)
Put vector B into diagonal square matrix diag(B)
Upper Triangular Elements of B uppertriangle(B)
Lower Triangular Elements of B lowertriangle(B)
Sort based on values in column i sort(B,i)
Sort based on values in column i and j sort(B,(i,j))
Multiplication of Matrices if missing elements cross(B,A)

For matrices A and B of same dimensions, matrix addition is given by

D  = A + C
D

      1    2
+-----------+
1 |   4    6  |
2 |   8   10  |
+-----------+


Subtraction follows in a similar way. Multiplication (assuming conformability of A and B) is given by

D  = A * B
D

      1    2    3
+----------------+
1 |  21   24   27  |
2 |  47   54   61  |
+----------------+


Combinations of these operators are also possible. For example, $\mathbf{(x'x)^{-1}x'y}$ is

invsym(x'*x)*x'*y


Would be the OLS estimator we discuss in Chapter 1. There are many more functions and tools in the mata environment that I won't describe here, but are available to interested students.