ECON 407: R Primer

It is recommended that you use R-Studio for your work if you decide to use R in this class.

Loading data into R

Loading stata datasets

The foreign library allows us to open a bunch of different types of datafiles including excel, stata, sas, and comma delimited data to name a few. Good documentation is found here. Below, I show you how to open stata datasets.

library(foreign)
mroz <- read.dta("http://rlhick.people.wm.edu/econ407/data/mroz.dta")
summary(mroz)
     lfp              whrs             kl6              k618      
Min.   :0.0000   Min.   :   0.0   Min.   :0.0000   Min.   :0.000  
1st Qu.:0.0000   1st Qu.:   0.0   1st Qu.:0.0000   1st Qu.:0.000  
Median :1.0000   Median : 288.0   Median :0.0000   Median :1.000  
Mean   :0.5684   Mean   : 740.6   Mean   :0.2377   Mean   :1.353  
3rd Qu.:1.0000   3rd Qu.:1516.0   3rd Qu.:0.0000   3rd Qu.:2.000  
Max.   :1.0000   Max.   :4950.0   Max.   :3.0000   Max.   :8.000  
      wa              we              ww              rpwg           hhrs     
Min.   :30.00   Min.   : 5.00   Min.   : 0.000   Min.   :0.00   Min.   : 175  
1st Qu.:36.00   1st Qu.:12.00   1st Qu.: 0.000   1st Qu.:0.00   1st Qu.:1928  
Median :43.00   Median :12.00   Median : 1.625   Median :0.00   Median :2164  
Mean   :42.54   Mean   :12.29   Mean   : 2.375   Mean   :1.85   Mean   :2267  
3rd Qu.:49.00   3rd Qu.:13.00   3rd Qu.: 3.788   3rd Qu.:3.58   3rd Qu.:2553  
Max.   :60.00   Max.   :17.00   Max.   :25.000   Max.   :9.98   Max.   :5010  
      ha              he              hw              faminc     
Min.   :30.00   Min.   : 3.00   Min.   : 0.4121   Min.   : 1500  
1st Qu.:38.00   1st Qu.:11.00   1st Qu.: 4.7883   1st Qu.:15428  
Median :46.00   Median :12.00   Median : 6.9758   Median :20880  
Mean   :45.12   Mean   :12.49   Mean   : 7.4822   Mean   :23081  
3rd Qu.:52.00   3rd Qu.:15.00   3rd Qu.: 9.1667   3rd Qu.:28200  
Max.   :60.00   Max.   :17.00   Max.   :40.5090   Max.   :96000  
     mtr              wmed             wfed              un        
Min.   :0.4415   Min.   : 0.000   Min.   : 0.000   Min.   : 3.000  
1st Qu.:0.6215   1st Qu.: 7.000   1st Qu.: 7.000   1st Qu.: 7.500  
Median :0.6915   Median :10.000   Median : 7.000   Median : 7.500  
Mean   :0.6789   Mean   : 9.251   Mean   : 8.809   Mean   : 8.624  
3rd Qu.:0.7215   3rd Qu.:12.000   3rd Qu.:12.000   3rd Qu.:11.000  
Max.   :0.9415   Max.   :17.000   Max.   :17.000   Max.   :14.000  
     cit               ax       
Min.   :0.0000   Min.   : 0.00  
1st Qu.:0.0000   1st Qu.: 4.00  
Median :1.0000   Median : 9.00  
Mean   :0.6428   Mean   :10.63  
3rd Qu.:1.0000   3rd Qu.:15.00  
Max.   :1.0000   Max.   :45.00

Loading files from disk is a slight variation the above command. Supposing that your stata data file mroz.dta was in the folder /some/place, in Linux or MacOS we would use the R command

mroz <- read.dta("/some/place/mroz.dta")

Opening R datasets

If your dataset is already in an R format, simply use the load command:

mroz = load("/some/place/mroz.RData")

Web-based data is also accessible using load:

mroz = load("http://www.someplace.com/some/place/mroz.RData")

Viewing Data in R

If you are using R Studio (recommended) listing data is easy and I can show you how to do that. Viewing R data at the command line is achieved by the head command. Here we'll view the first 5 rows of data:

head(mroz,5)
  lfp whrs kl6 k618 wa we     ww rpwg hhrs ha he      hw faminc    mtr wmed
1   1 1610   1    0 32 12 3.3540 2.65 2708 34 12  4.0288  16310 0.7215   12
2   1 1656   0    2 30 12 1.3889 2.65 2310 30  9  8.4416  21800 0.6615    7
3   1 1980   1    3 35 12 4.5455 4.04 3072 40 12  3.5807  21040 0.6915   12
4   1  456   0    3 34 12 1.0965 3.25 1920 53 10  3.5417   7300 0.7815    7
5   1 1568   1    2 31 14 4.5918 3.60 2000 32 12 10.0000  27300 0.6215   12
  wfed   un cit ax
1    7  5.0   0 14
2    7 11.0   1  5
3    7  5.0   0 15
4    7  5.0   0  6
5   14  9.5   1  7

Or, the last 10 rows of data:

tail(mroz,5)
    lfp whrs kl6 k618 wa we ww rpwg hhrs ha he      hw faminc    mtr wmed wfed
749   0    0   0    2 40 13  0    0 3020 43 16  9.2715  28200 0.6215   10   10
750   0    0   2    3 31 12  0    0 2056 33 12  4.8638  10000 0.7715   12   12
751   0    0   0    0 43 12  0    0 2383 43 12  1.0898   9952 0.7515   10    3
752   0    0   0    0 60 12  0    0 1705 55  8 12.4400  24984 0.6215   12   12
753   0    0   0    3 39  9  0    0 3120 48 12  6.0897  28363 0.6915    7    7
      un cit ax
749  9.5   1  5
750  7.5   0 14
751  7.5   0  4
752 14.0   1 15
753 11.0   1 12

Or specific rows, using what is called "slice" indexing:

mroz[10:15,]
   lfp whrs kl6 k618 wa we     ww rpwg hhrs ha he      hw faminc    mtr wmed
10   1 1600   0    2 39 12 4.6875 4.15 2100 43 12  5.7143  20425 0.6915    7
11   1 1969   0    1 33 12 4.0630 4.30 2450 34 12  9.7959  32300 0.5815   12
12   1 1960   0    1 42 11 4.5918 4.58 2375 47 14  8.0000  28700 0.6215   14
13   1  240   1    2 30 12 2.0833 0.00 2830 33 16  5.3004  15500 0.7215   16
14   1  997   0    2 43 12 2.2668 3.50 3317 46 12  4.3413  16860 0.7215   10
15   1 1848   0    1 43 10 3.6797 3.38 2024 45 17 10.8700  31431 0.5815    7
   wfed  un cit ax
10    7 5.0   0 21
11    3 5.0   0 15
12    7 5.0   0 14
13   16 5.0   0  0
14   10 7.5   1 14
15    7 7.5   1  6

Or rows meeting logical conditions. Let's look at the first 10 rows where the respondent has kids less than 6 years old:

head(mroz[mroz$kl6>0,],10)
   lfp whrs kl6 k618 wa we     ww rpwg hhrs ha he      hw faminc    mtr wmed
1    1 1610   1    0 32 12 3.3540 2.65 2708 34 12  4.0288  16310 0.7215   12
3    1 1980   1    3 35 12 4.5455 4.04 3072 40 12  3.5807  21040 0.6915   12
5    1 1568   1    2 31 14 4.5918 3.60 2000 32 12 10.0000  27300 0.6215   12
13   1  240   1    2 30 12 2.0833 0.00 2830 33 16  5.3004  15500 0.7215   16
25   1 1955   1    1 31 12 2.1545 2.30 2024 31 12  4.0884  12487 0.7515   12
29   1 1516   1    0 31 17 7.2559 6.00 2390 30 17  6.2762  26100 0.6215   12
41   1  112   1    2 30 12 2.6786 0.00 4030 33 16  3.8462  15810 0.7215   12
43   1  583   1    2 31 16 2.5729 9.98 1530 34 16 13.7250  24000 0.6615   14
74   1  608   2    4 34 10 8.2237 3.00 1304 38  9  3.3742  15200 0.7915    0
79   1   90   2    2 32 15 1.0000 0.00 2350 31 14  4.8787  13755 0.7515   10
   wfed  un cit ax
1     7 5.0   0 14
3     7 5.0   0 15
5    14 9.5   1  7
13   16 5.0   0  0
25    7 5.0   1  4
29   12 5.0   0  7
41   12 3.0   0  1
43   16 9.5   1  6
74    0 7.5   1 11
79   12 7.5   1  9

Note, this is achieved using logical addressing, where only rows having the logical value TRUE is included. So for the first five rows of mroz, only rows 1, 3, and 5 have more than one young child and would be displayed above:

head(mroz$kl6>0,5)
[1]  TRUE FALSE  TRUE FALSE  TRUE

Creating and Modifying Variables

Creating Variables

In stata, you need to start a new variable with create. In R, just assign the variable:

mroz$newvar = mroz$lfp * mroz$ax
print(colnames(mroz))
print(summary(mroz))
 [1] "lfp"    "whrs"   "kl6"    "k618"   "wa"     "we"     "ww"     "rpwg"  
 [9] "hhrs"   "ha"     "he"     "hw"     "faminc" "mtr"    "wmed"   "wfed"  
[17] "un"     "cit"    "ax"     "newvar"
      lfp              whrs             kl6              k618      
 Min.   :0.0000   Min.   :   0.0   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:   0.0   1st Qu.:0.0000   1st Qu.:0.000  
 Median :1.0000   Median : 288.0   Median :0.0000   Median :1.000  
 Mean   :0.5684   Mean   : 740.6   Mean   :0.2377   Mean   :1.353  
 3rd Qu.:1.0000   3rd Qu.:1516.0   3rd Qu.:0.0000   3rd Qu.:2.000  
 Max.   :1.0000   Max.   :4950.0   Max.   :3.0000   Max.   :8.000  
       wa              we              ww              rpwg           hhrs     
 Min.   :30.00   Min.   : 5.00   Min.   : 0.000   Min.   :0.00   Min.   : 175  
 1st Qu.:36.00   1st Qu.:12.00   1st Qu.: 0.000   1st Qu.:0.00   1st Qu.:1928  
 Median :43.00   Median :12.00   Median : 1.625   Median :0.00   Median :2164  
 Mean   :42.54   Mean   :12.29   Mean   : 2.375   Mean   :1.85   Mean   :2267  
 3rd Qu.:49.00   3rd Qu.:13.00   3rd Qu.: 3.788   3rd Qu.:3.58   3rd Qu.:2553  
 Max.   :60.00   Max.   :17.00   Max.   :25.000   Max.   :9.98   Max.   :5010  
       ha              he              hw              faminc     
 Min.   :30.00   Min.   : 3.00   Min.   : 0.4121   Min.   : 1500  
 1st Qu.:38.00   1st Qu.:11.00   1st Qu.: 4.7883   1st Qu.:15428  
 Median :46.00   Median :12.00   Median : 6.9758   Median :20880  
 Mean   :45.12   Mean   :12.49   Mean   : 7.4822   Mean   :23081  
 3rd Qu.:52.00   3rd Qu.:15.00   3rd Qu.: 9.1667   3rd Qu.:28200  
 Max.   :60.00   Max.   :17.00   Max.   :40.5090   Max.   :96000  
      mtr              wmed             wfed              un        
 Min.   :0.4415   Min.   : 0.000   Min.   : 0.000   Min.   : 3.000  
 1st Qu.:0.6215   1st Qu.: 7.000   1st Qu.: 7.000   1st Qu.: 7.500  
 Median :0.6915   Median :10.000   Median : 7.000   Median : 7.500  
 Mean   :0.6789   Mean   : 9.251   Mean   : 8.809   Mean   : 8.624  
 3rd Qu.:0.7215   3rd Qu.:12.000   3rd Qu.:12.000   3rd Qu.:11.000  
 Max.   :0.9415   Max.   :17.000   Max.   :17.000   Max.   :14.000  
      cit               ax            newvar     
 Min.   :0.0000   Min.   : 0.00   Min.   : 0.00  
 1st Qu.:0.0000   1st Qu.: 4.00   1st Qu.: 0.00  
 Median :1.0000   Median : 9.00   Median : 4.00  
 Mean   :0.6428   Mean   :10.63   Mean   : 7.41  
 3rd Qu.:1.0000   3rd Qu.:15.00   3rd Qu.:13.00  
 Max.   :1.0000   Max.   :45.00   Max.   :38.00

Note a new column called newvar is now part of the data.

R aficionados would probably criticize the above code, since strictly speaking the assignment

x = y

is sometimes different than the R recommended way of making an assignment:

x <- y

which is an artifact from the use of ancient keyboards when R was written. I have never encountered a case where x=y doesn't work, but apparently it can happen.

Modifying Variables

Unlike stata we simply redefine the variable and don't need to bother with replace:

mroz$newvar = mroz$newvar/10
print(summary(mroz$newvar))
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.000   0.000   0.400   0.741   1.300   3.800

Getting help in R

If you know the function you need help with, just use the help function:

help(tail)

Linear Algebra in R

Here I briefly introduce the use of matrix algebra manipulations. Other programs are arguably better for pure linear algebra work (e.g. Matlab or Julia), but R is a very good environment for mixing modeling including running pre-packaged statistical commands and linear algebra. In my opinion stata's linear algebra code is a lot more intuitive than R's.

Creating vectors and matrices

Suppose we want to define a row vector \(\mathbf{a}\) as

\begin{equation} \mathbf{a} = \begin{bmatrix} 1 & 2 & 3 \end{bmatrix} \end{equation}

we can enter this in R as

a <- cbind(1,2,3)
print(a)
     [,1] [,2] [,3]
[1,]    1    2    3

If instead, we wanted a column vector \mathbf{a} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \end{equation} we use this

a <- rbind(1,2,3)
print(a)
     [,1]
[1,]    1
[2,]    2
[3,]    3

Referencing elements in arrays

We can grab individual elements of this vector using slicing:

print(a[1:2])
print(a[3])
[1] 1 2
[1] 3

Note, since we are working with a column, we don't need to refer to the row dimension, although we could:

print(a[1:2,1])
print(a[3,1])

Creating this matrix $$ B = \begin{bmatrix} 2&4&3\\1&5&7 \end{bmatrix} $$ is done this way:

B <- matrix(c(2, 4, 3, 1, 5, 7), nrow = 2, byrow = TRUE)
print(B)
     [,1] [,2] [,3]
[1,]    2    4    3
[2,]    1    5    7

slicing is also possible

B[,2:3]
     [,1] [,2]
[1,]    4    3
[2,]    5    7

It is also to create an empty matrix (all zeros) and fill it in using slicing:

C <- matrix(0, 5, 5)
C[3:4, 5] = -999
print(C)
     [,1] [,2] [,3] [,4] [,5]
[1,]    0    0    0    0    0
[2,]    0    0    0    0    0
[3,]    0    0    0    0 -999
[4,]    0    0    0    0 -999
[5,]    0    0    0    0    0

The identity matrix

These are created using diag, having the arguments:

  • Value to place on the diagonal
  • Number of rows
  • Number of columns (if ommitted, columns=rows)
I = diag(1,5)
print(I)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    1    0    0    0
[3,]    0    0    1    0    0
[4,]    0    0    0    1    0
[5,]    0    0    0    0    1

or, if you only supply one argument (row, column dimension):

I = diag(5)
print(I)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    1    0    0    0
[3,]    0    0    1    0    0
[4,]    0    0    0    1    0
[5,]    0    0    0    0    1

Creating a column vector of ones

cbind(rep(1, 5))
     [,1]
[1,]    1
[2,]    1
[3,]    1
[4,]    1
[5,]    1

Getting information about your matrices and vectors

The most important information we need is the dimensions of our matrices. The function dim tells us rows and columns (for 2 dimensional objects):

dim(B)
[1] 2 3

We can extract row and column dimensions like this:

dim(B)[1]
dim(B)[2]
[1] 2
[1] 3

Linear Algebra Operations

Scalar Addition

a = 1
b = 1
print(a + b)
[1] 2

Matrix Addition

a = rbind(2, 5, 8)
b = rbind(6, 4, 3)
print(b + a)
     [,1]
[1,]    8
[2,]    9
[3,]   11

Note, conformability matters:

a = rbind(2, 5)
b = rbind(6, 4, 3)
print(b + a)
Error in b + a : non-conformable arrays

Matrix Multiplication

Matrix multiplication uses the %*% operator:

dim(B)
dim(b)
print(B%*%b)
[1] 2 3
[1] 3 1
     [,1]
[1,]   37
[2,]   47

Again, comformability (and order) matters:

print(b%*%B)
Error in b %*% B : non-conformable arguments

Matrix Transpose

The transpose operator is t():

print(B)
t(B)
     [,1] [,2] [,3]
[1,]    2    4    3
[2,]    1    5    7
     [,1] [,2]
[1,]    2    1
[2,]    4    5
[3,]    3    7

Matrix Inversion

A <- matrix(c(2, 4, 3, 4, 9, 1,  1, 5, 7), nrow = 3, byrow = TRUE)
A_inv = solve(A)
A_inv
           [,1]       [,2]        [,3]
[1,]  1.4146341 -0.3170732 -0.56097561
[2,] -0.6585366  0.2682927  0.24390244
[3,]  0.2682927 -0.1463415  0.04878049

and the matrix Ainv satisfies the properties of an inverse:

A_inv%*%A
              [,1]          [,2]          [,3]
[1,]  1.000000e+00 -2.664535e-15 -1.332268e-15
[2,]  2.498002e-16  1.000000e+00  4.440892e-16
[3,] -4.857226e-17 -2.775558e-17  1.000000e+00

Scalar Operations

Scalar Addition and Subtraction

print(a)
print(a - 5)
     [,1]
[1,]    2
[2,]    5
     [,1]
[1,]   -3
[2,]    0

Scalar Multiplication and Addition

print(a/5)
print(a*5)
     [,1]
[1,]  0.4
[2,]  1.0
     [,1]
[1,]   10
[2,]   25

Elementwise Operations

Sometimes we want to combine arrays by performing arithmetic on the corresponding elements. For example, supposing that

\begin{equation} \mathbf{a} = \begin{bmatrix} 1 \\ 2 \end{bmatrix} \nonumber \end{equation}

and

\begin{equation} \mathbf{b} = \begin{bmatrix} 3 \\ 4 \end{bmatrix} \nonumber \end{equation}

we might want to calculate

\begin{equation} a\_divide\_b=\begin{bmatrix} 1/3 \\ 2/4 \end{bmatrix} \end{equation}

by default R performs these operations using basic arithmetic operators so long as the arrays are of the same dimensions. Note in the first example below, \(\mathbf{b}\) and \(\mathbf{a}\) are not conformable.

print(dim(a))
print(dim(b))
print(a/b)
[1] 2 1
[1] 3 1
Error in a/b : non-conformable arrays

However, if we redefine \(\mathbf{b}\) for having conformable dimensions, element-wise division exists

b=c(6,4)
print(a/b)
          [,1]
[1,] 0.3333333
[2,] 1.2500000