ECON 407: R Primer

You will be using R-Studio for your work in this class.

Loading data into R

Loading stata datasets

The foreign library allows us to open a bunch of different types of datafiles including excel, stata, sas, and comma delimited data to name a few. Documentation is found here. Below, I show you how to open stata datasets.

In [1]:
library(foreign)

mroz <- read.dta("http://rlhick.people.wm.edu/econ407/data/mroz.dta")

summary(mroz)
Out[1]:
      lfp              whrs             kl6              k618      
 Min.   :0.0000   Min.   :   0.0   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:   0.0   1st Qu.:0.0000   1st Qu.:0.000  
 Median :1.0000   Median : 288.0   Median :0.0000   Median :1.000  
 Mean   :0.5684   Mean   : 740.6   Mean   :0.2377   Mean   :1.353  
 3rd Qu.:1.0000   3rd Qu.:1516.0   3rd Qu.:0.0000   3rd Qu.:2.000  
 Max.   :1.0000   Max.   :4950.0   Max.   :3.0000   Max.   :8.000  
       wa              we              ww              rpwg           hhrs     
 Min.   :30.00   Min.   : 5.00   Min.   : 0.000   Min.   :0.00   Min.   : 175  
 1st Qu.:36.00   1st Qu.:12.00   1st Qu.: 0.000   1st Qu.:0.00   1st Qu.:1928  
 Median :43.00   Median :12.00   Median : 1.625   Median :0.00   Median :2164  
 Mean   :42.54   Mean   :12.29   Mean   : 2.375   Mean   :1.85   Mean   :2267  
 3rd Qu.:49.00   3rd Qu.:13.00   3rd Qu.: 3.788   3rd Qu.:3.58   3rd Qu.:2553  
 Max.   :60.00   Max.   :17.00   Max.   :25.000   Max.   :9.98   Max.   :5010  
       ha              he              hw              faminc     
 Min.   :30.00   Min.   : 3.00   Min.   : 0.4121   Min.   : 1500  
 1st Qu.:38.00   1st Qu.:11.00   1st Qu.: 4.7883   1st Qu.:15428  
 Median :46.00   Median :12.00   Median : 6.9758   Median :20880  
 Mean   :45.12   Mean   :12.49   Mean   : 7.4822   Mean   :23081  
 3rd Qu.:52.00   3rd Qu.:15.00   3rd Qu.: 9.1667   3rd Qu.:28200  
 Max.   :60.00   Max.   :17.00   Max.   :40.5090   Max.   :96000  
      mtr              wmed             wfed              un        
 Min.   :0.4415   Min.   : 0.000   Min.   : 0.000   Min.   : 3.000  
 1st Qu.:0.6215   1st Qu.: 7.000   1st Qu.: 7.000   1st Qu.: 7.500  
 Median :0.6915   Median :10.000   Median : 7.000   Median : 7.500  
 Mean   :0.6789   Mean   : 9.251   Mean   : 8.809   Mean   : 8.624  
 3rd Qu.:0.7215   3rd Qu.:12.000   3rd Qu.:12.000   3rd Qu.:11.000  
 Max.   :0.9415   Max.   :17.000   Max.   :17.000   Max.   :14.000  
      cit               ax       
 Min.   :0.0000   Min.   : 0.00  
 1st Qu.:0.0000   1st Qu.: 4.00  
 Median :1.0000   Median : 9.00  
 Mean   :0.6428   Mean   :10.63  
 3rd Qu.:1.0000   3rd Qu.:15.00  
 Max.   :1.0000   Max.   :45.00  

Loading files from disk is a slight variation the above command. Supposing that your stata data file mroz.dta was in the folder /some/place, in Linux or MacOS we would use the R command

mroz <- read.dta("/some/place/mroz.dta")

Note for windows users: reverse the slashes and reference the drive letter containing your data, so you would be running

mroz <- read.dta("c:\some\place\mroz.dta")

In all future documentation in this course, we assume the unix file name conventions, as in the first example.

Opening R datasets

If your dataset is already in an R format, simply use the load command:

mroz = load("/some/place/mroz.RData")

Web-based data is also accessible using load:

mroz = load("http://www.someplace.com/some/place/mroz.RData")

Viewing Data in R

If you are using R Studio (recommended) listing data is easy and I'll show you how to do that in class. Viewing R data at the command line is achieved by the head command. Here we'll view the first 5 rows of data:

In [2]:
head(mroz,5)
Out[2]:
lfp whrs kl6 k618 wa we ww rpwg hhrs ha he hw faminc mtr wmed wfed un cit ax
1 1 1610 1 0 32 12 3.354 2.65 2708 34 12 4.0288 16310 0.7215 12 7 5 0 14
2 1 1656 0 2 30 12 1.3889 2.65 2310 30 9 8.4416 21800 0.6615 7 7 11 1 5
3 1 1980 1 3 35 12 4.5455 4.04 3072 40 12 3.5807 21040 0.6915 12 7 5 0 15
4 1 456 0 3 34 12 1.0965 3.25 1920 53 10 3.5417 7300 0.7815 7 7 5 0 6
5 1 1568 1 2 31 14 4.5918 3.6 2000 32 12 10 27300 0.6215 12 14 9.5 1 7

Or, the last 10 rows of data:

In [3]:
tail(mroz,5)
Out[3]:
lfp whrs kl6 k618 wa we ww rpwg hhrs ha he hw faminc mtr wmed wfed un cit ax
749 0 0 0 2 40 13 0 0 3020 43 16 9.2715 28200 0.6215 10 10 9.5 1 5
750 0 0 2 3 31 12 0 0 2056 33 12 4.8638 10000 0.7715 12 12 7.5 0 14
751 0 0 0 0 43 12 0 0 2383 43 12 1.0898 9952 0.7515 10 3 7.5 0 4
752 0 0 0 0 60 12 0 0 1705 55 8 12.44 24984 0.6215 12 12 14 1 15
753 0 0 0 3 39 9 0 0 3120 48 12 6.0897 28363 0.6915 7 7 11 1 12

Or specific rows, using what is called "slice" indexing:

In [4]:
mroz[10:15,]
Out[4]:
lfp whrs kl6 k618 wa we ww rpwg hhrs ha he hw faminc mtr wmed wfed un cit ax
10 1 1600 0 2 39 12 4.6875 4.15 2100 43 12 5.7143 20425 0.6915 7 7 5 0 21
11 1 1969 0 1 33 12 4.063 4.3 2450 34 12 9.7959 32300 0.5815 12 3 5 0 15
12 1 1960 0 1 42 11 4.5918 4.58 2375 47 14 8 28700 0.6215 14 7 5 0 14
13 1 240 1 2 30 12 2.0833 0 2830 33 16 5.3004 15500 0.7215 16 16 5 0 0
14 1 997 0 2 43 12 2.2668 3.5 3317 46 12 4.3413 16860 0.7215 10 10 7.5 1 14
15 1 1848 0 1 43 10 3.6797 3.38 2024 45 17 10.87 31431 0.5815 7 7 7.5 1 6

Or, rows meeting logical conditions. Let's look at the first 10 rows where the respondent has kids less than 6 years old:

In [5]:
head(mroz[mroz$kl6>0,],10)
Out[5]:
lfp whrs kl6 k618 wa we ww rpwg hhrs ha he hw faminc mtr wmed wfed un cit ax
1 1 1610 1 0 32 12 3.354 2.65 2708 34 12 4.0288 16310 0.7215 12 7 5 0 14
3 1 1980 1 3 35 12 4.5455 4.04 3072 40 12 3.5807 21040 0.6915 12 7 5 0 15
5 1 1568 1 2 31 14 4.5918 3.6 2000 32 12 10 27300 0.6215 12 14 9.5 1 7
13 1 240 1 2 30 12 2.0833 0 2830 33 16 5.3004 15500 0.7215 16 16 5 0 0
25 1 1955 1 1 31 12 2.1545 2.3 2024 31 12 4.0884 12487 0.7515 12 7 5 1 4
29 1 1516 1 0 31 17 7.2559 6 2390 30 17 6.2762 26100 0.6215 12 12 5 0 7
41 1 112 1 2 30 12 2.6786 0 4030 33 16 3.8462 15810 0.7215 12 12 3 0 1
43 1 583 1 2 31 16 2.5729 9.98 1530 34 16 13.725 24000 0.6615 14 16 9.5 1 6
74 1 608 2 4 34 10 8.2237 3 1304 38 9 3.3742 15200 0.7915 0 0 7.5 1 11
79 1 90 2 2 32 15 1 0 2350 31 14 4.8787 13755 0.7515 10 12 7.5 1 9

Note, this is achieved using logical addressing, where only rows having the logical value TRUE is included. So for the first five rows of mroz, only rows 1, 3, and 5 have more than one young child:

In [6]:
head(mroz$kl6>0,5)
Out[6]:
  1. TRUE
  2. FALSE
  3. TRUE
  4. FALSE
  5. TRUE

Creating and Modifying Variables

Creating Variables

In stata, you need to start a new variable with create. In R, just assign the variable.

In [7]:
print(colnames(mroz))
mroz$newvar = mroz$lfp * mroz$ax
print(colnames(mroz))
print(summary(mroz))
 [1] "lfp"    "whrs"   "kl6"    "k618"   "wa"     "we"     "ww"     "rpwg"  
 [9] "hhrs"   "ha"     "he"     "hw"     "faminc" "mtr"    "wmed"   "wfed"  
[17] "un"     "cit"    "ax"    
 [1] "lfp"    "whrs"   "kl6"    "k618"   "wa"     "we"     "ww"     "rpwg"  
 [9] "hhrs"   "ha"     "he"     "hw"     "faminc" "mtr"    "wmed"   "wfed"  
[17] "un"     "cit"    "ax"     "newvar"
      lfp              whrs             kl6              k618      
 Min.   :0.0000   Min.   :   0.0   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:   0.0   1st Qu.:0.0000   1st Qu.:0.000  
 Median :1.0000   Median : 288.0   Median :0.0000   Median :1.000  
 Mean   :0.5684   Mean   : 740.6   Mean   :0.2377   Mean   :1.353  
 3rd Qu.:1.0000   3rd Qu.:1516.0   3rd Qu.:0.0000   3rd Qu.:2.000  
 Max.   :1.0000   Max.   :4950.0   Max.   :3.0000   Max.   :8.000  
       wa              we              ww              rpwg           hhrs     
 Min.   :30.00   Min.   : 5.00   Min.   : 0.000   Min.   :0.00   Min.   : 175  
 1st Qu.:36.00   1st Qu.:12.00   1st Qu.: 0.000   1st Qu.:0.00   1st Qu.:1928  
 Median :43.00   Median :12.00   Median : 1.625   Median :0.00   Median :2164  
 Mean   :42.54   Mean   :12.29   Mean   : 2.375   Mean   :1.85   Mean   :2267  
 3rd Qu.:49.00   3rd Qu.:13.00   3rd Qu.: 3.788   3rd Qu.:3.58   3rd Qu.:2553  
 Max.   :60.00   Max.   :17.00   Max.   :25.000   Max.   :9.98   Max.   :5010  
       ha              he              hw              faminc     
 Min.   :30.00   Min.   : 3.00   Min.   : 0.4121   Min.   : 1500  
 1st Qu.:38.00   1st Qu.:11.00   1st Qu.: 4.7883   1st Qu.:15428  
 Median :46.00   Median :12.00   Median : 6.9758   Median :20880  
 Mean   :45.12   Mean   :12.49   Mean   : 7.4822   Mean   :23081  
 3rd Qu.:52.00   3rd Qu.:15.00   3rd Qu.: 9.1667   3rd Qu.:28200  
 Max.   :60.00   Max.   :17.00   Max.   :40.5090   Max.   :96000  
      mtr              wmed             wfed              un        
 Min.   :0.4415   Min.   : 0.000   Min.   : 0.000   Min.   : 3.000  
 1st Qu.:0.6215   1st Qu.: 7.000   1st Qu.: 7.000   1st Qu.: 7.500  
 Median :0.6915   Median :10.000   Median : 7.000   Median : 7.500  
 Mean   :0.6789   Mean   : 9.251   Mean   : 8.809   Mean   : 8.624  
 3rd Qu.:0.7215   3rd Qu.:12.000   3rd Qu.:12.000   3rd Qu.:11.000  
 Max.   :0.9415   Max.   :17.000   Max.   :17.000   Max.   :14.000  
      cit               ax            newvar     
 Min.   :0.0000   Min.   : 0.00   Min.   : 0.00  
 1st Qu.:0.0000   1st Qu.: 4.00   1st Qu.: 0.00  
 Median :1.0000   Median : 9.00   Median : 4.00  
 Mean   :0.6428   Mean   :10.63   Mean   : 7.41  
 3rd Qu.:1.0000   3rd Qu.:15.00   3rd Qu.:13.00  
 Max.   :1.0000   Max.   :45.00   Max.   :38.00  

R aficionados would probably criticize the above code, since strictly speaking the assignment

x = y

is sometimes different than the R recommended way of making an assignment:

x <- y

which is an artifact from the use of ancient keyboards when R was written.

I have never encountered a case where x=y doesn't work, but apparently it can happen.

Modifying Variables

Unlike stata we simply redefine the variable and don't need to bother with replace:

In [8]:
mroz$newvar = mroz$newvar/10
print(summary(mroz$newvar))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.400   0.741   1.300   3.800 

Getting help in R

If you know the function you need help with, just use the help function:

In [10]:
help(tail)
Out[10]:
head {utils} R Documentation

Return the First or Last Part of an Object

Description

Returns the first or last parts of a vector, matrix, table, data frame or function. Since head() and tail() are generic functions, they may also have been extended to other classes.

Usage

head(x, ...)
## Default S3 method:
head(x, n = 6L, ...)
## S3 method for class 'data.frame'
head(x, n = 6L, ...)
## S3 method for class 'matrix'
head(x, n = 6L, ...)
## S3 method for class 'ftable'
head(x, n = 6L, ...)
## S3 method for class 'table'
head(x, n = 6L, ...)
## S3 method for class 'function'
head(x, n = 6L, ...)

tail(x, ...)
## Default S3 method:
tail(x, n = 6L, ...)
## S3 method for class 'data.frame'
tail(x, n = 6L, ...)
## S3 method for class 'matrix'
tail(x, n = 6L, addrownums = TRUE, ...)
## S3 method for class 'ftable'
tail(x, n = 6L, addrownums = FALSE, ...)
## S3 method for class 'table'
tail(x, n = 6L, addrownums = TRUE, ...)
## S3 method for class 'function'
tail(x, n = 6L, ...)

Arguments

x

an object

n

a single integer. If positive, size for the resulting object: number of elements for a vector (including lists), rows for a matrix or data frame or lines for a function. If negative, all but the n last/first number of elements of x.

addrownums

if there are no row names, create them from the row numbers.

...

arguments to be passed to or from other methods.

Details

For matrices, 2-dim tables and data frames, head() (tail()) returns the first (last) n rows when n > 0 or all but the last (first) n rows when n < 0. head.matrix() and tail.matrix() are exported. For functions, the lines of the deparsed function are returned as character strings.

If a matrix has no row names, then tail() will add row names of the form "[n,]" to the result, so that it looks similar to the last lines of x when printed. Setting addrownums = FALSE suppresses this behaviour.

Value

An object (usually) like x but generally smaller. For ftable objects x, a transformed format(x).

Author(s)

Patrick Burns, improved and corrected by R-Core. Negative argument added by Vincent Goulet.

Examples

head(letters)
head(letters, n = -6L)

head(freeny.x, n = 10L)
head(freeny.y)

tail(letters)
tail(letters, n = -6L)

tail(freeny.x)
tail(freeny.y)

tail(library)

head(stats::ftable(Titanic))

[Package utils version 3.2.2 ]

Linear Algebra in R

Here I briefly introduce the use of matrix algebra manipulations. Other programs are arguably better for pure linear algebra work (e.g. Matlab or Julia), but R is a very good environment for mixing modeling including running pre-packaged statistical commands and linear algebra. In my opinion stata linear algebra code is more intuitive than R's.

Creating vectors and matrices

Suppose we want to define a row vector $a$ as $$ a = \begin{bmatrix} 1 & 2 & 3 \end{bmatrix} $$

In [11]:
a <- cbind(1,2,3)
print(a)
     [,1] [,2] [,3]
[1,]    1    2    3

If instead, we wanted a column vector $$ a = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} $$

In [12]:
a <- rbind(1,2,3)
print(a)
     [,1]
[1,]    1
[2,]    2
[3,]    3

Referencing elements

We can grab individual elements of this vector using slicing:

In [13]:
print(a[1:2])
print(a[3])
[1] 1 2
[1] 3

Note, since we are working with a column, we don't need to refer to the row dimension, although we could:

In [ ]:
```{r}
print(a[1:2,1])
print(a[3,1])
```
In [ ]:
Creating this matrix
$$
B = \begin{bmatrix} 2&4&3\\1&5&7 \end{bmatrix}
$$
is done this way:
In [14]:
B <- matrix(c(2, 4, 3, 1, 5, 7), nrow = 2, byrow = TRUE)
print(B)
     [,1] [,2] [,3]
[1,]    2    4    3
[2,]    1    5    7

Slicing is also possible

In [15]:
B[2,2]
Out[15]:
5

or

In [16]:
B[,2:3]
Out[16]:
4 3
5 7

It is also to create an empty matrix (all zeros) and fill it in using slicing:

In [17]:
C <- matrix(0, 5, 5)
C[3,2] = -999
print(C)
     [,1] [,2] [,3] [,4] [,5]
[1,]    0    0    0    0    0
[2,]    0    0    0    0    0
[3,]    0 -999    0    0    0
[4,]    0    0    0    0    0
[5,]    0    0    0    0    0

The identity matrix

These are created using diag, having the arguments:

  1. Value to place on the diagonal
  2. Number of rows
  3. Number of columns (if ommitted, columns=rows)
In [18]:
I = diag(1,5)
print(I)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    1    0    0    0
[3,]    0    0    1    0    0
[4,]    0    0    0    1    0
[5,]    0    0    0    0    1

or, if you only supply one argument, it is dimension of identity matrix:

In [19]:
I = diag(5)
print(I)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    1    0    0    0
[3,]    0    0    1    0    0
[4,]    0    0    0    1    0
[5,]    0    0    0    0    1

Creating a column vector of ones

In [20]:
cbind(rep(1, 5))
Out[20]:
1
1
1
1
1

Getting information about your matrices and vectors

The most important information we need is the dimensions of our matrices.

The function dim tells us rows and columns (for 2 dimensional objects):

In [21]:
dim(B)
Out[21]:
  1. 2
  2. 3

We can extract row and column dimensions like this:

In [22]:
cat("Rows:", dim(B)[1])
cat("Columns:",dim(B)[2])
Rows: 2Columns: 3

Basic Linear Algebra

Scalar Addition

In [23]:
a = 1
b = 1
print(a + b)
[1] 2

Matrix Addition

In [67]:
a = rbind(2, 5, 8)
b = rbind(6, 4, 3)
print(b + a)
     [,1]
[1,]    8
[2,]    9
[3,]   11

Note, conformability matters:

In [68]:
a = rbind(2, 5)
b = rbind(6, 4, 3)
print(b + a)
Error in b + a: non-conformable arrays

Matrix Multiplication

In [69]:
dim(B)
dim(b)
print(B%*%b)
Out[69]:
  1. 2
  2. 3
Out[69]:
  1. 3
  2. 1
     [,1]
[1,]   37
[2,]   47

Again, comformability (and order) matters:

In [70]:
print(b%*%B)
Error in b %*% B: non-conformable arguments

Matrix Transpose

In [71]:
print(B)
t(B)
     [,1] [,2] [,3]
[1,]    2    4    3
[2,]    1    5    7
Out[71]:
2 1
4 5
3 7

Matrix Inversion

In [72]:
A <- matrix(c(2, 4, 3, 4, 9, 1,  1, 5, 7), nrow = 3, byrow = TRUE)
A_inv = solve(A)
A_inv
Out[72]:
1.4146341 -0.3170732 -0.5609756
-0.6585366 0.2682927 0.2439024
0.26829268 -0.14634146 0.04878049

and the matrix A_inv satisfies the properties of an inverse:

In [73]:
A_inv%*%A
Out[73]:
1.000000e+00 -2.664535e-15 -1.332268e-15
2.498002e-16 1.000000e+00 4.440892e-16
-4.857226e-17 -2.775558e-17 1.000000e+00

Scalar Operations

Scalar Addition and Subtraction
In [74]:
print(a)
print(a - 5)
     [,1]
[1,]    2
[2,]    5
     [,1]
[1,]   -3
[2,]    0
Scalar Multiplication and Addition
In [75]:
print(a/5)
print(a*5)
     [,1]
[1,]  0.4
[2,]  1.0
     [,1]
[1,]   10
[2,]   25
Elementwise Operations

Sometimes we want to combine arrays by performing arithmetic on the corresponding elements. For example, supposing that $$ \mathbf{a} = \begin{bmatrix} 1 \\ 2 \end{bmatrix} $$ and $$ \mathbf{b} = \begin{bmatrix} 3 \\ 4 \end{bmatrix} $$

we might want to calculate $$ a\_divide\_b = \begin{bmatrix} 1/3 \\ 2/4 \end{bmatrix} $$

R by default performs these operations using basic arithmetic operators so long as the arrays are of the same dimensions. Note in the first example below, $\mathbf{b}$ and $\mathbf{a}$ are not conformable.

In [76]:
print(dim(a))
print(dim(b))
print(a/b)

b=c(6,4)

print(a/b)
[1] 2 1
[1] 3 1
Error in a/b: non-conformable arrays
          [,1]
[1,] 0.3333333
[2,] 1.2500000

Dataframes and linear algebra

This shows how to perform a simple manual OLS estimate using the mroz data. Let's estimate the following simple model: $$ \label{eq:ols} kl6_i = \beta_0 + \beta_1 we_i + \beta_2 lfp_i +\epsilon_i $$

Create a vector of ones in our dataset and create the matrices $\mathbf{X}$ and $\mathbf{Y}$:

In [65]:
mroz$ones = 1
X = data.matrix(mroz[c("ones","we","lfp")], rownames.force = NA)
Y = data.matrix(mroz[c("kl6")], rownames.force = NA)

The OLS estimate for our $\beta$ vector is $\mathbf{(X'X)^{-1}X'Y}$

In [33]:
solve(t(X)%*%X)%*%t(X)%*%Y
Out[33]:
kl6
ones -0.05169664
we 0.03542029
lfp -0.2564976

Or, we can clean up this syntax a little bit by using the crossprod and tcrossprod functions, where $$ \begin{align} crossprod(\mathbf{x},\mathbf{y}) &= \mathbf{x' x} \\ tcrossprod(\mathbf{x},\mathbf{y}) &= \mathbf{xx'} \\ \end{align} $$

In [42]:
solve(crossprod(X))%*%crossprod(X,Y)
Out[42]:
kl6
ones -0.05169664
we 0.03542029
lfp -0.2564976

Note, this is exactly what we get with the canned OLS package in R:

In [43]:
model = lm(kl6~we+lfp,data=mroz)
summary(model)
Out[43]:
Call:
lm(formula = kl6 ~ we + lfp, data = mroz)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.55045 -0.30251 -0.11685 -0.04601  2.59123 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.051697   0.101222  -0.511     0.61    
we           0.035420   0.008243   4.297 1.96e-05 ***
lfp         -0.256498   0.037926  -6.763 2.72e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5063 on 750 degrees of freedom
Multiple R-squared:  0.06862,	Adjusted R-squared:  0.06613 
F-statistic: 27.63 on 2 and 750 DF,  p-value: 2.65e-12