# ECON 407: R Primer

You will be using R-Studio for your work in this class.

The foreign library allows us to open a bunch of different types of datafiles including excel, stata, sas, and comma delimited data to name a few. Documentation is found here. Below, I show you how to open stata datasets.

In [1]:
library(foreign)

summary(mroz)

Out[1]:
      lfp              whrs             kl6              k618
Min.   :0.0000   Min.   :   0.0   Min.   :0.0000   Min.   :0.000
1st Qu.:0.0000   1st Qu.:   0.0   1st Qu.:0.0000   1st Qu.:0.000
Median :1.0000   Median : 288.0   Median :0.0000   Median :1.000
Mean   :0.5684   Mean   : 740.6   Mean   :0.2377   Mean   :1.353
3rd Qu.:1.0000   3rd Qu.:1516.0   3rd Qu.:0.0000   3rd Qu.:2.000
Max.   :1.0000   Max.   :4950.0   Max.   :3.0000   Max.   :8.000
wa              we              ww              rpwg           hhrs
Min.   :30.00   Min.   : 5.00   Min.   : 0.000   Min.   :0.00   Min.   : 175
1st Qu.:36.00   1st Qu.:12.00   1st Qu.: 0.000   1st Qu.:0.00   1st Qu.:1928
Median :43.00   Median :12.00   Median : 1.625   Median :0.00   Median :2164
Mean   :42.54   Mean   :12.29   Mean   : 2.375   Mean   :1.85   Mean   :2267
3rd Qu.:49.00   3rd Qu.:13.00   3rd Qu.: 3.788   3rd Qu.:3.58   3rd Qu.:2553
Max.   :60.00   Max.   :17.00   Max.   :25.000   Max.   :9.98   Max.   :5010
ha              he              hw              faminc
Min.   :30.00   Min.   : 3.00   Min.   : 0.4121   Min.   : 1500
1st Qu.:38.00   1st Qu.:11.00   1st Qu.: 4.7883   1st Qu.:15428
Median :46.00   Median :12.00   Median : 6.9758   Median :20880
Mean   :45.12   Mean   :12.49   Mean   : 7.4822   Mean   :23081
3rd Qu.:52.00   3rd Qu.:15.00   3rd Qu.: 9.1667   3rd Qu.:28200
Max.   :60.00   Max.   :17.00   Max.   :40.5090   Max.   :96000
mtr              wmed             wfed              un
Min.   :0.4415   Min.   : 0.000   Min.   : 0.000   Min.   : 3.000
1st Qu.:0.6215   1st Qu.: 7.000   1st Qu.: 7.000   1st Qu.: 7.500
Median :0.6915   Median :10.000   Median : 7.000   Median : 7.500
Mean   :0.6789   Mean   : 9.251   Mean   : 8.809   Mean   : 8.624
3rd Qu.:0.7215   3rd Qu.:12.000   3rd Qu.:12.000   3rd Qu.:11.000
Max.   :0.9415   Max.   :17.000   Max.   :17.000   Max.   :14.000
cit               ax
Min.   :0.0000   Min.   : 0.00
1st Qu.:0.0000   1st Qu.: 4.00
Median :1.0000   Median : 9.00
Mean   :0.6428   Mean   :10.63
3rd Qu.:1.0000   3rd Qu.:15.00
Max.   :1.0000   Max.   :45.00  

Loading files from disk is a slight variation the above command. Supposing that your stata data file mroz.dta was in the folder /some/place, in Linux or MacOS we would use the R command

mroz <- read.dta("/some/place/mroz.dta")

Note for windows users: reverse the slashes and reference the drive letter containing your data, so you would be running

mroz <- read.dta("c:\some\place\mroz.dta")

In all future documentation in this course, we assume the unix file name conventions, as in the first example.

#### Opening R datasets¶

If your dataset is already in an R format, simply use the load command:

mroz = load("/some/place/mroz.RData")

Web-based data is also accessible using load:

mroz = load("http://www.someplace.com/some/place/mroz.RData")

### Viewing Data in R¶

If you are using R Studio (recommended) listing data is easy and I'll show you how to do that in class. Viewing R data at the command line is achieved by the head command. Here we'll view the first 5 rows of data:

In [2]:
head(mroz,5)

Out[2]:
lfp whrs kl6 k618 wa we ww rpwg hhrs ha he hw faminc mtr wmed wfed un cit ax
1 1 1610 1 0 32 12 3.354 2.65 2708 34 12 4.0288 16310 0.7215 12 7 5 0 14
2 1 1656 0 2 30 12 1.3889 2.65 2310 30 9 8.4416 21800 0.6615 7 7 11 1 5
3 1 1980 1 3 35 12 4.5455 4.04 3072 40 12 3.5807 21040 0.6915 12 7 5 0 15
4 1 456 0 3 34 12 1.0965 3.25 1920 53 10 3.5417 7300 0.7815 7 7 5 0 6
5 1 1568 1 2 31 14 4.5918 3.6 2000 32 12 10 27300 0.6215 12 14 9.5 1 7

Or, the last 10 rows of data:

In [3]:
tail(mroz,5)

Out[3]:
lfp whrs kl6 k618 wa we ww rpwg hhrs ha he hw faminc mtr wmed wfed un cit ax
749 0 0 0 2 40 13 0 0 3020 43 16 9.2715 28200 0.6215 10 10 9.5 1 5
750 0 0 2 3 31 12 0 0 2056 33 12 4.8638 10000 0.7715 12 12 7.5 0 14
751 0 0 0 0 43 12 0 0 2383 43 12 1.0898 9952 0.7515 10 3 7.5 0 4
752 0 0 0 0 60 12 0 0 1705 55 8 12.44 24984 0.6215 12 12 14 1 15
753 0 0 0 3 39 9 0 0 3120 48 12 6.0897 28363 0.6915 7 7 11 1 12

Or specific rows, using what is called "slice" indexing:

In [4]:
mroz[10:15,]

Out[4]:
lfp whrs kl6 k618 wa we ww rpwg hhrs ha he hw faminc mtr wmed wfed un cit ax
10 1 1600 0 2 39 12 4.6875 4.15 2100 43 12 5.7143 20425 0.6915 7 7 5 0 21
11 1 1969 0 1 33 12 4.063 4.3 2450 34 12 9.7959 32300 0.5815 12 3 5 0 15
12 1 1960 0 1 42 11 4.5918 4.58 2375 47 14 8 28700 0.6215 14 7 5 0 14
13 1 240 1 2 30 12 2.0833 0 2830 33 16 5.3004 15500 0.7215 16 16 5 0 0
14 1 997 0 2 43 12 2.2668 3.5 3317 46 12 4.3413 16860 0.7215 10 10 7.5 1 14
15 1 1848 0 1 43 10 3.6797 3.38 2024 45 17 10.87 31431 0.5815 7 7 7.5 1 6

Or, rows meeting logical conditions. Let's look at the first 10 rows where the respondent has kids less than 6 years old:

In [5]:
head(mroz[mroz$kl6>0,],10)  Out[5]: lfp whrs kl6 k618 wa we ww rpwg hhrs ha he hw faminc mtr wmed wfed un cit ax 1 1 1610 1 0 32 12 3.354 2.65 2708 34 12 4.0288 16310 0.7215 12 7 5 0 14 3 1 1980 1 3 35 12 4.5455 4.04 3072 40 12 3.5807 21040 0.6915 12 7 5 0 15 5 1 1568 1 2 31 14 4.5918 3.6 2000 32 12 10 27300 0.6215 12 14 9.5 1 7 13 1 240 1 2 30 12 2.0833 0 2830 33 16 5.3004 15500 0.7215 16 16 5 0 0 25 1 1955 1 1 31 12 2.1545 2.3 2024 31 12 4.0884 12487 0.7515 12 7 5 1 4 29 1 1516 1 0 31 17 7.2559 6 2390 30 17 6.2762 26100 0.6215 12 12 5 0 7 41 1 112 1 2 30 12 2.6786 0 4030 33 16 3.8462 15810 0.7215 12 12 3 0 1 43 1 583 1 2 31 16 2.5729 9.98 1530 34 16 13.725 24000 0.6615 14 16 9.5 1 6 74 1 608 2 4 34 10 8.2237 3 1304 38 9 3.3742 15200 0.7915 0 0 7.5 1 11 79 1 90 2 2 32 15 1 0 2350 31 14 4.8787 13755 0.7515 10 12 7.5 1 9 Note, this is achieved using logical addressing, where only rows having the logical value TRUE is included. So for the first five rows of mroz, only rows 1, 3, and 5 have more than one young child: In [6]: head(mroz$kl6>0,5)

Out[6]:
1. TRUE
2. FALSE
3. TRUE
4. FALSE
5. TRUE

### Creating and Modifying Variables¶

#### Creating Variables¶

In stata, you need to start a new variable with create. In R, just assign the variable.

In [7]:
print(colnames(mroz))
mroz$newvar = mroz$lfp * mroz$ax print(colnames(mroz)) print(summary(mroz))   [1] "lfp" "whrs" "kl6" "k618" "wa" "we" "ww" "rpwg" [9] "hhrs" "ha" "he" "hw" "faminc" "mtr" "wmed" "wfed" [17] "un" "cit" "ax" [1] "lfp" "whrs" "kl6" "k618" "wa" "we" "ww" "rpwg" [9] "hhrs" "ha" "he" "hw" "faminc" "mtr" "wmed" "wfed" [17] "un" "cit" "ax" "newvar" lfp whrs kl6 k618 Min. :0.0000 Min. : 0.0 Min. :0.0000 Min. :0.000 1st Qu.:0.0000 1st Qu.: 0.0 1st Qu.:0.0000 1st Qu.:0.000 Median :1.0000 Median : 288.0 Median :0.0000 Median :1.000 Mean :0.5684 Mean : 740.6 Mean :0.2377 Mean :1.353 3rd Qu.:1.0000 3rd Qu.:1516.0 3rd Qu.:0.0000 3rd Qu.:2.000 Max. :1.0000 Max. :4950.0 Max. :3.0000 Max. :8.000 wa we ww rpwg hhrs Min. :30.00 Min. : 5.00 Min. : 0.000 Min. :0.00 Min. : 175 1st Qu.:36.00 1st Qu.:12.00 1st Qu.: 0.000 1st Qu.:0.00 1st Qu.:1928 Median :43.00 Median :12.00 Median : 1.625 Median :0.00 Median :2164 Mean :42.54 Mean :12.29 Mean : 2.375 Mean :1.85 Mean :2267 3rd Qu.:49.00 3rd Qu.:13.00 3rd Qu.: 3.788 3rd Qu.:3.58 3rd Qu.:2553 Max. :60.00 Max. :17.00 Max. :25.000 Max. :9.98 Max. :5010 ha he hw faminc Min. :30.00 Min. : 3.00 Min. : 0.4121 Min. : 1500 1st Qu.:38.00 1st Qu.:11.00 1st Qu.: 4.7883 1st Qu.:15428 Median :46.00 Median :12.00 Median : 6.9758 Median :20880 Mean :45.12 Mean :12.49 Mean : 7.4822 Mean :23081 3rd Qu.:52.00 3rd Qu.:15.00 3rd Qu.: 9.1667 3rd Qu.:28200 Max. :60.00 Max. :17.00 Max. :40.5090 Max. :96000 mtr wmed wfed un Min. :0.4415 Min. : 0.000 Min. : 0.000 Min. : 3.000 1st Qu.:0.6215 1st Qu.: 7.000 1st Qu.: 7.000 1st Qu.: 7.500 Median :0.6915 Median :10.000 Median : 7.000 Median : 7.500 Mean :0.6789 Mean : 9.251 Mean : 8.809 Mean : 8.624 3rd Qu.:0.7215 3rd Qu.:12.000 3rd Qu.:12.000 3rd Qu.:11.000 Max. :0.9415 Max. :17.000 Max. :17.000 Max. :14.000 cit ax newvar Min. :0.0000 Min. : 0.00 Min. : 0.00 1st Qu.:0.0000 1st Qu.: 4.00 1st Qu.: 0.00 Median :1.0000 Median : 9.00 Median : 4.00 Mean :0.6428 Mean :10.63 Mean : 7.41 3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:13.00 Max. :1.0000 Max. :45.00 Max. :38.00  R aficionados would probably criticize the above code, since strictly speaking the assignment x = y is sometimes different than the R recommended way of making an assignment: x <- y which is an artifact from the use of ancient keyboards when R was written. I have never encountered a case where x=y doesn't work, but apparently it can happen. #### Modifying Variables¶ Unlike stata we simply redefine the variable and don't need to bother with replace: In [8]: mroz$newvar = mroz$newvar/10 print(summary(mroz$newvar))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.000   0.000   0.400   0.741   1.300   3.800


## Getting help in R¶

If you know the function you need help with, just use the help function:

In [10]:
help(tail)

Out[10]:

### Return the First or Last Part of an Object

#### Description

Returns the first or last parts of a vector, matrix, table, data frame or function. Since head() and tail() are generic functions, they may also have been extended to other classes.

#### Usage

head(x, ...)
## Default S3 method:
## S3 method for class 'data.frame'
## S3 method for class 'matrix'
## S3 method for class 'ftable'
## S3 method for class 'table'
## S3 method for class 'function'

tail(x, ...)
## Default S3 method:
tail(x, n = 6L, ...)
## S3 method for class 'data.frame'
tail(x, n = 6L, ...)
## S3 method for class 'matrix'
tail(x, n = 6L, addrownums = TRUE, ...)
## S3 method for class 'ftable'
tail(x, n = 6L, addrownums = FALSE, ...)
## S3 method for class 'table'
tail(x, n = 6L, addrownums = TRUE, ...)
## S3 method for class 'function'
tail(x, n = 6L, ...)


#### Arguments

 x an object n a single integer. If positive, size for the resulting object: number of elements for a vector (including lists), rows for a matrix or data frame or lines for a function. If negative, all but the n last/first number of elements of x. addrownums if there are no row names, create them from the row numbers. ... arguments to be passed to or from other methods.

#### Details

For matrices, 2-dim tables and data frames, head() (tail()) returns the first (last) n rows when n > 0 or all but the last (first) n rows when n < 0. head.matrix() and tail.matrix() are exported. For functions, the lines of the deparsed function are returned as character strings.

If a matrix has no row names, then tail() will add row names of the form "[n,]" to the result, so that it looks similar to the last lines of x when printed. Setting addrownums = FALSE suppresses this behaviour.

#### Value

An object (usually) like x but generally smaller. For ftable objects x, a transformed format(x).

#### Author(s)

Patrick Burns, improved and corrected by R-Core. Negative argument added by Vincent Goulet.

#### Examples

head(letters)

tail(letters)
tail(letters, n = -6L)

tail(freeny.x)
tail(freeny.y)

tail(library)



[Package utils version 3.2.2 ]

## Linear Algebra in R¶

Here I briefly introduce the use of matrix algebra manipulations. Other programs are arguably better for pure linear algebra work (e.g. Matlab or Julia), but R is a very good environment for mixing modeling including running pre-packaged statistical commands and linear algebra. In my opinion stata linear algebra code is more intuitive than R's.

### Creating vectors and matrices¶

Suppose we want to define a row vector $a$ as $a = \begin{bmatrix} 1 & 2 & 3 \end{bmatrix}$

In [11]:
a <- cbind(1,2,3)
print(a)

     [,1] [,2] [,3]
[1,]    1    2    3


If instead, we wanted a column vector $a = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}$

In [12]:
a <- rbind(1,2,3)
print(a)

     [,1]
[1,]    1
[2,]    2
[3,]    3


### Referencing elements¶

We can grab individual elements of this vector using slicing:

In [13]:
print(a[1:2])
print(a[3])

[1] 1 2
[1] 3


Note, since we are working with a column, we don't need to refer to the row dimension, although we could:

In [ ]:
{r}
print(a[1:2,1])
print(a[3,1])


In [ ]:
Creating this matrix
$B = \begin{bmatrix} 2&4&3\\1&5&7 \end{bmatrix}$
is done this way:

In [14]:
B <- matrix(c(2, 4, 3, 1, 5, 7), nrow = 2, byrow = TRUE)
print(B)

     [,1] [,2] [,3]
[1,]    2    4    3
[2,]    1    5    7


Slicing is also possible

In [15]:
B[2,2]

Out[15]:
5

or

In [16]:
B[,2:3]

Out[16]:
 4 3 5 7

It is also to create an empty matrix (all zeros) and fill it in using slicing:

In [17]:
C <- matrix(0, 5, 5)
C[3,2] = -999
print(C)

     [,1] [,2] [,3] [,4] [,5]
[1,]    0    0    0    0    0
[2,]    0    0    0    0    0
[3,]    0 -999    0    0    0
[4,]    0    0    0    0    0
[5,]    0    0    0    0    0


### The identity matrix¶

These are created using diag, having the arguments:

1. Value to place on the diagonal
2. Number of rows
3. Number of columns (if ommitted, columns=rows)
In [18]:
I = diag(1,5)
print(I)

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    1    0    0    0
[3,]    0    0    1    0    0
[4,]    0    0    0    1    0
[5,]    0    0    0    0    1


or, if you only supply one argument, it is dimension of identity matrix:

In [19]:
I = diag(5)
print(I)

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    1    0    0    0
[3,]    0    0    1    0    0
[4,]    0    0    0    1    0
[5,]    0    0    0    0    1


#### Creating a column vector of ones¶

In [20]:
cbind(rep(1, 5))

Out[20]:
 1 1 1 1 1

The most important information we need is the dimensions of our matrices.

The function dim tells us rows and columns (for 2 dimensional objects):

In [21]:
dim(B)

Out[21]:
1. 2
2. 3

We can extract row and column dimensions like this:

In [22]:
cat("Rows:", dim(B)[1])
cat("Columns:",dim(B)[2])

Rows: 2Columns: 3

### Basic Linear Algebra¶

In [23]:
a = 1
b = 1
print(a + b)

[1] 2


In [67]:
a = rbind(2, 5, 8)
b = rbind(6, 4, 3)
print(b + a)

     [,1]
[1,]    8
[2,]    9
[3,]   11


Note, conformability matters:

In [68]:
a = rbind(2, 5)
b = rbind(6, 4, 3)
print(b + a)

Error in b + a: non-conformable arrays


#### Matrix Multiplication¶

In [69]:
dim(B)
dim(b)
print(B%*%b)

Out[69]:
1. 2
2. 3
Out[69]:
1. 3
2. 1
     [,1]
[1,]   37
[2,]   47


Again, comformability (and order) matters:

In [70]:
print(b%*%B)

Error in b %*% B: non-conformable arguments


#### Matrix Transpose¶

In [71]:
print(B)
t(B)

     [,1] [,2] [,3]
[1,]    2    4    3
[2,]    1    5    7

Out[71]:
 2 1 4 5 3 7

#### Matrix Inversion¶

In [72]:
A <- matrix(c(2, 4, 3, 4, 9, 1,  1, 5, 7), nrow = 3, byrow = TRUE)
A_inv = solve(A)
A_inv

Out[72]:
 1.41463 -0.317073 -0.560976 -0.658537 0.268293 0.243902 0.268293 -0.146341 0.0487805

and the matrix A_inv satisfies the properties of an inverse:

In [73]:
A_inv%*%A

Out[73]:
 1 -2.66453e-15 -1.33227e-15 2.498e-16 1 4.44089e-16 -4.85723e-17 -2.77556e-17 1

#### Scalar Operations¶

In [74]:
print(a)
print(a - 5)

     [,1]
[1,]    2
[2,]    5
[,1]
[1,]   -3
[2,]    0

In [75]:
print(a/5)
print(a*5)

     [,1]
[1,]  0.4
[2,]  1.0
[,1]
[1,]   10
[2,]   25

##### Elementwise Operations¶

Sometimes we want to combine arrays by performing arithmetic on the corresponding elements. For example, supposing that $\mathbf{a} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}$ and $\mathbf{b} = \begin{bmatrix} 3 \\ 4 \end{bmatrix}$

we might want to calculate $a\_divide\_b = \begin{bmatrix} 1/3 \\ 2/4 \end{bmatrix}$

R by default performs these operations using basic arithmetic operators so long as the arrays are of the same dimensions. Note in the first example below, $\mathbf{b}$ and $\mathbf{a}$ are not conformable.

In [76]:
print(dim(a))
print(dim(b))
print(a/b)

b=c(6,4)

print(a/b)

[1] 2 1
[1] 3 1

Error in a/b: non-conformable arrays

          [,1]
[1,] 0.3333333
[2,] 1.2500000


## Dataframes and linear algebra¶

This shows how to perform a simple manual OLS estimate using the mroz data. Let's estimate the following simple model: $\label{eq:ols} kl6_i = \beta_0 + \beta_1 we_i + \beta_2 lfp_i +\epsilon_i$

Create a vector of ones in our dataset and create the matrices $\mathbf{X}$ and $\mathbf{Y}$:

In [65]:
mroz$ones = 1 X = data.matrix(mroz[c("ones","we","lfp")], rownames.force = NA) Y = data.matrix(mroz[c("kl6")], rownames.force = NA)  The OLS estimate for our$\beta$vector is$\mathbf{(X'X)^{-1}X'Y}\$

In [33]:
solve(t(X)%*%X)%*%t(X)%*%Y

Out[33]:
kl6
ones -0.05169664
we 0.03542029
lfp -0.2564976

Or, we can clean up this syntax a little bit by using the crossprod and tcrossprod functions, where \begin{align} crossprod(\mathbf{x},\mathbf{y}) &= \mathbf{x' x} \\ tcrossprod(\mathbf{x},\mathbf{y}) &= \mathbf{xx'} \\ \end{align}

In [42]:
solve(crossprod(X))%*%crossprod(X,Y)

Out[42]:
kl6
ones -0.05169664
we 0.03542029
lfp -0.2564976

Note, this is exactly what we get with the canned OLS package in R:

In [43]:
model = lm(kl6~we+lfp,data=mroz)
summary(model)

Out[43]:
Call:
lm(formula = kl6 ~ we + lfp, data = mroz)

Residuals:
Min       1Q   Median       3Q      Max
-0.55045 -0.30251 -0.11685 -0.04601  2.59123

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.051697   0.101222  -0.511     0.61
we           0.035420   0.008243   4.297 1.96e-05 ***
lfp         -0.256498   0.037926  -6.763 2.72e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5063 on 750 degrees of freedom
Multiple R-squared:  0.06862,	Adjusted R-squared:  0.06613
F-statistic: 27.63 on 2 and 750 DF,  p-value: 2.65e-12