# ECON 407: R Primer

It is recommended that you use R-Studio for your work if you decide to use R in this class.

The foreign library allows us to open a bunch of different types of datafiles including excel, stata, sas, and comma delimited data to name a few. Good documentation is found here. Below, I show you how to open stata datasets.

library(foreign)
summary(mroz)

     lfp              whrs             kl6              k618
Min.   :0.0000   Min.   :   0.0   Min.   :0.0000   Min.   :0.000
1st Qu.:0.0000   1st Qu.:   0.0   1st Qu.:0.0000   1st Qu.:0.000
Median :1.0000   Median : 288.0   Median :0.0000   Median :1.000
Mean   :0.5684   Mean   : 740.6   Mean   :0.2377   Mean   :1.353
3rd Qu.:1.0000   3rd Qu.:1516.0   3rd Qu.:0.0000   3rd Qu.:2.000
Max.   :1.0000   Max.   :4950.0   Max.   :3.0000   Max.   :8.000
wa              we              ww              rpwg           hhrs
Min.   :30.00   Min.   : 5.00   Min.   : 0.000   Min.   :0.00   Min.   : 175
1st Qu.:36.00   1st Qu.:12.00   1st Qu.: 0.000   1st Qu.:0.00   1st Qu.:1928
Median :43.00   Median :12.00   Median : 1.625   Median :0.00   Median :2164
Mean   :42.54   Mean   :12.29   Mean   : 2.375   Mean   :1.85   Mean   :2267
3rd Qu.:49.00   3rd Qu.:13.00   3rd Qu.: 3.788   3rd Qu.:3.58   3rd Qu.:2553
Max.   :60.00   Max.   :17.00   Max.   :25.000   Max.   :9.98   Max.   :5010
ha              he              hw              faminc
Min.   :30.00   Min.   : 3.00   Min.   : 0.4121   Min.   : 1500
1st Qu.:38.00   1st Qu.:11.00   1st Qu.: 4.7883   1st Qu.:15428
Median :46.00   Median :12.00   Median : 6.9758   Median :20880
Mean   :45.12   Mean   :12.49   Mean   : 7.4822   Mean   :23081
3rd Qu.:52.00   3rd Qu.:15.00   3rd Qu.: 9.1667   3rd Qu.:28200
Max.   :60.00   Max.   :17.00   Max.   :40.5090   Max.   :96000
mtr              wmed             wfed              un
Min.   :0.4415   Min.   : 0.000   Min.   : 0.000   Min.   : 3.000
1st Qu.:0.6215   1st Qu.: 7.000   1st Qu.: 7.000   1st Qu.: 7.500
Median :0.6915   Median :10.000   Median : 7.000   Median : 7.500
Mean   :0.6789   Mean   : 9.251   Mean   : 8.809   Mean   : 8.624
3rd Qu.:0.7215   3rd Qu.:12.000   3rd Qu.:12.000   3rd Qu.:11.000
Max.   :0.9415   Max.   :17.000   Max.   :17.000   Max.   :14.000
cit               ax
Min.   :0.0000   Min.   : 0.00
1st Qu.:0.0000   1st Qu.: 4.00
Median :1.0000   Median : 9.00
Mean   :0.6428   Mean   :10.63
3rd Qu.:1.0000   3rd Qu.:15.00
Max.   :1.0000   Max.   :45.00


Loading files from disk is a slight variation the above command. Supposing that your stata data file mroz.dta was in the folder /some/place, in Linux or MacOS we would use the R command

mroz <- read.dta("/some/place/mroz.dta")


Opening R datasets

mroz = load("/some/place/mroz.RData")


Web-based data is also accessible using load:

mroz = load("http://www.someplace.com/some/place/mroz.RData")


## Viewing Data in R

If you are using R Studio (recommended) listing data is easy and I can show you how to do that. Viewing R data at the command line is achieved by the head command. Here we'll view the first 5 rows of data:

head(mroz,5)

  lfp whrs kl6 k618 wa we     ww rpwg hhrs ha he      hw faminc    mtr wmed
1   1 1610   1    0 32 12 3.3540 2.65 2708 34 12  4.0288  16310 0.7215   12
2   1 1656   0    2 30 12 1.3889 2.65 2310 30  9  8.4416  21800 0.6615    7
3   1 1980   1    3 35 12 4.5455 4.04 3072 40 12  3.5807  21040 0.6915   12
4   1  456   0    3 34 12 1.0965 3.25 1920 53 10  3.5417   7300 0.7815    7
5   1 1568   1    2 31 14 4.5918 3.60 2000 32 12 10.0000  27300 0.6215   12
wfed   un cit ax
1    7  5.0   0 14
2    7 11.0   1  5
3    7  5.0   0 15
4    7  5.0   0  6
5   14  9.5   1  7


Or, the last 10 rows of data:

tail(mroz,5)

    lfp whrs kl6 k618 wa we ww rpwg hhrs ha he      hw faminc    mtr wmed wfed
749   0    0   0    2 40 13  0    0 3020 43 16  9.2715  28200 0.6215   10   10
750   0    0   2    3 31 12  0    0 2056 33 12  4.8638  10000 0.7715   12   12
751   0    0   0    0 43 12  0    0 2383 43 12  1.0898   9952 0.7515   10    3
752   0    0   0    0 60 12  0    0 1705 55  8 12.4400  24984 0.6215   12   12
753   0    0   0    3 39  9  0    0 3120 48 12  6.0897  28363 0.6915    7    7
un cit ax
749  9.5   1  5
750  7.5   0 14
751  7.5   0  4
752 14.0   1 15
753 11.0   1 12


Or specific rows, using what is called "slice" indexing:

mroz[10:15,]

   lfp whrs kl6 k618 wa we     ww rpwg hhrs ha he      hw faminc    mtr wmed
10   1 1600   0    2 39 12 4.6875 4.15 2100 43 12  5.7143  20425 0.6915    7
11   1 1969   0    1 33 12 4.0630 4.30 2450 34 12  9.7959  32300 0.5815   12
12   1 1960   0    1 42 11 4.5918 4.58 2375 47 14  8.0000  28700 0.6215   14
13   1  240   1    2 30 12 2.0833 0.00 2830 33 16  5.3004  15500 0.7215   16
14   1  997   0    2 43 12 2.2668 3.50 3317 46 12  4.3413  16860 0.7215   10
15   1 1848   0    1 43 10 3.6797 3.38 2024 45 17 10.8700  31431 0.5815    7
wfed  un cit ax
10    7 5.0   0 21
11    3 5.0   0 15
12    7 5.0   0 14
13   16 5.0   0  0
14   10 7.5   1 14
15    7 7.5   1  6


Or rows meeting logical conditions. Let's look at the first 10 rows where the respondent has kids less than 6 years old:

head(mroz[mroz$kl6>0,],10)   lfp whrs kl6 k618 wa we ww rpwg hhrs ha he hw faminc mtr wmed 1 1 1610 1 0 32 12 3.3540 2.65 2708 34 12 4.0288 16310 0.7215 12 3 1 1980 1 3 35 12 4.5455 4.04 3072 40 12 3.5807 21040 0.6915 12 5 1 1568 1 2 31 14 4.5918 3.60 2000 32 12 10.0000 27300 0.6215 12 13 1 240 1 2 30 12 2.0833 0.00 2830 33 16 5.3004 15500 0.7215 16 25 1 1955 1 1 31 12 2.1545 2.30 2024 31 12 4.0884 12487 0.7515 12 29 1 1516 1 0 31 17 7.2559 6.00 2390 30 17 6.2762 26100 0.6215 12 41 1 112 1 2 30 12 2.6786 0.00 4030 33 16 3.8462 15810 0.7215 12 43 1 583 1 2 31 16 2.5729 9.98 1530 34 16 13.7250 24000 0.6615 14 74 1 608 2 4 34 10 8.2237 3.00 1304 38 9 3.3742 15200 0.7915 0 79 1 90 2 2 32 15 1.0000 0.00 2350 31 14 4.8787 13755 0.7515 10 wfed un cit ax 1 7 5.0 0 14 3 7 5.0 0 15 5 14 9.5 1 7 13 16 5.0 0 0 25 7 5.0 1 4 29 12 5.0 0 7 41 12 3.0 0 1 43 16 9.5 1 6 74 0 7.5 1 11 79 12 7.5 1 9  Note, this is achieved using logical addressing, where only rows having the logical value TRUE is included. So for the first five rows of mroz, only rows 1, 3, and 5 have more than one young child and would be displayed above: head(mroz$kl6>0,5)

[1]  TRUE FALSE  TRUE FALSE  TRUE


## Creating and Modifying Variables

### Creating Variables

In stata, you need to start a new variable with create. In R, just assign the variable:

mroz$newvar = mroz$lfp * mroz$ax print(colnames(mroz)) print(summary(mroz))   [1] "lfp" "whrs" "kl6" "k618" "wa" "we" "ww" "rpwg" [9] "hhrs" "ha" "he" "hw" "faminc" "mtr" "wmed" "wfed" [17] "un" "cit" "ax" "newvar" lfp whrs kl6 k618 Min. :0.0000 Min. : 0.0 Min. :0.0000 Min. :0.000 1st Qu.:0.0000 1st Qu.: 0.0 1st Qu.:0.0000 1st Qu.:0.000 Median :1.0000 Median : 288.0 Median :0.0000 Median :1.000 Mean :0.5684 Mean : 740.6 Mean :0.2377 Mean :1.353 3rd Qu.:1.0000 3rd Qu.:1516.0 3rd Qu.:0.0000 3rd Qu.:2.000 Max. :1.0000 Max. :4950.0 Max. :3.0000 Max. :8.000 wa we ww rpwg hhrs Min. :30.00 Min. : 5.00 Min. : 0.000 Min. :0.00 Min. : 175 1st Qu.:36.00 1st Qu.:12.00 1st Qu.: 0.000 1st Qu.:0.00 1st Qu.:1928 Median :43.00 Median :12.00 Median : 1.625 Median :0.00 Median :2164 Mean :42.54 Mean :12.29 Mean : 2.375 Mean :1.85 Mean :2267 3rd Qu.:49.00 3rd Qu.:13.00 3rd Qu.: 3.788 3rd Qu.:3.58 3rd Qu.:2553 Max. :60.00 Max. :17.00 Max. :25.000 Max. :9.98 Max. :5010 ha he hw faminc Min. :30.00 Min. : 3.00 Min. : 0.4121 Min. : 1500 1st Qu.:38.00 1st Qu.:11.00 1st Qu.: 4.7883 1st Qu.:15428 Median :46.00 Median :12.00 Median : 6.9758 Median :20880 Mean :45.12 Mean :12.49 Mean : 7.4822 Mean :23081 3rd Qu.:52.00 3rd Qu.:15.00 3rd Qu.: 9.1667 3rd Qu.:28200 Max. :60.00 Max. :17.00 Max. :40.5090 Max. :96000 mtr wmed wfed un Min. :0.4415 Min. : 0.000 Min. : 0.000 Min. : 3.000 1st Qu.:0.6215 1st Qu.: 7.000 1st Qu.: 7.000 1st Qu.: 7.500 Median :0.6915 Median :10.000 Median : 7.000 Median : 7.500 Mean :0.6789 Mean : 9.251 Mean : 8.809 Mean : 8.624 3rd Qu.:0.7215 3rd Qu.:12.000 3rd Qu.:12.000 3rd Qu.:11.000 Max. :0.9415 Max. :17.000 Max. :17.000 Max. :14.000 cit ax newvar Min. :0.0000 Min. : 0.00 Min. : 0.00 1st Qu.:0.0000 1st Qu.: 4.00 1st Qu.: 0.00 Median :1.0000 Median : 9.00 Median : 4.00 Mean :0.6428 Mean :10.63 Mean : 7.41 3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:13.00 Max. :1.0000 Max. :45.00 Max. :38.00  Note a new column called newvar is now part of the data. R aficionados would probably criticize the above code, since strictly speaking the assignment x = y  is sometimes different than the R recommended way of making an assignment: x <- y  which is an artifact from the use of ancient keyboards when R was written. I have never encountered a case where x=y doesn't work, but apparently it can happen. ### Modifying Variables Unlike stata we simply redefine the variable and don't need to bother with replace: mroz$newvar = mroz$newvar/10 print(summary(mroz$newvar))

 Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.000   0.000   0.400   0.741   1.300   3.800


## Getting help in R

If you know the function you need help with, just use the help function:

help(tail)


## Linear Algebra in R

Here I briefly introduce the use of matrix algebra manipulations. Other programs are arguably better for pure linear algebra work (e.g. Matlab or Julia), but R is a very good environment for mixing modeling including running pre-packaged statistical commands and linear algebra. In my opinion stata's linear algebra code is a lot more intuitive than R's.

### Creating vectors and matrices

Suppose we want to define a row vector $$\mathbf{a}$$ as

$$\mathbf{a} = \begin{bmatrix} 1 & 2 & 3 \end{bmatrix}$$

we can enter this in R as

a <- cbind(1,2,3)
print(a)

     [,1] [,2] [,3]
[1,]    1    2    3


If instead, we wanted a column vector \mathbf{a} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} we use this

a <- rbind(1,2,3)
print(a)

     [,1]
[1,]    1
[2,]    2
[3,]    3


### Referencing elements in arrays

We can grab individual elements of this vector using slicing:

print(a[1:2])
print(a[3])

[1] 1 2
[1] 3


Note, since we are working with a column, we don't need to refer to the row dimension, although we could:

print(a[1:2,1])
print(a[3,1])


Creating this matrix $$B = \begin{bmatrix} 2&4&3\\1&5&7 \end{bmatrix}$$ is done this way:

B <- matrix(c(2, 4, 3, 1, 5, 7), nrow = 2, byrow = TRUE)
print(B)

     [,1] [,2] [,3]
[1,]    2    4    3
[2,]    1    5    7


slicing is also possible

B[,2:3]

     [,1] [,2]
[1,]    4    3
[2,]    5    7


It is also to create an empty matrix (all zeros) and fill it in using slicing:

C <- matrix(0, 5, 5)
C[3:4, 5] = -999
print(C)

     [,1] [,2] [,3] [,4] [,5]
[1,]    0    0    0    0    0
[2,]    0    0    0    0    0
[3,]    0    0    0    0 -999
[4,]    0    0    0    0 -999
[5,]    0    0    0    0    0


### The identity matrix

These are created using diag, having the arguments:

• Value to place on the diagonal
• Number of rows
• Number of columns (if ommitted, columns=rows)
I = diag(1,5)
print(I)

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    1    0    0    0
[3,]    0    0    1    0    0
[4,]    0    0    0    1    0
[5,]    0    0    0    0    1


or, if you only supply one argument (row, column dimension):

I = diag(5)
print(I)

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    1    0    0    0
[3,]    0    0    1    0    0
[4,]    0    0    0    1    0
[5,]    0    0    0    0    1


### Creating a column vector of ones

cbind(rep(1, 5))

     [,1]
[1,]    1
[2,]    1
[3,]    1
[4,]    1
[5,]    1


The most important information we need is the dimensions of our matrices. The function dim tells us rows and columns (for 2 dimensional objects):

dim(B)

[1] 2 3


We can extract row and column dimensions like this:

dim(B)[1]
dim(B)[2]

[1] 2
[1] 3


### Linear Algebra Operations

a = 1
b = 1
print(a + b)

[1] 2


a = rbind(2, 5, 8)
b = rbind(6, 4, 3)
print(b + a)

     [,1]
[1,]    8
[2,]    9
[3,]   11


Note, conformability matters:

a = rbind(2, 5)
b = rbind(6, 4, 3)
print(b + a)

Error in b + a : non-conformable arrays


#### Matrix Multiplication

Matrix multiplication uses the %*% operator:

dim(B)
dim(b)
print(B%*%b)

[1] 2 3
[1] 3 1
[,1]
[1,]   37
[2,]   47


Again, comformability (and order) matters:

print(b%*%B)

Error in b %*% B : non-conformable arguments


#### Matrix Transpose

The transpose operator is t():

print(B)
t(B)

     [,1] [,2] [,3]
[1,]    2    4    3
[2,]    1    5    7
[,1] [,2]
[1,]    2    1
[2,]    4    5
[3,]    3    7


#### Matrix Inversion

A <- matrix(c(2, 4, 3, 4, 9, 1,  1, 5, 7), nrow = 3, byrow = TRUE)
A_inv = solve(A)
A_inv

           [,1]       [,2]        [,3]
[1,]  1.4146341 -0.3170732 -0.56097561
[2,] -0.6585366  0.2682927  0.24390244
[3,]  0.2682927 -0.1463415  0.04878049


and the matrix Ainv satisfies the properties of an inverse:

A_inv%*%A

              [,1]          [,2]          [,3]
[1,]  1.000000e+00 -2.664535e-15 -1.332268e-15
[2,]  2.498002e-16  1.000000e+00  4.440892e-16
[3,] -4.857226e-17 -2.775558e-17  1.000000e+00


#### Scalar Operations

print(a)
print(a - 5)

     [,1]
[1,]    2
[2,]    5
[,1]
[1,]   -3
[2,]    0


print(a/5)
print(a*5)

     [,1]
[1,]  0.4
[2,]  1.0
[,1]
[1,]   10
[2,]   25


#### Elementwise Operations

Sometimes we want to combine arrays by performing arithmetic on the corresponding elements. For example, supposing that

$$\mathbf{a} = \begin{bmatrix} 1 \\ 2 \end{bmatrix} \nonumber$$

and

$$\mathbf{b} = \begin{bmatrix} 3 \\ 4 \end{bmatrix} \nonumber$$

we might want to calculate

$$a\_divide\_b=\begin{bmatrix} 1/3 \\ 2/4 \end{bmatrix}$$

by default R performs these operations using basic arithmetic operators so long as the arrays are of the same dimensions. Note in the first example below, $$\mathbf{b}$$ and $$\mathbf{a}$$ are not conformable.

print(dim(a))
print(dim(b))
print(a/b)

[1] 2 1
[1] 3 1
Error in a/b : non-conformable arrays


However, if we redefine $$\mathbf{b}$$ for having conformable dimensions, element-wise division exists

b=c(6,4)
print(a/b)

          [,1]
[1,] 0.3333333
[2,] 1.2500000