ECON 407: R Primer
This page has been moved to https://econ.pages.code.wm.edu/407/notes/docs/index.html and is no longer being maintained here.
It is recommended that you use R-Studio for your work if you decide to use R
in this class.
Loading data into R
Loading stata datasets
The foreign library allows us to open a bunch of different types of datafiles including excel, stata, sas, and comma delimited data to name a few. Good documentation is found here. Below, I show you how to open stata datasets.
library(foreign)
mroz <- read.dta("https://rlhick.people.wm.edu/econ407/data/mroz.dta")
summary(mroz)
lfp whrs kl6 k618 Min. :0.0000 Min. : 0.0 Min. :0.0000 Min. :0.000 1st Qu.:0.0000 1st Qu.: 0.0 1st Qu.:0.0000 1st Qu.:0.000 Median :1.0000 Median : 288.0 Median :0.0000 Median :1.000 Mean :0.5684 Mean : 740.6 Mean :0.2377 Mean :1.353 3rd Qu.:1.0000 3rd Qu.:1516.0 3rd Qu.:0.0000 3rd Qu.:2.000 Max. :1.0000 Max. :4950.0 Max. :3.0000 Max. :8.000 wa we ww rpwg hhrs Min. :30.00 Min. : 5.00 Min. : 0.000 Min. :0.00 Min. : 175 1st Qu.:36.00 1st Qu.:12.00 1st Qu.: 0.000 1st Qu.:0.00 1st Qu.:1928 Median :43.00 Median :12.00 Median : 1.625 Median :0.00 Median :2164 Mean :42.54 Mean :12.29 Mean : 2.375 Mean :1.85 Mean :2267 3rd Qu.:49.00 3rd Qu.:13.00 3rd Qu.: 3.788 3rd Qu.:3.58 3rd Qu.:2553 Max. :60.00 Max. :17.00 Max. :25.000 Max. :9.98 Max. :5010 ha he hw faminc Min. :30.00 Min. : 3.00 Min. : 0.4121 Min. : 1500 1st Qu.:38.00 1st Qu.:11.00 1st Qu.: 4.7883 1st Qu.:15428 Median :46.00 Median :12.00 Median : 6.9758 Median :20880 Mean :45.12 Mean :12.49 Mean : 7.4822 Mean :23081 3rd Qu.:52.00 3rd Qu.:15.00 3rd Qu.: 9.1667 3rd Qu.:28200 Max. :60.00 Max. :17.00 Max. :40.5090 Max. :96000 mtr wmed wfed un Min. :0.4415 Min. : 0.000 Min. : 0.000 Min. : 3.000 1st Qu.:0.6215 1st Qu.: 7.000 1st Qu.: 7.000 1st Qu.: 7.500 Median :0.6915 Median :10.000 Median : 7.000 Median : 7.500 Mean :0.6789 Mean : 9.251 Mean : 8.809 Mean : 8.624 3rd Qu.:0.7215 3rd Qu.:12.000 3rd Qu.:12.000 3rd Qu.:11.000 Max. :0.9415 Max. :17.000 Max. :17.000 Max. :14.000 cit ax Min. :0.0000 Min. : 0.00 1st Qu.:0.0000 1st Qu.: 4.00 Median :1.0000 Median : 9.00 Mean :0.6428 Mean :10.63 3rd Qu.:1.0000 3rd Qu.:15.00 Max. :1.0000 Max. :45.00
Loading files from disk is a slight variation the above command. Supposing that your stata data file mroz.dta was in the folder /some/place, in Linux or MacOS we would use the R command
mroz <- read.dta("/some/place/mroz.dta")
Opening R datasets
If your dataset is already in an R format, simply use the load command:
mroz = load("/some/place/mroz.RData")
Web-based data is also accessible using load:
mroz = load("https://www.someplace.com/some/place/mroz.RData")
Viewing Data in R
If you are using R Studio (recommended) listing data is easy and I can show you how to do that. Viewing R data at the command line is achieved by the head command. Here we'll view the first 5 rows of data:
head(mroz,5)
lfp whrs kl6 k618 wa we ww rpwg hhrs ha he hw faminc mtr wmed 1 1 1610 1 0 32 12 3.3540 2.65 2708 34 12 4.0288 16310 0.7215 12 2 1 1656 0 2 30 12 1.3889 2.65 2310 30 9 8.4416 21800 0.6615 7 3 1 1980 1 3 35 12 4.5455 4.04 3072 40 12 3.5807 21040 0.6915 12 4 1 456 0 3 34 12 1.0965 3.25 1920 53 10 3.5417 7300 0.7815 7 5 1 1568 1 2 31 14 4.5918 3.60 2000 32 12 10.0000 27300 0.6215 12 wfed un cit ax 1 7 5.0 0 14 2 7 11.0 1 5 3 7 5.0 0 15 4 7 5.0 0 6 5 14 9.5 1 7
Or, the last 10 rows of data:
tail(mroz,5)
lfp whrs kl6 k618 wa we ww rpwg hhrs ha he hw faminc mtr wmed wfed 749 0 0 0 2 40 13 0 0 3020 43 16 9.2715 28200 0.6215 10 10 750 0 0 2 3 31 12 0 0 2056 33 12 4.8638 10000 0.7715 12 12 751 0 0 0 0 43 12 0 0 2383 43 12 1.0898 9952 0.7515 10 3 752 0 0 0 0 60 12 0 0 1705 55 8 12.4400 24984 0.6215 12 12 753 0 0 0 3 39 9 0 0 3120 48 12 6.0897 28363 0.6915 7 7 un cit ax 749 9.5 1 5 750 7.5 0 14 751 7.5 0 4 752 14.0 1 15 753 11.0 1 12
Or specific rows, using what is called "slice" indexing:
mroz[10:15,]
lfp whrs kl6 k618 wa we ww rpwg hhrs ha he hw faminc mtr wmed 10 1 1600 0 2 39 12 4.6875 4.15 2100 43 12 5.7143 20425 0.6915 7 11 1 1969 0 1 33 12 4.0630 4.30 2450 34 12 9.7959 32300 0.5815 12 12 1 1960 0 1 42 11 4.5918 4.58 2375 47 14 8.0000 28700 0.6215 14 13 1 240 1 2 30 12 2.0833 0.00 2830 33 16 5.3004 15500 0.7215 16 14 1 997 0 2 43 12 2.2668 3.50 3317 46 12 4.3413 16860 0.7215 10 15 1 1848 0 1 43 10 3.6797 3.38 2024 45 17 10.8700 31431 0.5815 7 wfed un cit ax 10 7 5.0 0 21 11 3 5.0 0 15 12 7 5.0 0 14 13 16 5.0 0 0 14 10 7.5 1 14 15 7 7.5 1 6
Or rows meeting logical conditions. Let's look at the first 10 rows where the respondent has kids less than 6 years old:
head(mroz[mroz$kl6>0,],10)
lfp whrs kl6 k618 wa we ww rpwg hhrs ha he hw faminc mtr wmed 1 1 1610 1 0 32 12 3.3540 2.65 2708 34 12 4.0288 16310 0.7215 12 3 1 1980 1 3 35 12 4.5455 4.04 3072 40 12 3.5807 21040 0.6915 12 5 1 1568 1 2 31 14 4.5918 3.60 2000 32 12 10.0000 27300 0.6215 12 13 1 240 1 2 30 12 2.0833 0.00 2830 33 16 5.3004 15500 0.7215 16 25 1 1955 1 1 31 12 2.1545 2.30 2024 31 12 4.0884 12487 0.7515 12 29 1 1516 1 0 31 17 7.2559 6.00 2390 30 17 6.2762 26100 0.6215 12 41 1 112 1 2 30 12 2.6786 0.00 4030 33 16 3.8462 15810 0.7215 12 43 1 583 1 2 31 16 2.5729 9.98 1530 34 16 13.7250 24000 0.6615 14 74 1 608 2 4 34 10 8.2237 3.00 1304 38 9 3.3742 15200 0.7915 0 79 1 90 2 2 32 15 1.0000 0.00 2350 31 14 4.8787 13755 0.7515 10 wfed un cit ax 1 7 5.0 0 14 3 7 5.0 0 15 5 14 9.5 1 7 13 16 5.0 0 0 25 7 5.0 1 4 29 12 5.0 0 7 41 12 3.0 0 1 43 16 9.5 1 6 74 0 7.5 1 11 79 12 7.5 1 9
Note, this is achieved using logical addressing, where only rows having the logical value TRUE is included. So for the first five rows of mroz, only rows 1, 3, and 5 have more than one young child and would be displayed above:
head(mroz$kl6>0,5)
[1] TRUE FALSE TRUE FALSE TRUE
Creating and Modifying Variables
Creating Variables
In stata
, you need to start a new variable with create
. In R
, just assign the variable:
mroz$newvar = mroz$lfp * mroz$ax
print(colnames(mroz))
print(summary(mroz))
[1] "lfp" "whrs" "kl6" "k618" "wa" "we" "ww" "rpwg" [9] "hhrs" "ha" "he" "hw" "faminc" "mtr" "wmed" "wfed" [17] "un" "cit" "ax" "newvar" lfp whrs kl6 k618 Min. :0.0000 Min. : 0.0 Min. :0.0000 Min. :0.000 1st Qu.:0.0000 1st Qu.: 0.0 1st Qu.:0.0000 1st Qu.:0.000 Median :1.0000 Median : 288.0 Median :0.0000 Median :1.000 Mean :0.5684 Mean : 740.6 Mean :0.2377 Mean :1.353 3rd Qu.:1.0000 3rd Qu.:1516.0 3rd Qu.:0.0000 3rd Qu.:2.000 Max. :1.0000 Max. :4950.0 Max. :3.0000 Max. :8.000 wa we ww rpwg hhrs Min. :30.00 Min. : 5.00 Min. : 0.000 Min. :0.00 Min. : 175 1st Qu.:36.00 1st Qu.:12.00 1st Qu.: 0.000 1st Qu.:0.00 1st Qu.:1928 Median :43.00 Median :12.00 Median : 1.625 Median :0.00 Median :2164 Mean :42.54 Mean :12.29 Mean : 2.375 Mean :1.85 Mean :2267 3rd Qu.:49.00 3rd Qu.:13.00 3rd Qu.: 3.788 3rd Qu.:3.58 3rd Qu.:2553 Max. :60.00 Max. :17.00 Max. :25.000 Max. :9.98 Max. :5010 ha he hw faminc Min. :30.00 Min. : 3.00 Min. : 0.4121 Min. : 1500 1st Qu.:38.00 1st Qu.:11.00 1st Qu.: 4.7883 1st Qu.:15428 Median :46.00 Median :12.00 Median : 6.9758 Median :20880 Mean :45.12 Mean :12.49 Mean : 7.4822 Mean :23081 3rd Qu.:52.00 3rd Qu.:15.00 3rd Qu.: 9.1667 3rd Qu.:28200 Max. :60.00 Max. :17.00 Max. :40.5090 Max. :96000 mtr wmed wfed un Min. :0.4415 Min. : 0.000 Min. : 0.000 Min. : 3.000 1st Qu.:0.6215 1st Qu.: 7.000 1st Qu.: 7.000 1st Qu.: 7.500 Median :0.6915 Median :10.000 Median : 7.000 Median : 7.500 Mean :0.6789 Mean : 9.251 Mean : 8.809 Mean : 8.624 3rd Qu.:0.7215 3rd Qu.:12.000 3rd Qu.:12.000 3rd Qu.:11.000 Max. :0.9415 Max. :17.000 Max. :17.000 Max. :14.000 cit ax newvar Min. :0.0000 Min. : 0.00 Min. : 0.00 1st Qu.:0.0000 1st Qu.: 4.00 1st Qu.: 0.00 Median :1.0000 Median : 9.00 Median : 4.00 Mean :0.6428 Mean :10.63 Mean : 7.41 3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:13.00 Max. :1.0000 Max. :45.00 Max. :38.00
Note a new column called newvar
is now part of the data.
R aficionados would probably criticize the above code, since strictly speaking the assignment
x = y
is sometimes different than the R recommended way of making an assignment:
x <- y
which is an artifact from the use of ancient keyboards when R was written. I have never encountered a case where x=y
doesn't work, but apparently it can happen.
Modifying Variables
Unlike stata
we simply redefine the variable and don't need to bother with replace
:
mroz$newvar = mroz$newvar/10
print(summary(mroz$newvar))
Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 0.400 0.741 1.300 3.800
Getting help in R
If you know the function you need help with, just use the help function:
help(tail)
Linear Algebra in R
Here I briefly introduce the use of matrix algebra manipulations. Other programs are arguably better for pure linear algebra work (e.g. Matlab or Julia), but R is a very good environment for mixing modeling including running pre-packaged statistical commands and linear algebra. In my opinion stata's linear algebra code is a lot more intuitive than R's.
Creating vectors and matrices
Suppose we want to define a row vector \(\mathbf{a}\) as
\begin{equation} \mathbf{a} = \begin{bmatrix} 1 & 2 & 3 \end{bmatrix} \end{equation}we can enter this in R as
a <- cbind(1,2,3)
print(a)
[,1] [,2] [,3] [1,] 1 2 3
If instead, we wanted a column vector \mathbf{a} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \end{equation} we use this
a <- rbind(1,2,3)
print(a)
[,1] [1,] 1 [2,] 2 [3,] 3
Referencing elements in arrays
We can grab individual elements of this vector using slicing:
print(a[1:2])
print(a[3])
[1] 1 2 [1] 3
Note, since we are working with a column, we don't need to refer to the row dimension, although we could:
print(a[1:2,1])
print(a[3,1])
Creating this matrix \[ B = \begin{bmatrix} 2&4&3\\1&5&7 \end{bmatrix} \] is done this way:
B <- matrix(c(2, 4, 3, 1, 5, 7), nrow = 2, byrow = TRUE)
print(B)
[,1] [,2] [,3] [1,] 2 4 3 [2,] 1 5 7
slicing is also possible
B[,2:3]
[,1] [,2] [1,] 4 3 [2,] 5 7
It is also to create an empty matrix (all zeros) and fill it in using slicing:
C <- matrix(0, 5, 5)
C[3:4, 5] = -999
print(C)
[,1] [,2] [,3] [,4] [,5] [1,] 0 0 0 0 0 [2,] 0 0 0 0 0 [3,] 0 0 0 0 -999 [4,] 0 0 0 0 -999 [5,] 0 0 0 0 0
The identity matrix
These are created using diag, having the arguments:
- Value to place on the diagonal
- Number of rows
- Number of columns (if ommitted, columns=rows)
I = diag(1,5)
print(I)
[,1] [,2] [,3] [,4] [,5] [1,] 1 0 0 0 0 [2,] 0 1 0 0 0 [3,] 0 0 1 0 0 [4,] 0 0 0 1 0 [5,] 0 0 0 0 1
or, if you only supply one argument (row, column dimension):
I = diag(5)
print(I)
[,1] [,2] [,3] [,4] [,5] [1,] 1 0 0 0 0 [2,] 0 1 0 0 0 [3,] 0 0 1 0 0 [4,] 0 0 0 1 0 [5,] 0 0 0 0 1
Creating a column vector of ones
cbind(rep(1, 5))
[,1] [1,] 1 [2,] 1 [3,] 1 [4,] 1 [5,] 1
Getting information about your matrices and vectors
The most important information we need is the dimensions of our matrices. The function dim
tells us rows and columns (for 2 dimensional objects):
dim(B)
[1] 2 3
We can extract row and column dimensions like this:
dim(B)[1]
dim(B)[2]
[1] 2 [1] 3
Linear Algebra Operations
Scalar Addition
a = 1
b = 1
print(a + b)
[1] 2
Matrix Addition
a = rbind(2, 5, 8)
b = rbind(6, 4, 3)
print(b + a)
[,1] [1,] 8 [2,] 9 [3,] 11
Note, conformability matters:
a = rbind(2, 5)
b = rbind(6, 4, 3)
print(b + a)
Error in b + a : non-conformable arrays
Matrix Multiplication
Matrix multiplication uses the %*%
operator:
dim(B)
dim(b)
print(B%*%b)
[1] 2 3 [1] 3 1 [,1] [1,] 37 [2,] 47
Again, comformability (and order) matters:
print(b%*%B)
Error in b %*% B : non-conformable arguments
Matrix Transpose
The transpose operator is t()
:
print(B)
t(B)
[,1] [,2] [,3] [1,] 2 4 3 [2,] 1 5 7 [,1] [,2] [1,] 2 1 [2,] 4 5 [3,] 3 7
Matrix Inversion
A <- matrix(c(2, 4, 3, 4, 9, 1, 1, 5, 7), nrow = 3, byrow = TRUE)
A_inv = solve(A)
A_inv
[,1] [,2] [,3] [1,] 1.4146341 -0.3170732 -0.56097561 [2,] -0.6585366 0.2682927 0.24390244 [3,] 0.2682927 -0.1463415 0.04878049
and the matrix Ainv satisfies the properties of an inverse:
A_inv%*%A
[,1] [,2] [,3] [1,] 1.000000e+00 -2.664535e-15 -1.332268e-15 [2,] 2.498002e-16 1.000000e+00 4.440892e-16 [3,] -4.857226e-17 -2.775558e-17 1.000000e+00
Scalar Operations
Scalar Addition and Subtraction
print(a)
print(a - 5)
[,1] [1,] 2 [2,] 5 [,1] [1,] -3 [2,] 0
Scalar Multiplication and Addition
print(a/5)
print(a*5)
[,1] [1,] 0.4 [2,] 1.0 [,1] [1,] 10 [2,] 25
Elementwise Operations
Sometimes we want to combine arrays by performing arithmetic on the corresponding elements. For example, supposing that
\begin{equation} \mathbf{a} = \begin{bmatrix} 1 \\ 2 \end{bmatrix} \nonumber \end{equation}and
\begin{equation} \mathbf{b} = \begin{bmatrix} 3 \\ 4 \end{bmatrix} \nonumber \end{equation}we might want to calculate
\begin{equation} a\_divide\_b=\begin{bmatrix} 1/3 \\ 2/4 \end{bmatrix} \end{equation}
by default R
performs these operations using basic arithmetic operators so long as the arrays are of the same dimensions. Note in the first example below, \(\mathbf{b}\) and \(\mathbf{a}\) are not conformable.
print(dim(a))
print(dim(b))
print(a/b)
[1] 2 1 [1] 3 1 Error in a/b : non-conformable arrays
However, if we redefine \(\mathbf{b}\) for having conformable dimensions, element-wise division exists
b=c(6,4)
print(a/b)
[,1] [1,] 0.3333333 [2,] 1.2500000