# ECON 407: Companion to IV Regression

This page has been moved to https://econ.pages.code.wm.edu/407/notes/docs/index.html and is no longer being maintained here.

This companion document to our chapter on endogeneity quickly explores the problem of endogeneity and how to estimate this class of models in R and Stata. Recall that the OLS estimator requires $E(\mathbf{x'\epsilon}) = 0$

This code shows how to overcome estimation problems where this assumption fails but where we can identify an instrument for implementing instrumental variables regression (IV Regression). We demonstrate the uses of R and stata for IV regression problems. First, let's open up the data in both R and Stata noting that we are using a "Cross-sectioned" version of Tobias and Koop that focuses on 1983. Load data and summarize:

webuse set "https://rlhick.people.wm.edu/econ407/data/"
webuse tobias_koop
keep if time==4
sum

(prefix now "https://rlhick.people.wm.edu/econ407/data")
(16,885 observations deleted)

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
id |      1,034    1090.952    634.8917          4       2177
educ |      1,034    12.27466    1.566838          9         19
ln_wage |      1,034    2.138259    .4662805        .42       3.59
pexp |      1,034     4.81528    2.190298          0         12
time |      1,034           4           0          4          4
-------------+---------------------------------------------------------
ability |      1,034    .0165957    .9209635      -3.14       1.89
meduc |      1,034    11.40329    3.027277          0         20
feduc |      1,034    11.58511    3.735833          0         20
broken_home |      1,034    .1692456    .3751502          0          1
siblings |      1,034    3.200193    2.126575          0         15
-------------+---------------------------------------------------------
pexp2 |      1,034    27.97969    22.59879          0        144


and in R:

library(foreign)
library(sandwich)
library(lmtest)
library(boot)
library(AER)
library(car)
library(ivpack)
tk4.df = subset(tk.df, time == 4)
attach(tk4.df)


### OLS

If we ignore any potential endogeneity problem we can estimate OLS as described in the OLS chapter companion. Here are the results from stata:

reg ln_wage pexp pexp2 educ broken_home


Source |       SS           df       MS      Number of obs   =     1,034
-------------+----------------------------------   F(4, 1029)      =     51.36
Model |  37.3778146         4  9.34445366   Prob > F        =    0.0000
Residual |   187.21445     1,029  .181938241   R-squared       =    0.1664
Total |  224.592265     1,033  .217417488   Root MSE        =    .42654

------------------------------------------------------------------------------
ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
pexp |   .2035214   .0235859     8.63   0.000     .1572395    .2498033
pexp2 |  -.0124126   .0022825    -5.44   0.000    -.0168916   -.0079336
educ |   .0852725   .0092897     9.18   0.000     .0670437    .1035014
broken_home |  -.0087254   .0357107    -0.24   0.807    -.0787995    .0613488
_cons |   .4603326    .137294     3.35   0.001     .1909243    .7297408
------------------------------------------------------------------------------


where education, has the elasticity

margins, dyex(educ) continuous


Average marginal effects                        Number of obs     =      1,034
Model VCE    : OLS

Expression   : Linear prediction, predict()
dy/ex w.r.t. : educ

------------------------------------------------------------------------------
|            Delta-method
|      dy/ex   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ |   1.046691   .1140274     9.18   0.000     .8229385    1.270444
------------------------------------------------------------------------------


Running the OLS regression in R is done in a similar manner (I am surpressing output for the sake of brevity).

ols.lm = lm(ln_wage ~ pexp + pexp2 + broken_home + educ)


## The endogeneity problem

Suppose we are worried that education is endogenous. That is, it is correlated with the population regression errors. This means OLS estimates of $$\beta$$ are biased. We hypothesize that the variable feduc is a good instrument having all the properties we describe in detail in the notes document.

### Estimation in Stata

In stata, we use this code:

ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc)


Instrumental variables (2SLS) regression          Number of obs   =      1,034
Wald chi2(4)    =     138.19
Prob > chi2     =     0.0000
R-squared       =     0.1277
Root MSE        =     .43528

------------------------------------------------------------------------------
ln_wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ |   .1495027   .0320009     4.67   0.000     .0867821    .2122233
pexp |    .214752   .0246553     8.71   0.000     .1664285    .2630755
pexp2 |  -.0117453   .0023508    -5.00   0.000    -.0163529   -.0071377
broken_home |   .0244713   .0397189     0.62   0.538    -.0533763     .102319
_cons |  -.4064389   .4356072    -0.93   0.351    -1.260213    .4473354
------------------------------------------------------------------------------
Instrumented:  educ
Instruments:   pexp pexp2 broken_home feduc


Note 2 things:

1. The mean estimate for the elasticity on education has nearly doubled compared to OLS
2. There is no R command that I can find that will exactly replicate the above results, since ivregress isn't applying a small sample correction to the variance covariance matrix that the R ivreg command does.
margins, dyex(educ) continuous


Average marginal effects                        Number of obs     =      1,034

Expression   : Linear prediction, predict()
dy/ex w.r.t. : educ

------------------------------------------------------------------------------
|            Delta-method
|      dy/ex   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ |   1.835095   .3928002     4.67   0.000     1.065221     2.60497
------------------------------------------------------------------------------


### Estimation in R

This runs the estimation in R.

ivmodel <- ivreg(ln_wage ~ pexp + pexp2 + broken_home + educ |
pexp + pexp2 + broken_home + feduc)
summary(ivmodel)


Call:
ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ |
pexp + pexp2 + broken_home + feduc)

Residuals:
Min      1Q  Median      3Q     Max
-1.8472 -0.2326  0.0194  0.2541  1.6113

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.406439   0.436664  -0.931    0.352
pexp         0.214752   0.024715   8.689  < 2e-16 ***
pexp2       -0.011745   0.002357  -4.984 7.30e-07 ***
broken_home  0.024471   0.039815   0.615    0.539
educ         0.149503   0.032079   4.661 3.57e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4363 on 1029 degrees of freedom
Multiple R-Squared: 0.1277,	Adjusted R-squared: 0.1243
Wald test: 34.38 on 4 and 1029 DF,  p-value: < 2.2e-16


As an FYI, R is reporting the unadjusted standard errors that matches output from this stata command:

ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc), small


Stata's ivregress output for robust regression (suppressed) is obtained from

ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc), robust


Here is the robust version of the model in R,

summary(ivmodel,vcov=sandwich)


Call:
ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ |
pexp + pexp2 + broken_home + feduc)

Residuals:
Min      1Q  Median      3Q     Max
-1.8472 -0.2326  0.0194  0.2541  1.6113

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.406439   0.440450  -0.923    0.356
pexp         0.214752   0.023863   8.999  < 2e-16 ***
pexp2       -0.011745   0.002359  -4.978 7.53e-07 ***
broken_home  0.024471   0.033503   0.730    0.465
educ         0.149503   0.032908   4.543 6.20e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4363 on 1029 degrees of freedom
Multiple R-Squared: 0.1277,	Adjusted R-squared: 0.1243
Wald test: 37.63 on 4 and 1029 DF,  p-value: < 2.2e-16


## Inference

We have more work to do:

1. Test for relevant and strong instruments
2. Test for endogeneity
3. Test for overidentification (not relevant for this example)

In stata, we issue these commands:

estat firststage


First-stage regression summary statistics
--------------------------------------------------------------------------
Variable |   R-sq.       R-sq.        R-sq.     F(1,1029)   Prob > F
-------------+------------------------------------------------------------
educ |  0.2416      0.2387       0.0878       98.9915    0.0000
--------------------------------------------------------------------------

Minimum eigenvalue statistic = 98.9915

Critical Values                      # of endogenous regressors:    1
Ho: Instruments are weak             # of excluded instruments:     1
---------------------------------------------------------------------
|    5%     10%     20%     30%
2SLS relative bias                 |         (not available)
-----------------------------------+---------------------------------
|   10%     15%     20%     25%
2SLS Size of nominal 5% Wald test  |  16.38    8.96    6.66    5.53
LIML Size of nominal 5% Wald test  |  16.38    8.96    6.66    5.53
---------------------------------------------------------------------


Note, since the number of instruments is equal to the number of endogenous variables, we don't have an overidentification problem.

estat overid

no overidentifying restrictions
r(498);


In R, we do it this way:

summary(ivmodel,vcov=sandwich,diagnostics = TRUE)


Call:
ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ |
pexp + pexp2 + broken_home + feduc)

Residuals:
Min      1Q  Median      3Q     Max
-1.8472 -0.2326  0.0194  0.2541  1.6113

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.406439   0.440450  -0.923    0.356
pexp         0.214752   0.023863   8.999  < 2e-16 ***
pexp2       -0.011745   0.002359  -4.978 7.53e-07 ***
broken_home  0.024471   0.033503   0.730    0.465
educ         0.149503   0.032908   4.543 6.20e-06 ***

Diagnostic tests:
df1  df2 statistic p-value
Weak instruments    1 1029    80.649  <2e-16 ***
Wu-Hausman          1 1028     4.376  0.0367 *
Sargan              0   NA        NA      NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4363 on 1029 degrees of freedom
Multiple R-Squared: 0.1277,	Adjusted R-squared: 0.1243
Wald test: 37.63 on 4 and 1029 DF,  p-value: < 2.2e-16


These results tell us we have relevant and strong instruments and that education is likely endogenous.