ECON 407: Companion to IV Regression

This companion document to our chapter on endogeneity quickly explores the problem of endogeneity and how to estimate this class of models in R and Stata. Recall that the OLS estimator requires $$ E(\mathbf{x'\epsilon}) = 0 $$

This code shows how to overcome estimation problems where this assumption fails but where we can identify an instrument for implementing instrumental variables regression (IV Regression).

In [2]:
library(foreign)
library(sandwich)
library(lmtest)
library(boot)
library(AER)
library(car)
library(ivpack)

We will persist with a "Cross-sectioned" version of Tobias and Koop that focuses on 1983. Load data and summarize:

In [3]:
tk.df = read.dta("http://rlhick.people.wm.edu/econ407/data/tobias_koop.dta")
tk4.df = subset(tk.df, time == 4)
attach(tk4.df)

OLS

If we ignore any potential endogeneity problem we can estimate OLS as described in the OLS document (ols.Rmd). Here are the results from stata:

. reg ln_wage pexp pexp2 educ broken_home

      Source |       SS       df       MS              Number of obs =    1034
-------------+------------------------------           F(  4,  1029) =   51.36
       Model |  37.3778146     4  9.34445366           Prob > F      =  0.0000
    Residual |   187.21445  1029  .181938241           R-squared     =  0.1664
-------------+------------------------------           Adj R-squared =  0.1632
       Total |  224.592265  1033  .217417488           Root MSE      =  .42654

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        pexp |   .2035214   .0235859     8.63   0.000     .1572395    .2498033
       pexp2 |  -.0124126   .0022825    -5.44   0.000    -.0168916   -.0079336
        educ |   .0852725   .0092897     9.18   0.000     .0670437    .1035014
 broken_home |  -.0087254   .0357107    -0.24   0.807    -.0787995    .0613488
       _cons |   .4603326    .137294     3.35   0.001     .1909243    .7297408
------------------------------------------------------------------------------

where education, has the elasticity

. margins, dyex(educ) continuous

Average marginal effects                          Number of obs   =       1034
Model VCE    : OLS

Expression   : Linear prediction, predict()
dy/ex w.r.t. : educ

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/ex   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   1.046691   .1140274     9.18   0.000     .8232017    1.270181
------------------------------------------------------------------------------

Running this in R is done this way (I am surpressing output for the sake of brevity).

In [5]:
ols.lm = lm(ln_wage ~ pexp + pexp2 + broken_home + educ)

The endogeneity problem

We assume that education is endogenous. That is, it is correlated with the population regression errors. This means OLS estimates of $\beta$ are biased. We hypothesize that the variable feduc is a good instrument having all the properties we describe in detail in the notes document.

Estimation in Stata

In stata, we use this code:

. ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc) 

Instrumental variables (2SLS) regression               Number of obs =    1034
                                                       Wald chi2(4)  =  138.19
                                                       Prob > chi2   =  0.0000
                                                       R-squared     =  0.1277
                                                       Root MSE      =  .43528

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .1495027   .0320009     4.67   0.000     .0867821    .2122233
        pexp |    .214752   .0246553     8.71   0.000     .1664285    .2630755
       pexp2 |  -.0117453   .0023508    -5.00   0.000    -.0163529   -.0071377
 broken_home |   .0244713   .0397189     0.62   0.538    -.0533763     .102319
       _cons |  -.4064389   .4356072    -0.93   0.351    -1.260213    .4473354
------------------------------------------------------------------------------
Instrumented:  educ
Instruments:   pexp pexp2 broken_home feduc

Note 2 things:

  1. The mean estimate for the elasticity on education has nearly doubled compared to OLS
  2. There is no R command that I can find that will exactly replicate the above results, since ivregress isn't applying a small sample correction to the variance covariance matrix that the R ivreg command does.
. margins, dyex(educ) continuous

Average marginal effects                          Number of obs   =       1034
Model VCE    : Unadjusted

Expression   : Linear prediction, predict()
dy/ex w.r.t. : educ

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/ex   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   1.835095   .3928002     4.67   0.000     1.065221     2.60497
------------------------------------------------------------------------------

Estimation in R

This runs the estimation in R.

In [6]:
ivmodel <- ivreg(ln_wage ~ pexp + pexp2 + broken_home + educ |
   pexp + pexp2 + broken_home + feduc)

This is the unadjusted standard errors that matches output from this stata command:

ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc), small
In [7]:
summary(ivmodel)
Out[7]:
Call:
ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ | 
    pexp + pexp2 + broken_home + feduc)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8472 -0.2326  0.0194  0.2541  1.6113 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.406439   0.436664  -0.931    0.352    
pexp         0.214752   0.024715   8.689  < 2e-16 ***
pexp2       -0.011745   0.002357  -4.984 7.30e-07 ***
broken_home  0.024471   0.039815   0.615    0.539    
educ         0.149503   0.032079   4.661 3.57e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4363 on 1029 degrees of freedom
Multiple R-Squared: 0.1277,	Adjusted R-squared: 0.1243 
Wald test: 34.38 on 4 and 1029 DF,  p-value: < 2.2e-16 

Here is the robust version of the model, matching stata's ivregress output from

ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc), robust
In [8]:
summary(ivmodel,vcov=sandwich)
Out[8]:
Call:
ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ | 
    pexp + pexp2 + broken_home + feduc)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8472 -0.2326  0.0194  0.2541  1.6113 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.406439   0.440450  -0.923    0.356    
pexp         0.214752   0.023863   8.999  < 2e-16 ***
pexp2       -0.011745   0.002359  -4.978 7.53e-07 ***
broken_home  0.024471   0.033503   0.730    0.465    
educ         0.149503   0.032908   4.543 6.20e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4363 on 1029 degrees of freedom
Multiple R-Squared: 0.1277,	Adjusted R-squared: 0.1243 
Wald test: 37.63 on 4 and 1029 DF,  p-value: < 2.2e-16 

Inference

We have more work to do:

  1. Test for relevant and strong instruments
  2. Test for endogeneity
  3. Test for overidentification (not relevant for this example)

In stata, we issue these commands:

. estat firststage

  First-stage regression summary statistics
  --------------------------------------------------------------------------
               |            Adjusted      Partial
      Variable |   R-sq.       R-sq.        R-sq.     F(1,1029)   Prob > F
  -------------+------------------------------------------------------------
          educ |  0.2416      0.2387       0.0878       98.9915    0.0000
  --------------------------------------------------------------------------


  Minimum eigenvalue statistic = 98.9915     

  Critical Values                      # of endogenous regressors:    1
  Ho: Instruments are weak             # of excluded instruments:     1
  ---------------------------------------------------------------------
                                     |    5%     10%     20%     30%
  2SLS relative bias                 |         (not available)
  -----------------------------------+---------------------------------
                                     |   10%     15%     20%     25%
  2SLS Size of nominal 5% Wald test  |  16.38    8.96    6.66    5.53
  LIML Size of nominal 5% Wald test  |  16.38    8.96    6.66    5.53
  ---------------------------------------------------------------------

. estat endogenous

  Tests of endogeneity
  Ho: variables are exogenous

  Durbin (score) chi2(1)          =  4.62133  (p = 0.0316)
  Wu-Hausman F(1,1028)            =  4.61514  (p = 0.0319)

Note, since the number of instruments is equal to the number of endogenous variables, we don't have an overidentification problem.

. estat overid
no overidentifying restrictions
r(498);

In R, we do it this way:

In [9]:
summary(ivmodel,vcov=sandwich,diagnostics = TRUE)
Out[9]:
Call:
ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ | 
    pexp + pexp2 + broken_home + feduc)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8472 -0.2326  0.0194  0.2541  1.6113 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.406439   0.440450  -0.923    0.356    
pexp         0.214752   0.023863   8.999  < 2e-16 ***
pexp2       -0.011745   0.002359  -4.978 7.53e-07 ***
broken_home  0.024471   0.033503   0.730    0.465    
educ         0.149503   0.032908   4.543 6.20e-06 ***

Diagnostic tests:
                  df1  df2 statistic p-value    
Weak instruments    1 1029    80.649  <2e-16 ***
Wu-Hausman          1 1028     4.376  0.0367 *  
Sargan              0   NA        NA      NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4363 on 1029 degrees of freedom
Multiple R-Squared: 0.1277,	Adjusted R-squared: 0.1243 
Wald test: 37.63 on 4 and 1029 DF,  p-value: < 2.2e-16 

These results tell us we have relevant and strong instruments and that education is likely endogenous.