ECON 407: Companion to IV Regression

This page has been moved to https://econ.pages.code.wm.edu/407/notes/docs/index.html and is no longer being maintained here.

This companion document to our chapter on endogeneity quickly explores the problem of endogeneity and how to estimate this class of models in R and Stata. Recall that the OLS estimator requires \[ E(\mathbf{x'\epsilon}) = 0 \]

This code shows how to overcome estimation problems where this assumption fails but where we can identify an instrument for implementing instrumental variables regression (IV Regression). We demonstrate the uses of R and stata for IV regression problems. First, let's open up the data in both R and Stata noting that we are using a "Cross-sectioned" version of Tobias and Koop that focuses on 1983. Load data and summarize:

webuse set "https://rlhick.people.wm.edu/econ407/data/"
webuse tobias_koop
keep if time==4
sum
(prefix now "https://rlhick.people.wm.edu/econ407/data")
(16,885 observations deleted)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          id |      1,034    1090.952    634.8917          4       2177
        educ |      1,034    12.27466    1.566838          9         19
     ln_wage |      1,034    2.138259    .4662805        .42       3.59
        pexp |      1,034     4.81528    2.190298          0         12
        time |      1,034           4           0          4          4
-------------+---------------------------------------------------------
     ability |      1,034    .0165957    .9209635      -3.14       1.89
       meduc |      1,034    11.40329    3.027277          0         20
       feduc |      1,034    11.58511    3.735833          0         20
 broken_home |      1,034    .1692456    .3751502          0          1
    siblings |      1,034    3.200193    2.126575          0         15
-------------+---------------------------------------------------------
       pexp2 |      1,034    27.97969    22.59879          0        144

and in R:

library(foreign)
library(sandwich)
library(lmtest)
library(boot)
library(AER)
library(car)
library(ivpack)
tk.df = read.dta("https://rlhick.people.wm.edu/econ407/data/tobias_koop.dta")
tk4.df = subset(tk.df, time == 4)
attach(tk4.df)

OLS

If we ignore any potential endogeneity problem we can estimate OLS as described in the OLS chapter companion. Here are the results from stata:

reg ln_wage pexp pexp2 educ broken_home

      Source |       SS           df       MS      Number of obs   =     1,034
-------------+----------------------------------   F(4, 1029)      =     51.36
       Model |  37.3778146         4  9.34445366   Prob > F        =    0.0000
    Residual |   187.21445     1,029  .181938241   R-squared       =    0.1664
-------------+----------------------------------   Adj R-squared   =    0.1632
       Total |  224.592265     1,033  .217417488   Root MSE        =    .42654

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        pexp |   .2035214   .0235859     8.63   0.000     .1572395    .2498033
       pexp2 |  -.0124126   .0022825    -5.44   0.000    -.0168916   -.0079336
        educ |   .0852725   .0092897     9.18   0.000     .0670437    .1035014
 broken_home |  -.0087254   .0357107    -0.24   0.807    -.0787995    .0613488
       _cons |   .4603326    .137294     3.35   0.001     .1909243    .7297408
------------------------------------------------------------------------------

where education, has the elasticity

margins, dyex(educ) continuous

Average marginal effects                        Number of obs     =      1,034
Model VCE    : OLS

Expression   : Linear prediction, predict()
dy/ex w.r.t. : educ

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/ex   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   1.046691   .1140274     9.18   0.000     .8229385    1.270444
------------------------------------------------------------------------------

Running the OLS regression in R is done in a similar manner (I am surpressing output for the sake of brevity).

ols.lm = lm(ln_wage ~ pexp + pexp2 + broken_home + educ)

The endogeneity problem

Suppose we are worried that education is endogenous. That is, it is correlated with the population regression errors. This means OLS estimates of \(\beta\) are biased. We hypothesize that the variable feduc is a good instrument having all the properties we describe in detail in the notes document.

Estimation in Stata

In stata, we use this code:

ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc) 

Instrumental variables (2SLS) regression          Number of obs   =      1,034
                                                  Wald chi2(4)    =     138.19
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.1277
                                                  Root MSE        =     .43528

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .1495027   .0320009     4.67   0.000     .0867821    .2122233
        pexp |    .214752   .0246553     8.71   0.000     .1664285    .2630755
       pexp2 |  -.0117453   .0023508    -5.00   0.000    -.0163529   -.0071377
 broken_home |   .0244713   .0397189     0.62   0.538    -.0533763     .102319
       _cons |  -.4064389   .4356072    -0.93   0.351    -1.260213    .4473354
------------------------------------------------------------------------------
Instrumented:  educ
Instruments:   pexp pexp2 broken_home feduc

Note 2 things:

  1. The mean estimate for the elasticity on education has nearly doubled compared to OLS
  2. There is no R command that I can find that will exactly replicate the above results, since ivregress isn't applying a small sample correction to the variance covariance matrix that the R ivreg command does.
margins, dyex(educ) continuous

Average marginal effects                        Number of obs     =      1,034
Model VCE    : Unadjusted

Expression   : Linear prediction, predict()
dy/ex w.r.t. : educ

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/ex   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   1.835095   .3928002     4.67   0.000     1.065221     2.60497
------------------------------------------------------------------------------

Estimation in R

This runs the estimation in R.

ivmodel <- ivreg(ln_wage ~ pexp + pexp2 + broken_home + educ |
                 pexp + pexp2 + broken_home + feduc)
summary(ivmodel)

Call:
ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ | 
    pexp + pexp2 + broken_home + feduc)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8472 -0.2326  0.0194  0.2541  1.6113 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.406439   0.436664  -0.931    0.352    
pexp         0.214752   0.024715   8.689  < 2e-16 ***
pexp2       -0.011745   0.002357  -4.984 7.30e-07 ***
broken_home  0.024471   0.039815   0.615    0.539    
educ         0.149503   0.032079   4.661 3.57e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4363 on 1029 degrees of freedom
Multiple R-Squared: 0.1277,	Adjusted R-squared: 0.1243 
Wald test: 34.38 on 4 and 1029 DF,  p-value: < 2.2e-16

As an FYI, R is reporting the unadjusted standard errors that matches output from this stata command:

ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc), small

Stata's ivregress output for robust regression (suppressed) is obtained from

ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc), robust

Here is the robust version of the model in R,

summary(ivmodel,vcov=sandwich)

Call:
ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ | 
    pexp + pexp2 + broken_home + feduc)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8472 -0.2326  0.0194  0.2541  1.6113 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.406439   0.440450  -0.923    0.356    
pexp         0.214752   0.023863   8.999  < 2e-16 ***
pexp2       -0.011745   0.002359  -4.978 7.53e-07 ***
broken_home  0.024471   0.033503   0.730    0.465    
educ         0.149503   0.032908   4.543 6.20e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4363 on 1029 degrees of freedom
Multiple R-Squared: 0.1277,	Adjusted R-squared: 0.1243 
Wald test: 37.63 on 4 and 1029 DF,  p-value: < 2.2e-16

Inference

We have more work to do:

  1. Test for relevant and strong instruments
  2. Test for endogeneity
  3. Test for overidentification (not relevant for this example)

In stata, we issue these commands:

estat firststage

First-stage regression summary statistics
--------------------------------------------------------------------------
             |            Adjusted      Partial
    Variable |   R-sq.       R-sq.        R-sq.     F(1,1029)   Prob > F
-------------+------------------------------------------------------------
        educ |  0.2416      0.2387       0.0878       98.9915    0.0000
--------------------------------------------------------------------------


Minimum eigenvalue statistic = 98.9915

Critical Values                      # of endogenous regressors:    1
Ho: Instruments are weak             # of excluded instruments:     1
---------------------------------------------------------------------
                                   |    5%     10%     20%     30%
2SLS relative bias                 |         (not available)
-----------------------------------+---------------------------------
                                   |   10%     15%     20%     25%
2SLS Size of nominal 5% Wald test  |  16.38    8.96    6.66    5.53
LIML Size of nominal 5% Wald test  |  16.38    8.96    6.66    5.53
---------------------------------------------------------------------

Note, since the number of instruments is equal to the number of endogenous variables, we don't have an overidentification problem.

estat overid
no overidentifying restrictions
r(498);

In R, we do it this way:

summary(ivmodel,vcov=sandwich,diagnostics = TRUE)

Call:
ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ | 
    pexp + pexp2 + broken_home + feduc)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8472 -0.2326  0.0194  0.2541  1.6113 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.406439   0.440450  -0.923    0.356    
pexp         0.214752   0.023863   8.999  < 2e-16 ***
pexp2       -0.011745   0.002359  -4.978 7.53e-07 ***
broken_home  0.024471   0.033503   0.730    0.465    
educ         0.149503   0.032908   4.543 6.20e-06 ***

Diagnostic tests:
                  df1  df2 statistic p-value    
Weak instruments    1 1029    80.649  <2e-16 ***
Wu-Hausman          1 1028     4.376  0.0367 *  
Sargan              0   NA        NA      NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4363 on 1029 degrees of freedom
Multiple R-Squared: 0.1277,	Adjusted R-squared: 0.1243 
Wald test: 37.63 on 4 and 1029 DF,  p-value: < 2.2e-16

These results tell us we have relevant and strong instruments and that education is likely endogenous.