ECON 407: Companion to IV Regression
This page has been moved to https://econ.pages.code.wm.edu/407/notes/docs/index.html and is no longer being maintained here.
This companion document to our chapter on endogeneity quickly explores the problem of endogeneity and how to estimate this class of models in R and Stata. Recall that the OLS estimator requires \[ E(\mathbf{x'\epsilon}) = 0 \]
This code shows how to overcome estimation problems where this assumption fails but where we can identify an instrument for implementing instrumental variables regression (IV Regression). We demonstrate the uses of R
and stata
for IV regression problems. First, let's open up the data in both R and Stata noting that we are using a "Cross-sectioned" version of Tobias and Koop that focuses on 1983. Load data and summarize:
webuse set "https://rlhick.people.wm.edu/econ407/data/"
webuse tobias_koop
keep if time==4
sum
(prefix now "https://rlhick.people.wm.edu/econ407/data") (16,885 observations deleted) Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- id | 1,034 1090.952 634.8917 4 2177 educ | 1,034 12.27466 1.566838 9 19 ln_wage | 1,034 2.138259 .4662805 .42 3.59 pexp | 1,034 4.81528 2.190298 0 12 time | 1,034 4 0 4 4 -------------+--------------------------------------------------------- ability | 1,034 .0165957 .9209635 -3.14 1.89 meduc | 1,034 11.40329 3.027277 0 20 feduc | 1,034 11.58511 3.735833 0 20 broken_home | 1,034 .1692456 .3751502 0 1 siblings | 1,034 3.200193 2.126575 0 15 -------------+--------------------------------------------------------- pexp2 | 1,034 27.97969 22.59879 0 144
and in R:
library(foreign)
library(sandwich)
library(lmtest)
library(boot)
library(AER)
library(car)
library(ivpack)
tk.df = read.dta("https://rlhick.people.wm.edu/econ407/data/tobias_koop.dta")
tk4.df = subset(tk.df, time == 4)
attach(tk4.df)
OLS
If we ignore any potential endogeneity problem we can estimate OLS as described in the OLS chapter companion. Here are the results from stata:
reg ln_wage pexp pexp2 educ broken_home
Source | SS df MS Number of obs = 1,034 -------------+---------------------------------- F(4, 1029) = 51.36 Model | 37.3778146 4 9.34445366 Prob > F = 0.0000 Residual | 187.21445 1,029 .181938241 R-squared = 0.1664 -------------+---------------------------------- Adj R-squared = 0.1632 Total | 224.592265 1,033 .217417488 Root MSE = .42654 ------------------------------------------------------------------------------ ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- pexp | .2035214 .0235859 8.63 0.000 .1572395 .2498033 pexp2 | -.0124126 .0022825 -5.44 0.000 -.0168916 -.0079336 educ | .0852725 .0092897 9.18 0.000 .0670437 .1035014 broken_home | -.0087254 .0357107 -0.24 0.807 -.0787995 .0613488 _cons | .4603326 .137294 3.35 0.001 .1909243 .7297408 ------------------------------------------------------------------------------
where education, has the elasticity
margins, dyex(educ) continuous
Average marginal effects Number of obs = 1,034 Model VCE : OLS Expression : Linear prediction, predict() dy/ex w.r.t. : educ ------------------------------------------------------------------------------ | Delta-method | dy/ex Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- educ | 1.046691 .1140274 9.18 0.000 .8229385 1.270444 ------------------------------------------------------------------------------
Running the OLS regression in R is done in a similar manner (I am surpressing output for the sake of brevity).
ols.lm = lm(ln_wage ~ pexp + pexp2 + broken_home + educ)
The endogeneity problem
Suppose we are worried that education is endogenous. That is, it is correlated with the population regression errors. This means OLS estimates of \(\beta\) are biased. We hypothesize that the variable feduc
is a good instrument having all the properties we describe in detail in the notes document.
Estimation in Stata
In stata, we use this code:
ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc)
Instrumental variables (2SLS) regression Number of obs = 1,034 Wald chi2(4) = 138.19 Prob > chi2 = 0.0000 R-squared = 0.1277 Root MSE = .43528 ------------------------------------------------------------------------------ ln_wage | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- educ | .1495027 .0320009 4.67 0.000 .0867821 .2122233 pexp | .214752 .0246553 8.71 0.000 .1664285 .2630755 pexp2 | -.0117453 .0023508 -5.00 0.000 -.0163529 -.0071377 broken_home | .0244713 .0397189 0.62 0.538 -.0533763 .102319 _cons | -.4064389 .4356072 -0.93 0.351 -1.260213 .4473354 ------------------------------------------------------------------------------ Instrumented: educ Instruments: pexp pexp2 broken_home feduc
Note 2 things:
- The mean estimate for the elasticity on education has nearly doubled compared to OLS
- There is no R command that I can find that will exactly replicate the above results, since ivregress isn't applying a small sample correction to the variance covariance matrix that the R
ivreg
command does.
margins, dyex(educ) continuous
Average marginal effects Number of obs = 1,034 Model VCE : Unadjusted Expression : Linear prediction, predict() dy/ex w.r.t. : educ ------------------------------------------------------------------------------ | Delta-method | dy/ex Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- educ | 1.835095 .3928002 4.67 0.000 1.065221 2.60497 ------------------------------------------------------------------------------
Estimation in R
This runs the estimation in R.
ivmodel <- ivreg(ln_wage ~ pexp + pexp2 + broken_home + educ |
pexp + pexp2 + broken_home + feduc)
summary(ivmodel)
Call: ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ | pexp + pexp2 + broken_home + feduc) Residuals: Min 1Q Median 3Q Max -1.8472 -0.2326 0.0194 0.2541 1.6113 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.406439 0.436664 -0.931 0.352 pexp 0.214752 0.024715 8.689 < 2e-16 *** pexp2 -0.011745 0.002357 -4.984 7.30e-07 *** broken_home 0.024471 0.039815 0.615 0.539 educ 0.149503 0.032079 4.661 3.57e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4363 on 1029 degrees of freedom Multiple R-Squared: 0.1277, Adjusted R-squared: 0.1243 Wald test: 34.38 on 4 and 1029 DF, p-value: < 2.2e-16
As an FYI, R
is reporting the unadjusted standard errors that matches output from this stata command:
ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc), small
Stata's ivregress output for robust regression (suppressed) is obtained from
ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc), robust
Here is the robust version of the model in R
,
summary(ivmodel,vcov=sandwich)
Call: ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ | pexp + pexp2 + broken_home + feduc) Residuals: Min 1Q Median 3Q Max -1.8472 -0.2326 0.0194 0.2541 1.6113 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.406439 0.440450 -0.923 0.356 pexp 0.214752 0.023863 8.999 < 2e-16 *** pexp2 -0.011745 0.002359 -4.978 7.53e-07 *** broken_home 0.024471 0.033503 0.730 0.465 educ 0.149503 0.032908 4.543 6.20e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4363 on 1029 degrees of freedom Multiple R-Squared: 0.1277, Adjusted R-squared: 0.1243 Wald test: 37.63 on 4 and 1029 DF, p-value: < 2.2e-16
Inference
We have more work to do:
- Test for relevant and strong instruments
- Test for endogeneity
- Test for overidentification (not relevant for this example)
In stata, we issue these commands:
estat firststage
First-stage regression summary statistics -------------------------------------------------------------------------- | Adjusted Partial Variable | R-sq. R-sq. R-sq. F(1,1029) Prob > F -------------+------------------------------------------------------------ educ | 0.2416 0.2387 0.0878 98.9915 0.0000 -------------------------------------------------------------------------- Minimum eigenvalue statistic = 98.9915 Critical Values # of endogenous regressors: 1 Ho: Instruments are weak # of excluded instruments: 1 --------------------------------------------------------------------- | 5% 10% 20% 30% 2SLS relative bias | (not available) -----------------------------------+--------------------------------- | 10% 15% 20% 25% 2SLS Size of nominal 5% Wald test | 16.38 8.96 6.66 5.53 LIML Size of nominal 5% Wald test | 16.38 8.96 6.66 5.53 ---------------------------------------------------------------------
Note, since the number of instruments is equal to the number of endogenous variables, we don't have an overidentification problem.
estat overid
no overidentifying restrictions r(498);
In R, we do it this way:
summary(ivmodel,vcov=sandwich,diagnostics = TRUE)
Call: ivreg(formula = ln_wage ~ pexp + pexp2 + broken_home + educ | pexp + pexp2 + broken_home + feduc) Residuals: Min 1Q Median 3Q Max -1.8472 -0.2326 0.0194 0.2541 1.6113 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.406439 0.440450 -0.923 0.356 pexp 0.214752 0.023863 8.999 < 2e-16 *** pexp2 -0.011745 0.002359 -4.978 7.53e-07 *** broken_home 0.024471 0.033503 0.730 0.465 educ 0.149503 0.032908 4.543 6.20e-06 *** Diagnostic tests: df1 df2 statistic p-value Weak instruments 1 1029 80.649 <2e-16 *** Wu-Hausman 1 1028 4.376 0.0367 * Sargan 0 NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4363 on 1029 degrees of freedom Multiple R-Squared: 0.1277, Adjusted R-squared: 0.1243 Wald test: 37.63 on 4 and 1029 DF, p-value: < 2.2e-16
These results tell us we have relevant and strong instruments and that education is likely endogenous.