ECON 407: Companion OLS Chapter

This companion to the OLS Chapter shows how to implement the models in both stata and R. Unless stated otherwise, the code shown here is for R.

In [2]:
# allows us to open stata datasets
library(foreign)
# for heteroskedasticity corrected variance/covariance matrices
library(sandwich)
# for White Heteroskedasticity Test
library(lmtest)
# for bootstrapping
library(boot)


In [3]:
tk.df = read.dta("http://rlhick.people.wm.edu/econ407/data/tobias_koop.dta")
tk4.df = subset(tk.df, time == 4)


Focusing on our dependent variable:

In [4]:
hist(tk4.df$ln_wage,main = "Log Wage")  The OLS Model¶ In stata, we run OLS simply by . regress ln_wage educ pexp pexp2 broken_home Source | SS df MS Number of obs = 1034 -------------+------------------------------ F( 4, 1029) = 51.36 Model | 37.3778146 4 9.34445366 Prob > F = 0.0000 Residual | 187.21445 1029 .181938241 R-squared = 0.1664 -------------+------------------------------ Adj R-squared = 0.1632 Total | 224.592265 1033 .217417488 Root MSE = .42654 ------------------------------------------------------------------------------ ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- educ | .0852725 .0092897 9.18 0.000 .0670437 .1035014 pexp | .2035214 .0235859 8.63 0.000 .1572395 .2498033 pexp2 | -.0124126 .0022825 -5.44 0.000 -.0168916 -.0079336 broken_home | -.0087254 .0357107 -0.24 0.807 -.0787995 .0613488 _cons | .4603326 .137294 3.35 0.001 .1909243 .7297408 ------------------------------------------------------------------------------ and in R, we use In [5]: attach(tk4.df) ols.lm = lm(ln_wage~educ + pexp + pexp2 + broken_home) summary(ols.lm)  Out[5]: Call: lm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home) Residuals: Min 1Q Median 3Q Max -1.69490 -0.24818 -0.00079 0.25417 1.51139 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.460333 0.137294 3.353 0.000829 *** educ 0.085273 0.009290 9.179 < 2e-16 *** pexp 0.203521 0.023586 8.629 < 2e-16 *** pexp2 -0.012413 0.002283 -5.438 6.73e-08 *** broken_home -0.008725 0.035711 -0.244 0.807020 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4265 on 1029 degrees of freedom Multiple R-squared: 0.1664, Adjusted R-squared: 0.1632 F-statistic: 51.36 on 4 and 1029 DF, p-value: < 2.2e-16  Robust Standard Errors¶ Note that since the$\mathbf{b}$'s are unbiased estimators for$\beta$, those continue to be the ones to report. However, our variance covariance matrix is $$VARCOV_{HC1} = \mathbf{(x'x)^{-1}V(x'x)^{-1}}$$ Defining$\mathbf{e=y-xb}$, $$\mathbf{V} = diag(\mathbf{ee}')$$ In stata, the robust option is a very easy way to implement the robust standard error correction. Note: this is not robust regression. . regress ln_wage educ pexp pexp2 broken_home, robust Linear regression Number of obs = 1034 F( 4, 1029) = 64.82 Prob > F = 0.0000 R-squared = 0.1664 Root MSE = .42654 ------------------------------------------------------------------------------ | Robust ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- educ | .0852725 .0099294 8.59 0.000 .0657882 .1047568 pexp | .2035214 .0226835 8.97 0.000 .1590101 .2480326 pexp2 | -.0124126 .0022447 -5.53 0.000 -.0168173 -.0080079 broken_home | -.0087254 .0312381 -0.28 0.780 -.070023 .0525723 _cons | .4603326 .1315416 3.50 0.000 .2022121 .718453 ------------------------------------------------------------------------------ Implementing this in R, we can use the sandwich package's vcovHC function with the HC1 option (stata default): In [6]: VARCOV = vcovHC(ols.lm, 'HC1') se = sqrt(diag(VARCOV)) print(se)  (Intercept) educ pexp pexp2 broken_home 0.131541623 0.009929445 0.022683530 0.002244678 0.031238081  Note that the F-statistic reported by OLS is inappropriate in the presence of heteroskedasticity. We need to use a different form of the F test. Stata does this automatically. In R, we need to use waldtest for the correct model F statistic: In [7]: waldtest(ols.lm, vcov = vcovHC(ols.lm,'HC1'))  Out[7]: Res.Df Df F Pr(>F) 1 1029 NA NA NA 2 1033 -4 64.81968 6.411661e-49 We can put this together to make a more useful summary output: In [8]: # copy summary object robust.summ = summary(ols.lm) # overwrite all coefficients with robust counterparts robust.summ$coefficients = unclass(coeftest(ols.lm, vcov. = VARCOV))
# overwrite f stat with correct robust value
f_robust = waldtest(ols.lm, vcov = VARCOV)
robust.summ$fstatistic[1] = f_robust$F[2]
print(robust.summ)

Call:
lm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home)

Residuals:
Min       1Q   Median       3Q      Max
-1.69490 -0.24818 -0.00079  0.25417  1.51139

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.460333   0.131542   3.500 0.000486 ***
educ         0.085273   0.009929   8.588  < 2e-16 ***
pexp         0.203521   0.022684   8.972  < 2e-16 ***
pexp2       -0.012413   0.002245  -5.530 4.06e-08 ***
broken_home -0.008725   0.031238  -0.279 0.780057
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4265 on 1029 degrees of freedom
Multiple R-squared:  0.1664,	Adjusted R-squared:  0.1632
F-statistic: 64.82 on 4 and 1029 DF,  p-value: < 2.2e-16



The Bootstrapping Approach¶

Given the advent of very fast computation even with laptop computers, we can use boostrapping to correct our standard errrors for any non-iid error process. In R, we run this:

In [9]:
bs_ols <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d) # runs the model for this sample
return(coef(fit)) # returns coefficients
}

results <- boot(data=tk4.df, statistic=bs_ols,
R=1000, formula=ln_wage~educ + pexp + pexp2 + broken_home)

boot.ci(results,conf=0.95,type='perc',index=1)
boot.ci(results,conf=0.95,type='perc',index=2)
boot.ci(results,conf=0.95,type='perc',index=3)
boot.ci(results,conf=0.95,type='perc',index=4)
boot.ci(results,conf=0.95,type='perc',index=5)

colnames(results$t) = c("Intercept","educ","pexp","pexp2","broken_home") summary(results$t)

Out[9]:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL :
boot.ci(boot.out = results, conf = 0.95, type = "perc", index = 1)

Intervals :
Level     Percentile
95%   ( 0.1897,  0.7214 )
Calculations and Intervals on Original Scale
Out[9]:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL :
boot.ci(boot.out = results, conf = 0.95, type = "perc", index = 2)

Intervals :
Level     Percentile
95%   ( 0.0657,  0.1056 )
Calculations and Intervals on Original Scale
Out[9]:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL :
boot.ci(boot.out = results, conf = 0.95, type = "perc", index = 3)

Intervals :
Level     Percentile
95%   ( 0.1561,  0.2488 )
Calculations and Intervals on Original Scale
Out[9]:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL :
boot.ci(boot.out = results, conf = 0.95, type = "perc", index = 4)

Intervals :
Level     Percentile
95%   (-0.0168, -0.0078 )
Calculations and Intervals on Original Scale
Out[9]:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL :
boot.ci(boot.out = results, conf = 0.95, type = "perc", index = 5)

Intervals :
Level     Percentile
95%   (-0.0697,  0.0572 )
Calculations and Intervals on Original Scale
Out[9]:
   Intercept            educ              pexp            pexp2
Min.   :0.08783   Min.   :0.05473   Min.   :0.1191   Min.   :-0.018849
1st Qu.:0.37772   1st Qu.:0.07781   1st Qu.:0.1879   1st Qu.:-0.013870
Median :0.46628   Median :0.08508   Median :0.2017   Median :-0.012285
Mean   :0.46646   Mean   :0.08488   Mean   :0.2025   Mean   :-0.012304
3rd Qu.:0.55972   3rd Qu.:0.09148   3rd Qu.:0.2175   3rd Qu.:-0.010902
Max.   :0.84774   Max.   :0.11534   Max.   :0.2669   Max.   :-0.003668
broken_home
Min.   :-0.109900
1st Qu.:-0.029338
Median :-0.008493
Mean   :-0.008627
3rd Qu.: 0.012057
Max.   : 0.103714  

In stata, we just run

regress ln_wage educ pexp pexp2 broken_home, vce(bootstrap)

Testing for Heteroskedasticity¶

The two dominant heteroskedasticity tests are the White test and the Cook-Weisberg (both Breusch Pagan tests). The White test is particularly useful if you believe the error variances have a potentially non-linear relationship with model covariates. For stata, use

. imtest, white

White's test for Ho: homoskedasticity
against Ha: unrestricted heteroskedasticity

chi2(12)     =     29.20
Prob > chi2  =    0.0037

Cameron & Trivedi's decomposition of IM-test

---------------------------------------------------
Source |       chi2     df      p
---------------------+-----------------------------
Heteroskedasticity |      29.20     12    0.0037
Skewness |      14.14      4    0.0068
Kurtosis |      16.19      1    0.0001
---------------------+-----------------------------
Total |      59.53     17    0.0000
---------------------------------------------------

In R we perform the White test by:

In [10]:
bptest(ols.lm,ln_wage ~ educ*pexp + educ*pexp2 + educ*broken_home + pexp*pexp2 + pexp*broken_home +
pexp2*broken_home + I(educ^2) + I(pexp^2) + I(pexp2^2) + I(broken_home^2))

Out[10]:
	studentized Breusch-Pagan test

data:  ols.lm
BP = 29.196, df = 14, p-value = 0.009831


Notice that in R, we need to specify the form of the heteroskedasticity. To mirror the White test, we include interaction and second order terms. Both stata's imtest and R's bptest allow you to specify more esoteric forms of heteroskedasticity, although in R it is more apparent since we have to specify the form of the heteroskedasticity for the test to run.

A more basic test- one assuming that heteroskedastity is a linear function of the $\mathbf{x}$'s- is the Cook-Weisberg Breusch Pagan test.

In stata, we simply use this command:

. hettest

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of ln_wage

chi2(1)      =    19.57
Prob > chi2  =   0.0000

R doesn't implement this test directly. But it can be performed manually following these steps:

1. Recover residuals as above, denote as $\mathbf{e}$
2. square the residuals by the sqare root of mean squared residuals: $$\mathbf{r} = \frac{diagonal(\mathbf{ee}')}{\sqrt{\mathbf{e'e}/N}}$$
3. Run the regression $\mathbf{r} = \gamma_0 + \gamma_1 \mathbf{xb} + \mathbf{v}$. Recall that $\mathbf{xb} = \mathbf{\hat{y}}$.
4. Recover ESS defined as $$ESS = (\mathbf{\hat{r}} - \bar{r})'(\hat{\mathbf{r}} - \bar{r})$$
5. ESS/2 is the test statistic distributed $\chi^2_{(1)}$
In [11]:
r = diag(tcrossprod(ols.lm$residuals))/mean(diag(tcrossprod(ols.lm$residuals)))
xb = ols.lm$fitted.values het_ols_results = lm(r~xb) mss = crossprod(het_ols_results$fitted.values - mean(r))
cw = mss/2
sprintf("BP / Cook-Weisberg score for heterosk.: %2.3f distributed chi2 with %d df.",cw,1)

Out[11]:
'BP / Cook-Weisberg score for heterosk.: 19.572 distributed chi2 with 1 df.'