ECON 407: Companion OLS Chapter

This companion to the OLS Chapter shows how to implement the models in both stata and R. Unless stated otherwise, the code shown here is for R.

In [2]:
# allows us to open stata datasets
library(foreign)
# for heteroskedasticity corrected variance/covariance matrices
library(sandwich)
# for White Heteroskedasticity Test
library(lmtest)
# for bootstrapping
library(boot)

Load data and summarize:

In [3]:
tk.df = read.dta("http://rlhick.people.wm.edu/econ407/data/tobias_koop.dta")
tk4.df = subset(tk.df, time == 4)

Focusing on our dependent variable:

In [4]:
hist(tk4.df$ln_wage,main = "Log Wage")

The OLS Model

In stata, we run OLS simply by

. regress ln_wage educ pexp pexp2 broken_home

      Source |       SS       df       MS              Number of obs =    1034
-------------+------------------------------           F(  4,  1029) =   51.36
       Model |  37.3778146     4  9.34445366           Prob > F      =  0.0000
    Residual |   187.21445  1029  .181938241           R-squared     =  0.1664
-------------+------------------------------           Adj R-squared =  0.1632
       Total |  224.592265  1033  .217417488           Root MSE      =  .42654

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0852725   .0092897     9.18   0.000     .0670437    .1035014
        pexp |   .2035214   .0235859     8.63   0.000     .1572395    .2498033
       pexp2 |  -.0124126   .0022825    -5.44   0.000    -.0168916   -.0079336
 broken_home |  -.0087254   .0357107    -0.24   0.807    -.0787995    .0613488
       _cons |   .4603326    .137294     3.35   0.001     .1909243    .7297408
------------------------------------------------------------------------------

and in R, we use

In [5]:
attach(tk4.df)
ols.lm = lm(ln_wage~educ + pexp + pexp2 + broken_home)
summary(ols.lm)
Out[5]:
Call:
lm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.69490 -0.24818 -0.00079  0.25417  1.51139 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.460333   0.137294   3.353 0.000829 ***
educ         0.085273   0.009290   9.179  < 2e-16 ***
pexp         0.203521   0.023586   8.629  < 2e-16 ***
pexp2       -0.012413   0.002283  -5.438 6.73e-08 ***
broken_home -0.008725   0.035711  -0.244 0.807020    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4265 on 1029 degrees of freedom
Multiple R-squared:  0.1664,	Adjusted R-squared:  0.1632 
F-statistic: 51.36 on 4 and 1029 DF,  p-value: < 2.2e-16

Robust Standard Errors

Note that since the $\mathbf{b}$'s are unbiased estimators for $\beta$, those continue to be the ones to report. However, our variance covariance matrix is $$ VARCOV_{HC1} = \mathbf{(x'x)^{-1}V(x'x)^{-1}} $$

Defining $\mathbf{e=y-xb}$, $$ \mathbf{V} = diag(\mathbf{ee}') $$

In stata, the robust option is a very easy way to implement the robust standard error correction. Note: this is not robust regression.

. regress ln_wage educ pexp pexp2 broken_home, robust

Linear regression                                      Number of obs =    1034
                                                       F(  4,  1029) =   64.82
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.1664
                                                       Root MSE      =  .42654

------------------------------------------------------------------------------
             |               Robust
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0852725   .0099294     8.59   0.000     .0657882    .1047568
        pexp |   .2035214   .0226835     8.97   0.000     .1590101    .2480326
       pexp2 |  -.0124126   .0022447    -5.53   0.000    -.0168173   -.0080079
 broken_home |  -.0087254   .0312381    -0.28   0.780     -.070023    .0525723
       _cons |   .4603326   .1315416     3.50   0.000     .2022121     .718453
------------------------------------------------------------------------------

Implementing this in R, we can use the sandwich package's vcovHC function with the HC1 option (stata default):

In [6]:
VARCOV = vcovHC(ols.lm, 'HC1')
se = sqrt(diag(VARCOV))
print(se)
(Intercept)        educ        pexp       pexp2 broken_home 
0.131541623 0.009929445 0.022683530 0.002244678 0.031238081 

Note that the F-statistic reported by OLS is inappropriate in the presence of heteroskedasticity. We need to use a different form of the F test. Stata does this automatically. In R, we need to use waldtest for the correct model F statistic:

In [7]:
waldtest(ols.lm, vcov = vcovHC(ols.lm,'HC1'))
Out[7]:
Res.Df Df F Pr(>F)
1 1029 NA NA NA
2 1033 -4 64.81968 6.411661e-49

We can put this together to make a more useful summary output:

In [8]:
# copy summary object 
robust.summ = summary(ols.lm)
# overwrite all coefficients with robust counterparts
robust.summ$coefficients = unclass(coeftest(ols.lm, vcov. = VARCOV))
# overwrite f stat with correct robust value
f_robust = waldtest(ols.lm, vcov = VARCOV)
robust.summ$fstatistic[1] = f_robust$F[2]
print(robust.summ)
Call:
lm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.69490 -0.24818 -0.00079  0.25417  1.51139 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.460333   0.131542   3.500 0.000486 ***
educ         0.085273   0.009929   8.588  < 2e-16 ***
pexp         0.203521   0.022684   8.972  < 2e-16 ***
pexp2       -0.012413   0.002245  -5.530 4.06e-08 ***
broken_home -0.008725   0.031238  -0.279 0.780057    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4265 on 1029 degrees of freedom
Multiple R-squared:  0.1664,	Adjusted R-squared:  0.1632 
F-statistic: 64.82 on 4 and 1029 DF,  p-value: < 2.2e-16

The Bootstrapping Approach

Given the advent of very fast computation even with laptop computers, we can use boostrapping to correct our standard errrors for any non-iid error process. In R, we run this:

In [9]:
bs_ols <- function(formula, data, indices) {
  d <- data[indices,] # allows boot to select sample 
  fit <- lm(formula, data=d) # runs the model for this sample
  return(coef(fit)) # returns coefficients
} 

results <- boot(data=tk4.df, statistic=bs_ols, 
  	R=1000, formula=ln_wage~educ + pexp + pexp2 + broken_home)

boot.ci(results,conf=0.95,type='perc',index=1)
boot.ci(results,conf=0.95,type='perc',index=2)
boot.ci(results,conf=0.95,type='perc',index=3)
boot.ci(results,conf=0.95,type='perc',index=4)
boot.ci(results,conf=0.95,type='perc',index=5)

colnames(results$t) = c("Intercept","educ","pexp","pexp2","broken_home")
summary(results$t)
Out[9]:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = results, conf = 0.95, type = "perc", index = 1)

Intervals : 
Level     Percentile     
95%   ( 0.1897,  0.7214 )  
Calculations and Intervals on Original Scale
Out[9]:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = results, conf = 0.95, type = "perc", index = 2)

Intervals : 
Level     Percentile     
95%   ( 0.0657,  0.1056 )  
Calculations and Intervals on Original Scale
Out[9]:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = results, conf = 0.95, type = "perc", index = 3)

Intervals : 
Level     Percentile     
95%   ( 0.1561,  0.2488 )  
Calculations and Intervals on Original Scale
Out[9]:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = results, conf = 0.95, type = "perc", index = 4)

Intervals : 
Level     Percentile     
95%   (-0.0168, -0.0078 )  
Calculations and Intervals on Original Scale
Out[9]:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = results, conf = 0.95, type = "perc", index = 5)

Intervals : 
Level     Percentile     
95%   (-0.0697,  0.0572 )  
Calculations and Intervals on Original Scale
Out[9]:
   Intercept            educ              pexp            pexp2          
 Min.   :0.08783   Min.   :0.05473   Min.   :0.1191   Min.   :-0.018849  
 1st Qu.:0.37772   1st Qu.:0.07781   1st Qu.:0.1879   1st Qu.:-0.013870  
 Median :0.46628   Median :0.08508   Median :0.2017   Median :-0.012285  
 Mean   :0.46646   Mean   :0.08488   Mean   :0.2025   Mean   :-0.012304  
 3rd Qu.:0.55972   3rd Qu.:0.09148   3rd Qu.:0.2175   3rd Qu.:-0.010902  
 Max.   :0.84774   Max.   :0.11534   Max.   :0.2669   Max.   :-0.003668  
  broken_home       
 Min.   :-0.109900  
 1st Qu.:-0.029338  
 Median :-0.008493  
 Mean   :-0.008627  
 3rd Qu.: 0.012057  
 Max.   : 0.103714  

In stata, we just run

regress ln_wage educ pexp pexp2 broken_home, vce(bootstrap)

Testing for Heteroskedasticity

The two dominant heteroskedasticity tests are the White test and the Cook-Weisberg (both Breusch Pagan tests). The White test is particularly useful if you believe the error variances have a potentially non-linear relationship with model covariates. For stata, use

. imtest, white

White's test for Ho: homoskedasticity
         against Ha: unrestricted heteroskedasticity

         chi2(12)     =     29.20
         Prob > chi2  =    0.0037

Cameron & Trivedi's decomposition of IM-test

---------------------------------------------------
              Source |       chi2     df      p
---------------------+-----------------------------
  Heteroskedasticity |      29.20     12    0.0037
            Skewness |      14.14      4    0.0068
            Kurtosis |      16.19      1    0.0001
---------------------+-----------------------------
               Total |      59.53     17    0.0000
---------------------------------------------------

In R we perform the White test by:

In [10]:
bptest(ols.lm,ln_wage ~ educ*pexp + educ*pexp2 + educ*broken_home + pexp*pexp2 + pexp*broken_home + 
         pexp2*broken_home + I(educ^2) + I(pexp^2) + I(pexp2^2) + I(broken_home^2))
Out[10]:
	studentized Breusch-Pagan test

data:  ols.lm
BP = 29.196, df = 14, p-value = 0.009831

Notice that in R, we need to specify the form of the heteroskedasticity. To mirror the White test, we include interaction and second order terms. Both stata's imtest and R's bptest allow you to specify more esoteric forms of heteroskedasticity, although in R it is more apparent since we have to specify the form of the heteroskedasticity for the test to run.

A more basic test- one assuming that heteroskedastity is a linear function of the $\mathbf{x}$'s- is the Cook-Weisberg Breusch Pagan test.

In stata, we simply use this command:

. hettest

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity 
         Ho: Constant variance
         Variables: fitted values of ln_wage

         chi2(1)      =    19.57
         Prob > chi2  =   0.0000

R doesn't implement this test directly. But it can be performed manually following these steps:

  1. Recover residuals as above, denote as $\mathbf{e}$
  2. square the residuals by the sqare root of mean squared residuals: $$ \mathbf{r} = \frac{diagonal(\mathbf{ee}')}{\sqrt{\mathbf{e'e}/N}} $$
  3. Run the regression $\mathbf{r} = \gamma_0 + \gamma_1 \mathbf{xb} + \mathbf{v}$. Recall that $\mathbf{xb} = \mathbf{\hat{y}}$.
  4. Recover ESS defined as $$ ESS = (\mathbf{\hat{r}} - \bar{r})'(\hat{\mathbf{r}} - \bar{r}) $$
  5. ESS/2 is the test statistic distributed $\chi^2_{(1)}$
In [11]:
r = diag(tcrossprod(ols.lm$residuals))/mean(diag(tcrossprod(ols.lm$residuals)))
xb = ols.lm$fitted.values
het_ols_results = lm(r~xb)
mss = crossprod(het_ols_results$fitted.values - mean(r))
cw = mss/2
sprintf("BP / Cook-Weisberg score for heterosk.: %2.3f distributed chi2 with %d df.",cw,1)
Out[11]:
'BP / Cook-Weisberg score for heterosk.: 19.572 distributed chi2 with 1 df.'