Companion for OLS Chapter
This page has been moved to https://econ.pages.code.wm.edu/407/notes/docs/index.html and is no longer being maintained here.
This companion to the OLS Chapter shows how to implement the models in both Stata
and R
. Unless stated otherwise, the code shown here is for R
. Note, we use a panel dataset (multiple time periods per individual) and arbitrarily restrict our analysis to a cross section dataset by analyzing only records where time
is 4. We do this for demonstration purposes only- it is usually a horrible idea to throw away data like this, and later in the course we will learn more about panel data techniques.
# allows us to open stata datasets
library(foreign)
# for heteroskedasticity corrected variance/covariance matrices
library(sandwich)
# for White Heteroskedasticity Test
library(lmtest)
# for bootstrapping
library(boot)
# for column percentiles
library(matrixStats)
tk.df = read.dta("https://rlhick.people.wm.edu/econ407/data/tobias_koop.dta")
tk4.df = subset(tk.df, time == 4)
histwage <- hist(tk4.df$ln_wage,main = "Log Wage",xlab="ln_wage")
The OLS Model
Using ln_wage
as our dependent variable, here is the code for running a simple OLS model in stata and R. First in Stata. Since we haven't loaded the data yet, let's do that and run the model:
webuse set "https://rlhick.people.wm.edu/econ407/data/"
webuse tobias_koop
keep if time==4
regress ln_wage educ pexp pexp2 broken_home
set more off (prefix now "https://rlhick.people.wm.edu/econ407/data") (16,885 observations deleted) Source | SS df MS Number of obs = 1,034 -------------+---------------------------------- F(4, 1029) = 51.36 Model | 37.3778146 4 9.34445366 Prob > F = 0.0000 Residual | 187.21445 1,029 .181938241 R-squared = 0.1664 -------------+---------------------------------- Adj R-squared = 0.1632 Total | 224.592265 1,033 .217417488 Root MSE = .42654 ------------------------------------------------------------------------------ ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- educ | .0852725 .0092897 9.18 0.000 .0670437 .1035014 pexp | .2035214 .0235859 8.63 0.000 .1572395 .2498033 pexp2 | -.0124126 .0022825 -5.44 0.000 -.0168916 -.0079336 broken_home | -.0087254 .0357107 -0.24 0.807 -.0787995 .0613488 _cons | .4603326 .137294 3.35 0.001 .1909243 .7297408 ------------------------------------------------------------------------------
In R
, we use
ols.lm = lm(ln_wage~educ + pexp + pexp2 + broken_home, data=tk4.df)
summary(ols.lm)
Call: lm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk4.df) Residuals: Min 1Q Median 3Q Max -1.69490 -0.24818 -0.00079 0.25417 1.51139 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.460333 0.137294 3.353 0.000829 *** educ 0.085273 0.009290 9.179 < 2e-16 *** pexp 0.203521 0.023586 8.629 < 2e-16 *** pexp2 -0.012413 0.002283 -5.438 6.73e-08 *** broken_home -0.008725 0.035711 -0.244 0.807020 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4265 on 1029 degrees of freedom Multiple R-squared: 0.1664, Adjusted R-squared: 0.1632 F-statistic: 51.36 on 4 and 1029 DF, p-value: < 2.2e-16
Robust Standard Errors
Note that since the \(\mathbf{b}\)'s are unbiased estimators for \(\beta\), those continue to be the ones to report. However, our variance covariance matrix is \[ VARCOV_{HC1} = \mathbf{(x'x)^{-1}V(x'x)^{-1}} \]
Defining \(\mathbf{e=y-xb}\), \[ \mathbf{V} = diag(\mathbf{ee}') \]
In stata, the robust
option is a very easy way to implement the robust standard error correction. Note: this is not robust regression.
regress ln_wage educ pexp pexp2 broken_home, robust
Linear regression Number of obs = 1,034 F(4, 1029) = 64.82 Prob > F = 0.0000 R-squared = 0.1664 Root MSE = .42654 ------------------------------------------------------------------------------ | Robust ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- educ | .0852725 .0099294 8.59 0.000 .0657882 .1047568 pexp | .2035214 .0226835 8.97 0.000 .1590101 .2480326 pexp2 | -.0124126 .0022447 -5.53 0.000 -.0168173 -.0080079 broken_home | -.0087254 .0312381 -0.28 0.780 -.070023 .0525723 _cons | .4603326 .1315416 3.50 0.000 .2022121 .718453 ------------------------------------------------------------------------------
Implementing this in R, we can use the sandwich package's `vcovHC` function with the `HC1` option (stata default):
VARCOV = vcovHC(ols.lm, 'HC1')
se = sqrt(diag(VARCOV))
print(se)
(Intercept) educ pexp pexp2 broken_home 0.131541623 0.009929445 0.022683530 0.002244678 0.031238081
Note that the F-statistic reported by the R
OLS result above is inappropriate in the presence of heteroskedasticity. We need to use a different form of the F test. Stata does this automatically. In R, we need to use waldtest for the correct model F statistic:
waldtest(ols.lm, vcov = vcovHC(ols.lm,'HC1'))
Wald test Model 1: ln_wage ~ educ + pexp + pexp2 + broken_home Model 2: ln_wage ~ 1 Res.Df Df F Pr(>F) 1 1029 2 1033 -4 64.82 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So far, we have gotten R
to provide us with the correct robust standard error regression results but it isn't in a single table. To put it together for printing, use
# copy summary object
robust.summ = summary(ols.lm)
# overwrite all coefficients with robust counterparts
robust.summ$coefficients = unclass(coeftest(ols.lm, vcov. = VARCOV))
# overwrite f stat with correct robust value
f_robust = waldtest(ols.lm, vcov = VARCOV)
robust.summ$fstatistic[1] = f_robust$F[2]
print(robust.summ)
Call: lm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk4.df) Residuals: Min 1Q Median 3Q Max -1.69490 -0.24818 -0.00079 0.25417 1.51139 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.460333 0.131542 3.500 0.000486 *** educ 0.085273 0.009929 8.588 < 2e-16 *** pexp 0.203521 0.022684 8.972 < 2e-16 *** pexp2 -0.012413 0.002245 -5.530 4.06e-08 *** broken_home -0.008725 0.031238 -0.279 0.780057 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4265 on 1029 degrees of freedom Multiple R-squared: 0.1664, Adjusted R-squared: 0.1632 F-statistic: 64.82 on 4 and 1029 DF, p-value: < 2.2e-16
The bootstrapping approach to robust standard errors
Given the advent of very fast computation even with laptop computers, we can use boostrapping to correct our standard errrors for any non-iid error process. In R, we run this (and by asking for percentiles) to get non-parametric bootstrap estimates for confidence intervals:
bs_ols <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d) # runs the model for this sample
return(coef(fit)) # returns coefficients
}
results <- boot(data=tk4.df, statistic=bs_ols,
R=1000, formula=ln_wage~educ + pexp + pexp2 + broken_home)
From that we can extract the 95\% confidence intervals from the 1000 replicates.
ci_mat = colQuantiles(results$t,probs=c(.025,.975))
ci = data.frame(ci_mat,row.names=c("Intercept","educ","pexp","pexp2","broken_home"))
colnames(ci) = c("2.5%","97.5%")
print(ci)
2.5% 97.5% Intercept 0.19939776 0.714556678 educ 0.06565962 0.104340132 pexp 0.16128174 0.249996019 pexp2 -0.01696095 -0.008148125 broken_home -0.07249099 0.056744122
In Stata
, we just run (for 1000 replicates) using the following code. I believe this is doing what is called a parametric bootstrap (confidence intervals assume normality). As you can see, the Stata
method is much easier to implement.
regress ln_wage educ pexp pexp2 broken_home, vce(bootstrap,rep(1000))
(running regress on estimation sample) Bootstrap replications (1000) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .................................................. 100 .................................................. 150 .................................................. 200 .................................................. 250 .................................................. 300 .................................................. 350 .................................................. 400 .................................................. 450 .................................................. 500 .................................................. 550 .................................................. 600 .................................................. 650 .................................................. 700 .................................................. 750 .................................................. 800 .................................................. 850 .................................................. 900 .................................................. 950 .................................................. 1000 Linear regression Number of obs = 1,034 Replications = 1,000 Wald chi2(4) = 252.66 Prob > chi2 = 0.0000 R-squared = 0.1664 Adj R-squared = 0.1632 Root MSE = 0.4265 ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based ln_wage | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- educ | .0852725 .0100503 8.48 0.000 .0655743 .1049707 pexp | .2035214 .022688 8.97 0.000 .1590538 .247989 pexp2 | -.0124126 .0022409 -5.54 0.000 -.0168047 -.0080205 broken_home | -.0087254 .0320888 -0.27 0.786 -.0716183 .0541676 _cons | .4603326 .1340845 3.43 0.001 .1975318 .7231333 ------------------------------------------------------------------------------
Testing for Heteroskedasticity
The two dominant heteroskedasticity tests are the White test and the Cook-Weisberg (both Breusch Pagan tests). The White test is particularly useful if you believe the error variances have a potentially non-linear relationship with model covariates. For stata, we need to make sure the basic OLS regression (rather than robust) is run just prior to these commands and then we can use:
quietly regress ln_wage educ pexp pexp2 broken_home
imtest, white
White's test for Ho: homoskedasticity against Ha: unrestricted heteroskedasticity chi2(12) = 29.20 Prob > chi2 = 0.0037 Cameron & Trivedi's decomposition of IM-test --------------------------------------------------- Source | chi2 df p ---------------------+----------------------------- Heteroskedasticity | 29.20 12 0.0037 Skewness | 14.14 4 0.0068 Kurtosis | 16.19 1 0.0001 ---------------------+----------------------------- Total | 59.53 17 0.0000 ---------------------------------------------------
In R, we perform the White test by:
bptest(ols.lm,ln_wage ~ educ*pexp + educ*pexp2 + educ*broken_home + pexp*pexp2 + pexp*broken_home +
pexp2*broken_home + I(educ^2) + I(pexp^2) + I(pexp2^2) + I(broken_home^2),data=tk4.df)
studentized Breusch-Pagan test data: ols.lm BP = 29.196, df = 12, p-value = 0.003685
Notice that in R
, we need to specify the form of the heteroskedasticity. To mirror the White test, we include interaction and second order terms. Both Stata
's imtest
and R
's bptest
allow you to specify more esoteric forms of heteroskedasticity, although in R
it is more apparent since we have to specify the form of the heteroskedasticity for the test to run.
A more basic test- one assuming that heteroskedastity is a linear function of the \(\mathbf{x}\)'s- is the Cook-Weisberg Breusch Pagan test.
In stata, we use
hettest
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of ln_wage chi2(1) = 19.57 Prob > chi2 = 0.0000
R doesn't implement this test directly. But it can be performed manually following these steps:
- Recover residuals as above, denote as \(\mathbf{e}\)
- square the residuals by the sqare root of mean squared residuals:
\[ \mathbf{r} = \frac{diagonal(\mathbf{ee}')}{\sqrt{\mathbf{e'e}/N}} \]
- Run the regression \(\mathbf{r} = \gamma_0 + \gamma_1 \mathbf{xb} + \mathbf{v}\). Recall that \(\mathbf{xb} = \mathbf{\hat{y}}\).
- Recover ESS defined as
\[ ESS = (\mathbf{\hat{r}} - \bar{r})'(\hat{\mathbf{r}} - \bar{r}) \]
- ESS/2 is the test statistic distributed \(\chi^2_{(1)}\)
r = diag(tcrossprod(ols.lm$residuals))/mean(diag(tcrossprod(ols.lm$residuals)))
xb = ols.lm$fitted.values
het_ols_results = lm(r~xb)
mss = crossprod(het_ols_results$fitted.values - mean(r))
cw = mss/2
sprintf("BP / Cook-Weisberg score for heterosk.: %2.3f distributed chi2 with %d df.",cw,1)
[1] "BP / Cook-Weisberg score for heterosk.: 19.572 distributed chi2 with 1 df."