ECON 407: Companion to Panel Data Models
This page has been moved to https://econ.pages.code.wm.edu/407/notes/docs/index.html and is no longer being maintained here.
This document, a companion to the Panel Data series of lecture notes, provides a brief description of how to implement panel data models in both R and Stata. We will load with Tobias and Koop but this time will use the entire dataset since we are now ready to exploit the panel nature of the full dataset. Load data and summarize:
library(foreign)
library(lmtest)
library(plm)
tk.df = read.dta("https://rlhick.people.wm.edu/econ407/data/tobias_koop.dta")
attach(tk.df)
We also need to load the data into stata
webuse set "https://rlhick.people.wm.edu/econ407/data/"
webuse tobias_koop
sum
set more off (prefix now "https://rlhick.people.wm.edu/econ407/data") Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- id | 17,919 1081.852 629.8459 1 2178 educ | 17,919 12.67604 1.922433 9 20 ln_wage | 17,919 2.296821 .5282364 .07 4.57 pexp | 17,919 8.362688 4.127502 0 22 time | 17,919 8.196719 3.956042 0 14 -------------+--------------------------------------------------------- ability | 17,919 .052374 .9261294 -4.04 2.01 meduc | 17,919 11.4719 2.988851 0 20 feduc | 17,919 11.70925 3.766923 0 20 broken_home | 17,919 .153859 .3608236 0 1 siblings | 17,919 3.156203 2.120989 0 18 -------------+--------------------------------------------------------- pexp2 | 17,919 86.96986 75.26336 0 484
We will be estimating a very simple returns to education equation and will ignore any endogeneity issues for the purposes of illustration.
Pooled OLS
Here are the commands to run pooled OLS in stata. After we run the model, we store the model estimates for doing hypothesis testing later.
regress ln_wage educ pexp pexp2 broken_home
est store bpool
Source | SS df MS Number of obs = 17,919 -------------+---------------------------------- F(4, 17914) = 928.58 Model | 858.620393 4 214.655098 Prob > F = 0.0000 Residual | 4141.10561 17,914 .231165882 R-squared = 0.1717 -------------+---------------------------------- Adj R-squared = 0.1715 Total | 4999.72601 17,918 .279033709 Root MSE = .4808 ------------------------------------------------------------------------------ ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- educ | .0890684 .001939 45.93 0.000 .0852677 .0928691 pexp | .0907662 .0034375 26.40 0.000 .0840284 .097504 pexp2 | -.003046 .0001895 -16.07 0.000 -.0034175 -.0026745 broken_home | -.0561547 .0100131 -5.61 0.000 -.0757813 -.036528 _cons | .6822885 .0286842 23.79 0.000 .6260648 .7385122 ------------------------------------------------------------------------------
In R, this is simply an OLS model and is easily run using
wage.ols = lm(ln_wage~educ+pexp+pexp2+broken_home)
summary(wage.ols)
Call: lm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home) Residuals: Min 1Q Median 3Q Max -2.53137 -0.27441 0.02012 0.31121 2.11371 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.6822885 0.0286842 23.786 < 2e-16 *** educ 0.0890684 0.0019390 45.934 < 2e-16 *** pexp 0.0907662 0.0034375 26.405 < 2e-16 *** pexp2 -0.0030460 0.0001895 -16.071 < 2e-16 *** broken_home -0.0561547 0.0100131 -5.608 2.08e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4808 on 17914 degrees of freedom Multiple R-squared: 0.1717, Adjusted R-squared: 0.1715 F-statistic: 928.6 on 4 and 17914 DF, p-value: < 2.2e-16
or we could do it using the plm package after setting up our panel dataset:
tk.panel.df = pdata.frame(tk.df, index = c("id","time"))
wage.pool = plm(ln_wage~educ+pexp+pexp2+broken_home, data=tk.panel.df,model="pooling")
summary(wage.pool)
Pooling Model Call: plm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk.panel.df, model = "pooling") Unbalanced Panel: n = 2178, T = 1-15, N = 17919 Residuals: Min. 1st Qu. Median 3rd Qu. Max. -2.531374 -0.274411 0.020121 0.311211 2.113707 Coefficients: Estimate Std. Error t-value Pr(>|t|) (Intercept) 0.68228847 0.02868415 23.7863 < 2.2e-16 *** educ 0.08906836 0.00193904 45.9343 < 2.2e-16 *** pexp 0.09076619 0.00343747 26.4050 < 2.2e-16 *** pexp2 -0.00304601 0.00018953 -16.0715 < 2.2e-16 *** broken_home -0.05615469 0.01001310 -5.6081 2.076e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Total Sum of Squares: 4999.7 Residual Sum of Squares: 4141.1 R-Squared: 0.17173 Adj. R-Squared: 0.17155 F-statistic: 928.576 on 4 and 17914 DF, p-value: < 2.22e-16
Random Effects
In stata, we run the following code. Notice, we use xtset to inform stata of the panel data individual (id) and time (time) identifiers. Also, save the results for analysis later:
xtset id time
xtreg ln_wage educ pexp pexp2 broken_home , re
est store bre
panel variable: id (unbalanced) time variable: time, 0 to 14, but with gaps delta: 1 unit Random-effects GLS regression Number of obs = 17,919 Group variable: id Number of groups = 2,178 R-sq: Obs per group: within = 0.2283 min = 1 between = 0.1489 avg = 8.2 overall = 0.1714 max = 15 Wald chi2(4) = 5026.37 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ ln_wage | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- educ | .0901376 .0034376 26.22 0.000 .0834001 .0968751 pexp | .1027822 .0026045 39.46 0.000 .0976776 .1078868 pexp2 | -.0036384 .0001426 -25.52 0.000 -.0039179 -.0033589 broken_home | -.0646814 .0228582 -2.83 0.005 -.1094827 -.0198801 _cons | .5807788 .0441847 13.14 0.000 .4941784 .6673792 -------------+---------------------------------------------------------------- sigma_u | .37244396 sigma_e | .32867768 rho | .56218095 (fraction of variance due to u_i) ------------------------------------------------------------------------------
Or, the Swamy-Aurora version of the random effects model (closest to what R uses):
xtreg ln_wage educ pexp pexp2 broken_home , sa th
est store bre_sa
Random-effects GLS regression Number of obs = 17,919 Group variable: id Number of groups = 2,178 R-sq: Obs per group: within = 0.2283 min = 1 between = 0.1492 avg = 8.2 overall = 0.1714 max = 15 Wald chi2(4) = 5004.33 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000 ------------------- theta -------------------- min 5% median 95% max 0.3201 0.4517 0.6885 0.7595 0.7672 ------------------------------------------------------------------------------ ln_wage | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- educ | .0904542 .0033583 26.93 0.000 .0838721 .0970363 pexp | .1025766 .0026118 39.27 0.000 .0974575 .1076957 pexp2 | -.0036296 .0001431 -25.37 0.000 -.00391 -.0033492 broken_home | -.0642368 .022005 -2.92 0.004 -.1073659 -.0211077 _cons | .5782614 .0432297 13.38 0.000 .4935328 .66299 -------------+---------------------------------------------------------------- sigma_u | .35445648 sigma_e | .32867768 rho | .53768242 (fraction of variance due to u_i) ------------------------------------------------------------------------------
And in R the we have similar code:
wage.re = plm(ln_wage~educ+pexp+pexp2+broken_home, data=tk.panel.df,model="random")
summary(wage.re)
Oneway (individual) effect Random Effect Model (Swamy-Arora's transformation) Call: plm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk.panel.df, model = "random") Unbalanced Panel: n = 2178, T = 1-15, N = 17919 Effects: var std.dev share idiosyncratic 0.1080 0.3287 0.469 individual 0.1223 0.3498 0.531 theta: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.3152 0.6847 0.7151 0.7007 0.7478 0.7642 Residuals: Min. 1st Qu. Median Mean 3rd Qu. Max. -2.36051 -0.15586 0.03123 0.00648 0.19569 2.21809 Coefficients: Estimate Std. Error t-value Pr(>|t|) (Intercept) 0.57764593 0.04297643 13.4410 < 2e-16 *** educ 0.09053539 0.00333714 27.1296 < 2e-16 *** pexp 0.10252019 0.00261397 39.2202 < 2e-16 *** pexp2 -0.00362720 0.00014322 -25.3263 < 2e-16 *** broken_home -0.06411881 0.02178336 -2.9435 0.00325 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Total Sum of Squares: 2693.3 Residual Sum of Squares: 1960.1 R-Squared: 0.27508 Adj. R-Squared: 0.27492 F-statistic: 1675.26 on 4 and 17914 DF, p-value: < 2.22e-16
Fixed Effects
In stata, the easiest model to run is the "between" estimator. Note, stata automatically drops covariates that do not vary within an individual's observations so that the model runs, but it leaves the variable in the regression results output:
xtreg ln_wage educ pexp pexp2 broken_home , fe
est store bfe
note: broken_home omitted because of collinearity Fixed-effects (within) regression Number of obs = 17,919 Group variable: id Number of groups = 2,178 R-sq: Obs per group: within = 0.2286 min = 1 between = 0.1331 avg = 8.2 overall = 0.1663 max = 15 F(3,15738) = 1554.76 corr(u_i, Xb) = 0.0053 Prob > F = 0.0000 ------------------------------------------------------------------------------ ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- educ | .0761987 .0059414 12.83 0.000 .0645529 .0878444 pexp | .1070331 .0027728 38.60 0.000 .101598 .1124681 pexp2 | -.0038188 .000149 -25.63 0.000 -.0041109 -.0035268 broken_home | 0 (omitted) _cons | .7679618 .0716835 10.71 0.000 .6274539 .9084696 -------------+---------------------------------------------------------------- sigma_u | .40417965 sigma_e | .32867768 rho | .6019421 (fraction of variance due to u_i) ------------------------------------------------------------------------------ F test that all u_i=0: F(2177, 15738) = 10.41 Prob > F = 0.0000
In R we have the same result (although R plm surpresses the model constant in a way more in line with our class notes). R also automatically drops covariates not varying within a cross sectional unit and it drops it from the results (it doesn't tell you explicitly the way stata does).
wage.fe = plm(ln_wage~educ+pexp+pexp2+broken_home, data=tk.panel.df,model="within")
summary(wage.fe)
Oneway (individual) effect Within Model Call: plm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk.panel.df, model = "within") Unbalanced Panel: n = 2178, T = 1-15, N = 17919 Residuals: Min. 1st Qu. Median 3rd Qu. Max. -2.452999 -0.145082 0.010925 0.167903 2.543757 Coefficients: Estimate Std. Error t-value Pr(>|t|) educ 0.07619867 0.00594136 12.825 < 2.2e-16 *** pexp 0.10703306 0.00277281 38.601 < 2.2e-16 *** pexp2 -0.00381882 0.00014899 -25.632 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Total Sum of Squares: 2204 Residual Sum of Squares: 1700.2 R-Squared: 0.22862 Adj. R-Squared: 0.12176 F-statistic: 1554.76 on 3 and 15738 DF, p-value: < 2.22e-16
Hypothesis Testing and Model Comparison
We have various tests to run:
OLS (Pooled) versus Random Effects
In stata, right after running the random effects model, run
quietly: xtreg ln_wage educ pexp pexp2 broken_home , re
xttest0
Breusch and Pagan Lagrangian multiplier test for random effects ln_wage[id,t] = Xb + u[id] + e[id,t] Estimated results: | Var sd = sqrt(Var) ---------+----------------------------- ln_wage | .2790337 .5282364 e | .108029 .3286777 u | .1387145 .372444 Test: Var(u) = 0 chibar2(01) = 19338.65 Prob > chibar2 = 0.0000
In R,
plmtest(wage.pool,type='bp')
Lagrange Multiplier Test - (Breusch-Pagan) for unbalanced panels data: ln_wage ~ educ + pexp + pexp2 + broken_home chisq = 19339, df = 1, p-value < 2.2e-16 alternative hypothesis: significant effects
OLS (Pooled) versus Fixed Effects
In stata, this is tested by default and included in the fe
output (at the end):
------------------------------------------------------------------------------ F test that all u_i=0: F(2177, 15738) = 10.41 Prob > F = 0.0000
In R, use this (note the slight difference in the F statistic (and degrees of freedom) due to stata using a model constant):
pFtest(wage.fe, wage.pool)
F test for individual effects data: ln_wage ~ educ + pexp + pexp2 + broken_home F = 10.384, df1 = 2176, df2 = 15738, p-value < 2.2e-16 alternative hypothesis: significant effects
Random Effects versus Fixed Effects
In stata, install xtoverid
and use this after the fixed effects regression:
xtoverid
Test of overidentifying restrictions: fixed vs random effects Cross-section time-series model: xtreg re Sargan-Hansen statistic 31.892 Chi-sq(3) P-value = 0.0000
or, you can use the Hausman test explictly. Let's run Hausman on the Swamy-Arora version of random effects:
hausman bfe bre_sa
---- Coefficients ---- | (b) (B) (b-B) sqrt(diag(V_b-V_B)) | bfe bre_sa Difference S.E. -------------+---------------------------------------------------------------- educ | .0761987 .0904542 -.0142555 .0049012 pexp | .1070331 .1025766 .0044565 .000931 pexp2 | -.0038188 -.0036296 -.0001892 .0000415 ------------------------------------------------------------------------------ b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Test: Ho: difference in coefficients not systematic chi2(3) = (b-B)'[(V_b-V_B)^(-1)](b-B) = 43.88 Prob>chi2 = 0.0000
in R, we use the Hausman test:
phtest(wage.fe, wage.re)
Hausman Test data: ln_wage ~ educ + pexp + pexp2 + broken_home chisq = 49.388, df = 3, p-value = 1.079e-10 alternative hypothesis: one model is inconsistent
Note the stata output is more informative.
Endogeneity
IV regression on panel data (where x4
is endogenous and instrumented by z1
) can be performed by
xtivreg y x1 x2 x3 (x4=z1), fe
and in R by
summary(plm(y~x1+x2+x3+x4 | . + x1+x2+x3+z1, data=panel.df,model="within"))
Note: Random Effects with instruments seemingly doesn't work in R (doesn't come close to matching stata's xtivreg command). Fixed effects (shown above), not verified.
Robust Standard Errors
In stata:
xtreg ln_wage educ pexp pexp2, fe robust
Fixed-effects (within) regression Number of obs = 17,919 Group variable: id Number of groups = 2,178 R-sq: Obs per group: within = 0.2286 min = 1 between = 0.1331 avg = 8.2 overall = 0.1663 max = 15 F(3,2177) = 535.99 corr(u_i, Xb) = 0.0053 Prob > F = 0.0000 (Std. Err. adjusted for 2,178 clusters in id) ------------------------------------------------------------------------------ | Robust ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- educ | .0761987 .0099113 7.69 0.000 .0567621 .0956352 pexp | .1070331 .0042128 25.41 0.000 .0987716 .1152945 pexp2 | -.0038188 .0002193 -17.42 0.000 -.0042489 -.0033888 _cons | .7679618 .1208608 6.35 0.000 .5309472 1.004976 -------------+---------------------------------------------------------------- sigma_u | .40417965 sigma_e | .32867768 rho | .6019421 (fraction of variance due to u_i) ------------------------------------------------------------------------------
In R and noting that only the standard errors, t-stats, and p values change, we have very similar results. Note the vcovHC
and coeftest
functions are being drawn from the plm
package and not sandwich
:
coeftest(wage.fe,vcov=vcovHC(wage.fe,type="sss"))
t test of coefficients: Estimate Std. Error t value Pr(>|t|) educ 0.07619867 0.00991101 7.6883 1.579e-14 *** pexp 0.10703306 0.00421265 25.4076 < 2.2e-16 *** pexp2 -0.00381882 0.00021928 -17.4155 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1