ECON 407: Companion to Panel Data Models
This page has been moved to https://econ.pages.code.wm.edu/407/notes/docs/index.html and is no longer being maintained here.
This document, a companion to the Panel Data series of lecture notes, provides a brief description of how to implement panel data models in both R and Stata. We will load with Tobias and Koop but this time will use the entire dataset since we are now ready to exploit the panel nature of the full dataset. Load data and summarize:
library(foreign)
library(lmtest)
library(plm)
tk.df = read.dta("https://rlhick.people.wm.edu/econ407/data/tobias_koop.dta")
attach(tk.df)
We also need to load the data into stata
webuse set "https://rlhick.people.wm.edu/econ407/data/"
webuse tobias_koop
sum
set more off
(prefix now "https://rlhick.people.wm.edu/econ407/data")
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
id | 17,919 1081.852 629.8459 1 2178
educ | 17,919 12.67604 1.922433 9 20
ln_wage | 17,919 2.296821 .5282364 .07 4.57
pexp | 17,919 8.362688 4.127502 0 22
time | 17,919 8.196719 3.956042 0 14
-------------+---------------------------------------------------------
ability | 17,919 .052374 .9261294 -4.04 2.01
meduc | 17,919 11.4719 2.988851 0 20
feduc | 17,919 11.70925 3.766923 0 20
broken_home | 17,919 .153859 .3608236 0 1
siblings | 17,919 3.156203 2.120989 0 18
-------------+---------------------------------------------------------
pexp2 | 17,919 86.96986 75.26336 0 484
We will be estimating a very simple returns to education equation and will ignore any endogeneity issues for the purposes of illustration.
Pooled OLS
Here are the commands to run pooled OLS in stata. After we run the model, we store the model estimates for doing hypothesis testing later.
regress ln_wage educ pexp pexp2 broken_home
est store bpool
Source | SS df MS Number of obs = 17,919
-------------+---------------------------------- F(4, 17914) = 928.58
Model | 858.620393 4 214.655098 Prob > F = 0.0000
Residual | 4141.10561 17,914 .231165882 R-squared = 0.1717
-------------+---------------------------------- Adj R-squared = 0.1715
Total | 4999.72601 17,918 .279033709 Root MSE = .4808
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ | .0890684 .001939 45.93 0.000 .0852677 .0928691
pexp | .0907662 .0034375 26.40 0.000 .0840284 .097504
pexp2 | -.003046 .0001895 -16.07 0.000 -.0034175 -.0026745
broken_home | -.0561547 .0100131 -5.61 0.000 -.0757813 -.036528
_cons | .6822885 .0286842 23.79 0.000 .6260648 .7385122
------------------------------------------------------------------------------
In R, this is simply an OLS model and is easily run using
wage.ols = lm(ln_wage~educ+pexp+pexp2+broken_home)
summary(wage.ols)
Call:
lm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home)
Residuals:
Min 1Q Median 3Q Max
-2.53137 -0.27441 0.02012 0.31121 2.11371
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6822885 0.0286842 23.786 < 2e-16 ***
educ 0.0890684 0.0019390 45.934 < 2e-16 ***
pexp 0.0907662 0.0034375 26.405 < 2e-16 ***
pexp2 -0.0030460 0.0001895 -16.071 < 2e-16 ***
broken_home -0.0561547 0.0100131 -5.608 2.08e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4808 on 17914 degrees of freedom
Multiple R-squared: 0.1717, Adjusted R-squared: 0.1715
F-statistic: 928.6 on 4 and 17914 DF, p-value: < 2.2e-16
or we could do it using the plm package after setting up our panel dataset:
tk.panel.df = pdata.frame(tk.df, index = c("id","time"))
wage.pool = plm(ln_wage~educ+pexp+pexp2+broken_home, data=tk.panel.df,model="pooling")
summary(wage.pool)
Pooling Model
Call:
plm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk.panel.df,
model = "pooling")
Unbalanced Panel: n = 2178, T = 1-15, N = 17919
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-2.531374 -0.274411 0.020121 0.311211 2.113707
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 0.68228847 0.02868415 23.7863 < 2.2e-16 ***
educ 0.08906836 0.00193904 45.9343 < 2.2e-16 ***
pexp 0.09076619 0.00343747 26.4050 < 2.2e-16 ***
pexp2 -0.00304601 0.00018953 -16.0715 < 2.2e-16 ***
broken_home -0.05615469 0.01001310 -5.6081 2.076e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 4999.7
Residual Sum of Squares: 4141.1
R-Squared: 0.17173
Adj. R-Squared: 0.17155
F-statistic: 928.576 on 4 and 17914 DF, p-value: < 2.22e-16
Random Effects
In stata, we run the following code. Notice, we use xtset to inform stata of the panel data individual (id) and time (time) identifiers. Also, save the results for analysis later:
xtset id time
xtreg ln_wage educ pexp pexp2 broken_home , re
est store bre
panel variable: id (unbalanced)
time variable: time, 0 to 14, but with gaps
delta: 1 unit
Random-effects GLS regression Number of obs = 17,919
Group variable: id Number of groups = 2,178
R-sq: Obs per group:
within = 0.2283 min = 1
between = 0.1489 avg = 8.2
overall = 0.1714 max = 15
Wald chi2(4) = 5026.37
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ | .0901376 .0034376 26.22 0.000 .0834001 .0968751
pexp | .1027822 .0026045 39.46 0.000 .0976776 .1078868
pexp2 | -.0036384 .0001426 -25.52 0.000 -.0039179 -.0033589
broken_home | -.0646814 .0228582 -2.83 0.005 -.1094827 -.0198801
_cons | .5807788 .0441847 13.14 0.000 .4941784 .6673792
-------------+----------------------------------------------------------------
sigma_u | .37244396
sigma_e | .32867768
rho | .56218095 (fraction of variance due to u_i)
------------------------------------------------------------------------------
Or, the Swamy-Aurora version of the random effects model (closest to what R uses):
xtreg ln_wage educ pexp pexp2 broken_home , sa th
est store bre_sa
Random-effects GLS regression Number of obs = 17,919
Group variable: id Number of groups = 2,178
R-sq: Obs per group:
within = 0.2283 min = 1
between = 0.1492 avg = 8.2
overall = 0.1714 max = 15
Wald chi2(4) = 5004.33
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
------------------- theta --------------------
min 5% median 95% max
0.3201 0.4517 0.6885 0.7595 0.7672
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ | .0904542 .0033583 26.93 0.000 .0838721 .0970363
pexp | .1025766 .0026118 39.27 0.000 .0974575 .1076957
pexp2 | -.0036296 .0001431 -25.37 0.000 -.00391 -.0033492
broken_home | -.0642368 .022005 -2.92 0.004 -.1073659 -.0211077
_cons | .5782614 .0432297 13.38 0.000 .4935328 .66299
-------------+----------------------------------------------------------------
sigma_u | .35445648
sigma_e | .32867768
rho | .53768242 (fraction of variance due to u_i)
------------------------------------------------------------------------------
And in R the we have similar code:
wage.re = plm(ln_wage~educ+pexp+pexp2+broken_home, data=tk.panel.df,model="random")
summary(wage.re)
Oneway (individual) effect Random Effect Model
(Swamy-Arora's transformation)
Call:
plm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk.panel.df,
model = "random")
Unbalanced Panel: n = 2178, T = 1-15, N = 17919
Effects:
var std.dev share
idiosyncratic 0.1080 0.3287 0.469
individual 0.1223 0.3498 0.531
theta:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3152 0.6847 0.7151 0.7007 0.7478 0.7642
Residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.36051 -0.15586 0.03123 0.00648 0.19569 2.21809
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 0.57764593 0.04297643 13.4410 < 2e-16 ***
educ 0.09053539 0.00333714 27.1296 < 2e-16 ***
pexp 0.10252019 0.00261397 39.2202 < 2e-16 ***
pexp2 -0.00362720 0.00014322 -25.3263 < 2e-16 ***
broken_home -0.06411881 0.02178336 -2.9435 0.00325 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 2693.3
Residual Sum of Squares: 1960.1
R-Squared: 0.27508
Adj. R-Squared: 0.27492
F-statistic: 1675.26 on 4 and 17914 DF, p-value: < 2.22e-16
Fixed Effects
In stata, the easiest model to run is the "between" estimator. Note, stata automatically drops covariates that do not vary within an individual's observations so that the model runs, but it leaves the variable in the regression results output:
xtreg ln_wage educ pexp pexp2 broken_home , fe
est store bfe
note: broken_home omitted because of collinearity
Fixed-effects (within) regression Number of obs = 17,919
Group variable: id Number of groups = 2,178
R-sq: Obs per group:
within = 0.2286 min = 1
between = 0.1331 avg = 8.2
overall = 0.1663 max = 15
F(3,15738) = 1554.76
corr(u_i, Xb) = 0.0053 Prob > F = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ | .0761987 .0059414 12.83 0.000 .0645529 .0878444
pexp | .1070331 .0027728 38.60 0.000 .101598 .1124681
pexp2 | -.0038188 .000149 -25.63 0.000 -.0041109 -.0035268
broken_home | 0 (omitted)
_cons | .7679618 .0716835 10.71 0.000 .6274539 .9084696
-------------+----------------------------------------------------------------
sigma_u | .40417965
sigma_e | .32867768
rho | .6019421 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(2177, 15738) = 10.41 Prob > F = 0.0000
In R we have the same result (although R plm surpresses the model constant in a way more in line with our class notes). R also automatically drops covariates not varying within a cross sectional unit and it drops it from the results (it doesn't tell you explicitly the way stata does).
wage.fe = plm(ln_wage~educ+pexp+pexp2+broken_home, data=tk.panel.df,model="within")
summary(wage.fe)
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk.panel.df,
model = "within")
Unbalanced Panel: n = 2178, T = 1-15, N = 17919
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-2.452999 -0.145082 0.010925 0.167903 2.543757
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
educ 0.07619867 0.00594136 12.825 < 2.2e-16 ***
pexp 0.10703306 0.00277281 38.601 < 2.2e-16 ***
pexp2 -0.00381882 0.00014899 -25.632 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 2204
Residual Sum of Squares: 1700.2
R-Squared: 0.22862
Adj. R-Squared: 0.12176
F-statistic: 1554.76 on 3 and 15738 DF, p-value: < 2.22e-16
Hypothesis Testing and Model Comparison
We have various tests to run:
OLS (Pooled) versus Random Effects
In stata, right after running the random effects model, run
quietly: xtreg ln_wage educ pexp pexp2 broken_home , re
xttest0
Breusch and Pagan Lagrangian multiplier test for random effects
ln_wage[id,t] = Xb + u[id] + e[id,t]
Estimated results:
| Var sd = sqrt(Var)
---------+-----------------------------
ln_wage | .2790337 .5282364
e | .108029 .3286777
u | .1387145 .372444
Test: Var(u) = 0
chibar2(01) = 19338.65
Prob > chibar2 = 0.0000
In R,
plmtest(wage.pool,type='bp')
Lagrange Multiplier Test - (Breusch-Pagan) for unbalanced panels data: ln_wage ~ educ + pexp + pexp2 + broken_home chisq = 19339, df = 1, p-value < 2.2e-16 alternative hypothesis: significant effects
OLS (Pooled) versus Fixed Effects
In stata, this is tested by default and included in the fe output (at the end):
------------------------------------------------------------------------------ F test that all u_i=0: F(2177, 15738) = 10.41 Prob > F = 0.0000
In R, use this (note the slight difference in the F statistic (and degrees of freedom) due to stata using a model constant):
pFtest(wage.fe, wage.pool)
F test for individual effects data: ln_wage ~ educ + pexp + pexp2 + broken_home F = 10.384, df1 = 2176, df2 = 15738, p-value < 2.2e-16 alternative hypothesis: significant effects
Random Effects versus Fixed Effects
In stata, install xtoverid and use this after the fixed effects regression:
xtoverid
Test of overidentifying restrictions: fixed vs random effects Cross-section time-series model: xtreg re Sargan-Hansen statistic 31.892 Chi-sq(3) P-value = 0.0000
or, you can use the Hausman test explictly. Let's run Hausman on the Swamy-Arora version of random effects:
hausman bfe bre_sa
---- Coefficients ----
| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| bfe bre_sa Difference S.E.
-------------+----------------------------------------------------------------
educ | .0761987 .0904542 -.0142555 .0049012
pexp | .1070331 .1025766 .0044565 .000931
pexp2 | -.0038188 -.0036296 -.0001892 .0000415
------------------------------------------------------------------------------
b = consistent under Ho and Ha; obtained from xtreg
B = inconsistent under Ha, efficient under Ho; obtained from xtreg
Test: Ho: difference in coefficients not systematic
chi2(3) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 43.88
Prob>chi2 = 0.0000
in R, we use the Hausman test:
phtest(wage.fe, wage.re)
Hausman Test data: ln_wage ~ educ + pexp + pexp2 + broken_home chisq = 49.388, df = 3, p-value = 1.079e-10 alternative hypothesis: one model is inconsistent
Note the stata output is more informative.
Endogeneity
IV regression on panel data (where x4 is endogenous and instrumented by z1) can be performed by
xtivreg y x1 x2 x3 (x4=z1), fe
and in R by
summary(plm(y~x1+x2+x3+x4 | . + x1+x2+x3+z1, data=panel.df,model="within"))
Note: Random Effects with instruments seemingly doesn't work in R (doesn't come close to matching stata's xtivreg command). Fixed effects (shown above), not verified.
Robust Standard Errors
In stata:
xtreg ln_wage educ pexp pexp2, fe robust
Fixed-effects (within) regression Number of obs = 17,919
Group variable: id Number of groups = 2,178
R-sq: Obs per group:
within = 0.2286 min = 1
between = 0.1331 avg = 8.2
overall = 0.1663 max = 15
F(3,2177) = 535.99
corr(u_i, Xb) = 0.0053 Prob > F = 0.0000
(Std. Err. adjusted for 2,178 clusters in id)
------------------------------------------------------------------------------
| Robust
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ | .0761987 .0099113 7.69 0.000 .0567621 .0956352
pexp | .1070331 .0042128 25.41 0.000 .0987716 .1152945
pexp2 | -.0038188 .0002193 -17.42 0.000 -.0042489 -.0033888
_cons | .7679618 .1208608 6.35 0.000 .5309472 1.004976
-------------+----------------------------------------------------------------
sigma_u | .40417965
sigma_e | .32867768
rho | .6019421 (fraction of variance due to u_i)
------------------------------------------------------------------------------
In R and noting that only the standard errors, t-stats, and p values change, we have very similar results. Note the vcovHC and coeftest functions are being drawn from the plm package and not sandwich:
coeftest(wage.fe,vcov=vcovHC(wage.fe,type="sss"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
educ 0.07619867 0.00991101 7.6883 1.579e-14 ***
pexp 0.10703306 0.00421265 25.4076 < 2.2e-16 ***
pexp2 -0.00381882 0.00021928 -17.4155 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1