ECON 407: Companion to Panel Data Models

This document, a companion to the Panel Data series of lecture notes, provides a brief description of how to implement panel data models in both R and Stata.

In [2]:
library(foreign)
library(lmtest)
library(plm)

We will persist with Tobias and Koop but this time will use the entire dataset. Load data and summarize:

In [3]:
tk.df = read.dta("http://rlhick.people.wm.edu/econ407/data/tobias_koop.dta")
attach(tk.df)

We will be estimating the returns to education equation and will ignore any endogeneity issues for the purposes of illustration.

Pooled OLS

Here are the commands to run pooled OLS in stata. After we run the model, we store the model estimates for doing hypothesis testing later.

. * simple pooled OLS regression
. 
. regress ln_wage educ pexp pexp2 broken_home 

      Source |       SS       df       MS              Number of obs =   17919
-------------+------------------------------           F(  4, 17914) =  928.58
       Model |  858.620393     4  214.655098           Prob > F      =  0.0000
    Residual |  4141.10561 17914  .231165882           R-squared     =  0.1717
-------------+------------------------------           Adj R-squared =  0.1715
       Total |  4999.72601 17918  .279033709           Root MSE      =   .4808

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0890684    .001939    45.93   0.000     .0852677    .0928691
        pexp |   .0907662   .0034375    26.40   0.000     .0840284     .097504
       pexp2 |   -.003046   .0001895   -16.07   0.000    -.0034175   -.0026745
 broken_home |  -.0561547   .0100131    -5.61   0.000    -.0757813    -.036528
       _cons |   .6822885   .0286842    23.79   0.000     .6260648    .7385122
------------------------------------------------------------------------------

. est store bpool

In R, this is simply an OLS model and is easily run using

In [4]:
wage.ols = lm(ln_wage~educ+pexp+pexp2+broken_home)
summary(wage.ols)
Out[4]:
Call:
lm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.53137 -0.27441  0.02012  0.31121  2.11371 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.6822885  0.0286842  23.786  < 2e-16 ***
educ         0.0890684  0.0019390  45.934  < 2e-16 ***
pexp         0.0907662  0.0034375  26.405  < 2e-16 ***
pexp2       -0.0030460  0.0001895 -16.071  < 2e-16 ***
broken_home -0.0561547  0.0100131  -5.608 2.08e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4808 on 17914 degrees of freedom
Multiple R-squared:  0.1717,	Adjusted R-squared:  0.1715 
F-statistic: 928.6 on 4 and 17914 DF,  p-value: < 2.2e-16

or we could do it using the plm package after setting up our panel dataset:

In [6]:
tk.panel.df = plm.data(tk.df, index = c("id","time"))
wage.pool = plm(ln_wage~educ+pexp+pexp2+broken_home, data=tk.panel.df,model="pooling")
summary(wage.pool)
Out[6]:
Oneway (individual) effect Pooling Model

Call:
plm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk.panel.df, 
    model = "pooling")

Unbalanced Panel: n=2178, T=1-15, N=17919

Residuals :
   Min. 1st Qu.  Median 3rd Qu.    Max. 
-2.5300 -0.2740  0.0201  0.3110  2.1100 

Coefficients :
               Estimate  Std. Error  t-value  Pr(>|t|)    
(Intercept)  0.68228847  0.02868415  23.7863 < 2.2e-16 ***
educ         0.08906836  0.00193904  45.9343 < 2.2e-16 ***
pexp         0.09076619  0.00343747  26.4050 < 2.2e-16 ***
pexp2       -0.00304601  0.00018953 -16.0715 < 2.2e-16 ***
broken_home -0.05615469  0.01001310  -5.6081 2.076e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    4999.7
Residual Sum of Squares: 4141.1
R-Squared:      0.17173
Adj. R-Squared: 0.17169
F-statistic: 928.576 on 4 and 17914 DF, p-value: < 2.22e-16

Random Effects

In stata, we run the following code. Notice, we use xtset to inform stata of the panel data individual (id) and time (time) identifiers.

. xtset id time
       panel variable:  id (unbalanced)
        time variable:  time, 0 to 14, but with gaps
                delta:  1 unit

. xtreg ln_wage educ pexp pexp2 broken_home , re

Random-effects GLS regression                   Number of obs      =     17919
Group variable: id                              Number of groups   =      2178

R-sq:  within  = 0.2283                         Obs per group: min =         1
       between = 0.1489                                        avg =       8.2
       overall = 0.1714                                        max =        15

                                                Wald chi2(4)       =   5026.37
corr(u_i, X)   = 0 (assumed)                    Prob > chi2        =    0.0000

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0901376   .0034376    26.22   0.000     .0834001    .0968751
        pexp |   .1027822   .0026045    39.46   0.000     .0976776    .1078868
       pexp2 |  -.0036384   .0001426   -25.52   0.000    -.0039179   -.0033589
 broken_home |  -.0646814   .0228582    -2.83   0.005    -.1094827   -.0198801
       _cons |   .5807788   .0441847    13.14   0.000     .4941784    .6673792
-------------+----------------------------------------------------------------
     sigma_u |  .37244396
     sigma_e |  .32867768
         rho |  .56218095   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. est store bre

Or, the Swamy-Aurora version of the random effects model (closest to what R uses):

. xtreg ln_wage educ pexp pexp2 broken_home , sa th

Random-effects GLS regression                   Number of obs      =     17919
Group variable: id                              Number of groups   =      2178

R-sq:  within  = 0.2283                         Obs per group: min =         1
       between = 0.1492                                        avg =       8.2
       overall = 0.1714                                        max =        15

                                                Wald chi2(4)       =   5004.33
corr(u_i, X)   = 0 (assumed)                    Prob > chi2        =    0.0000

------------------- theta --------------------
  min      5%       median        95%      max
0.3201   0.4517     0.6885     0.7595   0.7672

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0904542   .0033583    26.93   0.000     .0838721    .0970363
        pexp |   .1025766   .0026118    39.27   0.000     .0974575    .1076957
       pexp2 |  -.0036296   .0001431   -25.37   0.000      -.00391   -.0033492
 broken_home |  -.0642368    .022005    -2.92   0.004    -.1073659   -.0211077
       _cons |   .5782614   .0432297    13.38   0.000     .4935328      .66299
-------------+----------------------------------------------------------------
     sigma_u |  .35445648
     sigma_e |  .32867768
         rho |  .53768242   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. est store bre

And in R the notation is similar:

In [7]:
wage.re = plm(ln_wage~educ+pexp+pexp2+broken_home, data=tk.panel.df,model="random")
summary(wage.re)
Out[7]:
Oneway (individual) effect Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk.panel.df, 
    model = "random")

Unbalanced Panel: n=2178, T=1-15, N=17919

Effects:
                 var std.dev share
idiosyncratic 0.1080  0.3287 0.469
individual    0.1223  0.3498 0.531
theta  : 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3152  0.6847  0.7151  0.7007  0.7478  0.7642 

Residuals :
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.36000 -0.15600  0.03120  0.00648  0.19600  2.22000 

Coefficients :
               Estimate  Std. Error  t-value  Pr(>|t|)    
(Intercept)  0.57764435  0.04297577  13.4412 < 2.2e-16 ***
educ         0.09053560  0.00333708  27.1301 < 2.2e-16 ***
pexp         0.10252004  0.00261397  39.2200 < 2.2e-16 ***
pexp2       -0.00362719  0.00014322 -25.3262 < 2.2e-16 ***
broken_home -0.06411850  0.02178278  -2.9435  0.003249 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    2693.3
Residual Sum of Squares: 1960.1
R-Squared:      0.27508
Adj. R-Squared: 0.275
F-statistic: 1675.27 on 4 and 17914 DF, p-value: < 2.22e-16

Fixed Effects

In stata, the easiest model to run is the "between" estimator. Note, stata automatically drops covariates that do not vary within an individual's observations:

. xtreg ln_wage educ pexp pexp2 broken_home , fe
note: broken_home omitted because of collinearity

Fixed-effects (within) regression               Number of obs      =     17919
Group variable: id                              Number of groups   =      2178

R-sq:  within  = 0.2286                         Obs per group: min =         1
       between = 0.1331                                        avg =       8.2
       overall = 0.1663                                        max =        15

                                                F(3,15738)         =   1554.76
corr(u_i, Xb)  = 0.0053                         Prob > F           =    0.0000

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0761987   .0059414    12.83   0.000     .0645529    .0878444
        pexp |   .1070331   .0027728    38.60   0.000      .101598    .1124681
       pexp2 |  -.0038188    .000149   -25.63   0.000    -.0041109   -.0035268
 broken_home |          0  (omitted)
       _cons |   .7679618   .0716835    10.71   0.000     .6274539    .9084696
-------------+----------------------------------------------------------------
     sigma_u |  .40417965
     sigma_e |  .32867768
         rho |   .6019421   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0:     F(2177, 15738) =    10.41         Prob > F = 0.0000

. est store bfe

In R we have the same result (although R plm surpresses the model constant in a way more in line with our class notes). R also drops automatically covariates not varying within a cross sectional unit (although it doesn't tell you explicitly the way stata does).

In [8]:
wage.fe = plm(ln_wage~educ+pexp+pexp2+broken_home, data=tk.panel.df,model="within")
summary(wage.fe)
Out[8]:
Oneway (individual) effect Within Model

Call:
plm(formula = ln_wage ~ educ + pexp + pexp2 + broken_home, data = tk.panel.df, 
    model = "within")

Unbalanced Panel: n=2178, T=1-15, N=17919

Residuals :
   Min. 1st Qu.  Median 3rd Qu.    Max. 
-2.4500 -0.1450  0.0109  0.1680  2.5400 

Coefficients :
         Estimate  Std. Error t-value  Pr(>|t|)    
educ   0.07619867  0.00594136  12.825 < 2.2e-16 ***
pexp   0.10703306  0.00277281  38.601 < 2.2e-16 ***
pexp2 -0.00381882  0.00014899 -25.632 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    2204
Residual Sum of Squares: 1700.2
R-Squared:      0.22862
Adj. R-Squared: 0.20079
F-statistic: 1554.76 on 3 and 15738 DF, p-value: < 2.22e-16

Hypothesis Testing and Model Comparison

We have various tests to run:

OLS (Pooled) versus Random Effects

In stata, right after running the random effects model, run

. xttest0

Breusch and Pagan Lagrangian multiplier test for random effects

        ln_wage[id,t] = Xb + u[id] + e[id,t]

        Estimated results:
                         |       Var     sd = sqrt(Var)
                ---------+-----------------------------
                 ln_wage |   .2790337       .5282364
                       e |    .108029       .3286777
                       u |   .1387145        .372444

        Test:   Var(u) = 0
                             chibar2(01) = 19338.65
                          Prob > chibar2 =   0.0000

In R,

In [9]:
plmtest(wage.pool,type='bp')
Out[9]:
	Lagrange Multiplier Test - (Breusch-Pagan)

data:  ln_wage ~ educ + pexp + pexp2 + broken_home
chisq = 387220, df = 1, p-value < 2.2e-16
alternative hypothesis: significant effects

OLS (Pooled) versus Fixed Effects

In stata, this is tested by default and included in the fe output (at the end):

------------------------------------------------------------------------------
F test that all u_i=0:     F(2177, 15738) =    10.41         Prob > F = 0.0000

In R, use this (note the slight difference in the F statistic (and degrees of freedom) due to stata using a model constant):

In [10]:
pFtest(wage.fe, wage.pool)
Out[10]:
	F test for individual effects

data:  ln_wage ~ educ + pexp + pexp2 + broken_home
F = 10.384, df1 = 2176, df2 = 15738, p-value < 2.2e-16
alternative hypothesis: significant effects
  1. Random Effects versus Fixed Effects

Install xtoverid and use this:

. xtoverid

Test of overidentifying restrictions: fixed vs random effects
Cross-section time-series model: xtreg re   
Sargan-Hansen statistic  31.892  Chi-sq(3)    P-value = 0.0000

or, you can use the Hausman test explictly:

. hausman bfe bre

                 ---- Coefficients ----
             |      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
             |      bfe          bre         Difference          S.E.
-------------+----------------------------------------------------------------
        educ |    .0761987     .0901376       -.0139389        .0048459
        pexp |    .1070331     .1027822        .0042509        .0009515
       pexp2 |   -.0038188    -.0036384       -.0001804        .0000432
------------------------------------------------------------------------------
                           b = consistent under Ho and Ha; obtained from xtreg
            B = inconsistent under Ha, efficient under Ho; obtained from xtreg

    Test:  Ho:  difference in coefficients not systematic

                  chi2(3) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                          =       31.18
                Prob>chi2 =      0.0000

in R, we use the Hausman test:

In [11]:
phtest(wage.fe, wage.re)
Out[11]:
	Hausman Test

data:  ln_wage ~ educ + pexp + pexp2 + broken_home
chisq = 49.404, df = 3, p-value = 1.07e-10
alternative hypothesis: one model is inconsistent

Note the stata output is more informative.

Endogeneity

IV regression on panel data (where x4 is endogenous and instrumented by z1) can be performed by

xtivreg y x1 x2 x3 (x4=z1), fe

and in R by

summary(plm(y~x1+x2+x3+x4 | . + x1+x2+x3+z1, data=panel.df,model="within"))

Note: Random Effects with instruments seemingly doesn't work in R (doesn't come close to matching stata's xtivreg command). Fixed effects (shown above), not verified.

Robust Standard Errors

In stata:

. xtreg ln_wage educ pexp pexp2, fe robust

Fixed-effects (within) regression               Number of obs      =     17919
Group variable: id                              Number of groups   =      2178

R-sq:  within  = 0.2286                         Obs per group: min =         1
       between = 0.1331                                        avg =       8.2
       overall = 0.1663                                        max =        15

                                                F(3,2177)          =    535.99
corr(u_i, Xb)  = 0.0053                         Prob > F           =    0.0000

                                  (Std. Err. adjusted for 2178 clusters in id)
------------------------------------------------------------------------------
             |               Robust
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0761987   .0099113     7.69   0.000     .0567621    .0956352
        pexp |   .1070331   .0042128    25.41   0.000     .0987716    .1152945
       pexp2 |  -.0038188   .0002193   -17.42   0.000    -.0042489   -.0033888
       _cons |   .7679618   .1208608     6.35   0.000     .5309472    1.004976
-------------+----------------------------------------------------------------
     sigma_u |  .40417965
     sigma_e |  .32867768
         rho |   .6019421   (fraction of variance due to u_i)
------------------------------------------------------------------------------

In R and noting that only the standard errors, t-stats, and p values change, we have very similar results. Note the vcovHC and coeftest functions are being drawn from the plm package and not sandwich:

In [12]:
coeftest(wage.fe,vcov=vcovHC(wage.fe,type="sss"))
Out[12]:
t test of coefficients:

         Estimate  Std. Error  t value  Pr(>|t|)    
educ   0.07619867  0.00991101   7.6883 1.579e-14 ***
pexp   0.10703306  0.00421265  25.4076 < 2.2e-16 ***
pexp2 -0.00381882  0.00021928 -17.4155 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1