In this note, I'll explore the Ipython statsmodels package for estimating linear regression models (OLS). The goal is to completely map stata commands for reg into something implementable in Ipython.

In [1]:

#load python libraries for this ipython notebook:
%matplotlib inline
import pandas as pd             # pandas data manipulation library
import statsmodels.formula.api as smf    # for the ols and robust ols model
import statsmodels.api as sm
import numpy as np

Note this is from the Tobias and Koop dataset on the returns to education. The full dataset is a panel. I have turned this into a cross section where time==4 (for demonstration purposes only).

In [2]:

# load data and create dataframe
tobias_koop=pd.read_csv('http://rlhick.people.wm.edu/econ407/data/tobias_koop_t_4.csv')

With the pandas data frame created, we can easily run some models. First let's look at the data, run summary statistics, and plot the histogram of our dependent variable, ln_wage.

In [3]:

tobias_koop.head()

Out[3]:

	id	educ	ln_wage	pexp	time	ability	meduc	feduc	broken_home	siblings	pexp2
0	4	12	2.14	2	4	0.26	12	10	1	4	4
1	6	15	1.91	4	4	0.44	12	16	0	2	16
2	8	13	2.32	8	4	0.51	12	15	1	2	64
3	11	14	1.64	1	4	1.82	16	17	1	2	1
4	12	13	2.16	6	4	-1.30	13	12	0	5	36

In [4]:

tobias_koop.describe()

Out[4]:

	id	educ	ln_wage	pexp	time	ability	meduc	feduc	broken_home	siblings	pexp2
count	1034.000000	1034.000000	1034.000000	1034.000000	1034	1034.000000	1034.000000	1034.000000	1034.000000	1034.000000	1034.000000
mean	1090.951644	12.274662	2.138259	4.815280	4	0.016596	11.403288	11.585106	0.169246	3.200193	27.979691
std	634.891728	1.566838	0.466280	2.190298	0	0.920963	3.027277	3.735833	0.375150	2.126575	22.598790
min	4.000000	9.000000	0.420000	0.000000	4	-3.140000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	537.250000	12.000000	1.820000	3.000000	4	-0.550000	11.000000	10.000000	0.000000	2.000000	9.000000
50%	1081.500000	12.000000	2.120000	5.000000	4	0.170000	12.000000	12.000000	0.000000	3.000000	25.000000
75%	1666.500000	13.000000	2.450000	6.000000	4	0.720000	12.000000	14.000000	0.000000	4.000000	36.000000
max	2177.000000	19.000000	3.590000	12.000000	4	1.890000	20.000000	20.000000	1.000000	15.000000	144.000000

In [5]:

tobias_koop.ln_wage.plot(kind='hist')

Out[5]:

<matplotlib.axes.AxesSubplot at 0x41dcb50>

No description has been provided for this image

OLS analysis¶

This is the model we'll be estimating the vector $\beta$ from

$$ y_i = \beta_0 + \beta_1 educ_i + \beta_2 pexp_i + \beta_3 pexp^2_i + \beta_4 broken\_home_i + \epsilon_i $$

For now, we will ignore any potential bias due to endogeneity, although we'll come back to that in a later post.

Fitting the model in stata¶

Here is the code for estimating the model in stata:

. reg ln_wage educ pexp pexp2 broken_home

      Source |       SS       df       MS              Number of obs =    1034
-------------+------------------------------           F(  4,  1029) =   51.36
       Model |  37.3778146     4  9.34445366           Prob > F      =  0.0000
    Residual |   187.21445  1029  .181938241           R-squared     =  0.1664
-------------+------------------------------           Adj R-squared =  0.1632
       Total |  224.592265  1033  .217417488           Root MSE      =  .42654

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0852725   .0092897     9.18   0.000     .0670437    .1035014
        pexp |   .2035214   .0235859     8.63   0.000     .1572395    .2498033
       pexp2 |  -.0124126   .0022825    -5.44   0.000    -.0168916   -.0079336
 broken_home |  -.0087254   .0357107    -0.24   0.807    -.0787995    .0613488
       _cons |   .4603326    .137294     3.35   0.001     .1909243    .7297408
------------------------------------------------------------------------------

Fitting the model Ipython with statsmodels¶

We will estimate the same models as above using statsmodels.

In [6]:

formula = "ln_wage ~ educ + pexp + pexp2 + broken_home"
results = smf.ols(formula,tobias_koop).fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                ln_wage   R-squared:                       0.166
Model:                            OLS   Adj. R-squared:                  0.163
Method:                 Least Squares   F-statistic:                     51.36
Date:                Thu, 05 Mar 2015   Prob (F-statistic):           1.83e-39
Time:                        09:18:38   Log-Likelihood:                -583.66
No. Observations:                1034   AIC:                             1177.
Df Residuals:                    1029   BIC:                             1202.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept       0.4603      0.137      3.353      0.001         0.191     0.730
educ            0.0853      0.009      9.179      0.000         0.067     0.104
pexp            0.2035      0.024      8.629      0.000         0.157     0.250
pexp2          -0.0124      0.002     -5.438      0.000        -0.017    -0.008
broken_home    -0.0087      0.036     -0.244      0.807        -0.079     0.061
==============================================================================
Omnibus:                       55.892   Durbin-Watson:                   1.761
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              112.050
Skew:                          -0.355   Prob(JB):                     4.66e-25
Kurtosis:                       4.448   Cond. No.                         391.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

OLS Regression with robust standard errors¶

Recall that for robust standard errors, we first recover our OLS estimates ($\mathbf{b}$) of $\beta$. Using $\mathbf{b}$, construct $\mathbf{e} = \mathbf{y - xb}$. From there, calculate the robust variance/covariance matrix of estimated parameters as $$ Var(\mathbf{b})_{robust} = \mathbf{(x'x)^{-1}x'diag(ee')x(x'x)^{-1}}\times \frac{N}{N-K} $$ for calculating standard errors.

Fitting the model in stata¶

In stata, we would run this:

. reg ln_wage educ pexp pexp2 broken_home, robust

Linear regression                                      Number of obs =    1034
                                                       F(  4,  1029) =   64.82
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.1664
                                                       Root MSE      =  .42654

------------------------------------------------------------------------------
             |               Robust
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0852725   .0099294     8.59   0.000     .0657882    .1047568
        pexp |   .2035214   .0226835     8.97   0.000     .1590101    .2480326
       pexp2 |  -.0124126   .0022447    -5.53   0.000    -.0168173   -.0080079
 broken_home |  -.0087254   .0312381    -0.28   0.780     -.070023    .0525723
       _cons |   .4603326   .1315416     3.50   0.000     .2022121     .718453
------------------------------------------------------------------------------

Fitting the model in Ipython¶

In Ipython, we don't need to rerun the model. The default OLS command already includes a number of different types of robust standard errors (one of which using the method outlined above). All we need to do is create a new results instance that calls the covariance type we want:

In [7]:

results_robust = results.get_robustcov_results(cov_type='HC1')
print(results_robust.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                ln_wage   R-squared:                       0.166
Model:                            OLS   Adj. R-squared:                  0.163
Method:                 Least Squares   F-statistic:                     64.82
Date:                Thu, 05 Mar 2015   Prob (F-statistic):           6.41e-49
Time:                        09:18:39   Log-Likelihood:                -583.66
No. Observations:                1034   AIC:                             1177.
Df Residuals:                    1029   BIC:                             1202.
Df Model:                           4                                         
Covariance Type:                  HC1                                         
===============================================================================
                  coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept       0.4603      0.132      3.500      0.000         0.202     0.718
educ            0.0853      0.010      8.588      0.000         0.066     0.105
pexp            0.2035      0.023      8.972      0.000         0.159     0.248
pexp2          -0.0124      0.002     -5.530      0.000        -0.017    -0.008
broken_home    -0.0087      0.031     -0.279      0.780        -0.070     0.053
==============================================================================
Omnibus:                       55.892   Durbin-Watson:                   1.761
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              112.050
Skew:                          -0.355   Prob(JB):                     4.66e-25
Kurtosis:                       4.448   Cond. No.                         391.
==============================================================================

Warnings:
[1] Standard Errors are heteroscedasticity robust (HC1)

Bootstrapped standard errors¶

Fitting the model in stata¶

If instead, we want to bootstrap our standard errors, in stata we would do this:

. bstrap:  reg ln_wage educ pexp pexp2 broken_home
(running regress on estimation sample)

Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50

Linear regression                               Number of obs      =      1034
                                                Replications       =        50
                                                Wald chi2(4)       =    290.46
                                                Prob > chi2        =    0.0000
                                                R-squared          =    0.1664
                                                Adj R-squared      =    0.1632
                                                Root MSE           =    0.4265

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
     ln_wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0852725    .010622     8.03   0.000     .0644539    .1060912
        pexp |   .2035214   .0217148     9.37   0.000     .1609611    .2460816
       pexp2 |  -.0124126   .0021076    -5.89   0.000    -.0165433   -.0082818
 broken_home |  -.0087254   .0330727    -0.26   0.792    -.0735466    .0560958
       _cons |   .4603326   .1405716     3.27   0.001     .1848173    .7358479
------------------------------------------------------------------------------

It is important to note that the above set of results implicitly assumes normality, since confidence intervals are symmetric about the mean (and since standard errors are reported).

Fitting the model in Ipython¶

This isn't as straightforward in Ipython. We need to do a bit of wrangling to get the bootstrap working in Ipython. Write our own routine.

In [9]:

R = 50

results_boot = np.zeros((R,results.params.shape[0]))

row_id = range(0,tobias_koop.shape[0])

for r in range(R):
     this_sample = np.random.choice(row_id, size=tobias_koop.shape[0], replace=True) # gives sampled row numbers
     # Define data for this replicate:    
     tobias_koop_r = tobias_koop.iloc[this_sample]   
     # Estimate model
     results_r = smf.ols(formula,tobias_koop_r).fit(disp=0).params   
     # Store in row r of results_boot:
     results_boot[r,:] = np.asarray(results_r)

The array results_boot has all of our data. Put in pandas dataframe and summarize:

In [10]:

# Convert results to pandas dataframe for easier analysis:
results_boot = pd.DataFrame(results_boot,columns=['b_Intercept','b_educ','b_pexp','pexp2','b_broken_home'])
results_boot.describe(percentiles=[.025,.975])

Out[10]:

	b_Intercept	b_educ	b_pexp	pexp2	b_broken_home
count	50.000000	50.000000	50.000000	50.000000	50.000000
mean	0.462494	0.084486	0.208471	-0.012965	-0.017659
std	0.160763	0.011046	0.024721	0.002473	0.036356
min	0.004603	0.056518	0.148656	-0.019397	-0.102527
2.5%	0.126781	0.064798	0.157471	-0.018137	-0.100246
50%	0.462583	0.084843	0.209224	-0.013365	-0.010556
97.5%	0.783207	0.105074	0.260848	-0.008159	0.054402
max	0.801684	0.118317	0.273543	-0.007170	0.066333

In the above set of results, the standard deviation (std) can be interpreted as the parameter standard error if we are willing to impose normality. Alternatively, we could construct 95% confidence intervals by looking at the 2.5% and 97.5% percentiles.