Heckman Code¶

The python code generating the toy data for the figures above is given below. This version examines Case 4.1

# start a connected stata17 session
from pystata import config
config.init('be')

  ___  ____  ____  ____  ____ ©
 /__    /   ____/   /   ____/      17.0
___/   /   /___/   /   /___/       BE—Basic Edition

 Statistics and Data Science       Copyright 1985-2021 StataCorp LLC
                                   StataCorp
                                   4905 Lakeway Drive
                                   College Station, Texas 77845 USA
                                   800-STATA-PC        https://www.stata.com
                                   979-696-4600        stata@stata.com

Stata license: Single-user  perpetual
Serial number: 301706306291
  Licensed to: Rob Hicks
               College of William and Mary

Notes:
      1. Unicode is supported; see help unicode_advice.

import numpy as np
import pandas as pd

# true parameters
rho_t = np.array([0.8])
rho_x_w_t = np.array([0.8])
gamma_t = np.array([.5,1.0])
beta_t = np.array([-2.0,0.5])
sigma_e_t = np.array([1.0])
N =5000

# generate toy data consistent with heckman:
# generate potentially correlated x,w data
mean_x_w = np.array([0,0])
cov_x_w = np.array([[1,rho_x_w_t[0]],[rho_x_w_t[0], 1]])
w, x = np.random.multivariate_normal(mean_x_w, cov_x_w, N).T

# add constant to first position and convert to DataFrame
w_ = pd.DataFrame(np.c_[np.ones(N),w],columns=['Constant (Selection)','Slope (Selection)'])
x_ = pd.DataFrame(np.c_[np.ones(N),x], columns=['Constant','Slope'])

# generate errors 
mean_u_eps = np.array([0,0])
cov_u_eps = np.array([[1,rho_t[0]],[rho_t[0],sigma_e_t[0]]])
u, epsilon = np.random.multivariate_normal(mean_u_eps, cov_u_eps, N).T

# generate latent zstar
zstar = w_.dot(gamma_t) + u
# generate observed z (indicator=1 if zstar is positive)
z = zstar > 0  

# generate latent ystar
ystar = x_.dot(beta_t) + epsilon
y=ystar.copy()
# generate observed y [if z=0, set y to NaN]
y[~z] = np.nan

stata_data = pd.DataFrame(np.c_[y,z,x,w], columns=['y','z','x','w'])
stata_data.head()

	y	z	x	w
0	-1.234971	1.0	0.705421	1.638497
1	NaN	0.0	-0.705640	-0.496328
2	-2.111768	1.0	0.630919	0.375758
3	-2.386395	1.0	-0.841180	-0.265511
4	-2.297429	1.0	0.637365	0.953034

We can estimate a Two-Step Heckman Model in Python using an unmerged branch from StatsModels (this replicates the Stata two-step results).

import heckman as heckman
res = heckman.Heckman(y, x_, w_).fit(method='twostep')
print(res.summary())

       Heckman Regression Results      
=======================================
Dep. Variable:                        y
Model:                          Heckman
Method:                Heckman Two-Step
Date:                  Wed, 12 May 2021
Time:                          12:22:11
No. Total Obs.:                    5000
No. Censored Obs.:                 1847
No. Uncensored Obs.:               3153
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Constant      -1.9881      0.037    -53.394      0.000      -2.061      -1.915
Slope          0.4999      0.024     20.649      0.000       0.452       0.547
========================================================================================
                           coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------
Constant (Selection)     0.4736      0.022     21.891      0.000       0.431       0.516
Slope (Selection)        0.9805      0.028     35.587      0.000       0.927       1.035
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
IMR (Lambda)     0.7412      0.062     12.026      0.000       0.620       0.862
=====================================
rho:                            0.753
sigma:                          0.985
=====================================

First table are the estimates for the regression (response) equation.
Second table are the estimates for the selection equation.
Third table is the estimate for the coef of the inverse Mills ratio (Heckman's Lambda).

And in Stata, we can estimate the Full Information Maximum Likelihood model over the toy dataset as

%%stata -d stata_data
heckman y x, select(z=w)

Iteration 0:   log likelihood = -6363.4231

Iteration 1:   log likelihood = -6360.8593

Iteration 2:   log likelihood = -6360.8397

Iteration 3:   log likelihood = -6360.8397  

Heckman selection model                         Number of obs     =      5,000
(regression model with sample selection)              Selected    =      3,153
                                                      Nonselected =      1,847

                                                Wald chi2(1)      =     721.77
Log likelihood =  -6360.84                      Prob > chi2       =     0.0000

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------

y            |
           x |   .5161336   .0192116    26.87   0.000     .4784796    .5537877
       _cons |  -2.024205   .0244713   -82.72   0.000    -2.072167   -1.976242
-------------+----------------------------------------------------------------
z            |
           w |   .9748141   .0260147    37.47   0.000     .9238261    1.025802
       _cons |   .4769289   .0213639    22.32   0.000     .4350564    .5188013
-------------+----------------------------------------------------------------
     /athrho |   1.116945   .0613091    18.22   0.000     .9967814    1.237109
    /lnsigma |   .0034814    .016585     0.21   0.834    -.0290246    .0359874
-------------+----------------------------------------------------------------
         rho |   .8065037   .0214307                      .7602391     .844629
       sigma |   1.003487   .0166429                      .9713925    1.036643
      lambda |   .8093163     .03187                      .7468522    .8717805
------------------------------------------------------------------------------
LR test of indep. eqns. (rho = 0): chi2(1) = 229.21       Prob > chi2 = 0.0000

1: Note, we are using Stata 17 Jupyter Notebook integration using the %%stata magic.

Course Notes for Cross Section Econometrics

Heckman Code¶