Heckman Code¶
The python code generating the toy data for the figures above is given below. This version examines Case 4.1
# start a connected stata17 session
from pystata import config
config.init('be')
___ ____ ____ ____ ____ ©
/__ / ____/ / ____/ 17.0
___/ / /___/ / /___/ BE—Basic Edition
Statistics and Data Science Copyright 1985-2021 StataCorp LLC
StataCorp
4905 Lakeway Drive
College Station, Texas 77845 USA
800-STATA-PC https://www.stata.com
979-696-4600 stata@stata.com
Stata license: Single-user perpetual
Serial number: 301706306291
Licensed to: Rob Hicks
College of William and Mary
Notes:
1. Unicode is supported; see help unicode_advice.
import numpy as np
import pandas as pd
# true parameters
rho_t = np.array([0.8])
rho_x_w_t = np.array([0.8])
gamma_t = np.array([.5,1.0])
beta_t = np.array([-2.0,0.5])
sigma_e_t = np.array([1.0])
N =5000
# generate toy data consistent with heckman:
# generate potentially correlated x,w data
mean_x_w = np.array([0,0])
cov_x_w = np.array([[1,rho_x_w_t[0]],[rho_x_w_t[0], 1]])
w, x = np.random.multivariate_normal(mean_x_w, cov_x_w, N).T
# add constant to first position and convert to DataFrame
w_ = pd.DataFrame(np.c_[np.ones(N),w],columns=['Constant (Selection)','Slope (Selection)'])
x_ = pd.DataFrame(np.c_[np.ones(N),x], columns=['Constant','Slope'])
# generate errors
mean_u_eps = np.array([0,0])
cov_u_eps = np.array([[1,rho_t[0]],[rho_t[0],sigma_e_t[0]]])
u, epsilon = np.random.multivariate_normal(mean_u_eps, cov_u_eps, N).T
# generate latent zstar
zstar = w_.dot(gamma_t) + u
# generate observed z (indicator=1 if zstar is positive)
z = zstar > 0
# generate latent ystar
ystar = x_.dot(beta_t) + epsilon
y=ystar.copy()
# generate observed y [if z=0, set y to NaN]
y[~z] = np.nan
stata_data = pd.DataFrame(np.c_[y,z,x,w], columns=['y','z','x','w'])
stata_data.head()
y | z | x | w | |
---|---|---|---|---|
0 | -1.234971 | 1.0 | 0.705421 | 1.638497 |
1 | NaN | 0.0 | -0.705640 | -0.496328 |
2 | -2.111768 | 1.0 | 0.630919 | 0.375758 |
3 | -2.386395 | 1.0 | -0.841180 | -0.265511 |
4 | -2.297429 | 1.0 | 0.637365 | 0.953034 |
We can estimate a Two-Step Heckman Model in Python using an unmerged branch from StatsModels (this replicates the Stata two-step results).
import heckman as heckman
res = heckman.Heckman(y, x_, w_).fit(method='twostep')
print(res.summary())
Heckman Regression Results
=======================================
Dep. Variable: y
Model: Heckman
Method: Heckman Two-Step
Date: Wed, 12 May 2021
Time: 12:22:11
No. Total Obs.: 5000
No. Censored Obs.: 1847
No. Uncensored Obs.: 3153
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Constant -1.9881 0.037 -53.394 0.000 -2.061 -1.915
Slope 0.4999 0.024 20.649 0.000 0.452 0.547
========================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------
Constant (Selection) 0.4736 0.022 21.891 0.000 0.431 0.516
Slope (Selection) 0.9805 0.028 35.587 0.000 0.927 1.035
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
IMR (Lambda) 0.7412 0.062 12.026 0.000 0.620 0.862
=====================================
rho: 0.753
sigma: 0.985
=====================================
First table are the estimates for the regression (response) equation.
Second table are the estimates for the selection equation.
Third table is the estimate for the coef of the inverse Mills ratio (Heckman's Lambda).
And in Stata, we can estimate the Full Information Maximum Likelihood model over the toy dataset as
%%stata -d stata_data
heckman y x, select(z=w)
Iteration 0: log likelihood = -6363.4231
Iteration 1: log likelihood = -6360.8593
Iteration 2: log likelihood = -6360.8397
Iteration 3: log likelihood = -6360.8397
Heckman selection model Number of obs = 5,000
(regression model with sample selection) Selected = 3,153
Nonselected = 1,847
Wald chi2(1) = 721.77
Log likelihood = -6360.84 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
y |
x | .5161336 .0192116 26.87 0.000 .4784796 .5537877
_cons | -2.024205 .0244713 -82.72 0.000 -2.072167 -1.976242
-------------+----------------------------------------------------------------
z |
w | .9748141 .0260147 37.47 0.000 .9238261 1.025802
_cons | .4769289 .0213639 22.32 0.000 .4350564 .5188013
-------------+----------------------------------------------------------------
/athrho | 1.116945 .0613091 18.22 0.000 .9967814 1.237109
/lnsigma | .0034814 .016585 0.21 0.834 -.0290246 .0359874
-------------+----------------------------------------------------------------
rho | .8065037 .0214307 .7602391 .844629
sigma | 1.003487 .0166429 .9713925 1.036643
lambda | .8093163 .03187 .7468522 .8717805
------------------------------------------------------------------------------
LR test of indep. eqns. (rho = 0): chi2(1) = 229.21 Prob > chi2 = 0.0000
- 1
Note, we are using Stata 17 Jupyter Notebook integration using the
%%stata
magic.