Reproducible Research and Literate Programming for Econometrics

We are interested in estimating the model

\begin{equation} \mathbf{y^*=xb + \epsilon} \end{equation}

but for a subset of our data, the dependent variable is either missing or coded to some arbitrary values (e.g. 0 or -999). Of concern however, is that the pattern of this missingness is non-random in a way that could induce bias in our estimated \(\beta\)'s if we apply OLS to the above equation.

Suppose that the pattern of missingness (I'll refer to this as censored hereafter) is related to the latent (unobserved) process

\begin{equation} \mathbf{z}^* = \mathbf{w}\gamma + \mathbf{u} \end{equation}

From this process, the researcher can observe

\begin{align} z_i =& 1 \text{ if } z^*_i > 0 \\ =&0 \text{ if } z^*_i \le 0 \end{align}

or \(z_i=1\) (\(y_i\) not censored) when

\begin{equation} u_i \ge - \mathbf{w}_i\gamma \end{equation}

The probability of \(y_i\) not censored is

\begin{align} Pr(u_i \ge - \mathbf{w}_i\gamma) =& 1- \Phi(-\mathbf{w}_i\gamma)\\ &=\Phi(\mathbf{w}_i\gamma) \end{align}

if we are willing to assume that \(\mathbf{u}\sim N(\mathbf{0},\mathbf{I})\). Note for identification purposes in the Heckman Model we restrict \(Var(u_i) = 1\). Also note that \(1- \Phi(-\mathbf{w}_i\gamma)=\Phi(\mathbf{w}_i\gamma)\) by symmetry of the standard normal distribution.

Having constructed a model a model for censoring, we can construct "amounts" equation as follows. Denoting \(\mathbf{y}\) as the not censored (observed) dependent variable, the censoring model defines what is in the estimation sample as

\begin{align} y_i = y^*_i = \mathbf{x}_i \beta + \epsilon_i \text{ observed, if \(z_i\) = 1} \end{align}

Finally, the joint distribution of the errors in the selection (\(u_i\)) and amounts equation (\(\epsilon\)) is distributed iid as

\begin{equation} \begin{bmatrix} u_i \\ \epsilon_i \end{bmatrix} \sim Normal\left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 1 & \rho \\ \rho & \sigma^2_\epsilon \end{bmatrix} \right) \end{equation}

To see how the selection and amounts model are related, consider

\begin{align} E(y_i | y_i \text{ observed}) = & E(y_i | z^* > 0 ) \\ =&E(y_i | u_i > -\mathbf{w}_i \gamma) \\ =&\mathbf{x}_i \beta + E(\epsilon_i | u_i > -\mathbf{w}_i \gamma) \\ =&\mathbf{x}_i \beta + \rho \sigma_\epsilon \frac{\phi(\mathbf{w}_i \gamma)}{\Phi(\mathbf{w}_i \gamma)} \end{align}

What is immediately apparent is that the conditional mean (\(E(y_i | y_i \text{ observed})\)) differs from the unconditional mean (\(\mathbf{x}_i\beta\)) only if \(\rho \neq 0\) since all the other elements in the far right hand term (i.e., the variance of the error in the amounts equation, \(\sigma_\epsilon\), and the Inverse Mills Ratio, \(\frac{\phi(\mathbf{w}_i \gamma)}{\Phi(\mathbf{w}_i \gamma)}\)) in the preceding equation are strictly positive. So if the errors in the amounts and selection equations are uncorrelated (\(\rho=0\)) we can safely apply ordinary least squares to uncover unbiased estimates for \(\beta\) and can ignore endogenous selection effects and the selection equation portion of the model.

However, there are some further nuances to consider. Denoting the correlation amongst the two sets of independent variables in the model \(\mathbf{x},\mathbf{w}\) as \(\rho_{\mathbf{x},\mathbf{w}}\), we examine the following four cases of correlation patterns amongst data and errors. As the final equation above shows, when \(\rho\) is zero (Cases 1 and 2) in the Table below, OLS will always give unbiased \(\beta\) even if \(\rho_{\mathbf{x},\mathbf{w}}\) is non-zero. This is happening because the missingness patterns in \(\mathbf{y}\) are essentially random or select cases in ways that are symmetric around the regression line that preserves slope and intercept coefficients (we have plots of these cases later for providing more intuition about this statement).

Case	\(\rho\)	\(\rho_{\mathbf{x},\mathbf{w}}\)	Intercept	Slope Coefficents
1	\(0\)	\(0\)	Unbiased	Unbiased
2	\(0\)	\(\neq 0\)	Unbiased	Unbiased
3	\(\neq 0\)	\(0\)	Biased	Unbiased
4	\(\neq 0\)	\(\neq 0\)	Biased	Biased

Cases 3 and 4 are more interesting and represent instances where OLS will be inconsistent and we should apply the Heckman Model. For gaining intuition about how cases 3 and 4 differ consider again the conditional mean

\begin{equation} E(y_i | y_i \text{ observed}) = \mathbf{x}_i \beta + \rho \sigma_\epsilon \frac{\phi(\mathbf{w}_i \gamma)}{\Phi(\mathbf{w}_i \gamma)} \end{equation}

We can think of the preceding equation as guiding the regression equation we ought to be estimating if the sample selection process as outlined above is happening. If instead, we ignore the final term (\(\rho \sigma_\epsilon \frac{\phi(\mathbf{w}_i \gamma)}{\Phi(\mathbf{w}_i \gamma)}\)) when in fact \(\rho\neq 0\) then we are putting this information into the error term of the misppecified OLS model regressing \(\mathbf{y}\) on \(\mathbf{x}\) making the estimation equation

\begin{equation} \mathbf{y = x \tilde{\beta} + \psi} \end{equation}

where \(\psi = \rho \sigma_\epsilon \mathbf{IMR} + \epsilon\), where \(\mathbf{IMR}\) is the \(N \times 1\) vector

\begin{equation} \begin{bmatrix} \frac{\phi(\mathbf{w}_1 \gamma)}{\Phi(\mathbf{w}_1 \gamma)}\\ \vdots \\ \frac{\phi(\mathbf{w}_i \gamma)}{\Phi(\mathbf{w}_i \gamma)} \\ \vdots \\ \frac{\phi(\mathbf{w}_N \gamma)}{\Phi(\mathbf{w}_N \gamma)} \end{bmatrix} \end{equation}

For the OLS estimator to be unbiased in this context, we need \(E[\mathbf{b}] = \beta\), so that (without showing steps in proof which is analogous to the OLS proof of unbiasedness):

\begin{align} E[\mathbf{b}] \mathbf{x}'\psi|\mathbf{x}] =& \beta +E[\rho \sigma_\epsilon \mathbf{x}'\mathbf{IMR} + \mathbf{x}'\epsilon] \\ =& \beta + \rho \sigma E[\mathbf{x}'\mathbf{IMR}] \end{align}

since \(E[\mathbf{x}'\epsilon]=0\) and \(\rho\) and \(\sigma_\epsilon\) are unknown constants. If \(\rho \neq 0\), this term won't be zero. Simplifying the last term from above further, we have

\begin{align} \rho \sigma E[\mathbf{x}'\mathbf{IMR}] = \rho \sigma \mathbf{x}'\mathbf{IMR} + Cov(\mathbf{x}',\mathbf{IMR}) \end{align}

which needs to be zero for unbiasedness of the OLS model.

For case 3, the independent variables in the model \(\mathbf{x}\) and \(\mathbf{w}\) are uncorrelated so it is likely that \(Cov(\mathbf{x}',\mathbf{IMR})\) is very close to zero, since all variation in the relationship is being driven by the two sets of independent variables in the model (\(\mathbf{x}\) and \(\mathbf{w}\)). Setting \(Cov(\mathbf{x}',\mathbf{IMR})=0\), shows that there will still be bias since \(\rho \sigma \mathbf{x}'\mathbf{IMR}\) won't be zero. Here we are adding (or subtracting depending on the sign of \(\rho\)) a constant value to the model with mean \(\rho \sigma \mathbf{x}'\mathbf{IMR}\). So this miss-specified model's constant term will need to depart from the true value \(\beta_0\) to account for this positive or negative shift in the mean value of all observation's errors. However, since there is no correlation amongst \(\mathbf{x}\) and \(\mathbf{w}\) in our miss-specified OLS model we won't impart missing variable bias on our slope coefficients.

For case 4, we have the bias imparted on our intercept coefficient as discussed above and bias induced by correlation between our model error \(\psi\) and \(\mathbf{x}\) (since \(\mathbf{x}\) and \(\mathbf{w}\) are correlated). This has the effect of also biasing our slope coefficients.

## Model Log-Likelihood

The log likelihood function for the Heckman Model is

\begin{equation} LogL(\mathbf{y},\mathbf{z} | \beta, \gamma, \rho, \sigma_\epsilon, \mathbf{x}, \mathbf{w}) = \sum_{i = 1}^N z_i log\left ( \phi \left ( \frac{y-\mathbf{x}\beta}{\sigma_\epsilon}\right ) \Phi \left(\frac{\mathbf{w_i \gamma} + \rho \left (\frac{y_i - \mathbf{x}_i \beta}{\sigma_\epsilon}\right)} {(1 - \rho^2)^{-\frac{1}{2}}}\right)\right ) + (1 - z_i)log \left( 1 - \Phi(\mathbf{w}_i\gamma) \right) \end{equation}

### Estimation

There are two estimators one can employ. The first known as the two-step method was the only practical way to estimate the model when the paper was first written. This method follows these steps:

Run Probit on the Selection Model
Recover Estimated Inverse Mills Ratio
Using Odinary Least Squares, run the regression \begin{equation} y_i = \mathbf{x}_i \beta + \rho \sigma_\epsilon \frac{\phi(\mathbf{w}_i \hat{\gamma})}{\Phi(\mathbf{w}_i \hat{\gamma})} \end{equation}
where \(\rho \sigma_\epsilon\) is treated as a single parameter to be estimated.
"Back Out" separate estimates for \(\rho\) and \(\sigma_\epsilon\)
Adjust standard errors to account for the fact that the Inverse Mills Ratio is an estimate (and hence random) covariate in the above model.

The key two steps are to first run a probit and using information from the results from that model estimate a corrected form of the OLS model. This is the only estimation method available in a beta branch of Python Statsmodels as of November 2018.

The second and preferred method is to use Maximum Likelihood over the full parameter set \(\beta, \gamma, \rho\), and \(\sigma\) in the log-likelihood function above. This is the default method in Stata and the Bayesian approach outlined below is most similar to this approach since it samples from the full joint posterior of \(\beta, \gamma, \rho\), and \(\sigma\).