# Reproducible Research and Literate Programming for Econometrics

The exogeneity assumption of the previous chapter is a particularly strong assumption. One might think of a number of cases where this would not be expected to hold. Education and earnings is an example. A wage equation that is a function of education may suffer from a problem regarding this assumption if those factors influencing wage and not accounted for in the regression equation are related to educational attainment. In a pure cross section world, unobserved individual-specific factors, for example, may influence both wage and educational attainment. The problem of endogeneity is a very big deal and one you should worry about every time you estimate any type of regression model. To see why, note that our ability to estimate unbiased parameters hinges on

$$\mathbf{E(x'\epsilon)=0}$$

If there is in fact correlation between the elements of $$\mathbf{x}$$ and $$\mathbf{\epsilon}$$ then our parameter estimates are biased.

## Instrumental Variables Regression

### An exactly identified model (1 endogenous variable and 1 instrument)

In this section, we discuss ways of addressing endogeneity in the OLS framework. As before, assume a linear model

$$\label{eq:ivbasic_framework} \mathbf{y}=\mathbf{x}\beta+\epsilon=\beta_0+\beta_2 \mathbf{x_2} + \beta_3 \mathbf{x_3} + \ldots + \beta_K \mathbf{x_K} + \mathbf{\epsilon}$$

where $$\mathbf{x}_j$$ is a $$N \times 1$$ column vector with data from column $$j$$. Where the following holds:

• $$E(\mathbf{\epsilon})=0$$
• $$E(\mathbf{x_{j}\epsilon})=0$$ for $$j=1,2,\ldots,K-1$$

Notice that we are invoking the exogeneity assumption for some of our explanatory variables but not all of them. The explanatory variable $$\mathbf{x_K}$$ is potentially endogenous and a failure to deal with this will potentially lead to biased parameter estimates.

The method of instrumental variables offers a way of handling this problem. Letting the instrumental variable be denoted as $$z_k$$, we need for it to have these properties:

• Assumption 1: $$E(\mathbf{z'\epsilon})=0$$
• Assumption 2: And, for the following linear relationship,

\label{eq:reducedformx} \mathbf{x_K}=δ0 + δ2 \mathbf{x_2}+\ldots+ δK-1 \mathbf{xK-1}+θK \mathbf{z_K} + \mathbf{r}

where $$E(\mathbf{r_k})=0$$ and is uncorrelated with the right hand side variables. We need for $$\theta_1$$ to be non-zero, for $$\mathbf{z_K}$$ to be valid instrument. This is something like saying $$\mathbf{z_K}$$ is partially correlated with $$\mathbf{x_K}$$ after netting out the effects of $$\mathbf{x_1,\ldots,x_{K-1}}$$.

When $$\mathbf{z_K}$$ satisfies these conditions, it is called an Instrumental Variable (IV) candidate for $$\mathbf{x_K}$$. The next step is understanding how one uses this structure for estimating the parameters of interest ($$\mathbf{\beta}$$). One option is to drop our endogenous variable ($$\mathbf{x}_k$$) and replace it with our IV variable ($$\mathbf{z}_k$$), and then estimate an OLS model.

#### Using the IV

Define the matrix $$\mathbf{z}$$ as

$$\mathbf{z} = \begin{bmatrix} 1 & x_{12} & \ldots & x_{1,K-1} & z_{1,K} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{i2} & \ldots & x_{i,K-1} & z_{i,K} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{N2} & \ldots & x_{N,K-1} & z_{N,K} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{1} & \mathbf{x}_2 &\ldots & \mathbf{x}_{K-1} & \mathbf{z}_K \end{bmatrix}$$

Notice, in doing this we have done a drop-in-replacement of $$\mathbf{x}_k$$ with $$\mathbf{z}_k$$

So how do we use the IV (found in $$\mathbf{z}$$) to estimate $$\beta$$? One idea is to define $$\mathbf{b}^{iv}$$ as $$\mathbf{(z'z)^{-1}z'y}$$. Since we have already established that $$\mathbf{z}_k$$ is exogenous and correlated with $$\mathbf{x}_k$$ even after controlling for all of the other exogenous information at hand. What are the properties of such an estimator? To examine this, substitute the "Relevancy equation" into our original estimating equation:

\begin{eqnarray} \mathbf{y}&=&\beta_0 + \beta_2 \mathbf{x}_2 + \ldots + \beta_{K-1} \mathbf{x}_{K-1} \nonumber \\ & &+ \beta_K (\delta_1 + \delta_2 \mathbf{x}_2 + \ldots + \delta_{K-1} \mathbf{x}_{K-1} + \theta_K \mathbf{z}_K + \mathbf{r}) + \epsilon \\ &=& (\beta_0 + \beta_K \delta_1) + (\beta_2 + \beta_K \delta_2) \mathbf{x}_2 + \ldots + (\beta_{K-1} + \beta_{K} \delta_{K-1}) \mathbf{x}_{K-1} \nonumber \\ & &+ (\beta_K \theta_K)\mathbf{z}_k + (\beta_K \mathbf{r}_k+\epsilon)\\ &=&\alpha_0 + \alpha_2 \mathbf{x}_2 + \ldots + \alpha_{K-1} \mathbf{x}_{K-1} + \alpha_K \mathbf{z}_k + \mathbf{v} \end{eqnarray}

If we run this regression, we obtain estimates defined as $$\mathbf{a} = \mathbf{(z'z)^{-1}z'y}$$. From this we can see that by substituting our Instrumental Variable in as a proxy variable in for $$\mathbf{x}_K$$ and recovering parameters $$\mathbf{a}$$ will give you estimates where:

• $$\alpha_k \neq \beta_k$$ for every parameter you estimate, not just the endogenous one, $$\beta_K$$. For example, $$\alpha_0 = \beta_0 + \beta_K \delta_1 \neq \beta_0$$
• The variance/covariance matrix of the errors ($$\mathbf{v}$$) is not $$~N(0,\sigma^2\mathbf{I})$$

Consequently, for $$\mathbf{a} = \mathbf{(z'z)^{-1}z'y}$$,

• $$E[\mathbf{a}] \neq \beta$$, so this is not a good IV estimator.
• Given an estimate for $$\mathbf{a}$$, we can't solve for the $$K$$ estimates for $$\beta$$ because we have $$K$$ equations in $$2\times K$$ unknowns.

A better way of defining our instrumental variable estimator follows from our assumption of no correlation between $$\mathbf{z}$$ and $$\epsilon$$:

\begin{eqnarray} 0 &=& E[\mathbf{z'\epsilon}] \\ & & E[\mathbf{z'(y-x\beta)}] \end{eqnarray}

from our assumptions above. Simplifying, gives $$\mathbf{z'y=z'x \beta + z'\epsilon}$$ and taking expectations results in

$$\mathbf{E(z'y)=E(z'x \beta)}$$

This gives us $$K$$ equations in $$K$$ unknowns, which can be solved. Simplifying further, we can write $$\mathbf{\beta}$$ as

$$\mathbf{\beta=[E(z'x)]^{-1}E[z'y]}$$

where both expectations terms can be consistently estimated using a random sample on ($$\mathbf{x}$$,$$\mathbf{y}$$ and $$\mathbf{z_K}$$). A very important point to remember is that this systems of equation is solvable only if

$$rank(E[\mathbf{z'x}])=K$$

With this, identification is achieved since we can't invert a square matrix not having full rank. In practice, this estimator can be implemented given a random sample from the population as

$$\mathbf{b}^{IV}=\mathbf{(z'x)^{-1} z'y}=\mathbf{(\hat{x}'\hat{x})^{-1} \hat{x}'y}$$

where $$\hat{\mathbf{x}}$$ is equal to $$\mathbf{z}(\mathbf{z'z})^{-1}\mathbf{z'x}$$, or the predicted value of $$\mathbf{x}$$ given our instruments $$\mathbf{z}$$.

It is important to note that $$\mathbf{b}^{IV}$$ is a consistent estimator for $$\beta$$, so we rely on large sample properties.

#### Testing for Suitable Instruments

In the preceding section, we saw that two assumptions were necessary for having a suitable instrument: * Assumption 1: Orthogonality of the instrument and model errors

$$E(\mathbf{z'\epsilon)}=0$$
• Assumption 2: Partial correlation estimated in Equation \eqref{eq:reducedformx} are non-zero
$$\theta_K \ne 0$$

Ideally, we would like to be able to test our assumptions to ensure that our candidate instrumental variable meets these conditions. If our instrumental variable does not conform with these assumptions, our $$\mathbf{\beta}$$'s will be biased. Assumption 1 can't be formally tested since the true model errors, $$\mathbf{\epsilon}$$ are not observed. However, the second condition can and should be tested by estimating equation \eqref{eq:reducedformx} and conducting a t-test over the $$\theta_k$$ parameter. Studies have shown that lower p-values accord with better instruments, as would be expected.

#### Testing for endogeneity

The basic test we consider here starts with the observation that if there is endogeneity then $$b^{ols}$$ is biased. If we have tested for and identified a useful instrument(s), and estimated an IV model, then we have the following hypothesis to test

• $$H_0: \mathbf{b_{OLS}-b^{iv}}=0$$
• $$H_1: \mathbf{b_{OLS}-b^{iv}}\ne 0$$

So, if our variable $$\mathbf{x_k}$$ is not endogenous, the difference between the two estimators should be attributed to sampling error of $$\mathbf{\beta}$$ only. If on the other hand, we do have an endogenous regressor and instrument for it, the bias should show up in this difference and we would reject the null hypothesis in favor of the 2SLS technique. The test we will use is called the Hausman test and can be applied in a wide range of problems- and will be used later in the course- well beyond the endogeneity case considered here.

To implement this test in the instrumental variable context, we can follow an approach for the Hausman test outlined by Wu applicable to the instrumental variable case only. 1 This test can be performed manually by

• Step 1: Regress the endogenous variable ($$\mathbf{x}_K$$) variable on all exogenous variables both $$\mathbf{x}_{-K}$$ and $$\mathbf{z}$$ and recover the estimated residuals $$\hat{\mathbf{u}}$$ from the following regression: $$\mathbf{\mathbf{x}_K=\mathbf{x}_{-K}\delta_{-K}+\mathbf{z}\theta+\mathbf{u}}$$
• Step 2: Regress the dependent variable in the regression ($$\mathbf{y}$$) on the full set of \emph{original} independent variables, $$\mathbf{x}$$. $$\mathbf{y=x\beta+\delta\hat{\mathbf{u}}+\mathbf{\gamma}}$$
• [Step 3:]: Based on the preceding step, test the null hypothesis that $$\delta=0$$ or that the regressor is exogenous.

#### Standard Errors

In the IV framework, the variance-covariance matrix is 2

$$Var[\mathbf{b}^{IV}]=\sigma^2 \left( \mathbf{(z'x)^{-1} z'z (z'x)^{-1\prime}} \right)$$

The robust version of this can be calculated with a similar definition of $$\hat{V}$$ as used in the OLS robust standard error section 3:

$$Var^{robust}[\mathbf{b}^{IV}]=\mathbf{(z'x)^{-1} z'V z (z'x)^{-1\prime}}$$

### Multiple Instruments (1 endogenous variable and more than 1 instrument)

In the section above, we restricted our attention to the case of one and only one instrumental variable $$\mathbf{z_k}$$ for the correlated variable $$\mathbf{x_k}$$. What if now, we allow there to be $$M$$ instruments for $$\mathbf{x_k}$$, such that

$$z=\begin{bmatrix} \mathbf{ 1} & \mathbf{x_2} & \ldots & \mathbf{x_{K-1}} & \mathbf{z_1} & \mathbf{z_2} & \ldots & \mathbf{z_M} \end{bmatrix}$$

where $$\mathbf{z}$$ and each $$\mathbf{z_h}$$ is of dimension $$N \times K+M$$ and $$N \times 1$$, respectively. We maintain the assumption that

$$E(\mathbf{z_h'\epsilon})=0 \hspace{.1in} h=1,2,\ldots,M$$

As above, consider the condition ($E(\mathbf{z'\epsilon})$=0) and the implications for our IV estimator:

\label{eq:moment_iv} \mathbf{z'eiv} = 0 Notice the dimensionality of this condition: $$\mathbf{z}$$ is of dimension $$N \times (K+M-1)$$, whereas $$\mathbf{e}^{iv}$$ is $$N \times 1$$. The product will be of dimension $$(K+M-1) \times 1$$. But we only have $$K$$ parameters to estimate, so we have more equations than unknowns.

Should we choose one of the z's, all of the z's, or a subset of the z's? If we choose more than one of the z's, can we continue to use IV regression from the previous section? The answer is no. We need to proceed in one of two ways.

#### Estimation Methods

An important point is that the distinctions outlined below for the various estimation methods only exist when the number of instruments exceeds the number of endogenous variables. If your model is exactly identified (as in the preceding section), it is sufficient to focus on 2SLS.

• Two Staged Least Squares

In a similar way to equation \eqref{eq:reducedformx}, write the linear function of $$\mathbf{x}_k$$ onto $$\mathbf{z}$$ as

$$\mathbf{x}_k=\delta_0 + \delta_1 \mathbf{x}_1 + \delta_2 \mathbf{x}_2 + \ldots + \delta_{K-1}\mathbf{x}_{K-1} + \theta_1 \mathbf{z}_1 + \ldots + \theta_{M} \mathbf{z}_M + r_K$$

where $$\mathbf{r_K}$$ is of mean zero an uncorrelated with all independent variables. Since any linear combination of $$\mathbf{z}$$ is uncorrelated with u (from the assumption above),

$$\mathbf{x^*_K} \equiv \delta_0 + \delta_2 \mathbf{x_2} + \delta_3 \mathbf{x_3} + \ldots + \delta_{K-1} \mathbf{x_{K-1}} + \theta_1 \mathbf{z_1} + \ldots + \theta_{M} \mathbf{z_M}$$

is also uncorrelated with $$\mathbf{\epsilon}$$. Unfortunately, neither $$\mathbf{x}^*_K$$ nor $$\mathbf{\delta}$$ is known. We can use a first stage estimator for $$\mathbf{x}^*_K$$, called $$\mathbf{\hat{x}}^*_K$$ that is written as

$$\mathbf{\hat{x}}^*_K=\hat{\delta_0} + \hat{\delta}_2 \mathbf{x_2} + \hat{\delta}_3 \mathbf{x_3} + \ldots + \hat{\delta}_{K-1} \mathbf{x_{K-1}} + \hat {\theta}_1 \mathbf{z_1} + \ldots + \hat {\theta}_{M} \mathbf{z_M}$$

by running an OLS regression. Denoting $$\hat{\mathbf{x}}=\begin{bmatrix} 1 & \mathbf{x_2} & \ldots & \mathbf{x_{K-1}} &\mathbf{\hat{x}}^*_K \end{bmatrix}$$, the two stage least squares estimator (2SLS) is 

$$\hat{\mathbf{\beta}}^{2SLS}=(\hat{\mathbf{x}}'\hat{\mathbf{x}})^{-1}\hat{\mathbf{x}}'\mathbf{y}$$
• Method of Moments

A better way to proceed (and the default method employed by stata's ivreg and ivreg2 commands) is to minimize the condition outlined above in Equation \eqref{eq:moment_iv}

$$\underset{b^{IV}}{min} \hspace{.05in} \frac{\mathbf{e'z}W\mathbf{z'e}}{N}$$

which is a scalar value. If $$\mathbf{W=I}_{N \times N}$$, then the GMM estimator and the 2SLS estimator yield the same result. Consequently, GMM is nearly always the preferred estimator. Stata default method for defining W uses a heteroskedastic error approach constructing errors for each individual from the 2SLS model. This is much like our $$V$$ matrix we defined for estimating robust standard errors in the OLS chapter.

While it is almost always possible to find a $$\mathbf{b}^{IV}$$ that minimizes this condition, it does not impose the orthogonality condition for each column of $$\mathbf{z}$$. The possibility that some of our instruments are correlated with our errors even after trying to minimize the condition above opens the door to a problem called called overidentification.

• LIML and 3SLS

There are also two additional techniques one can use for estimating $$\mathbf{b}^{IV}$$. One is a maximum likelihood technique called limited information maximum likelihood (LIML) and another is termed Three Staged Least Squares (3SLS). We won't be investigating these further in this class, but they are options in stata's ivregress command. One quick point: for small samples, LIML is often the best approach.

#### Testing for Strong and Relevant Instruments

Testing for the suitability of instrument is also important in this context and test the null hypothesis

$$H_0=\theta_1=\theta_2=\ldots\theta_M=0$$

using an F test with $$(M,N-M-K-1)$$ degrees of freedom.

#### Overidentification in IV regression

Recall that in the IV regression model, we might have as many as $$M$$ instrumental variables for $$K_{end}$$ endogenous regressors. In our example, $$K_{end}=1$$, but in a general 2SLS setting we need for $$K_{end} \le M$$ in order to identify $$\beta$$. However, consider a situation where $$K_{end} < M$$. By including a myriad of instruments, we might be introducing bias in our estimate of $$\beta$$ because some subset of our IV's, in fact do not satisfy the important requirement that $$E(\mathbf{z'\epsilon})=0$$. In effect, we can test for whether a subset of our IV's would be a candidate IV set by avoiding those instrumental variables that themselves may be correlated with the model errors. Under i.i.d. errors, this test is called the Sargan test.

Fortunately, the test is easy to implement.

• [Step 1:] Recover the estimated residuals from the 2SLS regression. I label this vector as $$\mathbf{e}_{2sls}$$.
• [Step 2:] Regress $$\mathbf{e}_{2sls}$$ on the full set of exogenous instruments, $$\mathbf{x_{-K}}$$ and $$\mathbf{z}$$. Make sure to omit the endogenous variable.

The test statistic, $$N \times R^2$$, where $$R^2$$ is recovered from this regression, is distributed with degrees of freedom equal to the number of instruments in the 2SLS regression minus the number of endogenous variables in Step 1. Fortunately for us, the ivreg2 command automatically reports the Sargan statistic for overidentification. If we reject the null hypothesis, then we have a vector of instrumental variables that is overidentified and our logic for choosing the set of IV's must be reexamined. Low p-values indicate that we need to re-evaluate our set of IV's.

The intuition of this test rests with information contained in the error structure from the 2sls. If these errors can be explained well using information contained in our IV's (the ivreg2 command labels these as excluded instruments), then they really aren't good instruments, since we need them to be uncorrelated with the error. Rather than test each excluded IV sequentially, the Sargan approach jointly tests whether overidentification is a problem or not. If it is, consider using a subset of IV's or search for new ones.

#### Standard Errors

Manual calculations should be avoided as a correction must be made to the standard errors. The two steps outlined above should generally not be implemented by hand since this approach leads to inconsistent estimates of $$\mathbf{\beta}$$ and the variance\covariance matrix of the parameters is also incorrect since it fails to account for the underlying randomness associated with $$\mathbf{\hat{x}}^*_K$$.

### Implementation in R and Stata

The companion to this chapter shows how to implement many of these ideas in R and Stata.

## Footnotes:

1

These steps are not correct for the case of more than 1 instrumental variable. However, they are instructive in understanding the intuition of the Hausman Test in the instrumental variables context. If you have more than 1 instrumental variable, you must use the ivendog' or "hausman" commands in stata.

2

This equation will exactly replicate the stata ivregress command (for 2sls) using the options vce(unadjusted) small.

3

This equation will exactly replicate the stata ivregress command (for 2sls) using the options vce(robust) small` defining $$\mathbf{V}$$ as we did in the OLS chapter.