# Instrumental Variables Regression¶

The exogeneity assumption of the previous chapter is a particularly strong assumption. One might think of a number of cases where this would not be expected to hold. Education and earnings is an example. A wage equation that is a function of education may suffer from a problem regarding this assumption if those factors influencing wage and not accounted for in the regression equation are related to educational attainment. In a pure cross section world, unobserved individual-specific factors, for example, may influence both wage and educational attainment. The problem of endogeneity is a very big deal and one you should worry about every time you estimate any type of regression model. To see why, note that our ability to estimate unbiased parameters hinges on

$$$\mathbf{E(x'\epsilon)=0}$$$

If there is in fact correlation between the elements of $$\mathbf{x}$$ and $$\mathbf{\epsilon}$$ then our parameter estimates are biased.

## A Single Instrumental Variable¶

In this section, we discuss ways of addressing endogeneity in the OLS framework. As before, assume a linear model

(8)$$$\mathbf{y}=\mathbf{x}\beta+\epsilon=\beta_0+\beta_2 \mathbf{x_2} + \beta_3 \mathbf{x_3} + \ldots + \beta_K \mathbf{x_K} + \mathbf{\epsilon}$$$

where $$\mathbf{x}_j$$ is a $$N \times 1$$ column vector with data from column $$j$$. Where the following holds:

• $$E(\mathbf{\epsilon})=0$$

• $$E(\mathbf{x_{j}\epsilon})=0$$ for $$j=1,2,\ldots,K-1$$

Notice that we are invoking the exogeneity assumption for some of our explanatory variables but not all of them. The explanatory variable $$\mathbf{x_K}$$ is potentially endogenous and a failure to deal with this will potentially lead to biased parameter estimates.

The method of instrumental variables offers a way of handling this problem. Letting the instrumental variable be denoted as $$z_k$$, we need for it to have these properties:

• Assumption 1: $$E(\mathbf{z'\epsilon})=0$$

• Assumption 2: And, for the following linear relationship,

(9)$$$\mathbf{x_K}=\delta_0 + \delta_2 \mathbf{x_2}+\ldots+ \delta_{K-1} \mathbf{x_{K-1}}+\theta_K \mathbf{z_K} + \mathbf{r}$$$

where $$E(\mathbf{r_k})=0$$ and is uncorrelated with the right hand side variables. We need for $$\theta_1$$ to be non-zero, for $$\mathbf{z_K}$$ to be valid instrument. This is something like saying $$\mathbf{z_K}$$ is partially correlated with $$\mathbf{x_K}$$ after netting out the effects of $$\mathbf{x_1,\ldots,x_{K-1}}$$.

When $$\mathbf{z_K}$$ satisfies these conditions, it is called an Instrumental Variable (IV) candidate for $$\mathbf{x_K}$$. The next step is understanding how one uses this structure for estimating the parameters of interest ($$\mathbf{\beta}$$). One option is to drop our endogenous variable ($$\mathbf{x}_k$$) and replace it with our IV variable ($$\mathbf{z}_k$$), and then estimate an OLS model.

### Using the IV¶

Define the matrix $$\mathbf{z}$$ as

$\begin{split}$$\mathbf{z} = \begin{bmatrix} 1 & x_{12} & \ldots & x_{1,K-1} & z_{1,K} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{i2} & \ldots & x_{i,K-1} & z_{i,K} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{N2} & \ldots & x_{N,K-1} & z_{N,K} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{1} & \mathbf{x}_2 &\ldots & \mathbf{x}_{K-1} & \mathbf{z}_K \end{bmatrix}$$\end{split}$

Notice, in doing this we have done a drop-in-replacement of $$\mathbf{x}_k$$ with $$\mathbf{z}_k$$

So how do we use the IV (found in $$\mathbf{z}$$) to estimate $$\beta$$? One idea is to define $$\mathbf{b}^{iv}$$ as $$\mathbf{(z'z)^{-1}z'y}$$. Since we have already established that $$\mathbf{z}_k$$ is exogenous and correlated with $$\mathbf{x}_k$$ even after controlling for all of the other exogenous information at hand. What are the properties of such an estimator? To examine this, substitute the “Relevancy equation” into our original estimating equation:

$\begin{split}\begin{eqnarray} \mathbf{y}&=&\beta_0 + \beta_2 \mathbf{x}_2 + \ldots + \beta_{K-1} \mathbf{x}_{K-1} \nonumber \\ & &+ \beta_K (\delta_1 + \delta_2 \mathbf{x}_2 + \ldots + \delta_{K-1} \mathbf{x}_{K-1} + \theta_K \mathbf{z}_K + \mathbf{r}) + \epsilon \\ &=& (\beta_0 + \beta_K \delta_1) + (\beta_2 + \beta_K \delta_2) \mathbf{x}_2 + \ldots + (\beta_{K-1} + \beta_{K} \delta_{K-1}) \mathbf{x}_{K-1} \nonumber \\ & &+ (\beta_K \theta_K)\mathbf{z}_k + (\beta_K \mathbf{r}_k+\epsilon)\\ &=&\alpha_0 + \alpha_2 \mathbf{x}_2 + \ldots + \alpha_{K-1} \mathbf{x}_{K-1} + \alpha_K \mathbf{z}_k + \mathbf{v} \end{eqnarray}\end{split}$

If we run this regression, we obtain estimates defined as $$\mathbf{a} = \mathbf{(z'z)^{-1}z'y}$$. From this we can see that by substituting our Instrumental Variable in as a proxy variable in for $$\mathbf{x}_K$$ and recovering parameters $$\mathbf{a}$$ will give you estimates where:

• $$\alpha_k \neq \beta_k$$ for every parameter you estimate, not just the endogenous one, $$\beta_K$$. For example, $$\alpha_0 = \beta_0 + \beta_K \delta_1 \neq \beta_0$$

• The variance/covariance matrix of the errors ($$\mathbf{v}$$) is not $$~N(0,\sigma^2\mathbf{I})$$

Consequently, for $$\mathbf{a} = \mathbf{(z'z)^{-1}z'y}$$,

• Given that $$E[\mathbf{a}] \neq \beta$$, so this is not a good IV estimator.

• Given an estimate for $$\mathbf{a}$$, we can’t solve for the $$K$$ estimates for $$\beta$$ because we have $$K$$ equations in $$2\times K$$ unknowns.

A better way of defining our instrumental variable estimator follows from our assumption of no correlation between $$\mathbf{z}$$ and $$\epsilon$$:

$\begin{split}\begin{eqnarray} 0 &=& E[\mathbf{z'\epsilon}] \\ & & E[\mathbf{z'(y-x\beta)}] \end{eqnarray}\end{split}$

from our assumptions above. Simplifying, gives $$\mathbf{z'y=z'x \beta + z'\epsilon}$$ and taking expectations results in

$$$\mathbf{E(z'y)=E(z'x \beta)}$$$

This gives us $$K$$ equations in $$K$$ unknowns, which can be solved. Simplifying further, we can write $$\mathbf{\beta}$$ as

$$$\mathbf{\beta=[E(z'x)]^{-1}E[z'y]}$$$

where both expectations terms can be consistently estimated using a random sample on ($$\mathbf{x}$$,$$\mathbf{y}$$ and $$\mathbf{z_K}$$). A very important point to remember is that this systems of equation is solvable only if

$$$rank(E[\mathbf{z'x}])=K$$$

With this, identification is achieved since we can’t invert a square matrix not having full rank. In practice, this estimator can be implemented given a random sample from the population as

$$$\mathbf{b}^{IV}=\mathbf{(z'x)^{-1} z'y}=\mathbf{(\hat{x}'\hat{x})^{-1} \hat{x}'y}$$$

where $$\hat{\mathbf{x}}$$ is equal to $$\mathbf{z}(\mathbf{z'z})^{-1}\mathbf{z'x}$$, or the predicted value of $$\mathbf{x}$$ given our instruments $$\mathbf{z}$$.

It is important to note that $$\mathbf{b}^{IV}$$ is a consistent estimator for $$\beta$$, so we rely on large sample properties.

### Testing for Suitable Instruments¶

In the preceding section, we saw that two assumptions were necessary for having a suitable instrument:

• Assumption 1: Orthogonality of the instrument and model errors

$$$E(\mathbf{z'\epsilon)}=0$$$
• Assumption 2: Partial correlation estimated in Equation (9) are non-zero

$$$\theta_K \ne 0$$$

Ideally, we would like to be able to test our assumptions to ensure that our candidate instrumental variable meets these conditions. If our instrumental variable does not conform with these assumptions, our $$\mathbf{\beta}$$’s will be biased. Assumption 1 can’t be formally tested since the true model errors, $$\mathbf{\epsilon}$$ are not observed. However, the second condition can and should be tested by estimating equation (9) and conducting a t-test over the $$\theta_k$$ parameter. Studies have shown that lower p-values accord with better instruments, as would be expected.

### Testing for endogeneity¶

The basic test we consider here starts with the observation that if there is endogeneity then $$b^{ols}$$ is biased. If we have tested for and identified a useful instrument(s), and estimated an IV model, then we have the following hypothesis to test

• $$H_0: \mathbf{b_{OLS}-b^{iv}}=0$$

• $$H_1: \mathbf{b_{OLS}-b^{iv}}\ne 0$$

So, if our variable $$\mathbf{x_k}$$ is not endogenous, the difference between the two estimators should be attributed to sampling error of $$\mathbf{\beta}$$ only. If on the other hand, we do have an endogenous regressor and instrument for it, the bias should show up in this difference and we would reject the null hypothesis in favor of the 2SLS technique. The test we will use is called the Hausman test and can be applied in a wide range of problems- and will be used later in the course- well beyond the endogeneity case considered here.

To implement this test in the instrumental variable context, we can follow an approach for the Hausman test outlined by Wu applicable to the instrumental variable case only. 1 This test can be performed manually by

• Step 1: Regress the endogenous variable ($$\mathbf{x}_K$$) variable on all exogenous variables both $$\mathbf{x}_{-K}$$ and $$\mathbf{z}$$ and recover the estimated residuals $$\hat{\mathbf{u}}$$ from the following regression:

$$$\mathbf{\mathbf{x}_K=\mathbf{x}_{-K}\delta_{-K}+\mathbf{z}\theta+\mathbf{u}}$$$
• Step 2: Regress the dependent variable in the regression ($$\mathbf{y}$$) on the full set of original independent variables, $$\mathbf{x}$$.

$$$\mathbf{y=x\beta+\delta\hat{\mathbf{u}}+\mathbf{\gamma}}$$$
• [Step 3:]: Based on the preceding step, test the null hypothesis that $$\delta=0$$ or that the regressor is exogenous.

### Standard Errors¶

In the IV framework, the variance-covariance matrix is 2

$$$Var[\mathbf{b}^{IV}]=\sigma^2 \left( \mathbf{(z'x)^{-1} z'z (z'x)^{-1\prime}} \right)$$$

The robust version of this can be calculated with a similar definition of $$\hat{V}$$ as used in the OLS robust standard error section 3:

$$$Var^{robust}[\mathbf{b}^{IV}]=\mathbf{(z'x)^{-1} z'V z (z'x)^{-1\prime}}$$$

## Multiple Instrumental Variables¶

In the section above, we restricted our attention to the case of one and only one instrumental variable $$\mathbf{z_k}$$ for the correlated variable $$\mathbf{x_k}$$. What if now, we allow there to be $$M$$ instruments for $$\mathbf{x_k}$$, such that

$$$z=\begin{bmatrix} \mathbf{ 1} & \mathbf{x_2} & \ldots & \mathbf{x_{K-1}} & \mathbf{z_1} & \mathbf{z_2} & \ldots & \mathbf{z_M} \end{bmatrix}$$$

where $$\mathbf{z}$$ and each $$\mathbf{z_h}$$ is of dimension $$N \times K+M$$ and $$N \times 1$$, respectively. We maintain the assumption that

$$$E(\mathbf{z_h'\epsilon})=0 \hspace{.1in} h=1,2,\ldots,M$$$

As above, consider the condition ($$E(\mathbf{z'\epsilon})=0$$) and the implications for our IV estimator:

(10)$$$\mathbf{z'e^{iv}} = 0$$$

Notice the dimensionality of this condition: $$\mathbf{z}$$ is of dimension $$N \times (K+M-1)$$, whereas $$\mathbf{e}^{iv}$$ is $$N \times 1$$. The product will be of dimension $$(K+M-1) \times 1$$. But we only have $$K$$ parameters to estimate, so we have more equations than unknowns.

Should we choose one of the z’s, all of the z’s, or a subset of the z’s? If we choose more than one of the z’s, can we continue to use IV regression from the previous section? The answer is no. We need to proceed in one of two ways.

### Estimation Methods¶

An important point is that the distinctions outlined below for the various estimation methods only exist when the number of instruments exceeds the number of endogenous variables. If your model is exactly identified (as in the preceding section), it is sufficient to focus on 2SLS.

1. Two Staged Least Squares

In a similar way to equation (9), write the linear function of $$\mathbf{x}_k$$ onto $$\mathbf{z}$$ as

$$$\mathbf{x}_k=\delta_0 + \delta_1 \mathbf{x}_1 + \delta_2 \mathbf{x}_2 + \ldots + \delta_{K-1}\mathbf{x}_{K-1} + \theta_1 \mathbf{z}_1 + \ldots + \theta_{M} \mathbf{z}_M + r_K$$$

where $$\mathbf{r_K}$$ is of mean zero an uncorrelated with all independent variables. Since any linear combination of $$\mathbf{z}$$ is uncorrelated with u (from the assumption above),

$$$\mathbf{x^*_K} \equiv \delta_0 + \delta_2 \mathbf{x_2} + \delta_3 \mathbf{x_3} + \ldots + \delta_{K-1} \mathbf{x_{K-1}} + \theta_1 \mathbf{z_1} + \ldots + \theta_{M} \mathbf{z_M}$$$

is also uncorrelated with $$\mathbf{\epsilon}$$. Unfortunately, neither $$\mathbf{x}^*_K$$ nor $$\mathbf{\delta}$$ is known. We can use a first stage estimator for $$\mathbf{x}^*_K$$, called $$\mathbf{\hat{x}}^*_K$$ that is written as

$$$\mathbf{\hat{x}}^*_K=\hat{\delta_0} + \hat{\delta}_2 \mathbf{x_2} + \hat{\delta}_3 \mathbf{x_3} + \ldots + \hat{\delta}_{K-1} \mathbf{x_{K-1}} + \hat {\theta}_1 \mathbf{z_1} + \ldots + \hat {\theta}_{M} \mathbf{z_M}$$$

by running an OLS regression. Denoting $$\hat{\mathbf{x}}=\begin{bmatrix} 1 & \mathbf{x_2} & \ldots & \mathbf{x_{K-1}} &\mathbf{\hat{x}}^*_K \end{bmatrix}$$, the two stage least squares estimator (2SLS) is

$$$\hat{\mathbf{\beta}}^{2SLS}=(\hat{\mathbf{x}}'\hat{\mathbf{x}})^{-1}\hat{\mathbf{x}}'\mathbf{y}$$$
2. Method of Moments

A better way to proceed (and the default method employed by stata’s ivreg and ivreg2 commands) is to minimize the condition outlined above in Equation (10)

$$$\underset{b^{IV}}{min} \hspace{.05in} \frac{\mathbf{e'z}W\mathbf{z'e}}{N}$$$

which is a scalar value. If $$\mathbf{W=I}_{N \times N}$$, then the GMM estimator and the 2SLS estimator yield the same result. Consequently, GMM is nearly always the preferred estimator. Stata default method for defining W uses a heteroskedastic error approach constructing errors for each individual from the 2SLS model. This is much like our $$V$$ matrix we defined for estimating robust standard errors in the OLS chapter.

While it is almost always possible to find a $$\mathbf{b}^{IV}$$ that minimizes this condition, it does not impose the orthogonality condition for each column of $$\mathbf{z}$$. The possibility that some of our instruments are correlated with our errors even after trying to minimize the condition above opens the door to a problem called called overidentification.

3. LIML and 3SLS

There are also two additional techniques one can use for estimating $$\mathbf{b}^{IV}$$. One is a maximum likelihood technique called limited information maximum likelihood (LIML) and another is termed Three Staged Least Squares (3SLS). We won’t be investigating these further in this class, but they are options in stata’s ivregress command. One quick point: for small samples, LIML is often the best approach.

### Testing for Strong and Relevant Instruments¶

Testing for the suitability of instrument is also important in this context and test the null hypothesis

$$$H_0=\theta_1=\theta_2=\ldots\theta_M=0$$$

using an F test with $$(M,N-M-K-1)$$ degrees of freedom.

### Overidentification in IV regression¶

Recall that in the IV regression model, we might have as many as $$M$$ instrumental variables for $$K_{end}$$ endogenous regressors. In our example, $$K_{end}=1$$, but in a general 2SLS setting we need for $$K_{end} \le M$$ in order to identify $$\beta$$. However, consider a situation where $$K_{end} < M$$. By including a myriad of instruments, we might be introducing bias in our estimate of $$\beta$$ because some subset of our IV’s, in fact do not satisfy the important requirement that $$E(\mathbf{z'\epsilon})=0$$. In effect, we can test for whether a subset of our IV’s would be a candidate IV set by avoiding those instrumental variables that themselves may be correlated with the model errors. Under i.i.d. errors, this test is called the Sargan test.

Fortunately, the test is easy to implement.

• [Step 1:] Recover the estimated residuals from the 2SLS regression. I label this vector as $$\mathbf{e}_{2sls}$$.

• [Step 2:] Regress $$\mathbf{e}_{2sls}$$ on the full set of exogenous instruments, $$\mathbf{x_{-K}}$$ and $$\mathbf{z}$$. Make sure to omit the endogenous variable.

The test statistic, $$N \times R^2$$, where $$R^2$$ is recovered from this regression, is distributed with degrees of freedom equal to the number of instruments in the 2SLS regression minus the number of endogenous variables in Step 1. Fortunately for us, the ivreg2 command automatically reports the Sargan statistic for overidentification. If we reject the null hypothesis, then we have a vector of instrumental variables that is overidentified and our logic for choosing the set of IV’s must be reexamined. Low p-values indicate that we need to re-evaluate our set of IV’s.

The intuition of this test rests with information contained in the error structure from the 2sls. If these errors can be explained well using information contained in our IV’s (the ivreg2 command labels these as excluded instruments), then they really aren’t good instruments, since we need them to be uncorrelated with the error. Rather than test each excluded IV sequentially, the Sargan approach jointly tests whether overidentification is a problem or not. If it is, consider using a subset of IV’s or search for new ones.

### Standard Errors¶

Manual calculations should be avoided as a correction must be made to the standard errors. The two steps method outlined in the Estimation Methods section should generally not be implemented by hand since this approach leads to inconsistent estimates of $$\mathbf{\beta}$$ and the variance/covariance matrix of the parameters is also incorrect since it fails to account for the underlying randomness associated with $$\mathbf{\hat{x}}^*_K$$.4

## Implementation in Stata¶

This section on endogeneity quickly explores the problem of endogeneity and how to estimate this class of models in Stata. Recall that the OLS estimator requires

$E(\mathbf{x'\epsilon}) = 0$

This code shows how to overcome estimation problems where this assumption fails but where we can identify an instrument for implementing instrumental variables regression (IV Regression). We demonstrate the uses of Stata for IV regression problems. First, let’s open up the data in Stata noting that we are using a “Cross-sectioned” version of Tobias and Koop that focuses on 1983. Load data and summarize:

# start a connected stata17 session
from pystata import config
config.init('be')
config.set_streaming_output_mode('off')

%%stata
webuse set "https://rlhick.people.wm.edu/econ407/data/"
webuse tobias_koop
keep if time==4
sum

. webuse set "https://rlhick.people.wm.edu/econ407/data/"
(prefix now "https://rlhick.people.wm.edu/econ407/data")

. webuse tobias_koop

. keep if time==4
(16,885 observations deleted)

. sum

Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
id |      1,034    1090.952    634.8917          4       2177
educ |      1,034    12.27466    1.566838          9         19
ln_wage |      1,034    2.138259    .4662805        .42       3.59
pexp |      1,034     4.81528    2.190298          0         12
time |      1,034           4           0          4          4
-------------+---------------------------------------------------------
ability |      1,034    .0165957    .9209635      -3.14       1.89
meduc |      1,034    11.40329    3.027277          0         20
feduc |      1,034    11.58511    3.735833          0         20
broken_home |      1,034    .1692456    .3751502          0          1
siblings |      1,034    3.200193    2.126575          0         15
-------------+---------------------------------------------------------
pexp2 |      1,034    27.97969    22.59879          0        144

.


### First run OLS¶

If we ignore any potential endogeneity problem we can estimate OLS as described in the OLS chapter companion. Here are the results from stata:

%%stata
reg ln_wage pexp pexp2 educ broken_home

      Source |       SS           df       MS      Number of obs   =     1,034
-------------+----------------------------------   F(4, 1029)      =     51.36
Model |  37.3778146         4  9.34445366   Prob > F        =    0.0000
Residual |   187.21445     1,029  .181938241   R-squared       =    0.1664
Total |  224.592265     1,033  .217417488   Root MSE        =    .42654

------------------------------------------------------------------------------
ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
pexp |   .2035214   .0235859     8.63   0.000     .1572395    .2498033
pexp2 |  -.0124126   .0022825    -5.44   0.000    -.0168916   -.0079336
educ |   .0852725   .0092897     9.18   0.000     .0670437    .1035014
broken_home |  -.0087254   .0357107    -0.24   0.807    -.0787995    .0613488
_cons |   .4603326    .137294     3.35   0.001     .1909243    .7297408
------------------------------------------------------------------------------


where education, has the elasticity

%%stata
margins, dyex(educ) continuous

Average marginal effects                                 Number of obs = 1,034
Model VCE: OLS

Expression: Linear prediction, predict()
dy/ex wrt:  educ

------------------------------------------------------------------------------
|            Delta-method
|      dy/ex   std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
educ |   1.046691   .1140274     9.18   0.000     .8229385    1.270444
------------------------------------------------------------------------------


### Running IV Regression¶

Suppose we are worried that education is endogenous. That is, it is correlated with the population regression errors. This means OLS estimates of $$\beta$$ are biased. We hypothesize that the variable feduc is a good instrument having all the properties we describe in detail in the notes document.

In stata, we use this code:

%%stata
ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc)

Instrumental variables 2SLS regression            Number of obs   =      1,034
Wald chi2(4)    =     138.19
Prob > chi2     =     0.0000
R-squared       =     0.1277
Root MSE        =     .43528

------------------------------------------------------------------------------
ln_wage | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
educ |   .1495027   .0320009     4.67   0.000     .0867821    .2122233
pexp |    .214752   .0246553     8.71   0.000     .1664285    .2630755
pexp2 |  -.0117453   .0023508    -5.00   0.000    -.0163529   -.0071377
broken_home |   .0244713   .0397189     0.62   0.538    -.0533763     .102319
_cons |  -.4064389   .4356072    -0.93   0.351    -1.260213    .4473354
------------------------------------------------------------------------------
Instrumented: educ
Instruments: pexp pexp2 broken_home feduc


Note that the mean estimate for the elasticity on education has nearly doubled compared to OLS

%%stata
margins, dyex(educ) continuous

Average marginal effects                                 Number of obs = 1,034

Expression: Linear prediction, predict()
dy/ex wrt:  educ

------------------------------------------------------------------------------
|            Delta-method
|      dy/ex   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
educ |   1.835095   .3928002     4.67   0.000     1.065221     2.60497
------------------------------------------------------------------------------


Stata’s ivregress output for robust regression is obtained from

%%stata
ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc), robust

Instrumental variables 2SLS regression            Number of obs   =      1,034
Wald chi2(4)    =     150.52
Prob > chi2     =     0.0000
R-squared       =     0.1277
Root MSE        =     .43528

------------------------------------------------------------------------------
|               Robust
ln_wage | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
educ |   .1495027   .0329085     4.54   0.000     .0850033    .2140021
pexp |    .214752   .0238629     9.00   0.000     .1679815    .2615225
pexp2 |  -.0117453   .0023595    -4.98   0.000    -.0163698   -.0071208
broken_home |   .0244713   .0335032     0.73   0.465    -.0411937    .0901364
_cons |  -.4064389   .4404503    -0.92   0.356    -1.269706    .4568278
------------------------------------------------------------------------------
Instrumented: educ
Instruments: pexp pexp2 broken_home feduc


### Testing Assumptions¶

We have more work to do:

1. Test for relevant and strong instruments

2. Test for endogeneity

3. Test for overidentification (not relevant for this example)

In stata, we issue these commands:

%%stata
estat firststage

  First-stage regression summary statistics
--------------------------------------------------------------------------
Variable |   R-sq.       R-sq.        R-sq.     F(1,1029)   Prob > F
-------------+------------------------------------------------------------
educ |  0.2416      0.2387       0.0878       80.2589    0.0000
--------------------------------------------------------------------------


Note, since the number of instruments is equal to the number of endogenous variables, we don’t have an overidentification problem.

%%stata
estat overid

---------------------------------------------------------------------------
SystemError                               Traceback (most recent call last)
<ipython-input-9-d841baac8b2d> in <module>
----> 1 get_ipython().run_cell_magic('stata', '', 'estat overid\n')

~/anaconda3/envs/python3/lib/python3.8/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
2397             with self.builtin_trap:
2398                 args = (magic_arg_s, cell)
-> 2399                 result = fn(*args, **kwargs)
2400             return result
2401

<decorator-gen-117> in stata(self, line, cell, local_ns)

~/anaconda3/envs/python3/lib/python3.8/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
185     # but it's overkill for just that one bit of state.
186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
188
189         if callable(arg):

/usr/local/stata/utilities/pystata/ipython/stpymagic.py in stata(self, line, cell, local_ns)
274             _stata.run(cell, quietly=True, inline=_config.stconfig['grshow'])
275         else:
--> 276             _stata.run(cell, quietly=False, inline=_config.stconfig['grshow'])
277
278         if '-gw' in args or '-gh' in args:

/usr/local/stata/utilities/pystata/stata.py in run(cmd, quietly, echo, inline)
299                 _stata_wrk1("qui " + cmds[0], echo)
300             else:
--> 301                 _stata_wrk1(cmds[0], echo)
302     else:
303         if inline:

/usr/local/stata/utilities/pystata/stata.py in _stata_wrk1(cmd, echo)
76             while len(output)!=0:
77                 if rc1 != 0:
---> 78                     raise SystemError(output)
79
80                 _print_no_streaming_output(output, False)

SystemError: no overidentifying restrictions
r(498);


The python stack trace is irrelevant here and will terrify my students. All the user needs to see is the Stata part of the error:

SystemError: no overidentifying restrictions
r(498);


These results tell us we have relevant and strong instruments and that education is likely endogenous.

Here is another error:

%%stata
gen ln_wage = 5

---------------------------------------------------------------------------
SystemError                               Traceback (most recent call last)
<ipython-input-10-f4b1cbd9edf4> in <module>
----> 1 get_ipython().run_cell_magic('stata', '', 'gen ln_wage = 5\n')

~/anaconda3/envs/python3/lib/python3.8/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
2397             with self.builtin_trap:
2398                 args = (magic_arg_s, cell)
-> 2399                 result = fn(*args, **kwargs)
2400             return result
2401

<decorator-gen-117> in stata(self, line, cell, local_ns)

~/anaconda3/envs/python3/lib/python3.8/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
185     # but it's overkill for just that one bit of state.
186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
188
189         if callable(arg):

/usr/local/stata/utilities/pystata/ipython/stpymagic.py in stata(self, line, cell, local_ns)
274             _stata.run(cell, quietly=True, inline=_config.stconfig['grshow'])
275         else:
--> 276             _stata.run(cell, quietly=False, inline=_config.stconfig['grshow'])
277
278         if '-gw' in args or '-gh' in args:

/usr/local/stata/utilities/pystata/stata.py in run(cmd, quietly, echo, inline)
299                 _stata_wrk1("qui " + cmds[0], echo)
300             else:
--> 301                 _stata_wrk1(cmds[0], echo)
302     else:
303         if inline:

/usr/local/stata/utilities/pystata/stata.py in _stata_wrk1(cmd, echo)
76             while len(output)!=0:
77                 if rc1 != 0:
---> 78                     raise SystemError(output)
79
80                 _print_no_streaming_output(output, False)

r(110);


Again, the python stack trace is irrelevant and completely the same as the previous one. All the user needs to see is the Stata part of the error

SystemError: variable ln_wage already defined
r(110);


1

These steps are not correct for the case of more than 1 instrumental variable. However, they are instructive in understanding the intuition of the Hausman Test in the instrumental variables context. If you have more than 1 instrumental variable, you must use the ivendog or hausman commands in stata.

2

This equation will exactly replicate the stata ivregress command (for 2sls) using the options vce(unadjusted) small.

3

This equation will exactly replicate the stata ivregress command (for 2sls) using the options vce(robust) small defining $$\mathbf{V}$$ as we did in the OLS chapter.

4

This is true if the model has more instruments than endogenous variables.