---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.10.3
kernelspec:
display_name: Python 3
language: python
name: python3
---
# Instrumental Variables Regression
The exogeneity assumption of the previous chapter is a particularly
strong assumption. One might think of a number of cases where this would
not be expected to hold. Education and earnings is an example. A wage
equation that is a function of education may suffer from a problem
regarding this assumption if those factors influencing wage and not
accounted for in the regression equation are related to educational
attainment. In a pure cross section world, unobserved
individual-specific factors, for example, may influence both wage and
educational attainment. The problem of endogeneity is a very big deal
and one you should worry about every time you estimate any type of
regression model. To see why, note that our ability to estimate unbiased
parameters hinges on
```{math}
\begin{equation}
\mathbf{E(x'\epsilon)=0}
\end{equation}
```
If there is in fact correlation between the elements of $\mathbf{x}$ and
$\mathbf{\epsilon}$ then our parameter estimates are biased.
## A Single Instrumental Variable
In this section, we discuss ways of addressing endogeneity in the OLS
framework. As before, assume a linear model
```{math}
:label: end:eq:ivbasic_framework
\begin{equation}
\mathbf{y}=\mathbf{x}\beta+\epsilon=\beta_0+\beta_2 \mathbf{x_2} + \beta_3 \mathbf{x_3} + \ldots + \beta_K \mathbf{x_K} + \mathbf{\epsilon}
\end{equation}
```
where $\mathbf{x}_j$ is a $N \times 1$ column vector with data from
column $j$. Where the following holds:
- $E(\mathbf{\epsilon})=0$
- $E(\mathbf{x_{j}\epsilon})=0$ for $j=1,2,\ldots,K-1$
Notice that we are invoking the exogeneity assumption for some of our
explanatory variables but not all of them. The explanatory variable
$\mathbf{x_K}$ is potentially endogenous and a failure to deal with this
will potentially lead to biased parameter estimates.
The method of instrumental variables offers a way of handling this
problem. Letting the instrumental variable be denoted as $z_k$, we need
for it to have these properties:
- **Assumption 1**: $E(\mathbf{z'\epsilon})=0$
- **Assumption 2**: And, for the following linear relationship,
```{math}
:label: end:eq:reducedformx
\begin{equation}
\mathbf{x_K}=\delta_0 + \delta_2 \mathbf{x_2}+\ldots+ \delta_{K-1} \mathbf{x_{K-1}}+\theta_K \mathbf{z_K} + \mathbf{r}
\end{equation}
```
where $E(\mathbf{r_k})=0$ and is uncorrelated with the right hand side
variables. We need for $\theta_1$ to be non-zero, for $\mathbf{z_K}$ to
be valid instrument. This is something like saying $\mathbf{z_K}$ is
partially correlated with $\mathbf{x_K}$ after netting out the effects
of $\mathbf{x_1,\ldots,x_{K-1}}$.
When $\mathbf{z_K}$ satisfies these conditions, it is called an
**Instrumental Variable (IV)** candidate for $\mathbf{x_K}$. The next
step is understanding how one uses this structure for estimating the
parameters of interest ($\mathbf{\beta}$). One option is to drop our
endogenous variable ($\mathbf{x}_k$) and replace it with our IV variable
($\mathbf{z}_k$), and then estimate an OLS model.
### Using the IV
Define the matrix $\mathbf{z}$ as
```{math}
\begin{equation}
\mathbf{z} = \begin{bmatrix} 1 & x_{12} & \ldots & x_{1,K-1} & z_{1,K} \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
1 & x_{i2} & \ldots & x_{i,K-1} & z_{i,K} \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
1 & x_{N2} & \ldots & x_{N,K-1} & z_{N,K} \\
\end{bmatrix}
= \begin{bmatrix} \mathbf{1} & \mathbf{x}_2 &\ldots & \mathbf{x}_{K-1} & \mathbf{z}_K \end{bmatrix}
\end{equation}
```
Notice, in doing this we have done a drop-in-replacement of
$\mathbf{x}_k$ with $\mathbf{z}_k$
So how do we use the IV (found in $\mathbf{z}$) to estimate $\beta$? One
idea is to define $\mathbf{b}^{iv}$ as $\mathbf{(z'z)^{-1}z'y}$. Since
we have already established that $\mathbf{z}_k$ is exogenous and
correlated with $\mathbf{x}_k$ even after controlling for all of the
other exogenous information at hand. What are the properties of such an
estimator? To examine this, substitute the \"Relevancy equation\" into
our original estimating equation:
```{math}
\begin{eqnarray}
\mathbf{y}&=&\beta_0 + \beta_2 \mathbf{x}_2 + \ldots + \beta_{K-1} \mathbf{x}_{K-1} \nonumber \\
& &+ \beta_K (\delta_1 + \delta_2 \mathbf{x}_2 + \ldots + \delta_{K-1} \mathbf{x}_{K-1} + \theta_K \mathbf{z}_K + \mathbf{r}) + \epsilon \\
&=& (\beta_0 + \beta_K \delta_1) + (\beta_2 + \beta_K \delta_2) \mathbf{x}_2 + \ldots + (\beta_{K-1} + \beta_{K} \delta_{K-1}) \mathbf{x}_{K-1} \nonumber \\
& &+ (\beta_K \theta_K)\mathbf{z}_k + (\beta_K \mathbf{r}_k+\epsilon)\\
&=&\alpha_0 + \alpha_2 \mathbf{x}_2 + \ldots + \alpha_{K-1} \mathbf{x}_{K-1} + \alpha_K \mathbf{z}_k + \mathbf{v}
\end{eqnarray}
```
If we run this regression, we obtain estimates defined as
$\mathbf{a} = \mathbf{(z'z)^{-1}z'y}$. From this we can see that by
substituting our Instrumental Variable in as a proxy variable in for
$\mathbf{x}_K$ and recovering parameters $\mathbf{a}$ will give you
estimates where:
- $\alpha_k \neq \beta_k$ for **every** parameter you estimate, not
just the endogenous one, $\beta_K$. For example,
$\alpha_0 = \beta_0 + \beta_K \delta_1 \neq \beta_0$
- The variance/covariance matrix of the errors ($\mathbf{v}$) is not
$~N(0,\sigma^2\mathbf{I})$
Consequently, for $\mathbf{a} = \mathbf{(z'z)^{-1}z'y}$,
- Given that $E[\mathbf{a}] \neq \beta$, so this is not a good IV estimator.
- Given an estimate for $\mathbf{a}$, we can\'t solve for the $K$
estimates for $\beta$ because we have $K$ equations in $2\times K$
unknowns.
A better way of defining our instrumental variable estimator follows
from our assumption of no correlation between $\mathbf{z}$ and
$\epsilon$:
```{math}
\begin{eqnarray}
0 &=& E[\mathbf{z'\epsilon}] \\
& & E[\mathbf{z'(y-x\beta)}]
\end{eqnarray}
```
from our assumptions above. Simplifying, gives
$\mathbf{z'y=z'x \beta + z'\epsilon}$ and taking expectations results in
```{math}
\begin{equation}
\mathbf{E(z'y)=E(z'x \beta)}
\end{equation}
```
This gives us $K$ equations in $K$ unknowns, which can be solved.
Simplifying further, we can write $\mathbf{\beta}$ as
```{math}
\begin{equation}
\mathbf{\beta=[E(z'x)]^{-1}E[z'y]}
\end{equation}
```
where both expectations terms can be consistently estimated using a
random sample on ($\mathbf{x}$,$\mathbf{y}$ and $\mathbf{z_K}$). A very
important point to remember is that this systems of equation is solvable
only if
```{math}
\begin{equation}
rank(E[\mathbf{z'x}])=K
\end{equation}
```
With this, identification is achieved since we can\'t invert a square
matrix not having full rank. In practice, this estimator can be
implemented given a random sample from the population as
```{math}
\begin{equation}
\mathbf{b}^{IV}=\mathbf{(z'x)^{-1} z'y}=\mathbf{(\hat{x}'\hat{x})^{-1} \hat{x}'y}
\end{equation}
```
where $\hat{\mathbf{x}}$ is equal to
$\mathbf{z}(\mathbf{z'z})^{-1}\mathbf{z'x}$, or the predicted value of
$\mathbf{x}$ given our instruments $\mathbf{z}$.
It is important to note that $\mathbf{b}^{IV}$ is a consistent estimator
for $\beta$, so we rely on large sample properties.
### Testing for Suitable Instruments
In the preceding section, we saw that two assumptions were necessary for
having a suitable instrument:
- **Assumption 1**: Orthogonality of the instrument and model errors
```{math}
\begin{equation}
E(\mathbf{z'\epsilon)}=0
\end{equation}
```
- **Assumption 2**: Partial correlation estimated in Equation
{eq}`end:eq:reducedformx` are non-zero
```{math}
\begin{equation}
\theta_K \ne 0
\end{equation}
```
Ideally, we would like to be able to test our assumptions to ensure that
our candidate instrumental variable meets these conditions. If our
instrumental variable does not conform with these assumptions, our
$\mathbf{\beta}$\'s will be biased. Assumption 1 can\'t be formally
tested since the true model errors, $\mathbf{\epsilon}$ are not
observed. However, the second condition can and should be tested by
estimating equation {eq}`end:eq:reducedformx` and conducting a
t-test over the $\theta_k$ parameter. Studies have shown that lower
p-values accord with better instruments, as would be expected.
### Testing for endogeneity
The basic test we consider here starts with the observation that if
there is endogeneity then $b^{ols}$ is biased. If we have tested for and
identified a useful instrument(s), and estimated an IV model, then we
have the following hypothesis to test
- $H_0: \mathbf{b_{OLS}-b^{iv}}=0$
- $H_1: \mathbf{b_{OLS}-b^{iv}}\ne 0$
So, if our variable $\mathbf{x_k}$ is not endogenous, the difference
between the two estimators should be attributed to sampling error of
$\mathbf{\beta}$ only. If on the other hand, we do have an endogenous
regressor and instrument for it, the bias should show up in this
difference and we would reject the null hypothesis in favor of the 2SLS
technique. The test we will use is called the Hausman test and can be
applied in a wide range of problems- and will be used later in the
course- well beyond the endogeneity case considered here.
To implement this test in the instrumental variable context, we can
follow an approach for the Hausman test outlined by Wu applicable to the
instrumental variable case only. [^1] This test can be performed
manually by
- **Step 1**: Regress the endogenous variable ($\mathbf{x}_K$)
variable on all exogenous variables both $\mathbf{x}_{-K}$ and
$\mathbf{z}$ and recover the estimated residuals $\hat{\mathbf{u}}$
from the following regression:
```{math}
\begin{equation}
\mathbf{\mathbf{x}_K=\mathbf{x}_{-K}\delta_{-K}+\mathbf{z}\theta+\mathbf{u}}
\end{equation}
```
- **Step 2**: Regress the dependent variable in the regression
($\mathbf{y}$) on the full set of *original* independent variables,
$\mathbf{x}$.
```{math}
\begin{equation}
\mathbf{y=x\beta+\delta\hat{\mathbf{u}}+\mathbf{\gamma}}
\end{equation}
```
- **\[Step 3:\]**: Based on the preceding step, test the null
hypothesis that $\delta=0$ or that the regressor is exogenous.
### Standard Errors
In the IV framework, the variance-covariance matrix is [^2]
```{math}
\begin{equation}
Var[\mathbf{b}^{IV}]=\sigma^2 \left( \mathbf{(z'x)^{-1} z'z (z'x)^{-1\prime}} \right)
\end{equation}
```
The robust version of this can be calculated with a similar definition
of $\hat{V}$ as used in the OLS robust standard error section [^3]:
```{math}
\begin{equation}
Var^{robust}[\mathbf{b}^{IV}]=\mathbf{(z'x)^{-1} z'V z (z'x)^{-1\prime}}
\end{equation}
```
## Multiple Instrumental Variables
In the section above, we restricted our attention to the case of one and
only one instrumental variable $\mathbf{z_k}$ for the correlated
variable $\mathbf{x_k}$. What if now, we allow there to be $M$
instruments for $\mathbf{x_k}$, such that
```{math}
\begin{equation}
z=\begin{bmatrix} \mathbf{ 1} & \mathbf{x_2} & \ldots & \mathbf{x_{K-1}} & \mathbf{z_1} & \mathbf{z_2} & \ldots & \mathbf{z_M} \end{bmatrix}
\end{equation}
```
where $\mathbf{z}$ and each $\mathbf{z_h}$ is of dimension
$N \times K+M$ and $N \times 1$, respectively. We maintain the
assumption that
```{math}
\begin{equation}
E(\mathbf{z_h'\epsilon})=0 \hspace{.1in} h=1,2,\ldots,M
\end{equation}
```
As above, consider the condition ($E(\mathbf{z'\epsilon})=0$) and the
implications for our IV estimator:
```{math}
:label: end:eq:moment_iv
\begin{equation}
\mathbf{z'e^{iv}} = 0
\end{equation}
```
Notice the dimensionality of this condition: $\mathbf{z}$ is of
dimension $N \times (K+M-1)$, whereas $\mathbf{e}^{iv}$ is $N \times 1$.
The product will be of dimension $(K+M-1) \times 1$. But we only have
$K$ parameters to estimate, so we have more equations than unknowns.
Should we choose one of the z\'s, all of the z\'s, or a subset of the
z\'s? If we choose more than one of the z\'s, can we continue to use IV
regression from the previous section? The answer is no. We need to
proceed in one of two ways.
(estimation_methods)=
### Estimation Methods
An important point is that the distinctions outlined below for the
various estimation methods only exist when the number of instruments
exceeds the number of endogenous variables. If your model is exactly
identified (as in the preceding section), it is sufficient to focus on
2SLS.
1. Two Staged Least Squares
In a similar way to equation {eq}`end:eq:reducedformx`,
write the linear function of $\mathbf{x}_k$ onto $\mathbf{z}$ as
```{math}
\begin{equation}
\mathbf{x}_k=\delta_0 + \delta_1 \mathbf{x}_1 + \delta_2 \mathbf{x}_2 + \ldots + \delta_{K-1}\mathbf{x}_{K-1} + \theta_1 \mathbf{z}_1 + \ldots + \theta_{M} \mathbf{z}_M + r_K
\end{equation}
```
where $\mathbf{r_K}$ is of mean zero an uncorrelated with all
independent variables. Since any linear combination of $\mathbf{z}$
is uncorrelated with u (from the assumption above),
```{math}
\begin{equation}
\mathbf{x^*_K} \equiv \delta_0 + \delta_2 \mathbf{x_2} + \delta_3 \mathbf{x_3} + \ldots + \delta_{K-1} \mathbf{x_{K-1}} + \theta_1 \mathbf{z_1} + \ldots + \theta_{M} \mathbf{z_M}
\end{equation}
```
is also uncorrelated with $\mathbf{\epsilon}$. Unfortunately,
neither $\mathbf{x}^*_K$ nor $\mathbf{\delta}$ is known. We can use
a first stage estimator for $\mathbf{x}^*_K$, called
$\mathbf{\hat{x}}^*_K$ that is written as
```{math}
\begin{equation}
\mathbf{\hat{x}}^*_K=\hat{\delta_0} + \hat{\delta}_2 \mathbf{x_2} + \hat{\delta}_3 \mathbf{x_3} + \ldots + \hat{\delta}_{K-1} \mathbf{x_{K-1}} + \hat {\theta}_1 \mathbf{z_1} + \ldots + \hat {\theta}_{M} \mathbf{z_M}
\end{equation}
```
by running an OLS regression. Denoting
$\hat{\mathbf{x}}=\begin{bmatrix} 1 & \mathbf{x_2} & \ldots & \mathbf{x_{K-1}} &\mathbf{\hat{x}}^*_K \end{bmatrix}$,
the two stage least squares estimator (2SLS) is
```{math}
\begin{equation}
\hat{\mathbf{\beta}}^{2SLS}=(\hat{\mathbf{x}}'\hat{\mathbf{x}})^{-1}\hat{\mathbf{x}}'\mathbf{y}
\end{equation}
```
2. Method of Moments
A better way to proceed (and the default method employed by stata\'s
`ivreg` and `ivreg2` commands) is to minimize the condition outlined
above in Equation {eq}`end:eq:moment_iv`
```{math}
\begin{equation}
\underset{b^{IV}}{min} \hspace{.05in} \frac{\mathbf{e'z}W\mathbf{z'e}}{N}
\end{equation}
```
which is a scalar value. If $\mathbf{W=I}_{N \times N}$, then the
GMM estimator and the 2SLS estimator yield the same result.
Consequently, GMM is nearly always the preferred estimator. Stata
default method for defining W uses a heteroskedastic error approach
constructing errors for each individual from the 2SLS model. This is
much like our $V$ matrix we defined for estimating robust standard
errors in the OLS chapter.
While it is almost always possible to find a $\mathbf{b}^{IV}$ that
minimizes this condition, it does not impose the orthogonality
condition for each column of $\mathbf{z}$. The possibility that some
of our instruments are correlated with our errors even after trying
to minimize the condition above opens the door to a problem called
called **overidentification**.
3. LIML and 3SLS
There are also two additional techniques one can use for estimating
$\mathbf{b}^{IV}$. One is a maximum likelihood technique called
limited information maximum likelihood (LIML) and another is termed
Three Staged Least Squares (3SLS). We won\'t be investigating these
further in this class, but they are options in stata\'s `ivregress`
command. One quick point: for small samples, LIML is often the best
approach.
### Testing for Strong and Relevant Instruments
Testing for the suitability of instrument is also important in this
context and test the null hypothesis
```{math}
\begin{equation}
H_0=\theta_1=\theta_2=\ldots\theta_M=0
\end{equation}
```
using an F test with $(M,N-M-K-1)$ degrees of freedom.
### Overidentification in IV regression
Recall that in the IV regression model, we might have as many as $M$
instrumental variables for $K_{end}$ endogenous regressors. In our
example, $K_{end}=1$, but in a general 2SLS setting we need for
$K_{end} \le M$ in order to identify $\beta$. However, consider a
situation where $K_{end} < M$. By including a myriad of instruments, we
might be introducing bias in our estimate of $\beta$ because some subset
of our IV\'s, in fact do not satisfy the important requirement that
$E(\mathbf{z'\epsilon})=0$. In effect, we can test for whether a subset
of our IV\'s would be a candidate IV set by avoiding those instrumental
variables that themselves may be correlated with the model errors. Under
i.i.d. errors, this test is called the Sargan test.
Fortunately, the test is easy to implement.
- \[Step 1:\] Recover the estimated residuals from the 2SLS
regression. I label this vector as $\mathbf{e}_{2sls}$.
- \[Step 2:\] Regress $\mathbf{e}_{2sls}$ on the full set of exogenous
instruments, $\mathbf{x_{-K}}$ and $\mathbf{z}$. Make sure to omit
the endogenous variable.
The test statistic, $N \times R^2$, where $R^2$ is recovered from this
regression, is distributed with degrees of freedom equal to the number
of instruments in the 2SLS regression minus the number of endogenous
variables in Step 1. Fortunately for us, the ivreg2 command
automatically reports the Sargan statistic for overidentification. If we
reject the null hypothesis, then we have a vector of instrumental
variables that is overidentified and our logic for choosing the set of
IV\'s must be reexamined. Low p-values indicate that we need to
re-evaluate our set of IV\'s.
The intuition of this test rests with information contained in the error
structure from the 2sls. If these errors can be explained well using
information contained in our IV\'s (the ivreg2 command labels these as
excluded instruments), then they really aren\'t good instruments, since
we need them to be uncorrelated with the error. Rather than test each
excluded IV sequentially, the Sargan approach jointly tests whether
overidentification is a problem or not. If it is, consider using a
subset of IV\'s or search for new ones.
### Standard Errors
Manual calculations should be avoided as a correction must be made to
the standard errors. The two steps method outlined in the
{ref}`estimation_methods` section should generally not be implemented by hand
since this approach leads to inconsistent estimates of
$\mathbf{\beta}$ and the variance\/covariance matrix of the parameters
is also incorrect since it fails to account for the underlying
randomness associated with $\mathbf{\hat{x}}^*_K$.[^4]
## Implementation in Stata
This section on endogeneity quickly explores the problem of
endogeneity and how to estimate this class of models in Stata. Recall
that the OLS estimator requires
$$
E(\mathbf{x'\epsilon}) = 0
$$
This code shows how to overcome estimation problems where this
assumption fails but where we can identify an instrument for
implementing instrumental variables regression (IV Regression). We
demonstrate the uses of Stata for IV regression problems.
First, let\'s open up the data in Stata noting that we are
using a \"Cross-sectioned\" version of Tobias and Koop that focuses on
1983. Load data and summarize:
```{code-cell} ipython3
:tags: ["remove_output"]
# start a connected stata17 session
from pystata import config
config.init('be')
config.set_streaming_output_mode('off')
```
```{code-cell} ipython3
%%stata
webuse set "https://rlhick.people.wm.edu/econ407/data/"
webuse tobias_koop
keep if time==4
sum
```
### First run OLS
If we ignore any potential endogeneity problem we can estimate OLS as
described in the OLS chapter companion. Here are the results from stata:
```{code-cell} ipython3
%%stata
reg ln_wage pexp pexp2 educ broken_home
```
where education, has the elasticity
```{code-cell} ipython3
%%stata
margins, dyex(educ) continuous
```
### Running IV Regression
Suppose we are worried that education is endogenous. That is, it is
correlated with the population regression errors. This means OLS
estimates of $\beta$ are biased. We hypothesize that the variable
`feduc` is a good instrument having all the properties we describe in
detail in the notes document.
In stata, we use this code:
```{code-cell} ipython3
%%stata
ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc)
```
Note that the mean estimate for the elasticity on education has nearly doubled
compared to OLS
```{code-cell} ipython3
%%stata
margins, dyex(educ) continuous
```
Stata\'s ivregress output for robust regression is obtained
from
```{code-cell} ipython
%%stata
ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc), robust
```
### Testing Assumptions
We have more work to do:
1. Test for relevant and strong instruments
2. Test for endogeneity
3. Test for overidentification (not relevant for this example)
In stata, we issue these commands:
```{code-cell} ipython3
%%stata
estat firststage
```
Note, since the number of instruments is equal to the number of
endogenous variables, we don\'t have an overidentification problem.
```{code-cell} ipython3
---
tags: [raises-exception]
---
%%stata
estat overid
```
The python stack trace is irrelevant here and will terrify my students. All the user needs to see is the Stata part of the error:
```
SystemError: no overidentifying restrictions
r(498);
```
These results tell us we have relevant and strong instruments and that
education is likely endogenous.
Here is another error:
```{code-cell} ipython3
---
tags: [raises-exception]
---
%%stata
gen ln_wage = 5
```
Again, the python stack trace is irrelevant and completely the same as the previous one. All the user needs to see is the Stata part of the error
```
SystemError: variable ln_wage already defined
r(110);
```
[^1]: These steps are not correct for the case of more than 1
instrumental variable. However, they are instructive in
understanding the intuition of the Hausman Test in the instrumental
variables context. If you have more than 1 instrumental variable,
you must use the `ivendog` or `hausman` commands in stata.
[^2]: This equation will exactly replicate the `stata` ivregress command
(for `2sls`) using the options `vce(unadjusted) small`.
[^3]: This equation will exactly replicate the `stata` ivregress command
(for `2sls`) using the options `vce(robust) small` defining
$\mathbf{V}$ as we did in the OLS chapter.
[^4]: This is true if the model has more instruments than endogenous variables.