{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "front-proxy",
   "metadata": {},
   "source": [
    "# Instrumental Variables Regression\n",
    "\n",
    "The exogeneity assumption of the previous chapter is a particularly\n",
    "strong assumption. One might think of a number of cases where this would\n",
    "not be expected to hold. Education and earnings is an example. A wage\n",
    "equation that is a function of education may suffer from a problem\n",
    "regarding this assumption if those factors influencing wage and not\n",
    "accounted for in the regression equation are related to educational\n",
    "attainment. In a pure cross section world, unobserved\n",
    "individual-specific factors, for example, may influence both wage and\n",
    "educational attainment. The problem of endogeneity is a very big deal\n",
    "and one you should worry about every time you estimate any type of\n",
    "regression model. To see why, note that our ability to estimate unbiased\n",
    "parameters hinges on\n",
    "\n",
    "```{math}\n",
    "\\begin{equation}\n",
    "    \\mathbf{E(x'\\epsilon)=0}\n",
    "\\end{equation}\n",
    "```\n",
    "If there is in fact correlation between the elements of $\\mathbf{x}$ and\n",
    "$\\mathbf{\\epsilon}$ then our parameter estimates are biased.\n",
    "\n",
    "\n",
    "## A Single Instrumental Variable\n",
    "\n",
    "In this section, we discuss ways of addressing endogeneity in the OLS\n",
    "framework. As before, assume a linear model\n",
    "\n",
    "```{math}\n",
    ":label: end:eq:ivbasic_framework\n",
    "\\begin{equation}\n",
    "    \\mathbf{y}=\\mathbf{x}\\beta+\\epsilon=\\beta_0+\\beta_2 \\mathbf{x_2} + \\beta_3 \\mathbf{x_3} + \\ldots + \\beta_K \\mathbf{x_K} + \\mathbf{\\epsilon}\n",
    "\\end{equation}\n",
    "```\n",
    "\n",
    "where $\\mathbf{x}_j$ is a $N \\times 1$ column vector with data from\n",
    "column $j$. Where the following holds:\n",
    "\n",
    "-   $E(\\mathbf{\\epsilon})=0$\n",
    "-   $E(\\mathbf{x_{j}\\epsilon})=0$ for $j=1,2,\\ldots,K-1$\n",
    "\n",
    "Notice that we are invoking the exogeneity assumption for some of our\n",
    "explanatory variables but not all of them. The explanatory variable\n",
    "$\\mathbf{x_K}$ is potentially endogenous and a failure to deal with this\n",
    "will potentially lead to biased parameter estimates.\n",
    "\n",
    "The method of instrumental variables offers a way of handling this\n",
    "problem. Letting the instrumental variable be denoted as $z_k$, we need\n",
    "for it to have these properties:\n",
    "\n",
    "-   **Assumption 1**: $E(\\mathbf{z'\\epsilon})=0$\n",
    "\n",
    "-   **Assumption 2**: And, for the following linear relationship,\n",
    "\n",
    "    ```{math}\n",
    "\t:label: end:eq:reducedformx\n",
    "    \\begin{equation}\n",
    "          \\mathbf{x_K}=\\delta_0 + \\delta_2 \\mathbf{x_2}+\\ldots+ \\delta_{K-1} \\mathbf{x_{K-1}}+\\theta_K \\mathbf{z_K} + \\mathbf{r}\n",
    "    \\end{equation}\n",
    "    ```\n",
    "\n",
    "where $E(\\mathbf{r_k})=0$ and is uncorrelated with the right hand side\n",
    "variables. We need for $\\theta_1$ to be non-zero, for $\\mathbf{z_K}$ to\n",
    "be valid instrument. This is something like saying $\\mathbf{z_K}$ is\n",
    "partially correlated with $\\mathbf{x_K}$ after netting out the effects\n",
    "of $\\mathbf{x_1,\\ldots,x_{K-1}}$.\n",
    "\n",
    "When $\\mathbf{z_K}$ satisfies these conditions, it is called an\n",
    "**Instrumental Variable (IV)** candidate for $\\mathbf{x_K}$. The next\n",
    "step is understanding how one uses this structure for estimating the\n",
    "parameters of interest ($\\mathbf{\\beta}$). One option is to drop our\n",
    "endogenous variable ($\\mathbf{x}_k$) and replace it with our IV variable\n",
    "($\\mathbf{z}_k$), and then estimate an OLS model.\n",
    "\n",
    "### Using the IV\n",
    "\n",
    "Define the matrix $\\mathbf{z}$ as\n",
    "\n",
    "```{math}\n",
    "\\begin{equation}\n",
    "\\mathbf{z} = \\begin{bmatrix} 1 & x_{12} & \\ldots & x_{1,K-1} & z_{1,K} \\\\\n",
    "                                 \\vdots  & \\vdots & \\vdots &   \\vdots  &  \\vdots         \\\\ \n",
    "                                 1 & x_{i2} & \\ldots & x_{i,K-1} & z_{i,K} \\\\   \n",
    "                                 \\vdots  & \\vdots & \\vdots &   \\vdots  &   \\vdots        \\\\ \n",
    "                                 1 & x_{N2} & \\ldots & x_{N,K-1} & z_{N,K} \\\\            \n",
    "             \\end{bmatrix} \n",
    "            = \\begin{bmatrix} \\mathbf{1} & \\mathbf{x}_2  &\\ldots & \\mathbf{x}_{K-1} & \\mathbf{z}_K \\end{bmatrix}\n",
    "\\end{equation}\n",
    "```\n",
    "Notice, in doing this we have done a drop-in-replacement of\n",
    "$\\mathbf{x}_k$ with $\\mathbf{z}_k$\n",
    "\n",
    "So how do we use the IV (found in $\\mathbf{z}$) to estimate $\\beta$? One\n",
    "idea is to define $\\mathbf{b}^{iv}$ as $\\mathbf{(z'z)^{-1}z'y}$. Since\n",
    "we have already established that $\\mathbf{z}_k$ is exogenous and\n",
    "correlated with $\\mathbf{x}_k$ even after controlling for all of the\n",
    "other exogenous information at hand. What are the properties of such an\n",
    "estimator? To examine this, substitute the \\\"Relevancy equation\\\" into\n",
    "our original estimating equation:\n",
    "\n",
    "```{math}\n",
    "\\begin{eqnarray}\n",
    "\\mathbf{y}&=&\\beta_0 + \\beta_2 \\mathbf{x}_2 + \\ldots + \\beta_{K-1} \\mathbf{x}_{K-1} \\nonumber \\\\ \n",
    "             & &+ \\beta_K (\\delta_1 + \\delta_2 \\mathbf{x}_2 + \\ldots + \\delta_{K-1} \\mathbf{x}_{K-1} + \\theta_K \\mathbf{z}_K + \\mathbf{r}) + \\epsilon           \\\\\n",
    "           &=& (\\beta_0 + \\beta_K \\delta_1) + (\\beta_2 + \\beta_K \\delta_2) \\mathbf{x}_2 + \\ldots +  (\\beta_{K-1} + \\beta_{K} \\delta_{K-1}) \\mathbf{x}_{K-1} \\nonumber  \\\\\n",
    "           & &+ (\\beta_K \\theta_K)\\mathbf{z}_k + (\\beta_K \\mathbf{r}_k+\\epsilon)\\\\\n",
    "           &=&\\alpha_0 + \\alpha_2 \\mathbf{x}_2 + \\ldots + \\alpha_{K-1} \\mathbf{x}_{K-1} + \\alpha_K \\mathbf{z}_k + \\mathbf{v} \n",
    "\\end{eqnarray}\n",
    "```\n",
    "If we run this regression, we obtain estimates defined as\n",
    "$\\mathbf{a} = \\mathbf{(z'z)^{-1}z'y}$. From this we can see that by\n",
    "substituting our Instrumental Variable in as a proxy variable in for\n",
    "$\\mathbf{x}_K$ and recovering parameters $\\mathbf{a}$ will give you\n",
    "estimates where:\n",
    "\n",
    "-   $\\alpha_k \\neq \\beta_k$ for **every** parameter you estimate, not\n",
    "    just the endogenous one, $\\beta_K$. For example,\n",
    "    $\\alpha_0 = \\beta_0 + \\beta_K \\delta_1 \\neq \\beta_0$\n",
    "-   The variance/covariance matrix of the errors ($\\mathbf{v}$) is not\n",
    "    $~N(0,\\sigma^2\\mathbf{I})$\n",
    "\n",
    "Consequently, for $\\mathbf{a} = \\mathbf{(z'z)^{-1}z'y}$,\n",
    "\n",
    "-   Given that $E[\\mathbf{a}] \\neq \\beta$, so this is not a good IV estimator.\n",
    "-   Given an estimate for $\\mathbf{a}$, we can\\'t solve for the $K$\n",
    "    estimates for $\\beta$ because we have $K$ equations in $2\\times K$\n",
    "    unknowns.\n",
    "\n",
    "A better way of defining our instrumental variable estimator follows\n",
    "from our assumption of no correlation between $\\mathbf{z}$ and\n",
    "$\\epsilon$:\n",
    "\n",
    "```{math}\n",
    "\\begin{eqnarray}\n",
    "0 &=& E[\\mathbf{z'\\epsilon}] \\\\ \n",
    "  & & E[\\mathbf{z'(y-x\\beta)}]\n",
    "\\end{eqnarray}\n",
    "```\n",
    "from our assumptions above. Simplifying, gives\n",
    "$\\mathbf{z'y=z'x \\beta + z'\\epsilon}$ and taking expectations results in\n",
    "\n",
    "```{math}\n",
    "\\begin{equation}\n",
    "    \\mathbf{E(z'y)=E(z'x \\beta)}\n",
    "\\end{equation}\n",
    "```\n",
    "This gives us $K$ equations in $K$ unknowns, which can be solved.\n",
    "Simplifying further, we can write $\\mathbf{\\beta}$ as\n",
    "\n",
    "```{math}\n",
    "\\begin{equation}\n",
    "    \\mathbf{\\beta=[E(z'x)]^{-1}E[z'y]}\n",
    "\\end{equation}\n",
    "```\n",
    "where both expectations terms can be consistently estimated using a\n",
    "random sample on ($\\mathbf{x}$,$\\mathbf{y}$ and $\\mathbf{z_K}$). A very\n",
    "important point to remember is that this systems of equation is solvable\n",
    "only if\n",
    "\n",
    "```{math}\n",
    "\\begin{equation}\n",
    "    rank(E[\\mathbf{z'x}])=K\n",
    "\\end{equation}\n",
    "```\n",
    "With this, identification is achieved since we can\\'t invert a square\n",
    "matrix not having full rank. In practice, this estimator can be\n",
    "implemented given a random sample from the population as\n",
    "\n",
    "```{math}\n",
    "\\begin{equation}\n",
    "    \\mathbf{b}^{IV}=\\mathbf{(z'x)^{-1} z'y}=\\mathbf{(\\hat{x}'\\hat{x})^{-1} \\hat{x}'y}\n",
    "\\end{equation}\n",
    "```\n",
    "where $\\hat{\\mathbf{x}}$ is equal to\n",
    "$\\mathbf{z}(\\mathbf{z'z})^{-1}\\mathbf{z'x}$, or the predicted value of\n",
    "$\\mathbf{x}$ given our instruments $\\mathbf{z}$.\n",
    "\n",
    "It is important to note that $\\mathbf{b}^{IV}$ is a consistent estimator\n",
    "for $\\beta$, so we rely on large sample properties.\n",
    "\n",
    "### Testing for Suitable Instruments\n",
    "\n",
    "In the preceding section, we saw that two assumptions were necessary for\n",
    "having a suitable instrument:\n",
    "\n",
    "-   **Assumption 1**: Orthogonality of the instrument and model errors\n",
    "\n",
    "```{math}\n",
    "\\begin{equation}\n",
    "E(\\mathbf{z'\\epsilon)}=0\n",
    "\\end{equation}\n",
    "```\n",
    "-   **Assumption 2**: Partial correlation estimated in Equation\n",
    "    {eq}`end:eq:reducedformx` are non-zero\n",
    "\n",
    "```{math}\n",
    "\\begin{equation}\n",
    "\\theta_K \\ne 0\n",
    "\\end{equation}\n",
    "```\n",
    "Ideally, we would like to be able to test our assumptions to ensure that\n",
    "our candidate instrumental variable meets these conditions. If our\n",
    "instrumental variable does not conform with these assumptions, our\n",
    "$\\mathbf{\\beta}$\\'s will be biased. Assumption 1 can\\'t be formally\n",
    "tested since the true model errors, $\\mathbf{\\epsilon}$ are not\n",
    "observed. However, the second condition can and should be tested by\n",
    "estimating equation {eq}`end:eq:reducedformx` and conducting a\n",
    "t-test over the $\\theta_k$ parameter. Studies have shown that lower\n",
    "p-values accord with better instruments, as would be expected.\n",
    "\n",
    "### Testing for endogeneity\n",
    "\n",
    "The basic test we consider here starts with the observation that if\n",
    "there is endogeneity then $b^{ols}$ is biased. If we have tested for and\n",
    "identified a useful instrument(s), and estimated an IV model, then we\n",
    "have the following hypothesis to test\n",
    "\n",
    "-   $H_0: \\mathbf{b_{OLS}-b^{iv}}=0$\n",
    "-   $H_1: \\mathbf{b_{OLS}-b^{iv}}\\ne 0$\n",
    "\n",
    "So, if our variable $\\mathbf{x_k}$ is not endogenous, the difference\n",
    "between the two estimators should be attributed to sampling error of\n",
    "$\\mathbf{\\beta}$ only. If on the other hand, we do have an endogenous\n",
    "regressor and instrument for it, the bias should show up in this\n",
    "difference and we would reject the null hypothesis in favor of the 2SLS\n",
    "technique. The test we will use is called the Hausman test and can be\n",
    "applied in a wide range of problems- and will be used later in the\n",
    "course- well beyond the endogeneity case considered here.\n",
    "\n",
    "To implement this test in the instrumental variable context, we can\n",
    "follow an approach for the Hausman test outlined by Wu applicable to the\n",
    "instrumental variable case only. [^1] This test can be performed\n",
    "manually by\n",
    "\n",
    "-   **Step 1**: Regress the endogenous variable ($\\mathbf{x}_K$)\n",
    "    variable on all exogenous variables both $\\mathbf{x}_{-K}$ and\n",
    "    $\\mathbf{z}$ and recover the estimated residuals $\\hat{\\mathbf{u}}$\n",
    "    from the following regression:\n",
    "\n",
    "    ```{math}\n",
    "    \\begin{equation}\n",
    "           \\mathbf{\\mathbf{x}_K=\\mathbf{x}_{-K}\\delta_{-K}+\\mathbf{z}\\theta+\\mathbf{u}}\n",
    "    \\end{equation}\n",
    "    ```\n",
    "\n",
    "-   **Step 2**: Regress the dependent variable in the regression\n",
    "    ($\\mathbf{y}$) on the full set of *original* independent variables,\n",
    "    $\\mathbf{x}$.\n",
    "\n",
    "    ```{math}\n",
    "    \\begin{equation}\n",
    "           \\mathbf{y=x\\beta+\\delta\\hat{\\mathbf{u}}+\\mathbf{\\gamma}}\n",
    "    \\end{equation}\n",
    "    ```\n",
    "\n",
    "-   **\\[Step 3:\\]**: Based on the preceding step, test the null\n",
    "    hypothesis that $\\delta=0$ or that the regressor is exogenous.\n",
    "\n",
    "### Standard Errors\n",
    "\n",
    "In the IV framework, the variance-covariance matrix is [^2]\n",
    "\n",
    "```{math}\n",
    "\\begin{equation}\n",
    "     Var[\\mathbf{b}^{IV}]=\\sigma^2 \\left( \\mathbf{(z'x)^{-1} z'z (z'x)^{-1\\prime}} \\right)\n",
    "\\end{equation}\n",
    "```\n",
    "The robust version of this can be calculated with a similar definition\n",
    "of $\\hat{V}$ as used in the OLS robust standard error section [^3]:\n",
    "\n",
    "```{math}\n",
    "\\begin{equation}\n",
    " Var^{robust}[\\mathbf{b}^{IV}]=\\mathbf{(z'x)^{-1} z'V z (z'x)^{-1\\prime}}\n",
    "\\end{equation}\n",
    "```\n",
    "## Multiple Instrumental Variables\n",
    "\n",
    "In the section above, we restricted our attention to the case of one and\n",
    "only one instrumental variable $\\mathbf{z_k}$ for the correlated\n",
    "variable $\\mathbf{x_k}$. What if now, we allow there to be $M$\n",
    "instruments for $\\mathbf{x_k}$, such that\n",
    "\n",
    "```{math}\n",
    "\\begin{equation}\n",
    "    z=\\begin{bmatrix} \\mathbf{ 1} & \\mathbf{x_2} & \\ldots & \\mathbf{x_{K-1}} & \\mathbf{z_1} & \\mathbf{z_2} & \\ldots & \\mathbf{z_M} \\end{bmatrix}\n",
    "\\end{equation}\n",
    "```\n",
    "where $\\mathbf{z}$ and each $\\mathbf{z_h}$ is of dimension\n",
    "$N \\times K+M$ and $N \\times 1$, respectively. We maintain the\n",
    "assumption that\n",
    "\n",
    "```{math}\n",
    "\\begin{equation}\n",
    "    E(\\mathbf{z_h'\\epsilon})=0 \\hspace{.1in} h=1,2,\\ldots,M\n",
    "\\end{equation}\n",
    "```\n",
    "As above, consider the condition ($E(\\mathbf{z'\\epsilon})=0$) and the\n",
    "implications for our IV estimator:\n",
    "\n",
    "```{math}\n",
    ":label: end:eq:moment_iv\n",
    "\\begin{equation}\n",
    "     \\mathbf{z'e^{iv}} = 0\n",
    "\\end{equation}\n",
    "```\n",
    "\n",
    "Notice the dimensionality of this condition: $\\mathbf{z}$ is of\n",
    "dimension $N \\times (K+M-1)$, whereas $\\mathbf{e}^{iv}$ is $N \\times 1$.\n",
    "The product will be of dimension $(K+M-1) \\times 1$. But we only have\n",
    "$K$ parameters to estimate, so we have more equations than unknowns.\n",
    "\n",
    "Should we choose one of the z\\'s, all of the z\\'s, or a subset of the\n",
    "z\\'s? If we choose more than one of the z\\'s, can we continue to use IV\n",
    "regression from the previous section? The answer is no. We need to\n",
    "proceed in one of two ways.\n",
    "\n",
    "(estimation_methods)=\n",
    "### Estimation Methods\n",
    "\n",
    "An important point is that the distinctions outlined below for the\n",
    "various estimation methods only exist when the number of instruments\n",
    "exceeds the number of endogenous variables. If your model is exactly\n",
    "identified (as in the preceding section), it is sufficient to focus on\n",
    "2SLS.\n",
    "\n",
    "1.  Two Staged Least Squares\n",
    "\n",
    "    In a similar way to equation {eq}`end:eq:reducedformx`,\n",
    "    write the linear function of $\\mathbf{x}_k$ onto $\\mathbf{z}$ as\n",
    "\n",
    "    ```{math}\n",
    "    \\begin{equation}\n",
    "        \\mathbf{x}_k=\\delta_0 + \\delta_1 \\mathbf{x}_1 + \\delta_2 \\mathbf{x}_2 + \\ldots + \\delta_{K-1}\\mathbf{x}_{K-1} + \\theta_1 \\mathbf{z}_1 + \\ldots + \\theta_{M} \\mathbf{z}_M + r_K\n",
    "    \\end{equation}\n",
    "    ```\n",
    "    where $\\mathbf{r_K}$ is of mean zero an uncorrelated with all\n",
    "    independent variables. Since any linear combination of $\\mathbf{z}$\n",
    "    is uncorrelated with u (from the assumption above),\n",
    "\n",
    "    ```{math}\n",
    "    \\begin{equation}\n",
    "        \\mathbf{x^*_K} \\equiv \\delta_0 + \\delta_2 \\mathbf{x_2} + \\delta_3 \\mathbf{x_3} + \\ldots + \\delta_{K-1} \\mathbf{x_{K-1}} + \\theta_1 \\mathbf{z_1} + \\ldots + \\theta_{M} \\mathbf{z_M}\n",
    "    \\end{equation}\n",
    "    ```\n",
    "    is also uncorrelated with $\\mathbf{\\epsilon}$. Unfortunately,\n",
    "    neither $\\mathbf{x}^*_K$ nor $\\mathbf{\\delta}$ is known. We can use\n",
    "    a first stage estimator for $\\mathbf{x}^*_K$, called\n",
    "    $\\mathbf{\\hat{x}}^*_K$ that is written as\n",
    "\n",
    "    ```{math}\n",
    "    \\begin{equation}\n",
    "        \\mathbf{\\hat{x}}^*_K=\\hat{\\delta_0} + \\hat{\\delta}_2 \\mathbf{x_2} + \\hat{\\delta}_3 \\mathbf{x_3} + \\ldots + \\hat{\\delta}_{K-1} \\mathbf{x_{K-1}} + \\hat {\\theta}_1 \\mathbf{z_1} + \\ldots + \\hat {\\theta}_{M} \\mathbf{z_M}\n",
    "    \\end{equation}\n",
    "    ```\n",
    "    by running an OLS regression. Denoting\n",
    "    $\\hat{\\mathbf{x}}=\\begin{bmatrix} 1 & \\mathbf{x_2} & \\ldots & \\mathbf{x_{K-1}} &\\mathbf{\\hat{x}}^*_K \\end{bmatrix}$,\n",
    "    the two stage least squares estimator (2SLS) is\n",
    "\n",
    "    ```{math}\n",
    "    \\begin{equation}\n",
    "        \\hat{\\mathbf{\\beta}}^{2SLS}=(\\hat{\\mathbf{x}}'\\hat{\\mathbf{x}})^{-1}\\hat{\\mathbf{x}}'\\mathbf{y}\n",
    "    \\end{equation}\n",
    "    ```\n",
    "\n",
    "2.  Method of Moments\n",
    "\n",
    "    A better way to proceed (and the default method employed by stata\\'s\n",
    "    `ivreg` and `ivreg2` commands) is to minimize the condition outlined\n",
    "    above in Equation {eq}`end:eq:moment_iv`\n",
    "\n",
    "    ```{math}\n",
    "    \\begin{equation}\n",
    "            \\underset{b^{IV}}{min} \\hspace{.05in} \\frac{\\mathbf{e'z}W\\mathbf{z'e}}{N}\n",
    "    \\end{equation}\n",
    "    ```\n",
    "    which is a scalar value. If $\\mathbf{W=I}_{N \\times N}$, then the\n",
    "    GMM estimator and the 2SLS estimator yield the same result.\n",
    "    Consequently, GMM is nearly always the preferred estimator. Stata\n",
    "    default method for defining W uses a heteroskedastic error approach\n",
    "    constructing errors for each individual from the 2SLS model. This is\n",
    "    much like our $V$ matrix we defined for estimating robust standard\n",
    "    errors in the OLS chapter.\n",
    "\n",
    "    While it is almost always possible to find a $\\mathbf{b}^{IV}$ that\n",
    "    minimizes this condition, it does not impose the orthogonality\n",
    "    condition for each column of $\\mathbf{z}$. The possibility that some\n",
    "    of our instruments are correlated with our errors even after trying\n",
    "    to minimize the condition above opens the door to a problem called\n",
    "    called **overidentification**.\n",
    "\n",
    "3.  LIML and 3SLS\n",
    "\n",
    "    There are also two additional techniques one can use for estimating\n",
    "    $\\mathbf{b}^{IV}$. One is a maximum likelihood technique called\n",
    "    limited information maximum likelihood (LIML) and another is termed\n",
    "    Three Staged Least Squares (3SLS). We won\\'t be investigating these\n",
    "    further in this class, but they are options in stata\\'s `ivregress`\n",
    "    command. One quick point: for small samples, LIML is often the best\n",
    "    approach.\n",
    "\n",
    "### Testing for Strong and Relevant Instruments\n",
    "\n",
    "Testing for the suitability of instrument is also important in this\n",
    "context and test the null hypothesis\n",
    "\n",
    "```{math}\n",
    "\\begin{equation}\n",
    "    H_0=\\theta_1=\\theta_2=\\ldots\\theta_M=0\n",
    "\\end{equation}\n",
    "```\n",
    "using an F test with $(M,N-M-K-1)$ degrees of freedom.\n",
    "\n",
    "### Overidentification in IV regression\n",
    "\n",
    "Recall that in the IV regression model, we might have as many as $M$\n",
    "instrumental variables for $K_{end}$ endogenous regressors. In our\n",
    "example, $K_{end}=1$, but in a general 2SLS setting we need for\n",
    "$K_{end} \\le M$ in order to identify $\\beta$. However, consider a\n",
    "situation where $K_{end} < M$. By including a myriad of instruments, we\n",
    "might be introducing bias in our estimate of $\\beta$ because some subset\n",
    "of our IV\\'s, in fact do not satisfy the important requirement that\n",
    "$E(\\mathbf{z'\\epsilon})=0$. In effect, we can test for whether a subset\n",
    "of our IV\\'s would be a candidate IV set by avoiding those instrumental\n",
    "variables that themselves may be correlated with the model errors. Under\n",
    "i.i.d. errors, this test is called the Sargan test.\n",
    "\n",
    "Fortunately, the test is easy to implement.\n",
    "\n",
    "-   \\[Step 1:\\] Recover the estimated residuals from the 2SLS\n",
    "    regression. I label this vector as $\\mathbf{e}_{2sls}$.\n",
    "-   \\[Step 2:\\] Regress $\\mathbf{e}_{2sls}$ on the full set of exogenous\n",
    "    instruments, $\\mathbf{x_{-K}}$ and $\\mathbf{z}$. Make sure to omit\n",
    "    the endogenous variable.\n",
    "\n",
    "The test statistic, $N \\times R^2$, where $R^2$ is recovered from this\n",
    "regression, is distributed with degrees of freedom equal to the number\n",
    "of instruments in the 2SLS regression minus the number of endogenous\n",
    "variables in Step 1. Fortunately for us, the ivreg2 command\n",
    "automatically reports the Sargan statistic for overidentification. If we\n",
    "reject the null hypothesis, then we have a vector of instrumental\n",
    "variables that is overidentified and our logic for choosing the set of\n",
    "IV\\'s must be reexamined. Low p-values indicate that we need to\n",
    "re-evaluate our set of IV\\'s.\n",
    "\n",
    "The intuition of this test rests with information contained in the error\n",
    "structure from the 2sls. If these errors can be explained well using\n",
    "information contained in our IV\\'s (the ivreg2 command labels these as\n",
    "excluded instruments), then they really aren\\'t good instruments, since\n",
    "we need them to be uncorrelated with the error. Rather than test each\n",
    "excluded IV sequentially, the Sargan approach jointly tests whether\n",
    "overidentification is a problem or not. If it is, consider using a\n",
    "subset of IV\\'s or search for new ones.\n",
    "\n",
    "### Standard Errors\n",
    "\n",
    "Manual calculations should be avoided as a correction must be made to\n",
    "the standard errors. The two steps method outlined in the\n",
    "{ref}`estimation_methods` section should generally not be implemented by hand\n",
    "since this approach leads to inconsistent estimates of\n",
    "$\\mathbf{\\beta}$ and the variance\\/covariance matrix of the parameters\n",
    "is also incorrect since it fails to account for the underlying\n",
    "randomness associated with $\\mathbf{\\hat{x}}^*_K$.[^4]\n",
    "\n",
    "## Implementation in Stata\n",
    "\n",
    "This section on endogeneity quickly explores the problem of\n",
    "endogeneity and how to estimate this class of models in Stata. Recall\n",
    "that the OLS estimator requires \n",
    "\n",
    "$$\n",
    "E(\\mathbf{x'\\epsilon}) = 0\n",
    "$$\n",
    "\n",
    "This code shows how to overcome estimation problems where this\n",
    "assumption fails but where we can identify an instrument for\n",
    "implementing instrumental variables regression (IV Regression). We\n",
    "demonstrate the uses of Stata for IV regression problems.\n",
    "First, let\\'s open up the data in Stata noting that we are\n",
    "using a \\\"Cross-sectioned\\\" version of Tobias and Koop that focuses on\n",
    "1983. Load data and summarize:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "current-maldives",
   "metadata": {
    "tags": [
     "remove_output"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "  ___  ____  ____  ____  ____ ©\n",
      " /__    /   ____/   /   ____/      17.0\n",
      "___/   /   /___/   /   /___/       BE—Basic Edition\n",
      "\n",
      " Statistics and Data Science       Copyright 1985-2021 StataCorp LLC\n",
      "                                   StataCorp\n",
      "                                   4905 Lakeway Drive\n",
      "                                   College Station, Texas 77845 USA\n",
      "                                   800-STATA-PC        https://www.stata.com\n",
      "                                   979-696-4600        stata@stata.com\n",
      "\n",
      "Stata license: Single-user  perpetual\n",
      "Serial number: 301706306291\n",
      "  Licensed to: Rob Hicks\n",
      "               College of William and Mary\n",
      "\n",
      "Notes:\n",
      "      1. Unicode is supported; see help unicode_advice.\n"
     ]
    }
   ],
   "source": [
    "# start a connected stata17 session\n",
    "from pystata import config\n",
    "config.init('be')\n",
    "config.set_streaming_output_mode('off')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "weighted-horse",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      ". webuse set \"https://rlhick.people.wm.edu/econ407/data/\"\n",
      "(prefix now \"https://rlhick.people.wm.edu/econ407/data\")\n",
      "\n",
      ". webuse tobias_koop\n",
      "\n",
      ". keep if time==4\n",
      "(16,885 observations deleted)\n",
      "\n",
      ". sum\n",
      "\n",
      "    Variable |        Obs        Mean    Std. dev.       Min        Max\n",
      "-------------+---------------------------------------------------------\n",
      "          id |      1,034    1090.952    634.8917          4       2177\n",
      "        educ |      1,034    12.27466    1.566838          9         19\n",
      "     ln_wage |      1,034    2.138259    .4662805        .42       3.59\n",
      "        pexp |      1,034     4.81528    2.190298          0         12\n",
      "        time |      1,034           4           0          4          4\n",
      "-------------+---------------------------------------------------------\n",
      "     ability |      1,034    .0165957    .9209635      -3.14       1.89\n",
      "       meduc |      1,034    11.40329    3.027277          0         20\n",
      "       feduc |      1,034    11.58511    3.735833          0         20\n",
      " broken_home |      1,034    .1692456    .3751502          0          1\n",
      "    siblings |      1,034    3.200193    2.126575          0         15\n",
      "-------------+---------------------------------------------------------\n",
      "       pexp2 |      1,034    27.97969    22.59879          0        144\n",
      "\n",
      ". \n"
     ]
    }
   ],
   "source": [
    "%%stata\n",
    "webuse set \"https://rlhick.people.wm.edu/econ407/data/\"\n",
    "webuse tobias_koop\n",
    "keep if time==4\n",
    "sum"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "continued-alfred",
   "metadata": {},
   "source": [
    "### First run OLS\n",
    "\n",
    "If we ignore any potential endogeneity problem we can estimate OLS as\n",
    "described in the OLS chapter companion. Here are the results from stata:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "absent-compact",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "      Source |       SS           df       MS      Number of obs   =     1,034\n",
      "-------------+----------------------------------   F(4, 1029)      =     51.36\n",
      "       Model |  37.3778146         4  9.34445366   Prob > F        =    0.0000\n",
      "    Residual |   187.21445     1,029  .181938241   R-squared       =    0.1664\n",
      "-------------+----------------------------------   Adj R-squared   =    0.1632\n",
      "       Total |  224.592265     1,033  .217417488   Root MSE        =    .42654\n",
      "\n",
      "------------------------------------------------------------------------------\n",
      "     ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]\n",
      "-------------+----------------------------------------------------------------\n",
      "        pexp |   .2035214   .0235859     8.63   0.000     .1572395    .2498033\n",
      "       pexp2 |  -.0124126   .0022825    -5.44   0.000    -.0168916   -.0079336\n",
      "        educ |   .0852725   .0092897     9.18   0.000     .0670437    .1035014\n",
      " broken_home |  -.0087254   .0357107    -0.24   0.807    -.0787995    .0613488\n",
      "       _cons |   .4603326    .137294     3.35   0.001     .1909243    .7297408\n",
      "------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "%%stata\n",
    "reg ln_wage pexp pexp2 educ broken_home"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "colonial-marketing",
   "metadata": {},
   "source": [
    "where education, has the elasticity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "behavioral-parallel",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Average marginal effects                                 Number of obs = 1,034\n",
      "Model VCE: OLS\n",
      "\n",
      "Expression: Linear prediction, predict()\n",
      "dy/ex wrt:  educ\n",
      "\n",
      "------------------------------------------------------------------------------\n",
      "             |            Delta-method\n",
      "             |      dy/ex   std. err.      t    P>|t|     [95% conf. interval]\n",
      "-------------+----------------------------------------------------------------\n",
      "        educ |   1.046691   .1140274     9.18   0.000     .8229385    1.270444\n",
      "------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "%%stata\n",
    "margins, dyex(educ) continuous"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "developed-offering",
   "metadata": {},
   "source": [
    "### Running IV Regression\n",
    "\n",
    "Suppose we are worried that education is endogenous. That is, it is\n",
    "correlated with the population regression errors. This means OLS\n",
    "estimates of $\\beta$ are biased. We hypothesize that the variable\n",
    "`feduc` is a good instrument having all the properties we describe in\n",
    "detail in the notes document.\n",
    "\n",
    "In stata, we use this code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "acoustic-leisure",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Instrumental variables 2SLS regression            Number of obs   =      1,034\n",
      "                                                  Wald chi2(4)    =     138.19\n",
      "                                                  Prob > chi2     =     0.0000\n",
      "                                                  R-squared       =     0.1277\n",
      "                                                  Root MSE        =     .43528\n",
      "\n",
      "------------------------------------------------------------------------------\n",
      "     ln_wage | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]\n",
      "-------------+----------------------------------------------------------------\n",
      "        educ |   .1495027   .0320009     4.67   0.000     .0867821    .2122233\n",
      "        pexp |    .214752   .0246553     8.71   0.000     .1664285    .2630755\n",
      "       pexp2 |  -.0117453   .0023508    -5.00   0.000    -.0163529   -.0071377\n",
      " broken_home |   .0244713   .0397189     0.62   0.538    -.0533763     .102319\n",
      "       _cons |  -.4064389   .4356072    -0.93   0.351    -1.260213    .4473354\n",
      "------------------------------------------------------------------------------\n",
      "Instrumented: educ\n",
      " Instruments: pexp pexp2 broken_home feduc\n"
     ]
    }
   ],
   "source": [
    "%%stata\n",
    "ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "homeless-relay",
   "metadata": {},
   "source": [
    "Note that the mean estimate for the elasticity on education has nearly doubled\n",
    "compared to OLS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "celtic-southwest",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Average marginal effects                                 Number of obs = 1,034\n",
      "Model VCE: Unadjusted\n",
      "\n",
      "Expression: Linear prediction, predict()\n",
      "dy/ex wrt:  educ\n",
      "\n",
      "------------------------------------------------------------------------------\n",
      "             |            Delta-method\n",
      "             |      dy/ex   std. err.      z    P>|z|     [95% conf. interval]\n",
      "-------------+----------------------------------------------------------------\n",
      "        educ |   1.835095   .3928002     4.67   0.000     1.065221     2.60497\n",
      "------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "%%stata\n",
    "margins, dyex(educ) continuous"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "unnecessary-links",
   "metadata": {},
   "source": [
    "Stata\\'s ivregress output for robust regression is obtained\n",
    "from"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "liked-duration",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Instrumental variables 2SLS regression            Number of obs   =      1,034\n",
      "                                                  Wald chi2(4)    =     150.52\n",
      "                                                  Prob > chi2     =     0.0000\n",
      "                                                  R-squared       =     0.1277\n",
      "                                                  Root MSE        =     .43528\n",
      "\n",
      "------------------------------------------------------------------------------\n",
      "             |               Robust\n",
      "     ln_wage | Coefficient  std. err.      z    P>|z|     [95% conf. interval]\n",
      "-------------+----------------------------------------------------------------\n",
      "        educ |   .1495027   .0329085     4.54   0.000     .0850033    .2140021\n",
      "        pexp |    .214752   .0238629     9.00   0.000     .1679815    .2615225\n",
      "       pexp2 |  -.0117453   .0023595    -4.98   0.000    -.0163698   -.0071208\n",
      " broken_home |   .0244713   .0335032     0.73   0.465    -.0411937    .0901364\n",
      "       _cons |  -.4064389   .4404503    -0.92   0.356    -1.269706    .4568278\n",
      "------------------------------------------------------------------------------\n",
      "Instrumented: educ\n",
      " Instruments: pexp pexp2 broken_home feduc\n"
     ]
    }
   ],
   "source": [
    "%%stata\n",
    "ivregress 2sls ln_wage pexp pexp2 broken_home (educ=feduc), robust"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "acceptable-rally",
   "metadata": {},
   "source": [
    "### Testing Assumptions\n",
    "\n",
    "We have more work to do:\n",
    "\n",
    "1.  Test for relevant and strong instruments\n",
    "2.  Test for endogeneity\n",
    "3.  Test for overidentification (not relevant for this example)\n",
    "\n",
    "In stata, we issue these commands:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "arranged-engineering",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "  First-stage regression summary statistics\n",
      "  --------------------------------------------------------------------------\n",
      "               |            Adjusted      Partial       Robust\n",
      "      Variable |   R-sq.       R-sq.        R-sq.     F(1,1029)   Prob > F\n",
      "  -------------+------------------------------------------------------------\n",
      "          educ |  0.2416      0.2387       0.0878       80.2589    0.0000\n",
      "  --------------------------------------------------------------------------\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%%stata\n",
    "estat firststage"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "noble-punch",
   "metadata": {},
   "source": [
    "Note, since the number of instruments is equal to the number of\n",
    "endogenous variables, we don\\'t have an overidentification problem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "wrong-operation",
   "metadata": {
    "tags": [
     "raises-exception"
    ]
   },
   "outputs": [
    {
     "ename": "SystemError",
     "evalue": "no overidentifying restrictions\nr(498);\n",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mSystemError\u001b[0m                               Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-9-d841baac8b2d>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mget_ipython\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun_cell_magic\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'stata'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m''\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'estat overid\\n'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/anaconda3/envs/python3/lib/python3.8/site-packages/IPython/core/interactiveshell.py\u001b[0m in \u001b[0;36mrun_cell_magic\u001b[0;34m(self, magic_name, line, cell)\u001b[0m\n\u001b[1;32m   2397\u001b[0m             \u001b[0;32mwith\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuiltin_trap\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2398\u001b[0m                 \u001b[0margs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mmagic_arg_s\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2399\u001b[0;31m                 \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2400\u001b[0m             \u001b[0;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2401\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m<decorator-gen-117>\u001b[0m in \u001b[0;36mstata\u001b[0;34m(self, line, cell, local_ns)\u001b[0m\n",
      "\u001b[0;32m~/anaconda3/envs/python3/lib/python3.8/site-packages/IPython/core/magic.py\u001b[0m in \u001b[0;36m<lambda>\u001b[0;34m(f, *a, **k)\u001b[0m\n\u001b[1;32m    185\u001b[0m     \u001b[0;31m# but it's overkill for just that one bit of state.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    186\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mmagic_deco\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 187\u001b[0;31m         \u001b[0mcall\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mlambda\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    188\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    189\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mcallable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/stata/utilities/pystata/ipython/stpymagic.py\u001b[0m in \u001b[0;36mstata\u001b[0;34m(self, line, cell, local_ns)\u001b[0m\n\u001b[1;32m    274\u001b[0m             \u001b[0m_stata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcell\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mquietly\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minline\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0m_config\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstconfig\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'grshow'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    275\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 276\u001b[0;31m             \u001b[0m_stata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcell\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mquietly\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minline\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0m_config\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstconfig\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'grshow'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    277\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    278\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0;34m'-gw'\u001b[0m \u001b[0;32min\u001b[0m \u001b[0margs\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;34m'-gh'\u001b[0m \u001b[0;32min\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/stata/utilities/pystata/stata.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(cmd, quietly, echo, inline)\u001b[0m\n\u001b[1;32m    299\u001b[0m                 \u001b[0m_stata_wrk1\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"qui \"\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mcmds\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mecho\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    300\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 301\u001b[0;31m                 \u001b[0m_stata_wrk1\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcmds\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mecho\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    302\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    303\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0minline\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/stata/utilities/pystata/stata.py\u001b[0m in \u001b[0;36m_stata_wrk1\u001b[0;34m(cmd, echo)\u001b[0m\n\u001b[1;32m     76\u001b[0m             \u001b[0;32mwhile\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m!=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     77\u001b[0m                 \u001b[0;32mif\u001b[0m \u001b[0mrc1\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 78\u001b[0;31m                     \u001b[0;32mraise\u001b[0m \u001b[0mSystemError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     79\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     80\u001b[0m                 \u001b[0m_print_no_streaming_output\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mSystemError\u001b[0m: no overidentifying restrictions\nr(498);\n"
     ]
    }
   ],
   "source": [
    "%%stata\n",
    "estat overid"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "macro-forwarding",
   "metadata": {},
   "source": [
    "The python stack trace is irrelevant here and will terrify my students.  All the user needs to see is the Stata part of the error:\n",
    "```\n",
    "SystemError: no overidentifying restrictions\n",
    "r(498);\n",
    "```\n",
    "\n",
    "These results tell us we have relevant and strong instruments and that\n",
    "education is likely endogenous.\n",
    "\n",
    "Here is another error:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "complex-failing",
   "metadata": {
    "tags": [
     "raises-exception"
    ]
   },
   "outputs": [
    {
     "ename": "SystemError",
     "evalue": "variable ln_wage already defined\nr(110);\n",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mSystemError\u001b[0m                               Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-10-f4b1cbd9edf4>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mget_ipython\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun_cell_magic\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'stata'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m''\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'gen ln_wage = 5\\n'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/anaconda3/envs/python3/lib/python3.8/site-packages/IPython/core/interactiveshell.py\u001b[0m in \u001b[0;36mrun_cell_magic\u001b[0;34m(self, magic_name, line, cell)\u001b[0m\n\u001b[1;32m   2397\u001b[0m             \u001b[0;32mwith\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuiltin_trap\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2398\u001b[0m                 \u001b[0margs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mmagic_arg_s\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2399\u001b[0;31m                 \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2400\u001b[0m             \u001b[0;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2401\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m<decorator-gen-117>\u001b[0m in \u001b[0;36mstata\u001b[0;34m(self, line, cell, local_ns)\u001b[0m\n",
      "\u001b[0;32m~/anaconda3/envs/python3/lib/python3.8/site-packages/IPython/core/magic.py\u001b[0m in \u001b[0;36m<lambda>\u001b[0;34m(f, *a, **k)\u001b[0m\n\u001b[1;32m    185\u001b[0m     \u001b[0;31m# but it's overkill for just that one bit of state.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    186\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mmagic_deco\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 187\u001b[0;31m         \u001b[0mcall\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mlambda\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    188\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    189\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mcallable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/stata/utilities/pystata/ipython/stpymagic.py\u001b[0m in \u001b[0;36mstata\u001b[0;34m(self, line, cell, local_ns)\u001b[0m\n\u001b[1;32m    274\u001b[0m             \u001b[0m_stata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcell\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mquietly\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minline\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0m_config\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstconfig\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'grshow'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    275\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 276\u001b[0;31m             \u001b[0m_stata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcell\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mquietly\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minline\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0m_config\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstconfig\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'grshow'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    277\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    278\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0;34m'-gw'\u001b[0m \u001b[0;32min\u001b[0m \u001b[0margs\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;34m'-gh'\u001b[0m \u001b[0;32min\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/stata/utilities/pystata/stata.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(cmd, quietly, echo, inline)\u001b[0m\n\u001b[1;32m    299\u001b[0m                 \u001b[0m_stata_wrk1\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"qui \"\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mcmds\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mecho\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    300\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 301\u001b[0;31m                 \u001b[0m_stata_wrk1\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcmds\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mecho\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    302\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    303\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0minline\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/stata/utilities/pystata/stata.py\u001b[0m in \u001b[0;36m_stata_wrk1\u001b[0;34m(cmd, echo)\u001b[0m\n\u001b[1;32m     76\u001b[0m             \u001b[0;32mwhile\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m!=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     77\u001b[0m                 \u001b[0;32mif\u001b[0m \u001b[0mrc1\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 78\u001b[0;31m                     \u001b[0;32mraise\u001b[0m \u001b[0mSystemError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     79\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     80\u001b[0m                 \u001b[0m_print_no_streaming_output\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mSystemError\u001b[0m: variable ln_wage already defined\nr(110);\n"
     ]
    }
   ],
   "source": [
    "%%stata\n",
    "gen ln_wage = 5"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "quality-format",
   "metadata": {},
   "source": [
    "Again, the python stack trace is irrelevant and completely the same as the previous one.  All the user needs to see is the Stata part of the error\n",
    "```\n",
    "SystemError: variable ln_wage already defined\n",
    "r(110);\n",
    "```\n",
    "\n",
    "\n",
    "[^1]: These steps are not correct for the case of more than 1\n",
    "    instrumental variable. However, they are instructive in\n",
    "    understanding the intuition of the Hausman Test in the instrumental\n",
    "    variables context. If you have more than 1 instrumental variable,\n",
    "    you must use the `ivendog` or `hausman` commands in stata.\n",
    "\n",
    "[^2]: This equation will exactly replicate the `stata` ivregress command\n",
    "    (for `2sls`) using the options `vce(unadjusted) small`.\n",
    "\n",
    "[^3]: This equation will exactly replicate the `stata` ivregress command\n",
    "    (for `2sls`) using the options `vce(robust) small` defining\n",
    "    $\\mathbf{V}$ as we did in the OLS chapter.\n",
    "\t\n",
    "[^4]: This is true if the model has more instruments than endogenous variables."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "text_representation": {
    "extension": ".md",
    "format_name": "myst",
    "format_version": 0.13,
    "jupytext_version": "1.10.3"
   }
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  },
  "source_map": [
   12,
   496,
   504,
   510,
   517,
   520,
   524,
   527,
   538,
   541,
   545,
   548,
   553,
   556,
   568,
   571,
   576,
   582,
   594,
   600
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 5
}