Workshop on Bootstrapping and Bayesian Econometrics

The goal of this workshop is to 1. Explore bootstrapping * See the implementation steps * Show subtleties of reported results 2. Introduce Bayesian Econometrics * Brief discussion of Bayes Rule * Markov Chains and the Metropolis-Hastings Sampler * Simple example estimating mean and standard deviation * Application to Tobias and Koop * Heirarchical Models * Show you three different samplers

#load python libraries for this ipython notebook:
import numpy as np              # numpy linear algebra library
import pandas as pd             # pandas data manipulation library
import scipy.stats as scipy     # access to stats libraries 
import emcee as emcee           # a bayesian library
#load pymc3 libraries (another Bayesian Library)
import imp
pm = imp.load_module("pymc3", *imp.find_module("pymc", ["/home/robhicks/Dropbox/notebooks/pymc3/pymc"]))

Estimation methods in Econometrics

In class, we looked at these estimation methods: 1. Linear Model 2. Maximum Likelihood 3. Generalized Method of Moments

Leaving these as major estimation paradigms you haven't seen: 4. Bayesian Estimation 5. ~~Non-parametric Estimation~~

In what follows, we will replicate our Bootstrapping work in Ipython (so we can compare to Bayesian Methods) and then move on to Bayesian Estimation.

Note: the appeal of Bayesian modeling is to tackle computationally challenging models and the implementation of an OLS model is fairly trivial, but gives us a point of comparison since we know the OLS model very well. At the end of these notes, we show how Bayesian Econometrics opens the door to very rich models, that are difficult to replicate using frequentist methods.

Load Tobias and Koop data for time==4

tobias_koop=pd.read_csv('http://rlhick.people.wm.edu/econ407/data/tobias_koop_t_4.csv')

tobias_koop.head()

tobias_koop.describe()

# explore distribution of ln_wage (dv)
tobias_koop.ln_wage.plot(kind='hist')

<matplotlib.axes.AxesSubplot at 0x51b7dd0>

# run OLS regression:
results = pd.ols(y=tobias_koop['ln_wage'], x=tobias_koop[['educ','pexp','pexp2','broken_home']])
results


-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <educ> + <pexp> + <pexp2> + <broken_home> + <intercept>

Number of Observations:         1034
Number of Degrees of Freedom:   5

R-squared:         0.1664
Adj R-squared:     0.1632

Rmse:              0.4265

F-stat (4, 1029):    51.3606, p-value:     0.0000

Degrees of Freedom: model 4, resid 1029

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
          educ     0.0853     0.0093       9.18     0.0000     0.0671     0.1035
          pexp     0.2035     0.0236       8.63     0.0000     0.1573     0.2497
         pexp2    -0.0124     0.0023      -5.44     0.0000    -0.0169    -0.0079
   broken_home    -0.0087     0.0357      -0.24     0.8070    -0.0787     0.0613
     intercept     0.4603     0.1373       3.35     0.0008     0.1912     0.7294
---------------------------------End of Summary---------------------------------

These s.e.'s aren't robust (don't match stata output), so let's bootstrap. The statsmodels package has a robust and bootstrap option (see http://statsmodels.sourceforge.net/stable/rlm.html for the robust OLS model). In general, you should use the canned routines, but we present the following for educational purposes only to make explicit what the bootstrap is doing.

Steps: 1. Loop \(R\) times (\(R\) = # of boostrapped replicates) 2. For each \(r\in R\), sample \(N\) times with replacement from \(\mathbf{Y}\) and \(\mathbf{X}\), noting that \(N\) is equal to the number of observations in the original dataset \(\mathbf{Y}\) and \(\mathbf{X}\). Denote each of these samples as \(\mathbf{X}_r\) and \(\mathbf{Y}_r\) 3. Estimate an OLS model and recover estimate for replicate \(r\): \(\hat{\beta}_r = (\mathbf{X}_r'\mathbf{X}_r)^{-1}\mathbf{X}_r'\mathbf{Y}_r\) 4. Store results 5. Calculate the standard deviation (our estimate for the s.e.) of each parameter estimate, or construct non-parametric confidence intervals (preferred)

R = 400 

results_boot = np.zeros((R,results.beta.shape[0]))

row_id = range(0,tobias_koop.shape[0])

for r in range(R):
     this_sample = numpy.random.choice(row_id, size=tobias_koop.shape[0], replace=True) # gives sampled row numbers
     # Define data for this replicate:    
     tobias_koop_r = tobias_koop.iloc[this_sample]   
     # Estimate model
     results_r = pd.ols(y=tobias_koop_r['ln_wage'], x=tobias_koop_r[['educ','pexp','pexp2','broken_home']])   
     # Store in row r of results_boot:
     results_boot[r,:] = np.asarray(results_r.beta)

# Convert results to pandas dataframe for easier analysis:
results_boot = pd.DataFrame(results_boot,columns=['b_educ','b_pexp','pexp2','b_broken_home','b_Intercept'])

results_boot.head(10) # the first 10 replicates

results_boot.b_educ.plot(kind='hist')

<matplotlib.axes.AxesSubplot at 0x533ced0>

results_boot.describe(percentiles=[.005,.025,.05,.5,.95,.975,.995])

Let's plot our bootstrap replicates, by parameter. Note that the values of each of these replicates are completely independent (or should be). Replicate 3 doesn't in any way inform us about replicate 4.

replicate= np.arange(R)

# plot point estimate and confidence intervals:
plt.figure(figsize=(14, 10), dpi=200)
lw = 1

plt.subplot(521)
plt.ylabel("Intercept")
plt.plot(replicate, results_boot.b_Intercept, label="Intercept",lw=lw)
plt.subplot(522)
plt.hist(results_boot.b_Intercept,lw=lw, label="Intercept", orientation='horizontal')

plt.subplot(523)
plt.ylabel("b_educ")
plt.plot(replicate, results_boot.b_educ, label="b_educ",lw=lw)
plt.subplot(524)
plt.hist(results_boot.b_educ,lw=lw, label="b_educ",orientation='horizontal')

plt.subplot(525)
plt.ylabel("b_pexp")
plt.plot(replicate, results_boot.b_pexp, label="b_pexp",lw=lw)
plt.subplot(526)
plt.hist(results_boot.b_pexp,lw=lw, label="b_pexp",orientation='horizontal')

plt.subplot(527)
plt.ylabel("b_pexp2")
plt.plot(replicate, results_boot.b_pexp, label="b_pexp2",lw=lw)
plt.subplot(528)
plt.hist(results_boot.b_pexp,lw=lw, label="b_pexp2",orientation='horizontal')


plt.subplot(529)
plt.ylabel("b_broken_home")
plt.plot(replicate, results_boot.b_broken_home, label="b_broken_home",lw=lw)
plt.subplot(520)
plt.hist(results_boot.b_broken_home,lw=lw, label="b_broken_home",orientation='horizontal')

(array([  5.,  13.,  35.,  67.,  84.,  96.,  53.,  28.,  12.,   7.]),
 array([-0.15673906, -0.12778079, -0.09882252, -0.06986425, -0.04090598,
        -0.01194771,  0.01701055,  0.04596882,  0.07492709,  0.10388536,
         0.13284363]),
 <a list of 10 Patch objects>)

Now for Bayesian Estimation of the same model:

The Bayesian approach is functionally similar yet comes at the problem from a philosphically different viewpoint. For frequentists, there is one true measure of the \(\beta\) vector and uncertainty arises from having to sample (as is demonstrated in the above bootstrap) from a population. All the "noise" is a result of estimating our beta over a different sample. With a large enough sample size the estimator would collapse to a single value as the sample becomes increasingly large (note: here I am referring to the number of observations in \(\mathbf{X}\), not increasing \(R\).

The Bayesian analyst approaches the problem differently. Taking the data as given, uncertainty arises around \(\beta\) for a whole host of reasons. Even an infinite sample size may not cause that uncertainty to collapse to zero. In frequentist terms, we would never talk about uncertainty of \(\beta\), but rather the uncertainty of our estimator. Bayesians are comfortable talking about uncertainty of \(\beta\). In practice, these nuances can be largely ignored (this statement could start a holy war) and one can approach econometric models from either perspective depending on which one leads to a better (or estimable) approach. There are examples on the web showing how one approach can outperform another, but these are all special cases. It just depends on how one defines outperforms and the phenomena under investigation.

An aside on Bayes Rule and the Relationship to Bayesian Econometrics

Recall that Bayes Rule is \[ Prob(A|B) = \frac{Prob(B|A)Prob(A)}{Prob(B)} \]

How do we use this idea for econometric inference? A very good book on this topic is by John Geweke, "Contemporary Bayesian Econometrics and Statistics". Consider the following definitions for getting us started:

Parameters we are trying to estimate (like \(\beta\)): \(\theta\)
Hyperparameter(s) governing our prior beliefs about \(\theta\): \(\alpha\)
Prior distribution of parameters (what we believe a priori about model parameters): \(Prob(\theta|\alpha)\)
Sampling distribution of observed data (\(\mathbf{X}\) and \(\mathbf{Y}\): \(Prob(\mathbf{X},\mathbf{Y}|\theta)\). This is also called the likelihood function (exactly equivalent to what we did in MLE).
Marginal likelihood: \[ Prob(\mathbf{X},\mathbf{Y}|\alpha)=\int_{\theta}Prob(\mathbf{X},\mathbf{Y}|\theta)Prob(\theta|\alpha) \]
Posterior Distribution (this is the link to Bayes Rule): \[ Prob(\theta|\mathbf{X},\mathbf{Y},\alpha) = \frac{Prob(\mathbf{X},\mathbf{Y}|\theta)Prob(\theta|\alpha))}{Prob(\mathbf{X},\mathbf{Y}|\alpha)} \propto Prob(\mathbf{X},\mathbf{Y}|\theta)Prob(\theta|\alpha)) \]

Implementing the Metropolis-Hastings Algorithm

Consider constructing a set or series of potential beta vectors that explore the posterior distribution described above. When selecting beta vectors we would like to explore parameters having higher probability for the posterior (\(Prob(\theta|\mathbf{X},\mathbf{Y},\alpha)\)) . So in that context, it is useful to think about \(\theta\) and \(\alpha\) as follows: denote \(\theta\) as a new draw from the parameter distribution (\(\theta_{t+1}\)), and let \(\alpha\) be the current draw (\(\theta_t\)).

With that in mind, we can write the posterior as \[ Prob(\theta_{t+1}|\mathbf{X},\mathbf{Y},\theta_t) = \frac{Prob(\mathbf{X},\mathbf{Y}|\theta_{t+1})Prob(\theta_{t+1}|\theta_t)}{Prob(\mathbf{X},\mathbf{Y}|\theta_{t})} \]

This begins us thinking about how new information (a proposed value of \(\theta\)) tells us about the posterior distribution. Essentially we want to construct a series of estimates (called a chain) of the beta vector that is in regions of the parameter space that are high probability for the posterior. Once the chain is constructed, it reflects the underlying distribution of our parameter estimates.

First, we must specify a proposal distribution from which to draw proposed values of \(\theta\). Suppose we choose a symmetric proposal distribution such that \(Prob(\theta_{t+1}|\theta_t) = Prob(\theta_t|\theta_{t+1})\). For example, we might specify that \(\theta_{t+1} = \theta_t + Normal(0,\sigma)\). Since the difference is distributed \(N(\theta_{t+1} - \theta_t;0,\sigma) = N(\theta_t - \theta_{t+1};0,\sigma)\). This is called the random walk Metropolis-Hastings algorithm, the most common one in use.

We can then define an accept/reject criteria, where accept means that the candidate draw (\(\theta_{t+1}\)) provides information about \(\theta\) in high probability regions. The criteria is \[ c(\theta_{t+1},\theta_t) = min \left( 1,\frac{Prob(\mathbf{X},\mathbf{Y}|\theta_{t+1})Prob(\theta_{t+1}|\theta_t)}{Prob(\mathbf{X},\mathbf{Y}|\theta_{t})Prob(\theta_{t}|\theta_t+1)}\right ) \]

Given the symmetry of the random walk draws, this simplifies to

\[ c(\theta_{t+1},\theta_t) = min \left( 1,\frac{Prob(\mathbf{X},\mathbf{Y}|\theta_{t+1})}{Prob(\mathbf{X},\mathbf{Y}|\theta_{t})}\right ) \] which is simply a ratio of likelihood function values!

Having calculated the accept/reject criteria value \(c\), we need to decide whether to include \(\theta_{t+1}\) in our chain. To do that, draw the random uniform value \(u\in[0,1]\). If \(c(\theta_{t+1},\theta_t)\ge u\) then add \(\theta_{t+1}\) to our chain that will help us characterize the distribution of parameters that most satisfy the posterior probability.

It is important to note that the accept reject method ensures that the chain never "converges" like maximum likelihood. Even if the maximum is found, there is always a non-zero probability that the chain will deviate (and store values of the estimate of \(\beta\) that actaully cause the likelihood function value to decline!

The Metroplis-Hastings algorithm has been named one of the top 10 algorithms of the 20th century and appears on both the Math and Computational Sciences lists.

An example

Suppose we have 100 observations on age = [43, 21, 36, 17, ...] and we want to use the MH algorithm to find the mean age.

Steps: 1. Define likelihood function 2. Generate candidate value (\(\theta_{t+1}\)) 3. Calculate likelihood value 4. Calculate \(c(\theta_{t+1}|\theta_{t})\) 5. Draw a random uniform variate (value) 6. Choose to either accept or reject \(\theta_{t+1}\) 7. If accept, add to chain. If reject, go back to (2)

# data: construct age matrix and ensure correct shape
age = pd.DataFrame((40 + 12*np.random.normal(0, 1,(100,1))), columns=['age'])
age.describe()

Step 1. Define the Likelihood Function: a normal pdf with mean mu, standard deviation sigma, and vector data input x.

def like_age(mu,sigma,x):
    if sigma<=0:
       return -np.inf  # heavily penalize if < 0
    else:
       u = (x - mu)/np.tile(np.abs(sigma),(x.shape[0]))
       scale = (1/(np.sqrt(2*np.pi)*np.abs(sigma)))
       scale = np.tile(scale,(shape(x)[0],1))
       prob = scale*np.exp(-np.multiply(u,u)/2)
       return np.sum(np.log(prob))

Steps 2-7.

Define starting values and construct the chain.

age=np.asarray(age)
mu0 = 30                                           #starting value for mean
sigma0 = 10                                        #starting value for sigma
rho = np.asarray(.15)                              #tuning parameter for MH sampler
p0 = like_age(mu0,sigma0,age)                      #starting probability for chain

chain_length = 10000
burn_in = 1000

chain_vals = np.zeros((chain_length,3))

# This is the heart of the MH algorithm, see notes below
# Loop performs steps 2-7 above
for i in range(0,chain_length):
    mu_cand = np.asarray(mu0 + rho*np.random.normal(0, 1))
    sigma_cand = np.asarray(sigma0 + rho*np.random.normal(0, 1))
    # for this candidate calculate p1 (probability of parameters given data)
    p1 = like_age(mu_cand,sigma_cand,age)  
    c = np.minimum(0,p1 - p0)      # subtract since in logs
    if c>=np.log(np.random.uniform()):
        p0 = p1
        mu0 = mu_cand
        sigma0 = sigma_cand 
    chain_vals[i,0] = mu0
    chain_vals[i,1] = sigma0
    chain_vals[i,2] = p0

The code above works but is far from the perfect version of the MH algorithm. One needs to implement an adaptive random walk draw approach that tunes rho to lead to an accept ratio of around 20%. This doesn't do that, so it shouldn't be implemented in a production environment, but is ok for instructional purposes.

# convert to pandas dataframe
chain_ex = pd.DataFrame(chain_vals,columns=['mean','std','prob'])
chain_ex.plot(kind='hexbin', x='mean', y='std', C='prob', reduce_C_function=np.max,gridsize=50)
# calculate summary statistics post burn-in
chain_ex.loc[burn_in:].describe(percentiles=[.005,.025,.05,.5,.95,.975,.995])

The summary statistics are for post burn-in values in the chain only and are calculated off of the dark green region in the above plot. The plot above shows how accepted values of the mean and standard deviation move around the parameter space. Lighter shaded areas are low probability and notice the chain "converges" to the dark green area after exploring some higher age and std values. Note: other combinations of parameters were considered but weren't accepted in the chain.

The main point to consider is that the Bayesian Estimator is not merely the mean and standard deviation of age, but rather the estimated distribution as can be seen from the traceplots:

#plot chains, excluding burn-in
time= np.arange(chain_length-burn_in)

# plot point estimate and confidence intervals:
plt.figure(figsize=(14, 10), dpi=200)
lw = 1

plt.subplot(221)
plt.ylabel("mu")
plt.plot(time, chain_ex['mean'].loc[burn_in:] , lw=lw)
plt.subplot(222)
chain_ex['mean'].loc[burn_in:].plot(kind='hist',bins=50,orientation='horizontal')

plt.subplot(223)
plt.ylabel("sigma")
plt.plot(time, chain_ex['std'].loc[burn_in:], lw=lw)
plt.subplot(224)
chain_ex['std'].loc[burn_in:].plot(kind='hist',bins=50,orientation='horizontal')

<matplotlib.axes.AxesSubplot at 0x6be0410>

For Bayesian Econometrics, convergence describes whether the chain oscillates randomly around a stable value (which is the mean of the distribution).

Bayesian OLS Regression applied to Tobias and Koop

# to make likelihood calculations easier, add a column of ones for the intercept:
tobias_koop['ones'] = 1

The Bayesian approach centers on probabilities of the posterior. We can recycle what we know from maximum likelihood to write the probability for individual i as

\[ Prob(\mathbf{b}| \mathbf{Y}_i, \mathbf{X}_i) = \frac{1}{\sqrt{2\sigma^2\pi}} e^{-\frac{(Y_i - \mathbf{X_i b})^2}{2\sigma^2}} \] which is just the normal pdf having mean \(\mathbf{X}_i \mathbf{b}\)

With that, we can write the log-likelihood function as \[ LogLike(\mathbf{b}| \mathbf{Y}, \mathbf{X}) = \sum_{i \in N} ln\left(Prob(\mathbf{b}| \mathbf{Y}_i, \mathbf{X}_i)\right) \] which is exactly the same as the log-likelihood, we'd use if we were estimating OLS by maximum likelihood. The following operationalizes this in code. Note that as in MLE, we have to estimate the \(K\) parameters of \(\mathbf{b}\) and our estimate for \(\sigma\).

# Likelihood Function for OLS (Bayesians call this log of posterior): identical to likelihood function for MLE
def log_likelihood(theta, x, y):
    sigma = np.asmatrix(theta[-1])         #the last element of theta is the std dev of epsilon, and can't be <0
    beta = np.asarray(theta[:-1])          #the first k are beta parameters
    Y=np.asmatrix(tobias_koop[y])
    if sigma <= 0:
        return -np.inf                     #account for a chain trying to assign sigma < 0: penalize heavily
    else:
        mu = np.asmatrix(np.sum(tobias_koop[x]*np.tile(beta,(tobias_koop.shape[0],1)),axis=1)) #this is xb
        mu =  mu.reshape(Y.shape[0],1)
        prob = scipy.norm.pdf(Y,mu,sigma) 
    return sum(np.log(prob))

Now that we have defined the likelihood function (good for any OLS style model), we need to use an algorithm to sample parameters from the posterior. The MH algorithm has some advantages, but there are better ones out there that have been recently developed. For this example, we will use the emcee algorithm (http://dan.iel.fm/emcee/current/) that has been used in the Astronomy/Physics Literature. It wanders the parameter space using a more informed method rather than just a random walk (what the MH algorithm we employed uses). This code constructs a chain for our model parameters:

x = ['educ','pexp','pexp2','broken_home','ones']
y = ['ln_wage']

# this calls the emcee mcmc algorithm
ndim, nwalkers = shape(x)[0]+1, 100      #number of parameters, #of walkers in ensemble
nsteps, nburn = 5000, 1000              #total number of steps, # of burn-in steps

starting_guesses = np.random.rand(nwalkers, ndim)*.1
  
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_likelihood, args=[x, y],threads=5)  

store_chain=sampler.run_mcmc(starting_guesses, nsteps)

The estimator returns a Markov Chain (actually the Ensemble Sampler returns 100 chains (number of walkers)). Note: in what follows below, it should be evident that the Bayesian Approach estimates a distribution of parameter estimates rather than focusing on parameter estimates. As an aside, this calculates the likelihood function 5,000 x 100 = 500k times. This may be a little overkill for an OLS model.....

The following graphs one of these chains. Note: this exhibits periodicity which is problematic, so would need to look at combining across all chains to judge convergence. See discussion here: http://dan.iel.fm/emcee/current/user/line/

#Plot the entire chain:
time= np.arange(nsteps-nburn)  # x values for plot (integer count)
sample = sampler.chain[:, nburn:, :]

# plot point estimate and confidence intervals:
plt.figure(figsize=(14, 10), dpi=200)
lw = 1

# to show the full chain, need to loop through walkers, but we'll just show the first walker:
plt.subplot(521)
plt.ylabel("Intercept")
plt.plot(time, sample[1,:,4], label="Intercept",lw=lw)
plt.subplot(522)
plt.hist(sample[1,:,4],lw=lw, label="Intercept", orientation='horizontal')

plt.subplot(523)
plt.ylabel("b_educ")
plt.plot(time, sample[1,:,0], label="b_educ",lw=lw)
plt.subplot(524)
plt.hist(sample[1,:,0],lw=lw, label="b_educ",orientation='horizontal')

plt.subplot(525)
plt.ylabel("b_pexp")
plt.plot(time, sample[1,:,1], label="b_pexp",lw=lw)
plt.subplot(526)
plt.hist(sample[1,:,1],lw=lw, label="b_pexp",orientation='horizontal')

plt.subplot(527)
plt.ylabel("b_pexp2")
plt.plot(time, sample[1,:,2], label="b_pexp2",lw=lw)
plt.subplot(528)
plt.hist(sample[1,:,2],lw=lw,label="b_pexp2", orientation='horizontal')

plt.subplot(529)
plt.ylabel("b_broken_home")
plt.plot(time,sample[1,:,3], label="b_broken_home",lw=lw)
plt.subplot(520)
plt.hist(sample[1,:,3],lw=lw, label="b_broken_home",orientation='horizontal')

(array([    5.,    55.,   231.,   709.,  1072.,   937.,   649.,   270.,
           49.,    23.]),
 array([-0.13442667, -0.10919383, -0.083961  , -0.05872817, -0.03349534,
        -0.0082625 ,  0.01697033,  0.04220316,  0.06743599,  0.09266882,
         0.11790166]),
 <a list of 10 Patch objects>)

While the trace plots above look fairly similar to what we found above in the bootstrapping excercise, we came at this information from a very different perspective, by sampling values from the posterior distribution. This was done by evaluating a number of trial values of our parameters and settling on areas of the parameter space having higher probability values for the posterior.

What is of more interest is the information contained in all of the chains, and what it tells us about the distribution of the parameters. The following trace chart uses all of the information from all 100 chains.

# to show the full chain, need to loop through walkers
walkers=np.arange(nwalkers)

plt.figure(figsize=(14, 14), dpi=200)

lw = .05

for i in walkers:
    plt.subplot(511)
    plt.ylabel("Intercept")
    plt.plot(time, sample[i,:,4], label="Intercept",lw=lw)

    plt.subplot(512)
    plt.ylabel("b_educ")
    plt.plot(time, sample[i,:,0], label="b_educ",lw=lw)

    plt.subplot(513)
    plt.ylabel("b_pexp")
    plt.plot(time, sample[i,:,1], label="b_pexp",lw=lw)
    
    plt.subplot(514)
    plt.ylabel("b_pexp2")
    plt.plot(time, sample[i,:,2], label="b_pexp2",lw=lw)

    plt.subplot(515)
    plt.ylabel("b_broken_home")
    plt.plot(time,sample[i,:,3], label="b_broken_home",lw=lw)

Notice, using full set of walkers gives us a chain that at least visually does not have the periodicity problems (although there are formal tests), so we can proceed.

Using the same information, the following plots the likelihood (posterior) surface in two dimensions. Each point in the diagrams is an actual parameter value that was included in our chain.

samples_long = sampler.chain[:, nburn:, :].reshape((-1, ndim))
import triangle
fig = triangle.corner(samples_long, labels=["$educ$", "$pexp$", "$pexp2$","$broken\_home$","$Int$","$\sigma$","$\ln\,L$"],
                      truths=[.0853, .2035, -.0124, -.0087, .4603, .4265, 0])

It is important to note that any of the Bayesian MCMC sampling techniques explored here will never "converge" to the Maximum Likelihood Estimate, which is the value of parameters that maximize the likelihood function. This is because the MH algorithm (and its more modern derivates) places some non-zero probability of accepting candidate vectors with lower posterior probabilities. The vectors that are accepted are likely to be almost as probable as the MLE estimate, that is why the chain tends to wander about the MLE estimate. This is because we are trying to find the distribution of our parameters, not just a point estimate. However, on average, you get pretty much the same result as MLE (and for this example OLS).

Next, we summarize the distribution of our parameters, by taking the mean (most akin to the frequentist idea of a point estimate), the standard deviation (like the parameter standard error), and the percentile breakdowns. Note that reporting standard errors (or deviations) imposes normality on your distribution that may not be appropriate. Reporting confidence intervals (the way we have calculated them) is much less restrictive. Although no one in the literature really does this. Furthermore, we could calculate covariances as well to fully characterize \(cov(\theta)\), but don't do that here.

#convert to pandas dataset and describe
results_bayesian = pd.DataFrame(samples_long,columns=['educ','pexp','pexp2','broken_home','intercept','sigma'])
results_bayesian.describe(percentiles=[.005,.025,.05,.5,.95,.975,.995])

# to see how the MH algorithm searches the parameter space and information on parameter covariance:
#Uncomment the following two lines for another plot, but I prefer the triangle plot above.
#from pandas.tools.plotting import scatter_matrix
#scatter_matrix(results_bayesian, alpha=0.2, figsize=(12, 12), diagonal='hist')

Where Bayesian Models Shine

So far, we have been using Bayesian Econometric Techniques on relatively trivial problems. But suppose we want to investigate something more interesting and suited to our Bayesian machinery?

Heirarchical modeling is one such case. The idea is that there may be heterogeneity of model parameters governed by a heirarchical process, for example the parameter \(\beta_1 \sim N(\mu_1,\sigma_1)\), but individuals have different realizations from this draw: \(\beta_{i1} = \mu_{1} + \sigma_1 Normal(0,1)\). This model allows for effects to differ amongst units (a unit could be a person, a state, etc).

Note: this example borrowed from http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/

The data set

Gelman et al.'s (2007) radon dataset is a classic for hierarchical modeling. In this dataset the amount of the radioactive gas radon has been measured among different households in all counties of several states. Radon gas is known to be the highest cause of lung cancer in non-smokers. It is believed to be more strongly present in households containing a basement and to differ in amount present among types of soil. Here we'll investigate this differences and try to make predictions of radonlevels in different counties based on the county itself and the presence of a basement. In this example we'll look at Minnesota, a state that contains 85 counties in which different measurements are taken, ranging from 2 to 116 measurements per county.

radon

#%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

import pandas as pd

data = pd.read_csv('./random_parameters/radon_constructed.csv')

county_names = data.county.unique()
county_idx = data['county_code'].values
n_counties = len(data.county.unique())

We will look at this subset of the dataset:

data[['county', 'log_radon', 'floor']].head()

As you can see, we have multiple radon measurements (log-converted) -- one row for each house -- in a county and whether the house has a basement (floor == 0) or not (floor == 1). We are interested in whether having a basement increases the radon measured in the house.

The Models

Pooling of measurements

Now you might say: "That's easy! I'll just pool all my data and estimate one big regression to asses the influence of a basement across all counties". In math-speak that model would be:

\[radon_{i, c} = \alpha + \beta*\text{floor}_{i, c} + \epsilon\]

Where \(i\) represents the measurement, \(c\) the county and floor contains a 0 or 1 if the house has a basement or not, respectively. If you need a refresher on Linear Regressions in PyMC, check out my previous blog post. Critically, we are only estimating one intercept and one slope for all measurements over all counties pooled together as illustrated in the graphic below (\(\theta\) represents \((\alpha, \beta)\) in our case and \(y_i\) are the measurements of the \(i\)th county).

pooled

Unpooled measurements: separate regressions

But what if we are interested in whether different counties actually have different relationships (slope) and different base-rates of radon (intercept)? Then you might say "OK then, I'll just estimate \(n\) (number of counties) different regressions -- one for each county". In math-speak that model would be:

\[radon_{i, c} = \alpha_{c} + \beta_{c}*\text{floor}_{i, c} + \epsilon_c\]

Note that we added the subindex \(c\) so we are estimating \(n\) different \(\alpha\)s and \(\beta\)s -- one for each county.

unpooled

This is the extreme opposite model; where above we assumed all counties are exactly the same, here we are saying that they share no similarities whatsoever. As we show below, this type of model can be very noisy when we have little data per county, as is the case in this data set.

Partial pooling: Hierarchical Regression aka, the best of both worlds

Fortunately, there is a middle ground to both of these extremes. Specifically, we may assume that while \(\alpha\)s and \(\beta\)s are different for each county as in the unpooled case, the coefficients all share similarity. We can model this by assuming that each individual coefficient comes from a common group distribution:

\[\alpha_{c} \sim \mathcal{N}(\mu_{\alpha}, \sigma_{\alpha}^2)\] \[\beta_{c} \sim \mathcal{N}(\mu_{\beta}, \sigma_{\beta}^2)\]

We thus assume the intercepts \(\alpha\) and slopes \(\beta\) to come from a normal distribution centered around their respective group mean \(\mu\) with a certain standard deviation \(\sigma^2\), the values (or rather posteriors) of which we also estimate. That's why this is called a multilevel, hierarchical or partial-pooling modeling.

hierarchical

(Note that the above is not a complete Bayesian model specification as we haven't defined priors or hyperpriors (i.e. priors for the group distribution, \(\mu\) and \(\sigma\)). These will be used in the model implementation below but only distract here.)

Hierarchical Model

Instead of creating models separatley, the hierarchical model creates group parameters that consider the counties not as completely different but as having an underlying similarity. These distributions are subsequently used to influence the distribution of each county's \(\alpha\) (intercept) and \(\beta\) (slope).

# this suppresses some warnings for better viewing:
import warnings
warnings.simplefilter(action = "ignore", category = FutureWarning)
warnings.simplefilter(action = "ignore", category = RuntimeWarning)

with pm.Model() as hierarchical_model:
    # Hyperpriors for group nodes
    mu_a = pm.Normal('mu_alpha', mu=0., sd=100**2)
    sigma_a = pm.Uniform('sigma_alpha', lower=0, upper=100)
    mu_b = pm.Normal('mu_beta', mu=0., sd=100**2)
    sigma_b = pm.Uniform('sigma_beta', lower=0, upper=100)
    
    # Intercept for each county, distributed around group mean mu_a
    # Above we just set mu and sd to a fixed value while here we
    # plug in a common group distribution for all a and b (which are
    # vectors of length n_counties).
    a = pm.Normal('alpha', mu=mu_a, sd=sigma_a, shape=n_counties)
    # Intercept for each county, distributed around group mean mu_a
    b = pm.Normal('beta', mu=mu_b, sd=sigma_b, shape=n_counties)
    
    # s.d. of Model error
    eps = pm.Uniform('eps', lower=0, upper=100)
    
    # Model prediction of radon level
    # a[county_idx] translates to a[0, 0, 0, 1, 1, ...],
    # we thus link multiple household measures of a county
    # to its coefficients.
    radon_est = a[county_idx] + b[county_idx] * data.floor.values
    
    # Data likelihood
    radon_like = pm.Normal('radon_like', mu=radon_est, sd=eps, observed=data.log_radon)

# RUN the MCMC using the No U-turn Sampler (NUTS)
with hierarchical_model:
    step = pm.NUTS()
    hierarchical_trace = pm.sample(2000, step)

 [-----------------100%-----------------] 2000 of 2000 complete in 24.4 sec

# Plotting the hierarchical model trace -its found values- from 500 iterations onwards (right side plot) 
# and its accumulated marginal values (left side plot) 
pm.traceplot(hierarchical_trace[500:]);

The marginal posteriors in the left column are highly informative. mu_a tells us the group mean (log) radon levels. mu_b tells us that having no basement decreases radon levels significantly (no mass above zero). We can also see by looking at the marginals for a that there is quite some differences in radon levels between counties (each 'rainbow' color corresponds to a single county); the different widths are related to how much confidence we have in each paramter estimate -- the more measurements per county, the higher our confidence will be.

	id	educ	ln_wage	pexp	time	ability	meduc	feduc	broken_home	siblings	pexp2
count	1034.000000	1034.000000	1034.000000	1034.000000	1034	1034.000000	1034.000000	1034.000000	1034.000000	1034.000000	1034.000000
mean	1090.951644	12.274662	2.138259	4.815280	4	0.016596	11.403288	11.585106	0.169246	3.200193	27.979691
std	634.891728	1.566838	0.466280	2.190298	0	0.920963	3.027277	3.735833	0.375150	2.126575	22.598790
min	4.000000	9.000000	0.420000	0.000000	4	-3.140000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	537.250000	12.000000	1.820000	3.000000	4	-0.550000	11.000000	10.000000	0.000000	2.000000	9.000000
50%	1081.500000	12.000000	2.120000	5.000000	4	0.170000	12.000000	12.000000	0.000000	3.000000	25.000000
75%	1666.500000	13.000000	2.450000	6.000000	4	0.720000	12.000000	14.000000	0.000000	4.000000	36.000000
max	2177.000000	19.000000	3.590000	12.000000	4	1.890000	20.000000	20.000000	1.000000	15.000000	144.000000

	b_educ	b_pexp	pexp2	b_broken_home	b_Intercept
0	0.086845	0.209767	-0.013187	0.042530	0.424621
1	0.055819	0.151736	-0.007761	-0.058262	0.931521
2	0.075879	0.137397	-0.006732	-0.010441	0.719790
3	0.073911	0.205536	-0.013938	-0.106334	0.662189
4	0.082670	0.209667	-0.012387	-0.071467	0.430717
5	0.079207	0.218031	-0.014434	0.026968	0.548299
6	0.110818	0.183666	-0.009843	0.012264	0.156372
7	0.098079	0.211655	-0.013272	0.005513	0.276762
8	0.060890	0.220493	-0.016071	-0.115970	0.799638
9	0.102324	0.198275	-0.011950	0.064837	0.212105

	b_educ	b_pexp	pexp2	b_broken_home	b_Intercept
count	400.000000	400.000000	400.000000	400.000000	400.000000
mean	0.084648	0.205918	-0.012596	-0.014049	0.462193
std	0.015267	0.036862	0.003708	0.051261	0.209663
min	0.034552	0.045842	-0.023124	-0.156739	-0.213324
0.5%	0.044082	0.099568	-0.021045	-0.134710	-0.071113
2.5%	0.053761	0.129801	-0.019293	-0.117726	0.051967
5%	0.059072	0.145110	-0.018207	-0.096306	0.111766
50%	0.084729	0.207228	-0.012590	-0.014617	0.454001
95%	0.110858	0.262188	-0.006414	0.073874	0.824146
97.5%	0.115970	0.276534	-0.004944	0.094848	0.891660
99.5%	0.123583	0.295535	-0.001668	0.126296	0.962410
max	0.126530	0.308454	0.003527	0.132844	1.098337

	age
count	100.000000
mean	41.784593
std	12.949731
min	4.870258
25%	33.110703
50%	40.404375
75%	48.116807
max	74.027367

	mean	std	prob
count	9000.000000	9000.000000	9000.000000
mean	41.784150	12.886972	-39750.865206
std	0.128627	0.089576	1.009329
min	41.354163	12.520079	-39758.391179
0.5%	41.444949	12.653458	-39755.368671
2.5%	41.529483	12.711870	-39753.642675
5%	41.571783	12.742053	-39752.872637
50%	41.784452	12.885293	-39750.527944
95%	41.993022	13.034135	-39749.938687
97.5%	42.031078	13.066558	-39749.913259
99.5%	42.121277	13.122512	-39749.889520
max	42.236743	13.208737	-39749.884224

	id	educ	ln_wage	pexp	time	ability	meduc	feduc	broken_home	siblings	pexp2
0	4	12	2.14	2	4	0.26	12	10	1	4	4
1	6	15	1.91	4	4	0.44	12	16	0	2	16
2	8	13	2.32	8	4	0.51	12	15	1	2	64
3	11	14	1.64	1	4	1.82	16	17	1	2	1
4	12	13	2.16	6	4	-1.30	13	12	0	5	36

	educ	pexp	pexp2	broken_home	intercept	sigma
count	400000.000000	400000.000000	400000.000000	400000.000000	400000.000000	400000.000000
mean	0.085315	0.204037	-0.012465	-0.008680	0.458503	0.426964
std	0.009377	0.023304	0.002262	0.035665	0.137912	0.009571
min	0.042974	0.096317	-0.021915	-0.153405	-0.107116	0.389304
0.5%	0.060932	0.143789	-0.018341	-0.100915	0.109621	0.403358
2.5%	0.066856	0.158400	-0.016927	-0.078533	0.190340	0.408892
5%	0.069787	0.165834	-0.016207	-0.067292	0.234387	0.411656
50%	0.085413	0.203957	-0.012448	-0.008594	0.456475	0.426738
95%	0.100624	0.242422	-0.008772	0.050141	0.687206	0.443015
97.5%	0.103675	0.249944	-0.008073	0.061648	0.731061	0.446241
99.5%	0.109362	0.264178	-0.006653	0.084702	0.818035	0.452690
max	0.124412	0.308641	-0.002592	0.155374	1.099135	0.474191

	county	log_radon	floor
0	AITKIN	0.832909	1
1	AITKIN	0.832909	0
2	AITKIN	1.098612	0
3	AITKIN	0.095310	0
4	ANOKA	1.163151	0