Simulation and Bayes Rule

In [3]:
%matplotlib inline
from __future__ import division
from IPython.display import YouTubeVideo
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
from scipy.stats import norm, gamma, uniform
import pymc3 as pm3
from scipy import optimize

import warnings
warnings.filterwarnings('ignore')

sbn.set_style('white')
sbn.set_context('talk')

np.random.seed(12578)

Note: this workbook is not complete and is only being posted online for the take home mid-term. This is very much subject to change.

Here, I'll briefly describe some of the step methods for producing a Monte Carlo Markov Chain. We have discussed in detail the random walk Metropolis Hastings sampler. We will briefly discuss in a bit more detail the following samplers:

  1. Gibbs Samplers
  2. Metropolis Hastings Samplers
    a. Random Walk Metropolis Hastings
    b. Adaptive Metropolis Hastings (we won't cover)
    c. Independent Metropolis Hastings (we won't cover)
    d. Slice Samplers and Other variants (we won't cover)

The nitty-gritty details of the following two samplers are beyond the scope of the course, but I will try to give you a feel for how they improve on Metropolis Hastings because they are known to be a major improvement over any of the samplers listed above even for high dimensional models (lots of coefficients to estimate):

  • Hamiltonian Monte Carlo Sampler (No U-turn Sampler)
  • Affine Invariant Sampler (EMCEE)

This is a very fast moving topic- so fast, in fact- that the current (non-beta) version of pymc doesn't have either NUTS or the Affine Invariant Sampler.

Gibbs Sampler

The Gibbs Sampler is a special case of the Random-Walk Metropolis Hastings algorithm and one worth knowing about in a bit of detail because many tutorials and discussions about MH (especially older ones) are entertwined with discussions on Gibbs sampling and it can be confusing for the uninitiated. Despite this being a special case of MH, it is worth seeing how it works in a bit of detail. The idea is to successively sample each $\theta$ from its conditional distribution:

\begin{align} \theta^{t+1}_1 \sim & f_1(\theta^{t+1}_1 \mid \theta^{t}_2,\theta^{t}_3,\ldots, \theta^{t}_k,\mathbf{y},\mathbf{x}) \\ \theta^{t+1}_2 \sim & f_2(\theta^{t+1}_2 \mid \theta^{t+1}_1,\theta^{t}_3,\ldots, \theta^{t}_k,\mathbf{y},\mathbf{x}) \\ \theta^{t+1}_3 \sim & f_3(\theta^{t+1}_3 \mid \theta^{t+1}_1,\theta^{t+1}_2,\ldots, \theta^{t}_k,\mathbf{y},\mathbf{x})\\ &\vdots \\ \theta^{t+1}_k \sim & f_k(\theta^{t+1}_k \mid \theta^{t+1}_1,\theta^{t+1}_2,\theta^{t+1}_3\ldots, \theta^{t+1}_{k-1},\mathbf{y},\mathbf{x}) \end{align}

where each parameter $i$ has a conditional distribution function $f_i$.

First consider this simple example, where our parameters to be estimate are $\theta = [\mu,\sigma]$ and where we have flat priors:

In [3]:
# data: construct toy data matrix
N = 1000
Y = norm(40,4).rvs(N)
plt.hist(Y);

When we employ a Gibbs Sampler, all samples are accepted (there are versions of the Gibbs Sampler that use a rejection based approach- not covered here). This is accomplished by choosing problems where the conditional distributions:

\begin{align} f_\mu(\mu | \sigma, \mathbf{Y}) \\ f_\sigma(\sigma | \mu, \mathbf{Y}) \end{align}

are known and have closed form solutions. For example, if we have

\begin{equation} f_\mu(\mu | \sigma, \mathbf{Y}) = \phi \left ( \frac{\mathbf{Y} - \mu}{\sigma} \right ) \end{equation}

that is, Y is distributed normal with mean $\mu$ and standard deviation $\sigma$, then it can be shown that \begin{equation} f_\sigma \left(\sigma^2 | \mu, \mathbf{Y}\right ) = InverseGamma(k,s) \end{equation}

where $k$ and $s$, the shape and scale parameters for the Inverse Gamma Distribution are \begin{align} k &= \frac{n}{2} \ s &= \left (\frac{(Y-\mu)^2}{2} \right )^{-1} \end{align}

The Gibbs Sampler then proceeds as follows:

  1. Start the sampler at an arbitrary starting value for $\theta$. In my code below, I start the Gibbs sampler at $\theta_0 = [\mu_0=30,\sigma_0=3]$.
  2. Draw the next (always accepted) value of $\sigma_t$ by:
    1. Draw random variate from a $\Gamma(k,s) = \Gamma \left ( \frac{n}{2},\left (\frac{(Y-\mu_{t-1})^2}{2} \right )^{-1} \right )$ and denote as $g$
    2. Define $var_t = \frac{1}{g}$
  3. Draw the next (always accepted) value of $\mu_t$ by:
    1. Draw a normal random variate having mean $\bar{Y}$ and standard deviation $\sqrt{\frac{var_t}{n}}$ and denote as $\mu_t$.
  4. Store the values $[\mu_t,\sqrt{var_t}]$ as the next sampled values for $\mu$ and $\sigma$.

Important thing to note about this process:

  1. Gibbs sampling depends on our ability to analytically solve for conditional distributions.
  2. This involves derivations in mathematical statistics
  3. Once we have these closed form solutions, we can accept every sampled draw
In [4]:
chain_len = 10000
burnin = 1000
In [5]:
# store chains here:
mu_chain_gibbs =np.zeros((chain_len+burnin,1)) 
sigma_chain_gibbs = np.zeros(mu_chain_gibbs.shape)

# starting guesses for mu
mu_old = 40

for i in np.arange(chain_len+burnin):
    gamma_new = gamma(len(Y)/2,scale=1/np.sum(((Y-mu_old)**2)/2)).rvs()
    var_new = 1/gamma_new
    mu_new = norm(np.mean(Y), np.sqrt(var_new/len(Y))).rvs(1)
    mu_chain_gibbs[i] = mu_new
    sigma_chain_gibbs[i] = np.sqrt(var_new)
    # reset old values
    mu_old,var_old = mu_new,var_new
In [6]:
# convert to pandas dataframe
chain_gibbs = pd.DataFrame(np.append(mu_chain_gibbs,sigma_chain_gibbs,1),columns=['mean','std'])
# calculate summary statistics post burn-in
chain_gibbs.loc[burnin:].describe(percentiles=[.025,.5,.975])
Out[6]:
mean std
count 10000.000000 10000.000000
mean 39.805775 4.098301
std 0.129122 0.091833
min 39.305744 3.778357
2.5% 39.557243 3.927042
50% 39.805870 4.096343
97.5% 40.058110 4.286209
max 40.273004 4.507727
In [7]:
plot_gibbs = chain_gibbs.loc[burnin:]
plot_gibbs.head()
sbn.jointplot(x='mean', y='std', data=plot_gibbs, kind="kde");
In [8]:
plt.figure(figsize=(12,5))
plt.plot(np.arange(chain_len),plot_gibbs['mean'])
plt.show()

The Gibbs Sampler was the predominant sampling method early on in applied Bayesian statistics because it was the empirical analogue of conjugate priors (the focus of Bayesian Statistics before the computer age) and does have real advantages over MH Random Walk for problems having solvable conditional distributions, since it accepts every sample and can be more efficient (less computing time to reach convergence and inference).

Despite this, both the Gibbs Sampler and the MH Sampler are less than ideal for

  1. High dimensional problems (many parameters)
  2. High correlation amongst parameters

Bayesian OLS Regression applied to Tobias and Koop

We now turn our attention back to MH Random Walk sampling using a real dataset.

In [9]:
tobias_koop=pd.read_csv('http://rlhick.people.wm.edu/econ407/data/tobias_koop_t_4.csv')
tobias_koop['ones'] = 1
tobias_koop.head()
Out[9]:
id educ ln_wage pexp time ability meduc feduc broken_home siblings pexp2 ones
0 4 12 2.14 2 4 0.26 12 10 1 4 4 1
1 6 15 1.91 4 4 0.44 12 16 0 2 16 1
2 8 13 2.32 8 4 0.51 12 15 1 2 64 1
3 11 14 1.64 1 4 1.82 16 17 1 2 1 1
4 12 13 2.16 6 4 -1.30 13 12 0 5 36 1
In [10]:
# sanity check on data:
from statsmodels.regression.linear_model import OLS
ols_res = OLS(tobias_koop.ln_wage,tobias_koop[['educ','pexp','pexp2','broken_home','ones']]).fit()
ols_res.summary()
Out[10]:
OLS Regression Results
Dep. Variable: ln_wage R-squared: 0.166
Model: OLS Adj. R-squared: 0.163
Method: Least Squares F-statistic: 51.36
Date: Wed, 02 Mar 2016 Prob (F-statistic): 1.83e-39
Time: 08:35:28 Log-Likelihood: -583.66
No. Observations: 1034 AIC: 1177.
Df Residuals: 1029 BIC: 1202.
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
educ 0.0853 0.009 9.179 0.000 0.067 0.104
pexp 0.2035 0.024 8.629 0.000 0.157 0.250
pexp2 -0.0124 0.002 -5.438 0.000 -0.017 -0.008
broken_home -0.0087 0.036 -0.244 0.807 -0.079 0.061
ones 0.4603 0.137 3.353 0.001 0.191 0.730
Omnibus: 55.892 Durbin-Watson: 1.761
Prob(Omnibus): 0.000 Jarque-Bera (JB): 112.050
Skew: -0.355 Prob(JB): 4.66e-25
Kurtosis: 4.448 Cond. No. 391.

This is the model statement describing priors and the likelihood. Here, we have 4 beta variables and sigma defines stochastic variables (we want a chain of sampled values for all of these variables) and we provide a prior distribution and hyper-parameters for each variable. The likelihood function is chosen to be Normal, having mean denoted as mu (which is defined as $\mathbf{x}\beta$, having standard deviation $\sigma$ (denoted as sigma). Our "dependent variable" is given by observed=tobias_koop.ln_wage, where tobias_koop.ln_wage is the log of wage from the Tobias and Koop dataset loaded above. In more formal terms, the code below sets up a basic_model having the following form:

\begin{align} Prob(\mu|\sigma,\mathbf{data}) \propto& Prob(\mathbf{data}|\mu,\sigma) \times Prob(\sigma | \mu^0_\sigma, \sigma^0_\sigma) \times Prob(\text{prior on }\beta) \\ Prob(\mu|\sigma,\mathbf{data}) \propto& \phi \left( \frac{\mathbf{ln\_wage} - \mathbf{x}\beta}{\sigma} \right ) \times HalfNormal \left(\frac{\sigma - \mu^0_\sigma}{\sigma^0_\mu} \right ) \times C \\ Prob(\mu|\sigma,\mathbf{data}) \propto& \phi \left( \frac{\mathbf{ln\_wage} - \mathbf{x}\beta}{\sigma} \right ) \times HalfNormal \left(\frac{\mu - 0}{10} \right ) \end{align}

where $\phi(.)$ is the standard normal pdf. The constant C (the joint probability of the prior on our $\beta$'s) is invariant with changes in our chain, and can therefore be dropped.

In [11]:
ols_model = pm3.Model()

with ols_model:

    # Priors for unknown model parameters
    beta_0 = pm3.Flat('Intercept')
    beta_1 = pm3.Flat('Educ')
    beta_2 = pm3.Flat('Pexp')
    beta_3 = pm3.Flat('Pexp2')
    beta_4 = pm3.Flat('Broken Home')
    sigma = pm3.HalfNormal('sigma', sd=10)
    
    mu = beta_0 + beta_1*tobias_koop.educ + beta_2*tobias_koop.pexp + \
         beta_3*tobias_koop.pexp2 + beta_4*tobias_koop.broken_home
    
    # Likelihood (sampling distribution) of observations
    ln_wage = pm3.Normal('ln_wage', mu=mu, sd=sigma, observed=tobias_koop.ln_wage)
Applied log-transform to sigma and added transformed sigma_log to model.

Metropolis-Hastings Random Walk

In [12]:
chain_length = 10000 

with ols_model:
    
    start = pm3.find_MAP()
    print(start)
    
    # instantiate sampler
    step = pm3.Metropolis() 

    # draw posterior samples
    trace_mh = pm3.sample(chain_length, step=step, start=start, njobs=3, progressbar=True) 
{'Pexp': array(0.20352137944507617), 'Broken Home': array(-0.008725378330397111), 'Intercept': array(0.46033253793539525), 'sigma_log': array(-0.853984713051837), 'Pexp2': array(-0.012412592947049154), 'Educ': array(0.0852725235854634)}
 [-----------------100%-----------------] 10000 of 10000 complete in 82.2 sec
In [63]:
pm3.traceplot(trace_mh[2000:]);
In [13]:
pm3.summary(trace_mh[2000:])
Intercept:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.459            0.141            0.009            [0.186, 0.727]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.194          0.362          0.457          0.555          0.738


Educ:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.086            0.010            0.001            [0.067, 0.104]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.067          0.079          0.086          0.093          0.104


Pexp:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.200            0.024            0.001            [0.152, 0.246]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.153          0.184          0.200          0.216          0.247


Pexp2:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  -0.012           0.002            0.000            [-0.017, -0.008]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  -0.017         -0.014         -0.012         -0.011         -0.008


Broken Home:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  -0.008           0.036            0.000            [-0.077, 0.063]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  -0.078         -0.032         -0.008         0.015          0.061


sigma_log:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  -0.851           0.022            0.000            [-0.894, -0.808]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  -0.893         -0.866         -0.851         -0.836         -0.808


sigma:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.427            0.009            0.000            [0.408, 0.445]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.409          0.421          0.427          0.433          0.446

In [14]:
pm3.autocorrplot(trace_mh[2000:]);
In [15]:
pm3.gelman_rubin(trace_mh[2000:])
Out[15]:
{'Broken Home': 1.0001675677611783,
 'Educ': 1.0097027663341329,
 'Intercept': 1.0076065498910578,
 'Pexp': 1.0014020294024144,
 'Pexp2': 1.0018657646558828,
 'sigma': 0.99998250303249925,
 'sigma_log': 0.99998329981936351}
In [16]:
scores = pm3.geweke(trace_mh[2000:])
In [17]:
# create plot for each variable:
varnames = scores[0].keys()
varnames.remove('sigma_log') #sigma_log is nuisance variable, ignore
chainids = scores.keys()
colors_for_chains = ['k','g','b']

plt.figure(figsize=(10,15))
plt.subplots_adjust(top=0.85)
plt.suptitle('Geweke Plot',fontsize=25)

plotnum = 1
for i in varnames:
    plt.subplot(len(varnames),1,plotnum)
    plt.title(i)
    for j in chainids:
        plt.scatter(scores[j][i][:,0],scores[j][i][:,1],s=50,c=colors_for_chains[j])
 
    plt.axhline(-1.98, c='r')
    plt.axhline(1.98, c='r')
    plt.ylim(-2.5,2.5)
    plt.xlim(-50,4050)
    plotnum+=1

plt.tight_layout(rect=[0, 0.03, 1, 0.95])

I like visualizing my distributions using the Seaborn PairGrid function. To do that, convert your chains into a pandas dataframe [caution, the next two codecells take a long time to run]:

In [18]:
data = np.column_stack((trace_mh['Intercept'],trace_mh['Educ'],
             trace_mh['Pexp'],trace_mh['Pexp2'],
             trace_mh['Broken Home'],trace_mh['sigma']))
pairplot_dat = pd.DataFrame(data,columns=['Intercept','Educ','Pexp','Pexp2','Broken Home','sigma'])
In [19]:
g = sbn.PairGrid(pairplot_dat.loc[1000:10000]) # chain #1 post burn-in
g.map_diag(sbn.kdeplot)
g.map_offdiag(sbn.kdeplot, cmap="Blues_d", n_levels=6);