Simulation and Bayes Rule
%matplotlib inline
from IPython.display import YouTubeVideo
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
from scipy.stats import norm, gamma, uniform
import pymc3 as pm3
from scipy import optimize
import corner as corner
import emcee as emcee
import warnings
warnings.filterwarnings('ignore')
sbn.set_style('white')
sbn.set_context('talk')
np.random.seed(12578)
Here, I'll briefly describe some of the step methods for producing a Monte Carlo Markov Chain. We have discussed in detail the random walk Metropolis Hastings sampler. We will briefly discuss in a bit more detail the following samplers:
- Gibbs Samplers
- Metropolis Hastings Samplers
a. Random Walk Metropolis Hastings
b. Adaptive Metropolis Hastings (we won't cover)
c. Independent Metropolis Hastings (we won't cover)
d. Slice Samplers and Other variants (we won't cover)
The nitty-gritty details of the following two samplers are beyond the scope of the course, but I will try to give you a feel for how they improve on Metropolis Hastings because they are known to be a major improvement over any of the samplers listed above even for high dimensional models (lots of coefficients to estimate):
- Hamiltonian Monte Carlo Sampler (No U-turn Sampler)
- Affine Invariant Sampler (EMCEE)
This is a very fast moving topic- so fast, in fact- that the current (non-beta) version of pymc doesn't have either NUTS or the Affine Invariant Sampler.
Gibbs Sampler¶
The Gibbs Sampler is a special case of the Random-Walk Metropolis Hastings algorithm and one worth knowing about in a bit of detail because many tutorials and discussions about MH (especially older ones) are entertwined with discussions on Gibbs sampling and it can be confusing for the uninitiated. Despite this being a special case of MH, it is worth seeing how it works in a bit of detail. The idea is to successively sample each $\theta$ from its conditional distribution:
\begin{align} \theta^{t+1}_1 \sim & f_1(\theta^{t+1}_1 \mid \theta^{t}_2,\theta^{t}_3,\ldots, \theta^{t}_k,\mathbf{y},\mathbf{x}) \\ \theta^{t+1}_2 \sim & f_2(\theta^{t+1}_2 \mid \theta^{t+1}_1,\theta^{t}_3,\ldots, \theta^{t}_k,\mathbf{y},\mathbf{x}) \\ \theta^{t+1}_3 \sim & f_3(\theta^{t+1}_3 \mid \theta^{t+1}_1,\theta^{t+1}_2,\ldots, \theta^{t}_k,\mathbf{y},\mathbf{x})\\ &\vdots \\ \theta^{t+1}_k \sim & f_k(\theta^{t+1}_k \mid \theta^{t+1}_1,\theta^{t+1}_2,\theta^{t+1}_3\ldots, \theta^{t+1}_{k-1},\mathbf{y},\mathbf{x}) \end{align}where each parameter $i$ has a conditional distribution function $f_i$.
First consider this simple example, where our parameters to be estimate are $\theta = [\mu,\sigma]$ and where we have flat priors:
# data: construct toy data matrix
N = 1000
Y = norm(40,4).rvs(N)
plt.hist(Y);
When we employ a Gibbs Sampler, all samples are accepted (there are versions of the Gibbs Sampler that use a rejection based approach- not covered here). This is accomplished by choosing problems where the conditional distributions:
\begin{align} f_\mu(\mu | \sigma, \mathbf{Y}) \\ f_\sigma(\sigma | \mu, \mathbf{Y}) \end{align}are known and have closed form solutions. For example, if we have
\begin{equation} f_\mu(\mu | \sigma, \mathbf{Y}) = \phi \left ( \frac{\mathbf{Y} - \mu}{\sigma} \right ) \end{equation}that is, Y is distributed normal with mean $\mu$ and standard deviation $\sigma$, then it can be shown that \begin{equation} f_\sigma \left(\sigma^2 | \mu, \mathbf{Y}\right ) = InverseGamma(k,s) \end{equation}
where $k$ and $s$, the shape and scale parameters for the Inverse Gamma Distribution are \begin{align} k &= \frac{n}{2} \\ s &= \left (\frac{(Y-\mu)^2}{2} \right )^{-1} \end{align}
The Gibbs Sampler then proceeds as follows:
- Start the sampler at an arbitrary starting value for $\theta$. In my code below, I start the Gibbs sampler at $\theta_0 = [\mu_0=30,\sigma_0=3]$.
- Draw the next (always accepted) value of $\sigma_t$ by:
- Draw random variate from a $\Gamma(k,s) = \Gamma \left ( \frac{n}{2},\left (\frac{(Y-\mu_{t-1})^2}{2} \right )^{-1} \right )$ and denote as $g$
- Define $var_t = \frac{1}{g}$
- Draw the next (always accepted) value of $\mu_t$ by:
- Draw a normal random variate having mean $\bar{Y}$ and standard deviation $\sqrt{\frac{var_t}{n}}$ and denote as $\mu_t$.
- Store the values $[\mu_t,\sqrt{var_t}]$ as the next sampled values for $\mu$ and $\sigma$.
Important thing to note about this process:
- Gibbs sampling depends on our ability to analytically solve for conditional distributions.
- This involves derivations in mathematical statistics
- Once we have these closed form solutions, we can accept every sampled draw
chain_len = 10000
burnin = 1000
# store chains here:
mu_chain_gibbs =np.zeros((chain_len+burnin,1))
sigma_chain_gibbs = np.zeros(mu_chain_gibbs.shape)
# starting guesses for mu
mu_old = 40
for i in np.arange(chain_len+burnin):
gamma_new = gamma(len(Y)/2,scale=1/np.sum(((Y-mu_old)**2)/2)).rvs()
var_new = 1/gamma_new
mu_new = norm(np.mean(Y), np.sqrt(var_new/len(Y))).rvs(1)
mu_chain_gibbs[i] = mu_new
sigma_chain_gibbs[i] = np.sqrt(var_new)
# reset old values
mu_old,var_old = mu_new,var_new
# convert to pandas dataframe
chain_gibbs = pd.DataFrame(np.append(mu_chain_gibbs,sigma_chain_gibbs,1),columns=['mean','std'])
# calculate summary statistics post burn-in
chain_gibbs.loc[burnin:].describe(percentiles=[.025,.5,.975])
plot_gibbs = chain_gibbs.loc[burnin:]
plot_gibbs.head()
sbn.jointplot(x='mean', y='std', data=plot_gibbs, kind="kde");
plt.figure(figsize=(12,5))
plt.plot(np.arange(chain_len),plot_gibbs['mean'])
plt.show()
The Gibbs Sampler was the predominant sampling method early on in applied Bayesian statistics because it was the empirical analogue of conjugate priors (the focus of Bayesian Statistics before the computer age) and does have real advantages over MH Random Walk for problems having solvable conditional distributions, since it accepts every sample and can be more efficient (less computing time to reach convergence and inference).
Despite this, both the Gibbs Sampler and the MH Sampler are less than ideal for
- High dimensional problems (many parameters)
- High correlation amongst parameters
Bayesian OLS Regression applied to Tobias and Koop¶
We now turn our attention back to MH Random Walk sampling using a real dataset.
tobias_koop=pd.read_csv('https://rlhick.people.wm.edu/econ407/data/tobias_koop_t_4.csv')
tobias_koop['ones'] = 1
tobias_koop.head()
# sanity check on data:
from statsmodels.regression.linear_model import OLS
ols_res = OLS(tobias_koop.ln_wage,tobias_koop[['educ','pexp','pexp2','broken_home','ones']]).fit()
ols_res.summary()
This is the model statement describing priors and the likelihood. Here, we have 4 beta
variables and sigma
defines stochastic variables (we want a chain of sampled values for all of these variables) and we provide a prior distribution and hyper-parameters for each variable. The likelihood function is chosen to be Normal, having mean denoted as mu
(which is defined as $\mathbf{x}\beta$, having standard deviation $\sigma$ (denoted as sigma
). Our "dependent variable" is given by observed=tobias_koop.ln_wage
, where tobias_koop.ln_wage
is the log of wage from the Tobias and Koop dataset loaded above. In more formal terms, the code below sets up a basic_model
having the following form:
where $\phi(.)$ is the standard normal pdf. The constant C (the joint probability of the prior on our $\beta$'s) is invariant with changes in our chain, and can therefore be dropped.
ols_model = pm3.Model()
with ols_model:
# Priors for unknown model parameters
beta_0 = pm3.Flat('Intercept')
beta_1 = pm3.Flat('Educ')
beta_2 = pm3.Flat('Pexp')
beta_3 = pm3.Flat('Pexp2')
beta_4 = pm3.Flat('Broken Home')
sigma = pm3.HalfNormal('sigma', sd=10)
mu = beta_0 + beta_1*tobias_koop.educ + beta_2*tobias_koop.pexp + \
beta_3*tobias_koop.pexp2 + beta_4*tobias_koop.broken_home
# Likelihood (sampling distribution) of observations
ln_wage = pm3.Normal('ln_wage', mu=mu, sd=sigma, observed=tobias_koop.ln_wage)
Metropolis-Hastings Random Walk¶
chain_length = 20000
with ols_model:
start = pm3.find_MAP()
print(start)
# instantiate sampler
step = pm3.Metropolis()
# draw posterior samples
trace_mh = pm3.sample(chain_length, step=step, start=start, njobs=3, progressbar=True)
pm3.traceplot(trace_mh[10000:]);
pm3.summary(trace_mh[10000:])
pm3.autocorrplot(trace_mh[2000:]);
pm3.gelman_rubin(trace_mh[10000:])
scores = pm3.geweke(trace_mh[10000:])
# create plot for each variable:
varnames = list(scores[0])
varnames.remove('sigma_log__') #sigma_log__ is nuisance variable, ignore
chainids = scores.keys()
colors_for_chains = ['k','g','b']
plt.figure(figsize=(10,15))
plt.subplots_adjust(top=0.85)
plt.suptitle('Geweke Plot',fontsize=25)
plotnum = 1
for i in varnames:
plt.subplot(len(varnames),1,plotnum)
plt.title(i)
for j in chainids:
plt.scatter(scores[j][i][:,0],scores[j][i][:,1],s=50,c=colors_for_chains[j])
plt.axhline(-1.98, c='r')
plt.axhline(1.98, c='r')
plt.ylim(-2.5,2.5)
plt.xlim(-50,4050)
plotnum+=1
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
I like visualizing my distributions using the corner
plot function. To do that, convert your chains into a pandas dataframe:
data = np.column_stack((trace_mh[10000:]['Intercept'],trace_mh[10000:]['Educ'],
trace_mh[10000:]['Pexp'],trace_mh[10000:]['Pexp2'],
trace_mh[10000:]['Broken Home'],trace_mh[10000:]['sigma']))
pairplot_dat = pd.DataFrame(data,columns=['Intercept','Educ','Pexp','Pexp2','Broken Home','sigma'])
fig = corner.corner(pairplot_dat, labels=["Int", "$educ$", "$pexp$", "$pexp2$","$broken\_home$","sigma"],
truths=[.4603, .0853, .2035, -.0124, -.0087, np.sqrt(ols_res.mse_resid)])