Convergence Diagnostics
%matplotlib inline
import numpy as np
from scipy.stats import norm,uniform
import matplotlib.pyplot as plt
import seaborn as sbn
import pandas as pd
from __future__ import division
np.random.seed(282629734)
sbn.set_style('white')
sbn.set_context('talk')
The Convergence Problem¶
From the previous notebook, we know that using MH leads to a Markov Chain that we can use for inference. This is predicated on our chain converging to our stationary distribution (the posterior). Knowing when a chain has converged is a numerical issue and there are some important diagnostic tools we'll be using for assessing convergence.
For having some data to play with, let's resurrect our simple example yet again:
data = norm(10,3).rvs(10)
sigma = 3 # Note this is the std of x
mu_prior = 8
sigma_prior = 1.5 # Note this is our prior on the std of mu
plt.hist(data,bins=20)
plt.show()
# a fairly basic mh mcmc random walk sampler:
def sampler(data, samples=4, mu_init=9, sigma= 1, proposal_width=3,
mu_prior_mu=10, mu_prior_sd=1):
mu_current = mu_init
posterior = [mu_current]
for i in range(samples):
# suggest new position
mu_proposal = norm(mu_current, proposal_width).rvs()
# Compute likelihood by multiplying probabilities of each data point
likelihood_current = norm(mu_current, sigma).pdf(data).prod()
likelihood_proposal = norm(mu_proposal, sigma).pdf(data).prod()
# Compute prior probability of current and proposed mu
prior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current)
prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal)
p_current = likelihood_current * prior_current
p_proposal = likelihood_proposal * prior_proposal
# Accept proposal?
c = np.min((1,p_proposal / p_current))
accept = np.random.rand() <= c
if accept:
# Update position
mu_current = mu_proposal
posterior.append(mu_current)
return np.array(posterior)
chain = sampler(data,samples=5000,mu_init=9,sigma=sigma,proposal_width=2,
mu_prior_mu=mu_prior,mu_prior_sd = sigma_prior)
plt.figure(figsize=(15,6))
plt.subplot(121)
plt.plot(np.arange(chain.shape[0]),chain)
plt.title('Trace Plot for $\\mu$')
plt.subplot(122)
plt.hist(chain,orientation='horizontal',bins=30)
plt.title('Histogram for $\\mu$')
plt.tight_layout()
plt.show()
How do we know this chain has converged to the posterior?¶
Standard Error of the Mean¶
This investigates the question how does the mean of $\theta$ deviate in our chain, and is capturing the simulation error of the mean rather than underlying uncertainty of our parameter $\theta$:
$$ SE^{\bar{\theta}} = \frac{\text{Posterior Standard Deviation}}{\sqrt{L}} $$where $L$ is the chain length (the number of iterations in your chain). Note that the package we'll be using calls this MC Error
.
For our problem, this is:
print "Standard Error of the Mean: ", chain.std()/np.sqrt(chain.shape[0])
This is saying that very little of our posterior variation in $\theta$ is due to sampling error (that is good). We can visualize this by examining the moving average of our chain as we move through the 5000 iterations:
# pandas makes this easy:
df_chain = pd.DataFrame(chain,columns=['theta'])
df_chain['ma_theta'] = pd.stats.moments.rolling_mean(df_chain.theta,100)
df_chain['iteration'] = df_chain.index.values
ax = df_chain.plot(x=['iteration'],y=['theta','ma_theta'],figsize=(15,7))
ax.lines[1].set_linewidth(4)
ax.lines[1].set_color('k')
ax.lines[0].set_label("$\\theta$")
ax.lines[1].set_label("Moving Average of $\\theta$")
plt.legend()
plt.show()
This is a good sign that our chain is stable, since both the individual samples of $\theta$ in our chain and the mean of the samples dance around a stable value of $\theta$. The calculation above makes this more concrete. There are time series versions of this calculation that accounts for the fact that the chain is not iid.
Autocorrelation Plots¶
import pymc3 as pm3
lags=np.arange(1,100)
fig, ax = plt.subplots()
ax.plot(lags, [pm3.autocorr(chain, l) for l in lags])
_ = ax.set(xlabel='lag', ylabel='autocorrelation', ylim=(.1, 1))
plt.title('Autocorrelation Plot')
plt.show()
Acceptance Rate for the MH Algorithm¶
Recall that we want the acceptance rate to be in the range .2 to .4. For our problem this paper suggests an acceptance rate of .234 for random walk MH.
Since the number of new members in the chain represent the number of acceptances, count changes in chain values and divide by total chain length to calculate acceptance rate:
changes = np.sum((chain[1:]!=chain[:1]))
print "Acceptance Rate is: ", changes/(chain.shape[0]1)
The acceptance rate is helpful in describing convergence because it indicates a good level of "mixing" over the parameter space. Here the acceptance rate is too low, so we should decrease proposal width and rerun our MH MCMC sampler.
Note: modern software (like pymc) can autotune $\omega$ to achieve a desired acceptance rate.
Geweke Diagnostic¶
We can explicitly think of this test as a test for the Ergodicity (stationarity) of your chain.
Take the first 10 and last 50% of your chain and do a z test comparing means (correcting for autocorrelation). Software packages, take this a step further: The geweke
function in pymc3 by default chooses the first 10% of your chain, and the final 50%; divides the final 50% of the chain into 20 segments and performs a ztest for each segment. You want to fail to reject the null, since the hypothesis is:
for each segment $s$. If our means are the same (we fail to reject the null), then we have strong evidence of chain convergence.
gw_plot = pm3.geweke(chain)
plt.scatter(gw_plot[:,0],gw_plot[:,1])
plt.axhline(1.98, c='r')
plt.axhline(1.98, c='r')
plt.ylim(2.5,2.5)
plt.title('Geweke Plot Comparing first 10% and Slices of the Last 50% of Chain')
plt.xlim(10,2510)
plt.show()
Even without dropping any burnin observations, we have convergence. We only start seeing issues when we restrict ourselves to the first 5 values in the chain. Suggests we should drop the first few dozen observations for burnin.
gw_plot = pm3.geweke(chain,.001,.5,20)
plt.scatter(gw_plot[:,0],gw_plot[:,1])
plt.axhline(1.98, c='r')
plt.axhline(1.98, c='r')
plt.ylim(2.5,2.5)
plt.xlim(10,2510)
plt.title('Geweke Plot Comparing first .1% and Slices of the Last 50% of Chain')
plt.show()
Gelman Rubin Diagnostic¶
If our MH MCMC Chain reaches a stationary distribution, and we repeat the excercise multiple times, then we can examine if the posterior for each chain converges to the same place in the distribution of the parameter space.
Steps:
 Run $M>1$ Chains of length $2 \times N$.
 Discard the first $N$ draws of each chain, leaving $N$ iterations in the chain.

Calculate the within and between chain variance.
 Within chain variance: $$ W = \frac{1}{M}\sum_{j=1}^M s_j^2 $$ where $s_j^2$ is the variance of each chain (after throwing out the first $N$ draws).

Between chain variance: $$ B = \frac{N}{M1} \sum_{j=1}^M (\bar{\theta_j}  \bar{\bar{\theta}})^2 $$
where $\bar{\bar{\theta}}$ is the mean of each of the M means.
 Calculate the estimated variance of $\theta$ as the weighted sum of between and within chain variance. $$ \hat{var}(\theta) = \left ( 1  \frac{1}{N}\right ) W + \frac{1}{N}B $$
 Calculate the potential scale reduction factor. $$ \hat{R} = \sqrt{\frac{\hat{var}(\theta)}{W}} $$
We want this number to be close to 1. Why? This would indicate that the between chain variance is small. This makes sense, if between chain variance is small, that means both chains are mixing around the stationary distribution. Gelmen and Rubin show that when $\hat{R}$ is greater than 1.1 or 1.2, we need longer burnin.
Let's run 2 chains:
chain1 = sampler(data,samples=5000,mu_init=9,sigma=sigma,proposal_width=2,
mu_prior_mu=mu_prior,mu_prior_sd = sigma_prior)
chain2 = sampler(data,samples=5000,mu_init=9,sigma=sigma,proposal_width=2,
mu_prior_mu=mu_prior,mu_prior_sd = sigma_prior)
plt.figure(figsize=(15,5))
plt.plot(np.arange(5001),chain1)
plt.plot(np.arange(5001),chain2,alpha=.7)
# note, I can't use pm3.gelmen_rubin() because my chains are not pymc trace objects :(
burn_in = 10
length = 10
n = chain1[burn_in:burn_in+length].shape[0]
W = (chain1[burn_in:burn_in+length].std()**2 + chain2[burn_in:burn_in+length].std()**2)/2
mean1 = chain1[burn_in:burn_in+length].mean()
mean2 = chain2[burn_in:burn_in+length].mean()
mean = (mean1 + mean2)/2
B = n * ((mean1  mean)**2 + (mean2  mean)**2)
var_theta = (1  1/n) * W + 1/n*B
print "GelmenRubin Diagnostic: ", np.sqrt(var_theta/W)
It should be evident that GelmenRubin has two factors at play: burnin and chain length. Really this criteria can be used expost to justify decisions on burnin and chain length.
Raftery and Lewis Diagnostic¶
This is an older test and one not used too much anymore. However, it is the only one that will tell you apriori what your burnin and chain length needs to be. Note: this is not implemented in pymc3, but is available in pymc2. Suppose we want to estimate a posterior quantile of interest.
If we define an acceptable tolerance ($r$ [e.g. .005]) around a quantile ($q$ [e.g. 2.5% quantile]), and a probability of being within that tolerance ($s$ [e.g. .95]), then the Raftery and Lewis diagnostic will calculate the number of burnin and chain length required to satisfy these conditions based on a trial run of the chain.
To think about what these parameters are doing, note that we want choose burnin, chain length, and thinning so that: $$ prob(\theta < q \pm r) < s $$
import pymc as pm2
pm2.raftery_lewis(chain,.025, .005, s=.95, epsilon=.001, verbose=1)
Some discussion:
The theoretically derived values for chain length:
 3746 iterations required .... : theoretical minimum chain length if samples independent. Calculated as: $$ n_{min} = \left [ \Phi^{1}\left(\frac{s+1}{2}\right) \sqrt{\frac{q(1q)}{r}} \right] $$ where $\Phi^{1}$ is the inverse normal CDF.
 If independent we don't need to thin.
What we learned given the chain passed to the RafteryLewis Diagnostic:
 xx subsequent iterations to be discarded (this is the burnin)
 23244 this is the chain length we need
 Thinning factor: drop all but every xth sample to have independence (no autocorrelation). This allows us to rely on SLLN and not ergodic theorem.
 if we can rely on ergodic theorem (which most people do), then we only need a chain length of approximately 4600.
The two primary criticisms for this approach are
1) Geared toward singleparameter estimation problems
2) The recommendations will vary depending on the quantile of interest.
Univariate Approaches¶
The diagnostics we have discussed are all univariate (they work perfectly when there is only 1 parameter to estimate). Other diagnostics have been derived for the multivariate case, but these are useful only when using Gibbs Samplers or other specialized versions of MetropolisHastings.
So most people examine univariate diagnostics for each variable, examine autocorrelation plots, acceptance rates and try to argue chain convergence based on that unless they are using Gibbs or other specialized samplers.