Convergence Diagnostics

In [2]:
%matplotlib inline
import numpy as np
from scipy.stats import norm,uniform
import matplotlib.pyplot as plt
import seaborn as sbn
import pandas as pd
from __future__ import division



The Convergence Problem

From the previous notebook, we know that using MH leads to a Markov Chain that we can use for inference. This is predicated on our chain converging to our stationary distribution (the posterior). Knowing when a chain has converged is a numerical issue and there are some important diagnostic tools we'll be using for assessing convergence.

For having some data to play with, let's resurrect our simple example yet again:

In [3]:
data = norm(10,3).rvs(10)
sigma = 3          # Note this is the std of x
mu_prior = 8
sigma_prior = 1.5  # Note this is our prior on the std of mu

In [4]:
# a fairly basic mh mcmc random walk sampler:
def sampler(data, samples=4, mu_init=9, sigma= 1, proposal_width=3, 
            mu_prior_mu=10, mu_prior_sd=1):
    mu_current = mu_init
    posterior = [mu_current]
    for i in range(samples):
        # suggest new position
        mu_proposal = norm(mu_current, proposal_width).rvs()

        # Compute likelihood by multiplying probabilities of each data point
        likelihood_current = norm(mu_current, sigma).pdf(data).prod()
        likelihood_proposal = norm(mu_proposal, sigma).pdf(data).prod()
        # Compute prior probability of current and proposed mu        
        prior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current)
        prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal)
        p_current = likelihood_current * prior_current
        p_proposal = likelihood_proposal * prior_proposal
        # Accept proposal?
        c = np.min((1,p_proposal / p_current))
        accept = np.random.rand() <= c
        if accept:
            # Update position
            mu_current = mu_proposal
    return np.array(posterior)
In [5]:
chain = sampler(data,samples=5000,mu_init=9,sigma=sigma,proposal_width=2,
                mu_prior_mu=mu_prior,mu_prior_sd = sigma_prior)
In [6]:
plt.title('Trace Plot for $\\mu$')

plt.title('Histogram for $\\mu$')


How do we know this chain has converged to the posterior?

Standard Error of the Mean

This investigates the question how does the mean of $\theta$ deviate in our chain, and is capturing the simulation error of the mean rather than underlying uncertainty of our parameter $\theta$:

$$ SE^{\bar{\theta}} = \frac{\text{Posterior Standard Deviation}}{\sqrt{L}} $$

where $L$ is the chain length (the number of iterations in your chain). Note that the package we'll be using calls this MC Error.

For our problem, this is:

In [7]:
print "Standard Error of the Mean: ", chain.std()/np.sqrt(chain.shape[0])
Standard Error of the Mean:  0.0116562412405

This is saying that very little of our posterior variation in $\theta$ is due to sampling error (that is good). We can visualize this by examining the moving average of our chain as we move through the 5000 iterations:

In [8]:
# pandas makes this easy:
df_chain = pd.DataFrame(chain,columns=['theta'])
df_chain['ma_theta'] = pd.stats.moments.rolling_mean(df_chain.theta,100)
df_chain['iteration'] = df_chain.index.values
ax = df_chain.plot(x=['iteration'],y=['theta','ma_theta'],figsize=(15,7))
ax.lines[-1].set_label("Moving Average of $\\theta$")

This is a good sign that our chain is stable, since both the individual samples of $\theta$ in our chain and the mean of the samples dance around a stable value of $\theta$. The calculation above makes this more concrete. There are time series versions of this calculation that accounts for the fact that the chain is not iid.

Autocorrelation Plots

In [9]:
import pymc3 as pm3


fig, ax = plt.subplots()
ax.plot(lags, [pm3.autocorr(chain, l) for l in lags])
_ = ax.set(xlabel='lag', ylabel='autocorrelation', ylim=(-.1, 1))
plt.title('Autocorrelation Plot')

Acceptance Rate for the MH Algorithm

Recall that we want the acceptance rate to be in the range .2 to .4. For our problem this paper suggests an acceptance rate of .234 for random walk MH.

Since the number of new members in the chain represent the number of acceptances, count changes in chain values and divide by total chain length to calculate acceptance rate:

In [10]:
changes = np.sum((chain[1:]!=chain[:-1]))
print "Acceptance Rate is: ", changes/(chain.shape[0]-1)
Acceptance Rate is:  0.4232

The acceptance rate is helpful in describing convergence because it indicates a good level of "mixing" over the parameter space. Here the acceptance rate is too low, so we should decrease proposal width and re-run our MH MCMC sampler.

Note: modern software (like pymc) can auto-tune $\omega$ to achieve a desired acceptance rate.

Geweke Diagnostic

We can explicitly think of this test as a test for the Ergodicity (stationarity) of your chain.

Take the first 10 and last 50% of your chain and do a z test comparing means (correcting for autocorrelation). Software packages, take this a step further: The geweke function in pymc3 by default chooses the first 10% of your chain, and the final 50%; divides the final 50% of the chain into 20 segments and performs a z-test for each segment. You want to fail to reject the null, since the hypothesis is:

$$ H_0: \theta_{10\%} = $\theta^s_{50\%} \\ H_1: \theta_{10\%} \ne $\theta^s_{50\%} $$

for each segment $s$. If our means are the same (we fail to reject the null), then we have strong evidence of chain convergence.

In [11]:
gw_plot = pm3.geweke(chain)
plt.axhline(-1.98, c='r')
plt.axhline(1.98, c='r')
plt.title('Geweke Plot Comparing first 10% and Slices of the Last 50% of Chain')

Even without dropping any burn-in observations, we have convergence. We only start seeing issues when we restrict ourselves to the first 5 values in the chain. Suggests we should drop the first few dozen observations for burn-in.

In [12]:
gw_plot = pm3.geweke(chain,.001,.5,20)
plt.axhline(-1.98, c='r')
plt.axhline(1.98, c='r')
plt.title('Geweke Plot Comparing first .1% and Slices of the Last 50% of Chain')

Gelman Rubin Diagnostic

If our MH MCMC Chain reaches a stationary distribution, and we repeat the excercise multiple times, then we can examine if the posterior for each chain converges to the same place in the distribution of the parameter space.


  1. Run $M>1$ Chains of length $2 \times N$.
  2. Discard the first $N$ draws of each chain, leaving $N$ iterations in the chain.
  3. Calculate the within and between chain variance.

    • Within chain variance: $$ W = \frac{1}{M}\sum_{j=1}^M s_j^2 $$ where $s_j^2$ is the variance of each chain (after throwing out the first $N$ draws).
    • Between chain variance: $$ B = \frac{N}{M-1} \sum_{j=1}^M (\bar{\theta_j} - \bar{\bar{\theta}})^2 $$

      where $\bar{\bar{\theta}}$ is the mean of each of the M means.

  4. Calculate the estimated variance of $\theta$ as the weighted sum of between and within chain variance. $$ \hat{var}(\theta) = \left ( 1 - \frac{1}{N}\right ) W + \frac{1}{N}B $$
  5. Calculate the potential scale reduction factor. $$ \hat{R} = \sqrt{\frac{\hat{var}(\theta)}{W}} $$

We want this number to be close to 1. Why? This would indicate that the between chain variance is small. This makes sense, if between chain variance is small, that means both chains are mixing around the stationary distribution. Gelmen and Rubin show that when $\hat{R}$ is greater than 1.1 or 1.2, we need longer burn-in.

Let's run 2 chains:

In [13]:
chain1 = sampler(data,samples=5000,mu_init=9,sigma=sigma,proposal_width=2,
                mu_prior_mu=mu_prior,mu_prior_sd = sigma_prior)
chain2 = sampler(data,samples=5000,mu_init=9,sigma=sigma,proposal_width=2,
                mu_prior_mu=mu_prior,mu_prior_sd = sigma_prior)
In [16]:
[<matplotlib.lines.Line2D at 0x7f2bd87617d0>]
In [17]:
# note, I can't use pm3.gelmen_rubin() because my chains are not pymc trace objects :-(
burn_in = 10
length = 10

n = chain1[burn_in:burn_in+length].shape[0]

W = (chain1[burn_in:burn_in+length].std()**2 + chain2[burn_in:burn_in+length].std()**2)/2
mean1 = chain1[burn_in:burn_in+length].mean()
mean2 = chain2[burn_in:burn_in+length].mean()
mean = (mean1 + mean2)/2
B = n * ((mean1 - mean)**2 + (mean2 - mean)**2)
var_theta = (1 - 1/n) * W + 1/n*B
print "Gelmen-Rubin Diagnostic: ", np.sqrt(var_theta/W)
Gelmen-Rubin Diagnostic:  1.16163636535

It should be evident that Gelmen-Rubin has two factors at play: burn-in and chain length. Really this criteria can be used ex-post to justify decisions on burn-in and chain length.

Raftery and Lewis Diagnostic

This is an older test and one not used too much anymore. However, it is the only one that will tell you apriori what your burn-in and chain length needs to be. Note: this is not implemented in pymc3, but is available in pymc2. Suppose we want to estimate a posterior quantile of interest.

If we define an acceptable tolerance ($r$ [e.g. .005]) around a quantile ($q$ [e.g. 2.5% quantile]), and a probability of being within that tolerance ($s$ [e.g. .95]), then the Raftery and Lewis diagnostic will calculate the number of burn-in and chain length required to satisfy these conditions based on a trial run of the chain.

To think about what these parameters are doing, note that we want choose burn-in, chain length, and thinning so that: $$ prob(\theta < q \pm r) < s $$

In [18]:
import pymc as pm2
pm2.raftery_lewis(chain,.025, .005, s=.95, epsilon=.001, verbose=1)
Raftery-Lewis Diagnostic

3746 iterations required (assuming independence) to achieve 0.005 accuracy with 95 percent probability.

Thinning factor of 1 required to produce a first-order Markov chain.

14 iterations to be discarded at the beginning of the simulation (burn-in).

15290 subsequent iterations required.

Thinning factor of 5 required to produce an independence chain.
(3746, 1, 14, 15290, 5)

Some discussion:

The theoretically derived values for chain length:

  • 3746 iterations required .... : theoretical minimum chain length if samples independent. Calculated as: $$ n_{min} = \left [ \Phi^{-1}\left(\frac{s+1}{2}\right) \sqrt{\frac{q(1-q)}{r}} \right] $$ where $\Phi^{-1}$ is the inverse normal CDF.
  • If independent we don't need to thin.

What we learned given the chain passed to the Raftery-Lewis Diagnostic:

  • xx subsequent iterations to be discarded (this is the burn-in)
  • 23244 this is the chain length we need
  • Thinning factor: drop all but every xth sample to have independence (no autocorrelation). This allows us to rely on SLLN and not ergodic theorem.
  • if we can rely on ergodic theorem (which most people do), then we only need a chain length of approximately 4600.

The two primary criticisms for this approach are
1) Geared toward single-parameter estimation problems
2) The recommendations will vary depending on the quantile of interest.

Univariate Approaches

The diagnostics we have discussed are all univariate (they work perfectly when there is only 1 parameter to estimate). Other diagnostics have been derived for the multivariate case, but these are useful only when using Gibbs Samplers or other specialized versions of Metropolis-Hastings.

So most people examine univariate diagnostics for each variable, examine autocorrelation plots, acceptance rates and try to argue chain convergence based on that- unless they are using Gibbs or other specialized samplers.