Introduction to PyMC3
%matplotlib inline
import re as re
import pandas as pd
import numpy as np
import seaborn as sbn
from scipy.stats import norm
import matplotlib.pyplot as plt
import warnings as warnings
warnings.filterwarnings('ignore')
sbn.set_style('white')
sbn.set_context('talk')
np.random.seed(12345678)
Note: these are my personal opinions and I'm sure that lots of people would disagree with my assessment of available software packages.
We have been working on getting under the hood of Metropolis-Hastings and Monte Carlo Markov Chain. But there are some very accessible Metropolis Markov Chain software packages out there. These are my personal views on what is available:
- Stata (
bayesianmh
)- just available in Stata 14. To be honest, I have never heard of anyone using Stata for Bayesian Sampling Methods although I haven't really looked. - Matlab. There are numerous user and Matlab written routines. For example, there is a version of
emcee
that is implemented there (more on this later in the course). Matlab is for people who want to possibly tweak their own sampler code and who need the fastest possible computation. - SAS. A decent set of packages, not used very much in the social sciences (maybe in the medical sciences??), but is well suited for big data.
- R. For a long time R was the de-facto standard and probably still is, with packages like R-stan which interfaces the
STAN
C Library. AlsoJags
,Rbugs
andOpen BUGS
have been around for a long time. I almost elected to teach this course in R. - Python. The libraries aren't as rich as R, but you might view Python as sitting between Matlab and R. Maybe (in my view probably) also easier to code your own samplers than R as I really prefer python syntax to R. Python is the used a lot in many fields including physics and is strong in the big-data arena, more so than any of the other packages mentioned above. The major players on python are:
- PyMC and PyMC3 (in beta)
- PyStan
- EMCEE
Today, we are going to focus on PyMC3, which is a very easy to use package now that we have a solid understanding of how posteriors are constructed.
We will start with our very simple one parameter model and then move to slightly more complicated settings:
sigma = 3 # Note this is the std of our data
data = norm(10,sigma).rvs(100)
mu_prior = 8
sigma_prior = 1.5 # Note this is our prior on the std of mu
plt.hist(data,bins=20)
plt.show()
PyMC3¶
Let's use PyMC3 to model this:
import pymc3 as pm3
This is the model statement describing priors and the likelihood. Here, mu
is defined as a stochastic variable (we want a chain of sampled values for this variable) and we provide a prior distribution and hyper-parameters for it. The likelihood function is chosen to be Normal, with one parameter to be estimated (mu
), and we use known $\sigma$ (denoted as sigma
). Our "dependent variable" is given by observed=data
, where data
is generated above and shown in the histogram. In more formal terms, the code below sets up a basic_model
having the following form:
where $\phi(.)$ is the standard normal pdf.
basic_model = pm3.Model()
with basic_model:
# Priors for unknown model parameters
mu = pm3.Normal('Mean of Data',mu_prior,sigma_prior)
# Likelihood (sampling distribution) of observations
data_in = pm3.Normal('Y_obs', mu=mu, sd=sigma, observed=data)
This defines how the chain will be constructed. We will set startvals at MAP and use a Metropolis step method (which is Random Walk MH that we have studied).
chain_length = 10000
with basic_model:
# obtain starting values via MAP
startvals = pm3.find_MAP(model=basic_model)
print(startvals)
# instantiate sampler
step = pm3.Metropolis()
# draw 5000 posterior samples
trace = pm3.sample(chain_length, step=step, start=startvals)
pm3.traceplot(trace,figsize=(20,5));
pm3.summary(trace)
Understanding the PyMC3 Results Object¶
All the results are contained in the trace
variable. This is a pymc3 results object. It contains some information that we might want to extract at times. Varnames tells us all the variable names setup in our model.
trace.varnames
With the variable names, we can extract chain values for each variable:
trace['Mean of Data']
pm3.plots.autocorrplot(trace,figsize=(17,5));
Acceptance Rate¶
accept = np.sum(trace['Mean of Data'][1:] != trace['Mean of Data'][:-1])
print("Acceptance Rate: ", accept/trace['Mean of Data'].shape[0])
Geweke Score¶
score=pm3.geweke(trace, first=0.1, last=0.5, intervals=20)
score
score=pm3.geweke(trace, first=0.1, last=0.5, intervals=20)
plt.scatter(score[0]['Mean of Data'][:,0],score[0]['Mean of Data'][:,1], marker = 'o', s=100)
plt.axhline(-1.98, c='r')
plt.axhline(1.98, c='r')
plt.ylim(-2.5,2.5)
plt.xlim(0-10,.5*trace['Mean of Data'].shape[0]/2+10)
plt.title('Geweke Plot Comparing first 10% and Slices of the Last 50% of Chain\nDifference in Mean Z score')
plt.show()
Gelmen-Rubin¶
For GR, we need multiple chains in order to test if each of the chains collapse to the limiting distribution which is our posterior. PyMC3 makes it very easy to generate these.
chain_length = 100000
with basic_model:
# obtain starting values via MAP
startvals = pm3.find_MAP(model=basic_model)
print(startvals)
# instantiate sampler
step = pm3.Metropolis()
# draw 5000 posterior samples
trace = pm3.sample(chain_length, step=step, start=startvals, njobs=2)
pm3.traceplot(trace,figsize=(20,5));
pm3.gelman_rubin(trace)
Based on Gelman-Rubin and Geweke, we can be confident we have a chain that has converged to the limiting distribution (although perhaps for Gelman-Rubin, we'd want to try different starting values rather than MAP).
In class excercise¶
Team up and modify the code below to estimate sigma as well as the mean:
basic_model = pm3.Model()
with basic_model:
# Priors for unknown model parameters
mu = pm3.Normal('Mean of Data',mu_prior,sigma_prior)
# Likelihood (sampling distribution) of observations
data_in = pm3.Normal('Y_obs', mu=mu, sd=sigma, observed=data)
chain_length = 10000
with basic_model:
# obtain starting values via MAP
startvals = pm3.find_MAP(model=basic_model)
print(startvals)
# instantiate sampler
step = pm3.Metropolis()
# draw 5000 posterior samples
trace = pm3.sample(chain_length, step=step, start=startvals)
score=pm3.geweke(trace, first=0.1, last=0.5, intervals=20)
plt.scatter(score[0]['Mean of Data'][:,0],score[0]['Mean of Data'][:,1], marker = 'o', s=100)
plt.axhline(-1.98, c='r')
plt.axhline(1.98, c='r')
plt.ylim(-2.5,2.5)
plt.xlim(0-10,.5*trace['Mean of Data'].shape[0]/2+10)
plt.title('Geweke Plot Comparing first 10% and Slices of the Last 50% of Chain\nDifference in Mean Z score')
plt.show()