The goal of this workshop is to 1. Explore bootstrapping * See the implementation steps * Show subtleties of reported results 2. Introduce Bayesian Econometrics * Brief discussion of Bayes Rule * Markov Chains and the Metropolis-Hastings Sampler * Simple example estimating mean and standard deviation * Application to Tobias and Koop * Heirarchical Models * Show you three different samplers
#load python libraries for this ipython notebook:
import numpy as np # numpy linear algebra library
import pandas as pd # pandas data manipulation library
import scipy.stats as scipy # access to stats libraries
import emcee as emcee # a bayesian library
#load pymc3 libraries (another Bayesian Library)
import imp
pm = imp.load_module("pymc3", *imp.find_module("pymc", ["/home/robhicks/Dropbox/notebooks/pymc3/pymc"]))
In class, we looked at these estimation methods: 1. Linear Model 2. Maximum Likelihood 3. Generalized Method of Moments
Leaving these as major estimation paradigms you haven't seen: 4. Bayesian Estimation 5. Non-parametric Estimation
In what follows, we will replicate our Bootstrapping work in Ipython (so we can compare to Bayesian Methods) and then move on to Bayesian Estimation.
Note: the appeal of Bayesian modeling is to tackle computationally challenging models and the implementation of an OLS model is fairly trivial, but gives us a point of comparison since we know the OLS model very well. At the end of these notes, we show how Bayesian Econometrics opens the door to very rich models, that are difficult to replicate using frequentist methods.
tobias_koop=pd.read_csv('http://rlhick.people.wm.edu/econ407/data/tobias_koop_t_4.csv')
tobias_koop.head()
tobias_koop.describe()
# explore distribution of ln_wage (dv)
tobias_koop.ln_wage.plot(kind='hist')
# run OLS regression:
results = pd.ols(y=tobias_koop['ln_wage'], x=tobias_koop[['educ','pexp','pexp2','broken_home']])
results
These s.e.'s aren't robust (don't match stata output), so let's bootstrap. The statsmodels package has a robust and bootstrap option (see http://statsmodels.sourceforge.net/stable/rlm.html for the robust OLS model). In general, you should use the canned routines, but we present the following for educational purposes only to make explicit what the bootstrap is doing.
Steps: 1. Loop \(R\) times (\(R\) = # of boostrapped replicates) 2. For each \(r\in R\), sample \(N\) times with replacement from \(\mathbf{Y}\) and \(\mathbf{X}\), noting that \(N\) is equal to the number of observations in the original dataset \(\mathbf{Y}\) and \(\mathbf{X}\). Denote each of these samples as \(\mathbf{X}_r\) and \(\mathbf{Y}_r\) 3. Estimate an OLS model and recover estimate for replicate \(r\): \(\hat{\beta}_r = (\mathbf{X}_r'\mathbf{X}_r)^{-1}\mathbf{X}_r'\mathbf{Y}_r\) 4. Store results 5. Calculate the standard deviation (our estimate for the s.e.) of each parameter estimate, or construct non-parametric confidence intervals (preferred)
R = 400
results_boot = np.zeros((R,results.beta.shape[0]))
row_id = range(0,tobias_koop.shape[0])
for r in range(R):
this_sample = numpy.random.choice(row_id, size=tobias_koop.shape[0], replace=True) # gives sampled row numbers
# Define data for this replicate:
tobias_koop_r = tobias_koop.iloc[this_sample]
# Estimate model
results_r = pd.ols(y=tobias_koop_r['ln_wage'], x=tobias_koop_r[['educ','pexp','pexp2','broken_home']])
# Store in row r of results_boot:
results_boot[r,:] = np.asarray(results_r.beta)
# Convert results to pandas dataframe for easier analysis:
results_boot = pd.DataFrame(results_boot,columns=['b_educ','b_pexp','pexp2','b_broken_home','b_Intercept'])
results_boot.head(10) # the first 10 replicates
results_boot.b_educ.plot(kind='hist')
results_boot.describe(percentiles=[.005,.025,.05,.5,.95,.975,.995])
Let's plot our bootstrap replicates, by parameter. Note that the values of each of these replicates are completely independent (or should be). Replicate 3 doesn't in any way inform us about replicate 4.
replicate= np.arange(R)
# plot point estimate and confidence intervals:
plt.figure(figsize=(14, 10), dpi=200)
lw = 1
plt.subplot(521)
plt.ylabel("Intercept")
plt.plot(replicate, results_boot.b_Intercept, label="Intercept",lw=lw)
plt.subplot(522)
plt.hist(results_boot.b_Intercept,lw=lw, label="Intercept", orientation='horizontal')
plt.subplot(523)
plt.ylabel("b_educ")
plt.plot(replicate, results_boot.b_educ, label="b_educ",lw=lw)
plt.subplot(524)
plt.hist(results_boot.b_educ,lw=lw, label="b_educ",orientation='horizontal')
plt.subplot(525)
plt.ylabel("b_pexp")
plt.plot(replicate, results_boot.b_pexp, label="b_pexp",lw=lw)
plt.subplot(526)
plt.hist(results_boot.b_pexp,lw=lw, label="b_pexp",orientation='horizontal')
plt.subplot(527)
plt.ylabel("b_pexp2")
plt.plot(replicate, results_boot.b_pexp, label="b_pexp2",lw=lw)
plt.subplot(528)
plt.hist(results_boot.b_pexp,lw=lw, label="b_pexp2",orientation='horizontal')
plt.subplot(529)
plt.ylabel("b_broken_home")
plt.plot(replicate, results_boot.b_broken_home, label="b_broken_home",lw=lw)
plt.subplot(520)
plt.hist(results_boot.b_broken_home,lw=lw, label="b_broken_home",orientation='horizontal')
The Bayesian approach is functionally similar yet comes at the problem from a philosphically different viewpoint. For frequentists, there is one true measure of the \(\beta\) vector and uncertainty arises from having to sample (as is demonstrated in the above bootstrap) from a population. All the "noise" is a result of estimating our beta over a different sample. With a large enough sample size the estimator would collapse to a single value as the sample becomes increasingly large (note: here I am referring to the number of observations in \(\mathbf{X}\), not increasing \(R\).
The Bayesian analyst approaches the problem differently. Taking the data as given, uncertainty arises around \(\beta\) for a whole host of reasons. Even an infinite sample size may not cause that uncertainty to collapse to zero. In frequentist terms, we would never talk about uncertainty of \(\beta\), but rather the uncertainty of our estimator. Bayesians are comfortable talking about uncertainty of \(\beta\). In practice, these nuances can be largely ignored (this statement could start a holy war) and one can approach econometric models from either perspective depending on which one leads to a better (or estimable) approach. There are examples on the web showing how one approach can outperform another, but these are all special cases. It just depends on how one defines outperforms and the phenomena under investigation.
Recall that Bayes Rule is \[ Prob(A|B) = \frac{Prob(B|A)Prob(A)}{Prob(B)} \]
How do we use this idea for econometric inference? A very good book on this topic is by John Geweke, "Contemporary Bayesian Econometrics and Statistics". Consider the following definitions for getting us started:
Consider constructing a set or series of potential beta vectors that explore the posterior distribution described above. When selecting beta vectors we would like to explore parameters having higher probability for the posterior (\(Prob(\theta|\mathbf{X},\mathbf{Y},\alpha)\)) . So in that context, it is useful to think about \(\theta\) and \(\alpha\) as follows: denote \(\theta\) as a new draw from the parameter distribution (\(\theta_{t+1}\)), and let \(\alpha\) be the current draw (\(\theta_t\)).
With that in mind, we can write the posterior as \[ Prob(\theta_{t+1}|\mathbf{X},\mathbf{Y},\theta_t) = \frac{Prob(\mathbf{X},\mathbf{Y}|\theta_{t+1})Prob(\theta_{t+1}|\theta_t)}{Prob(\mathbf{X},\mathbf{Y}|\theta_{t})} \]
This begins us thinking about how new information (a proposed value of \(\theta\)) tells us about the posterior distribution. Essentially we want to construct a series of estimates (called a chain) of the beta vector that is in regions of the parameter space that are high probability for the posterior. Once the chain is constructed, it reflects the underlying distribution of our parameter estimates.
First, we must specify a proposal distribution from which to draw proposed values of \(\theta\). Suppose we choose a symmetric proposal distribution such that \(Prob(\theta_{t+1}|\theta_t) = Prob(\theta_t|\theta_{t+1})\). For example, we might specify that \(\theta_{t+1} = \theta_t + Normal(0,\sigma)\). Since the difference is distributed \(N(\theta_{t+1} - \theta_t;0,\sigma) = N(\theta_t - \theta_{t+1};0,\sigma)\). This is called the random walk Metropolis-Hastings algorithm, the most common one in use.
We can then define an accept/reject criteria, where accept means that the candidate draw (\(\theta_{t+1}\)) provides information about \(\theta\) in high probability regions. The criteria is \[ c(\theta_{t+1},\theta_t) = min \left( 1,\frac{Prob(\mathbf{X},\mathbf{Y}|\theta_{t+1})Prob(\theta_{t+1}|\theta_t)}{Prob(\mathbf{X},\mathbf{Y}|\theta_{t})Prob(\theta_{t}|\theta_t+1)}\right ) \]
Given the symmetry of the random walk draws, this simplifies to
\[ c(\theta_{t+1},\theta_t) = min \left( 1,\frac{Prob(\mathbf{X},\mathbf{Y}|\theta_{t+1})}{Prob(\mathbf{X},\mathbf{Y}|\theta_{t})}\right ) \] which is simply a ratio of likelihood function values!
Having calculated the accept/reject criteria value \(c\), we need to decide whether to include \(\theta_{t+1}\) in our chain. To do that, draw the random uniform value \(u\in[0,1]\). If \(c(\theta_{t+1},\theta_t)\ge u\) then add \(\theta_{t+1}\) to our chain that will help us characterize the distribution of parameters that most satisfy the posterior probability.
It is important to note that the accept reject method ensures that the chain never "converges" like maximum likelihood. Even if the maximum is found, there is always a non-zero probability that the chain will deviate (and store values of the estimate of \(\beta\) that actaully cause the likelihood function value to decline!
The Metroplis-Hastings algorithm has been named one of the top 10 algorithms of the 20th century and appears on both the Math and Computational Sciences lists.
Suppose we have 100 observations on age = [43, 21, 36, 17, ...] and we want to use the MH algorithm to find the mean age.
Steps: 1. Define likelihood function 2. Generate candidate value (\(\theta_{t+1}\)) 3. Calculate likelihood value 4. Calculate \(c(\theta_{t+1}|\theta_{t})\) 5. Draw a random uniform variate (value) 6. Choose to either accept or reject \(\theta_{t+1}\) 7. If accept, add to chain. If reject, go back to (2)
# data: construct age matrix and ensure correct shape
age = pd.DataFrame((40 + 12*np.random.normal(0, 1,(100,1))), columns=['age'])
age.describe()
Step 1. Define the Likelihood Function: a normal pdf with mean mu, standard deviation sigma, and vector data input x.
def like_age(mu,sigma,x):
if sigma<=0:
return -np.inf # heavily penalize if < 0
else:
u = (x - mu)/np.tile(np.abs(sigma),(x.shape[0]))
scale = (1/(np.sqrt(2*np.pi)*np.abs(sigma)))
scale = np.tile(scale,(shape(x)[0],1))
prob = scale*np.exp(-np.multiply(u,u)/2)
return np.sum(np.log(prob))
Steps 2-7.
Define starting values and construct the chain.
age=np.asarray(age)
mu0 = 30 #starting value for mean
sigma0 = 10 #starting value for sigma
rho = np.asarray(.15) #tuning parameter for MH sampler
p0 = like_age(mu0,sigma0,age) #starting probability for chain
chain_length = 10000
burn_in = 1000
chain_vals = np.zeros((chain_length,3))
# This is the heart of the MH algorithm, see notes below
# Loop performs steps 2-7 above
for i in range(0,chain_length):
mu_cand = np.asarray(mu0 + rho*np.random.normal(0, 1))
sigma_cand = np.asarray(sigma0 + rho*np.random.normal(0, 1))
# for this candidate calculate p1 (probability of parameters given data)
p1 = like_age(mu_cand,sigma_cand,age)
c = np.minimum(0,p1 - p0) # subtract since in logs
if c>=np.log(np.random.uniform()):
p0 = p1
mu0 = mu_cand
sigma0 = sigma_cand
chain_vals[i,0] = mu0
chain_vals[i,1] = sigma0
chain_vals[i,2] = p0
The code above works but is far from the perfect version of the MH algorithm. One needs to implement an adaptive random walk draw approach that tunes rho to lead to an accept ratio of around 20%. This doesn't do that, so it shouldn't be implemented in a production environment, but is ok for instructional purposes.
# convert to pandas dataframe
chain_ex = pd.DataFrame(chain_vals,columns=['mean','std','prob'])
chain_ex.plot(kind='hexbin', x='mean', y='std', C='prob', reduce_C_function=np.max,gridsize=50)
# calculate summary statistics post burn-in
chain_ex.loc[burn_in:].describe(percentiles=[.005,.025,.05,.5,.95,.975,.995])
The summary statistics are for post burn-in values in the chain only and are calculated off of the dark green region in the above plot. The plot above shows how accepted values of the mean and standard deviation move around the parameter space. Lighter shaded areas are low probability and notice the chain "converges" to the dark green area after exploring some higher age and std values. Note: other combinations of parameters were considered but weren't accepted in the chain.
The main point to consider is that the Bayesian Estimator is not merely the mean and standard deviation of age, but rather the estimated distribution as can be seen from the traceplots:
#plot chains, excluding burn-in
time= np.arange(chain_length-burn_in)
# plot point estimate and confidence intervals:
plt.figure(figsize=(14, 10), dpi=200)
lw = 1
plt.subplot(221)
plt.ylabel("mu")
plt.plot(time, chain_ex['mean'].loc[burn_in:] , lw=lw)
plt.subplot(222)
chain_ex['mean'].loc[burn_in:].plot(kind='hist',bins=50,orientation='horizontal')
plt.subplot(223)
plt.ylabel("sigma")
plt.plot(time, chain_ex['std'].loc[burn_in:], lw=lw)
plt.subplot(224)
chain_ex['std'].loc[burn_in:].plot(kind='hist',bins=50,orientation='horizontal')
For Bayesian Econometrics, convergence describes whether the chain oscillates randomly around a stable value (which is the mean of the distribution).
# to make likelihood calculations easier, add a column of ones for the intercept:
tobias_koop['ones'] = 1
The Bayesian approach centers on probabilities of the posterior. We can recycle what we know from maximum likelihood to write the probability for individual i as
\[ Prob(\mathbf{b}| \mathbf{Y}_i, \mathbf{X}_i) = \frac{1}{\sqrt{2\sigma^2\pi}} e^{-\frac{(Y_i - \mathbf{X_i b})^2}{2\sigma^2}} \] which is just the normal pdf having mean \(\mathbf{X}_i \mathbf{b}\)
With that, we can write the log-likelihood function as \[ LogLike(\mathbf{b}| \mathbf{Y}, \mathbf{X}) = \sum_{i \in N} ln\left(Prob(\mathbf{b}| \mathbf{Y}_i, \mathbf{X}_i)\right) \] which is exactly the same as the log-likelihood, we'd use if we were estimating OLS by maximum likelihood. The following operationalizes this in code. Note that as in MLE, we have to estimate the \(K\) parameters of \(\mathbf{b}\) and our estimate for \(\sigma\).
# Likelihood Function for OLS (Bayesians call this log of posterior): identical to likelihood function for MLE
def log_likelihood(theta, x, y):
sigma = np.asmatrix(theta[-1]) #the last element of theta is the std dev of epsilon, and can't be <0
beta = np.asarray(theta[:-1]) #the first k are beta parameters
Y=np.asmatrix(tobias_koop[y])
if sigma <= 0:
return -np.inf #account for a chain trying to assign sigma < 0: penalize heavily
else:
mu = np.asmatrix(np.sum(tobias_koop[x]*np.tile(beta,(tobias_koop.shape[0],1)),axis=1)) #this is xb
mu = mu.reshape(Y.shape[0],1)
prob = scipy.norm.pdf(Y,mu,sigma)
return sum(np.log(prob))
Now that we have defined the likelihood function (good for any OLS style model), we need to use an algorithm to sample parameters from the posterior. The MH algorithm has some advantages, but there are better ones out there that have been recently developed. For this example, we will use the emcee algorithm (http://dan.iel.fm/emcee/current/) that has been used in the Astronomy/Physics Literature. It wanders the parameter space using a more informed method rather than just a random walk (what the MH algorithm we employed uses). This code constructs a chain for our model parameters:
x = ['educ','pexp','pexp2','broken_home','ones']
y = ['ln_wage']
# this calls the emcee mcmc algorithm
ndim, nwalkers = shape(x)[0]+1, 100 #number of parameters, #of walkers in ensemble
nsteps, nburn = 5000, 1000 #total number of steps, # of burn-in steps
starting_guesses = np.random.rand(nwalkers, ndim)*.1
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_likelihood, args=[x, y],threads=5)
store_chain=sampler.run_mcmc(starting_guesses, nsteps)
The estimator returns a Markov Chain (actually the Ensemble Sampler returns 100 chains (number of walkers)). Note: in what follows below, it should be evident that the Bayesian Approach estimates a distribution of parameter estimates rather than focusing on parameter estimates. As an aside, this calculates the likelihood function 5,000 x 100 = 500k times. This may be a little overkill for an OLS model.....
The following graphs one of these chains. Note: this exhibits periodicity which is problematic, so would need to look at combining across all chains to judge convergence. See discussion here: http://dan.iel.fm/emcee/current/user/line/
#Plot the entire chain:
time= np.arange(nsteps-nburn) # x values for plot (integer count)
sample = sampler.chain[:, nburn:, :]
# plot point estimate and confidence intervals:
plt.figure(figsize=(14, 10), dpi=200)
lw = 1
# to show the full chain, need to loop through walkers, but we'll just show the first walker:
plt.subplot(521)
plt.ylabel("Intercept")
plt.plot(time, sample[1,:,4], label="Intercept",lw=lw)
plt.subplot(522)
plt.hist(sample[1,:,4],lw=lw, label="Intercept", orientation='horizontal')
plt.subplot(523)
plt.ylabel("b_educ")
plt.plot(time, sample[1,:,0], label="b_educ",lw=lw)
plt.subplot(524)
plt.hist(sample[1,:,0],lw=lw, label="b_educ",orientation='horizontal')
plt.subplot(525)
plt.ylabel("b_pexp")
plt.plot(time, sample[1,:,1], label="b_pexp",lw=lw)
plt.subplot(526)
plt.hist(sample[1,:,1],lw=lw, label="b_pexp",orientation='horizontal')
plt.subplot(527)
plt.ylabel("b_pexp2")
plt.plot(time, sample[1,:,2], label="b_pexp2",lw=lw)
plt.subplot(528)
plt.hist(sample[1,:,2],lw=lw,label="b_pexp2", orientation='horizontal')
plt.subplot(529)
plt.ylabel("b_broken_home")
plt.plot(time,sample[1,:,3], label="b_broken_home",lw=lw)
plt.subplot(520)
plt.hist(sample[1,:,3],lw=lw, label="b_broken_home",orientation='horizontal')
While the trace plots above look fairly similar to what we found above in the bootstrapping excercise, we came at this information from a very different perspective, by sampling values from the posterior distribution. This was done by evaluating a number of trial values of our parameters and settling on areas of the parameter space having higher probability values for the posterior.
What is of more interest is the information contained in all of the chains, and what it tells us about the distribution of the parameters. The following trace chart uses all of the information from all 100 chains.
# to show the full chain, need to loop through walkers
walkers=np.arange(nwalkers)
plt.figure(figsize=(14, 14), dpi=200)
lw = .05
for i in walkers:
plt.subplot(511)
plt.ylabel("Intercept")
plt.plot(time, sample[i,:,4], label="Intercept",lw=lw)
plt.subplot(512)
plt.ylabel("b_educ")
plt.plot(time, sample[i,:,0], label="b_educ",lw=lw)
plt.subplot(513)
plt.ylabel("b_pexp")
plt.plot(time, sample[i,:,1], label="b_pexp",lw=lw)
plt.subplot(514)
plt.ylabel("b_pexp2")
plt.plot(time, sample[i,:,2], label="b_pexp2",lw=lw)
plt.subplot(515)
plt.ylabel("b_broken_home")
plt.plot(time,sample[i,:,3], label="b_broken_home",lw=lw)
Notice, using full set of walkers gives us a chain that at least visually does not have the periodicity problems (although there are formal tests), so we can proceed.
Using the same information, the following plots the likelihood (posterior) surface in two dimensions. Each point in the diagrams is an actual parameter value that was included in our chain.
samples_long = sampler.chain[:, nburn:, :].reshape((-1, ndim))
import triangle
fig = triangle.corner(samples_long, labels=["$educ$", "$pexp$", "$pexp2$","$broken\_home$","$Int$","$\sigma$","$\ln\,L$"],
truths=[.0853, .2035, -.0124, -.0087, .4603, .4265, 0])
It is important to note that any of the Bayesian MCMC sampling techniques explored here will never "converge" to the Maximum Likelihood Estimate, which is the value of parameters that maximize the likelihood function. This is because the MH algorithm (and its more modern derivates) places some non-zero probability of accepting candidate vectors with lower posterior probabilities. The vectors that are accepted are likely to be almost as probable as the MLE estimate, that is why the chain tends to wander about the MLE estimate. This is because we are trying to find the distribution of our parameters, not just a point estimate. However, on average, you get pretty much the same result as MLE (and for this example OLS).
Next, we summarize the distribution of our parameters, by taking the mean (most akin to the frequentist idea of a point estimate), the standard deviation (like the parameter standard error), and the percentile breakdowns. Note that reporting standard errors (or deviations) imposes normality on your distribution that may not be appropriate. Reporting confidence intervals (the way we have calculated them) is much less restrictive. Although no one in the literature really does this. Furthermore, we could calculate covariances as well to fully characterize \(cov(\theta)\), but don't do that here.
#convert to pandas dataset and describe
results_bayesian = pd.DataFrame(samples_long,columns=['educ','pexp','pexp2','broken_home','intercept','sigma'])
results_bayesian.describe(percentiles=[.005,.025,.05,.5,.95,.975,.995])
# to see how the MH algorithm searches the parameter space and information on parameter covariance:
#Uncomment the following two lines for another plot, but I prefer the triangle plot above.
#from pandas.tools.plotting import scatter_matrix
#scatter_matrix(results_bayesian, alpha=0.2, figsize=(12, 12), diagonal='hist')
So far, we have been using Bayesian Econometric Techniques on relatively trivial problems. But suppose we want to investigate something more interesting and suited to our Bayesian machinery?
Heirarchical modeling is one such case. The idea is that there may be heterogeneity of model parameters governed by a heirarchical process, for example the parameter \(\beta_1 \sim N(\mu_1,\sigma_1)\), but individuals have different realizations from this draw: \(\beta_{i1} = \mu_{1} + \sigma_1 Normal(0,1)\). This model allows for effects to differ amongst units (a unit could be a person, a state, etc).
Note: this example borrowed from http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/
Gelman et al.'s (2007) radon dataset is a classic for hierarchical modeling. In this dataset the amount of the radioactive gas radon has been measured among different households in all counties of several states. Radon gas is known to be the highest cause of lung cancer in non-smokers. It is believed to be more strongly present in households containing a basement and to differ in amount present among types of soil. Here we'll investigate this differences and try to make predictions of radonlevels in different counties based on the county itself and the presence of a basement. In this example we'll look at Minnesota, a state that contains 85 counties in which different measurements are taken, ranging from 2 to 116 measurements per county.
radon
#%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.read_csv('./random_parameters/radon_constructed.csv')
county_names = data.county.unique()
county_idx = data['county_code'].values
n_counties = len(data.county.unique())
We will look at this subset of the dataset:
data[['county', 'log_radon', 'floor']].head()
As you can see, we have multiple radon measurements (log-converted) -- one row for each house -- in a county and whether the house has a basement (floor == 0) or not (floor == 1). We are interested in whether having a basement increases the radon measured in the house.
Now you might say: "That's easy! I'll just pool all my data and estimate one big regression to asses the influence of a basement across all counties". In math-speak that model would be:
\[radon_{i, c} = \alpha + \beta*\text{floor}_{i, c} + \epsilon\]
Where \(i\) represents the measurement, \(c\) the county and floor contains a 0 or 1 if the house has a basement or not, respectively. If you need a refresher on Linear Regressions in PyMC
, check out my previous blog post. Critically, we are only estimating one intercept and one slope for all measurements over all counties pooled together as illustrated in the graphic below (\(\theta\) represents \((\alpha, \beta)\) in our case and \(y_i\) are the measurements of the \(i\)th county).
pooled
But what if we are interested in whether different counties actually have different relationships (slope) and different base-rates of radon (intercept)? Then you might say "OK then, I'll just estimate \(n\) (number of counties) different regressions -- one for each county". In math-speak that model would be:
\[radon_{i, c} = \alpha_{c} + \beta_{c}*\text{floor}_{i, c} + \epsilon_c\]
Note that we added the subindex \(c\) so we are estimating \(n\) different \(\alpha\)s and \(\beta\)s -- one for each county.
unpooled
This is the extreme opposite model; where above we assumed all counties are exactly the same, here we are saying that they share no similarities whatsoever. As we show below, this type of model can be very noisy when we have little data per county, as is the case in this data set.
Fortunately, there is a middle ground to both of these extremes. Specifically, we may assume that while \(\alpha\)s and \(\beta\)s are different for each county as in the unpooled case, the coefficients all share similarity. We can model this by assuming that each individual coefficient comes from a common group distribution:
\[\alpha_{c} \sim \mathcal{N}(\mu_{\alpha}, \sigma_{\alpha}^2)\] \[\beta_{c} \sim \mathcal{N}(\mu_{\beta}, \sigma_{\beta}^2)\]
We thus assume the intercepts \(\alpha\) and slopes \(\beta\) to come from a normal distribution centered around their respective group mean \(\mu\) with a certain standard deviation \(\sigma^2\), the values (or rather posteriors) of which we also estimate. That's why this is called a multilevel, hierarchical or partial-pooling modeling.
hierarchical
(Note that the above is not a complete Bayesian model specification as we haven't defined priors or hyperpriors (i.e. priors for the group distribution, \(\mu\) and \(\sigma\)). These will be used in the model implementation below but only distract here.)
Instead of creating models separatley, the hierarchical model creates group parameters that consider the counties not as completely different but as having an underlying similarity. These distributions are subsequently used to influence the distribution of each county's \(\alpha\) (intercept) and \(\beta\) (slope).
# this suppresses some warnings for better viewing:
import warnings
warnings.simplefilter(action = "ignore", category = FutureWarning)
warnings.simplefilter(action = "ignore", category = RuntimeWarning)
with pm.Model() as hierarchical_model:
# Hyperpriors for group nodes
mu_a = pm.Normal('mu_alpha', mu=0., sd=100**2)
sigma_a = pm.Uniform('sigma_alpha', lower=0, upper=100)
mu_b = pm.Normal('mu_beta', mu=0., sd=100**2)
sigma_b = pm.Uniform('sigma_beta', lower=0, upper=100)
# Intercept for each county, distributed around group mean mu_a
# Above we just set mu and sd to a fixed value while here we
# plug in a common group distribution for all a and b (which are
# vectors of length n_counties).
a = pm.Normal('alpha', mu=mu_a, sd=sigma_a, shape=n_counties)
# Intercept for each county, distributed around group mean mu_a
b = pm.Normal('beta', mu=mu_b, sd=sigma_b, shape=n_counties)
# s.d. of Model error
eps = pm.Uniform('eps', lower=0, upper=100)
# Model prediction of radon level
# a[county_idx] translates to a[0, 0, 0, 1, 1, ...],
# we thus link multiple household measures of a county
# to its coefficients.
radon_est = a[county_idx] + b[county_idx] * data.floor.values
# Data likelihood
radon_like = pm.Normal('radon_like', mu=radon_est, sd=eps, observed=data.log_radon)
# RUN the MCMC using the No U-turn Sampler (NUTS)
with hierarchical_model:
step = pm.NUTS()
hierarchical_trace = pm.sample(2000, step)
# Plotting the hierarchical model trace -its found values- from 500 iterations onwards (right side plot)
# and its accumulated marginal values (left side plot)
pm.traceplot(hierarchical_trace[500:]);
The marginal posteriors in the left column are highly informative. mu_a
tells us the group mean (log) radon levels. mu_b
tells us that having no basement decreases radon levels significantly (no mass above zero). We can also see by looking at the marginals for a
that there is quite some differences in radon levels between counties (each 'rainbow' color corresponds to a single county); the different widths are related to how much confidence we have in each paramter estimate -- the more measurements per county, the higher our confidence will be.