The goal of this workshop is to 1. Explore bootstrapping * See the implementation steps * Show subtleties of reported results 2. Introduce Bayesian Econometrics * Brief discussion of Bayes Rule * Markov Chains and the Metropolis-Hastings Sampler * Simple example estimating mean and standard deviation * Application to Tobias and Koop * Heirarchical Models * Show you three different samplers
#load python libraries for this ipython notebook:
import numpy as np # numpy linear algebra library
import pandas as pd # pandas data manipulation library
import scipy.stats as scipy # access to stats libraries
import emcee as emcee # a bayesian library
#load pymc3 libraries (another Bayesian Library)
import imp
pm = imp.load_module("pymc3", *imp.find_module("pymc", ["/home/robhicks/Dropbox/notebooks/pymc3/pymc"]))
In class, we looked at these estimation methods: 1. Linear Model 2. Maximum Likelihood 3. Generalized Method of Moments
Leaving these as major estimation paradigms you haven't seen: 4. Bayesian Estimation 5. Non-parametric Estimation
In what follows, we will replicate our Bootstrapping work in Ipython (so we can compare to Bayesian Methods) and then move on to Bayesian Estimation.
Note: the appeal of Bayesian modeling is to tackle computationally challenging models and the implementation of an OLS model is fairly trivial, but gives us a point of comparison since we know the OLS model very well. At the end of these notes, we show how Bayesian Econometrics opens the door to very rich models, that are difficult to replicate using frequentist methods.
tobias_koop=pd.read_csv('http://rlhick.people.wm.edu/econ407/data/tobias_koop_t_4.csv')
tobias_koop.head()
tobias_koop.describe()
# explore distribution of ln_wage (dv)
tobias_koop.ln_wage.plot(kind='hist')
# run OLS regression:
results = pd.ols(y=tobias_koop['ln_wage'], x=tobias_koop[['educ','pexp','pexp2','broken_home']])
results
These s.e.'s aren't robust (don't match stata output), so let's bootstrap. The statsmodels package has a robust and bootstrap option (see http://statsmodels.sourceforge.net/stable/rlm.html for the robust OLS model). In general, you should use the canned routines, but we present the following for educational purposes only to make explicit what the bootstrap is doing.
Steps: 1. Loop \(R\) times (\(R\) = # of boostrapped replicates) 2. For each \(r\in R\), sample \(N\) times with replacement from \(\mathbf{Y}\) and \(\mathbf{X}\), noting that \(N\) is equal to the number of observations in the original dataset \(\mathbf{Y}\) and \(\mathbf{X}\). Denote each of these samples as \(\mathbf{X}_r\) and \(\mathbf{Y}_r\) 3. Estimate an OLS model and recover estimate for replicate \(r\): \(\hat{\beta}_r = (\mathbf{X}_r'\mathbf{X}_r)^{-1}\mathbf{X}_r'\mathbf{Y}_r\) 4. Store results 5. Calculate the standard deviation (our estimate for the s.e.) of each parameter estimate, or construct non-parametric confidence intervals (preferred)
R = 400
results_boot = np.zeros((R,results.beta.shape[0]))
row_id = range(0,tobias_koop.shape[0])
for r in range(R):
this_sample = numpy.random.choice(row_id, size=tobias_koop.shape[0], replace=True) # gives sampled row numbers
# Define data for this replicate:
tobias_koop_r = tobias_koop.iloc[this_sample]
# Estimate model
results_r = pd.ols(y=tobias_koop_r['ln_wage'], x=tobias_koop_r[['educ','pexp','pexp2','broken_home']])
# Store in row r of results_boot:
results_boot[r,:] = np.asarray(results_r.beta)
# Convert results to pandas dataframe for easier analysis:
results_boot = pd.DataFrame(results_boot,columns=['b_educ','b_pexp','pexp2','b_broken_home','b_Intercept'])
results_boot.head(10) # the first 10 replicates
results_boot.b_educ.plot(kind='hist')
results_boot.describe(percentiles=[.005,.025,.05,.5,.95,.975,.995])
Let's plot our bootstrap replicates, by parameter. Note that the values of each of these replicates are completely independent (or should be). Replicate 3 doesn't in any way inform us about replicate 4.
replicate= np.arange(R)
# plot point estimate and confidence intervals:
plt.figure(figsize=(14, 10), dpi=200)
lw = 1
plt.subplot(521)
plt.ylabel("Intercept")
plt.plot(replicate, results_boot.b_Intercept, label="Intercept",lw=lw)
plt.subplot(522)
plt.hist(results_boot.b_Intercept,lw=lw, label="Intercept", orientation='horizontal')
plt.subplot(523)
plt.ylabel("b_educ")
plt.plot(replicate, results_boot.b_educ, label="b_educ",lw=lw)
plt.subplot(524)
plt.hist(results_boot.b_educ,lw=lw, label="b_educ",orientation='horizontal')
plt.subplot(525)
plt.ylabel("b_pexp")
plt.plot(replicate, results_boot.b_pexp, label="b_pexp",lw=lw)
plt.subplot(526)
plt.hist(results_boot.b_pexp,lw=lw, label="b_pexp",orientation='horizontal')
plt.subplot(527)
plt.ylabel("b_pexp2")
plt.plot(replicate, results_boot.b_pexp, label="b_pexp2",lw=lw)
plt.subplot(528)
plt.hist(results_boot.b_pexp,lw=lw, label="b_pexp2",orientation='horizontal')
plt.subplot(529)
plt.ylabel("b_broken_home")
plt.plot(replicate, results_boot.b_broken_home, label="b_broken_home",lw=lw)
plt.subplot(520)
plt.hist(results_boot.b_broken_home,lw=lw, label="b_broken_home",orientation='horizontal')
The Bayesian approach is functionally similar yet comes at the problem from a philosphically different viewpoint. For frequentists, there is one true measure of the \(\beta\) vector and uncertainty arises from having to sample (as is demonstrated in the above bootstrap) from a population. All the "noise" is a result of estimating our beta over a different sample. With a large enough sample size the estimator would collapse to a single value as the sample becomes increasingly large (note: here I am referring to the number of observations in \(\mathbf{X}\), not increasing \(R\).
The Bayesian analyst approaches the problem differently. Taking the data as given, uncertainty arises around \(\beta\) for a whole host of reasons. Even an infinite sample size may not cause that uncertainty to collapse to zero. In frequentist terms, we would never talk about uncertainty of \(\beta\), but rather the uncertainty of our estimator. Bayesians are comfortable talking about uncertainty of \(\beta\). In practice, these nuances can be largely ignored (this statement could start a holy war) and one can approach econometric models from either perspective depending on which one leads to a better (or estimable) approach. There are examples on the web showing how one approach can outperform another, but these are all special cases. It just depends on how one defines outperforms and the phenomena under investigation.
Recall that Bayes Rule is \[ Prob(A|B) = \frac{Prob(B|A)Prob(A)}{Prob(B)} \]
How do we use this idea for econometric inference? A very good book on this topic is by John Geweke, "Contemporary Bayesian Econometrics and Statistics". Consider the following definitions for getting us started:
Consider constructing a set or series of potential beta vectors that explore the posterior distribution described above. When selecting beta vectors we would like to explore parameters having higher probability for the posterior (\(Prob(\theta|\mathbf{X},\mathbf{Y},\alpha)\)) . So in that context, it is useful to think about \(\theta\) and \(\alpha\) as follows: denote \(\theta\) as a new draw from the parameter distribution (\(\theta_{t+1}\)), and let \(\alpha\) be the current draw (\(\theta_t\)).
With that in mind, we can write the posterior as \[ Prob(\theta_{t+1}|\mathbf{X},\mathbf{Y},\theta_t) = \frac{Prob(\mathbf{X},\mathbf{Y}|\theta_{t+1})Prob(\theta_{t+1}|\theta_t)}{Prob(\mathbf{X},\mathbf{Y}|\theta_{t})} \]
This begins us thinking about how new information (a proposed value of \(\theta\)) tells us about the posterior distribution. Essentially we want to construct a series of estimates (called a chain) of the beta vector that is in regions of the parameter space that are high probability for the posterior. Once the chain is constructed, it reflects the underlying distribution of our parameter estimates.
First, we must specify a proposal distribution from which to draw proposed values of \(\theta\). Suppose we choose a symmetric proposal distribution such that \(Prob(\theta_{t+1}|\theta_t) = Prob(\theta_t|\theta_{t+1})\). For example, we might specify that \(\theta_{t+1} = \theta_t + Normal(0,\sigma)\). Since the difference is distributed \(N(\theta_{t+1} - \theta_t;0,\sigma) = N(\theta_t - \theta_{t+1};0,\sigma)\). This is called the random walk Metropolis-Hastings algorithm, the most common one in use.
We can then define an accept/reject criteria, where accept means that the candidate draw (\(\theta_{t+1}\)) provides information about \(\theta\) in high probability regions. The criteria is \[ c(\theta_{t+1},\theta_t) = min \left( 1,\frac{Prob(\mathbf{X},\mathbf{Y}|\theta_{t+1})Prob(\theta_{t+1}|\theta_t)}{Prob(\mathbf{X},\mathbf{Y}|\theta_{t})Prob(\theta_{t}|\theta_t+1)}\right ) \]
Given the symmetry of the random walk draws, this simplifies to
\[ c(\theta_{t+1},\theta_t) = min \left( 1,\frac{Prob(\mathbf{X},\mathbf{Y}|\theta_{t+1})}{Prob(\mathbf{X},\mathbf{Y}|\theta_{t})}\right ) \] which is simply a ratio of likelihood function values!
Having calculated the accept/reject criteria value \(c\), we need to decide whether to include \(\theta_{t+1}\) in our chain. To do that, draw the random uniform value \(u\in[0,1]\). If \(c(\theta_{t+1},\theta_t)\ge u\) then add \(\theta_{t+1}\) to our chain that will help us characterize the distribution of parameters that most satisfy the posterior probability.
It is important to note that the accept reject method ensures that the chain never "converges" like maximum likelihood. Even if the maximum is found, there is always a non-zero probability that the chain will deviate (and store values of the estimate of \(\beta\) that actaully cause the likelihood function value to decline!
The Metroplis-Hastings algorithm has been named one of the top 10 algorithms of the 20th century and appears on both the Math and Computational Sciences lists.
Suppose we have 100 observations on age = [43, 21, 36, 17, ...] and we want to use the MH algorithm to find the mean age.
Steps: 1. Define likelihood function 2. Generate candidate value (\(\theta_{t+1}\)) 3. Calculate likelihood value 4. Calculate \(c(\theta_{t+1}|\theta_{t})\) 5. Draw a random uniform variate (value) 6. Choose to either accept or reject \(\theta_{t+1}\) 7. If accept, add to chain. If reject, go back to (2)
# data: construct age matrix and ensure correct shape
age = pd.DataFrame((40 + 12*np.random.normal(0, 1,(100,1))), columns=['age'])
age.describe()
Step 1. Define the Likelihood Function: a normal pdf with mean mu, standard deviation sigma, and vector data input x.
def like_age(mu,sigma,x):
if sigma<=0:
return -np.inf # heavily penalize if < 0
else:
u = (x - mu)/np.tile(np.abs(sigma),(x.shape[0]))
scale = (1/(np.sqrt(2*np.pi)*np.abs(sigma)))
scale = np.tile(scale,(shape(x)[0],1))
prob = scale*np.exp(-np.multiply(u,u)/2)
return np.sum(np.log(prob))
Steps 2-7.
Define starting values and construct the chain.
age=np.asarray(age)
mu0 = 30 #starting value for mean
sigma0 = 10 #starting value for sigma
rho = np.asarray(.15) #tuning parameter for MH sampler
p0 = like_age(mu0,sigma0,age) #starting probability for chain
chain_length = 10000
burn_in = 1000
chain_vals = np.zeros((chain_length,3))
# This is the heart of the MH algorithm, see notes below
# Loop performs steps 2-7 above
for i in range(0,chain_length):
mu_cand = np.asarray(mu0 + rho*np.random.normal(0, 1))
sigma_cand = np.asarray(sigma0 + rho*np.random.normal(0, 1))
# for this candidate calculate p1 (probability of parameters given data)
p1 = like_age(mu_cand,sigma_cand,age)
c = np.minimum(0,p1 - p0) # subtract since in logs
if c>=np.log(np.random.uniform()):
p0 = p1
mu0 = mu_cand
sigma0 = sigma_cand
chain_vals[i,0] = mu0
chain_vals[i,1] = sigma0
chain_vals[i,2] = p0
The code above works but is far from the perfect version of the MH algorithm. One needs to implement an adaptive random walk draw approach that tunes rho to lead to an accept ratio of around 20%. This doesn't do that, so it shouldn't be implemented in a production environment, but is ok for instructional purposes.
# convert to pandas dataframe
chain_ex = pd.DataFrame(chain_vals,columns=['mean','std','prob'])
chain_ex.plot(kind='hexbin', x='mean', y='std', C='prob', reduce_C_function=np.max,gridsize=50)
# calculate summary statistics post burn-in
chain_ex.loc[burn_in:].describe(percentiles=[.005,.025,.05,.5,.95,.975,.995])
The summary statistics are for post burn-in values in the chain only and are calculated off of the dark green region in the above plot. The plot above shows how accepted values of the mean and standard deviation move around the parameter space. Lighter shaded areas are low probability and notice the chain "converges" to the dark green area after exploring some higher age and std values. Note: other combinations of parameters were considered but weren't accepted in the chain.
The main point to consider is that the Bayesian Estimator is not merely the mean and standard deviation of age, but rather the estimated distribution as can be seen from the traceplots:
#plot chains, excluding burn-in
time= np.arange(chain_length-burn_in)
# plot point estimate and confidence intervals:
plt.figure(figsize=(14, 10), dpi=200)
lw = 1
plt.subplot(221)
plt.ylabel("mu")
plt.plot(time, chain_ex['mean'].loc[burn_in:] , lw=lw)
plt.subplot(222)
chain_ex['mean'].loc[burn_in:].plot(kind='hist',bins=50,orientation='horizontal')
plt.subplot(223)
plt.ylabel("sigma")
plt.plot(time, chain_ex['std'].loc[burn_in:], lw=lw)
plt.subplot(224)
chain_ex['std'].loc[burn_in:].plot(kind='hist',bins=50,orientation='horizontal')
For Bayesian Econometrics, convergence describes whether the chain oscillates randomly around a stable value (which is the mean of the distribution).
# to make likelihood calculations easier, add a column of ones for the intercept:
tobias_koop['ones'] = 1
The Bayesian approach centers on probabilities of the posterior. We can recycle what we know from maximum likelihood to write the probability for individual i as
\[ Prob(\mathbf{b}| \mathbf{Y}_i, \mathbf{X}_i) = \frac{1}{\sqrt{2\sigma^2\pi}} e^{-\frac{(Y_i - \mathbf{X_i b})^2}{2\sigma^2}} \] which is just the normal pdf having mean \(\mathbf{X}_i \mathbf{b}\)
With that, we can write the log-likelihood function as \[ LogLike(\mathbf{b}| \mathbf{Y}, \mathbf{X}) = \sum_{i \in N} ln\left(Prob(\mathbf{b}| \mathbf{Y}_i, \mathbf{X}_i)\right) \] which is exactly the same as the log-likelihood, we'd use if we were estimating OLS by maximum likelihood. The following operationalizes this in code. Note that as in MLE, we have to estimate the \(K\) parameters of \(\mathbf{b}\) and our estimate for \(\sigma\).
# Likelihood Function for OLS (Bayesians call this log of posterior): identical to likelihood function for MLE
def log_likelihood(theta, x, y):
sigma = np.asmatrix(theta[-1]) #the last element of theta is the std dev of epsilon, and can't be <0
beta = np.asarray(theta[:-1]) #the first k are beta parameters
Y=np.asmatrix(tobias_koop[y])
if sigma <= 0:
return -np.inf #account for a chain trying to assign sigma < 0: penalize heavily
else:
mu = np.asmatrix(np.sum(tobias_koop[x]*np.tile(beta,(tobias_koop.shape[0],1)),axis=1)) #this is xb
mu = mu.reshape(Y.shape[0],1)
prob = scipy.norm.pdf(Y,mu,sigma)
return sum(np.log(prob))
Now that we have defined the likelihood function (good for any OLS style model), we need to use an algorithm to sample parameters from the posterior. The MH algorithm has some advantages, but there are better ones out there that have been recently developed. For this example, we will use the emcee algorithm (http://dan.iel.fm/emcee/current/) that has been used in the Astronomy/Physics Literature. It wanders the parameter space using a more informed method rather than just a random walk (what the MH algorithm we employed uses). This code constructs a chain for our model parameters:
x = ['educ','pexp','pexp2','broken_home','ones']
y = ['ln_wage']
# this calls the emcee mcmc algorithm
ndim, nwalkers = shape(x)[0]+1, 100 #number of parameters, #of walkers in ensemble
nsteps, nburn = 5000, 1000 #total number of steps, # of burn-in steps
starting_guesses = np.random.rand(nwalkers, ndim)*.1
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_likelihood, args=[x, y],threads=5)
store_chain=sampler.run_mcmc(starting_guesses, nsteps)
The estimator returns a Markov Chain (actually the Ensemble Sampler returns 100 chains (number of walkers)). Note: in what follows below, it should be evident that the Bayesian Approach estimates a distribution of parameter estimates rather than focusing on parameter estimates. As an aside, this calculates the likelihood function 5,000 x 100 = 500k times. This may be a little overkill for an OLS model.....
The following graphs one of these chains. Note: this exhibits periodicity which is problematic, so would need to look at combining across all chains to judge convergence. See discussion here: http://dan.iel.fm/emcee/current/user/line/
#Plot the entire chain:
time= np.arange(nsteps-nburn) # x values for plot (integer count)
sample = sampler.chain[:, nburn:, :]
# plot point estimate and confidence intervals:
plt.figure(figsize=(14, 10), dpi=200)
lw = 1
# to show the full chain, need to loop through walkers, but we'll just show the first walker:
plt.subplot(521)
plt.ylabel("Intercept")
plt.plot(time, sample[1,:,4], label="Intercept",lw=lw)
plt.subplot(522)
plt.hist(sample[1,:,4],lw=lw, label="Intercept", orientation='horizontal')
plt.subplot(523)
plt.ylabel("b_educ")
plt.plot(time, sample[1,:,0], label="b_educ",lw=lw)
plt.subplot(524)
plt.hist(sample[1,:,0],lw=lw, label="b_educ",orientation='horizontal')
plt.subplot(525)
plt.ylabel("b_pexp")
plt.plot(time, sample[1,:,1], label="b_pexp",lw=lw)
plt.subplot(526)
plt.hist(sample[1,:,1],lw=lw, label="b_pexp",orientation='horizontal')
plt.subplot(527)
plt.ylabel("b_pexp2")
plt.plot(time, sample[1,:,2], label="b_pexp2",lw=lw)
plt.subplot(528)
plt.hist(sample[1,:,2],lw=lw,label="b_pexp2", orientation='horizontal')
plt.subplot(529)
plt.ylabel("b_broken_home")
plt.plot(time,sample[1,:,3], label="b_broken_home",lw=lw)
plt.subplot(520)
plt.hist(sample[1,:,3],lw=lw, label="b_broken_home",orientation='horizontal')
While the trace plots above look fairly similar to what we found above in the bootstrapping excercise, we came at this information from a very different perspective, by sampling values from the posterior distribution. This was done by evaluating a number of trial values of our parameters and settling on areas of the parameter space having higher probability values for the posterior.
What is of more interest is the information contained in all of the chains, and what it tells us about the distribution of the parameters. The following trace chart uses all of the information from all 100 chains.
# to show the full chain, need to loop through walkers
walkers=np.arange(nwalkers)
plt.figure(figsize=(14, 14), dpi=200)
lw = .05
for i in walkers:
plt.subplot(511)
plt.ylabel("Intercept")
plt.plot(time, sample[i,:,4], label="Intercept",lw=lw)
plt.subplot(512)
plt.ylabel("b_educ")
plt.plot(time, sample[i,:,0], label="b_educ",lw=lw)
plt.subplot(513)
plt.ylabel("b_pexp")
plt.plot(time, sample[i,:,1], label="b_pexp",lw=lw)
plt.subplot(514)
plt.ylabel("b_pexp2")
plt.plot(time, sample[i,:,2], label="b_pexp2",lw=lw)
plt.subplot(515)
plt.ylabel("b_broken_home")
plt.plot(time,sample[i,:,3], label="b_broken_home",lw=lw)