Tensorflow with Custom Likelihood Functions
This post builds on earlier ones dealing with custom likelihood functions in python and maximum likelihood estimation with auto differentiation. This post is approaching tensorflow from an econometrics perspective and is based on a series of tests and notes I developed for using tensorflow for some of my work. In early explorations of tensorflow nearly all of the examples I encountered were from a machine learning perspective, making it difficult to fit code examples to econometric problems. For the tensorflow uninitiated who want to dive in (like me!), I hope this will prove useful.
The goals of the post:
- Some tensorflow basics I wish I had known before I started this work
- Define a custom log-likelihood function in tensorflow and perform differentiation over model parameters to illustrate how, under the hood, tensorflow's model graph is designed to calculate derivatives "free of charge" (no programming required and very little to no additional compute time).
- Use the tensorflow log-likelihood to estimate a maximum likelihood
model using
tensorflow_probability.optimizer
capabilities. - Illustrate how the
tensorflow.probability.mcmc
libraries can be used with custom log-likelihoods.
So why bother? The earlier posts outlined above show python's
autograd
or theano
has auto differentiation capabilities which we
used successfully to speedup some models. Unfortunately, those
libraries aren't optimal for the current revolution in machine
learning enabled by gpu computing.1 Tensorflow is a library that
opens doors for econometricians enabling much faster processing (and
massive parallelization of models).
Tensorflow Basics
As described in the tensorflow performance tutorial, wrap your
functions in @tensorflow.function
. Without this, your code will run
very slow. All of the results below do this and it makes a huge
difference in runtimes. Here is an example of two functions that look
similar and use the tensorflow matmul
op but are very different in
terms of how tensorflow compiles (or doesn't compile) code in the
backend. Additionally, you can try the experimental_compile
mode
as it can shave over 50% off runtimes when it works:
import tensorflow as tf
import tensorflow_probability as tfp
from tensorflow_probability import distributions as tfd
import numpy as np
import pandas as pd
def mult_fun(x, y):
return tf.matmul(x, y)
@tf.function(experimental_compile=True)
def mult_tf_fun(x,y):
return tf.matmul(x, y)
print("The function mult_fun runs slow: ", type(mult_fun))
print("The function mult_tf_fun runs fast: ", type(mult_tf_fun))
# Out [120]: The function mult_fun runs slow: <class 'function'> The function mult_tf_fun runs fast: <class 'tensorflow.python.eager.def_function.Function'>
You want your functions to have the type of the second one.
The data types of your tensorflow function arguments also matter. In general python function arguments will require tensorflow to automatically rebuild compute graphs whereas tensorflow arguments will not cause a rebuild (so will run faster). If you need changes to the graph at runtime (e.g. number of samples, number of chains, etc.) then pass in the variable as a python variable as that will require a rebuild of the graph which is needed to run successfully.
A toy problem
To begin, we will be generating data consistent with ordinary least squares:
\begin{equation} \mathbf{y} = \mathbf{x} \beta + \epsilon \end{equation}where \(\mathbf{y}\) is an \(N \times 1\) vector of dependent variables, \(\mathbf{x}\) is an \(N \times K\) vector of dependent variables, \(\beta\) are the \(K\times 1\) parameters we are trying to estimate, and \(\epsilon\) is an \(N \times 1\) vector of iid mean zero homoskedastic unobservables having variance \(\sigma^2\).
Here we generate data for \(N=500\) and \(K=2\):
# set tensorflow data type
dtype = tf.float32
##
## simple OLS Data Generation Process
##
# True beta
b = np.array([10, -1])
N = 500
# True error std deviation
sigma_e = 1
x = np.c_[np.ones(N), np.random.randn(N)]
y = x.dot(b) + sigma_e * np.random.randn(N)
As a point of comparison to results in tensorflow, run classical OLS on our toy dataset:
# estimate parameter vector, errors, sd of errors, and se of parameters
bols = np.linalg.inv(x.T.dot(x)).dot(x.T.dot(y))
err = y - x.dot(bols)
sigma_ols = np.sqrt(err.dot(err)/(x.shape[0] - x.shape[1]))
se = np.sqrt(err.dot(err)/(x.shape[0] - x.shape[1]) * np.diagonal(np.linalg.inv(x.T.dot(x))))
# put results together for easy viewing
ols_parms = np.r_[bols, sigma_ols]
ols_se = np.r_[se, np.nan]
print(pd.DataFrame(np.c_[ols_parms, ols_se],columns=['estimate', 'std err'],
index=['b0', 'b1', 'sigma']))
# Out [123]: estimate std err b0 10.021294 0.044157 b1 -0.953204 0.046017 sigma 0.987337 NaN
Getting data into tensorflow tensors
Using tensorflow vs numpy syntax/functions is paramount for building likelihoods that will work for us. Unfortunately, numpy and matlab-like slicing and indexing does not always work which means that vectorizing loops requires quite alot of thought and the use of indices.2 Apart from that there are fairly minor differences from numpy and with tensorflow 2's "eager execution", code is easy to debug.
First, let's convert our data to tensorflow tensors. While not strictly necessary, this ensures that we have strict "typing" of data when we move to gpu operations:
X = tf.constant(x, dtype=dtype)
Y = tf.constant(y, dtype=dtype)
print(type(X))
print(X.dtype)
print(X.shape)
# extract the first 5 tensors into numpy array
print(X.numpy()[:5])
# Out [124]: <class 'tensorflow.python.framework.ops.EagerTensor'> <dtype: 'float32'> (500, 2) [[ 1. 0.47143516] [ 1. -1.1909757 ] [ 1. 1.432707 ] [ 1. -0.3126519 ] [ 1. -0.72058874]]
By default, Tensorflow 2.0 (what I am using) executes as "eager tensors". This enables easy debugging of tensorflow code but in reality what we are doing with tensorflow is building a model graph with functional linkages between tensors that we can use for our statistical analysis later.
Log-Likelihood in Tensorflow
Next let's define the log-likelihood function using tensorflow
primitives. It goes without saying that we could (and should) use
tensorflow_probability.distributions.normal
for this problem, but I
proceed with this example because it is a well known model with
multiple parameters and data for understanding how to setup problems
with custom log likelihoods. The log-likelihood over model parameters
\(\mathbf{b}\) and \(s\), the estimates for \(\beta\) and \(\sigma\) respectively, is given by
Programming this as a tensorflow function is fairly straightforward:
pi = tf.constant(np.pi, dtype=dtype)
@tf.function
def ols_loglike(beta, sigma):
# xb (mu_i for each observation)
mu = tf.linalg.matvec(X, beta)
# this is normal pdf logged and summed over all observations
ll = - (X.shape[0]/2.)*tf.math.log(2.*pi*sigma**2) -\
(1./(2.*sigma**2.))*tf.math.reduce_sum((Y-mu)**2., axis=-1)
return ll
We can use this to evaluate the normal log likelihood given our data and parameters. Evaluating at \(\beta = \begin{bmatrix} 1 & 1 \end{bmatrix}\) and \(\sigma = 1\) yields the tensor scalar
ols_loglike(beta = [1., 1.], sigma=1.)
# Out [126]: : <tf.Tensor: shape=(), dtype=float32, numpy=-21840.611>
Note that the numpy value is the result of the tensorflow function evaluation due to eager execution but underlying all of this is a model compute graph.
Auto differentiation
Auto differentiation is possible since tensorflow in the background has defined the model graph given data for finding \(\frac{\partial LogL}{\partial \beta}\) and \(\frac{\partial LogL}{\partial \sigma}\). Evaluating the function and derivative at \(\beta = \begin{bmatrix} 1 & 1 \end{bmatrix}\) and \(\sigma=1\) we have
[funval, grads] = tfp.math.value_and_gradient(ols_loglike, [[1., 1.], 1.])
print("Function Value: ", funval)
print('Gradients on beta: \n', grads[0])
print('Gradient on sigma: \n', grads[1])
# Out [127]: Function Value: tf.Tensor(-21840.611, shape=(), dtype=float32) Gradients on beta: tf.Tensor([4501.1436 -855.3488], shape=(2,), dtype=float32) Gradient on sigma: tf.Tensor(42262.285, shape=(), dtype=float32)
This calculation occurs instantly and as importantly there was no need to programmatically define gradients. Also, the sign and magnitudes of the gradients are what we expect. Focusing on \(\beta\), at the values \(\beta=\begin{bmatrix} 1 & 1\end{bmatrix}\), \(\beta_0\) is too small relative to the "true" value (hence the positive gradient) whereas \(\beta_1\) is too large (hence the negative gradient).
Maximum Likelihood Estimation in Tensorflow
For finding the maximum likelihood estimate, we'll be minimizing the negative log-likelihood since tensorflow has an optimization library based on minimization (rather than maximizing the log-likelihood). Since we will be using a minimizer, we need to construct a negative log-likelihood function in tensorflow. Additionally, we need all of the parameters (the \(\beta\)'s and \(\sigma\)) to enter as a single vector.3
Defining the function and doing a "sanity check" below, shows that we have identical answers as before except that all values are of opposite sign.
# how to calculate gradiants
# objective function for minimization
@tf.function
def neg_ols_loglike(param_vec):
beta_split, sigma_split = tf.split(param_vec, [2,1], axis=0)
# need to take these back down to vectors and scalars:
beta_split = tf.reshape(beta_split,(2,))
sigma_split = tf.reshape(sigma_split,())
# xb (mu_i for each observation)
mu = tf.linalg.matvec(X, beta_split)
# this is normal pdf logged and summed over all observations
ll = -(X.shape[0]/2.)*tf.math.log(2.*pi*sigma_split**2.) -\
(1./(2.*sigma_split**2.))*tf.math.reduce_sum((Y-mu)**2., axis=-1)
return -1*ll
# return function value and gradients
@tf.function
def neg_like_and_gradient(parms):
return tfp.math.value_and_gradient(neg_ols_loglike, parms)
# using the same values as above
testval = tf.constant([1., 1., 1.], shape=(3))
out = neg_like_and_gradient(parms=testval)
print("Function value: ", out[0])
print("Gradients: ", out[1])
# Out [128]: Function value: tf.Tensor(21840.611, shape=(), dtype=float32) Gradients: tf.Tensor([ -4501.1436 855.3488 -42262.285 ], shape=(3,), dtype=float32)
An important note that may possibly save someone alot of time: for
the optimizer code below to work, the function values and gradients
returned by neg_like_and_gradient
must be the correct shape. For my
problem the correct shape is a scalar for the negative log likelihood
value (shape=()) and a 1d tensor (here the shape=(3,)) for the
gradients.
With that, we can conduct maximum likelihood estimation. The
bfgs_minimize
function returns quite a bit of information including
estimated parameters and the inverse hessian which is
needed for constructing standard errors.
# set some naiive starting values
start = [0., 0., 1.]
# optimization
optim_results = tfp.optimizer.bfgs_minimize(
neg_like_and_gradient, start, tolerance=1e-8)
# organize results
est_params = optim_results.position.numpy()
est_serr = np.sqrt(np.diagonal(optim_results.inverse_hessian_estimate.numpy()))
print(pd.DataFrame(np.c_[est_params, est_serr],columns=['estimate', 'std err'],
index=['b0', 'b1', 'sigma']))
# Out [129]: estimate std err b0 10.021295 0.044162 b1 -0.953205 0.046155 sigma 0.985360 0.031172
We can confirm the maximum likelihood solution by inspecting the
returned gradient information in optim_results
, or by calculating
directly:
out = neg_like_and_gradient(est_params)
print("Function value: ", out[0])
print("Gradients: ", out[1])
# Out [130]: Function value: tf.Tensor(702.0952, shape=(), dtype=float32) Gradients: tf.Tensor([ 5.8412552e-05 -1.8119812e-05 0.0000000e+00], shape=(3,), dtype=float32)
So, solution confirmed.4
MCMC in Tensorflow
The following shows the mechanics of Random-Walk Metropolis Hastings and No-Uturn Samplers for constructing Monte-Carlo Markov Chains. These results don't introduce priors on any parameters so all accept/reject decisions are based on likelihood values only.
Random Walk Metropolis-Hastings
Recall that the Random Walk Metropolis-Hastings (RWMH) algorithm formulates a proposal (\(\theta_1\)) from the current parameter position (\(\theta_0\)) as
\begin{equation} \theta_1 = \theta_0 + f(\delta) \end{equation}where \(\delta\) is a scale parameter (or potentially a vector of scale parameters for each parameter we are estimating) from the proposal distribution \(f(^.)\). Below, we implement a Normally distributed random walk proposal making the above condition
\begin{equation} \theta_1 = \theta_0 + \delta N(0,1) \end{equation}where \(N(0,1)\) is a standard normal random variate. The choice of scale (\(\delta\)) will heavily influence the acceptance rate from our MHRW sampler, and unfortunately I don't believe that tuning is built into the random walk sampler in tensorflow. The scale I use below was recovered from hand tuning the sampler where a single scale is chosen over our three parameters. For the sake of brevity I don't show that.
We will use these parameters for our final chain once we have
uncovered the appropriate scale of our proposal. Next, we use the
tensorflow_probability.mcmc
library for sampling. For any mcmc
model, this involves
- Setting up a kernel defining the likelihood and prior log-likelihood probabilities and the proposal (and potentially including the proposal scale)
- Specifying starting values
- Sampling from the kernel
Based on hand tuning, I found a proposal scale of .4 to provide a good acceptance rate. Also, let's start the sampling at our maximum likelihood estimates (there are potential downsides to this that I won't discuss here):
@tf.function
def mcmc_sampler(init_vals):
kernel = tfp.mcmc.RandomWalkMetropolis(
target_log_prob_fn=ols_loglike,
new_state_fn=tfp.mcmc.random_walk_normal_fn(scale=.04),seed=42)
samples_, stats_ = tfp.mcmc.sample_chain(
num_results=mhrw_samples,
current_state= init_vals,
kernel=kernel,
num_burnin_steps=mhrw_burnin,
parallel_iterations=1)
return samples, stats
# initialize at MLE estimates
init = [est_params[:2],est_params[-1]]
samples, stats = mcmc_sampler(init)
print('Final acceptance rate for step size %2.3f: %2.3f '% (.04,stats.is_accepted.numpy().sum()/mhrw_samples))
# Out [131]: Final acceptance rate for step size 0.040: 0.448
After sampling, we see the acceptance rate is near .45. The mean and standard deviation of the posterior samples are:
# summarize results
trace_sigma = samples[1].numpy()
trace_beta = samples[0].numpy()
est_mh = np.r_[trace_beta.mean(axis=0), trace_sigma.mean()]
std_mh = np.r_[trace_beta.std(axis=0), trace_sigma.std()]
# assemble and print
print(pd.DataFrame(np.c_[est_mh, std_mh],columns=['estimate', 'std err'],
index=['b0', 'b1', 'sigma']))
# Out [132]: estimate std err b0 10.020264 0.043223 b1 -0.953512 0.048436 sigma 0.986995 0.031827
Note: tensorflow random walk metropolis hastings examples from the web like the
one here uses the default scale of 1. For this problem, it resulted
in chains that never accepted any proposals after approximately 25
samples. Finding a scale that works for your problem seems like a
necessary step in using
tensorflow_probability.mcmc.RandomWalkMetropolis
unless I have
missed something obvious and auto tuning of the proposal scale is
possible.
No U-Turn Sampler (Nuts)
The Nuts sampler in tensorflow can be setup to auto-tune the proposal
and doesn't require us to worry so much about proposal scale.
Furthermore, it uses the gradients provided by tensorflow for smarter
proposals than a random walk and for nearly all problems would be the
recommended approach. One need only supply an initial step size, and I
found that estimating the model in Nuts for this toy example the
estimates are not particularly sensitive to the parameter
init_step_size
which governs the behavior of proposals generated by
Nuts. As always, your mileage may vary. The Nuts sampler is easy to
setup and implement:
nuts_samples = 1000
nuts_burnin = 500
chains = 4
## Initial step size
init_step_size=.3
##
## NUTS (using inner step size averaging step)
##
@tf.function
def nuts_sampler(init):
nuts_kernel = tfp.mcmc.NoUTurnSampler(
target_log_prob_fn=ols_loglike,
step_size=init_step_size,
)
adapt_nuts_kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
inner_kernel=nuts_kernel,
num_adaptation_steps=nuts_burnin,
step_size_getter_fn=lambda pkr: pkr.step_size,
log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
)
samples_nuts_, stats_nuts_ = tfp.mcmc.sample_chain(
num_results=nuts_samples,
current_state=init,
kernel=adapt_nuts_kernel,
num_burnin_steps=100,
parallel_iterations=5)
return samples_nuts_, stats_nuts_
samples_nuts, stats_nuts = nuts_sampler(init)
trace_sigman = samples_nuts_[1].numpy()
trace_betan = samples_nuts_[0].numpy()
est_nuts = np.r_[trace_betan.mean(axis=0), trace_sigman.mean()]
std_nuts = np.r_[trace_betan.std(axis=0), trace_sigman.std()]
# assemble and print
print(pd.DataFrame(np.c_[est_nuts, std_nuts],columns=['estimate', 'std err'],
index=['b0', 'b1', 'sigma']))
# Out [134]: estimate std err b0 10.022340 0.044498 b1 -0.955192 0.044316 sigma 0.991023 0.032741
Multiple Chains
All of the discussion thus far samples a single chain. With multiple
processor cores on either CPU or GPU's there are opportunities to
sample in parallel very quickly.5 This provides more information about
the posterior and to test for convergence by comparing multiple
chains. Provided your log-likelihood function is written correctly
(is able to handle vectorized calculations), this is accomplished by
providing starting values with the appropriate shape (many thanks to
the gist from this blog post for making this clear). The pymc4
function tile_init
converted to python is useful for doing this.
#
# Setup starting vals for multiple chains
#
def tile_init(init, num_repeats):
"""
create tiled initial values for multiple chains
idea from pymc4.inference.sample:
https://github.com/pymc-devs/pymc4/blob/master/pymc4/inference/sampling.py
"""
return [np.tile(np.expand_dims(tens, 0), [num_repeats] + [1] * tens.ndim) for tens in init]
init = tile_init(init, chains)
print("Multiple chains requested: ", chains)
init
# Out [135]: Multiple chains requested: 4 : [array([[10.021295 , -0.9532046], : [10.021295 , -0.9532046], : [10.021295 , -0.9532046], : [10.021295 , -0.9532046]], dtype=float32), : array([0.98536015, 0.98536015, 0.98536015, 0.98536015], dtype=float32)]
Note the inital values are copied 4 times (in the row dimension for \(\beta\)) for all of our parameters and the NUTs function is defined with a python variable, which forces a rebuild of the compute graph if the shape of the initial values change. With these starting values, the code for sampling is identical to before except this time we will have four independent chains.
samples_nuts_mult, stats_nuts_mult = nuts_sampler(init)
Summarizing means from each of the chains gives:
print("Shape of samples (for beta): ", samples_nuts_mult[0].shape)
print("Mean for each chain and parameter (beta only): ")
data = samples_nuts_mult[0].numpy().mean(axis=0)
df = pd.DataFrame(data, columns=['bo', 'b1'], index=['chain ' + str(i+1) for i in range(4)])
df
# Out [137]: Shape of samples (for beta): (1000, 4, 2) Mean for each chain and parameter (beta only): : bo b1 : chain 1 10.020152 -0.951442 : chain 2 10.020010 -0.952338 : chain 3 10.019940 -0.954123 : chain 4 10.020774 -0.952213
I ran this same code for 200 chains of 2000 samples each on an Nvidia Tesla P100-16GB GPU in less than 11 seconds:
Number of Chains: 200 Elapsed time (seconds): 10.79 Shape of beta: (2000, 200, 2)
Footnotes:
Theano is designed for and does take advantage of Nvidia graphics cards but for my purposes (and my purposes only) it did not lead to any speed-up for a wide range of models I attempted (I was using PyMC3: see footnote 8 here). Additionally Theano 1.0 from November 2017 is the final release.
The indexing issues described above are not covered in this post.
This may not be an absolute requirement, but I was unable to get the minimization code working for segmented \(\beta\) and \(\sigma\), whereas I was successful treating all parameters to be estimated as a vector.
A word of warning: the minimization routine was sensitive to how I defined the log-likelihood. For some reason, another mathematically equivalent version of the log-likelihood (with identical gradients and function values) failed to converge with the starting values used above.
While not discussed in the post, one could also use multiple starting values from a grid in a maximum likelihood estimation context if the log-likelihood isn't globally concave. All of this could be done in parallel.