Tensorflow with Custom Likelihood Functions

This post builds on earlier ones dealing with custom likelihood functions in python and maximum likelihood estimation with auto differentiation. This post is approaching tensorflow from an econometrics perspective and is based on a series of tests and notes I developed for using tensorflow for some of my work. In early explorations of tensorflow nearly all of the examples I encountered were from a machine learning perspective, making it difficult to fit code examples to econometric problems. For the tensorflow uninitiated who want to dive in (like me!), I hope this will prove useful.

The goals of the post:

  1. Some tensorflow basics I wish I had known before I started this work
  2. Define a custom log-likelihood function in tensorflow and perform differentiation over model parameters to illustrate how, under the hood, tensorflow's model graph is designed to calculate derivatives "free of charge" (no programming required and very little to no additional compute time).
  3. Use the tensorflow log-likelihood to estimate a maximum likelihood model using tensorflow_probability.optimizer capabilities.
  4. Illustrate how the tensorflow.probability.mcmc libraries can be used with custom log-likelihoods.

So why bother? The earlier posts outlined above show python's autograd or theano has auto differentiation capabilities which we used successfully to speedup some models. Unfortunately, those libraries aren't optimal for the current revolution in machine learning enabled by gpu computing.1 Tensorflow is a library that opens doors for econometricians enabling much faster processing (and massive parallelization of models).

Tensorflow Basics

As described in the tensorflow performance tutorial, wrap your functions in @tensorflow.function. Without this, your code will run very slow. All of the results below do this and it makes a huge difference in runtimes. Here is an example of two functions that look similar and use the tensorflow matmul op but are very different in terms of how tensorflow compiles (or doesn't compile) code in the backend. Additionally, you can try the experimental_compile mode as it can shave over 50% off runtimes when it works:

import tensorflow as tf
import tensorflow_probability as tfp
from tensorflow_probability import distributions as tfd
import numpy as np
import pandas as pd

def mult_fun(x, y):
    return tf.matmul(x, y)

@tf.function(experimental_compile=True)
def mult_tf_fun(x,y):
    return tf.matmul(x, y)

print("The function mult_fun runs slow: ", type(mult_fun))
print("The function mult_tf_fun runs fast: ", type(mult_tf_fun))
# Out [120]: 
The function mult_fun runs slow:  <class 'function'>
The function mult_tf_fun runs fast:  <class 'tensorflow.python.eager.def_function.Function'>

You want your functions to have the type of the second one.

The data types of your tensorflow function arguments also matter. In general python function arguments will require tensorflow to automatically rebuild compute graphs whereas tensorflow arguments will not cause a rebuild (so will run faster). If you need changes to the graph at runtime (e.g. number of samples, number of chains, etc.) then pass in the variable as a python variable as that will require a rebuild of the graph which is needed to run successfully.

A toy problem

To begin, we will be generating data consistent with ordinary least squares:

\begin{equation} \mathbf{y} = \mathbf{x} \beta + \epsilon \end{equation}

where \(\mathbf{y}\) is an \(N \times 1\) vector of dependent variables, \(\mathbf{x}\) is an \(N \times K\) vector of dependent variables, \(\beta\) are the \(K\times 1\) parameters we are trying to estimate, and \(\epsilon\) is an \(N \times 1\) vector of iid mean zero homoskedastic unobservables having variance \(\sigma^2\).

Here we generate data for \(N=500\) and \(K=2\):

# set tensorflow data type
dtype = tf.float32

##
## simple OLS Data Generation Process
##
# True beta
b = np.array([10, -1])
N = 500
# True error std deviation
sigma_e = 1

x = np.c_[np.ones(N), np.random.randn(N)]
y = x.dot(b) + sigma_e * np.random.randn(N)

As a point of comparison to results in tensorflow, run classical OLS on our toy dataset:

# estimate parameter vector, errors, sd of errors, and se of parameters
bols = np.linalg.inv(x.T.dot(x)).dot(x.T.dot(y))
err = y - x.dot(bols)
sigma_ols = np.sqrt(err.dot(err)/(x.shape[0] - x.shape[1]))
se = np.sqrt(err.dot(err)/(x.shape[0] - x.shape[1]) * np.diagonal(np.linalg.inv(x.T.dot(x))))
# put results together for easy viewing
ols_parms = np.r_[bols, sigma_ols]
ols_se = np.r_[se, np.nan]
print(pd.DataFrame(np.c_[ols_parms, ols_se],columns=['estimate', 'std err'],
      index=['b0', 'b1', 'sigma']))
# Out [123]: 
        estimate   std err
b0     10.021294  0.044157
b1     -0.953204  0.046017
sigma   0.987337       NaN

Getting data into tensorflow tensors

Using tensorflow vs numpy syntax/functions is paramount for building likelihoods that will work for us. Unfortunately, numpy and matlab-like slicing and indexing does not always work which means that vectorizing loops requires quite alot of thought and the use of indices.2 Apart from that there are fairly minor differences from numpy and with tensorflow 2's "eager execution", code is easy to debug.

First, let's convert our data to tensorflow tensors. While not strictly necessary, this ensures that we have strict "typing" of data when we move to gpu operations:

X = tf.constant(x, dtype=dtype)
Y = tf.constant(y, dtype=dtype)
print(type(X))
print(X.dtype)
print(X.shape)
# extract the first 5 tensors into numpy array
print(X.numpy()[:5])
# Out [124]: 
<class 'tensorflow.python.framework.ops.EagerTensor'>
<dtype: 'float32'>
(500, 2)
[[ 1.          0.47143516]
 [ 1.         -1.1909757 ]
 [ 1.          1.432707  ]
 [ 1.         -0.3126519 ]
 [ 1.         -0.72058874]]

By default, Tensorflow 2.0 (what I am using) executes as "eager tensors". This enables easy debugging of tensorflow code but in reality what we are doing with tensorflow is building a model graph with functional linkages between tensors that we can use for our statistical analysis later.

Log-Likelihood in Tensorflow

Next let's define the log-likelihood function using tensorflow primitives. It goes without saying that we could (and should) use tensorflow_probability.distributions.normal for this problem, but I proceed with this example because it is a well known model with multiple parameters and data for understanding how to setup problems with custom log likelihoods. The log-likelihood over model parameters \(\mathbf{b}\) and \(s\), the estimates for \(\beta\) and \(\sigma\) respectively, is given by

\begin{align} LogL =& \sum_{i=1}^N log \left( p(y_i | \mathbf{x}_i, \mathbf{b},s) \right) = \sum_{i=1}^N log \left( \frac{1}{\sqrt{2 \pi s^2}} e^{\frac{-(y_i-\mathbf{x}_i \mathbf{b})^2}{2 s^2}} \right) \nonumber \\ =& - \frac{N}{2}log(2 \pi s^2) - \sum_{i=1}^N \frac{1}{2 s^2} (y_i - \mathbf{x}_i \mathbf{b})^2 \end{align}

Programming this as a tensorflow function is fairly straightforward:

pi = tf.constant(np.pi, dtype=dtype)

@tf.function
def ols_loglike(beta, sigma):
    # xb (mu_i for each observation)
    mu = tf.linalg.matvec(X, beta)
    # this is normal pdf logged and summed over all observations
    ll = - (X.shape[0]/2.)*tf.math.log(2.*pi*sigma**2) -\
            (1./(2.*sigma**2.))*tf.math.reduce_sum((Y-mu)**2., axis=-1)
    return ll

We can use this to evaluate the normal log likelihood given our data and parameters. Evaluating at \(\beta = \begin{bmatrix} 1 & 1 \end{bmatrix}\) and \(\sigma = 1\) yields the tensor scalar

ols_loglike(beta = [1., 1.], sigma=1.)
# Out [126]: 
: <tf.Tensor: shape=(), dtype=float32, numpy=-21840.611>

Note that the numpy value is the result of the tensorflow function evaluation due to eager execution but underlying all of this is a model compute graph.

Auto differentiation

Auto differentiation is possible since tensorflow in the background has defined the model graph given data for finding \(\frac{\partial LogL}{\partial \beta}\) and \(\frac{\partial LogL}{\partial \sigma}\). Evaluating the function and derivative at \(\beta = \begin{bmatrix} 1 & 1 \end{bmatrix}\) and \(\sigma=1\) we have

[funval, grads] = tfp.math.value_and_gradient(ols_loglike, [[1., 1.], 1.])
print("Function Value: ", funval)
print('Gradients on beta: \n', grads[0])
print('Gradient on sigma: \n', grads[1])
# Out [127]: 
Function Value:  tf.Tensor(-21840.611, shape=(), dtype=float32)
Gradients on beta: 
 tf.Tensor([4501.1436 -855.3488], shape=(2,), dtype=float32)
Gradient on sigma: 
 tf.Tensor(42262.285, shape=(), dtype=float32)

This calculation occurs instantly and as importantly there was no need to programmatically define gradients. Also, the sign and magnitudes of the gradients are what we expect. Focusing on \(\beta\), at the values \(\beta=\begin{bmatrix} 1 & 1\end{bmatrix}\), \(\beta_0\) is too small relative to the "true" value (hence the positive gradient) whereas \(\beta_1\) is too large (hence the negative gradient).

Maximum Likelihood Estimation in Tensorflow

For finding the maximum likelihood estimate, we'll be minimizing the negative log-likelihood since tensorflow has an optimization library based on minimization (rather than maximizing the log-likelihood). Since we will be using a minimizer, we need to construct a negative log-likelihood function in tensorflow. Additionally, we need all of the parameters (the \(\beta\)'s and \(\sigma\)) to enter as a single vector.3

Defining the function and doing a "sanity check" below, shows that we have identical answers as before except that all values are of opposite sign.

# how to calculate gradiants
# objective function for minimization
@tf.function
def neg_ols_loglike(param_vec):
   beta_split, sigma_split = tf.split(param_vec, [2,1], axis=0)
   # need to take these back down to vectors and scalars:
   beta_split = tf.reshape(beta_split,(2,))
   sigma_split = tf.reshape(sigma_split,())
   # xb (mu_i for each observation)
   mu = tf.linalg.matvec(X, beta_split)
   # this is normal pdf logged and summed over all observations
   ll =  -(X.shape[0]/2.)*tf.math.log(2.*pi*sigma_split**2.) -\
           (1./(2.*sigma_split**2.))*tf.math.reduce_sum((Y-mu)**2., axis=-1)
   return -1*ll

# return function value and gradients
@tf.function
def neg_like_and_gradient(parms):
    return tfp.math.value_and_gradient(neg_ols_loglike, parms)

# using the same values as above
testval = tf.constant([1., 1., 1.], shape=(3))
out = neg_like_and_gradient(parms=testval)
print("Function value: ", out[0])
print("Gradients: ", out[1])
# Out [128]: 
Function value:  tf.Tensor(21840.611, shape=(), dtype=float32)
Gradients:  tf.Tensor([ -4501.1436    855.3488 -42262.285 ], shape=(3,), dtype=float32)

An important note that may possibly save someone alot of time: for the optimizer code below to work, the function values and gradients returned by neg_like_and_gradient must be the correct shape. For my problem the correct shape is a scalar for the negative log likelihood value (shape=()) and a 1d tensor (here the shape=(3,)) for the gradients.

With that, we can conduct maximum likelihood estimation. The bfgs_minimize function returns quite a bit of information including estimated parameters and the inverse hessian which is needed for constructing standard errors.

# set some naiive starting values
start = [0., 0., 1.]

# optimization
optim_results = tfp.optimizer.bfgs_minimize(
    neg_like_and_gradient, start, tolerance=1e-8)

# organize results
est_params = optim_results.position.numpy()
est_serr = np.sqrt(np.diagonal(optim_results.inverse_hessian_estimate.numpy()))
print(pd.DataFrame(np.c_[est_params, est_serr],columns=['estimate', 'std err'],
      index=['b0', 'b1', 'sigma']))
# Out [129]: 
        estimate   std err
b0     10.021295  0.044162
b1     -0.953205  0.046155
sigma   0.985360  0.031172

We can confirm the maximum likelihood solution by inspecting the returned gradient information in optim_results, or by calculating directly:

out = neg_like_and_gradient(est_params)
print("Function value: ", out[0])
print("Gradients: ", out[1])
# Out [130]: 
Function value:  tf.Tensor(702.0952, shape=(), dtype=float32)
Gradients:  tf.Tensor([ 5.8412552e-05 -1.8119812e-05  0.0000000e+00], shape=(3,), dtype=float32)

So, solution confirmed.4

MCMC in Tensorflow

The following shows the mechanics of Random-Walk Metropolis Hastings and No-Uturn Samplers for constructing Monte-Carlo Markov Chains. These results don't introduce priors on any parameters so all accept/reject decisions are based on likelihood values only.

Random Walk Metropolis-Hastings

Recall that the Random Walk Metropolis-Hastings (RWMH) algorithm formulates a proposal (\(\theta_1\)) from the current parameter position (\(\theta_0\)) as

\begin{equation} \theta_1 = \theta_0 + f(\delta) \end{equation}

where \(\delta\) is a scale parameter (or potentially a vector of scale parameters for each parameter we are estimating) from the proposal distribution \(f(^.)\). Below, we implement a Normally distributed random walk proposal making the above condition

\begin{equation} \theta_1 = \theta_0 + \delta N(0,1) \end{equation}

where \(N(0,1)\) is a standard normal random variate. The choice of scale (\(\delta\)) will heavily influence the acceptance rate from our MHRW sampler, and unfortunately I don't believe that tuning is built into the random walk sampler in tensorflow. The scale I use below was recovered from hand tuning the sampler where a single scale is chosen over our three parameters. For the sake of brevity I don't show that.

We will use these parameters for our final chain once we have uncovered the appropriate scale of our proposal. Next, we use the tensorflow_probability.mcmc library for sampling. For any mcmc model, this involves

  1. Setting up a kernel defining the likelihood and prior log-likelihood probabilities and the proposal (and potentially including the proposal scale)
  2. Specifying starting values
  3. Sampling from the kernel

Based on hand tuning, I found a proposal scale of .4 to provide a good acceptance rate. Also, let's start the sampling at our maximum likelihood estimates (there are potential downsides to this that I won't discuss here):

@tf.function
def mcmc_sampler(init_vals):
    kernel = tfp.mcmc.RandomWalkMetropolis(
            target_log_prob_fn=ols_loglike,
            new_state_fn=tfp.mcmc.random_walk_normal_fn(scale=.04),seed=42)

    samples_, stats_ = tfp.mcmc.sample_chain(
            num_results=mhrw_samples,
            current_state= init_vals,
            kernel=kernel,
            num_burnin_steps=mhrw_burnin,
            parallel_iterations=1)
    return samples, stats

# initialize at MLE estimates
init = [est_params[:2],est_params[-1]]
samples, stats = mcmc_sampler(init) 
print('Final acceptance rate for step size %2.3f: %2.3f '% (.04,stats.is_accepted.numpy().sum()/mhrw_samples))
# Out [131]: 
Final acceptance rate for step size 0.040: 0.448 

After sampling, we see the acceptance rate is near .45. The mean and standard deviation of the posterior samples are:

# summarize results
trace_sigma = samples[1].numpy()
trace_beta = samples[0].numpy()
est_mh = np.r_[trace_beta.mean(axis=0), trace_sigma.mean()]
std_mh = np.r_[trace_beta.std(axis=0), trace_sigma.std()]
# assemble and print
print(pd.DataFrame(np.c_[est_mh, std_mh],columns=['estimate', 'std err'],
      index=['b0', 'b1', 'sigma']))
# Out [132]: 
        estimate   std err
b0     10.020264  0.043223
b1     -0.953512  0.048436
sigma   0.986995  0.031827

Note: tensorflow random walk metropolis hastings examples from the web like the one here uses the default scale of 1. For this problem, it resulted in chains that never accepted any proposals after approximately 25 samples. Finding a scale that works for your problem seems like a necessary step in using tensorflow_probability.mcmc.RandomWalkMetropolis unless I have missed something obvious and auto tuning of the proposal scale is possible.

No U-Turn Sampler (Nuts)

The Nuts sampler in tensorflow can be setup to auto-tune the proposal and doesn't require us to worry so much about proposal scale. Furthermore, it uses the gradients provided by tensorflow for smarter proposals than a random walk and for nearly all problems would be the recommended approach. One need only supply an initial step size, and I found that estimating the model in Nuts for this toy example the estimates are not particularly sensitive to the parameter init_step_size which governs the behavior of proposals generated by Nuts. As always, your mileage may vary. The Nuts sampler is easy to setup and implement:

nuts_samples = 1000
nuts_burnin = 500
chains = 4
## Initial step size
init_step_size=.3

##
## NUTS (using inner step size averaging step)
##
@tf.function
def nuts_sampler(init):
    nuts_kernel = tfp.mcmc.NoUTurnSampler(
            target_log_prob_fn=ols_loglike, 
            step_size=init_step_size,
            )
    adapt_nuts_kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
        inner_kernel=nuts_kernel,
        num_adaptation_steps=nuts_burnin,
        step_size_getter_fn=lambda pkr: pkr.step_size,
        log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
        step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
       )

    samples_nuts_, stats_nuts_ = tfp.mcmc.sample_chain(
        num_results=nuts_samples,
        current_state=init,
        kernel=adapt_nuts_kernel,
        num_burnin_steps=100,
        parallel_iterations=5)
    return samples_nuts_, stats_nuts_

samples_nuts, stats_nuts = nuts_sampler(init)
trace_sigman = samples_nuts_[1].numpy()
trace_betan = samples_nuts_[0].numpy()
est_nuts = np.r_[trace_betan.mean(axis=0), trace_sigman.mean()]
std_nuts = np.r_[trace_betan.std(axis=0), trace_sigman.std()]
# assemble and print
print(pd.DataFrame(np.c_[est_nuts, std_nuts],columns=['estimate', 'std err'],
      index=['b0', 'b1', 'sigma']))
# Out [134]: 
        estimate   std err
b0     10.022340  0.044498
b1     -0.955192  0.044316
sigma   0.991023  0.032741

Multiple Chains

All of the discussion thus far samples a single chain. With multiple processor cores on either CPU or GPU's there are opportunities to sample in parallel very quickly.5 This provides more information about the posterior and to test for convergence by comparing multiple chains. Provided your log-likelihood function is written correctly (is able to handle vectorized calculations), this is accomplished by providing starting values with the appropriate shape (many thanks to the gist from this blog post for making this clear). The pymc4 function tile_init converted to python is useful for doing this.

# 
# Setup starting vals for multiple chains
#
def tile_init(init, num_repeats):
    """
    create tiled initial values for multiple chains
    idea from pymc4.inference.sample:
    https://github.com/pymc-devs/pymc4/blob/master/pymc4/inference/sampling.py    
    """
    return [np.tile(np.expand_dims(tens, 0), [num_repeats] + [1] * tens.ndim) for tens in init]

init = tile_init(init, chains)
print("Multiple chains requested: ", chains)
init
# Out [135]: 
Multiple chains requested:  4

: [array([[10.021295 , -0.9532046],
:         [10.021295 , -0.9532046],
:         [10.021295 , -0.9532046],
:         [10.021295 , -0.9532046]], dtype=float32),
:  array([0.98536015, 0.98536015, 0.98536015, 0.98536015], dtype=float32)]

Note the inital values are copied 4 times (in the row dimension for \(\beta\)) for all of our parameters and the NUTs function is defined with a python variable, which forces a rebuild of the compute graph if the shape of the initial values change. With these starting values, the code for sampling is identical to before except this time we will have four independent chains.

samples_nuts_mult, stats_nuts_mult = nuts_sampler(init)

Summarizing means from each of the chains gives:

print("Shape of samples (for beta): ", samples_nuts_mult[0].shape)
print("Mean for each chain and parameter (beta only): ")
data = samples_nuts_mult[0].numpy().mean(axis=0)
df = pd.DataFrame(data, columns=['bo', 'b1'], index=['chain ' + str(i+1) for i in range(4)])
df
# Out [137]: 
Shape of samples (for beta):  (1000, 4, 2)
Mean for each chain and parameter (beta only): 

:                 bo        b1
: chain 1  10.020152 -0.951442
: chain 2  10.020010 -0.952338
: chain 3  10.019940 -0.954123
: chain 4  10.020774 -0.952213

I ran this same code for 200 chains of 2000 samples each on an Nvidia Tesla P100-16GB GPU in less than 11 seconds:

Number of Chains:  200
Elapsed time (seconds):  10.79
Shape of beta:  (2000, 200, 2)

Footnotes:

1

Theano is designed for and does take advantage of Nvidia graphics cards but for my purposes (and my purposes only) it did not lead to any speed-up for a wide range of models I attempted (I was using PyMC3: see footnote 8 here). Additionally Theano 1.0 from November 2017 is the final release.

2

The indexing issues described above are not covered in this post.

3

This may not be an absolute requirement, but I was unable to get the minimization code working for segmented \(\beta\) and \(\sigma\), whereas I was successful treating all parameters to be estimated as a vector.

4

A word of warning: the minimization routine was sensitive to how I defined the log-likelihood. For some reason, another mathematically equivalent version of the log-likelihood (with identical gradients and function values) failed to converge with the starting values used above.

5

While not discussed in the post, one could also use multiple starting values from a grid in a maximum likelihood estimation context if the log-likelihood isn't globally concave. All of this could be done in parallel.