#+BEGIN_COMMENT
.. title: Tensorflow with Custom Likelihood Functions
.. slug: custom-likes-tensorflow
.. date: 2020-02-24 9:15:50 UTC+01:00
.. tags: tensorflow, likelihood, mcmc, maximum likelihood
.. status:
.. has_math: yes
.. category:
.. link:
.. description:
.. type: text
#+END_COMMENT
#+BEGIN_SRC emacs-lisp :exports none
(setf (cdr (assoc :results org-babel-default-header-args:ipython)) "output replace")
#+END_SRC
#+RESULTS:
: output replace
This post builds on earlier ones dealing with [[./estimating-custom-mle.html][custom likelihood
functions in python]] and maximum likelihood estimation with [[./mle-autograd.html][auto
differentiation]]. This post is approaching tensorflow from an
econometrics perspective and is based on a series of tests and notes I
developed for using tensorflow for some of my work. In early
explorations of tensorflow nearly all of the examples I encountered
were from a machine learning perspective, making it difficult to fit
code examples to econometric problems. For the tensorflow uninitiated
who want to dive in (like me!), I hope this will prove useful.
The goals of the post:
1. Some tensorflow basics I wish I had known before I started this work
2. Define a custom log-likelihood function in tensorflow and perform
differentiation over model parameters to illustrate how, under the
hood, tensorflow's model graph is designed to calculate derivatives
"free of charge" (no programming required and very little to no
additional compute time).
3. Use the tensorflow log-likelihood to estimate a maximum likelihood
model using =tensorflow_probability.optimizer= capabilities.
4. Illustrate how the =tensorflow.probability.mcmc= libraries can be
used with custom log-likelihoods.
#+HTML:
So why bother? The earlier posts outlined above show python's
=autograd= or =theano= has auto differentiation capabilities which we
used successfully to speedup some models. Unfortunately, those
libraries aren't optimal for the current revolution in machine
learning enabled by gpu computing.[fn:1] Tensorflow is a library that
opens doors for econometricians enabling much faster processing (and
massive parallelization of models).
* Tensorflow Basics
As described in the [[https://www.tensorflow.org/tutorials/customization/performance][tensorflow performance tutorial]], wrap your
functions in =@tensorflow.function=. Without this, your code will run
very slow. All of the results below do this and it makes a huge
difference in runtimes. Here is an example of two functions that look
similar and use the tensorflow =matmul= op but are very different in
terms of how tensorflow compiles (or doesn't compile) code in the
backend. Additionally, you can try the =experimental_compile= mode
as it can shave over 50% off runtimes when it works:
#+BEGIN_SRC ipython :exports both
import tensorflow as tf
import tensorflow_probability as tfp
from tensorflow_probability import distributions as tfd
import numpy as np
import pandas as pd
def mult_fun(x, y):
return tf.matmul(x, y)
@tf.function(experimental_compile=True)
def mult_tf_fun(x,y):
return tf.matmul(x, y)
print("The function mult_fun runs slow: ", type(mult_fun))
print("The function mult_tf_fun runs fast: ", type(mult_tf_fun))
#+END_SRC
#+RESULTS:
: # Out [120]:
: The function mult_fun runs slow:
: The function mult_tf_fun runs fast:
:
You want your functions to have the type of the second one.
The data types of your tensorflow function arguments also matter. In general python function arguments will require tensorflow to automatically rebuild compute graphs whereas tensorflow arguments will not cause a rebuild (so will run faster). If you need changes to the graph at runtime (e.g. number of samples, number of chains, etc.) then pass in the variable as a python variable as that will require a rebuild of the graph which is needed to run successfully.
#+BEGIN_SRC ipython :exports none
import warnings
warnings.filterwarnings("ignore", category=Warning)
# set seed so results never change
np.random.seed(1234)
#+END_SRC
#+RESULTS:
: # Out [121]:
** A toy problem
To begin, we will be generating data consistent with ordinary least squares:
\begin{equation}
\mathbf{y} = \mathbf{x} \beta + \epsilon
\end{equation}
where $\mathbf{y}$ is an $N \times 1$ vector of dependent variables,
$\mathbf{x}$ is an $N \times K$ vector of dependent variables, $\beta$
are the $K\times 1$ parameters we are trying to estimate, and
$\epsilon$ is an $N \times 1$ vector of iid mean zero homoskedastic
unobservables having variance $\sigma^2$.
Here we generate data for $N=500$ and $K=2$:
#+BEGIN_SRC ipython :exports code
# set tensorflow data type
dtype = tf.float32
##
## simple OLS Data Generation Process
##
# True beta
b = np.array([10, -1])
N = 500
# True error std deviation
sigma_e = 1
x = np.c_[np.ones(N), np.random.randn(N)]
y = x.dot(b) + sigma_e * np.random.randn(N)
#+END_SRC
#+RESULTS:
: # Out [122]:
As a point of comparison to results in tensorflow, run classical OLS on our toy dataset:
#+BEGIN_SRC ipython :exports both
# estimate parameter vector, errors, sd of errors, and se of parameters
bols = np.linalg.inv(x.T.dot(x)).dot(x.T.dot(y))
err = y - x.dot(bols)
sigma_ols = np.sqrt(err.dot(err)/(x.shape[0] - x.shape[1]))
se = np.sqrt(err.dot(err)/(x.shape[0] - x.shape[1]) * np.diagonal(np.linalg.inv(x.T.dot(x))))
# put results together for easy viewing
ols_parms = np.r_[bols, sigma_ols]
ols_se = np.r_[se, np.nan]
print(pd.DataFrame(np.c_[ols_parms, ols_se],columns=['estimate', 'std err'],
index=['b0', 'b1', 'sigma']))
#+END_SRC
#+RESULTS:
: # Out [123]:
: estimate std err
: b0 10.021294 0.044157
: b1 -0.953204 0.046017
: sigma 0.987337 NaN
:
** Getting data into tensorflow tensors
Using tensorflow vs numpy syntax/functions is paramount for building
likelihoods that will work for us. Unfortunately, numpy and
matlab-like slicing and indexing does not always work which means that
vectorizing loops requires quite alot of thought and the use of
indices.[fn:7] Apart from that there are fairly minor differences from
numpy and with tensorflow 2's "eager execution", code is easy to
debug.
First, let's convert our data to tensorflow tensors. While not
strictly necessary, this ensures that we have strict "typing" of data
when we move to gpu operations:
#+BEGIN_SRC ipython :exports both
X = tf.constant(x, dtype=dtype)
Y = tf.constant(y, dtype=dtype)
print(type(X))
print(X.dtype)
print(X.shape)
# extract the first 5 tensors into numpy array
print(X.numpy()[:5])
#+END_SRC
#+RESULTS:
#+begin_example
# Out [124]:
(500, 2)
[[ 1. 0.47143516]
[ 1. -1.1909757 ]
[ 1. 1.432707 ]
[ 1. -0.3126519 ]
[ 1. -0.72058874]]
#+end_example
By default, Tensorflow 2.0 (what I am using) executes as "eager
tensors". This enables easy debugging of tensorflow code but in
reality what we are doing with tensorflow is building a model graph
with functional linkages between tensors that we can use for our
statistical analysis later.
* Log-Likelihood in Tensorflow
Next let's define the log-likelihood function using tensorflow
primitives. It goes without saying that we could (and should) use
=tensorflow_probability.distributions.normal= for this problem, but I
proceed with this example because it is a well known model with
multiple parameters and data for understanding how to setup problems
with custom log likelihoods. The log-likelihood over model parameters
$\mathbf{b}$ and $s$, the estimates for $\beta$ and $\sigma$ respectively, is given by
\begin{align}
LogL =& \sum_{i=1}^N log \left( p(y_i | \mathbf{x}_i, \mathbf{b},s) \right) = \sum_{i=1}^N log \left( \frac{1}{\sqrt{2 \pi s^2}} e^{\frac{-(y_i-\mathbf{x}_i \mathbf{b})^2}{2 s^2}} \right) \nonumber \\
=& - \frac{N}{2}log(2 \pi s^2) - \sum_{i=1}^N \frac{1}{2 s^2} (y_i - \mathbf{x}_i \mathbf{b})^2
\end{align}
Programming this as a tensorflow function is fairly straightforward:
#+BEGIN_SRC ipython :exports code
pi = tf.constant(np.pi, dtype=dtype)
@tf.function
def ols_loglike(beta, sigma):
# xb (mu_i for each observation)
mu = tf.linalg.matvec(X, beta)
# this is normal pdf logged and summed over all observations
ll = - (X.shape[0]/2.)*tf.math.log(2.*pi*sigma**2) -\
(1./(2.*sigma**2.))*tf.math.reduce_sum((Y-mu)**2., axis=-1)
return ll
#+END_SRC
#+RESULTS:
: # Out [125]:
We can use this to evaluate the normal log likelihood given our data
and parameters. Evaluating at $\beta = \begin{bmatrix} 1 & 1
\end{bmatrix}$ and $\sigma = 1$
yields the tensor scalar
#+BEGIN_SRC ipython :exports both
ols_loglike(beta = [1., 1.], sigma=1.)
#+END_SRC
#+RESULTS:
: # Out [126]:
: :
Note that the numpy value is the result of the tensorflow function
evaluation due to eager execution but underlying all of this is a
model compute graph.
* Auto differentiation
Auto differentiation is possible since tensorflow in the background
has defined the model graph given data for finding $\frac{\partial
LogL}{\partial \beta}$ and $\frac{\partial LogL}{\partial \sigma}$.
Evaluating the function and derivative at $\beta = \begin{bmatrix} 1 & 1
\end{bmatrix}$ and $\sigma=1$ we have
#+BEGIN_SRC ipython :exports both
[funval, grads] = tfp.math.value_and_gradient(ols_loglike, [[1., 1.], 1.])
print("Function Value: ", funval)
print('Gradients on beta: \n', grads[0])
print('Gradient on sigma: \n', grads[1])
#+END_SRC
#+RESULTS:
: # Out [127]:
: Function Value: tf.Tensor(-21840.611, shape=(), dtype=float32)
: Gradients on beta:
: tf.Tensor([4501.1436 -855.3488], shape=(2,), dtype=float32)
: Gradient on sigma:
: tf.Tensor(42262.285, shape=(), dtype=float32)
:
This calculation occurs instantly and as importantly there was no need
to programmatically define gradients. Also, the sign and magnitudes of
the gradients are what we expect. Focusing on $\beta$, at the values
$\beta=\begin{bmatrix} 1 & 1\end{bmatrix}$, $\beta_0$ is too small
relative to the "true" value (hence the positive gradient) whereas
$\beta_1$ is too large (hence the negative gradient).
* Maximum Likelihood Estimation in Tensorflow
For finding the maximum likelihood estimate, we'll be minimizing the
negative log-likelihood since tensorflow has an optimization library
based on minimization (rather than maximizing the log-likelihood).
Since we will be using a minimizer, we need to construct a negative
log-likelihood function in tensorflow. Additionally, we need all of
the parameters (the $\beta$'s and $\sigma$) to enter as a single
vector.[fn:2]
Defining the function and doing a "sanity check" below, shows that we
have identical answers as before except that all values are of
opposite sign.
#+BEGIN_SRC ipython :exports both
# how to calculate gradiants
# objective function for minimization
@tf.function
def neg_ols_loglike(param_vec):
beta_split, sigma_split = tf.split(param_vec, [2,1], axis=0)
# need to take these back down to vectors and scalars:
beta_split = tf.reshape(beta_split,(2,))
sigma_split = tf.reshape(sigma_split,())
# xb (mu_i for each observation)
mu = tf.linalg.matvec(X, beta_split)
# this is normal pdf logged and summed over all observations
ll = -(X.shape[0]/2.)*tf.math.log(2.*pi*sigma_split**2.) -\
(1./(2.*sigma_split**2.))*tf.math.reduce_sum((Y-mu)**2., axis=-1)
return -1*ll
# return function value and gradients
@tf.function
def neg_like_and_gradient(parms):
return tfp.math.value_and_gradient(neg_ols_loglike, parms)
# using the same values as above
testval = tf.constant([1., 1., 1.], shape=(3))
out = neg_like_and_gradient(parms=testval)
print("Function value: ", out[0])
print("Gradients: ", out[1])
#+END_SRC
#+RESULTS:
: # Out [128]:
: Function value: tf.Tensor(21840.611, shape=(), dtype=float32)
: Gradients: tf.Tensor([ -4501.1436 855.3488 -42262.285 ], shape=(3,), dtype=float32)
:
*An important note that may possibly save someone alot of time*: for
the optimizer code below to work, the function values and gradients
returned by =neg_like_and_gradient= must be the correct shape. For my
problem the correct shape is a scalar for the negative log likelihood
value (shape=()) and a 1d tensor (here the shape=(3,)) for the
gradients.
With that, we can conduct maximum likelihood estimation. The
=bfgs_minimize= function returns quite a bit of information including
estimated parameters and the inverse hessian which is
needed for constructing standard errors.
#+BEGIN_SRC ipython :exports both
# set some naiive starting values
start = [0., 0., 1.]
# optimization
optim_results = tfp.optimizer.bfgs_minimize(
neg_like_and_gradient, start, tolerance=1e-8)
# organize results
est_params = optim_results.position.numpy()
est_serr = np.sqrt(np.diagonal(optim_results.inverse_hessian_estimate.numpy()))
print(pd.DataFrame(np.c_[est_params, est_serr],columns=['estimate', 'std err'],
index=['b0', 'b1', 'sigma']))
#+END_SRC
#+RESULTS:
: # Out [129]:
: estimate std err
: b0 10.021295 0.044162
: b1 -0.953205 0.046155
: sigma 0.985360 0.031172
:
We can confirm the maximum likelihood solution by inspecting the
returned gradient information in =optim_results=, or by calculating
directly:
#+BEGIN_SRC ipython :exports both
out = neg_like_and_gradient(est_params)
print("Function value: ", out[0])
print("Gradients: ", out[1])
#+END_SRC
#+RESULTS:
: # Out [130]:
: Function value: tf.Tensor(702.0952, shape=(), dtype=float32)
: Gradients: tf.Tensor([ 5.8412552e-05 -1.8119812e-05 0.0000000e+00], shape=(3,), dtype=float32)
:
So, solution confirmed.[fn:3]
* MCMC in Tensorflow
The following shows the mechanics of Random-Walk Metropolis Hastings
and No-Uturn Samplers for constructing Monte-Carlo Markov Chains.
These results don't introduce priors on any parameters so all
accept/reject decisions are based on likelihood values only.
** Random Walk Metropolis-Hastings
Recall that the Random Walk Metropolis-Hastings (RWMH) algorithm
formulates a proposal ($\theta_1$) from the current parameter position
($\theta_0$) as
\begin{equation}
\theta_1 = \theta_0 + f(\delta)
\end{equation}
where $\delta$ is a scale parameter (or potentially a vector of scale
parameters for each parameter we are estimating) from the proposal
distribution $f(^.)$. Below, we implement a Normally distributed
random walk proposal making the above condition
\begin{equation}
\theta_1 = \theta_0 + \delta N(0,1)
\end{equation}
where $N(0,1)$ is a standard normal random variate. The choice of
scale ($\delta$) will heavily influence the acceptance rate from our
MHRW sampler, and unfortunately I don't believe that tuning is built
into the random walk sampler in tensorflow. The scale I use below was
recovered from hand tuning the sampler where a single scale is chosen
over our three parameters. For the sake of brevity I don't show that.
We will use these parameters for our final chain once we have
uncovered the appropriate scale of our proposal. Next, we use the
=tensorflow_probability.mcmc= library for sampling. For any mcmc
model, this involves
1) Setting up a kernel defining the likelihood and prior
log-likelihood probabilities and the proposal (and potentially
including the proposal scale)
2) Specifying starting values
3) Sampling from the kernel
Based on hand tuning, I found a proposal scale of .4 to provide a good acceptance rate. Also, let's start the sampling at our maximum likelihood estimates (there are potential downsides to this that I won't discuss here):
#+BEGIN_SRC ipython :exports both
@tf.function
def mcmc_sampler(init_vals):
kernel = tfp.mcmc.RandomWalkMetropolis(
target_log_prob_fn=ols_loglike,
new_state_fn=tfp.mcmc.random_walk_normal_fn(scale=.04),seed=42)
samples_, stats_ = tfp.mcmc.sample_chain(
num_results=mhrw_samples,
current_state= init_vals,
kernel=kernel,
num_burnin_steps=mhrw_burnin,
parallel_iterations=1)
return samples, stats
# initialize at MLE estimates
init = [est_params[:2],est_params[-1]]
samples, stats = mcmc_sampler(init)
print('Final acceptance rate for step size %2.3f: %2.3f '% (.04,stats.is_accepted.numpy().sum()/mhrw_samples))
#+END_SRC
#+RESULTS:
: # Out [131]:
: Final acceptance rate for step size 0.040: 0.448
:
After sampling, we see the acceptance rate is near .45. The mean and standard deviation of the posterior samples are:
#+BEGIN_SRC ipython :exports both
# summarize results
trace_sigma = samples[1].numpy()
trace_beta = samples[0].numpy()
est_mh = np.r_[trace_beta.mean(axis=0), trace_sigma.mean()]
std_mh = np.r_[trace_beta.std(axis=0), trace_sigma.std()]
# assemble and print
print(pd.DataFrame(np.c_[est_mh, std_mh],columns=['estimate', 'std err'],
index=['b0', 'b1', 'sigma']))
#+END_SRC
#+RESULTS:
: # Out [132]:
: estimate std err
: b0 10.020264 0.043223
: b1 -0.953512 0.048436
: sigma 0.986995 0.031827
:
Note: tensorflow random walk metropolis hastings examples from the web like the
one [[https://www.tensorflow.org/probability/api_docs/python/tfp/mcmc/RandomWalkMetropolis][here]] uses the default scale of 1. For this problem, it resulted
in chains that *never* accepted any proposals after approximately 25
samples. Finding a scale that works for your problem seems like a
necessary step in using
=tensorflow_probability.mcmc.RandomWalkMetropolis= unless I have
missed something obvious and auto tuning of the proposal scale is
possible.
** No U-Turn Sampler (Nuts)
The Nuts sampler in tensorflow can be setup to auto-tune the proposal
and doesn't require us to worry so much about proposal scale.
Furthermore, it uses the gradients provided by tensorflow for smarter
proposals than a random walk and for nearly all problems would be the
recommended approach. One need only supply an initial step size, and I
found that estimating the model in Nuts for this toy example the
estimates are not particularly sensitive to the parameter
=init_step_size= which governs the behavior of proposals generated by
Nuts. As always, your mileage may vary. The Nuts sampler is easy to
setup and implement:
#+BEGIN_SRC ipython :exports code
nuts_samples = 1000
nuts_burnin = 500
chains = 4
## Initial step size
init_step_size=.3
##
## NUTS (using inner step size averaging step)
##
@tf.function
def nuts_sampler(init):
nuts_kernel = tfp.mcmc.NoUTurnSampler(
target_log_prob_fn=ols_loglike,
step_size=init_step_size,
)
adapt_nuts_kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
inner_kernel=nuts_kernel,
num_adaptation_steps=nuts_burnin,
step_size_getter_fn=lambda pkr: pkr.step_size,
log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
)
samples_nuts_, stats_nuts_ = tfp.mcmc.sample_chain(
num_results=nuts_samples,
current_state=init,
kernel=adapt_nuts_kernel,
num_burnin_steps=100,
parallel_iterations=5)
return samples_nuts_, stats_nuts_
samples_nuts, stats_nuts = nuts_sampler(init)
#+END_SRC
#+RESULTS:
: # Out [133]:
#+BEGIN_SRC ipython :exports both
trace_sigman = samples_nuts_[1].numpy()
trace_betan = samples_nuts_[0].numpy()
est_nuts = np.r_[trace_betan.mean(axis=0), trace_sigman.mean()]
std_nuts = np.r_[trace_betan.std(axis=0), trace_sigman.std()]
# assemble and print
print(pd.DataFrame(np.c_[est_nuts, std_nuts],columns=['estimate', 'std err'],
index=['b0', 'b1', 'sigma']))
#+END_SRC
#+RESULTS:
: # Out [134]:
: estimate std err
: b0 10.022340 0.044498
: b1 -0.955192 0.044316
: sigma 0.991023 0.032741
:
** Multiple Chains
All of the discussion thus far samples a single chain. With multiple
processor cores on either CPU or GPU's there are opportunities to
sample in parallel very quickly.[fn:9] This provides more information about
the posterior and to test for convergence by comparing multiple
chains. Provided your log-likelihood function is written correctly
(is able to handle vectorized calculations), this is accomplished by
providing starting values with the appropriate shape (many thanks to
[[https://colindcarroll.com/2019/08/18/very-parallel-mcmc-sampling/][the gist from this blog post]] for making this clear). The =pymc4=
function =tile_init= converted to python is useful for doing this.
#+BEGIN_SRC ipython :exports both
#
# Setup starting vals for multiple chains
#
def tile_init(init, num_repeats):
"""
create tiled initial values for multiple chains
idea from pymc4.inference.sample:
https://github.com/pymc-devs/pymc4/blob/master/pymc4/inference/sampling.py
"""
return [np.tile(np.expand_dims(tens, 0), [num_repeats] + [1] * tens.ndim) for tens in init]
init = tile_init(init, chains)
print("Multiple chains requested: ", chains)
init
#+END_SRC
#+RESULTS:
: # Out [135]:
: Multiple chains requested: 4
:
: : [array([[10.021295 , -0.9532046],
: : [10.021295 , -0.9532046],
: : [10.021295 , -0.9532046],
: : [10.021295 , -0.9532046]], dtype=float32),
: : array([0.98536015, 0.98536015, 0.98536015, 0.98536015], dtype=float32)]
Note the inital values are copied 4 times (in the row dimension for
$\beta$) for all of our parameters and the NUTs function is defined
with a *python* variable, which forces a rebuild of the compute graph
if the shape of the initial values change. With these starting
values, the code for sampling is identical to before except this time
we will have four independent chains.
#+BEGIN_SRC ipython
samples_nuts_mult, stats_nuts_mult = nuts_sampler(init)
#+END_SRC
#+RESULTS:
: # Out [136]:
Summarizing means from each of the chains gives:
#+BEGIN_SRC ipython :exports both
print("Shape of samples (for beta): ", samples_nuts_mult[0].shape)
print("Mean for each chain and parameter (beta only): ")
data = samples_nuts_mult[0].numpy().mean(axis=0)
df = pd.DataFrame(data, columns=['bo', 'b1'], index=['chain ' + str(i+1) for i in range(4)])
df
#+END_SRC
#+RESULTS:
#+begin_example
# Out [137]:
Shape of samples (for beta): (1000, 4, 2)
Mean for each chain and parameter (beta only):
: bo b1
: chain 1 10.020152 -0.951442
: chain 2 10.020010 -0.952338
: chain 3 10.019940 -0.954123
: chain 4 10.020774 -0.952213
#+end_example
I ran this same code for 200 chains of 2000 samples each on an Nvidia Tesla P100-16GB GPU in less than 11 seconds:
#+begin_example
Number of Chains: 200
Elapsed time (seconds): 10.79
Shape of beta: (2000, 200, 2)
#+end_example
[fn:1] Theano is designed for and does take advantage of Nvidia
graphics cards but **for my purposes (and my purposes only)** it did
not lead to any speed-up for a wide range of models I attempted (I was
using PyMC3: see footnote 8 [[https://arxiv.org/pdf/1811.02091.pdf][here]]). Additionally Theano 1.0 from
November 2017 is the final release.
[fn:2] This may not be an absolute requirement, but I was unable to
get the minimization code working for segmented $\beta$ and $\sigma$,
whereas I was successful treating all parameters to be estimated as a
vector.
[fn:3] A word of warning: the minimization routine was sensitive to how
I defined the log-likelihood. For some reason, another mathematically
equivalent version of the log-likelihood (with identical gradients and
function values) failed to converge with the starting values used
above.
[fn:7] The indexing issues described above are not covered in this post.
[fn:9] While not discussed in the post, one could also use multiple
starting values from a grid in a maximum likelihood estimation context if the
log-likelihood isn't globally concave. All of this could be done in
parallel.