#+BEGIN_COMMENT .. title: Tensorflow with Custom Likelihood Functions .. slug: custom-likes-tensorflow .. date: 2020-02-24 9:15:50 UTC+01:00 .. tags: tensorflow, likelihood, mcmc, maximum likelihood .. status: .. has_math: yes .. category: .. link: .. description: .. type: text #+END_COMMENT #+BEGIN_SRC emacs-lisp :exports none (setf (cdr (assoc :results org-babel-default-header-args:ipython)) "output replace") #+END_SRC #+RESULTS: : output replace This post builds on earlier ones dealing with [[./estimating-custom-mle.html][custom likelihood functions in python]] and maximum likelihood estimation with [[./mle-autograd.html][auto differentiation]]. This post is approaching tensorflow from an econometrics perspective and is based on a series of tests and notes I developed for using tensorflow for some of my work. In early explorations of tensorflow nearly all of the examples I encountered were from a machine learning perspective, making it difficult to fit code examples to econometric problems. For the tensorflow uninitiated who want to dive in (like me!), I hope this will prove useful. The goals of the post: 1. Some tensorflow basics I wish I had known before I started this work 2. Define a custom log-likelihood function in tensorflow and perform differentiation over model parameters to illustrate how, under the hood, tensorflow's model graph is designed to calculate derivatives "free of charge" (no programming required and very little to no additional compute time). 3. Use the tensorflow log-likelihood to estimate a maximum likelihood model using =tensorflow_probability.optimizer= capabilities. 4. Illustrate how the =tensorflow.probability.mcmc= libraries can be used with custom log-likelihoods. #+HTML: So why bother? The earlier posts outlined above show python's =autograd= or =theano= has auto differentiation capabilities which we used successfully to speedup some models. Unfortunately, those libraries aren't optimal for the current revolution in machine learning enabled by gpu computing.[fn:1] Tensorflow is a library that opens doors for econometricians enabling much faster processing (and massive parallelization of models). * Tensorflow Basics As described in the [[https://www.tensorflow.org/tutorials/customization/performance][tensorflow performance tutorial]], wrap your functions in =@tensorflow.function=. Without this, your code will run very slow. All of the results below do this and it makes a huge difference in runtimes. Here is an example of two functions that look similar and use the tensorflow =matmul= op but are very different in terms of how tensorflow compiles (or doesn't compile) code in the backend. Additionally, you can try the =experimental_compile= mode as it can shave over 50% off runtimes when it works: #+BEGIN_SRC ipython :exports both import tensorflow as tf import tensorflow_probability as tfp from tensorflow_probability import distributions as tfd import numpy as np import pandas as pd def mult_fun(x, y): return tf.matmul(x, y) @tf.function(experimental_compile=True) def mult_tf_fun(x,y): return tf.matmul(x, y) print("The function mult_fun runs slow: ", type(mult_fun)) print("The function mult_tf_fun runs fast: ", type(mult_tf_fun)) #+END_SRC #+RESULTS: : # Out [120]: : The function mult_fun runs slow: : The function mult_tf_fun runs fast: : You want your functions to have the type of the second one. The data types of your tensorflow function arguments also matter. In general python function arguments will require tensorflow to automatically rebuild compute graphs whereas tensorflow arguments will not cause a rebuild (so will run faster). If you need changes to the graph at runtime (e.g. number of samples, number of chains, etc.) then pass in the variable as a python variable as that will require a rebuild of the graph which is needed to run successfully. #+BEGIN_SRC ipython :exports none import warnings warnings.filterwarnings("ignore", category=Warning) # set seed so results never change np.random.seed(1234) #+END_SRC #+RESULTS: : # Out [121]: ** A toy problem To begin, we will be generating data consistent with ordinary least squares: $$\mathbf{y} = \mathbf{x} \beta + \epsilon$$ where $\mathbf{y}$ is an $N \times 1$ vector of dependent variables, $\mathbf{x}$ is an $N \times K$ vector of dependent variables, $\beta$ are the $K\times 1$ parameters we are trying to estimate, and $\epsilon$ is an $N \times 1$ vector of iid mean zero homoskedastic unobservables having variance $\sigma^2$. Here we generate data for $N=500$ and $K=2$: #+BEGIN_SRC ipython :exports code # set tensorflow data type dtype = tf.float32 ## ## simple OLS Data Generation Process ## # True beta b = np.array([10, -1]) N = 500 # True error std deviation sigma_e = 1 x = np.c_[np.ones(N), np.random.randn(N)] y = x.dot(b) + sigma_e * np.random.randn(N) #+END_SRC #+RESULTS: : # Out [122]: As a point of comparison to results in tensorflow, run classical OLS on our toy dataset: #+BEGIN_SRC ipython :exports both # estimate parameter vector, errors, sd of errors, and se of parameters bols = np.linalg.inv(x.T.dot(x)).dot(x.T.dot(y)) err = y - x.dot(bols) sigma_ols = np.sqrt(err.dot(err)/(x.shape[0] - x.shape[1])) se = np.sqrt(err.dot(err)/(x.shape[0] - x.shape[1]) * np.diagonal(np.linalg.inv(x.T.dot(x)))) # put results together for easy viewing ols_parms = np.r_[bols, sigma_ols] ols_se = np.r_[se, np.nan] print(pd.DataFrame(np.c_[ols_parms, ols_se],columns=['estimate', 'std err'], index=['b0', 'b1', 'sigma'])) #+END_SRC #+RESULTS: : # Out [123]: : estimate std err : b0 10.021294 0.044157 : b1 -0.953204 0.046017 : sigma 0.987337 NaN : ** Getting data into tensorflow tensors Using tensorflow vs numpy syntax/functions is paramount for building likelihoods that will work for us. Unfortunately, numpy and matlab-like slicing and indexing does not always work which means that vectorizing loops requires quite alot of thought and the use of indices.[fn:7] Apart from that there are fairly minor differences from numpy and with tensorflow 2's "eager execution", code is easy to debug. First, let's convert our data to tensorflow tensors. While not strictly necessary, this ensures that we have strict "typing" of data when we move to gpu operations: #+BEGIN_SRC ipython :exports both X = tf.constant(x, dtype=dtype) Y = tf.constant(y, dtype=dtype) print(type(X)) print(X.dtype) print(X.shape) # extract the first 5 tensors into numpy array print(X.numpy()[:5]) #+END_SRC #+RESULTS: #+begin_example # Out [124]: (500, 2) [[ 1. 0.47143516] [ 1. -1.1909757 ] [ 1. 1.432707 ] [ 1. -0.3126519 ] [ 1. -0.72058874]] #+end_example By default, Tensorflow 2.0 (what I am using) executes as "eager tensors". This enables easy debugging of tensorflow code but in reality what we are doing with tensorflow is building a model graph with functional linkages between tensors that we can use for our statistical analysis later. * Log-Likelihood in Tensorflow Next let's define the log-likelihood function using tensorflow primitives. It goes without saying that we could (and should) use =tensorflow_probability.distributions.normal= for this problem, but I proceed with this example because it is a well known model with multiple parameters and data for understanding how to setup problems with custom log likelihoods. The log-likelihood over model parameters $\mathbf{b}$ and $s$, the estimates for $\beta$ and $\sigma$ respectively, is given by \begin{align} LogL =& \sum_{i=1}^N log \left( p(y_i | \mathbf{x}_i, \mathbf{b},s) \right) = \sum_{i=1}^N log \left( \frac{1}{\sqrt{2 \pi s^2}} e^{\frac{-(y_i-\mathbf{x}_i \mathbf{b})^2}{2 s^2}} \right) \nonumber \\ =& - \frac{N}{2}log(2 \pi s^2) - \sum_{i=1}^N \frac{1}{2 s^2} (y_i - \mathbf{x}_i \mathbf{b})^2 \end{align} Programming this as a tensorflow function is fairly straightforward: #+BEGIN_SRC ipython :exports code pi = tf.constant(np.pi, dtype=dtype) @tf.function def ols_loglike(beta, sigma): # xb (mu_i for each observation) mu = tf.linalg.matvec(X, beta) # this is normal pdf logged and summed over all observations ll = - (X.shape[0]/2.)*tf.math.log(2.*pi*sigma**2) -\ (1./(2.*sigma**2.))*tf.math.reduce_sum((Y-mu)**2., axis=-1) return ll #+END_SRC #+RESULTS: : # Out [125]: We can use this to evaluate the normal log likelihood given our data and parameters. Evaluating at $\beta = \begin{bmatrix} 1 & 1 \end{bmatrix}$ and $\sigma = 1$ yields the tensor scalar #+BEGIN_SRC ipython :exports both ols_loglike(beta = [1., 1.], sigma=1.) #+END_SRC #+RESULTS: : # Out [126]: : : Note that the numpy value is the result of the tensorflow function evaluation due to eager execution but underlying all of this is a model compute graph. * Auto differentiation Auto differentiation is possible since tensorflow in the background has defined the model graph given data for finding $\frac{\partial LogL}{\partial \beta}$ and $\frac{\partial LogL}{\partial \sigma}$. Evaluating the function and derivative at $\beta = \begin{bmatrix} 1 & 1 \end{bmatrix}$ and $\sigma=1$ we have #+BEGIN_SRC ipython :exports both [funval, grads] = tfp.math.value_and_gradient(ols_loglike, [[1., 1.], 1.]) print("Function Value: ", funval) print('Gradients on beta: \n', grads[0]) print('Gradient on sigma: \n', grads[1]) #+END_SRC #+RESULTS: : # Out [127]: : Function Value: tf.Tensor(-21840.611, shape=(), dtype=float32) : Gradients on beta: : tf.Tensor([4501.1436 -855.3488], shape=(2,), dtype=float32) : Gradient on sigma: : tf.Tensor(42262.285, shape=(), dtype=float32) : This calculation occurs instantly and as importantly there was no need to programmatically define gradients. Also, the sign and magnitudes of the gradients are what we expect. Focusing on $\beta$, at the values $\beta=\begin{bmatrix} 1 & 1\end{bmatrix}$, $\beta_0$ is too small relative to the "true" value (hence the positive gradient) whereas $\beta_1$ is too large (hence the negative gradient). * Maximum Likelihood Estimation in Tensorflow For finding the maximum likelihood estimate, we'll be minimizing the negative log-likelihood since tensorflow has an optimization library based on minimization (rather than maximizing the log-likelihood). Since we will be using a minimizer, we need to construct a negative log-likelihood function in tensorflow. Additionally, we need all of the parameters (the $\beta$'s and $\sigma$) to enter as a single vector.[fn:2] Defining the function and doing a "sanity check" below, shows that we have identical answers as before except that all values are of opposite sign. #+BEGIN_SRC ipython :exports both # how to calculate gradiants # objective function for minimization @tf.function def neg_ols_loglike(param_vec): beta_split, sigma_split = tf.split(param_vec, [2,1], axis=0) # need to take these back down to vectors and scalars: beta_split = tf.reshape(beta_split,(2,)) sigma_split = tf.reshape(sigma_split,()) # xb (mu_i for each observation) mu = tf.linalg.matvec(X, beta_split) # this is normal pdf logged and summed over all observations ll = -(X.shape[0]/2.)*tf.math.log(2.*pi*sigma_split**2.) -\ (1./(2.*sigma_split**2.))*tf.math.reduce_sum((Y-mu)**2., axis=-1) return -1*ll # return function value and gradients @tf.function def neg_like_and_gradient(parms): return tfp.math.value_and_gradient(neg_ols_loglike, parms) # using the same values as above testval = tf.constant([1., 1., 1.], shape=(3)) out = neg_like_and_gradient(parms=testval) print("Function value: ", out[0]) print("Gradients: ", out[1]) #+END_SRC #+RESULTS: : # Out [128]: : Function value: tf.Tensor(21840.611, shape=(), dtype=float32) : Gradients: tf.Tensor([ -4501.1436 855.3488 -42262.285 ], shape=(3,), dtype=float32) : *An important note that may possibly save someone alot of time*: for the optimizer code below to work, the function values and gradients returned by =neg_like_and_gradient= must be the correct shape. For my problem the correct shape is a scalar for the negative log likelihood value (shape=()) and a 1d tensor (here the shape=(3,)) for the gradients. With that, we can conduct maximum likelihood estimation. The =bfgs_minimize= function returns quite a bit of information including estimated parameters and the inverse hessian which is needed for constructing standard errors. #+BEGIN_SRC ipython :exports both # set some naiive starting values start = [0., 0., 1.] # optimization optim_results = tfp.optimizer.bfgs_minimize( neg_like_and_gradient, start, tolerance=1e-8) # organize results est_params = optim_results.position.numpy() est_serr = np.sqrt(np.diagonal(optim_results.inverse_hessian_estimate.numpy())) print(pd.DataFrame(np.c_[est_params, est_serr],columns=['estimate', 'std err'], index=['b0', 'b1', 'sigma'])) #+END_SRC #+RESULTS: : # Out [129]: : estimate std err : b0 10.021295 0.044162 : b1 -0.953205 0.046155 : sigma 0.985360 0.031172 : We can confirm the maximum likelihood solution by inspecting the returned gradient information in =optim_results=, or by calculating directly: #+BEGIN_SRC ipython :exports both out = neg_like_and_gradient(est_params) print("Function value: ", out[0]) print("Gradients: ", out[1]) #+END_SRC #+RESULTS: : # Out [130]: : Function value: tf.Tensor(702.0952, shape=(), dtype=float32) : Gradients: tf.Tensor([ 5.8412552e-05 -1.8119812e-05 0.0000000e+00], shape=(3,), dtype=float32) : So, solution confirmed.[fn:3] * MCMC in Tensorflow The following shows the mechanics of Random-Walk Metropolis Hastings and No-Uturn Samplers for constructing Monte-Carlo Markov Chains. These results don't introduce priors on any parameters so all accept/reject decisions are based on likelihood values only. ** Random Walk Metropolis-Hastings Recall that the Random Walk Metropolis-Hastings (RWMH) algorithm formulates a proposal ($\theta_1$) from the current parameter position ($\theta_0$) as $$\theta_1 = \theta_0 + f(\delta)$$ where $\delta$ is a scale parameter (or potentially a vector of scale parameters for each parameter we are estimating) from the proposal distribution $f(^.)$. Below, we implement a Normally distributed random walk proposal making the above condition $$\theta_1 = \theta_0 + \delta N(0,1)$$ where $N(0,1)$ is a standard normal random variate. The choice of scale ($\delta$) will heavily influence the acceptance rate from our MHRW sampler, and unfortunately I don't believe that tuning is built into the random walk sampler in tensorflow. The scale I use below was recovered from hand tuning the sampler where a single scale is chosen over our three parameters. For the sake of brevity I don't show that. We will use these parameters for our final chain once we have uncovered the appropriate scale of our proposal. Next, we use the =tensorflow_probability.mcmc= library for sampling. For any mcmc model, this involves 1) Setting up a kernel defining the likelihood and prior log-likelihood probabilities and the proposal (and potentially including the proposal scale) 2) Specifying starting values 3) Sampling from the kernel Based on hand tuning, I found a proposal scale of .4 to provide a good acceptance rate. Also, let's start the sampling at our maximum likelihood estimates (there are potential downsides to this that I won't discuss here): #+BEGIN_SRC ipython :exports both @tf.function def mcmc_sampler(init_vals): kernel = tfp.mcmc.RandomWalkMetropolis( target_log_prob_fn=ols_loglike, new_state_fn=tfp.mcmc.random_walk_normal_fn(scale=.04),seed=42) samples_, stats_ = tfp.mcmc.sample_chain( num_results=mhrw_samples, current_state= init_vals, kernel=kernel, num_burnin_steps=mhrw_burnin, parallel_iterations=1) return samples, stats # initialize at MLE estimates init = [est_params[:2],est_params[-1]] samples, stats = mcmc_sampler(init) print('Final acceptance rate for step size %2.3f: %2.3f '% (.04,stats.is_accepted.numpy().sum()/mhrw_samples)) #+END_SRC #+RESULTS: : # Out [131]: : Final acceptance rate for step size 0.040: 0.448 : After sampling, we see the acceptance rate is near .45. The mean and standard deviation of the posterior samples are: #+BEGIN_SRC ipython :exports both # summarize results trace_sigma = samples[1].numpy() trace_beta = samples[0].numpy() est_mh = np.r_[trace_beta.mean(axis=0), trace_sigma.mean()] std_mh = np.r_[trace_beta.std(axis=0), trace_sigma.std()] # assemble and print print(pd.DataFrame(np.c_[est_mh, std_mh],columns=['estimate', 'std err'], index=['b0', 'b1', 'sigma'])) #+END_SRC #+RESULTS: : # Out [132]: : estimate std err : b0 10.020264 0.043223 : b1 -0.953512 0.048436 : sigma 0.986995 0.031827 : Note: tensorflow random walk metropolis hastings examples from the web like the one [[https://www.tensorflow.org/probability/api_docs/python/tfp/mcmc/RandomWalkMetropolis][here]] uses the default scale of 1. For this problem, it resulted in chains that *never* accepted any proposals after approximately 25 samples. Finding a scale that works for your problem seems like a necessary step in using =tensorflow_probability.mcmc.RandomWalkMetropolis= unless I have missed something obvious and auto tuning of the proposal scale is possible. ** No U-Turn Sampler (Nuts) The Nuts sampler in tensorflow can be setup to auto-tune the proposal and doesn't require us to worry so much about proposal scale. Furthermore, it uses the gradients provided by tensorflow for smarter proposals than a random walk and for nearly all problems would be the recommended approach. One need only supply an initial step size, and I found that estimating the model in Nuts for this toy example the estimates are not particularly sensitive to the parameter =init_step_size= which governs the behavior of proposals generated by Nuts. As always, your mileage may vary. The Nuts sampler is easy to setup and implement: #+BEGIN_SRC ipython :exports code nuts_samples = 1000 nuts_burnin = 500 chains = 4 ## Initial step size init_step_size=.3 ## ## NUTS (using inner step size averaging step) ## @tf.function def nuts_sampler(init): nuts_kernel = tfp.mcmc.NoUTurnSampler( target_log_prob_fn=ols_loglike, step_size=init_step_size, ) adapt_nuts_kernel = tfp.mcmc.DualAveragingStepSizeAdaptation( inner_kernel=nuts_kernel, num_adaptation_steps=nuts_burnin, step_size_getter_fn=lambda pkr: pkr.step_size, log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio, step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size) ) samples_nuts_, stats_nuts_ = tfp.mcmc.sample_chain( num_results=nuts_samples, current_state=init, kernel=adapt_nuts_kernel, num_burnin_steps=100, parallel_iterations=5) return samples_nuts_, stats_nuts_ samples_nuts, stats_nuts = nuts_sampler(init) #+END_SRC #+RESULTS: : # Out [133]: #+BEGIN_SRC ipython :exports both trace_sigman = samples_nuts_[1].numpy() trace_betan = samples_nuts_[0].numpy() est_nuts = np.r_[trace_betan.mean(axis=0), trace_sigman.mean()] std_nuts = np.r_[trace_betan.std(axis=0), trace_sigman.std()] # assemble and print print(pd.DataFrame(np.c_[est_nuts, std_nuts],columns=['estimate', 'std err'], index=['b0', 'b1', 'sigma'])) #+END_SRC #+RESULTS: : # Out [134]: : estimate std err : b0 10.022340 0.044498 : b1 -0.955192 0.044316 : sigma 0.991023 0.032741 : ** Multiple Chains All of the discussion thus far samples a single chain. With multiple processor cores on either CPU or GPU's there are opportunities to sample in parallel very quickly.[fn:9] This provides more information about the posterior and to test for convergence by comparing multiple chains. Provided your log-likelihood function is written correctly (is able to handle vectorized calculations), this is accomplished by providing starting values with the appropriate shape (many thanks to [[https://colindcarroll.com/2019/08/18/very-parallel-mcmc-sampling/][the gist from this blog post]] for making this clear). The =pymc4= function =tile_init= converted to python is useful for doing this. #+BEGIN_SRC ipython :exports both # # Setup starting vals for multiple chains # def tile_init(init, num_repeats): """ create tiled initial values for multiple chains idea from pymc4.inference.sample: https://github.com/pymc-devs/pymc4/blob/master/pymc4/inference/sampling.py """ return [np.tile(np.expand_dims(tens, 0), [num_repeats] + [1] * tens.ndim) for tens in init] init = tile_init(init, chains) print("Multiple chains requested: ", chains) init #+END_SRC #+RESULTS: : # Out [135]: : Multiple chains requested: 4 : : : [array([[10.021295 , -0.9532046], : : [10.021295 , -0.9532046], : : [10.021295 , -0.9532046], : : [10.021295 , -0.9532046]], dtype=float32), : : array([0.98536015, 0.98536015, 0.98536015, 0.98536015], dtype=float32)] Note the inital values are copied 4 times (in the row dimension for $\beta$) for all of our parameters and the NUTs function is defined with a *python* variable, which forces a rebuild of the compute graph if the shape of the initial values change. With these starting values, the code for sampling is identical to before except this time we will have four independent chains. #+BEGIN_SRC ipython samples_nuts_mult, stats_nuts_mult = nuts_sampler(init) #+END_SRC #+RESULTS: : # Out [136]: Summarizing means from each of the chains gives: #+BEGIN_SRC ipython :exports both print("Shape of samples (for beta): ", samples_nuts_mult[0].shape) print("Mean for each chain and parameter (beta only): ") data = samples_nuts_mult[0].numpy().mean(axis=0) df = pd.DataFrame(data, columns=['bo', 'b1'], index=['chain ' + str(i+1) for i in range(4)]) df #+END_SRC #+RESULTS: #+begin_example # Out [137]: Shape of samples (for beta): (1000, 4, 2) Mean for each chain and parameter (beta only): : bo b1 : chain 1 10.020152 -0.951442 : chain 2 10.020010 -0.952338 : chain 3 10.019940 -0.954123 : chain 4 10.020774 -0.952213 #+end_example I ran this same code for 200 chains of 2000 samples each on an Nvidia Tesla P100-16GB GPU in less than 11 seconds: #+begin_example Number of Chains: 200 Elapsed time (seconds): 10.79 Shape of beta: (2000, 200, 2) #+end_example [fn:1] Theano is designed for and does take advantage of Nvidia graphics cards but **for my purposes (and my purposes only)** it did not lead to any speed-up for a wide range of models I attempted (I was using PyMC3: see footnote 8 [[https://arxiv.org/pdf/1811.02091.pdf][here]]). Additionally Theano 1.0 from November 2017 is the final release. [fn:2] This may not be an absolute requirement, but I was unable to get the minimization code working for segmented $\beta$ and $\sigma$, whereas I was successful treating all parameters to be estimated as a vector. [fn:3] A word of warning: the minimization routine was sensitive to how I defined the log-likelihood. For some reason, another mathematically equivalent version of the log-likelihood (with identical gradients and function values) failed to converge with the starting values used above. [fn:7] The indexing issues described above are not covered in this post. [fn:9] While not discussed in the post, one could also use multiple starting values from a grid in a maximum likelihood estimation context if the log-likelihood isn't globally concave. All of this could be done in parallel.