Speeding up Tensorflow

In writing my previous posts, I stumbled upon a few things that made code run significantly faster in tensorflow. This summarizes my notes on these techniques. To a tensorflow afficionado these things are probably considered well-known, but for a the uninitiated I was suprised at the kind of difference just a little bit of extra code could make. Here we examine these techniques using the example from a previous post the techniques are:

Wrapping a tensorflow_probability.mcmc.sample_chain in a tensorflow function
Using XLA for your tensorflow function

To start, we generate toy data for a simple ordinary least squares regression problem (to see more, see my previous tensorflow post).

import tensorflow as tf
import tensorflow_probability as tfp
from tensorflow_probability import distributions as tfd
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore", category=Warning) 
# set seed so results never change
np.random.seed(1234)

Here we generate data for \(N=500\) and \(K=2\):

# set tensorflow data type
dtype = tf.float32

##
## simple OLS Data Generation Process
##
# True beta
b = np.array([10, -1])
N = 500
# True error std deviation
sigma_e = 1

x = np.c_[np.ones(N), np.random.randn(N)]
y = x.dot(b) + sigma_e * np.random.randn(N)

And convert the data to tensors and setup the log-likelihood for this problem:

X = tf.constant(x, dtype=dtype)
Y = tf.constant(y, dtype=dtype)
pi = tf.constant(np.pi, dtype=dtype)

def ols_loglike(beta, sigma):
    # xb (mu_i for each observation)
    mu = tf.linalg.matvec(X, beta)
    # this is normal pdf logged and summed over all observations
    ll = - (X.shape[0]/2.)*tf.math.log(2.*pi*sigma**2) -\
	    (1./(2.*sigma**2.))*tf.math.reduce_sum((Y-mu)**2., axis=-1)
    return ll

# Out [7]:

leads to BIG speed increases (as the wrapped versions runs on multiple cores). Below we provide a quick speed comparison (notebook CPU with 8 cores).

Tensor Function Wrapper

As I demonstrated in the earlier post, it is straightforward to setup

Let's run the same model without the wrapping technique. Resetting the kernels:

# a naiive initial value for chain (for beta and sigma):
init = [tf.constant([0., 0.], dtype=dtype), tf.constant(1.,dtype=dtype)]

samples = 2000
burnin = 500
init_step_size=.3

nuts_kernel = tfp.mcmc.NoUTurnSampler(
	target_log_prob_fn=ols_loglike, 
	step_size=init_step_size,
	)
adapt_nuts_kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
    inner_kernel=nuts_kernel,
    num_adaptation_steps=burnin,
    step_size_getter_fn=lambda pkr: pkr.step_size,
    log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
    step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
   )

And running the time consuming part:

%%timeit -n1  -r1
tfp.mcmc.sample_chain(
    num_results=samples,
    current_state=init,
    kernel=adapt_nuts_kernel,
    num_burnin_steps=100,
    parallel_iterations=5)

If, we wrap the sampler as a tensorflow function, we get dramatic speedups:

@tf.function
def sampler(init_vals):

    @tf.function
    def ols_loglike(beta, sigma):
	# xb (mu_i for each observation)
	mu = tf.linalg.matvec(X, beta)
	# this is normal pdf logged and summed over all observations
	ll = - (X.shape[0]/2.)*tf.math.log(2.*pi*sigma**2) -\
	      (1./(2.*sigma**2.))*tf.math.reduce_sum((Y-mu)**2., axis=-1)
	return ll

    nuts_kernel = tfp.mcmc.NoUTurnSampler(
	target_log_prob_fn=ols_loglike, 
	step_size=init_step_size,
	)
    adapt_nuts_kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
	inner_kernel=nuts_kernel,
	num_adaptation_steps=burnin,
	step_size_getter_fn=lambda pkr: pkr.step_size,
	log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
	step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
	)
    sample_vals, stats = tfp.mcmc.sample_chain(num_results=samples,
					   current_state=init_vals,
					   kernel=adapt_nuts_kernel,
					   num_burnin_steps=100,
					   parallel_iterations=5)
    return sample_vals, stats

Checkout the sampler:

type(sampler)

Let's sample from the function:

%%timeit -n1 -r1
sampler(init)

That is a BIG speedup (it is ~14x faster) just by wrapping your code in a tensorflow function.

XLA Mode

Perhaps we can do even better using the new XLA compiler for our tensorflow function. This is experimental, but let's try it.

@tf.function(experimental_compile=True)
def sampler(init_vals):

    @tf.function(experimental_compile=True)
    def ols_loglike(beta, sigma):
	# xb (mu_i for each observation)
	mu = tf.linalg.matvec(X, beta)
	# this is normal pdf logged and summed over all observations
	ll = - (X.shape[0]/2.)*tf.math.log(2.*pi*sigma**2) -\
	      (1./(2.*sigma**2.))*tf.math.reduce_sum((Y-mu)**2., axis=-1)
	return ll

    nuts_kernel = tfp.mcmc.NoUTurnSampler(
	target_log_prob_fn=ols_loglike, 
	step_size=init_step_size,
	)
    adapt_nuts_kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
	inner_kernel=nuts_kernel,
	num_adaptation_steps=burnin,
	step_size_getter_fn=lambda pkr: pkr.step_size,
	log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
	step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
	)
    sample_vals, stats = tfp.mcmc.sample_chain(num_results=samples,
					   current_state=init_vals,
					   kernel=adapt_nuts_kernel,
					   num_burnin_steps=100,
					   parallel_iterations=5)
    return sample_vals, stats

%%timeit -n1 -r1
sampler(init)