# Making Tensorflow Faster

## Speeding up Tensorflow

In writing my previous posts, I stumbled upon a few things that made code run significantly faster in tensorflow. This summarizes my notes on these techniques. To a tensorflow afficionado these things are probably considered well-known, but for a the uninitiated I was suprised at the kind of difference just a little bit of extra code could make. Here we examine these techniques using the example from a previous post the techniques are:

• Wrapping a tensorflow_probability.mcmc.sample_chain in a tensorflow function
• Using XLA for your tensorflow function

To start, we generate toy data for a simple ordinary least squares regression problem (to see more, see my previous tensorflow post).

import tensorflow as tf
import tensorflow_probability as tfp
from tensorflow_probability import distributions as tfd
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore", category=Warning)
# set seed so results never change
np.random.seed(1234)


Here we generate data for $N=500$ and $K=2$:

# set tensorflow data type
dtype = tf.float32

##
## simple OLS Data Generation Process
##
# True beta
b = np.array([10, -1])
N = 500
# True error std deviation
sigma_e = 1

x = np.c_[np.ones(N), np.random.randn(N)]
y = x.dot(b) + sigma_e * np.random.randn(N)


And convert the data to tensors and setup the log-likelihood for this problem:

X = tf.constant(x, dtype=dtype)
Y = tf.constant(y, dtype=dtype)
pi = tf.constant(np.pi, dtype=dtype)

def ols_loglike(beta, sigma):
# xb (mu_i for each observation)
mu = tf.linalg.matvec(X, beta)
# this is normal pdf logged and summed over all observations
ll = - (X.shape[0]/2.)*tf.math.log(2.*pi*sigma**2) -\
(1./(2.*sigma**2.))*tf.math.reduce_sum((Y-mu)**2., axis=-1)
return ll

# Out [7]:



leads to BIG speed increases (as the wrapped versions runs on multiple cores). Below we provide a quick speed comparison (notebook CPU with 8 cores).

### Tensor Function Wrapper

As I demonstrated in the earlier post, it is straightforward to setup

Let's run the same model without the wrapping technique. Resetting the kernels:

# a naiive initial value for chain (for beta and sigma):
init = [tf.constant([0., 0.], dtype=dtype), tf.constant(1.,dtype=dtype)]

samples = 2000
burnin = 500
init_step_size=.3

nuts_kernel = tfp.mcmc.NoUTurnSampler(
target_log_prob_fn=ols_loglike,
step_size=init_step_size,
)
inner_kernel=nuts_kernel,
step_size_getter_fn=lambda pkr: pkr.step_size,
log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
)


And running the time consuming part:

%%timeit -n1  -r1
tfp.mcmc.sample_chain(
num_results=samples,
current_state=init,
num_burnin_steps=100,
parallel_iterations=5)


If, we wrap the sampler as a tensorflow function, we get dramatic speedups:

@tf.function
def sampler(init_vals):

@tf.function
def ols_loglike(beta, sigma):
# xb (mu_i for each observation)
mu = tf.linalg.matvec(X, beta)
# this is normal pdf logged and summed over all observations
ll = - (X.shape[0]/2.)*tf.math.log(2.*pi*sigma**2) -\
(1./(2.*sigma**2.))*tf.math.reduce_sum((Y-mu)**2., axis=-1)
return ll

nuts_kernel = tfp.mcmc.NoUTurnSampler(
target_log_prob_fn=ols_loglike,
step_size=init_step_size,
)
inner_kernel=nuts_kernel,
step_size_getter_fn=lambda pkr: pkr.step_size,
log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
)
sample_vals, stats = tfp.mcmc.sample_chain(num_results=samples,
current_state=init_vals,
num_burnin_steps=100,
parallel_iterations=5)
return sample_vals, stats


Checkout the sampler:

type(sampler)


Let's sample from the function:

%%timeit -n1 -r1
sampler(init)


That is a BIG speedup (it is ~14x faster) just by wrapping your code in a tensorflow function.

### XLA Mode

Perhaps we can do even better using the new XLA compiler for our tensorflow function. This is experimental, but let's try it.

@tf.function(experimental_compile=True)
def sampler(init_vals):

@tf.function(experimental_compile=True)
def ols_loglike(beta, sigma):
# xb (mu_i for each observation)
mu = tf.linalg.matvec(X, beta)
# this is normal pdf logged and summed over all observations
ll = - (X.shape[0]/2.)*tf.math.log(2.*pi*sigma**2) -\
(1./(2.*sigma**2.))*tf.math.reduce_sum((Y-mu)**2., axis=-1)
return ll

nuts_kernel = tfp.mcmc.NoUTurnSampler(
target_log_prob_fn=ols_loglike,
step_size=init_step_size,
)
inner_kernel=nuts_kernel,
step_size_getter_fn=lambda pkr: pkr.step_size,
log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
)
sample_vals, stats = tfp.mcmc.sample_chain(num_results=samples,
current_state=init_vals,

%%timeit -n1 -r1