# Tensorflow ROCM Performance for Probabilistic Programming

This post investigates the Tensorflow ROCM (AMD) library for econometric problems. We perform various comparisons of operations that might be used for econometric applications and find a mixed bag for ROCM compared to Nvidia Cuda in terms of execution times (at the time of this writing March 2020). In some cases Tensorflow ROCM performs brilliantly but in others not so much. Since I have found no good benchmarking for these types of tensorflow operations on the internet, I hope this post will be useful.

Nvidia Cuda has a near monopoly for scientific computing, machine learning, and other compute heavy applications. The importance of GPU processing for scientific workloads can't be overstated as speedups can be dramatic to the point of making methodologies viable. With Cuda you are getting a development stack that is mature, can be applied to a multitude of problems, and that just works (once the drivers are installed). Above all it is very fast. The downside to Cuda is twofold in my opinion:

- All of the software stack from drivers to compute libraries is closed source and proprietary
- The hardware is very expensive particularly if installing an Nvidia card in a data center (where consumer-grade cards aren't allowed due to licensing restrictions)

Practically speaking, the first limitation can lead to a software stack that is unreliable (read: computer won't boot to gui or the Cuda stack could break) on Kernel upgrades and it can be quite difficult to install the Cuda framework depending on the linux distribution you choose. The second limitation means that hardware can be exhorbinantly expensive. But the upside is a robust and fast development environment for running models.

I was extremely excited to hear about the Radeon Open Compute (ROCM) initiative and AMD drivers in linux. The primary benefit (as of March 2020) is that you can use the AMDGPU driver which is an open source in-tree kernel library for linux (so it is very easy to install and maintain). Additionally, the scientific library is open source. Finally, in 2019 AMD released the Radeon VII with 16GB of GPU memory (which is nearly identical to AMD's data center card, the MI50) and for what you get is very inexpensive ($699 at the time of this writing). I read a bunch of compute benchmarks on reddit, the ROCM Github Issues page, and other sources (and this one). All of these benchmarks show that the Radeon VII on ROCM comes very close to the Nvidia 2080TI Nvidia flagship comsumer card for Tensorflow machine learning applications like resnet, graphics rendering, and gaming. Unfortunately, my applications don't match any of these benchmarks. My work involves heavy usage of tensorflow operations, high memory usage, and the use of the `tensorflow_probability`

library and not image processing.

So I bought a Radeon VII and installed it in my machine along with the ROCM stack. The card installed easily, runs without any driver installation required, and is not loud or hot. Since I am using Tensorflow for some models I am running right now, I immediately wanted to try my Cuda working code (on an Nvidia P10016GB or V10016GB Nvidia GPU at the William and Mary Sciclone cluster) on the ROCM tensorflow stack (I term this ROCM from here on) on the Radeon VII installed on my personal workstation. I was able to run code and here is a quick summary of results:

- Once I disabled XLA support in my code, everything just worked. The code ran and gave me the same results as if I was on Cuda or the CPU. But disabling XLA can mean compute times are longer.
- For basic tensorflow operations, the ROCM tensorflow stack is very mature and comparable to Cuda (with a few exceptions). For these tensorflow functions, usually the Radeon VII on ROCM was every bit as fast (and sometimes faster) as an Nvidia V100 16GB. Since that card costs as much as $8,000 that is impressive.
- For other functions (like
`tensorflow-probability.mcmc`

), the performance hit from having to disable XLA can be so large that even CPU with XLA is faster than ROCM on GPU. That is a big disappointment. I've opened an issue at the ROCM tensorflow github site. Hopefully it will be resolved soon. I will be updating when the issues are addressed.

** Hardware details and full disclosure**: the times presented in this post are not apples to apples comparisons since the Radeon VII card is not installed on the cluster but rather on my own workstation. My workstation has fewer cores (18 vs 64) but a faster processor (Xeon W2155 @3.30GHz vs E52683v4 @2.10 Ghz). Memory is roughly comparable. For these reasons, I recommend IGNORING the CPU comparisons between Rocm and Cuda, except to note how important XLA compilation is for some problems. All ROCM tests are run on Ubuntu 18.04 with the HWE kernel 5.3, whereas Cuda tests are on Redhat 7 with a 3.10 kernel. All cuda tests are run by decorating functions with

`@tf.function(experimental_compile=True)`

where possible.
## Tensorflow Operations

Code for this problem can be found here. I have picked a few operations to compare in this test, some are linear algebra operations, some are for reshaping/copying data, and others are reduce or simple mathematical operations. This list isn't meant to be definitive in comparing Cuda and Rocm, but rather are functions that are or might be important to my workflow. In the table below, all times are in milliseconds.

CPU-Cuda | GPU-Cuda | CPU-ROCM | GPU-ROCM | |
---|---|---|---|---|

`tf.linalg.inv()` (ms) |
124.49 | 9.80 | 88.86 | 88.23 |

`tf.linalg.matmul()` (ms) |
2.77 | 2.02 | 3.62 | 1.68 |

`tf.gather_nd()` (ms) |
2.002 | 3.01 | 5.34 | 4.05 |

`tf.scatter_nd()` (ms) |
278.75 | 7.12 | 161.98 | 5.56 |

`tf.gather_nd()` (ms) |
0.54 | 0.73 | 2.89 | 1.19 |

`tf.math.bincount()` (ms) |
5.19 | 2.19 | 4.29 | 2.98 |

`tf.multiply()` (ms) |
9.89 | 0.03 | 26.39 | 0.01 |

`tf.reduce_sum()` (ms) |
4.89 | 0.06 | 5.65 | 0.02 |

`tf.add()` (ms) |
0.27 | 0.64 | 0.43 | 0.83 |

Some things stand out in these results:

- The ROCM GPU stack is always faster than base tensorflow (labeled CPU-Cuda) for every operation considered in the table.
- Operations using ROCM GPU is faster than Cuda GPU operations in 4 of the 9 tests considered here. When ROCM is faster differences are usually small (less than 25%).
- In 4 of the 5 occasions where the ROCM GPU operation is slower, differences are usually small (less than 25%).
- In one instance (matrix inversion), ROCM appears to not implement this on the CPU and it is 9 times slower than its Cuda counterpart.

Apart from the first row in the preceding table (which is a big slow-down), we see that ROCM is a viable alternative to CUDA if your work only requires tensorflow operations. This is fantastic and speaks to the advances made by the ROCM community. But caution is warranted. If your function isn't implemented yet on the GPU, you may not get enough speedup from ROCM to justify switching from CUDA.

## A Toy Problem

Code for this problem can be found here. This problem differs from the preceding test as follows:

- Tensorflow functions are created from basic tensorflow operations.
- These functions are timed on their own
- These functions are used as inputs to
`tensorflow_probability`

calls like`mcmc`

(for the NUTS sampling rows) and`optimize`

(for the Maximum Likelihood Estimation rows)

CPU-Cuda | GPU-Cuda | CPU-ROCM | GPU-ROCM | |
---|---|---|---|---|

Function and Gradient (ms) | 36.40 | 37.50 | 34.70 | 27.60 |

Maximum Likelihood Estimation (s) | 9.63 | 8.04 | 4.26 | 8.93 |

NUTS Sampling (s) | 108.92 | 1.58 | 99.50 | 47.29 |

The results show that ROCM GPU is actually signficantly faster than GPU-Cuda (by around 30%) for the function and gradient calculations (recall these are pure tensorflow operations)- that's fantastic! When we estimate using maximum likelihood using `tensorflow_probability`

, we see that on GPU, ROCM holds its own compared to CUDA (although the ROCM CPU is the fastest probably due to the superior processor hardware on my workstation). Comparing CPU-ROCM to GPU-ROCM we see that the GPU actually slows things down- that's not fantastic. For NUTS monte-carlo markov chain sampling, we see that the ROCM stack is ** MUCH** slower than CUDA on the GPU (30x) slower.

## A "Real" Problem

I'm not going to share the code for this one since it is ongoing research. Suffice it to say that the model log-likelihood uses many of the tensorflow operations we explored above (not `tf.linalg.inv`

) and this tensorflow function is used with the `tensorflow_probability.mcmc`

and `tensorflow_probability.optimize`

methods for large tensorflow matrices. Given our findings above, we would expect 1) Log=likelihood evaluations in ROCM to be similar to Cuda (since the function is comprised of tensor operations), and 2) `tensorflow_probability`

calls to be slower than Cuda counterparts.

Model | CPU-Cuda | GPU-Cuda | CPU-ROCM | GPU-ROCM | |
---|---|---|---|---|---|

Function and Gradient (ms) | 1 | 0.40 | 0.50 | 2.53 | 1.98 |

Maximum Likelihood Estimation (ms) | 1 | 110.00 | 10.00 | 122.99 | 138.41 |

Nuts Sampling (s) | 1 | 36.57 | 8.12 | 15.46 | 31.72 |

Function and Gradient (ms) | 2 | 30.00 | 10.00 | 30.37 | 12.58 |

Nuts Sampling (s) | 2 | 36.57 | 8.12 | 1546.75 | 637.78 |

The real-world example supports what we would expect: the function and gradient calculations are somewhat slower (4x for model 1 and 1.3x in model 2) as these use gather operations on very large matrices, which is where CUDA excels relative to ROCM. These aren't bad results for ROCM particularly for Model 2, which is the most compute intensive of any calculation reported here so far. But when we use either of the `tensorflow_probability`

functions we see severe performance degradation relative to CUDA. Furthermore, we see that the CPU on base tensorflow is faster by some margin (around 20x) than even ROCM GPU. The CUDA GPU stack is orders of magnitude faster than ROCM for sampling. That is very bad news for ROCM for this problem.

## Conclusion

ROCM is being actively developed and I expect these issues will be addressed relatively soon. It seems that `tensorflow_probability`

has not been optimized for ROCM yet and futhermore, ROCM isn't currently optimized for XLA compilation in some instances. Taken together this means that for many "custom" tensorflow applications results will vary and the benchmarks you commonly see on the internet (that I cite in the introduction) are not a reliable indicator for how ROCM will perform for your problem.