Using PyTorch for Maximum Likelihood Estimation
This post investigates using pytorch
for econometric research in a maximum likelihood setting. Packages like torch-choice (designed to run discrete choice or multinomial logit models) show that huge speedups can be achieved compared to running on typical platforms due to being able to use graphical processing units (GPU) and from benefits from autograd.
This post explores three issues:
- Installing the necessary software to get Pytorch running on AMD hardware (very short discussion)
- How to use
pytorch
for generic settings/custom log-likelihoods, where existing packages liketorch-choice
are insufficient for your needs - How to use
pytorch-minimize
to further streamline optimization usingpytorch
PyTorch on AMD Hardware (Fedora 39)
As of Fedora 39, it is now possible to get a working version of the Radeon Open Compute Platform (ROCm) quite easily. The benefits of ROCm on Fedora enables GPU hardware made by AMD to be used with common machine learning frameworks like pytorch
, and allows you to use the open source "upstream" AMD driver bundled with Fedora. This setup should be quite robust to kernel updates and does not require docker
despite claims to the contrary. While Fedora isn't officially supported (Ubuntu and Red Hat are the intended targets), this works quite well and big kudos to the scientific computing group at Fedora for getting this working.
Most of the steps are outlined in this post:
-
Add user to proper video groups:
sudo usermod -a -G video $LOGNAME
-
From Fedora
dnf
, install the following:sudo dnf install rocminfo sudo dnf install rocm-opencl sudo dnf install rocm-hip
- Create a pytorch virtual environment. Include the usual suspects like
pandas
,matplotlib
, andscipy
. For the installation of ROCm 5.7, I had to specifypython==3.10
. -
Install PyTorch for ROCm 5.7+ (I have a Radeon 7900xtx, and it requires at least ROCm 5.7)
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm5.7
-
Add this to your .bashrc if your machine has integrated AMD graphics:
# hopefully fixes rocm bugs with 7900xtx export HIP_VISIBLE_DEVICES=0
A simple example for Econometricians
While pytorch
is a Machine Learning library like tensorflow
, it is built up with key capabilities that are beneficial for econometric projects:
- Built to access fast linear algebra and vectorization operations on GPU's (like Nvidia or AMD)
- Has
autograd
capabilities, where first, second, etc. derivatives are baked into your econometric model - Is free and open-source with no licensing restrictions
Lately, pytorch
seems to be winning the machine learning race, and so I though it might be interesting to try it in a maximum likelihood setting.
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
from torchmin import minimize as pyt_minimize
from scipy.optimize import minimize
from datetime import datetime as datetime
import pandas as pd
dtype = torch.float64
device = "cuda" if torch.cuda.is_available() else "cpu"
name = torch.cuda.get_device_name(0) if device=='cuda' else 'cpu'
print("Running", device, "on a", name)
torch.set_default_device(device)
Running cuda on a AMD Radeon RX 7900 XTX
While pytorch
reports running cuda
(a proprietary Nvidia scientific computing package), we're using ROCm
under the hood. This is important because
- we can proceed as if we have Nvidia hardware and Cuda software installed, and have access to GPU accelerated scientific computing
- the ROCm software package is free and open source, so we can use cheaper hardware that has no licensing restrictions
We'll be estimating a simple OLS model that has a fairly high number of parameters (100) and observations (50,000). We run the model using base linear algebra commands in python numpy
:
##
## simple OLS Data Generation Process
##
# True beta
N = 50000
K = 100
b = np.random.randn(K)
b[0] = b[0] + 3.
# True error std deviation
sigma_e = 1.
x = np.c_[np.ones(N), np.random.randn(N,K-1)]
y = x.dot(b) + sigma_e * np.random.randn(N)
# estimate parameter vector, errors, sd of errors, and se of parameters
bols = np.linalg.inv(x.T.dot(x)).dot(x.T.dot(y))
err = y - x.dot(bols)
sigma_ols = np.sqrt(err.dot(err)/(x.shape[0] - x.shape[1]))
se = np.sqrt(err.dot(err)/(x.shape[0] - x.shape[1]) * np.diagonal(np.linalg.inv(x.T.dot(x))))
# put results together for easy viewing
ols_parms = np.r_[bols, sigma_ols]
ols_se = np.r_[se, np.nan]
indexn = ['b'+str(i) for i in range(K)]
indexn.extend(['sigma'])
Now using the same data as above, let's run the model on the GPU using pytorch
. This requires us to
- "Transfer" the data to the GPU using
torch.tensor
- Define the objective function (\(-log(likelihood)\)) using
pytorch
primitives. This process is key because these mathematical primitives (matrix multiplication, sums, inverses, and log, etc. ) are "derivative aware" functions allowing forautograd
capabilities) - Use the machine learning centric capabilities of
pytorch
to estimate the model via maximum likelihood and report out standard errors. Theautograd
capabilities allow this to happen way faster compared to computing gradients using finite differences (e.g.scipy
on cpu).
X = torch.tensor(x, dtype=dtype)
Y = torch.tensor(y, dtype=dtype)
The parameters we will be estimating using pytorch
is the \(\omega\) vector, equal to \(\begin{bmatrix} \beta \\ \sigma \end{bmatrix}\). To signal to pytorch
which parameters we will be optimizing the model over, we add the option requires_grad=True
. This creates a tensor that has a gradient attached to it.
# initialize parameter vector:
# betas in first K positions and sigma in last
startvals = np.append(np.random.randn(K), 1.)
omega = torch.tensor(startvals, dtype=dtype, requires_grad=True)
##
## Model Log-Likelihood (times -1)
##
# (note: Y and X are tensors)
def ols_loglike(omega):
# divide omega into beta, sigma
beta = omega[:-1]
sigma = omega[-1]
# xb (mu_i for each observation)
mu = torch.mv(X, beta)
# this is normal pdf logged and summed over all observations
ll = - (Y.shape[0]/2.)*torch.log(2.*torch.pi*sigma**2) -\
(1./(2.*sigma**2.))*torch.sum(((Y-mu)**2.))
return -1.*ll
The code below shows how to fit the model over the 101
parameters using stochastic gradient descent:
gd = torch.optim.SGD([omega], lr=1e-5)
history_gd = []
time_start = datetime.now()
for i in range(100000):
gd.zero_grad()
objective = ols_loglike(omega)
objective.backward()
gd.step()
history_gd.append(objective.item())
if (i>1) and (np.abs(history_gd[-1] - history_gd[-2]) < .00001):
print("Convergence achieved in ", i+1, " iterations")
print("-LogL Value: ", objective.item())
print("Mean |gradient|: ", torch.abs(omega.grad).mean().item())
break
time_pytorch = datetime.now() - time_start
Convergence achieved in 11841 iterations -LogL Value: 71062.8326115869 Mean |gradient|: 0.03338810517902473
Note from above that the estimated parameter vector omega
has gradients attached to it omega.grad
. Here are the first 10 at the MLE estimate:
omega.grad[:10]
tensor([-0.1425, -0.0136, -0.0409, 0.0324, 0.0287, 0.0082, 0.0293, -0.0478, 0.0373, 0.0098], device='cuda:0', dtype=torch.float64)
Since pytorch
is gradient aware, we can quickly compute standard errors using the hessian:
# compute se's using hessian
hessian = torch.autograd.functional.hessian(ols_loglike, omega)
se_torch = torch.sqrt(torch.linalg.diagonal(torch.linalg.inv(hessian)))
Note several things in the code above:
-
pytorch
approaches optimization in terms of a learning ratelr
. This concept is mostly foreign to what most of us learned about classic maximum likelihood estimation. It is possible to tune the learning rate during iterations of the stochastic gradient descent process (as well as other optimization methods like BFGS), and it could further speed-up or improve the quality of the search for the MLE estimator. Smaller learning rates means smaller jumps in the parameter vector from one iteration to the next. I haven't explored this further, but as a classically trained maximum likelihood fanboy, it seems like the hessian should give us the learning rate. - The
gd
object isn't setup like a classic optimizer in that it won't iterate automatically until either the optimum value is reached or maximum iterations are exceeded. Rather, you supply a number of iterations (I supplied 100,000 above) and you either let it run until all iterations are reached or you add your own convergence criteria for stopping iteration (I used changes in the function value to stop iterations).
Taken together, pytorch
is less than ideal for this type of work, although we've shown it is feasible. Here are a comparison of the results for the 5 or so parameters estimated:
results = pd.DataFrame(np.c_[ols_parms, ols_se],columns=['estimate', 'std err'], index=indexn)
midx = pd.MultiIndex.from_product([['OLS', 'Pytorch'], ['estimate', 'std err']])
results['torch estimate'] = omega.data.cpu().numpy()
results['torch std err'] = se_torch.data.cpu().numpy()
results.columns = midx
results.head()
OLS | Pytorch | |||
---|---|---|---|---|
estimate | std err | estimate | std err | |
b0 | 3.454166 | 0.004492 | 3.454164 | 0.004488 |
b1 | -0.556897 | 0.004488 | -0.556898 | 0.004483 |
b2 | -0.392518 | 0.004510 | -0.392518 | 0.004505 |
b3 | -1.999855 | 0.004500 | -1.999855 | 0.004495 |
b4 | -2.102200 | 0.004475 | -2.102200 | 0.004471 |
PyTorch + PyTorch Minimize
The above example shows that using pytorch
is not as streamlined as it could be for optimization in maximum likelihood. pytorch-minimize
has already dealt with this problem. This package uses scipy
in combination with the autograd capabilities of pytorch
to provide a more traditional optimization experience. For example, to use the "BFGS" gradient-based method for minimization, redefine the log likelihood so that data (tensors) are passed into the function:
##
## Model Log-Likelihood (times -1)
##
# (note: Y and X are tensors)
def ols_loglike_(omega, Y, X):
# divide omega into beta, sigma
beta = omega[:-1]
sigma = omega[-1]
# xb (mu_i for each observation)
mu = torch.mv(X, beta)
# this is normal pdf logged and summed over all observations
ll = - (Y.shape[0]/2.)*torch.log(2.*torch.pi*sigma**2) -\
(1./(2.*sigma**2.))*torch.sum(((Y-mu)**2.))
return -1*ll
and then use syntax like this (note we are reinitializing the starting values to make this a more interesting optimization problem):
# initialize parameter vector:
# betas in first K positions and sigma in last
startvals = np.append(np.random.randn(K), 1.)
omega = torch.tensor(startvals, dtype=dtype, requires_grad=True)
time_start = datetime.now()
res = pyt_minimize(lambda x: ols_loglike_(x, Y, X),
omega, method='l-bfgs', tol=1e-5, disp=True)
time_pyt_scipy = datetime.now() - time_start
A non-descent direction was encountered. Current function value: 71062.832610 Iterations: 31 Function evaluations: 42
Then we have all the parameters stored in res.x
. Note: we should not use the res.hess_inv
computed by scipy.minimize
for computing standard errors. This isn't a problem however, since we have autograd
with pytorch
and we can proceed as before:
hessian_ = torch.autograd.functional.hessian(lambda x: ols_loglike_(x, Y, X), res.x)
se_torch_ = torch.sqrt(torch.linalg.diagonal(torch.linalg.inv(hessian_)))
results['torch/scipy estimate'] = res.x.data.cpu().numpy()
results['torch/scipy std err'] = se_torch_.data.cpu().numpy()
midx = pd.MultiIndex.from_product([['OLS', 'Pytorch', 'Pytorch + Scipy'], ['estimate', 'std err']])
results.columns = midx
results.head()
OLS | Pytorch | Pytorch + Scipy | ||||
---|---|---|---|---|---|---|
estimate | std err | estimate | std err | estimate | std err | |
b0 | 3.454166 | 0.004492 | 3.454164 | 0.004488 | 3.454166 | 0.004488 |
b1 | -0.556897 | 0.004488 | -0.556898 | 0.004483 | -0.556897 | 0.004483 |
b2 | -0.392518 | 0.004510 | -0.392518 | 0.004505 | -0.392518 | 0.004505 |
b3 | -1.999855 | 0.004500 | -1.999855 | 0.004495 | -1.999855 | 0.004495 |
b4 | -2.102200 | 0.004475 | -2.102200 | 0.004471 | -2.102200 | 0.004471 |
Why does this matter? In addition to having a more "matlab-like" optimization experience, this approach allows us to access a wide array of optimization algorithms for difficult optimization problems, all the while enjoying the benefits of autograd
and fast linear algebra computations on GPU.
Comparisons
Both methods provide good parameter estimates compared to base OLS, although pytorch-minimize
has smaller root mean squared error (note this seems to always be true if you use dtype=torch.float64
):
Root mean squared error (pytorch): 4.780654101229938e-07 Root mean squared error (pytorch + pytorch-minimize): 3.5071836544411594e-09
The times for the runs are as follows:
Seconds Scipy (No GPU/Autograd) 5.470564 Pytorch 13.901280 Pytorch + Pytorch Minimize 0.123910
We see that the pytorch
method has the longest runtime. This could potentially be lowered by adjusting the learning rate during gradient descent. However, since pytorch-minimize
has better RMSE and has the shortest runtime, why bother?