Using PyTorch for Maximum Likelihood Estimation

This post investigates using pytorch for econometric research in a maximum likelihood setting. Packages like torch-choice (designed to run discrete choice or multinomial logit models) show that huge speedups can be achieved compared to running on typical platforms due to being able to use graphical processing units (GPU) and from benefits from autograd.

This post explores three issues:

  1. Installing the necessary software to get Pytorch running on AMD hardware (very short discussion)
  2. How to use pytorch for generic settings/custom log-likelihoods, where existing packages like torch-choice are insufficient for your needs
  3. How to use pytorch-minimize to further streamline optimization using pytorch

PyTorch on AMD Hardware (Fedora 39)

As of Fedora 39, it is now possible to get a working version of the Radeon Open Compute Platform (ROCm) quite easily. The benefits of ROCm on Fedora enables GPU hardware made by AMD to be used with common machine learning frameworks like pytorch, and allows you to use the open source "upstream" AMD driver bundled with Fedora. This setup should be quite robust to kernel updates and does not require docker despite claims to the contrary. While Fedora isn't officially supported (Ubuntu and Red Hat are the intended targets), this works quite well and big kudos to the scientific computing group at Fedora for getting this working.

Most of the steps are outlined in this post:

  1. Add user to proper video groups:

      sudo usermod -a -G video $LOGNAME
    
  2. From Fedora dnf, install the following:

      sudo dnf install rocminfo
      sudo dnf install rocm-opencl
      sudo dnf install rocm-hip
    
  3. Create a pytorch virtual environment. Include the usual suspects like pandas, matplotlib, and scipy. For the installation of ROCm 5.7, I had to specify python==3.10.
  4. Install PyTorch for ROCm 5.7+ (I have a Radeon 7900xtx, and it requires at least ROCm 5.7)

       pip install --pre torch  --index-url https://download.pytorch.org/whl/nightly/rocm5.7
    
  5. Add this to your .bashrc if your machine has integrated AMD graphics:

       # hopefully fixes rocm bugs with 7900xtx
       export HIP_VISIBLE_DEVICES=0
    

A simple example for Econometricians

While pytorch is a Machine Learning library like tensorflow, it is built up with key capabilities that are beneficial for econometric projects:

  1. Built to access fast linear algebra and vectorization operations on GPU's (like Nvidia or AMD)
  2. Has autograd capabilities, where first, second, etc. derivatives are baked into your econometric model
  3. Is free and open-source with no licensing restrictions

Lately, pytorch seems to be winning the machine learning race, and so I though it might be interesting to try it in a maximum likelihood setting.

import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
from torchmin import minimize as pyt_minimize
from scipy.optimize import minimize
from datetime import datetime as datetime
import pandas as pd

dtype = torch.float64
device = "cuda" if torch.cuda.is_available() else "cpu"
name = torch.cuda.get_device_name(0) if device=='cuda' else 'cpu'
print("Running", device, "on a", name)
torch.set_default_device(device)
Running cuda on a AMD Radeon RX 7900 XTX

While pytorch reports running cuda (a proprietary Nvidia scientific computing package), we're using ROCm under the hood. This is important because

  • we can proceed as if we have Nvidia hardware and Cuda software installed, and have access to GPU accelerated scientific computing
  • the ROCm software package is free and open source, so we can use cheaper hardware that has no licensing restrictions

We'll be estimating a simple OLS model that has a fairly high number of parameters (100) and observations (50,000). We run the model using base linear algebra commands in python numpy:

##
## simple OLS Data Generation Process
##
# True beta
N = 50000
K = 100
b = np.random.randn(K)
b[0] = b[0] + 3.
# True error std deviation
sigma_e = 1.

x = np.c_[np.ones(N), np.random.randn(N,K-1)]
y = x.dot(b) + sigma_e * np.random.randn(N)

# estimate parameter vector, errors, sd of errors, and se of parameters
bols = np.linalg.inv(x.T.dot(x)).dot(x.T.dot(y))
err = y - x.dot(bols)
sigma_ols = np.sqrt(err.dot(err)/(x.shape[0] - x.shape[1]))
se = np.sqrt(err.dot(err)/(x.shape[0] - x.shape[1]) * np.diagonal(np.linalg.inv(x.T.dot(x))))
# put results together for easy viewing
ols_parms = np.r_[bols, sigma_ols]
ols_se = np.r_[se, np.nan]
indexn = ['b'+str(i) for i in range(K)]
indexn.extend(['sigma'])

Now using the same data as above, let's run the model on the GPU using pytorch. This requires us to

  1. "Transfer" the data to the GPU using torch.tensor
  2. Define the objective function (\(-log(likelihood)\)) using pytorch primitives. This process is key because these mathematical primitives (matrix multiplication, sums, inverses, and log, etc. ) are "derivative aware" functions allowing for autograd capabilities)
  3. Use the machine learning centric capabilities of pytorch to estimate the model via maximum likelihood and report out standard errors. The autograd capabilities allow this to happen way faster compared to computing gradients using finite differences (e.g. scipy on cpu).
X = torch.tensor(x, dtype=dtype)
Y = torch.tensor(y, dtype=dtype)

The parameters we will be estimating using pytorch is the \(\omega\) vector, equal to \(\begin{bmatrix} \beta \\ \sigma \end{bmatrix}\). To signal to pytorch which parameters we will be optimizing the model over, we add the option requires_grad=True. This creates a tensor that has a gradient attached to it.

# initialize parameter vector:
#  betas in first K positions and sigma in last
startvals = np.append(np.random.randn(K), 1.)
omega = torch.tensor(startvals, dtype=dtype, requires_grad=True)
##
## Model Log-Likelihood (times -1)
##
# (note: Y and X are tensors)
def ols_loglike(omega):
    # divide omega into beta, sigma
    beta = omega[:-1]
    sigma = omega[-1]
    # xb (mu_i for each observation)
    mu = torch.mv(X, beta)
    # this is normal pdf logged and summed over all observations
    ll = - (Y.shape[0]/2.)*torch.log(2.*torch.pi*sigma**2) -\
        (1./(2.*sigma**2.))*torch.sum(((Y-mu)**2.))
    return -1.*ll

The code below shows how to fit the model over the 101 parameters using stochastic gradient descent:

gd = torch.optim.SGD([omega], lr=1e-5)
history_gd = []
time_start = datetime.now()
for i in range(100000):
    gd.zero_grad()
    objective = ols_loglike(omega)
    objective.backward()
    gd.step()
    history_gd.append(objective.item())
    if (i>1) and (np.abs(history_gd[-1] - history_gd[-2]) < .00001):
        print("Convergence achieved in ", i+1, " iterations")
        print("-LogL Value: ", objective.item())
        print("Mean |gradient|: ", torch.abs(omega.grad).mean().item())
        break

time_pytorch = datetime.now() - time_start
Convergence achieved in  11841  iterations
-LogL Value:  71062.8326115869
Mean |gradient|:  0.03338810517902473

Note from above that the estimated parameter vector omega has gradients attached to it omega.grad. Here are the first 10 at the MLE estimate:

omega.grad[:10]
tensor([-0.1425, -0.0136, -0.0409,  0.0324,  0.0287,  0.0082,  0.0293, -0.0478,
         0.0373,  0.0098], device='cuda:0', dtype=torch.float64)

Since pytorch is gradient aware, we can quickly compute standard errors using the hessian:

# compute se's using hessian
hessian = torch.autograd.functional.hessian(ols_loglike, omega)
se_torch = torch.sqrt(torch.linalg.diagonal(torch.linalg.inv(hessian)))

Note several things in the code above:

  • pytorch approaches optimization in terms of a learning rate lr. This concept is mostly foreign to what most of us learned about classic maximum likelihood estimation. It is possible to tune the learning rate during iterations of the stochastic gradient descent process (as well as other optimization methods like BFGS), and it could further speed-up or improve the quality of the search for the MLE estimator. Smaller learning rates means smaller jumps in the parameter vector from one iteration to the next. I haven't explored this further, but as a classically trained maximum likelihood fanboy, it seems like the hessian should give us the learning rate.
  • The gd object isn't setup like a classic optimizer in that it won't iterate automatically until either the optimum value is reached or maximum iterations are exceeded. Rather, you supply a number of iterations (I supplied 100,000 above) and you either let it run until all iterations are reached or you add your own convergence criteria for stopping iteration (I used changes in the function value to stop iterations).

Taken together, pytorch is less than ideal for this type of work, although we've shown it is feasible. Here are a comparison of the results for the 5 or so parameters estimated:

results = pd.DataFrame(np.c_[ols_parms, ols_se],columns=['estimate', 'std err'], index=indexn)

midx = pd.MultiIndex.from_product([['OLS', 'Pytorch'], ['estimate', 'std err']])
results['torch estimate'] = omega.data.cpu().numpy()
results['torch std err'] = se_torch.data.cpu().numpy()
results.columns = midx

results.head()
OLS Pytorch
estimate std err estimate std err
b0 3.454166 0.004492 3.454164 0.004488
b1 -0.556897 0.004488 -0.556898 0.004483
b2 -0.392518 0.004510 -0.392518 0.004505
b3 -1.999855 0.004500 -1.999855 0.004495
b4 -2.102200 0.004475 -2.102200 0.004471

PyTorch + PyTorch Minimize

The above example shows that using pytorch is not as streamlined as it could be for optimization in maximum likelihood. pytorch-minimize has already dealt with this problem. This package uses scipy in combination with the autograd capabilities of pytorch to provide a more traditional optimization experience. For example, to use the "BFGS" gradient-based method for minimization, redefine the log likelihood so that data (tensors) are passed into the function:

##
## Model Log-Likelihood (times -1)
##
# (note: Y and X are tensors)
def ols_loglike_(omega, Y, X):
    # divide omega into beta, sigma
    beta = omega[:-1]
    sigma = omega[-1]
    # xb (mu_i for each observation)
    mu = torch.mv(X, beta)
    # this is normal pdf logged and summed over all observations
    ll = - (Y.shape[0]/2.)*torch.log(2.*torch.pi*sigma**2) -\
        (1./(2.*sigma**2.))*torch.sum(((Y-mu)**2.))
    return -1*ll

and then use syntax like this (note we are reinitializing the starting values to make this a more interesting optimization problem):

# initialize parameter vector:
#  betas in first K positions and sigma in last
startvals = np.append(np.random.randn(K), 1.)
omega = torch.tensor(startvals, dtype=dtype, requires_grad=True)
time_start = datetime.now()
res = pyt_minimize(lambda x: ols_loglike_(x, Y, X),
               omega, method='l-bfgs', tol=1e-5, disp=True)
time_pyt_scipy = datetime.now() - time_start
A non-descent direction was encountered.
         Current function value: 71062.832610
         Iterations: 31
         Function evaluations: 42

Then we have all the parameters stored in res.x. Note: we should not use the res.hess_inv computed by scipy.minimize for computing standard errors. This isn't a problem however, since we have autograd with pytorch and we can proceed as before:

hessian_ = torch.autograd.functional.hessian(lambda x: ols_loglike_(x, Y, X), res.x)
se_torch_ = torch.sqrt(torch.linalg.diagonal(torch.linalg.inv(hessian_)))
results['torch/scipy estimate'] = res.x.data.cpu().numpy()
results['torch/scipy std err'] = se_torch_.data.cpu().numpy()

midx = pd.MultiIndex.from_product([['OLS', 'Pytorch', 'Pytorch + Scipy'], ['estimate', 'std err']])
results.columns = midx
results.head()
OLS Pytorch Pytorch + Scipy
estimate std err estimate std err estimate std err
b0 3.454166 0.004492 3.454164 0.004488 3.454166 0.004488
b1 -0.556897 0.004488 -0.556898 0.004483 -0.556897 0.004483
b2 -0.392518 0.004510 -0.392518 0.004505 -0.392518 0.004505
b3 -1.999855 0.004500 -1.999855 0.004495 -1.999855 0.004495
b4 -2.102200 0.004475 -2.102200 0.004471 -2.102200 0.004471

Why does this matter? In addition to having a more "matlab-like" optimization experience, this approach allows us to access a wide array of optimization algorithms for difficult optimization problems, all the while enjoying the benefits of autograd and fast linear algebra computations on GPU.

Comparisons

Both methods provide good parameter estimates compared to base OLS, although pytorch-minimize has smaller root mean squared error (note this seems to always be true if you use dtype=torch.float64):

Root mean squared error (pytorch):  4.780654101229938e-07
Root mean squared error (pytorch + pytorch-minimize):  3.5071836544411594e-09

The times for the runs are as follows:

                              Seconds
Scipy (No GPU/Autograd)      5.470564
Pytorch                     13.901280
Pytorch + Pytorch Minimize   0.123910

We see that the pytorch method has the longest runtime. This could potentially be lowered by adjusting the learning rate during gradient descent. However, since pytorch-minimize has better RMSE and has the shortest runtime, why bother?