Using Autograd for Maximum Likelihood Estimation

Thanks to an excellent series of posts on the python package autograd for automatic differentiation by John Kitchin (e.g. More Auto-differentiation Goodness for Science and Engineering), this post revisits some earlier work on maximum likelihood estimation in Python and investigates the use of auto differentiation. As pointed out in this article, auto-differentiation "can be thought of as performing a non-standard interpretation of a computer program where this interpretation involves augmenting the standard computation with the calculation of various derivatives."

Auto-differentiation is neither symbolic differentiation nor numerical approximations using finite difference methods. What auto-differentiation provides is code augmentation where code is provided for derivatives of your functions free of charge. In this post, we will be using the autograd package in python after defining a function in the usual numpy way. In python, another auto-differentiation choice is the Theano package, which is used by PyMC3 a Bayesian probabilistic programming package that I use in my research and teaching. There are probably other implementations in python, as it is becoming a must-have in the machine learning field. Implementations also exist in C/C++, R, Matlab, and probably others.

The three primary reasons for incorporating auto-differentiation capabilities into your research are

  1. In nearly all cases, your code will run faster. For some problems, much faster.
  2. For difficult problems, your model is likely to converge closer to the true parameter values and may be less sensitive to starting values.
  3. Your model will provide more accurate calculations for things like gradiants and hessians (so your standard errors will be more accurately calculated).

With auto-differentiation, gone are the days of deriving analytical derivatives and programming them into your estimation routine. In this short note, we show a simple example of auto-differentiation, expand on that for maximum likelihood estimation, and show that for problems where likelihood calculations are expensive, or for which there are many parameters being estimated there can be dramatic speed-ups.

A simple example

import warnings
warnings.filterwarnings('ignore')
import autograd.numpy as np
from autograd import grad, jacobian, hessian
from autograd.scipy.stats import norm
from scipy.optimize import minimize
import statsmodels.api as sm
import pandas as pd
import timeit

Here we define a simple function and show the basics of how autograd works. Note that the package is a "drop-in" replacement for numpy that supports many numpy methods and some scipy distributions. Define a simple Cobb-Douglas production function demonstrating constant return to scale in two variables \(x\) and \(y\) as:

\begin{equation} f(x,y) = x^{.8} y^{.2} \end{equation}

Define the function we want to auto-differentiate:

def f(x, y):
    return x**(.8) * y**(.2)

For this problem, calculating the analytical derivatives isn't hard:

\begin{align} \frac{\partial f}{\partial x} = .8 x^{-.2} y^{.2} \\ \frac{\partial f}{\partial y} = .2 x^{.8} y^{-.8} \end{align}

but we can now use the grad function to take the first derivative without having to program these equations in python. For functions with more than 1 variable, we can ask for each first derivative as shown below:

# first derivatives - for f(x,y), x is position 0 (default)  
#                         and y is position 1
dfdx = grad(f)
dfdy = grad(f, 1)

# suppose values for x and y are as follows
x, y = 2.0, 3.0

# evaluate the gradiant at x, y
derivs = np.array([dfdx(x,y), dfdy(x,y)])
print("Gradiant: ", derivs)
Gradiant:  [0.86757742 0.14459624]

As a sanity check, let's program in the analytical derivatives and compare to what autograd gives us:

# analytical derivatives
a_dfdx = lambda x, y: .8*x**(-.2)*y**(.2)
a_dfdy = lambda x, y: .2*x**(.8)*y**(-.8)

print("Analytical Gradiant: ", np.array([ a_dfdx(x,y), a_dfdy(x,y)])) 
Analytical Gradiant:  [0.86757742 0.14459624]

That is fantastic. The package autograd gives us analytical derivates without the pain and suffering of deriving them and programming them into python!

We can take derivatives of derivatives (to uncover elements of the matrix of 2nd derivatives):

# second derivates
d2fdxdx = grad(dfdx)
d2fdydy = grad(dfdy, 1)
d2fdxdy = grad(dfdx, 1)
d2fdydx = grad(dfdy)

Maximum Likelihood Estimation and autograd

Next we turn to a more realistic example based on my previous post on maximum likelihood methods in python linked above. Consider the population regression equation

\begin{equation} \mathbf{y = x \pmb{\beta} + \epsilon} \end{equation}

And we have a sample of \(N=5000\) observations, where \(\beta\) the matrix of parameters is dimension \(K \times 1\) having an intercept.

# number of observations
N = 5000
# number of parameters
K = 10
# true parameter values
beta = 2 * np.random.randn(K)
# true error std deviation
sigma =  2

def datagen(N, beta, sigma):
    """
    Generates data for OLS regression.
    Inputs:
    N: Number of observations
    beta: K x 1 true parameter values
    sigma: std dev of error
    """
    K = beta.shape[0]
    x_ = 10 + 2 * np.random.randn(N,K-1)
    # x is the N x K data matrix with column of ones
    #   in the first position for estimating a constant
    x = np.c_[np.ones(N),x_]
    # y is the N x 1 vector of dependent variables
    y = x.dot(beta) + sigma*np.random.randn(N)
    return y, x

y, x  = datagen(N, beta, sigma)

For using auto-differentiation and autograd, we need to define the negative log-likelihood function (the function we are minimizing). For ordinary least squares regression, the negative log-likelihood function is given by

def neg_loglike(theta):
    beta = theta[:-1]
    sigma = theta[-1]
    mu = np.dot(x,beta)
    ll = -N/2 * np.log(2*np.pi*sigma**2) - (1/(2*sigma**2)) * np.sum((y - mu)**2)
    return -1 * ll

Note, that the function is defined using only numpy fundamental methods making it autograd ready.

However, autograd has support for some scipy distributions, including the normal distribution. Using this built-in logpdf function is easier to work with and it is optimized for fast auto-differentiation so we will work with this version of our log-likelihood rather than the one above:

def neg_loglike(theta):
    beta = theta[:-1]
    # transform theta[-1]
    # so that sigma > 0
    sigma = np.exp(theta[-1])
    mu = np.dot(x,beta)
    ll = norm.logpdf(y,mu,sigma).sum()
    return -1 * ll

We can then use the jacobian and hessian functions from autograd to evaluate the gradiant and hessian at particular values of our parameters. Here we evaluate the gradiant at true parameter values \(\pmb{\theta} = \begin{bmatrix} \pmb{\beta} & \sigma \end{bmatrix}\)

# derivates of neg_loglike
jacobian_  = jacobian(neg_loglike)
hessian_ = hessian(neg_loglike)

# evaluate the gradiant at true theta
# theta = [beta log(sigma)]
theta = np.append(beta,np.log(sigma))
print(jacobian_(theta))
[ -42.89535311 -476.79418971 -490.35924314 -492.17676195 -424.97595445
 -557.19375403 -471.38236042 -459.67877535 -601.13561765 -429.86108545
  154.86658456]

We will use the minimize function from scipy for finding the maximum likelihood estimates. Depending on our choice of optimization algorithm, the minimize function can accept a jacobian and sometimes a hessian. For example, the "BFGS" algorithm for unconstrained problems accepts a jacobian and we will use jacobian_ defined above using autograd. This code sets up the minimization problem and stores results in res1:

theta_start = np.append(np.zeros(beta.shape[0]),0.0)
res1 = minimize(neg_loglike, theta_start, method = 'BFGS', \
               options={'disp': False}, jac = jacobian_)
print("Convergence Achieved: ", res1.success)
print("Number of Function Evaluations: ", res1.nfev)
Convergence Achieved:  True
Number of Function Evaluations:  68

For looking at the results, we'll calculate standard errors using hessian_ also from autograd and assemble everything in a dataframe for viewing a little later:

# estimated parameters
theta_autograd = res1.x

# for std errors, calculate the information matrix
# using the autograd hessian
information1 = np.transpose(hessian_(theta_autograd))
se1 = np.sqrt(np.diagonal(np.linalg.inv(information1))) 

# Put Results in a DataFrame
results_a = pd.DataFrame({'Parameter':theta_autograd,'Std Err':se1})
names = ['beta_'+str(i) for i in range(K)]
names.append('log(Sigma)')
results_a['Variable'] = names
results_a['Model'] = "MLE Autograd"

Notice that at the estimated parameter vector, the gradiant is very close to zero for our \(K+1\) parameters:

print(jacobian_(theta_autograd))
[2.14390006e-07 1.68197130e-06 1.74329119e-06 1.78678499e-06
 1.96383104e-06 2.10519761e-06 2.08673862e-06 2.70411755e-06
 2.05865251e-06 2.90452056e-06 3.56323801e-07]

Comparison with OLS and Non-Autograd MLE

For comparison purposes, we estimate the model two additional ways:

  1. Using scipy OLS:

       res_ols = sm.OLS(y,x).fit()
    
       # Put Results in a DataFrame
       results_o = pd.DataFrame({'Parameter':res_ols.params,
                                 'Std Err':res_ols.HC0_se})
       names = ['beta_'+str(i) for i in range(K)]
       results_o['Variable'] = names
       results_o['Model'] = "OLS"
    
  2. Using scipy minimize but without the jacobian from autograd. The "BFGS" optimization method will use finite differences for calculating the jacobian when no jacobian is given. The code below sets up the minimization problem and omits the autograd jacobian.

       res2 = minimize(neg_loglike, theta_start, method = 'BFGS', \
                      options={'disp': False}) 
       se2 = np.sqrt(np.diag(res2.hess_inv))
       theta2 = res2.x
    

    For calculating standard errors, we will use the hessian contained in the minimization results dictionary res2 which is obtained from finite differences. Collect everything in a dataframe for viewing later:

       # Put Results in a DataFrame
       results_ = pd.DataFrame({'Parameter':theta2,'Std Err':se2})
       names = ['beta_'+str(i) for i in range(K)]
       names.append('log(Sigma)')
       results_['Variable'] = names
       results_['Model'] = "MLE"
    
       print("Convergence Achieved: ", res1.success)
       print("Number of Function Iterations: ", res2.nfev)
       print("Gradiant: ", jacobian_(res2.x))
    
    Convergence Achieved:  True
    Number of Function Iterations:  1207
    Gradiant:  [-4.39318368e-05 -1.05574927e-03 -9.83467330e-04 -1.03627801e-03
     -1.07281524e-03 -8.87223842e-04 -1.07660406e-03 -8.25771763e-04
     -1.08148896e-03 -9.20096397e-04 -3.63061100e-05]
    
    

For comparing results, we assemble everything in a single table and focus our attention first on parameter differences. First note that the OLS estimator does not include \(\sigma\) as an estimated parameter and is reported as "NaN" as expected. Note that there are no visible differences between the estimated parameters obtained using MLE Autograd and OLS. There are some very very small discrepancies between the MLE results obtained from finite differences and either OLS or MLE with Autograd. All of the methods give us qualitatively equivalent parameter estimates.

# combine results and print
results_a.set_index(['Variable','Model'],inplace=True)
results_o.set_index(['Variable','Model'],inplace=True)
results_.set_index(['Variable','Model'],inplace=True)
df_ = results_o.append(results_a).append(results_).unstack()

print("Parameters")
print(df_['Parameter'].head(K+1))
Parameters
Model            MLE  MLE Autograd       OLS
Variable                                    
beta_0     -1.162173     -1.162183 -1.162183
beta_1     -1.944660     -1.944660 -1.944660
beta_2     -1.252349     -1.252349 -1.252349
beta_3     -0.320416     -0.320416 -0.320416
beta_4      0.316880      0.316880  0.316880
beta_5      1.724172      1.724172  1.724172
beta_6     -2.385803     -2.385803 -2.385803
beta_7      1.253359      1.253359  1.253359
beta_8      1.179826      1.179826  1.179826
beta_9      2.987553      2.987553  2.987553
log(Sigma)  0.676046      0.676046       NaN

However, we do see sizable differences in standard errors, with the MLE method (not using autograd) showing smaller standard errors. Standard errors obtained via autograd are qualitatively the same as OLS. This is consistent with what is known about optimization and the propogation of rounding error (see the warning at the bottom of this page). When possible, we should always use programmed analytical or auto-differentiated derivatives.

print("\nStandard Errors")
print(df_['Std Err'].head(K+1))

Standard Errors
Model            MLE  MLE Autograd       OLS
Variable                                    
beta_0      4.382045      0.416128  0.421948
beta_1      0.068479      0.013951  0.014040
beta_2      0.041857      0.013816  0.013310
beta_3      0.025354      0.013937  0.013947
beta_4      0.069818      0.014011  0.013919
beta_5      0.058549      0.013897  0.013764
beta_6      0.078051      0.013972  0.013951
beta_7      0.058174      0.014054  0.014068
beta_8      0.028574      0.013823  0.013846
beta_9      0.024810      0.013994  0.014277
log(Sigma)  0.010053      0.010000       NaN

Speed Up

Another big benefit of using autograd is that gradiants don't need to be numerically approximated using finite differences. Therefore, the number of function evaluations will be much lower. In the example above with 11 parameters, estimating the model with autograd required 68 function evaluations whereas using finite differences required 1207. If the likelihood calculation is expensive (ours isn't for this problem), then these additional evaluations can really slow down estimation. To see how this translates into runtimes, we loop the estimation for \(K = \begin{bmatrix} 10&50&100\end{bmatrix}\) and examine execution times both with and without autograd.

for k in [10, 50, 100]:
     beta = np.random.randn(k)
     theta_start = np.append(beta,1.0)
     y, x = datagen(N, beta, 1)
     print("Time for " + str(k) +" Parameters")
     %timeit minimize(neg_loglike, theta_start, method = 'BFGS', \
                options={'disp': False}, jac = jacobian_)
     %timeit minimize(neg_loglike, theta_start, method = 'BFGS', \
                options={'disp': False})    

Putting the various run times together, the following tables shows average runtimes in seconds:

df = pd.DataFrame(.001*np.c_[[214, 573, 1190],[961, 12000, 46600]],columns=['With Autograd','Without Autograd'],index=[10, 50, 100])
df['Speed Up'] = df['Without Autograd']/df['With Autograd']
df.index.name = "K"
df.head()
     With Autograd  Without Autograd   Speed Up
K                                              
10           0.214             0.961   4.490654
50           0.573            12.000  20.942408
100          1.190            46.600  39.159664

As the number of function evaluations increases (more \(K\)), the execution time without autograd really increases. For the most expensive problem considered here, maximum likelihood estimation with autograd was nearly 40 times faster. It should be noted that even if we compare the "BFGS" results using the jacobian from autograd to gradiant free methods like "Nelder Mead" (results not reported here), we still see an approximate 10x speed up using autograd that remained roughly constant even as \(K\) increased.

Summary

We should all be using autograd or some other method of auto-differentiation. It is probably going to be faster, definitely more accurate, and only requires a little bit of additional coding.