Using Autograd for Maximum Likelihood Estimation
Thanks to an excellent series of posts on the python package autograd
for automatic differentiation by John Kitchin (e.g. More Auto-differentiation Goodness for Science and Engineering), this post revisits some earlier work on maximum likelihood estimation in Python and investigates the use of auto differentiation. As pointed out in this article, auto-differentiation "can be thought of as performing a non-standard interpretation of a computer program where this interpretation involves augmenting the standard computation with the calculation of various derivatives."
Auto-differentiation is neither symbolic differentiation nor numerical approximations using finite difference methods. What auto-differentiation provides is code augmentation where code is provided for derivatives of your functions free of charge. In this post, we will be using the autograd
package in python after defining a function in the usual numpy
way. In python, another auto-differentiation choice is the Theano package, which is used by PyMC3 a Bayesian probabilistic programming package that I use in my research and teaching. There are probably other implementations in python, as it is becoming a must-have in the machine learning field. Implementations also exist in C/C++, R, Matlab, and probably others.
The three primary reasons for incorporating auto-differentiation capabilities into your research are
- In nearly all cases, your code will run faster. For some problems, much faster.
- For difficult problems, your model is likely to converge closer to the true parameter values and may be less sensitive to starting values.
- Your model will provide more accurate calculations for things like gradiants and hessians (so your standard errors will be more accurately calculated).
With auto-differentiation, gone are the days of deriving analytical derivatives and programming them into your estimation routine. In this short note, we show a simple example of auto-differentiation, expand on that for maximum likelihood estimation, and show that for problems where likelihood calculations are expensive, or for which there are many parameters being estimated there can be dramatic speed-ups.
A simple example
import warnings
warnings.filterwarnings('ignore')
import autograd.numpy as np
from autograd import grad, jacobian, hessian
from autograd.scipy.stats import norm
from scipy.optimize import minimize
import statsmodels.api as sm
import pandas as pd
import timeit
Here we define a simple function and show the basics of how autograd
works. Note that the package is a "drop-in" replacement for numpy
that supports many numpy methods and some scipy
distributions. Define a simple Cobb-Douglas production function demonstrating constant return to scale in two variables \(x\) and \(y\) as:
Define the function we want to auto-differentiate:
def f(x, y):
return x**(.8) * y**(.2)
For this problem, calculating the analytical derivatives isn't hard:
\begin{align} \frac{\partial f}{\partial x} = .8 x^{-.2} y^{.2} \\ \frac{\partial f}{\partial y} = .2 x^{.8} y^{-.8} \end{align}
but we can now use the grad
function to take the first derivative without having to program these equations in python. For functions with more than 1 variable, we can ask for each first derivative as shown below:
# first derivatives - for f(x,y), x is position 0 (default)
# and y is position 1
dfdx = grad(f)
dfdy = grad(f, 1)
# suppose values for x and y are as follows
x, y = 2.0, 3.0
# evaluate the gradiant at x, y
derivs = np.array([dfdx(x,y), dfdy(x,y)])
print("Gradiant: ", derivs)
Gradiant: [0.86757742 0.14459624]
As a sanity check, let's program in the analytical derivatives and compare to what autograd
gives us:
# analytical derivatives
a_dfdx = lambda x, y: .8*x**(-.2)*y**(.2)
a_dfdy = lambda x, y: .2*x**(.8)*y**(-.8)
print("Analytical Gradiant: ", np.array([ a_dfdx(x,y), a_dfdy(x,y)]))
Analytical Gradiant: [0.86757742 0.14459624]
That is fantastic. The package autograd
gives us analytical derivates without the pain and suffering of deriving them and programming them into python!
We can take derivatives of derivatives (to uncover elements of the matrix of 2nd derivatives):
# second derivates
d2fdxdx = grad(dfdx)
d2fdydy = grad(dfdy, 1)
d2fdxdy = grad(dfdx, 1)
d2fdydx = grad(dfdy)
Maximum Likelihood Estimation and autograd
Next we turn to a more realistic example based on my previous post on maximum likelihood methods in python linked above. Consider the population regression equation
\begin{equation} \mathbf{y = x \pmb{\beta} + \epsilon} \end{equation}And we have a sample of \(N=5000\) observations, where \(\beta\) the matrix of parameters is dimension \(K \times 1\) having an intercept.
# number of observations
N = 5000
# number of parameters
K = 10
# true parameter values
beta = 2 * np.random.randn(K)
# true error std deviation
sigma = 2
def datagen(N, beta, sigma):
"""
Generates data for OLS regression.
Inputs:
N: Number of observations
beta: K x 1 true parameter values
sigma: std dev of error
"""
K = beta.shape[0]
x_ = 10 + 2 * np.random.randn(N,K-1)
# x is the N x K data matrix with column of ones
# in the first position for estimating a constant
x = np.c_[np.ones(N),x_]
# y is the N x 1 vector of dependent variables
y = x.dot(beta) + sigma*np.random.randn(N)
return y, x
y, x = datagen(N, beta, sigma)
For using auto-differentiation and autograd
, we need to define the negative log-likelihood function (the function we are minimizing). For ordinary least squares regression, the negative log-likelihood function is given by
def neg_loglike(theta):
beta = theta[:-1]
sigma = theta[-1]
mu = np.dot(x,beta)
ll = -N/2 * np.log(2*np.pi*sigma**2) - (1/(2*sigma**2)) * np.sum((y - mu)**2)
return -1 * ll
Note, that the function is defined using only numpy fundamental methods making it autograd
ready.
However, autograd
has support for some scipy
distributions, including the normal distribution. Using this built-in logpdf function is easier to work with and it is optimized for fast auto-differentiation so we will work with this version of our log-likelihood rather than the one above:
def neg_loglike(theta):
beta = theta[:-1]
# transform theta[-1]
# so that sigma > 0
sigma = np.exp(theta[-1])
mu = np.dot(x,beta)
ll = norm.logpdf(y,mu,sigma).sum()
return -1 * ll
We can then use the jacobian
and hessian
functions from autograd
to evaluate the gradiant and hessian at particular values of our parameters. Here we evaluate the gradiant at true parameter values \(\pmb{\theta} = \begin{bmatrix} \pmb{\beta} & \sigma \end{bmatrix}\)
# derivates of neg_loglike
jacobian_ = jacobian(neg_loglike)
hessian_ = hessian(neg_loglike)
# evaluate the gradiant at true theta
# theta = [beta log(sigma)]
theta = np.append(beta,np.log(sigma))
print(jacobian_(theta))
[ -42.89535311 -476.79418971 -490.35924314 -492.17676195 -424.97595445 -557.19375403 -471.38236042 -459.67877535 -601.13561765 -429.86108545 154.86658456]
We will use the minimize
function from scipy
for finding the maximum likelihood estimates. Depending on our choice of optimization algorithm, the minimize
function can accept a jacobian and sometimes a hessian. For example, the "BFGS" algorithm for unconstrained problems accepts a jacobian and we will use jacobian_
defined above using autograd. This code sets up the minimization problem and stores results in res1
:
theta_start = np.append(np.zeros(beta.shape[0]),0.0)
res1 = minimize(neg_loglike, theta_start, method = 'BFGS', \
options={'disp': False}, jac = jacobian_)
print("Convergence Achieved: ", res1.success)
print("Number of Function Evaluations: ", res1.nfev)
Convergence Achieved: True Number of Function Evaluations: 68
For looking at the results, we'll calculate standard errors using hessian_
also from autograd
and assemble everything in a dataframe for viewing a little later:
# estimated parameters
theta_autograd = res1.x
# for std errors, calculate the information matrix
# using the autograd hessian
information1 = np.transpose(hessian_(theta_autograd))
se1 = np.sqrt(np.diagonal(np.linalg.inv(information1)))
# Put Results in a DataFrame
results_a = pd.DataFrame({'Parameter':theta_autograd,'Std Err':se1})
names = ['beta_'+str(i) for i in range(K)]
names.append('log(Sigma)')
results_a['Variable'] = names
results_a['Model'] = "MLE Autograd"
Notice that at the estimated parameter vector, the gradiant is very close to zero for our \(K+1\) parameters:
print(jacobian_(theta_autograd))
[2.14390006e-07 1.68197130e-06 1.74329119e-06 1.78678499e-06 1.96383104e-06 2.10519761e-06 2.08673862e-06 2.70411755e-06 2.05865251e-06 2.90452056e-06 3.56323801e-07]
Comparison with OLS and Non-Autograd MLE
For comparison purposes, we estimate the model two additional ways:
-
Using
scipy
OLS:res_ols = sm.OLS(y,x).fit() # Put Results in a DataFrame results_o = pd.DataFrame({'Parameter':res_ols.params, 'Std Err':res_ols.HC0_se}) names = ['beta_'+str(i) for i in range(K)] results_o['Variable'] = names results_o['Model'] = "OLS"
-
Using
scipy
minimize but without the jacobian from autograd. The "BFGS" optimization method will use finite differences for calculating the jacobian when no jacobian is given. The code below sets up the minimization problem and omits theautograd
jacobian.res2 = minimize(neg_loglike, theta_start, method = 'BFGS', \ options={'disp': False}) se2 = np.sqrt(np.diag(res2.hess_inv)) theta2 = res2.x
For calculating standard errors, we will use the hessian contained in the minimization results dictionary
res2
which is obtained from finite differences. Collect everything in a dataframe for viewing later:# Put Results in a DataFrame results_ = pd.DataFrame({'Parameter':theta2,'Std Err':se2}) names = ['beta_'+str(i) for i in range(K)] names.append('log(Sigma)') results_['Variable'] = names results_['Model'] = "MLE" print("Convergence Achieved: ", res1.success) print("Number of Function Iterations: ", res2.nfev) print("Gradiant: ", jacobian_(res2.x))
Convergence Achieved: True Number of Function Iterations: 1207 Gradiant: [-4.39318368e-05 -1.05574927e-03 -9.83467330e-04 -1.03627801e-03 -1.07281524e-03 -8.87223842e-04 -1.07660406e-03 -8.25771763e-04 -1.08148896e-03 -9.20096397e-04 -3.63061100e-05]
For comparing results, we assemble everything in a single table and focus our attention first on parameter differences. First note that the OLS estimator does not include \(\sigma\) as an estimated parameter and is reported as "NaN" as expected. Note that there are no visible differences between the estimated parameters obtained using MLE Autograd and OLS. There are some very very small discrepancies between the MLE results obtained from finite differences and either OLS or MLE with Autograd. All of the methods give us qualitatively equivalent parameter estimates.
# combine results and print
results_a.set_index(['Variable','Model'],inplace=True)
results_o.set_index(['Variable','Model'],inplace=True)
results_.set_index(['Variable','Model'],inplace=True)
df_ = results_o.append(results_a).append(results_).unstack()
print("Parameters")
print(df_['Parameter'].head(K+1))
Parameters Model MLE MLE Autograd OLS Variable beta_0 -1.162173 -1.162183 -1.162183 beta_1 -1.944660 -1.944660 -1.944660 beta_2 -1.252349 -1.252349 -1.252349 beta_3 -0.320416 -0.320416 -0.320416 beta_4 0.316880 0.316880 0.316880 beta_5 1.724172 1.724172 1.724172 beta_6 -2.385803 -2.385803 -2.385803 beta_7 1.253359 1.253359 1.253359 beta_8 1.179826 1.179826 1.179826 beta_9 2.987553 2.987553 2.987553 log(Sigma) 0.676046 0.676046 NaN
However, we do see sizable differences in standard errors, with the MLE method (not using autograd
) showing smaller standard errors. Standard errors obtained via autograd are qualitatively the same as OLS. This is consistent with what is known about optimization and the propogation of rounding error (see the warning at the bottom of this page). When possible, we should always use programmed analytical or auto-differentiated derivatives.
print("\nStandard Errors")
print(df_['Std Err'].head(K+1))
Standard Errors Model MLE MLE Autograd OLS Variable beta_0 4.382045 0.416128 0.421948 beta_1 0.068479 0.013951 0.014040 beta_2 0.041857 0.013816 0.013310 beta_3 0.025354 0.013937 0.013947 beta_4 0.069818 0.014011 0.013919 beta_5 0.058549 0.013897 0.013764 beta_6 0.078051 0.013972 0.013951 beta_7 0.058174 0.014054 0.014068 beta_8 0.028574 0.013823 0.013846 beta_9 0.024810 0.013994 0.014277 log(Sigma) 0.010053 0.010000 NaN
Speed Up
Another big benefit of using autograd
is that gradiants don't need to be numerically approximated using finite differences. Therefore, the number of function evaluations will be much lower. In the example above with 11 parameters, estimating the model with autograd
required 68 function evaluations whereas using finite differences required 1207. If the likelihood calculation is expensive (ours isn't for this problem), then these additional evaluations can really slow down estimation. To see how this translates into runtimes, we loop the estimation for \(K = \begin{bmatrix} 10&50&100\end{bmatrix}\) and examine execution times both with and without autograd
.
for k in [10, 50, 100]:
beta = np.random.randn(k)
theta_start = np.append(beta,1.0)
y, x = datagen(N, beta, 1)
print("Time for " + str(k) +" Parameters")
%timeit minimize(neg_loglike, theta_start, method = 'BFGS', \
options={'disp': False}, jac = jacobian_)
%timeit minimize(neg_loglike, theta_start, method = 'BFGS', \
options={'disp': False})
Putting the various run times together, the following tables shows average runtimes in seconds:
df = pd.DataFrame(.001*np.c_[[214, 573, 1190],[961, 12000, 46600]],columns=['With Autograd','Without Autograd'],index=[10, 50, 100])
df['Speed Up'] = df['Without Autograd']/df['With Autograd']
df.index.name = "K"
df.head()
With Autograd Without Autograd Speed Up K 10 0.214 0.961 4.490654 50 0.573 12.000 20.942408 100 1.190 46.600 39.159664
As the number of function evaluations increases (more \(K\)), the execution time without autograd
really increases. For the most expensive problem considered here, maximum likelihood estimation with autograd
was nearly 40 times faster. It should be noted that even if we compare the "BFGS" results using the jacobian from autograd
to gradiant free methods like "Nelder Mead" (results not reported here), we still see an approximate 10x speed up using autograd
that remained roughly constant even as \(K\) increased.
Summary
We should all be using autograd
or some other method of auto-differentiation. It is probably going to be faster, definitely more accurate, and only requires a little bit of additional coding.