#+BEGIN_COMMENT
.. title: Using Autograd for Maximum Likelihood Estimation
.. slug: mle-autograd
.. date: 2018-03-06 09:30:50 UTC+01:00
.. tags: autograd, ipython, maximum likelihood
.. has_math: yes
.. category: 
.. link: 
.. description: 
.. type: text
#+END_COMMENT

Thanks to an excellent series of posts on the python package =autograd= for automatic differentiation by John Kitchin (e.g. [[http://kitchingroup.cheme.cmu.edu/blog/2017/11/22/More-auto-differentiation-goodness-for-science-and-engineering/][More Auto-differentiation Goodness for Science and Engineering]]), this post revisits some earlier work on [[http://rlhick.people.wm.edu/posts/estimating-custom-mle.html][maximum likelihood estimation in Python]] and investigates the use of auto differentiation.  As pointed out in [[https://arxiv.org/pdf/1502.05767.pdf][this article]], auto-differentiation "can be thought of as performing a non-standard interpretation of a computer program where this interpretation involves augmenting the standard computation with the calculation of various derivatives."

Auto-differentiation is neither symbolic differentiation nor numerical approximations using finite difference methods.  What auto-differentiation provides is code augmentation where code is provided for derivatives of your functions free of charge.  In this post, we will be using the =autograd= package in python after defining a function in the usual =numpy= way.   In python, another auto-differentiation choice is the Theano package, which is used by PyMC3 a Bayesian probabilistic programming package that I use in my research and teaching. There are probably other implementations in python, as it is becoming a must-have in the machine learning field.  Implementations also exist in C/C++, R, Matlab, and probably others.

The three primary reasons for incorporating auto-differentiation capabilities into your research are
1. In nearly all cases, your code will run faster.  For some problems, much faster.
2. For difficult problems, your model is likely to converge closer to the true parameter values and may be less sensitive to starting values.
3. Your model will provide more accurate calculations for things like gradiants and hessians (so your standard errors will be more accurately calculated).

With auto-differentiation, gone are the days of deriving analytical derivatives and programming them into your estimation routine.  In this short note, we show a simple example of auto-differentiation, expand on that for maximum likelihood estimation, and show that for problems where likelihood calculations are expensive, or for which there are many parameters being estimated there can be dramatic speed-ups.
#+HTML: <!-- TEASER_END -->

* A simple example

#+BEGIN_SRC ipython :exports code :eval never-export :restart
import warnings
warnings.filterwarnings('ignore')
import autograd.numpy as np
from autograd import grad, jacobian, hessian
from autograd.scipy.stats import norm
from scipy.optimize import minimize
import statsmodels.api as sm
import pandas as pd
import timeit
#+END_SRC

#+RESULTS:
:RESULTS:
# Out[1]:
:END:

Here we define a simple function and show the basics of how =autograd= works.  Note that the package is a "drop-in" replacement for =numpy= that supports many numpy methods and some =scipy= distributions. Define a simple Cobb-Douglas production function demonstrating constant return to scale in two variables $x$ and $y$ as:
\begin{equation}
f(x,y) = x^{.8} y^{.2}
\end{equation}

Define the function we want to auto-differentiate: 
#+BEGIN_SRC ipython :results output :exports both :eval never-export :exports code
def f(x, y):
    return x**(.8) * y**(.2)
#+END_SRC

#+RESULTS:
:RESULTS:
# Out[2]:
:END:

For this problem, calculating the analytical derivatives isn't hard:
\begin{align}
\frac{\partial f}{\partial x} = .8 x^{-.2} y^{.2} \\
\frac{\partial f}{\partial y} = .2 x^{.8} y^{-.8}
\end{align}
but we can now use the =grad= function to take the first derivative without having to program these equations in python.  For functions with more than 1 variable, we can ask for each first derivative as shown below:
#+BEGIN_SRC ipython :results output :exports both :eval never-export
# first derivatives - for f(x,y), x is position 0 (default)  
#                         and y is position 1
dfdx = grad(f)
dfdy = grad(f, 1)

# suppose values for x and y are as follows
x, y = 2.0, 3.0

# evaluate the gradiant at x, y
derivs = np.array([dfdx(x,y), dfdy(x,y)])
print("Gradiant: ", derivs)
#+END_SRC

#+RESULTS:
:RESULTS:
# Out[4]:
# output
: Gradiant:  [0.86757742 0.14459624]
: 
:END:

As a sanity check, let's program in the analytical derivatives and compare to what =autograd= gives us:
#+BEGIN_SRC ipython :results output :exports both :eval never-export
# analytical derivatives
a_dfdx = lambda x, y: .8*x**(-.2)*y**(.2)
a_dfdy = lambda x, y: .2*x**(.8)*y**(-.8)

print("Analytical Gradiant: ", np.array([ a_dfdx(x,y), a_dfdy(x,y)])) 
#+END_SRC

#+RESULTS:
:RESULTS:
# Out[5]:
# output
: Analytical Gradiant:  [0.86757742 0.14459624]
: 
:END:

That is fantastic.  The package =autograd= gives us analytical derivates without the pain and suffering of deriving them and programming them into python!

We can take derivatives of derivatives (to uncover elements of the matrix of 2^{nd} derivatives):
#+BEGIN_SRC ipython :results output :exports both :eval never-export
# second derivates
d2fdxdx = grad(dfdx)
d2fdydy = grad(dfdy, 1)
d2fdxdy = grad(dfdx, 1)
d2fdydx = grad(dfdy)
#+END_SRC

#+RESULTS:
:RESULTS:
# Out[6]:
:END:

* Maximum Likelihood Estimation and =autograd=

Next we turn to a more realistic example based on my previous post on maximum likelihood methods in python linked above.  Consider the population regression equation
\begin{equation}
\mathbf{y =  x \pmb{\beta} + \epsilon}
\end{equation}
And we have a sample of $N=5000$ observations, where $\beta$ the matrix of parameters is dimension $K \times 1$ having an intercept.
#+BEGIN_SRC ipython :exports both :eval never-export
# number of observations
N = 5000
# number of parameters
K = 10
# true parameter values
beta = 2 * np.random.randn(K)
# true error std deviation
sigma =  2

def datagen(N, beta, sigma):
    """
    Generates data for OLS regression.
    Inputs:
    N: Number of observations
    beta: K x 1 true parameter values
    sigma: std dev of error
    """
    K = beta.shape[0]
    x_ = 10 + 2 * np.random.randn(N,K-1)
    # x is the N x K data matrix with column of ones
    #   in the first position for estimating a constant
    x = np.c_[np.ones(N),x_]
    # y is the N x 1 vector of dependent variables
    y = x.dot(beta) + sigma*np.random.randn(N)
    return y, x

y, x  = datagen(N, beta, sigma)
#+END_SRC

#+RESULTS:
:RESULTS:
# Out[18]:
:END:

For using auto-differentiation and =autograd=, we need to define the negative log-likelihood function (the function we are minimizing).  For ordinary least squares regression, the negative log-likelihood function is given by 
#+BEGIN_SRC ipython :exports both :eval never-export
def neg_loglike(theta):
    beta = theta[:-1]
    sigma = theta[-1]
    mu = np.dot(x,beta)
    ll = -N/2 * np.log(2*np.pi*sigma**2) - (1/(2*sigma**2)) * np.sum((y - mu)**2)
    return -1 * ll
#+END_SRC

#+RESULTS:
:RESULTS:
# Out[246]:
:END:
Note, that the function is defined using only numpy fundamental methods making it =autograd= ready.

However, =autograd= has [[https://github.com/HIPS/autograd/tree/master/autograd/scipy/stats][support for some =scipy= distributions]], including the normal distribution.  Using this built-in logpdf function is easier to work with and it is optimized for fast auto-differentiation so we will work with this version of our log-likelihood rather than the one above: 
#+BEGIN_SRC ipython :exports both :eval never-export
def neg_loglike(theta):
    beta = theta[:-1]
    # transform theta[-1]
    # so that sigma > 0
    sigma = np.exp(theta[-1])
    mu = np.dot(x,beta)
    ll = norm.logpdf(y,mu,sigma).sum()
    return -1 * ll
#+END_SRC

#+RESULTS:
:RESULTS:
# Out[3]:
:END:

We can then use the =jacobian= and =hessian= functions from =autograd= to evaluate the gradiant and hessian at particular values of our parameters.  Here we evaluate the gradiant at true parameter values $\pmb{\theta} = \begin{bmatrix} \pmb{\beta} & \sigma \end{bmatrix}$
#+BEGIN_SRC ipython :exports both :eval never-export
# derivates of neg_loglike
jacobian_  = jacobian(neg_loglike)
hessian_ = hessian(neg_loglike)

# evaluate the gradiant at true theta
# theta = [beta log(sigma)]
theta = np.append(beta,np.log(sigma))
print(jacobian_(theta))
#+END_SRC

#+RESULTS:
:RESULTS:
# Out[19]:
# output
: [ -42.89535311 -476.79418971 -490.35924314 -492.17676195 -424.97595445
:  -557.19375403 -471.38236042 -459.67877535 -601.13561765 -429.86108545
:   154.86658456]
: 
:END:

We will use the =minimize= function from =scipy= for finding the maximum likelihood estimates.  Depending on our choice of optimization algorithm, the =minimize= function can accept a jacobian and sometimes a hessian.  For example, the "BFGS" algorithm for unconstrained problems accepts a jacobian and we will use =jacobian_= defined above using autograd.   This code sets up the minimization problem and stores results in =res1=:
#+BEGIN_SRC ipython :results output  :exports both :eval never-export :display text/plain
theta_start = np.append(np.zeros(beta.shape[0]),0.0)
res1 = minimize(neg_loglike, theta_start, method = 'BFGS', \
	       options={'disp': False}, jac = jacobian_)
print("Convergence Achieved: ", res1.success)
print("Number of Function Evaluations: ", res1.nfev)
#+END_SRC

#+RESULTS:
:RESULTS:
# Out[20]:
# output
: Convergence Achieved:  True
: Number of Function Evaluations:  68
: 
:END:

For looking at the results, we'll calculate standard errors using =hessian_= also from =autograd= and assemble everything in a dataframe for viewing a little later: 
#+BEGIN_SRC ipython :results output  :exports both :eval never-export :display text/plain
# estimated parameters
theta_autograd = res1.x

# for std errors, calculate the information matrix
# using the autograd hessian
information1 = np.transpose(hessian_(theta_autograd))
se1 = np.sqrt(np.diagonal(np.linalg.inv(information1))) 

# Put Results in a DataFrame
results_a = pd.DataFrame({'Parameter':theta_autograd,'Std Err':se1})
names = ['beta_'+str(i) for i in range(K)]
names.append('log(Sigma)')
results_a['Variable'] = names
results_a['Model'] = "MLE Autograd"
#+END_SRC

#+RESULTS:
:RESULTS:
# Out[21]:
:END:

Notice that at the estimated parameter vector, the gradiant is very close to zero for our $K+1$ parameters:
#+BEGIN_SRC ipython :results output :exports both :eval never-export :display text/plain
print(jacobian_(theta_autograd))
#+END_SRC

#+RESULTS:
:RESULTS:
# Out[22]:
# output
: [2.14390006e-07 1.68197130e-06 1.74329119e-06 1.78678499e-06
:  1.96383104e-06 2.10519761e-06 2.08673862e-06 2.70411755e-06
:  2.05865251e-06 2.90452056e-06 3.56323801e-07]
: 
:END:

** Comparison with OLS and Non-Autograd MLE

For comparison purposes, we estimate the model two additional ways:

1. Using =scipy= OLS:
   #+BEGIN_SRC ipython :display text/plain :eval never-export
res_ols = sm.OLS(y,x).fit()

# Put Results in a DataFrame
results_o = pd.DataFrame({'Parameter':res_ols.params,
                          'Std Err':res_ols.HC0_se})
names = ['beta_'+str(i) for i in range(K)]
results_o['Variable'] = names
results_o['Model'] = "OLS"
   #+END_SRC

   #+RESULTS:
   :RESULTS:
   # Out[23]:
   :END:

2. Using =scipy= minimize but without the jacobian from autograd. The "BFGS" optimization method will use finite differences for calculating the jacobian when no jacobian is given.  The code below sets up the minimization problem and *omits* the =autograd= jacobian.  
   #+BEGIN_SRC ipython :results output :exports both :eval never-export
res2 = minimize(neg_loglike, theta_start, method = 'BFGS', \
	       options={'disp': False}) 
se2 = np.sqrt(np.diag(res2.hess_inv))
theta2 = res2.x
   #+END_SRC

   #+RESULTS:
   :RESULTS:
   # Out[24]:
   :END:

   For calculating standard errors, we will use the hessian contained in the minimization results dictionary =res2= which is obtained from finite differences.  Collect everything in a dataframe for viewing later:
   #+BEGIN_SRC ipython :results output :exports both :eval never-export
# Put Results in a DataFrame
results_ = pd.DataFrame({'Parameter':theta2,'Std Err':se2})
names = ['beta_'+str(i) for i in range(K)]
names.append('log(Sigma)')
results_['Variable'] = names
results_['Model'] = "MLE"

print("Convergence Achieved: ", res1.success)
print("Number of Function Iterations: ", res2.nfev)
print("Gradiant: ", jacobian_(res2.x))
   #+END_SRC

   #+RESULTS:
   :RESULTS:
   # Out[25]:
   # output
   : Convergence Achieved:  True
   : Number of Function Iterations:  1207
   : Gradiant:  [-4.39318368e-05 -1.05574927e-03 -9.83467330e-04 -1.03627801e-03
   :  -1.07281524e-03 -8.87223842e-04 -1.07660406e-03 -8.25771763e-04
   :  -1.08148896e-03 -9.20096397e-04 -3.63061100e-05]
   : 
   :END:


For comparing results, we assemble everything in a single table and focus our attention first on parameter differences.  First note that the OLS estimator does not include $\sigma$ as an estimated parameter and is reported as "NaN" as expected.  Note that there are no visible differences between the estimated parameters obtained using MLE Autograd and OLS.  There are some very very small discrepancies between the MLE results obtained from finite differences and either OLS or MLE with Autograd.  All of the methods give us qualitatively equivalent parameter estimates.
#+BEGIN_SRC ipython :results output :exports both :eval never-export
# combine results and print
results_a.set_index(['Variable','Model'],inplace=True)
results_o.set_index(['Variable','Model'],inplace=True)
results_.set_index(['Variable','Model'],inplace=True)
df_ = results_o.append(results_a).append(results_).unstack()

print("Parameters")
print(df_['Parameter'].head(K+1))
#+END_SRC

#+RESULTS:
:RESULTS:
# Out[26]:
# output
: Parameters
: Model            MLE  MLE Autograd       OLS
: Variable                                    
: beta_0     -1.162173     -1.162183 -1.162183
: beta_1     -1.944660     -1.944660 -1.944660
: beta_2     -1.252349     -1.252349 -1.252349
: beta_3     -0.320416     -0.320416 -0.320416
: beta_4      0.316880      0.316880  0.316880
: beta_5      1.724172      1.724172  1.724172
: beta_6     -2.385803     -2.385803 -2.385803
: beta_7      1.253359      1.253359  1.253359
: beta_8      1.179826      1.179826  1.179826
: beta_9      2.987553      2.987553  2.987553
: log(Sigma)  0.676046      0.676046       NaN
: 
:END:

However, we do see sizable differences in standard errors, with the MLE method (not using =autograd=) showing smaller standard errors.  Standard errors obtained via autograd are qualitatively the same as OLS.  This is consistent with what is known about optimization and the propogation of rounding error (see the [[http://www.statsmodels.org/0.6.1/examples/notebooks/generated/generic_mle.html][warning at the bottom of this page]]).  When possible, we should always use programmed analytical or auto-differentiated derivatives. 
#+BEGIN_SRC ipython :results output :exports both :eval never-export
print("\nStandard Errors")
print(df_['Std Err'].head(K+1))
#+END_SRC

#+RESULTS:
:RESULTS:
# Out[27]:
# output
: 
: Standard Errors
: Model            MLE  MLE Autograd       OLS
: Variable                                    
: beta_0      4.382045      0.416128  0.421948
: beta_1      0.068479      0.013951  0.014040
: beta_2      0.041857      0.013816  0.013310
: beta_3      0.025354      0.013937  0.013947
: beta_4      0.069818      0.014011  0.013919
: beta_5      0.058549      0.013897  0.013764
: beta_6      0.078051      0.013972  0.013951
: beta_7      0.058174      0.014054  0.014068
: beta_8      0.028574      0.013823  0.013846
: beta_9      0.024810      0.013994  0.014277
: log(Sigma)  0.010053      0.010000       NaN
: 
:END:


* Speed Up

Another big benefit of using =autograd= is that gradiants don't need to be numerically approximated using finite differences.  Therefore, the number of function evaluations will be much lower. In the example above with 11 parameters, estimating the model with =autograd= required 68 function evaluations whereas using finite differences required 1207.  If the likelihood calculation is expensive (ours isn't for this problem), then these additional evaluations can really slow down estimation.  To see how this translates into runtimes, we loop the estimation for $K = \begin{bmatrix} 10&50&100\end{bmatrix}$ and examine execution times both with and without =autograd=.  
#+BEGIN_SRC ipython :display text/plain :eval never-export :results output :exports code :async
for k in [10, 50, 100]:
     beta = np.random.randn(k)
     theta_start = np.append(beta,1.0)
     y, x = datagen(N, beta, 1)
     print("Time for " + str(k) +" Parameters")
     %timeit minimize(neg_loglike, theta_start, method = 'BFGS', \
		options={'disp': False}, jac = jacobian_)
     %timeit minimize(neg_loglike, theta_start, method = 'BFGS', \
		options={'disp': False})    
#+END_SRC

#+RESULTS:
:RESULTS:
# Out[271]:
# output
: Time for 10 Parameters
: 214 ms ± 4.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
: 961 ms ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
: Time for 50 Parameters
: 573 ms ± 74.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
: 12 s ± 1.25 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
: Time for 100 Parameters
: 1.19 s ± 38.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
: 46.6 s ± 7.48 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
: 
:END:

Putting the various run times together, the following tables shows average runtimes in seconds:
#+BEGIN_SRC ipython :display text/plain :eval never-export :exports both
df = pd.DataFrame(.001*np.c_[[214, 573, 1190],[961, 12000, 46600]],columns=['With Autograd','Without Autograd'],index=[10, 50, 100])
df['Speed Up'] = df['Without Autograd']/df['With Autograd']
df.index.name = "K"
df.head()
#+END_SRC

#+RESULTS:
:RESULTS:
# Out[274]:
# text/plain
:      With Autograd  Without Autograd   Speed Up
: K                                              
: 10           0.214             0.961   4.490654
: 50           0.573            12.000  20.942408
: 100          1.190            46.600  39.159664
:END:

As the number of function evaluations increases (more $K$), the execution time without =autograd= really increases.  For the most expensive problem considered here, maximum likelihood estimation with =autograd= was nearly 40 times faster.  It should be noted that even if we compare the "BFGS" results using the jacobian from =autograd= to gradiant free methods like "Nelder Mead" (results not reported here), we still see an approximate 10x speed up using =autograd= that remained roughly constant even as $K$ increased.

* Summary

We should all be using =autograd= or some other method of auto-differentiation.  It is probably going to be faster, definitely more accurate, and only requires a little bit of additional coding. 

* Scratch Area :noexport:

#+BEGIN_SRC ipython :eval never-export :display text/plain
# true parameter values
beta = 2 * np.random.randn(K)

y, x  = datagen(N, beta, sigma)

theta_start = np.append(np.zeros(beta.shape[0]),0.0)
res1 = minimize(neg_loglike, theta_start, method = 'BFGS', \
	       options={'disp': True}, jac = jacobian_)

# estimated parameters
theta_autograd = res1.x

# diagnostics on fit
error = np.append(beta,np.log(2.0)) - res1.x
error_slope = np.mean(error[0:])
print("Avg. Error on slopes is: ", error_slope)
                      
# Put Results in a DataFrame
results_ = pd.DataFrame({'Parameter':res1.x,'True Value':np.append(beta,np.log(2.0))})
names = ['beta_'+str(i) for i in range(K)]
names.append('Sigma')

results_.head(K+1)

#+END_SRC

#+RESULTS:
:RESULTS:
# Out[244]:
# output
: Warning: Desired error not necessarily achieved due to precision loss.
:          Current function value: 10549.703253
:          Iterations: 44
:          Function evaluations: 74
:          Gradient evaluations: 73
: Avg. Error on slopes is:  0.01067290830793357
: 
# text/plain
:     Parameter  True Value
: 0    3.234092    3.361542
: 1    2.127797    2.131122
: 2    0.020556    0.010627
: 3    0.831604    0.822153
: 4   -0.687595   -0.669747
: 5    2.183055    2.170120
: 6    4.658187    4.658509
: 7   -2.194179   -2.199327
: 8    1.370306    1.373238
: 9    0.199318    0.200162
: 10   0.691002    0.693147
:END: