Estimating Custom Maximum Likelihood Models in Python (and Matlab)

In this post I show various ways of estimating "generic" maximum likelihood models in python. For each, we'll recover standard errors.

We will implement a simple ordinary least squares model like this

\begin{equation} \mathbf{y = x\beta +\epsilon} \end{equation}

where \(\epsilon\) is assumed distributed i.i.d. normal with mean 0 and variance \(\sigma^2\). In our simple model, there is only a constant and one slope coefficient (\(\beta = \begin{bmatrix} \beta_0 & \beta_1 \end{bmatrix}\)).

For this model, we would probably never bother going to the trouble of manually implementing maximum likelihood estimators as we show in this post. However, for more complicated models for which there is no established package or command, there are benefits to knowing how to build your own likelihood function and use it for estimation. It is also worthwhile noting that most of the methods shown here don't use analytical gradiants or hessians, so are likely (1) to have longer execution times and (2) to be less precise than methods where known analytical gradiants and hessians are built into the estimation method. I might explore those issues in a later post.

tl;dr: There are numerous ways to estimate custom maximum likelihood models in Python, and what I find is:

  1. For the most features, I recommend using the Genericlikelihoodmodel class from Statsmodels even if it the least intuitive way for programmers familiar with Matlab. If you are comfortable with object oriented programming you should definitely go this route.
  2. For fastest run times and computationally expensive problems Matlab will most likely be significantly faster than Python without time investments optimizing code and even then it is doubtful python would be faster.

Python packages we'll use for this post:

import pymc3 as pm3
import numpy as np
import numdifftools as ndt
import pandas as pd
from scipy.stats import norm
import statsmodels.api as sm
from statsmodels.base.model import GenericLikelihoodModel
from scipy.optimize import minimize

First we generate data for this model here and add a column for the model constant:

N = 10000
x = 10 + 2*np.random.randn(N)
y = 5 + x + np.random.randn(N)
df = pd.DataFrame({'y':y, 'x':x})
df['constant'] = 1

df.head()
           x          y  constant
0   8.987883  13.252842         1
1  10.323436  13.591453         1
2   8.932267  14.462129         1
3   8.917005  12.495138         1
4   9.294429  14.359948         1

OLS Estimation

Since this is such a simple and universally used model, there are numerous packages available for estimating it. For this problem, you would undoubtedly want to use one of these existing packages. For example, statsmodels has an OLS method. We use it here for benchmarking purposes for comparing our maximum likelihood estimation of the same model below.

sm.OLS(df.y,df[['constant','x']]).fit().summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.802
Model:                            OLS   Adj. R-squared:                  0.802
Method:                 Least Squares   F-statistic:                 4.048e+04
Date:                Mon, 08 May 2017   Prob (F-statistic):               0.00
Time:                        10:10:00   Log-Likelihood:                -14111.
No. Observations:               10000   AIC:                         2.823e+04
Df Residuals:                    9998   BIC:                         2.824e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
constant       5.0451      0.050    100.129      0.000         4.946     5.144
x              0.9954      0.005    201.200      0.000         0.986     1.005
==============================================================================
Omnibus:                        0.093   Durbin-Watson:                   2.010
Prob(Omnibus):                  0.955   Jarque-Bera (JB):                0.110
Skew:                          -0.005   Prob(JB):                        0.947
Kurtosis:                       2.987   Cond. No.                         52.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Matlab

Matlab is a tool used by many econometricians for estimating generic likelihood models so I include it here for comparison purposes. The file neg_loglike.m defines the log-likelihood function (negative log-likelihood):

function [ll] = neg_loglike(theta,Y,X)
  mu = X*theta(1:2);
  ll = -1*sum(log(normpdf(Y,mu,theta(3))));  
end

When this code is run in Matlab:

data = csvread('/tmp/data.csv');  

X = data(:,[3 1]);
Y = data(:,2);

% Optimization Settings
opt = optimset('GradObj','off',...
	       'Hessian','off',...
	       'LargeScale','off',...
	       'Display','off',...
	       'MaxIter', 100000,...
	       'MaxFunEvals', 1000000, ...
	       'TolFun', 1.000e-006,...
	       'TolX', 1.000e-006,...
	       'FunValCheck','off',...
	       'DerivativeCheck','off',...
	       'Diagnostics','off',...
	       'Algorithm','trust-region-reflective');

% starting values for estimation
theta_start=.5*ones(size(X,2)+1,1);
[theta,fval,flag,out,grad,hess] = fminunc(@(theta) neg_loglike(theta,Y,X), ...
    theta_start, opt);
se = sqrt(diag(inv(1*hess)));

ans = [theta se]

We get the following parameter estimates:

           Parameter  Standard Error
Intercept    5.04550        0.050377
Slope        0.99535        0.004946
Sigma        0.99208        0.007016

Python Handcoded using Scipy

Of all the python methods, this one is most similar to Matlab above. Defining the log-likelihood (in this case as in matlab a negative log-likelihood since there is no maximize function):

def neg_loglike(theta):
    mu = theta[0] + theta[1]*x
    return -1*norm(mu, theta[2]).logpdf(y).sum()

We'll optimize the log-likelihood over our parameters using minimize from python's scipy.optimize package:

theta_start = np.array([1,1,1])
res = minimize(neg_loglike, theta_start, method = 'Nelder-Mead', 
	       options={'disp': True})
Optimization terminated successfully.
         Current function value: 14110.691436
         Iterations: 118
         Function evaluations: 215

The default method, BFGS sometimes fails to converge, so I usually use Nelder-Mead. Unfortunately, Nelder-Mead doesn't return an estimated Hessian, so we need to calculate it ourselves using the numdifftools package:

Hfun = ndt.Hessian(neg_loglike, full_output=True)
hessian_ndt, info = Hfun(res['x'])

Combining the estimates with the standard errors gives us this set of estimates:

           parameters   std err
Intercept    5.045121  0.050381
Slope        0.995394  0.004947
Sigma        0.992161  0.007016

Extending Statsmodels

This method is perhaps the best way to proceed, but unless you are well-versed in object oriented programming is likely to be confusing. In my view it is worth the trouble to setup your custom maximum likelihood problem like this, by leveraging the machinery of statsmodels for working with our custom likelihood function. Since we have \(K+1\) parameters, proceed carefully, since by default statsmodels assumes the number of parameters is equal to the column dimension of your independent variable. First, define the log-likelihood function (note this is not the negative log-likelihood):

def _ll_ols(y, X, beta, sigma):
    mu = X.dot(beta)
    return norm(mu,sigma).logpdf(y).sum()

Then, using the log-likelihood define our custom likelihood class (I'll call it MyOLS). Note that there are two key parts to the code below:

  1. The function nloglikeobs, is only acting as a "traffic cop" and spits the parameters into \(\beta\) and \(\sigma\) coefficients and calls the likelihood function _ll_ols above.
  2. The fit function is where we inform statsmodels that our model has \(K+1\) variables. We'll provide a name for the additional variable (\(\sigma\)) and provide default starting values.
class MyOLS(GenericLikelihoodModel):
    def __init__(self, endog, exog, **kwds):
	super(MyOLS, self).__init__(endog, exog, **kwds)
    def nloglikeobs(self, params):
	sigma = params[-1]
	beta = params[:-1]
	ll = _ll_ols(self.endog, self.exog, beta, sigma)
	return -ll
    def fit(self, start_params=None, maxiter=10000, maxfun=5000, **kwds):
	# we have one additional parameter and we need to add it for summary
	self.exog_names.append('sigma')
	if start_params == None:
	    # Reasonable starting values
	    start_params = np.append(np.zeros(self.exog.shape[1]), .5)
	return super(MyOLS, self).fit(start_params=start_params,
				     maxiter=maxiter, maxfun=maxfun,
				     **kwds)

Once the model structure is setup, we can use fit() for estimating the model. statsmodels handles all of the optimizations and hessian calculations for us.

sm_ols_manual = MyOLS(df.y,df[['constant','x']]).fit()
print(sm_ols_manual.summary())
Optimization terminated successfully.
         Current function value: 1.411069
         Iterations: 233
         Function evaluations: 413
MyOLS Results                                 
==============================================================================
Dep. Variable:                      y   Log-Likelihood:                -14111.
Model:                          MyOLS   AIC:                         2.823e+04
Method:            Maximum Likelihood   BIC:                         2.824e+04
Date:                Mon, 08 May 2017                                         
Time:                        10:10:11                                         
No. Observations:               10000                                         
Df Residuals:                    9998                                         
Df Model:                           1                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
constant       5.0451      0.050    100.138      0.000         4.946     5.144
x              0.9954      0.005    201.220      0.000         0.986     1.005
sigma          0.9922      0.007    141.420      0.000         0.978     1.006
==============================================================================

The great thing about using the statsmodels infrastructure is that you have access to a lot of functionality once your model is setup and estimated. For example, suppose you want 95% confidence intervals, after estimation, we can use this command with the model object:

ci = sm_ols_manual.conf_int(alpha=.05)
resdf = pd.DataFrame(ci,columns=['lower','upper'])
resdf.index = ['constant','x','sigma']
resdf.head()
             lower     upper
constant  4.946321  5.143813
x         0.985705  1.005096
sigma     0.978416  1.005917

The are lots of other methods available to your generic model instance, and it is worth the effort getting it running.

PyMC3

The Bayesian Statistics Package PyMC3 can also find the Hessian and Maximum Posterior values, which for Flat priors should give us something nearly identical to Maximum Likelihood. Note, the Hessian produced by PyMC3 using approx_hessian is what you should use. The command find_hessian doesn't yield a valid hessian matrix for any of the cases I have tried. One potential advantage of using PyMC3 is that the hessian could be calculated off of analytical gradiants and if this is the case would likely yield more accurate standard errors than any of the other methods presented in this post (including Matlab).1

Setting the model up in PyMC3

with pm3.Model() as model:
    b0 = pm3.Flat('constant',testval=5)
    b1 = pm3.Flat('x',testval=1)
    sigma = pm3.Flat('sigma',testval = 1)
    mu = b0 + b1 * df.x
    like = pm3.Normal('like',mu = mu, sd=sigma, observed=df.y)

Noting that with Flat priors the posterior is exactly proportional to the likelihood, we can use the find_MAP function

with model:
    # obtain MLE 
    beta_mle = pm3.find_MAP()
    # obtain hessian at MAP
    hessian = pm3.approx_hessian(beta_mle)
Warning: Desired error not necessarily achieved due to precision loss.
         Current function value: 14110.691435
         Iterations: 8
         Function evaluations: 131
         Gradient evaluations: 118

The method pm3.approx_hessian uses numdifftools to calculate the standard errors evaluated at the parameter vector you specify. We evaluate it at the MAP estimates which coincides with MLE estimates for this model setup. Note, the only gotcha is that the pm3.approx_hessian changes the order of parameters, so we need to match the standard errors we calculate from the hessian to the estimates carefully. In this case, they are in reverse order. Putting together we have

se_1 = np.sqrt(np.diag(np.linalg.inv(hessian)))    
se = se_1[::-1]
b = np.array([beta_mle['constant'], beta_mle['x'], beta_mle['sigma']])
results_pymc = pd.DataFrame({'parameters':b,'std err':se})
results_pymc.index=['Intercept','Slope','Sigma']   
results_pymc.head()
           parameters   std err
Intercept    5.045094  0.050381
Slope        0.995398  0.004947
Sigma        0.992161  0.007016

This entire discussion of shoe-horning maximum likelihood estimation into PyMC3 really does beg the question of why not just sample directly from from the posterior of our parameter distribution (which for this case, using flat priors is sampling from the maximum likelihood parameter distribution) using Monte-Carlo Markov Chains. I'll leave that to a later post.

Comparison

Unsurprisingly given the simplistic nature of the model estimated here, all of the results are nearly identical. I did an informal time (timed single run) comparison and here are the results. Note that we are using PyMC3 in unintended ways and it wasn't built to optimize execution times for this type of problem.

Table 1: Single-Run Time Comparison
Program Execution Time (Seconds)
Matlab with fminunc .122
Python with minimize .9937
Python statsmodels .9848
Python PyMC3 1.377

Matlab is significantly faster for this problem.

Computing Environment

Package Version
OS Kernel Linux 4.10.13-200.fc25.x8664
gcc 4.8.5
python 3.6.0
cython 0.25.2
numpy 1.11.3
pandas 0.19.2
theano 0.8.2
pymc3 3.0
statsmodels 0.6.1
numdifftools 0.9.20
scipy 0.18.1

Footnotes:

1

Without futher investigation, I can't tell if PyMC3 uses analytical gradiants for calculating the hessian.