ECON 407: Companion to Maximum Likelihood

This companion to the maximum likelihood discussion shows how to implement many of our topics in R and Stata.

The maximum likelihood method seeks to find model parameters $\pmb{\theta}^{MLE}$ that maximizes the joint likelihood (the likelihood function- $L$) of observing $\mathbf{y}$ conditional on some structural model that is a function of independent variables $\mathbf{x}$ and model parameters $\pmb{\theta}^{MLE}$: $$ \pmb{\theta}^{MLE} = \underset{\pmb{\theta}}{\operatorname{max}} L = \underset{\pmb{\theta}}{\operatorname{max}} \prod_{i=1}^N prob(y_i|\mathbf{x}_i,\pmb{\theta}) $$ In a simple ordinary least squares setting with a linear model, we have the model parameters $\pmb{\theta}=[\pmb{\beta},\sigma]$ making the likelihood function $$ prob(y_i|\mathbf{x}_i,\pmb{\beta},\sigma) = \frac{1}{\sqrt{2\sigma^2 \pi}}e^{\frac{-(y_i - \mathbf{x}_i\pmb{\beta})^2}{2\sigma^2}} $$ Typically, for computational purposes we work with the log-likelihood function rather than the the likelihood. This is $$ log(L) = \sum_{i=1}^N log(prob(y_i|\mathbf{x}_i,\pmb{\theta})) $$

In the code below we show how to implement a simple regression model using generic maximum likelihood estimation in stata and R. While this is a toy example with existing estimators in stata (regress) and R (lm), we develop the example here for demonstration purposes.

Load Data and run simple OLS regression Model

In [3]:
library(foreign)
library(maxLik)

Loading our data:

In [4]:
tk.df = read.dta("http://rlhick.people.wm.edu/econ407/data/tobias_koop.dta")
tk4.df = subset(tk.df, time == 4)
tk4.df$ones=1
attach(tk4.df)

This is the standard OLS estimation results in stata:

. regress ln_wage educ pexp pexp2 broken_home

      Source |       SS       df       MS              Number of obs =    1034
-------------+------------------------------           F(  4,  1029) =   51.36
       Model |  37.3778146     4  9.34445366           Prob > F      =  0.0000
    Residual |   187.21445  1029  .181938241           R-squared     =  0.1664
-------------+------------------------------           Adj R-squared =  0.1632
       Total |  224.592265  1033  .217417488           Root MSE      =  .42654

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0852725   .0092897     9.18   0.000     .0670437    .1035014
        pexp |   .2035214   .0235859     8.63   0.000     .1572395    .2498033
       pexp2 |  -.0124126   .0022825    -5.44   0.000    -.0168916   -.0079336
 broken_home |  -.0087254   .0357107    -0.24   0.807    -.0787995    .0613488
       _cons |   .4603326    .137294     3.35   0.001     .1909243    .7297408
------------------------------------------------------------------------------

Implementing Maximum Likelihood Estimation in Stata

Since the normal pdf is undefined for negative variances, we transform the variance term by exp to force any real number that is attempted by the maximization routine to be evaluated as a postive number. Consequently to recover the estimate for the model $\sigma^2$ we need to raise it by exp post-estimation to find the maximum likelihood estimate for $\sigma^2$. Here is the code defining the likelihood function and estimating the maximum likelihood model:

. program define ols_mle
  1.  * This program defines the contribution of the likelihood function 
.  * for each observation in a simple OLS model
.    args lnf xb log_sigma2
  2.    qui replace `lnf' =ln(1/sqrt(2*_pi*exp(`log_sigma2'))) - (($ML_y1 - `xb')^2)/(2*exp(`log_sigma2'))
  3.  end

. 
. 
. ml clear

. * Tell the ML program we want a linear function (lf) modeling the mean
. * of y as a linear function of the variables on the right hand side of
. * the equals.  sigma: just models sigma as a constant
. ml model lf ols_mle (ln_wage=educ pexp pexp2 broken_home) (log_sigma2:)

. ml max

initial:       log likelihood = -3426.2813
alternative:   log likelihood = -2118.3989
rescale:       log likelihood = -1754.9153
rescale eq:    log likelihood = -765.29913
Iteration 0:   log likelihood = -765.29913  
Iteration 1:   log likelihood = -652.29309  
Iteration 2:   log likelihood = -584.46047  
Iteration 3:   log likelihood = -583.66381  
Iteration 4:   log likelihood = -583.66289  
Iteration 5:   log likelihood = -583.66289  

                                                  Number of obs   =       1034
                                                  Wald chi2(4)    =     206.44
Log likelihood = -583.66289                       Prob > chi2     =     0.0000

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
eq1          |
        educ |   .0852725   .0092672     9.20   0.000     .0671092    .1034358
        pexp |   .2035214   .0235288     8.65   0.000     .1574058     .249637
       pexp2 |  -.0124126    .002277    -5.45   0.000    -.0168755   -.0079497
 broken_home |  -.0087254   .0356243    -0.24   0.807    -.0785477     .061097
       _cons |   .4603326   .1369617     3.36   0.001     .1918926    .7287726
-------------+----------------------------------------------------------------
log_sigma2   |
       _cons |  -1.708935   .0439799   -38.86   0.000    -1.795134   -1.622736
------------------------------------------------------------------------------

Note that in our program, $\sigma^2$ is equal to the exponential of what stata reports (so the standard deviation can't be negative):

In [5]:
exp(-1.708935)
Out[5]:
0.181058517293934

Implementing Maximum Likelihood Estimation in R

Here we define the likelihood function. Note the first step is to determine if the model variance $\sigma^2<0$. If so, the model returns an NA (not a number), and the optimization function enters another iteration for looking for model parameters.

In [6]:
y = ln_wage
X = matrix(c(educ,pexp,pexp2,broken_home,ones),nrow = length(educ))
ols.lnlike <- function(theta) {
  if (theta[1] <= 0) return(NA)
  -sum(dnorm(y, mean = X %*% theta[-1], sd = sqrt(theta[1]), log = TRUE))
}

Note, that we can evaluate the log-likelihood for any parameter vector:

In [7]:
ols.lnlike(c(1,1,1,1,1,1))
Out[7]:
1305304.09136282

Manual Optimization (the hard way)

While this method is the hard way, it demonstrates the mechanics of optimization in R. In order to find the best or optimal parameters, we use R's optim function to find the estimated parameter vector. It needs as arguments

  1. starting values
  2. optimization method (for unconstrained problems BFGS is a good choice)
  3. function to be minimized
  4. any data that the function needs for evaluation.

Also, we can ask optim to return the hessian, which we'll use for calculating standard errors.

In [8]:
mle_res.optim = optim(c(1,.1,.1,.1,.1,.1), method="BFGS",fn=ols.lnlike,hessian=TRUE)
summary(mle_res.optim)
Out[8]:
            Length Class  Mode   
par          6     -none- numeric
value        1     -none- numeric
counts       2     -none- numeric
convergence  1     -none- numeric
message      0     -none- NULL   
hessian     36     -none- numeric

Our model parameters are

In [9]:
b = round(mle_res.optim$par,5)
b
Out[9]:
  1. 0.18106
  2. 0.08527
  3. 0.20352
  4. -0.01241
  5. -0.00873
  6. 0.46033

and the standard errors are

In [10]:
se = round(sqrt(diag(solve(mle_res.optim$hessian))),5)
se
Out[10]:
  1. 0.00796
  2. 0.00927
  3. 0.02353
  4. 0.00228
  5. 0.03562
  6. 0.13696

Putting it together for comparison to stata:

In [12]:
labels = c('Sigma2','educ','pexp','pexp2','broken_home','Constant')
tabledat = data.frame(matrix(c(labels,b,se),nrow=length(b)))
names(tabledat) = c("Variable","Parameter","Standard Error")
print(tabledat)
     Variable Parameter Standard Error
1      Sigma2   0.18106        0.00796
2        educ   0.08527        0.00927
3        pexp   0.20352        0.02353
4       pexp2  -0.01241        0.00228
5 broken_home  -0.00873        0.03562
6    Constant   0.46033        0.13696

Using maxLik (the easy way)

Unlike optim above, the maxLik package needs the log-likelihood value not -1*log-likelihood. So let's redefine the likelihood:

In [13]:
ols.lnlike <- function(theta) {
  if (theta[1] <= 0) return(NA)
  sum(dnorm(y, mean = X %*% theta[-1], sd = sqrt(theta[1]), log = TRUE))
}

Using the maxLik package facilitates this process alot.

In [14]:
# starting values with names (a named vector)
start = c(1,.1,.1,.1,.1,.1)
names(start) = c("sigma2","educ","pexp","pexp2","broken_home","Constant")
mle_res <- maxLik( logLik = ols.lnlike, start = start, method="BFGS" )
summary( mle_res )
Out[14]:
--------------------------------------------
Maximum Likelihood estimation
BFGS maximisation, 117 iterations
Return code 0: successful convergence 
Log-Likelihood: -583.6629 
6  free parameters
Estimates:
             Estimate Std. error t value  Pr(> t)    
sigma2       0.181058   0.007963  22.737  < 2e-16 ***
educ         0.085273   0.009262   9.207  < 2e-16 ***
pexp         0.203521   0.023524   8.652  < 2e-16 ***
pexp2       -0.012413   0.002277  -5.452 4.99e-08 ***
broken_home -0.008725   0.035626  -0.245 0.806525    
Constant     0.460333   0.136859   3.364 0.000769 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------

My personal opinion is that if you do a lot of this type of work (writing custom log-likelihoods), you are better off using Matlab, R', orPythonthanStata. If you useRmy strong preference is to usemaxLikas I demonstrate above overmleormle2`, since the latter two require that you define your likelihood function as having separate arguments for each parameter to be optimized over.

Visualizing the Likelihood Function

Let's graph likelihood for the educ coefficient, holding

  1. all data at observed values
  2. other parameters constant at optimal values

I am going to call this a "marginal log-likelihood" since we are only allowing one parameter to vary (rather than all 6). Here is what it looks like:

In [15]:
coeff2 = mle_res.optim$par
altcoeff <- seq(.06,.11,0.001)
LL <- NULL

# calculates the LL for various values of the education parameter
# and stores them in LL:
for (m in altcoeff) {
  coeff2[2] <- m # consider a candidate parameter value for education
  LLt <- -1*ols.lnlike(coeff2)
  LL <- rbind(LL,LLt)
}


plot(altcoeff,LL, type="l", xlab="Education Coefficient", ylab="-LogL",
     main="Marginal Log-Likelihood Function for \n Education Coefficient")

abline(v=mle_res.optim$par[2], col="red")
abline(h=mle_res.optim$value, col="red")