ECON 407: Companion to Maximum Likelihood

This page has been moved to https://econ.pages.code.wm.edu/407/notes/docs/index.html and is no longer being maintained here.

This companion to the maximum likelihood discussion shows how to implement many of our topics in R and Stata.

The maximum likelihood method seeks to find model parameters \(\pmb{\theta}^{MLE}\) that maximizes the joint likelihood (the likelihood function- \(L\)) of observing \(\mathbf{y}\) conditional on some structural model that is a function of independent variables \(\mathbf{x}\) and model parameters \(\pmb{\theta}^{MLE}\): \[ \pmb{\theta}^{MLE} = \underset{\pmb{\theta}}{\operatorname{max}} L = \underset{\pmb{\theta}}{\operatorname{max}} \prod_{i=1}^N prob(y_i|\mathbf{x}_i,\pmb{\theta}) \] In a simple ordinary least squares setting with a linear model, we have the model parameters \(\pmb{\theta}=[\pmb{\beta},\sigma]\) making the likelihood function \[ prob(y_i|\mathbf{x}_i,\pmb{\beta},\sigma) = \frac{1}{\sqrt{2\sigma^2 \pi}}e^{\frac{-(y_i - \mathbf{x}_i\pmb{\beta})^2}{2\sigma^2}} \] Typically, for computational purposes we work with the log-likelihood function rather than the the likelihood. This is \[ log(L) = \sum_{i=1}^N log(prob(y_i|\mathbf{x}_i,\pmb{\theta})) \]

In the code below we show how to implement a simple regression model using generic maximum likelihood estimation in stata and R. While this is a toy example with existing estimators in stata (regress) and R (lm), we develop the example here for demonstration purposes.

We'll be estimating a standard OLS model using maximum likelihood. For bench-marking purposes, the standard OLS estimation results in stata are given below:

webuse set "https://rlhick.people.wm.edu/econ407/data"
webuse tobias_koop
keep if time==4
regress ln_wage educ pexp pexp2 broken_home
set more off
(prefix now "https://rlhick.people.wm.edu/econ407/data")
(16,885 observations deleted)

      Source |       SS           df       MS      Number of obs   =     1,034
-------------+----------------------------------   F(4, 1029)      =     51.36
       Model |  37.3778146         4  9.34445366   Prob > F        =    0.0000
    Residual |   187.21445     1,029  .181938241   R-squared       =    0.1664
-------------+----------------------------------   Adj R-squared   =    0.1632
       Total |  224.592265     1,033  .217417488   Root MSE        =    .42654

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0852725   .0092897     9.18   0.000     .0670437    .1035014
        pexp |   .2035214   .0235859     8.63   0.000     .1572395    .2498033
       pexp2 |  -.0124126   .0022825    -5.44   0.000    -.0168916   -.0079336
 broken_home |  -.0087254   .0357107    -0.24   0.807    -.0787995    .0613488
       _cons |   .4603326    .137294     3.35   0.001     .1909243    .7297408
------------------------------------------------------------------------------

Maximum Likelihood in Stata

To setup the model, we need to program in the normal log-likelihood function. Here is the code defining the likelihood function and estimating the maximum likelihood model:

program define ols_mle
 * This program defines the contribution of the likelihood function
 * for each observation in a simple OLS model
   args lnf xb sigma
   qui replace `lnf' =ln(1/sqrt(2*_pi*(`sigma')^2)) - (($ML_y1 - `xb')^2)/(2*(`sigma')^2)
 end

Having defined the likelihood, use the stata maximum likelihood capabilities. Note: the code is telling the Maximum likelihood program

  • we want a linear function (lf) modeling the mean
  • y as a linear function of the independent variables
  • sigma is modeled as a constant
ml model lf ols_mle (ln_wage=educ pexp pexp2 broken_home) (sigma:)
ml max

initial:       log likelihood =     -<inf>  (could not be evaluated)
feasible:      log likelihood = -6232.9439
rescale:       log likelihood = -1697.4414
rescale eq:    log likelihood = -722.18386
Iteration 0:   log likelihood = -722.18386
Iteration 1:   log likelihood = -606.03669
Iteration 2:   log likelihood = -583.66648
Iteration 3:   log likelihood = -583.66289
Iteration 4:   log likelihood = -583.66289

                                                Number of obs     =      1,034
                                                Wald chi2(4)      =     206.44
Log likelihood = -583.66289                     Prob > chi2       =     0.0000

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
eq1          |
        educ |   .0852725   .0092672     9.20   0.000     .0671092    .1034358
        pexp |   .2035214   .0235288     8.65   0.000     .1574058     .249637
       pexp2 |  -.0124126    .002277    -5.45   0.000    -.0168755   -.0079497
 broken_home |  -.0087254   .0356243    -0.24   0.807    -.0785477     .061097
       _cons |   .4603326   .1369617     3.36   0.001     .1918926    .7287726
-------------+----------------------------------------------------------------
sigma        |
       _cons |   .4255097   .0093569    45.48   0.000     .4071704    .4438489
------------------------------------------------------------------------------

Maximum Likelihood in R

Load data and put into an R dataframe.

library(foreign)
library(maxLik)

tk.df = read.dta("https://rlhick.people.wm.edu/econ407/data/tobias_koop.dta")
tk4.df = subset(tk.df, time == 4)
tk4.df$ones=1
attach(tk4.df)

Loading our data:

y = ln_wage
X = matrix(c(educ,pexp,pexp2,broken_home,ones),nrow = length(educ))
ols.lnlike <- function(theta) {
  if (theta[1] <= 0) return(NA)
  -sum(dnorm(y, mean = X %*% theta[-1], sd = theta[1], log = TRUE))
}       

Notice, we can evaluate this function at any parameter value (where the first parameter is \(\sigma\)):

ols.lnlike(c(1,1,1,1,1,1))
1305304.09136282

Manual Optimization (the hard way)

While this method is the hard way, it demonstrates the mechanics of optimization in R. In order to find the best or optimal parameters, we use R's optim function to find the estimated parameter vector. It needs as arguments

  1. starting values
  2. optimization method (for unconstrained problems `BFGS` is a good choice)
  3. function to be minimized
  4. any data that the function needs for evaluation.

Also, we can ask `optim` to return the hessian, which we'll use for calculating standard errors. The code below calls optim on our function with starting values of 1 for all parameters c(1,.1,.1,.1,.1,.1) and uses the BFGS optimization method and asks for a hessian to be returned. This runs the maximum likelihood model and assembles results in a table:

mle_res.optim = optim(c(1,.1,.1,.1,.1,.1), method="BFGS", fn=ols.lnlike, hessian=TRUE)
# parameter estimates
b = round(mle_res.optim$par,5)
# std errors
se = round(sqrt(diag(solve(mle_res.optim$hessian))),5)
labels = c('sigma','educ','pexp','pexp2','broken_home','Constant')
tabledat = data.frame(matrix(c(labels,b,se),nrow=length(b)))
names(tabledat) = c("Variable","Parameter","Standard Error")
print(tabledat)
     Variable Parameter Standard Error
1       sigma   0.42551        0.00936
2        educ   0.08527        0.00927
3        pexp   0.20352        0.02353
4       pexp2  -0.01241        0.00228
5 broken_home  -0.00873        0.03562
6    Constant   0.46033        0.13696

Using maxLik (the easy way)

Unlike optim above, the maxLik package needs the log-likelihood value not -1*log-likelihood. So let's redefine the likelihood:

ols.lnlike <- function(theta) {
  if (theta[1] <= 0) return(NA)
  sum(dnorm(y, mean = X %*% theta[-1], sd = sqrt(theta[1]), log = TRUE))
}       
# starting values with names (a named vector)
start = c(1,.1,.1,.1,.1,.1)
names(start) = c("sigma2","educ","pexp","pexp2","broken_home","Constant")
mle_res <- maxLik( logLik = ols.lnlike, start = start, method="BFGS" )
print(summary(mle_res))
--------------------------------------------
Maximum Likelihood estimation
BFGS maximization, 117 iterations
Return code 0: successful convergence 
Log-Likelihood: -583.6629 
6  free parameters
Estimates:
             Estimate Std. error t value  Pr(> t)    
sigma2       0.181058   0.007963  22.737  < 2e-16 ***
educ         0.085273   0.009262   9.207  < 2e-16 ***
pexp         0.203521   0.023524   8.652  < 2e-16 ***
pexp2       -0.012413   0.002277  -5.452 4.99e-08 ***
broken_home -0.008725   0.035626  -0.245 0.806525    
Constant     0.460333   0.136859   3.364 0.000769 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------

My personal opinion is that if you do a lot of this type of work (writing custom log-likelihoods), you are better off using Matlab, R, or Python than Stata. If you use R my strong preference is to use maxLik as I demonstrate above over mle or mle2, since the latter two require that you define your likelihood function as having separate arguments for each parameter to be optimized over.

Visualizing the Likelihood Function

Let's graph likelihood for the educ coefficient, holding

  1. all data at observed values
  2. other parameters constant at optimal values

I am going to call this a "marginal log-likelihood" since we are only allowing one parameter to vary (rather than all 6). Here is what it looks like:

coeff2 = mle_res.optim$par
altcoeff <- seq(.06,.11,0.001)
LL <- NULL

# calculates the LL for various values of the education parameter
# and stores them in LL:
for (m in altcoeff) {
  coeff2[2] <- m # consider a candidate parameter value for education
  LLt <- -1*ols.lnlike(coeff2)
  LL <- rbind(LL,LLt)
}


plot(altcoeff,LL, type="l", xlab="Education Coefficient", ylab="-LogL",
     main="Marginal Log-Likelihood Function for \n Education Coefficient")

abline(v=mle_res.optim$par[2], col="red")
abline(h=mle_res.optim$value, col="red")
dev.copy(png,'/tmp/marginal_mle.png')
dev.off()

marginal_mle.png

Notice that as we move away from the estimated value (the red line), the log-likelihood function declines as we lose joint probability from the estimation sample.