Stata and Literate Programming in Emacs Org-Mode

Stata is a statistical package that lots of people use, and Emacs Org-mode is a great platform for organizing, publishing, and blogging your research. In one of my older posts, I outlined the relative benefits of Org-mode compared to other packages for literate programming. At that time, I argued it was the best way to write literate programming documents with Stata (if you are willing to pay the fixed costs of learning Emacs). I still believe that, and I use it a lot for writing course notes, emailing students with code and results, and even for drafting manuscripts for publishing.

Despite how good Emacs Org-mode is for research involving Stata, Stata is still something of a second class citizen compared to packages like R or Python. While it is functional, it can be a little rough around the edges, and since not many people use Stata with Emacs finding answers can be tough. This post does 3 things:

  1. Demonstrates some issues using stata in org-mode
  2. Introduces an updated version of ob-stata.el. With only minor modifications, this version avoids some issues with the current version of ob-stata found here. My version of ob-stata.el can be downloaded from gitlab.
  3. Provides full setup instructions that enables code-highliting in html and latex export.

Issues with Stata and Org-Mode

Occasional garbled output

For most of the usual commands (regress, sum, probit, etc.) output is fine. But there are some commands for which output can be truncated. In this codeblock, we will bootstrap the probit command and then ask for model fit diagnostics. Both the commands bstrap and estat classification fail to render properly using the current version of ob-stata.el.

webuse auto
bstrap: probit foreign mpg price headroom 
estat classification
webuse auto
(1978 Automobile Data)
   50

Probit regression                               Number of obs     =         74
                                                Replications      =         50
                                                Wald chi2(3)      =      19.82
                                                Prob > chi2       =     0.0002
Log likelihood = -35.296645                     Pseudo R2         =     0.2162

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
     foreign |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |   .1218917     .03376     3.61   0.000     .0557233    .1880601
       price |   .0001563   .0000879     1.78   0.075    -.0000159    .0003286
    headroom |  -.3379361   .1974509    -1.71   0.087    -.7249327    .0490606
       _cons |  -3.232319    1.32055    -2.45   0.014    -5.820551    -.644088
------------------------------------------------------------------------------
    |         8             6  |         14
     -     |        14            46  |         60
-----------+--------------------------+-----------
   Total   |        22            52  |         74

Classified + if predicted Pr(D) >= .5
True D defined as foreign != 0
--------------------------------------------------
Sensitivity                     Pr( +| D)   36.36%
Specificity                     Pr( -|~D)   88.46%
Positive predictive value       Pr( D| +)   57.14%
Negative predictive value       Pr(~D| -)   76.67%
--------------------------------------------------
False + rate for true ~D        Pr( +|~D)   11.54%
False - rate for true D         Pr( -| D)   63.64%
False + rate for classified +   Pr(~D| +)   42.86%
False - rate for classified -   Pr( D| -)   23.33%
--------------------------------------------------
Correctly classified                        72.97%
--------------------------------------------------

My modification of ob-stata.el renders this output correctly:

webuse auto
bstrap: probit foreign mpg price headroom 
estat classification
(1978 Automobile Data)
(running probit on estimation sample)

Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50

Probit regression                               Number of obs     =         74
                                                Replications      =         50
                                                Wald chi2(3)      =      14.01
                                                Prob > chi2       =     0.0029
Log likelihood = -35.296645                     Pseudo R2         =     0.2162

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
     foreign |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |   .1218917   .0555982     2.19   0.028     .0129212    .2308622
       price |   .0001563   .0000843     1.85   0.064    -8.94e-06    .0003216
    headroom |  -.3379361   .2451438    -1.38   0.168    -.8184091     .142537
       _cons |  -3.232319   2.100886    -1.54   0.124     -7.34998    .8853416
------------------------------------------------------------------------------

Probit model for foreign

              -------- True --------
Classified |         D            ~D  |      Total
-----------+--------------------------+-----------
     +     |         8             6  |         14
     -     |        14            46  |         60
-----------+--------------------------+-----------
   Total   |        22            52  |         74

Classified + if predicted Pr(D) >= .5
True D defined as foreign != 0
--------------------------------------------------
Sensitivity                     Pr( +| D)   36.36%
Specificity                     Pr( -|~D)   88.46%
Positive predictive value       Pr( D| +)   57.14%
Negative predictive value       Pr(~D| -)   76.67%
--------------------------------------------------
False + rate for true ~D        Pr( +|~D)   11.54%
False - rate for true D         Pr( -| D)   63.64%
False + rate for classified +   Pr(~D| +)   42.86%
False - rate for classified -   Pr( D| -)   23.33%
--------------------------------------------------
Correctly classified                        72.97%
--------------------------------------------------

Output Contains Commands and Results

For me one of the major annoyances with using Stata in Org-mode is that Stata output includes commands and results. If one wants to produce output that has code highlighting of the Stata commands, you will necessarily have duplicate commands in your html or pdf document. This example illustrates the problem:

webuse auto 
reg price mpg
webuse auto
(1978 Automobile Data)
reg price mpg

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =     20.26
       Model |   139449474         1   139449474   Prob > F        =    0.0000
    Residual |   495615923        72  6883554.48   R-squared       =    0.2196
-------------+----------------------------------   Adj R-squared   =    0.2087
       Total |   635065396        73  8699525.97   Root MSE        =    2623.7

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -238.8943   53.07669    -4.50   0.000    -344.7008   -133.0879
       _cons |   11253.06   1170.813     9.61   0.000     8919.088    13587.03
------------------------------------------------------------------------------

Notice that in the exported html (what you are viewing), you see duplicate versions of the commands that produced the output. The first has font highlighting (and is what we really want) while the second is interspersed in the plain text results. I should note that no other language I have used in Org-mode behaves like this. For example, in R or Python, the commands are left in the source code block (and are highlited) while results only contain results.

To make Stata behave more like R or Python, I have modified ob-stata.el to purge the results of any and all commands (for :results output and stata invoked by :session). For the same Stata code, this modification produces:

webuse auto 
reg price mpg
(1978 Automobile Data)

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =     20.26
       Model |   139449474         1   139449474   Prob > F        =    0.0000
    Residual |   495615923        72  6883554.48   R-squared       =    0.2196
-------------+----------------------------------   Adj R-squared   =    0.2087
       Total |   635065396        73  8699525.97   Root MSE        =    2623.7

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -238.8943   53.07669    -4.50   0.000    -344.7008   -133.0879
       _cons |   11253.06   1170.813     9.61   0.000     8919.088    13587.03
------------------------------------------------------------------------------

Note, even when using my modified code there are instances when using line continuation or your code is contained on a line longer than 77 characters where some form of your command might still be included in output. This occurs infrequently enough that I haven't bothered to try to patch ob-stata.el further.

No line continuation support

Stata allows for long commands to be split across lines using ///. This isn't currently supported in ob-stata.el. My modifications support line continuation:

reg price mpg ///
	  weight
      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(2, 71)        =     14.74
       Model |   186321280         2  93160639.9   Prob > F        =    0.0000
    Residual |   448744116        71  6320339.67   R-squared       =    0.2934
-------------+----------------------------------   Adj R-squared   =    0.2735
       Total |   635065396        73  8699525.97   Root MSE        =      2514

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -49.51222   86.15604    -0.57   0.567    -221.3025     122.278
      weight |   1.746559   .6413538     2.72   0.008      .467736    3.025382
       _cons |   1946.069    3597.05     0.54   0.590    -5226.245    9118.382
------------------------------------------------------------------------------

Smaller annoyances

Stata has some more minor limitations in Org-mode that I have learned to live with and haven't bothered to try and fix, since my research isn't too Stata centric. I'll list them here

  • Font highliting in exported documents does work, but it is somewhat hit and miss for html output while pretty good for latex output. I am not altogether clear why there is a difference, but I suspect that for html output it is using a stata syntax dictionary from the Emacs Speaks Statistics package (and it isn't possible to modify highliting settings for Stata since "Font Lock" is disabled). For latex/pdf output, pygmentize is used and it works very well. For this blog, html output is produced using pygmentize'd output, so what you are seeing isn't representative of what you will get from a straight html export in Org-mode.
  • Highliting in Emacs Org-mode while editing the document requires you to place a space before the first command in src blocks.
  • There can be limits on other types of output you can get out of your codeblocks, such as values, tables, latex code, etc. For this reason, I only use :results output.
  • Graphics can only be included if you run code, save the result and then manually include it in your org file.

Setup

To get stata execution blocks working in Org-mode, you need to

  1. Ensure the command stata is executable and is in your path (note, not xstata). Emacs will execute commands using stata. If you want to use another version of Stata, you will need to soft-link it to a stata command in the path. For example, I would rather use stata-mp, so I soft-linked it to /usr/local/sbin/stata so emacs will use the Multi-Processing version of Stata.
  2. In Emacs, install ESS: Emacs Speaks Statistics
  3. If you want to try my version, download ob-stata.el from this gitlab repo and save to ~/emacs.d/lisp. If you prefer the original version of ob-stata.el, you can find it at this mirror of the emacs repository.
  4. For your version of Emacs and Org-Mode, you might need to change this line in ob-stata.el (see this thread):
    (let ((vars (mapcar #'cdr (org-babel--get-vars params))))
    

    to/from (depending on what you downloaded above)

    (let ((vars (mapcar #'cdr (org-babel-get-header params :var))))
    
  5. Modify one of your Emacs initialization files to include
    ;; load Emacs Speaks Statistics - for Stata support
    (require 'ess-site)
    ;; Tell emacs location of the directory containing 
    ;; personal elisp (and ob-stata.el)
    (add-to-list 'load-path "~/.emacs.d/lisp/")
    ;; load ob-stata
    (require "ob-stata")
    
  6. Following the commands above, include Stata as a babel language in your Emacs initialization files. Mine looks like this:
    (org-babel-do-load-languages
     'org-babel-load-languages
     '((python . t)
       (ipython . t)
       (R . t)
       (sh . t)
       (matlab . t)
       (stata . t)
     ))
    
  7. Include Stata as a language to be fontified for latex exports by including the following in your Emacs initialization files:
    (add-to-list 'org-latex-minted-langs '(stata "stata")) 
    

    Make sure to include \usepackage{minted} in the header of your latex export template and that your version of pygmentize is 2.2 or higher.