Reproducible Research and Literate Programming for Econometrics

Jump straight to the discussion on Stata and Emacs Org Mode

This post describes my experience implementing reproducible research and literate programming methods for commonly used econometric software packages. Since literate programming aims to store the accumulated scientific knowledge of the research project in one document, the software package must allow for the reproduction of data cleaning and data analysis steps, store the record of methods used, generate results dynamically and use these for the writeup, and be executable by including the computational environment.

Perhaps most importantly, this dynamic document can be executed to produce the academic paper. The researcher shares this file with other researchers rather than the only a pdf of the paper, making the research fully reproducible by executing the dynamic document. It is my view that this will be expected in most scientific journals over the next few decades.

Besides the huge benefits of full reproducibility and quick sharing of research that literate programming brings, I have found other advantages:

  1. There is minimal time spent getting back up to speed on projects that have been sitting on the back burner since the document contains the full evolution of a project including the initial notes, ideas, code, results, tables, figures, and up to the submitted manuscript
  2. The document is your computing environment, so there is no need to create figures and tables elsewhere and then insert them later
  3. You can easily convert your document into multiple outputs (blog posts, pdf documents, tables, raw code files, etc.)

Of course there are some downsides such as

  1. You will need to adopt a new workflow and there are substantial fixed costs
  2. Not all collaborators will be onboard, so your workflow has to be adaptive

My Requirements

In what follows, I discuss some of the packages I considered for literate programming, and describe the journey to my current workflow. I work primarily in Matlab and Python, so at the outset, I was looking for a unified workflow where I could do the things listed below without jumping between software packages:

  • Could handle \(LaTeX\) math, lists, inserted graphics, sectioning, and text for writing content
  • A single system for coding in Python and Matlab in one document
  • Could publish results to pdf or html

As I began to assess tools, I added even more requirements:

  • Also needed to handle Stata and R code
  • A more than competent text editor with code highliting and code completion for all the computer languages I use
  • Could run code remotely and asynchronously
  • Could handle references
  • Could publish to pdf in highly customizable ways for getting as near to a "submission ready" state as possible
  • A workable way to collaborate with "non-believers" (those that don't want to bother with literate programming)

Once I settled on a solution, I also discovered even more things that I now consider to be requirements:

  • Easy to use table creation
  • Todo lists, deadlines, and scheduling for projects
  • Versioning

The Contenders

Over the past couple of years, I have considered the following software packages for fulfilling all the requirements listed above:

  • Stata with the Markdoc package. For many econometricians, their go-to package will be Stata. Markdoc is the glue making literate programming possible in Stata. I can't find a demo video oshowing the use of Stata with Markdoc, but it works in a similar manner to Rmarkdown below.
  • R with RMarkdown. This is probably the inspiration for Stata's Markdoc package. Even if you don't use R as a research tool, Rmarkdown can be used for producing literate programming documents for other languages including some support for Stata. This youtube demo video will give you a flavor of how it works. I recently came across this example that shows how to implement Stata code in RMarkdown. RMarkdown uses Knitr in the background to bring the document together, and one might consider Knitr to be a "Contender" on its own.
  • Jupyter Notebook. Jupyter provides an interactive notebook interface for a number of languages including all of the ones I need (including Stata) and more. The strongest feature about the Jupyter Notebook is interactive computing. I haven't found an ideal short video for showcasing Jupyter, but this one starting half way through will provide some perspective (but with all examples using python). This series of short videos shows alot of the features of Jupyter Notebook.
  • Emacs Org Mode. Using the venerable text editor Emacs, Org Mode adds literate programming capabilities, and like Jupyter Notebook, it supports basically every language (including Stata). This is a great video showcasing Org Mode (around minute 15, the document is used to produce a pdf manuscript). Also, this web page outlines how many of the features discussed in the next section look in Emacs Org Mode.

Features

Below I break down a discussion of features as I found them when I was considering each package. Since all of the software I consider is under rapid development, my discussion may be dated or incomplete. I divide my discussion into several groups. The first group described in Table 1 covers features for writing latex or in some cases markdown code for adding text narratives to your document. The features I am interested in are \(LateX\) math, previewing equations for debugging, sectioning and document organization, referencing and footnotes, etc. One of the most important features here is the quality of the pdf output. It is a subjective consideration, but for my purposes I am referring to how close the output is to a submission ready manuscript (or blog post).

Table 1: Features for writing text and equations
  Stata R   Emacs
Feature Markdoc Markdown Jupyter Org Mode
Primary Markup Language Markdown/Latex1 Markdown Markdown Org Mode/Latex
Raw \(LateX\) Yes Not Sure Yes Yes
\(LaTeX\) Math Yes Yes Yes Yes
\(LaTeX\) Math Previews No No Yes Yes
Sectioning Yes Yes Yes Yes
Lists Yes Yes Yes Yes
Graphics Yes Yes Yes Yes
Tables Yes Yes Yes Yes 2
Footnotes Not Sure Not Sure No3 Yes
\(LaTeX\) referencing Not Sure Not Sure No Yes
\(LaTeX\) Code Highliting No No Partial Yes
\(LaTeX\) Code Completion No No No Yes
Quality of Exports Variable 4 High 5 Variable 6, 7 High 6
Blogging No Not Sure Yes Yes

The next group describes how well each package handles programming languages. For understanding Table 2, consider the column labeled "Stata Markdoc". Here the literate programming capabilities of Stata Markdoc only allows for the execution of Stata code. By contrast, Jupyter and Emacs Org Mode basically runs everything I am interested in.

Table 2: Language Support
  Stata R   Emacs
Runs Code From Markdoc Markdown Jupyter Org Mode
Stata Yes Yes 8, 9 Yes10, 9 Yes9
R No Yes Yes Yes
Python No Yes11 Yes Yes
Matlab/Octave No Yes11 Yes Yes

My final set of requirements found in Table 3 really deals with the question of extensibility, and whether additional features can be added. Jupyter and Org Mode are by far the most extensible packages of the four considered here, with Stata Markdoc being the least extensible. For example, while one can use a version control system like Git with Stata Markdoc files, support for it isn't built in to Stata itself.

Table 3: Extensibility
  Stata R   Emacs
Feature Markdoc Markdown Jupyter Org Mode
Versioning (Git) No Not Sure No 12 Yes
Exports Raw Code Yes Yes Yes Yes
Project Management No No No Superb
Dynamic Document No Yes Superb Yes
Reference Management No Not Sure Not Sure Yes 13
Remote Execution from GUI No Not Sure Yes Yes
Asynchronous Execution14 No Not Sure Yes Yes 15

Since this feature comparison is based on my recollection of how the package works and I use some alot more than others, the tables above may have errors or lack nuance. If you see anything that needs correcting, please contact me.

What I use

For the past 2 years I have mostly been using Jupyter Notebook and, lately, Emacs Org Mode. These are both superb packages. Between the two, Jupyter Notebook provides the easiest and most intuitive way for testing the reproducible research waters. It is a great tool and the best in my opinion at interacting with your data (great for classroom demos). It was designed from the ground up for sharing documents with colleages. If your co-author doesn't have a jupyter instance installed, you can always export the code from your notebook for sharing.

Right now, Emacs Org Mode is my preferred package for 3 reasons:

  1. The quality of pdf output is for me the best and the user has lots of control over the look and feel of the how your literate programming document exports into outputs like pdf's or blogs.
  2. Notes and to-do lists can be incorporated into a project's literate programming document in a way that helps me stay organized. Go to youtube and search for Org Mode. You will see lots of examples of how people use it for organizing their life.
  3. The quality of the editing environment in Emacs is second to none, with code highliting, completion, and impressive extensibility.

Emacs/Org Mode has a very high learning cost compared to the other packages listed here since it uses a keyboard-centric paradigm for control. If you opt for the Emacs/Org Mode route, prepare for at least a week or two of frustration.

Literate Programming and Stata

There is no built in literate programming capabilities in Stata itself and even add-ons like Markdoc have to deal with the peculiarities of Stata (compared to R or Python for example)16. I hope Stata will remedy this in upcoming versions.

If you do most of your econometrics work in Stata in a Mac or Linux OS environment and want to start writing literate programming documents for your research, in my view you have 2 choices:

  1. Stata Markdoc
  2. Emacs Org Mode

Unless your OS is Windows, don't bother with Jupyter Notebook- it is too limiting. If you are on Windows, Jupyter with the Ipystata extension is a workable choice and should be added to the list above. Also, I don't consider R with RMarkdown to be a strong contender if your work is Stata-centric.

Stata Markdoc

My students have used Stata Markdoc with great success for problem sets. Setup is easy and literate programming is only minutes away. My students were easily able to write literate programming documents that included code and written responses using Markdoc. They would email it to me (as a do file) and I would execute the document in Stata to produce a pdf of their problem set response. Stata Markdoc is a great tool, and if you choose to use it I highly recommend using the latex input method rather than the markdown method for better looking documents. It does have limitations (outlined in the feature tables above) despite the awesome work of its maintainer.

Org Mode and Stata

Emacs Org Mode works really well with Stata, and provides a great literate programming experience (once you have overcome the high costs of learning Emacs). The only things not working are (1) code highliting stata code in output documents (like this one) and (2) some subtle differences in output when you export stata results compared to languages like Python17. When editing Stata code in Org Mode, code highliting works fine. Stata is nearly full featured in Org Mode and is a pleasure to use18. Below, I include a quick setup guide, since what I found on the web was very cryptic:

  1. Ensure the command stata is executable and is in your path (note, not xstata). I soft-linked stata-mp in the stata installation directory to /usr/local/sbin/stata, so emacs will use the mp version of stata.
  2. In Emacs install the package ess: Emacs Speaks Statistics
  3. Download ob-stata.el from here, and save to ~/emacs.d/lisp
  4. Modify .emacs to include
    (require 'ess-site)
    ;; Tell emacs location of your  personal elisp lib dir
    (add-to-list 'load-path "~/.emacs.d/lisp/")
    
    ;; load ob-stata
    (load "ob-stata")
    (load "ob-ipython")
    
    ;; Add stata to babel languages (you need an entry for each 
    ;; language you want to submit code for)
    (org-babel-do-load-languages
     'org-babel-load-languages
     '((python . t)
       (ipython . t)
       (R . t)
       (sh . t)
       (matlab . t)
       (stata . t)
     ))
    
  5. If you get error messages when executing stata src blocks, see this thread on stackoverflow for some debugging tips.
  6. Beware that future updates to either Emacs, Org Mode, or Emacs Speaks Statistics might break stata functionality inside Emacs Org Mode since ob-stata.el isn't a formal part of the Emacs ecosystem. However, it is likely someone would get it working again without too much trouble.

An example of a simple literate programming document

We will run the following two regressions:

\begin{align} price_{i} = &\beta_0 + \beta_1 mpg_i + \epsilon \\ price_{i} = &\beta_0 + \beta_1 mpg_i + \beta_2 weight_i + \epsilon \end{align}

This codeblock within our document will be run when we export to pdf or html:

#+BEGIN_SRC stata :session :results output :exports results
webuse auto
eststo reg1: qui reg price mpg 
eststo reg2: qui reg price mpg weight
esttab reg1 reg2, nostar
#+END_SRC

And gives us these results19:

set more off
clear
webuse auto
(1978 Automobile Data)
eststo reg1: qui reg price mpg
eststo reg2: qui reg price mpg weight
esttab reg1 reg2, nostar

--------------------------------------
                      (1)          (2)
                    price        price
--------------------------------------
mpg                -238.9       -49.51
                  (-4.50)      (-0.57)

weight                           1.747
                                (2.72)

_cons             11253.1       1946.1
                   (9.61)       (0.54)
--------------------------------------
N                      74           74
--------------------------------------
t statistics in parentheses

See a literate document in action

The literate programming document that created this entire blog post can be downloaded by clicking the Source link at that top of this page.

Footnotes:

1

It isn't possible to mix markdown and \(LaTeX\) in the same document (except for math environments).

2

Emacs Org Mode table editing is an amazing feature in its own right and is significantly better than other tabling methods considered here.

3

Jupyter's markdown implementation rules out footnotes, though it may be possible to use pure \(LaTeX\) cells for linking sections in a document.

4

Stata Markdoc allows for either \(LaTeX\) or Markdown for writing content. The \(LaTeX\) documents looks alot better in my opinion.

5

R Markdown combined with the package stargazer can produce article-quality pdf documents.

6

Pandas and other packages, allow for \(LaTeX\) output of tables, leading to high quality tables.

7

Jupyter Notebook is written as an html centric tool and sometimes the pdf output isn't as good as R Markdown or Emacs Org Mode.

8

Without some tinkering uses batch mode in R Markdown, which does successfully run code cells, but each cell is run independently (so you can't access work or loaded data from other stata code cells). See this example for how to link work across code cells, although I don't think it is a very workable solution. Also, there is no syntax highliting for Stata inside RStudio and the RMarkdown file.

9

You need the Stata Console and/or Batch Mode Capabilities which may or may not be included with your version of Stata.

10

Using Ipystata it is possible to run Stata commands in the Jupyter Notebooks. However, OSX and Linux users must use the much more restrictive "Batch Mode", which like R Markdown running stata means that Stata code cells must run independently and can't access work from other code cells.

11

I haven't personally tried to use these languages in RMarkdown.

12

The Jupyter Notebook is super easy to share with colleagues since it contains images embedded in it. This makes versioning less efficient.

13

Org Ref is a great reference tool for Org Mode.

14

Here I mean that the package isn't frozen during execution of computationally expensive code-blocks, so you can proceed with writing.

15

Only python has asynchronous support in Org Mode.

16

A good writeup of the non-standard ways Stata treats output is found here and illustrates the issues the research community has trying to adapt Stata into a literate programming tool.

17

When you run stata code as an SRC block in Emacs Org Mode, both the command and the results are reported as output.

18

Note, since you aren't using the Stata Graphical User Interface, you won't have access to the display window (instead, use list) or the variable window (instead use describe).

19

If exporting to pdf, it is possible to export a latex table using esttab and have it embedded in our document. The stars in esttab output, unfortunately, causes problems when exporting to either pdf or html.