% Stata Markdown and Reproducible Research % Rob Hicks **Note: course syllabus is here**: [https://rlhick.people.wm.edu/stories/syllabus_econ407.html](https://rlhick.people.wm.edu/stories/syllabus_econ407.html) ## Introduction From [Wikipedia](https://en.wikipedia.org/wiki/Reproducibility# Reproducible_research), reproducible research is defined as: >The term reproducible research refers to the idea that the ultimate product of academic research is the paper **along with** the full computational environment used to produce the results in the paper such as the code, data, etc. that can be used to reproduce the results and create new work based on the research. The reproducible research movement (especially for the statistical sciences) takes this a step further by advocating for dynamic documents. The idea is that a researcher should provide a file (the dynamic document) that can execute the statistical analysis, generate figures, and contains accompanying text narrative. This file can be executed to produce the **academic paper**. The researcher shares this file with other researchers rather than the only the paper. It is my view that within 20 years nearly every scientific journal in applied statistics will require this approach. This document shows how to use [MarkStat](http://data.princeton.edu/stata/markdown/) and markdown syntax for reproducible research and dynamic documents in stata. The idea behind MarkStat is that you share your research by sharing your do file. This do file performs the full suite of statistical analysis and can produce the pdf (with extra configuration), MS Word, or html documents describing your analysis. You will use this workflow for producing pdf or word documents for class assignments. For every problem set, you will turn in * The stata `stmd` (similar to a do file) file containing all commands and written text that produces your problem set responses. * A hardcopy of the pdf or word version produced after running your do file [the hardcopy] The only exception to this rule is for questions involving proofs or other equation heavy assigments where handwritten responses can be attached to the hardcopy problem set response. ## Installation Instructions In `Stata`, issue these commands: 1. `ssc install markstat` 2. `ssc install whereis` 3. Install pandoc from `http://pandoc.org/installing` 4. Tell markstat where to find pandoc. Probably the command you need to run in stata is: * Windows: `whereis pandoc "C:\Users\username\AppData\Local\Pandoc\pandoc.exe"` * Mac: `whereis pandoc /usr/local/bin/pandoc` * Linux/Unix: `whereis pandoc /usr/bin/pandoc` Windows users should substitute your username for "username" in the `whereis` commands above ## Some Features of MarkStat Markdoc allows for most features of [Markdown](https://daringfireball.net/projects/markdown/syntax), which is a liteweight and readable **text-based** language that allows files to be easily converted to nice looking pdf, html, or even word documents. Some features you will likely want to use: * Equations and Math Notation using latex math * Headers * Emphasizing text (bold and italics) * Numeric and bulletted lists * Turning stata output on and off * Pagebreaks can be inserted using `\newpage` on a separate line \newpage ## A simple example analysis using Markdoc Below we'll be modeling the following regression equation for cars back in the day: $$ price_i = \beta_0 + \beta_1 mpg_i + \beta_2 foreign_i + \epsilon_i $$ ### Load Data and Summarize cd ~/Dropbox/Current/Teaching/courses/ECON407/do_files/reproducible_research/markstat webuse auto reg price mpg sum hist price graph export price.png, replace ![Histogram of Price](price.png){width=60%} ### Regression Model Here are the regression results: reg price mpg foreign #### Discussion Looks like back in the day, foreign cars sell for more! ## Markdoc and Mata Mata is the matrix algebra environment in stata. We can embed markdown (including equations) inside mata too: Define $\mathbf{A}_{2 \times 2}$ as $$ \mathbf{A}=\begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} $$ mata A = (1,2\3,4) A end ## Compiling your document You will be creating a file with an `stmd` extension that contains your code and writeup. You can create and edit this file in any text editor including the stata do file editor. Suppose your problem set document called `script.stmd` contained this text: ``` % Problem Set 1 % Johnny Appleseed % Sept 1, 2018 Let us read the fuel efficiency data that ships with Stata sysuse auto, clear To study how fuel efficiency depends on weight it is useful to transform the dependent variable from “miles per gallon” to “gallons per 100 miles” gen gphm = 100/mpg We then obtain a fairly linear relationship twoway scatter gphm weight || lfit gphm weight /// ytitle(Gallons per 100 Miles) legend(off) graph export auto.png, width(500) replace ![Fuel Efficiency by Weight](auto.png) The regression equation estimated by OLS is $$ gphm = \beta_0 + \beta_1 weight + \epsilon $$ Estimating in stata, yields: regress gphm weight Thus, a car that weighs 1,000 pounds more than another requires on average an extra 1.4 gallons to travel 100 miles. ``` You can then generate a word, pdf, or html document containing all code and results with these commands in stata (assuming your current working directory contains `script.stmd`): * `markstat using script, mathjax`: produces an html file * `markstat using script, mathjax docx`: produces a word document * `markstat using script, mathjax pdf`: produces a pdf document (requires working latex environment) Problem set responses produced by `markstat` in small fonts will be immediately returned to the student and considered not turned in until font sizes are fixed. Shoot for 11pt fonts. # Document Details This document is written entirely in `stata` using `markstat`. To see the source code, [see http://rlhick.people.wm.edu/bin/reproducible_research.stmd (clickable)](http://rlhick.people.wm.edu/econ407/bin/reproducible_research.stmd).