This post describes my experience implementing reproducible research and literate programming methods for commonly used econometric software packages. Since literate programming aims to store the accumulated scientific knowledge of the research project in one document, the software package must allow for the reproduction of data cleaning and data analysis steps, store the record of methods used, generate results dynamically and use these for the writeup, and be executable by including the computational environment.

Perhaps most importantly, this dynamic document can be executed to produce the academic paper. The researcher shares this file with other researchers rather than the only a pdf of the paper, making the research fully reproducible by executing the dynamic document. It is my view that this will be expected in most scientific journals over the next few decades.

Part II: Comparing the Speed of Matlab versus Python/Numpy

2015-04-09 07:06

In this note, I extend a previous post on comparing run-time speeds of various econometrics packages by

Adding Stata to the original comparison of Matlab and Python
Calculating runtime speeds by
- Comparing full OLS estimation functions for each package
  - Stata: reg
  - Matlab: fitlm
  - Python: regression.linear_model.OLS from the statsmodels module.
- Comparing the runtimes for calculations using linear algebra code for the OLS model: $ (x'x)^{-1}x'y $
Since Stata and Matlab automatically parralelize some calculations, we parallelize the python code using the Parallel module.

Comparing the Speed of Matlab versus Python/Numpy

2015-03-19 08:07

Update 1: A more complete and updated speed comparison can be found here.

Update 2: Python and Matlab code edited on 4/5/2015.

In this short note, we compare the speed of matlab and the scientific computing platform of python for a simple bootstrap of an ordinary least squares model. Bottom line (with caveats): matlab is faster than python with this code. One might be able to further optimize the python code below, but it isn't an obvious or easy process (see for example advanced optimization techniques).

As an aside, this note demonstrates that even if one can't optimize python code significantly enough, it is possible to do computationally expensive calculations in matlab and return results to the ipython notebook.

Data Setup¶

We will bootstrap the ordinary least squares model (ols) using 1000 replicates. For generating the toy dataset, the true parameter values are $$ \beta=\begin{bmatrix} 10\\-.5\\.5 \end{bmatrix} $$

We perform the experiment for 3 different sample sizes ($n = \begin{bmatrix}1,000 & 10,000 & 100,000 \end{bmatrix}$). For each of the observations in the toy dataset, the independent variables are drawn from

$$ \mu_x = \begin{bmatrix} 10\\10 \end{bmatrix}, \sigma_x = \begin{bmatrix} 4 & 0 \\ 0 & 4 \end{bmatrix} $$

The dependent variable is constructed by drawing a vector of random normal variates from Normal(0,1). Denoting this vector as $\epsilon$ calculate the dependent variable as $$ \mathbf{Y=Xb+\epsilon} $$

Tapping MariaDB / MySQL data from Ipython

2015-03-06 06:39

In this short post, I will outline how one can access data stored in a database like MariaDB or MySQL for analysis inside an Ipython Notebook. There are many reasons why you might want to store your data in a proper database. For me the most important are:

All of my data resides in a password protected and more secure place than having a multitude of csv, mat, and dta files scattered all over my file system.
If you access the same data for multiple projects, any changes to the underlying data will be propagated to your analysis, without having to update copies of project data.
Having data in a central repository makes backup and recover significantly easier.
This allows for two-way interaction with your database. You can read and write tables from/to your database. Rather than use SQL, you can create database tables using pandas/ipython.

Comparing Stata and Ipython Commands for OLS Models

2015-03-02 07:15

In this note, I'll explore the Ipython statsmodels package for estimating linear regression models (OLS). The goal is to completely map stata commands for reg into something implementable in Ipython.

Running R and Matlab Commands in an Ipython Notebook

2015-02-26 11:06

The ipython notebook environment is a superb environment for empirical research. Sometimes, though, you would like to access the capabilities of other software. This post shows how to incorporate R and Matlab into ipython notebooks.