Make matplotlib histograms look like R's
I prefer the look of R
's histograms. This short post pulls together some resources for mimicking R histograms in Matplotlib
.
I prefer the look of R
's histograms. This short post pulls together some resources for mimicking R histograms in Matplotlib
.
This post is a followup to two earlier blog posts on reproducible research found here and here. This post focuses on my usage of Stata for classroom assignments turned in by students. These assignments entail
These are different from my own research requirements. For me, emacs org-mode is the best tool for the reasons I outline in the prior posts linked above. For my students, however, learning Emacs and org-mode is totally impractical. This post quickly surveys the three available options: Markdoc, Markstat, and Jupyter Notebook.
Thanks to an excellent series of posts on the python package autograd
for automatic differentiation by John Kitchin (e.g. More Auto-differentiation Goodness for Science and Engineering), this post revisits some earlier work on maximum likelihood estimation in Python and investigates the use of auto differentiation. As pointed out in this article, auto-differentiation "can be thought of as performing a non-standard interpretation of a computer program where this interpretation involves augmenting the standard computation with the calculation of various derivatives."
Auto-differentiation is neither symbolic differentiation nor numerical approximations using finite difference methods. What auto-differentiation provides is code augmentation where code is provided for derivatives of your functions free of charge. In this post, we will be using the autograd
package in python after defining a function in the usual numpy
way. In python, another auto-differentiation choice is the Theano package, which is used by PyMC3 a Bayesian probabilistic programming package that I use in my research and teaching. There are probably other implementations in python, as it is becoming a must-have in the machine learning field. Implementations also exist in C/C++, R, Matlab, and probably others.
The three primary reasons for incorporating auto-differentiation capabilities into your research are
With auto-differentiation, gone are the days of deriving analytical derivatives and programming them into your estimation routine. In this short note, we show a simple example of auto-differentiation, expand on that for maximum likelihood estimation, and show that for problems where likelihood calculations are expensive, or for which there are many parameters being estimated there can be dramatic speed-ups.
Stata is a statistical package that lots of people use, and Emacs Org-mode is a great platform for organizing, publishing, and blogging your research. In one of my older posts, I outlined the relative benefits of Org-mode compared to other packages for literate programming. At that time, I argued it was the best way to write literate programming documents with Stata (if you are willing to pay the fixed costs of learning Emacs). I still believe that, and I use it a lot for writing course notes, emailing students with code and results, and even for drafting manuscripts for publishing.
Despite how good Emacs Org-mode is for research involving Stata, Stata is still something of a second class citizen compared to packages like R
or Python
. While it is functional, it can be a little rough around the edges, and since not many people use Stata with Emacs finding answers can be tough. This post does 3 things:
ob-stata.el
. With only minor modifications, this version avoids some issues with the current version of ob-stata
found here. My version of ob-stata.el
can be downloaded from gitlab.
Jump straight to the discussion on Stata and Emacs Org Mode
This post describes my experience implementing reproducible research and literate programming methods for commonly used econometric software packages. Since literate programming aims to store the accumulated scientific knowledge of the research project in one document, the software package must allow for the reproduction of data cleaning and data analysis steps, store the record of methods used, generate results dynamically and use these for the writeup, and be executable by including the computational environment.
Perhaps most importantly, this dynamic document can be executed to produce the academic paper. The researcher shares this file with other researchers rather than the only a pdf of the paper, making the research fully reproducible by executing the dynamic document. It is my view that this will be expected in most scientific journals over the next few decades.
In this notebook, we examine the workings of the Gordon-Schaefer Fisheries Model for a single species.
Denoting $S(t)$ as the stock at time $t$, we can write the population growth function as
$$ \frac{\Delta S}{\Delta t} = \frac{\partial S}{\partial t} = r S(t) \left(1- \frac{S(t)}{K} \right) $$
where
$S(t)$ = stock size at time $t$
$K$ = carrying capacity
$r$ = intrinsic growth rate of the population
In this note, I extend a previous post on comparing run-time speeds of various econometrics packages by
reg
fitlm
regression.linear_model.OLS
from the statsmodels
module.Parallel
module.
Update 1: A more complete and updated speed comparison can be found here.
Update 2: Python and Matlab code edited on 4/5/2015.
In this short note, we compare the speed of matlab and the scientific computing platform of python for a simple bootstrap of an ordinary least squares model. Bottom line (with caveats): matlab is faster than python with this code. One might be able to further optimize the python code below, but it isn't an obvious or easy process (see for example advanced optimization techniques).
As an aside, this note demonstrates that even if one can't optimize python code significantly enough, it is possible to do computationally expensive calculations in matlab and return results to the ipython notebook.
We will bootstrap the ordinary least squares model (ols) using 1000 replicates. For generating the toy dataset, the true parameter values are $$ \beta=\begin{bmatrix} 10\\-.5\\.5 \end{bmatrix} $$
We perform the experiment for 3 different sample sizes ($n = \begin{bmatrix}1,000 & 10,000 & 100,000 \end{bmatrix}$). For each of the observations in the toy dataset, the independent variables are drawn from
$$ \mu_x = \begin{bmatrix} 10\\10 \end{bmatrix}, \sigma_x = \begin{bmatrix} 4 & 0 \\ 0 & 4 \end{bmatrix} $$
The dependent variable is constructed by drawing a vector of random normal variates from Normal(0,1). Denoting this vector as $\epsilon$ calculate the dependent variable as $$ \mathbf{Y=Xb+\epsilon} $$
In this short post, I will outline how one can access data stored in a database like MariaDB or MySQL for analysis inside an Ipython Notebook. There are many reasons why you might want to store your data in a proper database. For me the most important are:
All of my data resides in a password protected and more secure place than having a multitude of csv, mat, and dta files scattered all over my file system.
If you access the same data for multiple projects, any changes to the underlying data will be propagated to your analysis, without having to update copies of project data.
Having data in a central repository makes backup and recover significantly easier.
This allows for two-way interaction with your database. You can read and write tables from/to your database. Rather than use SQL, you can create database tables using pandas/ipython.
In this note, I'll explore the Ipython statsmodels
package for estimating linear regression models (OLS). The goal is to completely map stata commands for reg
into something implementable in Ipython.