Monte Carlos, Random Number Generation, and Parallel Processing in Python

Monte Carlo analysis is a big part of my research: for testing the properties of new estimators or for debugging code. Computers with multiple processors and cores offer a quick way to run dozens if not hundreds of monte-carlos replicates in parallel. This can drop runtimes for the monte-carlo by as much as # processors \(\times\) # of cores (although due to overhead, things don't always scale linearly). This note serves two purposes. It demonstrates how to run parallel jobs in python in a monte carlo setting, and more importantly shows how randomness is treated by default in python parallel libraries like joblib or multiprocess and what this means for proper monte carlo analysis.

The typical workflow is something like this:

  1. Develop estimator
  2. Develop a data generation process for creating a dataset with known statistical properties (the toy dataset). This process produces data from
    • known parameters \(\pmb{theta}\) and \(\epsilon\) (although some of these may be randomly drawn inside the data generation process)
    • randomly generated independent variable data (\(\mathbf{x}\))
    • a known functional relationship between the dependent variable (\(\mathbf{y}\)) and all other model information (\(f(\mathbf{x},\pmb{theta},\epsilon)\))
    • other parameters may be important for the data generation process (e.g. sample size (\(N\))). Usually these will be fixed for all replicates within a particular monte carlo experiment
  3. Given the estimator and a toy dataset, run the model and collect model estimates for \(\pmb{\theta}\)
  4. Repeat \(R\) times and summarize data

It should be apparent that step (2) involves quite a bit of random number generation.