Statistics in Python

In [1]:
%matplotlib inline
from __future__ import division
from IPython.display import YouTubeVideo
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.stats import gamma
from scipy.stats import uniform
import statsmodels.formula.api as smf  

np.random.seed(12578)

Frequentists Statistics

The statsmodels package has some of the functionality of stata. It can estimate most models you would see as a student taking econometrics, time series, and cross section here at William and Mary. Here is an example OLS regression:

In [2]:
# load data and create dataframe
tobias_koop=pd.read_csv('http://rlhick.people.wm.edu/econ407/data/tobias_koop_t_4.csv')

tobias_koop.head()
Out[2]:
id educ ln_wage pexp time ability meduc feduc broken_home siblings pexp2
0 4 12 2.14 2 4 0.26 12 10 1 4 4
1 6 15 1.91 4 4 0.44 12 16 0 2 16
2 8 13 2.32 8 4 0.51 12 15 1 2 64
3 11 14 1.64 1 4 1.82 16 17 1 2 1
4 12 13 2.16 6 4 -1.30 13 12 0 5 36

Note that statsmodels works seemlessly with Pandas dataframes (tobias_koop).

In [3]:
formula = "ln_wage ~ educ + pexp + pexp2 + broken_home"
results = smf.ols(formula,tobias_koop).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                ln_wage   R-squared:                       0.166
Model:                            OLS   Adj. R-squared:                  0.163
Method:                 Least Squares   F-statistic:                     51.36
Date:                Tue, 12 Jan 2016   Prob (F-statistic):           1.83e-39
Time:                        16:29:11   Log-Likelihood:                -583.66
No. Observations:                1034   AIC:                             1177.
Df Residuals:                    1029   BIC:                             1202.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept       0.4603      0.137      3.353      0.001         0.191     0.730
educ            0.0853      0.009      9.179      0.000         0.067     0.104
pexp            0.2035      0.024      8.629      0.000         0.157     0.250
pexp2          -0.0124      0.002     -5.438      0.000        -0.017    -0.008
broken_home    -0.0087      0.036     -0.244      0.807        -0.079     0.061
==============================================================================
Omnibus:                       55.892   Durbin-Watson:                   1.761
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              112.050
Skew:                          -0.355   Prob(JB):                     4.66e-25
Kurtosis:                       4.448   Cond. No.                         391.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Note that the results object contains the other types of information (including robust standard errors).

In [4]:
results.HC0_se
Out[4]:
Intercept      0.131223
educ           0.009905
pexp           0.022629
pexp2          0.002239
broken_home    0.031162
dtype: float64

If you want to specify which variance covariance matrix to use, use the cov_type argument.

In [5]:
results = smf.ols(formula,tobias_koop).fit(cov_type='HC0')
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                ln_wage   R-squared:                       0.166
Model:                            OLS   Adj. R-squared:                  0.163
Method:                 Least Squares   F-statistic:                     65.13
Date:                Tue, 12 Jan 2016   Prob (F-statistic):           3.89e-49
Time:                        16:29:14   Log-Likelihood:                -583.66
No. Observations:                1034   AIC:                             1177.
Df Residuals:                    1029   BIC:                             1202.
Df Model:                           4                                         
Covariance Type:                  HC0                                         
===============================================================================
                  coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept       0.4603      0.131      3.508      0.000         0.203     0.718
educ            0.0853      0.010      8.609      0.000         0.066     0.105
pexp            0.2035      0.023      8.994      0.000         0.159     0.248
pexp2          -0.0124      0.002     -5.543      0.000        -0.017    -0.008
broken_home    -0.0087      0.031     -0.280      0.779        -0.070     0.052
==============================================================================
Omnibus:                       55.892   Durbin-Watson:                   1.761
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              112.050
Skew:                          -0.355   Prob(JB):                     4.66e-25
Kurtosis:                       4.448   Cond. No.                         391.
==============================================================================

Warnings:
[1] Standard Errors are heteroscedasticity robust (HC0)

Statistical Distributions

In the course, we will be evaluating cdf and pdf's, building likelihood and log-likelihood functions, taking random samples and perhaps other statistical operations. We will primarily use the scipy family of distributions. At times, I may "slip-up" and use some numpy functions, but in my opinion the scipy functions are more powerful by saving time coding. Here are some examples:

Evaluating pdf's and cdf's:

In [6]:
mean = 0
y = 1.5
std = 1
print "PDF value" 
print norm(mean,std).pdf(y)
print "CDF value"
print norm(mean,std).cdf(y)
PDF value
0.129517595666
CDF value
0.933192798731
In [7]:
# use these for some plots
y = np.arange(-3.5,3.5,.05)

pdf_cdf_values = np.zeros((y.shape[0],3))

row_index=0
for i in y:
    pdf_cdf_values[row_index,0] = i
    pdf_cdf_values[row_index,1] = norm(mean,std).pdf(i)
    pdf_cdf_values[row_index,2] = norm(mean,std).cdf(i)
    row_index+=1
    
plt.figure(figsize=(8,8))
plt.plot(pdf_cdf_values[:,0],pdf_cdf_values[:,1],label="Standard Normal PDF")
plt.plot(pdf_cdf_values[:,0],pdf_cdf_values[:,2],label="Standard Normal CDF")
plt.legend(loc='upper left')
plt.show()

Random Variates

Let's draw some random numbers from the normal distribution. This time, instead of using mean 0 and standard deviation of 1, let's assume the distribution is centered at 10 with a standard deviation of 2. Draw 100 random variates:

In [8]:
mean = 10
std = 2
N = 100
y = norm(mean,std).rvs(N)
print y
[ 10.41686472   6.72611306   9.20348461  11.67079106  14.15377413
   9.78287809   9.83408784   7.83980489   6.21018425  12.23916019
   9.45461662  10.63889929   6.58849774   9.96389407   7.96884601
  10.42344446   9.45020002  12.93908039  10.11263851   7.98512175
  11.87855743   8.42314993   9.09156988  10.90302077   7.3771751
   7.03648814  10.54189927  14.09773549   9.19448907  12.93931697
  10.49027688   9.22235715   5.73973794  11.20407959   9.82559841
   8.78135374   5.69799647  13.62382972   8.26666461  11.31502973
   9.24674708   9.09477411  12.31188227  11.20571449  12.89555143
  10.90910426  12.65193488  10.73500682   9.1105223   12.11264092
   9.56369209  11.47744907   9.81428745   8.91313828  11.91558726
   8.0820239    7.91563515  11.06442387   5.56267745   6.09232808
  10.34458283   6.64327682   7.85285525   7.73673878   9.12125366
   7.71302735   9.56898416   8.09889655   9.40693208  12.77956292
  13.20303638  10.8201574   10.30760335  11.8707502   12.27990302
   9.86902746   6.13998553   9.44624862  10.44409674   9.52436141
   9.05273226  12.713509    10.0567424    6.28136942   8.25894703
  10.41825466  12.67761413  12.86492231   7.79962787  11.20064669
  10.09174688   9.51563211   8.94945377   8.95720001   7.96935112
  11.64306391  10.29524047   6.46765218   8.75384597  10.67989401]

To convince ourselves, we have decent random numbers, let's view a histogram:

In [10]:
print np.mean(y)
print np.std(y)
plt.hist(y,bins=10)
plt.show()
9.77746555858
2.03761908345

Jury is still out on that distribution. But with $N=100$, it is a small sample.

Likelihood functions

Suppose the values y are observed and we want to calculate the likelihood of y given a mean for each of the 100 observations in y. Even though we know the mean above should be 10, let's calculate the likelihood of y given a mean of 10.3 and a standard deviation of 2 assuming y is distributed normal. Note these are simply pdf values for each datapoint:

$$ \mathcal{L}_i(y_i|\mu,\sigma) = \frac{1}{\sigma \sqrt{2\pi}}e^{\frac{(y_i - \mu)^2}{2\sigma^2}} $$
In [11]:
norm(10.3,2).pdf(y)
Out[11]:
array([ 0.1991309 ,  0.04041036,  0.17163606,  0.15771444,  0.03116273,
        0.19291365,  0.19413141,  0.09360648,  0.02465169,  0.12466458,
        0.18242433,  0.19662787,  0.03565051,  0.19667422,  0.10112776,
        0.19909155,  0.18225368,  0.08351919,  0.19859777,  0.10208818,
        0.14608507,  0.12842548,  0.16618999,  0.1906073 ,  0.06856666,
        0.05268632,  0.19801745,  0.03287854,  0.17121161,  0.08350615,
        0.19857044,  0.17251864,  0.01482286,  0.18009772,  0.19393779,
        0.14951306,  0.01413092,  0.05013348,  0.118969  ,  0.17536752,
        0.17364307,  0.16635073,  0.12026658,  0.18003112,  0.0859322 ,
        0.19043169,  0.09990501,  0.19480824,  0.16713676,  0.13228533,
        0.18640112,  0.16773264,  0.19367471,  0.15684317,  0.14394111,
        0.1078501 ,  0.09800516,  0.18542057,  0.012066  ,  0.02181519,
        0.19942159,  0.03749533,  0.09435883,  0.08774028,  0.16766856,
        0.0864111 ,  0.18658214,  0.10885999,  0.18054378,  0.0924937 ,
        0.0695619 ,  0.19283774,  0.1994697 ,  0.14653474,  0.12220104,
        0.19489333,  0.02293019,  0.18210039,  0.19895409,  0.18502069,
        0.16422049,  0.09630703,  0.19800114,  0.02649605,  0.11850231,
        0.19912277,  0.09839977,  0.0876469 ,  0.0913033 ,  0.18023724,
        0.1983927 ,  0.18470601,  0.15880431,  0.159219  ,  0.10115753,
        0.15920489,  0.19947058,  0.03181088,  0.14794573,  0.19590496])

We can also easily calculate the joint likelihood. This is

$$ \mathcal{L}(\mathbf{y}|\mu) = \prod_{i \in N}\mathcal{L}_i(y_i|\mu,\sigma) $$
In [12]:
norm(10.3,2).pdf(y).prod()
Out[12]:
9.2566324062648562e-95

For computational advantages, we usually, work with log likelihoods, noting that

$$ ln(\mathcal{L}(\mathbf{y}|\mu)) = \sum_{i \in N}ln(\mathcal{L}_i(y_i|\mu,\sigma)) $$

Implemented in code for our vector $\mathbf{y}$:

In [13]:
norm(10.3,2).logpdf(y).sum()
Out[13]:
-216.52024352295027

Let's visualize why the log-likelihood is better to work with than the likelihood:

In [14]:
mu_candidate = np.arange(7,13,.2) # plot likelihood and log-likelihood in this range
In [15]:
like_loglike_values = np.zeros((mu_candidate.shape[0],3))

row_index=0
for i in mu_candidate:
    like_loglike_values[row_index,0] = i
    like_loglike_values[row_index,1] = norm(i,2).pdf(y).prod()
    like_loglike_values[row_index,2] = norm(i,2).logpdf(y).sum()
    row_index+=1

plt.figure(figsize=(10,5))
plt.subplot(121)
plt.title('Likelihood Function')
plt.plot(like_loglike_values[:,0],like_loglike_values[:,1],label="Likelihood Function")
plt.subplot(122)
plt.plot(like_loglike_values[:,0],like_loglike_values[:,2],label="Log-Likelihood Function")
plt.title('Log-Likelihood Function')
plt.show()
In [ ]: