Statistics in Python
%matplotlib inline
from IPython.display import YouTubeVideo
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.stats import gamma
from scipy.stats import uniform
import statsmodels.formula.api as smf
np.random.seed(12578)
Frequentists Statistics¶
The statsmodels package has some of the functionality of stata. It can estimate most models you would see as a student taking econometrics, time series, and cross section here at William and Mary. Here is an example OLS regression:
# load data and create dataframe
tobias_koop=pd.read_csv('https://rlhick.people.wm.edu/econ407/data/tobias_koop_t_4.csv')
tobias_koop.head()
Note that statsmodels works seemlessly with Pandas dataframes (tobias_koop).
formula = "ln_wage ~ educ + pexp + pexp2 + broken_home"
results = smf.ols(formula,tobias_koop).fit()
print(results.summary())
Note that the results object contains the other types of information (including robust standard errors).
results.HC0_se
If you want to specify which variance covariance matrix to use, use the cov_type
argument.
results = smf.ols(formula,tobias_koop).fit(cov_type='HC0')
print(results.summary())
Statistical Distributions¶
In the course, we will be evaluating cdf and pdf's, building likelihood and log-likelihood functions, taking random samples and perhaps other statistical operations. We will primarily use the scipy
family of distributions. At times, I may "slip-up" and use some numpy
functions, but in my opinion the scipy
functions are more powerful by saving time coding. Here are some examples:
Evaluating pdf's and cdf's:¶
mean = 0
y = 1.5
std = 1
print("PDF value")
print(norm(mean,std).pdf(y))
print("CDF value")
print(norm(mean,std).cdf(y))
# use these for some plots
y = np.arange(-3.5,3.5,.05)
pdf_cdf_values = np.zeros((y.shape[0],3))
row_index=0
for i in y:
pdf_cdf_values[row_index,0] = i
pdf_cdf_values[row_index,1] = norm(mean,std).pdf(i)
pdf_cdf_values[row_index,2] = norm(mean,std).cdf(i)
row_index+=1
plt.figure(figsize=(8,8))
plt.plot(pdf_cdf_values[:,0],pdf_cdf_values[:,1],label="Standard Normal PDF")
plt.plot(pdf_cdf_values[:,0],pdf_cdf_values[:,2],label="Standard Normal CDF")
plt.legend(loc='upper left')
plt.show()
Random Variates¶
Let's draw some random numbers from the normal distribution. This time, instead of using mean 0 and standard deviation of 1, let's assume the distribution is centered at 10 with a standard deviation of 2. Draw 100 random variates:
mean = 10
std = 2
N = 100
y = norm(mean,std).rvs(N)
print(y)
To convince ourselves, we have decent random numbers, let's view a histogram:
print(np.mean(y))
print(np.std(y))
plt.hist(y,bins=10)
plt.show()
Jury is still out on that distribution. But with $N=100$, it is a small sample.
Likelihood functions¶
Suppose the values y are observed and we want to calculate the likelihood of y given a mean for each of the 100 observations in y. Even though we know the mean above should be 10, let's calculate the likelihood of y given a mean of 10.3 and a standard deviation of 2 assuming y is distributed normal. Note these are simply pdf values for each datapoint:
$$ \mathcal{L}_i(y_i|\mu,\sigma) = \frac{1}{\sigma \sqrt{2\pi}}e^{\frac{(y_i - \mu)^2}{2\sigma^2}} $$norm(10.3,2).pdf(y)
We can also easily calculate the joint likelihood. This is
$$ \mathcal{L}(\mathbf{y}|\mu) = \prod_{i \in N}\mathcal{L}_i(y_i|\mu,\sigma) $$norm(10.3,2).pdf(y).prod()
For computational advantages, we usually, work with log likelihoods, noting that
$$ ln(\mathcal{L}(\mathbf{y}|\mu)) = \sum_{i \in N}ln(\mathcal{L}_i(y_i|\mu,\sigma)) $$Implemented in code for our vector $\mathbf{y}$:
norm(10.3,2).logpdf(y).sum()
Let's visualize why the log-likelihood is better to work with than the likelihood:
mu_candidate = np.arange(7,13,.2) # plot likelihood and log-likelihood in this range
like_loglike_values = np.zeros((mu_candidate.shape[0],3))
row_index=0
for i in mu_candidate:
like_loglike_values[row_index,0] = i
like_loglike_values[row_index,1] = norm(i,2).pdf(y).prod()
like_loglike_values[row_index,2] = norm(i,2).logpdf(y).sum()
row_index+=1
plt.figure(figsize=(10,5))
plt.subplot(121)
plt.title('Likelihood Function')
plt.plot(like_loglike_values[:,0],like_loglike_values[:,1],label="Likelihood Function")
plt.subplot(122)
plt.plot(like_loglike_values[:,0],like_loglike_values[:,2],label="Log-Likelihood Function")
plt.title('Log-Likelihood Function')
plt.show()