In [1]:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import urllib

Introduction to Pandas¶

Pandas works hand-in-hand with other python libraries (e.g. matplotlib and numpy) to make manipulating data (what the Pandas team calls "Data Munging") easy. With pandas it is easy to

Easily access data using variable names, but have full linear algebra capabilities of numpy
Group data by values (for example, calculate the mean income by state)
Plot and summarize values
Join (combine) different sources of data
Powerful time series and panel data capabilities (beyond scope of course)

Note: If you want to reinforce some of these concepts on your own, I recommend this superb youtube video (https://www.youtube.com/watch?v=5JnMutdy6Fw) and accompanying coursework (https://github.com/brandon-rhodes/pycon-pandas-tutorial) by Brandon Rhodes.

There are numerous ways to get data into Pandas:

Import excel and comma delimited data
Import stata, sas, matlab, and other datasets
Import data from an SQL server
Import data scraped from the web
Manually building (by typing in values)

In this tutorial we will focus on the first two methods for reading data, but just know that there is probably a way to get your data into pandas irrespective of what format it is in.

Loading and Cleaning Data¶

We will be loading 2010 "Trip" data from the NOAA Fisheries Service Recreational Fisheries Statistics Survey (called MRIP). More detail here: https://www.fisheries.noaa.gov/topic/recreational-fishing-data

We will also load the Census Bureau's county and state fips file, obtained here: https://www.census.gov/2010census/xls/fips_codes_website.xls

The MRIPs data needs to be downloaded from my website (you only need to do this once).

In [2]:

# run this once and then comment out with #
# urllib.request.urlretrieve("https://rlhick.people.wm.edu/pubp622/data/mrips_2010.pydata", "mrips_2010.pydata")

In [3]:

trips_2010 = pd.io.pickle.read_pickle('mrips_2010.pydata')
fips = pd.io.excel.read_excel('https://rlhick.people.wm.edu/pubp622/data/fips_codes_website.xls')

In [4]:

trips_2010.head()

Out[4]:

	index	add_hrs	add_ph	age	area	area_x	art_reef	asg_code	...	time	tourn	tsn1	tsn2	turtle	wave	wp_int	year	zip	intercept_date
0	0	NaN	1	NaN	C	5	NaN	1171020101114	...	1330.0	NaN	167680		NaN	6.0	3411.891063	2010.0	06475	2010-11-14
1	1	NaN	2	NaN	C	5	NaN	1171020101114	...	1332.0	NaN	167680		NaN	6.0	3411.891063	2010.0	06475	2010-11-14
2	2	0.0	1	NaN	C	5	NaN	1171020101116	...	1130.0	NaN	168559	167680	NaN	6.0	17548.885633	2010.0	06110	2010-11-16
3	3	NaN	1	NaN	C	5	NaN	1171020101123	...	1200.0	NaN	167680	168559	NaN	6.0	4489.923579	2010.0	06416	2010-11-23
4	4	NaN	1	NaN	C	5	NaN	1171020101123	...	1215.0	NaN	167680	168559	NaN	6.0	4489.923579	2010.0	06067	2010-11-23

5 rows × 80 columns

Trips is a very big dataset with lots of columns we'll never use. Let's trim it down:

In [5]:

trips_2010.columns

Out[5]:

Index(['index', 'add_hrs', 'add_ph', 'age', 'area', 'area_nc', 'area_x',
       'art_reef', 'asg_code', 'boat_hrs', 'catch', 'celltype', 'cntrbtrs',
       'cnty', 'cnty_res', 'coastal', 'compflag', 'county', 'date1', 'dist',
       'distkeys', 'f_by_p', 'ffdays12', 'ffdays2', 'first', 'fshinsp_a',
       'gear', 'gender', 'hrs_dtd', 'hrsf', 'id_code', 'intsite', 'kod',
       'leader', 'license', 'mode2001', 'mode_asg', 'mode_f', 'mode_fx',
       'monitor', 'month', 'muni_res', 'muni_trp', 'new_list', 'num_fish_a',
       'num_typ2', 'num_typ3', 'num_typ4', 'num_typ6', 'num_typ9', 'on_list',
       'party', 'prim1', 'prim1_common', 'prim2', 'prim2_common', 'prt_code',
       'psu_id', 'pvt_res', 'reefcode', 'reg_res', 'region', 'rig', 'sep_fish',
       'st', 'st_res', 'strat_id', 'strat_interval', 'sub_reg', 'telefon',
       'time', 'tourn', 'tsn1', 'tsn2', 'turtle', 'wave', 'wp_int', 'year',
       'zip', 'intercept_date'],
      dtype='object')

In [6]:

trips = trips_2010[['id_code','year','wave','intercept_date','st','prim1',
                    'prim1_common','prim2','prim2_common','cnty','ffdays12',
                    'ffdays2']]

This is summary statistics for numeric data columns:

In [7]:

trips.describe()

Out[7]:

	year	wave	st	cnty	ffdays12	ffdays2
count	104709.0	104709.000000	104709.000000	104519.000000	104519.000000	104519.000000
mean	2010.0	3.794879	23.974615	78.798171	40.771439	6.191936
std	0.0	1.347586	13.160875	130.751209	122.456004	13.037239
min	2010.0	1.000000	1.000000	1.000000	0.000000	0.000000
25%	2010.0	3.000000	12.000000	19.000000	1.000000	0.000000
50%	2010.0	4.000000	22.000000	55.000000	10.000000	2.000000
75%	2010.0	5.000000	37.000000	99.000000	35.000000	7.000000
max	2010.0	6.000000	51.000000	810.000000	999.000000	99.000000

Using Pandas DataFrames¶

Referencing columns in data:¶

In [8]:

# copy data frame into t (note, this is different than assignment [t=trips])
t=trips.copy()

t.ffdays12.head()

Out[8]:

0    25.0
1    10.0
2    12.0
3    90.0
4    75.0
Name: ffdays12, dtype: float64

In [9]:

t['ffdays12'].head()

Out[9]:

0    25.0
1    10.0
2    12.0
3    90.0
4    75.0
Name: ffdays12, dtype: float64

Referencing rows in data:¶

We can use numpy-like slicing:

In [10]:

# rows 50-54
t.loc[51:55]

Out[10]:

	id_code	year	wave	intercept_date	st	prim1	prim1_common	cnty	ffdays12	ffdays2
51	1171020101111005	2010.0	6.0	2010-11-11	9.0	8839010101	TAUTOG	7.0	35.0	10.0
52	1171020101111006	2010.0	6.0	2010-11-11	9.0	8839010101	TAUTOG	7.0	30.0	10.0
53	1171020101113001	2010.0	6.0	2010-11-13	9.0	8839010101	TAUTOG	11.0	10.0	8.0
54	1171020101113002	2010.0	6.0	2010-11-13	9.0	8839010101	TAUTOG	11.0	10.0	8.0
55	1171020101113003	2010.0	6.0	2010-11-13	9.0	8839010101	TAUTOG	11.0	3.0	2.0

In [11]:

# rows 1 and 2 for the first 5 columns
t.iloc[0:2,0:5]

Out[11]:

	id_code	year	wave	intercept_date	st
0	1171020101114001	2010.0	6.0	2010-11-14	9.0
1	1171020101114002	2010.0	6.0	2010-11-14	9.0

We can select rows of data based on column values. Let's select all the rows where st is 9 (Connecticut)

In [12]:

t[t.st == 9].head()

Out[12]:

	id_code	year	wave	intercept_date	st	prim1	prim1_common	prim2	prim2_common	cnty	ffdays12	ffdays2
0	1171020101114001	2010.0	6.0	2010-11-14	9.0	8835020102	STRIPED BASS			7.0	25.0	5.0
1	1171020101114002	2010.0	6.0	2010-11-14	9.0	8835020102	STRIPED BASS			7.0	10.0	2.0
2	1171020101116001	2010.0	6.0	2010-11-16	9.0	8835250101	BLUEFISH	8835020102	STRIPED BASS	11.0	12.0	6.0
3	1171020101123001	2010.0	6.0	2010-11-23	9.0	8835020102	STRIPED BASS	8835250101	BLUEFISH	7.0	90.0	25.0
4	1171020101123002	2010.0	6.0	2010-11-23	9.0	8835020102	STRIPED BASS	8835250101	BLUEFISH	7.0	75.0	20.0

Find all rows where days spent fishing during past 12 months exceeds 10:

In [13]:

t[t.ffdays12>10].head()

Out[13]:

	id_code	year	wave	intercept_date	st	prim1	prim1_common	prim2	prim2_common	cnty	ffdays12	ffdays2
0	1171020101114001	2010.0	6.0	2010-11-14	9.0	8835020102	STRIPED BASS			7.0	25.0	5.0
2	1171020101116001	2010.0	6.0	2010-11-16	9.0	8835250101	BLUEFISH	8835020102	STRIPED BASS	11.0	12.0	6.0
3	1171020101123001	2010.0	6.0	2010-11-23	9.0	8835020102	STRIPED BASS	8835250101	BLUEFISH	7.0	90.0	25.0
4	1171020101123002	2010.0	6.0	2010-11-23	9.0	8835020102	STRIPED BASS	8835250101	BLUEFISH	7.0	75.0	20.0
5	1171020101123003	2010.0	6.0	2010-11-23	9.0	8835020102	STRIPED BASS	8835250101	BLUEFISH	7.0	70.0	10.0

Math on columns:¶

In [14]:

# we can create a new variable and use numpy commands:
t['temp'] = t.ffdays12 + .5*np.random.randn(t.shape[0])
t.head()

Out[14]:

	id_code	year	wave	intercept_date	st	prim1	prim1_common	prim2	prim2_common	cnty	ffdays12	ffdays2	temp
0	1171020101114001	2010.0	6.0	2010-11-14	9.0	8835020102	STRIPED BASS			7.0	25.0	5.0	24.981628
1	1171020101114002	2010.0	6.0	2010-11-14	9.0	8835020102	STRIPED BASS			7.0	10.0	2.0	9.281947
2	1171020101116001	2010.0	6.0	2010-11-16	9.0	8835250101	BLUEFISH	8835020102	STRIPED BASS	11.0	12.0	6.0	11.958769
3	1171020101123001	2010.0	6.0	2010-11-23	9.0	8835020102	STRIPED BASS	8835250101	BLUEFISH	7.0	90.0	25.0	90.170897
4	1171020101123002	2010.0	6.0	2010-11-23	9.0	8835020102	STRIPED BASS	8835250101	BLUEFISH	7.0	75.0	20.0	75.427539

In [15]:

t.temp.mean()

Out[15]:

40.76974409766304

In [16]:

# or we could do this all in one step if we didn't want to create a new column in t:
(t.ffdays12 + .5*np.random.randn(t.shape[0])).mean()

Out[16]:

40.77151062295116

Note: You can combine pandas with numpy. This is the standard deviation of the column ffdays12:

In [17]:

np.std(t.ffdays12)

Out[17]:

122.45541845299698

We can perform matrix/vector operations on pandas data.

Here, we can transpose a slice of the data:

In [18]:

t[['ffdays12','ffdays2']].head(10).T

Out[18]:

	0	1	2	3	4	5	6	7	8	9
ffdays12	25.0	10.0	12.0	90.0	75.0	70.0	10.0	15.0	1.0	250.0
ffdays2	5.0	2.0	6.0	25.0	20.0	10.0	2.0	0.0	0.0	50.0

We can do matrix multiplication:

In [19]:

np.dot(t[['ffdays12','ffdays2']].head(10).T,t[['ffdays12','ffdays2']].head(10))

Out[19]:

array([[82320., 17187.],
       [17187.,  3694.]])

Note: the remainder of this notebook is not particularly relevant for Bayesian Econometrics¶

While we won't use it in our course, you may find the following very useful for data munging.

Summary Statistics¶

Find mean of ffdays12 and ffdays2

In [20]:

t[['ffdays12','ffdays2']].describe()

Out[20]:

	ffdays12	ffdays2
count	104519.000000	104519.000000
mean	40.771439	6.191936
std	122.456004	13.037239
min	0.000000	0.000000
25%	1.000000	0.000000
50%	10.000000	2.000000
75%	35.000000	7.000000
max	999.000000	99.000000

Find mean of ffdays12 and ffdays2 by state:

In [21]:

t.head()

Out[21]:

	id_code	year	wave	intercept_date	st	prim1	prim1_common	prim2	prim2_common	cnty	ffdays12	ffdays2	temp
0	1171020101114001	2010.0	6.0	2010-11-14	9.0	8835020102	STRIPED BASS			7.0	25.0	5.0	24.981628
1	1171020101114002	2010.0	6.0	2010-11-14	9.0	8835020102	STRIPED BASS			7.0	10.0	2.0	9.281947
2	1171020101116001	2010.0	6.0	2010-11-16	9.0	8835250101	BLUEFISH	8835020102	STRIPED BASS	11.0	12.0	6.0	11.958769
3	1171020101123001	2010.0	6.0	2010-11-23	9.0	8835020102	STRIPED BASS	8835250101	BLUEFISH	7.0	90.0	25.0	90.170897
4	1171020101123002	2010.0	6.0	2010-11-23	9.0	8835020102	STRIPED BASS	8835250101	BLUEFISH	7.0	75.0	20.0	75.427539

In [22]:

t[['ffdays12','ffdays2']].groupby('st').describe()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-22-a7174ec0628b> in <module>()
----> 1 t[['ffdays12','ffdays2']].groupby('st').describe()

~/anaconda/envs/python36/lib/python3.6/site-packages/pandas/core/generic.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, **kwargs)
   5160         return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
   5161                        sort=sort, group_keys=group_keys, squeeze=squeeze,
-> 5162                        **kwargs)
   5163 
   5164     def asfreq(self, freq, method=None, how=None, normalize=False,

~/anaconda/envs/python36/lib/python3.6/site-packages/pandas/core/groupby.py in groupby(obj, by, **kwds)
   1846         raise TypeError('invalid type: %s' % type(obj))
   1847 
-> 1848     return klass(obj, by, **kwds)
   1849 
   1850 

~/anaconda/envs/python36/lib/python3.6/site-packages/pandas/core/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, **kwargs)
    514                                                     level=level,
    515                                                     sort=sort,
--> 516                                                     mutated=self.mutated)
    517 
    518         self.obj = obj

~/anaconda/envs/python36/lib/python3.6/site-packages/pandas/core/groupby.py in _get_grouper(obj, key, axis, level, sort, mutated, validate)
   2932                 in_axis, name, level, gpr = False, None, gpr, None
   2933             else:
-> 2934                 raise KeyError(gpr)
   2935         elif isinstance(gpr, Grouper) and gpr.key is not None:
   2936             # Add key to exclusions

KeyError: 'st'

What happened? We trimmed down the columns of t with the statement t[['ffdays12','ffdays2']] and pandas couldn't find the columns st. Let's try this again:

In [23]:

t.groupby('st')[['ffdays12','ffdays2']].describe()

Out[23]:

	ffdays12								ffdays2
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
st
1.0	2319.0	54.470030	173.295166	0.0	1.00	8.0	30.00	998.0	2319.0	5.535576	11.381926	0.0	0.0	2.0	6.0	98.0
9.0	2068.0	19.050774	45.799724	0.0	0.75	5.0	20.00	998.0	2068.0	4.711315	8.832107	0.0	0.0	2.0	6.0	98.0
10.0	3691.0	33.415335	126.399327	0.0	1.00	6.0	20.00	999.0	3691.0	5.167163	13.028735	0.0	0.0	1.0	5.0	99.0
12.0	37354.0	59.632543	154.875664	0.0	2.00	20.0	50.00	999.0	37354.0	7.921401	15.925836	0.0	0.0	3.0	8.0	99.0
13.0	1725.0	17.682899	34.627497	0.0	0.00	3.0	24.00	360.0	1725.0	3.355362	6.087584	0.0	0.0	1.0	4.0	60.0
22.0	5212.0	23.146201	53.764561	0.0	3.00	10.0	25.00	998.0	5212.0	3.582502	6.558266	0.0	0.0	2.0	4.0	98.0
23.0	1440.0	25.009028	112.346057	0.0	0.00	2.0	13.00	998.0	1440.0	4.971528	12.479304	0.0	0.0	0.0	4.0	98.0
24.0	4163.0	22.875811	91.202939	0.0	1.00	4.0	16.00	999.0	4163.0	4.357915	10.796496	0.0	0.0	1.0	4.0	99.0
25.0	4181.0	29.579048	106.879088	0.0	1.00	6.0	24.00	999.0	4181.0	5.560871	12.220832	0.0	0.0	1.0	6.0	99.0
28.0	1784.0	50.395179	97.010203	0.0	5.00	25.0	50.00	999.0	1784.0	6.431054	11.731276	0.0	0.0	3.0	8.0	99.0
33.0	2375.0	14.131368	67.848061	0.0	0.00	2.0	10.00	999.0	2375.0	3.124632	8.231361	0.0	0.0	0.0	3.0	99.0
34.0	4760.0	35.825000	111.351586	0.0	3.00	10.0	30.00	998.0	4760.0	6.005042	11.223230	0.0	0.0	2.0	7.0	99.0
36.0	3981.0	36.994223	101.657053	0.0	4.00	20.0	40.00	999.0	3981.0	7.949510	12.562636	0.0	1.0	4.0	10.0	99.0
37.0	20289.0	29.622998	86.518452	0.0	1.00	6.0	25.00	999.0	20289.0	5.417418	10.735914	0.0	0.0	2.0	6.0	99.0
44.0	1488.0	20.715726	39.349850	0.0	1.00	5.0	20.25	320.0	1488.0	4.751344	9.322511	0.0	0.0	1.0	5.0	60.0
45.0	2867.0	50.676317	167.737333	0.0	0.00	7.0	30.00	999.0	2867.0	7.040112	17.099136	0.0	0.0	2.0	6.0	99.0
51.0	4822.0	24.962256	81.259619	0.0	1.00	6.0	25.00	999.0	4822.0	4.214641	8.909940	0.0	0.0	1.0	5.0	98.0

Combining DataFrames¶

The state fips codes are hardly intuitive and even worse for the county fips codes. The goal of this section is to

Add text-based labels
Be able to use these labels for fast data selection

To do this, we will attempt to match text based state and county labels to our trips dataframe. Pandas calls this a merge.

In [24]:

fips.head()

Out[24]:

	State Abbreviation	State FIPS Code	County FIPS Code	FIPS Entity Code	ANSI Code	GU Name	Entity Description
0	AL	1	67	124	2403054	Abbeville	city
1	AL	1	73	460	2403063	Adamsville	city
2	AL	1	117	820	2403069	Alabaster	city
3	AL	1	95	988	2403074	Albertville	city
4	AL	1	123	1132	2403077	Alexander City	city

The fips codes (numeric) uniquely identify counties in our trips file. So let's look at the fips file (that we grabbed from Census) above and combine it with our trips data. These column names are messy: let's clean it up.

For me, column names should always:

Be lower case
Have no spaces or funky characters

In [25]:

names = ['state','state_fips','county_fips','fips','ansi_code',
         'county_name','entity_type']
fips.columns=names
fips.head()

Out[25]:

	state	state_fips	county_fips	fips	ansi_code	county_name	entity_type
0	AL	1	67	124	2403054	Abbeville	city
1	AL	1	73	460	2403063	Adamsville	city
2	AL	1	117	820	2403069	Alabaster	city
3	AL	1	95	988	2403074	Albertville	city
4	AL	1	123	1132	2403077	Alexander City	city

The merge statement takes two pandas dataframes and combines them. Notice in the merge command below we have trips,fips. The dataframe trips is known as the "left" dataframe and fips, the "right". I want to combine these columns based on the numeric fips values (state and county). These fields have different names in each of the dataframes, so I need to specify the names (in the same order) for the left (trips) and right (fips) dataframes. The option how='left' tells pandas to always keep rows from trips whether a match is made with fips or not. Note that if a row in the fips table doesn't match our trips data, it is discarded.

In [26]:

# join clean fips codes to data_trip
trips_fips = pd.merge(trips,fips, left_on=['st','cnty'], 
                      right_on=['state_fips','county_fips'],
      how='left', sort=False)

In [27]:

print(trips.shape)
print(trips_fips.shape)

(104709, 12)
(1561000, 19)

The result we get is disconcerting. What we wanted to achieve with the merge was to add text names to the trips table, and our merged dataset has 1.5 million rows rather than 104,709 (what we want). This probably occured because there were duplicate (or redundant information on fips codes in the fips table). To check that use the duplicate method to find rows with duplicate state and county fips codes.

In [28]:

fips[fips.duplicated(['state_fips','county_fips'])]['entity_type'].value_counts().head(20)

Out[28]:

township              12641
city                   8355
town                   8136
village                3974
County                 2555
borough                1136
Reservation             438
ANVSA                   221
CDP                     147
comunidad               124
OTSA                    101
zona urbana              79
charter township         78
Hawaiian Home Land       75
Municipio                58
SDTSA                    53
Rancheria                47
Parish                   38
Indian Reservation       36
Township                 35
Name: entity_type, dtype: int64

Unfortunately, this file is (apparently) for census workers in the field who may encounter lots of different place names, and this file helps their data-entry program map into the correct state and county fips codes. This isn't ideal for our purposes. Since our trips table already has the correct numeric fips codes, we need a fips table that has the correct and unique county/city names.

Grab another fips table from here: https://www.census.gov/geo/reference/codes/cou.html and choose United States.

For convenience, I'll save to my website.

In [29]:

fips = pd.io.parsers.read_csv('https://rlhick.people.wm.edu/pubp622/data/national_county.txt',
                              header=None, names=['state','state_fips','county_fips','county','fips_class_code'])
print(fips.shape)

(3235, 5)

In [30]:

fips.head()

Out[30]:

	state	state_fips	county_fips	county	fips_class_code
0	AL	1	1	Autauga County	H1
1	AL	1	3	Baldwin County	H1
2	AL	1	5	Barbour County	H1
3	AL	1	7	Bibb County	H1
4	AL	1	9	Blount County	H1

Having worked with the data before, I know that counties/entities like Virginia Beach City are labeled "Virginia Beach city". Let's fix that:

In [31]:

fips["county"] = fips.county.str.replace("city","City")

Check for duplicates:

In [32]:

fips.duplicated(['state_fips','county_fips'])
print("Number of duplicate rows:", fips[fips.duplicated(['state_fips','county_fips'])].size)

Number of duplicate rows: 0

In [33]:

trips_fips = pd.merge(trips,fips, left_on=['st','cnty'], 
                      right_on=['state_fips','county_fips'],
                      how='left', sort=False)
print(trips.shape)
print(trips_fips.shape)

(104709, 12)
(104709, 17)

In [34]:

trips_fips.head()

Out[34]:

	id_code	year	wave	intercept_date	st	prim1	prim1_common	prim2	prim2_common	cnty	ffdays12	ffdays2	state	state_fips	county_fips	county	fips_class_code
0	1171020101114001	2010.0	6.0	2010-11-14	9	8835020102	STRIPED BASS			7	25.0	5.0	CT	9.0	7.0	Middlesex County	H4
1	1171020101114002	2010.0	6.0	2010-11-14	9	8835020102	STRIPED BASS			7	10.0	2.0	CT	9.0	7.0	Middlesex County	H4
2	1171020101116001	2010.0	6.0	2010-11-16	9	8835250101	BLUEFISH	8835020102	STRIPED BASS	11	12.0	6.0	CT	9.0	11.0	New London County	H4
3	1171020101123001	2010.0	6.0	2010-11-23	9	8835020102	STRIPED BASS	8835250101	BLUEFISH	7	90.0	25.0	CT	9.0	7.0	Middlesex County	H4
4	1171020101123002	2010.0	6.0	2010-11-23	9	8835020102	STRIPED BASS	8835250101	BLUEFISH	7	75.0	20.0	CT	9.0	7.0	Middlesex County	H4

Let's rename our state and intercept column to make it more intuitive.

In [35]:

trips_fips.rename(columns = {'county': 'county_intercept','state':'state_intercept'},inplace=True)
trips_fips.head()

Out[35]:

	id_code	year	wave	intercept_date	st	prim1	prim1_common	prim2	prim2_common	cnty	ffdays12	ffdays2	state_intercept	state_fips	county_fips	county_intercept	fips_class_code
0	1171020101114001	2010.0	6.0	2010-11-14	9	8835020102	STRIPED BASS			7	25.0	5.0	CT	9.0	7.0	Middlesex County	H4
1	1171020101114002	2010.0	6.0	2010-11-14	9	8835020102	STRIPED BASS			7	10.0	2.0	CT	9.0	7.0	Middlesex County	H4
2	1171020101116001	2010.0	6.0	2010-11-16	9	8835250101	BLUEFISH	8835020102	STRIPED BASS	11	12.0	6.0	CT	9.0	11.0	New London County	H4
3	1171020101123001	2010.0	6.0	2010-11-23	9	8835020102	STRIPED BASS	8835250101	BLUEFISH	7	90.0	25.0	CT	9.0	7.0	Middlesex County	H4
4	1171020101123002	2010.0	6.0	2010-11-23	9	8835020102	STRIPED BASS	8835250101	BLUEFISH	7	75.0	20.0	CT	9.0	7.0	Middlesex County	H4

Working with Indices¶

Let's create an index on these new state and county fields:

In [36]:

trips_fips.set_index(['state_intercept','county_intercept'],inplace=True)
trips_fips.head()

Out[36]:

		id_code	year	wave	intercept_date	st	prim1	prim1_common	prim2	prim2_common	cnty	ffdays12	ffdays2	state_fips	county_fips	fips_class_code
state_intercept	county_intercept
CT	Middlesex County	1171020101114001	2010.0	6.0	2010-11-14	9	8835020102	STRIPED BASS			7	25.0	5.0	9.0	7.0	H4
	Middlesex County	1171020101114002	2010.0	6.0	2010-11-14	9	8835020102	STRIPED BASS			7	10.0	2.0	9.0	7.0	H4
	New London County	1171020101116001	2010.0	6.0	2010-11-16	9	8835250101	BLUEFISH	8835020102	STRIPED BASS	11	12.0	6.0	9.0	11.0	H4
	Middlesex County	1171020101123001	2010.0	6.0	2010-11-23	9	8835020102	STRIPED BASS	8835250101	BLUEFISH	7	90.0	25.0	9.0	7.0	H4
	Middlesex County	1171020101123002	2010.0	6.0	2010-11-23	9	8835020102	STRIPED BASS	8835250101	BLUEFISH	7	75.0	20.0	9.0	7.0	H4

Why use an index for these state and county codes? The biggest reason is that we can easily access values for looking at our data. Here we have 2 month avidity for anglers sampled in James City County:

In [37]:

trips_fips['ffdays2'].loc['VA','James City County']

/home/robhicks/anaconda/envs/python36/lib/python3.6/site-packages/ipykernel_launcher.py:1: PerformanceWarning: indexing past lexsort depth may impact performance.
  """Entry point for launching an IPython kernel.

Out[37]:

state_intercept  county_intercept 
VA               James City County     2.0
                 James City County     1.0
                 James City County     0.0
                 James City County     0.0
                 James City County     4.0
                 James City County     6.0
                 James City County     6.0
                 James City County     6.0
                 James City County     6.0
                 James City County     3.0
                 James City County     3.0
                 James City County     8.0
                 James City County    20.0
                 James City County     1.0
                 James City County    12.0
                 James City County     2.0
                 James City County     1.0
                 James City County     8.0
                 James City County     0.0
                 James City County     4.0
                 James City County     2.0
                 James City County     2.0
                 James City County     0.0
                 James City County     0.0
                 James City County     0.0
                 James City County     0.0
                 James City County    15.0
                 James City County    15.0
                 James City County     3.0
Name: ffdays2, dtype: float64

In [38]:

# we want to create a richer index that includes the prim1_common species name:
# 1. put state_intercept and county_intercept back in data
trips_fips.reset_index(inplace=True)
trips_fips.head()
# 2. create new multiindex
trips_fips.set_index(['state_intercept','county_intercept','prim1_common'],inplace=True)

In [39]:

trips_fips.head()

Out[39]:

			id_code	year	wave	intercept_date	st	prim1	prim2	prim2_common	cnty	ffdays12	ffdays2	state_fips	county_fips	fips_class_code
state_intercept	county_intercept	prim1_common
CT	Middlesex County	STRIPED BASS	1171020101114001	2010.0	6.0	2010-11-14	9	8835020102			7	25.0	5.0	9.0	7.0	H4
	Middlesex County	STRIPED BASS	1171020101114002	2010.0	6.0	2010-11-14	9	8835020102			7	10.0	2.0	9.0	7.0	H4
	New London County	BLUEFISH	1171020101116001	2010.0	6.0	2010-11-16	9	8835250101	8835020102	STRIPED BASS	11	12.0	6.0	9.0	11.0	H4
	Middlesex County	STRIPED BASS	1171020101123001	2010.0	6.0	2010-11-23	9	8835020102	8835250101	BLUEFISH	7	90.0	25.0	9.0	7.0	H4
	Middlesex County	STRIPED BASS	1171020101123002	2010.0	6.0	2010-11-23	9	8835020102	8835250101	BLUEFISH	7	75.0	20.0	9.0	7.0	H4

In [40]:

trips_fips.loc['VA','Virginia Beach City','RED DRUM'].head()

/home/robhicks/anaconda/envs/python36/lib/python3.6/site-packages/ipykernel_launcher.py:1: PerformanceWarning: indexing past lexsort depth may impact performance.
  """Entry point for launching an IPython kernel.

Out[40]:

			id_code	year	wave	intercept_date	st	prim1	prim2	prim2_common	cnty	ffdays12	ffdays2	state_fips	county_fips	fips_class_code
state_intercept	county_intercept	prim1_common
VA	Virginia Beach City	RED DRUM	1681420101127002	2010.0	6.0	2010-11-27	51	8835440901	8835440102	SPOTTED SEATROUT	810	40.0	10.0	51.0	810.0	C7
		RED DRUM	1681420101127004	2010.0	6.0	2010-11-27	51	8835440901	8835440102	SPOTTED SEATROUT	810	150.0	24.0	51.0	810.0	C7
		RED DRUM	1681420101127005	2010.0	6.0	2010-11-27	51	8835440901	8835250101	BLUEFISH	810	50.0	20.0	51.0	810.0	C7
		RED DRUM	1681420101126003	2010.0	6.0	2010-11-26	51	8835440901	8835440102	SPOTTED SEATROUT	810	8.0	4.0	51.0	810.0	C7
		RED DRUM	1681420101126004	2010.0	6.0	2010-11-26	51	8835440901			810	150.0	35.0	51.0	810.0	C7

In [41]:

trips_fips.loc['VA',:,'STRIPED BASS'].head()

Out[41]:

	id_code	year	wave	intercept_date	st	prim1	prim2	prim2_common	cnty	ffdays12	ffdays2	state_fips	county_fips	fips_class_code
county_intercept
Virginia Beach City	1662020101110001	2010.0	6.0	2010-11-10	51	8835020102			810	14.0	5.0	51.0	810.0	C7
Virginia Beach City	1662020101110002	2010.0	6.0	2010-11-10	51	8835020102			810	10.0	5.0	51.0	810.0	C7
Hampton City	1680820101122001	2010.0	6.0	2010-11-22	51	8835020102	8835440901	RED DRUM	650	80.0	24.0	51.0	650.0	C7
Virginia Beach City	1681420101121005	2010.0	6.0	2010-11-21	51	8835020102			810	4.0	4.0	51.0	810.0	C7
Virginia Beach City	1681420101121012	2010.0	6.0	2010-11-21	51	8835020102			810	2.0	0.0	51.0	810.0	C7

Ok, indexes are pretty cool, but what if we need the county data for doing summary statistics? Example, suppose we want to get a frequency count of trips by county in VA targeting Striped Bass? Notice, we can't just access index values like you would a column of data:

In [42]:

trips_fips.county_intercept

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-42-fd63cf86b5c1> in <module>()
----> 1 trips_fips.county_intercept

~/anaconda/envs/python36/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
   3612             if name in self._info_axis:
   3613                 return self[name]
-> 3614             return object.__getattribute__(self, name)
   3615 
   3616     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'county_intercept'

So the strategy will be to copy the index values back into our table so we can use it for other purposes:

(Note: you can also keep these columns in your data by including the option drop=False when you set_index())

In [43]:

print(trips_fips.index.get_level_values(0))
trips_fips['state_intercept'] = trips_fips.index.get_level_values(0)
trips_fips['county_intercept'] = trips_fips.index.get_level_values(1)
trips_fips['species_name'] = trips_fips.index.get_level_values(2)

Index(['CT', 'CT', 'CT', 'CT', 'CT', 'CT', 'CT', 'CT', 'CT', 'CT',
       ...
       'MS', 'MS', 'MS', 'MS', 'MS', 'MS', 'MS', 'MS', 'MS', 'MS'],
      dtype='object', name='state_intercept', length=104709)

In [44]:

trips_fips.head()

Out[44]:

			id_code	year	wave	intercept_date	st	prim1	prim2	prim2_common	cnty	ffdays12	ffdays2	state_fips	county_fips	fips_class_code	state_intercept	county_intercept	species_name
state_intercept	county_intercept	prim1_common
CT	Middlesex County	STRIPED BASS	1171020101114001	2010.0	6.0	2010-11-14	9	8835020102			7	25.0	5.0	9.0	7.0	H4	CT	Middlesex County	STRIPED BASS
	Middlesex County	STRIPED BASS	1171020101114002	2010.0	6.0	2010-11-14	9	8835020102			7	10.0	2.0	9.0	7.0	H4	CT	Middlesex County	STRIPED BASS
	New London County	BLUEFISH	1171020101116001	2010.0	6.0	2010-11-16	9	8835250101	8835020102	STRIPED BASS	11	12.0	6.0	9.0	11.0	H4	CT	New London County	BLUEFISH
	Middlesex County	STRIPED BASS	1171020101123001	2010.0	6.0	2010-11-23	9	8835020102	8835250101	BLUEFISH	7	90.0	25.0	9.0	7.0	H4	CT	Middlesex County	STRIPED BASS
	Middlesex County	STRIPED BASS	1171020101123002	2010.0	6.0	2010-11-23	9	8835020102	8835250101	BLUEFISH	7	75.0	20.0	9.0	7.0	H4	CT	Middlesex County	STRIPED BASS

This allows us to use index lookups and also the index values for calculating interesting summary data. The following calculates the top 10 targeted species in VA:

In [45]:

trips_fips.loc['VA'].species_name.value_counts().head(10)

Out[45]:

                    1794
SUMMER FLOUNDER      972
STRIPED BASS         710
ATLANTIC CROAKER     448
SPOTTED SEATROUT     154
TAUTOG               153
SPOT                 143
COBIA                117
RED DRUM              66
BLUEFISH              50
Name: species_name, dtype: int64

This calculates the top targeted species in Gloucester County, VA.

In [46]:

trips_fips.loc['VA','Gloucester County',:].species_name.value_counts()

Out[46]:

                    37
STRIPED BASS        23
ATLANTIC CROAKER    20
SPOT                18
SUMMER FLOUNDER     15
COBIA                7
SPANISH MACKEREL     2
BLUEFISH             1
Name: species_name, dtype: int64

What about top 10 counties in VA with respect to numbers of intercepted trips?

In [47]:

print(trips_fips.loc['VA'].county_intercept.value_counts().head(10))

Virginia Beach City    1876
Hampton City            553
Accomack County         505
Norfolk City            440
Newport News City       306
Northampton County      297
Middlesex County        215
Gloucester County       123
Mathews County           67
Portsmouth City          64
Name: county_intercept, dtype: int64

This tells us when summer flounder is being targeted in VA:

In [48]:

trips_fips.loc['VA',:,'SUMMER FLOUNDER'].intercept_date.dt.month.value_counts()

Out[48]:

6     190
7     176
5     167
8     163
4     137
9      95
10     31
3      12
11      1
Name: intercept_date, dtype: int64

Note we used the data/time field intercept_date to extract the month using the dt.month method. Pandas has immensely rich datetime capabilities that we won't demonstrate here.

Groupby¶

We can compute interesting data by grouping over column values and using our index for fast selection. Suppose we want to know total number of trips in VA by county of intercept for each wave (wave 1 = Jan,Feb; 2 = Mar,Apr; etc). Groupby lets us do this:

In [49]:

trips_fips.loc['VA'].groupby(['county_intercept','wave'])['id_code'].count().head(25)

Out[49]:

county_intercept      wave
Accomack County       2.0     101
                      3.0     172
                      4.0     181
                      5.0      50
                      6.0       1
Chesapeake City       2.0      10
                      3.0       8
                      4.0       5
                      5.0       3
                      6.0      33
Essex County          3.0      19
                      4.0       3
                      6.0      15
Gloucester County     3.0      40
                      4.0      32
                      5.0      35
                      6.0      16
Hampton City          2.0      59
                      3.0     108
                      4.0     173
                      5.0     141
                      6.0      72
Isle of Wight County  2.0      17
                      5.0       9
James City County     3.0      13
Name: id_code, dtype: int64

Stack and Unstack¶

We can manipulate the result above further to make it more amenable for comparisons and graphing (e.g. we want all of the wave 2 records on the same row):

In [50]:

t = trips_fips.loc['VA'].groupby(['county_intercept','wave'])['id_code'].count()
t.unstack().head(25)

Out[50]:

wave	2.0	3.0	4.0	5.0	6.0
county_intercept
Accomack County	101.0	172.0	181.0	50.0	1.0
Chesapeake City	10.0	8.0	5.0	3.0	33.0
Essex County	NaN	19.0	3.0	NaN	15.0
Gloucester County	NaN	40.0	32.0	35.0	16.0
Hampton City	59.0	108.0	173.0	141.0	72.0
Isle of Wight County	17.0	NaN	NaN	9.0	NaN
James City County	NaN	13.0	11.0	5.0	NaN
King William County	NaN	NaN	9.0	NaN	NaN
Mathews County	NaN	20.0	30.0	2.0	15.0
Middlesex County	NaN	64.0	60.0	63.0	28.0
Newport News City	25.0	63.0	79.0	74.0	65.0
Norfolk City	18.0	125.0	159.0	98.0	40.0
Northampton County	7.0	73.0	108.0	49.0	60.0
Northumberland County	NaN	11.0	8.0	5.0	4.0
Poquoson City	NaN	23.0	11.0	12.0	14.0
Portsmouth City	NaN	31.0	13.0	NaN	20.0
Richmond County	NaN	21.0	4.0	4.0	NaN
Suffolk City	NaN	11.0	13.0	9.0	NaN
Surry County	NaN	2.0	2.0	2.0	NaN
Virginia Beach City	382.0	427.0	412.0	341.0	314.0
Westmoreland County	5.0	5.0	4.0	NaN	3.0
York County	NaN	19.0	9.0	5.0	10.0

Or we could put waves on the same row:

In [51]:

t.unstack(0).head(25)

Out[51]:

county_intercept	Accomack County	Chesapeake City	Essex County	Gloucester County	Hampton City	Isle of Wight County	James City County	King William County	Mathews County	Middlesex County	...	Northampton County	Northumberland County	Poquoson City	Portsmouth City	Richmond County	Suffolk City	Surry County	Virginia Beach City	Westmoreland County	York County
wave
2.0	101.0	10.0	NaN	NaN	59.0	17.0	NaN	NaN	NaN	NaN	...	7.0	NaN	NaN	NaN	NaN	NaN	NaN	382.0	5.0	NaN
3.0	172.0	8.0	19.0	40.0	108.0	NaN	13.0	NaN	20.0	64.0	...	73.0	11.0	23.0	31.0	21.0	11.0	2.0	427.0	5.0	19.0
4.0	181.0	5.0	3.0	32.0	173.0	NaN	11.0	9.0	30.0	60.0	...	108.0	8.0	11.0	13.0	4.0	13.0	2.0	412.0	4.0	9.0
5.0	50.0	3.0	NaN	35.0	141.0	9.0	5.0	NaN	2.0	63.0	...	49.0	5.0	12.0	NaN	4.0	9.0	2.0	341.0	NaN	5.0
6.0	1.0	33.0	15.0	16.0	72.0	NaN	NaN	NaN	15.0	28.0	...	60.0	4.0	14.0	20.0	NaN	NaN	NaN	314.0	3.0	10.0

5 rows × 22 columns

We can combine these with plot to quickly make cool charts. Let's focus on the peninsula:

In [52]:

t.unstack(0)[['York County','Newport News City','Hampton City','James City County']].plot(kind='bar',figsize=(8, 8))
plt.title("Number of Intercepted Anglers in Dataset")
plt.show()

We can also stack data we have unstacked:

In [53]:

t_unstacked = t.unstack(0)
t_unstacked.stack()

Out[53]:

wave  county_intercept     
2.0   Accomack County          101.0
      Chesapeake City           10.0
      Hampton City              59.0
      Isle of Wight County      17.0
      Newport News City         25.0
      Norfolk City              18.0
      Northampton County         7.0
      Virginia Beach City      382.0
      Westmoreland County        5.0
3.0   Accomack County          172.0
      Chesapeake City            8.0
      Essex County              19.0
      Gloucester County         40.0
      Hampton City             108.0
      James City County         13.0
      Mathews County            20.0
      Middlesex County          64.0
      Newport News City         63.0
      Norfolk City             125.0
      Northampton County        73.0
      Northumberland County     11.0
      Poquoson City             23.0
      Portsmouth City           31.0
      Richmond County           21.0
      Suffolk City              11.0
      Surry County               2.0
      Virginia Beach City      427.0
      Westmoreland County        5.0
      York County               19.0
4.0   Accomack County          181.0
                               ...  
5.0   Isle of Wight County       9.0
      James City County          5.0
      Mathews County             2.0
      Middlesex County          63.0
      Newport News City         74.0
      Norfolk City              98.0
      Northampton County        49.0
      Northumberland County      5.0
      Poquoson City             12.0
      Richmond County            4.0
      Suffolk City               9.0
      Surry County               2.0
      Virginia Beach City      341.0
      York County                5.0
6.0   Accomack County            1.0
      Chesapeake City           33.0
      Essex County              15.0
      Gloucester County         16.0
      Hampton City              72.0
      Mathews County            15.0
      Middlesex County          28.0
      Newport News City         65.0
      Norfolk City              40.0
      Northampton County        60.0
      Northumberland County      4.0
      Poquoson City             14.0
      Portsmouth City           20.0
      Virginia Beach City      314.0
      Westmoreland County        3.0
      York County               10.0
Length: 84, dtype: float64

Pivot Table¶

Borrowing from excel, pandas can create pivot tables of summary statistics. Suppose we want counties as rows and waves as columns, with the average and standard deviation of 2 and 12 month avidity.

In [54]:

pd.pivot_table(trips_fips.loc['VA'],index='county_intercept',columns='wave',
               values=['ffdays2','ffdays12'],aggfunc=[np.mean, np.std],fill_value="-")

Out[54]:

	mean										std
	ffdays12					ffdays2					ffdays12					ffdays2
wave	2.0	3.0	4.0	5.0	6.0	2.0	3.0	4.0	5.0	6.0	2.0	3.0	4.0	5.0	6.0	2.0	3.0	4.0	5.0	6.0
county_intercept
Accomack County	21.5644	9.66279	11.116	16.7	50	2.77228	2.2907	3.52486	4.42	10	35.48	17.5116	19.1492	37.3195	-	7.306	4.69507	6.13059	8.09406	-
Chesapeake City	13.9	26.625	36	23.3333	45.0303	3.2	5.75	8	4.66667	10.0606	16.2921	34.5912	13.4164	5.7735	38.8945	4.1042	7.75979	0	1.1547	9.45033
Essex County	-	25.4211	60.6667	-	12.0667	-	1.57895	12	-	2.73333	-	38.4134	59.0028	-	11.4671	-	2.65237	15.6205	-	3.78845
Gloucester County	-	23.575	44.625	49.4857	39.0625	-	4.05	8	9.74286	6.5	-	32.052	63.0385	166.512	48.0354	-	5.50501	11.3564	17.4007	5.68038
Hampton City	38.0169	22.963	27.289	21.9007	40.4583	2.74576	3.61111	4.90751	3.42553	5.45833	131.304	97.6488	108.988	89.3672	163.965	12.7814	10.1951	12.3798	9.48851	16.6783
Isle of Wight County	24.4706	-	-	26.7778	-	3	-	-	6.11111	-	42.891	-	-	32.8663	-	7.6567	-	-	11.0277	-
James City County	-	23.4615	31.8182	4.2	-	-	3.84615	6.63636	1.4	-	-	34.8966	44.4248	3.42053	-	-	5.44436	5.39023	1.67332	-
King William County	-	-	9.66667	-	-	-	-	5	-	-	-	-	9.52628	-	-	-	-	6.38357	-	-
Mathews County	-	66.45	82.3333	21	21.3333	-	7.1	11.8333	3.5	7.2	-	220.903	249.337	12.7279	26.9064	-	21.6087	24.0862	2.12132	8.32552
Middlesex County	-	10.6094	24.9667	17.7619	24.9643	-	1.98438	5.25	4.46032	4.85714	-	15.1549	37.8807	30.0562	31.1121	-	2.75158	7.06057	5.87182	5.13315
Newport News City	96.88	40.873	12.519	16.473	29.1077	8.44	1.71429	3.11392	4.48649	4.55385	272.163	175.237	20.6664	17.7704	39.9438	26.9754	3.29432	3.26594	3.73889	5.50009
Norfolk City	25.7222	23.472	14.3585	25.2449	26.075	1.77778	2.408	3.04403	6.67347	4.875	30.9949	94.8706	32.0538	32.6039	24.6341	2.51011	9.63034	4.46979	8.92671	4.83145
Northampton County	37.2857	15.9041	14.3981	28.4082	18.4167	3.42857	3.27397	4.91667	6.55102	2.53333	53.3376	24.746	19.5127	37.2111	21.5896	4.50397	4.29231	6.72994	6.96438	3.47127
Northumberland County	-	2.18182	8.125	23.6	749.25	-	0.181818	1.375	4.6	73.5	-	5.92989	10.723	24.4704	498.167	-	0.603023	2.72226	4.219	49
Poquoson City	-	53.5217	34.7273	22	11.7857	-	7.6087	10.1818	7.83333	2.5	-	65.9345	24.7552	16.7549	10.4527	-	6.44374	4.95617	5.74984	2.71038
Portsmouth City	-	12.9677	16.6154	-	47.15	-	3.45161	5.46154	-	10.45	-	14.0866	13.4632	-	54.047	-	5.78411	4.31307	-	13.9302
Richmond County	-	14.8571	2	6	-	-	4	2	1.75	-	-	23.5548	2.3094	3.26599	-	-	4.62601	2.3094	1.70783	-
Suffolk City	-	7.63636	12.6154	9	-	-	2.72727	3.84615	4.22222	-	-	6.32887	15.7244	6.80074	-	-	3.60807	4.68768	3.66667	-
Surry County	-	0	6	87.5	-	-	0	5	22	-	-	0	5.65685	88.3883	-	-	0	4.24264	19.799	-
Virginia Beach City	28.0864	29.4567	18.8908	20.0792	31.7675	2.5445	3.60187	3.76456	4.94428	5.53822	42.5248	91.2589	74.7389	61.2486	100.923	6.58973	6.74107	9.04949	9.29642	8.54943
Westmoreland County	25.4	20.6	1.25	-	0.333333	8	2.8	0.25	-	0	32.416	14.0285	2.5	-	0.57735	10.9545	2.16795	0.5	-	0
York County	-	29.9474	36.2222	11	41.5	-	4.78947	7.77778	3.8	6.6	-	33.1553	37.8741	12.5698	40.6072	-	3.85255	10.1091	5.01996	7.8344

Exporting Data from Pandas¶

You may want to export dataframes and other information to other packages (e.g. latex, excel, or stata). Pandas has a lot of to_ methods on pandas dataframes and series. For example we can export some interesting information to latex table code:

In [55]:

(t.unstack().head(25)).to_latex('excel_example.tex',na_rep='-')

In [57]:

(t.unstack().head(25)).to_excel('excel_example.xls')

Econometrics in Python¶

In [58]:

import statsmodels.formula.api as smf    

# load data and create dataframe
tobias_koop=pd.read_csv('https://rlhick.people.wm.edu/econ407/data/tobias_koop_t_4.csv')

tobias_koop.head()

Out[58]:

	id	educ	ln_wage	pexp	time	ability	meduc	feduc	broken_home	siblings	pexp2
0	4	12	2.14	2	4	0.26	12	10	1	4	4
1	6	15	1.91	4	4	0.44	12	16	0	2	16
2	8	13	2.32	8	4	0.51	12	15	1	2	64
3	11	14	1.64	1	4	1.82	16	17	1	2	1
4	12	13	2.16	6	4	-1.30	13	12	0	5	36

In [59]:

formula = "ln_wage ~ educ + pexp + pexp2 + broken_home"
results = smf.ols(formula,tobias_koop).fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                ln_wage   R-squared:                       0.166
Model:                            OLS   Adj. R-squared:                  0.163
Method:                 Least Squares   F-statistic:                     51.36
Date:                Mon, 12 Mar 2018   Prob (F-statistic):           1.83e-39
Time:                        15:04:02   Log-Likelihood:                -583.66
No. Observations:                1034   AIC:                             1177.
Df Residuals:                    1029   BIC:                             1202.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       0.4603      0.137      3.353      0.001       0.191       0.730
educ            0.0853      0.009      9.179      0.000       0.067       0.104
pexp            0.2035      0.024      8.629      0.000       0.157       0.250
pexp2          -0.0124      0.002     -5.438      0.000      -0.017      -0.008
broken_home    -0.0087      0.036     -0.244      0.807      -0.079       0.061
==============================================================================
Omnibus:                       55.892   Durbin-Watson:                   1.761
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              112.050
Skew:                          -0.355   Prob(JB):                     4.66e-25
Kurtosis:                       4.448   Cond. No.                         391.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.