Pandas Primer

In [2]:
from __future__ import division
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import urllib

Introduction to Pandas

Pandas works hand-in-hand with other python libraries (e.g. matplotlib and numpy) to make manipulating data (what the Pandas team calls "Data Munging") easy. With pandas it is easy to

  1. Easily access data using variable names, but have full linear algebra capabilities of numpy
  2. Group data by values (for example, calculate the mean income by state)
  3. Plot and summarize values
  4. Join (combine) different sources of data
  5. Powerful time series and panel data capabilities (beyond scope of course)

Note: If you want to reinforce some of these concepts on your own, I recommend this superb youtube video (https://www.youtube.com/watch?v=5JnMutdy6Fw) and accompanying coursework (https://github.com/brandon-rhodes/pycon-pandas-tutorial) by Brandon Rhodes.

There are numerous ways to get data into Pandas:

  • Import excel and comma delimited data
  • Import stata, sas, matlab, and other datasets
  • Import data from an SQL server
  • Import data scraped from the web
  • Manually building (by typing in values)

In this tutorial we will focus on the first two methods for reading data, but just know that there is probably a way to get your data into pandas irrespective of what format it is in.

Loading and Cleaning Data

We will be loading 2010 "Trip" data from the NOAA Fisheries Service Recreational Fisheries Statistics Survey (called MRIP). More detail here: https://www.fisheries.noaa.gov/topic/recreational-fishing-data

We will also load the Census Bureau's county and state fips file, obtained here: https://www.census.gov/2010census/xls/fips_codes_website.xls

The MRIPs data needs to be downloaded from my website (you only need to do this once).

In [3]:
# run this once and then comment out with #
# urllib.urlretrieve ("https://rlhick.people.wm.edu/pubp622/data/mrips_2010.pydata", "mrips_2010.pydata")
Out[3]:
('mrips_2010.pydata', <httplib.HTTPMessage instance at 0x7f1dda4a1128>)
In [5]:
trips_2010 = pd.io.pickle.read_pickle('mrips_2010.pydata')
fips = pd.io.excel.read_excel('https://rlhick.people.wm.edu/pubp622/data/fips_codes_website.xls')
In [10]:
trips_2010.head()
Out[10]:
index add_hrs add_ph age area area_nc area_x art_reef asg_code boat_hrs ... time tourn tsn1 tsn2 turtle wave wp_int year zip intercept_date
0 0 NaN 1 NaN C 5 NaN 1171020101114 0 ... 1330 NaN 167680 NaN 6 3411.891063 2010 06475 2010-11-14
1 1 NaN 2 NaN C 5 NaN 1171020101114 0 ... 1332 NaN 167680 NaN 6 3411.891063 2010 06475 2010-11-14
2 2 0 1 NaN C 5 NaN 1171020101116 0 ... 1130 NaN 168559 167680 NaN 6 17548.885633 2010 06110 2010-11-16
3 3 NaN 1 NaN C 5 NaN 1171020101123 0 ... 1200 NaN 167680 168559 NaN 6 4489.923579 2010 06416 2010-11-23
4 4 NaN 1 NaN C 5 NaN 1171020101123 0 ... 1215 NaN 167680 168559 NaN 6 4489.923579 2010 06067 2010-11-23

5 rows × 80 columns

Trips is a very big dataset with lots of columns we'll never use. Let's trim it down:

In [9]:
trips_2010.columns
Out[9]:
Index([u'index', u'add_hrs', u'add_ph', u'age', u'area', u'area_nc', u'area_x',
       u'art_reef', u'asg_code', u'boat_hrs', u'catch', u'celltype',
       u'cntrbtrs', u'cnty', u'cnty_res', u'coastal', u'compflag', u'county',
       u'date1', u'dist', u'distkeys', u'f_by_p', u'ffdays12', u'ffdays2',
       u'first', u'fshinsp_a', u'gear', u'gender', u'hrs_dtd', u'hrsf',
       u'id_code', u'intsite', u'kod', u'leader', u'license', u'mode2001',
       u'mode_asg', u'mode_f', u'mode_fx', u'monitor', u'month', u'muni_res',
       u'muni_trp', u'new_list', u'num_fish_a', u'num_typ2', u'num_typ3',
       u'num_typ4', u'num_typ6', u'num_typ9', u'on_list', u'party', u'prim1',
       u'prim1_common', u'prim2', u'prim2_common', u'prt_code', u'psu_id',
       u'pvt_res', u'reefcode', u'reg_res', u'region', u'rig', u'sep_fish',
       u'st', u'st_res', u'strat_id', u'strat_interval', u'sub_reg',
       u'telefon', u'time', u'tourn', u'tsn1', u'tsn2', u'turtle', u'wave',
       u'wp_int', u'year', u'zip', u'intercept_date'],
      dtype='object')
In [7]:
trips = trips_2010[['id_code','year','wave','intercept_date','st','prim1',
                    'prim1_common','prim2','prim2_common','cnty','ffdays12',
                    'ffdays2']]

This is summary statistics for numeric data columns:

In [24]:
trips.describe()
Out[24]:
year wave st cnty ffdays12 ffdays2
count 104709 104709.000000 104709.000000 104519.000000 104519.000000 104519.000000
mean 2010 3.794879 23.974615 78.798171 40.771439 6.191936
std 0 1.347586 13.160875 130.751209 122.456004 13.037239
min 2010 1.000000 1.000000 1.000000 0.000000 0.000000
25% 2010 3.000000 12.000000 19.000000 1.000000 0.000000
50% 2010 4.000000 22.000000 55.000000 10.000000 2.000000
75% 2010 5.000000 37.000000 99.000000 35.000000 7.000000
max 2010 6.000000 51.000000 810.000000 999.000000 99.000000

Using Pandas DataFrames

Referencing columns in data:
In [13]:
# copy data frame into t (note, this is different than assignment [t=trips])
t=trips.copy()

t.ffdays12.head()
Out[13]:
0    25
1    10
2    12
3    90
4    75
Name: ffdays12, dtype: float64
In [14]:
t['ffdays12'].head()
Out[14]:
0    25
1    10
2    12
3    90
4    75
Name: ffdays12, dtype: float64
Referencing rows in data:

We can use numpy-like slicing:

In [17]:
# rows 50-54
t.loc[51:55]
Out[17]:
id_code year wave intercept_date st prim1 prim1_common prim2 prim2_common cnty ffdays12 ffdays2
51 1171020101111005 2010 6 2010-11-11 9 8839010101 TAUTOG 7 35 10
52 1171020101111006 2010 6 2010-11-11 9 8839010101 TAUTOG 7 30 10
53 1171020101113001 2010 6 2010-11-13 9 8839010101 TAUTOG 11 10 8
54 1171020101113002 2010 6 2010-11-13 9 8839010101 TAUTOG 11 10 8
55 1171020101113003 2010 6 2010-11-13 9 8839010101 TAUTOG 11 3 2
In [18]:
# rows 1 and 2 for the first 5 columns
t.iloc[0:2,0:5]
Out[18]:
id_code year wave intercept_date st
0 1171020101114001 2010 6 2010-11-14 9
1 1171020101114002 2010 6 2010-11-14 9

We can select rows of data based on column values. Let's select all the rows where st is 9 (Connecticut)

In [19]:
t[t.st == 9].head()
Out[19]:
id_code year wave intercept_date st prim1 prim1_common prim2 prim2_common cnty ffdays12 ffdays2
0 1171020101114001 2010 6 2010-11-14 9 8835020102 STRIPED BASS 7 25 5
1 1171020101114002 2010 6 2010-11-14 9 8835020102 STRIPED BASS 7 10 2
2 1171020101116001 2010 6 2010-11-16 9 8835250101 BLUEFISH 8835020102 STRIPED BASS 11 12 6
3 1171020101123001 2010 6 2010-11-23 9 8835020102 STRIPED BASS 8835250101 BLUEFISH 7 90 25
4 1171020101123002 2010 6 2010-11-23 9 8835020102 STRIPED BASS 8835250101 BLUEFISH 7 75 20

Find all rows where days spent fishing during past 12 months exceeds 10:

In [20]:
t[t.ffdays12>10].head()
Out[20]:
id_code year wave intercept_date st prim1 prim1_common prim2 prim2_common cnty ffdays12 ffdays2
0 1171020101114001 2010 6 2010-11-14 9 8835020102 STRIPED BASS 7 25 5
2 1171020101116001 2010 6 2010-11-16 9 8835250101 BLUEFISH 8835020102 STRIPED BASS 11 12 6
3 1171020101123001 2010 6 2010-11-23 9 8835020102 STRIPED BASS 8835250101 BLUEFISH 7 90 25
4 1171020101123002 2010 6 2010-11-23 9 8835020102 STRIPED BASS 8835250101 BLUEFISH 7 75 20
5 1171020101123003 2010 6 2010-11-23 9 8835020102 STRIPED BASS 8835250101 BLUEFISH 7 70 10
Math on columns:
In [21]:
# we can create a new variable and use numpy commands:
t['temp'] = t.ffdays12 + .5*np.random.randn(t.shape[0])
t.head()
Out[21]:
id_code year wave intercept_date st prim1 prim1_common prim2 prim2_common cnty ffdays12 ffdays2 temp
0 1171020101114001 2010 6 2010-11-14 9 8835020102 STRIPED BASS 7 25 5 25.321303
1 1171020101114002 2010 6 2010-11-14 9 8835020102 STRIPED BASS 7 10 2 10.281614
2 1171020101116001 2010 6 2010-11-16 9 8835250101 BLUEFISH 8835020102 STRIPED BASS 11 12 6 12.161692
3 1171020101123001 2010 6 2010-11-23 9 8835020102 STRIPED BASS 8835250101 BLUEFISH 7 90 25 89.333737
4 1171020101123002 2010 6 2010-11-23 9 8835020102 STRIPED BASS 8835250101 BLUEFISH 7 75 20 75.887685
In [22]:
t.temp.mean()
Out[22]:
40.768852202237284
In [23]:
# or we could do this all in one step if we didn't want to create a new column in t:
(t.ffdays12 + .5*np.random.randn(t.shape[0])).mean()
Out[23]:
40.77045249565499

Note: You can combine pandas with numpy. This is the standard deviation of the column ffdays12:

In [15]:
np.std(t.ffdays12)
Out[15]:
122.4554184530213

We can perform matrix/vector operations on pandas data.

Here, we can transpose a slice of the data:

In [16]:
t[['ffdays12','ffdays2']].head(10).T
Out[16]:
0 1 2 3 4 5 6 7 8 9
ffdays12 25 10 12 90 75 70 10 15 1 250
ffdays2 5 2 6 25 20 10 2 0 0 50

We can do matrix multiplication:

In [17]:
np.dot(t[['ffdays12','ffdays2']].head(10).T,t[['ffdays12','ffdays2']].head(10))
Out[17]:
array([[ 82320.,  17187.],
       [ 17187.,   3694.]])
Summary Statistics

Find mean of ffdays12 and ffdays2

In [18]:
t[['ffdays12','ffdays2']].describe()
Out[18]:
ffdays12 ffdays2
count 104519.000000 104519.000000
mean 40.771439 6.191936
std 122.456004 13.037239
min 0.000000 0.000000
25% 1.000000 0.000000
50% 10.000000 2.000000
75% 35.000000 7.000000
max 999.000000 99.000000

Find mean of ffdays12 and ffdays2 by state:

In [21]:
t[['ffdays12','ffdays2']].groupby('st').describe()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-21-4c15adf3f6f3> in <module>()
----> 1 t[['ffdays12','ffdays2']].groupby('st').describe()

/home/robhicks/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze)
   3157         axis = self._get_axis_number(axis)
   3158         return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
-> 3159                        sort=sort, group_keys=group_keys, squeeze=squeeze)
   3160 
   3161     def asfreq(self, freq, method=None, how=None, normalize=False):

/home/robhicks/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in groupby(obj, by, **kwds)
   1197         raise TypeError('invalid type: %s' % type(obj))
   1198 
-> 1199     return klass(obj, by, **kwds)
   1200 
   1201 

/home/robhicks/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze)
    386         if grouper is None:
    387             grouper, exclusions, obj = _get_grouper(obj, keys, axis=axis,
--> 388                                                     level=level, sort=sort)
    389 
    390         self.obj = obj

/home/robhicks/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _get_grouper(obj, key, axis, level, sort)
   2146 
   2147         elif is_in_axis(gpr):  # df.groupby('name')
-> 2148             in_axis, name, gpr = True, gpr, obj[gpr]
   2149             exclusions.append(name)
   2150 

/home/robhicks/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1795             return self._getitem_multilevel(key)
   1796         else:
-> 1797             return self._getitem_column(key)
   1798 
   1799     def _getitem_column(self, key):

/home/robhicks/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1802         # get column
   1803         if self.columns.is_unique:
-> 1804             return self._get_item_cache(key)
   1805 
   1806         # duplicate columns & possible reduce dimensionaility

/home/robhicks/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1082         res = cache.get(item)
   1083         if res is None:
-> 1084             values = self._data.get(item)
   1085             res = self._box_item_values(item, values)
   1086             cache[item] = res

/home/robhicks/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath)
   2849 
   2850             if not isnull(item):
-> 2851                 loc = self.items.get_loc(item)
   2852             else:
   2853                 indexer = np.arange(len(self.items))[isnull(self.items)]

/home/robhicks/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in get_loc(self, key, method)
   1570         """
   1571         if method is None:
-> 1572             return self._engine.get_loc(_values_from_object(key))
   1573 
   1574         indexer = self.get_indexer([key], method=method)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12280)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12231)()

KeyError: 'st'

What happened? We trimmed down the columns of t with the statement t[['ffdays12','ffdays2']] and pandas couldn't find the columns st. Let's try this again:

In [19]:
t.groupby('st')[['ffdays12','ffdays2']].describe()
Out[19]:
ffdays12 ffdays2
st
1 count 2319.000000 2319.000000
mean 54.470030 5.535576
std 173.295166 11.381926
min 0.000000 0.000000
25% 1.000000 0.000000
50% 8.000000 2.000000
75% 30.000000 6.000000
max 998.000000 98.000000
9 count 2068.000000 2068.000000
mean 19.050774 4.711315
std 45.799724 8.832107
min 0.000000 0.000000
25% 0.750000 0.000000
50% 5.000000 2.000000
75% 20.000000 6.000000
max 998.000000 98.000000
10 count 3691.000000 3691.000000
mean 33.415335 5.167163
std 126.399327 13.028735
min 0.000000 0.000000
25% 1.000000 0.000000
50% 6.000000 1.000000
75% 20.000000 5.000000
max 999.000000 99.000000
12 count 37354.000000 37354.000000
mean 59.632543 7.921401
std 154.875664 15.925836
min 0.000000 0.000000
25% 2.000000 0.000000
50% 20.000000 3.000000
... ... ... ...
37 std 86.518452 10.735914
min 0.000000 0.000000
25% 1.000000 0.000000
50% 6.000000 2.000000
75% 25.000000 6.000000
max 999.000000 99.000000
44 count 1488.000000 1488.000000
mean 20.715726 4.751344
std 39.349850 9.322511
min 0.000000 0.000000
25% 1.000000 0.000000
50% 5.000000 1.000000
75% 20.250000 5.000000
max 320.000000 60.000000
45 count 2867.000000 2867.000000
mean 50.676317 7.040112
std 167.737333 17.099136
min 0.000000 0.000000
25% 0.000000 0.000000
50% 7.000000 2.000000
75% 30.000000 6.000000
max 999.000000 99.000000
51 count 4822.000000 4822.000000
mean 24.962256 4.214641
std 81.259619 8.909940
min 0.000000 0.000000
25% 1.000000 0.000000
50% 6.000000 1.000000
75% 25.000000 5.000000
max 999.000000 98.000000

136 rows × 2 columns

Combining DataFrames

The state fips codes are hardly intuitive and even worse for the county fips codes. The goal of this section is to

  1. Add text-based labels
  2. Be able to use these labels for fast data selection

To do this, we will attempt to match text based state and county labels to our trips dataframe. Pandas calls this a merge.

In [20]:
fips.head()
Out[20]:
State Abbreviation State FIPS Code County FIPS Code FIPS Entity Code ANSI Code GU Name Entity Description
0 AL 1 67 124 2403054 Abbeville city
1 AL 1 73 460 2403063 Adamsville city
2 AL 1 117 820 2403069 Alabaster city
3 AL 1 95 988 2403074 Albertville city
4 AL 1 123 1132 2403077 Alexander City city

The fips codes (numeric) uniquely identify counties in our trips file. So let's look at the fips file (that we grabbed from Census) above and combine it with our trips data. These column names are messy: let's clean it up.

For me, column names should always:

  1. Be lower case
  2. Have no spaces or funky characters
In [21]:
names = ['state','state_fips','county_fips','fips','ansi_code',
         'county_name','entity_type']
fips.columns=names
fips.head()
Out[21]:
state state_fips county_fips fips ansi_code county_name entity_type
0 AL 1 67 124 2403054 Abbeville city
1 AL 1 73 460 2403063 Adamsville city
2 AL 1 117 820 2403069 Alabaster city
3 AL 1 95 988 2403074 Albertville city
4 AL 1 123 1132 2403077 Alexander City city

The merge statement takes two pandas dataframes and combines them. Notice in the merge command below we have trips,fips. The dataframe trips is known as the "left" dataframe and fips, the "right". I want to combine these columns based on the numeric fips values (state and county). These fields have different names in each of the dataframes, so I need to specify the names (in the same order) for the left (trips) and right (fips) dataframes. The option how='left' tells pandas to always keep rows from trips whether a match is made with fips or not. Note that if a row in the fips table doesn't match our trips data, it is discarded.

In [22]:
# join clean fips codes to data_trip
trips_fips = pd.merge(trips,fips, left_on=['st','cnty'], 
                      right_on=['state_fips','county_fips'],
      how='left', sort=False)
In [23]:
print trips.shape
print trips_fips.shape
(104709, 12)
(1561000, 19)

The result we get is disconcerting. What we wanted to achieve with the merge was to add text names to the trips table, and our merged dataset has 1.5 million rows rather than 104,709 (what we want). This probably occured because there were duplicate (or redundant information on fips codes in the fips table). To check that use the duplicate method to find rows with duplicate state and county fips codes.

In [24]:
fips[fips.duplicated(['state_fips','county_fips'])]['entity_type'].value_counts().head(20)
Out[24]:
township              12641
city                   8355
town                   8136
village                3974
County                 2555
borough                1136
Reservation             438
ANVSA                   221
CDP                     147
comunidad               124
OTSA                    101
zona urbana              79
charter township         78
Hawaiian Home Land       75
Municipio                58
SDTSA                    53
Rancheria                47
Parish                   38
Indian Reservation       36
Township                 35
dtype: int64

Unfortunately, this file is (apparently) for census workers in the field who may encounter lots of different place names, and this file helps their data-entry program map into the correct state and county fips codes. This isn't ideal for our purposes. Since our trips table already has the correct numeric fips codes, we need a fips table that has the correct and unique county/city names.

Grab another fips table from here: https://www.census.gov/geo/reference/codes/cou.html and choose United States.

For convenience, I'll save to my website.

In [25]:
fips = pd.io.parsers.read_csv('https://rlhick.people.wm.edu/pubp622/data/national_county.txt',
                              header=False,names=['state','state_fips','county_fips','county','fips_class_code'])
print fips.shape
(3234, 5)
In [26]:
fips.head()
Out[26]:
state state_fips county_fips county fips_class_code
0 AL 1 3 Baldwin County H1
1 AL 1 5 Barbour County H1
2 AL 1 7 Bibb County H1
3 AL 1 9 Blount County H1
4 AL 1 11 Bullock County H1

Having worked with the data before, I know that counties/entities like Virginia Beach City are labeled "Virginia Beach city". Let's fix that:

In [27]:
fips["county"] = fips.county.str.replace("city","City")

Check for duplicates:

In [28]:
fips.duplicated(['state_fips','county_fips'])
print "Number of duplicate rows:", fips[fips.duplicated(['state_fips','county_fips'])].size
Number of duplicate rows: 0
In [29]:
trips_fips = pd.merge(trips,fips, left_on=['st','cnty'], 
                      right_on=['state_fips','county_fips'],
                      how='left', sort=False)
print trips.shape
print trips_fips.shape
(104709, 12)
(104709, 17)
In [30]:
trips_fips.head()
Out[30]:
id_code year wave intercept_date st prim1 prim1_common prim2 prim2_common cnty ffdays12 ffdays2 state state_fips county_fips county fips_class_code
0 1171020101114001 2010 6 2010-11-14 9 8835020102 STRIPED BASS 7 25 5 CT 9 7 Middlesex County H4
1 1171020101114002 2010 6 2010-11-14 9 8835020102 STRIPED BASS 7 10 2 CT 9 7 Middlesex County H4
2 1171020101116001 2010 6 2010-11-16 9 8835250101 BLUEFISH 8835020102 STRIPED BASS 11 12 6 CT 9 11 New London County H4
3 1171020101123001 2010 6 2010-11-23 9 8835020102 STRIPED BASS 8835250101 BLUEFISH 7 90 25 CT 9 7 Middlesex County H4
4 1171020101123002 2010 6 2010-11-23 9 8835020102 STRIPED BASS 8835250101 BLUEFISH 7 75 20 CT 9 7 Middlesex County H4

Let's rename our state and intercept column to make it more intuitive.

In [31]:
trips_fips.rename(columns = {'county': 'county_intercept','state':'state_intercept'},inplace=True)
trips_fips.head()
Out[31]:
id_code year wave intercept_date st prim1 prim1_common prim2 prim2_common cnty ffdays12 ffdays2 state_intercept state_fips county_fips county_intercept fips_class_code
0 1171020101114001 2010 6 2010-11-14 9 8835020102 STRIPED BASS 7 25 5 CT 9 7 Middlesex County H4
1 1171020101114002 2010 6 2010-11-14 9 8835020102 STRIPED BASS 7 10 2 CT 9 7 Middlesex County H4
2 1171020101116001 2010 6 2010-11-16 9 8835250101 BLUEFISH 8835020102 STRIPED BASS 11 12 6 CT 9 11 New London County H4
3 1171020101123001 2010 6 2010-11-23 9 8835020102 STRIPED BASS 8835250101 BLUEFISH 7 90 25 CT 9 7 Middlesex County H4
4 1171020101123002 2010 6 2010-11-23 9 8835020102 STRIPED BASS 8835250101 BLUEFISH 7 75 20 CT 9 7 Middlesex County H4

Working with Indices

Let's create an index on these new state and county fields:

In [32]:
trips_fips.set_index(['state_intercept','county_intercept'],inplace=True)
trips_fips.head()
Out[32]:
id_code year wave intercept_date st prim1 prim1_common prim2 prim2_common cnty ffdays12 ffdays2 state_fips county_fips fips_class_code
state_intercept county_intercept
CT Middlesex County 1171020101114001 2010 6 2010-11-14 9 8835020102 STRIPED BASS 7 25 5 9 7 H4
Middlesex County 1171020101114002 2010 6 2010-11-14 9 8835020102 STRIPED BASS 7 10 2 9 7 H4
New London County 1171020101116001 2010 6 2010-11-16 9 8835250101 BLUEFISH 8835020102 STRIPED BASS 11 12 6 9 11 H4
Middlesex County 1171020101123001 2010 6 2010-11-23 9 8835020102 STRIPED BASS 8835250101 BLUEFISH 7 90 25 9 7 H4
Middlesex County 1171020101123002 2010 6 2010-11-23 9 8835020102 STRIPED BASS 8835250101 BLUEFISH 7 75 20 9 7 H4

Why use an index for these state and county codes? The biggest reason is that we can easily access values for looking at our data. Here we have 2 month avidity for anglers sampled in James City County:

In [33]:
trips_fips['ffdays2'].loc['VA','James City County']
/home/robhicks/anaconda/lib/python2.7/site-packages/pandas/core/index.py:5091: PerformanceWarning: indexing past lexsort depth may impact performance.
  PerformanceWarning)
Out[33]:
state_intercept  county_intercept 
VA               James City County     2
                 James City County     1
                 James City County     0
                 James City County     0
                 James City County     4
                 James City County     6
                 James City County     6
                 James City County     6
                 James City County     6
                 James City County     3
                 James City County     3
                 James City County     8
                 James City County    20
                 James City County     1
                 James City County    12
                 James City County     2
                 James City County     1
                 James City County     8
                 James City County     0
                 James City County     4
                 James City County     2
                 James City County     2
                 James City County     0
                 James City County     0
                 James City County     0
                 James City County     0
                 James City County    15
                 James City County    15
                 James City County     3
Name: ffdays2, dtype: float64
In [34]:
# we want to create a richer index that includes the prim1_common species name:
# 1. put state_intercept and county_intercept back in data
trips_fips.reset_index(inplace=True)
trips_fips.head()
# 2. create new multiindex
trips_fips.set_index(['state_intercept','county_intercept','prim1_common'],inplace=True)
In [35]:
trips_fips.head()
Out[35]:
id_code year wave intercept_date st prim1 prim2 prim2_common cnty ffdays12 ffdays2 state_fips county_fips fips_class_code
state_intercept county_intercept prim1_common
CT Middlesex County STRIPED BASS 1171020101114001 2010 6 2010-11-14 9 8835020102 7 25 5 9 7 H4
STRIPED BASS 1171020101114002 2010 6 2010-11-14 9 8835020102 7 10 2 9 7 H4
New London County BLUEFISH 1171020101116001 2010 6 2010-11-16 9 8835250101 8835020102 STRIPED BASS 11 12 6 9 11 H4
Middlesex County STRIPED BASS 1171020101123001 2010 6 2010-11-23 9 8835020102 8835250101 BLUEFISH 7 90 25 9 7 H4
STRIPED BASS 1171020101123002 2010 6 2010-11-23 9 8835020102 8835250101 BLUEFISH 7 75 20 9 7 H4
In [36]:
trips_fips.loc['VA','Virginia Beach City','RED DRUM'].head()
Out[36]:
id_code year wave intercept_date st prim1 prim2 prim2_common cnty ffdays12 ffdays2 state_fips county_fips fips_class_code
state_intercept county_intercept prim1_common
VA Virginia Beach City RED DRUM 1681420101127002 2010 6 2010-11-27 51 8835440901 8835440102 SPOTTED SEATROUT 810 40 10 51 810 C7
RED DRUM 1681420101127004 2010 6 2010-11-27 51 8835440901 8835440102 SPOTTED SEATROUT 810 150 24 51 810 C7
RED DRUM 1681420101127005 2010 6 2010-11-27 51 8835440901 8835250101 BLUEFISH 810 50 20 51 810 C7
RED DRUM 1681420101126003 2010 6 2010-11-26 51 8835440901 8835440102 SPOTTED SEATROUT 810 8 4 51 810 C7
RED DRUM 1681420101126004 2010 6 2010-11-26 51 8835440901 810 150 35 51 810 C7
In [37]:
trips_fips.loc['VA',:,'STRIPED BASS'].head()
Out[37]:
id_code year wave intercept_date st prim1 prim2 prim2_common cnty ffdays12 ffdays2 state_fips county_fips fips_class_code
county_intercept
Virginia Beach City 1662020101110001 2010 6 2010-11-10 51 8835020102 810 14 5 51 810 C7
Virginia Beach City 1662020101110002 2010 6 2010-11-10 51 8835020102 810 10 5 51 810 C7
Hampton City 1680820101122001 2010 6 2010-11-22 51 8835020102 8835440901 RED DRUM 650 80 24 51 650 C7
Virginia Beach City 1681420101121005 2010 6 2010-11-21 51 8835020102 810 4 4 51 810 C7
Virginia Beach City 1681420101121012 2010 6 2010-11-21 51 8835020102 810 2 0 51 810 C7

Ok, indexes are pretty cool, but what if we need the county data for doing summary statistics? Example, suppose we want to get a frequency count of trips by county in VA targeting Striped Bass? Notice, we can't just access index values like you would a column of data:

In [38]:
trips_fips.county_intercept
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-38-f0324470b884> in <module>()
----> 1 trips_fips.county_intercept

/home/robhicks/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
   2148                 return self[name]
   2149             raise AttributeError("'%s' object has no attribute '%s'" %
-> 2150                                  (type(self).__name__, name))
   2151 
   2152     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'county_intercept'

So the strategy will be to copy the index values back into our table so we can use it for other purposes:

(Note: you can also keep these columns in your data by including the option drop=False when you set_index())

In [39]:
print trips_fips.index.get_level_values(0)
trips_fips['state_intercept'] = trips_fips.index.get_level_values(0)
trips_fips['county_intercept'] = trips_fips.index.get_level_values(1)
trips_fips['species_name'] = trips_fips.index.get_level_values(2)
Index([u'CT', u'CT', u'CT', u'CT', u'CT', u'CT', u'CT', u'CT', u'CT', u'CT', 
       ...
       u'MS', u'MS', u'MS', u'MS', u'MS', u'MS', u'MS', u'MS', u'MS', u'MS'],
      dtype='object', name=u'state_intercept', length=104709)
In [40]:
trips_fips.head()
Out[40]:
id_code year wave intercept_date st prim1 prim2 prim2_common cnty ffdays12 ffdays2 state_fips county_fips fips_class_code state_intercept county_intercept species_name
state_intercept county_intercept prim1_common
CT Middlesex County STRIPED BASS 1171020101114001 2010 6 2010-11-14 9 8835020102 7 25 5 9 7 H4 CT Middlesex County STRIPED BASS
STRIPED BASS 1171020101114002 2010 6 2010-11-14 9 8835020102 7 10 2 9 7 H4 CT Middlesex County STRIPED BASS
New London County BLUEFISH 1171020101116001 2010 6 2010-11-16 9 8835250101 8835020102 STRIPED BASS 11 12 6 9 11 H4 CT New London County BLUEFISH
Middlesex County STRIPED BASS 1171020101123001 2010 6 2010-11-23 9 8835020102 8835250101 BLUEFISH 7 90 25 9 7 H4 CT Middlesex County STRIPED BASS
STRIPED BASS 1171020101123002 2010 6 2010-11-23 9 8835020102 8835250101 BLUEFISH 7 75 20 9 7 H4 CT Middlesex County STRIPED BASS

This allows us to use index lookups and also the index values for calculating interesting summary data. The following calculates the top 10 targeted species in VA:

In [41]:
trips_fips.loc['VA'].species_name.value_counts().head(10)
Out[41]:
                    1794
SUMMER FLOUNDER      972
STRIPED BASS         710
ATLANTIC CROAKER     448
SPOTTED SEATROUT     154
TAUTOG               153
SPOT                 143
COBIA                117
RED DRUM              66
BLUEFISH              50
dtype: int64

This calculates the top targeted species in Gloucester County, VA.

In [42]:
trips_fips.loc['VA','Gloucester County',:].species_name.value_counts()
Out[42]:
                    37
STRIPED BASS        23
ATLANTIC CROAKER    20
SPOT                18
SUMMER FLOUNDER     15
COBIA                7
SPANISH MACKEREL     2
BLUEFISH             1
dtype: int64

What about top 10 counties in VA with respect to numbers of intercepted trips?

In [43]:
print trips_fips.loc['VA'].county_intercept.value_counts().head(10)
Virginia Beach City    1876
Hampton City            553
Accomack County         505
Norfolk City            440
Newport News City       306
Northampton County      297
Middlesex County        215
Gloucester County       123
Mathews County           67
Portsmouth City          64
dtype: int64

This tells us when summer flounder is being targeted in VA:

In [44]:
trips_fips.loc['VA',:,'SUMMER FLOUNDER'].intercept_date.dt.month.value_counts()
Out[44]:
6     190
7     176
5     167
8     163
4     137
9      95
10     31
3      12
11      1
dtype: int64

Note we used the data/time field intercept_date to extract the month using the dt.month method. Pandas has immensely rich datetime capabilities that we won't demonstrate here.

Groupby

We can compute interesting data by grouping over column values and using our index for fast selection. Suppose we want to know total number of trips in VA by county of intercept for each wave (wave 1 = Jan,Feb; 2 = Mar,Apr; etc). Groupby lets us do this:

In [45]:
trips_fips.loc['VA'].groupby(['county_intercept','wave'])['id_code'].count().head(25)
Out[45]:
county_intercept      wave
Accomack County       2       101
                      3       172
                      4       181
                      5        50
                      6         1
Chesapeake City       2        10
                      3         8
                      4         5
                      5         3
                      6        33
Essex County          3        19
                      4         3
                      6        15
Gloucester County     3        40
                      4        32
                      5        35
                      6        16
Hampton City          2        59
                      3       108
                      4       173
                      5       141
                      6        72
Isle of Wight County  2        17
                      5         9
James City County     3        13
Name: id_code, dtype: int64

Stack and Unstack

We can manipulate the result above further to make it more amenable for comparisons and graphing (e.g. we want all of the wave 2 records on the same row):

In [46]:
t = trips_fips.loc['VA'].groupby(['county_intercept','wave'])['id_code'].count()
t.unstack().head(25)
Out[46]:
wave 2.0 3.0 4.0 5.0 6.0
county_intercept
Accomack County 101 172 181 50 1
Chesapeake City 10 8 5 3 33
Essex County NaN 19 3 NaN 15
Gloucester County NaN 40 32 35 16
Hampton City 59 108 173 141 72
Isle of Wight County 17 NaN NaN 9 NaN
James City County NaN 13 11 5 NaN
King William County NaN NaN 9 NaN NaN
Mathews County NaN 20 30 2 15
Middlesex County NaN 64 60 63 28
Newport News City 25 63 79 74 65
Norfolk City 18 125 159 98 40
Northampton County 7 73 108 49 60
Northumberland County NaN 11 8 5 4
Poquoson City NaN 23 11 12 14
Portsmouth City NaN 31 13 NaN 20
Richmond County NaN 21 4 4 NaN
Suffolk City NaN 11 13 9 NaN
Surry County NaN 2 2 2 NaN
Virginia Beach City 382 427 412 341 314
Westmoreland County 5 5 4 NaN 3
York County NaN 19 9 5 10

Or we could put waves on the same row:

In [47]:
t.unstack(0).head(25)
Out[47]:
county_intercept Accomack County Chesapeake City Essex County Gloucester County Hampton City Isle of Wight County James City County King William County Mathews County Middlesex County ... Northampton County Northumberland County Poquoson City Portsmouth City Richmond County Suffolk City Surry County Virginia Beach City Westmoreland County York County
wave
2 101 10 NaN NaN 59 17 NaN NaN NaN NaN ... 7 NaN NaN NaN NaN NaN NaN 382 5 NaN
3 172 8 19 40 108 NaN 13 NaN 20 64 ... 73 11 23 31 21 11 2 427 5 19
4 181 5 3 32 173 NaN 11 9 30 60 ... 108 8 11 13 4 13 2 412 4 9
5 50 3 NaN 35 141 9 5 NaN 2 63 ... 49 5 12 NaN 4 9 2 341 NaN 5
6 1 33 15 16 72 NaN NaN NaN 15 28 ... 60 4 14 20 NaN NaN NaN 314 3 10

5 rows × 22 columns

We can combine these with plot to quickly make cool charts. Let's focus on the peninsula:

In [48]:
t.unstack(0)[['York County','Newport News City','Hampton City','James City County']].plot(kind='bar',figsize=(8, 8))
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8c381101d0>
No description has been provided for this image

We can also stack data we have unstacked:

In [49]:
t_unstacked = t.unstack(0)
t_unstacked.stack()
Out[49]:
wave  county_intercept     
2     Accomack County          101
      Chesapeake City           10
      Hampton City              59
      Isle of Wight County      17
      Newport News City         25
      Norfolk City              18
      Northampton County         7
      Virginia Beach City      382
      Westmoreland County        5
3     Accomack County          172
      Chesapeake City            8
      Essex County              19
      Gloucester County         40
      Hampton City             108
      James City County         13
      Mathews County            20
      Middlesex County          64
      Newport News City         63
      Norfolk City             125
      Northampton County        73
      Northumberland County     11
      Poquoson City             23
      Portsmouth City           31
      Richmond County           21
      Suffolk City              11
      Surry County               2
      Virginia Beach City      427
      Westmoreland County        5
      York County               19
4     Accomack County          181
                              ... 
5     Isle of Wight County       9
      James City County          5
      Mathews County             2
      Middlesex County          63
      Newport News City         74
      Norfolk City              98
      Northampton County        49
      Northumberland County      5
      Poquoson City             12
      Richmond County            4
      Suffolk City               9
      Surry County               2
      Virginia Beach City      341
      York County                5
6     Accomack County            1
      Chesapeake City           33
      Essex County              15
      Gloucester County         16
      Hampton City              72
      Mathews County            15
      Middlesex County          28
      Newport News City         65
      Norfolk City              40
      Northampton County        60
      Northumberland County      4
      Poquoson City             14
      Portsmouth City           20
      Virginia Beach City      314
      Westmoreland County        3
      York County               10
dtype: float64

Pivot Table

Borrowing from excel, pandas can create pivot tables of summary statistics. Suppose we want counties as rows and waves as columns, with the average and standard deviation of 2 and 12 month avidity.

In [50]:
pd.pivot_table(trips_fips.loc['VA'],index='county_intercept',columns='wave',
               values=['ffdays2','ffdays12'],aggfunc=[np.mean, np.std],fill_value="-")
Out[50]:
mean std
ffdays2 ffdays12 ffdays2 ffdays12
wave 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6
county_intercept
Accomack County 2.772277 2.290698 3.524862 4.42 10 21.56436 9.662791 11.11602 16.7 50 7.305999 4.695067 6.130587 8.094064 - 35.47997 17.51164 19.1492 37.3195 -
Chesapeake City 3.2 5.75 8 4.666667 10.06061 13.9 26.625 36 23.33333 45.0303 4.104198 7.759786 0 1.154701 9.450329 16.29212 34.59124 13.41641 5.773503 38.89448
Essex County - 1.578947 12 - 2.733333 - 25.42105 60.66667 - 12.06667 - 2.652374 15.6205 - 3.788454 - 38.41342 59.00282 - 11.46714
Gloucester County - 4.05 8 9.742857 6.5 - 23.575 44.625 49.48571 39.0625 - 5.505009 11.3564 17.40071 5.680376 - 32.05195 63.03852 166.5124 48.03536
Hampton City 2.745763 3.611111 4.907514 3.425532 5.458333 38.01695 22.96296 27.28902 21.90071 40.45833 12.78144 10.19514 12.37978 9.488515 16.67835 131.3037 97.6488 108.9884 89.36716 163.9646
Isle of Wight County 3 - - 6.111111 - 24.47059 - - 26.77778 - 7.656696 - - 11.02774 - 42.89102 - - 32.86631 -
James City County - 3.846154 6.636364 1.4 - - 23.46154 31.81818 4.2 - - 5.444357 5.390227 1.67332 - - 34.89655 44.42481 3.420526 -
King William County - - 5 - - - - 9.666667 - - - - 6.383573 - - - - 9.526279 - -
Mathews County - 7.1 11.83333 3.5 7.2 - 66.45 82.33333 21 21.33333 - 21.60872 24.08617 2.12132 8.32552 - 220.9033 249.3374 12.72792 26.90636
Middlesex County - 1.984375 5.25 4.460317 4.857143 - 10.60938 24.96667 17.7619 24.96429 - 2.751578 7.060573 5.871817 5.133148 - 15.15488 37.88071 30.05625 31.11208
Newport News City 8.44 1.714286 3.113924 4.486486 4.553846 96.88 40.87302 12.51899 16.47297 29.10769 26.97542 3.294319 3.265937 3.738886 5.500087 272.1627 175.2367 20.66637 17.77039 39.94376
Norfolk City 1.777778 2.408 3.044025 6.673469 4.875 25.72222 23.472 14.35849 25.2449 26.075 2.51011 9.630342 4.469794 8.926707 4.831454 30.99489 94.87059 32.05381 32.60386 24.63413
Northampton County 3.428571 3.273973 4.916667 6.55102 2.533333 37.28571 15.90411 14.39815 28.40816 18.41667 4.503967 4.292307 6.729937 6.964377 3.47127 53.33765 24.74602 19.51273 37.21106 21.58962
Northumberland County - 0.1818182 1.375 4.6 73.5 - 2.181818 8.125 23.6 749.25 - 0.6030227 2.722263 4.219005 49 - 5.929893 10.72297 24.47039 498.1669
Poquoson City - 7.608696 10.18182 7.833333 2.5 - 53.52174 34.72727 22 11.78571 - 6.443736 4.956172 5.749835 2.710379 - 65.93445 24.75516 16.75492 10.45267
Portsmouth City - 3.451613 5.461538 - 10.45 - 12.96774 16.61538 - 47.15 - 5.784109 4.313069 - 13.93018 - 14.0866 13.46315 - 54.04703
Richmond County - 4 2 1.75 - - 14.85714 2 6 - - 4.626013 2.309401 1.707825 - - 23.5548 2.309401 3.265986 -
Suffolk City - 2.727273 3.846154 4.222222 - - 7.636364 12.61538 9 - - 3.608072 4.687682 3.666667 - - 6.328866 15.72439 6.800735 -
Surry County - 0 5 22 - - 0 6 87.5 - - 0 4.242641 19.79899 - - 0 5.656854 88.38835 -
Virginia Beach City 2.544503 3.601874 3.764563 4.944282 5.538217 28.08639 29.45667 18.89078 20.07918 31.76752 6.589729 6.741073 9.049494 9.296416 8.549432 42.52479 91.25888 74.73887 61.24855 100.9227
Westmoreland County 8 2.8 0.25 - 0 25.4 20.6 1.25 - 0.3333333 10.95445 2.167948 0.5 - 0 32.41605 14.02854 2.5 - 0.5773503
York County - 4.789474 7.777778 3.8 6.6 - 29.94737 36.22222 11 41.5 - 3.852545 10.10913 5.01996 7.834397 - 33.15531 37.87406 12.56981 40.6072

Exporting Data from Pandas

You may want to export dataframes and other information to other packages (e.g. latex, excel, or stata). Pandas has a lot of to_ methods on pandas dataframes and series. For example we can export some interesting information to latex table code:

In [51]:
(t.unstack().head(25)).to_latex('excel_example.tex',na_rep='-')
Or, we can save the same information to excel:
In [52]:
(t.unstack().head(25)).to_excel('excel_example.xls')

Econometrics in Python

In [3]:
import statsmodels.formula.api as smf    

# load data and create dataframe
tobias_koop=pd.read_csv('https://rlhick.people.wm.edu/econ407/data/tobias_koop_t_4.csv')

tobias_koop.head()
Out[3]:
id educ ln_wage pexp time ability meduc feduc broken_home siblings pexp2
0 4 12 2.14 2 4 0.26 12 10 1 4 4
1 6 15 1.91 4 4 0.44 12 16 0 2 16
2 8 13 2.32 8 4 0.51 12 15 1 2 64
3 11 14 1.64 1 4 1.82 16 17 1 2 1
4 12 13 2.16 6 4 -1.30 13 12 0 5 36
In [4]:
formula = "ln_wage ~ educ + pexp + pexp2 + broken_home"
results = smf.ols(formula,tobias_koop).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                ln_wage   R-squared:                       0.166
Model:                            OLS   Adj. R-squared:                  0.163
Method:                 Least Squares   F-statistic:                     51.36
Date:                Thu, 17 Sep 2015   Prob (F-statistic):           1.83e-39
Time:                        14:52:15   Log-Likelihood:                -583.66
No. Observations:                1034   AIC:                             1177.
Df Residuals:                    1029   BIC:                             1202.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept       0.4603      0.137      3.353      0.001         0.191     0.730
educ            0.0853      0.009      9.179      0.000         0.067     0.104
pexp            0.2035      0.024      8.629      0.000         0.157     0.250
pexp2          -0.0124      0.002     -5.438      0.000        -0.017    -0.008
broken_home    -0.0087      0.036     -0.244      0.807        -0.079     0.061
==============================================================================
Omnibus:                       55.892   Durbin-Watson:                   1.761
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              112.050
Skew:                          -0.355   Prob(JB):                     4.66e-25
Kurtosis:                       4.448   Cond. No.                         391.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.