Pandas Primer
This page has been moved to https://econ.pages.code.wm.edu/414/syllabus/docs/index.html and is no longer being maintained here.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import urllib
Introduction to Pandas¶
Pandas works hand-in-hand with other python libraries (e.g. matplotlib and numpy) to make manipulating data (what the Pandas team calls "Data Munging") easy. With pandas it is easy to
- Easily access data using variable names, but have full linear algebra capabilities of numpy
- Group data by values (for example, calculate the mean income by state)
- Plot and summarize values
- Join (combine) different sources of data
- Powerful time series and panel data capabilities (beyond scope of course)
Note: If you want to reinforce some of these concepts on your own, I recommend this superb youtube video (https://www.youtube.com/watch?v=5JnMutdy6Fw) and accompanying coursework (https://github.com/brandon-rhodes/pycon-pandas-tutorial) by Brandon Rhodes.
There are numerous ways to get data into Pandas:
- Import excel and comma delimited data
- Import stata, sas, matlab, and other datasets
- Import data from an SQL server
- Import data scraped from the web
- Manually building (by typing in values)
In this tutorial we will focus on the first two methods for reading data, but just know that there is probably a way to get your data into pandas irrespective of what format it is in.
Loading and Cleaning Data¶
We will be loading 2010 "Trip" data from the NOAA Fisheries Service Recreational Fisheries Statistics Survey (called MRIP). More detail here: https://www.fisheries.noaa.gov/topic/recreational-fishing-data
We will also load the Census Bureau's county and state fips file, obtained here: https://www.census.gov/2010census/xls/fips_codes_website.xls
The MRIPs data needs to be downloaded from my website (you only need to do this once).
# run this once and then comment out with #
# urllib.request.urlretrieve("https://rlhick.people.wm.edu/pubp622/data/mrips_2010.pydata", "mrips_2010.pydata")
trips_2010 = pd.io.pickle.read_pickle('mrips_2010.pydata')
fips = pd.io.excel.read_excel('https://rlhick.people.wm.edu/pubp622/data/fips_codes_website.xls')
trips_2010.head()
Trips is a very big dataset with lots of columns we'll never use. Let's trim it down:
trips_2010.columns
trips = trips_2010[['id_code','year','wave','intercept_date','st','prim1',
'prim1_common','prim2','prim2_common','cnty','ffdays12',
'ffdays2']]
This is summary statistics for numeric data columns:
trips.describe()
# copy data frame into t (note, this is different than assignment [t=trips])
t=trips.copy()
t.ffdays12.head()
t['ffdays12'].head()
Referencing rows in data:¶
We can use numpy-like slicing:
# rows 50-54
t.loc[51:55]
# rows 1 and 2 for the first 5 columns
t.iloc[0:2,0:5]
We can select rows of data based on column values. Let's select all the rows where st is 9 (Connecticut)
t[t.st == 9].head()
Find all rows where days spent fishing during past 12 months exceeds 10:
t[t.ffdays12>10].head()
Math on columns:¶
# we can create a new variable and use numpy commands:
t['temp'] = t.ffdays12 + .5*np.random.randn(t.shape[0])
t.head()
t.temp.mean()
# or we could do this all in one step if we didn't want to create a new column in t:
(t.ffdays12 + .5*np.random.randn(t.shape[0])).mean()
Note: You can combine pandas with numpy. This is the standard deviation of the column ffdays12
:
np.std(t.ffdays12)
We can perform matrix/vector operations on pandas data.
Here, we can transpose a slice of the data:
t[['ffdays12','ffdays2']].head(10).T
We can do matrix multiplication:
np.dot(t[['ffdays12','ffdays2']].head(10).T,t[['ffdays12','ffdays2']].head(10))
t[['ffdays12','ffdays2']].describe()
Find mean of ffdays12 and ffdays2 by state:
t.head()
t[['ffdays12','ffdays2']].groupby('st').describe()
What happened? We trimmed down the columns of t with the statement t[['ffdays12','ffdays2']]
and pandas couldn't find the columns st
. Let's try this again:
t.groupby('st')[['ffdays12','ffdays2']].describe()
Combining DataFrames¶
The state fips codes are hardly intuitive and even worse for the county fips codes. The goal of this section is to
- Add text-based labels
- Be able to use these labels for fast data selection
To do this, we will attempt to match text based state and county labels to our trips dataframe. Pandas calls this a merge.
fips.head()
The fips codes (numeric) uniquely identify counties in our trips file. So let's look at the fips file (that we grabbed from Census) above and combine it with our trips data. These column names are messy: let's clean it up.
For me, column names should always:
- Be lower case
- Have no spaces or funky characters
names = ['state','state_fips','county_fips','fips','ansi_code',
'county_name','entity_type']
fips.columns=names
fips.head()
The merge statement takes two pandas dataframes and combines them. Notice in the merge command below we have trips,fips. The dataframe trips is known as the "left" dataframe and fips, the "right". I want to combine these columns based on the numeric fips values (state and county). These fields have different names in each of the dataframes, so I need to specify the names (in the same order) for the left (trips) and right (fips) dataframes. The option how='left'
tells pandas to always keep rows from trips whether a match is made with fips or not. Note that if a row in the fips table doesn't match our trips data, it is discarded.
# join clean fips codes to data_trip
trips_fips = pd.merge(trips,fips, left_on=['st','cnty'],
right_on=['state_fips','county_fips'],
how='left', sort=False)
print(trips.shape)
print(trips_fips.shape)
The result we get is disconcerting. What we wanted to achieve with the merge was to add text names to the trips table, and our merged dataset has 1.5 million rows rather than 104,709 (what we want). This probably occured because there were duplicate (or redundant information on fips codes in the fips table). To check that use the duplicate
method to find rows with duplicate state and county fips codes.
fips[fips.duplicated(['state_fips','county_fips'])]['entity_type'].value_counts().head(20)
Unfortunately, this file is (apparently) for census workers in the field who may encounter lots of different place names, and this file helps their data-entry program map into the correct state and county fips codes. This isn't ideal for our purposes. Since our trips table already has the correct numeric fips codes, we need a fips table that has the correct and unique county/city names.
Grab another fips table from here: https://www.census.gov/geo/reference/codes/cou.html and choose United States.
For convenience, I'll save to my website.
fips = pd.io.parsers.read_csv('https://rlhick.people.wm.edu/pubp622/data/national_county.txt',
header=None, names=['state','state_fips','county_fips','county','fips_class_code'])
print(fips.shape)
fips.head()
Having worked with the data before, I know that counties/entities like Virginia Beach City are labeled "Virginia Beach city". Let's fix that:
fips["county"] = fips.county.str.replace("city","City")
Check for duplicates:
fips.duplicated(['state_fips','county_fips'])
print("Number of duplicate rows:", fips[fips.duplicated(['state_fips','county_fips'])].size)
trips_fips = pd.merge(trips,fips, left_on=['st','cnty'],
right_on=['state_fips','county_fips'],
how='left', sort=False)
print(trips.shape)
print(trips_fips.shape)
trips_fips.head()
Let's rename our state and intercept column to make it more intuitive.
trips_fips.rename(columns = {'county': 'county_intercept','state':'state_intercept'},inplace=True)
trips_fips.head()
Working with Indices¶
Let's create an index on these new state and county fields:
trips_fips.set_index(['state_intercept','county_intercept'],inplace=True)
trips_fips.head()
Why use an index for these state and county codes? The biggest reason is that we can easily access values for looking at our data. Here we have 2 month avidity for anglers sampled in James City County:
trips_fips['ffdays2'].loc['VA','James City County']
# we want to create a richer index that includes the prim1_common species name:
# 1. put state_intercept and county_intercept back in data
trips_fips.reset_index(inplace=True)
trips_fips.head()
# 2. create new multiindex
trips_fips.set_index(['state_intercept','county_intercept','prim1_common'],inplace=True)
trips_fips.head()
trips_fips.loc['VA','Virginia Beach City','RED DRUM'].head()
trips_fips.loc['VA',:,'STRIPED BASS'].head()
Ok, indexes are pretty cool, but what if we need the county data for doing summary statistics? Example, suppose we want to get a frequency count of trips by county in VA targeting Striped Bass? Notice, we can't just access index values like you would a column of data:
trips_fips.county_intercept
So the strategy will be to copy the index values back into our table so we can use it for other purposes:
(Note: you can also keep these columns in your data by including the option drop=False
when you set_index()
)
print(trips_fips.index.get_level_values(0))
trips_fips['state_intercept'] = trips_fips.index.get_level_values(0)
trips_fips['county_intercept'] = trips_fips.index.get_level_values(1)
trips_fips['species_name'] = trips_fips.index.get_level_values(2)
trips_fips.head()
This allows us to use index lookups and also the index values for calculating interesting summary data. The following calculates the top 10 targeted species in VA:
trips_fips.loc['VA'].species_name.value_counts().head(10)
This calculates the top targeted species in Gloucester County, VA.
trips_fips.loc['VA','Gloucester County',:].species_name.value_counts()
What about top 10 counties in VA with respect to numbers of intercepted trips?
print(trips_fips.loc['VA'].county_intercept.value_counts().head(10))
This tells us when summer flounder is being targeted in VA:
trips_fips.loc['VA',:,'SUMMER FLOUNDER'].intercept_date.dt.month.value_counts()
Note we used the data/time field intercept_date
to extract the month using the dt.month
method. Pandas has immensely rich datetime capabilities that we won't demonstrate here.
Groupby¶
We can compute interesting data by grouping over column values and using our index for fast selection. Suppose we want to know total number of trips in VA by county of intercept for each wave (wave 1 = Jan,Feb; 2 = Mar,Apr; etc). Groupby lets us do this:
trips_fips.loc['VA'].groupby(['county_intercept','wave'])['id_code'].count().head(25)
Stack and Unstack¶
We can manipulate the result above further to make it more amenable for comparisons and graphing (e.g. we want all of the wave 2 records on the same row):
t = trips_fips.loc['VA'].groupby(['county_intercept','wave'])['id_code'].count()
t.unstack().head(25)
Or we could put waves on the same row:
t.unstack(0).head(25)
We can combine these with plot to quickly make cool charts. Let's focus on the peninsula:
t.unstack(0)[['York County','Newport News City','Hampton City','James City County']].plot(kind='bar',figsize=(8, 8))
plt.title("Number of Intercepted Anglers in Dataset")
plt.show()
We can also stack data we have unstacked:
t_unstacked = t.unstack(0)
t_unstacked.stack()
Pivot Table¶
Borrowing from excel, pandas can create pivot tables of summary statistics. Suppose we want counties as rows and waves as columns, with the average and standard deviation of 2 and 12 month avidity.
pd.pivot_table(trips_fips.loc['VA'],index='county_intercept',columns='wave',
values=['ffdays2','ffdays12'],aggfunc=[np.mean, np.std],fill_value="-")
Exporting Data from Pandas¶
You may want to export dataframes and other information to other packages (e.g. latex, excel, or stata). Pandas has a lot of to_
methods on pandas dataframes and series. For example we can export some interesting information to latex table code:
(t.unstack().head(25)).to_latex('excel_example.tex',na_rep='-')
(t.unstack().head(25)).to_excel('excel_example.xls')
Econometrics in Python¶
import statsmodels.formula.api as smf
# load data and create dataframe
tobias_koop=pd.read_csv('https://rlhick.people.wm.edu/econ407/data/tobias_koop_t_4.csv')
tobias_koop.head()
formula = "ln_wage ~ educ + pexp + pexp2 + broken_home"
results = smf.ols(formula,tobias_koop).fit()
print(results.summary())