{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "In this short post, I will outline how one can access data stored in a database like MariaDB or MySQL for analysis inside an Ipython Notebook. There are many reasons why you might want to store your data in a proper database. For me the most important are:\n", "\n", "1. All of my data resides in a password protected and more secure place than having a multitude of csv, mat, and dta files scattered all over my file system. \n", "\n", "2. If you access the same data for multiple projects, any changes to the underlying data will be propagated to your analysis, without having to update copies of project data.\n", "\n", "3. Having data in a central repository makes backup and recover significantly easier.\n", "\n", "4. This allows for two-way interaction with your database. You can read and write tables from/to your database. Rather than use SQL, you can create database tables using pandas/ipython. \n", "\n", "\n", "\n", "In what follows, I focus on techniques specifically for MariaDB and MySQL, although similar steps would work with other database products out there.\n", "\n", "[Instructions for doing this in Stata](http://www.stata.com/support/faqs/data-management/using-plugin-to-connect-to-database/)\n", "\n", "First, load the python libraries we'll need:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline\n", "import MySQLdb as mariadb\n", "import getpass # so we don't need to store passwords on disk\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code uses the getpass package to record password information, and sets up the database connection to a mariadb/mysql database on localhost. Once the connection is established, it runs the query \n", "\n", "``\n", "select * from ipython_out\n", "``\n", "\n", "for copying the table `ipython_out` to the pandas dataframe `data`. Note, you need to supply your own username (`myusername`) and database name (`dbname`)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(209594, 27)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# importing via mariadb directly into pandas dataframe:\n", "password = getpass.getpass() # asks for password in the console window so we don't store it here\n", "conn = mariadb.connect('localhost','myusername',password,'dbname');\n", "data = pd.io.sql.read_sql('select * from ipython_out', conn)\n", "conn.close()\n", "data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As can be seen above, this table (now a dataframe in ipython) has over 200k rows and 27 columns. Let's grab a list of columns:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['year', 'donor', 'donor_iso2', 'donor_iso3', 'donor_project_id',\n", " 'umbrella', 'recipient', 'recipient_iso2', 'recipient_iso3',\n", " 'donor_type', 'bi_multi', 'environment',\n", " 'environmental_impact_assessment', 'commitment_amount_usd_nominal',\n", " 'commitment_amount_usd_constant', 'title', 'short_description',\n", " 'long_description', 'aiddata_activity_name', 'aiddata_id',\n", " 'score_title', 'score_short_description', 'score_long_description',\n", " 'score_activity', 'score_purpose', 'coalesced_purpose_name',\n", " 'total_score'], dtype=object)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.columns.values" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | year | \n", "donor | \n", "donor_iso2 | \n", "donor_iso3 | \n", "donor_project_id | \n", "umbrella | \n", "recipient | \n", "recipient_iso2 | \n", "recipient_iso3 | \n", "donor_type | \n", "... | \n", "long_description | \n", "aiddata_activity_name | \n", "aiddata_id | \n", "score_title | \n", "score_short_description | \n", "score_long_description | \n", "score_activity | \n", "score_purpose | \n", "coalesced_purpose_name | \n", "total_score | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1998 | \n", "World Bank - International Finance Corporation... | \n", "None | \n", "None | \n", "None | \n", "WORLD BANK | \n", "Honduras | \n", "HN | \n", "HND | \n", "Multilateral | \n", "... | \n", "<br> | \n", "Electrical transmission/ distribution, activit... | \n", "3010916 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "Electrical transmission/ distribution | \n", "1 | \n", "
1 | \n", "2001 | \n", "Asian Development Bank (ASDB) | \n", "None | \n", "None | \n", "None | \n", "ASDB | \n", "Pakistan | \n", "PK | \n", "PAK | \n", "Multilateral | \n", "... | \n", "None | \n", "Systems of government|Improving public expendi... | \n", "2403576 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "Public sector financial management | \n", "2 | \n", "
2 | \n", "2005 | \n", "Inter-American Development Bank (IADB) | \n", "None | \n", "None | \n", "None | \n", "IADB | \n", "America, regional | \n", "\n", " | \n", " | Multilateral | \n", "... | \n", "The aim of this operation is to improve the fi... | \n", "Agricultural services |Agricultural financial ... | \n", "2389476 | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "Agricultural services, purpose | \n", "2 | \n", "
3 | \n", "1988 | \n", "Asian Development Bank (ASDB) | \n", "None | \n", "None | \n", "None | \n", "ASDB | \n", "China | \n", "CN | \n", "CHN | \n", "Multilateral | \n", "... | \n", "None | \n", "Energy conservation | \n", "2445959 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "Energy policy and administrative management | \n", "1 | \n", "
4 | \n", "1994 | \n", "Asian Development Bank (ASDB) | \n", "None | \n", "None | \n", "None | \n", "ASDB | \n", "Laos | \n", "LA | \n", "LAO | \n", "Multilateral | \n", "... | \n", "None | \n", "Financial policy and administrative management... | \n", "2446645 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "Financial policy and administrative management | \n", "2 | \n", "
5 rows × 27 columns
\n", "