{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reshaping and Merging Data and Working with Strings, Dates, and Times"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{contents} Table of Contents\n",
    ":depth: 4\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style= \"float:left;position: relative; padding: 20px\">\n",
    "<a href=\"https://xkcd.com/1409/\"><img src=\"https://i.imgflip.com/45ymyw.jpg\" width=\"300\"></a>\n",
    "</div>\n",
    "\n",
    "## Introduction: `pandas` or SQL?\n",
    "\n",
    "As we saw in module 7, read operations in SQL not only extract data from a database, but they can select and rename columns, filter and sort rows, join tables, and aggregate data. With `pandas`, we can also select and rename columns, filter and sort rows, join tables, and aggregate data. So the question is, should we be using SQL or `pandas` to perform all these data manipulation steps? \n",
    "\n",
    "SQL is sometimes referred to as **server-side** data manipulation, because databases are often stored on remote servers and SQL queries are processed on the server instead on on a local machine. Data manipulation that is conducted using `pandas` on a local Python installation is called **client-side** data manipulation.\n",
    "\n",
    "The question of SQL vs. `pandas` (or SQL vs. the tidyverse in R) is the subject of a lot of debate among professionals who use data. The question [comes up](https://www.quora.com/In-what-situations-should-you-use-SQL-instead-of-Pandas-as-a-data-scientist) [frequently](https://towardsdatascience.com/sql-and-pandas-268f634a4f5d) on [various](https://www.reddit.com/r/Python/comments/1tqjt4/why_do_you_use_pandas_instead_of_sql/) [coding forums](https://datascience.stackexchange.com/questions/34357/why-do-people-prefer-pandas-to-sql). Some [blog posts](https://blog.thedataincubator.com/2018/05/sqlite-vs-pandas-performance-benchmarks/) have tried to compare SQL and `pandas` in terms of the [speed](https://wesmckinney.com/blog/high-performance-database-joins-with-pandas-dataframe-more-benchmarks/) with which they complete equivalent operations, with differing results. [Tina Wenzel and Kavya Gupta](https://medium.com/carwow-product-engineering/sql-vs-pandas-how-to-balance-tasks-between-server-and-client-side-9e2f6c95677) note that many factors influence the relative speed of SQL and `pandas` operations, including the configuration of a PostgreSQL database and the bandwidth available in a network connection. They take the mixed evidence into account and conclude that\n",
    "\n",
    "> SQL is the best tool to use for basic data retrieval and the better you know it, the more it will speed up your workflow. More advanced data manipulation tasks can be performed both on the server and client side and it is up to the analyst to balance the workload optimally.\n",
    "\n",
    "In short, the existing evidence does not overwhelmingly support one option or the other as best practice for data manipulation. The reality is that both SQL and `pandas` are extremely useful and widely-used, and both tools will become part of your workflow. It will take some time and experience with end-to-end projects in data science to learn how to balance SQL and `pandas` in a way that is most comfortable for you in your workflow. But at this stage, there are some important points to keep in mind when thinking about these tools. \n",
    "\n",
    "First, there are many situations in which the question of SQL vs. `pandas` might be moot. For a given project, our data might not come from a database, but instead from a CSV file, from a JSON document acquired via an API, or from raw HTML that we scraped from a webpage. So in order to use SQL, we would have to take the additional step of creating a database. If we hold ourselves to the principles that E. F. Codd and others laid out about the organization of relational databases, it can take a significant amount of work to create this database. If there is no database involved for a specific data project, there is no role for SQL, but because `pandas` works on dataframes it can be a primary tool for data manipulation regardless of the original source for the data.\n",
    "\n",
    "Second, while there are differences in speed, these differences only become a significant factor for projects that involve big data, and in those cases, we will almost certainly be working with data stored on remote servers. If the data are organized using a database, then SQL may well be faster than `pandas`, but it very much depends on how the database is [configured](https://severalnines.com/database-blog/guide-postgresql-server-configuration-parameters) and on the myriad factors that determine the speed with which code runs through a remote connection. `pandas` can also be used in scripts or Jupyter notebooks that are run on remote servers. Sometimes it makes sense to pull selections of remotely stored data into a local environment so that we can manipulate the data without having to seek permission from a database manager, and in that case, a mix of SQL to choose a selection and `pandas` to manipulate the local data can be very effective.\n",
    "\n",
    "Third, `pandas` simply has more functionality and flexibility than SQL. For example, it is fairly straightforward to reshape (e.g. pivot) data using `pandas`, and it is [much more difficult to reshape data](https://stackoverflow.com/questions/2444708/sqlite-long-to-wide-formats) in SQL. `pandas` has better functionality for working with strings and time and date features than SQL, and `pandas`, being part of a Python environment, works well with any other Python package that works with dataframes or arrays. In contrast, while various DBMSs add specific additions to the base SQL language, SQL extensions tend to be fairly limited because of the inherent tension between expanding the functionality of a query language and staying close enough to base SQL to still be considered SQL. The easiest way to bring SQL query results into a Python enviromnent uses an `sqlalchemy` engine and the `pd.read_sql_query()` funtion from `pandas`. \n",
    "\n",
    "Fourth, both `pandas` and SQL enable us to work as part of larger teams that share tools, but we might choose or be required to use one tool over the other to work better as part of the team. `pandas` is essential for projects in which the whole team works within Python. SQL is readable and usable for people who do not use Python but still work with databases.\n",
    "\n",
    "Fifth, and most importantly, the choice of SQL vs. `pandas` should be made based on how comfortable we feel with each tool. Both SQL and `pandas` can perform data manipulation correctly, and it will probably be the case that we can remember the code with one tool better than with the other. We should try as much as we can to do the work that comprises 80% of our time as data scientists more quickly.\n",
    "\n",
    "For now, we need to practice both `pandas` and SQL and get comfortable with these tools, and we need to be flexible in the future as different situations will call for SQL or `pandas`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Joining Dataframes\n",
    "Joining dataframes is also called merging dataframes, and I will use the words \"join\" and \"merge\" interchangeably below. Every kind of merge that is possible in SQL is possible in `pandas`: inner, outer, left, right, natural, cross, and anti-joins. One important difference between joining in SQL and merging in `pandas` is that dataframes might not be as carefully curated as data tables in a relational database, and we should not assume that the join worked properly even if there is no error. Primary keys might not uniquely identify rows in general, and joins might turn into cross joins. Some of values of keys that should match might not match due to differences in coding and spelling. And if we attempt a natural join, the wrong columns might be automatically selected as the keys.\n",
    "\n",
    "Too many discussions of merging data provide clean examples. Here I will show you how merging works when the data contain some common problems that may invalidate the merge. In the following discussion, we will discuss how to perform merges in `pandas`, and also how to check that the merge worked in the way we want to.\n",
    "\n",
    "First, for the following examples, we load these packages:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: Merging Data on U.S. State Election Results, Income, Economies, and Area\n",
    "The \"state_elections.csv\" file contains information about the result of presidential elections by state for every presidential election between 1964 and 2004. In this dataset, the primary keys are `State` and `year`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>year</th>\n",
       "      <th>demvote</th>\n",
       "      <th>repvote</th>\n",
       "      <th>wallacevote</th>\n",
       "      <th>dempercent</th>\n",
       "      <th>reppercent</th>\n",
       "      <th>population</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1964</td>\n",
       "      <td>210732</td>\n",
       "      <td>479085</td>\n",
       "      <td>0</td>\n",
       "      <td>30.54897</td>\n",
       "      <td>69.45103</td>\n",
       "      <td>1919000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1968</td>\n",
       "      <td>196579</td>\n",
       "      <td>146923</td>\n",
       "      <td>691425</td>\n",
       "      <td>18.99448</td>\n",
       "      <td>81.00552</td>\n",
       "      <td>1993000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1972</td>\n",
       "      <td>256923</td>\n",
       "      <td>728701</td>\n",
       "      <td>0</td>\n",
       "      <td>26.06704</td>\n",
       "      <td>73.93296</td>\n",
       "      <td>3540080</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1976</td>\n",
       "      <td>659170</td>\n",
       "      <td>504070</td>\n",
       "      <td>0</td>\n",
       "      <td>56.66673</td>\n",
       "      <td>43.33327</td>\n",
       "      <td>3737204</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1980</td>\n",
       "      <td>636730</td>\n",
       "      <td>654192</td>\n",
       "      <td>0</td>\n",
       "      <td>49.32366</td>\n",
       "      <td>50.67634</td>\n",
       "      <td>3894025</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     State  year  demvote  repvote  wallacevote  dempercent  reppercent  \\\n",
       "0  Alabama  1964   210732   479085            0    30.54897    69.45103   \n",
       "1  Alabama  1968   196579   146923       691425    18.99448    81.00552   \n",
       "2  Alabama  1972   256923   728701            0    26.06704    73.93296   \n",
       "3  Alabama  1976   659170   504070            0    56.66673    43.33327   \n",
       "4  Alabama  1980   636730   654192            0    49.32366    50.67634   \n",
       "\n",
       "   population  \n",
       "0     1919000  \n",
       "1     1993000  \n",
       "2     3540080  \n",
       "3     3737204  \n",
       "4     3894025  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "elect = pd.read_csv(\"https://github.com/jkropko/DS-6001/raw/master/localdata/state_elections.csv\")\n",
    "elect.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"state_income.txt\" contains varous summary statistics (percentiles and means, broken down separately for men, for women, and for the overall population) about personal incomes within each state. In this dataset, the primary keys are `stcode` (the post office's two letter abbreviation for states) and `year`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>stcode</th>\n",
       "      <th>year</th>\n",
       "      <th>income</th>\n",
       "      <th>p10income</th>\n",
       "      <th>p20income</th>\n",
       "      <th>p30income</th>\n",
       "      <th>p40income</th>\n",
       "      <th>p50income</th>\n",
       "      <th>p60income</th>\n",
       "      <th>p70income</th>\n",
       "      <th>...</th>\n",
       "      <th>wom_p90income</th>\n",
       "      <th>wom_meanincome</th>\n",
       "      <th>wom_minincome</th>\n",
       "      <th>wom_maxincome</th>\n",
       "      <th>wom_sdincome</th>\n",
       "      <th>wom_ineqincome</th>\n",
       "      <th>wom_gini_income</th>\n",
       "      <th>wom_n_income</th>\n",
       "      <th>wom_nwgt_income</th>\n",
       "      <th>medincome</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AL</td>\n",
       "      <td>1964</td>\n",
       "      <td>89.83488</td>\n",
       "      <td>3269.85</td>\n",
       "      <td>5731.07</td>\n",
       "      <td>7988.87</td>\n",
       "      <td>10318.3</td>\n",
       "      <td>13189.1</td>\n",
       "      <td>16007.3</td>\n",
       "      <td>18744.4</td>\n",
       "      <td>...</td>\n",
       "      <td>24233.0</td>\n",
       "      <td>11400.0</td>\n",
       "      <td>58.13953</td>\n",
       "      <td>72674.42</td>\n",
       "      <td>1796.89</td>\n",
       "      <td>29.53152</td>\n",
       "      <td>0.473738</td>\n",
       "      <td>175.0</td>\n",
       "      <td>418494.1</td>\n",
       "      <td>76.68081</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AL</td>\n",
       "      <td>1968</td>\n",
       "      <td>92.18238</td>\n",
       "      <td>4469.95</td>\n",
       "      <td>7455.28</td>\n",
       "      <td>10050.30</td>\n",
       "      <td>12792.1</td>\n",
       "      <td>15580.2</td>\n",
       "      <td>18392.4</td>\n",
       "      <td>21310.7</td>\n",
       "      <td>...</td>\n",
       "      <td>26287.5</td>\n",
       "      <td>12005.3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1968.78</td>\n",
       "      <td>27.40979</td>\n",
       "      <td>0.485090</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>80.72643</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AL</td>\n",
       "      <td>1972</td>\n",
       "      <td>94.87543</td>\n",
       "      <td>6734.11</td>\n",
       "      <td>10314.80</td>\n",
       "      <td>13336.30</td>\n",
       "      <td>16421.5</td>\n",
       "      <td>19088.9</td>\n",
       "      <td>22141.6</td>\n",
       "      <td>25410.7</td>\n",
       "      <td>...</td>\n",
       "      <td>30278.0</td>\n",
       "      <td>14667.9</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8604.25</td>\n",
       "      <td>17.83577</td>\n",
       "      <td>0.457932</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>82.27975</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AL</td>\n",
       "      <td>1976</td>\n",
       "      <td>67.53671</td>\n",
       "      <td>7759.81</td>\n",
       "      <td>11176.60</td>\n",
       "      <td>13823.20</td>\n",
       "      <td>16431.8</td>\n",
       "      <td>18982.8</td>\n",
       "      <td>21923.5</td>\n",
       "      <td>25180.8</td>\n",
       "      <td>...</td>\n",
       "      <td>31119.2</td>\n",
       "      <td>15362.5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12842.20</td>\n",
       "      <td>15.77765</td>\n",
       "      <td>0.454074</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>60.07215</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AL</td>\n",
       "      <td>1980</td>\n",
       "      <td>46.29869</td>\n",
       "      <td>7602.15</td>\n",
       "      <td>10637.00</td>\n",
       "      <td>13532.60</td>\n",
       "      <td>16279.6</td>\n",
       "      <td>19111.8</td>\n",
       "      <td>21802.5</td>\n",
       "      <td>25190.6</td>\n",
       "      <td>...</td>\n",
       "      <td>29623.1</td>\n",
       "      <td>14804.2</td>\n",
       "      <td>28.38428</td>\n",
       "      <td>120196.50</td>\n",
       "      <td>5944.71</td>\n",
       "      <td>14.52640</td>\n",
       "      <td>0.436961</td>\n",
       "      <td>543.0</td>\n",
       "      <td>711278.2</td>\n",
       "      <td>41.72882</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 55 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  stcode  year    income  p10income  p20income  p30income  p40income  \\\n",
       "0     AL  1964  89.83488    3269.85    5731.07    7988.87    10318.3   \n",
       "1     AL  1968  92.18238    4469.95    7455.28   10050.30    12792.1   \n",
       "2     AL  1972  94.87543    6734.11   10314.80   13336.30    16421.5   \n",
       "3     AL  1976  67.53671    7759.81   11176.60   13823.20    16431.8   \n",
       "4     AL  1980  46.29869    7602.15   10637.00   13532.60    16279.6   \n",
       "\n",
       "   p50income  p60income  p70income  ...  wom_p90income  wom_meanincome  \\\n",
       "0    13189.1    16007.3    18744.4  ...        24233.0         11400.0   \n",
       "1    15580.2    18392.4    21310.7  ...        26287.5         12005.3   \n",
       "2    19088.9    22141.6    25410.7  ...        30278.0         14667.9   \n",
       "3    18982.8    21923.5    25180.8  ...        31119.2         15362.5   \n",
       "4    19111.8    21802.5    25190.6  ...        29623.1         14804.2   \n",
       "\n",
       "   wom_minincome  wom_maxincome  wom_sdincome  wom_ineqincome  \\\n",
       "0       58.13953       72674.42       1796.89        29.53152   \n",
       "1            NaN            NaN       1968.78        27.40979   \n",
       "2            NaN            NaN       8604.25        17.83577   \n",
       "3            NaN            NaN      12842.20        15.77765   \n",
       "4       28.38428      120196.50       5944.71        14.52640   \n",
       "\n",
       "   wom_gini_income  wom_n_income  wom_nwgt_income  medincome  \n",
       "0         0.473738         175.0         418494.1   76.68081  \n",
       "1         0.485090           NaN              NaN   80.72643  \n",
       "2         0.457932           NaN              NaN   82.27975  \n",
       "3         0.454074           NaN              NaN   60.07215  \n",
       "4         0.436961         543.0         711278.2   41.72882  \n",
       "\n",
       "[5 rows x 55 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "income = pd.read_csv(\"https://github.com/jkropko/DS-6001/raw/master/localdata/state_income.txt\", sep=\"\\t\", header=2) # tab separated (note the header in this file)\n",
    "income.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"state_economics.txt\" contains information about the macro-level economy in each state and year, including overall and per capita gross domestic product, and GDP broken down by industry. In this dataset, the primary keys are `fips` (a standard numeric code that identifies countries and provinces within countries) and `year`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fips</th>\n",
       "      <th>year</th>\n",
       "      <th>GDP</th>\n",
       "      <th>GDPpc</th>\n",
       "      <th>Private</th>\n",
       "      <th>Agriculture</th>\n",
       "      <th>Farms</th>\n",
       "      <th>Mining</th>\n",
       "      <th>Utilities</th>\n",
       "      <th>Construction</th>\n",
       "      <th>...</th>\n",
       "      <th>Construction_share</th>\n",
       "      <th>Manufacturing_share</th>\n",
       "      <th>Finance_Insurance_share</th>\n",
       "      <th>Legal_share</th>\n",
       "      <th>Education_share</th>\n",
       "      <th>Health_share</th>\n",
       "      <th>Government_share</th>\n",
       "      <th>Government_federal_share</th>\n",
       "      <th>Government_military_share</th>\n",
       "      <th>Government_statelocal_share</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1964</td>\n",
       "      <td>8201</td>\n",
       "      <td>4273.580</td>\n",
       "      <td>6900</td>\n",
       "      <td>312</td>\n",
       "      <td>283</td>\n",
       "      <td>115</td>\n",
       "      <td>235</td>\n",
       "      <td>379</td>\n",
       "      <td>...</td>\n",
       "      <td>4.621387</td>\n",
       "      <td>28.22827</td>\n",
       "      <td>11.41324</td>\n",
       "      <td>0.402390</td>\n",
       "      <td>0.256066</td>\n",
       "      <td>1.902207</td>\n",
       "      <td>15.86392</td>\n",
       "      <td>6.291915</td>\n",
       "      <td>2.402146</td>\n",
       "      <td>7.182051</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1968</td>\n",
       "      <td>10949</td>\n",
       "      <td>5493.728</td>\n",
       "      <td>9095</td>\n",
       "      <td>294</td>\n",
       "      <td>253</td>\n",
       "      <td>133</td>\n",
       "      <td>294</td>\n",
       "      <td>499</td>\n",
       "      <td>...</td>\n",
       "      <td>4.557494</td>\n",
       "      <td>28.74235</td>\n",
       "      <td>10.67677</td>\n",
       "      <td>0.429263</td>\n",
       "      <td>0.365330</td>\n",
       "      <td>2.228514</td>\n",
       "      <td>16.93305</td>\n",
       "      <td>5.945748</td>\n",
       "      <td>2.630377</td>\n",
       "      <td>8.356928</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>1972</td>\n",
       "      <td>15336</td>\n",
       "      <td>4332.105</td>\n",
       "      <td>12696</td>\n",
       "      <td>481</td>\n",
       "      <td>418</td>\n",
       "      <td>223</td>\n",
       "      <td>435</td>\n",
       "      <td>702</td>\n",
       "      <td>...</td>\n",
       "      <td>4.577465</td>\n",
       "      <td>26.05634</td>\n",
       "      <td>11.48279</td>\n",
       "      <td>0.469484</td>\n",
       "      <td>0.443401</td>\n",
       "      <td>2.595201</td>\n",
       "      <td>17.21440</td>\n",
       "      <td>5.692488</td>\n",
       "      <td>2.679969</td>\n",
       "      <td>8.841941</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1976</td>\n",
       "      <td>24206</td>\n",
       "      <td>6477.035</td>\n",
       "      <td>19988</td>\n",
       "      <td>781</td>\n",
       "      <td>683</td>\n",
       "      <td>584</td>\n",
       "      <td>714</td>\n",
       "      <td>1281</td>\n",
       "      <td>...</td>\n",
       "      <td>5.292076</td>\n",
       "      <td>24.85334</td>\n",
       "      <td>11.50541</td>\n",
       "      <td>0.479220</td>\n",
       "      <td>0.359415</td>\n",
       "      <td>2.986863</td>\n",
       "      <td>17.42543</td>\n",
       "      <td>5.787821</td>\n",
       "      <td>2.255639</td>\n",
       "      <td>9.386103</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>1980</td>\n",
       "      <td>36006</td>\n",
       "      <td>9246.474</td>\n",
       "      <td>29567</td>\n",
       "      <td>674</td>\n",
       "      <td>542</td>\n",
       "      <td>1351</td>\n",
       "      <td>1228</td>\n",
       "      <td>1588</td>\n",
       "      <td>...</td>\n",
       "      <td>4.410376</td>\n",
       "      <td>24.60423</td>\n",
       "      <td>11.63417</td>\n",
       "      <td>0.649892</td>\n",
       "      <td>0.352719</td>\n",
       "      <td>3.660501</td>\n",
       "      <td>17.88313</td>\n",
       "      <td>5.782370</td>\n",
       "      <td>2.171860</td>\n",
       "      <td>9.928901</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 36 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   fips  year    GDP     GDPpc  Private  Agriculture  Farms  Mining  \\\n",
       "0     1  1964   8201  4273.580     6900          312    283     115   \n",
       "1     1  1968  10949  5493.728     9095          294    253     133   \n",
       "2     1  1972  15336  4332.105    12696          481    418     223   \n",
       "3     1  1976  24206  6477.035    19988          781    683     584   \n",
       "4     1  1980  36006  9246.474    29567          674    542    1351   \n",
       "\n",
       "   Utilities  Construction  ...  Construction_share  Manufacturing_share  \\\n",
       "0        235           379  ...            4.621387             28.22827   \n",
       "1        294           499  ...            4.557494             28.74235   \n",
       "2        435           702  ...            4.577465             26.05634   \n",
       "3        714          1281  ...            5.292076             24.85334   \n",
       "4       1228          1588  ...            4.410376             24.60423   \n",
       "\n",
       "   Finance_Insurance_share  Legal_share  Education_share  Health_share  \\\n",
       "0                 11.41324     0.402390         0.256066      1.902207   \n",
       "1                 10.67677     0.429263         0.365330      2.228514   \n",
       "2                 11.48279     0.469484         0.443401      2.595201   \n",
       "3                 11.50541     0.479220         0.359415      2.986863   \n",
       "4                 11.63417     0.649892         0.352719      3.660501   \n",
       "\n",
       "   Government_share  Government_federal_share  Government_military_share  \\\n",
       "0          15.86392                  6.291915                   2.402146   \n",
       "1          16.93305                  5.945748                   2.630377   \n",
       "2          17.21440                  5.692488                   2.679969   \n",
       "3          17.42543                  5.787821                   2.255639   \n",
       "4          17.88313                  5.782370                   2.171860   \n",
       "\n",
       "   Government_statelocal_share  \n",
       "0                     7.182051  \n",
       "1                     8.356928  \n",
       "2                     8.841941  \n",
       "3                     9.386103  \n",
       "4                     9.928901  \n",
       "\n",
       "[5 rows x 36 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "econ = pd.read_csv(\"https://github.com/jkropko/DS-6001/raw/master/localdata/state_economics.txt\", sep=\";\") # semi-colon separated\n",
    "econ.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"state_area.csv\" contains various measures of the total, land, and water areas of each state. In this dataset, the primary key is `state`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>state</th>\n",
       "      <th>area_sqmi</th>\n",
       "      <th>area_sqkm</th>\n",
       "      <th>landarea_sqmi</th>\n",
       "      <th>landarea_sqkm</th>\n",
       "      <th>water_sqmi</th>\n",
       "      <th>water_sqkm</th>\n",
       "      <th>percent_water</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Alaska</td>\n",
       "      <td>663,267.26</td>\n",
       "      <td>1,717,854</td>\n",
       "      <td>571,951.26</td>\n",
       "      <td>1,481,347</td>\n",
       "      <td>91,316.00</td>\n",
       "      <td>236,507</td>\n",
       "      <td>13.77</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Texas</td>\n",
       "      <td>268,580.82</td>\n",
       "      <td>695,621</td>\n",
       "      <td>261,797.12</td>\n",
       "      <td>678,051</td>\n",
       "      <td>6,783.70</td>\n",
       "      <td>17,570</td>\n",
       "      <td>2.53</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>California</td>\n",
       "      <td>163,695.57</td>\n",
       "      <td>423,970</td>\n",
       "      <td>155,939.52</td>\n",
       "      <td>403,882</td>\n",
       "      <td>7,736.23</td>\n",
       "      <td>20,037</td>\n",
       "      <td>4.73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Montana</td>\n",
       "      <td>147,042.40</td>\n",
       "      <td>380,838</td>\n",
       "      <td>145,552.43</td>\n",
       "      <td>376,979</td>\n",
       "      <td>1,489.96</td>\n",
       "      <td>3,859</td>\n",
       "      <td>1.01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>New Mexico</td>\n",
       "      <td>121,589.48</td>\n",
       "      <td>314,915</td>\n",
       "      <td>121,355.53</td>\n",
       "      <td>314,309</td>\n",
       "      <td>233.96</td>\n",
       "      <td>606</td>\n",
       "      <td>0.19</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        state   area_sqmi  area_sqkm landarea_sqmi landarea_sqkm water_sqmi  \\\n",
       "0      Alaska  663,267.26  1,717,854    571,951.26     1,481,347  91,316.00   \n",
       "1       Texas  268,580.82    695,621    261,797.12       678,051   6,783.70   \n",
       "2  California  163,695.57    423,970    155,939.52       403,882   7,736.23   \n",
       "3     Montana  147,042.40    380,838    145,552.43       376,979   1,489.96   \n",
       "4  New Mexico  121,589.48    314,915    121,355.53       314,309     233.96   \n",
       "\n",
       "  water_sqkm  percent_water  \n",
       "0    236,507          13.77  \n",
       "1     17,570           2.53  \n",
       "2     20,037           4.73  \n",
       "3      3,859           1.01  \n",
       "4        606           0.19  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "area = pd.read_csv(\"https://github.com/jkropko/DS-6001/raw/master/localdata/state_area.csv\") \n",
    "area.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because the datasets above contain three ways to identify states - by name, with two-letter abbreviations, and with FIPS codes - we also need a dataset that matches each state by name to its abbreviation and FIPS code. Datasets that exist solely for the purpose of matching values of keys across datasets are called **crosswalks**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fips</th>\n",
       "      <th>State</th>\n",
       "      <th>stcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Alabama</td>\n",
       "      <td>AL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Alaska</td>\n",
       "      <td>AK</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>Arizona</td>\n",
       "      <td>AZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5</td>\n",
       "      <td>Arkansas</td>\n",
       "      <td>AR</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6</td>\n",
       "      <td>California</td>\n",
       "      <td>CA</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   fips       State stcode\n",
       "0     1     Alabama     AL\n",
       "1     2      Alaska     AK\n",
       "2     4     Arizona     AZ\n",
       "3     5    Arkansas     AR\n",
       "4     6  California     CA"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "crosswalk = pd.read_csv(\"https://github.com/jkropko/DS-6001/raw/master/localdata/crosswalk.csv\")\n",
    "crosswalk.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our goal is to merge all five of these datasets together."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using the `pd.merge()` Function\n",
    "When using a crosswalk to merge dataframes together, the first step is to merge a dataframe with the crosswalk. Here we start by joining the `elect` and `crosswalk` dataframes together. To join dataframes, we use the `pd.merge()` function. The following code works:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>year</th>\n",
       "      <th>demvote</th>\n",
       "      <th>repvote</th>\n",
       "      <th>wallacevote</th>\n",
       "      <th>dempercent</th>\n",
       "      <th>reppercent</th>\n",
       "      <th>population</th>\n",
       "      <th>fips</th>\n",
       "      <th>stcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1964</td>\n",
       "      <td>210732</td>\n",
       "      <td>479085</td>\n",
       "      <td>0</td>\n",
       "      <td>30.54897</td>\n",
       "      <td>69.45103</td>\n",
       "      <td>1919000</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1968</td>\n",
       "      <td>196579</td>\n",
       "      <td>146923</td>\n",
       "      <td>691425</td>\n",
       "      <td>18.99448</td>\n",
       "      <td>81.00552</td>\n",
       "      <td>1993000</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1972</td>\n",
       "      <td>256923</td>\n",
       "      <td>728701</td>\n",
       "      <td>0</td>\n",
       "      <td>26.06704</td>\n",
       "      <td>73.93296</td>\n",
       "      <td>3540080</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1976</td>\n",
       "      <td>659170</td>\n",
       "      <td>504070</td>\n",
       "      <td>0</td>\n",
       "      <td>56.66673</td>\n",
       "      <td>43.33327</td>\n",
       "      <td>3737204</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1980</td>\n",
       "      <td>636730</td>\n",
       "      <td>654192</td>\n",
       "      <td>0</td>\n",
       "      <td>49.32366</td>\n",
       "      <td>50.67634</td>\n",
       "      <td>3894025</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>556</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1988</td>\n",
       "      <td>67113</td>\n",
       "      <td>106867</td>\n",
       "      <td>0</td>\n",
       "      <td>38.57512</td>\n",
       "      <td>61.42488</td>\n",
       "      <td>465080</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>557</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1992</td>\n",
       "      <td>68160</td>\n",
       "      <td>79347</td>\n",
       "      <td>0</td>\n",
       "      <td>46.20798</td>\n",
       "      <td>53.79202</td>\n",
       "      <td>466251</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>558</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1996</td>\n",
       "      <td>77934</td>\n",
       "      <td>105388</td>\n",
       "      <td>0</td>\n",
       "      <td>42.51208</td>\n",
       "      <td>57.48792</td>\n",
       "      <td>488167</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>559</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>2000</td>\n",
       "      <td>60481</td>\n",
       "      <td>147947</td>\n",
       "      <td>0</td>\n",
       "      <td>29.01769</td>\n",
       "      <td>70.98231</td>\n",
       "      <td>493782</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>560</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>2004</td>\n",
       "      <td>70776</td>\n",
       "      <td>167629</td>\n",
       "      <td>0</td>\n",
       "      <td>29.68730</td>\n",
       "      <td>70.31271</td>\n",
       "      <td>506529</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>561 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       State  year  demvote  repvote  wallacevote  dempercent  reppercent  \\\n",
       "0    Alabama  1964   210732   479085            0    30.54897    69.45103   \n",
       "1    Alabama  1968   196579   146923       691425    18.99448    81.00552   \n",
       "2    Alabama  1972   256923   728701            0    26.06704    73.93296   \n",
       "3    Alabama  1976   659170   504070            0    56.66673    43.33327   \n",
       "4    Alabama  1980   636730   654192            0    49.32366    50.67634   \n",
       "..       ...   ...      ...      ...          ...         ...         ...   \n",
       "556  Wyoming  1988    67113   106867            0    38.57512    61.42488   \n",
       "557  Wyoming  1992    68160    79347            0    46.20798    53.79202   \n",
       "558  Wyoming  1996    77934   105388            0    42.51208    57.48792   \n",
       "559  Wyoming  2000    60481   147947            0    29.01769    70.98231   \n",
       "560  Wyoming  2004    70776   167629            0    29.68730    70.31271   \n",
       "\n",
       "     population  fips stcode  \n",
       "0       1919000     1     AL  \n",
       "1       1993000     1     AL  \n",
       "2       3540080     1     AL  \n",
       "3       3737204     1     AL  \n",
       "4       3894025     1     AL  \n",
       "..          ...   ...    ...  \n",
       "556      465080    56     WY  \n",
       "557      466251    56     WY  \n",
       "558      488167    56     WY  \n",
       "559      493782    56     WY  \n",
       "560      506529    56     WY  \n",
       "\n",
       "[561 rows x 10 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "merged_data = pd.merge(elect, crosswalk, on='State')\n",
    "merged_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `merged_data` dataframe now contains the features from `elect` and from `crosswalk`, which means we've added `stcode` and `fips` to the `elect` data, and that enables us to merge `merged_data` with `income` and `econ`.\n",
    "\n",
    "Before moving on to the rest of the example, we should discuss some properties of the `pd.merge()` function. Recall from module 7 that there are many different kinds of joins that differ only in how they treat unmatched observations. `pd.merge()` conducts an inner join by default, leaving only the matched observations in the merged data. To change the type of merge, use the `how` argument:\n",
    "\n",
    "`how='outer'` performs an outer or full join. It brings all rows from both dataframes into the merged data, and replaces the features for the unmatched rows with `NaN` for the columns without data.\n",
    "```\n",
    "merged_data = pd.merge(elect, crosswalk, on='State', how='outer')\n",
    "```\n",
    "\n",
    "`how='left'` performs a left join. It brings all rows from the first dataframe listed into the merged data, and only those in the second dataframe that are matched to the first.\n",
    "```\n",
    "merged_data = pd.merge(elect, crosswalk, on='State', how='left')\n",
    "```\n",
    "\n",
    "`how='right'` performs a right join. It brings all rows from the second dataframe listed into the merged data, and only those in the first dataframe that are matched to the second.\n",
    "```\n",
    "merged_data = pd.merge(elect, crosswalk, on='State', how='right')\n",
    "```\n",
    "\n",
    "There are two ways to specify the key or keys on which to match rows. If the key or keys have the same name in both dataframes, use the `on` argument. If there is one key, pass this name as a string to `on` like we did above. If there is more than one key, pass these names as a list to `on`. For example, if we merge on both `State` and `year`, we will type `on = ['State', 'year']`.\n",
    "\n",
    "If the keys have different names in each dataframe, we use `left_on` to specify the names of these keys in the first dataframe listed in the `pd.merge()` function, and `right_on` to specify the names of these keys in the second dataframe listed. Suppose, for example, that the keys are named \"State\" and \"year\" in the first dataframe, and \"st\" and \"yr\" in the second dataframe. We would type `left_on = ['State', 'year'], right_on = ['st', 'yr']`.\n",
    "\n",
    "Depending on the type of join we use, a row can be matched, or can come from just the first or just the second dataframe listed in `pd.merge()`. To see which rows come from which sources, specify `indicator='newname'`, where `'newname'` is a name you choose for a new column that contains this information. For example, the following code repeats the merge we performed above, but uses an outer join instead of an inner join and adds a column `matched` that tells us how each row is matched:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>year</th>\n",
       "      <th>demvote</th>\n",
       "      <th>repvote</th>\n",
       "      <th>wallacevote</th>\n",
       "      <th>dempercent</th>\n",
       "      <th>reppercent</th>\n",
       "      <th>population</th>\n",
       "      <th>fips</th>\n",
       "      <th>stcode</th>\n",
       "      <th>matched</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1964</td>\n",
       "      <td>210732</td>\n",
       "      <td>479085</td>\n",
       "      <td>0</td>\n",
       "      <td>30.54897</td>\n",
       "      <td>69.45103</td>\n",
       "      <td>1919000</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1968</td>\n",
       "      <td>196579</td>\n",
       "      <td>146923</td>\n",
       "      <td>691425</td>\n",
       "      <td>18.99448</td>\n",
       "      <td>81.00552</td>\n",
       "      <td>1993000</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1972</td>\n",
       "      <td>256923</td>\n",
       "      <td>728701</td>\n",
       "      <td>0</td>\n",
       "      <td>26.06704</td>\n",
       "      <td>73.93296</td>\n",
       "      <td>3540080</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1976</td>\n",
       "      <td>659170</td>\n",
       "      <td>504070</td>\n",
       "      <td>0</td>\n",
       "      <td>56.66673</td>\n",
       "      <td>43.33327</td>\n",
       "      <td>3737204</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1980</td>\n",
       "      <td>636730</td>\n",
       "      <td>654192</td>\n",
       "      <td>0</td>\n",
       "      <td>49.32366</td>\n",
       "      <td>50.67634</td>\n",
       "      <td>3894025</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>556</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1988</td>\n",
       "      <td>67113</td>\n",
       "      <td>106867</td>\n",
       "      <td>0</td>\n",
       "      <td>38.57512</td>\n",
       "      <td>61.42488</td>\n",
       "      <td>465080</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>557</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1992</td>\n",
       "      <td>68160</td>\n",
       "      <td>79347</td>\n",
       "      <td>0</td>\n",
       "      <td>46.20798</td>\n",
       "      <td>53.79202</td>\n",
       "      <td>466251</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>558</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1996</td>\n",
       "      <td>77934</td>\n",
       "      <td>105388</td>\n",
       "      <td>0</td>\n",
       "      <td>42.51208</td>\n",
       "      <td>57.48792</td>\n",
       "      <td>488167</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>559</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>2000</td>\n",
       "      <td>60481</td>\n",
       "      <td>147947</td>\n",
       "      <td>0</td>\n",
       "      <td>29.01769</td>\n",
       "      <td>70.98231</td>\n",
       "      <td>493782</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>560</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>2004</td>\n",
       "      <td>70776</td>\n",
       "      <td>167629</td>\n",
       "      <td>0</td>\n",
       "      <td>29.68730</td>\n",
       "      <td>70.31271</td>\n",
       "      <td>506529</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>561 rows × 11 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       State  year  demvote  repvote  wallacevote  dempercent  reppercent  \\\n",
       "0    Alabama  1964   210732   479085            0    30.54897    69.45103   \n",
       "1    Alabama  1968   196579   146923       691425    18.99448    81.00552   \n",
       "2    Alabama  1972   256923   728701            0    26.06704    73.93296   \n",
       "3    Alabama  1976   659170   504070            0    56.66673    43.33327   \n",
       "4    Alabama  1980   636730   654192            0    49.32366    50.67634   \n",
       "..       ...   ...      ...      ...          ...         ...         ...   \n",
       "556  Wyoming  1988    67113   106867            0    38.57512    61.42488   \n",
       "557  Wyoming  1992    68160    79347            0    46.20798    53.79202   \n",
       "558  Wyoming  1996    77934   105388            0    42.51208    57.48792   \n",
       "559  Wyoming  2000    60481   147947            0    29.01769    70.98231   \n",
       "560  Wyoming  2004    70776   167629            0    29.68730    70.31271   \n",
       "\n",
       "     population  fips stcode matched  \n",
       "0       1919000     1     AL    both  \n",
       "1       1993000     1     AL    both  \n",
       "2       3540080     1     AL    both  \n",
       "3       3737204     1     AL    both  \n",
       "4       3894025     1     AL    both  \n",
       "..          ...   ...    ...     ...  \n",
       "556      465080    56     WY    both  \n",
       "557      466251    56     WY    both  \n",
       "558      488167    56     WY    both  \n",
       "559      493782    56     WY    both  \n",
       "560      506529    56     WY    both  \n",
       "\n",
       "[561 rows x 11 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "merged_data = pd.merge(elect, crosswalk, on='State', how='outer', indicator='matched')\n",
    "merged_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Checking for Problems That Can Occur Without Errors\n",
    "Whenever we apply `pd.merge()`, there's a great risk that the merge was incorrect even if the function runs without error. There are two ways that a merge can fail.\n",
    "\n",
    "**Problem 1:** what we *thought* was a unique ID was not a unique ID. \n",
    "\n",
    "If we merge two dataframes on a column (or set of columns) that fail to uniquely identify the rows in either dataframe, then `pd.merge()` returns all pairwise combinations of rows with matching IDs. To see this, consider the following two example dataframes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>data1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A</td>\n",
       "      <td>150</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B</td>\n",
       "      <td>25</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  id  data1\n",
       "0  A    150\n",
       "1  A    200\n",
       "2  B     50\n",
       "3  B     25"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dict1 = {'id': ['A', 'A', 'B', 'B'],\n",
    "         'data1': [150, 200, 50, 25]}\n",
    "df1 = pd.DataFrame.from_dict(dict1)\n",
    "df1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>data2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A</td>\n",
       "      <td>-20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A</td>\n",
       "      <td>-75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B</td>\n",
       "      <td>-125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B</td>\n",
       "      <td>-250</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  id  data2\n",
       "0  A    -20\n",
       "1  A    -75\n",
       "2  B   -125\n",
       "3  B   -250"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dict2 = {'id': ['A', 'A', 'B', 'B'],\n",
    "         'data2': [-20, -75, -125, -250]}\n",
    "df2 = pd.DataFrame.from_dict(dict2)\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the two dataframes share a column \"id\", but that \"id\" is not a unique identifier in either dataframe. If I try to merge the two based on \"id\", the result is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>data1</th>\n",
       "      <th>data2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A</td>\n",
       "      <td>150</td>\n",
       "      <td>-20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A</td>\n",
       "      <td>150</td>\n",
       "      <td>-75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A</td>\n",
       "      <td>200</td>\n",
       "      <td>-20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A</td>\n",
       "      <td>200</td>\n",
       "      <td>-75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B</td>\n",
       "      <td>50</td>\n",
       "      <td>-125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>B</td>\n",
       "      <td>50</td>\n",
       "      <td>-250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>B</td>\n",
       "      <td>25</td>\n",
       "      <td>-125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>B</td>\n",
       "      <td>25</td>\n",
       "      <td>-250</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  id  data1  data2\n",
       "0  A    150    -20\n",
       "1  A    150    -75\n",
       "2  A    200    -20\n",
       "3  A    200    -75\n",
       "4  B     50   -125\n",
       "5  B     50   -250\n",
       "6  B     25   -125\n",
       "7  B     25   -250"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_merge = pd.merge(df1, df2, on='id')\n",
    "df_merge"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because we did not have a unique ID in either dataframe, we artificially increased the number of rows from 4 to 8. At first glance, the merged data looks nice, but with double the number of rows we are supposed to have the data is in fact corrupted, and we cannot use it for any subsequent analysis. Also, note that the merge ran without any errors. If we aren't careful, we might move forward with data that cannot be used.\n",
    "\n",
    "The above example is an instance of a **many-to-many** merge, in which rows in the left dataframe can match to more than one row in the right dataframe, and rows in the right dataframe can match to more than one row in the left dataframe. In these cases `pandas` performs a cross join within the subsets of the two dataframes that share the same key, inflating the number of rows. But a many-to-many join is almost never what we want to do with data, and if it happens, it is usually a mistake.\n",
    "\n",
    "The other types of merges are \n",
    "* **one-to-one**, in which every row in the left dataframe matches to at most one row in the right dataframe, and every row in the right dataframe matches to at most one row in the left dataframe,\n",
    "* **many-to-one**, in which rows in the left dataframe can match to more than one row in the right dataframe, and every row in the right dataframe mnatches to at most one row in the left dataframe,\n",
    "* and **one-to-many**, in which every row in the left dataframe matches to at most one row in the right dataframe, and rows in the right dataframe can match to more than one row in the left dataframe.\n",
    "\n",
    "The best way to prevent this mistake is to first think about whether the two dataframes should constitute a one-to-one, many-to-one, or one-to-many merge. It is important to have a clear expectation. Then use the `validate` argument to have `pandas` automatically check to see whether this expectation is met. If not, the code will generate an error, which will bring our attention to problems in the data we might not have caught otherwise. If we write\n",
    "\n",
    "* `validate = 'one_to_one'` then `pandas` checks to see whether the merge keys are unique in both dataframes,\n",
    "* `validate = 'many_to_one'` then `pandas` checks to see whether the merge keys are unique in the right dataframe,\n",
    "* `validate = 'one_to_many'` then `pandas` checks to see whether the merge keys are unique in the left dataframe.\n",
    "\n",
    "There are many reasons why our expectations might be incorrect. In messy, real-world data, mistakes like duplicated keys often make it past review processes and exist in the data without any clue of it in the codebook. The `validate` argument is an important check for these kinds of issues. For example, we expect that merging `elect` and `crosswalk` will be a many-to-one merge, so we confirm that the merge is many-to-one as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>year</th>\n",
       "      <th>demvote</th>\n",
       "      <th>repvote</th>\n",
       "      <th>wallacevote</th>\n",
       "      <th>dempercent</th>\n",
       "      <th>reppercent</th>\n",
       "      <th>population</th>\n",
       "      <th>fips</th>\n",
       "      <th>stcode</th>\n",
       "      <th>matched</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1964</td>\n",
       "      <td>210732</td>\n",
       "      <td>479085</td>\n",
       "      <td>0</td>\n",
       "      <td>30.54897</td>\n",
       "      <td>69.45103</td>\n",
       "      <td>1919000</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1968</td>\n",
       "      <td>196579</td>\n",
       "      <td>146923</td>\n",
       "      <td>691425</td>\n",
       "      <td>18.99448</td>\n",
       "      <td>81.00552</td>\n",
       "      <td>1993000</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1972</td>\n",
       "      <td>256923</td>\n",
       "      <td>728701</td>\n",
       "      <td>0</td>\n",
       "      <td>26.06704</td>\n",
       "      <td>73.93296</td>\n",
       "      <td>3540080</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1976</td>\n",
       "      <td>659170</td>\n",
       "      <td>504070</td>\n",
       "      <td>0</td>\n",
       "      <td>56.66673</td>\n",
       "      <td>43.33327</td>\n",
       "      <td>3737204</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1980</td>\n",
       "      <td>636730</td>\n",
       "      <td>654192</td>\n",
       "      <td>0</td>\n",
       "      <td>49.32366</td>\n",
       "      <td>50.67634</td>\n",
       "      <td>3894025</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>556</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1988</td>\n",
       "      <td>67113</td>\n",
       "      <td>106867</td>\n",
       "      <td>0</td>\n",
       "      <td>38.57512</td>\n",
       "      <td>61.42488</td>\n",
       "      <td>465080</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>557</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1992</td>\n",
       "      <td>68160</td>\n",
       "      <td>79347</td>\n",
       "      <td>0</td>\n",
       "      <td>46.20798</td>\n",
       "      <td>53.79202</td>\n",
       "      <td>466251</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>558</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1996</td>\n",
       "      <td>77934</td>\n",
       "      <td>105388</td>\n",
       "      <td>0</td>\n",
       "      <td>42.51208</td>\n",
       "      <td>57.48792</td>\n",
       "      <td>488167</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>559</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>2000</td>\n",
       "      <td>60481</td>\n",
       "      <td>147947</td>\n",
       "      <td>0</td>\n",
       "      <td>29.01769</td>\n",
       "      <td>70.98231</td>\n",
       "      <td>493782</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>560</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>2004</td>\n",
       "      <td>70776</td>\n",
       "      <td>167629</td>\n",
       "      <td>0</td>\n",
       "      <td>29.68730</td>\n",
       "      <td>70.31271</td>\n",
       "      <td>506529</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>561 rows × 11 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       State  year  demvote  repvote  wallacevote  dempercent  reppercent  \\\n",
       "0    Alabama  1964   210732   479085            0    30.54897    69.45103   \n",
       "1    Alabama  1968   196579   146923       691425    18.99448    81.00552   \n",
       "2    Alabama  1972   256923   728701            0    26.06704    73.93296   \n",
       "3    Alabama  1976   659170   504070            0    56.66673    43.33327   \n",
       "4    Alabama  1980   636730   654192            0    49.32366    50.67634   \n",
       "..       ...   ...      ...      ...          ...         ...         ...   \n",
       "556  Wyoming  1988    67113   106867            0    38.57512    61.42488   \n",
       "557  Wyoming  1992    68160    79347            0    46.20798    53.79202   \n",
       "558  Wyoming  1996    77934   105388            0    42.51208    57.48792   \n",
       "559  Wyoming  2000    60481   147947            0    29.01769    70.98231   \n",
       "560  Wyoming  2004    70776   167629            0    29.68730    70.31271   \n",
       "\n",
       "     population  fips stcode matched  \n",
       "0       1919000     1     AL    both  \n",
       "1       1993000     1     AL    both  \n",
       "2       3540080     1     AL    both  \n",
       "3       3737204     1     AL    both  \n",
       "4       3894025     1     AL    both  \n",
       "..          ...   ...    ...     ...  \n",
       "556      465080    56     WY    both  \n",
       "557      466251    56     WY    both  \n",
       "558      488167    56     WY    both  \n",
       "559      493782    56     WY    both  \n",
       "560      506529    56     WY    both  \n",
       "\n",
       "[561 rows x 11 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "merged_data = pd.merge(elect, crosswalk, on='State', how='outer', \n",
    "                       indicator='matched', validate='many_to_one')\n",
    "merged_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Problem 2**: rows should be matched, but aren't.\n",
    "\n",
    "This problem is especially common when matching on strings, such as names. With countries, it's possible for one dataframe to write \"USA\" and the other to write \"United States\". You will have to go back into a dataframe to edit individual ID cells so that they match. But to identify the cells that need to be edited, we have to construct a dataframe of the rows that were unmatched in one or the other dataframe. The easiest way to see this is to use an anti-join: a merge that keeps only the rows in the left dataframe that have no match in the right dataframe. Like SQL, there is no anti-join function built in to `pandas`, but there are ways to program our own anti-join. The easiest way is to perform an outer join with the `indicator='matched'` argument, then to filter the data to those rows for which `matched` is not equal to \"both\". \n",
    "\n",
    "For this step, I recommend creating a second, working version of the data `merge_test` to contain the first attempt at the merge. If there are issues, it is easier to fix the problems and recreate the working dataframe than to restart the kernel. If there are no problems, then we can assign the `merged_data` dataframe to be equal to the working version."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>year</th>\n",
       "      <th>demvote</th>\n",
       "      <th>repvote</th>\n",
       "      <th>wallacevote</th>\n",
       "      <th>dempercent</th>\n",
       "      <th>reppercent</th>\n",
       "      <th>population</th>\n",
       "      <th>fips</th>\n",
       "      <th>stcode</th>\n",
       "      <th>matched</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [State, year, demvote, repvote, wallacevote, dempercent, reppercent, population, fips, stcode, matched]\n",
       "Index: []"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "merge_test = pd.merge(elect, crosswalk, on='State', how='outer', indicator='matched', validate='many_to_one')\n",
    "merge_test.query(\"matched!='both'\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are no rows in `crosswalk` with no match to `elect`, and there are no rows in `elect` without a match in `crosswalk`. But if there were unmatched rows, there are two reasons why some rows may have been unmatched: rows can be unmatched due to differences in coverage, and they might be unmatched due to differences in coding or spelling. `pandas` cannot tell the difference between these two reasons why rows might be unmatched, and an inner join will drop unmatched rows regardless of the reason. Using an outer join with `indicator='matched'` is best practice for seeing the unmatched rows and thinking about whether any rows that should be matched are for some reason unmatched. We can use the category relabeling functionality in `pandas` to fix issues of mispelled and miscoded keys, as we will do in the examples in the next section.\n",
    "\n",
    "After performing this check, if we intend to perform another merge, we will either need to choose a new name for the indicator column, or we can drop the `matched` column so that we can reuse this name later. We set `merged_data` to the working dataframe `merge_test` without the `matched` column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "merged_data = merge_test.drop('matched', axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Merging all of the Dataframes Together While Checking for Problems\n",
    "The `merged_data` and `income` dataframes share keys named `stcode` and `year`. Joining these two dataframes should be a one-to-one merge because there should be only one occurrence of the same state and year in each dataframe. We use `how='outer'` and `indicator='matched'` (since we deleted the previous `matched` column) to enable us to filter the result to the unmatched rows, and we use `validate='one_to_one'` to catch anomolies that violate our expectation that this is a one-to-one merge:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>year</th>\n",
       "      <th>demvote</th>\n",
       "      <th>repvote</th>\n",
       "      <th>wallacevote</th>\n",
       "      <th>dempercent</th>\n",
       "      <th>reppercent</th>\n",
       "      <th>population</th>\n",
       "      <th>fips</th>\n",
       "      <th>stcode</th>\n",
       "      <th>...</th>\n",
       "      <th>wom_meanincome</th>\n",
       "      <th>wom_minincome</th>\n",
       "      <th>wom_maxincome</th>\n",
       "      <th>wom_sdincome</th>\n",
       "      <th>wom_ineqincome</th>\n",
       "      <th>wom_gini_income</th>\n",
       "      <th>wom_n_income</th>\n",
       "      <th>wom_nwgt_income</th>\n",
       "      <th>medincome</th>\n",
       "      <th>matched</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>0 rows × 64 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [State, year, demvote, repvote, wallacevote, dempercent, reppercent, population, fips, stcode, income, p10income, p20income, p30income, p40income, p50income, p60income, p70income, p80income, p90income, meanincome, minincome, maxincome, sdincome, ineqincome, gini_income, n_income, nwgt_income, men_p10income, men_p20income, men_p30income, men_p40income, men_p50income, men_p60income, men_p70income, men_p80income, men_p90income, men_meanincome, men_minincome, men_maxincome, men_sdincome, men_ineqincome, men_gini_income, men_n_income, men_nwgt_income, wom_p10income, wom_p20income, wom_p30income, wom_p40income, wom_p50income, wom_p60income, wom_p70income, wom_p80income, wom_p90income, wom_meanincome, wom_minincome, wom_maxincome, wom_sdincome, wom_ineqincome, wom_gini_income, wom_n_income, wom_nwgt_income, medincome, matched]\n",
       "Index: []\n",
       "\n",
       "[0 rows x 64 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "merge_test = pd.merge(merged_data, income, on=['stcode','year'], how='outer', indicator='matched', validate='one_to_one')\n",
    "merge_test.query(\"matched!='both'\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are no unmatched rows so there are no issues with miscoded or mispelled ID values. The merged data now contain information on both state election results and state incomes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>year</th>\n",
       "      <th>demvote</th>\n",
       "      <th>repvote</th>\n",
       "      <th>wallacevote</th>\n",
       "      <th>dempercent</th>\n",
       "      <th>reppercent</th>\n",
       "      <th>population</th>\n",
       "      <th>fips</th>\n",
       "      <th>stcode</th>\n",
       "      <th>...</th>\n",
       "      <th>wom_meanincome</th>\n",
       "      <th>wom_minincome</th>\n",
       "      <th>wom_maxincome</th>\n",
       "      <th>wom_sdincome</th>\n",
       "      <th>wom_ineqincome</th>\n",
       "      <th>wom_gini_income</th>\n",
       "      <th>wom_n_income</th>\n",
       "      <th>wom_nwgt_income</th>\n",
       "      <th>medincome</th>\n",
       "      <th>matched</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1964</td>\n",
       "      <td>210732</td>\n",
       "      <td>479085</td>\n",
       "      <td>0</td>\n",
       "      <td>30.54897</td>\n",
       "      <td>69.45103</td>\n",
       "      <td>1919000</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>...</td>\n",
       "      <td>11400.0</td>\n",
       "      <td>58.139530</td>\n",
       "      <td>72674.42</td>\n",
       "      <td>1796.89</td>\n",
       "      <td>29.53152</td>\n",
       "      <td>0.473738</td>\n",
       "      <td>175.0</td>\n",
       "      <td>418494.1</td>\n",
       "      <td>76.68081</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1968</td>\n",
       "      <td>196579</td>\n",
       "      <td>146923</td>\n",
       "      <td>691425</td>\n",
       "      <td>18.99448</td>\n",
       "      <td>81.00552</td>\n",
       "      <td>1993000</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>...</td>\n",
       "      <td>12005.3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1968.78</td>\n",
       "      <td>27.40979</td>\n",
       "      <td>0.485090</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>80.72643</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1972</td>\n",
       "      <td>256923</td>\n",
       "      <td>728701</td>\n",
       "      <td>0</td>\n",
       "      <td>26.06704</td>\n",
       "      <td>73.93296</td>\n",
       "      <td>3540080</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>...</td>\n",
       "      <td>14667.9</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8604.25</td>\n",
       "      <td>17.83577</td>\n",
       "      <td>0.457932</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>82.27975</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1976</td>\n",
       "      <td>659170</td>\n",
       "      <td>504070</td>\n",
       "      <td>0</td>\n",
       "      <td>56.66673</td>\n",
       "      <td>43.33327</td>\n",
       "      <td>3737204</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>...</td>\n",
       "      <td>15362.5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12842.20</td>\n",
       "      <td>15.77765</td>\n",
       "      <td>0.454074</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>60.07215</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1980</td>\n",
       "      <td>636730</td>\n",
       "      <td>654192</td>\n",
       "      <td>0</td>\n",
       "      <td>49.32366</td>\n",
       "      <td>50.67634</td>\n",
       "      <td>3894025</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>...</td>\n",
       "      <td>14804.2</td>\n",
       "      <td>28.384280</td>\n",
       "      <td>120196.50</td>\n",
       "      <td>5944.71</td>\n",
       "      <td>14.52640</td>\n",
       "      <td>0.436961</td>\n",
       "      <td>543.0</td>\n",
       "      <td>711278.2</td>\n",
       "      <td>41.72882</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>556</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1988</td>\n",
       "      <td>67113</td>\n",
       "      <td>106867</td>\n",
       "      <td>0</td>\n",
       "      <td>38.57512</td>\n",
       "      <td>61.42488</td>\n",
       "      <td>465080</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>...</td>\n",
       "      <td>17188.0</td>\n",
       "      <td>75.987840</td>\n",
       "      <td>77507.60</td>\n",
       "      <td>10402.40</td>\n",
       "      <td>23.54822</td>\n",
       "      <td>0.478940</td>\n",
       "      <td>318.0</td>\n",
       "      <td>110956.1</td>\n",
       "      <td>36.63389</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>557</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1992</td>\n",
       "      <td>68160</td>\n",
       "      <td>79347</td>\n",
       "      <td>0</td>\n",
       "      <td>46.20798</td>\n",
       "      <td>53.79202</td>\n",
       "      <td>466251</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>...</td>\n",
       "      <td>17934.1</td>\n",
       "      <td>92.307700</td>\n",
       "      <td>96391.03</td>\n",
       "      <td>12512.80</td>\n",
       "      <td>21.14388</td>\n",
       "      <td>0.469024</td>\n",
       "      <td>360.0</td>\n",
       "      <td>117688.6</td>\n",
       "      <td>30.10564</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>558</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1996</td>\n",
       "      <td>77934</td>\n",
       "      <td>105388</td>\n",
       "      <td>0</td>\n",
       "      <td>42.51208</td>\n",
       "      <td>57.48792</td>\n",
       "      <td>488167</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>...</td>\n",
       "      <td>18796.4</td>\n",
       "      <td>29.816510</td>\n",
       "      <td>122420.90</td>\n",
       "      <td>14517.60</td>\n",
       "      <td>22.74368</td>\n",
       "      <td>0.461693</td>\n",
       "      <td>413.0</td>\n",
       "      <td>125222.3</td>\n",
       "      <td>26.93395</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>559</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>2000</td>\n",
       "      <td>60481</td>\n",
       "      <td>147947</td>\n",
       "      <td>0</td>\n",
       "      <td>29.01769</td>\n",
       "      <td>70.98231</td>\n",
       "      <td>493782</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>...</td>\n",
       "      <td>20463.4</td>\n",
       "      <td>88.819230</td>\n",
       "      <td>333171.40</td>\n",
       "      <td>16609.80</td>\n",
       "      <td>15.30497</td>\n",
       "      <td>0.447124</td>\n",
       "      <td>440.0</td>\n",
       "      <td>128753.9</td>\n",
       "      <td>26.05026</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>560</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>2004</td>\n",
       "      <td>70776</td>\n",
       "      <td>167629</td>\n",
       "      <td>0</td>\n",
       "      <td>29.68730</td>\n",
       "      <td>70.31271</td>\n",
       "      <td>506529</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>...</td>\n",
       "      <td>20225.1</td>\n",
       "      <td>9.587728</td>\n",
       "      <td>206166.80</td>\n",
       "      <td>18311.90</td>\n",
       "      <td>11.33138</td>\n",
       "      <td>0.432011</td>\n",
       "      <td>692.0</td>\n",
       "      <td>123164.1</td>\n",
       "      <td>24.74094</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>561 rows × 64 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       State  year  demvote  repvote  wallacevote  dempercent  reppercent  \\\n",
       "0    Alabama  1964   210732   479085            0    30.54897    69.45103   \n",
       "1    Alabama  1968   196579   146923       691425    18.99448    81.00552   \n",
       "2    Alabama  1972   256923   728701            0    26.06704    73.93296   \n",
       "3    Alabama  1976   659170   504070            0    56.66673    43.33327   \n",
       "4    Alabama  1980   636730   654192            0    49.32366    50.67634   \n",
       "..       ...   ...      ...      ...          ...         ...         ...   \n",
       "556  Wyoming  1988    67113   106867            0    38.57512    61.42488   \n",
       "557  Wyoming  1992    68160    79347            0    46.20798    53.79202   \n",
       "558  Wyoming  1996    77934   105388            0    42.51208    57.48792   \n",
       "559  Wyoming  2000    60481   147947            0    29.01769    70.98231   \n",
       "560  Wyoming  2004    70776   167629            0    29.68730    70.31271   \n",
       "\n",
       "     population  fips stcode  ...  wom_meanincome  wom_minincome  \\\n",
       "0       1919000     1     AL  ...         11400.0      58.139530   \n",
       "1       1993000     1     AL  ...         12005.3            NaN   \n",
       "2       3540080     1     AL  ...         14667.9            NaN   \n",
       "3       3737204     1     AL  ...         15362.5            NaN   \n",
       "4       3894025     1     AL  ...         14804.2      28.384280   \n",
       "..          ...   ...    ...  ...             ...            ...   \n",
       "556      465080    56     WY  ...         17188.0      75.987840   \n",
       "557      466251    56     WY  ...         17934.1      92.307700   \n",
       "558      488167    56     WY  ...         18796.4      29.816510   \n",
       "559      493782    56     WY  ...         20463.4      88.819230   \n",
       "560      506529    56     WY  ...         20225.1       9.587728   \n",
       "\n",
       "     wom_maxincome  wom_sdincome  wom_ineqincome  wom_gini_income  \\\n",
       "0         72674.42       1796.89        29.53152         0.473738   \n",
       "1              NaN       1968.78        27.40979         0.485090   \n",
       "2              NaN       8604.25        17.83577         0.457932   \n",
       "3              NaN      12842.20        15.77765         0.454074   \n",
       "4        120196.50       5944.71        14.52640         0.436961   \n",
       "..             ...           ...             ...              ...   \n",
       "556       77507.60      10402.40        23.54822         0.478940   \n",
       "557       96391.03      12512.80        21.14388         0.469024   \n",
       "558      122420.90      14517.60        22.74368         0.461693   \n",
       "559      333171.40      16609.80        15.30497         0.447124   \n",
       "560      206166.80      18311.90        11.33138         0.432011   \n",
       "\n",
       "     wom_n_income  wom_nwgt_income  medincome  matched  \n",
       "0           175.0         418494.1   76.68081     both  \n",
       "1             NaN              NaN   80.72643     both  \n",
       "2             NaN              NaN   82.27975     both  \n",
       "3             NaN              NaN   60.07215     both  \n",
       "4           543.0         711278.2   41.72882     both  \n",
       "..            ...              ...        ...      ...  \n",
       "556         318.0         110956.1   36.63389     both  \n",
       "557         360.0         117688.6   30.10564     both  \n",
       "558         413.0         125222.3   26.93395     both  \n",
       "559         440.0         128753.9   26.05026     both  \n",
       "560         692.0         123164.1   24.74094     both  \n",
       "\n",
       "[561 rows x 64 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "merge_test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since there are no problems, we set `merged_data` equal to `merge_test`. Because we intend to merge the data with another dataframe, we drop the `matched` column. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "merged_data = merge_test.drop('matched', axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we merge this result with the `econ` dataframe. These two dataframes share `fips` and `year` as keys, and should also be a one-to-one merge. We check to see whether there are unmatched rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>year</th>\n",
       "      <th>demvote</th>\n",
       "      <th>repvote</th>\n",
       "      <th>wallacevote</th>\n",
       "      <th>dempercent</th>\n",
       "      <th>reppercent</th>\n",
       "      <th>population</th>\n",
       "      <th>fips</th>\n",
       "      <th>stcode</th>\n",
       "      <th>...</th>\n",
       "      <th>Manufacturing_share</th>\n",
       "      <th>Finance_Insurance_share</th>\n",
       "      <th>Legal_share</th>\n",
       "      <th>Education_share</th>\n",
       "      <th>Health_share</th>\n",
       "      <th>Government_share</th>\n",
       "      <th>Government_federal_share</th>\n",
       "      <th>Government_military_share</th>\n",
       "      <th>Government_statelocal_share</th>\n",
       "      <th>matched</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>0 rows × 98 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [State, year, demvote, repvote, wallacevote, dempercent, reppercent, population, fips, stcode, income, p10income, p20income, p30income, p40income, p50income, p60income, p70income, p80income, p90income, meanincome, minincome, maxincome, sdincome, ineqincome, gini_income, n_income, nwgt_income, men_p10income, men_p20income, men_p30income, men_p40income, men_p50income, men_p60income, men_p70income, men_p80income, men_p90income, men_meanincome, men_minincome, men_maxincome, men_sdincome, men_ineqincome, men_gini_income, men_n_income, men_nwgt_income, wom_p10income, wom_p20income, wom_p30income, wom_p40income, wom_p50income, wom_p60income, wom_p70income, wom_p80income, wom_p90income, wom_meanincome, wom_minincome, wom_maxincome, wom_sdincome, wom_ineqincome, wom_gini_income, wom_n_income, wom_nwgt_income, medincome, GDP, GDPpc, Private, Agriculture, Farms, Mining, Utilities, Construction, Manufacturing, Finance_Insurance, Legal, Education, Health, Government, Government_federal, Government_military, Government_statelocal, cpideflator, statecpi, Private_share, Agriculture_share, Farms_share, Mining_share, Utilities_share, Construction_share, Manufacturing_share, Finance_Insurance_share, Legal_share, Education_share, Health_share, Government_share, Government_federal_share, Government_military_share, Government_statelocal_share, matched]\n",
       "Index: []\n",
       "\n",
       "[0 rows x 98 columns]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "merge_test = pd.merge(merged_data, econ, on=['fips','year'], how='outer', indicator='matched', validate='one_to_one')\n",
    "merge_test.query(\"matched!='both'\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, there are no unmatched rows so there are no issues with miscoded or mispelled ID values. The merged data now contain information on state election results, state incomes, and state economies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>year</th>\n",
       "      <th>demvote</th>\n",
       "      <th>repvote</th>\n",
       "      <th>wallacevote</th>\n",
       "      <th>dempercent</th>\n",
       "      <th>reppercent</th>\n",
       "      <th>population</th>\n",
       "      <th>fips</th>\n",
       "      <th>stcode</th>\n",
       "      <th>...</th>\n",
       "      <th>Manufacturing_share</th>\n",
       "      <th>Finance_Insurance_share</th>\n",
       "      <th>Legal_share</th>\n",
       "      <th>Education_share</th>\n",
       "      <th>Health_share</th>\n",
       "      <th>Government_share</th>\n",
       "      <th>Government_federal_share</th>\n",
       "      <th>Government_military_share</th>\n",
       "      <th>Government_statelocal_share</th>\n",
       "      <th>matched</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1964</td>\n",
       "      <td>210732</td>\n",
       "      <td>479085</td>\n",
       "      <td>0</td>\n",
       "      <td>30.54897</td>\n",
       "      <td>69.45103</td>\n",
       "      <td>1919000</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>...</td>\n",
       "      <td>28.228270</td>\n",
       "      <td>11.413240</td>\n",
       "      <td>0.402390</td>\n",
       "      <td>0.256066</td>\n",
       "      <td>1.902207</td>\n",
       "      <td>15.86392</td>\n",
       "      <td>6.291915</td>\n",
       "      <td>2.402146</td>\n",
       "      <td>7.182051</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1968</td>\n",
       "      <td>196579</td>\n",
       "      <td>146923</td>\n",
       "      <td>691425</td>\n",
       "      <td>18.99448</td>\n",
       "      <td>81.00552</td>\n",
       "      <td>1993000</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>...</td>\n",
       "      <td>28.742350</td>\n",
       "      <td>10.676770</td>\n",
       "      <td>0.429263</td>\n",
       "      <td>0.365330</td>\n",
       "      <td>2.228514</td>\n",
       "      <td>16.93305</td>\n",
       "      <td>5.945748</td>\n",
       "      <td>2.630377</td>\n",
       "      <td>8.356928</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1972</td>\n",
       "      <td>256923</td>\n",
       "      <td>728701</td>\n",
       "      <td>0</td>\n",
       "      <td>26.06704</td>\n",
       "      <td>73.93296</td>\n",
       "      <td>3540080</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>...</td>\n",
       "      <td>26.056340</td>\n",
       "      <td>11.482790</td>\n",
       "      <td>0.469484</td>\n",
       "      <td>0.443401</td>\n",
       "      <td>2.595201</td>\n",
       "      <td>17.21440</td>\n",
       "      <td>5.692488</td>\n",
       "      <td>2.679969</td>\n",
       "      <td>8.841941</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1976</td>\n",
       "      <td>659170</td>\n",
       "      <td>504070</td>\n",
       "      <td>0</td>\n",
       "      <td>56.66673</td>\n",
       "      <td>43.33327</td>\n",
       "      <td>3737204</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>...</td>\n",
       "      <td>24.853340</td>\n",
       "      <td>11.505410</td>\n",
       "      <td>0.479220</td>\n",
       "      <td>0.359415</td>\n",
       "      <td>2.986863</td>\n",
       "      <td>17.42543</td>\n",
       "      <td>5.787821</td>\n",
       "      <td>2.255639</td>\n",
       "      <td>9.386103</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1980</td>\n",
       "      <td>636730</td>\n",
       "      <td>654192</td>\n",
       "      <td>0</td>\n",
       "      <td>49.32366</td>\n",
       "      <td>50.67634</td>\n",
       "      <td>3894025</td>\n",
       "      <td>1</td>\n",
       "      <td>AL</td>\n",
       "      <td>...</td>\n",
       "      <td>24.604230</td>\n",
       "      <td>11.634170</td>\n",
       "      <td>0.649892</td>\n",
       "      <td>0.352719</td>\n",
       "      <td>3.660501</td>\n",
       "      <td>17.88313</td>\n",
       "      <td>5.782370</td>\n",
       "      <td>2.171860</td>\n",
       "      <td>9.928901</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>556</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1988</td>\n",
       "      <td>67113</td>\n",
       "      <td>106867</td>\n",
       "      <td>0</td>\n",
       "      <td>38.57512</td>\n",
       "      <td>61.42488</td>\n",
       "      <td>465080</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>...</td>\n",
       "      <td>4.380078</td>\n",
       "      <td>7.788767</td>\n",
       "      <td>0.688802</td>\n",
       "      <td>0.123631</td>\n",
       "      <td>2.207700</td>\n",
       "      <td>14.51784</td>\n",
       "      <td>2.622748</td>\n",
       "      <td>1.660191</td>\n",
       "      <td>10.234900</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>557</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1992</td>\n",
       "      <td>68160</td>\n",
       "      <td>79347</td>\n",
       "      <td>0</td>\n",
       "      <td>46.20798</td>\n",
       "      <td>53.79202</td>\n",
       "      <td>466251</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>...</td>\n",
       "      <td>4.303816</td>\n",
       "      <td>7.670391</td>\n",
       "      <td>0.704806</td>\n",
       "      <td>0.179951</td>\n",
       "      <td>2.579291</td>\n",
       "      <td>14.99588</td>\n",
       "      <td>2.811727</td>\n",
       "      <td>1.552073</td>\n",
       "      <td>10.632080</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>558</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1996</td>\n",
       "      <td>77934</td>\n",
       "      <td>105388</td>\n",
       "      <td>0</td>\n",
       "      <td>42.51208</td>\n",
       "      <td>57.48792</td>\n",
       "      <td>488167</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>...</td>\n",
       "      <td>6.146708</td>\n",
       "      <td>9.280448</td>\n",
       "      <td>0.705568</td>\n",
       "      <td>0.152555</td>\n",
       "      <td>2.866768</td>\n",
       "      <td>14.23849</td>\n",
       "      <td>2.657005</td>\n",
       "      <td>1.544622</td>\n",
       "      <td>10.030510</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>559</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>2000</td>\n",
       "      <td>60481</td>\n",
       "      <td>147947</td>\n",
       "      <td>0</td>\n",
       "      <td>29.01769</td>\n",
       "      <td>70.98231</td>\n",
       "      <td>493782</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>...</td>\n",
       "      <td>6.012348</td>\n",
       "      <td>3.110034</td>\n",
       "      <td>0.646241</td>\n",
       "      <td>0.173100</td>\n",
       "      <td>4.269805</td>\n",
       "      <td>14.95586</td>\n",
       "      <td>2.729213</td>\n",
       "      <td>1.540592</td>\n",
       "      <td>10.686050</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>560</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>2004</td>\n",
       "      <td>70776</td>\n",
       "      <td>167629</td>\n",
       "      <td>0</td>\n",
       "      <td>29.68730</td>\n",
       "      <td>70.31271</td>\n",
       "      <td>506529</td>\n",
       "      <td>56</td>\n",
       "      <td>WY</td>\n",
       "      <td>...</td>\n",
       "      <td>3.795901</td>\n",
       "      <td>2.813834</td>\n",
       "      <td>0.589240</td>\n",
       "      <td>0.222032</td>\n",
       "      <td>4.419300</td>\n",
       "      <td>14.49616</td>\n",
       "      <td>2.519214</td>\n",
       "      <td>1.648164</td>\n",
       "      <td>10.328780</td>\n",
       "      <td>both</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>561 rows × 98 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       State  year  demvote  repvote  wallacevote  dempercent  reppercent  \\\n",
       "0    Alabama  1964   210732   479085            0    30.54897    69.45103   \n",
       "1    Alabama  1968   196579   146923       691425    18.99448    81.00552   \n",
       "2    Alabama  1972   256923   728701            0    26.06704    73.93296   \n",
       "3    Alabama  1976   659170   504070            0    56.66673    43.33327   \n",
       "4    Alabama  1980   636730   654192            0    49.32366    50.67634   \n",
       "..       ...   ...      ...      ...          ...         ...         ...   \n",
       "556  Wyoming  1988    67113   106867            0    38.57512    61.42488   \n",
       "557  Wyoming  1992    68160    79347            0    46.20798    53.79202   \n",
       "558  Wyoming  1996    77934   105388            0    42.51208    57.48792   \n",
       "559  Wyoming  2000    60481   147947            0    29.01769    70.98231   \n",
       "560  Wyoming  2004    70776   167629            0    29.68730    70.31271   \n",
       "\n",
       "     population  fips stcode  ...  Manufacturing_share  \\\n",
       "0       1919000     1     AL  ...            28.228270   \n",
       "1       1993000     1     AL  ...            28.742350   \n",
       "2       3540080     1     AL  ...            26.056340   \n",
       "3       3737204     1     AL  ...            24.853340   \n",
       "4       3894025     1     AL  ...            24.604230   \n",
       "..          ...   ...    ...  ...                  ...   \n",
       "556      465080    56     WY  ...             4.380078   \n",
       "557      466251    56     WY  ...             4.303816   \n",
       "558      488167    56     WY  ...             6.146708   \n",
       "559      493782    56     WY  ...             6.012348   \n",
       "560      506529    56     WY  ...             3.795901   \n",
       "\n",
       "     Finance_Insurance_share  Legal_share  Education_share  Health_share  \\\n",
       "0                  11.413240     0.402390         0.256066      1.902207   \n",
       "1                  10.676770     0.429263         0.365330      2.228514   \n",
       "2                  11.482790     0.469484         0.443401      2.595201   \n",
       "3                  11.505410     0.479220         0.359415      2.986863   \n",
       "4                  11.634170     0.649892         0.352719      3.660501   \n",
       "..                       ...          ...              ...           ...   \n",
       "556                 7.788767     0.688802         0.123631      2.207700   \n",
       "557                 7.670391     0.704806         0.179951      2.579291   \n",
       "558                 9.280448     0.705568         0.152555      2.866768   \n",
       "559                 3.110034     0.646241         0.173100      4.269805   \n",
       "560                 2.813834     0.589240         0.222032      4.419300   \n",
       "\n",
       "     Government_share  Government_federal_share  Government_military_share  \\\n",
       "0            15.86392                  6.291915                   2.402146   \n",
       "1            16.93305                  5.945748                   2.630377   \n",
       "2            17.21440                  5.692488                   2.679969   \n",
       "3            17.42543                  5.787821                   2.255639   \n",
       "4            17.88313                  5.782370                   2.171860   \n",
       "..                ...                       ...                        ...   \n",
       "556          14.51784                  2.622748                   1.660191   \n",
       "557          14.99588                  2.811727                   1.552073   \n",
       "558          14.23849                  2.657005                   1.544622   \n",
       "559          14.95586                  2.729213                   1.540592   \n",
       "560          14.49616                  2.519214                   1.648164   \n",
       "\n",
       "     Government_statelocal_share  matched  \n",
       "0                       7.182051     both  \n",
       "1                       8.356928     both  \n",
       "2                       8.841941     both  \n",
       "3                       9.386103     both  \n",
       "4                       9.928901     both  \n",
       "..                           ...      ...  \n",
       "556                    10.234900     both  \n",
       "557                    10.632080     both  \n",
       "558                    10.030510     both  \n",
       "559                    10.686050     both  \n",
       "560                    10.328780     both  \n",
       "\n",
       "[561 rows x 98 columns]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "merge_test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, we set `merged_data` equal to `merge_test`, dropping the `matched` column so that we can reuse this name in the next join:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "merged_data = merge_test.drop('matched', axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally we merge the data with the `area` dataframe. This merge matches `State` in `merged_data` to `state` in area: this subtle difference is easy to miss and will result in an error if we fail to notice the lowercase s and write `on='State'`. This merge should be a many-to-one merge, as there are many rows for each state in `merged_data` (for many years), but only one row for each state in `area`. As "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>state</th>\n",
       "      <th>year</th>\n",
       "      <th>matched</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>77</th>\n",
       "      <td>D. C.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1964.0</td>\n",
       "      <td>left_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>78</th>\n",
       "      <td>D. C.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1968.0</td>\n",
       "      <td>left_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>79</th>\n",
       "      <td>D. C.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1972.0</td>\n",
       "      <td>left_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>80</th>\n",
       "      <td>D. C.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1976.0</td>\n",
       "      <td>left_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>81</th>\n",
       "      <td>D. C.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1980.0</td>\n",
       "      <td>left_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>82</th>\n",
       "      <td>D. C.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>left_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>83</th>\n",
       "      <td>D. C.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1988.0</td>\n",
       "      <td>left_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>84</th>\n",
       "      <td>D. C.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1992.0</td>\n",
       "      <td>left_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>85</th>\n",
       "      <td>D. C.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1996.0</td>\n",
       "      <td>left_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>86</th>\n",
       "      <td>D. C.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2000.0</td>\n",
       "      <td>left_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87</th>\n",
       "      <td>D. C.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2004.0</td>\n",
       "      <td>left_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>561</th>\n",
       "      <td>NaN</td>\n",
       "      <td>District of Columbia</td>\n",
       "      <td>NaN</td>\n",
       "      <td>right_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>562</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Puerto Rico</td>\n",
       "      <td>NaN</td>\n",
       "      <td>right_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>563</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Northern Mariana Islands</td>\n",
       "      <td>NaN</td>\n",
       "      <td>right_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>564</th>\n",
       "      <td>NaN</td>\n",
       "      <td>United States Virgin Islands</td>\n",
       "      <td>NaN</td>\n",
       "      <td>right_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>565</th>\n",
       "      <td>NaN</td>\n",
       "      <td>American Samoa</td>\n",
       "      <td>NaN</td>\n",
       "      <td>right_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>566</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Guam</td>\n",
       "      <td>NaN</td>\n",
       "      <td>right_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>567</th>\n",
       "      <td>NaN</td>\n",
       "      <td>United States Minor Outlying Islands</td>\n",
       "      <td>NaN</td>\n",
       "      <td>right_only</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     State                                 state    year     matched\n",
       "77   D. C.                                   NaN  1964.0   left_only\n",
       "78   D. C.                                   NaN  1968.0   left_only\n",
       "79   D. C.                                   NaN  1972.0   left_only\n",
       "80   D. C.                                   NaN  1976.0   left_only\n",
       "81   D. C.                                   NaN  1980.0   left_only\n",
       "82   D. C.                                   NaN  1984.0   left_only\n",
       "83   D. C.                                   NaN  1988.0   left_only\n",
       "84   D. C.                                   NaN  1992.0   left_only\n",
       "85   D. C.                                   NaN  1996.0   left_only\n",
       "86   D. C.                                   NaN  2000.0   left_only\n",
       "87   D. C.                                   NaN  2004.0   left_only\n",
       "561    NaN                  District of Columbia     NaN  right_only\n",
       "562    NaN                           Puerto Rico     NaN  right_only\n",
       "563    NaN              Northern Mariana Islands     NaN  right_only\n",
       "564    NaN          United States Virgin Islands     NaN  right_only\n",
       "565    NaN                        American Samoa     NaN  right_only\n",
       "566    NaN                                  Guam     NaN  right_only\n",
       "567    NaN  United States Minor Outlying Islands     NaN  right_only"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "merge_test = pd.merge(merged_data, area, how=\"outer\",\n",
    "                      left_on='State',right_on='state', \n",
    "                      indicator='matched', \n",
    "                      validate='many_to_one')\n",
    "merge_test.query(\"matched!='both'\")[['State', 'state',\n",
    "                                     'year','matched']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, one of our validation checks failed: we have unmatched rows. Some of these rows are due to differences in coverage: we will not bring Puerto Rico, the Northern Mariana Islands, the United States Virgin Islands, American Samoa, Guam, or the United States Minor Outlying Islands into the merged data as these territories do not receive votes in the electoral college. We will however bring the District of Columbia into the merged data as D.C. does get 3 votes in the electoral college. D.C. is unmatched due to spelling discrepancies, and we must correct this mistake before we can proceed. The easiest way to fix the error is to replace values of \"District of Columbia\" in the `area` dataframe with \"D. C.\", as it is written in the merged data. We can use the `.replace()` method (because we only want to replace one `state` category and `.replace()` leaves unlisted categories alone, as opposed to `.map()` which would replace unlisted categories with missing values), as we discussed in module 8:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "area.state = area.state.replace({'District of Columbia':'D. C.'})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now if we repeat the test, we will only see the rows that are unmatched due to differences in coverage:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>state</th>\n",
       "      <th>year</th>\n",
       "      <th>matched</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>561</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Puerto Rico</td>\n",
       "      <td>NaN</td>\n",
       "      <td>right_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>562</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Northern Mariana Islands</td>\n",
       "      <td>NaN</td>\n",
       "      <td>right_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>563</th>\n",
       "      <td>NaN</td>\n",
       "      <td>United States Virgin Islands</td>\n",
       "      <td>NaN</td>\n",
       "      <td>right_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>564</th>\n",
       "      <td>NaN</td>\n",
       "      <td>American Samoa</td>\n",
       "      <td>NaN</td>\n",
       "      <td>right_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>565</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Guam</td>\n",
       "      <td>NaN</td>\n",
       "      <td>right_only</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>566</th>\n",
       "      <td>NaN</td>\n",
       "      <td>United States Minor Outlying Islands</td>\n",
       "      <td>NaN</td>\n",
       "      <td>right_only</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    State                                 state  year     matched\n",
       "561   NaN                           Puerto Rico   NaN  right_only\n",
       "562   NaN              Northern Mariana Islands   NaN  right_only\n",
       "563   NaN          United States Virgin Islands   NaN  right_only\n",
       "564   NaN                        American Samoa   NaN  right_only\n",
       "565   NaN                                  Guam   NaN  right_only\n",
       "566   NaN  United States Minor Outlying Islands   NaN  right_only"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "merge_test = pd.merge(merged_data, area, left_on='State', how=\"outer\", right_on='state', indicator='matched', validate='many_to_one')\n",
    "merge_test.query(\"matched!='both'\")[['State', 'state','year','matched']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We replace the `merged_data` dataframe with `merge_test`, keeping only the rows for which `matched='both'`, then dropping the `matched` column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "merged_data = merge_test.query(\"matched=='both'\").drop('matched', axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now `merged_data` contains all of the data we need, including state election results, income, economic data, and area:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>year</th>\n",
       "      <th>demvote</th>\n",
       "      <th>repvote</th>\n",
       "      <th>wallacevote</th>\n",
       "      <th>dempercent</th>\n",
       "      <th>reppercent</th>\n",
       "      <th>population</th>\n",
       "      <th>fips</th>\n",
       "      <th>stcode</th>\n",
       "      <th>...</th>\n",
       "      <th>Government_military_share</th>\n",
       "      <th>Government_statelocal_share</th>\n",
       "      <th>state</th>\n",
       "      <th>area_sqmi</th>\n",
       "      <th>area_sqkm</th>\n",
       "      <th>landarea_sqmi</th>\n",
       "      <th>landarea_sqkm</th>\n",
       "      <th>water_sqmi</th>\n",
       "      <th>water_sqkm</th>\n",
       "      <th>percent_water</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1964.0</td>\n",
       "      <td>210732.0</td>\n",
       "      <td>479085.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>30.54897</td>\n",
       "      <td>69.45103</td>\n",
       "      <td>1919000.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>AL</td>\n",
       "      <td>...</td>\n",
       "      <td>2.402146</td>\n",
       "      <td>7.182051</td>\n",
       "      <td>Alabama</td>\n",
       "      <td>52,419.02</td>\n",
       "      <td>135,765</td>\n",
       "      <td>50,744.00</td>\n",
       "      <td>131,426</td>\n",
       "      <td>1,675.01</td>\n",
       "      <td>4,338</td>\n",
       "      <td>3.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1968.0</td>\n",
       "      <td>196579.0</td>\n",
       "      <td>146923.0</td>\n",
       "      <td>691425.0</td>\n",
       "      <td>18.99448</td>\n",
       "      <td>81.00552</td>\n",
       "      <td>1993000.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>AL</td>\n",
       "      <td>...</td>\n",
       "      <td>2.630377</td>\n",
       "      <td>8.356928</td>\n",
       "      <td>Alabama</td>\n",
       "      <td>52,419.02</td>\n",
       "      <td>135,765</td>\n",
       "      <td>50,744.00</td>\n",
       "      <td>131,426</td>\n",
       "      <td>1,675.01</td>\n",
       "      <td>4,338</td>\n",
       "      <td>3.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1972.0</td>\n",
       "      <td>256923.0</td>\n",
       "      <td>728701.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>26.06704</td>\n",
       "      <td>73.93296</td>\n",
       "      <td>3540080.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>AL</td>\n",
       "      <td>...</td>\n",
       "      <td>2.679969</td>\n",
       "      <td>8.841941</td>\n",
       "      <td>Alabama</td>\n",
       "      <td>52,419.02</td>\n",
       "      <td>135,765</td>\n",
       "      <td>50,744.00</td>\n",
       "      <td>131,426</td>\n",
       "      <td>1,675.01</td>\n",
       "      <td>4,338</td>\n",
       "      <td>3.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1976.0</td>\n",
       "      <td>659170.0</td>\n",
       "      <td>504070.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>56.66673</td>\n",
       "      <td>43.33327</td>\n",
       "      <td>3737204.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>AL</td>\n",
       "      <td>...</td>\n",
       "      <td>2.255639</td>\n",
       "      <td>9.386103</td>\n",
       "      <td>Alabama</td>\n",
       "      <td>52,419.02</td>\n",
       "      <td>135,765</td>\n",
       "      <td>50,744.00</td>\n",
       "      <td>131,426</td>\n",
       "      <td>1,675.01</td>\n",
       "      <td>4,338</td>\n",
       "      <td>3.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>1980.0</td>\n",
       "      <td>636730.0</td>\n",
       "      <td>654192.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>49.32366</td>\n",
       "      <td>50.67634</td>\n",
       "      <td>3894025.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>AL</td>\n",
       "      <td>...</td>\n",
       "      <td>2.171860</td>\n",
       "      <td>9.928901</td>\n",
       "      <td>Alabama</td>\n",
       "      <td>52,419.02</td>\n",
       "      <td>135,765</td>\n",
       "      <td>50,744.00</td>\n",
       "      <td>131,426</td>\n",
       "      <td>1,675.01</td>\n",
       "      <td>4,338</td>\n",
       "      <td>3.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>556</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1988.0</td>\n",
       "      <td>67113.0</td>\n",
       "      <td>106867.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>38.57512</td>\n",
       "      <td>61.42488</td>\n",
       "      <td>465080.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>WY</td>\n",
       "      <td>...</td>\n",
       "      <td>1.660191</td>\n",
       "      <td>10.234900</td>\n",
       "      <td>Wyoming</td>\n",
       "      <td>97,813.56</td>\n",
       "      <td>253,336</td>\n",
       "      <td>97,100.40</td>\n",
       "      <td>251,489</td>\n",
       "      <td>713.16</td>\n",
       "      <td>1,847</td>\n",
       "      <td>0.73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>557</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1992.0</td>\n",
       "      <td>68160.0</td>\n",
       "      <td>79347.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>46.20798</td>\n",
       "      <td>53.79202</td>\n",
       "      <td>466251.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>WY</td>\n",
       "      <td>...</td>\n",
       "      <td>1.552073</td>\n",
       "      <td>10.632080</td>\n",
       "      <td>Wyoming</td>\n",
       "      <td>97,813.56</td>\n",
       "      <td>253,336</td>\n",
       "      <td>97,100.40</td>\n",
       "      <td>251,489</td>\n",
       "      <td>713.16</td>\n",
       "      <td>1,847</td>\n",
       "      <td>0.73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>558</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>1996.0</td>\n",
       "      <td>77934.0</td>\n",
       "      <td>105388.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>42.51208</td>\n",
       "      <td>57.48792</td>\n",
       "      <td>488167.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>WY</td>\n",
       "      <td>...</td>\n",
       "      <td>1.544622</td>\n",
       "      <td>10.030510</td>\n",
       "      <td>Wyoming</td>\n",
       "      <td>97,813.56</td>\n",
       "      <td>253,336</td>\n",
       "      <td>97,100.40</td>\n",
       "      <td>251,489</td>\n",
       "      <td>713.16</td>\n",
       "      <td>1,847</td>\n",
       "      <td>0.73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>559</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>2000.0</td>\n",
       "      <td>60481.0</td>\n",
       "      <td>147947.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>29.01769</td>\n",
       "      <td>70.98231</td>\n",
       "      <td>493782.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>WY</td>\n",
       "      <td>...</td>\n",
       "      <td>1.540592</td>\n",
       "      <td>10.686050</td>\n",
       "      <td>Wyoming</td>\n",
       "      <td>97,813.56</td>\n",
       "      <td>253,336</td>\n",
       "      <td>97,100.40</td>\n",
       "      <td>251,489</td>\n",
       "      <td>713.16</td>\n",
       "      <td>1,847</td>\n",
       "      <td>0.73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>560</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>2004.0</td>\n",
       "      <td>70776.0</td>\n",
       "      <td>167629.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>29.68730</td>\n",
       "      <td>70.31271</td>\n",
       "      <td>506529.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>WY</td>\n",
       "      <td>...</td>\n",
       "      <td>1.648164</td>\n",
       "      <td>10.328780</td>\n",
       "      <td>Wyoming</td>\n",
       "      <td>97,813.56</td>\n",
       "      <td>253,336</td>\n",
       "      <td>97,100.40</td>\n",
       "      <td>251,489</td>\n",
       "      <td>713.16</td>\n",
       "      <td>1,847</td>\n",
       "      <td>0.73</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>561 rows × 105 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       State    year   demvote   repvote  wallacevote  dempercent  reppercent  \\\n",
       "0    Alabama  1964.0  210732.0  479085.0          0.0    30.54897    69.45103   \n",
       "1    Alabama  1968.0  196579.0  146923.0     691425.0    18.99448    81.00552   \n",
       "2    Alabama  1972.0  256923.0  728701.0          0.0    26.06704    73.93296   \n",
       "3    Alabama  1976.0  659170.0  504070.0          0.0    56.66673    43.33327   \n",
       "4    Alabama  1980.0  636730.0  654192.0          0.0    49.32366    50.67634   \n",
       "..       ...     ...       ...       ...          ...         ...         ...   \n",
       "556  Wyoming  1988.0   67113.0  106867.0          0.0    38.57512    61.42488   \n",
       "557  Wyoming  1992.0   68160.0   79347.0          0.0    46.20798    53.79202   \n",
       "558  Wyoming  1996.0   77934.0  105388.0          0.0    42.51208    57.48792   \n",
       "559  Wyoming  2000.0   60481.0  147947.0          0.0    29.01769    70.98231   \n",
       "560  Wyoming  2004.0   70776.0  167629.0          0.0    29.68730    70.31271   \n",
       "\n",
       "     population  fips stcode  ...  Government_military_share  \\\n",
       "0     1919000.0   1.0     AL  ...                   2.402146   \n",
       "1     1993000.0   1.0     AL  ...                   2.630377   \n",
       "2     3540080.0   1.0     AL  ...                   2.679969   \n",
       "3     3737204.0   1.0     AL  ...                   2.255639   \n",
       "4     3894025.0   1.0     AL  ...                   2.171860   \n",
       "..          ...   ...    ...  ...                        ...   \n",
       "556    465080.0  56.0     WY  ...                   1.660191   \n",
       "557    466251.0  56.0     WY  ...                   1.552073   \n",
       "558    488167.0  56.0     WY  ...                   1.544622   \n",
       "559    493782.0  56.0     WY  ...                   1.540592   \n",
       "560    506529.0  56.0     WY  ...                   1.648164   \n",
       "\n",
       "     Government_statelocal_share    state  area_sqmi  area_sqkm  \\\n",
       "0                       7.182051  Alabama  52,419.02    135,765   \n",
       "1                       8.356928  Alabama  52,419.02    135,765   \n",
       "2                       8.841941  Alabama  52,419.02    135,765   \n",
       "3                       9.386103  Alabama  52,419.02    135,765   \n",
       "4                       9.928901  Alabama  52,419.02    135,765   \n",
       "..                           ...      ...        ...        ...   \n",
       "556                    10.234900  Wyoming  97,813.56    253,336   \n",
       "557                    10.632080  Wyoming  97,813.56    253,336   \n",
       "558                    10.030510  Wyoming  97,813.56    253,336   \n",
       "559                    10.686050  Wyoming  97,813.56    253,336   \n",
       "560                    10.328780  Wyoming  97,813.56    253,336   \n",
       "\n",
       "     landarea_sqmi  landarea_sqkm  water_sqmi  water_sqkm  percent_water  \n",
       "0        50,744.00        131,426    1,675.01       4,338           3.20  \n",
       "1        50,744.00        131,426    1,675.01       4,338           3.20  \n",
       "2        50,744.00        131,426    1,675.01       4,338           3.20  \n",
       "3        50,744.00        131,426    1,675.01       4,338           3.20  \n",
       "4        50,744.00        131,426    1,675.01       4,338           3.20  \n",
       "..             ...            ...         ...         ...            ...  \n",
       "556      97,100.40        251,489      713.16       1,847           0.73  \n",
       "557      97,100.40        251,489      713.16       1,847           0.73  \n",
       "558      97,100.40        251,489      713.16       1,847           0.73  \n",
       "559      97,100.40        251,489      713.16       1,847           0.73  \n",
       "560      97,100.40        251,489      713.16       1,847           0.73  \n",
       "\n",
       "[561 rows x 105 columns]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "merged_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reshaping Data\n",
    "**Reshaping** a dataframe refers to turning the rows of a dataframe into columns (also known as a long-to-wide reshape), or turning columns into rows (also known as a wide-to-long reshape), in a way that cannot be accomplished by simply transposing the data matrix. It is also called **pivoting** for a long-to-wide reshape, **melting** for a wide-to-long reshape. It's a notoriously tricky skill to master, and it helps a lot to understand every part of the function we need, so that we're not blindly trying every combination of parameters hoping one of these combinations works.\n",
    "\n",
    "### Example: Gross domestic product by state, 1997-2015\n",
    "To demonstrate the method for pivoting a dataframe, we use example data from the U.S. Bureau of Economic Affairs (www.bea.gov) on economic output for the American states. We load the data, skipping the last four rows as these rows contain attribution information and notes, and are not part of the data. We also use the `.str.strip()` method to remove all the leading and trailing whitespace in the `Description` column: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.\n",
      "  \"\"\"Entry point for launching an IPython kernel.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>GeoFIPS</th>\n",
       "      <th>GeoName</th>\n",
       "      <th>Region</th>\n",
       "      <th>ComponentId</th>\n",
       "      <th>ComponentName</th>\n",
       "      <th>IndustryId</th>\n",
       "      <th>IndustryClassification</th>\n",
       "      <th>Description</th>\n",
       "      <th>1997</th>\n",
       "      <th>1998</th>\n",
       "      <th>...</th>\n",
       "      <th>2006</th>\n",
       "      <th>2007</th>\n",
       "      <th>2008</th>\n",
       "      <th>2009</th>\n",
       "      <th>2010</th>\n",
       "      <th>2011</th>\n",
       "      <th>2012</th>\n",
       "      <th>2013</th>\n",
       "      <th>2014</th>\n",
       "      <th>2015</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00000</td>\n",
       "      <td>United States</td>\n",
       "      <td>NaN</td>\n",
       "      <td>200</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>8542530</td>\n",
       "      <td>9024434</td>\n",
       "      <td>...</td>\n",
       "      <td>13773226</td>\n",
       "      <td>14391149</td>\n",
       "      <td>14626598</td>\n",
       "      <td>14320114</td>\n",
       "      <td>14859772</td>\n",
       "      <td>15406002</td>\n",
       "      <td>16041243</td>\n",
       "      <td>16576808</td>\n",
       "      <td>17277548</td>\n",
       "      <td>17919651</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00000</td>\n",
       "      <td>United States</td>\n",
       "      <td>NaN</td>\n",
       "      <td>200</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>Private industries</td>\n",
       "      <td>7459395</td>\n",
       "      <td>7894015</td>\n",
       "      <td>...</td>\n",
       "      <td>12045446</td>\n",
       "      <td>12572387</td>\n",
       "      <td>12716179</td>\n",
       "      <td>12352979</td>\n",
       "      <td>12826507</td>\n",
       "      <td>13348439</td>\n",
       "      <td>13957545</td>\n",
       "      <td>14468465</td>\n",
       "      <td>15115846</td>\n",
       "      <td>15698669</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>00000</td>\n",
       "      <td>United States</td>\n",
       "      <td>NaN</td>\n",
       "      <td>200</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>3</td>\n",
       "      <td>11</td>\n",
       "      <td>Agriculture, forestry, fishing, and hunting</td>\n",
       "      <td>108796</td>\n",
       "      <td>99940</td>\n",
       "      <td>...</td>\n",
       "      <td>128345</td>\n",
       "      <td>141999</td>\n",
       "      <td>154525</td>\n",
       "      <td>137655</td>\n",
       "      <td>160217</td>\n",
       "      <td>197241</td>\n",
       "      <td>185800</td>\n",
       "      <td>221821</td>\n",
       "      <td>203188</td>\n",
       "      <td>175236</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>00000</td>\n",
       "      <td>United States</td>\n",
       "      <td>NaN</td>\n",
       "      <td>200</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>4</td>\n",
       "      <td>111-112</td>\n",
       "      <td>Farms</td>\n",
       "      <td>88136</td>\n",
       "      <td>79030</td>\n",
       "      <td>...</td>\n",
       "      <td>99352</td>\n",
       "      <td>113533</td>\n",
       "      <td>126345</td>\n",
       "      <td>109800</td>\n",
       "      <td>129725</td>\n",
       "      <td>166249</td>\n",
       "      <td>151489</td>\n",
       "      <td>186960</td>\n",
       "      <td>166249</td>\n",
       "      <td>(NA)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>00000</td>\n",
       "      <td>United States</td>\n",
       "      <td>NaN</td>\n",
       "      <td>200</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>5</td>\n",
       "      <td>113-115</td>\n",
       "      <td>Forestry, fishing, and related activities</td>\n",
       "      <td>20660</td>\n",
       "      <td>20910</td>\n",
       "      <td>...</td>\n",
       "      <td>28993</td>\n",
       "      <td>28466</td>\n",
       "      <td>28180</td>\n",
       "      <td>27855</td>\n",
       "      <td>30492</td>\n",
       "      <td>30992</td>\n",
       "      <td>34311</td>\n",
       "      <td>34861</td>\n",
       "      <td>36939</td>\n",
       "      <td>(NA)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43287</th>\n",
       "      <td>94000</td>\n",
       "      <td>Plains</td>\n",
       "      <td>4</td>\n",
       "      <td>1000</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>39071</td>\n",
       "      <td>39970</td>\n",
       "      <td>...</td>\n",
       "      <td>46013</td>\n",
       "      <td>46501</td>\n",
       "      <td>46794</td>\n",
       "      <td>45399</td>\n",
       "      <td>46210</td>\n",
       "      <td>46921</td>\n",
       "      <td>47629</td>\n",
       "      <td>48126</td>\n",
       "      <td>48778</td>\n",
       "      <td>49153</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43288</th>\n",
       "      <td>95000</td>\n",
       "      <td>Southeast</td>\n",
       "      <td>5</td>\n",
       "      <td>1000</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>36722</td>\n",
       "      <td>37896</td>\n",
       "      <td>...</td>\n",
       "      <td>43586</td>\n",
       "      <td>43028</td>\n",
       "      <td>42092</td>\n",
       "      <td>40346</td>\n",
       "      <td>40675</td>\n",
       "      <td>40471</td>\n",
       "      <td>40415</td>\n",
       "      <td>40487</td>\n",
       "      <td>40802</td>\n",
       "      <td>41352</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43289</th>\n",
       "      <td>96000</td>\n",
       "      <td>Southwest</td>\n",
       "      <td>6</td>\n",
       "      <td>1000</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>38887</td>\n",
       "      <td>40283</td>\n",
       "      <td>...</td>\n",
       "      <td>45743</td>\n",
       "      <td>46684</td>\n",
       "      <td>45914</td>\n",
       "      <td>44324</td>\n",
       "      <td>44491</td>\n",
       "      <td>45344</td>\n",
       "      <td>46881</td>\n",
       "      <td>48064</td>\n",
       "      <td>49327</td>\n",
       "      <td>50519</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43290</th>\n",
       "      <td>97000</td>\n",
       "      <td>Rocky Mountain</td>\n",
       "      <td>7</td>\n",
       "      <td>1000</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>38243</td>\n",
       "      <td>40390</td>\n",
       "      <td>...</td>\n",
       "      <td>46756</td>\n",
       "      <td>47587</td>\n",
       "      <td>47019</td>\n",
       "      <td>45170</td>\n",
       "      <td>44937</td>\n",
       "      <td>45060</td>\n",
       "      <td>45014</td>\n",
       "      <td>45564</td>\n",
       "      <td>46508</td>\n",
       "      <td>47093</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43291</th>\n",
       "      <td>98000</td>\n",
       "      <td>Far West</td>\n",
       "      <td>8</td>\n",
       "      <td>1000</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>41901</td>\n",
       "      <td>43467</td>\n",
       "      <td>...</td>\n",
       "      <td>53932</td>\n",
       "      <td>54469</td>\n",
       "      <td>53881</td>\n",
       "      <td>51328</td>\n",
       "      <td>51609</td>\n",
       "      <td>51857</td>\n",
       "      <td>52429</td>\n",
       "      <td>52942</td>\n",
       "      <td>54077</td>\n",
       "      <td>55429</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>43292 rows × 27 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      GeoFIPS         GeoName Region ComponentId  \\\n",
       "0       00000   United States    NaN         200   \n",
       "1       00000   United States    NaN         200   \n",
       "2       00000   United States    NaN         200   \n",
       "3       00000   United States    NaN         200   \n",
       "4       00000   United States    NaN         200   \n",
       "...       ...             ...    ...         ...   \n",
       "43287   94000          Plains      4        1000   \n",
       "43288   95000       Southeast      5        1000   \n",
       "43289   96000       Southwest      6        1000   \n",
       "43290   97000  Rocky Mountain      7        1000   \n",
       "43291   98000        Far West      8        1000   \n",
       "\n",
       "                               ComponentName IndustryId  \\\n",
       "0      Gross domestic product (GDP) by state          1   \n",
       "1      Gross domestic product (GDP) by state          2   \n",
       "2      Gross domestic product (GDP) by state          3   \n",
       "3      Gross domestic product (GDP) by state          4   \n",
       "4      Gross domestic product (GDP) by state          5   \n",
       "...                                      ...        ...   \n",
       "43287           Per capita real GDP by state          1   \n",
       "43288           Per capita real GDP by state          1   \n",
       "43289           Per capita real GDP by state          1   \n",
       "43290           Per capita real GDP by state          1   \n",
       "43291           Per capita real GDP by state          1   \n",
       "\n",
       "      IndustryClassification                                  Description  \\\n",
       "0                        ...                           All industry total   \n",
       "1                        ...                           Private industries   \n",
       "2                         11  Agriculture, forestry, fishing, and hunting   \n",
       "3                    111-112                                        Farms   \n",
       "4                    113-115    Forestry, fishing, and related activities   \n",
       "...                      ...                                          ...   \n",
       "43287                    ...                           All industry total   \n",
       "43288                    ...                           All industry total   \n",
       "43289                    ...                           All industry total   \n",
       "43290                    ...                           All industry total   \n",
       "43291                    ...                           All industry total   \n",
       "\n",
       "          1997     1998  ...      2006      2007      2008      2009  \\\n",
       "0      8542530  9024434  ...  13773226  14391149  14626598  14320114   \n",
       "1      7459395  7894015  ...  12045446  12572387  12716179  12352979   \n",
       "2       108796    99940  ...    128345    141999    154525    137655   \n",
       "3        88136    79030  ...     99352    113533    126345    109800   \n",
       "4        20660    20910  ...     28993     28466     28180     27855   \n",
       "...        ...      ...  ...       ...       ...       ...       ...   \n",
       "43287    39071    39970  ...     46013     46501     46794     45399   \n",
       "43288    36722    37896  ...     43586     43028     42092     40346   \n",
       "43289    38887    40283  ...     45743     46684     45914     44324   \n",
       "43290    38243    40390  ...     46756     47587     47019     45170   \n",
       "43291    41901    43467  ...     53932     54469     53881     51328   \n",
       "\n",
       "           2010      2011      2012      2013      2014      2015  \n",
       "0      14859772  15406002  16041243  16576808  17277548  17919651  \n",
       "1      12826507  13348439  13957545  14468465  15115846  15698669  \n",
       "2        160217    197241    185800    221821    203188    175236  \n",
       "3        129725    166249    151489    186960    166249      (NA)  \n",
       "4         30492     30992     34311     34861     36939      (NA)  \n",
       "...         ...       ...       ...       ...       ...       ...  \n",
       "43287     46210     46921     47629     48126     48778     49153  \n",
       "43288     40675     40471     40415     40487     40802     41352  \n",
       "43289     44491     45344     46881     48064     49327     50519  \n",
       "43290     44937     45060     45014     45564     46508     47093  \n",
       "43291     51609     51857     52429     52942     54077     55429  \n",
       "\n",
       "[43292 rows x 27 columns]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gsp = pd.read_csv(\"https://github.com/jkropko/DS-6001/raw/master/localdata/gsp_naics_all.csv\", skipfooter=4)\n",
    "gsp['Description'] = gsp['Description'].str.strip()\n",
    "gsp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this example, we only want to save three features besides the primary keys:\n",
    "\n",
    "* state GDP,\n",
    "* state GDP per capita,\n",
    "* state GDP from private industries\n",
    "\n",
    "We only want the 50 states, no regions, territories, or districts (sorry DC!). Notice that the features we want are contained in rows, and the years are contained in columns. We need to switch that: the features must be in columns, and years must be stored in the rows. We will need to perform both a wide-to-long and a long-to-wide reshape. In addition, we will give the columns concise but descriptive names, and we will create a new column that contains the percent of GDP from private industry in each state and each year.\n",
    "\n",
    "First we need to reduce the data to only the features we need: state GDP, state GDP per capita, and state GDP from private industries. Right now, however, the features are not stored in the columns, but in the rows. Specifically, the data we need are contains in the rows in which `Description` is \"All industry total\" or \"Private industries\" and `ComponentName` is \"Gross domestic product (GDP) by state\" or \"Per capita real GDP by state\". We use `.query()` to isolate these rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>GeoFIPS</th>\n",
       "      <th>GeoName</th>\n",
       "      <th>Region</th>\n",
       "      <th>ComponentId</th>\n",
       "      <th>ComponentName</th>\n",
       "      <th>IndustryId</th>\n",
       "      <th>IndustryClassification</th>\n",
       "      <th>Description</th>\n",
       "      <th>1997</th>\n",
       "      <th>1998</th>\n",
       "      <th>...</th>\n",
       "      <th>2006</th>\n",
       "      <th>2007</th>\n",
       "      <th>2008</th>\n",
       "      <th>2009</th>\n",
       "      <th>2010</th>\n",
       "      <th>2011</th>\n",
       "      <th>2012</th>\n",
       "      <th>2013</th>\n",
       "      <th>2014</th>\n",
       "      <th>2015</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00000</td>\n",
       "      <td>United States</td>\n",
       "      <td>NaN</td>\n",
       "      <td>200</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>8542530</td>\n",
       "      <td>9024434</td>\n",
       "      <td>...</td>\n",
       "      <td>13773226</td>\n",
       "      <td>14391149</td>\n",
       "      <td>14626598</td>\n",
       "      <td>14320114</td>\n",
       "      <td>14859772</td>\n",
       "      <td>15406002</td>\n",
       "      <td>16041243</td>\n",
       "      <td>16576808</td>\n",
       "      <td>17277548</td>\n",
       "      <td>17919651</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00000</td>\n",
       "      <td>United States</td>\n",
       "      <td>NaN</td>\n",
       "      <td>200</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>Private industries</td>\n",
       "      <td>7459395</td>\n",
       "      <td>7894015</td>\n",
       "      <td>...</td>\n",
       "      <td>12045446</td>\n",
       "      <td>12572387</td>\n",
       "      <td>12716179</td>\n",
       "      <td>12352979</td>\n",
       "      <td>12826507</td>\n",
       "      <td>13348439</td>\n",
       "      <td>13957545</td>\n",
       "      <td>14468465</td>\n",
       "      <td>15115846</td>\n",
       "      <td>15698669</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>90</th>\n",
       "      <td>01000</td>\n",
       "      <td>Alabama</td>\n",
       "      <td>5</td>\n",
       "      <td>200</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>104218</td>\n",
       "      <td>109414</td>\n",
       "      <td>...</td>\n",
       "      <td>164468</td>\n",
       "      <td>169923</td>\n",
       "      <td>172646</td>\n",
       "      <td>168315</td>\n",
       "      <td>174710</td>\n",
       "      <td>180665</td>\n",
       "      <td>185878</td>\n",
       "      <td>190095</td>\n",
       "      <td>194421</td>\n",
       "      <td>199656</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>91</th>\n",
       "      <td>01000</td>\n",
       "      <td>Alabama</td>\n",
       "      <td>5</td>\n",
       "      <td>200</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>Private industries</td>\n",
       "      <td>87014</td>\n",
       "      <td>91506</td>\n",
       "      <td>...</td>\n",
       "      <td>137954</td>\n",
       "      <td>141306</td>\n",
       "      <td>142965</td>\n",
       "      <td>137143</td>\n",
       "      <td>142773</td>\n",
       "      <td>148181</td>\n",
       "      <td>153494</td>\n",
       "      <td>157961</td>\n",
       "      <td>161364</td>\n",
       "      <td>166243</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>180</th>\n",
       "      <td>02000</td>\n",
       "      <td>Alaska</td>\n",
       "      <td>8</td>\n",
       "      <td>200</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>25446</td>\n",
       "      <td>24030</td>\n",
       "      <td>...</td>\n",
       "      <td>44679</td>\n",
       "      <td>49197</td>\n",
       "      <td>55461</td>\n",
       "      <td>50463</td>\n",
       "      <td>54134</td>\n",
       "      <td>58759</td>\n",
       "      <td>60890</td>\n",
       "      <td>59762</td>\n",
       "      <td>58253</td>\n",
       "      <td>52747</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43287</th>\n",
       "      <td>94000</td>\n",
       "      <td>Plains</td>\n",
       "      <td>4</td>\n",
       "      <td>1000</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>39071</td>\n",
       "      <td>39970</td>\n",
       "      <td>...</td>\n",
       "      <td>46013</td>\n",
       "      <td>46501</td>\n",
       "      <td>46794</td>\n",
       "      <td>45399</td>\n",
       "      <td>46210</td>\n",
       "      <td>46921</td>\n",
       "      <td>47629</td>\n",
       "      <td>48126</td>\n",
       "      <td>48778</td>\n",
       "      <td>49153</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43288</th>\n",
       "      <td>95000</td>\n",
       "      <td>Southeast</td>\n",
       "      <td>5</td>\n",
       "      <td>1000</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>36722</td>\n",
       "      <td>37896</td>\n",
       "      <td>...</td>\n",
       "      <td>43586</td>\n",
       "      <td>43028</td>\n",
       "      <td>42092</td>\n",
       "      <td>40346</td>\n",
       "      <td>40675</td>\n",
       "      <td>40471</td>\n",
       "      <td>40415</td>\n",
       "      <td>40487</td>\n",
       "      <td>40802</td>\n",
       "      <td>41352</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43289</th>\n",
       "      <td>96000</td>\n",
       "      <td>Southwest</td>\n",
       "      <td>6</td>\n",
       "      <td>1000</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>38887</td>\n",
       "      <td>40283</td>\n",
       "      <td>...</td>\n",
       "      <td>45743</td>\n",
       "      <td>46684</td>\n",
       "      <td>45914</td>\n",
       "      <td>44324</td>\n",
       "      <td>44491</td>\n",
       "      <td>45344</td>\n",
       "      <td>46881</td>\n",
       "      <td>48064</td>\n",
       "      <td>49327</td>\n",
       "      <td>50519</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43290</th>\n",
       "      <td>97000</td>\n",
       "      <td>Rocky Mountain</td>\n",
       "      <td>7</td>\n",
       "      <td>1000</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>38243</td>\n",
       "      <td>40390</td>\n",
       "      <td>...</td>\n",
       "      <td>46756</td>\n",
       "      <td>47587</td>\n",
       "      <td>47019</td>\n",
       "      <td>45170</td>\n",
       "      <td>44937</td>\n",
       "      <td>45060</td>\n",
       "      <td>45014</td>\n",
       "      <td>45564</td>\n",
       "      <td>46508</td>\n",
       "      <td>47093</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43291</th>\n",
       "      <td>98000</td>\n",
       "      <td>Far West</td>\n",
       "      <td>8</td>\n",
       "      <td>1000</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>41901</td>\n",
       "      <td>43467</td>\n",
       "      <td>...</td>\n",
       "      <td>53932</td>\n",
       "      <td>54469</td>\n",
       "      <td>53881</td>\n",
       "      <td>51328</td>\n",
       "      <td>51609</td>\n",
       "      <td>51857</td>\n",
       "      <td>52429</td>\n",
       "      <td>52942</td>\n",
       "      <td>54077</td>\n",
       "      <td>55429</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>180 rows × 27 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      GeoFIPS         GeoName Region ComponentId  \\\n",
       "0       00000   United States    NaN         200   \n",
       "1       00000   United States    NaN         200   \n",
       "90      01000         Alabama      5         200   \n",
       "91      01000         Alabama      5         200   \n",
       "180     02000          Alaska      8         200   \n",
       "...       ...             ...    ...         ...   \n",
       "43287   94000          Plains      4        1000   \n",
       "43288   95000       Southeast      5        1000   \n",
       "43289   96000       Southwest      6        1000   \n",
       "43290   97000  Rocky Mountain      7        1000   \n",
       "43291   98000        Far West      8        1000   \n",
       "\n",
       "                               ComponentName IndustryId  \\\n",
       "0      Gross domestic product (GDP) by state          1   \n",
       "1      Gross domestic product (GDP) by state          2   \n",
       "90     Gross domestic product (GDP) by state          1   \n",
       "91     Gross domestic product (GDP) by state          2   \n",
       "180    Gross domestic product (GDP) by state          1   \n",
       "...                                      ...        ...   \n",
       "43287           Per capita real GDP by state          1   \n",
       "43288           Per capita real GDP by state          1   \n",
       "43289           Per capita real GDP by state          1   \n",
       "43290           Per capita real GDP by state          1   \n",
       "43291           Per capita real GDP by state          1   \n",
       "\n",
       "      IndustryClassification         Description     1997     1998  ...  \\\n",
       "0                        ...  All industry total  8542530  9024434  ...   \n",
       "1                        ...  Private industries  7459395  7894015  ...   \n",
       "90                       ...  All industry total   104218   109414  ...   \n",
       "91                       ...  Private industries    87014    91506  ...   \n",
       "180                      ...  All industry total    25446    24030  ...   \n",
       "...                      ...                 ...      ...      ...  ...   \n",
       "43287                    ...  All industry total    39071    39970  ...   \n",
       "43288                    ...  All industry total    36722    37896  ...   \n",
       "43289                    ...  All industry total    38887    40283  ...   \n",
       "43290                    ...  All industry total    38243    40390  ...   \n",
       "43291                    ...  All industry total    41901    43467  ...   \n",
       "\n",
       "           2006      2007      2008      2009      2010      2011      2012  \\\n",
       "0      13773226  14391149  14626598  14320114  14859772  15406002  16041243   \n",
       "1      12045446  12572387  12716179  12352979  12826507  13348439  13957545   \n",
       "90       164468    169923    172646    168315    174710    180665    185878   \n",
       "91       137954    141306    142965    137143    142773    148181    153494   \n",
       "180       44679     49197     55461     50463     54134     58759     60890   \n",
       "...         ...       ...       ...       ...       ...       ...       ...   \n",
       "43287     46013     46501     46794     45399     46210     46921     47629   \n",
       "43288     43586     43028     42092     40346     40675     40471     40415   \n",
       "43289     45743     46684     45914     44324     44491     45344     46881   \n",
       "43290     46756     47587     47019     45170     44937     45060     45014   \n",
       "43291     53932     54469     53881     51328     51609     51857     52429   \n",
       "\n",
       "           2013      2014      2015  \n",
       "0      16576808  17277548  17919651  \n",
       "1      14468465  15115846  15698669  \n",
       "90       190095    194421    199656  \n",
       "91       157961    161364    166243  \n",
       "180       59762     58253     52747  \n",
       "...         ...       ...       ...  \n",
       "43287     48126     48778     49153  \n",
       "43288     40487     40802     41352  \n",
       "43289     48064     49327     50519  \n",
       "43290     45564     46508     47093  \n",
       "43291     52942     54077     55429  \n",
       "\n",
       "[180 rows x 27 columns]"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gsp_clean = gsp.query(\n",
    "    \"Description in ['All industry total', 'Private industries']\"\n",
    ").query(\n",
    "    \"ComponentName in ['Gross domestic product (GDP) by state','Per capita real GDP by state']\")\n",
    "gsp_clean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we only want the 50 states, no regions, territories, or districts, so we issue another `.query()` to remove the territories. To see these non-states, we call up a list of the unique values of `GeoName`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['United States', 'Alabama', 'Alaska', 'Arizona', 'Arkansas',\n",
       "       'California', 'Colorado', 'Connecticut', 'Delaware',\n",
       "       'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho',\n",
       "       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',\n",
       "       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',\n",
       "       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',\n",
       "       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',\n",
       "       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',\n",
       "       'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota',\n",
       "       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',\n",
       "       'West Virginia', 'Wisconsin', 'Wyoming', 'New England', 'Mideast',\n",
       "       'Great Lakes', 'Plains', 'Southeast', 'Southwest',\n",
       "       'Rocky Mountain', 'Far West'], dtype=object)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gsp_clean['GeoName'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to remove the rows in which `GeoName` is \"United States\", \"District of Columbia\", \"New England\", \"Mideast\", \"Great Lakes\", \"Plains\", \"Southeast\", \"Southwest\", \"Rocky Mountain\", or \"Far West\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>GeoFIPS</th>\n",
       "      <th>GeoName</th>\n",
       "      <th>Region</th>\n",
       "      <th>ComponentId</th>\n",
       "      <th>ComponentName</th>\n",
       "      <th>IndustryId</th>\n",
       "      <th>IndustryClassification</th>\n",
       "      <th>Description</th>\n",
       "      <th>1997</th>\n",
       "      <th>1998</th>\n",
       "      <th>...</th>\n",
       "      <th>2006</th>\n",
       "      <th>2007</th>\n",
       "      <th>2008</th>\n",
       "      <th>2009</th>\n",
       "      <th>2010</th>\n",
       "      <th>2011</th>\n",
       "      <th>2012</th>\n",
       "      <th>2013</th>\n",
       "      <th>2014</th>\n",
       "      <th>2015</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>90</th>\n",
       "      <td>01000</td>\n",
       "      <td>Alabama</td>\n",
       "      <td>5</td>\n",
       "      <td>200</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>104218</td>\n",
       "      <td>109414</td>\n",
       "      <td>...</td>\n",
       "      <td>164468</td>\n",
       "      <td>169923</td>\n",
       "      <td>172646</td>\n",
       "      <td>168315</td>\n",
       "      <td>174710</td>\n",
       "      <td>180665</td>\n",
       "      <td>185878</td>\n",
       "      <td>190095</td>\n",
       "      <td>194421</td>\n",
       "      <td>199656</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>91</th>\n",
       "      <td>01000</td>\n",
       "      <td>Alabama</td>\n",
       "      <td>5</td>\n",
       "      <td>200</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>Private industries</td>\n",
       "      <td>87014</td>\n",
       "      <td>91506</td>\n",
       "      <td>...</td>\n",
       "      <td>137954</td>\n",
       "      <td>141306</td>\n",
       "      <td>142965</td>\n",
       "      <td>137143</td>\n",
       "      <td>142773</td>\n",
       "      <td>148181</td>\n",
       "      <td>153494</td>\n",
       "      <td>157961</td>\n",
       "      <td>161364</td>\n",
       "      <td>166243</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>180</th>\n",
       "      <td>02000</td>\n",
       "      <td>Alaska</td>\n",
       "      <td>8</td>\n",
       "      <td>200</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>25446</td>\n",
       "      <td>24030</td>\n",
       "      <td>...</td>\n",
       "      <td>44679</td>\n",
       "      <td>49197</td>\n",
       "      <td>55461</td>\n",
       "      <td>50463</td>\n",
       "      <td>54134</td>\n",
       "      <td>58759</td>\n",
       "      <td>60890</td>\n",
       "      <td>59762</td>\n",
       "      <td>58253</td>\n",
       "      <td>52747</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>181</th>\n",
       "      <td>02000</td>\n",
       "      <td>Alaska</td>\n",
       "      <td>8</td>\n",
       "      <td>200</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>Private industries</td>\n",
       "      <td>20284</td>\n",
       "      <td>18776</td>\n",
       "      <td>...</td>\n",
       "      <td>36737</td>\n",
       "      <td>40919</td>\n",
       "      <td>46889</td>\n",
       "      <td>41387</td>\n",
       "      <td>44742</td>\n",
       "      <td>48851</td>\n",
       "      <td>50592</td>\n",
       "      <td>49764</td>\n",
       "      <td>47879</td>\n",
       "      <td>42246</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>270</th>\n",
       "      <td>04000</td>\n",
       "      <td>Arizona</td>\n",
       "      <td>6</td>\n",
       "      <td>200</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>132708</td>\n",
       "      <td>143768</td>\n",
       "      <td>...</td>\n",
       "      <td>248459</td>\n",
       "      <td>262045</td>\n",
       "      <td>256718</td>\n",
       "      <td>242509</td>\n",
       "      <td>245668</td>\n",
       "      <td>254192</td>\n",
       "      <td>264693</td>\n",
       "      <td>270642</td>\n",
       "      <td>280166</td>\n",
       "      <td>290903</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43279</th>\n",
       "      <td>51000</td>\n",
       "      <td>Virginia</td>\n",
       "      <td>5</td>\n",
       "      <td>1000</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>43069</td>\n",
       "      <td>44745</td>\n",
       "      <td>...</td>\n",
       "      <td>52866</td>\n",
       "      <td>52657</td>\n",
       "      <td>51985</td>\n",
       "      <td>51389</td>\n",
       "      <td>51945</td>\n",
       "      <td>51749</td>\n",
       "      <td>51538</td>\n",
       "      <td>51105</td>\n",
       "      <td>50861</td>\n",
       "      <td>51540</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43280</th>\n",
       "      <td>53000</td>\n",
       "      <td>Washington</td>\n",
       "      <td>8</td>\n",
       "      <td>1000</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>45753</td>\n",
       "      <td>47712</td>\n",
       "      <td>...</td>\n",
       "      <td>52900</td>\n",
       "      <td>55348</td>\n",
       "      <td>54939</td>\n",
       "      <td>52264</td>\n",
       "      <td>52681</td>\n",
       "      <td>52495</td>\n",
       "      <td>53423</td>\n",
       "      <td>53987</td>\n",
       "      <td>54773</td>\n",
       "      <td>55577</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43281</th>\n",
       "      <td>54000</td>\n",
       "      <td>West Virginia</td>\n",
       "      <td>5</td>\n",
       "      <td>1000</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>30445</td>\n",
       "      <td>30978</td>\n",
       "      <td>...</td>\n",
       "      <td>34009</td>\n",
       "      <td>33892</td>\n",
       "      <td>34679</td>\n",
       "      <td>34564</td>\n",
       "      <td>35368</td>\n",
       "      <td>36085</td>\n",
       "      <td>35515</td>\n",
       "      <td>35778</td>\n",
       "      <td>36234</td>\n",
       "      <td>36817</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43282</th>\n",
       "      <td>55000</td>\n",
       "      <td>Wisconsin</td>\n",
       "      <td>3</td>\n",
       "      <td>1000</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>38663</td>\n",
       "      <td>39695</td>\n",
       "      <td>...</td>\n",
       "      <td>45515</td>\n",
       "      <td>45464</td>\n",
       "      <td>44622</td>\n",
       "      <td>43215</td>\n",
       "      <td>44126</td>\n",
       "      <td>44905</td>\n",
       "      <td>45380</td>\n",
       "      <td>45582</td>\n",
       "      <td>46469</td>\n",
       "      <td>46893</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43283</th>\n",
       "      <td>56000</td>\n",
       "      <td>Wyoming</td>\n",
       "      <td>7</td>\n",
       "      <td>1000</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>46585</td>\n",
       "      <td>47652</td>\n",
       "      <td>...</td>\n",
       "      <td>63428</td>\n",
       "      <td>65471</td>\n",
       "      <td>69182</td>\n",
       "      <td>66320</td>\n",
       "      <td>64602</td>\n",
       "      <td>63981</td>\n",
       "      <td>60744</td>\n",
       "      <td>60743</td>\n",
       "      <td>61639</td>\n",
       "      <td>61389</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>150 rows × 27 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      GeoFIPS        GeoName Region ComponentId  \\\n",
       "90      01000        Alabama      5         200   \n",
       "91      01000        Alabama      5         200   \n",
       "180     02000         Alaska      8         200   \n",
       "181     02000         Alaska      8         200   \n",
       "270     04000        Arizona      6         200   \n",
       "...       ...            ...    ...         ...   \n",
       "43279   51000       Virginia      5        1000   \n",
       "43280   53000     Washington      8        1000   \n",
       "43281   54000  West Virginia      5        1000   \n",
       "43282   55000      Wisconsin      3        1000   \n",
       "43283   56000        Wyoming      7        1000   \n",
       "\n",
       "                               ComponentName IndustryId  \\\n",
       "90     Gross domestic product (GDP) by state          1   \n",
       "91     Gross domestic product (GDP) by state          2   \n",
       "180    Gross domestic product (GDP) by state          1   \n",
       "181    Gross domestic product (GDP) by state          2   \n",
       "270    Gross domestic product (GDP) by state          1   \n",
       "...                                      ...        ...   \n",
       "43279           Per capita real GDP by state          1   \n",
       "43280           Per capita real GDP by state          1   \n",
       "43281           Per capita real GDP by state          1   \n",
       "43282           Per capita real GDP by state          1   \n",
       "43283           Per capita real GDP by state          1   \n",
       "\n",
       "      IndustryClassification         Description    1997    1998  ...    2006  \\\n",
       "90                       ...  All industry total  104218  109414  ...  164468   \n",
       "91                       ...  Private industries   87014   91506  ...  137954   \n",
       "180                      ...  All industry total   25446   24030  ...   44679   \n",
       "181                      ...  Private industries   20284   18776  ...   36737   \n",
       "270                      ...  All industry total  132708  143768  ...  248459   \n",
       "...                      ...                 ...     ...     ...  ...     ...   \n",
       "43279                    ...  All industry total   43069   44745  ...   52866   \n",
       "43280                    ...  All industry total   45753   47712  ...   52900   \n",
       "43281                    ...  All industry total   30445   30978  ...   34009   \n",
       "43282                    ...  All industry total   38663   39695  ...   45515   \n",
       "43283                    ...  All industry total   46585   47652  ...   63428   \n",
       "\n",
       "         2007    2008    2009    2010    2011    2012    2013    2014    2015  \n",
       "90     169923  172646  168315  174710  180665  185878  190095  194421  199656  \n",
       "91     141306  142965  137143  142773  148181  153494  157961  161364  166243  \n",
       "180     49197   55461   50463   54134   58759   60890   59762   58253   52747  \n",
       "181     40919   46889   41387   44742   48851   50592   49764   47879   42246  \n",
       "270    262045  256718  242509  245668  254192  264693  270642  280166  290903  \n",
       "...       ...     ...     ...     ...     ...     ...     ...     ...     ...  \n",
       "43279   52657   51985   51389   51945   51749   51538   51105   50861   51540  \n",
       "43280   55348   54939   52264   52681   52495   53423   53987   54773   55577  \n",
       "43281   33892   34679   34564   35368   36085   35515   35778   36234   36817  \n",
       "43282   45464   44622   43215   44126   44905   45380   45582   46469   46893  \n",
       "43283   65471   69182   66320   64602   63981   60744   60743   61639   61389  \n",
       "\n",
       "[150 rows x 27 columns]"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gsp_clean = gsp_clean.query(\"GeoName not in ['United States', 'District of Columbia', 'New England', 'Mideast', 'Great Lakes', 'Plains', 'Southeast', 'Southwest', 'Rocky Mountain', 'Far West']\")\n",
    "gsp_clean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Melting: Turning Columns to Rows\n",
    "Notice that many of the columns refer to particular years. We want to move these columns to the rows, so that each state has one row for each year covered in the data. The best function to use when we need to turn columns into rows is `pd.melt()`:\n",
    "```\n",
    "pd.melt(df, id_vars, value_vars)\n",
    "```\n",
    "This function takes three parameters:\n",
    "\n",
    "1. The name of the dataframe you wish to edit\n",
    "\n",
    "2. `id_vars`: A list of the names of the columns that we want to continue to exist as columns after the dataframe is melted. In our example, this list will contain both IDs that currently uniquely identify the rows (`GeoName`, `ComponentName`, `Description`), synonyms for the IDs (`GeoFIPS`) that we want to keep, and features that vary across ID but not across the columns (`Region`).\n",
    "\n",
    "3. `value_vars`: A list of the names of the columns we want to store in the rows. If the columns have numeric names and can be sequentially ordered, as is the case in our example with years, a shortcut for creating a list of these column names is `[str(i) for i in range(1997,2015)]`, which loops over intergers from 1997 to 2015, placing each year as a string in the list.\n",
    "\n",
    "Any column that is not listed under either `id_vars` or `value_vars` will be deleted in the melted data. \n",
    "\n",
    "For example, we melt the `gsp_clean` dataframe with the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>GeoName</th>\n",
       "      <th>GeoFIPS</th>\n",
       "      <th>Region</th>\n",
       "      <th>ComponentName</th>\n",
       "      <th>Description</th>\n",
       "      <th>variable</th>\n",
       "      <th>value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>01000</td>\n",
       "      <td>5</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>1997</td>\n",
       "      <td>104218</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>01000</td>\n",
       "      <td>5</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>Private industries</td>\n",
       "      <td>1997</td>\n",
       "      <td>87014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Alaska</td>\n",
       "      <td>02000</td>\n",
       "      <td>8</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>1997</td>\n",
       "      <td>25446</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Alaska</td>\n",
       "      <td>02000</td>\n",
       "      <td>8</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>Private industries</td>\n",
       "      <td>1997</td>\n",
       "      <td>20284</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Arizona</td>\n",
       "      <td>04000</td>\n",
       "      <td>6</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>1997</td>\n",
       "      <td>132708</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2695</th>\n",
       "      <td>Virginia</td>\n",
       "      <td>51000</td>\n",
       "      <td>5</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>2014</td>\n",
       "      <td>50861</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2696</th>\n",
       "      <td>Washington</td>\n",
       "      <td>53000</td>\n",
       "      <td>8</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>2014</td>\n",
       "      <td>54773</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2697</th>\n",
       "      <td>West Virginia</td>\n",
       "      <td>54000</td>\n",
       "      <td>5</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>2014</td>\n",
       "      <td>36234</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2698</th>\n",
       "      <td>Wisconsin</td>\n",
       "      <td>55000</td>\n",
       "      <td>3</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>2014</td>\n",
       "      <td>46469</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2699</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>56000</td>\n",
       "      <td>7</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>2014</td>\n",
       "      <td>61639</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2700 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "            GeoName GeoFIPS Region                          ComponentName  \\\n",
       "0           Alabama   01000      5  Gross domestic product (GDP) by state   \n",
       "1           Alabama   01000      5  Gross domestic product (GDP) by state   \n",
       "2            Alaska   02000      8  Gross domestic product (GDP) by state   \n",
       "3            Alaska   02000      8  Gross domestic product (GDP) by state   \n",
       "4           Arizona   04000      6  Gross domestic product (GDP) by state   \n",
       "...             ...     ...    ...                                    ...   \n",
       "2695       Virginia   51000      5           Per capita real GDP by state   \n",
       "2696     Washington   53000      8           Per capita real GDP by state   \n",
       "2697  West Virginia   54000      5           Per capita real GDP by state   \n",
       "2698      Wisconsin   55000      3           Per capita real GDP by state   \n",
       "2699        Wyoming   56000      7           Per capita real GDP by state   \n",
       "\n",
       "             Description variable   value  \n",
       "0     All industry total     1997  104218  \n",
       "1     Private industries     1997   87014  \n",
       "2     All industry total     1997   25446  \n",
       "3     Private industries     1997   20284  \n",
       "4     All industry total     1997  132708  \n",
       "...                  ...      ...     ...  \n",
       "2695  All industry total     2014   50861  \n",
       "2696  All industry total     2014   54773  \n",
       "2697  All industry total     2014   36234  \n",
       "2698  All industry total     2014   46469  \n",
       "2699  All industry total     2014   61639  \n",
       "\n",
       "[2700 rows x 7 columns]"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gsp_clean = pd.melt(gsp_clean, id_vars = ['GeoName', 'GeoFIPS', 'Region',\n",
    "                             'ComponentName', 'Description'], \n",
    "        value_vars = [str(i) for i in range(1997,2015)])\n",
    "gsp_clean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the old column names (1997 through 2015 in this case) will always by stored in a column named \"variable\" after melting, and the datapoints that populated those columns will be contained in a column named \"value\". \n",
    "\n",
    "Before moving on, let's rename some of the columns to avoid confusion in the next step:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>FIPS</th>\n",
       "      <th>Region</th>\n",
       "      <th>ComponentName</th>\n",
       "      <th>Description</th>\n",
       "      <th>Year</th>\n",
       "      <th>value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>01000</td>\n",
       "      <td>5</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>1997</td>\n",
       "      <td>104218</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>01000</td>\n",
       "      <td>5</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>Private industries</td>\n",
       "      <td>1997</td>\n",
       "      <td>87014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Alaska</td>\n",
       "      <td>02000</td>\n",
       "      <td>8</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>1997</td>\n",
       "      <td>25446</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Alaska</td>\n",
       "      <td>02000</td>\n",
       "      <td>8</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>Private industries</td>\n",
       "      <td>1997</td>\n",
       "      <td>20284</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Arizona</td>\n",
       "      <td>04000</td>\n",
       "      <td>6</td>\n",
       "      <td>Gross domestic product (GDP) by state</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>1997</td>\n",
       "      <td>132708</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2695</th>\n",
       "      <td>Virginia</td>\n",
       "      <td>51000</td>\n",
       "      <td>5</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>2014</td>\n",
       "      <td>50861</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2696</th>\n",
       "      <td>Washington</td>\n",
       "      <td>53000</td>\n",
       "      <td>8</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>2014</td>\n",
       "      <td>54773</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2697</th>\n",
       "      <td>West Virginia</td>\n",
       "      <td>54000</td>\n",
       "      <td>5</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>2014</td>\n",
       "      <td>36234</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2698</th>\n",
       "      <td>Wisconsin</td>\n",
       "      <td>55000</td>\n",
       "      <td>3</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>2014</td>\n",
       "      <td>46469</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2699</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>56000</td>\n",
       "      <td>7</td>\n",
       "      <td>Per capita real GDP by state</td>\n",
       "      <td>All industry total</td>\n",
       "      <td>2014</td>\n",
       "      <td>61639</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2700 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "              State   FIPS Region                          ComponentName  \\\n",
       "0           Alabama  01000      5  Gross domestic product (GDP) by state   \n",
       "1           Alabama  01000      5  Gross domestic product (GDP) by state   \n",
       "2            Alaska  02000      8  Gross domestic product (GDP) by state   \n",
       "3            Alaska  02000      8  Gross domestic product (GDP) by state   \n",
       "4           Arizona  04000      6  Gross domestic product (GDP) by state   \n",
       "...             ...    ...    ...                                    ...   \n",
       "2695       Virginia  51000      5           Per capita real GDP by state   \n",
       "2696     Washington  53000      8           Per capita real GDP by state   \n",
       "2697  West Virginia  54000      5           Per capita real GDP by state   \n",
       "2698      Wisconsin  55000      3           Per capita real GDP by state   \n",
       "2699        Wyoming  56000      7           Per capita real GDP by state   \n",
       "\n",
       "             Description  Year   value  \n",
       "0     All industry total  1997  104218  \n",
       "1     Private industries  1997   87014  \n",
       "2     All industry total  1997   25446  \n",
       "3     Private industries  1997   20284  \n",
       "4     All industry total  1997  132708  \n",
       "...                  ...   ...     ...  \n",
       "2695  All industry total  2014   50861  \n",
       "2696  All industry total  2014   54773  \n",
       "2697  All industry total  2014   36234  \n",
       "2698  All industry total  2014   46469  \n",
       "2699  All industry total  2014   61639  \n",
       "\n",
       "[2700 rows x 7 columns]"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gsp_clean = gsp_clean.rename({'GeoName':'State',\n",
    "                             'GeoFIPS':'FIPS',\n",
    "                             'variable':'Year'}, axis=1)\n",
    "gsp_clean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pivoting: Turning Rows to Columns\n",
    "The rows contain features that we want to move to the columns. Specifically, we need separate columns for the different combinations of `ComponentName` and `Description` since the feature names are contained in two columns, so we can combine these two columns by concatenating them together and creating a new column called \"feature\". Then we can drop `ComponentName` and `Description`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>FIPS</th>\n",
       "      <th>Region</th>\n",
       "      <th>Year</th>\n",
       "      <th>value</th>\n",
       "      <th>feature</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>01000</td>\n",
       "      <td>5</td>\n",
       "      <td>1997</td>\n",
       "      <td>104218</td>\n",
       "      <td>Gross domestic product (GDP) by stateAll indus...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>01000</td>\n",
       "      <td>5</td>\n",
       "      <td>1997</td>\n",
       "      <td>87014</td>\n",
       "      <td>Gross domestic product (GDP) by statePrivate i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Alaska</td>\n",
       "      <td>02000</td>\n",
       "      <td>8</td>\n",
       "      <td>1997</td>\n",
       "      <td>25446</td>\n",
       "      <td>Gross domestic product (GDP) by stateAll indus...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Alaska</td>\n",
       "      <td>02000</td>\n",
       "      <td>8</td>\n",
       "      <td>1997</td>\n",
       "      <td>20284</td>\n",
       "      <td>Gross domestic product (GDP) by statePrivate i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Arizona</td>\n",
       "      <td>04000</td>\n",
       "      <td>6</td>\n",
       "      <td>1997</td>\n",
       "      <td>132708</td>\n",
       "      <td>Gross domestic product (GDP) by stateAll indus...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2695</th>\n",
       "      <td>Virginia</td>\n",
       "      <td>51000</td>\n",
       "      <td>5</td>\n",
       "      <td>2014</td>\n",
       "      <td>50861</td>\n",
       "      <td>Per capita real GDP by stateAll industry total</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2696</th>\n",
       "      <td>Washington</td>\n",
       "      <td>53000</td>\n",
       "      <td>8</td>\n",
       "      <td>2014</td>\n",
       "      <td>54773</td>\n",
       "      <td>Per capita real GDP by stateAll industry total</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2697</th>\n",
       "      <td>West Virginia</td>\n",
       "      <td>54000</td>\n",
       "      <td>5</td>\n",
       "      <td>2014</td>\n",
       "      <td>36234</td>\n",
       "      <td>Per capita real GDP by stateAll industry total</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2698</th>\n",
       "      <td>Wisconsin</td>\n",
       "      <td>55000</td>\n",
       "      <td>3</td>\n",
       "      <td>2014</td>\n",
       "      <td>46469</td>\n",
       "      <td>Per capita real GDP by stateAll industry total</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2699</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>56000</td>\n",
       "      <td>7</td>\n",
       "      <td>2014</td>\n",
       "      <td>61639</td>\n",
       "      <td>Per capita real GDP by stateAll industry total</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2700 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "              State   FIPS Region  Year   value  \\\n",
       "0           Alabama  01000      5  1997  104218   \n",
       "1           Alabama  01000      5  1997   87014   \n",
       "2            Alaska  02000      8  1997   25446   \n",
       "3            Alaska  02000      8  1997   20284   \n",
       "4           Arizona  04000      6  1997  132708   \n",
       "...             ...    ...    ...   ...     ...   \n",
       "2695       Virginia  51000      5  2014   50861   \n",
       "2696     Washington  53000      8  2014   54773   \n",
       "2697  West Virginia  54000      5  2014   36234   \n",
       "2698      Wisconsin  55000      3  2014   46469   \n",
       "2699        Wyoming  56000      7  2014   61639   \n",
       "\n",
       "                                                feature  \n",
       "0     Gross domestic product (GDP) by stateAll indus...  \n",
       "1     Gross domestic product (GDP) by statePrivate i...  \n",
       "2     Gross domestic product (GDP) by stateAll indus...  \n",
       "3     Gross domestic product (GDP) by statePrivate i...  \n",
       "4     Gross domestic product (GDP) by stateAll indus...  \n",
       "...                                                 ...  \n",
       "2695     Per capita real GDP by stateAll industry total  \n",
       "2696     Per capita real GDP by stateAll industry total  \n",
       "2697     Per capita real GDP by stateAll industry total  \n",
       "2698     Per capita real GDP by stateAll industry total  \n",
       "2699     Per capita real GDP by stateAll industry total  \n",
       "\n",
       "[2700 rows x 6 columns]"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gsp_clean = gsp_clean.assign(feature = gsp_clean['ComponentName'] + gsp_clean['Description'])\n",
    "gsp_clean = gsp_clean.drop(['ComponentName', 'Description'], axis=1)\n",
    "gsp_clean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To move these features from the rows to columns, apply the `.pivot_table()` method to a dataframe:\n",
    "```\n",
    "df.pivot_table(index, columns, values)\n",
    "```\n",
    "To understand this method, pay close attention to its parameters:\n",
    "\n",
    "1. `index` - a list containing the names of the current columns that we want to remain columns in the reshaped data. These include the primary keys (`State` and `Year` in this example), synonyms for a key (`FIPS`), and features that vary only by ID (`Region`).\n",
    "\n",
    "2. `columns` - the name of the column that contains the **names** of the new columns we are trying to create.\n",
    "\n",
    "3. `values` - the name of the column that contains the **datapoints** we are trying to move to the new columns.\n",
    "\n",
    "There is one important issue to keep in mind when using the `.pivot_table()` method. `.pivot_table()` contains default behavior to handle cases in which the keys in `index` do not uniquely identify the rows of a dataframe. By default, `.pivot_table()` takes the mean within group. If the column specified within `values` is non-numeric, however, the mean is not defined. So first make sure that any columns that are numeric are recognized as a numeric data type. If a column contains text, categories, or other non-numeric data, specify `aggfunc='first'` inside `.pivot_table()`. Writing `aggfunc='first'` tells `.pivot_table()` to use the first cell within a group instead of calculating the mean within a group, and the first cell is defined for non-numeric columns. If the columns in `index` do uniquely identify rows, there is only one cell per group anyway, so it is fine to take the first cell.\n",
    "\n",
    "Presently within `gsp_clean` the datapoints are contained within the `value` column. Unfortunately, `value` is not currently understood to be numeric:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "State      object\n",
       "FIPS       object\n",
       "Region     object\n",
       "Year       object\n",
       "value      object\n",
       "feature    object\n",
       "dtype: object"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gsp_clean.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Following our code from module 8, we convert value to a numeric class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "State       object\n",
       "FIPS        object\n",
       "Region      object\n",
       "Year        object\n",
       "value      float64\n",
       "feature     object\n",
       "dtype: object"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gsp_clean.value = gsp_clean.value.astype('float')\n",
    "gsp_clean.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " Now we can reshape the data with `.pivot_table()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>feature</th>\n",
       "      <th>Gross domestic product (GDP) by stateAll industry total</th>\n",
       "      <th>Gross domestic product (GDP) by statePrivate industries</th>\n",
       "      <th>Per capita real GDP by stateAll industry total</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>State</th>\n",
       "      <th>FIPS</th>\n",
       "      <th>Region</th>\n",
       "      <th>Year</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">Alabama</th>\n",
       "      <th rowspan=\"5\" valign=\"top\">01000</th>\n",
       "      <th rowspan=\"5\" valign=\"top\">5</th>\n",
       "      <th>1997</th>\n",
       "      <td>104218</td>\n",
       "      <td>87014</td>\n",
       "      <td>31398</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1998</th>\n",
       "      <td>109414</td>\n",
       "      <td>91506</td>\n",
       "      <td>32164</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1999</th>\n",
       "      <td>115015</td>\n",
       "      <td>96284</td>\n",
       "      <td>33106</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2000</th>\n",
       "      <td>119020</td>\n",
       "      <td>99665</td>\n",
       "      <td>33284</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2001</th>\n",
       "      <td>122822</td>\n",
       "      <td>102978</td>\n",
       "      <td>33312</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">Wyoming</th>\n",
       "      <th rowspan=\"5\" valign=\"top\">56000</th>\n",
       "      <th rowspan=\"5\" valign=\"top\">7</th>\n",
       "      <th>2010</th>\n",
       "      <td>39103</td>\n",
       "      <td>33832</td>\n",
       "      <td>64602</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011</th>\n",
       "      <td>41499</td>\n",
       "      <td>36164</td>\n",
       "      <td>63981</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2012</th>\n",
       "      <td>40201</td>\n",
       "      <td>34604</td>\n",
       "      <td>60744</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2013</th>\n",
       "      <td>40979</td>\n",
       "      <td>35096</td>\n",
       "      <td>60743</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014</th>\n",
       "      <td>42021</td>\n",
       "      <td>35947</td>\n",
       "      <td>61639</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>900 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "feature                    Gross domestic product (GDP) by stateAll industry total  \\\n",
       "State   FIPS  Region Year                                                            \n",
       "Alabama 01000 5      1997                                             104218         \n",
       "                     1998                                             109414         \n",
       "                     1999                                             115015         \n",
       "                     2000                                             119020         \n",
       "                     2001                                             122822         \n",
       "...                                                                      ...         \n",
       "Wyoming 56000 7      2010                                              39103         \n",
       "                     2011                                              41499         \n",
       "                     2012                                              40201         \n",
       "                     2013                                              40979         \n",
       "                     2014                                              42021         \n",
       "\n",
       "feature                    Gross domestic product (GDP) by statePrivate industries  \\\n",
       "State   FIPS  Region Year                                                            \n",
       "Alabama 01000 5      1997                                              87014         \n",
       "                     1998                                              91506         \n",
       "                     1999                                              96284         \n",
       "                     2000                                              99665         \n",
       "                     2001                                             102978         \n",
       "...                                                                      ...         \n",
       "Wyoming 56000 7      2010                                              33832         \n",
       "                     2011                                              36164         \n",
       "                     2012                                              34604         \n",
       "                     2013                                              35096         \n",
       "                     2014                                              35947         \n",
       "\n",
       "feature                    Per capita real GDP by stateAll industry total  \n",
       "State   FIPS  Region Year                                                  \n",
       "Alabama 01000 5      1997                                           31398  \n",
       "                     1998                                           32164  \n",
       "                     1999                                           33106  \n",
       "                     2000                                           33284  \n",
       "                     2001                                           33312  \n",
       "...                                                                   ...  \n",
       "Wyoming 56000 7      2010                                           64602  \n",
       "                     2011                                           63981  \n",
       "                     2012                                           60744  \n",
       "                     2013                                           60743  \n",
       "                     2014                                           61639  \n",
       "\n",
       "[900 rows x 3 columns]"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gsp_clean['value'] = gsp_clean['value'].astype(int) # first convert this column to int \n",
    "gsp_clean = gsp_clean.pivot_table(index=['State','FIPS', 'Region','Year'], \n",
    "                                  columns='feature', \n",
    "                                  values='value')\n",
    "gsp_clean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, if we choose to leave the `value` column as an `object` data type, we can use this code instead to perform the wide-to-long reshape:\n",
    "```\n",
    "gsp_clean = gsp_clean.pivot_table(index=['State','FIPS', 'Region','Year'], \n",
    "                                  columns='feature', \n",
    "                                  values='value', \n",
    "                                  aggfunc='first')\n",
    "```\n",
    "Note that the dataframe is not formated in a flat way, because the IDs and related features are stored in the row-indices instead of in columns. To convert the dataframe back to a standard format, use the `.to_records()` method within the `pd.DataFrame()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>FIPS</th>\n",
       "      <th>Region</th>\n",
       "      <th>Year</th>\n",
       "      <th>Gross domestic product (GDP) by stateAll industry total</th>\n",
       "      <th>Gross domestic product (GDP) by statePrivate industries</th>\n",
       "      <th>Per capita real GDP by stateAll industry total</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>01000</td>\n",
       "      <td>5</td>\n",
       "      <td>1997</td>\n",
       "      <td>104218</td>\n",
       "      <td>87014</td>\n",
       "      <td>31398</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>01000</td>\n",
       "      <td>5</td>\n",
       "      <td>1998</td>\n",
       "      <td>109414</td>\n",
       "      <td>91506</td>\n",
       "      <td>32164</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>01000</td>\n",
       "      <td>5</td>\n",
       "      <td>1999</td>\n",
       "      <td>115015</td>\n",
       "      <td>96284</td>\n",
       "      <td>33106</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>01000</td>\n",
       "      <td>5</td>\n",
       "      <td>2000</td>\n",
       "      <td>119020</td>\n",
       "      <td>99665</td>\n",
       "      <td>33284</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>01000</td>\n",
       "      <td>5</td>\n",
       "      <td>2001</td>\n",
       "      <td>122822</td>\n",
       "      <td>102978</td>\n",
       "      <td>33312</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>895</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>56000</td>\n",
       "      <td>7</td>\n",
       "      <td>2010</td>\n",
       "      <td>39103</td>\n",
       "      <td>33832</td>\n",
       "      <td>64602</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>896</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>56000</td>\n",
       "      <td>7</td>\n",
       "      <td>2011</td>\n",
       "      <td>41499</td>\n",
       "      <td>36164</td>\n",
       "      <td>63981</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>897</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>56000</td>\n",
       "      <td>7</td>\n",
       "      <td>2012</td>\n",
       "      <td>40201</td>\n",
       "      <td>34604</td>\n",
       "      <td>60744</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>898</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>56000</td>\n",
       "      <td>7</td>\n",
       "      <td>2013</td>\n",
       "      <td>40979</td>\n",
       "      <td>35096</td>\n",
       "      <td>60743</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>899</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>56000</td>\n",
       "      <td>7</td>\n",
       "      <td>2014</td>\n",
       "      <td>42021</td>\n",
       "      <td>35947</td>\n",
       "      <td>61639</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>900 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       State   FIPS Region  Year  \\\n",
       "0    Alabama  01000      5  1997   \n",
       "1    Alabama  01000      5  1998   \n",
       "2    Alabama  01000      5  1999   \n",
       "3    Alabama  01000      5  2000   \n",
       "4    Alabama  01000      5  2001   \n",
       "..       ...    ...    ...   ...   \n",
       "895  Wyoming  56000      7  2010   \n",
       "896  Wyoming  56000      7  2011   \n",
       "897  Wyoming  56000      7  2012   \n",
       "898  Wyoming  56000      7  2013   \n",
       "899  Wyoming  56000      7  2014   \n",
       "\n",
       "     Gross domestic product (GDP) by stateAll industry total  \\\n",
       "0                                               104218         \n",
       "1                                               109414         \n",
       "2                                               115015         \n",
       "3                                               119020         \n",
       "4                                               122822         \n",
       "..                                                 ...         \n",
       "895                                              39103         \n",
       "896                                              41499         \n",
       "897                                              40201         \n",
       "898                                              40979         \n",
       "899                                              42021         \n",
       "\n",
       "     Gross domestic product (GDP) by statePrivate industries  \\\n",
       "0                                                87014         \n",
       "1                                                91506         \n",
       "2                                                96284         \n",
       "3                                                99665         \n",
       "4                                               102978         \n",
       "..                                                 ...         \n",
       "895                                              33832         \n",
       "896                                              36164         \n",
       "897                                              34604         \n",
       "898                                              35096         \n",
       "899                                              35947         \n",
       "\n",
       "     Per capita real GDP by stateAll industry total  \n",
       "0                                             31398  \n",
       "1                                             32164  \n",
       "2                                             33106  \n",
       "3                                             33284  \n",
       "4                                             33312  \n",
       "..                                              ...  \n",
       "895                                           64602  \n",
       "896                                           63981  \n",
       "897                                           60744  \n",
       "898                                           60743  \n",
       "899                                           61639  \n",
       "\n",
       "[900 rows x 7 columns]"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gsp_clean = pd.DataFrame(gsp_clean.to_records())\n",
    "gsp_clean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All that's left to clean the data is to rename the columns and to create a new column containing the percent of the state's GDP that comes from private industry:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>FIPS</th>\n",
       "      <th>Region</th>\n",
       "      <th>Year</th>\n",
       "      <th>GDP</th>\n",
       "      <th>GDPprivate</th>\n",
       "      <th>GDPpc</th>\n",
       "      <th>percent_private</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>01000</td>\n",
       "      <td>5</td>\n",
       "      <td>1997</td>\n",
       "      <td>104218</td>\n",
       "      <td>87014</td>\n",
       "      <td>31398</td>\n",
       "      <td>83.492295</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>01000</td>\n",
       "      <td>5</td>\n",
       "      <td>1998</td>\n",
       "      <td>109414</td>\n",
       "      <td>91506</td>\n",
       "      <td>32164</td>\n",
       "      <td>83.632808</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>01000</td>\n",
       "      <td>5</td>\n",
       "      <td>1999</td>\n",
       "      <td>115015</td>\n",
       "      <td>96284</td>\n",
       "      <td>33106</td>\n",
       "      <td>83.714298</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>01000</td>\n",
       "      <td>5</td>\n",
       "      <td>2000</td>\n",
       "      <td>119020</td>\n",
       "      <td>99665</td>\n",
       "      <td>33284</td>\n",
       "      <td>83.738027</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>01000</td>\n",
       "      <td>5</td>\n",
       "      <td>2001</td>\n",
       "      <td>122822</td>\n",
       "      <td>102978</td>\n",
       "      <td>33312</td>\n",
       "      <td>83.843285</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>895</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>56000</td>\n",
       "      <td>7</td>\n",
       "      <td>2010</td>\n",
       "      <td>39103</td>\n",
       "      <td>33832</td>\n",
       "      <td>64602</td>\n",
       "      <td>86.520216</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>896</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>56000</td>\n",
       "      <td>7</td>\n",
       "      <td>2011</td>\n",
       "      <td>41499</td>\n",
       "      <td>36164</td>\n",
       "      <td>63981</td>\n",
       "      <td>87.144269</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>897</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>56000</td>\n",
       "      <td>7</td>\n",
       "      <td>2012</td>\n",
       "      <td>40201</td>\n",
       "      <td>34604</td>\n",
       "      <td>60744</td>\n",
       "      <td>86.077461</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>898</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>56000</td>\n",
       "      <td>7</td>\n",
       "      <td>2013</td>\n",
       "      <td>40979</td>\n",
       "      <td>35096</td>\n",
       "      <td>60743</td>\n",
       "      <td>85.643866</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>899</th>\n",
       "      <td>Wyoming</td>\n",
       "      <td>56000</td>\n",
       "      <td>7</td>\n",
       "      <td>2014</td>\n",
       "      <td>42021</td>\n",
       "      <td>35947</td>\n",
       "      <td>61639</td>\n",
       "      <td>85.545323</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>900 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       State   FIPS Region  Year     GDP  GDPprivate  GDPpc  percent_private\n",
       "0    Alabama  01000      5  1997  104218       87014  31398        83.492295\n",
       "1    Alabama  01000      5  1998  109414       91506  32164        83.632808\n",
       "2    Alabama  01000      5  1999  115015       96284  33106        83.714298\n",
       "3    Alabama  01000      5  2000  119020       99665  33284        83.738027\n",
       "4    Alabama  01000      5  2001  122822      102978  33312        83.843285\n",
       "..       ...    ...    ...   ...     ...         ...    ...              ...\n",
       "895  Wyoming  56000      7  2010   39103       33832  64602        86.520216\n",
       "896  Wyoming  56000      7  2011   41499       36164  63981        87.144269\n",
       "897  Wyoming  56000      7  2012   40201       34604  60744        86.077461\n",
       "898  Wyoming  56000      7  2013   40979       35096  60743        85.643866\n",
       "899  Wyoming  56000      7  2014   42021       35947  61639        85.545323\n",
       "\n",
       "[900 rows x 8 columns]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gsp_clean = gsp_clean.rename({'Gross domestic product (GDP) by stateAll industry total':'GDP',\n",
    "                             'Gross domestic product (GDP) by statePrivate industries':'GDPprivate',\n",
    "                             'Per capita real GDP by stateAll industry total':'GDPpc'}, \n",
    "                             axis=1)\n",
    "gsp_clean = gsp_clean.assign(percent_private = 100* gsp_clean.GDPprivate / gsp_clean.GDP)\n",
    "gsp_clean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working with Strings (Example: the 2019 ANES Pilot Study)\n",
    "In module 8, we worked with the 2019 American National Election Study pilot survey and saved the cleaned data in a separate CSV file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>caseid</th>\n",
       "      <th>liveurban</th>\n",
       "      <th>vote16</th>\n",
       "      <th>protest</th>\n",
       "      <th>vote</th>\n",
       "      <th>most_important_issue</th>\n",
       "      <th>confecon</th>\n",
       "      <th>ideology</th>\n",
       "      <th>partyID</th>\n",
       "      <th>universal_income</th>\n",
       "      <th>...</th>\n",
       "      <th>partisanship</th>\n",
       "      <th>ftbiden_level</th>\n",
       "      <th>age</th>\n",
       "      <th>age2</th>\n",
       "      <th>ftbiden_float</th>\n",
       "      <th>ftbiden_cat</th>\n",
       "      <th>ftbiden_str</th>\n",
       "      <th>prefersbiden</th>\n",
       "      <th>worried_econ</th>\n",
       "      <th>favor_both</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Suburb</td>\n",
       "      <td>Someone else</td>\n",
       "      <td>False</td>\n",
       "      <td>Joe Biden</td>\n",
       "      <td>Health Care</td>\n",
       "      <td>A little worried</td>\n",
       "      <td>Conservative</td>\n",
       "      <td>Democrat</td>\n",
       "      <td>Favor a moderate amount</td>\n",
       "      <td>...</td>\n",
       "      <td>5.0</td>\n",
       "      <td>neutral</td>\n",
       "      <td>51</td>\n",
       "      <td>2601</td>\n",
       "      <td>52.0</td>\n",
       "      <td>52.0</td>\n",
       "      <td>52.0</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Suburb</td>\n",
       "      <td>Donald Trump</td>\n",
       "      <td>False</td>\n",
       "      <td>Donald Trump</td>\n",
       "      <td>Working together</td>\n",
       "      <td>A little worried</td>\n",
       "      <td>Conservative</td>\n",
       "      <td>Republican</td>\n",
       "      <td>Oppose a moderate amount</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>neutral</td>\n",
       "      <td>78</td>\n",
       "      <td>6084</td>\n",
       "      <td>41.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Rural</td>\n",
       "      <td>Hillary Clinton</td>\n",
       "      <td>False</td>\n",
       "      <td>Joe Biden</td>\n",
       "      <td>health care</td>\n",
       "      <td>Extremely worried</td>\n",
       "      <td>Moderate</td>\n",
       "      <td>Democrat</td>\n",
       "      <td>Neither favor nor oppose</td>\n",
       "      <td>...</td>\n",
       "      <td>88.0</td>\n",
       "      <td>like</td>\n",
       "      <td>66</td>\n",
       "      <td>4356</td>\n",
       "      <td>88.0</td>\n",
       "      <td>88.0</td>\n",
       "      <td>88.0</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>City</td>\n",
       "      <td>Hillary Clinton</td>\n",
       "      <td>False</td>\n",
       "      <td>Donald Trump</td>\n",
       "      <td>The economy.</td>\n",
       "      <td>A little worried</td>\n",
       "      <td>Moderate</td>\n",
       "      <td>Democrat</td>\n",
       "      <td>Neither favor nor oppose</td>\n",
       "      <td>...</td>\n",
       "      <td>-100.0</td>\n",
       "      <td>dislike</td>\n",
       "      <td>41</td>\n",
       "      <td>1681</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>City</td>\n",
       "      <td>Donald Trump</td>\n",
       "      <td>False</td>\n",
       "      <td>Donald Trump</td>\n",
       "      <td>China</td>\n",
       "      <td>Not at all worried</td>\n",
       "      <td>Conservative</td>\n",
       "      <td>Republican</td>\n",
       "      <td>Oppose a great deal</td>\n",
       "      <td>...</td>\n",
       "      <td>-69.0</td>\n",
       "      <td>dislike</td>\n",
       "      <td>80</td>\n",
       "      <td>6400</td>\n",
       "      <td>25.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3160</th>\n",
       "      <td>3161</td>\n",
       "      <td>Town</td>\n",
       "      <td>Donald Trump</td>\n",
       "      <td>False</td>\n",
       "      <td>Donald Trump</td>\n",
       "      <td>The infiltration of Marxists into the institut...</td>\n",
       "      <td>A little worried</td>\n",
       "      <td>Conservative</td>\n",
       "      <td>Republican</td>\n",
       "      <td>Oppose a great deal</td>\n",
       "      <td>...</td>\n",
       "      <td>-74.0</td>\n",
       "      <td>dislike</td>\n",
       "      <td>72</td>\n",
       "      <td>5184</td>\n",
       "      <td>7.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3161</th>\n",
       "      <td>3162</td>\n",
       "      <td>City</td>\n",
       "      <td>Someone else</td>\n",
       "      <td>False</td>\n",
       "      <td>Someone else</td>\n",
       "      <td>Lack of basic resources being provided and off...</td>\n",
       "      <td>Extremely worried</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Democrat</td>\n",
       "      <td>Favor a great deal</td>\n",
       "      <td>...</td>\n",
       "      <td>-10.0</td>\n",
       "      <td>dislike</td>\n",
       "      <td>24</td>\n",
       "      <td>576</td>\n",
       "      <td>25.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3162</th>\n",
       "      <td>3163</td>\n",
       "      <td>City</td>\n",
       "      <td>Did not vote</td>\n",
       "      <td>False</td>\n",
       "      <td>Probably will not vote</td>\n",
       "      <td>donald trump</td>\n",
       "      <td>Very worried</td>\n",
       "      <td>Liberal</td>\n",
       "      <td>Independent</td>\n",
       "      <td>Oppose a great deal</td>\n",
       "      <td>...</td>\n",
       "      <td>44.0</td>\n",
       "      <td>neutral</td>\n",
       "      <td>40</td>\n",
       "      <td>1600</td>\n",
       "      <td>50.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3163</th>\n",
       "      <td>3164</td>\n",
       "      <td>Suburb</td>\n",
       "      <td>Did not vote</td>\n",
       "      <td>False</td>\n",
       "      <td>Joe Biden</td>\n",
       "      <td>Donald Trump</td>\n",
       "      <td>Moderately worried</td>\n",
       "      <td>Liberal</td>\n",
       "      <td>Democrat</td>\n",
       "      <td>Favor a moderate amount</td>\n",
       "      <td>...</td>\n",
       "      <td>94.0</td>\n",
       "      <td>like</td>\n",
       "      <td>60</td>\n",
       "      <td>3600</td>\n",
       "      <td>95.0</td>\n",
       "      <td>95.0</td>\n",
       "      <td>95.0</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3164</th>\n",
       "      <td>3165</td>\n",
       "      <td>Town</td>\n",
       "      <td>Hillary Clinton</td>\n",
       "      <td>False</td>\n",
       "      <td>Joe Biden</td>\n",
       "      <td>trump</td>\n",
       "      <td>Extremely worried</td>\n",
       "      <td>Moderate</td>\n",
       "      <td>Democrat</td>\n",
       "      <td>Oppose a great deal</td>\n",
       "      <td>...</td>\n",
       "      <td>70.0</td>\n",
       "      <td>neutral</td>\n",
       "      <td>60</td>\n",
       "      <td>3600</td>\n",
       "      <td>70.0</td>\n",
       "      <td>70.0</td>\n",
       "      <td>70.0</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3165 rows × 61 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      caseid liveurban           vote16  protest                    vote  \\\n",
       "0          1    Suburb     Someone else    False               Joe Biden   \n",
       "1          2    Suburb     Donald Trump    False            Donald Trump   \n",
       "2          3     Rural  Hillary Clinton    False               Joe Biden   \n",
       "3          4      City  Hillary Clinton    False            Donald Trump   \n",
       "4          5      City     Donald Trump    False            Donald Trump   \n",
       "...      ...       ...              ...      ...                     ...   \n",
       "3160    3161      Town     Donald Trump    False            Donald Trump   \n",
       "3161    3162      City     Someone else    False            Someone else   \n",
       "3162    3163      City     Did not vote    False  Probably will not vote   \n",
       "3163    3164    Suburb     Did not vote    False               Joe Biden   \n",
       "3164    3165      Town  Hillary Clinton    False               Joe Biden   \n",
       "\n",
       "                                   most_important_issue            confecon  \\\n",
       "0                                           Health Care    A little worried   \n",
       "1                                      Working together    A little worried   \n",
       "2                                           health care   Extremely worried   \n",
       "3                                          The economy.    A little worried   \n",
       "4                                                 China  Not at all worried   \n",
       "...                                                 ...                 ...   \n",
       "3160  The infiltration of Marxists into the institut...    A little worried   \n",
       "3161  Lack of basic resources being provided and off...   Extremely worried   \n",
       "3162                                       donald trump        Very worried   \n",
       "3163                                       Donald Trump  Moderately worried   \n",
       "3164                                              trump   Extremely worried   \n",
       "\n",
       "          ideology      partyID          universal_income  ... partisanship  \\\n",
       "0     Conservative     Democrat   Favor a moderate amount  ...          5.0   \n",
       "1     Conservative   Republican  Oppose a moderate amount  ...          0.0   \n",
       "2         Moderate     Democrat  Neither favor nor oppose  ...         88.0   \n",
       "3         Moderate     Democrat  Neither favor nor oppose  ...       -100.0   \n",
       "4     Conservative   Republican       Oppose a great deal  ...        -69.0   \n",
       "...            ...          ...                       ...  ...          ...   \n",
       "3160  Conservative   Republican       Oppose a great deal  ...        -74.0   \n",
       "3161           NaN     Democrat        Favor a great deal  ...        -10.0   \n",
       "3162       Liberal  Independent       Oppose a great deal  ...         44.0   \n",
       "3163       Liberal     Democrat   Favor a moderate amount  ...         94.0   \n",
       "3164      Moderate     Democrat       Oppose a great deal  ...         70.0   \n",
       "\n",
       "     ftbiden_level age  age2  ftbiden_float ftbiden_cat ftbiden_str  \\\n",
       "0          neutral  51  2601           52.0        52.0        52.0   \n",
       "1          neutral  78  6084           41.0        41.0        41.0   \n",
       "2             like  66  4356           88.0        88.0        88.0   \n",
       "3          dislike  41  1681            0.0         0.0         0.0   \n",
       "4          dislike  80  6400           25.0        25.0        25.0   \n",
       "...            ...  ..   ...            ...         ...         ...   \n",
       "3160       dislike  72  5184            7.0         7.0         7.0   \n",
       "3161       dislike  24   576           25.0        25.0        25.0   \n",
       "3162       neutral  40  1600           50.0        50.0        50.0   \n",
       "3163          like  60  3600           95.0        95.0        95.0   \n",
       "3164       neutral  60  3600           70.0        70.0        70.0   \n",
       "\n",
       "      prefersbiden  worried_econ  favor_both  \n",
       "0             True         False        True  \n",
       "1            False         False       False  \n",
       "2             True          True       False  \n",
       "3            False         False       False  \n",
       "4            False         False       False  \n",
       "...            ...           ...         ...  \n",
       "3160         False         False       False  \n",
       "3161         False          True        True  \n",
       "3162          True         False       False  \n",
       "3163          True          True       False  \n",
       "3164          True          True       False  \n",
       "\n",
       "[3165 rows x 61 columns]"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes = pd.read_csv(\"https://github.com/jkropko/DS-6001/raw/master/localdata/anes_clean.csv\")\n",
    "anes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data contain a column `most_important_issue` which records individuals' responses to the question of what they feel is the most important issue facing the United States as of December 2019. These responses are in the respondents' own words. For example, one response is"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The influence of big money on our political system both domestic and foreign'"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes.most_important_issue[5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These responses are strings, and we can employ the suite of string operations to manipulate the data in this column. String methods are contained in the `.str` module of `pandas`, and we will have to call the `.str` attribute in order to use string methods.\n",
    "\n",
    "One important technique is to alter the case within the text so that we can identify words without worrying about case sensitivity. To turn all responses to lowercase, use `.str.lower()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                                             health care\n",
       "1                                        working together\n",
       "2                                             health care\n",
       "3                                            the economy.\n",
       "4                                                   china\n",
       "                              ...                        \n",
       "3160    the infiltration of marxists into the institut...\n",
       "3161    lack of basic resources being provided and off...\n",
       "3162                                         donald trump\n",
       "3163                                         donald trump\n",
       "3164                                                trump\n",
       "Name: most_important_issue, Length: 3165, dtype: object"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes.most_important_issue.str.lower()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And to turn all the text to uppercase, use `.str.upper()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                                             HEALTH CARE\n",
       "1                                        WORKING TOGETHER\n",
       "2                                             HEALTH CARE\n",
       "3                                            THE ECONOMY.\n",
       "4                                                   CHINA\n",
       "                              ...                        \n",
       "3160    THE INFILTRATION OF MARXISTS INTO THE INSTITUT...\n",
       "3161    LACK OF BASIC RESOURCES BEING PROVIDED AND OFF...\n",
       "3162                                         DONALD TRUMP\n",
       "3163                                         DONALD TRUMP\n",
       "3164                                                TRUMP\n",
       "Name: most_important_issue, Length: 3165, dtype: object"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes.most_important_issue.str.upper()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One issue that can inhibit our ability to search through text is the existence of leading and trailing whitespace in the responses. White space can exist, invisibily, for several reasons. It is possible that the data authors included white space in order to standardize the number of spaces that the column takes up in the data file. To remove both leading and trailing white spaces, use `.str.strip()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "anes.most_important_issue = anes.most_important_issue.str.strip()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `str.replace()` method finds and replaces specific whole strings with different strings. For example, we can replace every response that reads \"health care\" with \"hospital stuff\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                                          hospital stuff\n",
       "1                                        working together\n",
       "2                                          hospital stuff\n",
       "3                                            the economy.\n",
       "4                                                   china\n",
       "                              ...                        \n",
       "3160    the infiltration of marxists into the institut...\n",
       "3161    lack of basic resources being provided and off...\n",
       "3162                                         donald trump\n",
       "3163                                         donald trump\n",
       "3164                                                trump\n",
       "Name: most_important_issue, Length: 3165, dtype: object"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes.most_important_issue.str.lower().replace('health care', 'hospital stuff')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to replace pieces of these strings, we can specify the `regex=True` argument. The following line of code replaces all occurrences of \"trump\" when the strings are converted to lowercase to \"twimp\". With `regex=True`, the method replaces the pattern \"trump\" anywhere it appears, as opposed to the default `regex=False` which only replaces entire entries that are exactly \"trump\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                                             health care\n",
       "1                                        working together\n",
       "2                                             health care\n",
       "3                                            the economy.\n",
       "4                                                   china\n",
       "                              ...                        \n",
       "3160    the infiltration of marxists into the institut...\n",
       "3161    lack of basic resources being provided and off...\n",
       "3162                                         donald twimp\n",
       "3163                                         donald twimp\n",
       "3164                                                twimp\n",
       "Name: most_important_issue, Length: 3165, dtype: object"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes.most_important_issue.str.lower().replace('trump', 'twimp', regex=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To create a new column that identifies when particular words are used, use either the `.str.match()` or `.str.contains()` method. `.str.match()` is true if the entire entry matches the provided string, "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       False\n",
       "1       False\n",
       "2       False\n",
       "3       False\n",
       "4       False\n",
       "        ...  \n",
       "3160    False\n",
       "3161    False\n",
       "3162    False\n",
       "3163    False\n",
       "3164     True\n",
       "Name: most_important_issue, Length: 3165, dtype: object"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes.most_important_issue.str.lower().str.match('trump')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and `.str.contains()` is true if the provided string exists anywhere within the entry. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       False\n",
       "1       False\n",
       "2       False\n",
       "3       False\n",
       "4       False\n",
       "        ...  \n",
       "3160    False\n",
       "3161    False\n",
       "3162     True\n",
       "3163     True\n",
       "3164     True\n",
       "Name: most_important_issue, Length: 3165, dtype: object"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes.most_important_issue.str.lower().str.contains('trump')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These columns of logical values can be placed into the data to filter or for other purposes. For example, we can see the number of Democrats, Republicans, and independents who use the word \"trump\" in their statement of the country's most important problem:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>problem_trump</th>\n",
       "      <th>False</th>\n",
       "      <th>True</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>partyID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Democrat</th>\n",
       "      <td>1030</td>\n",
       "      <td>293</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Independent</th>\n",
       "      <td>479</td>\n",
       "      <td>47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Republican</th>\n",
       "      <td>1122</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "problem_trump  False  True \n",
       "partyID                    \n",
       "Democrat        1030    293\n",
       "Independent      479     47\n",
       "Republican      1122     73"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes['problem_trump'] = anes.most_important_issue.str.lower().str.contains('trump')\n",
    "pd.DataFrame(anes.groupby(['partyID', 'problem_trump']).size().unstack())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `.str.replace()`, `.str.match()`, and `.str.contains()` methods can also accept regular expressions for identifying additional patterns within the string. Regular expressions are beyond our scope in this document, but here is a [good discussion of regular expressions in Python](https://docs.python.org/2/howto/regex.html).\n",
    "\n",
    "To calculate the length of each response in terms of the number of characters, use `.str.len()`. In this case, it is important to make sure the whitespace is removed first with `.str.strip()`, otherwise the spaces will count towards the length: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        11.0\n",
       "1        16.0\n",
       "2        11.0\n",
       "3        12.0\n",
       "4         5.0\n",
       "        ...  \n",
       "3160    254.0\n",
       "3161    325.0\n",
       "3162     12.0\n",
       "3163     12.0\n",
       "3164      5.0\n",
       "Name: most_important_issue, Length: 3165, dtype: float64"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes.most_important_issue.str.len()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the length to display, for example, the single longest response in the data. To display the whole string, turn off the display limit by typing `pd.options.display.max_colwidth = None`. Here I set the limit to 500, because it sort of goes off the rails and I'd rather not display it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "819    Immigration has been and has become a very big problem. The gangs, drug wars, the crime has climbed in every single town, county, state in this country and I believe it all revolves around too many different races, too many prejudices. Most people I talk to dislike foreigners and most foreigners hate Americans. We ask, why come here then? Because they get financial aid, housing, medical, dental aid and compared to the wars, poverty and the way they lived in their own country, America is heav...\n",
       "Name: most_important_issue, dtype: object"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes['length'] = anes.most_important_issue.str.len()\n",
    "pd.options.display.max_colwidth = 500\n",
    "anes.sort_values(by = 'length', ascending = False).most_important_issue.head(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After all that, we set the string display limit back to the default of 50 characters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.options.display.max_colwidth = 50"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In some cases it might be necessary to split entries into different features according to a delimiter or pattern. It doesn't make sense to split the `most_important_problem` column in this example, but as an illiustration, we can split the responses on periods:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                                           [Health Care]\n",
       "1                                      [Working together]\n",
       "2                                           [health care]\n",
       "3                                         [The economy, ]\n",
       "4                                                 [China]\n",
       "                              ...                        \n",
       "3160    [The infiltration of Marxists into the institu...\n",
       "3161    [Lack of basic resources being provided and of...\n",
       "3162                                       [donald trump]\n",
       "3163                                       [Donald Trump]\n",
       "3164                                              [trump]\n",
       "Name: most_important_issue, Length: 3165, dtype: object"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes.most_important_issue.str.split('.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `expand=True` argument displays these split strings in different columns in a dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>...</th>\n",
       "      <th>19</th>\n",
       "      <th>20</th>\n",
       "      <th>21</th>\n",
       "      <th>22</th>\n",
       "      <th>23</th>\n",
       "      <th>24</th>\n",
       "      <th>25</th>\n",
       "      <th>26</th>\n",
       "      <th>27</th>\n",
       "      <th>28</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Health Care</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Working together</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>health care</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>The economy</td>\n",
       "      <td></td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>China</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3160</th>\n",
       "      <td>The infiltration of Marxists into the institut...</td>\n",
       "      <td>Money seems to be no object to them</td>\n",
       "      <td>Their over promising will lead to eventual chaos</td>\n",
       "      <td>Usually that leads to the over promised event...</td>\n",
       "      <td></td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3161</th>\n",
       "      <td>Lack of basic resources being provided and off...</td>\n",
       "      <td>This is a first world nation, a nation claimi...</td>\n",
       "      <td>yet we don't have housing or healthcare secur...</td>\n",
       "      <td>That needs to change and it needs to be chang...</td>\n",
       "      <td>A basic standard of living needs to be guaran...</td>\n",
       "      <td></td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3162</th>\n",
       "      <td>donald trump</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3163</th>\n",
       "      <td>Donald Trump</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3164</th>\n",
       "      <td>trump</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3165 rows × 29 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                     0   \\\n",
       "0                                           Health Care   \n",
       "1                                      Working together   \n",
       "2                                           health care   \n",
       "3                                           The economy   \n",
       "4                                                 China   \n",
       "...                                                 ...   \n",
       "3160  The infiltration of Marxists into the institut...   \n",
       "3161  Lack of basic resources being provided and off...   \n",
       "3162                                       donald trump   \n",
       "3163                                       Donald Trump   \n",
       "3164                                              trump   \n",
       "\n",
       "                                                     1   \\\n",
       "0                                                  None   \n",
       "1                                                  None   \n",
       "2                                                  None   \n",
       "3                                                         \n",
       "4                                                  None   \n",
       "...                                                 ...   \n",
       "3160                Money seems to be no object to them   \n",
       "3161   This is a first world nation, a nation claimi...   \n",
       "3162                                               None   \n",
       "3163                                               None   \n",
       "3164                                               None   \n",
       "\n",
       "                                                     2   \\\n",
       "0                                                  None   \n",
       "1                                                  None   \n",
       "2                                                  None   \n",
       "3                                                  None   \n",
       "4                                                  None   \n",
       "...                                                 ...   \n",
       "3160   Their over promising will lead to eventual chaos   \n",
       "3161   yet we don't have housing or healthcare secur...   \n",
       "3162                                               None   \n",
       "3163                                               None   \n",
       "3164                                               None   \n",
       "\n",
       "                                                     3   \\\n",
       "0                                                  None   \n",
       "1                                                  None   \n",
       "2                                                  None   \n",
       "3                                                  None   \n",
       "4                                                  None   \n",
       "...                                                 ...   \n",
       "3160   Usually that leads to the over promised event...   \n",
       "3161   That needs to change and it needs to be chang...   \n",
       "3162                                               None   \n",
       "3163                                               None   \n",
       "3164                                               None   \n",
       "\n",
       "                                                     4     5     6     7   \\\n",
       "0                                                  None  None  None  None   \n",
       "1                                                  None  None  None  None   \n",
       "2                                                  None  None  None  None   \n",
       "3                                                  None  None  None  None   \n",
       "4                                                  None  None  None  None   \n",
       "...                                                 ...   ...   ...   ...   \n",
       "3160                                                     None  None  None   \n",
       "3161   A basic standard of living needs to be guaran...        None  None   \n",
       "3162                                               None  None  None  None   \n",
       "3163                                               None  None  None  None   \n",
       "3164                                               None  None  None  None   \n",
       "\n",
       "        8     9   ...    19    20    21    22    23    24    25    26    27  \\\n",
       "0     None  None  ...  None  None  None  None  None  None  None  None  None   \n",
       "1     None  None  ...  None  None  None  None  None  None  None  None  None   \n",
       "2     None  None  ...  None  None  None  None  None  None  None  None  None   \n",
       "3     None  None  ...  None  None  None  None  None  None  None  None  None   \n",
       "4     None  None  ...  None  None  None  None  None  None  None  None  None   \n",
       "...    ...   ...  ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   \n",
       "3160  None  None  ...  None  None  None  None  None  None  None  None  None   \n",
       "3161  None  None  ...  None  None  None  None  None  None  None  None  None   \n",
       "3162  None  None  ...  None  None  None  None  None  None  None  None  None   \n",
       "3163  None  None  ...  None  None  None  None  None  None  None  None  None   \n",
       "3164  None  None  ...  None  None  None  None  None  None  None  None  None   \n",
       "\n",
       "        28  \n",
       "0     None  \n",
       "1     None  \n",
       "2     None  \n",
       "3     None  \n",
       "4     None  \n",
       "...    ...  \n",
       "3160  None  \n",
       "3161  None  \n",
       "3162  None  \n",
       "3163  None  \n",
       "3164  None  \n",
       "\n",
       "[3165 rows x 29 columns]"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes.most_important_issue.str.split('.', expand=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Writing `n=1` only splits on the first occurrence of the delimiter, `n=2` splits on the first and second occurrence, and so on:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Health Care</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Working together</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>health care</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>The economy</td>\n",
       "      <td></td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>China</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3160</th>\n",
       "      <td>The infiltration of Marxists into the institut...</td>\n",
       "      <td>Money seems to be no object to them</td>\n",
       "      <td>Their over promising will lead to eventual ch...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3161</th>\n",
       "      <td>Lack of basic resources being provided and off...</td>\n",
       "      <td>This is a first world nation, a nation claimi...</td>\n",
       "      <td>yet we don't have housing or healthcare secur...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3162</th>\n",
       "      <td>donald trump</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3163</th>\n",
       "      <td>Donald Trump</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3164</th>\n",
       "      <td>trump</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3165 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                      0  \\\n",
       "0                                           Health Care   \n",
       "1                                      Working together   \n",
       "2                                           health care   \n",
       "3                                           The economy   \n",
       "4                                                 China   \n",
       "...                                                 ...   \n",
       "3160  The infiltration of Marxists into the institut...   \n",
       "3161  Lack of basic resources being provided and off...   \n",
       "3162                                       donald trump   \n",
       "3163                                       Donald Trump   \n",
       "3164                                              trump   \n",
       "\n",
       "                                                      1  \\\n",
       "0                                                  None   \n",
       "1                                                  None   \n",
       "2                                                  None   \n",
       "3                                                         \n",
       "4                                                  None   \n",
       "...                                                 ...   \n",
       "3160                Money seems to be no object to them   \n",
       "3161   This is a first world nation, a nation claimi...   \n",
       "3162                                               None   \n",
       "3163                                               None   \n",
       "3164                                               None   \n",
       "\n",
       "                                                      2  \n",
       "0                                                  None  \n",
       "1                                                  None  \n",
       "2                                                  None  \n",
       "3                                                  None  \n",
       "4                                                  None  \n",
       "...                                                 ...  \n",
       "3160   Their over promising will lead to eventual ch...  \n",
       "3161   yet we don't have housing or healthcare secur...  \n",
       "3162                                               None  \n",
       "3163                                               None  \n",
       "3164                                               None  \n",
       "\n",
       "[3165 rows x 3 columns]"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes.most_important_issue.str.split('.', expand=True, n=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An individual string can be indexed according to character number. These indices can be applied to every entry in a string column as well. To pull just the first five characters out of each response in `most_important_issue`, we can type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       Healt\n",
       "1       Worki\n",
       "2       healt\n",
       "3       The e\n",
       "4       China\n",
       "        ...  \n",
       "3160    The i\n",
       "3161    Lack \n",
       "3162    donal\n",
       "3163    Donal\n",
       "3164    trump\n",
       "Name: most_important_issue, Length: 3165, dtype: object"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes.most_important_issue.str[0:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To pull characters 6 through 10, we type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        Care\n",
       "1       g tog\n",
       "2        care\n",
       "3       onomy\n",
       "4            \n",
       "        ...  \n",
       "3160    filtr\n",
       "3161    f bas\n",
       "3162     trum\n",
       "3163     Trum\n",
       "3164         \n",
       "Name: most_important_issue, Length: 3165, dtype: object"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes.most_important_issue.str[6:11]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To pull the last four characters, use a negative number to begin the range, and leave the end of the range blank:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       Care\n",
       "1       ther\n",
       "2       care\n",
       "3       omy.\n",
       "4       hina\n",
       "        ... \n",
       "3160    ers.\n",
       "3161    eed.\n",
       "3162    rump\n",
       "3163    rump\n",
       "3164    rump\n",
       "Name: most_important_issue, Length: 3165, dtype: object"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anes.most_important_issue.str[-4:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working with Dates and Times (Example: Twitter)\n",
    "In module 4 we discussed Twitter as an example of how to use an API in Python. We save the four credentials we need to access the API (the consumer key, the consumer secret, the access token, and the access token secret) in a .env file, and we load those keys into Python without displaying them in our notebooks by using the `dotenv` package:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "import dotenv\n",
    "import os\n",
    "os.chdir('/Users/jk8sd/Box Sync/Practice and Applications 1 online/Module 9 - Data Managenent in pandas Part 2')\n",
    "dotenv.load_dotenv()\n",
    "ConsumerKey = os.getenv('ConsumerKey')\n",
    "ConsumerSecret = os.getenv('ConsumerSecret')\n",
    "AccessToken = os.getenv('AccessToken')\n",
    "AccessTokenSecret = os.getenv('AccessTokenSecret')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we use the `tweepy` package to work with the Twitter API and we create a Twitter cursor with our credentials:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tweepy\n",
    "auth = tweepy.OAuthHandler(ConsumerKey, ConsumerSecret)\n",
    "auth.set_access_token(AccessToken, AccessTokenSecret)\n",
    "api = tweepy.API(auth)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code extracts 1000 tweets that contain the hashtag \"#uva\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>created_at</th>\n",
       "      <th>user</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>@KymoraJohnson_ let me call @IamTinaThompson s...</td>\n",
       "      <td>2020-06-30 02:20:43</td>\n",
       "      <td>SweetLickKing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Congrats Eric. It’s was really great to have w...</td>\n",
       "      <td>2020-06-30 02:18:17</td>\n",
       "      <td>Daniel_B_Ennis</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...</td>\n",
       "      <td>2020-06-30 01:55:02</td>\n",
       "      <td>blimeyonline1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Former #UVA stars Quin Blanding, Micah Kiser n...</td>\n",
       "      <td>2020-06-30 01:40:00</td>\n",
       "      <td>Wahoos247</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Former #UVA stars Quin Blanding, Micah Kiser n...</td>\n",
       "      <td>2020-06-30 01:39:42</td>\n",
       "      <td>JamieOakes247</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>995</th>\n",
       "      <td>RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...</td>\n",
       "      <td>2020-06-23 15:12:08</td>\n",
       "      <td>norahlfy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>996</th>\n",
       "      <td>#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...</td>\n",
       "      <td>2020-06-23 15:08:40</td>\n",
       "      <td>AgNPalabras</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>997</th>\n",
       "      <td>Zur Einstimmung #UVA vom letzten #gig bei uns ...</td>\n",
       "      <td>2020-06-23 14:52:23</td>\n",
       "      <td>WeAppU</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>998</th>\n",
       "      <td>RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...</td>\n",
       "      <td>2020-06-23 14:49:22</td>\n",
       "      <td>VirialexViri</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999</th>\n",
       "      <td>Good question... My instinct says Ty Jerome, b...</td>\n",
       "      <td>2020-06-23 14:32:44</td>\n",
       "      <td>The_Superhoo</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1000 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                  text          created_at  \\\n",
       "0    @KymoraJohnson_ let me call @IamTinaThompson s... 2020-06-30 02:20:43   \n",
       "1    Congrats Eric. It’s was really great to have w... 2020-06-30 02:18:17   \n",
       "2    RT @KathrynsScGifts: \"OGX Fade-defying + Orchi... 2020-06-30 01:55:02   \n",
       "3    Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:40:00   \n",
       "4    Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:39:42   \n",
       "..                                                 ...                 ...   \n",
       "995  RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... 2020-06-23 15:12:08   \n",
       "996  #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... 2020-06-23 15:08:40   \n",
       "997  Zur Einstimmung #UVA vom letzten #gig bei uns ... 2020-06-23 14:52:23   \n",
       "998  RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... 2020-06-23 14:49:22   \n",
       "999  Good question... My instinct says Ty Jerome, b... 2020-06-23 14:32:44   \n",
       "\n",
       "               user  \n",
       "0     SweetLickKing  \n",
       "1    Daniel_B_Ennis  \n",
       "2     blimeyonline1  \n",
       "3         Wahoos247  \n",
       "4     JamieOakes247  \n",
       "..              ...  \n",
       "995        norahlfy  \n",
       "996     AgNPalabras  \n",
       "997          WeAppU  \n",
       "998    VirialexViri  \n",
       "999    The_Superhoo  \n",
       "\n",
       "[1000 rows x 3 columns]"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "msgs = []\n",
    "msg =[]\n",
    "\n",
    "for tweet in tweepy.Cursor(api.search, q='#uva').items(1000):\n",
    "    msg = [tweet.text, tweet.created_at, tweet.user.screen_name] \n",
    "    msg = tuple(msg)                    \n",
    "    msgs.append(msg)\n",
    "\n",
    "tweets = pd.DataFrame(msgs, columns = ['text', 'created_at', 'user'])\n",
    "tweets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extracting Year, Month, Day, and Time from a Timestamp Column\n",
    "Note that the data contain timestamps that include the date and time, to the second, of each tweet. The `created_at` column has a `datetime64` data type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "text                  object\n",
       "created_at    datetime64[ns]\n",
       "user                  object\n",
       "dtype: object"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tweets.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The difference between a `datetime64` data type and an `object` or numeric type is that we have the ability to extract individual elements of time from a `datetime64` value. To pull out the year, month, day, hour, minute, and second of the timestamp, we use the following attributes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[6, 30, 2020, 2, 20, 43]"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[tweets.created_at[0].month, tweets.created_at[0].day, \n",
    " tweets.created_at[0].year, tweets.created_at[0].hour, \n",
    " tweets.created_at[0].minute, tweets.created_at[0].second]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The easiest way to create new columns with these elements for all the values in the dataframe is to use comprehension loops:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>created_at</th>\n",
       "      <th>user</th>\n",
       "      <th>month</th>\n",
       "      <th>day</th>\n",
       "      <th>year</th>\n",
       "      <th>hour</th>\n",
       "      <th>minute</th>\n",
       "      <th>second</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>@KymoraJohnson_ let me call @IamTinaThompson s...</td>\n",
       "      <td>2020-06-30 02:20:43</td>\n",
       "      <td>SweetLickKing</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>2</td>\n",
       "      <td>20</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Congrats Eric. It’s was really great to have w...</td>\n",
       "      <td>2020-06-30 02:18:17</td>\n",
       "      <td>Daniel_B_Ennis</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>2</td>\n",
       "      <td>18</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...</td>\n",
       "      <td>2020-06-30 01:55:02</td>\n",
       "      <td>blimeyonline1</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>1</td>\n",
       "      <td>55</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Former #UVA stars Quin Blanding, Micah Kiser n...</td>\n",
       "      <td>2020-06-30 01:40:00</td>\n",
       "      <td>Wahoos247</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>1</td>\n",
       "      <td>40</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Former #UVA stars Quin Blanding, Micah Kiser n...</td>\n",
       "      <td>2020-06-30 01:39:42</td>\n",
       "      <td>JamieOakes247</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>1</td>\n",
       "      <td>39</td>\n",
       "      <td>42</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>995</th>\n",
       "      <td>RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...</td>\n",
       "      <td>2020-06-23 15:12:08</td>\n",
       "      <td>norahlfy</td>\n",
       "      <td>6</td>\n",
       "      <td>23</td>\n",
       "      <td>2020</td>\n",
       "      <td>15</td>\n",
       "      <td>12</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>996</th>\n",
       "      <td>#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...</td>\n",
       "      <td>2020-06-23 15:08:40</td>\n",
       "      <td>AgNPalabras</td>\n",
       "      <td>6</td>\n",
       "      <td>23</td>\n",
       "      <td>2020</td>\n",
       "      <td>15</td>\n",
       "      <td>8</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>997</th>\n",
       "      <td>Zur Einstimmung #UVA vom letzten #gig bei uns ...</td>\n",
       "      <td>2020-06-23 14:52:23</td>\n",
       "      <td>WeAppU</td>\n",
       "      <td>6</td>\n",
       "      <td>23</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>52</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>998</th>\n",
       "      <td>RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...</td>\n",
       "      <td>2020-06-23 14:49:22</td>\n",
       "      <td>VirialexViri</td>\n",
       "      <td>6</td>\n",
       "      <td>23</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>49</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999</th>\n",
       "      <td>Good question... My instinct says Ty Jerome, b...</td>\n",
       "      <td>2020-06-23 14:32:44</td>\n",
       "      <td>The_Superhoo</td>\n",
       "      <td>6</td>\n",
       "      <td>23</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>32</td>\n",
       "      <td>44</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1000 rows × 9 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                  text          created_at  \\\n",
       "0    @KymoraJohnson_ let me call @IamTinaThompson s... 2020-06-30 02:20:43   \n",
       "1    Congrats Eric. It’s was really great to have w... 2020-06-30 02:18:17   \n",
       "2    RT @KathrynsScGifts: \"OGX Fade-defying + Orchi... 2020-06-30 01:55:02   \n",
       "3    Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:40:00   \n",
       "4    Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:39:42   \n",
       "..                                                 ...                 ...   \n",
       "995  RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... 2020-06-23 15:12:08   \n",
       "996  #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... 2020-06-23 15:08:40   \n",
       "997  Zur Einstimmung #UVA vom letzten #gig bei uns ... 2020-06-23 14:52:23   \n",
       "998  RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... 2020-06-23 14:49:22   \n",
       "999  Good question... My instinct says Ty Jerome, b... 2020-06-23 14:32:44   \n",
       "\n",
       "               user  month  day  year  hour  minute  second  \n",
       "0     SweetLickKing      6   30  2020     2      20      43  \n",
       "1    Daniel_B_Ennis      6   30  2020     2      18      17  \n",
       "2     blimeyonline1      6   30  2020     1      55       2  \n",
       "3         Wahoos247      6   30  2020     1      40       0  \n",
       "4     JamieOakes247      6   30  2020     1      39      42  \n",
       "..              ...    ...  ...   ...   ...     ...     ...  \n",
       "995        norahlfy      6   23  2020    15      12       8  \n",
       "996     AgNPalabras      6   23  2020    15       8      40  \n",
       "997          WeAppU      6   23  2020    14      52      23  \n",
       "998    VirialexViri      6   23  2020    14      49      22  \n",
       "999    The_Superhoo      6   23  2020    14      32      44  \n",
       "\n",
       "[1000 rows x 9 columns]"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tweets['month'] = [x.month for x in tweets.created_at]\n",
    "tweets['day'] = [x.day for x in tweets.created_at]\n",
    "tweets['year'] = [x.year for x in tweets.created_at]\n",
    "tweets['hour'] = [x.hour for x in tweets.created_at]\n",
    "tweets['minute'] = [x.minute for x in tweets.created_at]\n",
    "tweets['second'] = [x.second for x in tweets.created_at]\n",
    "tweets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generating Timestamps from Separate Month, Day, and Year Columns\n",
    "Sometimes raw data will come to us with these time elements already placed in separate columns, so that we have a `month`, `day`, and `year` column but no `date` column. There are important advantages to placing all this information in one column in values that Python understands as timestamps: timestamps are easier to filter, manipulate, index, and plot. To create one timestamp from separate columns that contain the month, day, and year (and optionally the hour, minute, and second), use the `pd.to_datatime` function:\n",
    "```\n",
    "pd.to_datetime(df)\n",
    "```\n",
    "where `df` is a subset of the dataframe that contains the separate columns for the time elements. In order for this function to work, these columns must be named `year`, `month`, and `day`, and optionally `hour`, `minute`, and `second`. If these columns are named something else, first use the `.rename()` method to change the names to the ones listed above.\n",
    "\n",
    "To create a timestamp from the individual time element columns, we can type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>created_at</th>\n",
       "      <th>user</th>\n",
       "      <th>month</th>\n",
       "      <th>day</th>\n",
       "      <th>year</th>\n",
       "      <th>hour</th>\n",
       "      <th>minute</th>\n",
       "      <th>second</th>\n",
       "      <th>timestamp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>@KymoraJohnson_ let me call @IamTinaThompson s...</td>\n",
       "      <td>2020-06-30 02:20:43</td>\n",
       "      <td>SweetLickKing</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>2</td>\n",
       "      <td>20</td>\n",
       "      <td>43</td>\n",
       "      <td>2020-06-30 02:20:43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Congrats Eric. It’s was really great to have w...</td>\n",
       "      <td>2020-06-30 02:18:17</td>\n",
       "      <td>Daniel_B_Ennis</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>2</td>\n",
       "      <td>18</td>\n",
       "      <td>17</td>\n",
       "      <td>2020-06-30 02:18:17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...</td>\n",
       "      <td>2020-06-30 01:55:02</td>\n",
       "      <td>blimeyonline1</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>1</td>\n",
       "      <td>55</td>\n",
       "      <td>2</td>\n",
       "      <td>2020-06-30 01:55:02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Former #UVA stars Quin Blanding, Micah Kiser n...</td>\n",
       "      <td>2020-06-30 01:40:00</td>\n",
       "      <td>Wahoos247</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>1</td>\n",
       "      <td>40</td>\n",
       "      <td>0</td>\n",
       "      <td>2020-06-30 01:40:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Former #UVA stars Quin Blanding, Micah Kiser n...</td>\n",
       "      <td>2020-06-30 01:39:42</td>\n",
       "      <td>JamieOakes247</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>1</td>\n",
       "      <td>39</td>\n",
       "      <td>42</td>\n",
       "      <td>2020-06-30 01:39:42</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>995</th>\n",
       "      <td>RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...</td>\n",
       "      <td>2020-06-23 15:12:08</td>\n",
       "      <td>norahlfy</td>\n",
       "      <td>6</td>\n",
       "      <td>23</td>\n",
       "      <td>2020</td>\n",
       "      <td>15</td>\n",
       "      <td>12</td>\n",
       "      <td>8</td>\n",
       "      <td>2020-06-23 15:12:08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>996</th>\n",
       "      <td>#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...</td>\n",
       "      <td>2020-06-23 15:08:40</td>\n",
       "      <td>AgNPalabras</td>\n",
       "      <td>6</td>\n",
       "      <td>23</td>\n",
       "      <td>2020</td>\n",
       "      <td>15</td>\n",
       "      <td>8</td>\n",
       "      <td>40</td>\n",
       "      <td>2020-06-23 15:08:40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>997</th>\n",
       "      <td>Zur Einstimmung #UVA vom letzten #gig bei uns ...</td>\n",
       "      <td>2020-06-23 14:52:23</td>\n",
       "      <td>WeAppU</td>\n",
       "      <td>6</td>\n",
       "      <td>23</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>52</td>\n",
       "      <td>23</td>\n",
       "      <td>2020-06-23 14:52:23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>998</th>\n",
       "      <td>RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...</td>\n",
       "      <td>2020-06-23 14:49:22</td>\n",
       "      <td>VirialexViri</td>\n",
       "      <td>6</td>\n",
       "      <td>23</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>49</td>\n",
       "      <td>22</td>\n",
       "      <td>2020-06-23 14:49:22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999</th>\n",
       "      <td>Good question... My instinct says Ty Jerome, b...</td>\n",
       "      <td>2020-06-23 14:32:44</td>\n",
       "      <td>The_Superhoo</td>\n",
       "      <td>6</td>\n",
       "      <td>23</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>32</td>\n",
       "      <td>44</td>\n",
       "      <td>2020-06-23 14:32:44</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1000 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                  text          created_at  \\\n",
       "0    @KymoraJohnson_ let me call @IamTinaThompson s... 2020-06-30 02:20:43   \n",
       "1    Congrats Eric. It’s was really great to have w... 2020-06-30 02:18:17   \n",
       "2    RT @KathrynsScGifts: \"OGX Fade-defying + Orchi... 2020-06-30 01:55:02   \n",
       "3    Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:40:00   \n",
       "4    Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:39:42   \n",
       "..                                                 ...                 ...   \n",
       "995  RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... 2020-06-23 15:12:08   \n",
       "996  #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... 2020-06-23 15:08:40   \n",
       "997  Zur Einstimmung #UVA vom letzten #gig bei uns ... 2020-06-23 14:52:23   \n",
       "998  RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... 2020-06-23 14:49:22   \n",
       "999  Good question... My instinct says Ty Jerome, b... 2020-06-23 14:32:44   \n",
       "\n",
       "               user  month  day  year  hour  minute  second  \\\n",
       "0     SweetLickKing      6   30  2020     2      20      43   \n",
       "1    Daniel_B_Ennis      6   30  2020     2      18      17   \n",
       "2     blimeyonline1      6   30  2020     1      55       2   \n",
       "3         Wahoos247      6   30  2020     1      40       0   \n",
       "4     JamieOakes247      6   30  2020     1      39      42   \n",
       "..              ...    ...  ...   ...   ...     ...     ...   \n",
       "995        norahlfy      6   23  2020    15      12       8   \n",
       "996     AgNPalabras      6   23  2020    15       8      40   \n",
       "997          WeAppU      6   23  2020    14      52      23   \n",
       "998    VirialexViri      6   23  2020    14      49      22   \n",
       "999    The_Superhoo      6   23  2020    14      32      44   \n",
       "\n",
       "              timestamp  \n",
       "0   2020-06-30 02:20:43  \n",
       "1   2020-06-30 02:18:17  \n",
       "2   2020-06-30 01:55:02  \n",
       "3   2020-06-30 01:40:00  \n",
       "4   2020-06-30 01:39:42  \n",
       "..                  ...  \n",
       "995 2020-06-23 15:12:08  \n",
       "996 2020-06-23 15:08:40  \n",
       "997 2020-06-23 14:52:23  \n",
       "998 2020-06-23 14:49:22  \n",
       "999 2020-06-23 14:32:44  \n",
       "\n",
       "[1000 rows x 10 columns]"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tweets['timestamp'] = pd.to_datetime(tweets[['year', 'month', 'day', 'hour', 'minute', 'second']])\n",
    "tweets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Like `created_at`, the new `timestamp` column is also of the `datetime64` data type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "text                  object\n",
       "created_at    datetime64[ns]\n",
       "user                  object\n",
       "month                  int64\n",
       "day                    int64\n",
       "year                   int64\n",
       "hour                   int64\n",
       "minute                 int64\n",
       "second                 int64\n",
       "timestamp     datetime64[ns]\n",
       "dtype: object"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tweets.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Converting String Columns to Timestamps\n",
    "Sometimes information about the date and time of an event in the data will be coded as a string. In that case, to use the time series functionality on the dates in a dataframe, a string column that contains the dates and times will need to be converted to the `datetime64` data type. The best way to do that is to use the `pd.to_datetime()` function once again. This time, we pass the column that contains the dates in string format to the `pd.to_datetime()` function. This function can identify the year, month, day, hour, minute, and second from many different formats. \n",
    "\n",
    "For example, we can write the date Sunday, June 19, 2016 at 8:00 PM (the date and time of game 7 of the 2016 NBA finals, the greatest moment in sports history) in many different ways such as:\n",
    "\n",
    "* \"Sunday, June 19, 2016, 8:00 PM\"\n",
    "* \"6-19-2016 8PM\"\n",
    "* \"6/19/16 8 p.m.\"\n",
    "* \"19/6/16 8pm\"\n",
    "\n",
    "The `pd.to_datetime()` function reads all of these formats:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Timestamp('2016-06-19 20:00:00')"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.to_datetime(\"Sunday, June 19, 2016, 8:00 PM\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Timestamp('2016-06-19 20:00:00')"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.to_datetime(\"6-19-2016 8PM\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Timestamp('2016-06-19 20:00:00')"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.to_datetime(\"6/19/16 8 p.m.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Timestamp('2016-06-19 20:00:00')"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.to_datetime(\"19/6/16 8pm\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `pd.to_datetime()` function can feel like a magic trick because it automatically detects where the different elements of the date and time are located in the string. However, it can go wrong, and it is important to check that the date was correctly understood by `pd.to_datetime()`. Suppose for example that the date was June 19, 1945, coded as \"6/19/45\". `pd.to_datetime()` reads the date as 2045:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Timestamp('2045-06-19 00:00:00')"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.to_datetime(\"6/19/45\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The easiest way to fix this problem would be to break the timestamp up into separate columns for year, month, and day, subtract 100 from the years that were placed in the wrong century, and recombine these columns into a new timestamp with `pd.to_datetime()`. But the real danger here is assuming the `pd.to_datetime()` function worked without confirmation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `pd.to_datetime()` function also works on an entire column of dates coded as strings. For example, let's convert the `created_at` column to string:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "text                  object\n",
       "created_at            object\n",
       "user                  object\n",
       "month                  int64\n",
       "day                    int64\n",
       "year                   int64\n",
       "hour                   int64\n",
       "minute                 int64\n",
       "second                 int64\n",
       "timestamp     datetime64[ns]\n",
       "dtype: object"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tweets['created_at'] = tweets.created_at.astype('str')\n",
    "tweets.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To convert the `created_at` column back to a timestamp, we can apply `pd.to_dataframe()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "text                  object\n",
       "created_at    datetime64[ns]\n",
       "user                  object\n",
       "month                  int64\n",
       "day                    int64\n",
       "year                   int64\n",
       "hour                   int64\n",
       "minute                 int64\n",
       "second                 int64\n",
       "timestamp     datetime64[ns]\n",
       "dtype: object"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tweets['created_at'] = pd.to_datetime(tweets.created_at)\n",
    "tweets.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filtering Rows Based on Date Ranges\n",
    "If every row in the data has a unique timestamp, then one convenient way to use the times to filter the data is to set the **index** (the row labels) to be equal to the timestamps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [],
   "source": [
    "tweets.index = tweets['created_at']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A `pandas` dataframe uses `pd.to_datetime()` to read dates and times in many formats, and we can pass any of these formats to the index of `tweets`. For example, to see all the tweets made on June 30, we can type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>created_at</th>\n",
       "      <th>user</th>\n",
       "      <th>month</th>\n",
       "      <th>day</th>\n",
       "      <th>year</th>\n",
       "      <th>hour</th>\n",
       "      <th>minute</th>\n",
       "      <th>second</th>\n",
       "      <th>timestamp</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>created_at</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2020-06-30 02:20:43</th>\n",
       "      <td>@KymoraJohnson_ let me call @IamTinaThompson s...</td>\n",
       "      <td>2020-06-30 02:20:43</td>\n",
       "      <td>SweetLickKing</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>2</td>\n",
       "      <td>20</td>\n",
       "      <td>43</td>\n",
       "      <td>2020-06-30 02:20:43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-30 02:18:17</th>\n",
       "      <td>Congrats Eric. It’s was really great to have w...</td>\n",
       "      <td>2020-06-30 02:18:17</td>\n",
       "      <td>Daniel_B_Ennis</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>2</td>\n",
       "      <td>18</td>\n",
       "      <td>17</td>\n",
       "      <td>2020-06-30 02:18:17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-30 01:55:02</th>\n",
       "      <td>RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...</td>\n",
       "      <td>2020-06-30 01:55:02</td>\n",
       "      <td>blimeyonline1</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>1</td>\n",
       "      <td>55</td>\n",
       "      <td>2</td>\n",
       "      <td>2020-06-30 01:55:02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-30 01:40:00</th>\n",
       "      <td>Former #UVA stars Quin Blanding, Micah Kiser n...</td>\n",
       "      <td>2020-06-30 01:40:00</td>\n",
       "      <td>Wahoos247</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>1</td>\n",
       "      <td>40</td>\n",
       "      <td>0</td>\n",
       "      <td>2020-06-30 01:40:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-30 01:39:42</th>\n",
       "      <td>Former #UVA stars Quin Blanding, Micah Kiser n...</td>\n",
       "      <td>2020-06-30 01:39:42</td>\n",
       "      <td>JamieOakes247</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>1</td>\n",
       "      <td>39</td>\n",
       "      <td>42</td>\n",
       "      <td>2020-06-30 01:39:42</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-30 01:30:45</th>\n",
       "      <td>RT @ThinakaranLK: சில இடங்களில் மழை பெய்யும் ச...</td>\n",
       "      <td>2020-06-30 01:30:45</td>\n",
       "      <td>sumanebot</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>1</td>\n",
       "      <td>30</td>\n",
       "      <td>45</td>\n",
       "      <td>2020-06-30 01:30:45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-30 01:08:08</th>\n",
       "      <td>We are🎉excited to🗣announce opening for in-pers...</td>\n",
       "      <td>2020-06-30 01:08:08</td>\n",
       "      <td>GrandMarc_UVA</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>1</td>\n",
       "      <td>8</td>\n",
       "      <td>8</td>\n",
       "      <td>2020-06-30 01:08:08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-30 00:20:15</th>\n",
       "      <td>RT @cavalierinsider: On Saturday, The Basketba...</td>\n",
       "      <td>2020-06-30 00:20:15</td>\n",
       "      <td>annefutch</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>0</td>\n",
       "      <td>20</td>\n",
       "      <td>15</td>\n",
       "      <td>2020-06-30 00:20:15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-30 00:19:00</th>\n",
       "      <td>Anthony Harris part of NFL's top safety tandem...</td>\n",
       "      <td>2020-06-30 00:19:00</td>\n",
       "      <td>hoosdaily</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>2020-06-30 00:19:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                  text  \\\n",
       "created_at                                                               \n",
       "2020-06-30 02:20:43  @KymoraJohnson_ let me call @IamTinaThompson s...   \n",
       "2020-06-30 02:18:17  Congrats Eric. It’s was really great to have w...   \n",
       "2020-06-30 01:55:02  RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...   \n",
       "2020-06-30 01:40:00  Former #UVA stars Quin Blanding, Micah Kiser n...   \n",
       "2020-06-30 01:39:42  Former #UVA stars Quin Blanding, Micah Kiser n...   \n",
       "2020-06-30 01:30:45  RT @ThinakaranLK: சில இடங்களில் மழை பெய்யும் ச...   \n",
       "2020-06-30 01:08:08  We are🎉excited to🗣announce opening for in-pers...   \n",
       "2020-06-30 00:20:15  RT @cavalierinsider: On Saturday, The Basketba...   \n",
       "2020-06-30 00:19:00  Anthony Harris part of NFL's top safety tandem...   \n",
       "\n",
       "                             created_at            user  month  day  year  \\\n",
       "created_at                                                                  \n",
       "2020-06-30 02:20:43 2020-06-30 02:20:43   SweetLickKing      6   30  2020   \n",
       "2020-06-30 02:18:17 2020-06-30 02:18:17  Daniel_B_Ennis      6   30  2020   \n",
       "2020-06-30 01:55:02 2020-06-30 01:55:02   blimeyonline1      6   30  2020   \n",
       "2020-06-30 01:40:00 2020-06-30 01:40:00       Wahoos247      6   30  2020   \n",
       "2020-06-30 01:39:42 2020-06-30 01:39:42   JamieOakes247      6   30  2020   \n",
       "2020-06-30 01:30:45 2020-06-30 01:30:45       sumanebot      6   30  2020   \n",
       "2020-06-30 01:08:08 2020-06-30 01:08:08   GrandMarc_UVA      6   30  2020   \n",
       "2020-06-30 00:20:15 2020-06-30 00:20:15       annefutch      6   30  2020   \n",
       "2020-06-30 00:19:00 2020-06-30 00:19:00       hoosdaily      6   30  2020   \n",
       "\n",
       "                     hour  minute  second           timestamp  \n",
       "created_at                                                     \n",
       "2020-06-30 02:20:43     2      20      43 2020-06-30 02:20:43  \n",
       "2020-06-30 02:18:17     2      18      17 2020-06-30 02:18:17  \n",
       "2020-06-30 01:55:02     1      55       2 2020-06-30 01:55:02  \n",
       "2020-06-30 01:40:00     1      40       0 2020-06-30 01:40:00  \n",
       "2020-06-30 01:39:42     1      39      42 2020-06-30 01:39:42  \n",
       "2020-06-30 01:30:45     1      30      45 2020-06-30 01:30:45  \n",
       "2020-06-30 01:08:08     1       8       8 2020-06-30 01:08:08  \n",
       "2020-06-30 00:20:15     0      20      15 2020-06-30 00:20:15  \n",
       "2020-06-30 00:19:00     0      19       0 2020-06-30 00:19:00  "
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tweets['June 30, 2020']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If these row labels are in chronological order, then we can extract slices of the data that fall within a time range. We write two dates separated by a colon, and the output extracts all rows from the row that matches the first date through the row that matches the last date. Because the rows are generally listed with the most recent ones first, we write the ending timestamp first and the ending timestamp second. If these exact dates and times do not exist in the data, the syntax still finds the locations where these rows would exist and extracts the rows in between these two locations. Here is code to extract all tweets that were posted between 2pm and 3pm on June 29:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>created_at</th>\n",
       "      <th>user</th>\n",
       "      <th>month</th>\n",
       "      <th>day</th>\n",
       "      <th>year</th>\n",
       "      <th>hour</th>\n",
       "      <th>minute</th>\n",
       "      <th>second</th>\n",
       "      <th>timestamp</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>created_at</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2020-06-29 14:54:31</th>\n",
       "      <td>UVa alumni team opts out of 2020 TBT, plans to...</td>\n",
       "      <td>2020-06-29 14:54:31</td>\n",
       "      <td>hoosdaily</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>54</td>\n",
       "      <td>31</td>\n",
       "      <td>2020-06-29 14:54:31</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 14:54:26</th>\n",
       "      <td>Ben Wallace Is a Proud Dad of Three Kids — Mee...</td>\n",
       "      <td>2020-06-29 14:54:26</td>\n",
       "      <td>hoosdaily</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>54</td>\n",
       "      <td>26</td>\n",
       "      <td>2020-06-29 14:54:26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 14:51:16</th>\n",
       "      <td>Inicia Sonora exportación de uva a Corea del S...</td>\n",
       "      <td>2020-06-29 14:51:16</td>\n",
       "      <td>AgroTratos</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>51</td>\n",
       "      <td>16</td>\n",
       "      <td>2020-06-29 14:51:16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 14:50:46</th>\n",
       "      <td>RT @Wahoos247: Wake Forest commitment Christia...</td>\n",
       "      <td>2020-06-29 14:50:46</td>\n",
       "      <td>FatWhite101</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>50</td>\n",
       "      <td>46</td>\n",
       "      <td>2020-06-29 14:50:46</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 14:42:28</th>\n",
       "      <td>RT @Cavs_Corner: Film Room: In the next instal...</td>\n",
       "      <td>2020-06-29 14:42:28</td>\n",
       "      <td>oleuva</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>42</td>\n",
       "      <td>28</td>\n",
       "      <td>2020-06-29 14:42:28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 14:40:56</th>\n",
       "      <td>RT @cavalierinsider: On Saturday, The Basketba...</td>\n",
       "      <td>2020-06-29 14:40:56</td>\n",
       "      <td>John_Shifflett</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>40</td>\n",
       "      <td>56</td>\n",
       "      <td>2020-06-29 14:40:56</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 14:37:11</th>\n",
       "      <td>Tatil planları yapmaya başladıysan sana harika...</td>\n",
       "      <td>2020-06-29 14:37:11</td>\n",
       "      <td>YasinALTINEL</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>37</td>\n",
       "      <td>11</td>\n",
       "      <td>2020-06-29 14:37:11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 14:35:52</th>\n",
       "      <td>On Saturday, The Basketball Tournament will be...</td>\n",
       "      <td>2020-06-29 14:35:52</td>\n",
       "      <td>cavalierinsider</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>35</td>\n",
       "      <td>52</td>\n",
       "      <td>2020-06-29 14:35:52</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 14:29:17</th>\n",
       "      <td>Jesse Rutherford of the Nelson County Board of...</td>\n",
       "      <td>2020-06-29 14:29:17</td>\n",
       "      <td>JerryMillerNow</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>29</td>\n",
       "      <td>17</td>\n",
       "      <td>2020-06-29 14:29:17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 14:20:58</th>\n",
       "      <td>RT @thesabre: NewsLink is updated! #UVA sports...</td>\n",
       "      <td>2020-06-29 14:20:58</td>\n",
       "      <td>Slider_Hoos</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>20</td>\n",
       "      <td>58</td>\n",
       "      <td>2020-06-29 14:20:58</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 14:20:44</th>\n",
       "      <td>RT @Wahoos247: Wake Forest commitment Christia...</td>\n",
       "      <td>2020-06-29 14:20:44</td>\n",
       "      <td>agee_brandon</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>20</td>\n",
       "      <td>44</td>\n",
       "      <td>2020-06-29 14:20:44</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 14:18:33</th>\n",
       "      <td>Ramseyer fondly recalls Cavs' perfect regular ...</td>\n",
       "      <td>2020-06-29 14:18:33</td>\n",
       "      <td>hoosdaily</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>18</td>\n",
       "      <td>33</td>\n",
       "      <td>2020-06-29 14:18:33</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 14:15:32</th>\n",
       "      <td>Ben Wallace Is a Proud Dad of Three Kids — Mee...</td>\n",
       "      <td>2020-06-29 14:15:32</td>\n",
       "      <td>hoosdaily</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>15</td>\n",
       "      <td>32</td>\n",
       "      <td>2020-06-29 14:15:32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 14:11:08</th>\n",
       "      <td>NewsLink is updated! #UVA sports links all in ...</td>\n",
       "      <td>2020-06-29 14:11:08</td>\n",
       "      <td>thesabre</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>14</td>\n",
       "      <td>11</td>\n",
       "      <td>8</td>\n",
       "      <td>2020-06-29 14:11:08</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                  text  \\\n",
       "created_at                                                               \n",
       "2020-06-29 14:54:31  UVa alumni team opts out of 2020 TBT, plans to...   \n",
       "2020-06-29 14:54:26  Ben Wallace Is a Proud Dad of Three Kids — Mee...   \n",
       "2020-06-29 14:51:16  Inicia Sonora exportación de uva a Corea del S...   \n",
       "2020-06-29 14:50:46  RT @Wahoos247: Wake Forest commitment Christia...   \n",
       "2020-06-29 14:42:28  RT @Cavs_Corner: Film Room: In the next instal...   \n",
       "2020-06-29 14:40:56  RT @cavalierinsider: On Saturday, The Basketba...   \n",
       "2020-06-29 14:37:11  Tatil planları yapmaya başladıysan sana harika...   \n",
       "2020-06-29 14:35:52  On Saturday, The Basketball Tournament will be...   \n",
       "2020-06-29 14:29:17  Jesse Rutherford of the Nelson County Board of...   \n",
       "2020-06-29 14:20:58  RT @thesabre: NewsLink is updated! #UVA sports...   \n",
       "2020-06-29 14:20:44  RT @Wahoos247: Wake Forest commitment Christia...   \n",
       "2020-06-29 14:18:33  Ramseyer fondly recalls Cavs' perfect regular ...   \n",
       "2020-06-29 14:15:32  Ben Wallace Is a Proud Dad of Three Kids — Mee...   \n",
       "2020-06-29 14:11:08  NewsLink is updated! #UVA sports links all in ...   \n",
       "\n",
       "                             created_at             user  month  day  year  \\\n",
       "created_at                                                                   \n",
       "2020-06-29 14:54:31 2020-06-29 14:54:31        hoosdaily      6   29  2020   \n",
       "2020-06-29 14:54:26 2020-06-29 14:54:26        hoosdaily      6   29  2020   \n",
       "2020-06-29 14:51:16 2020-06-29 14:51:16       AgroTratos      6   29  2020   \n",
       "2020-06-29 14:50:46 2020-06-29 14:50:46      FatWhite101      6   29  2020   \n",
       "2020-06-29 14:42:28 2020-06-29 14:42:28           oleuva      6   29  2020   \n",
       "2020-06-29 14:40:56 2020-06-29 14:40:56   John_Shifflett      6   29  2020   \n",
       "2020-06-29 14:37:11 2020-06-29 14:37:11     YasinALTINEL      6   29  2020   \n",
       "2020-06-29 14:35:52 2020-06-29 14:35:52  cavalierinsider      6   29  2020   \n",
       "2020-06-29 14:29:17 2020-06-29 14:29:17   JerryMillerNow      6   29  2020   \n",
       "2020-06-29 14:20:58 2020-06-29 14:20:58      Slider_Hoos      6   29  2020   \n",
       "2020-06-29 14:20:44 2020-06-29 14:20:44     agee_brandon      6   29  2020   \n",
       "2020-06-29 14:18:33 2020-06-29 14:18:33        hoosdaily      6   29  2020   \n",
       "2020-06-29 14:15:32 2020-06-29 14:15:32        hoosdaily      6   29  2020   \n",
       "2020-06-29 14:11:08 2020-06-29 14:11:08         thesabre      6   29  2020   \n",
       "\n",
       "                     hour  minute  second           timestamp  \n",
       "created_at                                                     \n",
       "2020-06-29 14:54:31    14      54      31 2020-06-29 14:54:31  \n",
       "2020-06-29 14:54:26    14      54      26 2020-06-29 14:54:26  \n",
       "2020-06-29 14:51:16    14      51      16 2020-06-29 14:51:16  \n",
       "2020-06-29 14:50:46    14      50      46 2020-06-29 14:50:46  \n",
       "2020-06-29 14:42:28    14      42      28 2020-06-29 14:42:28  \n",
       "2020-06-29 14:40:56    14      40      56 2020-06-29 14:40:56  \n",
       "2020-06-29 14:37:11    14      37      11 2020-06-29 14:37:11  \n",
       "2020-06-29 14:35:52    14      35      52 2020-06-29 14:35:52  \n",
       "2020-06-29 14:29:17    14      29      17 2020-06-29 14:29:17  \n",
       "2020-06-29 14:20:58    14      20      58 2020-06-29 14:20:58  \n",
       "2020-06-29 14:20:44    14      20      44 2020-06-29 14:20:44  \n",
       "2020-06-29 14:18:33    14      18      33 2020-06-29 14:18:33  \n",
       "2020-06-29 14:15:32    14      15      32 2020-06-29 14:15:32  \n",
       "2020-06-29 14:11:08    14      11       8 2020-06-29 14:11:08  "
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tweets['6/29/2020 15:00':'6/29/2020 14:00']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To extract all tweets before point in time, write the timestamp before the colon and write nothing after the colon. To extract all tweets after a point in time, write nothing before the colon and write the timestamp after the colon. To extract all tweets posted after June 29 at 3pm:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>created_at</th>\n",
       "      <th>user</th>\n",
       "      <th>month</th>\n",
       "      <th>day</th>\n",
       "      <th>year</th>\n",
       "      <th>hour</th>\n",
       "      <th>minute</th>\n",
       "      <th>second</th>\n",
       "      <th>timestamp</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>created_at</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2020-06-30 02:20:43</th>\n",
       "      <td>@KymoraJohnson_ let me call @IamTinaThompson s...</td>\n",
       "      <td>2020-06-30 02:20:43</td>\n",
       "      <td>SweetLickKing</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>2</td>\n",
       "      <td>20</td>\n",
       "      <td>43</td>\n",
       "      <td>2020-06-30 02:20:43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-30 02:18:17</th>\n",
       "      <td>Congrats Eric. It’s was really great to have w...</td>\n",
       "      <td>2020-06-30 02:18:17</td>\n",
       "      <td>Daniel_B_Ennis</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>2</td>\n",
       "      <td>18</td>\n",
       "      <td>17</td>\n",
       "      <td>2020-06-30 02:18:17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-30 01:55:02</th>\n",
       "      <td>RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...</td>\n",
       "      <td>2020-06-30 01:55:02</td>\n",
       "      <td>blimeyonline1</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>1</td>\n",
       "      <td>55</td>\n",
       "      <td>2</td>\n",
       "      <td>2020-06-30 01:55:02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-30 01:40:00</th>\n",
       "      <td>Former #UVA stars Quin Blanding, Micah Kiser n...</td>\n",
       "      <td>2020-06-30 01:40:00</td>\n",
       "      <td>Wahoos247</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>1</td>\n",
       "      <td>40</td>\n",
       "      <td>0</td>\n",
       "      <td>2020-06-30 01:40:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-30 01:39:42</th>\n",
       "      <td>Former #UVA stars Quin Blanding, Micah Kiser n...</td>\n",
       "      <td>2020-06-30 01:39:42</td>\n",
       "      <td>JamieOakes247</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "      <td>2020</td>\n",
       "      <td>1</td>\n",
       "      <td>39</td>\n",
       "      <td>42</td>\n",
       "      <td>2020-06-30 01:39:42</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 15:38:19</th>\n",
       "      <td>#UvA Last week a number of practicals started ...</td>\n",
       "      <td>2020-06-29 15:38:19</td>\n",
       "      <td>JJ_Angelus</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>15</td>\n",
       "      <td>38</td>\n",
       "      <td>19</td>\n",
       "      <td>2020-06-29 15:38:19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 15:27:34</th>\n",
       "      <td>Playing through: UVa golfers hone their games ...</td>\n",
       "      <td>2020-06-29 15:27:34</td>\n",
       "      <td>hoosdaily</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>15</td>\n",
       "      <td>27</td>\n",
       "      <td>34</td>\n",
       "      <td>2020-06-29 15:27:34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 15:27:29</th>\n",
       "      <td>Playing through: UVa golfers hone their games ...</td>\n",
       "      <td>2020-06-29 15:27:29</td>\n",
       "      <td>hoosdaily</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>15</td>\n",
       "      <td>27</td>\n",
       "      <td>29</td>\n",
       "      <td>2020-06-29 15:27:29</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 15:27:21</th>\n",
       "      <td>Ramseyer fondly recalls Cavs' perfect regular ...</td>\n",
       "      <td>2020-06-29 15:27:21</td>\n",
       "      <td>hoosdaily</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>15</td>\n",
       "      <td>27</td>\n",
       "      <td>21</td>\n",
       "      <td>2020-06-29 15:27:21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-29 15:06:57</th>\n",
       "      <td>While many college athletes have been unable t...</td>\n",
       "      <td>2020-06-29 15:06:57</td>\n",
       "      <td>cavalierinsider</td>\n",
       "      <td>6</td>\n",
       "      <td>29</td>\n",
       "      <td>2020</td>\n",
       "      <td>15</td>\n",
       "      <td>6</td>\n",
       "      <td>57</td>\n",
       "      <td>2020-06-29 15:06:57</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>71 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                  text  \\\n",
       "created_at                                                               \n",
       "2020-06-30 02:20:43  @KymoraJohnson_ let me call @IamTinaThompson s...   \n",
       "2020-06-30 02:18:17  Congrats Eric. It’s was really great to have w...   \n",
       "2020-06-30 01:55:02  RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...   \n",
       "2020-06-30 01:40:00  Former #UVA stars Quin Blanding, Micah Kiser n...   \n",
       "2020-06-30 01:39:42  Former #UVA stars Quin Blanding, Micah Kiser n...   \n",
       "...                                                                ...   \n",
       "2020-06-29 15:38:19  #UvA Last week a number of practicals started ...   \n",
       "2020-06-29 15:27:34  Playing through: UVa golfers hone their games ...   \n",
       "2020-06-29 15:27:29  Playing through: UVa golfers hone their games ...   \n",
       "2020-06-29 15:27:21  Ramseyer fondly recalls Cavs' perfect regular ...   \n",
       "2020-06-29 15:06:57  While many college athletes have been unable t...   \n",
       "\n",
       "                             created_at             user  month  day  year  \\\n",
       "created_at                                                                   \n",
       "2020-06-30 02:20:43 2020-06-30 02:20:43    SweetLickKing      6   30  2020   \n",
       "2020-06-30 02:18:17 2020-06-30 02:18:17   Daniel_B_Ennis      6   30  2020   \n",
       "2020-06-30 01:55:02 2020-06-30 01:55:02    blimeyonline1      6   30  2020   \n",
       "2020-06-30 01:40:00 2020-06-30 01:40:00        Wahoos247      6   30  2020   \n",
       "2020-06-30 01:39:42 2020-06-30 01:39:42    JamieOakes247      6   30  2020   \n",
       "...                                 ...              ...    ...  ...   ...   \n",
       "2020-06-29 15:38:19 2020-06-29 15:38:19       JJ_Angelus      6   29  2020   \n",
       "2020-06-29 15:27:34 2020-06-29 15:27:34        hoosdaily      6   29  2020   \n",
       "2020-06-29 15:27:29 2020-06-29 15:27:29        hoosdaily      6   29  2020   \n",
       "2020-06-29 15:27:21 2020-06-29 15:27:21        hoosdaily      6   29  2020   \n",
       "2020-06-29 15:06:57 2020-06-29 15:06:57  cavalierinsider      6   29  2020   \n",
       "\n",
       "                     hour  minute  second           timestamp  \n",
       "created_at                                                     \n",
       "2020-06-30 02:20:43     2      20      43 2020-06-30 02:20:43  \n",
       "2020-06-30 02:18:17     2      18      17 2020-06-30 02:18:17  \n",
       "2020-06-30 01:55:02     1      55       2 2020-06-30 01:55:02  \n",
       "2020-06-30 01:40:00     1      40       0 2020-06-30 01:40:00  \n",
       "2020-06-30 01:39:42     1      39      42 2020-06-30 01:39:42  \n",
       "...                   ...     ...     ...                 ...  \n",
       "2020-06-29 15:38:19    15      38      19 2020-06-29 15:38:19  \n",
       "2020-06-29 15:27:34    15      27      34 2020-06-29 15:27:34  \n",
       "2020-06-29 15:27:29    15      27      29 2020-06-29 15:27:29  \n",
       "2020-06-29 15:27:21    15      27      21 2020-06-29 15:27:21  \n",
       "2020-06-29 15:06:57    15       6      57 2020-06-29 15:06:57  \n",
       "\n",
       "[71 rows x 10 columns]"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tweets[:'6/29/2020 15:00']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Leads and Lags\n",
    "In time series data in which the rows represent particular points in time in descending chronological order, a **lag** is a column that contains the values of another column shifted up one cell. That is, if the rows represent days, and a column represents today's high temperature, the lag represents yesterday's high temperature. A **lead** is a existing column with its values shifted one row down: tomorrow's high temperature. \n",
    "\n",
    "To create a lead or a lag, apply the `.shift()` method to a dataframe. The argument of `.shift()` is an integer: positive integers shift values down, creating leads, and negative values shift values up, creating lags. Whenever we shift values up or down, we create new missing values at the top or bottom of the dataframe for the lead or lagged column. \n",
    "\n",
    "For example, in the `tweets` dataframe we can create two new columns, `previous_tweet` and `next_tweet` to compare with `text` by using `.shift(-1)` and `.shift(1)`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>previous_tweet</th>\n",
       "      <th>next_tweet</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>created_at</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2020-06-30 02:20:43</th>\n",
       "      <td>@KymoraJohnson_ let me call @IamTinaThompson s...</td>\n",
       "      <td>Congrats Eric. It’s was really great to have w...</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-30 02:18:17</th>\n",
       "      <td>Congrats Eric. It’s was really great to have w...</td>\n",
       "      <td>RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...</td>\n",
       "      <td>@KymoraJohnson_ let me call @IamTinaThompson s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-30 01:55:02</th>\n",
       "      <td>RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...</td>\n",
       "      <td>Former #UVA stars Quin Blanding, Micah Kiser n...</td>\n",
       "      <td>Congrats Eric. It’s was really great to have w...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-30 01:40:00</th>\n",
       "      <td>Former #UVA stars Quin Blanding, Micah Kiser n...</td>\n",
       "      <td>Former #UVA stars Quin Blanding, Micah Kiser n...</td>\n",
       "      <td>RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-30 01:39:42</th>\n",
       "      <td>Former #UVA stars Quin Blanding, Micah Kiser n...</td>\n",
       "      <td>RT @ThinakaranLK: சில இடங்களில் மழை பெய்யும் ச...</td>\n",
       "      <td>Former #UVA stars Quin Blanding, Micah Kiser n...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-23 15:12:08</th>\n",
       "      <td>RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...</td>\n",
       "      <td>#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...</td>\n",
       "      <td>WATCH: Top plays of UVA's 2019-20 basketball s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-23 15:08:40</th>\n",
       "      <td>#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...</td>\n",
       "      <td>Zur Einstimmung #UVA vom letzten #gig bei uns ...</td>\n",
       "      <td>RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-23 14:52:23</th>\n",
       "      <td>Zur Einstimmung #UVA vom letzten #gig bei uns ...</td>\n",
       "      <td>RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...</td>\n",
       "      <td>#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-23 14:49:22</th>\n",
       "      <td>RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...</td>\n",
       "      <td>Good question... My instinct says Ty Jerome, b...</td>\n",
       "      <td>Zur Einstimmung #UVA vom letzten #gig bei uns ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-23 14:32:44</th>\n",
       "      <td>Good question... My instinct says Ty Jerome, b...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1000 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                  text  \\\n",
       "created_at                                                               \n",
       "2020-06-30 02:20:43  @KymoraJohnson_ let me call @IamTinaThompson s...   \n",
       "2020-06-30 02:18:17  Congrats Eric. It’s was really great to have w...   \n",
       "2020-06-30 01:55:02  RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...   \n",
       "2020-06-30 01:40:00  Former #UVA stars Quin Blanding, Micah Kiser n...   \n",
       "2020-06-30 01:39:42  Former #UVA stars Quin Blanding, Micah Kiser n...   \n",
       "...                                                                ...   \n",
       "2020-06-23 15:12:08  RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...   \n",
       "2020-06-23 15:08:40  #UVA: 54,87. Sube 0,20% diario, 1,22% en junio...   \n",
       "2020-06-23 14:52:23  Zur Einstimmung #UVA vom letzten #gig bei uns ...   \n",
       "2020-06-23 14:49:22  RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...   \n",
       "2020-06-23 14:32:44  Good question... My instinct says Ty Jerome, b...   \n",
       "\n",
       "                                                        previous_tweet  \\\n",
       "created_at                                                               \n",
       "2020-06-30 02:20:43  Congrats Eric. It’s was really great to have w...   \n",
       "2020-06-30 02:18:17  RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...   \n",
       "2020-06-30 01:55:02  Former #UVA stars Quin Blanding, Micah Kiser n...   \n",
       "2020-06-30 01:40:00  Former #UVA stars Quin Blanding, Micah Kiser n...   \n",
       "2020-06-30 01:39:42  RT @ThinakaranLK: சில இடங்களில் மழை பெய்யும் ச...   \n",
       "...                                                                ...   \n",
       "2020-06-23 15:12:08  #UVA: 54,87. Sube 0,20% diario, 1,22% en junio...   \n",
       "2020-06-23 15:08:40  Zur Einstimmung #UVA vom letzten #gig bei uns ...   \n",
       "2020-06-23 14:52:23  RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...   \n",
       "2020-06-23 14:49:22  Good question... My instinct says Ty Jerome, b...   \n",
       "2020-06-23 14:32:44                                                NaN   \n",
       "\n",
       "                                                            next_tweet  \n",
       "created_at                                                              \n",
       "2020-06-30 02:20:43                                                NaN  \n",
       "2020-06-30 02:18:17  @KymoraJohnson_ let me call @IamTinaThompson s...  \n",
       "2020-06-30 01:55:02  Congrats Eric. It’s was really great to have w...  \n",
       "2020-06-30 01:40:00  RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...  \n",
       "2020-06-30 01:39:42  Former #UVA stars Quin Blanding, Micah Kiser n...  \n",
       "...                                                                ...  \n",
       "2020-06-23 15:12:08  WATCH: Top plays of UVA's 2019-20 basketball s...  \n",
       "2020-06-23 15:08:40  RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...  \n",
       "2020-06-23 14:52:23  #UVA: 54,87. Sube 0,20% diario, 1,22% en junio...  \n",
       "2020-06-23 14:49:22  Zur Einstimmung #UVA vom letzten #gig bei uns ...  \n",
       "2020-06-23 14:32:44  RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...  \n",
       "\n",
       "[1000 rows x 3 columns]"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tweets['previous_tweet'] = tweets.shift(-1).text\n",
    "tweets['next_tweet'] = tweets.shift(1).text\n",
    "tweets[['text', 'previous_tweet', 'next_tweet']]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}