{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Reshaping and Merging Data and Working with Strings, Dates, and Times" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{contents} Table of Contents\n", ":depth: 4\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "
\n", "\n", "## Introduction: `pandas` or SQL?\n", "\n", "As we saw in module 7, read operations in SQL not only extract data from a database, but they can select and rename columns, filter and sort rows, join tables, and aggregate data. With `pandas`, we can also select and rename columns, filter and sort rows, join tables, and aggregate data. So the question is, should we be using SQL or `pandas` to perform all these data manipulation steps? \n", "\n", "SQL is sometimes referred to as **server-side** data manipulation, because databases are often stored on remote servers and SQL queries are processed on the server instead on on a local machine. Data manipulation that is conducted using `pandas` on a local Python installation is called **client-side** data manipulation.\n", "\n", "The question of SQL vs. `pandas` (or SQL vs. the tidyverse in R) is the subject of a lot of debate among professionals who use data. The question [comes up](https://www.quora.com/In-what-situations-should-you-use-SQL-instead-of-Pandas-as-a-data-scientist) [frequently](https://towardsdatascience.com/sql-and-pandas-268f634a4f5d) on [various](https://www.reddit.com/r/Python/comments/1tqjt4/why_do_you_use_pandas_instead_of_sql/) [coding forums](https://datascience.stackexchange.com/questions/34357/why-do-people-prefer-pandas-to-sql). Some [blog posts](https://blog.thedataincubator.com/2018/05/sqlite-vs-pandas-performance-benchmarks/) have tried to compare SQL and `pandas` in terms of the [speed](https://wesmckinney.com/blog/high-performance-database-joins-with-pandas-dataframe-more-benchmarks/) with which they complete equivalent operations, with differing results. [Tina Wenzel and Kavya Gupta](https://medium.com/carwow-product-engineering/sql-vs-pandas-how-to-balance-tasks-between-server-and-client-side-9e2f6c95677) note that many factors influence the relative speed of SQL and `pandas` operations, including the configuration of a PostgreSQL database and the bandwidth available in a network connection. They take the mixed evidence into account and conclude that\n", "\n", "> SQL is the best tool to use for basic data retrieval and the better you know it, the more it will speed up your workflow. More advanced data manipulation tasks can be performed both on the server and client side and it is up to the analyst to balance the workload optimally.\n", "\n", "In short, the existing evidence does not overwhelmingly support one option or the other as best practice for data manipulation. The reality is that both SQL and `pandas` are extremely useful and widely-used, and both tools will become part of your workflow. It will take some time and experience with end-to-end projects in data science to learn how to balance SQL and `pandas` in a way that is most comfortable for you in your workflow. But at this stage, there are some important points to keep in mind when thinking about these tools. \n", "\n", "First, there are many situations in which the question of SQL vs. `pandas` might be moot. For a given project, our data might not come from a database, but instead from a CSV file, from a JSON document acquired via an API, or from raw HTML that we scraped from a webpage. So in order to use SQL, we would have to take the additional step of creating a database. If we hold ourselves to the principles that E. F. Codd and others laid out about the organization of relational databases, it can take a significant amount of work to create this database. If there is no database involved for a specific data project, there is no role for SQL, but because `pandas` works on dataframes it can be a primary tool for data manipulation regardless of the original source for the data.\n", "\n", "Second, while there are differences in speed, these differences only become a significant factor for projects that involve big data, and in those cases, we will almost certainly be working with data stored on remote servers. If the data are organized using a database, then SQL may well be faster than `pandas`, but it very much depends on how the database is [configured](https://severalnines.com/database-blog/guide-postgresql-server-configuration-parameters) and on the myriad factors that determine the speed with which code runs through a remote connection. `pandas` can also be used in scripts or Jupyter notebooks that are run on remote servers. Sometimes it makes sense to pull selections of remotely stored data into a local environment so that we can manipulate the data without having to seek permission from a database manager, and in that case, a mix of SQL to choose a selection and `pandas` to manipulate the local data can be very effective.\n", "\n", "Third, `pandas` simply has more functionality and flexibility than SQL. For example, it is fairly straightforward to reshape (e.g. pivot) data using `pandas`, and it is [much more difficult to reshape data](https://stackoverflow.com/questions/2444708/sqlite-long-to-wide-formats) in SQL. `pandas` has better functionality for working with strings and time and date features than SQL, and `pandas`, being part of a Python environment, works well with any other Python package that works with dataframes or arrays. In contrast, while various DBMSs add specific additions to the base SQL language, SQL extensions tend to be fairly limited because of the inherent tension between expanding the functionality of a query language and staying close enough to base SQL to still be considered SQL. The easiest way to bring SQL query results into a Python enviromnent uses an `sqlalchemy` engine and the `pd.read_sql_query()` funtion from `pandas`. \n", "\n", "Fourth, both `pandas` and SQL enable us to work as part of larger teams that share tools, but we might choose or be required to use one tool over the other to work better as part of the team. `pandas` is essential for projects in which the whole team works within Python. SQL is readable and usable for people who do not use Python but still work with databases.\n", "\n", "Fifth, and most importantly, the choice of SQL vs. `pandas` should be made based on how comfortable we feel with each tool. Both SQL and `pandas` can perform data manipulation correctly, and it will probably be the case that we can remember the code with one tool better than with the other. We should try as much as we can to do the work that comprises 80% of our time as data scientists more quickly.\n", "\n", "For now, we need to practice both `pandas` and SQL and get comfortable with these tools, and we need to be flexible in the future as different situations will call for SQL or `pandas`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joining Dataframes\n", "Joining dataframes is also called merging dataframes, and I will use the words \"join\" and \"merge\" interchangeably below. Every kind of merge that is possible in SQL is possible in `pandas`: inner, outer, left, right, natural, cross, and anti-joins. One important difference between joining in SQL and merging in `pandas` is that dataframes might not be as carefully curated as data tables in a relational database, and we should not assume that the join worked properly even if there is no error. Primary keys might not uniquely identify rows in general, and joins might turn into cross joins. Some of values of keys that should match might not match due to differences in coding and spelling. And if we attempt a natural join, the wrong columns might be automatically selected as the keys.\n", "\n", "Too many discussions of merging data provide clean examples. Here I will show you how merging works when the data contain some common problems that may invalidate the merge. In the following discussion, we will discuss how to perform merges in `pandas`, and also how to check that the merge worked in the way we want to.\n", "\n", "First, for the following examples, we load these packages:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Merging Data on U.S. State Election Results, Income, Economies, and Area\n", "The \"state_elections.csv\" file contains information about the result of presidential elections by state for every presidential election between 1964 and 2004. In this dataset, the primary keys are `State` and `year`:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Stateyeardemvoterepvotewallacevotedempercentreppercentpopulation
0Alabama1964210732479085030.5489769.451031919000
1Alabama196819657914692369142518.9944881.005521993000
2Alabama1972256923728701026.0670473.932963540080
3Alabama1976659170504070056.6667343.333273737204
4Alabama1980636730654192049.3236650.676343894025
\n", "
" ], "text/plain": [ " State year demvote repvote wallacevote dempercent reppercent \\\n", "0 Alabama 1964 210732 479085 0 30.54897 69.45103 \n", "1 Alabama 1968 196579 146923 691425 18.99448 81.00552 \n", "2 Alabama 1972 256923 728701 0 26.06704 73.93296 \n", "3 Alabama 1976 659170 504070 0 56.66673 43.33327 \n", "4 Alabama 1980 636730 654192 0 49.32366 50.67634 \n", "\n", " population \n", "0 1919000 \n", "1 1993000 \n", "2 3540080 \n", "3 3737204 \n", "4 3894025 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "elect = pd.read_csv(\"https://github.com/jkropko/DS-6001/raw/master/localdata/state_elections.csv\")\n", "elect.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"state_income.txt\" contains varous summary statistics (percentiles and means, broken down separately for men, for women, and for the overall population) about personal incomes within each state. In this dataset, the primary keys are `stcode` (the post office's two letter abbreviation for states) and `year`:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stcodeyearincomep10incomep20incomep30incomep40incomep50incomep60incomep70income...wom_p90incomewom_meanincomewom_minincomewom_maxincomewom_sdincomewom_ineqincomewom_gini_incomewom_n_incomewom_nwgt_incomemedincome
0AL196489.834883269.855731.077988.8710318.313189.116007.318744.4...24233.011400.058.1395372674.421796.8929.531520.473738175.0418494.176.68081
1AL196892.182384469.957455.2810050.3012792.115580.218392.421310.7...26287.512005.3NaNNaN1968.7827.409790.485090NaNNaN80.72643
2AL197294.875436734.1110314.8013336.3016421.519088.922141.625410.7...30278.014667.9NaNNaN8604.2517.835770.457932NaNNaN82.27975
3AL197667.536717759.8111176.6013823.2016431.818982.821923.525180.8...31119.215362.5NaNNaN12842.2015.777650.454074NaNNaN60.07215
4AL198046.298697602.1510637.0013532.6016279.619111.821802.525190.6...29623.114804.228.38428120196.505944.7114.526400.436961543.0711278.241.72882
\n", "

5 rows × 55 columns

\n", "
" ], "text/plain": [ " stcode year income p10income p20income p30income p40income \\\n", "0 AL 1964 89.83488 3269.85 5731.07 7988.87 10318.3 \n", "1 AL 1968 92.18238 4469.95 7455.28 10050.30 12792.1 \n", "2 AL 1972 94.87543 6734.11 10314.80 13336.30 16421.5 \n", "3 AL 1976 67.53671 7759.81 11176.60 13823.20 16431.8 \n", "4 AL 1980 46.29869 7602.15 10637.00 13532.60 16279.6 \n", "\n", " p50income p60income p70income ... wom_p90income wom_meanincome \\\n", "0 13189.1 16007.3 18744.4 ... 24233.0 11400.0 \n", "1 15580.2 18392.4 21310.7 ... 26287.5 12005.3 \n", "2 19088.9 22141.6 25410.7 ... 30278.0 14667.9 \n", "3 18982.8 21923.5 25180.8 ... 31119.2 15362.5 \n", "4 19111.8 21802.5 25190.6 ... 29623.1 14804.2 \n", "\n", " wom_minincome wom_maxincome wom_sdincome wom_ineqincome \\\n", "0 58.13953 72674.42 1796.89 29.53152 \n", "1 NaN NaN 1968.78 27.40979 \n", "2 NaN NaN 8604.25 17.83577 \n", "3 NaN NaN 12842.20 15.77765 \n", "4 28.38428 120196.50 5944.71 14.52640 \n", "\n", " wom_gini_income wom_n_income wom_nwgt_income medincome \n", "0 0.473738 175.0 418494.1 76.68081 \n", "1 0.485090 NaN NaN 80.72643 \n", "2 0.457932 NaN NaN 82.27975 \n", "3 0.454074 NaN NaN 60.07215 \n", "4 0.436961 543.0 711278.2 41.72882 \n", "\n", "[5 rows x 55 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "income = pd.read_csv(\"https://github.com/jkropko/DS-6001/raw/master/localdata/state_income.txt\", sep=\"\\t\", header=2) # tab separated (note the header in this file)\n", "income.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"state_economics.txt\" contains information about the macro-level economy in each state and year, including overall and per capita gross domestic product, and GDP broken down by industry. In this dataset, the primary keys are `fips` (a standard numeric code that identifies countries and provinces within countries) and `year`:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fipsyearGDPGDPpcPrivateAgricultureFarmsMiningUtilitiesConstruction...Construction_shareManufacturing_shareFinance_Insurance_shareLegal_shareEducation_shareHealth_shareGovernment_shareGovernment_federal_shareGovernment_military_shareGovernment_statelocal_share
01196482014273.5806900312283115235379...4.62138728.2282711.413240.4023900.2560661.90220715.863926.2919152.4021467.182051
111968109495493.7289095294253133294499...4.55749428.7423510.676770.4292630.3653302.22851416.933055.9457482.6303778.356928
211972153364332.10512696481418223435702...4.57746526.0563411.482790.4694840.4434012.59520117.214405.6924882.6799698.841941
311976242066477.035199887816835847141281...5.29207624.8533411.505410.4792200.3594152.98686317.425435.7878212.2556399.386103
411980360069246.47429567674542135112281588...4.41037624.6042311.634170.6498920.3527193.66050117.883135.7823702.1718609.928901
\n", "

5 rows × 36 columns

\n", "
" ], "text/plain": [ " fips year GDP GDPpc Private Agriculture Farms Mining \\\n", "0 1 1964 8201 4273.580 6900 312 283 115 \n", "1 1 1968 10949 5493.728 9095 294 253 133 \n", "2 1 1972 15336 4332.105 12696 481 418 223 \n", "3 1 1976 24206 6477.035 19988 781 683 584 \n", "4 1 1980 36006 9246.474 29567 674 542 1351 \n", "\n", " Utilities Construction ... Construction_share Manufacturing_share \\\n", "0 235 379 ... 4.621387 28.22827 \n", "1 294 499 ... 4.557494 28.74235 \n", "2 435 702 ... 4.577465 26.05634 \n", "3 714 1281 ... 5.292076 24.85334 \n", "4 1228 1588 ... 4.410376 24.60423 \n", "\n", " Finance_Insurance_share Legal_share Education_share Health_share \\\n", "0 11.41324 0.402390 0.256066 1.902207 \n", "1 10.67677 0.429263 0.365330 2.228514 \n", "2 11.48279 0.469484 0.443401 2.595201 \n", "3 11.50541 0.479220 0.359415 2.986863 \n", "4 11.63417 0.649892 0.352719 3.660501 \n", "\n", " Government_share Government_federal_share Government_military_share \\\n", "0 15.86392 6.291915 2.402146 \n", "1 16.93305 5.945748 2.630377 \n", "2 17.21440 5.692488 2.679969 \n", "3 17.42543 5.787821 2.255639 \n", "4 17.88313 5.782370 2.171860 \n", "\n", " Government_statelocal_share \n", "0 7.182051 \n", "1 8.356928 \n", "2 8.841941 \n", "3 9.386103 \n", "4 9.928901 \n", "\n", "[5 rows x 36 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "econ = pd.read_csv(\"https://github.com/jkropko/DS-6001/raw/master/localdata/state_economics.txt\", sep=\";\") # semi-colon separated\n", "econ.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"state_area.csv\" contains various measures of the total, land, and water areas of each state. In this dataset, the primary key is `state`:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statearea_sqmiarea_sqkmlandarea_sqmilandarea_sqkmwater_sqmiwater_sqkmpercent_water
0Alaska663,267.261,717,854571,951.261,481,34791,316.00236,50713.77
1Texas268,580.82695,621261,797.12678,0516,783.7017,5702.53
2California163,695.57423,970155,939.52403,8827,736.2320,0374.73
3Montana147,042.40380,838145,552.43376,9791,489.963,8591.01
4New Mexico121,589.48314,915121,355.53314,309233.966060.19
\n", "
" ], "text/plain": [ " state area_sqmi area_sqkm landarea_sqmi landarea_sqkm water_sqmi \\\n", "0 Alaska 663,267.26 1,717,854 571,951.26 1,481,347 91,316.00 \n", "1 Texas 268,580.82 695,621 261,797.12 678,051 6,783.70 \n", "2 California 163,695.57 423,970 155,939.52 403,882 7,736.23 \n", "3 Montana 147,042.40 380,838 145,552.43 376,979 1,489.96 \n", "4 New Mexico 121,589.48 314,915 121,355.53 314,309 233.96 \n", "\n", " water_sqkm percent_water \n", "0 236,507 13.77 \n", "1 17,570 2.53 \n", "2 20,037 4.73 \n", "3 3,859 1.01 \n", "4 606 0.19 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "area = pd.read_csv(\"https://github.com/jkropko/DS-6001/raw/master/localdata/state_area.csv\") \n", "area.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the datasets above contain three ways to identify states - by name, with two-letter abbreviations, and with FIPS codes - we also need a dataset that matches each state by name to its abbreviation and FIPS code. Datasets that exist solely for the purpose of matching values of keys across datasets are called **crosswalks**:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fipsStatestcode
01AlabamaAL
12AlaskaAK
24ArizonaAZ
35ArkansasAR
46CaliforniaCA
\n", "
" ], "text/plain": [ " fips State stcode\n", "0 1 Alabama AL\n", "1 2 Alaska AK\n", "2 4 Arizona AZ\n", "3 5 Arkansas AR\n", "4 6 California CA" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crosswalk = pd.read_csv(\"https://github.com/jkropko/DS-6001/raw/master/localdata/crosswalk.csv\")\n", "crosswalk.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our goal is to merge all five of these datasets together." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using the `pd.merge()` Function\n", "When using a crosswalk to merge dataframes together, the first step is to merge a dataframe with the crosswalk. Here we start by joining the `elect` and `crosswalk` dataframes together. To join dataframes, we use the `pd.merge()` function. The following code works:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Stateyeardemvoterepvotewallacevotedempercentreppercentpopulationfipsstcode
0Alabama1964210732479085030.5489769.4510319190001AL
1Alabama196819657914692369142518.9944881.0055219930001AL
2Alabama1972256923728701026.0670473.9329635400801AL
3Alabama1976659170504070056.6667343.3332737372041AL
4Alabama1980636730654192049.3236650.6763438940251AL
.................................
556Wyoming198867113106867038.5751261.4248846508056WY
557Wyoming19926816079347046.2079853.7920246625156WY
558Wyoming199677934105388042.5120857.4879248816756WY
559Wyoming200060481147947029.0176970.9823149378256WY
560Wyoming200470776167629029.6873070.3127150652956WY
\n", "

561 rows × 10 columns

\n", "
" ], "text/plain": [ " State year demvote repvote wallacevote dempercent reppercent \\\n", "0 Alabama 1964 210732 479085 0 30.54897 69.45103 \n", "1 Alabama 1968 196579 146923 691425 18.99448 81.00552 \n", "2 Alabama 1972 256923 728701 0 26.06704 73.93296 \n", "3 Alabama 1976 659170 504070 0 56.66673 43.33327 \n", "4 Alabama 1980 636730 654192 0 49.32366 50.67634 \n", ".. ... ... ... ... ... ... ... \n", "556 Wyoming 1988 67113 106867 0 38.57512 61.42488 \n", "557 Wyoming 1992 68160 79347 0 46.20798 53.79202 \n", "558 Wyoming 1996 77934 105388 0 42.51208 57.48792 \n", "559 Wyoming 2000 60481 147947 0 29.01769 70.98231 \n", "560 Wyoming 2004 70776 167629 0 29.68730 70.31271 \n", "\n", " population fips stcode \n", "0 1919000 1 AL \n", "1 1993000 1 AL \n", "2 3540080 1 AL \n", "3 3737204 1 AL \n", "4 3894025 1 AL \n", ".. ... ... ... \n", "556 465080 56 WY \n", "557 466251 56 WY \n", "558 488167 56 WY \n", "559 493782 56 WY \n", "560 506529 56 WY \n", "\n", "[561 rows x 10 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "merged_data = pd.merge(elect, crosswalk, on='State')\n", "merged_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `merged_data` dataframe now contains the features from `elect` and from `crosswalk`, which means we've added `stcode` and `fips` to the `elect` data, and that enables us to merge `merged_data` with `income` and `econ`.\n", "\n", "Before moving on to the rest of the example, we should discuss some properties of the `pd.merge()` function. Recall from module 7 that there are many different kinds of joins that differ only in how they treat unmatched observations. `pd.merge()` conducts an inner join by default, leaving only the matched observations in the merged data. To change the type of merge, use the `how` argument:\n", "\n", "`how='outer'` performs an outer or full join. It brings all rows from both dataframes into the merged data, and replaces the features for the unmatched rows with `NaN` for the columns without data.\n", "```\n", "merged_data = pd.merge(elect, crosswalk, on='State', how='outer')\n", "```\n", "\n", "`how='left'` performs a left join. It brings all rows from the first dataframe listed into the merged data, and only those in the second dataframe that are matched to the first.\n", "```\n", "merged_data = pd.merge(elect, crosswalk, on='State', how='left')\n", "```\n", "\n", "`how='right'` performs a right join. It brings all rows from the second dataframe listed into the merged data, and only those in the first dataframe that are matched to the second.\n", "```\n", "merged_data = pd.merge(elect, crosswalk, on='State', how='right')\n", "```\n", "\n", "There are two ways to specify the key or keys on which to match rows. If the key or keys have the same name in both dataframes, use the `on` argument. If there is one key, pass this name as a string to `on` like we did above. If there is more than one key, pass these names as a list to `on`. For example, if we merge on both `State` and `year`, we will type `on = ['State', 'year']`.\n", "\n", "If the keys have different names in each dataframe, we use `left_on` to specify the names of these keys in the first dataframe listed in the `pd.merge()` function, and `right_on` to specify the names of these keys in the second dataframe listed. Suppose, for example, that the keys are named \"State\" and \"year\" in the first dataframe, and \"st\" and \"yr\" in the second dataframe. We would type `left_on = ['State', 'year'], right_on = ['st', 'yr']`.\n", "\n", "Depending on the type of join we use, a row can be matched, or can come from just the first or just the second dataframe listed in `pd.merge()`. To see which rows come from which sources, specify `indicator='newname'`, where `'newname'` is a name you choose for a new column that contains this information. For example, the following code repeats the merge we performed above, but uses an outer join instead of an inner join and adds a column `matched` that tells us how each row is matched:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Stateyeardemvoterepvotewallacevotedempercentreppercentpopulationfipsstcodematched
0Alabama1964210732479085030.5489769.4510319190001ALboth
1Alabama196819657914692369142518.9944881.0055219930001ALboth
2Alabama1972256923728701026.0670473.9329635400801ALboth
3Alabama1976659170504070056.6667343.3332737372041ALboth
4Alabama1980636730654192049.3236650.6763438940251ALboth
....................................
556Wyoming198867113106867038.5751261.4248846508056WYboth
557Wyoming19926816079347046.2079853.7920246625156WYboth
558Wyoming199677934105388042.5120857.4879248816756WYboth
559Wyoming200060481147947029.0176970.9823149378256WYboth
560Wyoming200470776167629029.6873070.3127150652956WYboth
\n", "

561 rows × 11 columns

\n", "
" ], "text/plain": [ " State year demvote repvote wallacevote dempercent reppercent \\\n", "0 Alabama 1964 210732 479085 0 30.54897 69.45103 \n", "1 Alabama 1968 196579 146923 691425 18.99448 81.00552 \n", "2 Alabama 1972 256923 728701 0 26.06704 73.93296 \n", "3 Alabama 1976 659170 504070 0 56.66673 43.33327 \n", "4 Alabama 1980 636730 654192 0 49.32366 50.67634 \n", ".. ... ... ... ... ... ... ... \n", "556 Wyoming 1988 67113 106867 0 38.57512 61.42488 \n", "557 Wyoming 1992 68160 79347 0 46.20798 53.79202 \n", "558 Wyoming 1996 77934 105388 0 42.51208 57.48792 \n", "559 Wyoming 2000 60481 147947 0 29.01769 70.98231 \n", "560 Wyoming 2004 70776 167629 0 29.68730 70.31271 \n", "\n", " population fips stcode matched \n", "0 1919000 1 AL both \n", "1 1993000 1 AL both \n", "2 3540080 1 AL both \n", "3 3737204 1 AL both \n", "4 3894025 1 AL both \n", ".. ... ... ... ... \n", "556 465080 56 WY both \n", "557 466251 56 WY both \n", "558 488167 56 WY both \n", "559 493782 56 WY both \n", "560 506529 56 WY both \n", "\n", "[561 rows x 11 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "merged_data = pd.merge(elect, crosswalk, on='State', how='outer', indicator='matched')\n", "merged_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Checking for Problems That Can Occur Without Errors\n", "Whenever we apply `pd.merge()`, there's a great risk that the merge was incorrect even if the function runs without error. There are two ways that a merge can fail.\n", "\n", "**Problem 1:** what we *thought* was a unique ID was not a unique ID. \n", "\n", "If we merge two dataframes on a column (or set of columns) that fail to uniquely identify the rows in either dataframe, then `pd.merge()` returns all pairwise combinations of rows with matching IDs. To see this, consider the following two example dataframes:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddata1
0A150
1A200
2B50
3B25
\n", "
" ], "text/plain": [ " id data1\n", "0 A 150\n", "1 A 200\n", "2 B 50\n", "3 B 25" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dict1 = {'id': ['A', 'A', 'B', 'B'],\n", " 'data1': [150, 200, 50, 25]}\n", "df1 = pd.DataFrame.from_dict(dict1)\n", "df1" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddata2
0A-20
1A-75
2B-125
3B-250
\n", "
" ], "text/plain": [ " id data2\n", "0 A -20\n", "1 A -75\n", "2 B -125\n", "3 B -250" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dict2 = {'id': ['A', 'A', 'B', 'B'],\n", " 'data2': [-20, -75, -125, -250]}\n", "df2 = pd.DataFrame.from_dict(dict2)\n", "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the two dataframes share a column \"id\", but that \"id\" is not a unique identifier in either dataframe. If I try to merge the two based on \"id\", the result is:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddata1data2
0A150-20
1A150-75
2A200-20
3A200-75
4B50-125
5B50-250
6B25-125
7B25-250
\n", "
" ], "text/plain": [ " id data1 data2\n", "0 A 150 -20\n", "1 A 150 -75\n", "2 A 200 -20\n", "3 A 200 -75\n", "4 B 50 -125\n", "5 B 50 -250\n", "6 B 25 -125\n", "7 B 25 -250" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_merge = pd.merge(df1, df2, on='id')\n", "df_merge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because we did not have a unique ID in either dataframe, we artificially increased the number of rows from 4 to 8. At first glance, the merged data looks nice, but with double the number of rows we are supposed to have the data is in fact corrupted, and we cannot use it for any subsequent analysis. Also, note that the merge ran without any errors. If we aren't careful, we might move forward with data that cannot be used.\n", "\n", "The above example is an instance of a **many-to-many** merge, in which rows in the left dataframe can match to more than one row in the right dataframe, and rows in the right dataframe can match to more than one row in the left dataframe. In these cases `pandas` performs a cross join within the subsets of the two dataframes that share the same key, inflating the number of rows. But a many-to-many join is almost never what we want to do with data, and if it happens, it is usually a mistake.\n", "\n", "The other types of merges are \n", "* **one-to-one**, in which every row in the left dataframe matches to at most one row in the right dataframe, and every row in the right dataframe matches to at most one row in the left dataframe,\n", "* **many-to-one**, in which rows in the left dataframe can match to more than one row in the right dataframe, and every row in the right dataframe mnatches to at most one row in the left dataframe,\n", "* and **one-to-many**, in which every row in the left dataframe matches to at most one row in the right dataframe, and rows in the right dataframe can match to more than one row in the left dataframe.\n", "\n", "The best way to prevent this mistake is to first think about whether the two dataframes should constitute a one-to-one, many-to-one, or one-to-many merge. It is important to have a clear expectation. Then use the `validate` argument to have `pandas` automatically check to see whether this expectation is met. If not, the code will generate an error, which will bring our attention to problems in the data we might not have caught otherwise. If we write\n", "\n", "* `validate = 'one_to_one'` then `pandas` checks to see whether the merge keys are unique in both dataframes,\n", "* `validate = 'many_to_one'` then `pandas` checks to see whether the merge keys are unique in the right dataframe,\n", "* `validate = 'one_to_many'` then `pandas` checks to see whether the merge keys are unique in the left dataframe.\n", "\n", "There are many reasons why our expectations might be incorrect. In messy, real-world data, mistakes like duplicated keys often make it past review processes and exist in the data without any clue of it in the codebook. The `validate` argument is an important check for these kinds of issues. For example, we expect that merging `elect` and `crosswalk` will be a many-to-one merge, so we confirm that the merge is many-to-one as follows:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Stateyeardemvoterepvotewallacevotedempercentreppercentpopulationfipsstcodematched
0Alabama1964210732479085030.5489769.4510319190001ALboth
1Alabama196819657914692369142518.9944881.0055219930001ALboth
2Alabama1972256923728701026.0670473.9329635400801ALboth
3Alabama1976659170504070056.6667343.3332737372041ALboth
4Alabama1980636730654192049.3236650.6763438940251ALboth
....................................
556Wyoming198867113106867038.5751261.4248846508056WYboth
557Wyoming19926816079347046.2079853.7920246625156WYboth
558Wyoming199677934105388042.5120857.4879248816756WYboth
559Wyoming200060481147947029.0176970.9823149378256WYboth
560Wyoming200470776167629029.6873070.3127150652956WYboth
\n", "

561 rows × 11 columns

\n", "
" ], "text/plain": [ " State year demvote repvote wallacevote dempercent reppercent \\\n", "0 Alabama 1964 210732 479085 0 30.54897 69.45103 \n", "1 Alabama 1968 196579 146923 691425 18.99448 81.00552 \n", "2 Alabama 1972 256923 728701 0 26.06704 73.93296 \n", "3 Alabama 1976 659170 504070 0 56.66673 43.33327 \n", "4 Alabama 1980 636730 654192 0 49.32366 50.67634 \n", ".. ... ... ... ... ... ... ... \n", "556 Wyoming 1988 67113 106867 0 38.57512 61.42488 \n", "557 Wyoming 1992 68160 79347 0 46.20798 53.79202 \n", "558 Wyoming 1996 77934 105388 0 42.51208 57.48792 \n", "559 Wyoming 2000 60481 147947 0 29.01769 70.98231 \n", "560 Wyoming 2004 70776 167629 0 29.68730 70.31271 \n", "\n", " population fips stcode matched \n", "0 1919000 1 AL both \n", "1 1993000 1 AL both \n", "2 3540080 1 AL both \n", "3 3737204 1 AL both \n", "4 3894025 1 AL both \n", ".. ... ... ... ... \n", "556 465080 56 WY both \n", "557 466251 56 WY both \n", "558 488167 56 WY both \n", "559 493782 56 WY both \n", "560 506529 56 WY both \n", "\n", "[561 rows x 11 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "merged_data = pd.merge(elect, crosswalk, on='State', how='outer', \n", " indicator='matched', validate='many_to_one')\n", "merged_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Problem 2**: rows should be matched, but aren't.\n", "\n", "This problem is especially common when matching on strings, such as names. With countries, it's possible for one dataframe to write \"USA\" and the other to write \"United States\". You will have to go back into a dataframe to edit individual ID cells so that they match. But to identify the cells that need to be edited, we have to construct a dataframe of the rows that were unmatched in one or the other dataframe. The easiest way to see this is to use an anti-join: a merge that keeps only the rows in the left dataframe that have no match in the right dataframe. Like SQL, there is no anti-join function built in to `pandas`, but there are ways to program our own anti-join. The easiest way is to perform an outer join with the `indicator='matched'` argument, then to filter the data to those rows for which `matched` is not equal to \"both\". \n", "\n", "For this step, I recommend creating a second, working version of the data `merge_test` to contain the first attempt at the merge. If there are issues, it is easier to fix the problems and recreate the working dataframe than to restart the kernel. If there are no problems, then we can assign the `merged_data` dataframe to be equal to the working version." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Stateyeardemvoterepvotewallacevotedempercentreppercentpopulationfipsstcodematched
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [State, year, demvote, repvote, wallacevote, dempercent, reppercent, population, fips, stcode, matched]\n", "Index: []" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "merge_test = pd.merge(elect, crosswalk, on='State', how='outer', indicator='matched', validate='many_to_one')\n", "merge_test.query(\"matched!='both'\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are no rows in `crosswalk` with no match to `elect`, and there are no rows in `elect` without a match in `crosswalk`. But if there were unmatched rows, there are two reasons why some rows may have been unmatched: rows can be unmatched due to differences in coverage, and they might be unmatched due to differences in coding or spelling. `pandas` cannot tell the difference between these two reasons why rows might be unmatched, and an inner join will drop unmatched rows regardless of the reason. Using an outer join with `indicator='matched'` is best practice for seeing the unmatched rows and thinking about whether any rows that should be matched are for some reason unmatched. We can use the category relabeling functionality in `pandas` to fix issues of mispelled and miscoded keys, as we will do in the examples in the next section.\n", "\n", "After performing this check, if we intend to perform another merge, we will either need to choose a new name for the indicator column, or we can drop the `matched` column so that we can reuse this name later. We set `merged_data` to the working dataframe `merge_test` without the `matched` column:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "merged_data = merge_test.drop('matched', axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merging all of the Dataframes Together While Checking for Problems\n", "The `merged_data` and `income` dataframes share keys named `stcode` and `year`. Joining these two dataframes should be a one-to-one merge because there should be only one occurrence of the same state and year in each dataframe. We use `how='outer'` and `indicator='matched'` (since we deleted the previous `matched` column) to enable us to filter the result to the unmatched rows, and we use `validate='one_to_one'` to catch anomolies that violate our expectation that this is a one-to-one merge:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Stateyeardemvoterepvotewallacevotedempercentreppercentpopulationfipsstcode...wom_meanincomewom_minincomewom_maxincomewom_sdincomewom_ineqincomewom_gini_incomewom_n_incomewom_nwgt_incomemedincomematched
\n", "

0 rows × 64 columns

\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [State, year, demvote, repvote, wallacevote, dempercent, reppercent, population, fips, stcode, income, p10income, p20income, p30income, p40income, p50income, p60income, p70income, p80income, p90income, meanincome, minincome, maxincome, sdincome, ineqincome, gini_income, n_income, nwgt_income, men_p10income, men_p20income, men_p30income, men_p40income, men_p50income, men_p60income, men_p70income, men_p80income, men_p90income, men_meanincome, men_minincome, men_maxincome, men_sdincome, men_ineqincome, men_gini_income, men_n_income, men_nwgt_income, wom_p10income, wom_p20income, wom_p30income, wom_p40income, wom_p50income, wom_p60income, wom_p70income, wom_p80income, wom_p90income, wom_meanincome, wom_minincome, wom_maxincome, wom_sdincome, wom_ineqincome, wom_gini_income, wom_n_income, wom_nwgt_income, medincome, matched]\n", "Index: []\n", "\n", "[0 rows x 64 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "merge_test = pd.merge(merged_data, income, on=['stcode','year'], how='outer', indicator='matched', validate='one_to_one')\n", "merge_test.query(\"matched!='both'\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are no unmatched rows so there are no issues with miscoded or mispelled ID values. The merged data now contain information on both state election results and state incomes:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Stateyeardemvoterepvotewallacevotedempercentreppercentpopulationfipsstcode...wom_meanincomewom_minincomewom_maxincomewom_sdincomewom_ineqincomewom_gini_incomewom_n_incomewom_nwgt_incomemedincomematched
0Alabama1964210732479085030.5489769.4510319190001AL...11400.058.13953072674.421796.8929.531520.473738175.0418494.176.68081both
1Alabama196819657914692369142518.9944881.0055219930001AL...12005.3NaNNaN1968.7827.409790.485090NaNNaN80.72643both
2Alabama1972256923728701026.0670473.9329635400801AL...14667.9NaNNaN8604.2517.835770.457932NaNNaN82.27975both
3Alabama1976659170504070056.6667343.3332737372041AL...15362.5NaNNaN12842.2015.777650.454074NaNNaN60.07215both
4Alabama1980636730654192049.3236650.6763438940251AL...14804.228.384280120196.505944.7114.526400.436961543.0711278.241.72882both
..................................................................
556Wyoming198867113106867038.5751261.4248846508056WY...17188.075.98784077507.6010402.4023.548220.478940318.0110956.136.63389both
557Wyoming19926816079347046.2079853.7920246625156WY...17934.192.30770096391.0312512.8021.143880.469024360.0117688.630.10564both
558Wyoming199677934105388042.5120857.4879248816756WY...18796.429.816510122420.9014517.6022.743680.461693413.0125222.326.93395both
559Wyoming200060481147947029.0176970.9823149378256WY...20463.488.819230333171.4016609.8015.304970.447124440.0128753.926.05026both
560Wyoming200470776167629029.6873070.3127150652956WY...20225.19.587728206166.8018311.9011.331380.432011692.0123164.124.74094both
\n", "

561 rows × 64 columns

\n", "
" ], "text/plain": [ " State year demvote repvote wallacevote dempercent reppercent \\\n", "0 Alabama 1964 210732 479085 0 30.54897 69.45103 \n", "1 Alabama 1968 196579 146923 691425 18.99448 81.00552 \n", "2 Alabama 1972 256923 728701 0 26.06704 73.93296 \n", "3 Alabama 1976 659170 504070 0 56.66673 43.33327 \n", "4 Alabama 1980 636730 654192 0 49.32366 50.67634 \n", ".. ... ... ... ... ... ... ... \n", "556 Wyoming 1988 67113 106867 0 38.57512 61.42488 \n", "557 Wyoming 1992 68160 79347 0 46.20798 53.79202 \n", "558 Wyoming 1996 77934 105388 0 42.51208 57.48792 \n", "559 Wyoming 2000 60481 147947 0 29.01769 70.98231 \n", "560 Wyoming 2004 70776 167629 0 29.68730 70.31271 \n", "\n", " population fips stcode ... wom_meanincome wom_minincome \\\n", "0 1919000 1 AL ... 11400.0 58.139530 \n", "1 1993000 1 AL ... 12005.3 NaN \n", "2 3540080 1 AL ... 14667.9 NaN \n", "3 3737204 1 AL ... 15362.5 NaN \n", "4 3894025 1 AL ... 14804.2 28.384280 \n", ".. ... ... ... ... ... ... \n", "556 465080 56 WY ... 17188.0 75.987840 \n", "557 466251 56 WY ... 17934.1 92.307700 \n", "558 488167 56 WY ... 18796.4 29.816510 \n", "559 493782 56 WY ... 20463.4 88.819230 \n", "560 506529 56 WY ... 20225.1 9.587728 \n", "\n", " wom_maxincome wom_sdincome wom_ineqincome wom_gini_income \\\n", "0 72674.42 1796.89 29.53152 0.473738 \n", "1 NaN 1968.78 27.40979 0.485090 \n", "2 NaN 8604.25 17.83577 0.457932 \n", "3 NaN 12842.20 15.77765 0.454074 \n", "4 120196.50 5944.71 14.52640 0.436961 \n", ".. ... ... ... ... \n", "556 77507.60 10402.40 23.54822 0.478940 \n", "557 96391.03 12512.80 21.14388 0.469024 \n", "558 122420.90 14517.60 22.74368 0.461693 \n", "559 333171.40 16609.80 15.30497 0.447124 \n", "560 206166.80 18311.90 11.33138 0.432011 \n", "\n", " wom_n_income wom_nwgt_income medincome matched \n", "0 175.0 418494.1 76.68081 both \n", "1 NaN NaN 80.72643 both \n", "2 NaN NaN 82.27975 both \n", "3 NaN NaN 60.07215 both \n", "4 543.0 711278.2 41.72882 both \n", ".. ... ... ... ... \n", "556 318.0 110956.1 36.63389 both \n", "557 360.0 117688.6 30.10564 both \n", "558 413.0 125222.3 26.93395 both \n", "559 440.0 128753.9 26.05026 both \n", "560 692.0 123164.1 24.74094 both \n", "\n", "[561 rows x 64 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "merge_test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since there are no problems, we set `merged_data` equal to `merge_test`. Because we intend to merge the data with another dataframe, we drop the `matched` column. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "merged_data = merge_test.drop('matched', axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we merge this result with the `econ` dataframe. These two dataframes share `fips` and `year` as keys, and should also be a one-to-one merge. We check to see whether there are unmatched rows:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Stateyeardemvoterepvotewallacevotedempercentreppercentpopulationfipsstcode...Manufacturing_shareFinance_Insurance_shareLegal_shareEducation_shareHealth_shareGovernment_shareGovernment_federal_shareGovernment_military_shareGovernment_statelocal_sharematched
\n", "

0 rows × 98 columns

\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [State, year, demvote, repvote, wallacevote, dempercent, reppercent, population, fips, stcode, income, p10income, p20income, p30income, p40income, p50income, p60income, p70income, p80income, p90income, meanincome, minincome, maxincome, sdincome, ineqincome, gini_income, n_income, nwgt_income, men_p10income, men_p20income, men_p30income, men_p40income, men_p50income, men_p60income, men_p70income, men_p80income, men_p90income, men_meanincome, men_minincome, men_maxincome, men_sdincome, men_ineqincome, men_gini_income, men_n_income, men_nwgt_income, wom_p10income, wom_p20income, wom_p30income, wom_p40income, wom_p50income, wom_p60income, wom_p70income, wom_p80income, wom_p90income, wom_meanincome, wom_minincome, wom_maxincome, wom_sdincome, wom_ineqincome, wom_gini_income, wom_n_income, wom_nwgt_income, medincome, GDP, GDPpc, Private, Agriculture, Farms, Mining, Utilities, Construction, Manufacturing, Finance_Insurance, Legal, Education, Health, Government, Government_federal, Government_military, Government_statelocal, cpideflator, statecpi, Private_share, Agriculture_share, Farms_share, Mining_share, Utilities_share, Construction_share, Manufacturing_share, Finance_Insurance_share, Legal_share, Education_share, Health_share, Government_share, Government_federal_share, Government_military_share, Government_statelocal_share, matched]\n", "Index: []\n", "\n", "[0 rows x 98 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "merge_test = pd.merge(merged_data, econ, on=['fips','year'], how='outer', indicator='matched', validate='one_to_one')\n", "merge_test.query(\"matched!='both'\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, there are no unmatched rows so there are no issues with miscoded or mispelled ID values. The merged data now contain information on state election results, state incomes, and state economies:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Stateyeardemvoterepvotewallacevotedempercentreppercentpopulationfipsstcode...Manufacturing_shareFinance_Insurance_shareLegal_shareEducation_shareHealth_shareGovernment_shareGovernment_federal_shareGovernment_military_shareGovernment_statelocal_sharematched
0Alabama1964210732479085030.5489769.4510319190001AL...28.22827011.4132400.4023900.2560661.90220715.863926.2919152.4021467.182051both
1Alabama196819657914692369142518.9944881.0055219930001AL...28.74235010.6767700.4292630.3653302.22851416.933055.9457482.6303778.356928both
2Alabama1972256923728701026.0670473.9329635400801AL...26.05634011.4827900.4694840.4434012.59520117.214405.6924882.6799698.841941both
3Alabama1976659170504070056.6667343.3332737372041AL...24.85334011.5054100.4792200.3594152.98686317.425435.7878212.2556399.386103both
4Alabama1980636730654192049.3236650.6763438940251AL...24.60423011.6341700.6498920.3527193.66050117.883135.7823702.1718609.928901both
..................................................................
556Wyoming198867113106867038.5751261.4248846508056WY...4.3800787.7887670.6888020.1236312.20770014.517842.6227481.66019110.234900both
557Wyoming19926816079347046.2079853.7920246625156WY...4.3038167.6703910.7048060.1799512.57929114.995882.8117271.55207310.632080both
558Wyoming199677934105388042.5120857.4879248816756WY...6.1467089.2804480.7055680.1525552.86676814.238492.6570051.54462210.030510both
559Wyoming200060481147947029.0176970.9823149378256WY...6.0123483.1100340.6462410.1731004.26980514.955862.7292131.54059210.686050both
560Wyoming200470776167629029.6873070.3127150652956WY...3.7959012.8138340.5892400.2220324.41930014.496162.5192141.64816410.328780both
\n", "

561 rows × 98 columns

\n", "
" ], "text/plain": [ " State year demvote repvote wallacevote dempercent reppercent \\\n", "0 Alabama 1964 210732 479085 0 30.54897 69.45103 \n", "1 Alabama 1968 196579 146923 691425 18.99448 81.00552 \n", "2 Alabama 1972 256923 728701 0 26.06704 73.93296 \n", "3 Alabama 1976 659170 504070 0 56.66673 43.33327 \n", "4 Alabama 1980 636730 654192 0 49.32366 50.67634 \n", ".. ... ... ... ... ... ... ... \n", "556 Wyoming 1988 67113 106867 0 38.57512 61.42488 \n", "557 Wyoming 1992 68160 79347 0 46.20798 53.79202 \n", "558 Wyoming 1996 77934 105388 0 42.51208 57.48792 \n", "559 Wyoming 2000 60481 147947 0 29.01769 70.98231 \n", "560 Wyoming 2004 70776 167629 0 29.68730 70.31271 \n", "\n", " population fips stcode ... Manufacturing_share \\\n", "0 1919000 1 AL ... 28.228270 \n", "1 1993000 1 AL ... 28.742350 \n", "2 3540080 1 AL ... 26.056340 \n", "3 3737204 1 AL ... 24.853340 \n", "4 3894025 1 AL ... 24.604230 \n", ".. ... ... ... ... ... \n", "556 465080 56 WY ... 4.380078 \n", "557 466251 56 WY ... 4.303816 \n", "558 488167 56 WY ... 6.146708 \n", "559 493782 56 WY ... 6.012348 \n", "560 506529 56 WY ... 3.795901 \n", "\n", " Finance_Insurance_share Legal_share Education_share Health_share \\\n", "0 11.413240 0.402390 0.256066 1.902207 \n", "1 10.676770 0.429263 0.365330 2.228514 \n", "2 11.482790 0.469484 0.443401 2.595201 \n", "3 11.505410 0.479220 0.359415 2.986863 \n", "4 11.634170 0.649892 0.352719 3.660501 \n", ".. ... ... ... ... \n", "556 7.788767 0.688802 0.123631 2.207700 \n", "557 7.670391 0.704806 0.179951 2.579291 \n", "558 9.280448 0.705568 0.152555 2.866768 \n", "559 3.110034 0.646241 0.173100 4.269805 \n", "560 2.813834 0.589240 0.222032 4.419300 \n", "\n", " Government_share Government_federal_share Government_military_share \\\n", "0 15.86392 6.291915 2.402146 \n", "1 16.93305 5.945748 2.630377 \n", "2 17.21440 5.692488 2.679969 \n", "3 17.42543 5.787821 2.255639 \n", "4 17.88313 5.782370 2.171860 \n", ".. ... ... ... \n", "556 14.51784 2.622748 1.660191 \n", "557 14.99588 2.811727 1.552073 \n", "558 14.23849 2.657005 1.544622 \n", "559 14.95586 2.729213 1.540592 \n", "560 14.49616 2.519214 1.648164 \n", "\n", " Government_statelocal_share matched \n", "0 7.182051 both \n", "1 8.356928 both \n", "2 8.841941 both \n", "3 9.386103 both \n", "4 9.928901 both \n", ".. ... ... \n", "556 10.234900 both \n", "557 10.632080 both \n", "558 10.030510 both \n", "559 10.686050 both \n", "560 10.328780 both \n", "\n", "[561 rows x 98 columns]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "merge_test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, we set `merged_data` equal to `merge_test`, dropping the `matched` column so that we can reuse this name in the next join:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "merged_data = merge_test.drop('matched', axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally we merge the data with the `area` dataframe. This merge matches `State` in `merged_data` to `state` in area: this subtle difference is easy to miss and will result in an error if we fail to notice the lowercase s and write `on='State'`. This merge should be a many-to-one merge, as there are many rows for each state in `merged_data` (for many years), but only one row for each state in `area`. As " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Statestateyearmatched
77D. C.NaN1964.0left_only
78D. C.NaN1968.0left_only
79D. C.NaN1972.0left_only
80D. C.NaN1976.0left_only
81D. C.NaN1980.0left_only
82D. C.NaN1984.0left_only
83D. C.NaN1988.0left_only
84D. C.NaN1992.0left_only
85D. C.NaN1996.0left_only
86D. C.NaN2000.0left_only
87D. C.NaN2004.0left_only
561NaNDistrict of ColumbiaNaNright_only
562NaNPuerto RicoNaNright_only
563NaNNorthern Mariana IslandsNaNright_only
564NaNUnited States Virgin IslandsNaNright_only
565NaNAmerican SamoaNaNright_only
566NaNGuamNaNright_only
567NaNUnited States Minor Outlying IslandsNaNright_only
\n", "
" ], "text/plain": [ " State state year matched\n", "77 D. C. NaN 1964.0 left_only\n", "78 D. C. NaN 1968.0 left_only\n", "79 D. C. NaN 1972.0 left_only\n", "80 D. C. NaN 1976.0 left_only\n", "81 D. C. NaN 1980.0 left_only\n", "82 D. C. NaN 1984.0 left_only\n", "83 D. C. NaN 1988.0 left_only\n", "84 D. C. NaN 1992.0 left_only\n", "85 D. C. NaN 1996.0 left_only\n", "86 D. C. NaN 2000.0 left_only\n", "87 D. C. NaN 2004.0 left_only\n", "561 NaN District of Columbia NaN right_only\n", "562 NaN Puerto Rico NaN right_only\n", "563 NaN Northern Mariana Islands NaN right_only\n", "564 NaN United States Virgin Islands NaN right_only\n", "565 NaN American Samoa NaN right_only\n", "566 NaN Guam NaN right_only\n", "567 NaN United States Minor Outlying Islands NaN right_only" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "merge_test = pd.merge(merged_data, area, how=\"outer\",\n", " left_on='State',right_on='state', \n", " indicator='matched', \n", " validate='many_to_one')\n", "merge_test.query(\"matched!='both'\")[['State', 'state',\n", " 'year','matched']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, one of our validation checks failed: we have unmatched rows. Some of these rows are due to differences in coverage: we will not bring Puerto Rico, the Northern Mariana Islands, the United States Virgin Islands, American Samoa, Guam, or the United States Minor Outlying Islands into the merged data as these territories do not receive votes in the electoral college. We will however bring the District of Columbia into the merged data as D.C. does get 3 votes in the electoral college. D.C. is unmatched due to spelling discrepancies, and we must correct this mistake before we can proceed. The easiest way to fix the error is to replace values of \"District of Columbia\" in the `area` dataframe with \"D. C.\", as it is written in the merged data. We can use the `.replace()` method (because we only want to replace one `state` category and `.replace()` leaves unlisted categories alone, as opposed to `.map()` which would replace unlisted categories with missing values), as we discussed in module 8:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "area.state = area.state.replace({'District of Columbia':'D. C.'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now if we repeat the test, we will only see the rows that are unmatched due to differences in coverage:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Statestateyearmatched
561NaNPuerto RicoNaNright_only
562NaNNorthern Mariana IslandsNaNright_only
563NaNUnited States Virgin IslandsNaNright_only
564NaNAmerican SamoaNaNright_only
565NaNGuamNaNright_only
566NaNUnited States Minor Outlying IslandsNaNright_only
\n", "
" ], "text/plain": [ " State state year matched\n", "561 NaN Puerto Rico NaN right_only\n", "562 NaN Northern Mariana Islands NaN right_only\n", "563 NaN United States Virgin Islands NaN right_only\n", "564 NaN American Samoa NaN right_only\n", "565 NaN Guam NaN right_only\n", "566 NaN United States Minor Outlying Islands NaN right_only" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "merge_test = pd.merge(merged_data, area, left_on='State', how=\"outer\", right_on='state', indicator='matched', validate='many_to_one')\n", "merge_test.query(\"matched!='both'\")[['State', 'state','year','matched']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We replace the `merged_data` dataframe with `merge_test`, keeping only the rows for which `matched='both'`, then dropping the `matched` column:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "merged_data = merge_test.query(\"matched=='both'\").drop('matched', axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now `merged_data` contains all of the data we need, including state election results, income, economic data, and area:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Stateyeardemvoterepvotewallacevotedempercentreppercentpopulationfipsstcode...Government_military_shareGovernment_statelocal_sharestatearea_sqmiarea_sqkmlandarea_sqmilandarea_sqkmwater_sqmiwater_sqkmpercent_water
0Alabama1964.0210732.0479085.00.030.5489769.451031919000.01.0AL...2.4021467.182051Alabama52,419.02135,76550,744.00131,4261,675.014,3383.20
1Alabama1968.0196579.0146923.0691425.018.9944881.005521993000.01.0AL...2.6303778.356928Alabama52,419.02135,76550,744.00131,4261,675.014,3383.20
2Alabama1972.0256923.0728701.00.026.0670473.932963540080.01.0AL...2.6799698.841941Alabama52,419.02135,76550,744.00131,4261,675.014,3383.20
3Alabama1976.0659170.0504070.00.056.6667343.333273737204.01.0AL...2.2556399.386103Alabama52,419.02135,76550,744.00131,4261,675.014,3383.20
4Alabama1980.0636730.0654192.00.049.3236650.676343894025.01.0AL...2.1718609.928901Alabama52,419.02135,76550,744.00131,4261,675.014,3383.20
..................................................................
556Wyoming1988.067113.0106867.00.038.5751261.42488465080.056.0WY...1.66019110.234900Wyoming97,813.56253,33697,100.40251,489713.161,8470.73
557Wyoming1992.068160.079347.00.046.2079853.79202466251.056.0WY...1.55207310.632080Wyoming97,813.56253,33697,100.40251,489713.161,8470.73
558Wyoming1996.077934.0105388.00.042.5120857.48792488167.056.0WY...1.54462210.030510Wyoming97,813.56253,33697,100.40251,489713.161,8470.73
559Wyoming2000.060481.0147947.00.029.0176970.98231493782.056.0WY...1.54059210.686050Wyoming97,813.56253,33697,100.40251,489713.161,8470.73
560Wyoming2004.070776.0167629.00.029.6873070.31271506529.056.0WY...1.64816410.328780Wyoming97,813.56253,33697,100.40251,489713.161,8470.73
\n", "

561 rows × 105 columns

\n", "
" ], "text/plain": [ " State year demvote repvote wallacevote dempercent reppercent \\\n", "0 Alabama 1964.0 210732.0 479085.0 0.0 30.54897 69.45103 \n", "1 Alabama 1968.0 196579.0 146923.0 691425.0 18.99448 81.00552 \n", "2 Alabama 1972.0 256923.0 728701.0 0.0 26.06704 73.93296 \n", "3 Alabama 1976.0 659170.0 504070.0 0.0 56.66673 43.33327 \n", "4 Alabama 1980.0 636730.0 654192.0 0.0 49.32366 50.67634 \n", ".. ... ... ... ... ... ... ... \n", "556 Wyoming 1988.0 67113.0 106867.0 0.0 38.57512 61.42488 \n", "557 Wyoming 1992.0 68160.0 79347.0 0.0 46.20798 53.79202 \n", "558 Wyoming 1996.0 77934.0 105388.0 0.0 42.51208 57.48792 \n", "559 Wyoming 2000.0 60481.0 147947.0 0.0 29.01769 70.98231 \n", "560 Wyoming 2004.0 70776.0 167629.0 0.0 29.68730 70.31271 \n", "\n", " population fips stcode ... Government_military_share \\\n", "0 1919000.0 1.0 AL ... 2.402146 \n", "1 1993000.0 1.0 AL ... 2.630377 \n", "2 3540080.0 1.0 AL ... 2.679969 \n", "3 3737204.0 1.0 AL ... 2.255639 \n", "4 3894025.0 1.0 AL ... 2.171860 \n", ".. ... ... ... ... ... \n", "556 465080.0 56.0 WY ... 1.660191 \n", "557 466251.0 56.0 WY ... 1.552073 \n", "558 488167.0 56.0 WY ... 1.544622 \n", "559 493782.0 56.0 WY ... 1.540592 \n", "560 506529.0 56.0 WY ... 1.648164 \n", "\n", " Government_statelocal_share state area_sqmi area_sqkm \\\n", "0 7.182051 Alabama 52,419.02 135,765 \n", "1 8.356928 Alabama 52,419.02 135,765 \n", "2 8.841941 Alabama 52,419.02 135,765 \n", "3 9.386103 Alabama 52,419.02 135,765 \n", "4 9.928901 Alabama 52,419.02 135,765 \n", ".. ... ... ... ... \n", "556 10.234900 Wyoming 97,813.56 253,336 \n", "557 10.632080 Wyoming 97,813.56 253,336 \n", "558 10.030510 Wyoming 97,813.56 253,336 \n", "559 10.686050 Wyoming 97,813.56 253,336 \n", "560 10.328780 Wyoming 97,813.56 253,336 \n", "\n", " landarea_sqmi landarea_sqkm water_sqmi water_sqkm percent_water \n", "0 50,744.00 131,426 1,675.01 4,338 3.20 \n", "1 50,744.00 131,426 1,675.01 4,338 3.20 \n", "2 50,744.00 131,426 1,675.01 4,338 3.20 \n", "3 50,744.00 131,426 1,675.01 4,338 3.20 \n", "4 50,744.00 131,426 1,675.01 4,338 3.20 \n", ".. ... ... ... ... ... \n", "556 97,100.40 251,489 713.16 1,847 0.73 \n", "557 97,100.40 251,489 713.16 1,847 0.73 \n", "558 97,100.40 251,489 713.16 1,847 0.73 \n", "559 97,100.40 251,489 713.16 1,847 0.73 \n", "560 97,100.40 251,489 713.16 1,847 0.73 \n", "\n", "[561 rows x 105 columns]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "merged_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reshaping Data\n", "**Reshaping** a dataframe refers to turning the rows of a dataframe into columns (also known as a long-to-wide reshape), or turning columns into rows (also known as a wide-to-long reshape), in a way that cannot be accomplished by simply transposing the data matrix. It is also called **pivoting** for a long-to-wide reshape, **melting** for a wide-to-long reshape. It's a notoriously tricky skill to master, and it helps a lot to understand every part of the function we need, so that we're not blindly trying every combination of parameters hoping one of these combinations works.\n", "\n", "### Example: Gross domestic product by state, 1997-2015\n", "To demonstrate the method for pivoting a dataframe, we use example data from the U.S. Bureau of Economic Affairs (www.bea.gov) on economic output for the American states. We load the data, skipping the last four rows as these rows contain attribution information and notes, and are not part of the data. We also use the `.str.strip()` method to remove all the leading and trailing whitespace in the `Description` column: " ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.\n", " \"\"\"Entry point for launching an IPython kernel.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GeoFIPSGeoNameRegionComponentIdComponentNameIndustryIdIndustryClassificationDescription19971998...2006200720082009201020112012201320142015
000000United StatesNaN200Gross domestic product (GDP) by state1...All industry total85425309024434...13773226143911491462659814320114148597721540600216041243165768081727754817919651
100000United StatesNaN200Gross domestic product (GDP) by state2...Private industries74593957894015...12045446125723871271617912352979128265071334843913957545144684651511584615698669
200000United StatesNaN200Gross domestic product (GDP) by state311Agriculture, forestry, fishing, and hunting10879699940...128345141999154525137655160217197241185800221821203188175236
300000United StatesNaN200Gross domestic product (GDP) by state4111-112Farms8813679030...99352113533126345109800129725166249151489186960166249(NA)
400000United StatesNaN200Gross domestic product (GDP) by state5113-115Forestry, fishing, and related activities2066020910...289932846628180278553049230992343113486136939(NA)
..................................................................
4328794000Plains41000Per capita real GDP by state1...All industry total3907139970...46013465014679445399462104692147629481264877849153
4328895000Southeast51000Per capita real GDP by state1...All industry total3672237896...43586430284209240346406754047140415404874080241352
4328996000Southwest61000Per capita real GDP by state1...All industry total3888740283...45743466844591444324444914534446881480644932750519
4329097000Rocky Mountain71000Per capita real GDP by state1...All industry total3824340390...46756475874701945170449374506045014455644650847093
4329198000Far West81000Per capita real GDP by state1...All industry total4190143467...53932544695388151328516095185752429529425407755429
\n", "

43292 rows × 27 columns

\n", "
" ], "text/plain": [ " GeoFIPS GeoName Region ComponentId \\\n", "0 00000 United States NaN 200 \n", "1 00000 United States NaN 200 \n", "2 00000 United States NaN 200 \n", "3 00000 United States NaN 200 \n", "4 00000 United States NaN 200 \n", "... ... ... ... ... \n", "43287 94000 Plains 4 1000 \n", "43288 95000 Southeast 5 1000 \n", "43289 96000 Southwest 6 1000 \n", "43290 97000 Rocky Mountain 7 1000 \n", "43291 98000 Far West 8 1000 \n", "\n", " ComponentName IndustryId \\\n", "0 Gross domestic product (GDP) by state 1 \n", "1 Gross domestic product (GDP) by state 2 \n", "2 Gross domestic product (GDP) by state 3 \n", "3 Gross domestic product (GDP) by state 4 \n", "4 Gross domestic product (GDP) by state 5 \n", "... ... ... \n", "43287 Per capita real GDP by state 1 \n", "43288 Per capita real GDP by state 1 \n", "43289 Per capita real GDP by state 1 \n", "43290 Per capita real GDP by state 1 \n", "43291 Per capita real GDP by state 1 \n", "\n", " IndustryClassification Description \\\n", "0 ... All industry total \n", "1 ... Private industries \n", "2 11 Agriculture, forestry, fishing, and hunting \n", "3 111-112 Farms \n", "4 113-115 Forestry, fishing, and related activities \n", "... ... ... \n", "43287 ... All industry total \n", "43288 ... All industry total \n", "43289 ... All industry total \n", "43290 ... All industry total \n", "43291 ... All industry total \n", "\n", " 1997 1998 ... 2006 2007 2008 2009 \\\n", "0 8542530 9024434 ... 13773226 14391149 14626598 14320114 \n", "1 7459395 7894015 ... 12045446 12572387 12716179 12352979 \n", "2 108796 99940 ... 128345 141999 154525 137655 \n", "3 88136 79030 ... 99352 113533 126345 109800 \n", "4 20660 20910 ... 28993 28466 28180 27855 \n", "... ... ... ... ... ... ... ... \n", "43287 39071 39970 ... 46013 46501 46794 45399 \n", "43288 36722 37896 ... 43586 43028 42092 40346 \n", "43289 38887 40283 ... 45743 46684 45914 44324 \n", "43290 38243 40390 ... 46756 47587 47019 45170 \n", "43291 41901 43467 ... 53932 54469 53881 51328 \n", "\n", " 2010 2011 2012 2013 2014 2015 \n", "0 14859772 15406002 16041243 16576808 17277548 17919651 \n", "1 12826507 13348439 13957545 14468465 15115846 15698669 \n", "2 160217 197241 185800 221821 203188 175236 \n", "3 129725 166249 151489 186960 166249 (NA) \n", "4 30492 30992 34311 34861 36939 (NA) \n", "... ... ... ... ... ... ... \n", "43287 46210 46921 47629 48126 48778 49153 \n", "43288 40675 40471 40415 40487 40802 41352 \n", "43289 44491 45344 46881 48064 49327 50519 \n", "43290 44937 45060 45014 45564 46508 47093 \n", "43291 51609 51857 52429 52942 54077 55429 \n", "\n", "[43292 rows x 27 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gsp = pd.read_csv(\"https://github.com/jkropko/DS-6001/raw/master/localdata/gsp_naics_all.csv\", skipfooter=4)\n", "gsp['Description'] = gsp['Description'].str.strip()\n", "gsp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this example, we only want to save three features besides the primary keys:\n", "\n", "* state GDP,\n", "* state GDP per capita,\n", "* state GDP from private industries\n", "\n", "We only want the 50 states, no regions, territories, or districts (sorry DC!). Notice that the features we want are contained in rows, and the years are contained in columns. We need to switch that: the features must be in columns, and years must be stored in the rows. We will need to perform both a wide-to-long and a long-to-wide reshape. In addition, we will give the columns concise but descriptive names, and we will create a new column that contains the percent of GDP from private industry in each state and each year.\n", "\n", "First we need to reduce the data to only the features we need: state GDP, state GDP per capita, and state GDP from private industries. Right now, however, the features are not stored in the columns, but in the rows. Specifically, the data we need are contains in the rows in which `Description` is \"All industry total\" or \"Private industries\" and `ComponentName` is \"Gross domestic product (GDP) by state\" or \"Per capita real GDP by state\". We use `.query()` to isolate these rows:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GeoFIPSGeoNameRegionComponentIdComponentNameIndustryIdIndustryClassificationDescription19971998...2006200720082009201020112012201320142015
000000United StatesNaN200Gross domestic product (GDP) by state1...All industry total85425309024434...13773226143911491462659814320114148597721540600216041243165768081727754817919651
100000United StatesNaN200Gross domestic product (GDP) by state2...Private industries74593957894015...12045446125723871271617912352979128265071334843913957545144684651511584615698669
9001000Alabama5200Gross domestic product (GDP) by state1...All industry total104218109414...164468169923172646168315174710180665185878190095194421199656
9101000Alabama5200Gross domestic product (GDP) by state2...Private industries8701491506...137954141306142965137143142773148181153494157961161364166243
18002000Alaska8200Gross domestic product (GDP) by state1...All industry total2544624030...44679491975546150463541345875960890597625825352747
..................................................................
4328794000Plains41000Per capita real GDP by state1...All industry total3907139970...46013465014679445399462104692147629481264877849153
4328895000Southeast51000Per capita real GDP by state1...All industry total3672237896...43586430284209240346406754047140415404874080241352
4328996000Southwest61000Per capita real GDP by state1...All industry total3888740283...45743466844591444324444914534446881480644932750519
4329097000Rocky Mountain71000Per capita real GDP by state1...All industry total3824340390...46756475874701945170449374506045014455644650847093
4329198000Far West81000Per capita real GDP by state1...All industry total4190143467...53932544695388151328516095185752429529425407755429
\n", "

180 rows × 27 columns

\n", "
" ], "text/plain": [ " GeoFIPS GeoName Region ComponentId \\\n", "0 00000 United States NaN 200 \n", "1 00000 United States NaN 200 \n", "90 01000 Alabama 5 200 \n", "91 01000 Alabama 5 200 \n", "180 02000 Alaska 8 200 \n", "... ... ... ... ... \n", "43287 94000 Plains 4 1000 \n", "43288 95000 Southeast 5 1000 \n", "43289 96000 Southwest 6 1000 \n", "43290 97000 Rocky Mountain 7 1000 \n", "43291 98000 Far West 8 1000 \n", "\n", " ComponentName IndustryId \\\n", "0 Gross domestic product (GDP) by state 1 \n", "1 Gross domestic product (GDP) by state 2 \n", "90 Gross domestic product (GDP) by state 1 \n", "91 Gross domestic product (GDP) by state 2 \n", "180 Gross domestic product (GDP) by state 1 \n", "... ... ... \n", "43287 Per capita real GDP by state 1 \n", "43288 Per capita real GDP by state 1 \n", "43289 Per capita real GDP by state 1 \n", "43290 Per capita real GDP by state 1 \n", "43291 Per capita real GDP by state 1 \n", "\n", " IndustryClassification Description 1997 1998 ... \\\n", "0 ... All industry total 8542530 9024434 ... \n", "1 ... Private industries 7459395 7894015 ... \n", "90 ... All industry total 104218 109414 ... \n", "91 ... Private industries 87014 91506 ... \n", "180 ... All industry total 25446 24030 ... \n", "... ... ... ... ... ... \n", "43287 ... All industry total 39071 39970 ... \n", "43288 ... All industry total 36722 37896 ... \n", "43289 ... All industry total 38887 40283 ... \n", "43290 ... All industry total 38243 40390 ... \n", "43291 ... All industry total 41901 43467 ... \n", "\n", " 2006 2007 2008 2009 2010 2011 2012 \\\n", "0 13773226 14391149 14626598 14320114 14859772 15406002 16041243 \n", "1 12045446 12572387 12716179 12352979 12826507 13348439 13957545 \n", "90 164468 169923 172646 168315 174710 180665 185878 \n", "91 137954 141306 142965 137143 142773 148181 153494 \n", "180 44679 49197 55461 50463 54134 58759 60890 \n", "... ... ... ... ... ... ... ... \n", "43287 46013 46501 46794 45399 46210 46921 47629 \n", "43288 43586 43028 42092 40346 40675 40471 40415 \n", "43289 45743 46684 45914 44324 44491 45344 46881 \n", "43290 46756 47587 47019 45170 44937 45060 45014 \n", "43291 53932 54469 53881 51328 51609 51857 52429 \n", "\n", " 2013 2014 2015 \n", "0 16576808 17277548 17919651 \n", "1 14468465 15115846 15698669 \n", "90 190095 194421 199656 \n", "91 157961 161364 166243 \n", "180 59762 58253 52747 \n", "... ... ... ... \n", "43287 48126 48778 49153 \n", "43288 40487 40802 41352 \n", "43289 48064 49327 50519 \n", "43290 45564 46508 47093 \n", "43291 52942 54077 55429 \n", "\n", "[180 rows x 27 columns]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gsp_clean = gsp.query(\n", " \"Description in ['All industry total', 'Private industries']\"\n", ").query(\n", " \"ComponentName in ['Gross domestic product (GDP) by state','Per capita real GDP by state']\")\n", "gsp_clean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we only want the 50 states, no regions, territories, or districts, so we issue another `.query()` to remove the territories. To see these non-states, we call up a list of the unique values of `GeoName`:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['United States', 'Alabama', 'Alaska', 'Arizona', 'Arkansas',\n", " 'California', 'Colorado', 'Connecticut', 'Delaware',\n", " 'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho',\n", " 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',\n", " 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',\n", " 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',\n", " 'New Hampshire', 'New Jersey', 'New Mexico', 'New York',\n", " 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',\n", " 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota',\n", " 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',\n", " 'West Virginia', 'Wisconsin', 'Wyoming', 'New England', 'Mideast',\n", " 'Great Lakes', 'Plains', 'Southeast', 'Southwest',\n", " 'Rocky Mountain', 'Far West'], dtype=object)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gsp_clean['GeoName'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to remove the rows in which `GeoName` is \"United States\", \"District of Columbia\", \"New England\", \"Mideast\", \"Great Lakes\", \"Plains\", \"Southeast\", \"Southwest\", \"Rocky Mountain\", or \"Far West\":" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GeoFIPSGeoNameRegionComponentIdComponentNameIndustryIdIndustryClassificationDescription19971998...2006200720082009201020112012201320142015
9001000Alabama5200Gross domestic product (GDP) by state1...All industry total104218109414...164468169923172646168315174710180665185878190095194421199656
9101000Alabama5200Gross domestic product (GDP) by state2...Private industries8701491506...137954141306142965137143142773148181153494157961161364166243
18002000Alaska8200Gross domestic product (GDP) by state1...All industry total2544624030...44679491975546150463541345875960890597625825352747
18102000Alaska8200Gross domestic product (GDP) by state2...Private industries2028418776...36737409194688941387447424885150592497644787942246
27004000Arizona6200Gross domestic product (GDP) by state1...All industry total132708143768...248459262045256718242509245668254192264693270642280166290903
..................................................................
4327951000Virginia51000Per capita real GDP by state1...All industry total4306944745...52866526575198551389519455174951538511055086151540
4328053000Washington81000Per capita real GDP by state1...All industry total4575347712...52900553485493952264526815249553423539875477355577
4328154000West Virginia51000Per capita real GDP by state1...All industry total3044530978...34009338923467934564353683608535515357783623436817
4328255000Wisconsin31000Per capita real GDP by state1...All industry total3866339695...45515454644462243215441264490545380455824646946893
4328356000Wyoming71000Per capita real GDP by state1...All industry total4658547652...63428654716918266320646026398160744607436163961389
\n", "

150 rows × 27 columns

\n", "
" ], "text/plain": [ " GeoFIPS GeoName Region ComponentId \\\n", "90 01000 Alabama 5 200 \n", "91 01000 Alabama 5 200 \n", "180 02000 Alaska 8 200 \n", "181 02000 Alaska 8 200 \n", "270 04000 Arizona 6 200 \n", "... ... ... ... ... \n", "43279 51000 Virginia 5 1000 \n", "43280 53000 Washington 8 1000 \n", "43281 54000 West Virginia 5 1000 \n", "43282 55000 Wisconsin 3 1000 \n", "43283 56000 Wyoming 7 1000 \n", "\n", " ComponentName IndustryId \\\n", "90 Gross domestic product (GDP) by state 1 \n", "91 Gross domestic product (GDP) by state 2 \n", "180 Gross domestic product (GDP) by state 1 \n", "181 Gross domestic product (GDP) by state 2 \n", "270 Gross domestic product (GDP) by state 1 \n", "... ... ... \n", "43279 Per capita real GDP by state 1 \n", "43280 Per capita real GDP by state 1 \n", "43281 Per capita real GDP by state 1 \n", "43282 Per capita real GDP by state 1 \n", "43283 Per capita real GDP by state 1 \n", "\n", " IndustryClassification Description 1997 1998 ... 2006 \\\n", "90 ... All industry total 104218 109414 ... 164468 \n", "91 ... Private industries 87014 91506 ... 137954 \n", "180 ... All industry total 25446 24030 ... 44679 \n", "181 ... Private industries 20284 18776 ... 36737 \n", "270 ... All industry total 132708 143768 ... 248459 \n", "... ... ... ... ... ... ... \n", "43279 ... All industry total 43069 44745 ... 52866 \n", "43280 ... All industry total 45753 47712 ... 52900 \n", "43281 ... All industry total 30445 30978 ... 34009 \n", "43282 ... All industry total 38663 39695 ... 45515 \n", "43283 ... All industry total 46585 47652 ... 63428 \n", "\n", " 2007 2008 2009 2010 2011 2012 2013 2014 2015 \n", "90 169923 172646 168315 174710 180665 185878 190095 194421 199656 \n", "91 141306 142965 137143 142773 148181 153494 157961 161364 166243 \n", "180 49197 55461 50463 54134 58759 60890 59762 58253 52747 \n", "181 40919 46889 41387 44742 48851 50592 49764 47879 42246 \n", "270 262045 256718 242509 245668 254192 264693 270642 280166 290903 \n", "... ... ... ... ... ... ... ... ... ... \n", "43279 52657 51985 51389 51945 51749 51538 51105 50861 51540 \n", "43280 55348 54939 52264 52681 52495 53423 53987 54773 55577 \n", "43281 33892 34679 34564 35368 36085 35515 35778 36234 36817 \n", "43282 45464 44622 43215 44126 44905 45380 45582 46469 46893 \n", "43283 65471 69182 66320 64602 63981 60744 60743 61639 61389 \n", "\n", "[150 rows x 27 columns]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gsp_clean = gsp_clean.query(\"GeoName not in ['United States', 'District of Columbia', 'New England', 'Mideast', 'Great Lakes', 'Plains', 'Southeast', 'Southwest', 'Rocky Mountain', 'Far West']\")\n", "gsp_clean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Melting: Turning Columns to Rows\n", "Notice that many of the columns refer to particular years. We want to move these columns to the rows, so that each state has one row for each year covered in the data. The best function to use when we need to turn columns into rows is `pd.melt()`:\n", "```\n", "pd.melt(df, id_vars, value_vars)\n", "```\n", "This function takes three parameters:\n", "\n", "1. The name of the dataframe you wish to edit\n", "\n", "2. `id_vars`: A list of the names of the columns that we want to continue to exist as columns after the dataframe is melted. In our example, this list will contain both IDs that currently uniquely identify the rows (`GeoName`, `ComponentName`, `Description`), synonyms for the IDs (`GeoFIPS`) that we want to keep, and features that vary across ID but not across the columns (`Region`).\n", "\n", "3. `value_vars`: A list of the names of the columns we want to store in the rows. If the columns have numeric names and can be sequentially ordered, as is the case in our example with years, a shortcut for creating a list of these column names is `[str(i) for i in range(1997,2015)]`, which loops over intergers from 1997 to 2015, placing each year as a string in the list.\n", "\n", "Any column that is not listed under either `id_vars` or `value_vars` will be deleted in the melted data. \n", "\n", "For example, we melt the `gsp_clean` dataframe with the following code:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GeoNameGeoFIPSRegionComponentNameDescriptionvariablevalue
0Alabama010005Gross domestic product (GDP) by stateAll industry total1997104218
1Alabama010005Gross domestic product (GDP) by statePrivate industries199787014
2Alaska020008Gross domestic product (GDP) by stateAll industry total199725446
3Alaska020008Gross domestic product (GDP) by statePrivate industries199720284
4Arizona040006Gross domestic product (GDP) by stateAll industry total1997132708
........................
2695Virginia510005Per capita real GDP by stateAll industry total201450861
2696Washington530008Per capita real GDP by stateAll industry total201454773
2697West Virginia540005Per capita real GDP by stateAll industry total201436234
2698Wisconsin550003Per capita real GDP by stateAll industry total201446469
2699Wyoming560007Per capita real GDP by stateAll industry total201461639
\n", "

2700 rows × 7 columns

\n", "
" ], "text/plain": [ " GeoName GeoFIPS Region ComponentName \\\n", "0 Alabama 01000 5 Gross domestic product (GDP) by state \n", "1 Alabama 01000 5 Gross domestic product (GDP) by state \n", "2 Alaska 02000 8 Gross domestic product (GDP) by state \n", "3 Alaska 02000 8 Gross domestic product (GDP) by state \n", "4 Arizona 04000 6 Gross domestic product (GDP) by state \n", "... ... ... ... ... \n", "2695 Virginia 51000 5 Per capita real GDP by state \n", "2696 Washington 53000 8 Per capita real GDP by state \n", "2697 West Virginia 54000 5 Per capita real GDP by state \n", "2698 Wisconsin 55000 3 Per capita real GDP by state \n", "2699 Wyoming 56000 7 Per capita real GDP by state \n", "\n", " Description variable value \n", "0 All industry total 1997 104218 \n", "1 Private industries 1997 87014 \n", "2 All industry total 1997 25446 \n", "3 Private industries 1997 20284 \n", "4 All industry total 1997 132708 \n", "... ... ... ... \n", "2695 All industry total 2014 50861 \n", "2696 All industry total 2014 54773 \n", "2697 All industry total 2014 36234 \n", "2698 All industry total 2014 46469 \n", "2699 All industry total 2014 61639 \n", "\n", "[2700 rows x 7 columns]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gsp_clean = pd.melt(gsp_clean, id_vars = ['GeoName', 'GeoFIPS', 'Region',\n", " 'ComponentName', 'Description'], \n", " value_vars = [str(i) for i in range(1997,2015)])\n", "gsp_clean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the old column names (1997 through 2015 in this case) will always by stored in a column named \"variable\" after melting, and the datapoints that populated those columns will be contained in a column named \"value\". \n", "\n", "Before moving on, let's rename some of the columns to avoid confusion in the next step:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateFIPSRegionComponentNameDescriptionYearvalue
0Alabama010005Gross domestic product (GDP) by stateAll industry total1997104218
1Alabama010005Gross domestic product (GDP) by statePrivate industries199787014
2Alaska020008Gross domestic product (GDP) by stateAll industry total199725446
3Alaska020008Gross domestic product (GDP) by statePrivate industries199720284
4Arizona040006Gross domestic product (GDP) by stateAll industry total1997132708
........................
2695Virginia510005Per capita real GDP by stateAll industry total201450861
2696Washington530008Per capita real GDP by stateAll industry total201454773
2697West Virginia540005Per capita real GDP by stateAll industry total201436234
2698Wisconsin550003Per capita real GDP by stateAll industry total201446469
2699Wyoming560007Per capita real GDP by stateAll industry total201461639
\n", "

2700 rows × 7 columns

\n", "
" ], "text/plain": [ " State FIPS Region ComponentName \\\n", "0 Alabama 01000 5 Gross domestic product (GDP) by state \n", "1 Alabama 01000 5 Gross domestic product (GDP) by state \n", "2 Alaska 02000 8 Gross domestic product (GDP) by state \n", "3 Alaska 02000 8 Gross domestic product (GDP) by state \n", "4 Arizona 04000 6 Gross domestic product (GDP) by state \n", "... ... ... ... ... \n", "2695 Virginia 51000 5 Per capita real GDP by state \n", "2696 Washington 53000 8 Per capita real GDP by state \n", "2697 West Virginia 54000 5 Per capita real GDP by state \n", "2698 Wisconsin 55000 3 Per capita real GDP by state \n", "2699 Wyoming 56000 7 Per capita real GDP by state \n", "\n", " Description Year value \n", "0 All industry total 1997 104218 \n", "1 Private industries 1997 87014 \n", "2 All industry total 1997 25446 \n", "3 Private industries 1997 20284 \n", "4 All industry total 1997 132708 \n", "... ... ... ... \n", "2695 All industry total 2014 50861 \n", "2696 All industry total 2014 54773 \n", "2697 All industry total 2014 36234 \n", "2698 All industry total 2014 46469 \n", "2699 All industry total 2014 61639 \n", "\n", "[2700 rows x 7 columns]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gsp_clean = gsp_clean.rename({'GeoName':'State',\n", " 'GeoFIPS':'FIPS',\n", " 'variable':'Year'}, axis=1)\n", "gsp_clean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pivoting: Turning Rows to Columns\n", "The rows contain features that we want to move to the columns. Specifically, we need separate columns for the different combinations of `ComponentName` and `Description` since the feature names are contained in two columns, so we can combine these two columns by concatenating them together and creating a new column called \"feature\". Then we can drop `ComponentName` and `Description`:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateFIPSRegionYearvaluefeature
0Alabama0100051997104218Gross domestic product (GDP) by stateAll indus...
1Alabama010005199787014Gross domestic product (GDP) by statePrivate i...
2Alaska020008199725446Gross domestic product (GDP) by stateAll indus...
3Alaska020008199720284Gross domestic product (GDP) by statePrivate i...
4Arizona0400061997132708Gross domestic product (GDP) by stateAll indus...
.....................
2695Virginia510005201450861Per capita real GDP by stateAll industry total
2696Washington530008201454773Per capita real GDP by stateAll industry total
2697West Virginia540005201436234Per capita real GDP by stateAll industry total
2698Wisconsin550003201446469Per capita real GDP by stateAll industry total
2699Wyoming560007201461639Per capita real GDP by stateAll industry total
\n", "

2700 rows × 6 columns

\n", "
" ], "text/plain": [ " State FIPS Region Year value \\\n", "0 Alabama 01000 5 1997 104218 \n", "1 Alabama 01000 5 1997 87014 \n", "2 Alaska 02000 8 1997 25446 \n", "3 Alaska 02000 8 1997 20284 \n", "4 Arizona 04000 6 1997 132708 \n", "... ... ... ... ... ... \n", "2695 Virginia 51000 5 2014 50861 \n", "2696 Washington 53000 8 2014 54773 \n", "2697 West Virginia 54000 5 2014 36234 \n", "2698 Wisconsin 55000 3 2014 46469 \n", "2699 Wyoming 56000 7 2014 61639 \n", "\n", " feature \n", "0 Gross domestic product (GDP) by stateAll indus... \n", "1 Gross domestic product (GDP) by statePrivate i... \n", "2 Gross domestic product (GDP) by stateAll indus... \n", "3 Gross domestic product (GDP) by statePrivate i... \n", "4 Gross domestic product (GDP) by stateAll indus... \n", "... ... \n", "2695 Per capita real GDP by stateAll industry total \n", "2696 Per capita real GDP by stateAll industry total \n", "2697 Per capita real GDP by stateAll industry total \n", "2698 Per capita real GDP by stateAll industry total \n", "2699 Per capita real GDP by stateAll industry total \n", "\n", "[2700 rows x 6 columns]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gsp_clean = gsp_clean.assign(feature = gsp_clean['ComponentName'] + gsp_clean['Description'])\n", "gsp_clean = gsp_clean.drop(['ComponentName', 'Description'], axis=1)\n", "gsp_clean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To move these features from the rows to columns, apply the `.pivot_table()` method to a dataframe:\n", "```\n", "df.pivot_table(index, columns, values)\n", "```\n", "To understand this method, pay close attention to its parameters:\n", "\n", "1. `index` - a list containing the names of the current columns that we want to remain columns in the reshaped data. These include the primary keys (`State` and `Year` in this example), synonyms for a key (`FIPS`), and features that vary only by ID (`Region`).\n", "\n", "2. `columns` - the name of the column that contains the **names** of the new columns we are trying to create.\n", "\n", "3. `values` - the name of the column that contains the **datapoints** we are trying to move to the new columns.\n", "\n", "There is one important issue to keep in mind when using the `.pivot_table()` method. `.pivot_table()` contains default behavior to handle cases in which the keys in `index` do not uniquely identify the rows of a dataframe. By default, `.pivot_table()` takes the mean within group. If the column specified within `values` is non-numeric, however, the mean is not defined. So first make sure that any columns that are numeric are recognized as a numeric data type. If a column contains text, categories, or other non-numeric data, specify `aggfunc='first'` inside `.pivot_table()`. Writing `aggfunc='first'` tells `.pivot_table()` to use the first cell within a group instead of calculating the mean within a group, and the first cell is defined for non-numeric columns. If the columns in `index` do uniquely identify rows, there is only one cell per group anyway, so it is fine to take the first cell.\n", "\n", "Presently within `gsp_clean` the datapoints are contained within the `value` column. Unfortunately, `value` is not currently understood to be numeric:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "State object\n", "FIPS object\n", "Region object\n", "Year object\n", "value object\n", "feature object\n", "dtype: object" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gsp_clean.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Following our code from module 8, we convert value to a numeric class:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "State object\n", "FIPS object\n", "Region object\n", "Year object\n", "value float64\n", "feature object\n", "dtype: object" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gsp_clean.value = gsp_clean.value.astype('float')\n", "gsp_clean.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Now we can reshape the data with `.pivot_table()`:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
featureGross domestic product (GDP) by stateAll industry totalGross domestic product (GDP) by statePrivate industriesPer capita real GDP by stateAll industry total
StateFIPSRegionYear
Alabama01000519971042188701431398
19981094149150632164
19991150159628433106
20001190209966533284
200112282210297833312
.....................
Wyoming5600072010391033383264602
2011414993616463981
2012402013460460744
2013409793509660743
2014420213594761639
\n", "

900 rows × 3 columns

\n", "
" ], "text/plain": [ "feature Gross domestic product (GDP) by stateAll industry total \\\n", "State FIPS Region Year \n", "Alabama 01000 5 1997 104218 \n", " 1998 109414 \n", " 1999 115015 \n", " 2000 119020 \n", " 2001 122822 \n", "... ... \n", "Wyoming 56000 7 2010 39103 \n", " 2011 41499 \n", " 2012 40201 \n", " 2013 40979 \n", " 2014 42021 \n", "\n", "feature Gross domestic product (GDP) by statePrivate industries \\\n", "State FIPS Region Year \n", "Alabama 01000 5 1997 87014 \n", " 1998 91506 \n", " 1999 96284 \n", " 2000 99665 \n", " 2001 102978 \n", "... ... \n", "Wyoming 56000 7 2010 33832 \n", " 2011 36164 \n", " 2012 34604 \n", " 2013 35096 \n", " 2014 35947 \n", "\n", "feature Per capita real GDP by stateAll industry total \n", "State FIPS Region Year \n", "Alabama 01000 5 1997 31398 \n", " 1998 32164 \n", " 1999 33106 \n", " 2000 33284 \n", " 2001 33312 \n", "... ... \n", "Wyoming 56000 7 2010 64602 \n", " 2011 63981 \n", " 2012 60744 \n", " 2013 60743 \n", " 2014 61639 \n", "\n", "[900 rows x 3 columns]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gsp_clean['value'] = gsp_clean['value'].astype(int) # first convert this column to int \n", "gsp_clean = gsp_clean.pivot_table(index=['State','FIPS', 'Region','Year'], \n", " columns='feature', \n", " values='value')\n", "gsp_clean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, if we choose to leave the `value` column as an `object` data type, we can use this code instead to perform the wide-to-long reshape:\n", "```\n", "gsp_clean = gsp_clean.pivot_table(index=['State','FIPS', 'Region','Year'], \n", " columns='feature', \n", " values='value', \n", " aggfunc='first')\n", "```\n", "Note that the dataframe is not formated in a flat way, because the IDs and related features are stored in the row-indices instead of in columns. To convert the dataframe back to a standard format, use the `.to_records()` method within the `pd.DataFrame()` function:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateFIPSRegionYearGross domestic product (GDP) by stateAll industry totalGross domestic product (GDP) by statePrivate industriesPer capita real GDP by stateAll industry total
0Alabama01000519971042188701431398
1Alabama01000519981094149150632164
2Alabama01000519991150159628433106
3Alabama01000520001190209966533284
4Alabama010005200112282210297833312
........................
895Wyoming5600072010391033383264602
896Wyoming5600072011414993616463981
897Wyoming5600072012402013460460744
898Wyoming5600072013409793509660743
899Wyoming5600072014420213594761639
\n", "

900 rows × 7 columns

\n", "
" ], "text/plain": [ " State FIPS Region Year \\\n", "0 Alabama 01000 5 1997 \n", "1 Alabama 01000 5 1998 \n", "2 Alabama 01000 5 1999 \n", "3 Alabama 01000 5 2000 \n", "4 Alabama 01000 5 2001 \n", ".. ... ... ... ... \n", "895 Wyoming 56000 7 2010 \n", "896 Wyoming 56000 7 2011 \n", "897 Wyoming 56000 7 2012 \n", "898 Wyoming 56000 7 2013 \n", "899 Wyoming 56000 7 2014 \n", "\n", " Gross domestic product (GDP) by stateAll industry total \\\n", "0 104218 \n", "1 109414 \n", "2 115015 \n", "3 119020 \n", "4 122822 \n", ".. ... \n", "895 39103 \n", "896 41499 \n", "897 40201 \n", "898 40979 \n", "899 42021 \n", "\n", " Gross domestic product (GDP) by statePrivate industries \\\n", "0 87014 \n", "1 91506 \n", "2 96284 \n", "3 99665 \n", "4 102978 \n", ".. ... \n", "895 33832 \n", "896 36164 \n", "897 34604 \n", "898 35096 \n", "899 35947 \n", "\n", " Per capita real GDP by stateAll industry total \n", "0 31398 \n", "1 32164 \n", "2 33106 \n", "3 33284 \n", "4 33312 \n", ".. ... \n", "895 64602 \n", "896 63981 \n", "897 60744 \n", "898 60743 \n", "899 61639 \n", "\n", "[900 rows x 7 columns]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gsp_clean = pd.DataFrame(gsp_clean.to_records())\n", "gsp_clean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All that's left to clean the data is to rename the columns and to create a new column containing the percent of the state's GDP that comes from private industry:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateFIPSRegionYearGDPGDPprivateGDPpcpercent_private
0Alabama0100051997104218870143139883.492295
1Alabama0100051998109414915063216483.632808
2Alabama0100051999115015962843310683.714298
3Alabama0100052000119020996653328483.738027
4Alabama01000520011228221029783331283.843285
...........................
895Wyoming560007201039103338326460286.520216
896Wyoming560007201141499361646398187.144269
897Wyoming560007201240201346046074486.077461
898Wyoming560007201340979350966074385.643866
899Wyoming560007201442021359476163985.545323
\n", "

900 rows × 8 columns

\n", "
" ], "text/plain": [ " State FIPS Region Year GDP GDPprivate GDPpc percent_private\n", "0 Alabama 01000 5 1997 104218 87014 31398 83.492295\n", "1 Alabama 01000 5 1998 109414 91506 32164 83.632808\n", "2 Alabama 01000 5 1999 115015 96284 33106 83.714298\n", "3 Alabama 01000 5 2000 119020 99665 33284 83.738027\n", "4 Alabama 01000 5 2001 122822 102978 33312 83.843285\n", ".. ... ... ... ... ... ... ... ...\n", "895 Wyoming 56000 7 2010 39103 33832 64602 86.520216\n", "896 Wyoming 56000 7 2011 41499 36164 63981 87.144269\n", "897 Wyoming 56000 7 2012 40201 34604 60744 86.077461\n", "898 Wyoming 56000 7 2013 40979 35096 60743 85.643866\n", "899 Wyoming 56000 7 2014 42021 35947 61639 85.545323\n", "\n", "[900 rows x 8 columns]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gsp_clean = gsp_clean.rename({'Gross domestic product (GDP) by stateAll industry total':'GDP',\n", " 'Gross domestic product (GDP) by statePrivate industries':'GDPprivate',\n", " 'Per capita real GDP by stateAll industry total':'GDPpc'}, \n", " axis=1)\n", "gsp_clean = gsp_clean.assign(percent_private = 100* gsp_clean.GDPprivate / gsp_clean.GDP)\n", "gsp_clean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with Strings (Example: the 2019 ANES Pilot Study)\n", "In module 8, we worked with the 2019 American National Election Study pilot survey and saved the cleaned data in a separate CSV file:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
caseidliveurbanvote16protestvotemost_important_issueconfeconideologypartyIDuniversal_income...partisanshipftbiden_levelageage2ftbiden_floatftbiden_catftbiden_strprefersbidenworried_econfavor_both
01SuburbSomeone elseFalseJoe BidenHealth CareA little worriedConservativeDemocratFavor a moderate amount...5.0neutral51260152.052.052.0TrueFalseTrue
12SuburbDonald TrumpFalseDonald TrumpWorking togetherA little worriedConservativeRepublicanOppose a moderate amount...0.0neutral78608441.041.041.0FalseFalseFalse
23RuralHillary ClintonFalseJoe Bidenhealth careExtremely worriedModerateDemocratNeither favor nor oppose...88.0like66435688.088.088.0TrueTrueFalse
34CityHillary ClintonFalseDonald TrumpThe economy.A little worriedModerateDemocratNeither favor nor oppose...-100.0dislike4116810.00.00.0FalseFalseFalse
45CityDonald TrumpFalseDonald TrumpChinaNot at all worriedConservativeRepublicanOppose a great deal...-69.0dislike80640025.025.025.0FalseFalseFalse
..................................................................
31603161TownDonald TrumpFalseDonald TrumpThe infiltration of Marxists into the institut...A little worriedConservativeRepublicanOppose a great deal...-74.0dislike7251847.07.07.0FalseFalseFalse
31613162CitySomeone elseFalseSomeone elseLack of basic resources being provided and off...Extremely worriedNaNDemocratFavor a great deal...-10.0dislike2457625.025.025.0FalseTrueTrue
31623163CityDid not voteFalseProbably will not votedonald trumpVery worriedLiberalIndependentOppose a great deal...44.0neutral40160050.050.050.0TrueFalseFalse
31633164SuburbDid not voteFalseJoe BidenDonald TrumpModerately worriedLiberalDemocratFavor a moderate amount...94.0like60360095.095.095.0TrueTrueFalse
31643165TownHillary ClintonFalseJoe BidentrumpExtremely worriedModerateDemocratOppose a great deal...70.0neutral60360070.070.070.0TrueTrueFalse
\n", "

3165 rows × 61 columns

\n", "
" ], "text/plain": [ " caseid liveurban vote16 protest vote \\\n", "0 1 Suburb Someone else False Joe Biden \n", "1 2 Suburb Donald Trump False Donald Trump \n", "2 3 Rural Hillary Clinton False Joe Biden \n", "3 4 City Hillary Clinton False Donald Trump \n", "4 5 City Donald Trump False Donald Trump \n", "... ... ... ... ... ... \n", "3160 3161 Town Donald Trump False Donald Trump \n", "3161 3162 City Someone else False Someone else \n", "3162 3163 City Did not vote False Probably will not vote \n", "3163 3164 Suburb Did not vote False Joe Biden \n", "3164 3165 Town Hillary Clinton False Joe Biden \n", "\n", " most_important_issue confecon \\\n", "0 Health Care A little worried \n", "1 Working together A little worried \n", "2 health care Extremely worried \n", "3 The economy. A little worried \n", "4 China Not at all worried \n", "... ... ... \n", "3160 The infiltration of Marxists into the institut... A little worried \n", "3161 Lack of basic resources being provided and off... Extremely worried \n", "3162 donald trump Very worried \n", "3163 Donald Trump Moderately worried \n", "3164 trump Extremely worried \n", "\n", " ideology partyID universal_income ... partisanship \\\n", "0 Conservative Democrat Favor a moderate amount ... 5.0 \n", "1 Conservative Republican Oppose a moderate amount ... 0.0 \n", "2 Moderate Democrat Neither favor nor oppose ... 88.0 \n", "3 Moderate Democrat Neither favor nor oppose ... -100.0 \n", "4 Conservative Republican Oppose a great deal ... -69.0 \n", "... ... ... ... ... ... \n", "3160 Conservative Republican Oppose a great deal ... -74.0 \n", "3161 NaN Democrat Favor a great deal ... -10.0 \n", "3162 Liberal Independent Oppose a great deal ... 44.0 \n", "3163 Liberal Democrat Favor a moderate amount ... 94.0 \n", "3164 Moderate Democrat Oppose a great deal ... 70.0 \n", "\n", " ftbiden_level age age2 ftbiden_float ftbiden_cat ftbiden_str \\\n", "0 neutral 51 2601 52.0 52.0 52.0 \n", "1 neutral 78 6084 41.0 41.0 41.0 \n", "2 like 66 4356 88.0 88.0 88.0 \n", "3 dislike 41 1681 0.0 0.0 0.0 \n", "4 dislike 80 6400 25.0 25.0 25.0 \n", "... ... .. ... ... ... ... \n", "3160 dislike 72 5184 7.0 7.0 7.0 \n", "3161 dislike 24 576 25.0 25.0 25.0 \n", "3162 neutral 40 1600 50.0 50.0 50.0 \n", "3163 like 60 3600 95.0 95.0 95.0 \n", "3164 neutral 60 3600 70.0 70.0 70.0 \n", "\n", " prefersbiden worried_econ favor_both \n", "0 True False True \n", "1 False False False \n", "2 True True False \n", "3 False False False \n", "4 False False False \n", "... ... ... ... \n", "3160 False False False \n", "3161 False True True \n", "3162 True False False \n", "3163 True True False \n", "3164 True True False \n", "\n", "[3165 rows x 61 columns]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes = pd.read_csv(\"https://github.com/jkropko/DS-6001/raw/master/localdata/anes_clean.csv\")\n", "anes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data contain a column `most_important_issue` which records individuals' responses to the question of what they feel is the most important issue facing the United States as of December 2019. These responses are in the respondents' own words. For example, one response is" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'The influence of big money on our political system both domestic and foreign'" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes.most_important_issue[5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These responses are strings, and we can employ the suite of string operations to manipulate the data in this column. String methods are contained in the `.str` module of `pandas`, and we will have to call the `.str` attribute in order to use string methods.\n", "\n", "One important technique is to alter the case within the text so that we can identify words without worrying about case sensitivity. To turn all responses to lowercase, use `.str.lower()`:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 health care\n", "1 working together\n", "2 health care\n", "3 the economy.\n", "4 china\n", " ... \n", "3160 the infiltration of marxists into the institut...\n", "3161 lack of basic resources being provided and off...\n", "3162 donald trump\n", "3163 donald trump\n", "3164 trump\n", "Name: most_important_issue, Length: 3165, dtype: object" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes.most_important_issue.str.lower()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And to turn all the text to uppercase, use `.str.upper()`:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 HEALTH CARE\n", "1 WORKING TOGETHER\n", "2 HEALTH CARE\n", "3 THE ECONOMY.\n", "4 CHINA\n", " ... \n", "3160 THE INFILTRATION OF MARXISTS INTO THE INSTITUT...\n", "3161 LACK OF BASIC RESOURCES BEING PROVIDED AND OFF...\n", "3162 DONALD TRUMP\n", "3163 DONALD TRUMP\n", "3164 TRUMP\n", "Name: most_important_issue, Length: 3165, dtype: object" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes.most_important_issue.str.upper()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One issue that can inhibit our ability to search through text is the existence of leading and trailing whitespace in the responses. White space can exist, invisibily, for several reasons. It is possible that the data authors included white space in order to standardize the number of spaces that the column takes up in the data file. To remove both leading and trailing white spaces, use `.str.strip()`" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "anes.most_important_issue = anes.most_important_issue.str.strip()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `str.replace()` method finds and replaces specific whole strings with different strings. For example, we can replace every response that reads \"health care\" with \"hospital stuff\":" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 hospital stuff\n", "1 working together\n", "2 hospital stuff\n", "3 the economy.\n", "4 china\n", " ... \n", "3160 the infiltration of marxists into the institut...\n", "3161 lack of basic resources being provided and off...\n", "3162 donald trump\n", "3163 donald trump\n", "3164 trump\n", "Name: most_important_issue, Length: 3165, dtype: object" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes.most_important_issue.str.lower().replace('health care', 'hospital stuff')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to replace pieces of these strings, we can specify the `regex=True` argument. The following line of code replaces all occurrences of \"trump\" when the strings are converted to lowercase to \"twimp\". With `regex=True`, the method replaces the pattern \"trump\" anywhere it appears, as opposed to the default `regex=False` which only replaces entire entries that are exactly \"trump\":" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 health care\n", "1 working together\n", "2 health care\n", "3 the economy.\n", "4 china\n", " ... \n", "3160 the infiltration of marxists into the institut...\n", "3161 lack of basic resources being provided and off...\n", "3162 donald twimp\n", "3163 donald twimp\n", "3164 twimp\n", "Name: most_important_issue, Length: 3165, dtype: object" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes.most_important_issue.str.lower().replace('trump', 'twimp', regex=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create a new column that identifies when particular words are used, use either the `.str.match()` or `.str.contains()` method. `.str.match()` is true if the entire entry matches the provided string, " ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "3160 False\n", "3161 False\n", "3162 False\n", "3163 False\n", "3164 True\n", "Name: most_important_issue, Length: 3165, dtype: object" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes.most_important_issue.str.lower().str.match('trump')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and `.str.contains()` is true if the provided string exists anywhere within the entry. " ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "3160 False\n", "3161 False\n", "3162 True\n", "3163 True\n", "3164 True\n", "Name: most_important_issue, Length: 3165, dtype: object" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes.most_important_issue.str.lower().str.contains('trump')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These columns of logical values can be placed into the data to filter or for other purposes. For example, we can see the number of Democrats, Republicans, and independents who use the word \"trump\" in their statement of the country's most important problem:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
problem_trumpFalseTrue
partyID
Democrat1030293
Independent47947
Republican112273
\n", "
" ], "text/plain": [ "problem_trump False True \n", "partyID \n", "Democrat 1030 293\n", "Independent 479 47\n", "Republican 1122 73" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes['problem_trump'] = anes.most_important_issue.str.lower().str.contains('trump')\n", "pd.DataFrame(anes.groupby(['partyID', 'problem_trump']).size().unstack())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.str.replace()`, `.str.match()`, and `.str.contains()` methods can also accept regular expressions for identifying additional patterns within the string. Regular expressions are beyond our scope in this document, but here is a [good discussion of regular expressions in Python](https://docs.python.org/2/howto/regex.html).\n", "\n", "To calculate the length of each response in terms of the number of characters, use `.str.len()`. In this case, it is important to make sure the whitespace is removed first with `.str.strip()`, otherwise the spaces will count towards the length: " ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 11.0\n", "1 16.0\n", "2 11.0\n", "3 12.0\n", "4 5.0\n", " ... \n", "3160 254.0\n", "3161 325.0\n", "3162 12.0\n", "3163 12.0\n", "3164 5.0\n", "Name: most_important_issue, Length: 3165, dtype: float64" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes.most_important_issue.str.len()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the length to display, for example, the single longest response in the data. To display the whole string, turn off the display limit by typing `pd.options.display.max_colwidth = None`. Here I set the limit to 500, because it sort of goes off the rails and I'd rather not display it:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "819 Immigration has been and has become a very big problem. The gangs, drug wars, the crime has climbed in every single town, county, state in this country and I believe it all revolves around too many different races, too many prejudices. Most people I talk to dislike foreigners and most foreigners hate Americans. We ask, why come here then? Because they get financial aid, housing, medical, dental aid and compared to the wars, poverty and the way they lived in their own country, America is heav...\n", "Name: most_important_issue, dtype: object" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes['length'] = anes.most_important_issue.str.len()\n", "pd.options.display.max_colwidth = 500\n", "anes.sort_values(by = 'length', ascending = False).most_important_issue.head(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After all that, we set the string display limit back to the default of 50 characters:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "pd.options.display.max_colwidth = 50" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In some cases it might be necessary to split entries into different features according to a delimiter or pattern. It doesn't make sense to split the `most_important_problem` column in this example, but as an illiustration, we can split the responses on periods:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [Health Care]\n", "1 [Working together]\n", "2 [health care]\n", "3 [The economy, ]\n", "4 [China]\n", " ... \n", "3160 [The infiltration of Marxists into the institu...\n", "3161 [Lack of basic resources being provided and of...\n", "3162 [donald trump]\n", "3163 [Donald Trump]\n", "3164 [trump]\n", "Name: most_important_issue, Length: 3165, dtype: object" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes.most_important_issue.str.split('.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `expand=True` argument displays these split strings in different columns in a dataframe:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789...19202122232425262728
0Health CareNoneNoneNoneNoneNoneNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
1Working togetherNoneNoneNoneNoneNoneNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
2health careNoneNoneNoneNoneNoneNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
3The economyNoneNoneNoneNoneNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
4ChinaNoneNoneNoneNoneNoneNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
..................................................................
3160The infiltration of Marxists into the institut...Money seems to be no object to themTheir over promising will lead to eventual chaosUsually that leads to the over promised event...NoneNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
3161Lack of basic resources being provided and off...This is a first world nation, a nation claimi...yet we don't have housing or healthcare secur...That needs to change and it needs to be chang...A basic standard of living needs to be guaran...NoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
3162donald trumpNoneNoneNoneNoneNoneNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
3163Donald TrumpNoneNoneNoneNoneNoneNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
3164trumpNoneNoneNoneNoneNoneNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
\n", "

3165 rows × 29 columns

\n", "
" ], "text/plain": [ " 0 \\\n", "0 Health Care \n", "1 Working together \n", "2 health care \n", "3 The economy \n", "4 China \n", "... ... \n", "3160 The infiltration of Marxists into the institut... \n", "3161 Lack of basic resources being provided and off... \n", "3162 donald trump \n", "3163 Donald Trump \n", "3164 trump \n", "\n", " 1 \\\n", "0 None \n", "1 None \n", "2 None \n", "3 \n", "4 None \n", "... ... \n", "3160 Money seems to be no object to them \n", "3161 This is a first world nation, a nation claimi... \n", "3162 None \n", "3163 None \n", "3164 None \n", "\n", " 2 \\\n", "0 None \n", "1 None \n", "2 None \n", "3 None \n", "4 None \n", "... ... \n", "3160 Their over promising will lead to eventual chaos \n", "3161 yet we don't have housing or healthcare secur... \n", "3162 None \n", "3163 None \n", "3164 None \n", "\n", " 3 \\\n", "0 None \n", "1 None \n", "2 None \n", "3 None \n", "4 None \n", "... ... \n", "3160 Usually that leads to the over promised event... \n", "3161 That needs to change and it needs to be chang... \n", "3162 None \n", "3163 None \n", "3164 None \n", "\n", " 4 5 6 7 \\\n", "0 None None None None \n", "1 None None None None \n", "2 None None None None \n", "3 None None None None \n", "4 None None None None \n", "... ... ... ... ... \n", "3160 None None None \n", "3161 A basic standard of living needs to be guaran... None None \n", "3162 None None None None \n", "3163 None None None None \n", "3164 None None None None \n", "\n", " 8 9 ... 19 20 21 22 23 24 25 26 27 \\\n", "0 None None ... None None None None None None None None None \n", "1 None None ... None None None None None None None None None \n", "2 None None ... None None None None None None None None None \n", "3 None None ... None None None None None None None None None \n", "4 None None ... None None None None None None None None None \n", "... ... ... ... ... ... ... ... ... ... ... ... ... \n", "3160 None None ... None None None None None None None None None \n", "3161 None None ... None None None None None None None None None \n", "3162 None None ... None None None None None None None None None \n", "3163 None None ... None None None None None None None None None \n", "3164 None None ... None None None None None None None None None \n", "\n", " 28 \n", "0 None \n", "1 None \n", "2 None \n", "3 None \n", "4 None \n", "... ... \n", "3160 None \n", "3161 None \n", "3162 None \n", "3163 None \n", "3164 None \n", "\n", "[3165 rows x 29 columns]" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes.most_important_issue.str.split('.', expand=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Writing `n=1` only splits on the first occurrence of the delimiter, `n=2` splits on the first and second occurrence, and so on:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0Health CareNoneNone
1Working togetherNoneNone
2health careNoneNone
3The economyNone
4ChinaNoneNone
............
3160The infiltration of Marxists into the institut...Money seems to be no object to themTheir over promising will lead to eventual ch...
3161Lack of basic resources being provided and off...This is a first world nation, a nation claimi...yet we don't have housing or healthcare secur...
3162donald trumpNoneNone
3163Donald TrumpNoneNone
3164trumpNoneNone
\n", "

3165 rows × 3 columns

\n", "
" ], "text/plain": [ " 0 \\\n", "0 Health Care \n", "1 Working together \n", "2 health care \n", "3 The economy \n", "4 China \n", "... ... \n", "3160 The infiltration of Marxists into the institut... \n", "3161 Lack of basic resources being provided and off... \n", "3162 donald trump \n", "3163 Donald Trump \n", "3164 trump \n", "\n", " 1 \\\n", "0 None \n", "1 None \n", "2 None \n", "3 \n", "4 None \n", "... ... \n", "3160 Money seems to be no object to them \n", "3161 This is a first world nation, a nation claimi... \n", "3162 None \n", "3163 None \n", "3164 None \n", "\n", " 2 \n", "0 None \n", "1 None \n", "2 None \n", "3 None \n", "4 None \n", "... ... \n", "3160 Their over promising will lead to eventual ch... \n", "3161 yet we don't have housing or healthcare secur... \n", "3162 None \n", "3163 None \n", "3164 None \n", "\n", "[3165 rows x 3 columns]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes.most_important_issue.str.split('.', expand=True, n=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An individual string can be indexed according to character number. These indices can be applied to every entry in a string column as well. To pull just the first five characters out of each response in `most_important_issue`, we can type:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Healt\n", "1 Worki\n", "2 healt\n", "3 The e\n", "4 China\n", " ... \n", "3160 The i\n", "3161 Lack \n", "3162 donal\n", "3163 Donal\n", "3164 trump\n", "Name: most_important_issue, Length: 3165, dtype: object" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes.most_important_issue.str[0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To pull characters 6 through 10, we type" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Care\n", "1 g tog\n", "2 care\n", "3 onomy\n", "4 \n", " ... \n", "3160 filtr\n", "3161 f bas\n", "3162 trum\n", "3163 Trum\n", "3164 \n", "Name: most_important_issue, Length: 3165, dtype: object" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes.most_important_issue.str[6:11]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To pull the last four characters, use a negative number to begin the range, and leave the end of the range blank:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Care\n", "1 ther\n", "2 care\n", "3 omy.\n", "4 hina\n", " ... \n", "3160 ers.\n", "3161 eed.\n", "3162 rump\n", "3163 rump\n", "3164 rump\n", "Name: most_important_issue, Length: 3165, dtype: object" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anes.most_important_issue.str[-4:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with Dates and Times (Example: Twitter)\n", "In module 4 we discussed Twitter as an example of how to use an API in Python. We save the four credentials we need to access the API (the consumer key, the consumer secret, the access token, and the access token secret) in a .env file, and we load those keys into Python without displaying them in our notebooks by using the `dotenv` package:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "import dotenv\n", "import os\n", "os.chdir('/Users/jk8sd/Box Sync/Practice and Applications 1 online/Module 9 - Data Managenent in pandas Part 2')\n", "dotenv.load_dotenv()\n", "ConsumerKey = os.getenv('ConsumerKey')\n", "ConsumerSecret = os.getenv('ConsumerSecret')\n", "AccessToken = os.getenv('AccessToken')\n", "AccessTokenSecret = os.getenv('AccessTokenSecret')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we use the `tweepy` package to work with the Twitter API and we create a Twitter cursor with our credentials:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "import tweepy\n", "auth = tweepy.OAuthHandler(ConsumerKey, ConsumerSecret)\n", "auth.set_access_token(AccessToken, AccessTokenSecret)\n", "api = tweepy.API(auth)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code extracts 1000 tweets that contain the hashtag \"#uva\":" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textcreated_atuser
0@KymoraJohnson_ let me call @IamTinaThompson s...2020-06-30 02:20:43SweetLickKing
1Congrats Eric. It’s was really great to have w...2020-06-30 02:18:17Daniel_B_Ennis
2RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...2020-06-30 01:55:02blimeyonline1
3Former #UVA stars Quin Blanding, Micah Kiser n...2020-06-30 01:40:00Wahoos247
4Former #UVA stars Quin Blanding, Micah Kiser n...2020-06-30 01:39:42JamieOakes247
............
995RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...2020-06-23 15:12:08norahlfy
996#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...2020-06-23 15:08:40AgNPalabras
997Zur Einstimmung #UVA vom letzten #gig bei uns ...2020-06-23 14:52:23WeAppU
998RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...2020-06-23 14:49:22VirialexViri
999Good question... My instinct says Ty Jerome, b...2020-06-23 14:32:44The_Superhoo
\n", "

1000 rows × 3 columns

\n", "
" ], "text/plain": [ " text created_at \\\n", "0 @KymoraJohnson_ let me call @IamTinaThompson s... 2020-06-30 02:20:43 \n", "1 Congrats Eric. It’s was really great to have w... 2020-06-30 02:18:17 \n", "2 RT @KathrynsScGifts: \"OGX Fade-defying + Orchi... 2020-06-30 01:55:02 \n", "3 Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:40:00 \n", "4 Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:39:42 \n", ".. ... ... \n", "995 RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... 2020-06-23 15:12:08 \n", "996 #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... 2020-06-23 15:08:40 \n", "997 Zur Einstimmung #UVA vom letzten #gig bei uns ... 2020-06-23 14:52:23 \n", "998 RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... 2020-06-23 14:49:22 \n", "999 Good question... My instinct says Ty Jerome, b... 2020-06-23 14:32:44 \n", "\n", " user \n", "0 SweetLickKing \n", "1 Daniel_B_Ennis \n", "2 blimeyonline1 \n", "3 Wahoos247 \n", "4 JamieOakes247 \n", ".. ... \n", "995 norahlfy \n", "996 AgNPalabras \n", "997 WeAppU \n", "998 VirialexViri \n", "999 The_Superhoo \n", "\n", "[1000 rows x 3 columns]" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "msgs = []\n", "msg =[]\n", "\n", "for tweet in tweepy.Cursor(api.search, q='#uva').items(1000):\n", " msg = [tweet.text, tweet.created_at, tweet.user.screen_name] \n", " msg = tuple(msg) \n", " msgs.append(msg)\n", "\n", "tweets = pd.DataFrame(msgs, columns = ['text', 'created_at', 'user'])\n", "tweets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting Year, Month, Day, and Time from a Timestamp Column\n", "Note that the data contain timestamps that include the date and time, to the second, of each tweet. The `created_at` column has a `datetime64` data type:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "text object\n", "created_at datetime64[ns]\n", "user object\n", "dtype: object" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The difference between a `datetime64` data type and an `object` or numeric type is that we have the ability to extract individual elements of time from a `datetime64` value. To pull out the year, month, day, hour, minute, and second of the timestamp, we use the following attributes:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[6, 30, 2020, 2, 20, 43]" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[tweets.created_at[0].month, tweets.created_at[0].day, \n", " tweets.created_at[0].year, tweets.created_at[0].hour, \n", " tweets.created_at[0].minute, tweets.created_at[0].second]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The easiest way to create new columns with these elements for all the values in the dataframe is to use comprehension loops:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textcreated_atusermonthdayyearhourminutesecond
0@KymoraJohnson_ let me call @IamTinaThompson s...2020-06-30 02:20:43SweetLickKing630202022043
1Congrats Eric. It’s was really great to have w...2020-06-30 02:18:17Daniel_B_Ennis630202021817
2RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...2020-06-30 01:55:02blimeyonline163020201552
3Former #UVA stars Quin Blanding, Micah Kiser n...2020-06-30 01:40:00Wahoos24763020201400
4Former #UVA stars Quin Blanding, Micah Kiser n...2020-06-30 01:39:42JamieOakes247630202013942
..............................
995RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...2020-06-23 15:12:08norahlfy623202015128
996#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...2020-06-23 15:08:40AgNPalabras623202015840
997Zur Einstimmung #UVA vom letzten #gig bei uns ...2020-06-23 14:52:23WeAppU6232020145223
998RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...2020-06-23 14:49:22VirialexViri6232020144922
999Good question... My instinct says Ty Jerome, b...2020-06-23 14:32:44The_Superhoo6232020143244
\n", "

1000 rows × 9 columns

\n", "
" ], "text/plain": [ " text created_at \\\n", "0 @KymoraJohnson_ let me call @IamTinaThompson s... 2020-06-30 02:20:43 \n", "1 Congrats Eric. It’s was really great to have w... 2020-06-30 02:18:17 \n", "2 RT @KathrynsScGifts: \"OGX Fade-defying + Orchi... 2020-06-30 01:55:02 \n", "3 Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:40:00 \n", "4 Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:39:42 \n", ".. ... ... \n", "995 RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... 2020-06-23 15:12:08 \n", "996 #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... 2020-06-23 15:08:40 \n", "997 Zur Einstimmung #UVA vom letzten #gig bei uns ... 2020-06-23 14:52:23 \n", "998 RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... 2020-06-23 14:49:22 \n", "999 Good question... My instinct says Ty Jerome, b... 2020-06-23 14:32:44 \n", "\n", " user month day year hour minute second \n", "0 SweetLickKing 6 30 2020 2 20 43 \n", "1 Daniel_B_Ennis 6 30 2020 2 18 17 \n", "2 blimeyonline1 6 30 2020 1 55 2 \n", "3 Wahoos247 6 30 2020 1 40 0 \n", "4 JamieOakes247 6 30 2020 1 39 42 \n", ".. ... ... ... ... ... ... ... \n", "995 norahlfy 6 23 2020 15 12 8 \n", "996 AgNPalabras 6 23 2020 15 8 40 \n", "997 WeAppU 6 23 2020 14 52 23 \n", "998 VirialexViri 6 23 2020 14 49 22 \n", "999 The_Superhoo 6 23 2020 14 32 44 \n", "\n", "[1000 rows x 9 columns]" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets['month'] = [x.month for x in tweets.created_at]\n", "tweets['day'] = [x.day for x in tweets.created_at]\n", "tweets['year'] = [x.year for x in tweets.created_at]\n", "tweets['hour'] = [x.hour for x in tweets.created_at]\n", "tweets['minute'] = [x.minute for x in tweets.created_at]\n", "tweets['second'] = [x.second for x in tweets.created_at]\n", "tweets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generating Timestamps from Separate Month, Day, and Year Columns\n", "Sometimes raw data will come to us with these time elements already placed in separate columns, so that we have a `month`, `day`, and `year` column but no `date` column. There are important advantages to placing all this information in one column in values that Python understands as timestamps: timestamps are easier to filter, manipulate, index, and plot. To create one timestamp from separate columns that contain the month, day, and year (and optionally the hour, minute, and second), use the `pd.to_datatime` function:\n", "```\n", "pd.to_datetime(df)\n", "```\n", "where `df` is a subset of the dataframe that contains the separate columns for the time elements. In order for this function to work, these columns must be named `year`, `month`, and `day`, and optionally `hour`, `minute`, and `second`. If these columns are named something else, first use the `.rename()` method to change the names to the ones listed above.\n", "\n", "To create a timestamp from the individual time element columns, we can type:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textcreated_atusermonthdayyearhourminutesecondtimestamp
0@KymoraJohnson_ let me call @IamTinaThompson s...2020-06-30 02:20:43SweetLickKing6302020220432020-06-30 02:20:43
1Congrats Eric. It’s was really great to have w...2020-06-30 02:18:17Daniel_B_Ennis6302020218172020-06-30 02:18:17
2RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...2020-06-30 01:55:02blimeyonline1630202015522020-06-30 01:55:02
3Former #UVA stars Quin Blanding, Micah Kiser n...2020-06-30 01:40:00Wahoos247630202014002020-06-30 01:40:00
4Former #UVA stars Quin Blanding, Micah Kiser n...2020-06-30 01:39:42JamieOakes2476302020139422020-06-30 01:39:42
.................................
995RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...2020-06-23 15:12:08norahlfy6232020151282020-06-23 15:12:08
996#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...2020-06-23 15:08:40AgNPalabras6232020158402020-06-23 15:08:40
997Zur Einstimmung #UVA vom letzten #gig bei uns ...2020-06-23 14:52:23WeAppU62320201452232020-06-23 14:52:23
998RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...2020-06-23 14:49:22VirialexViri62320201449222020-06-23 14:49:22
999Good question... My instinct says Ty Jerome, b...2020-06-23 14:32:44The_Superhoo62320201432442020-06-23 14:32:44
\n", "

1000 rows × 10 columns

\n", "
" ], "text/plain": [ " text created_at \\\n", "0 @KymoraJohnson_ let me call @IamTinaThompson s... 2020-06-30 02:20:43 \n", "1 Congrats Eric. It’s was really great to have w... 2020-06-30 02:18:17 \n", "2 RT @KathrynsScGifts: \"OGX Fade-defying + Orchi... 2020-06-30 01:55:02 \n", "3 Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:40:00 \n", "4 Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:39:42 \n", ".. ... ... \n", "995 RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... 2020-06-23 15:12:08 \n", "996 #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... 2020-06-23 15:08:40 \n", "997 Zur Einstimmung #UVA vom letzten #gig bei uns ... 2020-06-23 14:52:23 \n", "998 RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... 2020-06-23 14:49:22 \n", "999 Good question... My instinct says Ty Jerome, b... 2020-06-23 14:32:44 \n", "\n", " user month day year hour minute second \\\n", "0 SweetLickKing 6 30 2020 2 20 43 \n", "1 Daniel_B_Ennis 6 30 2020 2 18 17 \n", "2 blimeyonline1 6 30 2020 1 55 2 \n", "3 Wahoos247 6 30 2020 1 40 0 \n", "4 JamieOakes247 6 30 2020 1 39 42 \n", ".. ... ... ... ... ... ... ... \n", "995 norahlfy 6 23 2020 15 12 8 \n", "996 AgNPalabras 6 23 2020 15 8 40 \n", "997 WeAppU 6 23 2020 14 52 23 \n", "998 VirialexViri 6 23 2020 14 49 22 \n", "999 The_Superhoo 6 23 2020 14 32 44 \n", "\n", " timestamp \n", "0 2020-06-30 02:20:43 \n", "1 2020-06-30 02:18:17 \n", "2 2020-06-30 01:55:02 \n", "3 2020-06-30 01:40:00 \n", "4 2020-06-30 01:39:42 \n", ".. ... \n", "995 2020-06-23 15:12:08 \n", "996 2020-06-23 15:08:40 \n", "997 2020-06-23 14:52:23 \n", "998 2020-06-23 14:49:22 \n", "999 2020-06-23 14:32:44 \n", "\n", "[1000 rows x 10 columns]" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets['timestamp'] = pd.to_datetime(tweets[['year', 'month', 'day', 'hour', 'minute', 'second']])\n", "tweets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like `created_at`, the new `timestamp` column is also of the `datetime64` data type:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "text object\n", "created_at datetime64[ns]\n", "user object\n", "month int64\n", "day int64\n", "year int64\n", "hour int64\n", "minute int64\n", "second int64\n", "timestamp datetime64[ns]\n", "dtype: object" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Converting String Columns to Timestamps\n", "Sometimes information about the date and time of an event in the data will be coded as a string. In that case, to use the time series functionality on the dates in a dataframe, a string column that contains the dates and times will need to be converted to the `datetime64` data type. The best way to do that is to use the `pd.to_datetime()` function once again. This time, we pass the column that contains the dates in string format to the `pd.to_datetime()` function. This function can identify the year, month, day, hour, minute, and second from many different formats. \n", "\n", "For example, we can write the date Sunday, June 19, 2016 at 8:00 PM (the date and time of game 7 of the 2016 NBA finals, the greatest moment in sports history) in many different ways such as:\n", "\n", "* \"Sunday, June 19, 2016, 8:00 PM\"\n", "* \"6-19-2016 8PM\"\n", "* \"6/19/16 8 p.m.\"\n", "* \"19/6/16 8pm\"\n", "\n", "The `pd.to_datetime()` function reads all of these formats:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('2016-06-19 20:00:00')" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.to_datetime(\"Sunday, June 19, 2016, 8:00 PM\")" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('2016-06-19 20:00:00')" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.to_datetime(\"6-19-2016 8PM\")" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('2016-06-19 20:00:00')" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.to_datetime(\"6/19/16 8 p.m.\")" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('2016-06-19 20:00:00')" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.to_datetime(\"19/6/16 8pm\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pd.to_datetime()` function can feel like a magic trick because it automatically detects where the different elements of the date and time are located in the string. However, it can go wrong, and it is important to check that the date was correctly understood by `pd.to_datetime()`. Suppose for example that the date was June 19, 1945, coded as \"6/19/45\". `pd.to_datetime()` reads the date as 2045:" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('2045-06-19 00:00:00')" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.to_datetime(\"6/19/45\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The easiest way to fix this problem would be to break the timestamp up into separate columns for year, month, and day, subtract 100 from the years that were placed in the wrong century, and recombine these columns into a new timestamp with `pd.to_datetime()`. But the real danger here is assuming the `pd.to_datetime()` function worked without confirmation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pd.to_datetime()` function also works on an entire column of dates coded as strings. For example, let's convert the `created_at` column to string:" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "text object\n", "created_at object\n", "user object\n", "month int64\n", "day int64\n", "year int64\n", "hour int64\n", "minute int64\n", "second int64\n", "timestamp datetime64[ns]\n", "dtype: object" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets['created_at'] = tweets.created_at.astype('str')\n", "tweets.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To convert the `created_at` column back to a timestamp, we can apply `pd.to_dataframe()`:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "text object\n", "created_at datetime64[ns]\n", "user object\n", "month int64\n", "day int64\n", "year int64\n", "hour int64\n", "minute int64\n", "second int64\n", "timestamp datetime64[ns]\n", "dtype: object" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets['created_at'] = pd.to_datetime(tweets.created_at)\n", "tweets.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filtering Rows Based on Date Ranges\n", "If every row in the data has a unique timestamp, then one convenient way to use the times to filter the data is to set the **index** (the row labels) to be equal to the timestamps:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "tweets.index = tweets['created_at']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `pandas` dataframe uses `pd.to_datetime()` to read dates and times in many formats, and we can pass any of these formats to the index of `tweets`. For example, to see all the tweets made on June 30, we can type:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textcreated_atusermonthdayyearhourminutesecondtimestamp
created_at
2020-06-30 02:20:43@KymoraJohnson_ let me call @IamTinaThompson s...2020-06-30 02:20:43SweetLickKing6302020220432020-06-30 02:20:43
2020-06-30 02:18:17Congrats Eric. It’s was really great to have w...2020-06-30 02:18:17Daniel_B_Ennis6302020218172020-06-30 02:18:17
2020-06-30 01:55:02RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...2020-06-30 01:55:02blimeyonline1630202015522020-06-30 01:55:02
2020-06-30 01:40:00Former #UVA stars Quin Blanding, Micah Kiser n...2020-06-30 01:40:00Wahoos247630202014002020-06-30 01:40:00
2020-06-30 01:39:42Former #UVA stars Quin Blanding, Micah Kiser n...2020-06-30 01:39:42JamieOakes2476302020139422020-06-30 01:39:42
2020-06-30 01:30:45RT @ThinakaranLK: சில இடங்களில் மழை பெய்யும் ச...2020-06-30 01:30:45sumanebot6302020130452020-06-30 01:30:45
2020-06-30 01:08:08We are🎉excited to🗣announce opening for in-pers...2020-06-30 01:08:08GrandMarc_UVA63020201882020-06-30 01:08:08
2020-06-30 00:20:15RT @cavalierinsider: On Saturday, The Basketba...2020-06-30 00:20:15annefutch6302020020152020-06-30 00:20:15
2020-06-30 00:19:00Anthony Harris part of NFL's top safety tandem...2020-06-30 00:19:00hoosdaily630202001902020-06-30 00:19:00
\n", "
" ], "text/plain": [ " text \\\n", "created_at \n", "2020-06-30 02:20:43 @KymoraJohnson_ let me call @IamTinaThompson s... \n", "2020-06-30 02:18:17 Congrats Eric. It’s was really great to have w... \n", "2020-06-30 01:55:02 RT @KathrynsScGifts: \"OGX Fade-defying + Orchi... \n", "2020-06-30 01:40:00 Former #UVA stars Quin Blanding, Micah Kiser n... \n", "2020-06-30 01:39:42 Former #UVA stars Quin Blanding, Micah Kiser n... \n", "2020-06-30 01:30:45 RT @ThinakaranLK: சில இடங்களில் மழை பெய்யும் ச... \n", "2020-06-30 01:08:08 We are🎉excited to🗣announce opening for in-pers... \n", "2020-06-30 00:20:15 RT @cavalierinsider: On Saturday, The Basketba... \n", "2020-06-30 00:19:00 Anthony Harris part of NFL's top safety tandem... \n", "\n", " created_at user month day year \\\n", "created_at \n", "2020-06-30 02:20:43 2020-06-30 02:20:43 SweetLickKing 6 30 2020 \n", "2020-06-30 02:18:17 2020-06-30 02:18:17 Daniel_B_Ennis 6 30 2020 \n", "2020-06-30 01:55:02 2020-06-30 01:55:02 blimeyonline1 6 30 2020 \n", "2020-06-30 01:40:00 2020-06-30 01:40:00 Wahoos247 6 30 2020 \n", "2020-06-30 01:39:42 2020-06-30 01:39:42 JamieOakes247 6 30 2020 \n", "2020-06-30 01:30:45 2020-06-30 01:30:45 sumanebot 6 30 2020 \n", "2020-06-30 01:08:08 2020-06-30 01:08:08 GrandMarc_UVA 6 30 2020 \n", "2020-06-30 00:20:15 2020-06-30 00:20:15 annefutch 6 30 2020 \n", "2020-06-30 00:19:00 2020-06-30 00:19:00 hoosdaily 6 30 2020 \n", "\n", " hour minute second timestamp \n", "created_at \n", "2020-06-30 02:20:43 2 20 43 2020-06-30 02:20:43 \n", "2020-06-30 02:18:17 2 18 17 2020-06-30 02:18:17 \n", "2020-06-30 01:55:02 1 55 2 2020-06-30 01:55:02 \n", "2020-06-30 01:40:00 1 40 0 2020-06-30 01:40:00 \n", "2020-06-30 01:39:42 1 39 42 2020-06-30 01:39:42 \n", "2020-06-30 01:30:45 1 30 45 2020-06-30 01:30:45 \n", "2020-06-30 01:08:08 1 8 8 2020-06-30 01:08:08 \n", "2020-06-30 00:20:15 0 20 15 2020-06-30 00:20:15 \n", "2020-06-30 00:19:00 0 19 0 2020-06-30 00:19:00 " ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets['June 30, 2020']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If these row labels are in chronological order, then we can extract slices of the data that fall within a time range. We write two dates separated by a colon, and the output extracts all rows from the row that matches the first date through the row that matches the last date. Because the rows are generally listed with the most recent ones first, we write the ending timestamp first and the ending timestamp second. If these exact dates and times do not exist in the data, the syntax still finds the locations where these rows would exist and extracts the rows in between these two locations. Here is code to extract all tweets that were posted between 2pm and 3pm on June 29:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textcreated_atusermonthdayyearhourminutesecondtimestamp
created_at
2020-06-29 14:54:31UVa alumni team opts out of 2020 TBT, plans to...2020-06-29 14:54:31hoosdaily62920201454312020-06-29 14:54:31
2020-06-29 14:54:26Ben Wallace Is a Proud Dad of Three Kids — Mee...2020-06-29 14:54:26hoosdaily62920201454262020-06-29 14:54:26
2020-06-29 14:51:16Inicia Sonora exportación de uva a Corea del S...2020-06-29 14:51:16AgroTratos62920201451162020-06-29 14:51:16
2020-06-29 14:50:46RT @Wahoos247: Wake Forest commitment Christia...2020-06-29 14:50:46FatWhite10162920201450462020-06-29 14:50:46
2020-06-29 14:42:28RT @Cavs_Corner: Film Room: In the next instal...2020-06-29 14:42:28oleuva62920201442282020-06-29 14:42:28
2020-06-29 14:40:56RT @cavalierinsider: On Saturday, The Basketba...2020-06-29 14:40:56John_Shifflett62920201440562020-06-29 14:40:56
2020-06-29 14:37:11Tatil planları yapmaya başladıysan sana harika...2020-06-29 14:37:11YasinALTINEL62920201437112020-06-29 14:37:11
2020-06-29 14:35:52On Saturday, The Basketball Tournament will be...2020-06-29 14:35:52cavalierinsider62920201435522020-06-29 14:35:52
2020-06-29 14:29:17Jesse Rutherford of the Nelson County Board of...2020-06-29 14:29:17JerryMillerNow62920201429172020-06-29 14:29:17
2020-06-29 14:20:58RT @thesabre: NewsLink is updated! #UVA sports...2020-06-29 14:20:58Slider_Hoos62920201420582020-06-29 14:20:58
2020-06-29 14:20:44RT @Wahoos247: Wake Forest commitment Christia...2020-06-29 14:20:44agee_brandon62920201420442020-06-29 14:20:44
2020-06-29 14:18:33Ramseyer fondly recalls Cavs' perfect regular ...2020-06-29 14:18:33hoosdaily62920201418332020-06-29 14:18:33
2020-06-29 14:15:32Ben Wallace Is a Proud Dad of Three Kids — Mee...2020-06-29 14:15:32hoosdaily62920201415322020-06-29 14:15:32
2020-06-29 14:11:08NewsLink is updated! #UVA sports links all in ...2020-06-29 14:11:08thesabre6292020141182020-06-29 14:11:08
\n", "
" ], "text/plain": [ " text \\\n", "created_at \n", "2020-06-29 14:54:31 UVa alumni team opts out of 2020 TBT, plans to... \n", "2020-06-29 14:54:26 Ben Wallace Is a Proud Dad of Three Kids — Mee... \n", "2020-06-29 14:51:16 Inicia Sonora exportación de uva a Corea del S... \n", "2020-06-29 14:50:46 RT @Wahoos247: Wake Forest commitment Christia... \n", "2020-06-29 14:42:28 RT @Cavs_Corner: Film Room: In the next instal... \n", "2020-06-29 14:40:56 RT @cavalierinsider: On Saturday, The Basketba... \n", "2020-06-29 14:37:11 Tatil planları yapmaya başladıysan sana harika... \n", "2020-06-29 14:35:52 On Saturday, The Basketball Tournament will be... \n", "2020-06-29 14:29:17 Jesse Rutherford of the Nelson County Board of... \n", "2020-06-29 14:20:58 RT @thesabre: NewsLink is updated! #UVA sports... \n", "2020-06-29 14:20:44 RT @Wahoos247: Wake Forest commitment Christia... \n", "2020-06-29 14:18:33 Ramseyer fondly recalls Cavs' perfect regular ... \n", "2020-06-29 14:15:32 Ben Wallace Is a Proud Dad of Three Kids — Mee... \n", "2020-06-29 14:11:08 NewsLink is updated! #UVA sports links all in ... \n", "\n", " created_at user month day year \\\n", "created_at \n", "2020-06-29 14:54:31 2020-06-29 14:54:31 hoosdaily 6 29 2020 \n", "2020-06-29 14:54:26 2020-06-29 14:54:26 hoosdaily 6 29 2020 \n", "2020-06-29 14:51:16 2020-06-29 14:51:16 AgroTratos 6 29 2020 \n", "2020-06-29 14:50:46 2020-06-29 14:50:46 FatWhite101 6 29 2020 \n", "2020-06-29 14:42:28 2020-06-29 14:42:28 oleuva 6 29 2020 \n", "2020-06-29 14:40:56 2020-06-29 14:40:56 John_Shifflett 6 29 2020 \n", "2020-06-29 14:37:11 2020-06-29 14:37:11 YasinALTINEL 6 29 2020 \n", "2020-06-29 14:35:52 2020-06-29 14:35:52 cavalierinsider 6 29 2020 \n", "2020-06-29 14:29:17 2020-06-29 14:29:17 JerryMillerNow 6 29 2020 \n", "2020-06-29 14:20:58 2020-06-29 14:20:58 Slider_Hoos 6 29 2020 \n", "2020-06-29 14:20:44 2020-06-29 14:20:44 agee_brandon 6 29 2020 \n", "2020-06-29 14:18:33 2020-06-29 14:18:33 hoosdaily 6 29 2020 \n", "2020-06-29 14:15:32 2020-06-29 14:15:32 hoosdaily 6 29 2020 \n", "2020-06-29 14:11:08 2020-06-29 14:11:08 thesabre 6 29 2020 \n", "\n", " hour minute second timestamp \n", "created_at \n", "2020-06-29 14:54:31 14 54 31 2020-06-29 14:54:31 \n", "2020-06-29 14:54:26 14 54 26 2020-06-29 14:54:26 \n", "2020-06-29 14:51:16 14 51 16 2020-06-29 14:51:16 \n", "2020-06-29 14:50:46 14 50 46 2020-06-29 14:50:46 \n", "2020-06-29 14:42:28 14 42 28 2020-06-29 14:42:28 \n", "2020-06-29 14:40:56 14 40 56 2020-06-29 14:40:56 \n", "2020-06-29 14:37:11 14 37 11 2020-06-29 14:37:11 \n", "2020-06-29 14:35:52 14 35 52 2020-06-29 14:35:52 \n", "2020-06-29 14:29:17 14 29 17 2020-06-29 14:29:17 \n", "2020-06-29 14:20:58 14 20 58 2020-06-29 14:20:58 \n", "2020-06-29 14:20:44 14 20 44 2020-06-29 14:20:44 \n", "2020-06-29 14:18:33 14 18 33 2020-06-29 14:18:33 \n", "2020-06-29 14:15:32 14 15 32 2020-06-29 14:15:32 \n", "2020-06-29 14:11:08 14 11 8 2020-06-29 14:11:08 " ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets['6/29/2020 15:00':'6/29/2020 14:00']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To extract all tweets before point in time, write the timestamp before the colon and write nothing after the colon. To extract all tweets after a point in time, write nothing before the colon and write the timestamp after the colon. To extract all tweets posted after June 29 at 3pm:" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textcreated_atusermonthdayyearhourminutesecondtimestamp
created_at
2020-06-30 02:20:43@KymoraJohnson_ let me call @IamTinaThompson s...2020-06-30 02:20:43SweetLickKing6302020220432020-06-30 02:20:43
2020-06-30 02:18:17Congrats Eric. It’s was really great to have w...2020-06-30 02:18:17Daniel_B_Ennis6302020218172020-06-30 02:18:17
2020-06-30 01:55:02RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...2020-06-30 01:55:02blimeyonline1630202015522020-06-30 01:55:02
2020-06-30 01:40:00Former #UVA stars Quin Blanding, Micah Kiser n...2020-06-30 01:40:00Wahoos247630202014002020-06-30 01:40:00
2020-06-30 01:39:42Former #UVA stars Quin Blanding, Micah Kiser n...2020-06-30 01:39:42JamieOakes2476302020139422020-06-30 01:39:42
.................................
2020-06-29 15:38:19#UvA Last week a number of practicals started ...2020-06-29 15:38:19JJ_Angelus62920201538192020-06-29 15:38:19
2020-06-29 15:27:34Playing through: UVa golfers hone their games ...2020-06-29 15:27:34hoosdaily62920201527342020-06-29 15:27:34
2020-06-29 15:27:29Playing through: UVa golfers hone their games ...2020-06-29 15:27:29hoosdaily62920201527292020-06-29 15:27:29
2020-06-29 15:27:21Ramseyer fondly recalls Cavs' perfect regular ...2020-06-29 15:27:21hoosdaily62920201527212020-06-29 15:27:21
2020-06-29 15:06:57While many college athletes have been unable t...2020-06-29 15:06:57cavalierinsider6292020156572020-06-29 15:06:57
\n", "

71 rows × 10 columns

\n", "
" ], "text/plain": [ " text \\\n", "created_at \n", "2020-06-30 02:20:43 @KymoraJohnson_ let me call @IamTinaThompson s... \n", "2020-06-30 02:18:17 Congrats Eric. It’s was really great to have w... \n", "2020-06-30 01:55:02 RT @KathrynsScGifts: \"OGX Fade-defying + Orchi... \n", "2020-06-30 01:40:00 Former #UVA stars Quin Blanding, Micah Kiser n... \n", "2020-06-30 01:39:42 Former #UVA stars Quin Blanding, Micah Kiser n... \n", "... ... \n", "2020-06-29 15:38:19 #UvA Last week a number of practicals started ... \n", "2020-06-29 15:27:34 Playing through: UVa golfers hone their games ... \n", "2020-06-29 15:27:29 Playing through: UVa golfers hone their games ... \n", "2020-06-29 15:27:21 Ramseyer fondly recalls Cavs' perfect regular ... \n", "2020-06-29 15:06:57 While many college athletes have been unable t... \n", "\n", " created_at user month day year \\\n", "created_at \n", "2020-06-30 02:20:43 2020-06-30 02:20:43 SweetLickKing 6 30 2020 \n", "2020-06-30 02:18:17 2020-06-30 02:18:17 Daniel_B_Ennis 6 30 2020 \n", "2020-06-30 01:55:02 2020-06-30 01:55:02 blimeyonline1 6 30 2020 \n", "2020-06-30 01:40:00 2020-06-30 01:40:00 Wahoos247 6 30 2020 \n", "2020-06-30 01:39:42 2020-06-30 01:39:42 JamieOakes247 6 30 2020 \n", "... ... ... ... ... ... \n", "2020-06-29 15:38:19 2020-06-29 15:38:19 JJ_Angelus 6 29 2020 \n", "2020-06-29 15:27:34 2020-06-29 15:27:34 hoosdaily 6 29 2020 \n", "2020-06-29 15:27:29 2020-06-29 15:27:29 hoosdaily 6 29 2020 \n", "2020-06-29 15:27:21 2020-06-29 15:27:21 hoosdaily 6 29 2020 \n", "2020-06-29 15:06:57 2020-06-29 15:06:57 cavalierinsider 6 29 2020 \n", "\n", " hour minute second timestamp \n", "created_at \n", "2020-06-30 02:20:43 2 20 43 2020-06-30 02:20:43 \n", "2020-06-30 02:18:17 2 18 17 2020-06-30 02:18:17 \n", "2020-06-30 01:55:02 1 55 2 2020-06-30 01:55:02 \n", "2020-06-30 01:40:00 1 40 0 2020-06-30 01:40:00 \n", "2020-06-30 01:39:42 1 39 42 2020-06-30 01:39:42 \n", "... ... ... ... ... \n", "2020-06-29 15:38:19 15 38 19 2020-06-29 15:38:19 \n", "2020-06-29 15:27:34 15 27 34 2020-06-29 15:27:34 \n", "2020-06-29 15:27:29 15 27 29 2020-06-29 15:27:29 \n", "2020-06-29 15:27:21 15 27 21 2020-06-29 15:27:21 \n", "2020-06-29 15:06:57 15 6 57 2020-06-29 15:06:57 \n", "\n", "[71 rows x 10 columns]" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets[:'6/29/2020 15:00']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Leads and Lags\n", "In time series data in which the rows represent particular points in time in descending chronological order, a **lag** is a column that contains the values of another column shifted up one cell. That is, if the rows represent days, and a column represents today's high temperature, the lag represents yesterday's high temperature. A **lead** is a existing column with its values shifted one row down: tomorrow's high temperature. \n", "\n", "To create a lead or a lag, apply the `.shift()` method to a dataframe. The argument of `.shift()` is an integer: positive integers shift values down, creating leads, and negative values shift values up, creating lags. Whenever we shift values up or down, we create new missing values at the top or bottom of the dataframe for the lead or lagged column. \n", "\n", "For example, in the `tweets` dataframe we can create two new columns, `previous_tweet` and `next_tweet` to compare with `text` by using `.shift(-1)` and `.shift(1)`:" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textprevious_tweetnext_tweet
created_at
2020-06-30 02:20:43@KymoraJohnson_ let me call @IamTinaThompson s...Congrats Eric. It’s was really great to have w...NaN
2020-06-30 02:18:17Congrats Eric. It’s was really great to have w...RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...@KymoraJohnson_ let me call @IamTinaThompson s...
2020-06-30 01:55:02RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...Former #UVA stars Quin Blanding, Micah Kiser n...Congrats Eric. It’s was really great to have w...
2020-06-30 01:40:00Former #UVA stars Quin Blanding, Micah Kiser n...Former #UVA stars Quin Blanding, Micah Kiser n...RT @KathrynsScGifts: \"OGX Fade-defying + Orchi...
2020-06-30 01:39:42Former #UVA stars Quin Blanding, Micah Kiser n...RT @ThinakaranLK: சில இடங்களில் மழை பெய்யும் ச...Former #UVA stars Quin Blanding, Micah Kiser n...
............
2020-06-23 15:12:08RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...WATCH: Top plays of UVA's 2019-20 basketball s...
2020-06-23 15:08:40#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...Zur Einstimmung #UVA vom letzten #gig bei uns ...RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...
2020-06-23 14:52:23Zur Einstimmung #UVA vom letzten #gig bei uns ...RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...
2020-06-23 14:49:22RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...Good question... My instinct says Ty Jerome, b...Zur Einstimmung #UVA vom letzten #gig bei uns ...
2020-06-23 14:32:44Good question... My instinct says Ty Jerome, b...NaNRT @JLuis_Sommelier: #Mosto: jugo obtenido de ...
\n", "

1000 rows × 3 columns

\n", "
" ], "text/plain": [ " text \\\n", "created_at \n", "2020-06-30 02:20:43 @KymoraJohnson_ let me call @IamTinaThompson s... \n", "2020-06-30 02:18:17 Congrats Eric. It’s was really great to have w... \n", "2020-06-30 01:55:02 RT @KathrynsScGifts: \"OGX Fade-defying + Orchi... \n", "2020-06-30 01:40:00 Former #UVA stars Quin Blanding, Micah Kiser n... \n", "2020-06-30 01:39:42 Former #UVA stars Quin Blanding, Micah Kiser n... \n", "... ... \n", "2020-06-23 15:12:08 RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... \n", "2020-06-23 15:08:40 #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... \n", "2020-06-23 14:52:23 Zur Einstimmung #UVA vom letzten #gig bei uns ... \n", "2020-06-23 14:49:22 RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... \n", "2020-06-23 14:32:44 Good question... My instinct says Ty Jerome, b... \n", "\n", " previous_tweet \\\n", "created_at \n", "2020-06-30 02:20:43 Congrats Eric. It’s was really great to have w... \n", "2020-06-30 02:18:17 RT @KathrynsScGifts: \"OGX Fade-defying + Orchi... \n", "2020-06-30 01:55:02 Former #UVA stars Quin Blanding, Micah Kiser n... \n", "2020-06-30 01:40:00 Former #UVA stars Quin Blanding, Micah Kiser n... \n", "2020-06-30 01:39:42 RT @ThinakaranLK: சில இடங்களில் மழை பெய்யும் ச... \n", "... ... \n", "2020-06-23 15:12:08 #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... \n", "2020-06-23 15:08:40 Zur Einstimmung #UVA vom letzten #gig bei uns ... \n", "2020-06-23 14:52:23 RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... \n", "2020-06-23 14:49:22 Good question... My instinct says Ty Jerome, b... \n", "2020-06-23 14:32:44 NaN \n", "\n", " next_tweet \n", "created_at \n", "2020-06-30 02:20:43 NaN \n", "2020-06-30 02:18:17 @KymoraJohnson_ let me call @IamTinaThompson s... \n", "2020-06-30 01:55:02 Congrats Eric. It’s was really great to have w... \n", "2020-06-30 01:40:00 RT @KathrynsScGifts: \"OGX Fade-defying + Orchi... \n", "2020-06-30 01:39:42 Former #UVA stars Quin Blanding, Micah Kiser n... \n", "... ... \n", "2020-06-23 15:12:08 WATCH: Top plays of UVA's 2019-20 basketball s... \n", "2020-06-23 15:08:40 RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... \n", "2020-06-23 14:52:23 #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... \n", "2020-06-23 14:49:22 Zur Einstimmung #UVA vom letzten #gig bei uns ... \n", "2020-06-23 14:32:44 RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... \n", "\n", "[1000 rows x 3 columns]" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets['previous_tweet'] = tweets.shift(-1).text\n", "tweets['next_tweet'] = tweets.shift(1).text\n", "tweets[['text', 'previous_tweet', 'next_tweet']]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 4 }