9. Reshaping and Merging Data and Working with Strings, Dates, and Times#

9.1. Introduction: pandas or SQL?#

As we saw in module 7, read operations in SQL not only extract data from a database, but they can select and rename columns, filter and sort rows, join tables, and aggregate data. With pandas, we can also select and rename columns, filter and sort rows, join tables, and aggregate data. So the question is, should we be using SQL or pandas to perform all these data manipulation steps?

SQL is sometimes referred to as server-side data manipulation, because databases are often stored on remote servers and SQL queries are processed on the server instead on on a local machine. Data manipulation that is conducted using pandas on a local Python installation is called client-side data manipulation.

The question of SQL vs. pandas (or SQL vs. the tidyverse in R) is the subject of a lot of debate among professionals who use data. The question comes up frequently on various coding forums. Some blog posts have tried to compare SQL and pandas in terms of the speed with which they complete equivalent operations, with differing results. Tina Wenzel and Kavya Gupta note that many factors influence the relative speed of SQL and pandas operations, including the configuration of a PostgreSQL database and the bandwidth available in a network connection. They take the mixed evidence into account and conclude that

SQL is the best tool to use for basic data retrieval and the better you know it, the more it will speed up your workflow. More advanced data manipulation tasks can be performed both on the server and client side and it is up to the analyst to balance the workload optimally.

In short, the existing evidence does not overwhelmingly support one option or the other as best practice for data manipulation. The reality is that both SQL and pandas are extremely useful and widely-used, and both tools will become part of your workflow. It will take some time and experience with end-to-end projects in data science to learn how to balance SQL and pandas in a way that is most comfortable for you in your workflow. But at this stage, there are some important points to keep in mind when thinking about these tools.

First, there are many situations in which the question of SQL vs. pandas might be moot. For a given project, our data might not come from a database, but instead from a CSV file, from a JSON document acquired via an API, or from raw HTML that we scraped from a webpage. So in order to use SQL, we would have to take the additional step of creating a database. If we hold ourselves to the principles that E. F. Codd and others laid out about the organization of relational databases, it can take a significant amount of work to create this database. If there is no database involved for a specific data project, there is no role for SQL, but because pandas works on dataframes it can be a primary tool for data manipulation regardless of the original source for the data.

Second, while there are differences in speed, these differences only become a significant factor for projects that involve big data, and in those cases, we will almost certainly be working with data stored on remote servers. If the data are organized using a database, then SQL may well be faster than pandas, but it very much depends on how the database is configured and on the myriad factors that determine the speed with which code runs through a remote connection. pandas can also be used in scripts or Jupyter notebooks that are run on remote servers. Sometimes it makes sense to pull selections of remotely stored data into a local environment so that we can manipulate the data without having to seek permission from a database manager, and in that case, a mix of SQL to choose a selection and pandas to manipulate the local data can be very effective.

Third, pandas simply has more functionality and flexibility than SQL. For example, it is fairly straightforward to reshape (e.g. pivot) data using pandas, and it is much more difficult to reshape data in SQL. pandas has better functionality for working with strings and time and date features than SQL, and pandas, being part of a Python environment, works well with any other Python package that works with dataframes or arrays. In contrast, while various DBMSs add specific additions to the base SQL language, SQL extensions tend to be fairly limited because of the inherent tension between expanding the functionality of a query language and staying close enough to base SQL to still be considered SQL. The easiest way to bring SQL query results into a Python enviromnent uses an sqlalchemy engine and the pd.read_sql_query() funtion from pandas.

Fourth, both pandas and SQL enable us to work as part of larger teams that share tools, but we might choose or be required to use one tool over the other to work better as part of the team. pandas is essential for projects in which the whole team works within Python. SQL is readable and usable for people who do not use Python but still work with databases.

Fifth, and most importantly, the choice of SQL vs. pandas should be made based on how comfortable we feel with each tool. Both SQL and pandas can perform data manipulation correctly, and it will probably be the case that we can remember the code with one tool better than with the other. We should try as much as we can to do the work that comprises 80% of our time as data scientists more quickly.

For now, we need to practice both pandas and SQL and get comfortable with these tools, and we need to be flexible in the future as different situations will call for SQL or pandas.

9.2. Joining Dataframes#

Joining dataframes is also called merging dataframes, and I will use the words “join” and “merge” interchangeably below. Every kind of merge that is possible in SQL is possible in pandas: inner, outer, left, right, natural, cross, and anti-joins. One important difference between joining in SQL and merging in pandas is that dataframes might not be as carefully curated as data tables in a relational database, and we should not assume that the join worked properly even if there is no error. Primary keys might not uniquely identify rows in general, and joins might turn into cross joins. Some of values of keys that should match might not match due to differences in coding and spelling. And if we attempt a natural join, the wrong columns might be automatically selected as the keys.

Too many discussions of merging data provide clean examples. Here I will show you how merging works when the data contain some common problems that may invalidate the merge. In the following discussion, we will discuss how to perform merges in pandas, and also how to check that the merge worked in the way we want to.

First, for the following examples, we load these packages:

import numpy as np
import pandas as pd

9.2.1. Example: Merging Data on U.S. State Election Results, Income, Economies, and Area#

The “state_elections.csv” file contains information about the result of presidential elections by state for every presidential election between 1964 and 2004. In this dataset, the primary keys are State and year:

elect = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/state_elections.csv")
elect.head()
State year demvote repvote wallacevote dempercent reppercent population
0 Alabama 1964 210732 479085 0 30.54897 69.45103 1919000
1 Alabama 1968 196579 146923 691425 18.99448 81.00552 1993000
2 Alabama 1972 256923 728701 0 26.06704 73.93296 3540080
3 Alabama 1976 659170 504070 0 56.66673 43.33327 3737204
4 Alabama 1980 636730 654192 0 49.32366 50.67634 3894025

“state_income.txt” contains varous summary statistics (percentiles and means, broken down separately for men, for women, and for the overall population) about personal incomes within each state. In this dataset, the primary keys are stcode (the post office’s two letter abbreviation for states) and year:

income = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/state_income.txt", sep="\t", header=2) # tab separated (note the header in this file)
income.head()
stcode year income p10income p20income p30income p40income p50income p60income p70income ... wom_p90income wom_meanincome wom_minincome wom_maxincome wom_sdincome wom_ineqincome wom_gini_income wom_n_income wom_nwgt_income medincome
0 AL 1964 89.83488 3269.85 5731.07 7988.87 10318.3 13189.1 16007.3 18744.4 ... 24233.0 11400.0 58.13953 72674.42 1796.89 29.53152 0.473738 175.0 418494.1 76.68081
1 AL 1968 92.18238 4469.95 7455.28 10050.30 12792.1 15580.2 18392.4 21310.7 ... 26287.5 12005.3 NaN NaN 1968.78 27.40979 0.485090 NaN NaN 80.72643
2 AL 1972 94.87543 6734.11 10314.80 13336.30 16421.5 19088.9 22141.6 25410.7 ... 30278.0 14667.9 NaN NaN 8604.25 17.83577 0.457932 NaN NaN 82.27975
3 AL 1976 67.53671 7759.81 11176.60 13823.20 16431.8 18982.8 21923.5 25180.8 ... 31119.2 15362.5 NaN NaN 12842.20 15.77765 0.454074 NaN NaN 60.07215
4 AL 1980 46.29869 7602.15 10637.00 13532.60 16279.6 19111.8 21802.5 25190.6 ... 29623.1 14804.2 28.38428 120196.50 5944.71 14.52640 0.436961 543.0 711278.2 41.72882

5 rows × 55 columns

“state_economics.txt” contains information about the macro-level economy in each state and year, including overall and per capita gross domestic product, and GDP broken down by industry. In this dataset, the primary keys are fips (a standard numeric code that identifies countries and provinces within countries) and year:

econ = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/state_economics.txt", sep=";") # semi-colon separated
econ.head()
fips year GDP GDPpc Private Agriculture Farms Mining Utilities Construction ... Construction_share Manufacturing_share Finance_Insurance_share Legal_share Education_share Health_share Government_share Government_federal_share Government_military_share Government_statelocal_share
0 1 1964 8201 4273.580 6900 312 283 115 235 379 ... 4.621387 28.22827 11.41324 0.402390 0.256066 1.902207 15.86392 6.291915 2.402146 7.182051
1 1 1968 10949 5493.728 9095 294 253 133 294 499 ... 4.557494 28.74235 10.67677 0.429263 0.365330 2.228514 16.93305 5.945748 2.630377 8.356928
2 1 1972 15336 4332.105 12696 481 418 223 435 702 ... 4.577465 26.05634 11.48279 0.469484 0.443401 2.595201 17.21440 5.692488 2.679969 8.841941
3 1 1976 24206 6477.035 19988 781 683 584 714 1281 ... 5.292076 24.85334 11.50541 0.479220 0.359415 2.986863 17.42543 5.787821 2.255639 9.386103
4 1 1980 36006 9246.474 29567 674 542 1351 1228 1588 ... 4.410376 24.60423 11.63417 0.649892 0.352719 3.660501 17.88313 5.782370 2.171860 9.928901

5 rows × 36 columns

“state_area.csv” contains various measures of the total, land, and water areas of each state. In this dataset, the primary key is state:

area = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/state_area.csv") 
area.head()
state area_sqmi area_sqkm landarea_sqmi landarea_sqkm water_sqmi water_sqkm percent_water
0 Alaska 663,267.26 1,717,854 571,951.26 1,481,347 91,316.00 236,507 13.77
1 Texas 268,580.82 695,621 261,797.12 678,051 6,783.70 17,570 2.53
2 California 163,695.57 423,970 155,939.52 403,882 7,736.23 20,037 4.73
3 Montana 147,042.40 380,838 145,552.43 376,979 1,489.96 3,859 1.01
4 New Mexico 121,589.48 314,915 121,355.53 314,309 233.96 606 0.19

Because the datasets above contain three ways to identify states - by name, with two-letter abbreviations, and with FIPS codes - we also need a dataset that matches each state by name to its abbreviation and FIPS code. Datasets that exist solely for the purpose of matching values of keys across datasets are called crosswalks:

crosswalk = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/crosswalk.csv")
crosswalk.head()
fips State stcode
0 1 Alabama AL
1 2 Alaska AK
2 4 Arizona AZ
3 5 Arkansas AR
4 6 California CA

Our goal is to merge all five of these datasets together.

9.2.2. Using the pd.merge() Function#

When using a crosswalk to merge dataframes together, the first step is to merge a dataframe with the crosswalk. Here we start by joining the elect and crosswalk dataframes together. To join dataframes, we use the pd.merge() function. The following code works:

merged_data = pd.merge(elect, crosswalk, on='State')
merged_data
State year demvote repvote wallacevote dempercent reppercent population fips stcode
0 Alabama 1964 210732 479085 0 30.54897 69.45103 1919000 1 AL
1 Alabama 1968 196579 146923 691425 18.99448 81.00552 1993000 1 AL
2 Alabama 1972 256923 728701 0 26.06704 73.93296 3540080 1 AL
3 Alabama 1976 659170 504070 0 56.66673 43.33327 3737204 1 AL
4 Alabama 1980 636730 654192 0 49.32366 50.67634 3894025 1 AL
... ... ... ... ... ... ... ... ... ... ...
556 Wyoming 1988 67113 106867 0 38.57512 61.42488 465080 56 WY
557 Wyoming 1992 68160 79347 0 46.20798 53.79202 466251 56 WY
558 Wyoming 1996 77934 105388 0 42.51208 57.48792 488167 56 WY
559 Wyoming 2000 60481 147947 0 29.01769 70.98231 493782 56 WY
560 Wyoming 2004 70776 167629 0 29.68730 70.31271 506529 56 WY

561 rows × 10 columns

The merged_data dataframe now contains the features from elect and from crosswalk, which means we’ve added stcode and fips to the elect data, and that enables us to merge merged_data with income and econ.

Before moving on to the rest of the example, we should discuss some properties of the pd.merge() function. Recall from module 7 that there are many different kinds of joins that differ only in how they treat unmatched observations. pd.merge() conducts an inner join by default, leaving only the matched observations in the merged data. To change the type of merge, use the how argument:

how='outer' performs an outer or full join. It brings all rows from both dataframes into the merged data, and replaces the features for the unmatched rows with NaN for the columns without data.

merged_data = pd.merge(elect, crosswalk, on='State', how='outer')

how='left' performs a left join. It brings all rows from the first dataframe listed into the merged data, and only those in the second dataframe that are matched to the first.

merged_data = pd.merge(elect, crosswalk, on='State', how='left')

how='right' performs a right join. It brings all rows from the second dataframe listed into the merged data, and only those in the first dataframe that are matched to the second.

merged_data = pd.merge(elect, crosswalk, on='State', how='right')

There are two ways to specify the key or keys on which to match rows. If the key or keys have the same name in both dataframes, use the on argument. If there is one key, pass this name as a string to on like we did above. If there is more than one key, pass these names as a list to on. For example, if we merge on both State and year, we will type on = ['State', 'year'].

If the keys have different names in each dataframe, we use left_on to specify the names of these keys in the first dataframe listed in the pd.merge() function, and right_on to specify the names of these keys in the second dataframe listed. Suppose, for example, that the keys are named “State” and “year” in the first dataframe, and “st” and “yr” in the second dataframe. We would type left_on = ['State', 'year'], right_on = ['st', 'yr'].

Depending on the type of join we use, a row can be matched, or can come from just the first or just the second dataframe listed in pd.merge(). To see which rows come from which sources, specify indicator='newname', where 'newname' is a name you choose for a new column that contains this information. For example, the following code repeats the merge we performed above, but uses an outer join instead of an inner join and adds a column matched that tells us how each row is matched:

merged_data = pd.merge(elect, crosswalk, on='State', how='outer', indicator='matched')
merged_data
State year demvote repvote wallacevote dempercent reppercent population fips stcode matched
0 Alabama 1964 210732 479085 0 30.54897 69.45103 1919000 1 AL both
1 Alabama 1968 196579 146923 691425 18.99448 81.00552 1993000 1 AL both
2 Alabama 1972 256923 728701 0 26.06704 73.93296 3540080 1 AL both
3 Alabama 1976 659170 504070 0 56.66673 43.33327 3737204 1 AL both
4 Alabama 1980 636730 654192 0 49.32366 50.67634 3894025 1 AL both
... ... ... ... ... ... ... ... ... ... ... ...
556 Wyoming 1988 67113 106867 0 38.57512 61.42488 465080 56 WY both
557 Wyoming 1992 68160 79347 0 46.20798 53.79202 466251 56 WY both
558 Wyoming 1996 77934 105388 0 42.51208 57.48792 488167 56 WY both
559 Wyoming 2000 60481 147947 0 29.01769 70.98231 493782 56 WY both
560 Wyoming 2004 70776 167629 0 29.68730 70.31271 506529 56 WY both

561 rows × 11 columns

9.2.3. Checking for Problems That Can Occur Without Errors#

Whenever we apply pd.merge(), there’s a great risk that the merge was incorrect even if the function runs without error. There are two ways that a merge can fail.

Problem 1: what we thought was a unique ID was not a unique ID.

If we merge two dataframes on a column (or set of columns) that fail to uniquely identify the rows in either dataframe, then pd.merge() returns all pairwise combinations of rows with matching IDs. To see this, consider the following two example dataframes:

dict1 = {'id': ['A', 'A', 'B', 'B'],
         'data1': [150, 200, 50, 25]}
df1 = pd.DataFrame.from_dict(dict1)
df1
id data1
0 A 150
1 A 200
2 B 50
3 B 25
dict2 = {'id': ['A', 'A', 'B', 'B'],
         'data2': [-20, -75, -125, -250]}
df2 = pd.DataFrame.from_dict(dict2)
df2
id data2
0 A -20
1 A -75
2 B -125
3 B -250

Note that the two dataframes share a column “id”, but that “id” is not a unique identifier in either dataframe. If I try to merge the two based on “id”, the result is:

df_merge = pd.merge(df1, df2, on='id')
df_merge
id data1 data2
0 A 150 -20
1 A 150 -75
2 A 200 -20
3 A 200 -75
4 B 50 -125
5 B 50 -250
6 B 25 -125
7 B 25 -250

Because we did not have a unique ID in either dataframe, we artificially increased the number of rows from 4 to 8. At first glance, the merged data looks nice, but with double the number of rows we are supposed to have the data is in fact corrupted, and we cannot use it for any subsequent analysis. Also, note that the merge ran without any errors. If we aren’t careful, we might move forward with data that cannot be used.

The above example is an instance of a many-to-many merge, in which rows in the left dataframe can match to more than one row in the right dataframe, and rows in the right dataframe can match to more than one row in the left dataframe. In these cases pandas performs a cross join within the subsets of the two dataframes that share the same key, inflating the number of rows. But a many-to-many join is almost never what we want to do with data, and if it happens, it is usually a mistake.

The other types of merges are

  • one-to-one, in which every row in the left dataframe matches to at most one row in the right dataframe, and every row in the right dataframe matches to at most one row in the left dataframe,

  • many-to-one, in which rows in the left dataframe can match to more than one row in the right dataframe, and every row in the right dataframe mnatches to at most one row in the left dataframe,

  • and one-to-many, in which every row in the left dataframe matches to at most one row in the right dataframe, and rows in the right dataframe can match to more than one row in the left dataframe.

The best way to prevent this mistake is to first think about whether the two dataframes should constitute a one-to-one, many-to-one, or one-to-many merge. It is important to have a clear expectation. Then use the validate argument to have pandas automatically check to see whether this expectation is met. If not, the code will generate an error, which will bring our attention to problems in the data we might not have caught otherwise. If we write

  • validate = 'one_to_one' then pandas checks to see whether the merge keys are unique in both dataframes,

  • validate = 'many_to_one' then pandas checks to see whether the merge keys are unique in the right dataframe,

  • validate = 'one_to_many' then pandas checks to see whether the merge keys are unique in the left dataframe.

There are many reasons why our expectations might be incorrect. In messy, real-world data, mistakes like duplicated keys often make it past review processes and exist in the data without any clue of it in the codebook. The validate argument is an important check for these kinds of issues. For example, we expect that merging elect and crosswalk will be a many-to-one merge, so we confirm that the merge is many-to-one as follows:

merged_data = pd.merge(elect, crosswalk, on='State', how='outer', 
                       indicator='matched', validate='many_to_one')
merged_data
State year demvote repvote wallacevote dempercent reppercent population fips stcode matched
0 Alabama 1964 210732 479085 0 30.54897 69.45103 1919000 1 AL both
1 Alabama 1968 196579 146923 691425 18.99448 81.00552 1993000 1 AL both
2 Alabama 1972 256923 728701 0 26.06704 73.93296 3540080 1 AL both
3 Alabama 1976 659170 504070 0 56.66673 43.33327 3737204 1 AL both
4 Alabama 1980 636730 654192 0 49.32366 50.67634 3894025 1 AL both
... ... ... ... ... ... ... ... ... ... ... ...
556 Wyoming 1988 67113 106867 0 38.57512 61.42488 465080 56 WY both
557 Wyoming 1992 68160 79347 0 46.20798 53.79202 466251 56 WY both
558 Wyoming 1996 77934 105388 0 42.51208 57.48792 488167 56 WY both
559 Wyoming 2000 60481 147947 0 29.01769 70.98231 493782 56 WY both
560 Wyoming 2004 70776 167629 0 29.68730 70.31271 506529 56 WY both

561 rows × 11 columns

Problem 2: rows should be matched, but aren’t.

This problem is especially common when matching on strings, such as names. With countries, it’s possible for one dataframe to write “USA” and the other to write “United States”. You will have to go back into a dataframe to edit individual ID cells so that they match. But to identify the cells that need to be edited, we have to construct a dataframe of the rows that were unmatched in one or the other dataframe. The easiest way to see this is to use an anti-join: a merge that keeps only the rows in the left dataframe that have no match in the right dataframe. Like SQL, there is no anti-join function built in to pandas, but there are ways to program our own anti-join. The easiest way is to perform an outer join with the indicator='matched' argument, then to filter the data to those rows for which matched is not equal to “both”.

For this step, I recommend creating a second, working version of the data merge_test to contain the first attempt at the merge. If there are issues, it is easier to fix the problems and recreate the working dataframe than to restart the kernel. If there are no problems, then we can assign the merged_data dataframe to be equal to the working version.

merge_test = pd.merge(elect, crosswalk, on='State', how='outer', indicator='matched', validate='many_to_one')
merge_test.query("matched!='both'")
State year demvote repvote wallacevote dempercent reppercent population fips stcode matched

There are no rows in crosswalk with no match to elect, and there are no rows in elect without a match in crosswalk. But if there were unmatched rows, there are two reasons why some rows may have been unmatched: rows can be unmatched due to differences in coverage, and they might be unmatched due to differences in coding or spelling. pandas cannot tell the difference between these two reasons why rows might be unmatched, and an inner join will drop unmatched rows regardless of the reason. Using an outer join with indicator='matched' is best practice for seeing the unmatched rows and thinking about whether any rows that should be matched are for some reason unmatched. We can use the category relabeling functionality in pandas to fix issues of mispelled and miscoded keys, as we will do in the examples in the next section.

After performing this check, if we intend to perform another merge, we will either need to choose a new name for the indicator column, or we can drop the matched column so that we can reuse this name later. We set merged_data to the working dataframe merge_test without the matched column:

merged_data = merge_test.drop('matched', axis=1)

9.2.4. Merging all of the Dataframes Together While Checking for Problems#

The merged_data and income dataframes share keys named stcode and year. Joining these two dataframes should be a one-to-one merge because there should be only one occurrence of the same state and year in each dataframe. We use how='outer' and indicator='matched' (since we deleted the previous matched column) to enable us to filter the result to the unmatched rows, and we use validate='one_to_one' to catch anomolies that violate our expectation that this is a one-to-one merge:

merge_test = pd.merge(merged_data, income, on=['stcode','year'], how='outer', indicator='matched', validate='one_to_one')
merge_test.query("matched!='both'")
State year demvote repvote wallacevote dempercent reppercent population fips stcode ... wom_meanincome wom_minincome wom_maxincome wom_sdincome wom_ineqincome wom_gini_income wom_n_income wom_nwgt_income medincome matched

0 rows × 64 columns

There are no unmatched rows so there are no issues with miscoded or mispelled ID values. The merged data now contain information on both state election results and state incomes:

merge_test
State year demvote repvote wallacevote dempercent reppercent population fips stcode ... wom_meanincome wom_minincome wom_maxincome wom_sdincome wom_ineqincome wom_gini_income wom_n_income wom_nwgt_income medincome matched
0 Alabama 1964 210732 479085 0 30.54897 69.45103 1919000 1 AL ... 11400.0 58.139530 72674.42 1796.89 29.53152 0.473738 175.0 418494.1 76.68081 both
1 Alabama 1968 196579 146923 691425 18.99448 81.00552 1993000 1 AL ... 12005.3 NaN NaN 1968.78 27.40979 0.485090 NaN NaN 80.72643 both
2 Alabama 1972 256923 728701 0 26.06704 73.93296 3540080 1 AL ... 14667.9 NaN NaN 8604.25 17.83577 0.457932 NaN NaN 82.27975 both
3 Alabama 1976 659170 504070 0 56.66673 43.33327 3737204 1 AL ... 15362.5 NaN NaN 12842.20 15.77765 0.454074 NaN NaN 60.07215 both
4 Alabama 1980 636730 654192 0 49.32366 50.67634 3894025 1 AL ... 14804.2 28.384280 120196.50 5944.71 14.52640 0.436961 543.0 711278.2 41.72882 both
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
556 Wyoming 1988 67113 106867 0 38.57512 61.42488 465080 56 WY ... 17188.0 75.987840 77507.60 10402.40 23.54822 0.478940 318.0 110956.1 36.63389 both
557 Wyoming 1992 68160 79347 0 46.20798 53.79202 466251 56 WY ... 17934.1 92.307700 96391.03 12512.80 21.14388 0.469024 360.0 117688.6 30.10564 both
558 Wyoming 1996 77934 105388 0 42.51208 57.48792 488167 56 WY ... 18796.4 29.816510 122420.90 14517.60 22.74368 0.461693 413.0 125222.3 26.93395 both
559 Wyoming 2000 60481 147947 0 29.01769 70.98231 493782 56 WY ... 20463.4 88.819230 333171.40 16609.80 15.30497 0.447124 440.0 128753.9 26.05026 both
560 Wyoming 2004 70776 167629 0 29.68730 70.31271 506529 56 WY ... 20225.1 9.587728 206166.80 18311.90 11.33138 0.432011 692.0 123164.1 24.74094 both

561 rows × 64 columns

Since there are no problems, we set merged_data equal to merge_test. Because we intend to merge the data with another dataframe, we drop the matched column.

merged_data = merge_test.drop('matched', axis=1)

Next we merge this result with the econ dataframe. These two dataframes share fips and year as keys, and should also be a one-to-one merge. We check to see whether there are unmatched rows:

merge_test = pd.merge(merged_data, econ, on=['fips','year'], how='outer', indicator='matched', validate='one_to_one')
merge_test.query("matched!='both'")
State year demvote repvote wallacevote dempercent reppercent population fips stcode ... Manufacturing_share Finance_Insurance_share Legal_share Education_share Health_share Government_share Government_federal_share Government_military_share Government_statelocal_share matched

0 rows × 98 columns

Again, there are no unmatched rows so there are no issues with miscoded or mispelled ID values. The merged data now contain information on state election results, state incomes, and state economies:

merge_test
State year demvote repvote wallacevote dempercent reppercent population fips stcode ... Manufacturing_share Finance_Insurance_share Legal_share Education_share Health_share Government_share Government_federal_share Government_military_share Government_statelocal_share matched
0 Alabama 1964 210732 479085 0 30.54897 69.45103 1919000 1 AL ... 28.228270 11.413240 0.402390 0.256066 1.902207 15.86392 6.291915 2.402146 7.182051 both
1 Alabama 1968 196579 146923 691425 18.99448 81.00552 1993000 1 AL ... 28.742350 10.676770 0.429263 0.365330 2.228514 16.93305 5.945748 2.630377 8.356928 both
2 Alabama 1972 256923 728701 0 26.06704 73.93296 3540080 1 AL ... 26.056340 11.482790 0.469484 0.443401 2.595201 17.21440 5.692488 2.679969 8.841941 both
3 Alabama 1976 659170 504070 0 56.66673 43.33327 3737204 1 AL ... 24.853340 11.505410 0.479220 0.359415 2.986863 17.42543 5.787821 2.255639 9.386103 both
4 Alabama 1980 636730 654192 0 49.32366 50.67634 3894025 1 AL ... 24.604230 11.634170 0.649892 0.352719 3.660501 17.88313 5.782370 2.171860 9.928901 both
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
556 Wyoming 1988 67113 106867 0 38.57512 61.42488 465080 56 WY ... 4.380078 7.788767 0.688802 0.123631 2.207700 14.51784 2.622748 1.660191 10.234900 both
557 Wyoming 1992 68160 79347 0 46.20798 53.79202 466251 56 WY ... 4.303816 7.670391 0.704806 0.179951 2.579291 14.99588 2.811727 1.552073 10.632080 both
558 Wyoming 1996 77934 105388 0 42.51208 57.48792 488167 56 WY ... 6.146708 9.280448 0.705568 0.152555 2.866768 14.23849 2.657005 1.544622 10.030510 both
559 Wyoming 2000 60481 147947 0 29.01769 70.98231 493782 56 WY ... 6.012348 3.110034 0.646241 0.173100 4.269805 14.95586 2.729213 1.540592 10.686050 both
560 Wyoming 2004 70776 167629 0 29.68730 70.31271 506529 56 WY ... 3.795901 2.813834 0.589240 0.222032 4.419300 14.49616 2.519214 1.648164 10.328780 both

561 rows × 98 columns

Again, we set merged_data equal to merge_test, dropping the matched column so that we can reuse this name in the next join:

merged_data = merge_test.drop('matched', axis=1)

Finally we merge the data with the area dataframe. This merge matches State in merged_data to state in area: this subtle difference is easy to miss and will result in an error if we fail to notice the lowercase s and write on='State'. This merge should be a many-to-one merge, as there are many rows for each state in merged_data (for many years), but only one row for each state in area. As

merge_test = pd.merge(merged_data, area, how="outer",
                      left_on='State',right_on='state', 
                      indicator='matched', 
                      validate='many_to_one')
merge_test.query("matched!='both'")[['State', 'state',
                                     'year','matched']]
State state year matched
77 D. C. NaN 1964.0 left_only
78 D. C. NaN 1968.0 left_only
79 D. C. NaN 1972.0 left_only
80 D. C. NaN 1976.0 left_only
81 D. C. NaN 1980.0 left_only
82 D. C. NaN 1984.0 left_only
83 D. C. NaN 1988.0 left_only
84 D. C. NaN 1992.0 left_only
85 D. C. NaN 1996.0 left_only
86 D. C. NaN 2000.0 left_only
87 D. C. NaN 2004.0 left_only
561 NaN District of Columbia NaN right_only
562 NaN Puerto Rico NaN right_only
563 NaN Northern Mariana Islands NaN right_only
564 NaN United States Virgin Islands NaN right_only
565 NaN American Samoa NaN right_only
566 NaN Guam NaN right_only
567 NaN United States Minor Outlying Islands NaN right_only

In this case, one of our validation checks failed: we have unmatched rows. Some of these rows are due to differences in coverage: we will not bring Puerto Rico, the Northern Mariana Islands, the United States Virgin Islands, American Samoa, Guam, or the United States Minor Outlying Islands into the merged data as these territories do not receive votes in the electoral college. We will however bring the District of Columbia into the merged data as D.C. does get 3 votes in the electoral college. D.C. is unmatched due to spelling discrepancies, and we must correct this mistake before we can proceed. The easiest way to fix the error is to replace values of “District of Columbia” in the area dataframe with “D. C.”, as it is written in the merged data. We can use the .replace() method (because we only want to replace one state category and .replace() leaves unlisted categories alone, as opposed to .map() which would replace unlisted categories with missing values), as we discussed in module 8:

area.state = area.state.replace({'District of Columbia':'D. C.'})

Now if we repeat the test, we will only see the rows that are unmatched due to differences in coverage:

merge_test = pd.merge(merged_data, area, left_on='State', how="outer", right_on='state', indicator='matched', validate='many_to_one')
merge_test.query("matched!='both'")[['State', 'state','year','matched']]
State state year matched
561 NaN Puerto Rico NaN right_only
562 NaN Northern Mariana Islands NaN right_only
563 NaN United States Virgin Islands NaN right_only
564 NaN American Samoa NaN right_only
565 NaN Guam NaN right_only
566 NaN United States Minor Outlying Islands NaN right_only

We replace the merged_data dataframe with merge_test, keeping only the rows for which matched='both', then dropping the matched column:

merged_data = merge_test.query("matched=='both'").drop('matched', axis=1)

Now merged_data contains all of the data we need, including state election results, income, economic data, and area:

merged_data
State year demvote repvote wallacevote dempercent reppercent population fips stcode ... Government_military_share Government_statelocal_share state area_sqmi area_sqkm landarea_sqmi landarea_sqkm water_sqmi water_sqkm percent_water
0 Alabama 1964.0 210732.0 479085.0 0.0 30.54897 69.45103 1919000.0 1.0 AL ... 2.402146 7.182051 Alabama 52,419.02 135,765 50,744.00 131,426 1,675.01 4,338 3.20
1 Alabama 1968.0 196579.0 146923.0 691425.0 18.99448 81.00552 1993000.0 1.0 AL ... 2.630377 8.356928 Alabama 52,419.02 135,765 50,744.00 131,426 1,675.01 4,338 3.20
2 Alabama 1972.0 256923.0 728701.0 0.0 26.06704 73.93296 3540080.0 1.0 AL ... 2.679969 8.841941 Alabama 52,419.02 135,765 50,744.00 131,426 1,675.01 4,338 3.20
3 Alabama 1976.0 659170.0 504070.0 0.0 56.66673 43.33327 3737204.0 1.0 AL ... 2.255639 9.386103 Alabama 52,419.02 135,765 50,744.00 131,426 1,675.01 4,338 3.20
4 Alabama 1980.0 636730.0 654192.0 0.0 49.32366 50.67634 3894025.0 1.0 AL ... 2.171860 9.928901 Alabama 52,419.02 135,765 50,744.00 131,426 1,675.01 4,338 3.20
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
556 Wyoming 1988.0 67113.0 106867.0 0.0 38.57512 61.42488 465080.0 56.0 WY ... 1.660191 10.234900 Wyoming 97,813.56 253,336 97,100.40 251,489 713.16 1,847 0.73
557 Wyoming 1992.0 68160.0 79347.0 0.0 46.20798 53.79202 466251.0 56.0 WY ... 1.552073 10.632080 Wyoming 97,813.56 253,336 97,100.40 251,489 713.16 1,847 0.73
558 Wyoming 1996.0 77934.0 105388.0 0.0 42.51208 57.48792 488167.0 56.0 WY ... 1.544622 10.030510 Wyoming 97,813.56 253,336 97,100.40 251,489 713.16 1,847 0.73
559 Wyoming 2000.0 60481.0 147947.0 0.0 29.01769 70.98231 493782.0 56.0 WY ... 1.540592 10.686050 Wyoming 97,813.56 253,336 97,100.40 251,489 713.16 1,847 0.73
560 Wyoming 2004.0 70776.0 167629.0 0.0 29.68730 70.31271 506529.0 56.0 WY ... 1.648164 10.328780 Wyoming 97,813.56 253,336 97,100.40 251,489 713.16 1,847 0.73

561 rows × 105 columns

9.3. Reshaping Data#

Reshaping a dataframe refers to turning the rows of a dataframe into columns (also known as a long-to-wide reshape), or turning columns into rows (also known as a wide-to-long reshape), in a way that cannot be accomplished by simply transposing the data matrix. It is also called pivoting for a long-to-wide reshape, melting for a wide-to-long reshape. It’s a notoriously tricky skill to master, and it helps a lot to understand every part of the function we need, so that we’re not blindly trying every combination of parameters hoping one of these combinations works.

9.3.1. Example: Gross domestic product by state, 1997-2015#

To demonstrate the method for pivoting a dataframe, we use example data from the U.S. Bureau of Economic Affairs (www.bea.gov) on economic output for the American states. We load the data, skipping the last four rows as these rows contain attribution information and notes, and are not part of the data. We also use the .str.strip() method to remove all the leading and trailing whitespace in the Description column:

gsp = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/gsp_naics_all.csv", skipfooter=4)
gsp['Description'] = gsp['Description'].str.strip()
gsp
/var/folders/4w/k9sqqcbx4dxgpjtwgv_1m29h0000gq/T/ipykernel_83160/793934051.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.
  gsp = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/gsp_naics_all.csv", skipfooter=4)
GeoFIPS GeoName Region ComponentId ComponentName IndustryId IndustryClassification Description 1997 1998 ... 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
0 00000 United States NaN 200 Gross domestic product (GDP) by state 1 ... All industry total 8542530 9024434 ... 13773226 14391149 14626598 14320114 14859772 15406002 16041243 16576808 17277548 17919651
1 00000 United States NaN 200 Gross domestic product (GDP) by state 2 ... Private industries 7459395 7894015 ... 12045446 12572387 12716179 12352979 12826507 13348439 13957545 14468465 15115846 15698669
2 00000 United States NaN 200 Gross domestic product (GDP) by state 3 11 Agriculture, forestry, fishing, and hunting 108796 99940 ... 128345 141999 154525 137655 160217 197241 185800 221821 203188 175236
3 00000 United States NaN 200 Gross domestic product (GDP) by state 4 111-112 Farms 88136 79030 ... 99352 113533 126345 109800 129725 166249 151489 186960 166249 (NA)
4 00000 United States NaN 200 Gross domestic product (GDP) by state 5 113-115 Forestry, fishing, and related activities 20660 20910 ... 28993 28466 28180 27855 30492 30992 34311 34861 36939 (NA)
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
43287 94000 Plains 4 1000 Per capita real GDP by state 1 ... All industry total 39071 39970 ... 46013 46501 46794 45399 46210 46921 47629 48126 48778 49153
43288 95000 Southeast 5 1000 Per capita real GDP by state 1 ... All industry total 36722 37896 ... 43586 43028 42092 40346 40675 40471 40415 40487 40802 41352
43289 96000 Southwest 6 1000 Per capita real GDP by state 1 ... All industry total 38887 40283 ... 45743 46684 45914 44324 44491 45344 46881 48064 49327 50519
43290 97000 Rocky Mountain 7 1000 Per capita real GDP by state 1 ... All industry total 38243 40390 ... 46756 47587 47019 45170 44937 45060 45014 45564 46508 47093
43291 98000 Far West 8 1000 Per capita real GDP by state 1 ... All industry total 41901 43467 ... 53932 54469 53881 51328 51609 51857 52429 52942 54077 55429

43292 rows × 27 columns

For this example, we only want to save three features besides the primary keys:

  • state GDP,

  • state GDP per capita,

  • state GDP from private industries

We only want the 50 states, no regions, territories, or districts (sorry DC!). Notice that the features we want are contained in rows, and the years are contained in columns. We need to switch that: the features must be in columns, and years must be stored in the rows. We will need to perform both a wide-to-long and a long-to-wide reshape. In addition, we will give the columns concise but descriptive names, and we will create a new column that contains the percent of GDP from private industry in each state and each year.

First we need to reduce the data to only the features we need: state GDP, state GDP per capita, and state GDP from private industries. Right now, however, the features are not stored in the columns, but in the rows. Specifically, the data we need are contains in the rows in which Description is “All industry total” or “Private industries” and ComponentName is “Gross domestic product (GDP) by state” or “Per capita real GDP by state”. We use .query() to isolate these rows:

gsp_clean = gsp.query(
    "Description in ['All industry total', 'Private industries']"
).query(
    "ComponentName in ['Gross domestic product (GDP) by state','Per capita real GDP by state']")
gsp_clean
GeoFIPS GeoName Region ComponentId ComponentName IndustryId IndustryClassification Description 1997 1998 ... 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
0 00000 United States NaN 200 Gross domestic product (GDP) by state 1 ... All industry total 8542530 9024434 ... 13773226 14391149 14626598 14320114 14859772 15406002 16041243 16576808 17277548 17919651
1 00000 United States NaN 200 Gross domestic product (GDP) by state 2 ... Private industries 7459395 7894015 ... 12045446 12572387 12716179 12352979 12826507 13348439 13957545 14468465 15115846 15698669
90 01000 Alabama 5 200 Gross domestic product (GDP) by state 1 ... All industry total 104218 109414 ... 164468 169923 172646 168315 174710 180665 185878 190095 194421 199656
91 01000 Alabama 5 200 Gross domestic product (GDP) by state 2 ... Private industries 87014 91506 ... 137954 141306 142965 137143 142773 148181 153494 157961 161364 166243
180 02000 Alaska 8 200 Gross domestic product (GDP) by state 1 ... All industry total 25446 24030 ... 44679 49197 55461 50463 54134 58759 60890 59762 58253 52747
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
43287 94000 Plains 4 1000 Per capita real GDP by state 1 ... All industry total 39071 39970 ... 46013 46501 46794 45399 46210 46921 47629 48126 48778 49153
43288 95000 Southeast 5 1000 Per capita real GDP by state 1 ... All industry total 36722 37896 ... 43586 43028 42092 40346 40675 40471 40415 40487 40802 41352
43289 96000 Southwest 6 1000 Per capita real GDP by state 1 ... All industry total 38887 40283 ... 45743 46684 45914 44324 44491 45344 46881 48064 49327 50519
43290 97000 Rocky Mountain 7 1000 Per capita real GDP by state 1 ... All industry total 38243 40390 ... 46756 47587 47019 45170 44937 45060 45014 45564 46508 47093
43291 98000 Far West 8 1000 Per capita real GDP by state 1 ... All industry total 41901 43467 ... 53932 54469 53881 51328 51609 51857 52429 52942 54077 55429

180 rows × 27 columns

Next, we only want the 50 states, no regions, territories, or districts, so we issue another .query() to remove the territories. To see these non-states, we call up a list of the unique values of GeoName:

gsp_clean['GeoName'].unique()
array(['United States', 'Alabama', 'Alaska', 'Arizona', 'Arkansas',
       'California', 'Colorado', 'Connecticut', 'Delaware',
       'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming', 'New England', 'Mideast',
       'Great Lakes', 'Plains', 'Southeast', 'Southwest',
       'Rocky Mountain', 'Far West'], dtype=object)

We need to remove the rows in which GeoName is “United States”, “District of Columbia”, “New England”, “Mideast”, “Great Lakes”, “Plains”, “Southeast”, “Southwest”, “Rocky Mountain”, or “Far West”:

gsp_clean = gsp_clean.query("GeoName not in ['United States', 'District of Columbia', 'New England', 'Mideast', 'Great Lakes', 'Plains', 'Southeast', 'Southwest', 'Rocky Mountain', 'Far West']")
gsp_clean
GeoFIPS GeoName Region ComponentId ComponentName IndustryId IndustryClassification Description 1997 1998 ... 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
90 01000 Alabama 5 200 Gross domestic product (GDP) by state 1 ... All industry total 104218 109414 ... 164468 169923 172646 168315 174710 180665 185878 190095 194421 199656
91 01000 Alabama 5 200 Gross domestic product (GDP) by state 2 ... Private industries 87014 91506 ... 137954 141306 142965 137143 142773 148181 153494 157961 161364 166243
180 02000 Alaska 8 200 Gross domestic product (GDP) by state 1 ... All industry total 25446 24030 ... 44679 49197 55461 50463 54134 58759 60890 59762 58253 52747
181 02000 Alaska 8 200 Gross domestic product (GDP) by state 2 ... Private industries 20284 18776 ... 36737 40919 46889 41387 44742 48851 50592 49764 47879 42246
270 04000 Arizona 6 200 Gross domestic product (GDP) by state 1 ... All industry total 132708 143768 ... 248459 262045 256718 242509 245668 254192 264693 270642 280166 290903
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
43279 51000 Virginia 5 1000 Per capita real GDP by state 1 ... All industry total 43069 44745 ... 52866 52657 51985 51389 51945 51749 51538 51105 50861 51540
43280 53000 Washington 8 1000 Per capita real GDP by state 1 ... All industry total 45753 47712 ... 52900 55348 54939 52264 52681 52495 53423 53987 54773 55577
43281 54000 West Virginia 5 1000 Per capita real GDP by state 1 ... All industry total 30445 30978 ... 34009 33892 34679 34564 35368 36085 35515 35778 36234 36817
43282 55000 Wisconsin 3 1000 Per capita real GDP by state 1 ... All industry total 38663 39695 ... 45515 45464 44622 43215 44126 44905 45380 45582 46469 46893
43283 56000 Wyoming 7 1000 Per capita real GDP by state 1 ... All industry total 46585 47652 ... 63428 65471 69182 66320 64602 63981 60744 60743 61639 61389

150 rows × 27 columns

9.3.2. Melting: Turning Columns to Rows#

Notice that many of the columns refer to particular years. We want to move these columns to the rows, so that each state has one row for each year covered in the data. The best function to use when we need to turn columns into rows is pd.melt():

pd.melt(df, id_vars, value_vars)

This function takes three parameters:

  1. The name of the dataframe you wish to edit

  2. id_vars: A list of the names of the columns that we want to continue to exist as columns after the dataframe is melted. In our example, this list will contain both IDs that currently uniquely identify the rows (GeoName, ComponentName, Description), synonyms for the IDs (GeoFIPS) that we want to keep, and features that vary across ID but not across the columns (Region).

  3. value_vars: A list of the names of the columns we want to store in the rows. If the columns have numeric names and can be sequentially ordered, as is the case in our example with years, a shortcut for creating a list of these column names is [str(i) for i in range(1997,2015)], which loops over intergers from 1997 to 2015, placing each year as a string in the list.

Any column that is not listed under either id_vars or value_vars will be deleted in the melted data.

For example, we melt the gsp_clean dataframe with the following code:

gsp_clean = pd.melt(gsp_clean, id_vars = ['GeoName', 'GeoFIPS', 'Region',
                             'ComponentName', 'Description'], 
        value_vars = [str(i) for i in range(1997,2015)])
gsp_clean
GeoName GeoFIPS Region ComponentName Description variable value
0 Alabama 01000 5 Gross domestic product (GDP) by state All industry total 1997 104218
1 Alabama 01000 5 Gross domestic product (GDP) by state Private industries 1997 87014
2 Alaska 02000 8 Gross domestic product (GDP) by state All industry total 1997 25446
3 Alaska 02000 8 Gross domestic product (GDP) by state Private industries 1997 20284
4 Arizona 04000 6 Gross domestic product (GDP) by state All industry total 1997 132708
... ... ... ... ... ... ... ...
2695 Virginia 51000 5 Per capita real GDP by state All industry total 2014 50861
2696 Washington 53000 8 Per capita real GDP by state All industry total 2014 54773
2697 West Virginia 54000 5 Per capita real GDP by state All industry total 2014 36234
2698 Wisconsin 55000 3 Per capita real GDP by state All industry total 2014 46469
2699 Wyoming 56000 7 Per capita real GDP by state All industry total 2014 61639

2700 rows × 7 columns

Note that the old column names (1997 through 2015 in this case) will always by stored in a column named “variable” after melting, and the datapoints that populated those columns will be contained in a column named “value”.

Before moving on, let’s rename some of the columns to avoid confusion in the next step:

gsp_clean = gsp_clean.rename({'GeoName':'State',
                             'GeoFIPS':'FIPS',
                             'variable':'Year'}, axis=1)
gsp_clean
State FIPS Region ComponentName Description Year value
0 Alabama 01000 5 Gross domestic product (GDP) by state All industry total 1997 104218
1 Alabama 01000 5 Gross domestic product (GDP) by state Private industries 1997 87014
2 Alaska 02000 8 Gross domestic product (GDP) by state All industry total 1997 25446
3 Alaska 02000 8 Gross domestic product (GDP) by state Private industries 1997 20284
4 Arizona 04000 6 Gross domestic product (GDP) by state All industry total 1997 132708
... ... ... ... ... ... ... ...
2695 Virginia 51000 5 Per capita real GDP by state All industry total 2014 50861
2696 Washington 53000 8 Per capita real GDP by state All industry total 2014 54773
2697 West Virginia 54000 5 Per capita real GDP by state All industry total 2014 36234
2698 Wisconsin 55000 3 Per capita real GDP by state All industry total 2014 46469
2699 Wyoming 56000 7 Per capita real GDP by state All industry total 2014 61639

2700 rows × 7 columns

9.3.3. Pivoting: Turning Rows to Columns#

The rows contain features that we want to move to the columns. Specifically, we need separate columns for the different combinations of ComponentName and Description since the feature names are contained in two columns, so we can combine these two columns by concatenating them together and creating a new column called “feature”. Then we can drop ComponentName and Description:

gsp_clean = gsp_clean.assign(feature = gsp_clean['ComponentName'] + gsp_clean['Description'])
gsp_clean = gsp_clean.drop(['ComponentName', 'Description'], axis=1)
gsp_clean
State FIPS Region Year value feature
0 Alabama 01000 5 1997 104218 Gross domestic product (GDP) by stateAll indus...
1 Alabama 01000 5 1997 87014 Gross domestic product (GDP) by statePrivate i...
2 Alaska 02000 8 1997 25446 Gross domestic product (GDP) by stateAll indus...
3 Alaska 02000 8 1997 20284 Gross domestic product (GDP) by statePrivate i...
4 Arizona 04000 6 1997 132708 Gross domestic product (GDP) by stateAll indus...
... ... ... ... ... ... ...
2695 Virginia 51000 5 2014 50861 Per capita real GDP by stateAll industry total
2696 Washington 53000 8 2014 54773 Per capita real GDP by stateAll industry total
2697 West Virginia 54000 5 2014 36234 Per capita real GDP by stateAll industry total
2698 Wisconsin 55000 3 2014 46469 Per capita real GDP by stateAll industry total
2699 Wyoming 56000 7 2014 61639 Per capita real GDP by stateAll industry total

2700 rows × 6 columns

To move these features from the rows to columns, apply the .pivot_table() method to a dataframe:

df.pivot_table(index, columns, values)

To understand this method, pay close attention to its parameters:

  1. index - a list containing the names of the current columns that we want to remain columns in the reshaped data. These include the primary keys (State and Year in this example), synonyms for a key (FIPS), and features that vary only by ID (Region).

  2. columns - the name of the column that contains the names of the new columns we are trying to create.

  3. values - the name of the column that contains the datapoints we are trying to move to the new columns.

There is one important issue to keep in mind when using the .pivot_table() method. .pivot_table() contains default behavior to handle cases in which the keys in index do not uniquely identify the rows of a dataframe. By default, .pivot_table() takes the mean within group. If the column specified within values is non-numeric, however, the mean is not defined. So first make sure that any columns that are numeric are recognized as a numeric data type. If a column contains text, categories, or other non-numeric data, specify aggfunc='first' inside .pivot_table(). Writing aggfunc='first' tells .pivot_table() to use the first cell within a group instead of calculating the mean within a group, and the first cell is defined for non-numeric columns. If the columns in index do uniquely identify rows, there is only one cell per group anyway, so it is fine to take the first cell.

Presently within gsp_clean the datapoints are contained within the value column. Unfortunately, value is not currently understood to be numeric:

gsp_clean.dtypes
State      object
FIPS       object
Region     object
Year       object
value      object
feature    object
dtype: object

Following our code from module 8, we convert value to a numeric class:

gsp_clean.value = gsp_clean.value.astype('float')
gsp_clean.dtypes
State       object
FIPS        object
Region      object
Year        object
value      float64
feature     object
dtype: object

Now we can reshape the data with .pivot_table():

gsp_clean['value'] = gsp_clean['value'].astype(int) # first convert this column to int 
gsp_clean = gsp_clean.pivot_table(index=['State','FIPS', 'Region','Year'], 
                                  columns='feature', 
                                  values='value')
gsp_clean
feature Gross domestic product (GDP) by stateAll industry total Gross domestic product (GDP) by statePrivate industries Per capita real GDP by stateAll industry total
State FIPS Region Year
Alabama 01000 5 1997 104218 87014 31398
1998 109414 91506 32164
1999 115015 96284 33106
2000 119020 99665 33284
2001 122822 102978 33312
... ... ... ... ... ... ...
Wyoming 56000 7 2010 39103 33832 64602
2011 41499 36164 63981
2012 40201 34604 60744
2013 40979 35096 60743
2014 42021 35947 61639

900 rows × 3 columns

Alternatively, if we choose to leave the value column as an object data type, we can use this code instead to perform the wide-to-long reshape:

gsp_clean = gsp_clean.pivot_table(index=['State','FIPS', 'Region','Year'], 
                                  columns='feature', 
                                  values='value', 
                                  aggfunc='first')

Note that the dataframe is not formated in a flat way, because the IDs and related features are stored in the row-indices instead of in columns. To convert the dataframe back to a standard format, use the .to_records() method within the pd.DataFrame() function:

gsp_clean = pd.DataFrame(gsp_clean.to_records())
gsp_clean
State FIPS Region Year Gross domestic product (GDP) by stateAll industry total Gross domestic product (GDP) by statePrivate industries Per capita real GDP by stateAll industry total
0 Alabama 01000 5 1997 104218 87014 31398
1 Alabama 01000 5 1998 109414 91506 32164
2 Alabama 01000 5 1999 115015 96284 33106
3 Alabama 01000 5 2000 119020 99665 33284
4 Alabama 01000 5 2001 122822 102978 33312
... ... ... ... ... ... ... ...
895 Wyoming 56000 7 2010 39103 33832 64602
896 Wyoming 56000 7 2011 41499 36164 63981
897 Wyoming 56000 7 2012 40201 34604 60744
898 Wyoming 56000 7 2013 40979 35096 60743
899 Wyoming 56000 7 2014 42021 35947 61639

900 rows × 7 columns

All that’s left to clean the data is to rename the columns and to create a new column containing the percent of the state’s GDP that comes from private industry:

gsp_clean = gsp_clean.rename({'Gross domestic product (GDP) by stateAll industry total':'GDP',
                             'Gross domestic product (GDP) by statePrivate industries':'GDPprivate',
                             'Per capita real GDP by stateAll industry total':'GDPpc'}, 
                             axis=1)
gsp_clean = gsp_clean.assign(percent_private = 100* gsp_clean.GDPprivate / gsp_clean.GDP)
gsp_clean
State FIPS Region Year GDP GDPprivate GDPpc percent_private
0 Alabama 01000 5 1997 104218 87014 31398 83.492295
1 Alabama 01000 5 1998 109414 91506 32164 83.632808
2 Alabama 01000 5 1999 115015 96284 33106 83.714298
3 Alabama 01000 5 2000 119020 99665 33284 83.738027
4 Alabama 01000 5 2001 122822 102978 33312 83.843285
... ... ... ... ... ... ... ... ...
895 Wyoming 56000 7 2010 39103 33832 64602 86.520216
896 Wyoming 56000 7 2011 41499 36164 63981 87.144269
897 Wyoming 56000 7 2012 40201 34604 60744 86.077461
898 Wyoming 56000 7 2013 40979 35096 60743 85.643866
899 Wyoming 56000 7 2014 42021 35947 61639 85.545323

900 rows × 8 columns

9.4. Working with Strings (Example: the 2019 ANES Pilot Study)#

In module 8, we worked with the 2019 American National Election Study pilot survey and saved the cleaned data in a separate CSV file:

anes = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/anes_clean.csv")
anes
caseid liveurban vote16 protest vote most_important_issue confecon ideology partyID universal_income ... partisanship ftbiden_level age age2 ftbiden_float ftbiden_cat ftbiden_str prefersbiden worried_econ favor_both
0 1 Suburb Someone else False Joe Biden Health Care A little worried Conservative Democrat Favor a moderate amount ... 5.0 neutral 51 2601 52.0 52.0 52.0 True False True
1 2 Suburb Donald Trump False Donald Trump Working together A little worried Conservative Republican Oppose a moderate amount ... 0.0 neutral 78 6084 41.0 41.0 41.0 False False False
2 3 Rural Hillary Clinton False Joe Biden health care Extremely worried Moderate Democrat Neither favor nor oppose ... 88.0 like 66 4356 88.0 88.0 88.0 True True False
3 4 City Hillary Clinton False Donald Trump The economy. A little worried Moderate Democrat Neither favor nor oppose ... -100.0 dislike 41 1681 0.0 0.0 0.0 False False False
4 5 City Donald Trump False Donald Trump China Not at all worried Conservative Republican Oppose a great deal ... -69.0 dislike 80 6400 25.0 25.0 25.0 False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3160 3161 Town Donald Trump False Donald Trump The infiltration of Marxists into the institut... A little worried Conservative Republican Oppose a great deal ... -74.0 dislike 72 5184 7.0 7.0 7.0 False False False
3161 3162 City Someone else False Someone else Lack of basic resources being provided and off... Extremely worried NaN Democrat Favor a great deal ... -10.0 dislike 24 576 25.0 25.0 25.0 False True True
3162 3163 City Did not vote False Probably will not vote donald trump Very worried Liberal Independent Oppose a great deal ... 44.0 neutral 40 1600 50.0 50.0 50.0 True False False
3163 3164 Suburb Did not vote False Joe Biden Donald Trump Moderately worried Liberal Democrat Favor a moderate amount ... 94.0 like 60 3600 95.0 95.0 95.0 True True False
3164 3165 Town Hillary Clinton False Joe Biden trump Extremely worried Moderate Democrat Oppose a great deal ... 70.0 neutral 60 3600 70.0 70.0 70.0 True True False

3165 rows × 62 columns

The data contain a column most_important_issue which records individuals’ responses to the question of what they feel is the most important issue facing the United States as of December 2019. These responses are in the respondents’ own words. For example, one response is

anes.most_important_issue[5]
'The influence of big money on our political system both domestic and foreign'

These responses are strings, and we can employ the suite of string operations to manipulate the data in this column. String methods are contained in the .str module of pandas, and we will have to call the .str attribute in order to use string methods.

One important technique is to alter the case within the text so that we can identify words without worrying about case sensitivity. To turn all responses to lowercase, use .str.lower():

anes.most_important_issue.str.lower()
0                                             health care
1                                        working together
2                                             health care
3                                            the economy.
4                                                   china
                              ...                        
3160    the infiltration of marxists into the institut...
3161    lack of basic resources being provided and off...
3162                                         donald trump
3163                                         donald trump
3164                                                trump
Name: most_important_issue, Length: 3165, dtype: object

And to turn all the text to uppercase, use .str.upper():

anes.most_important_issue.str.upper()
0                                             HEALTH CARE
1                                        WORKING TOGETHER
2                                             HEALTH CARE
3                                            THE ECONOMY.
4                                                   CHINA
                              ...                        
3160    THE INFILTRATION OF MARXISTS INTO THE INSTITUT...
3161    LACK OF BASIC RESOURCES BEING PROVIDED AND OFF...
3162                                         DONALD TRUMP
3163                                         DONALD TRUMP
3164                                                TRUMP
Name: most_important_issue, Length: 3165, dtype: object

One issue that can inhibit our ability to search through text is the existence of leading and trailing whitespace in the responses. White space can exist, invisibily, for several reasons. It is possible that the data authors included white space in order to standardize the number of spaces that the column takes up in the data file. To remove both leading and trailing white spaces, use .str.strip()

anes.most_important_issue = anes.most_important_issue.str.strip()

The str.replace() method finds and replaces specific whole strings with different strings. For example, we can replace every response that reads “health care” with “hospital stuff”:

anes.most_important_issue.str.lower().replace('health care', 'hospital stuff')
0                                          hospital stuff
1                                        working together
2                                          hospital stuff
3                                            the economy.
4                                                   china
                              ...                        
3160    the infiltration of marxists into the institut...
3161    lack of basic resources being provided and off...
3162                                         donald trump
3163                                         donald trump
3164                                                trump
Name: most_important_issue, Length: 3165, dtype: object

If we want to replace pieces of these strings, we can specify the regex=True argument. The following line of code replaces all occurrences of “trump” when the strings are converted to lowercase to “twimp”. With regex=True, the method replaces the pattern “trump” anywhere it appears, as opposed to the default regex=False which only replaces entire entries that are exactly “trump”:

anes.most_important_issue.str.lower().replace('trump', 'twimp', regex=True)
0                                             health care
1                                        working together
2                                             health care
3                                            the economy.
4                                                   china
                              ...                        
3160    the infiltration of marxists into the institut...
3161    lack of basic resources being provided and off...
3162                                         donald twimp
3163                                         donald twimp
3164                                                twimp
Name: most_important_issue, Length: 3165, dtype: object

To create a new column that identifies when particular words are used, use either the .str.match() or .str.contains() method. .str.match() is true if the entire entry matches the provided string,

anes.most_important_issue.str.lower().str.match('trump')
0       False
1       False
2       False
3       False
4       False
        ...  
3160    False
3161    False
3162    False
3163    False
3164     True
Name: most_important_issue, Length: 3165, dtype: object

and .str.contains() is true if the provided string exists anywhere within the entry.

anes.most_important_issue.str.lower().str.contains('trump')
0       False
1       False
2       False
3       False
4       False
        ...  
3160    False
3161    False
3162     True
3163     True
3164     True
Name: most_important_issue, Length: 3165, dtype: object

These columns of logical values can be placed into the data to filter or for other purposes. For example, we can see the number of Democrats, Republicans, and independents who use the word “trump” in their statement of the country’s most important problem:

anes['problem_trump'] = anes.most_important_issue.str.lower().str.contains('trump')
pd.DataFrame(anes.groupby(['partyID', 'problem_trump']).size().unstack())
problem_trump False True
partyID
Democrat 1029 293
Independent 479 47
Republican 1121 73

The .str.replace(), .str.match(), and .str.contains() methods can also accept regular expressions for identifying additional patterns within the string. Regular expressions are beyond our scope in this document, but here is a good discussion of regular expressions in Python.

To calculate the length of each response in terms of the number of characters, use .str.len(). In this case, it is important to make sure the whitespace is removed first with .str.strip(), otherwise the spaces will count towards the length:

anes.most_important_issue.str.len()
0        11.0
1        16.0
2        11.0
3        12.0
4         5.0
        ...  
3160    254.0
3161    325.0
3162     12.0
3163     12.0
3164      5.0
Name: most_important_issue, Length: 3165, dtype: float64

We can use the length to display, for example, the single longest response in the data. To display the whole string, turn off the display limit by typing pd.options.display.max_colwidth = None. Here I set the limit to 500, because it sort of goes off the rails and I’d rather not display it:

anes['length'] = anes.most_important_issue.str.len()
pd.options.display.max_colwidth = 500
anes.sort_values(by = 'length', ascending = False).most_important_issue.head(1)
819    Immigration has been and has become a very big problem. The gangs, drug wars, the crime has climbed in every single town, county, state in this country and I believe it all revolves around too many different races, too many prejudices. Most people I talk to dislike foreigners and most foreigners hate Americans. We ask, why come here then? Because they get financial aid, housing, medical, dental aid and compared to the wars, poverty and the way they lived in their own country, America is heav...
Name: most_important_issue, dtype: object

After all that, we set the string display limit back to the default of 50 characters:

pd.options.display.max_colwidth = 50

In some cases it might be necessary to split entries into different features according to a delimiter or pattern. It doesn’t make sense to split the most_important_problem column in this example, but as an illiustration, we can split the responses on periods:

anes.most_important_issue.str.split('.')
0                                           [Health Care]
1                                      [Working together]
2                                           [health care]
3                                         [The economy, ]
4                                                 [China]
                              ...                        
3160    [The infiltration of Marxists into the institu...
3161    [Lack of basic resources being provided and of...
3162                                       [donald trump]
3163                                       [Donald Trump]
3164                                              [trump]
Name: most_important_issue, Length: 3165, dtype: object

The expand=True argument displays these split strings in different columns in a dataframe:

anes.most_important_issue.str.split('.', expand=True)
0 1 2 3 4 5 6 7 8 9 ... 19 20 21 22 23 24 25 26 27 28
0 Health Care None None None None None None None None None ... None None None None None None None None None None
1 Working together None None None None None None None None None ... None None None None None None None None None None
2 health care None None None None None None None None None ... None None None None None None None None None None
3 The economy None None None None None None None None ... None None None None None None None None None None
4 China None None None None None None None None None ... None None None None None None None None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3160 The infiltration of Marxists into the institut... Money seems to be no object to them Their over promising will lead to eventual chaos Usually that leads to the over promised event... None None None None None ... None None None None None None None None None None
3161 Lack of basic resources being provided and off... This is a first world nation, a nation claimi... yet we don't have housing or healthcare secur... That needs to change and it needs to be chang... A basic standard of living needs to be guaran... None None None None ... None None None None None None None None None None
3162 donald trump None None None None None None None None None ... None None None None None None None None None None
3163 Donald Trump None None None None None None None None None ... None None None None None None None None None None
3164 trump None None None None None None None None None ... None None None None None None None None None None

3165 rows × 29 columns

Writing n=1 only splits on the first occurrence of the delimiter, n=2 splits on the first and second occurrence, and so on:

anes.most_important_issue.str.split('.', expand=True, n=2)
0 1 2
0 Health Care None None
1 Working together None None
2 health care None None
3 The economy None
4 China None None
... ... ... ...
3160 The infiltration of Marxists into the institut... Money seems to be no object to them Their over promising will lead to eventual ch...
3161 Lack of basic resources being provided and off... This is a first world nation, a nation claimi... yet we don't have housing or healthcare secur...
3162 donald trump None None
3163 Donald Trump None None
3164 trump None None

3165 rows × 3 columns

An individual string can be indexed according to character number. These indices can be applied to every entry in a string column as well. To pull just the first five characters out of each response in most_important_issue, we can type:

anes.most_important_issue.str[0:5]
0       Healt
1       Worki
2       healt
3       The e
4       China
        ...  
3160    The i
3161    Lack 
3162    donal
3163    Donal
3164    trump
Name: most_important_issue, Length: 3165, dtype: object

To pull characters 6 through 10, we type

anes.most_important_issue.str[6:11]
0        Care
1       g tog
2        care
3       onomy
4            
        ...  
3160    filtr
3161    f bas
3162     trum
3163     Trum
3164         
Name: most_important_issue, Length: 3165, dtype: object

To pull the last four characters, use a negative number to begin the range, and leave the end of the range blank:

anes.most_important_issue.str[-4:]
0       Care
1       ther
2       care
3       omy.
4       hina
        ... 
3160    ers.
3161    eed.
3162    rump
3163    rump
3164    rump
Name: most_important_issue, Length: 3165, dtype: object

9.5. Working with Dates and Times (Example: Twitter)#

In module 4 we discussed Twitter as an example of how to use an API in Python. We save the four credentials we need to access the API (the consumer key, the consumer secret, the access token, and the access token secret) in a .env file, and we load those keys into Python without displaying them in our notebooks by using the dotenv package:

import dotenv
import os
os.chdir('/Users/jk8sd/Box Sync/Practice and Applications 1 online/Module 9 - Data Managenent in pandas Part 2')
dotenv.load_dotenv()
ConsumerKey = os.getenv('ConsumerKey')
ConsumerSecret = os.getenv('ConsumerSecret')
AccessToken = os.getenv('AccessToken')
AccessTokenSecret = os.getenv('AccessTokenSecret')

Then we use the tweepy package to work with the Twitter API and we create a Twitter cursor with our credentials:

import tweepy
auth = tweepy.OAuthHandler(ConsumerKey, ConsumerSecret)
auth.set_access_token(AccessToken, AccessTokenSecret)
api = tweepy.API(auth)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[58], line 1
----> 1 import tweepy
      2 auth = tweepy.OAuthHandler(ConsumerKey, ConsumerSecret)
      3 auth.set_access_token(AccessToken, AccessTokenSecret)

ModuleNotFoundError: No module named 'tweepy'

The following code extracts 1000 tweets that contain the hashtag “#uva”:

msgs = []
msg =[]

for tweet in tweepy.Cursor(api.search, q='#uva').items(1000):
    msg = [tweet.text, tweet.created_at, tweet.user.screen_name] 
    msg = tuple(msg)                    
    msgs.append(msg)

tweets = pd.DataFrame(msgs, columns = ['text', 'created_at', 'user'])
tweets
text created_at user
0 @KymoraJohnson_ let me call @IamTinaThompson s... 2020-06-30 02:20:43 SweetLickKing
1 Congrats Eric. It’s was really great to have w... 2020-06-30 02:18:17 Daniel_B_Ennis
2 RT @KathrynsScGifts: "OGX Fade-defying + Orchi... 2020-06-30 01:55:02 blimeyonline1
3 Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:40:00 Wahoos247
4 Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:39:42 JamieOakes247
... ... ... ...
995 RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... 2020-06-23 15:12:08 norahlfy
996 #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... 2020-06-23 15:08:40 AgNPalabras
997 Zur Einstimmung #UVA vom letzten #gig bei uns ... 2020-06-23 14:52:23 WeAppU
998 RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... 2020-06-23 14:49:22 VirialexViri
999 Good question... My instinct says Ty Jerome, b... 2020-06-23 14:32:44 The_Superhoo

1000 rows × 3 columns

9.5.1. Extracting Year, Month, Day, and Time from a Timestamp Column#

Note that the data contain timestamps that include the date and time, to the second, of each tweet. The created_at column has a datetime64 data type:

tweets.dtypes
text                  object
created_at    datetime64[ns]
user                  object
dtype: object

The difference between a datetime64 data type and an object or numeric type is that we have the ability to extract individual elements of time from a datetime64 value. To pull out the year, month, day, hour, minute, and second of the timestamp, we use the following attributes:

[tweets.created_at[0].month, tweets.created_at[0].day, 
 tweets.created_at[0].year, tweets.created_at[0].hour, 
 tweets.created_at[0].minute, tweets.created_at[0].second]
[6, 30, 2020, 2, 20, 43]

The easiest way to create new columns with these elements for all the values in the dataframe is to use comprehension loops:

tweets['month'] = [x.month for x in tweets.created_at]
tweets['day'] = [x.day for x in tweets.created_at]
tweets['year'] = [x.year for x in tweets.created_at]
tweets['hour'] = [x.hour for x in tweets.created_at]
tweets['minute'] = [x.minute for x in tweets.created_at]
tweets['second'] = [x.second for x in tweets.created_at]
tweets
text created_at user month day year hour minute second
0 @KymoraJohnson_ let me call @IamTinaThompson s... 2020-06-30 02:20:43 SweetLickKing 6 30 2020 2 20 43
1 Congrats Eric. It’s was really great to have w... 2020-06-30 02:18:17 Daniel_B_Ennis 6 30 2020 2 18 17
2 RT @KathrynsScGifts: "OGX Fade-defying + Orchi... 2020-06-30 01:55:02 blimeyonline1 6 30 2020 1 55 2
3 Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:40:00 Wahoos247 6 30 2020 1 40 0
4 Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:39:42 JamieOakes247 6 30 2020 1 39 42
... ... ... ... ... ... ... ... ... ...
995 RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... 2020-06-23 15:12:08 norahlfy 6 23 2020 15 12 8
996 #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... 2020-06-23 15:08:40 AgNPalabras 6 23 2020 15 8 40
997 Zur Einstimmung #UVA vom letzten #gig bei uns ... 2020-06-23 14:52:23 WeAppU 6 23 2020 14 52 23
998 RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... 2020-06-23 14:49:22 VirialexViri 6 23 2020 14 49 22
999 Good question... My instinct says Ty Jerome, b... 2020-06-23 14:32:44 The_Superhoo 6 23 2020 14 32 44

1000 rows × 9 columns

9.5.2. Generating Timestamps from Separate Month, Day, and Year Columns#

Sometimes raw data will come to us with these time elements already placed in separate columns, so that we have a month, day, and year column but no date column. There are important advantages to placing all this information in one column in values that Python understands as timestamps: timestamps are easier to filter, manipulate, index, and plot. To create one timestamp from separate columns that contain the month, day, and year (and optionally the hour, minute, and second), use the pd.to_datatime function:

pd.to_datetime(df)

where df is a subset of the dataframe that contains the separate columns for the time elements. In order for this function to work, these columns must be named year, month, and day, and optionally hour, minute, and second. If these columns are named something else, first use the .rename() method to change the names to the ones listed above.

To create a timestamp from the individual time element columns, we can type:

tweets['timestamp'] = pd.to_datetime(tweets[['year', 'month', 'day', 'hour', 'minute', 'second']])
tweets
text created_at user month day year hour minute second timestamp
0 @KymoraJohnson_ let me call @IamTinaThompson s... 2020-06-30 02:20:43 SweetLickKing 6 30 2020 2 20 43 2020-06-30 02:20:43
1 Congrats Eric. It’s was really great to have w... 2020-06-30 02:18:17 Daniel_B_Ennis 6 30 2020 2 18 17 2020-06-30 02:18:17
2 RT @KathrynsScGifts: "OGX Fade-defying + Orchi... 2020-06-30 01:55:02 blimeyonline1 6 30 2020 1 55 2 2020-06-30 01:55:02
3 Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:40:00 Wahoos247 6 30 2020 1 40 0 2020-06-30 01:40:00
4 Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:39:42 JamieOakes247 6 30 2020 1 39 42 2020-06-30 01:39:42
... ... ... ... ... ... ... ... ... ... ...
995 RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... 2020-06-23 15:12:08 norahlfy 6 23 2020 15 12 8 2020-06-23 15:12:08
996 #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... 2020-06-23 15:08:40 AgNPalabras 6 23 2020 15 8 40 2020-06-23 15:08:40
997 Zur Einstimmung #UVA vom letzten #gig bei uns ... 2020-06-23 14:52:23 WeAppU 6 23 2020 14 52 23 2020-06-23 14:52:23
998 RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... 2020-06-23 14:49:22 VirialexViri 6 23 2020 14 49 22 2020-06-23 14:49:22
999 Good question... My instinct says Ty Jerome, b... 2020-06-23 14:32:44 The_Superhoo 6 23 2020 14 32 44 2020-06-23 14:32:44

1000 rows × 10 columns

Like created_at, the new timestamp column is also of the datetime64 data type:

tweets.dtypes
text                  object
created_at    datetime64[ns]
user                  object
month                  int64
day                    int64
year                   int64
hour                   int64
minute                 int64
second                 int64
timestamp     datetime64[ns]
dtype: object

9.5.3. Converting String Columns to Timestamps#

Sometimes information about the date and time of an event in the data will be coded as a string. In that case, to use the time series functionality on the dates in a dataframe, a string column that contains the dates and times will need to be converted to the datetime64 data type. The best way to do that is to use the pd.to_datetime() function once again. This time, we pass the column that contains the dates in string format to the pd.to_datetime() function. This function can identify the year, month, day, hour, minute, and second from many different formats.

For example, we can write the date Sunday, June 19, 2016 at 8:00 PM (the date and time of game 7 of the 2016 NBA finals, the greatest moment in sports history) in many different ways such as:

  • “Sunday, June 19, 2016, 8:00 PM”

  • “6-19-2016 8PM”

  • “6/19/16 8 p.m.”

  • “19/6/16 8pm”

The pd.to_datetime() function reads all of these formats:

pd.to_datetime("Sunday, June 19, 2016, 8:00 PM")
Timestamp('2016-06-19 20:00:00')
pd.to_datetime("6-19-2016 8PM")
Timestamp('2016-06-19 20:00:00')
pd.to_datetime("6/19/16 8 p.m.")
Timestamp('2016-06-19 20:00:00')
pd.to_datetime("19/6/16 8pm")
Timestamp('2016-06-19 20:00:00')

The pd.to_datetime() function can feel like a magic trick because it automatically detects where the different elements of the date and time are located in the string. However, it can go wrong, and it is important to check that the date was correctly understood by pd.to_datetime(). Suppose for example that the date was June 19, 1945, coded as “6/19/45”. pd.to_datetime() reads the date as 2045:

pd.to_datetime("6/19/45")
Timestamp('2045-06-19 00:00:00')

The easiest way to fix this problem would be to break the timestamp up into separate columns for year, month, and day, subtract 100 from the years that were placed in the wrong century, and recombine these columns into a new timestamp with pd.to_datetime(). But the real danger here is assuming the pd.to_datetime() function worked without confirmation.

The pd.to_datetime() function also works on an entire column of dates coded as strings. For example, let’s convert the created_at column to string:

tweets['created_at'] = tweets.created_at.astype('str')
tweets.dtypes
text                  object
created_at            object
user                  object
month                  int64
day                    int64
year                   int64
hour                   int64
minute                 int64
second                 int64
timestamp     datetime64[ns]
dtype: object

To convert the created_at column back to a timestamp, we can apply pd.to_dataframe():

tweets['created_at'] = pd.to_datetime(tweets.created_at)
tweets.dtypes
text                  object
created_at    datetime64[ns]
user                  object
month                  int64
day                    int64
year                   int64
hour                   int64
minute                 int64
second                 int64
timestamp     datetime64[ns]
dtype: object

9.5.4. Filtering Rows Based on Date Ranges#

If every row in the data has a unique timestamp, then one convenient way to use the times to filter the data is to set the index (the row labels) to be equal to the timestamps:

tweets.index = tweets['created_at']

A pandas dataframe uses pd.to_datetime() to read dates and times in many formats, and we can pass any of these formats to the index of tweets. For example, to see all the tweets made on June 30, we can type:

tweets['June 30, 2020']
text created_at user month day year hour minute second timestamp
created_at
2020-06-30 02:20:43 @KymoraJohnson_ let me call @IamTinaThompson s... 2020-06-30 02:20:43 SweetLickKing 6 30 2020 2 20 43 2020-06-30 02:20:43
2020-06-30 02:18:17 Congrats Eric. It’s was really great to have w... 2020-06-30 02:18:17 Daniel_B_Ennis 6 30 2020 2 18 17 2020-06-30 02:18:17
2020-06-30 01:55:02 RT @KathrynsScGifts: "OGX Fade-defying + Orchi... 2020-06-30 01:55:02 blimeyonline1 6 30 2020 1 55 2 2020-06-30 01:55:02
2020-06-30 01:40:00 Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:40:00 Wahoos247 6 30 2020 1 40 0 2020-06-30 01:40:00
2020-06-30 01:39:42 Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:39:42 JamieOakes247 6 30 2020 1 39 42 2020-06-30 01:39:42
2020-06-30 01:30:45 RT @ThinakaranLK: சில இடங்களில் மழை பெய்யும் ச... 2020-06-30 01:30:45 sumanebot 6 30 2020 1 30 45 2020-06-30 01:30:45
2020-06-30 01:08:08 We are🎉excited to🗣announce opening for in-pers... 2020-06-30 01:08:08 GrandMarc_UVA 6 30 2020 1 8 8 2020-06-30 01:08:08
2020-06-30 00:20:15 RT @cavalierinsider: On Saturday, The Basketba... 2020-06-30 00:20:15 annefutch 6 30 2020 0 20 15 2020-06-30 00:20:15
2020-06-30 00:19:00 Anthony Harris part of NFL's top safety tandem... 2020-06-30 00:19:00 hoosdaily 6 30 2020 0 19 0 2020-06-30 00:19:00

If these row labels are in chronological order, then we can extract slices of the data that fall within a time range. We write two dates separated by a colon, and the output extracts all rows from the row that matches the first date through the row that matches the last date. Because the rows are generally listed with the most recent ones first, we write the ending timestamp first and the ending timestamp second. If these exact dates and times do not exist in the data, the syntax still finds the locations where these rows would exist and extracts the rows in between these two locations. Here is code to extract all tweets that were posted between 2pm and 3pm on June 29:

tweets['6/29/2020 15:00':'6/29/2020 14:00']
text created_at user month day year hour minute second timestamp
created_at
2020-06-29 14:54:31 UVa alumni team opts out of 2020 TBT, plans to... 2020-06-29 14:54:31 hoosdaily 6 29 2020 14 54 31 2020-06-29 14:54:31
2020-06-29 14:54:26 Ben Wallace Is a Proud Dad of Three Kids — Mee... 2020-06-29 14:54:26 hoosdaily 6 29 2020 14 54 26 2020-06-29 14:54:26
2020-06-29 14:51:16 Inicia Sonora exportación de uva a Corea del S... 2020-06-29 14:51:16 AgroTratos 6 29 2020 14 51 16 2020-06-29 14:51:16
2020-06-29 14:50:46 RT @Wahoos247: Wake Forest commitment Christia... 2020-06-29 14:50:46 FatWhite101 6 29 2020 14 50 46 2020-06-29 14:50:46
2020-06-29 14:42:28 RT @Cavs_Corner: Film Room: In the next instal... 2020-06-29 14:42:28 oleuva 6 29 2020 14 42 28 2020-06-29 14:42:28
2020-06-29 14:40:56 RT @cavalierinsider: On Saturday, The Basketba... 2020-06-29 14:40:56 John_Shifflett 6 29 2020 14 40 56 2020-06-29 14:40:56
2020-06-29 14:37:11 Tatil planları yapmaya başladıysan sana harika... 2020-06-29 14:37:11 YasinALTINEL 6 29 2020 14 37 11 2020-06-29 14:37:11
2020-06-29 14:35:52 On Saturday, The Basketball Tournament will be... 2020-06-29 14:35:52 cavalierinsider 6 29 2020 14 35 52 2020-06-29 14:35:52
2020-06-29 14:29:17 Jesse Rutherford of the Nelson County Board of... 2020-06-29 14:29:17 JerryMillerNow 6 29 2020 14 29 17 2020-06-29 14:29:17
2020-06-29 14:20:58 RT @thesabre: NewsLink is updated! #UVA sports... 2020-06-29 14:20:58 Slider_Hoos 6 29 2020 14 20 58 2020-06-29 14:20:58
2020-06-29 14:20:44 RT @Wahoos247: Wake Forest commitment Christia... 2020-06-29 14:20:44 agee_brandon 6 29 2020 14 20 44 2020-06-29 14:20:44
2020-06-29 14:18:33 Ramseyer fondly recalls Cavs' perfect regular ... 2020-06-29 14:18:33 hoosdaily 6 29 2020 14 18 33 2020-06-29 14:18:33
2020-06-29 14:15:32 Ben Wallace Is a Proud Dad of Three Kids — Mee... 2020-06-29 14:15:32 hoosdaily 6 29 2020 14 15 32 2020-06-29 14:15:32
2020-06-29 14:11:08 NewsLink is updated! #UVA sports links all in ... 2020-06-29 14:11:08 thesabre 6 29 2020 14 11 8 2020-06-29 14:11:08

To extract all tweets before point in time, write the timestamp before the colon and write nothing after the colon. To extract all tweets after a point in time, write nothing before the colon and write the timestamp after the colon. To extract all tweets posted after June 29 at 3pm:

tweets[:'6/29/2020 15:00']
text created_at user month day year hour minute second timestamp
created_at
2020-06-30 02:20:43 @KymoraJohnson_ let me call @IamTinaThompson s... 2020-06-30 02:20:43 SweetLickKing 6 30 2020 2 20 43 2020-06-30 02:20:43
2020-06-30 02:18:17 Congrats Eric. It’s was really great to have w... 2020-06-30 02:18:17 Daniel_B_Ennis 6 30 2020 2 18 17 2020-06-30 02:18:17
2020-06-30 01:55:02 RT @KathrynsScGifts: "OGX Fade-defying + Orchi... 2020-06-30 01:55:02 blimeyonline1 6 30 2020 1 55 2 2020-06-30 01:55:02
2020-06-30 01:40:00 Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:40:00 Wahoos247 6 30 2020 1 40 0 2020-06-30 01:40:00
2020-06-30 01:39:42 Former #UVA stars Quin Blanding, Micah Kiser n... 2020-06-30 01:39:42 JamieOakes247 6 30 2020 1 39 42 2020-06-30 01:39:42
... ... ... ... ... ... ... ... ... ... ...
2020-06-29 15:38:19 #UvA Last week a number of practicals started ... 2020-06-29 15:38:19 JJ_Angelus 6 29 2020 15 38 19 2020-06-29 15:38:19
2020-06-29 15:27:34 Playing through: UVa golfers hone their games ... 2020-06-29 15:27:34 hoosdaily 6 29 2020 15 27 34 2020-06-29 15:27:34
2020-06-29 15:27:29 Playing through: UVa golfers hone their games ... 2020-06-29 15:27:29 hoosdaily 6 29 2020 15 27 29 2020-06-29 15:27:29
2020-06-29 15:27:21 Ramseyer fondly recalls Cavs' perfect regular ... 2020-06-29 15:27:21 hoosdaily 6 29 2020 15 27 21 2020-06-29 15:27:21
2020-06-29 15:06:57 While many college athletes have been unable t... 2020-06-29 15:06:57 cavalierinsider 6 29 2020 15 6 57 2020-06-29 15:06:57

71 rows × 10 columns

9.5.5. Leads and Lags#

In time series data in which the rows represent particular points in time in descending chronological order, a lag is a column that contains the values of another column shifted up one cell. That is, if the rows represent days, and a column represents today’s high temperature, the lag represents yesterday’s high temperature. A lead is a existing column with its values shifted one row down: tomorrow’s high temperature.

To create a lead or a lag, apply the .shift() method to a dataframe. The argument of .shift() is an integer: positive integers shift values down, creating leads, and negative values shift values up, creating lags. Whenever we shift values up or down, we create new missing values at the top or bottom of the dataframe for the lead or lagged column.

For example, in the tweets dataframe we can create two new columns, previous_tweet and next_tweet to compare with text by using .shift(-1) and .shift(1):

tweets['previous_tweet'] = tweets.shift(-1).text
tweets['next_tweet'] = tweets.shift(1).text
tweets[['text', 'previous_tweet', 'next_tweet']]
text previous_tweet next_tweet
created_at
2020-06-30 02:20:43 @KymoraJohnson_ let me call @IamTinaThompson s... Congrats Eric. It’s was really great to have w... NaN
2020-06-30 02:18:17 Congrats Eric. It’s was really great to have w... RT @KathrynsScGifts: "OGX Fade-defying + Orchi... @KymoraJohnson_ let me call @IamTinaThompson s...
2020-06-30 01:55:02 RT @KathrynsScGifts: "OGX Fade-defying + Orchi... Former #UVA stars Quin Blanding, Micah Kiser n... Congrats Eric. It’s was really great to have w...
2020-06-30 01:40:00 Former #UVA stars Quin Blanding, Micah Kiser n... Former #UVA stars Quin Blanding, Micah Kiser n... RT @KathrynsScGifts: "OGX Fade-defying + Orchi...
2020-06-30 01:39:42 Former #UVA stars Quin Blanding, Micah Kiser n... RT @ThinakaranLK: சில இடங்களில் மழை பெய்யும் ச... Former #UVA stars Quin Blanding, Micah Kiser n...
... ... ... ...
2020-06-23 15:12:08 RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... WATCH: Top plays of UVA's 2019-20 basketball s...
2020-06-23 15:08:40 #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... Zur Einstimmung #UVA vom letzten #gig bei uns ... RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...
2020-06-23 14:52:23 Zur Einstimmung #UVA vom letzten #gig bei uns ... RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... #UVA: 54,87. Sube 0,20% diario, 1,22% en junio...
2020-06-23 14:49:22 RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... Good question... My instinct says Ty Jerome, b... Zur Einstimmung #UVA vom letzten #gig bei uns ...
2020-06-23 14:32:44 Good question... My instinct says Ty Jerome, b... NaN RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...

1000 rows × 3 columns