Reshaping and Merging Data and Working with Strings, Dates, and Times

9. Reshaping and Merging Data and Working with Strings, Dates, and Times #

9.1. Introduction: `pandas` or SQL?#

As we saw in module 7, read operations in SQL not only extract data from a database, but they can select and rename columns, filter and sort rows, join tables, and aggregate data. With pandas, we can also select and rename columns, filter and sort rows, join tables, and aggregate data. So the question is, should we be using SQL or pandas to perform all these data manipulation steps?

SQL is sometimes referred to as server-side data manipulation, because databases are often stored on remote servers and SQL queries are processed on the server instead on on a local machine. Data manipulation that is conducted using pandas on a local Python installation is called client-side data manipulation.

The question of SQL vs. pandas (or SQL vs. the tidyverse in R) is the subject of a lot of debate among professionals who use data. The question comes up frequently on various coding forums. Some blog posts have tried to compare SQL and pandas in terms of the speed with which they complete equivalent operations, with differing results. Tina Wenzel and Kavya Gupta note that many factors influence the relative speed of SQL and pandas operations, including the configuration of a PostgreSQL database and the bandwidth available in a network connection. They take the mixed evidence into account and conclude that

SQL is the best tool to use for basic data retrieval and the better you know it, the more it will speed up your workflow. More advanced data manipulation tasks can be performed both on the server and client side and it is up to the analyst to balance the workload optimally.

In short, the existing evidence does not overwhelmingly support one option or the other as best practice for data manipulation. The reality is that both SQL and pandas are extremely useful and widely-used, and both tools will become part of your workflow. It will take some time and experience with end-to-end projects in data science to learn how to balance SQL and pandas in a way that is most comfortable for you in your workflow. But at this stage, there are some important points to keep in mind when thinking about these tools.

First, there are many situations in which the question of SQL vs. pandas might be moot. For a given project, our data might not come from a database, but instead from a CSV file, from a JSON document acquired via an API, or from raw HTML that we scraped from a webpage. So in order to use SQL, we would have to take the additional step of creating a database. If we hold ourselves to the principles that E. F. Codd and others laid out about the organization of relational databases, it can take a significant amount of work to create this database. If there is no database involved for a specific data project, there is no role for SQL, but because pandas works on dataframes it can be a primary tool for data manipulation regardless of the original source for the data.

Second, while there are differences in speed, these differences only become a significant factor for projects that involve big data, and in those cases, we will almost certainly be working with data stored on remote servers. If the data are organized using a database, then SQL may well be faster than pandas, but it very much depends on how the database is configured and on the myriad factors that determine the speed with which code runs through a remote connection. pandas can also be used in scripts or Jupyter notebooks that are run on remote servers. Sometimes it makes sense to pull selections of remotely stored data into a local environment so that we can manipulate the data without having to seek permission from a database manager, and in that case, a mix of SQL to choose a selection and pandas to manipulate the local data can be very effective.

Third, pandas simply has more functionality and flexibility than SQL. For example, it is fairly straightforward to reshape (e.g. pivot) data using pandas, and it is much more difficult to reshape data in SQL. pandas has better functionality for working with strings and time and date features than SQL, and pandas, being part of a Python environment, works well with any other Python package that works with dataframes or arrays. In contrast, while various DBMSs add specific additions to the base SQL language, SQL extensions tend to be fairly limited because of the inherent tension between expanding the functionality of a query language and staying close enough to base SQL to still be considered SQL. The easiest way to bring SQL query results into a Python enviromnent uses an sqlalchemy engine and the pd.read_sql_query() funtion from pandas.

Fourth, both pandas and SQL enable us to work as part of larger teams that share tools, but we might choose or be required to use one tool over the other to work better as part of the team. pandas is essential for projects in which the whole team works within Python. SQL is readable and usable for people who do not use Python but still work with databases.

Fifth, and most importantly, the choice of SQL vs. pandas should be made based on how comfortable we feel with each tool. Both SQL and pandas can perform data manipulation correctly, and it will probably be the case that we can remember the code with one tool better than with the other. We should try as much as we can to do the work that comprises 80% of our time as data scientists more quickly.

For now, we need to practice both pandas and SQL and get comfortable with these tools, and we need to be flexible in the future as different situations will call for SQL or pandas.

9.2. Joining Dataframes #

Joining dataframes is also called merging dataframes, and I will use the words “join” and “merge” interchangeably below. Every kind of merge that is possible in SQL is possible in pandas: inner, outer, left, right, natural, cross, and anti-joins. One important difference between joining in SQL and merging in pandas is that dataframes might not be as carefully curated as data tables in a relational database, and we should not assume that the join worked properly even if there is no error. Primary keys might not uniquely identify rows in general, and joins might turn into cross joins. Some of values of keys that should match might not match due to differences in coding and spelling. And if we attempt a natural join, the wrong columns might be automatically selected as the keys.

Too many discussions of merging data provide clean examples. Here I will show you how merging works when the data contain some common problems that may invalidate the merge. In the following discussion, we will discuss how to perform merges in pandas, and also how to check that the merge worked in the way we want to.

First, for the following examples, we load these packages:

import numpy as np
import pandas as pd

9.2.1. Example: Merging Data on U.S. State Election Results, Income, Economies, and Area #

The “state_elections.csv” file contains information about the result of presidential elections by state for every presidential election between 1964 and 2004. In this dataset, the primary keys are State and year:

elect = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/state_elections.csv")
elect.head()

	State	year	demvote	repvote	wallacevote	dempercent	reppercent	population
0	Alabama	1964	210732	479085	0	30.54897	69.45103	1919000
1	Alabama	1968	196579	146923	691425	18.99448	81.00552	1993000
2	Alabama	1972	256923	728701	0	26.06704	73.93296	3540080
3	Alabama	1976	659170	504070	0	56.66673	43.33327	3737204
4	Alabama	1980	636730	654192	0	49.32366	50.67634	3894025

“state_income.txt” contains varous summary statistics (percentiles and means, broken down separately for men, for women, and for the overall population) about personal incomes within each state. In this dataset, the primary keys are stcode (the post office’s two letter abbreviation for states) and year:

income = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/state_income.txt", sep="\t", header=2) # tab separated (note the header in this file)
income.head()

	stcode	year	income	p10income	p20income	p30income	p40income	p50income	p60income	p70income	...	wom_p90income	wom_meanincome	wom_minincome	wom_maxincome	wom_sdincome	wom_ineqincome	wom_gini_income	wom_n_income	wom_nwgt_income	medincome
0	AL	1964	89.83488	3269.85	5731.07	7988.87	10318.3	13189.1	16007.3	18744.4	...	24233.0	11400.0	58.13953	72674.42	1796.89	29.53152	0.473738	175.0	418494.1	76.68081
1	AL	1968	92.18238	4469.95	7455.28	10050.30	12792.1	15580.2	18392.4	21310.7	...	26287.5	12005.3	NaN	NaN	1968.78	27.40979	0.485090	NaN	NaN	80.72643
2	AL	1972	94.87543	6734.11	10314.80	13336.30	16421.5	19088.9	22141.6	25410.7	...	30278.0	14667.9	NaN	NaN	8604.25	17.83577	0.457932	NaN	NaN	82.27975
3	AL	1976	67.53671	7759.81	11176.60	13823.20	16431.8	18982.8	21923.5	25180.8	...	31119.2	15362.5	NaN	NaN	12842.20	15.77765	0.454074	NaN	NaN	60.07215
4	AL	1980	46.29869	7602.15	10637.00	13532.60	16279.6	19111.8	21802.5	25190.6	...	29623.1	14804.2	28.38428	120196.50	5944.71	14.52640	0.436961	543.0	711278.2	41.72882

5 rows × 55 columns

“state_economics.txt” contains information about the macro-level economy in each state and year, including overall and per capita gross domestic product, and GDP broken down by industry. In this dataset, the primary keys are fips (a standard numeric code that identifies countries and provinces within countries) and year:

econ = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/state_economics.txt", sep=";") # semi-colon separated
econ.head()

	fips	year	GDP	GDPpc	Private	Agriculture	Farms	Mining	Utilities	Construction	...	Construction_share	Manufacturing_share	Finance_Insurance_share	Legal_share	Education_share	Health_share	Government_share	Government_federal_share	Government_military_share	Government_statelocal_share
0	1	1964	8201	4273.580	6900	312	283	115	235	379	...	4.621387	28.22827	11.41324	0.402390	0.256066	1.902207	15.86392	6.291915	2.402146	7.182051
1	1	1968	10949	5493.728	9095	294	253	133	294	499	...	4.557494	28.74235	10.67677	0.429263	0.365330	2.228514	16.93305	5.945748	2.630377	8.356928
2	1	1972	15336	4332.105	12696	481	418	223	435	702	...	4.577465	26.05634	11.48279	0.469484	0.443401	2.595201	17.21440	5.692488	2.679969	8.841941
3	1	1976	24206	6477.035	19988	781	683	584	714	1281	...	5.292076	24.85334	11.50541	0.479220	0.359415	2.986863	17.42543	5.787821	2.255639	9.386103
4	1	1980	36006	9246.474	29567	674	542	1351	1228	1588	...	4.410376	24.60423	11.63417	0.649892	0.352719	3.660501	17.88313	5.782370	2.171860	9.928901

5 rows × 36 columns

“state_area.csv” contains various measures of the total, land, and water areas of each state. In this dataset, the primary key is state:

area = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/state_area.csv") 
area.head()

	state	area_sqmi	area_sqkm	landarea_sqmi	landarea_sqkm	water_sqmi	water_sqkm	percent_water
0	Alaska	663,267.26	1,717,854	571,951.26	1,481,347	91,316.00	236,507	13.77
1	Texas	268,580.82	695,621	261,797.12	678,051	6,783.70	17,570	2.53
2	California	163,695.57	423,970	155,939.52	403,882	7,736.23	20,037	4.73
3	Montana	147,042.40	380,838	145,552.43	376,979	1,489.96	3,859	1.01
4	New Mexico	121,589.48	314,915	121,355.53	314,309	233.96	606	0.19

Because the datasets above contain three ways to identify states - by name, with two-letter abbreviations, and with FIPS codes - we also need a dataset that matches each state by name to its abbreviation and FIPS code. Datasets that exist solely for the purpose of matching values of keys across datasets are called crosswalks:

crosswalk = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/crosswalk.csv")
crosswalk.head()

	fips	State	stcode
0	1	Alabama	AL
1	2	Alaska	AK
2	4	Arizona	AZ
3	5	Arkansas	AR
4	6	California	CA

Our goal is to merge all five of these datasets together.

9.2.2. Using the `pd.merge()` Function #

When using a crosswalk to merge dataframes together, the first step is to merge a dataframe with the crosswalk. Here we start by joining the elect and crosswalk dataframes together. To join dataframes, we use the pd.merge() function. The following code works:

merged_data = pd.merge(elect, crosswalk, on='State')
merged_data

	State	year	demvote	repvote	wallacevote	dempercent	reppercent	population	fips	stcode
0	Alabama	1964	210732	479085	0	30.54897	69.45103	1919000	1	AL
1	Alabama	1968	196579	146923	691425	18.99448	81.00552	1993000	1	AL
2	Alabama	1972	256923	728701	0	26.06704	73.93296	3540080	1	AL
3	Alabama	1976	659170	504070	0	56.66673	43.33327	3737204	1	AL
4	Alabama	1980	636730	654192	0	49.32366	50.67634	3894025	1	AL
...	...	...	...	...	...	...	...	...	...	...
556	Wyoming	1988	67113	106867	0	38.57512	61.42488	465080	56	WY
557	Wyoming	1992	68160	79347	0	46.20798	53.79202	466251	56	WY
558	Wyoming	1996	77934	105388	0	42.51208	57.48792	488167	56	WY
559	Wyoming	2000	60481	147947	0	29.01769	70.98231	493782	56	WY
560	Wyoming	2004	70776	167629	0	29.68730	70.31271	506529	56	WY

561 rows × 10 columns

The merged_data dataframe now contains the features from elect and from crosswalk, which means we’ve added stcode and fips to the elect data, and that enables us to merge merged_data with income and econ.

Before moving on to the rest of the example, we should discuss some properties of the pd.merge() function. Recall from module 7 that there are many different kinds of joins that differ only in how they treat unmatched observations. pd.merge() conducts an inner join by default, leaving only the matched observations in the merged data. To change the type of merge, use the how argument:

how='outer' performs an outer or full join. It brings all rows from both dataframes into the merged data, and replaces the features for the unmatched rows with NaN for the columns without data.

merged_data = pd.merge(elect, crosswalk, on='State', how='outer')

how='left' performs a left join. It brings all rows from the first dataframe listed into the merged data, and only those in the second dataframe that are matched to the first.

merged_data = pd.merge(elect, crosswalk, on='State', how='left')

how='right' performs a right join. It brings all rows from the second dataframe listed into the merged data, and only those in the first dataframe that are matched to the second.

merged_data = pd.merge(elect, crosswalk, on='State', how='right')

There are two ways to specify the key or keys on which to match rows. If the key or keys have the same name in both dataframes, use the on argument. If there is one key, pass this name as a string to on like we did above. If there is more than one key, pass these names as a list to on. For example, if we merge on both State and year, we will type on = ['State', 'year'].

If the keys have different names in each dataframe, we use left_on to specify the names of these keys in the first dataframe listed in the pd.merge() function, and right_on to specify the names of these keys in the second dataframe listed. Suppose, for example, that the keys are named “State” and “year” in the first dataframe, and “st” and “yr” in the second dataframe. We would type left_on = ['State', 'year'], right_on = ['st', 'yr'].

Depending on the type of join we use, a row can be matched, or can come from just the first or just the second dataframe listed in pd.merge(). To see which rows come from which sources, specify indicator='newname', where 'newname' is a name you choose for a new column that contains this information. For example, the following code repeats the merge we performed above, but uses an outer join instead of an inner join and adds a column matched that tells us how each row is matched:

merged_data = pd.merge(elect, crosswalk, on='State', how='outer', indicator='matched')
merged_data

	State	year	demvote	repvote	wallacevote	dempercent	reppercent	population	fips	stcode	matched
0	Alabama	1964	210732	479085	0	30.54897	69.45103	1919000	1	AL	both
1	Alabama	1968	196579	146923	691425	18.99448	81.00552	1993000	1	AL	both
2	Alabama	1972	256923	728701	0	26.06704	73.93296	3540080	1	AL	both
3	Alabama	1976	659170	504070	0	56.66673	43.33327	3737204	1	AL	both
4	Alabama	1980	636730	654192	0	49.32366	50.67634	3894025	1	AL	both
...	...	...	...	...	...	...	...	...	...	...	...
556	Wyoming	1988	67113	106867	0	38.57512	61.42488	465080	56	WY	both
557	Wyoming	1992	68160	79347	0	46.20798	53.79202	466251	56	WY	both
558	Wyoming	1996	77934	105388	0	42.51208	57.48792	488167	56	WY	both
559	Wyoming	2000	60481	147947	0	29.01769	70.98231	493782	56	WY	both
560	Wyoming	2004	70776	167629	0	29.68730	70.31271	506529	56	WY	both

561 rows × 11 columns

9.2.3. Checking for Problems That Can Occur Without Errors #

Whenever we apply pd.merge(), there’s a great risk that the merge was incorrect even if the function runs without error. There are two ways that a merge can fail.

Problem 1: what we thought was a unique ID was not a unique ID.

If we merge two dataframes on a column (or set of columns) that fail to uniquely identify the rows in either dataframe, then pd.merge() returns all pairwise combinations of rows with matching IDs. To see this, consider the following two example dataframes:

dict1 = {'id': ['A', 'A', 'B', 'B'],
         'data1': [150, 200, 50, 25]}
df1 = pd.DataFrame.from_dict(dict1)
df1

	id	data1
0	A	150
1	A	200
2	B	50
3	B	25

dict2 = {'id': ['A', 'A', 'B', 'B'],
         'data2': [-20, -75, -125, -250]}
df2 = pd.DataFrame.from_dict(dict2)
df2

	id	data2
0	A	-20
1	A	-75
2	B	-125
3	B	-250

Note that the two dataframes share a column “id”, but that “id” is not a unique identifier in either dataframe. If I try to merge the two based on “id”, the result is:

df_merge = pd.merge(df1, df2, on='id')
df_merge

	id	data1	data2
0	A	150	-20
1	A	150	-75
2	A	200	-20
3	A	200	-75
4	B	50	-125
5	B	50	-250
6	B	25	-125
7	B	25	-250

Because we did not have a unique ID in either dataframe, we artificially increased the number of rows from 4 to 8. At first glance, the merged data looks nice, but with double the number of rows we are supposed to have the data is in fact corrupted, and we cannot use it for any subsequent analysis. Also, note that the merge ran without any errors. If we aren’t careful, we might move forward with data that cannot be used.

The above example is an instance of a many-to-many merge, in which rows in the left dataframe can match to more than one row in the right dataframe, and rows in the right dataframe can match to more than one row in the left dataframe. In these cases pandas performs a cross join within the subsets of the two dataframes that share the same key, inflating the number of rows. But a many-to-many join is almost never what we want to do with data, and if it happens, it is usually a mistake.

The other types of merges are

one-to-one, in which every row in the left dataframe matches to at most one row in the right dataframe, and every row in the right dataframe matches to at most one row in the left dataframe,
many-to-one, in which rows in the left dataframe can match to more than one row in the right dataframe, and every row in the right dataframe mnatches to at most one row in the left dataframe,
and one-to-many, in which every row in the left dataframe matches to at most one row in the right dataframe, and rows in the right dataframe can match to more than one row in the left dataframe.

The best way to prevent this mistake is to first think about whether the two dataframes should constitute a one-to-one, many-to-one, or one-to-many merge. It is important to have a clear expectation. Then use the validate argument to have pandas automatically check to see whether this expectation is met. If not, the code will generate an error, which will bring our attention to problems in the data we might not have caught otherwise. If we write

validate = 'one_to_one' then pandas checks to see whether the merge keys are unique in both dataframes,
validate = 'many_to_one' then pandas checks to see whether the merge keys are unique in the right dataframe,
validate = 'one_to_many' then pandas checks to see whether the merge keys are unique in the left dataframe.

There are many reasons why our expectations might be incorrect. In messy, real-world data, mistakes like duplicated keys often make it past review processes and exist in the data without any clue of it in the codebook. The validate argument is an important check for these kinds of issues. For example, we expect that merging elect and crosswalk will be a many-to-one merge, so we confirm that the merge is many-to-one as follows:

merged_data = pd.merge(elect, crosswalk, on='State', how='outer', 
                       indicator='matched', validate='many_to_one')
merged_data

	State	year	demvote	repvote	wallacevote	dempercent	reppercent	population	fips	stcode	matched
0	Alabama	1964	210732	479085	0	30.54897	69.45103	1919000	1	AL	both
1	Alabama	1968	196579	146923	691425	18.99448	81.00552	1993000	1	AL	both
2	Alabama	1972	256923	728701	0	26.06704	73.93296	3540080	1	AL	both
3	Alabama	1976	659170	504070	0	56.66673	43.33327	3737204	1	AL	both
4	Alabama	1980	636730	654192	0	49.32366	50.67634	3894025	1	AL	both
...	...	...	...	...	...	...	...	...	...	...	...
556	Wyoming	1988	67113	106867	0	38.57512	61.42488	465080	56	WY	both
557	Wyoming	1992	68160	79347	0	46.20798	53.79202	466251	56	WY	both
558	Wyoming	1996	77934	105388	0	42.51208	57.48792	488167	56	WY	both
559	Wyoming	2000	60481	147947	0	29.01769	70.98231	493782	56	WY	both
560	Wyoming	2004	70776	167629	0	29.68730	70.31271	506529	56	WY	both

561 rows × 11 columns

Problem 2: rows should be matched, but aren’t.

This problem is especially common when matching on strings, such as names. With countries, it’s possible for one dataframe to write “USA” and the other to write “United States”. You will have to go back into a dataframe to edit individual ID cells so that they match. But to identify the cells that need to be edited, we have to construct a dataframe of the rows that were unmatched in one or the other dataframe. The easiest way to see this is to use an anti-join: a merge that keeps only the rows in the left dataframe that have no match in the right dataframe. Like SQL, there is no anti-join function built in to pandas, but there are ways to program our own anti-join. The easiest way is to perform an outer join with the indicator='matched' argument, then to filter the data to those rows for which matched is not equal to “both”.

For this step, I recommend creating a second, working version of the data merge_test to contain the first attempt at the merge. If there are issues, it is easier to fix the problems and recreate the working dataframe than to restart the kernel. If there are no problems, then we can assign the merged_data dataframe to be equal to the working version.

merge_test = pd.merge(elect, crosswalk, on='State', how='outer', indicator='matched', validate='many_to_one')
merge_test.query("matched!='both'")

	State	year	demvote	repvote	wallacevote	dempercent	reppercent	population	fips	stcode	matched

There are no rows in crosswalk with no match to elect, and there are no rows in elect without a match in crosswalk. But if there were unmatched rows, there are two reasons why some rows may have been unmatched: rows can be unmatched due to differences in coverage, and they might be unmatched due to differences in coding or spelling. pandas cannot tell the difference between these two reasons why rows might be unmatched, and an inner join will drop unmatched rows regardless of the reason. Using an outer join with indicator='matched' is best practice for seeing the unmatched rows and thinking about whether any rows that should be matched are for some reason unmatched. We can use the category relabeling functionality in pandas to fix issues of mispelled and miscoded keys, as we will do in the examples in the next section.

After performing this check, if we intend to perform another merge, we will either need to choose a new name for the indicator column, or we can drop the matched column so that we can reuse this name later. We set merged_data to the working dataframe merge_test without the matched column:

merged_data = merge_test.drop('matched', axis=1)

9.2.4. Merging all of the Dataframes Together While Checking for Problems #

The merged_data and income dataframes share keys named stcode and year. Joining these two dataframes should be a one-to-one merge because there should be only one occurrence of the same state and year in each dataframe. We use how='outer' and indicator='matched' (since we deleted the previous matched column) to enable us to filter the result to the unmatched rows, and we use validate='one_to_one' to catch anomolies that violate our expectation that this is a one-to-one merge:

merge_test = pd.merge(merged_data, income, on=['stcode','year'], how='outer', indicator='matched', validate='one_to_one')
merge_test.query("matched!='both'")

	State	year	demvote	repvote	wallacevote	dempercent	reppercent	population	fips	stcode	...	wom_meanincome	wom_minincome	wom_maxincome	wom_sdincome	wom_ineqincome	wom_gini_income	wom_n_income	wom_nwgt_income	medincome	matched

0 rows × 64 columns

There are no unmatched rows so there are no issues with miscoded or mispelled ID values. The merged data now contain information on both state election results and state incomes:

merge_test

	State	year	demvote	repvote	wallacevote	dempercent	reppercent	population	fips	stcode	...	wom_meanincome	wom_minincome	wom_maxincome	wom_sdincome	wom_ineqincome	wom_gini_income	wom_n_income	wom_nwgt_income	medincome	matched
0	Alaska	1964	44329	22930	0	65.90791	34.09209	153000	2	AK	...	27625.7	1441.860000	68668.60	3005.62	25.21898	0.370267	18.0	47161.85	219.84590	both
1	Alaska	1968	35411	37600	10024	42.64587	57.35413	166000	2	AK	...	21146.5	NaN	NaN	2543.43	33.65396	0.451380	NaN	NaN	153.35230	both
2	Alaska	1972	32967	55349	0	37.32846	62.67154	326494	2	AK	...	24690.5	NaN	NaN	12575.30	21.90015	0.456007	NaN	NaN	139.77930	both
3	Alaska	1976	44058	71555	0	38.10817	61.89183	393115	2	AK	...	24273.8	NaN	NaN	14821.10	22.49940	0.471481	NaN	NaN	101.17590	both
4	Alaska	1980	41842	86112	0	32.70081	67.29919	401851	2	AK	...	22100.6	43.668120	109347.20	9263.80	25.48564	0.497012	522.0	93897.68	68.85219	both
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
556	Wyoming	1988	67113	106867	0	38.57512	61.42488	465080	56	WY	...	17188.0	75.987840	77507.60	10402.40	23.54822	0.478940	318.0	110956.10	36.63389	both
557	Wyoming	1992	68160	79347	0	46.20798	53.79202	466251	56	WY	...	17934.1	92.307700	96391.03	12512.80	21.14388	0.469024	360.0	117688.60	30.10564	both
558	Wyoming	1996	77934	105388	0	42.51208	57.48792	488167	56	WY	...	18796.4	29.816510	122420.90	14517.60	22.74368	0.461693	413.0	125222.30	26.93395	both
559	Wyoming	2000	60481	147947	0	29.01769	70.98231	493782	56	WY	...	20463.4	88.819230	333171.40	16609.80	15.30497	0.447124	440.0	128753.90	26.05026	both
560	Wyoming	2004	70776	167629	0	29.68730	70.31271	506529	56	WY	...	20225.1	9.587728	206166.80	18311.90	11.33138	0.432011	692.0	123164.10	24.74094	both

561 rows × 64 columns

Since there are no problems, we set merged_data equal to merge_test. Because we intend to merge the data with another dataframe, we drop the matched column.

merged_data = merge_test.drop('matched', axis=1)

Next we merge this result with the econ dataframe. These two dataframes share fips and year as keys, and should also be a one-to-one merge. We check to see whether there are unmatched rows:

merge_test = pd.merge(merged_data, econ, on=['fips','year'], how='outer', indicator='matched', validate='one_to_one')
merge_test.query("matched!='both'")

	State	year	demvote	repvote	wallacevote	dempercent	reppercent	population	fips	stcode	...	Manufacturing_share	Finance_Insurance_share	Legal_share	Education_share	Health_share	Government_share	Government_federal_share	Government_military_share	Government_statelocal_share	matched

0 rows × 98 columns

Again, there are no unmatched rows so there are no issues with miscoded or mispelled ID values. The merged data now contain information on state election results, state incomes, and state economies:

merge_test

	State	year	demvote	repvote	wallacevote	dempercent	reppercent	population	fips	stcode	...	Manufacturing_share	Finance_Insurance_share	Legal_share	Education_share	Health_share	Government_share	Government_federal_share	Government_military_share	Government_statelocal_share	matched
0	Alabama	1964	210732	479085	0	30.54897	69.45103	1919000	1	AL	...	28.228270	11.413240	0.402390	0.256066	1.902207	15.86392	6.291915	2.402146	7.182051	both
1	Alabama	1968	196579	146923	691425	18.99448	81.00552	1993000	1	AL	...	28.742350	10.676770	0.429263	0.365330	2.228514	16.93305	5.945748	2.630377	8.356928	both
2	Alabama	1972	256923	728701	0	26.06704	73.93296	3540080	1	AL	...	26.056340	11.482790	0.469484	0.443401	2.595201	17.21440	5.692488	2.679969	8.841941	both
3	Alabama	1976	659170	504070	0	56.66673	43.33327	3737204	1	AL	...	24.853340	11.505410	0.479220	0.359415	2.986863	17.42543	5.787821	2.255639	9.386103	both
4	Alabama	1980	636730	654192	0	49.32366	50.67634	3894025	1	AL	...	24.604230	11.634170	0.649892	0.352719	3.660501	17.88313	5.782370	2.171860	9.928901	both
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
556	Wyoming	1988	67113	106867	0	38.57512	61.42488	465080	56	WY	...	4.380078	7.788767	0.688802	0.123631	2.207700	14.51784	2.622748	1.660191	10.234900	both
557	Wyoming	1992	68160	79347	0	46.20798	53.79202	466251	56	WY	...	4.303816	7.670391	0.704806	0.179951	2.579291	14.99588	2.811727	1.552073	10.632080	both
558	Wyoming	1996	77934	105388	0	42.51208	57.48792	488167	56	WY	...	6.146708	9.280448	0.705568	0.152555	2.866768	14.23849	2.657005	1.544622	10.030510	both
559	Wyoming	2000	60481	147947	0	29.01769	70.98231	493782	56	WY	...	6.012348	3.110034	0.646241	0.173100	4.269805	14.95586	2.729213	1.540592	10.686050	both
560	Wyoming	2004	70776	167629	0	29.68730	70.31271	506529	56	WY	...	3.795901	2.813834	0.589240	0.222032	4.419300	14.49616	2.519214	1.648164	10.328780	both

561 rows × 98 columns

Again, we set merged_data equal to merge_test, dropping the matched column so that we can reuse this name in the next join:

merged_data = merge_test.drop('matched', axis=1)

Finally we merge the data with the area dataframe. This merge matches State in merged_data to state in area: this subtle difference is easy to miss and will result in an error if we fail to notice the lowercase s and write on='State'. This merge should be a many-to-one merge, as there are many rows for each state in merged_data (for many years), but only one row for each state in area. As

merge_test = pd.merge(merged_data, area, how="outer",
                      left_on='State',right_on='state', 
                      indicator='matched', 
                      validate='many_to_one')
merge_test.query("matched!='both'")[['State', 'state',
                                     'year','matched']]

	State	state	year	matched
22	NaN	American Samoa	NaN	right_only
78	D. C.	NaN	1964.0	left_only
79	D. C.	NaN	1968.0	left_only
80	D. C.	NaN	1972.0	left_only
81	D. C.	NaN	1976.0	left_only
82	D. C.	NaN	1980.0	left_only
83	D. C.	NaN	1984.0	left_only
84	D. C.	NaN	1988.0	left_only
85	D. C.	NaN	1992.0	left_only
86	D. C.	NaN	1996.0	left_only
87	D. C.	NaN	2000.0	left_only
88	D. C.	NaN	2004.0	left_only
100	NaN	District of Columbia	NaN	right_only
123	NaN	Guam	NaN	right_only
388	NaN	Northern Mariana Islands	NaN	right_only
433	NaN	Puerto Rico	NaN	right_only
489	NaN	United States Minor Outlying Islands	NaN	right_only
490	NaN	United States Virgin Islands	NaN	right_only

In this case, one of our validation checks failed: we have unmatched rows. Some of these rows are due to differences in coverage: we will not bring Puerto Rico, the Northern Mariana Islands, the United States Virgin Islands, American Samoa, Guam, or the United States Minor Outlying Islands into the merged data as these territories do not receive votes in the electoral college. We will however bring the District of Columbia into the merged data as D.C. does get 3 votes in the electoral college. D.C. is unmatched due to spelling discrepancies, and we must correct this mistake before we can proceed. The easiest way to fix the error is to replace values of “District of Columbia” in the area dataframe with “D. C.”, as it is written in the merged data. We can use the .replace() method (because we only want to replace one state category and .replace() leaves unlisted categories alone, as opposed to .map() which would replace unlisted categories with missing values), as we discussed in module 8:

area.state = area.state.replace({'District of Columbia':'D. C.'})

Now if we repeat the test, we will only see the rows that are unmatched due to differences in coverage:

merge_test = pd.merge(merged_data, area, left_on='State', how="outer", right_on='state', indicator='matched', validate='many_to_one')
merge_test.query("matched!='both'")[['State', 'state','year','matched']]

	State	state	year	matched
22	NaN	American Samoa	NaN	right_only
122	NaN	Guam	NaN	right_only
387	NaN	Northern Mariana Islands	NaN	right_only
432	NaN	Puerto Rico	NaN	right_only
488	NaN	United States Minor Outlying Islands	NaN	right_only
489	NaN	United States Virgin Islands	NaN	right_only

We replace the merged_data dataframe with merge_test, keeping only the rows for which matched='both', then dropping the matched column:

merged_data = merge_test.query("matched=='both'").drop('matched', axis=1)

Now merged_data contains all of the data we need, including state election results, income, economic data, and area:

merged_data

	State	year	demvote	repvote	wallacevote	dempercent	reppercent	population	fips	stcode	...	Government_military_share	Government_statelocal_share	state	area_sqmi	area_sqkm	landarea_sqmi	landarea_sqkm	water_sqmi	water_sqkm	percent_water
0	Alabama	1964.0	210732.0	479085.0	0.0	30.54897	69.45103	1919000.0	1.0	AL	...	2.402146	7.182051	Alabama	52,419.02	135,765	50,744.00	131,426	1,675.01	4,338	3.20
1	Alabama	1968.0	196579.0	146923.0	691425.0	18.99448	81.00552	1993000.0	1.0	AL	...	2.630377	8.356928	Alabama	52,419.02	135,765	50,744.00	131,426	1,675.01	4,338	3.20
2	Alabama	1972.0	256923.0	728701.0	0.0	26.06704	73.93296	3540080.0	1.0	AL	...	2.679969	8.841941	Alabama	52,419.02	135,765	50,744.00	131,426	1,675.01	4,338	3.20
3	Alabama	1976.0	659170.0	504070.0	0.0	56.66673	43.33327	3737204.0	1.0	AL	...	2.255639	9.386103	Alabama	52,419.02	135,765	50,744.00	131,426	1,675.01	4,338	3.20
4	Alabama	1980.0	636730.0	654192.0	0.0	49.32366	50.67634	3894025.0	1.0	AL	...	2.171860	9.928901	Alabama	52,419.02	135,765	50,744.00	131,426	1,675.01	4,338	3.20
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
562	Wyoming	1988.0	67113.0	106867.0	0.0	38.57512	61.42488	465080.0	56.0	WY	...	1.660191	10.234900	Wyoming	97,813.56	253,336	97,100.40	251,489	713.16	1,847	0.73
563	Wyoming	1992.0	68160.0	79347.0	0.0	46.20798	53.79202	466251.0	56.0	WY	...	1.552073	10.632080	Wyoming	97,813.56	253,336	97,100.40	251,489	713.16	1,847	0.73
564	Wyoming	1996.0	77934.0	105388.0	0.0	42.51208	57.48792	488167.0	56.0	WY	...	1.544622	10.030510	Wyoming	97,813.56	253,336	97,100.40	251,489	713.16	1,847	0.73
565	Wyoming	2000.0	60481.0	147947.0	0.0	29.01769	70.98231	493782.0	56.0	WY	...	1.540592	10.686050	Wyoming	97,813.56	253,336	97,100.40	251,489	713.16	1,847	0.73
566	Wyoming	2004.0	70776.0	167629.0	0.0	29.68730	70.31271	506529.0	56.0	WY	...	1.648164	10.328780	Wyoming	97,813.56	253,336	97,100.40	251,489	713.16	1,847	0.73

561 rows × 105 columns

9.3. Reshaping Data #

Reshaping a dataframe refers to turning the rows of a dataframe into columns (also known as a long-to-wide reshape), or turning columns into rows (also known as a wide-to-long reshape), in a way that cannot be accomplished by simply transposing the data matrix. It is also called pivoting for a long-to-wide reshape, melting for a wide-to-long reshape. It’s a notoriously tricky skill to master, and it helps a lot to understand every part of the function we need, so that we’re not blindly trying every combination of parameters hoping one of these combinations works.

9.3.1. Example: Gross domestic product by state, 1997-2015 #

To demonstrate the method for pivoting a dataframe, we use example data from the U.S. Bureau of Economic Affairs (www.bea.gov) on economic output for the American states. We load the data, skipping the last four rows as these rows contain attribution information and notes, and are not part of the data. We also use the .str.strip() method to remove all the leading and trailing whitespace in the Description column:

gsp = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/gsp_naics_all.csv", skipfooter=4)
gsp['Description'] = gsp['Description'].str.strip()
gsp

/var/folders/4w/k9sqqcbx4dxgpjtwgv_1m29h0000gq/T/ipykernel_16692/793934051.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.
  gsp = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/gsp_naics_all.csv", skipfooter=4)

	GeoFIPS	GeoName	Region	ComponentId	ComponentName	IndustryId	IndustryClassification	Description	1997	1998	...	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015
0	00000	United States	NaN	200	Gross domestic product (GDP) by state	1	...	All industry total	8542530	9024434	...	13773226	14391149	14626598	14320114	14859772	15406002	16041243	16576808	17277548	17919651
1	00000	United States	NaN	200	Gross domestic product (GDP) by state	2	...	Private industries	7459395	7894015	...	12045446	12572387	12716179	12352979	12826507	13348439	13957545	14468465	15115846	15698669
2	00000	United States	NaN	200	Gross domestic product (GDP) by state	3	11	Agriculture, forestry, fishing, and hunting	108796	99940	...	128345	141999	154525	137655	160217	197241	185800	221821	203188	175236
3	00000	United States	NaN	200	Gross domestic product (GDP) by state	4	111-112	Farms	88136	79030	...	99352	113533	126345	109800	129725	166249	151489	186960	166249	(NA)
4	00000	United States	NaN	200	Gross domestic product (GDP) by state	5	113-115	Forestry, fishing, and related activities	20660	20910	...	28993	28466	28180	27855	30492	30992	34311	34861	36939	(NA)
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
43287	94000	Plains	4	1000	Per capita real GDP by state	1	...	All industry total	39071	39970	...	46013	46501	46794	45399	46210	46921	47629	48126	48778	49153
43288	95000	Southeast	5	1000	Per capita real GDP by state	1	...	All industry total	36722	37896	...	43586	43028	42092	40346	40675	40471	40415	40487	40802	41352
43289	96000	Southwest	6	1000	Per capita real GDP by state	1	...	All industry total	38887	40283	...	45743	46684	45914	44324	44491	45344	46881	48064	49327	50519
43290	97000	Rocky Mountain	7	1000	Per capita real GDP by state	1	...	All industry total	38243	40390	...	46756	47587	47019	45170	44937	45060	45014	45564	46508	47093
43291	98000	Far West	8	1000	Per capita real GDP by state	1	...	All industry total	41901	43467	...	53932	54469	53881	51328	51609	51857	52429	52942	54077	55429

43292 rows × 27 columns

For this example, we only want to save three features besides the primary keys:

state GDP,
state GDP per capita,
state GDP from private industries

We only want the 50 states, no regions, territories, or districts (sorry DC!). Notice that the features we want are contained in rows, and the years are contained in columns. We need to switch that: the features must be in columns, and years must be stored in the rows. We will need to perform both a wide-to-long and a long-to-wide reshape. In addition, we will give the columns concise but descriptive names, and we will create a new column that contains the percent of GDP from private industry in each state and each year.

First we need to reduce the data to only the features we need: state GDP, state GDP per capita, and state GDP from private industries. Right now, however, the features are not stored in the columns, but in the rows. Specifically, the data we need are contains in the rows in which Description is “All industry total” or “Private industries” and ComponentName is “Gross domestic product (GDP) by state” or “Per capita real GDP by state”. We use .query() to isolate these rows:

gsp_clean = gsp.query(
    "Description in ['All industry total', 'Private industries']"
).query(
    "ComponentName in ['Gross domestic product (GDP) by state','Per capita real GDP by state']")
gsp_clean

	GeoFIPS	GeoName	Region	ComponentId	ComponentName	IndustryId	IndustryClassification	Description	1997	1998	...	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015
0	00000	United States	NaN	200	Gross domestic product (GDP) by state	1	...	All industry total	8542530	9024434	...	13773226	14391149	14626598	14320114	14859772	15406002	16041243	16576808	17277548	17919651
1	00000	United States	NaN	200	Gross domestic product (GDP) by state	2	...	Private industries	7459395	7894015	...	12045446	12572387	12716179	12352979	12826507	13348439	13957545	14468465	15115846	15698669
90	01000	Alabama	5	200	Gross domestic product (GDP) by state	1	...	All industry total	104218	109414	...	164468	169923	172646	168315	174710	180665	185878	190095	194421	199656
91	01000	Alabama	5	200	Gross domestic product (GDP) by state	2	...	Private industries	87014	91506	...	137954	141306	142965	137143	142773	148181	153494	157961	161364	166243
180	02000	Alaska	8	200	Gross domestic product (GDP) by state	1	...	All industry total	25446	24030	...	44679	49197	55461	50463	54134	58759	60890	59762	58253	52747
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
43287	94000	Plains	4	1000	Per capita real GDP by state	1	...	All industry total	39071	39970	...	46013	46501	46794	45399	46210	46921	47629	48126	48778	49153
43288	95000	Southeast	5	1000	Per capita real GDP by state	1	...	All industry total	36722	37896	...	43586	43028	42092	40346	40675	40471	40415	40487	40802	41352
43289	96000	Southwest	6	1000	Per capita real GDP by state	1	...	All industry total	38887	40283	...	45743	46684	45914	44324	44491	45344	46881	48064	49327	50519
43290	97000	Rocky Mountain	7	1000	Per capita real GDP by state	1	...	All industry total	38243	40390	...	46756	47587	47019	45170	44937	45060	45014	45564	46508	47093
43291	98000	Far West	8	1000	Per capita real GDP by state	1	...	All industry total	41901	43467	...	53932	54469	53881	51328	51609	51857	52429	52942	54077	55429

180 rows × 27 columns

Next, we only want the 50 states, no regions, territories, or districts, so we issue another .query() to remove the territories. To see these non-states, we call up a list of the unique values of GeoName:

gsp_clean['GeoName'].unique()

array(['United States', 'Alabama', 'Alaska', 'Arizona', 'Arkansas',
       'California', 'Colorado', 'Connecticut', 'Delaware',
       'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming', 'New England', 'Mideast',
       'Great Lakes', 'Plains', 'Southeast', 'Southwest',
       'Rocky Mountain', 'Far West'], dtype=object)

We need to remove the rows in which GeoName is “United States”, “District of Columbia”, “New England”, “Mideast”, “Great Lakes”, “Plains”, “Southeast”, “Southwest”, “Rocky Mountain”, or “Far West”:

gsp_clean = gsp_clean.query("GeoName not in ['United States', 'District of Columbia', 'New England', 'Mideast', 'Great Lakes', 'Plains', 'Southeast', 'Southwest', 'Rocky Mountain', 'Far West']")
gsp_clean

	GeoFIPS	GeoName	Region	ComponentId	ComponentName	IndustryId	IndustryClassification	Description	1997	1998	...	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015
90	01000	Alabama	5	200	Gross domestic product (GDP) by state	1	...	All industry total	104218	109414	...	164468	169923	172646	168315	174710	180665	185878	190095	194421	199656
91	01000	Alabama	5	200	Gross domestic product (GDP) by state	2	...	Private industries	87014	91506	...	137954	141306	142965	137143	142773	148181	153494	157961	161364	166243
180	02000	Alaska	8	200	Gross domestic product (GDP) by state	1	...	All industry total	25446	24030	...	44679	49197	55461	50463	54134	58759	60890	59762	58253	52747
181	02000	Alaska	8	200	Gross domestic product (GDP) by state	2	...	Private industries	20284	18776	...	36737	40919	46889	41387	44742	48851	50592	49764	47879	42246
270	04000	Arizona	6	200	Gross domestic product (GDP) by state	1	...	All industry total	132708	143768	...	248459	262045	256718	242509	245668	254192	264693	270642	280166	290903
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
43279	51000	Virginia	5	1000	Per capita real GDP by state	1	...	All industry total	43069	44745	...	52866	52657	51985	51389	51945	51749	51538	51105	50861	51540
43280	53000	Washington	8	1000	Per capita real GDP by state	1	...	All industry total	45753	47712	...	52900	55348	54939	52264	52681	52495	53423	53987	54773	55577
43281	54000	West Virginia	5	1000	Per capita real GDP by state	1	...	All industry total	30445	30978	...	34009	33892	34679	34564	35368	36085	35515	35778	36234	36817
43282	55000	Wisconsin	3	1000	Per capita real GDP by state	1	...	All industry total	38663	39695	...	45515	45464	44622	43215	44126	44905	45380	45582	46469	46893
43283	56000	Wyoming	7	1000	Per capita real GDP by state	1	...	All industry total	46585	47652	...	63428	65471	69182	66320	64602	63981	60744	60743	61639	61389

150 rows × 27 columns

9.3.2. Melting: Turning Columns to Rows #

Notice that many of the columns refer to particular years. We want to move these columns to the rows, so that each state has one row for each year covered in the data. The best function to use when we need to turn columns into rows is pd.melt():

pd.melt(df, id_vars, value_vars)

This function takes three parameters:

The name of the dataframe you wish to edit
id_vars: A list of the names of the columns that we want to continue to exist as columns after the dataframe is melted. In our example, this list will contain both IDs that currently uniquely identify the rows (GeoName, ComponentName, Description), synonyms for the IDs (GeoFIPS) that we want to keep, and features that vary across ID but not across the columns (Region).
value_vars: A list of the names of the columns we want to store in the rows. If the columns have numeric names and can be sequentially ordered, as is the case in our example with years, a shortcut for creating a list of these column names is [str(i) for i in range(1997,2015)], which loops over intergers from 1997 to 2015, placing each year as a string in the list.

Any column that is not listed under either id_vars or value_vars will be deleted in the melted data.

For example, we melt the gsp_clean dataframe with the following code:

gsp_clean = pd.melt(gsp_clean, id_vars = ['GeoName', 'GeoFIPS', 'Region',
                             'ComponentName', 'Description'], 
        value_vars = [str(i) for i in range(1997,2015)])
gsp_clean

	GeoName	GeoFIPS	Region	ComponentName	Description	variable	value
0	Alabama	01000	5	Gross domestic product (GDP) by state	All industry total	1997	104218
1	Alabama	01000	5	Gross domestic product (GDP) by state	Private industries	1997	87014
2	Alaska	02000	8	Gross domestic product (GDP) by state	All industry total	1997	25446
3	Alaska	02000	8	Gross domestic product (GDP) by state	Private industries	1997	20284
4	Arizona	04000	6	Gross domestic product (GDP) by state	All industry total	1997	132708
...	...	...	...	...	...	...	...
2695	Virginia	51000	5	Per capita real GDP by state	All industry total	2014	50861
2696	Washington	53000	8	Per capita real GDP by state	All industry total	2014	54773
2697	West Virginia	54000	5	Per capita real GDP by state	All industry total	2014	36234
2698	Wisconsin	55000	3	Per capita real GDP by state	All industry total	2014	46469
2699	Wyoming	56000	7	Per capita real GDP by state	All industry total	2014	61639

2700 rows × 7 columns

Note that the old column names (1997 through 2015 in this case) will always by stored in a column named “variable” after melting, and the datapoints that populated those columns will be contained in a column named “value”.

Before moving on, let’s rename some of the columns to avoid confusion in the next step:

gsp_clean = gsp_clean.rename({'GeoName':'State',
                             'GeoFIPS':'FIPS',
                             'variable':'Year'}, axis=1)
gsp_clean

	State	FIPS	Region	ComponentName	Description	Year	value
0	Alabama	01000	5	Gross domestic product (GDP) by state	All industry total	1997	104218
1	Alabama	01000	5	Gross domestic product (GDP) by state	Private industries	1997	87014
2	Alaska	02000	8	Gross domestic product (GDP) by state	All industry total	1997	25446
3	Alaska	02000	8	Gross domestic product (GDP) by state	Private industries	1997	20284
4	Arizona	04000	6	Gross domestic product (GDP) by state	All industry total	1997	132708
...	...	...	...	...	...	...	...
2695	Virginia	51000	5	Per capita real GDP by state	All industry total	2014	50861
2696	Washington	53000	8	Per capita real GDP by state	All industry total	2014	54773
2697	West Virginia	54000	5	Per capita real GDP by state	All industry total	2014	36234
2698	Wisconsin	55000	3	Per capita real GDP by state	All industry total	2014	46469
2699	Wyoming	56000	7	Per capita real GDP by state	All industry total	2014	61639

2700 rows × 7 columns

9.3.3. Pivoting: Turning Rows to Columns #

The rows contain features that we want to move to the columns. Specifically, we need separate columns for the different combinations of ComponentName and Description since the feature names are contained in two columns, so we can combine these two columns by concatenating them together and creating a new column called “feature”. Then we can drop ComponentName and Description:

gsp_clean = gsp_clean.assign(feature = gsp_clean['ComponentName'] + gsp_clean['Description'])
gsp_clean = gsp_clean.drop(['ComponentName', 'Description'], axis=1)
gsp_clean

	State	FIPS	Region	Year	value	feature
0	Alabama	01000	5	1997	104218	Gross domestic product (GDP) by stateAll indus...
1	Alabama	01000	5	1997	87014	Gross domestic product (GDP) by statePrivate i...
2	Alaska	02000	8	1997	25446	Gross domestic product (GDP) by stateAll indus...
3	Alaska	02000	8	1997	20284	Gross domestic product (GDP) by statePrivate i...
4	Arizona	04000	6	1997	132708	Gross domestic product (GDP) by stateAll indus...
...	...	...	...	...	...	...
2695	Virginia	51000	5	2014	50861	Per capita real GDP by stateAll industry total
2696	Washington	53000	8	2014	54773	Per capita real GDP by stateAll industry total
2697	West Virginia	54000	5	2014	36234	Per capita real GDP by stateAll industry total
2698	Wisconsin	55000	3	2014	46469	Per capita real GDP by stateAll industry total
2699	Wyoming	56000	7	2014	61639	Per capita real GDP by stateAll industry total

2700 rows × 6 columns

To move these features from the rows to columns, apply the .pivot_table() method to a dataframe:

df.pivot_table(index, columns, values)

To understand this method, pay close attention to its parameters:

index - a list containing the names of the current columns that we want to remain columns in the reshaped data. These include the primary keys (State and Year in this example), synonyms for a key (FIPS), and features that vary only by ID (Region).
columns - the name of the column that contains the names of the new columns we are trying to create.
values - the name of the column that contains the datapoints we are trying to move to the new columns.

There is one important issue to keep in mind when using the .pivot_table() method. .pivot_table() contains default behavior to handle cases in which the keys in index do not uniquely identify the rows of a dataframe. By default, .pivot_table() takes the mean within group. If the column specified within values is non-numeric, however, the mean is not defined. So first make sure that any columns that are numeric are recognized as a numeric data type. If a column contains text, categories, or other non-numeric data, specify aggfunc='first' inside .pivot_table(). Writing aggfunc='first' tells .pivot_table() to use the first cell within a group instead of calculating the mean within a group, and the first cell is defined for non-numeric columns. If the columns in index do uniquely identify rows, there is only one cell per group anyway, so it is fine to take the first cell.

Presently within gsp_clean the datapoints are contained within the value column. Unfortunately, value is not currently understood to be numeric:

gsp_clean.dtypes

State      object
FIPS       object
Region     object
Year       object
value      object
feature    object
dtype: object

Following our code from module 8, we convert value to a numeric class:

gsp_clean.value = gsp_clean.value.astype('float')
gsp_clean.dtypes

State       object
FIPS        object
Region      object
Year        object
value      float64
feature     object
dtype: object

Now we can reshape the data with .pivot_table():

gsp_clean['value'] = gsp_clean['value'].astype(int) # first convert this column to int 
gsp_clean = gsp_clean.pivot_table(index=['State','FIPS', 'Region','Year'], 
                                  columns='feature', 
                                  values='value')
gsp_clean

			feature	Gross domestic product (GDP) by stateAll industry total	Gross domestic product (GDP) by statePrivate industries	Per capita real GDP by stateAll industry total
State	FIPS	Region	Year
Alabama	01000	5	1997	104218.0	87014.0	31398.0
			1998	109414.0	91506.0	32164.0
			1999	115015.0	96284.0	33106.0
			2000	119020.0	99665.0	33284.0
			2001	122822.0	102978.0	33312.0
...	...	...	...	...	...	...
Wyoming	56000	7	2010	39103.0	33832.0	64602.0
			2011	41499.0	36164.0	63981.0
			2012	40201.0	34604.0	60744.0
			2013	40979.0	35096.0	60743.0
			2014	42021.0	35947.0	61639.0

900 rows × 3 columns

Alternatively, if we choose to leave the value column as an object data type, we can use this code instead to perform the wide-to-long reshape:

gsp_clean = gsp_clean.pivot_table(index=['State','FIPS', 'Region','Year'], 
                                  columns='feature', 
                                  values='value', 
                                  aggfunc='first')

Note that the dataframe is not formated in a flat way, because the IDs and related features are stored in the row-indices instead of in columns. To convert the dataframe back to a standard format, use the .to_records() method within the pd.DataFrame() function:

gsp_clean = pd.DataFrame(gsp_clean.to_records())
gsp_clean

	State	FIPS	Region	Year	Gross domestic product (GDP) by stateAll industry total	Gross domestic product (GDP) by statePrivate industries	Per capita real GDP by stateAll industry total
0	Alabama	01000	5	1997	104218.0	87014.0	31398.0
1	Alabama	01000	5	1998	109414.0	91506.0	32164.0
2	Alabama	01000	5	1999	115015.0	96284.0	33106.0
3	Alabama	01000	5	2000	119020.0	99665.0	33284.0
4	Alabama	01000	5	2001	122822.0	102978.0	33312.0
...	...	...	...	...	...	...	...
895	Wyoming	56000	7	2010	39103.0	33832.0	64602.0
896	Wyoming	56000	7	2011	41499.0	36164.0	63981.0
897	Wyoming	56000	7	2012	40201.0	34604.0	60744.0
898	Wyoming	56000	7	2013	40979.0	35096.0	60743.0
899	Wyoming	56000	7	2014	42021.0	35947.0	61639.0

900 rows × 7 columns

All that’s left to clean the data is to rename the columns and to create a new column containing the percent of the state’s GDP that comes from private industry:

gsp_clean = gsp_clean.rename({'Gross domestic product (GDP) by stateAll industry total':'GDP',
                             'Gross domestic product (GDP) by statePrivate industries':'GDPprivate',
                             'Per capita real GDP by stateAll industry total':'GDPpc'}, 
                             axis=1)
gsp_clean = gsp_clean.assign(percent_private = 100* gsp_clean.GDPprivate / gsp_clean.GDP)
gsp_clean

	State	FIPS	Region	Year	GDP	GDPprivate	GDPpc	percent_private
0	Alabama	01000	5	1997	104218.0	87014.0	31398.0	83.492295
1	Alabama	01000	5	1998	109414.0	91506.0	32164.0	83.632808
2	Alabama	01000	5	1999	115015.0	96284.0	33106.0	83.714298
3	Alabama	01000	5	2000	119020.0	99665.0	33284.0	83.738027
4	Alabama	01000	5	2001	122822.0	102978.0	33312.0	83.843285
...	...	...	...	...	...	...	...	...
895	Wyoming	56000	7	2010	39103.0	33832.0	64602.0	86.520216
896	Wyoming	56000	7	2011	41499.0	36164.0	63981.0	87.144269
897	Wyoming	56000	7	2012	40201.0	34604.0	60744.0	86.077461
898	Wyoming	56000	7	2013	40979.0	35096.0	60743.0	85.643866
899	Wyoming	56000	7	2014	42021.0	35947.0	61639.0	85.545323

900 rows × 8 columns

9.4. Working with Strings (Example: the 2019 ANES Pilot Study)#

In module 8, we worked with the 2019 American National Election Study pilot survey and saved the cleaned data in a separate CSV file:

anes = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/anes_clean.csv")
anes

	caseid	liveurban	vote16	protest	vote	most_important_issue	confecon	ideology	partyID	universal_income	...	partisanship	ftbiden_level	age	age2	ftbiden_float	ftbiden_cat	ftbiden_str	prefersbiden	worried_econ	favor_both
0	1	Suburb	Someone else	False	Joe Biden	Health Care	A little worried	Conservative	Democrat	Favor a moderate amount	...	5.0	neutral	51	2601	52.0	52.0	52.0	True	False	True
1	2	Suburb	Donald Trump	False	Donald Trump	Working together	A little worried	Conservative	Republican	Oppose a moderate amount	...	0.0	neutral	78	6084	41.0	41.0	41.0	False	False	False
2	3	Rural	Hillary Clinton	False	Joe Biden	health care	Extremely worried	Moderate	Democrat	Neither favor nor oppose	...	88.0	like	66	4356	88.0	88.0	88.0	True	True	False
3	4	City	Hillary Clinton	False	Donald Trump	The economy.	A little worried	Moderate	Democrat	Neither favor nor oppose	...	-100.0	dislike	41	1681	0.0	0.0	0.0	False	False	False
4	5	City	Donald Trump	False	Donald Trump	China	Not at all worried	Conservative	Republican	Oppose a great deal	...	-69.0	dislike	80	6400	25.0	25.0	25.0	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3160	3161	Town	Donald Trump	False	Donald Trump	The infiltration of Marxists into the institut...	A little worried	Conservative	Republican	Oppose a great deal	...	-74.0	dislike	72	5184	7.0	7.0	7.0	False	False	False
3161	3162	City	Someone else	False	Someone else	Lack of basic resources being provided and off...	Extremely worried	NaN	Democrat	Favor a great deal	...	-10.0	dislike	24	576	25.0	25.0	25.0	False	True	True
3162	3163	City	Did not vote	False	Probably will not vote	donald trump	Very worried	Liberal	Independent	Oppose a great deal	...	44.0	neutral	40	1600	50.0	50.0	50.0	True	False	False
3163	3164	Suburb	Did not vote	False	Joe Biden	Donald Trump	Moderately worried	Liberal	Democrat	Favor a moderate amount	...	94.0	like	60	3600	95.0	95.0	95.0	True	True	False
3164	3165	Town	Hillary Clinton	False	Joe Biden	trump	Extremely worried	Moderate	Democrat	Oppose a great deal	...	70.0	neutral	60	3600	70.0	70.0	70.0	True	True	False

3165 rows × 62 columns

The data contain a column most_important_issue which records individuals’ responses to the question of what they feel is the most important issue facing the United States as of December 2019. These responses are in the respondents’ own words. For example, one response is

anes.most_important_issue[5]

'The influence of big money on our political system both domestic and foreign'

These responses are strings, and we can employ the suite of string operations to manipulate the data in this column. String methods are contained in the .str module of pandas, and we will have to call the .str attribute in order to use string methods.

One important technique is to alter the case within the text so that we can identify words without worrying about case sensitivity. To turn all responses to lowercase, use .str.lower():

anes.most_important_issue.str.lower()

                                           health care
                                      working together
                                           health care
                                          the economy.
                                                 china
                              ...                        
  the infiltration of marxists into the institut...
  lack of basic resources being provided and off...
                                       donald trump
                                       donald trump
                                              trump
Name: most_important_issue, Length: 3165, dtype: object

And to turn all the text to uppercase, use .str.upper():

anes.most_important_issue.str.upper()

                                           HEALTH CARE
                                      WORKING TOGETHER
                                           HEALTH CARE
                                          THE ECONOMY.
                                                 CHINA
                              ...                        
  THE INFILTRATION OF MARXISTS INTO THE INSTITUT...
  LACK OF BASIC RESOURCES BEING PROVIDED AND OFF...
                                       DONALD TRUMP
                                       DONALD TRUMP
                                              TRUMP
Name: most_important_issue, Length: 3165, dtype: object

One issue that can inhibit our ability to search through text is the existence of leading and trailing whitespace in the responses. White space can exist, invisibily, for several reasons. It is possible that the data authors included white space in order to standardize the number of spaces that the column takes up in the data file. To remove both leading and trailing white spaces, use .str.strip()

anes.most_important_issue = anes.most_important_issue.str.strip()

The str.replace() method finds and replaces specific whole strings with different strings. For example, we can replace every response that reads “health care” with “hospital stuff”:

anes.most_important_issue.str.lower().replace('health care', 'hospital stuff')

                                        hospital stuff
                                      working together
                                        hospital stuff
                                          the economy.
                                                 china
                              ...                        
  the infiltration of marxists into the institut...
  lack of basic resources being provided and off...
                                       donald trump
                                       donald trump
                                              trump
Name: most_important_issue, Length: 3165, dtype: object

If we want to replace pieces of these strings, we can specify the regex=True argument. The following line of code replaces all occurrences of “trump” when the strings are converted to lowercase to “twimp”. With regex=True, the method replaces the pattern “trump” anywhere it appears, as opposed to the default regex=False which only replaces entire entries that are exactly “trump”:

anes.most_important_issue.str.lower().replace('trump', 'twimp', regex=True)

                                           health care
                                      working together
                                           health care
                                          the economy.
                                                 china
                              ...                        
  the infiltration of marxists into the institut...
  lack of basic resources being provided and off...
                                       donald twimp
                                       donald twimp
                                              twimp
Name: most_important_issue, Length: 3165, dtype: object

To create a new column that identifies when particular words are used, use either the .str.match() or .str.contains() method. .str.match() is true if the entire entry matches the provided string,

anes.most_important_issue.str.lower().str.match('trump')

     False
     False
     False
     False
     False
        ...  
  False
  False
  False
  False
   True
Name: most_important_issue, Length: 3165, dtype: object

and .str.contains() is true if the provided string exists anywhere within the entry.

anes.most_important_issue.str.lower().str.contains('trump')

     False
     False
     False
     False
     False
        ...  
  False
  False
   True
   True
   True
Name: most_important_issue, Length: 3165, dtype: object

These columns of logical values can be placed into the data to filter or for other purposes. For example, we can see the number of Democrats, Republicans, and independents who use the word “trump” in their statement of the country’s most important problem:

anes['problem_trump'] = anes.most_important_issue.str.lower().str.contains('trump')
pd.DataFrame(anes.groupby(['partyID', 'problem_trump']).size().unstack())

problem_trump	False	True
partyID
Democrat	1029	293
Independent	479	47
Republican	1121	73

The .str.replace(), .str.match(), and .str.contains() methods can also accept regular expressions for identifying additional patterns within the string. Regular expressions are beyond our scope in this document, but here is a good discussion of regular expressions in Python.

To calculate the length of each response in terms of the number of characters, use .str.len(). In this case, it is important to make sure the whitespace is removed first with .str.strip(), otherwise the spaces will count towards the length:

anes.most_important_issue.str.len()

      11.0
      16.0
      11.0
      12.0
       5.0
        ...  
  254.0
  325.0
   12.0
   12.0
    5.0
Name: most_important_issue, Length: 3165, dtype: float64

We can use the length to display, for example, the single longest response in the data. To display the whole string, turn off the display limit by typing pd.options.display.max_colwidth = None. Here I set the limit to 500, because it sort of goes off the rails and I’d rather not display it:

anes['length'] = anes.most_important_issue.str.len()
pd.options.display.max_colwidth = 500
anes.sort_values(by = 'length', ascending = False).most_important_issue.head(1)

819    Immigration has been and has become a very big problem. The gangs, drug wars, the crime has climbed in every single town, county, state in this country and I believe it all revolves around too many different races, too many prejudices. Most people I talk to dislike foreigners and most foreigners hate Americans. We ask, why come here then? Because they get financial aid, housing, medical, dental aid and compared to the wars, poverty and the way they lived in their own country, America is heav...
Name: most_important_issue, dtype: object

After all that, we set the string display limit back to the default of 50 characters:

pd.options.display.max_colwidth = 50

In some cases it might be necessary to split entries into different features according to a delimiter or pattern. It doesn’t make sense to split the most_important_problem column in this example, but as an illiustration, we can split the responses on periods:

anes.most_important_issue.str.split('.')

                                         [Health Care]
                                    [Working together]
                                         [health care]
                                       [The economy, ]
                                               [China]
                              ...                        
  [The infiltration of Marxists into the institu...
  [Lack of basic resources being provided and of...
                                     [donald trump]
                                     [Donald Trump]
                                            [trump]
Name: most_important_issue, Length: 3165, dtype: object

The expand=True argument displays these split strings in different columns in a dataframe:

anes.most_important_issue.str.split('.', expand=True)

	0	1	2	3	4	5	6	7	8	9	...	19	20	21	22	23	24	25	26	27	28
0	Health Care	None	None	None	None	None	None	None	None	None	...	None	None	None	None	None	None	None	None	None	None
1	Working together	None	None	None	None	None	None	None	None	None	...	None	None	None	None	None	None	None	None	None	None
2	health care	None	None	None	None	None	None	None	None	None	...	None	None	None	None	None	None	None	None	None	None
3	The economy		None	None	None	None	None	None	None	None	...	None	None	None	None	None	None	None	None	None	None
4	China	None	None	None	None	None	None	None	None	None	...	None	None	None	None	None	None	None	None	None	None
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3160	The infiltration of Marxists into the institut...	Money seems to be no object to them	Their over promising will lead to eventual chaos	Usually that leads to the over promised event...		None	None	None	None	None	...	None	None	None	None	None	None	None	None	None	None
3161	Lack of basic resources being provided and off...	This is a first world nation, a nation claimi...	yet we don't have housing or healthcare secur...	That needs to change and it needs to be chang...	A basic standard of living needs to be guaran...		None	None	None	None	...	None	None	None	None	None	None	None	None	None	None
3162	donald trump	None	None	None	None	None	None	None	None	None	...	None	None	None	None	None	None	None	None	None	None
3163	Donald Trump	None	None	None	None	None	None	None	None	None	...	None	None	None	None	None	None	None	None	None	None
3164	trump	None	None	None	None	None	None	None	None	None	...	None	None	None	None	None	None	None	None	None	None

3165 rows × 29 columns

Writing n=1 only splits on the first occurrence of the delimiter, n=2 splits on the first and second occurrence, and so on:

anes.most_important_issue.str.split('.', expand=True, n=2)

	0	1	2
0	Health Care	None	None
1	Working together	None	None
2	health care	None	None
3	The economy		None
4	China	None	None
...	...	...	...
3160	The infiltration of Marxists into the institut...	Money seems to be no object to them	Their over promising will lead to eventual ch...
3161	Lack of basic resources being provided and off...	This is a first world nation, a nation claimi...	yet we don't have housing or healthcare secur...
3162	donald trump	None	None
3163	Donald Trump	None	None
3164	trump	None	None

3165 rows × 3 columns

An individual string can be indexed according to character number. These indices can be applied to every entry in a string column as well. To pull just the first five characters out of each response in most_important_issue, we can type:

anes.most_important_issue.str[0:5]

     Healt
     Worki
     healt
     The e
     China
        ...  
  The i
  Lack 
  donal
  Donal
  trump
Name: most_important_issue, Length: 3165, dtype: object

To pull characters 6 through 10, we type

anes.most_important_issue.str[6:11]

      Care
     g tog
      care
     onomy
          
        ...  
  filtr
  f bas
   trum
   Trum
       
Name: most_important_issue, Length: 3165, dtype: object

To pull the last four characters, use a negative number to begin the range, and leave the end of the range blank:

anes.most_important_issue.str[-4:]

     Care
     ther
     care
     omy.
     hina
        ... 
  ers.
  eed.
  rump
  rump
  rump
Name: most_important_issue, Length: 3165, dtype: object

9.5. Working with Dates and Times (Example: Twitter)#

In module 4 we discussed Twitter as an example of how to use an API in Python. We save the four credentials we need to access the API (the consumer key, the consumer secret, the access token, and the access token secret) in a .env file, and we load those keys into Python without displaying them in our notebooks by using the dotenv package:

import dotenv
import os
os.chdir('/Users/jk8sd/Box Sync/Practice and Applications 1 online/Module 9 - Data Managenent in pandas Part 2')
dotenv.load_dotenv()
ConsumerKey = os.getenv('ConsumerKey')
ConsumerSecret = os.getenv('ConsumerSecret')
AccessToken = os.getenv('AccessToken')
AccessTokenSecret = os.getenv('AccessTokenSecret')

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[57], line 3
      1 import dotenv
      2 import os
----> 3 os.chdir('/Users/jk8sd/Box Sync/Practice and Applications 1 online/Module 9 - Data Managenent in pandas Part 2')
      4 dotenv.load_dotenv()
      5 ConsumerKey = os.getenv('ConsumerKey')

FileNotFoundError: [Errno 2] No such file or directory: '/Users/jk8sd/Box Sync/Practice and Applications 1 online/Module 9 - Data Managenent in pandas Part 2'

Then we use the tweepy package to work with the Twitter API and we create a Twitter cursor with our credentials:

import tweepy
auth = tweepy.OAuthHandler(ConsumerKey, ConsumerSecret)
auth.set_access_token(AccessToken, AccessTokenSecret)
api = tweepy.API(auth)

The following code extracts 1000 tweets that contain the hashtag “#uva”:

msgs = []
msg =[]

for tweet in tweepy.Cursor(api.search, q='#uva').items(1000):
    msg = [tweet.text, tweet.created_at, tweet.user.screen_name] 
    msg = tuple(msg)                    
    msgs.append(msg)

tweets = pd.DataFrame(msgs, columns = ['text', 'created_at', 'user'])
tweets

	text	created_at	user
0	@KymoraJohnson_ let me call @IamTinaThompson s...	2020-06-30 02:20:43	SweetLickKing
1	Congrats Eric. It’s was really great to have w...	2020-06-30 02:18:17	Daniel_B_Ennis
2	RT @KathrynsScGifts: "OGX Fade-defying + Orchi...	2020-06-30 01:55:02	blimeyonline1
3	Former #UVA stars Quin Blanding, Micah Kiser n...	2020-06-30 01:40:00	Wahoos247
4	Former #UVA stars Quin Blanding, Micah Kiser n...	2020-06-30 01:39:42	JamieOakes247
...	...	...	...
995	RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...	2020-06-23 15:12:08	norahlfy
996	#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...	2020-06-23 15:08:40	AgNPalabras
997	Zur Einstimmung #UVA vom letzten #gig bei uns ...	2020-06-23 14:52:23	WeAppU
998	RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...	2020-06-23 14:49:22	VirialexViri
999	Good question... My instinct says Ty Jerome, b...	2020-06-23 14:32:44	The_Superhoo

1000 rows × 3 columns

9.5.1. Extracting Year, Month, Day, and Time from a Timestamp Column #

Note that the data contain timestamps that include the date and time, to the second, of each tweet. The created_at column has a datetime64 data type:

tweets.dtypes

text                  object
created_at    datetime64[ns]
user                  object
dtype: object

The difference between a datetime64 data type and an object or numeric type is that we have the ability to extract individual elements of time from a datetime64 value. To pull out the year, month, day, hour, minute, and second of the timestamp, we use the following attributes:

[tweets.created_at[0].month, tweets.created_at[0].day, 
 tweets.created_at[0].year, tweets.created_at[0].hour, 
 tweets.created_at[0].minute, tweets.created_at[0].second]

[6, 30, 2020, 2, 20, 43]

The easiest way to create new columns with these elements for all the values in the dataframe is to use comprehension loops:

tweets['month'] = [x.month for x in tweets.created_at]
tweets['day'] = [x.day for x in tweets.created_at]
tweets['year'] = [x.year for x in tweets.created_at]
tweets['hour'] = [x.hour for x in tweets.created_at]
tweets['minute'] = [x.minute for x in tweets.created_at]
tweets['second'] = [x.second for x in tweets.created_at]
tweets

	text	created_at	user	month	day	year	hour	minute	second
0	@KymoraJohnson_ let me call @IamTinaThompson s...	2020-06-30 02:20:43	SweetLickKing	6	30	2020	2	20	43
1	Congrats Eric. It’s was really great to have w...	2020-06-30 02:18:17	Daniel_B_Ennis	6	30	2020	2	18	17
2	RT @KathrynsScGifts: "OGX Fade-defying + Orchi...	2020-06-30 01:55:02	blimeyonline1	6	30	2020	1	55	2
3	Former #UVA stars Quin Blanding, Micah Kiser n...	2020-06-30 01:40:00	Wahoos247	6	30	2020	1	40	0
4	Former #UVA stars Quin Blanding, Micah Kiser n...	2020-06-30 01:39:42	JamieOakes247	6	30	2020	1	39	42
...	...	...	...	...	...	...	...	...	...
995	RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...	2020-06-23 15:12:08	norahlfy	6	23	2020	15	12	8
996	#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...	2020-06-23 15:08:40	AgNPalabras	6	23	2020	15	8	40
997	Zur Einstimmung #UVA vom letzten #gig bei uns ...	2020-06-23 14:52:23	WeAppU	6	23	2020	14	52	23
998	RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...	2020-06-23 14:49:22	VirialexViri	6	23	2020	14	49	22
999	Good question... My instinct says Ty Jerome, b...	2020-06-23 14:32:44	The_Superhoo	6	23	2020	14	32	44

1000 rows × 9 columns

9.5.2. Generating Timestamps from Separate Month, Day, and Year Columns #

Sometimes raw data will come to us with these time elements already placed in separate columns, so that we have a month, day, and year column but no date column. There are important advantages to placing all this information in one column in values that Python understands as timestamps: timestamps are easier to filter, manipulate, index, and plot. To create one timestamp from separate columns that contain the month, day, and year (and optionally the hour, minute, and second), use the pd.to_datatime function:

pd.to_datetime(df)

where df is a subset of the dataframe that contains the separate columns for the time elements. In order for this function to work, these columns must be named year, month, and day, and optionally hour, minute, and second. If these columns are named something else, first use the .rename() method to change the names to the ones listed above.

To create a timestamp from the individual time element columns, we can type:

tweets['timestamp'] = pd.to_datetime(tweets[['year', 'month', 'day', 'hour', 'minute', 'second']])
tweets

	text	created_at	user	month	day	year	hour	minute	second	timestamp
0	@KymoraJohnson_ let me call @IamTinaThompson s...	2020-06-30 02:20:43	SweetLickKing	6	30	2020	2	20	43	2020-06-30 02:20:43
1	Congrats Eric. It’s was really great to have w...	2020-06-30 02:18:17	Daniel_B_Ennis	6	30	2020	2	18	17	2020-06-30 02:18:17
2	RT @KathrynsScGifts: "OGX Fade-defying + Orchi...	2020-06-30 01:55:02	blimeyonline1	6	30	2020	1	55	2	2020-06-30 01:55:02
3	Former #UVA stars Quin Blanding, Micah Kiser n...	2020-06-30 01:40:00	Wahoos247	6	30	2020	1	40	0	2020-06-30 01:40:00
4	Former #UVA stars Quin Blanding, Micah Kiser n...	2020-06-30 01:39:42	JamieOakes247	6	30	2020	1	39	42	2020-06-30 01:39:42
...	...	...	...	...	...	...	...	...	...	...
995	RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...	2020-06-23 15:12:08	norahlfy	6	23	2020	15	12	8	2020-06-23 15:12:08
996	#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...	2020-06-23 15:08:40	AgNPalabras	6	23	2020	15	8	40	2020-06-23 15:08:40
997	Zur Einstimmung #UVA vom letzten #gig bei uns ...	2020-06-23 14:52:23	WeAppU	6	23	2020	14	52	23	2020-06-23 14:52:23
998	RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...	2020-06-23 14:49:22	VirialexViri	6	23	2020	14	49	22	2020-06-23 14:49:22
999	Good question... My instinct says Ty Jerome, b...	2020-06-23 14:32:44	The_Superhoo	6	23	2020	14	32	44	2020-06-23 14:32:44

1000 rows × 10 columns

Like created_at, the new timestamp column is also of the datetime64 data type:

tweets.dtypes

text                  object
created_at    datetime64[ns]
user                  object
month                  int64
day                    int64
year                   int64
hour                   int64
minute                 int64
second                 int64
timestamp     datetime64[ns]
dtype: object

9.5.3. Converting String Columns to Timestamps #

Sometimes information about the date and time of an event in the data will be coded as a string. In that case, to use the time series functionality on the dates in a dataframe, a string column that contains the dates and times will need to be converted to the datetime64 data type. The best way to do that is to use the pd.to_datetime() function once again. This time, we pass the column that contains the dates in string format to the pd.to_datetime() function. This function can identify the year, month, day, hour, minute, and second from many different formats.

For example, we can write the date Sunday, June 19, 2016 at 8:00 PM (the date and time of game 7 of the 2016 NBA finals, the greatest moment in sports history) in many different ways such as:

“Sunday, June 19, 2016, 8:00 PM”
“6-19-2016 8PM”
“6/19/16 8 p.m.”
“19/6/16 8pm”

The pd.to_datetime() function reads all of these formats:

pd.to_datetime("Sunday, June 19, 2016, 8:00 PM")

Timestamp('2016-06-19 20:00:00')

pd.to_datetime("6-19-2016 8PM")

Timestamp('2016-06-19 20:00:00')

pd.to_datetime("6/19/16 8 p.m.")

Timestamp('2016-06-19 20:00:00')

pd.to_datetime("19/6/16 8pm")

Timestamp('2016-06-19 20:00:00')

The pd.to_datetime() function can feel like a magic trick because it automatically detects where the different elements of the date and time are located in the string. However, it can go wrong, and it is important to check that the date was correctly understood by pd.to_datetime(). Suppose for example that the date was June 19, 1945, coded as “6/19/45”. pd.to_datetime() reads the date as 2045:

pd.to_datetime("6/19/45")

Timestamp('2045-06-19 00:00:00')

The easiest way to fix this problem would be to break the timestamp up into separate columns for year, month, and day, subtract 100 from the years that were placed in the wrong century, and recombine these columns into a new timestamp with pd.to_datetime(). But the real danger here is assuming the pd.to_datetime() function worked without confirmation.

The pd.to_datetime() function also works on an entire column of dates coded as strings. For example, let’s convert the created_at column to string:

tweets['created_at'] = tweets.created_at.astype('str')
tweets.dtypes

text                  object
created_at            object
user                  object
month                  int64
day                    int64
year                   int64
hour                   int64
minute                 int64
second                 int64
timestamp     datetime64[ns]
dtype: object

To convert the created_at column back to a timestamp, we can apply pd.to_dataframe():

tweets['created_at'] = pd.to_datetime(tweets.created_at)
tweets.dtypes

text                  object
created_at    datetime64[ns]
user                  object
month                  int64
day                    int64
year                   int64
hour                   int64
minute                 int64
second                 int64
timestamp     datetime64[ns]
dtype: object

9.5.4. Filtering Rows Based on Date Ranges #

If every row in the data has a unique timestamp, then one convenient way to use the times to filter the data is to set the index (the row labels) to be equal to the timestamps:

tweets.index = tweets['created_at']

A pandas dataframe uses pd.to_datetime() to read dates and times in many formats, and we can pass any of these formats to the index of tweets. For example, to see all the tweets made on June 30, we can type:

tweets['June 30, 2020']

	text	created_at	user	month	day	year	hour	minute	second	timestamp
created_at
2020-06-30 02:20:43	@KymoraJohnson_ let me call @IamTinaThompson s...	2020-06-30 02:20:43	SweetLickKing	6	30	2020	2	20	43	2020-06-30 02:20:43
2020-06-30 02:18:17	Congrats Eric. It’s was really great to have w...	2020-06-30 02:18:17	Daniel_B_Ennis	6	30	2020	2	18	17	2020-06-30 02:18:17
2020-06-30 01:55:02	RT @KathrynsScGifts: "OGX Fade-defying + Orchi...	2020-06-30 01:55:02	blimeyonline1	6	30	2020	1	55	2	2020-06-30 01:55:02
2020-06-30 01:40:00	Former #UVA stars Quin Blanding, Micah Kiser n...	2020-06-30 01:40:00	Wahoos247	6	30	2020	1	40	0	2020-06-30 01:40:00
2020-06-30 01:39:42	Former #UVA stars Quin Blanding, Micah Kiser n...	2020-06-30 01:39:42	JamieOakes247	6	30	2020	1	39	42	2020-06-30 01:39:42
2020-06-30 01:30:45	RT @ThinakaranLK: சில இடங்களில் மழை பெய்யும் ச...	2020-06-30 01:30:45	sumanebot	6	30	2020	1	30	45	2020-06-30 01:30:45
2020-06-30 01:08:08	We are🎉excited to🗣announce opening for in-pers...	2020-06-30 01:08:08	GrandMarc_UVA	6	30	2020	1	8	8	2020-06-30 01:08:08
2020-06-30 00:20:15	RT @cavalierinsider: On Saturday, The Basketba...	2020-06-30 00:20:15	annefutch	6	30	2020	0	20	15	2020-06-30 00:20:15
2020-06-30 00:19:00	Anthony Harris part of NFL's top safety tandem...	2020-06-30 00:19:00	hoosdaily	6	30	2020	0	19	0	2020-06-30 00:19:00

If these row labels are in chronological order, then we can extract slices of the data that fall within a time range. We write two dates separated by a colon, and the output extracts all rows from the row that matches the first date through the row that matches the last date. Because the rows are generally listed with the most recent ones first, we write the ending timestamp first and the ending timestamp second. If these exact dates and times do not exist in the data, the syntax still finds the locations where these rows would exist and extracts the rows in between these two locations. Here is code to extract all tweets that were posted between 2pm and 3pm on June 29:

tweets['6/29/2020 15:00':'6/29/2020 14:00']

	text	created_at	user	month	day	year	hour	minute	second	timestamp
created_at
2020-06-29 14:54:31	UVa alumni team opts out of 2020 TBT, plans to...	2020-06-29 14:54:31	hoosdaily	6	29	2020	14	54	31	2020-06-29 14:54:31
2020-06-29 14:54:26	Ben Wallace Is a Proud Dad of Three Kids — Mee...	2020-06-29 14:54:26	hoosdaily	6	29	2020	14	54	26	2020-06-29 14:54:26
2020-06-29 14:51:16	Inicia Sonora exportación de uva a Corea del S...	2020-06-29 14:51:16	AgroTratos	6	29	2020	14	51	16	2020-06-29 14:51:16
2020-06-29 14:50:46	RT @Wahoos247: Wake Forest commitment Christia...	2020-06-29 14:50:46	FatWhite101	6	29	2020	14	50	46	2020-06-29 14:50:46
2020-06-29 14:42:28	RT @Cavs_Corner: Film Room: In the next instal...	2020-06-29 14:42:28	oleuva	6	29	2020	14	42	28	2020-06-29 14:42:28
2020-06-29 14:40:56	RT @cavalierinsider: On Saturday, The Basketba...	2020-06-29 14:40:56	John_Shifflett	6	29	2020	14	40	56	2020-06-29 14:40:56
2020-06-29 14:37:11	Tatil planları yapmaya başladıysan sana harika...	2020-06-29 14:37:11	YasinALTINEL	6	29	2020	14	37	11	2020-06-29 14:37:11
2020-06-29 14:35:52	On Saturday, The Basketball Tournament will be...	2020-06-29 14:35:52	cavalierinsider	6	29	2020	14	35	52	2020-06-29 14:35:52
2020-06-29 14:29:17	Jesse Rutherford of the Nelson County Board of...	2020-06-29 14:29:17	JerryMillerNow	6	29	2020	14	29	17	2020-06-29 14:29:17
2020-06-29 14:20:58	RT @thesabre: NewsLink is updated! #UVA sports...	2020-06-29 14:20:58	Slider_Hoos	6	29	2020	14	20	58	2020-06-29 14:20:58
2020-06-29 14:20:44	RT @Wahoos247: Wake Forest commitment Christia...	2020-06-29 14:20:44	agee_brandon	6	29	2020	14	20	44	2020-06-29 14:20:44
2020-06-29 14:18:33	Ramseyer fondly recalls Cavs' perfect regular ...	2020-06-29 14:18:33	hoosdaily	6	29	2020	14	18	33	2020-06-29 14:18:33
2020-06-29 14:15:32	Ben Wallace Is a Proud Dad of Three Kids — Mee...	2020-06-29 14:15:32	hoosdaily	6	29	2020	14	15	32	2020-06-29 14:15:32
2020-06-29 14:11:08	NewsLink is updated! #UVA sports links all in ...	2020-06-29 14:11:08	thesabre	6	29	2020	14	11	8	2020-06-29 14:11:08

To extract all tweets before point in time, write the timestamp before the colon and write nothing after the colon. To extract all tweets after a point in time, write nothing before the colon and write the timestamp after the colon. To extract all tweets posted after June 29 at 3pm:

tweets[:'6/29/2020 15:00']

	text	created_at	user	month	day	year	hour	minute	second	timestamp
created_at
2020-06-30 02:20:43	@KymoraJohnson_ let me call @IamTinaThompson s...	2020-06-30 02:20:43	SweetLickKing	6	30	2020	2	20	43	2020-06-30 02:20:43
2020-06-30 02:18:17	Congrats Eric. It’s was really great to have w...	2020-06-30 02:18:17	Daniel_B_Ennis	6	30	2020	2	18	17	2020-06-30 02:18:17
2020-06-30 01:55:02	RT @KathrynsScGifts: "OGX Fade-defying + Orchi...	2020-06-30 01:55:02	blimeyonline1	6	30	2020	1	55	2	2020-06-30 01:55:02
2020-06-30 01:40:00	Former #UVA stars Quin Blanding, Micah Kiser n...	2020-06-30 01:40:00	Wahoos247	6	30	2020	1	40	0	2020-06-30 01:40:00
2020-06-30 01:39:42	Former #UVA stars Quin Blanding, Micah Kiser n...	2020-06-30 01:39:42	JamieOakes247	6	30	2020	1	39	42	2020-06-30 01:39:42
...	...	...	...	...	...	...	...	...	...	...
2020-06-29 15:38:19	#UvA Last week a number of practicals started ...	2020-06-29 15:38:19	JJ_Angelus	6	29	2020	15	38	19	2020-06-29 15:38:19
2020-06-29 15:27:34	Playing through: UVa golfers hone their games ...	2020-06-29 15:27:34	hoosdaily	6	29	2020	15	27	34	2020-06-29 15:27:34
2020-06-29 15:27:29	Playing through: UVa golfers hone their games ...	2020-06-29 15:27:29	hoosdaily	6	29	2020	15	27	29	2020-06-29 15:27:29
2020-06-29 15:27:21	Ramseyer fondly recalls Cavs' perfect regular ...	2020-06-29 15:27:21	hoosdaily	6	29	2020	15	27	21	2020-06-29 15:27:21
2020-06-29 15:06:57	While many college athletes have been unable t...	2020-06-29 15:06:57	cavalierinsider	6	29	2020	15	6	57	2020-06-29 15:06:57

71 rows × 10 columns

9.5.5. Leads and Lags #

In time series data in which the rows represent particular points in time in descending chronological order, a lag is a column that contains the values of another column shifted up one cell. That is, if the rows represent days, and a column represents today’s high temperature, the lag represents yesterday’s high temperature. A lead is a existing column with its values shifted one row down: tomorrow’s high temperature.

To create a lead or a lag, apply the .shift() method to a dataframe. The argument of .shift() is an integer: positive integers shift values down, creating leads, and negative values shift values up, creating lags. Whenever we shift values up or down, we create new missing values at the top or bottom of the dataframe for the lead or lagged column.

For example, in the tweets dataframe we can create two new columns, previous_tweet and next_tweet to compare with text by using .shift(-1) and .shift(1):

tweets['previous_tweet'] = tweets.shift(-1).text
tweets['next_tweet'] = tweets.shift(1).text
tweets[['text', 'previous_tweet', 'next_tweet']]

	text	previous_tweet	next_tweet
created_at
2020-06-30 02:20:43	@KymoraJohnson_ let me call @IamTinaThompson s...	Congrats Eric. It’s was really great to have w...	NaN
2020-06-30 02:18:17	Congrats Eric. It’s was really great to have w...	RT @KathrynsScGifts: "OGX Fade-defying + Orchi...	@KymoraJohnson_ let me call @IamTinaThompson s...
2020-06-30 01:55:02	RT @KathrynsScGifts: "OGX Fade-defying + Orchi...	Former #UVA stars Quin Blanding, Micah Kiser n...	Congrats Eric. It’s was really great to have w...
2020-06-30 01:40:00	Former #UVA stars Quin Blanding, Micah Kiser n...	Former #UVA stars Quin Blanding, Micah Kiser n...	RT @KathrynsScGifts: "OGX Fade-defying + Orchi...
2020-06-30 01:39:42	Former #UVA stars Quin Blanding, Micah Kiser n...	RT @ThinakaranLK: சில இடங்களில் மழை பெய்யும் ச...	Former #UVA stars Quin Blanding, Micah Kiser n...
...	...	...	...
2020-06-23 15:12:08	RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...	#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...	WATCH: Top plays of UVA's 2019-20 basketball s...
2020-06-23 15:08:40	#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...	Zur Einstimmung #UVA vom letzten #gig bei uns ...	RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari...
2020-06-23 14:52:23	Zur Einstimmung #UVA vom letzten #gig bei uns ...	RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...	#UVA: 54,87. Sube 0,20% diario, 1,22% en junio...
2020-06-23 14:49:22	RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...	Good question... My instinct says Ty Jerome, b...	Zur Einstimmung #UVA vom letzten #gig bei uns ...
2020-06-23 14:32:44	Good question... My instinct says Ty Jerome, b...	NaN	RT @JLuis_Sommelier: #Mosto: jugo obtenido de ...

1000 rows × 3 columns

Reshaping and Merging Data and Working with Strings, Dates, and Times

Contents