9. Reshaping and Merging Data and Working with Strings, Dates, and Times#
9.1. Introduction: pandas
or SQL?#
As we saw in module 7, read operations in SQL not only extract data from a database, but they can select and rename columns, filter and sort rows, join tables, and aggregate data. With pandas
, we can also select and rename columns, filter and sort rows, join tables, and aggregate data. So the question is, should we be using SQL or pandas
to perform all these data manipulation steps?
SQL is sometimes referred to as server-side data manipulation, because databases are often stored on remote servers and SQL queries are processed on the server instead on on a local machine. Data manipulation that is conducted using pandas
on a local Python installation is called client-side data manipulation.
The question of SQL vs. pandas
(or SQL vs. the tidyverse in R) is the subject of a lot of debate among professionals who use data. The question comes up frequently on various coding forums. Some blog posts have tried to compare SQL and pandas
in terms of the speed with which they complete equivalent operations, with differing results. Tina Wenzel and Kavya Gupta note that many factors influence the relative speed of SQL and pandas
operations, including the configuration of a PostgreSQL database and the bandwidth available in a network connection. They take the mixed evidence into account and conclude that
SQL is the best tool to use for basic data retrieval and the better you know it, the more it will speed up your workflow. More advanced data manipulation tasks can be performed both on the server and client side and it is up to the analyst to balance the workload optimally.
In short, the existing evidence does not overwhelmingly support one option or the other as best practice for data manipulation. The reality is that both SQL and pandas
are extremely useful and widely-used, and both tools will become part of your workflow. It will take some time and experience with end-to-end projects in data science to learn how to balance SQL and pandas
in a way that is most comfortable for you in your workflow. But at this stage, there are some important points to keep in mind when thinking about these tools.
First, there are many situations in which the question of SQL vs. pandas
might be moot. For a given project, our data might not come from a database, but instead from a CSV file, from a JSON document acquired via an API, or from raw HTML that we scraped from a webpage. So in order to use SQL, we would have to take the additional step of creating a database. If we hold ourselves to the principles that E. F. Codd and others laid out about the organization of relational databases, it can take a significant amount of work to create this database. If there is no database involved for a specific data project, there is no role for SQL, but because pandas
works on dataframes it can be a primary tool for data manipulation regardless of the original source for the data.
Second, while there are differences in speed, these differences only become a significant factor for projects that involve big data, and in those cases, we will almost certainly be working with data stored on remote servers. If the data are organized using a database, then SQL may well be faster than pandas
, but it very much depends on how the database is configured and on the myriad factors that determine the speed with which code runs through a remote connection. pandas
can also be used in scripts or Jupyter notebooks that are run on remote servers. Sometimes it makes sense to pull selections of remotely stored data into a local environment so that we can manipulate the data without having to seek permission from a database manager, and in that case, a mix of SQL to choose a selection and pandas
to manipulate the local data can be very effective.
Third, pandas
simply has more functionality and flexibility than SQL. For example, it is fairly straightforward to reshape (e.g. pivot) data using pandas
, and it is much more difficult to reshape data in SQL. pandas
has better functionality for working with strings and time and date features than SQL, and pandas
, being part of a Python environment, works well with any other Python package that works with dataframes or arrays. In contrast, while various DBMSs add specific additions to the base SQL language, SQL extensions tend to be fairly limited because of the inherent tension between expanding the functionality of a query language and staying close enough to base SQL to still be considered SQL. The easiest way to bring SQL query results into a Python enviromnent uses an sqlalchemy
engine and the pd.read_sql_query()
funtion from pandas
.
Fourth, both pandas
and SQL enable us to work as part of larger teams that share tools, but we might choose or be required to use one tool over the other to work better as part of the team. pandas
is essential for projects in which the whole team works within Python. SQL is readable and usable for people who do not use Python but still work with databases.
Fifth, and most importantly, the choice of SQL vs. pandas
should be made based on how comfortable we feel with each tool. Both SQL and pandas
can perform data manipulation correctly, and it will probably be the case that we can remember the code with one tool better than with the other. We should try as much as we can to do the work that comprises 80% of our time as data scientists more quickly.
For now, we need to practice both pandas
and SQL and get comfortable with these tools, and we need to be flexible in the future as different situations will call for SQL or pandas
.
9.2. Joining Dataframes#
Joining dataframes is also called merging dataframes, and I will use the words “join” and “merge” interchangeably below. Every kind of merge that is possible in SQL is possible in pandas
: inner, outer, left, right, natural, cross, and anti-joins. One important difference between joining in SQL and merging in pandas
is that dataframes might not be as carefully curated as data tables in a relational database, and we should not assume that the join worked properly even if there is no error. Primary keys might not uniquely identify rows in general, and joins might turn into cross joins. Some of values of keys that should match might not match due to differences in coding and spelling. And if we attempt a natural join, the wrong columns might be automatically selected as the keys.
Too many discussions of merging data provide clean examples. Here I will show you how merging works when the data contain some common problems that may invalidate the merge. In the following discussion, we will discuss how to perform merges in pandas
, and also how to check that the merge worked in the way we want to.
First, for the following examples, we load these packages:
import numpy as np
import pandas as pd
9.2.1. Example: Merging Data on U.S. State Election Results, Income, Economies, and Area#
The “state_elections.csv” file contains information about the result of presidential elections by state for every presidential election between 1964 and 2004. In this dataset, the primary keys are State
and year
:
elect = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/state_elections.csv")
elect.head()
State | year | demvote | repvote | wallacevote | dempercent | reppercent | population | |
---|---|---|---|---|---|---|---|---|
0 | Alabama | 1964 | 210732 | 479085 | 0 | 30.54897 | 69.45103 | 1919000 |
1 | Alabama | 1968 | 196579 | 146923 | 691425 | 18.99448 | 81.00552 | 1993000 |
2 | Alabama | 1972 | 256923 | 728701 | 0 | 26.06704 | 73.93296 | 3540080 |
3 | Alabama | 1976 | 659170 | 504070 | 0 | 56.66673 | 43.33327 | 3737204 |
4 | Alabama | 1980 | 636730 | 654192 | 0 | 49.32366 | 50.67634 | 3894025 |
“state_income.txt” contains varous summary statistics (percentiles and means, broken down separately for men, for women, and for the overall population) about personal incomes within each state. In this dataset, the primary keys are stcode
(the post office’s two letter abbreviation for states) and year
:
income = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/state_income.txt", sep="\t", header=2) # tab separated (note the header in this file)
income.head()
stcode | year | income | p10income | p20income | p30income | p40income | p50income | p60income | p70income | ... | wom_p90income | wom_meanincome | wom_minincome | wom_maxincome | wom_sdincome | wom_ineqincome | wom_gini_income | wom_n_income | wom_nwgt_income | medincome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AL | 1964 | 89.83488 | 3269.85 | 5731.07 | 7988.87 | 10318.3 | 13189.1 | 16007.3 | 18744.4 | ... | 24233.0 | 11400.0 | 58.13953 | 72674.42 | 1796.89 | 29.53152 | 0.473738 | 175.0 | 418494.1 | 76.68081 |
1 | AL | 1968 | 92.18238 | 4469.95 | 7455.28 | 10050.30 | 12792.1 | 15580.2 | 18392.4 | 21310.7 | ... | 26287.5 | 12005.3 | NaN | NaN | 1968.78 | 27.40979 | 0.485090 | NaN | NaN | 80.72643 |
2 | AL | 1972 | 94.87543 | 6734.11 | 10314.80 | 13336.30 | 16421.5 | 19088.9 | 22141.6 | 25410.7 | ... | 30278.0 | 14667.9 | NaN | NaN | 8604.25 | 17.83577 | 0.457932 | NaN | NaN | 82.27975 |
3 | AL | 1976 | 67.53671 | 7759.81 | 11176.60 | 13823.20 | 16431.8 | 18982.8 | 21923.5 | 25180.8 | ... | 31119.2 | 15362.5 | NaN | NaN | 12842.20 | 15.77765 | 0.454074 | NaN | NaN | 60.07215 |
4 | AL | 1980 | 46.29869 | 7602.15 | 10637.00 | 13532.60 | 16279.6 | 19111.8 | 21802.5 | 25190.6 | ... | 29623.1 | 14804.2 | 28.38428 | 120196.50 | 5944.71 | 14.52640 | 0.436961 | 543.0 | 711278.2 | 41.72882 |
5 rows × 55 columns
“state_economics.txt” contains information about the macro-level economy in each state and year, including overall and per capita gross domestic product, and GDP broken down by industry. In this dataset, the primary keys are fips
(a standard numeric code that identifies countries and provinces within countries) and year
:
econ = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/state_economics.txt", sep=";") # semi-colon separated
econ.head()
fips | year | GDP | GDPpc | Private | Agriculture | Farms | Mining | Utilities | Construction | ... | Construction_share | Manufacturing_share | Finance_Insurance_share | Legal_share | Education_share | Health_share | Government_share | Government_federal_share | Government_military_share | Government_statelocal_share | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1964 | 8201 | 4273.580 | 6900 | 312 | 283 | 115 | 235 | 379 | ... | 4.621387 | 28.22827 | 11.41324 | 0.402390 | 0.256066 | 1.902207 | 15.86392 | 6.291915 | 2.402146 | 7.182051 |
1 | 1 | 1968 | 10949 | 5493.728 | 9095 | 294 | 253 | 133 | 294 | 499 | ... | 4.557494 | 28.74235 | 10.67677 | 0.429263 | 0.365330 | 2.228514 | 16.93305 | 5.945748 | 2.630377 | 8.356928 |
2 | 1 | 1972 | 15336 | 4332.105 | 12696 | 481 | 418 | 223 | 435 | 702 | ... | 4.577465 | 26.05634 | 11.48279 | 0.469484 | 0.443401 | 2.595201 | 17.21440 | 5.692488 | 2.679969 | 8.841941 |
3 | 1 | 1976 | 24206 | 6477.035 | 19988 | 781 | 683 | 584 | 714 | 1281 | ... | 5.292076 | 24.85334 | 11.50541 | 0.479220 | 0.359415 | 2.986863 | 17.42543 | 5.787821 | 2.255639 | 9.386103 |
4 | 1 | 1980 | 36006 | 9246.474 | 29567 | 674 | 542 | 1351 | 1228 | 1588 | ... | 4.410376 | 24.60423 | 11.63417 | 0.649892 | 0.352719 | 3.660501 | 17.88313 | 5.782370 | 2.171860 | 9.928901 |
5 rows × 36 columns
“state_area.csv” contains various measures of the total, land, and water areas of each state. In this dataset, the primary key is state
:
area = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/state_area.csv")
area.head()
state | area_sqmi | area_sqkm | landarea_sqmi | landarea_sqkm | water_sqmi | water_sqkm | percent_water | |
---|---|---|---|---|---|---|---|---|
0 | Alaska | 663,267.26 | 1,717,854 | 571,951.26 | 1,481,347 | 91,316.00 | 236,507 | 13.77 |
1 | Texas | 268,580.82 | 695,621 | 261,797.12 | 678,051 | 6,783.70 | 17,570 | 2.53 |
2 | California | 163,695.57 | 423,970 | 155,939.52 | 403,882 | 7,736.23 | 20,037 | 4.73 |
3 | Montana | 147,042.40 | 380,838 | 145,552.43 | 376,979 | 1,489.96 | 3,859 | 1.01 |
4 | New Mexico | 121,589.48 | 314,915 | 121,355.53 | 314,309 | 233.96 | 606 | 0.19 |
Because the datasets above contain three ways to identify states - by name, with two-letter abbreviations, and with FIPS codes - we also need a dataset that matches each state by name to its abbreviation and FIPS code. Datasets that exist solely for the purpose of matching values of keys across datasets are called crosswalks:
crosswalk = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/crosswalk.csv")
crosswalk.head()
fips | State | stcode | |
---|---|---|---|
0 | 1 | Alabama | AL |
1 | 2 | Alaska | AK |
2 | 4 | Arizona | AZ |
3 | 5 | Arkansas | AR |
4 | 6 | California | CA |
Our goal is to merge all five of these datasets together.
9.2.2. Using the pd.merge()
Function#
When using a crosswalk to merge dataframes together, the first step is to merge a dataframe with the crosswalk. Here we start by joining the elect
and crosswalk
dataframes together. To join dataframes, we use the pd.merge()
function. The following code works:
merged_data = pd.merge(elect, crosswalk, on='State')
merged_data
State | year | demvote | repvote | wallacevote | dempercent | reppercent | population | fips | stcode | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 1964 | 210732 | 479085 | 0 | 30.54897 | 69.45103 | 1919000 | 1 | AL |
1 | Alabama | 1968 | 196579 | 146923 | 691425 | 18.99448 | 81.00552 | 1993000 | 1 | AL |
2 | Alabama | 1972 | 256923 | 728701 | 0 | 26.06704 | 73.93296 | 3540080 | 1 | AL |
3 | Alabama | 1976 | 659170 | 504070 | 0 | 56.66673 | 43.33327 | 3737204 | 1 | AL |
4 | Alabama | 1980 | 636730 | 654192 | 0 | 49.32366 | 50.67634 | 3894025 | 1 | AL |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
556 | Wyoming | 1988 | 67113 | 106867 | 0 | 38.57512 | 61.42488 | 465080 | 56 | WY |
557 | Wyoming | 1992 | 68160 | 79347 | 0 | 46.20798 | 53.79202 | 466251 | 56 | WY |
558 | Wyoming | 1996 | 77934 | 105388 | 0 | 42.51208 | 57.48792 | 488167 | 56 | WY |
559 | Wyoming | 2000 | 60481 | 147947 | 0 | 29.01769 | 70.98231 | 493782 | 56 | WY |
560 | Wyoming | 2004 | 70776 | 167629 | 0 | 29.68730 | 70.31271 | 506529 | 56 | WY |
561 rows × 10 columns
The merged_data
dataframe now contains the features from elect
and from crosswalk
, which means we’ve added stcode
and fips
to the elect
data, and that enables us to merge merged_data
with income
and econ
.
Before moving on to the rest of the example, we should discuss some properties of the pd.merge()
function. Recall from module 7 that there are many different kinds of joins that differ only in how they treat unmatched observations. pd.merge()
conducts an inner join by default, leaving only the matched observations in the merged data. To change the type of merge, use the how
argument:
how='outer'
performs an outer or full join. It brings all rows from both dataframes into the merged data, and replaces the features for the unmatched rows with NaN
for the columns without data.
merged_data = pd.merge(elect, crosswalk, on='State', how='outer')
how='left'
performs a left join. It brings all rows from the first dataframe listed into the merged data, and only those in the second dataframe that are matched to the first.
merged_data = pd.merge(elect, crosswalk, on='State', how='left')
how='right'
performs a right join. It brings all rows from the second dataframe listed into the merged data, and only those in the first dataframe that are matched to the second.
merged_data = pd.merge(elect, crosswalk, on='State', how='right')
There are two ways to specify the key or keys on which to match rows. If the key or keys have the same name in both dataframes, use the on
argument. If there is one key, pass this name as a string to on
like we did above. If there is more than one key, pass these names as a list to on
. For example, if we merge on both State
and year
, we will type on = ['State', 'year']
.
If the keys have different names in each dataframe, we use left_on
to specify the names of these keys in the first dataframe listed in the pd.merge()
function, and right_on
to specify the names of these keys in the second dataframe listed. Suppose, for example, that the keys are named “State” and “year” in the first dataframe, and “st” and “yr” in the second dataframe. We would type left_on = ['State', 'year'], right_on = ['st', 'yr']
.
Depending on the type of join we use, a row can be matched, or can come from just the first or just the second dataframe listed in pd.merge()
. To see which rows come from which sources, specify indicator='newname'
, where 'newname'
is a name you choose for a new column that contains this information. For example, the following code repeats the merge we performed above, but uses an outer join instead of an inner join and adds a column matched
that tells us how each row is matched:
merged_data = pd.merge(elect, crosswalk, on='State', how='outer', indicator='matched')
merged_data
State | year | demvote | repvote | wallacevote | dempercent | reppercent | population | fips | stcode | matched | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 1964 | 210732 | 479085 | 0 | 30.54897 | 69.45103 | 1919000 | 1 | AL | both |
1 | Alabama | 1968 | 196579 | 146923 | 691425 | 18.99448 | 81.00552 | 1993000 | 1 | AL | both |
2 | Alabama | 1972 | 256923 | 728701 | 0 | 26.06704 | 73.93296 | 3540080 | 1 | AL | both |
3 | Alabama | 1976 | 659170 | 504070 | 0 | 56.66673 | 43.33327 | 3737204 | 1 | AL | both |
4 | Alabama | 1980 | 636730 | 654192 | 0 | 49.32366 | 50.67634 | 3894025 | 1 | AL | both |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
556 | Wyoming | 1988 | 67113 | 106867 | 0 | 38.57512 | 61.42488 | 465080 | 56 | WY | both |
557 | Wyoming | 1992 | 68160 | 79347 | 0 | 46.20798 | 53.79202 | 466251 | 56 | WY | both |
558 | Wyoming | 1996 | 77934 | 105388 | 0 | 42.51208 | 57.48792 | 488167 | 56 | WY | both |
559 | Wyoming | 2000 | 60481 | 147947 | 0 | 29.01769 | 70.98231 | 493782 | 56 | WY | both |
560 | Wyoming | 2004 | 70776 | 167629 | 0 | 29.68730 | 70.31271 | 506529 | 56 | WY | both |
561 rows × 11 columns
9.2.3. Checking for Problems That Can Occur Without Errors#
Whenever we apply pd.merge()
, there’s a great risk that the merge was incorrect even if the function runs without error. There are two ways that a merge can fail.
Problem 1: what we thought was a unique ID was not a unique ID.
If we merge two dataframes on a column (or set of columns) that fail to uniquely identify the rows in either dataframe, then pd.merge()
returns all pairwise combinations of rows with matching IDs. To see this, consider the following two example dataframes:
dict1 = {'id': ['A', 'A', 'B', 'B'],
'data1': [150, 200, 50, 25]}
df1 = pd.DataFrame.from_dict(dict1)
df1
id | data1 | |
---|---|---|
0 | A | 150 |
1 | A | 200 |
2 | B | 50 |
3 | B | 25 |
dict2 = {'id': ['A', 'A', 'B', 'B'],
'data2': [-20, -75, -125, -250]}
df2 = pd.DataFrame.from_dict(dict2)
df2
id | data2 | |
---|---|---|
0 | A | -20 |
1 | A | -75 |
2 | B | -125 |
3 | B | -250 |
Note that the two dataframes share a column “id”, but that “id” is not a unique identifier in either dataframe. If I try to merge the two based on “id”, the result is:
df_merge = pd.merge(df1, df2, on='id')
df_merge
id | data1 | data2 | |
---|---|---|---|
0 | A | 150 | -20 |
1 | A | 150 | -75 |
2 | A | 200 | -20 |
3 | A | 200 | -75 |
4 | B | 50 | -125 |
5 | B | 50 | -250 |
6 | B | 25 | -125 |
7 | B | 25 | -250 |
Because we did not have a unique ID in either dataframe, we artificially increased the number of rows from 4 to 8. At first glance, the merged data looks nice, but with double the number of rows we are supposed to have the data is in fact corrupted, and we cannot use it for any subsequent analysis. Also, note that the merge ran without any errors. If we aren’t careful, we might move forward with data that cannot be used.
The above example is an instance of a many-to-many merge, in which rows in the left dataframe can match to more than one row in the right dataframe, and rows in the right dataframe can match to more than one row in the left dataframe. In these cases pandas
performs a cross join within the subsets of the two dataframes that share the same key, inflating the number of rows. But a many-to-many join is almost never what we want to do with data, and if it happens, it is usually a mistake.
The other types of merges are
one-to-one, in which every row in the left dataframe matches to at most one row in the right dataframe, and every row in the right dataframe matches to at most one row in the left dataframe,
many-to-one, in which rows in the left dataframe can match to more than one row in the right dataframe, and every row in the right dataframe mnatches to at most one row in the left dataframe,
and one-to-many, in which every row in the left dataframe matches to at most one row in the right dataframe, and rows in the right dataframe can match to more than one row in the left dataframe.
The best way to prevent this mistake is to first think about whether the two dataframes should constitute a one-to-one, many-to-one, or one-to-many merge. It is important to have a clear expectation. Then use the validate
argument to have pandas
automatically check to see whether this expectation is met. If not, the code will generate an error, which will bring our attention to problems in the data we might not have caught otherwise. If we write
validate = 'one_to_one'
thenpandas
checks to see whether the merge keys are unique in both dataframes,validate = 'many_to_one'
thenpandas
checks to see whether the merge keys are unique in the right dataframe,validate = 'one_to_many'
thenpandas
checks to see whether the merge keys are unique in the left dataframe.
There are many reasons why our expectations might be incorrect. In messy, real-world data, mistakes like duplicated keys often make it past review processes and exist in the data without any clue of it in the codebook. The validate
argument is an important check for these kinds of issues. For example, we expect that merging elect
and crosswalk
will be a many-to-one merge, so we confirm that the merge is many-to-one as follows:
merged_data = pd.merge(elect, crosswalk, on='State', how='outer',
indicator='matched', validate='many_to_one')
merged_data
State | year | demvote | repvote | wallacevote | dempercent | reppercent | population | fips | stcode | matched | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 1964 | 210732 | 479085 | 0 | 30.54897 | 69.45103 | 1919000 | 1 | AL | both |
1 | Alabama | 1968 | 196579 | 146923 | 691425 | 18.99448 | 81.00552 | 1993000 | 1 | AL | both |
2 | Alabama | 1972 | 256923 | 728701 | 0 | 26.06704 | 73.93296 | 3540080 | 1 | AL | both |
3 | Alabama | 1976 | 659170 | 504070 | 0 | 56.66673 | 43.33327 | 3737204 | 1 | AL | both |
4 | Alabama | 1980 | 636730 | 654192 | 0 | 49.32366 | 50.67634 | 3894025 | 1 | AL | both |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
556 | Wyoming | 1988 | 67113 | 106867 | 0 | 38.57512 | 61.42488 | 465080 | 56 | WY | both |
557 | Wyoming | 1992 | 68160 | 79347 | 0 | 46.20798 | 53.79202 | 466251 | 56 | WY | both |
558 | Wyoming | 1996 | 77934 | 105388 | 0 | 42.51208 | 57.48792 | 488167 | 56 | WY | both |
559 | Wyoming | 2000 | 60481 | 147947 | 0 | 29.01769 | 70.98231 | 493782 | 56 | WY | both |
560 | Wyoming | 2004 | 70776 | 167629 | 0 | 29.68730 | 70.31271 | 506529 | 56 | WY | both |
561 rows × 11 columns
Problem 2: rows should be matched, but aren’t.
This problem is especially common when matching on strings, such as names. With countries, it’s possible for one dataframe to write “USA” and the other to write “United States”. You will have to go back into a dataframe to edit individual ID cells so that they match. But to identify the cells that need to be edited, we have to construct a dataframe of the rows that were unmatched in one or the other dataframe. The easiest way to see this is to use an anti-join: a merge that keeps only the rows in the left dataframe that have no match in the right dataframe. Like SQL, there is no anti-join function built in to pandas
, but there are ways to program our own anti-join. The easiest way is to perform an outer join with the indicator='matched'
argument, then to filter the data to those rows for which matched
is not equal to “both”.
For this step, I recommend creating a second, working version of the data merge_test
to contain the first attempt at the merge. If there are issues, it is easier to fix the problems and recreate the working dataframe than to restart the kernel. If there are no problems, then we can assign the merged_data
dataframe to be equal to the working version.
merge_test = pd.merge(elect, crosswalk, on='State', how='outer', indicator='matched', validate='many_to_one')
merge_test.query("matched!='both'")
State | year | demvote | repvote | wallacevote | dempercent | reppercent | population | fips | stcode | matched |
---|
There are no rows in crosswalk
with no match to elect
, and there are no rows in elect
without a match in crosswalk
. But if there were unmatched rows, there are two reasons why some rows may have been unmatched: rows can be unmatched due to differences in coverage, and they might be unmatched due to differences in coding or spelling. pandas
cannot tell the difference between these two reasons why rows might be unmatched, and an inner join will drop unmatched rows regardless of the reason. Using an outer join with indicator='matched'
is best practice for seeing the unmatched rows and thinking about whether any rows that should be matched are for some reason unmatched. We can use the category relabeling functionality in pandas
to fix issues of mispelled and miscoded keys, as we will do in the examples in the next section.
After performing this check, if we intend to perform another merge, we will either need to choose a new name for the indicator column, or we can drop the matched
column so that we can reuse this name later. We set merged_data
to the working dataframe merge_test
without the matched
column:
merged_data = merge_test.drop('matched', axis=1)
9.2.4. Merging all of the Dataframes Together While Checking for Problems#
The merged_data
and income
dataframes share keys named stcode
and year
. Joining these two dataframes should be a one-to-one merge because there should be only one occurrence of the same state and year in each dataframe. We use how='outer'
and indicator='matched'
(since we deleted the previous matched
column) to enable us to filter the result to the unmatched rows, and we use validate='one_to_one'
to catch anomolies that violate our expectation that this is a one-to-one merge:
merge_test = pd.merge(merged_data, income, on=['stcode','year'], how='outer', indicator='matched', validate='one_to_one')
merge_test.query("matched!='both'")
State | year | demvote | repvote | wallacevote | dempercent | reppercent | population | fips | stcode | ... | wom_meanincome | wom_minincome | wom_maxincome | wom_sdincome | wom_ineqincome | wom_gini_income | wom_n_income | wom_nwgt_income | medincome | matched |
---|
0 rows × 64 columns
There are no unmatched rows so there are no issues with miscoded or mispelled ID values. The merged data now contain information on both state election results and state incomes:
merge_test
State | year | demvote | repvote | wallacevote | dempercent | reppercent | population | fips | stcode | ... | wom_meanincome | wom_minincome | wom_maxincome | wom_sdincome | wom_ineqincome | wom_gini_income | wom_n_income | wom_nwgt_income | medincome | matched | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 1964 | 210732 | 479085 | 0 | 30.54897 | 69.45103 | 1919000 | 1 | AL | ... | 11400.0 | 58.139530 | 72674.42 | 1796.89 | 29.53152 | 0.473738 | 175.0 | 418494.1 | 76.68081 | both |
1 | Alabama | 1968 | 196579 | 146923 | 691425 | 18.99448 | 81.00552 | 1993000 | 1 | AL | ... | 12005.3 | NaN | NaN | 1968.78 | 27.40979 | 0.485090 | NaN | NaN | 80.72643 | both |
2 | Alabama | 1972 | 256923 | 728701 | 0 | 26.06704 | 73.93296 | 3540080 | 1 | AL | ... | 14667.9 | NaN | NaN | 8604.25 | 17.83577 | 0.457932 | NaN | NaN | 82.27975 | both |
3 | Alabama | 1976 | 659170 | 504070 | 0 | 56.66673 | 43.33327 | 3737204 | 1 | AL | ... | 15362.5 | NaN | NaN | 12842.20 | 15.77765 | 0.454074 | NaN | NaN | 60.07215 | both |
4 | Alabama | 1980 | 636730 | 654192 | 0 | 49.32366 | 50.67634 | 3894025 | 1 | AL | ... | 14804.2 | 28.384280 | 120196.50 | 5944.71 | 14.52640 | 0.436961 | 543.0 | 711278.2 | 41.72882 | both |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
556 | Wyoming | 1988 | 67113 | 106867 | 0 | 38.57512 | 61.42488 | 465080 | 56 | WY | ... | 17188.0 | 75.987840 | 77507.60 | 10402.40 | 23.54822 | 0.478940 | 318.0 | 110956.1 | 36.63389 | both |
557 | Wyoming | 1992 | 68160 | 79347 | 0 | 46.20798 | 53.79202 | 466251 | 56 | WY | ... | 17934.1 | 92.307700 | 96391.03 | 12512.80 | 21.14388 | 0.469024 | 360.0 | 117688.6 | 30.10564 | both |
558 | Wyoming | 1996 | 77934 | 105388 | 0 | 42.51208 | 57.48792 | 488167 | 56 | WY | ... | 18796.4 | 29.816510 | 122420.90 | 14517.60 | 22.74368 | 0.461693 | 413.0 | 125222.3 | 26.93395 | both |
559 | Wyoming | 2000 | 60481 | 147947 | 0 | 29.01769 | 70.98231 | 493782 | 56 | WY | ... | 20463.4 | 88.819230 | 333171.40 | 16609.80 | 15.30497 | 0.447124 | 440.0 | 128753.9 | 26.05026 | both |
560 | Wyoming | 2004 | 70776 | 167629 | 0 | 29.68730 | 70.31271 | 506529 | 56 | WY | ... | 20225.1 | 9.587728 | 206166.80 | 18311.90 | 11.33138 | 0.432011 | 692.0 | 123164.1 | 24.74094 | both |
561 rows × 64 columns
Since there are no problems, we set merged_data
equal to merge_test
. Because we intend to merge the data with another dataframe, we drop the matched
column.
merged_data = merge_test.drop('matched', axis=1)
Next we merge this result with the econ
dataframe. These two dataframes share fips
and year
as keys, and should also be a one-to-one merge. We check to see whether there are unmatched rows:
merge_test = pd.merge(merged_data, econ, on=['fips','year'], how='outer', indicator='matched', validate='one_to_one')
merge_test.query("matched!='both'")
State | year | demvote | repvote | wallacevote | dempercent | reppercent | population | fips | stcode | ... | Manufacturing_share | Finance_Insurance_share | Legal_share | Education_share | Health_share | Government_share | Government_federal_share | Government_military_share | Government_statelocal_share | matched |
---|
0 rows × 98 columns
Again, there are no unmatched rows so there are no issues with miscoded or mispelled ID values. The merged data now contain information on state election results, state incomes, and state economies:
merge_test
State | year | demvote | repvote | wallacevote | dempercent | reppercent | population | fips | stcode | ... | Manufacturing_share | Finance_Insurance_share | Legal_share | Education_share | Health_share | Government_share | Government_federal_share | Government_military_share | Government_statelocal_share | matched | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 1964 | 210732 | 479085 | 0 | 30.54897 | 69.45103 | 1919000 | 1 | AL | ... | 28.228270 | 11.413240 | 0.402390 | 0.256066 | 1.902207 | 15.86392 | 6.291915 | 2.402146 | 7.182051 | both |
1 | Alabama | 1968 | 196579 | 146923 | 691425 | 18.99448 | 81.00552 | 1993000 | 1 | AL | ... | 28.742350 | 10.676770 | 0.429263 | 0.365330 | 2.228514 | 16.93305 | 5.945748 | 2.630377 | 8.356928 | both |
2 | Alabama | 1972 | 256923 | 728701 | 0 | 26.06704 | 73.93296 | 3540080 | 1 | AL | ... | 26.056340 | 11.482790 | 0.469484 | 0.443401 | 2.595201 | 17.21440 | 5.692488 | 2.679969 | 8.841941 | both |
3 | Alabama | 1976 | 659170 | 504070 | 0 | 56.66673 | 43.33327 | 3737204 | 1 | AL | ... | 24.853340 | 11.505410 | 0.479220 | 0.359415 | 2.986863 | 17.42543 | 5.787821 | 2.255639 | 9.386103 | both |
4 | Alabama | 1980 | 636730 | 654192 | 0 | 49.32366 | 50.67634 | 3894025 | 1 | AL | ... | 24.604230 | 11.634170 | 0.649892 | 0.352719 | 3.660501 | 17.88313 | 5.782370 | 2.171860 | 9.928901 | both |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
556 | Wyoming | 1988 | 67113 | 106867 | 0 | 38.57512 | 61.42488 | 465080 | 56 | WY | ... | 4.380078 | 7.788767 | 0.688802 | 0.123631 | 2.207700 | 14.51784 | 2.622748 | 1.660191 | 10.234900 | both |
557 | Wyoming | 1992 | 68160 | 79347 | 0 | 46.20798 | 53.79202 | 466251 | 56 | WY | ... | 4.303816 | 7.670391 | 0.704806 | 0.179951 | 2.579291 | 14.99588 | 2.811727 | 1.552073 | 10.632080 | both |
558 | Wyoming | 1996 | 77934 | 105388 | 0 | 42.51208 | 57.48792 | 488167 | 56 | WY | ... | 6.146708 | 9.280448 | 0.705568 | 0.152555 | 2.866768 | 14.23849 | 2.657005 | 1.544622 | 10.030510 | both |
559 | Wyoming | 2000 | 60481 | 147947 | 0 | 29.01769 | 70.98231 | 493782 | 56 | WY | ... | 6.012348 | 3.110034 | 0.646241 | 0.173100 | 4.269805 | 14.95586 | 2.729213 | 1.540592 | 10.686050 | both |
560 | Wyoming | 2004 | 70776 | 167629 | 0 | 29.68730 | 70.31271 | 506529 | 56 | WY | ... | 3.795901 | 2.813834 | 0.589240 | 0.222032 | 4.419300 | 14.49616 | 2.519214 | 1.648164 | 10.328780 | both |
561 rows × 98 columns
Again, we set merged_data
equal to merge_test
, dropping the matched
column so that we can reuse this name in the next join:
merged_data = merge_test.drop('matched', axis=1)
Finally we merge the data with the area
dataframe. This merge matches State
in merged_data
to state
in area: this subtle difference is easy to miss and will result in an error if we fail to notice the lowercase s and write on='State'
. This merge should be a many-to-one merge, as there are many rows for each state in merged_data
(for many years), but only one row for each state in area
. As
merge_test = pd.merge(merged_data, area, how="outer",
left_on='State',right_on='state',
indicator='matched',
validate='many_to_one')
merge_test.query("matched!='both'")[['State', 'state',
'year','matched']]
State | state | year | matched | |
---|---|---|---|---|
77 | D. C. | NaN | 1964.0 | left_only |
78 | D. C. | NaN | 1968.0 | left_only |
79 | D. C. | NaN | 1972.0 | left_only |
80 | D. C. | NaN | 1976.0 | left_only |
81 | D. C. | NaN | 1980.0 | left_only |
82 | D. C. | NaN | 1984.0 | left_only |
83 | D. C. | NaN | 1988.0 | left_only |
84 | D. C. | NaN | 1992.0 | left_only |
85 | D. C. | NaN | 1996.0 | left_only |
86 | D. C. | NaN | 2000.0 | left_only |
87 | D. C. | NaN | 2004.0 | left_only |
561 | NaN | District of Columbia | NaN | right_only |
562 | NaN | Puerto Rico | NaN | right_only |
563 | NaN | Northern Mariana Islands | NaN | right_only |
564 | NaN | United States Virgin Islands | NaN | right_only |
565 | NaN | American Samoa | NaN | right_only |
566 | NaN | Guam | NaN | right_only |
567 | NaN | United States Minor Outlying Islands | NaN | right_only |
In this case, one of our validation checks failed: we have unmatched rows. Some of these rows are due to differences in coverage: we will not bring Puerto Rico, the Northern Mariana Islands, the United States Virgin Islands, American Samoa, Guam, or the United States Minor Outlying Islands into the merged data as these territories do not receive votes in the electoral college. We will however bring the District of Columbia into the merged data as D.C. does get 3 votes in the electoral college. D.C. is unmatched due to spelling discrepancies, and we must correct this mistake before we can proceed. The easiest way to fix the error is to replace values of “District of Columbia” in the area
dataframe with “D. C.”, as it is written in the merged data. We can use the .replace()
method (because we only want to replace one state
category and .replace()
leaves unlisted categories alone, as opposed to .map()
which would replace unlisted categories with missing values), as we discussed in module 8:
area.state = area.state.replace({'District of Columbia':'D. C.'})
Now if we repeat the test, we will only see the rows that are unmatched due to differences in coverage:
merge_test = pd.merge(merged_data, area, left_on='State', how="outer", right_on='state', indicator='matched', validate='many_to_one')
merge_test.query("matched!='both'")[['State', 'state','year','matched']]
State | state | year | matched | |
---|---|---|---|---|
561 | NaN | Puerto Rico | NaN | right_only |
562 | NaN | Northern Mariana Islands | NaN | right_only |
563 | NaN | United States Virgin Islands | NaN | right_only |
564 | NaN | American Samoa | NaN | right_only |
565 | NaN | Guam | NaN | right_only |
566 | NaN | United States Minor Outlying Islands | NaN | right_only |
We replace the merged_data
dataframe with merge_test
, keeping only the rows for which matched='both'
, then dropping the matched
column:
merged_data = merge_test.query("matched=='both'").drop('matched', axis=1)
Now merged_data
contains all of the data we need, including state election results, income, economic data, and area:
merged_data
State | year | demvote | repvote | wallacevote | dempercent | reppercent | population | fips | stcode | ... | Government_military_share | Government_statelocal_share | state | area_sqmi | area_sqkm | landarea_sqmi | landarea_sqkm | water_sqmi | water_sqkm | percent_water | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 1964.0 | 210732.0 | 479085.0 | 0.0 | 30.54897 | 69.45103 | 1919000.0 | 1.0 | AL | ... | 2.402146 | 7.182051 | Alabama | 52,419.02 | 135,765 | 50,744.00 | 131,426 | 1,675.01 | 4,338 | 3.20 |
1 | Alabama | 1968.0 | 196579.0 | 146923.0 | 691425.0 | 18.99448 | 81.00552 | 1993000.0 | 1.0 | AL | ... | 2.630377 | 8.356928 | Alabama | 52,419.02 | 135,765 | 50,744.00 | 131,426 | 1,675.01 | 4,338 | 3.20 |
2 | Alabama | 1972.0 | 256923.0 | 728701.0 | 0.0 | 26.06704 | 73.93296 | 3540080.0 | 1.0 | AL | ... | 2.679969 | 8.841941 | Alabama | 52,419.02 | 135,765 | 50,744.00 | 131,426 | 1,675.01 | 4,338 | 3.20 |
3 | Alabama | 1976.0 | 659170.0 | 504070.0 | 0.0 | 56.66673 | 43.33327 | 3737204.0 | 1.0 | AL | ... | 2.255639 | 9.386103 | Alabama | 52,419.02 | 135,765 | 50,744.00 | 131,426 | 1,675.01 | 4,338 | 3.20 |
4 | Alabama | 1980.0 | 636730.0 | 654192.0 | 0.0 | 49.32366 | 50.67634 | 3894025.0 | 1.0 | AL | ... | 2.171860 | 9.928901 | Alabama | 52,419.02 | 135,765 | 50,744.00 | 131,426 | 1,675.01 | 4,338 | 3.20 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
556 | Wyoming | 1988.0 | 67113.0 | 106867.0 | 0.0 | 38.57512 | 61.42488 | 465080.0 | 56.0 | WY | ... | 1.660191 | 10.234900 | Wyoming | 97,813.56 | 253,336 | 97,100.40 | 251,489 | 713.16 | 1,847 | 0.73 |
557 | Wyoming | 1992.0 | 68160.0 | 79347.0 | 0.0 | 46.20798 | 53.79202 | 466251.0 | 56.0 | WY | ... | 1.552073 | 10.632080 | Wyoming | 97,813.56 | 253,336 | 97,100.40 | 251,489 | 713.16 | 1,847 | 0.73 |
558 | Wyoming | 1996.0 | 77934.0 | 105388.0 | 0.0 | 42.51208 | 57.48792 | 488167.0 | 56.0 | WY | ... | 1.544622 | 10.030510 | Wyoming | 97,813.56 | 253,336 | 97,100.40 | 251,489 | 713.16 | 1,847 | 0.73 |
559 | Wyoming | 2000.0 | 60481.0 | 147947.0 | 0.0 | 29.01769 | 70.98231 | 493782.0 | 56.0 | WY | ... | 1.540592 | 10.686050 | Wyoming | 97,813.56 | 253,336 | 97,100.40 | 251,489 | 713.16 | 1,847 | 0.73 |
560 | Wyoming | 2004.0 | 70776.0 | 167629.0 | 0.0 | 29.68730 | 70.31271 | 506529.0 | 56.0 | WY | ... | 1.648164 | 10.328780 | Wyoming | 97,813.56 | 253,336 | 97,100.40 | 251,489 | 713.16 | 1,847 | 0.73 |
561 rows × 105 columns
9.3. Reshaping Data#
Reshaping a dataframe refers to turning the rows of a dataframe into columns (also known as a long-to-wide reshape), or turning columns into rows (also known as a wide-to-long reshape), in a way that cannot be accomplished by simply transposing the data matrix. It is also called pivoting for a long-to-wide reshape, melting for a wide-to-long reshape. It’s a notoriously tricky skill to master, and it helps a lot to understand every part of the function we need, so that we’re not blindly trying every combination of parameters hoping one of these combinations works.
9.3.1. Example: Gross domestic product by state, 1997-2015#
To demonstrate the method for pivoting a dataframe, we use example data from the U.S. Bureau of Economic Affairs (www.bea.gov) on economic output for the American states. We load the data, skipping the last four rows as these rows contain attribution information and notes, and are not part of the data. We also use the .str.strip()
method to remove all the leading and trailing whitespace in the Description
column:
gsp = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/gsp_naics_all.csv", skipfooter=4)
gsp['Description'] = gsp['Description'].str.strip()
gsp
/var/folders/4w/k9sqqcbx4dxgpjtwgv_1m29h0000gq/T/ipykernel_83160/793934051.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.
gsp = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/gsp_naics_all.csv", skipfooter=4)
GeoFIPS | GeoName | Region | ComponentId | ComponentName | IndustryId | IndustryClassification | Description | 1997 | 1998 | ... | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 00000 | United States | NaN | 200 | Gross domestic product (GDP) by state | 1 | ... | All industry total | 8542530 | 9024434 | ... | 13773226 | 14391149 | 14626598 | 14320114 | 14859772 | 15406002 | 16041243 | 16576808 | 17277548 | 17919651 |
1 | 00000 | United States | NaN | 200 | Gross domestic product (GDP) by state | 2 | ... | Private industries | 7459395 | 7894015 | ... | 12045446 | 12572387 | 12716179 | 12352979 | 12826507 | 13348439 | 13957545 | 14468465 | 15115846 | 15698669 |
2 | 00000 | United States | NaN | 200 | Gross domestic product (GDP) by state | 3 | 11 | Agriculture, forestry, fishing, and hunting | 108796 | 99940 | ... | 128345 | 141999 | 154525 | 137655 | 160217 | 197241 | 185800 | 221821 | 203188 | 175236 |
3 | 00000 | United States | NaN | 200 | Gross domestic product (GDP) by state | 4 | 111-112 | Farms | 88136 | 79030 | ... | 99352 | 113533 | 126345 | 109800 | 129725 | 166249 | 151489 | 186960 | 166249 | (NA) |
4 | 00000 | United States | NaN | 200 | Gross domestic product (GDP) by state | 5 | 113-115 | Forestry, fishing, and related activities | 20660 | 20910 | ... | 28993 | 28466 | 28180 | 27855 | 30492 | 30992 | 34311 | 34861 | 36939 | (NA) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
43287 | 94000 | Plains | 4 | 1000 | Per capita real GDP by state | 1 | ... | All industry total | 39071 | 39970 | ... | 46013 | 46501 | 46794 | 45399 | 46210 | 46921 | 47629 | 48126 | 48778 | 49153 |
43288 | 95000 | Southeast | 5 | 1000 | Per capita real GDP by state | 1 | ... | All industry total | 36722 | 37896 | ... | 43586 | 43028 | 42092 | 40346 | 40675 | 40471 | 40415 | 40487 | 40802 | 41352 |
43289 | 96000 | Southwest | 6 | 1000 | Per capita real GDP by state | 1 | ... | All industry total | 38887 | 40283 | ... | 45743 | 46684 | 45914 | 44324 | 44491 | 45344 | 46881 | 48064 | 49327 | 50519 |
43290 | 97000 | Rocky Mountain | 7 | 1000 | Per capita real GDP by state | 1 | ... | All industry total | 38243 | 40390 | ... | 46756 | 47587 | 47019 | 45170 | 44937 | 45060 | 45014 | 45564 | 46508 | 47093 |
43291 | 98000 | Far West | 8 | 1000 | Per capita real GDP by state | 1 | ... | All industry total | 41901 | 43467 | ... | 53932 | 54469 | 53881 | 51328 | 51609 | 51857 | 52429 | 52942 | 54077 | 55429 |
43292 rows × 27 columns
For this example, we only want to save three features besides the primary keys:
state GDP,
state GDP per capita,
state GDP from private industries
We only want the 50 states, no regions, territories, or districts (sorry DC!). Notice that the features we want are contained in rows, and the years are contained in columns. We need to switch that: the features must be in columns, and years must be stored in the rows. We will need to perform both a wide-to-long and a long-to-wide reshape. In addition, we will give the columns concise but descriptive names, and we will create a new column that contains the percent of GDP from private industry in each state and each year.
First we need to reduce the data to only the features we need: state GDP, state GDP per capita, and state GDP from private industries. Right now, however, the features are not stored in the columns, but in the rows. Specifically, the data we need are contains in the rows in which Description
is “All industry total” or “Private industries” and ComponentName
is “Gross domestic product (GDP) by state” or “Per capita real GDP by state”. We use .query()
to isolate these rows:
gsp_clean = gsp.query(
"Description in ['All industry total', 'Private industries']"
).query(
"ComponentName in ['Gross domestic product (GDP) by state','Per capita real GDP by state']")
gsp_clean
GeoFIPS | GeoName | Region | ComponentId | ComponentName | IndustryId | IndustryClassification | Description | 1997 | 1998 | ... | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 00000 | United States | NaN | 200 | Gross domestic product (GDP) by state | 1 | ... | All industry total | 8542530 | 9024434 | ... | 13773226 | 14391149 | 14626598 | 14320114 | 14859772 | 15406002 | 16041243 | 16576808 | 17277548 | 17919651 |
1 | 00000 | United States | NaN | 200 | Gross domestic product (GDP) by state | 2 | ... | Private industries | 7459395 | 7894015 | ... | 12045446 | 12572387 | 12716179 | 12352979 | 12826507 | 13348439 | 13957545 | 14468465 | 15115846 | 15698669 |
90 | 01000 | Alabama | 5 | 200 | Gross domestic product (GDP) by state | 1 | ... | All industry total | 104218 | 109414 | ... | 164468 | 169923 | 172646 | 168315 | 174710 | 180665 | 185878 | 190095 | 194421 | 199656 |
91 | 01000 | Alabama | 5 | 200 | Gross domestic product (GDP) by state | 2 | ... | Private industries | 87014 | 91506 | ... | 137954 | 141306 | 142965 | 137143 | 142773 | 148181 | 153494 | 157961 | 161364 | 166243 |
180 | 02000 | Alaska | 8 | 200 | Gross domestic product (GDP) by state | 1 | ... | All industry total | 25446 | 24030 | ... | 44679 | 49197 | 55461 | 50463 | 54134 | 58759 | 60890 | 59762 | 58253 | 52747 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
43287 | 94000 | Plains | 4 | 1000 | Per capita real GDP by state | 1 | ... | All industry total | 39071 | 39970 | ... | 46013 | 46501 | 46794 | 45399 | 46210 | 46921 | 47629 | 48126 | 48778 | 49153 |
43288 | 95000 | Southeast | 5 | 1000 | Per capita real GDP by state | 1 | ... | All industry total | 36722 | 37896 | ... | 43586 | 43028 | 42092 | 40346 | 40675 | 40471 | 40415 | 40487 | 40802 | 41352 |
43289 | 96000 | Southwest | 6 | 1000 | Per capita real GDP by state | 1 | ... | All industry total | 38887 | 40283 | ... | 45743 | 46684 | 45914 | 44324 | 44491 | 45344 | 46881 | 48064 | 49327 | 50519 |
43290 | 97000 | Rocky Mountain | 7 | 1000 | Per capita real GDP by state | 1 | ... | All industry total | 38243 | 40390 | ... | 46756 | 47587 | 47019 | 45170 | 44937 | 45060 | 45014 | 45564 | 46508 | 47093 |
43291 | 98000 | Far West | 8 | 1000 | Per capita real GDP by state | 1 | ... | All industry total | 41901 | 43467 | ... | 53932 | 54469 | 53881 | 51328 | 51609 | 51857 | 52429 | 52942 | 54077 | 55429 |
180 rows × 27 columns
Next, we only want the 50 states, no regions, territories, or districts, so we issue another .query()
to remove the territories. To see these non-states, we call up a list of the unique values of GeoName
:
gsp_clean['GeoName'].unique()
array(['United States', 'Alabama', 'Alaska', 'Arizona', 'Arkansas',
'California', 'Colorado', 'Connecticut', 'Delaware',
'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota',
'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
'West Virginia', 'Wisconsin', 'Wyoming', 'New England', 'Mideast',
'Great Lakes', 'Plains', 'Southeast', 'Southwest',
'Rocky Mountain', 'Far West'], dtype=object)
We need to remove the rows in which GeoName
is “United States”, “District of Columbia”, “New England”, “Mideast”, “Great Lakes”, “Plains”, “Southeast”, “Southwest”, “Rocky Mountain”, or “Far West”:
gsp_clean = gsp_clean.query("GeoName not in ['United States', 'District of Columbia', 'New England', 'Mideast', 'Great Lakes', 'Plains', 'Southeast', 'Southwest', 'Rocky Mountain', 'Far West']")
gsp_clean
GeoFIPS | GeoName | Region | ComponentId | ComponentName | IndustryId | IndustryClassification | Description | 1997 | 1998 | ... | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
90 | 01000 | Alabama | 5 | 200 | Gross domestic product (GDP) by state | 1 | ... | All industry total | 104218 | 109414 | ... | 164468 | 169923 | 172646 | 168315 | 174710 | 180665 | 185878 | 190095 | 194421 | 199656 |
91 | 01000 | Alabama | 5 | 200 | Gross domestic product (GDP) by state | 2 | ... | Private industries | 87014 | 91506 | ... | 137954 | 141306 | 142965 | 137143 | 142773 | 148181 | 153494 | 157961 | 161364 | 166243 |
180 | 02000 | Alaska | 8 | 200 | Gross domestic product (GDP) by state | 1 | ... | All industry total | 25446 | 24030 | ... | 44679 | 49197 | 55461 | 50463 | 54134 | 58759 | 60890 | 59762 | 58253 | 52747 |
181 | 02000 | Alaska | 8 | 200 | Gross domestic product (GDP) by state | 2 | ... | Private industries | 20284 | 18776 | ... | 36737 | 40919 | 46889 | 41387 | 44742 | 48851 | 50592 | 49764 | 47879 | 42246 |
270 | 04000 | Arizona | 6 | 200 | Gross domestic product (GDP) by state | 1 | ... | All industry total | 132708 | 143768 | ... | 248459 | 262045 | 256718 | 242509 | 245668 | 254192 | 264693 | 270642 | 280166 | 290903 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
43279 | 51000 | Virginia | 5 | 1000 | Per capita real GDP by state | 1 | ... | All industry total | 43069 | 44745 | ... | 52866 | 52657 | 51985 | 51389 | 51945 | 51749 | 51538 | 51105 | 50861 | 51540 |
43280 | 53000 | Washington | 8 | 1000 | Per capita real GDP by state | 1 | ... | All industry total | 45753 | 47712 | ... | 52900 | 55348 | 54939 | 52264 | 52681 | 52495 | 53423 | 53987 | 54773 | 55577 |
43281 | 54000 | West Virginia | 5 | 1000 | Per capita real GDP by state | 1 | ... | All industry total | 30445 | 30978 | ... | 34009 | 33892 | 34679 | 34564 | 35368 | 36085 | 35515 | 35778 | 36234 | 36817 |
43282 | 55000 | Wisconsin | 3 | 1000 | Per capita real GDP by state | 1 | ... | All industry total | 38663 | 39695 | ... | 45515 | 45464 | 44622 | 43215 | 44126 | 44905 | 45380 | 45582 | 46469 | 46893 |
43283 | 56000 | Wyoming | 7 | 1000 | Per capita real GDP by state | 1 | ... | All industry total | 46585 | 47652 | ... | 63428 | 65471 | 69182 | 66320 | 64602 | 63981 | 60744 | 60743 | 61639 | 61389 |
150 rows × 27 columns
9.3.2. Melting: Turning Columns to Rows#
Notice that many of the columns refer to particular years. We want to move these columns to the rows, so that each state has one row for each year covered in the data. The best function to use when we need to turn columns into rows is pd.melt()
:
pd.melt(df, id_vars, value_vars)
This function takes three parameters:
The name of the dataframe you wish to edit
id_vars
: A list of the names of the columns that we want to continue to exist as columns after the dataframe is melted. In our example, this list will contain both IDs that currently uniquely identify the rows (GeoName
,ComponentName
,Description
), synonyms for the IDs (GeoFIPS
) that we want to keep, and features that vary across ID but not across the columns (Region
).value_vars
: A list of the names of the columns we want to store in the rows. If the columns have numeric names and can be sequentially ordered, as is the case in our example with years, a shortcut for creating a list of these column names is[str(i) for i in range(1997,2015)]
, which loops over intergers from 1997 to 2015, placing each year as a string in the list.
Any column that is not listed under either id_vars
or value_vars
will be deleted in the melted data.
For example, we melt the gsp_clean
dataframe with the following code:
gsp_clean = pd.melt(gsp_clean, id_vars = ['GeoName', 'GeoFIPS', 'Region',
'ComponentName', 'Description'],
value_vars = [str(i) for i in range(1997,2015)])
gsp_clean
GeoName | GeoFIPS | Region | ComponentName | Description | variable | value | |
---|---|---|---|---|---|---|---|
0 | Alabama | 01000 | 5 | Gross domestic product (GDP) by state | All industry total | 1997 | 104218 |
1 | Alabama | 01000 | 5 | Gross domestic product (GDP) by state | Private industries | 1997 | 87014 |
2 | Alaska | 02000 | 8 | Gross domestic product (GDP) by state | All industry total | 1997 | 25446 |
3 | Alaska | 02000 | 8 | Gross domestic product (GDP) by state | Private industries | 1997 | 20284 |
4 | Arizona | 04000 | 6 | Gross domestic product (GDP) by state | All industry total | 1997 | 132708 |
... | ... | ... | ... | ... | ... | ... | ... |
2695 | Virginia | 51000 | 5 | Per capita real GDP by state | All industry total | 2014 | 50861 |
2696 | Washington | 53000 | 8 | Per capita real GDP by state | All industry total | 2014 | 54773 |
2697 | West Virginia | 54000 | 5 | Per capita real GDP by state | All industry total | 2014 | 36234 |
2698 | Wisconsin | 55000 | 3 | Per capita real GDP by state | All industry total | 2014 | 46469 |
2699 | Wyoming | 56000 | 7 | Per capita real GDP by state | All industry total | 2014 | 61639 |
2700 rows × 7 columns
Note that the old column names (1997 through 2015 in this case) will always by stored in a column named “variable” after melting, and the datapoints that populated those columns will be contained in a column named “value”.
Before moving on, let’s rename some of the columns to avoid confusion in the next step:
gsp_clean = gsp_clean.rename({'GeoName':'State',
'GeoFIPS':'FIPS',
'variable':'Year'}, axis=1)
gsp_clean
State | FIPS | Region | ComponentName | Description | Year | value | |
---|---|---|---|---|---|---|---|
0 | Alabama | 01000 | 5 | Gross domestic product (GDP) by state | All industry total | 1997 | 104218 |
1 | Alabama | 01000 | 5 | Gross domestic product (GDP) by state | Private industries | 1997 | 87014 |
2 | Alaska | 02000 | 8 | Gross domestic product (GDP) by state | All industry total | 1997 | 25446 |
3 | Alaska | 02000 | 8 | Gross domestic product (GDP) by state | Private industries | 1997 | 20284 |
4 | Arizona | 04000 | 6 | Gross domestic product (GDP) by state | All industry total | 1997 | 132708 |
... | ... | ... | ... | ... | ... | ... | ... |
2695 | Virginia | 51000 | 5 | Per capita real GDP by state | All industry total | 2014 | 50861 |
2696 | Washington | 53000 | 8 | Per capita real GDP by state | All industry total | 2014 | 54773 |
2697 | West Virginia | 54000 | 5 | Per capita real GDP by state | All industry total | 2014 | 36234 |
2698 | Wisconsin | 55000 | 3 | Per capita real GDP by state | All industry total | 2014 | 46469 |
2699 | Wyoming | 56000 | 7 | Per capita real GDP by state | All industry total | 2014 | 61639 |
2700 rows × 7 columns
9.3.3. Pivoting: Turning Rows to Columns#
The rows contain features that we want to move to the columns. Specifically, we need separate columns for the different combinations of ComponentName
and Description
since the feature names are contained in two columns, so we can combine these two columns by concatenating them together and creating a new column called “feature”. Then we can drop ComponentName
and Description
:
gsp_clean = gsp_clean.assign(feature = gsp_clean['ComponentName'] + gsp_clean['Description'])
gsp_clean = gsp_clean.drop(['ComponentName', 'Description'], axis=1)
gsp_clean
State | FIPS | Region | Year | value | feature | |
---|---|---|---|---|---|---|
0 | Alabama | 01000 | 5 | 1997 | 104218 | Gross domestic product (GDP) by stateAll indus... |
1 | Alabama | 01000 | 5 | 1997 | 87014 | Gross domestic product (GDP) by statePrivate i... |
2 | Alaska | 02000 | 8 | 1997 | 25446 | Gross domestic product (GDP) by stateAll indus... |
3 | Alaska | 02000 | 8 | 1997 | 20284 | Gross domestic product (GDP) by statePrivate i... |
4 | Arizona | 04000 | 6 | 1997 | 132708 | Gross domestic product (GDP) by stateAll indus... |
... | ... | ... | ... | ... | ... | ... |
2695 | Virginia | 51000 | 5 | 2014 | 50861 | Per capita real GDP by stateAll industry total |
2696 | Washington | 53000 | 8 | 2014 | 54773 | Per capita real GDP by stateAll industry total |
2697 | West Virginia | 54000 | 5 | 2014 | 36234 | Per capita real GDP by stateAll industry total |
2698 | Wisconsin | 55000 | 3 | 2014 | 46469 | Per capita real GDP by stateAll industry total |
2699 | Wyoming | 56000 | 7 | 2014 | 61639 | Per capita real GDP by stateAll industry total |
2700 rows × 6 columns
To move these features from the rows to columns, apply the .pivot_table()
method to a dataframe:
df.pivot_table(index, columns, values)
To understand this method, pay close attention to its parameters:
index
- a list containing the names of the current columns that we want to remain columns in the reshaped data. These include the primary keys (State
andYear
in this example), synonyms for a key (FIPS
), and features that vary only by ID (Region
).columns
- the name of the column that contains the names of the new columns we are trying to create.values
- the name of the column that contains the datapoints we are trying to move to the new columns.
There is one important issue to keep in mind when using the .pivot_table()
method. .pivot_table()
contains default behavior to handle cases in which the keys in index
do not uniquely identify the rows of a dataframe. By default, .pivot_table()
takes the mean within group. If the column specified within values
is non-numeric, however, the mean is not defined. So first make sure that any columns that are numeric are recognized as a numeric data type. If a column contains text, categories, or other non-numeric data, specify aggfunc='first'
inside .pivot_table()
. Writing aggfunc='first'
tells .pivot_table()
to use the first cell within a group instead of calculating the mean within a group, and the first cell is defined for non-numeric columns. If the columns in index
do uniquely identify rows, there is only one cell per group anyway, so it is fine to take the first cell.
Presently within gsp_clean
the datapoints are contained within the value
column. Unfortunately, value
is not currently understood to be numeric:
gsp_clean.dtypes
State object
FIPS object
Region object
Year object
value object
feature object
dtype: object
Following our code from module 8, we convert value to a numeric class:
gsp_clean.value = gsp_clean.value.astype('float')
gsp_clean.dtypes
State object
FIPS object
Region object
Year object
value float64
feature object
dtype: object
Now we can reshape the data with .pivot_table()
:
gsp_clean['value'] = gsp_clean['value'].astype(int) # first convert this column to int
gsp_clean = gsp_clean.pivot_table(index=['State','FIPS', 'Region','Year'],
columns='feature',
values='value')
gsp_clean
feature | Gross domestic product (GDP) by stateAll industry total | Gross domestic product (GDP) by statePrivate industries | Per capita real GDP by stateAll industry total | |||
---|---|---|---|---|---|---|
State | FIPS | Region | Year | |||
Alabama | 01000 | 5 | 1997 | 104218 | 87014 | 31398 |
1998 | 109414 | 91506 | 32164 | |||
1999 | 115015 | 96284 | 33106 | |||
2000 | 119020 | 99665 | 33284 | |||
2001 | 122822 | 102978 | 33312 | |||
... | ... | ... | ... | ... | ... | ... |
Wyoming | 56000 | 7 | 2010 | 39103 | 33832 | 64602 |
2011 | 41499 | 36164 | 63981 | |||
2012 | 40201 | 34604 | 60744 | |||
2013 | 40979 | 35096 | 60743 | |||
2014 | 42021 | 35947 | 61639 |
900 rows × 3 columns
Alternatively, if we choose to leave the value
column as an object
data type, we can use this code instead to perform the wide-to-long reshape:
gsp_clean = gsp_clean.pivot_table(index=['State','FIPS', 'Region','Year'],
columns='feature',
values='value',
aggfunc='first')
Note that the dataframe is not formated in a flat way, because the IDs and related features are stored in the row-indices instead of in columns. To convert the dataframe back to a standard format, use the .to_records()
method within the pd.DataFrame()
function:
gsp_clean = pd.DataFrame(gsp_clean.to_records())
gsp_clean
State | FIPS | Region | Year | Gross domestic product (GDP) by stateAll industry total | Gross domestic product (GDP) by statePrivate industries | Per capita real GDP by stateAll industry total | |
---|---|---|---|---|---|---|---|
0 | Alabama | 01000 | 5 | 1997 | 104218 | 87014 | 31398 |
1 | Alabama | 01000 | 5 | 1998 | 109414 | 91506 | 32164 |
2 | Alabama | 01000 | 5 | 1999 | 115015 | 96284 | 33106 |
3 | Alabama | 01000 | 5 | 2000 | 119020 | 99665 | 33284 |
4 | Alabama | 01000 | 5 | 2001 | 122822 | 102978 | 33312 |
... | ... | ... | ... | ... | ... | ... | ... |
895 | Wyoming | 56000 | 7 | 2010 | 39103 | 33832 | 64602 |
896 | Wyoming | 56000 | 7 | 2011 | 41499 | 36164 | 63981 |
897 | Wyoming | 56000 | 7 | 2012 | 40201 | 34604 | 60744 |
898 | Wyoming | 56000 | 7 | 2013 | 40979 | 35096 | 60743 |
899 | Wyoming | 56000 | 7 | 2014 | 42021 | 35947 | 61639 |
900 rows × 7 columns
All that’s left to clean the data is to rename the columns and to create a new column containing the percent of the state’s GDP that comes from private industry:
gsp_clean = gsp_clean.rename({'Gross domestic product (GDP) by stateAll industry total':'GDP',
'Gross domestic product (GDP) by statePrivate industries':'GDPprivate',
'Per capita real GDP by stateAll industry total':'GDPpc'},
axis=1)
gsp_clean = gsp_clean.assign(percent_private = 100* gsp_clean.GDPprivate / gsp_clean.GDP)
gsp_clean
State | FIPS | Region | Year | GDP | GDPprivate | GDPpc | percent_private | |
---|---|---|---|---|---|---|---|---|
0 | Alabama | 01000 | 5 | 1997 | 104218 | 87014 | 31398 | 83.492295 |
1 | Alabama | 01000 | 5 | 1998 | 109414 | 91506 | 32164 | 83.632808 |
2 | Alabama | 01000 | 5 | 1999 | 115015 | 96284 | 33106 | 83.714298 |
3 | Alabama | 01000 | 5 | 2000 | 119020 | 99665 | 33284 | 83.738027 |
4 | Alabama | 01000 | 5 | 2001 | 122822 | 102978 | 33312 | 83.843285 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
895 | Wyoming | 56000 | 7 | 2010 | 39103 | 33832 | 64602 | 86.520216 |
896 | Wyoming | 56000 | 7 | 2011 | 41499 | 36164 | 63981 | 87.144269 |
897 | Wyoming | 56000 | 7 | 2012 | 40201 | 34604 | 60744 | 86.077461 |
898 | Wyoming | 56000 | 7 | 2013 | 40979 | 35096 | 60743 | 85.643866 |
899 | Wyoming | 56000 | 7 | 2014 | 42021 | 35947 | 61639 | 85.545323 |
900 rows × 8 columns
9.4. Working with Strings (Example: the 2019 ANES Pilot Study)#
In module 8, we worked with the 2019 American National Election Study pilot survey and saved the cleaned data in a separate CSV file:
anes = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/anes_clean.csv")
anes
caseid | liveurban | vote16 | protest | vote | most_important_issue | confecon | ideology | partyID | universal_income | ... | partisanship | ftbiden_level | age | age2 | ftbiden_float | ftbiden_cat | ftbiden_str | prefersbiden | worried_econ | favor_both | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Suburb | Someone else | False | Joe Biden | Health Care | A little worried | Conservative | Democrat | Favor a moderate amount | ... | 5.0 | neutral | 51 | 2601 | 52.0 | 52.0 | 52.0 | True | False | True |
1 | 2 | Suburb | Donald Trump | False | Donald Trump | Working together | A little worried | Conservative | Republican | Oppose a moderate amount | ... | 0.0 | neutral | 78 | 6084 | 41.0 | 41.0 | 41.0 | False | False | False |
2 | 3 | Rural | Hillary Clinton | False | Joe Biden | health care | Extremely worried | Moderate | Democrat | Neither favor nor oppose | ... | 88.0 | like | 66 | 4356 | 88.0 | 88.0 | 88.0 | True | True | False |
3 | 4 | City | Hillary Clinton | False | Donald Trump | The economy. | A little worried | Moderate | Democrat | Neither favor nor oppose | ... | -100.0 | dislike | 41 | 1681 | 0.0 | 0.0 | 0.0 | False | False | False |
4 | 5 | City | Donald Trump | False | Donald Trump | China | Not at all worried | Conservative | Republican | Oppose a great deal | ... | -69.0 | dislike | 80 | 6400 | 25.0 | 25.0 | 25.0 | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3160 | 3161 | Town | Donald Trump | False | Donald Trump | The infiltration of Marxists into the institut... | A little worried | Conservative | Republican | Oppose a great deal | ... | -74.0 | dislike | 72 | 5184 | 7.0 | 7.0 | 7.0 | False | False | False |
3161 | 3162 | City | Someone else | False | Someone else | Lack of basic resources being provided and off... | Extremely worried | NaN | Democrat | Favor a great deal | ... | -10.0 | dislike | 24 | 576 | 25.0 | 25.0 | 25.0 | False | True | True |
3162 | 3163 | City | Did not vote | False | Probably will not vote | donald trump | Very worried | Liberal | Independent | Oppose a great deal | ... | 44.0 | neutral | 40 | 1600 | 50.0 | 50.0 | 50.0 | True | False | False |
3163 | 3164 | Suburb | Did not vote | False | Joe Biden | Donald Trump | Moderately worried | Liberal | Democrat | Favor a moderate amount | ... | 94.0 | like | 60 | 3600 | 95.0 | 95.0 | 95.0 | True | True | False |
3164 | 3165 | Town | Hillary Clinton | False | Joe Biden | trump | Extremely worried | Moderate | Democrat | Oppose a great deal | ... | 70.0 | neutral | 60 | 3600 | 70.0 | 70.0 | 70.0 | True | True | False |
3165 rows × 62 columns
The data contain a column most_important_issue
which records individuals’ responses to the question of what they feel is the most important issue facing the United States as of December 2019. These responses are in the respondents’ own words. For example, one response is
anes.most_important_issue[5]
'The influence of big money on our political system both domestic and foreign'
These responses are strings, and we can employ the suite of string operations to manipulate the data in this column. String methods are contained in the .str
module of pandas
, and we will have to call the .str
attribute in order to use string methods.
One important technique is to alter the case within the text so that we can identify words without worrying about case sensitivity. To turn all responses to lowercase, use .str.lower()
:
anes.most_important_issue.str.lower()
0 health care
1 working together
2 health care
3 the economy.
4 china
...
3160 the infiltration of marxists into the institut...
3161 lack of basic resources being provided and off...
3162 donald trump
3163 donald trump
3164 trump
Name: most_important_issue, Length: 3165, dtype: object
And to turn all the text to uppercase, use .str.upper()
:
anes.most_important_issue.str.upper()
0 HEALTH CARE
1 WORKING TOGETHER
2 HEALTH CARE
3 THE ECONOMY.
4 CHINA
...
3160 THE INFILTRATION OF MARXISTS INTO THE INSTITUT...
3161 LACK OF BASIC RESOURCES BEING PROVIDED AND OFF...
3162 DONALD TRUMP
3163 DONALD TRUMP
3164 TRUMP
Name: most_important_issue, Length: 3165, dtype: object
One issue that can inhibit our ability to search through text is the existence of leading and trailing whitespace in the responses. White space can exist, invisibily, for several reasons. It is possible that the data authors included white space in order to standardize the number of spaces that the column takes up in the data file. To remove both leading and trailing white spaces, use .str.strip()
anes.most_important_issue = anes.most_important_issue.str.strip()
The str.replace()
method finds and replaces specific whole strings with different strings. For example, we can replace every response that reads “health care” with “hospital stuff”:
anes.most_important_issue.str.lower().replace('health care', 'hospital stuff')
0 hospital stuff
1 working together
2 hospital stuff
3 the economy.
4 china
...
3160 the infiltration of marxists into the institut...
3161 lack of basic resources being provided and off...
3162 donald trump
3163 donald trump
3164 trump
Name: most_important_issue, Length: 3165, dtype: object
If we want to replace pieces of these strings, we can specify the regex=True
argument. The following line of code replaces all occurrences of “trump” when the strings are converted to lowercase to “twimp”. With regex=True
, the method replaces the pattern “trump” anywhere it appears, as opposed to the default regex=False
which only replaces entire entries that are exactly “trump”:
anes.most_important_issue.str.lower().replace('trump', 'twimp', regex=True)
0 health care
1 working together
2 health care
3 the economy.
4 china
...
3160 the infiltration of marxists into the institut...
3161 lack of basic resources being provided and off...
3162 donald twimp
3163 donald twimp
3164 twimp
Name: most_important_issue, Length: 3165, dtype: object
To create a new column that identifies when particular words are used, use either the .str.match()
or .str.contains()
method. .str.match()
is true if the entire entry matches the provided string,
anes.most_important_issue.str.lower().str.match('trump')
0 False
1 False
2 False
3 False
4 False
...
3160 False
3161 False
3162 False
3163 False
3164 True
Name: most_important_issue, Length: 3165, dtype: object
and .str.contains()
is true if the provided string exists anywhere within the entry.
anes.most_important_issue.str.lower().str.contains('trump')
0 False
1 False
2 False
3 False
4 False
...
3160 False
3161 False
3162 True
3163 True
3164 True
Name: most_important_issue, Length: 3165, dtype: object
These columns of logical values can be placed into the data to filter or for other purposes. For example, we can see the number of Democrats, Republicans, and independents who use the word “trump” in their statement of the country’s most important problem:
anes['problem_trump'] = anes.most_important_issue.str.lower().str.contains('trump')
pd.DataFrame(anes.groupby(['partyID', 'problem_trump']).size().unstack())
problem_trump | False | True |
---|---|---|
partyID | ||
Democrat | 1029 | 293 |
Independent | 479 | 47 |
Republican | 1121 | 73 |
The .str.replace()
, .str.match()
, and .str.contains()
methods can also accept regular expressions for identifying additional patterns within the string. Regular expressions are beyond our scope in this document, but here is a good discussion of regular expressions in Python.
To calculate the length of each response in terms of the number of characters, use .str.len()
. In this case, it is important to make sure the whitespace is removed first with .str.strip()
, otherwise the spaces will count towards the length:
anes.most_important_issue.str.len()
0 11.0
1 16.0
2 11.0
3 12.0
4 5.0
...
3160 254.0
3161 325.0
3162 12.0
3163 12.0
3164 5.0
Name: most_important_issue, Length: 3165, dtype: float64
We can use the length to display, for example, the single longest response in the data. To display the whole string, turn off the display limit by typing pd.options.display.max_colwidth = None
. Here I set the limit to 500, because it sort of goes off the rails and I’d rather not display it:
anes['length'] = anes.most_important_issue.str.len()
pd.options.display.max_colwidth = 500
anes.sort_values(by = 'length', ascending = False).most_important_issue.head(1)
819 Immigration has been and has become a very big problem. The gangs, drug wars, the crime has climbed in every single town, county, state in this country and I believe it all revolves around too many different races, too many prejudices. Most people I talk to dislike foreigners and most foreigners hate Americans. We ask, why come here then? Because they get financial aid, housing, medical, dental aid and compared to the wars, poverty and the way they lived in their own country, America is heav...
Name: most_important_issue, dtype: object
After all that, we set the string display limit back to the default of 50 characters:
pd.options.display.max_colwidth = 50
In some cases it might be necessary to split entries into different features according to a delimiter or pattern. It doesn’t make sense to split the most_important_problem
column in this example, but as an illiustration, we can split the responses on periods:
anes.most_important_issue.str.split('.')
0 [Health Care]
1 [Working together]
2 [health care]
3 [The economy, ]
4 [China]
...
3160 [The infiltration of Marxists into the institu...
3161 [Lack of basic resources being provided and of...
3162 [donald trump]
3163 [Donald Trump]
3164 [trump]
Name: most_important_issue, Length: 3165, dtype: object
The expand=True
argument displays these split strings in different columns in a dataframe:
anes.most_important_issue.str.split('.', expand=True)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Health Care | None | None | None | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
1 | Working together | None | None | None | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
2 | health care | None | None | None | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
3 | The economy | None | None | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None | |
4 | China | None | None | None | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3160 | The infiltration of Marxists into the institut... | Money seems to be no object to them | Their over promising will lead to eventual chaos | Usually that leads to the over promised event... | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None | |
3161 | Lack of basic resources being provided and off... | This is a first world nation, a nation claimi... | yet we don't have housing or healthcare secur... | That needs to change and it needs to be chang... | A basic standard of living needs to be guaran... | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None | |
3162 | donald trump | None | None | None | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
3163 | Donald Trump | None | None | None | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
3164 | trump | None | None | None | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
3165 rows × 29 columns
Writing n=1
only splits on the first occurrence of the delimiter, n=2
splits on the first and second occurrence, and so on:
anes.most_important_issue.str.split('.', expand=True, n=2)
0 | 1 | 2 | |
---|---|---|---|
0 | Health Care | None | None |
1 | Working together | None | None |
2 | health care | None | None |
3 | The economy | None | |
4 | China | None | None |
... | ... | ... | ... |
3160 | The infiltration of Marxists into the institut... | Money seems to be no object to them | Their over promising will lead to eventual ch... |
3161 | Lack of basic resources being provided and off... | This is a first world nation, a nation claimi... | yet we don't have housing or healthcare secur... |
3162 | donald trump | None | None |
3163 | Donald Trump | None | None |
3164 | trump | None | None |
3165 rows × 3 columns
An individual string can be indexed according to character number. These indices can be applied to every entry in a string column as well. To pull just the first five characters out of each response in most_important_issue
, we can type:
anes.most_important_issue.str[0:5]
0 Healt
1 Worki
2 healt
3 The e
4 China
...
3160 The i
3161 Lack
3162 donal
3163 Donal
3164 trump
Name: most_important_issue, Length: 3165, dtype: object
To pull characters 6 through 10, we type
anes.most_important_issue.str[6:11]
0 Care
1 g tog
2 care
3 onomy
4
...
3160 filtr
3161 f bas
3162 trum
3163 Trum
3164
Name: most_important_issue, Length: 3165, dtype: object
To pull the last four characters, use a negative number to begin the range, and leave the end of the range blank:
anes.most_important_issue.str[-4:]
0 Care
1 ther
2 care
3 omy.
4 hina
...
3160 ers.
3161 eed.
3162 rump
3163 rump
3164 rump
Name: most_important_issue, Length: 3165, dtype: object
9.5. Working with Dates and Times (Example: Twitter)#
In module 4 we discussed Twitter as an example of how to use an API in Python. We save the four credentials we need to access the API (the consumer key, the consumer secret, the access token, and the access token secret) in a .env file, and we load those keys into Python without displaying them in our notebooks by using the dotenv
package:
import dotenv
import os
os.chdir('/Users/jk8sd/Box Sync/Practice and Applications 1 online/Module 9 - Data Managenent in pandas Part 2')
dotenv.load_dotenv()
ConsumerKey = os.getenv('ConsumerKey')
ConsumerSecret = os.getenv('ConsumerSecret')
AccessToken = os.getenv('AccessToken')
AccessTokenSecret = os.getenv('AccessTokenSecret')
Then we use the tweepy
package to work with the Twitter API and we create a Twitter cursor with our credentials:
import tweepy
auth = tweepy.OAuthHandler(ConsumerKey, ConsumerSecret)
auth.set_access_token(AccessToken, AccessTokenSecret)
api = tweepy.API(auth)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[58], line 1
----> 1 import tweepy
2 auth = tweepy.OAuthHandler(ConsumerKey, ConsumerSecret)
3 auth.set_access_token(AccessToken, AccessTokenSecret)
ModuleNotFoundError: No module named 'tweepy'
The following code extracts 1000 tweets that contain the hashtag “#uva”:
msgs = []
msg =[]
for tweet in tweepy.Cursor(api.search, q='#uva').items(1000):
msg = [tweet.text, tweet.created_at, tweet.user.screen_name]
msg = tuple(msg)
msgs.append(msg)
tweets = pd.DataFrame(msgs, columns = ['text', 'created_at', 'user'])
tweets
text | created_at | user | |
---|---|---|---|
0 | @KymoraJohnson_ let me call @IamTinaThompson s... | 2020-06-30 02:20:43 | SweetLickKing |
1 | Congrats Eric. It’s was really great to have w... | 2020-06-30 02:18:17 | Daniel_B_Ennis |
2 | RT @KathrynsScGifts: "OGX Fade-defying + Orchi... | 2020-06-30 01:55:02 | blimeyonline1 |
3 | Former #UVA stars Quin Blanding, Micah Kiser n... | 2020-06-30 01:40:00 | Wahoos247 |
4 | Former #UVA stars Quin Blanding, Micah Kiser n... | 2020-06-30 01:39:42 | JamieOakes247 |
... | ... | ... | ... |
995 | RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... | 2020-06-23 15:12:08 | norahlfy |
996 | #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... | 2020-06-23 15:08:40 | AgNPalabras |
997 | Zur Einstimmung #UVA vom letzten #gig bei uns ... | 2020-06-23 14:52:23 | WeAppU |
998 | RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... | 2020-06-23 14:49:22 | VirialexViri |
999 | Good question... My instinct says Ty Jerome, b... | 2020-06-23 14:32:44 | The_Superhoo |
1000 rows × 3 columns
9.5.1. Extracting Year, Month, Day, and Time from a Timestamp Column#
Note that the data contain timestamps that include the date and time, to the second, of each tweet. The created_at
column has a datetime64
data type:
tweets.dtypes
text object
created_at datetime64[ns]
user object
dtype: object
The difference between a datetime64
data type and an object
or numeric type is that we have the ability to extract individual elements of time from a datetime64
value. To pull out the year, month, day, hour, minute, and second of the timestamp, we use the following attributes:
[tweets.created_at[0].month, tweets.created_at[0].day,
tweets.created_at[0].year, tweets.created_at[0].hour,
tweets.created_at[0].minute, tweets.created_at[0].second]
[6, 30, 2020, 2, 20, 43]
The easiest way to create new columns with these elements for all the values in the dataframe is to use comprehension loops:
tweets['month'] = [x.month for x in tweets.created_at]
tweets['day'] = [x.day for x in tweets.created_at]
tweets['year'] = [x.year for x in tweets.created_at]
tweets['hour'] = [x.hour for x in tweets.created_at]
tweets['minute'] = [x.minute for x in tweets.created_at]
tweets['second'] = [x.second for x in tweets.created_at]
tweets
text | created_at | user | month | day | year | hour | minute | second | |
---|---|---|---|---|---|---|---|---|---|
0 | @KymoraJohnson_ let me call @IamTinaThompson s... | 2020-06-30 02:20:43 | SweetLickKing | 6 | 30 | 2020 | 2 | 20 | 43 |
1 | Congrats Eric. It’s was really great to have w... | 2020-06-30 02:18:17 | Daniel_B_Ennis | 6 | 30 | 2020 | 2 | 18 | 17 |
2 | RT @KathrynsScGifts: "OGX Fade-defying + Orchi... | 2020-06-30 01:55:02 | blimeyonline1 | 6 | 30 | 2020 | 1 | 55 | 2 |
3 | Former #UVA stars Quin Blanding, Micah Kiser n... | 2020-06-30 01:40:00 | Wahoos247 | 6 | 30 | 2020 | 1 | 40 | 0 |
4 | Former #UVA stars Quin Blanding, Micah Kiser n... | 2020-06-30 01:39:42 | JamieOakes247 | 6 | 30 | 2020 | 1 | 39 | 42 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... | 2020-06-23 15:12:08 | norahlfy | 6 | 23 | 2020 | 15 | 12 | 8 |
996 | #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... | 2020-06-23 15:08:40 | AgNPalabras | 6 | 23 | 2020 | 15 | 8 | 40 |
997 | Zur Einstimmung #UVA vom letzten #gig bei uns ... | 2020-06-23 14:52:23 | WeAppU | 6 | 23 | 2020 | 14 | 52 | 23 |
998 | RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... | 2020-06-23 14:49:22 | VirialexViri | 6 | 23 | 2020 | 14 | 49 | 22 |
999 | Good question... My instinct says Ty Jerome, b... | 2020-06-23 14:32:44 | The_Superhoo | 6 | 23 | 2020 | 14 | 32 | 44 |
1000 rows × 9 columns
9.5.2. Generating Timestamps from Separate Month, Day, and Year Columns#
Sometimes raw data will come to us with these time elements already placed in separate columns, so that we have a month
, day
, and year
column but no date
column. There are important advantages to placing all this information in one column in values that Python understands as timestamps: timestamps are easier to filter, manipulate, index, and plot. To create one timestamp from separate columns that contain the month, day, and year (and optionally the hour, minute, and second), use the pd.to_datatime
function:
pd.to_datetime(df)
where df
is a subset of the dataframe that contains the separate columns for the time elements. In order for this function to work, these columns must be named year
, month
, and day
, and optionally hour
, minute
, and second
. If these columns are named something else, first use the .rename()
method to change the names to the ones listed above.
To create a timestamp from the individual time element columns, we can type:
tweets['timestamp'] = pd.to_datetime(tweets[['year', 'month', 'day', 'hour', 'minute', 'second']])
tweets
text | created_at | user | month | day | year | hour | minute | second | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|
0 | @KymoraJohnson_ let me call @IamTinaThompson s... | 2020-06-30 02:20:43 | SweetLickKing | 6 | 30 | 2020 | 2 | 20 | 43 | 2020-06-30 02:20:43 |
1 | Congrats Eric. It’s was really great to have w... | 2020-06-30 02:18:17 | Daniel_B_Ennis | 6 | 30 | 2020 | 2 | 18 | 17 | 2020-06-30 02:18:17 |
2 | RT @KathrynsScGifts: "OGX Fade-defying + Orchi... | 2020-06-30 01:55:02 | blimeyonline1 | 6 | 30 | 2020 | 1 | 55 | 2 | 2020-06-30 01:55:02 |
3 | Former #UVA stars Quin Blanding, Micah Kiser n... | 2020-06-30 01:40:00 | Wahoos247 | 6 | 30 | 2020 | 1 | 40 | 0 | 2020-06-30 01:40:00 |
4 | Former #UVA stars Quin Blanding, Micah Kiser n... | 2020-06-30 01:39:42 | JamieOakes247 | 6 | 30 | 2020 | 1 | 39 | 42 | 2020-06-30 01:39:42 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... | 2020-06-23 15:12:08 | norahlfy | 6 | 23 | 2020 | 15 | 12 | 8 | 2020-06-23 15:12:08 |
996 | #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... | 2020-06-23 15:08:40 | AgNPalabras | 6 | 23 | 2020 | 15 | 8 | 40 | 2020-06-23 15:08:40 |
997 | Zur Einstimmung #UVA vom letzten #gig bei uns ... | 2020-06-23 14:52:23 | WeAppU | 6 | 23 | 2020 | 14 | 52 | 23 | 2020-06-23 14:52:23 |
998 | RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... | 2020-06-23 14:49:22 | VirialexViri | 6 | 23 | 2020 | 14 | 49 | 22 | 2020-06-23 14:49:22 |
999 | Good question... My instinct says Ty Jerome, b... | 2020-06-23 14:32:44 | The_Superhoo | 6 | 23 | 2020 | 14 | 32 | 44 | 2020-06-23 14:32:44 |
1000 rows × 10 columns
Like created_at
, the new timestamp
column is also of the datetime64
data type:
tweets.dtypes
text object
created_at datetime64[ns]
user object
month int64
day int64
year int64
hour int64
minute int64
second int64
timestamp datetime64[ns]
dtype: object
9.5.3. Converting String Columns to Timestamps#
Sometimes information about the date and time of an event in the data will be coded as a string. In that case, to use the time series functionality on the dates in a dataframe, a string column that contains the dates and times will need to be converted to the datetime64
data type. The best way to do that is to use the pd.to_datetime()
function once again. This time, we pass the column that contains the dates in string format to the pd.to_datetime()
function. This function can identify the year, month, day, hour, minute, and second from many different formats.
For example, we can write the date Sunday, June 19, 2016 at 8:00 PM (the date and time of game 7 of the 2016 NBA finals, the greatest moment in sports history) in many different ways such as:
“Sunday, June 19, 2016, 8:00 PM”
“6-19-2016 8PM”
“6/19/16 8 p.m.”
“19/6/16 8pm”
The pd.to_datetime()
function reads all of these formats:
pd.to_datetime("Sunday, June 19, 2016, 8:00 PM")
Timestamp('2016-06-19 20:00:00')
pd.to_datetime("6-19-2016 8PM")
Timestamp('2016-06-19 20:00:00')
pd.to_datetime("6/19/16 8 p.m.")
Timestamp('2016-06-19 20:00:00')
pd.to_datetime("19/6/16 8pm")
Timestamp('2016-06-19 20:00:00')
The pd.to_datetime()
function can feel like a magic trick because it automatically detects where the different elements of the date and time are located in the string. However, it can go wrong, and it is important to check that the date was correctly understood by pd.to_datetime()
. Suppose for example that the date was June 19, 1945, coded as “6/19/45”. pd.to_datetime()
reads the date as 2045:
pd.to_datetime("6/19/45")
Timestamp('2045-06-19 00:00:00')
The easiest way to fix this problem would be to break the timestamp up into separate columns for year, month, and day, subtract 100 from the years that were placed in the wrong century, and recombine these columns into a new timestamp with pd.to_datetime()
. But the real danger here is assuming the pd.to_datetime()
function worked without confirmation.
The pd.to_datetime()
function also works on an entire column of dates coded as strings. For example, let’s convert the created_at
column to string:
tweets['created_at'] = tweets.created_at.astype('str')
tweets.dtypes
text object
created_at object
user object
month int64
day int64
year int64
hour int64
minute int64
second int64
timestamp datetime64[ns]
dtype: object
To convert the created_at
column back to a timestamp, we can apply pd.to_dataframe()
:
tweets['created_at'] = pd.to_datetime(tweets.created_at)
tweets.dtypes
text object
created_at datetime64[ns]
user object
month int64
day int64
year int64
hour int64
minute int64
second int64
timestamp datetime64[ns]
dtype: object
9.5.4. Filtering Rows Based on Date Ranges#
If every row in the data has a unique timestamp, then one convenient way to use the times to filter the data is to set the index (the row labels) to be equal to the timestamps:
tweets.index = tweets['created_at']
A pandas
dataframe uses pd.to_datetime()
to read dates and times in many formats, and we can pass any of these formats to the index of tweets
. For example, to see all the tweets made on June 30, we can type:
tweets['June 30, 2020']
text | created_at | user | month | day | year | hour | minute | second | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|
created_at | ||||||||||
2020-06-30 02:20:43 | @KymoraJohnson_ let me call @IamTinaThompson s... | 2020-06-30 02:20:43 | SweetLickKing | 6 | 30 | 2020 | 2 | 20 | 43 | 2020-06-30 02:20:43 |
2020-06-30 02:18:17 | Congrats Eric. It’s was really great to have w... | 2020-06-30 02:18:17 | Daniel_B_Ennis | 6 | 30 | 2020 | 2 | 18 | 17 | 2020-06-30 02:18:17 |
2020-06-30 01:55:02 | RT @KathrynsScGifts: "OGX Fade-defying + Orchi... | 2020-06-30 01:55:02 | blimeyonline1 | 6 | 30 | 2020 | 1 | 55 | 2 | 2020-06-30 01:55:02 |
2020-06-30 01:40:00 | Former #UVA stars Quin Blanding, Micah Kiser n... | 2020-06-30 01:40:00 | Wahoos247 | 6 | 30 | 2020 | 1 | 40 | 0 | 2020-06-30 01:40:00 |
2020-06-30 01:39:42 | Former #UVA stars Quin Blanding, Micah Kiser n... | 2020-06-30 01:39:42 | JamieOakes247 | 6 | 30 | 2020 | 1 | 39 | 42 | 2020-06-30 01:39:42 |
2020-06-30 01:30:45 | RT @ThinakaranLK: சில இடங்களில் மழை பெய்யும் ச... | 2020-06-30 01:30:45 | sumanebot | 6 | 30 | 2020 | 1 | 30 | 45 | 2020-06-30 01:30:45 |
2020-06-30 01:08:08 | We are🎉excited to🗣announce opening for in-pers... | 2020-06-30 01:08:08 | GrandMarc_UVA | 6 | 30 | 2020 | 1 | 8 | 8 | 2020-06-30 01:08:08 |
2020-06-30 00:20:15 | RT @cavalierinsider: On Saturday, The Basketba... | 2020-06-30 00:20:15 | annefutch | 6 | 30 | 2020 | 0 | 20 | 15 | 2020-06-30 00:20:15 |
2020-06-30 00:19:00 | Anthony Harris part of NFL's top safety tandem... | 2020-06-30 00:19:00 | hoosdaily | 6 | 30 | 2020 | 0 | 19 | 0 | 2020-06-30 00:19:00 |
If these row labels are in chronological order, then we can extract slices of the data that fall within a time range. We write two dates separated by a colon, and the output extracts all rows from the row that matches the first date through the row that matches the last date. Because the rows are generally listed with the most recent ones first, we write the ending timestamp first and the ending timestamp second. If these exact dates and times do not exist in the data, the syntax still finds the locations where these rows would exist and extracts the rows in between these two locations. Here is code to extract all tweets that were posted between 2pm and 3pm on June 29:
tweets['6/29/2020 15:00':'6/29/2020 14:00']
text | created_at | user | month | day | year | hour | minute | second | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|
created_at | ||||||||||
2020-06-29 14:54:31 | UVa alumni team opts out of 2020 TBT, plans to... | 2020-06-29 14:54:31 | hoosdaily | 6 | 29 | 2020 | 14 | 54 | 31 | 2020-06-29 14:54:31 |
2020-06-29 14:54:26 | Ben Wallace Is a Proud Dad of Three Kids — Mee... | 2020-06-29 14:54:26 | hoosdaily | 6 | 29 | 2020 | 14 | 54 | 26 | 2020-06-29 14:54:26 |
2020-06-29 14:51:16 | Inicia Sonora exportación de uva a Corea del S... | 2020-06-29 14:51:16 | AgroTratos | 6 | 29 | 2020 | 14 | 51 | 16 | 2020-06-29 14:51:16 |
2020-06-29 14:50:46 | RT @Wahoos247: Wake Forest commitment Christia... | 2020-06-29 14:50:46 | FatWhite101 | 6 | 29 | 2020 | 14 | 50 | 46 | 2020-06-29 14:50:46 |
2020-06-29 14:42:28 | RT @Cavs_Corner: Film Room: In the next instal... | 2020-06-29 14:42:28 | oleuva | 6 | 29 | 2020 | 14 | 42 | 28 | 2020-06-29 14:42:28 |
2020-06-29 14:40:56 | RT @cavalierinsider: On Saturday, The Basketba... | 2020-06-29 14:40:56 | John_Shifflett | 6 | 29 | 2020 | 14 | 40 | 56 | 2020-06-29 14:40:56 |
2020-06-29 14:37:11 | Tatil planları yapmaya başladıysan sana harika... | 2020-06-29 14:37:11 | YasinALTINEL | 6 | 29 | 2020 | 14 | 37 | 11 | 2020-06-29 14:37:11 |
2020-06-29 14:35:52 | On Saturday, The Basketball Tournament will be... | 2020-06-29 14:35:52 | cavalierinsider | 6 | 29 | 2020 | 14 | 35 | 52 | 2020-06-29 14:35:52 |
2020-06-29 14:29:17 | Jesse Rutherford of the Nelson County Board of... | 2020-06-29 14:29:17 | JerryMillerNow | 6 | 29 | 2020 | 14 | 29 | 17 | 2020-06-29 14:29:17 |
2020-06-29 14:20:58 | RT @thesabre: NewsLink is updated! #UVA sports... | 2020-06-29 14:20:58 | Slider_Hoos | 6 | 29 | 2020 | 14 | 20 | 58 | 2020-06-29 14:20:58 |
2020-06-29 14:20:44 | RT @Wahoos247: Wake Forest commitment Christia... | 2020-06-29 14:20:44 | agee_brandon | 6 | 29 | 2020 | 14 | 20 | 44 | 2020-06-29 14:20:44 |
2020-06-29 14:18:33 | Ramseyer fondly recalls Cavs' perfect regular ... | 2020-06-29 14:18:33 | hoosdaily | 6 | 29 | 2020 | 14 | 18 | 33 | 2020-06-29 14:18:33 |
2020-06-29 14:15:32 | Ben Wallace Is a Proud Dad of Three Kids — Mee... | 2020-06-29 14:15:32 | hoosdaily | 6 | 29 | 2020 | 14 | 15 | 32 | 2020-06-29 14:15:32 |
2020-06-29 14:11:08 | NewsLink is updated! #UVA sports links all in ... | 2020-06-29 14:11:08 | thesabre | 6 | 29 | 2020 | 14 | 11 | 8 | 2020-06-29 14:11:08 |
To extract all tweets before point in time, write the timestamp before the colon and write nothing after the colon. To extract all tweets after a point in time, write nothing before the colon and write the timestamp after the colon. To extract all tweets posted after June 29 at 3pm:
tweets[:'6/29/2020 15:00']
text | created_at | user | month | day | year | hour | minute | second | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|
created_at | ||||||||||
2020-06-30 02:20:43 | @KymoraJohnson_ let me call @IamTinaThompson s... | 2020-06-30 02:20:43 | SweetLickKing | 6 | 30 | 2020 | 2 | 20 | 43 | 2020-06-30 02:20:43 |
2020-06-30 02:18:17 | Congrats Eric. It’s was really great to have w... | 2020-06-30 02:18:17 | Daniel_B_Ennis | 6 | 30 | 2020 | 2 | 18 | 17 | 2020-06-30 02:18:17 |
2020-06-30 01:55:02 | RT @KathrynsScGifts: "OGX Fade-defying + Orchi... | 2020-06-30 01:55:02 | blimeyonline1 | 6 | 30 | 2020 | 1 | 55 | 2 | 2020-06-30 01:55:02 |
2020-06-30 01:40:00 | Former #UVA stars Quin Blanding, Micah Kiser n... | 2020-06-30 01:40:00 | Wahoos247 | 6 | 30 | 2020 | 1 | 40 | 0 | 2020-06-30 01:40:00 |
2020-06-30 01:39:42 | Former #UVA stars Quin Blanding, Micah Kiser n... | 2020-06-30 01:39:42 | JamieOakes247 | 6 | 30 | 2020 | 1 | 39 | 42 | 2020-06-30 01:39:42 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-06-29 15:38:19 | #UvA Last week a number of practicals started ... | 2020-06-29 15:38:19 | JJ_Angelus | 6 | 29 | 2020 | 15 | 38 | 19 | 2020-06-29 15:38:19 |
2020-06-29 15:27:34 | Playing through: UVa golfers hone their games ... | 2020-06-29 15:27:34 | hoosdaily | 6 | 29 | 2020 | 15 | 27 | 34 | 2020-06-29 15:27:34 |
2020-06-29 15:27:29 | Playing through: UVa golfers hone their games ... | 2020-06-29 15:27:29 | hoosdaily | 6 | 29 | 2020 | 15 | 27 | 29 | 2020-06-29 15:27:29 |
2020-06-29 15:27:21 | Ramseyer fondly recalls Cavs' perfect regular ... | 2020-06-29 15:27:21 | hoosdaily | 6 | 29 | 2020 | 15 | 27 | 21 | 2020-06-29 15:27:21 |
2020-06-29 15:06:57 | While many college athletes have been unable t... | 2020-06-29 15:06:57 | cavalierinsider | 6 | 29 | 2020 | 15 | 6 | 57 | 2020-06-29 15:06:57 |
71 rows × 10 columns
9.5.5. Leads and Lags#
In time series data in which the rows represent particular points in time in descending chronological order, a lag is a column that contains the values of another column shifted up one cell. That is, if the rows represent days, and a column represents today’s high temperature, the lag represents yesterday’s high temperature. A lead is a existing column with its values shifted one row down: tomorrow’s high temperature.
To create a lead or a lag, apply the .shift()
method to a dataframe. The argument of .shift()
is an integer: positive integers shift values down, creating leads, and negative values shift values up, creating lags. Whenever we shift values up or down, we create new missing values at the top or bottom of the dataframe for the lead or lagged column.
For example, in the tweets
dataframe we can create two new columns, previous_tweet
and next_tweet
to compare with text
by using .shift(-1)
and .shift(1)
:
tweets['previous_tweet'] = tweets.shift(-1).text
tweets['next_tweet'] = tweets.shift(1).text
tweets[['text', 'previous_tweet', 'next_tweet']]
text | previous_tweet | next_tweet | |
---|---|---|---|
created_at | |||
2020-06-30 02:20:43 | @KymoraJohnson_ let me call @IamTinaThompson s... | Congrats Eric. It’s was really great to have w... | NaN |
2020-06-30 02:18:17 | Congrats Eric. It’s was really great to have w... | RT @KathrynsScGifts: "OGX Fade-defying + Orchi... | @KymoraJohnson_ let me call @IamTinaThompson s... |
2020-06-30 01:55:02 | RT @KathrynsScGifts: "OGX Fade-defying + Orchi... | Former #UVA stars Quin Blanding, Micah Kiser n... | Congrats Eric. It’s was really great to have w... |
2020-06-30 01:40:00 | Former #UVA stars Quin Blanding, Micah Kiser n... | Former #UVA stars Quin Blanding, Micah Kiser n... | RT @KathrynsScGifts: "OGX Fade-defying + Orchi... |
2020-06-30 01:39:42 | Former #UVA stars Quin Blanding, Micah Kiser n... | RT @ThinakaranLK: சில இடங்களில் மழை பெய்யும் ச... | Former #UVA stars Quin Blanding, Micah Kiser n... |
... | ... | ... | ... |
2020-06-23 15:12:08 | RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... | #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... | WATCH: Top plays of UVA's 2019-20 basketball s... |
2020-06-23 15:08:40 | #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... | Zur Einstimmung #UVA vom letzten #gig bei uns ... | RT @AgNPalabras: #UVA: 54,87. Sube 0,20% diari... |
2020-06-23 14:52:23 | Zur Einstimmung #UVA vom letzten #gig bei uns ... | RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... | #UVA: 54,87. Sube 0,20% diario, 1,22% en junio... |
2020-06-23 14:49:22 | RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... | Good question... My instinct says Ty Jerome, b... | Zur Einstimmung #UVA vom letzten #gig bei uns ... |
2020-06-23 14:32:44 | Good question... My instinct says Ty Jerome, b... | NaN | RT @JLuis_Sommelier: #Mosto: jugo obtenido de ... |
1000 rows × 3 columns