2. Loading Data from Electronic Data Files#
2.1. Introduction: Please don’t bend, fold, spindle or mutilate me#
“’Do Not Fold, Spindle or Mutilate’: A Cultural History of the Punch Card” by Steven Lubar details the earliest method of data storage and transference. Punch cards were first used by the U.S. government to tabulate the 1890 census. Before World War 1, punch cards started being used by the military, the railroads, and insurance companies to store data. In the 1930s, punch cards were used by the New Deal agencies, and by the 1940s punch cards were widespread throughout American society.
Punchcards predated computers, which is to say that data storage and transference predates computers. These early uses of punchcards employed data tabulation machines capable of counting the number of times particular holes have been punched in a series of cards, fed into the machine one at a time. The tabulation machine used for counting the 1890 census was developed by a company that later became IBM.
When punchcards became widely used, they were printed with a universal warning: “don’t bend, fold, spindle or mutilate”, which became a cultural meme that long outlived the use of punchcards. In the 1960s, punchcards and the slogan “don’t bend, fold, spindle or mutilate” became symbols of alienation and technocratic oppression used by campus protests and many countercultural movements:
The depersonalization of the punch-card era found its catch phrase in the words on the cards; its ubiquity gave it instant familiarity. One observer of the period wrote that marijuana, the 60s escape from the rigors of the real world, let you see “the strangeness of real unfolded-unspindled-unmutilated life” (Gitlan 202). “Do not fold, spindle, or mutilate” became shorthand for a whole realm of countercultural experience. The ecological movement of the early 1970s, a child of the 1960s counterculture, picked up on it too: a popular poster for Earth Day 1970 showed a picture of the Earth taken from space with the legend “Do not fold, spindle, or mutilate” (Lubar 1992, p. 48).
Starting in the 1950s it became possible to run more advanced statistical analyses, such as linear regression, using mainframe computers with punchcard inputs. Many of my professors in grad school had stories about trekking across campus at 3am, the only time they could get access to the computer, to the building with the giant, noisy computer in the basement. They would carry a stack of punchcards, trying hard not to trip or slip on ice because dropping the punchcards would mean hours of work to get them back into a specific order. Finally, while inputting the cards, inevitably one would get jammed, which meant alerting the surly technician on call.
Two technological advances together made punchcards obsolete. First, in the late 1970s and through the 1980s, it became possible to store data directly on a computer hard drive or a floppy disk, which led to the creation of electronic data files such as CSV. Second, the internet made it possible to transfer data electronically, which led to APIs, HTML, and remote databases.
Today there are several common ways to access electronic data.
People can share individual data files through websites, email, or hard storage. These files are often in ASCII format, but can be stored in other (sometimes proprietary) formats.
Through an application programming interface (API): a specific system that connects clients to web servers and allows the transfer of data, often formatted in JavaScript Object Notation (JSON).
Through raw HTML that can be converted into tabular data through web-scraping.
Through a local or remote relational database - a collection of many individual datasets - managed using SQL.
For modern data science, it’s important to be very comfortable working with all of these methods of sharing data. We will go over individual data files in this module, and the other methods in the following modules.
In this notebook, I demonstrate how to load CSV files with various messy properties that have to be addressed when loading the data into Python. I also demonstrate how to look at a dataframe once it’s loaded to search for problems, and how to load fixed width, Excel, SAS, Stata, and SPSS data files. All of the data for these examples are available on this GitHub page.
Before we begin, we load the following libraries:
import numpy as np
import pandas as pd
import os
import sys
sys.tracebacklimit = 0 # turn off the error tracebacks
2.2. Changing the Working Directory#
Before we go over the functions to load and save data files, it is useful to set the working directory at the start of your script or notebook. This sets the default folder where Python opens and saves files. If all of your files are in the same folder, setting the working directory means you don’t have to write out the paths each time you load or save a file.
To set the working directory:
Load the
os
package.Type the folder’s address into
os.chdir("
folder")
To check on the path Python is currently using as a default, type os.getcwd()
into the console. If you want to change the working directory back after you’ve run the relevant code, save your old path as an object to begin your notebook, and change the path back to the old path to end your notebook, like this:
import os
oldpath = os.getcwd()
os.chdir("folder")
#(Your code goes here)
os.chdir(oldpath)
Because I want this notebook to work on anyone’s environment, I will only work with files that are accessable through the web. But make sure to change the working directory in your own notebooks for working with local data files.
2.3. Text-Based Data Files#
When data could first be stored on harddrives and disks, space was very limited. In order to store data as efficiently as possible, universal standards for which symbols would be considered valid data were adopted. These standards, which are still in use today, are called ASCII:
ASCII - (pronounced “As-Key”) American Standard Code for Information Interchange:
128 characters are considered to be “legal” in data files.
ASCII files are plain text files. They may be messy, and not immediately ready for use in analyses, but we have methods to load and clean ASCII data.
ASCII files are designed to be as small and as universally portable as possible.
Data points are usually delimited by commas, spaces, or tabs, or might require a data dictionary to read.
The most common ASCII format is comma-separated values (CSV), in which individual data points are separated by commas. A CSV file is a plain text file that can be opened with Notepad on Windows or with TextEdit on Mac. Each row in the data appears on a new row in the text file, but the columns do not necessarily align - instead the commas represent new columns, so that the fourth and fifth commas on a line surround the fifth column in the data, for example. If the column names are included in the data, they will be listed on the first line. If you open a CSV file in a plain text editor, it will look like this:
On many computers with Microsoft Office installed, CSV files are opened by default in Microsoft Excel. Excel might be a convenient tool for aligning the columns and visualizing the data in a spreadsheet, but please note that the .csv
filetype is a universal and lightweight format that can be opened in any programming environment. It is not specific to Microsoft:
In a CSV, the comma is called the delimiter: the character that separates datapoints. Another frequently used delimiter is the tab. Here’s the same data in tab-separated format:
In some situations the data are stored in a way that minimizes the memory needed to contain the data. In that case, we can do away with the delimiter altogether and instead fixing particular spaces on each line to represent specific columns. We can save further space by also removing the column names from the file. This format is called a fixed-width data file, and in a plain text editor it looks like this:
Notice that this file contains no information to help us understand what the columns are or where they begin or end. Fixed-width files always come with a data dictionary that contains this information.
I will show examples of how to work with tab-separated, fixed-width, and other files below. But first, to load a standard CSV file with no issues that need to be fixed, use the pd.read_csv()
function. The first argument of pd.read_csv()
is filepath_or_buffer
, and can be one of three things:
The full file address and file name of the data file.
Just the file name of the data file if you’ve already set the working directory to the folder where the file exist.
The URL of a data file that’s accessible online.
Once a data file is loaded into Python with pd.read_csv()
, it is called a DataFrame
.
Here I load public opinion data from the American National Election Study:
url = "https://raw.githubusercontent.com/jkropko/DS-6001/master/localdata/anes_example.csv"
anes = pd.read_csv(url)
This dataset is messy! As we proceed you will see confusing column names, misleading codes for missing values, strange encodings for categorical features, and other flaws. It is important for us to work with messy data because data is in general always messy in the real-world.
2.4. Looking at the DataFrame to See if it Loaded Correctly#
One of the biggest mistakes someone can make when loading data into Python (or any programming environment) is to assume that the data loaded correctly when the code does not give an error. There are lots of things that can still go wrong. The data might not have properly read the column names, the dimensions might be wrong, the data might have been read in a way that led to valid data getting replaced with missing values, and so on.
There are several functions that are useful for seeing the data to understand whether the data loaded properly. The best strategy is to load data, look at the loaded data to understand any problems, and if necessary, reload the data to fix these problems from the outset.
2.4.1. Viewing the Data as a Spreadsheet#
The best way to visualize the data frame is to type the name of the data frame into a cell in a Jupyter notebook. The output of the cell will be rendered as a spreadsheet, and in that format it will be clear what problems, if any, exist with the data frame. One issue: for large data frames (many rows, many columns, or both), displaying the entire spreadsheet unnecessarily takes up a lot of space. Prior to displaying a spreadsheet, we can set the defaults for the number of rows and columns to be displayed as follows:
pd.set_option('display.max_columns', 10) #display a maximumn of 10 columns
pd.set_option('display.max_rows', 10) #display a maximumn of 10 rows
Now I can see a manageable slice of the data frame. Here’s what the ANES data look like:
anes
caseid | turnout12 | turnout12b | vote12 | percent16 | ... | pew_bornagain | pew_churatd | religpew | religpew_t | ever_vs_12mo_rand | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 2 |
1 | 2.0 | 2 | NaN | NaN | 50 | ... | 2 | 4 | 2.0 | __NA__ | 1 |
2 | 3.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 6 | 1.0 | __NA__ | 1 |
3 | 4.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 1 |
4 | 5.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 2 | 3.0 | __NA__ | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1195 | 1196.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 1 | 1.0 | __NA__ | 2 |
1196 | 1197.0 | 1 | NaN | 1.0 | 80 | ... | 1 | 2 | 1.0 | __NA__ | 1 |
1197 | 1198.0 | 1 | NaN | 2.0 | 95 | ... | 2 | 6 | NaN | __NA__ | 2 |
1198 | 1199.0 | 1 | NaN | 2.0 | 80 | ... | 2 | 4 | 1.0 | __NA__ | 1 |
1199 | 1200.0 | 2 | NaN | NaN | 0 | ... | 2 | 2 | 2.0 | __NA__ | 1 |
1200 rows × 168 columns
Note the ...
cells: they are unobtrusive, but in this case they represent 1190 rows and 158 columns. To see more of the data than are contained in this 10 x 10 table, change the defaults in the code listed above.
To see the first several rows, use the .head()
method. A method is different from a regular function in that it is applied specifically to an existing object in Python’s memory, and it is typed as a suffix to a call to that object. For example, to see the first 10 rows of the anes
data frame, type:
anes.head(10)
caseid | turnout12 | turnout12b | vote12 | percent16 | ... | pew_bornagain | pew_churatd | religpew | religpew_t | ever_vs_12mo_rand | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 2 |
1 | 2.0 | 2 | NaN | NaN | 50 | ... | 2 | 4 | 2.0 | __NA__ | 1 |
2 | 3.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 6 | 1.0 | __NA__ | 1 |
3 | 4.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 1 |
4 | 5.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 2 | 3.0 | __NA__ | 2 |
5 | 6.0 | 1 | NaN | 3.0 | 100 | ... | 2 | 2 | 1.0 | __NA__ | 2 |
6 | 7.0 | 1 | NaN | 3.0 | 100 | ... | 2 | 2 | 1.0 | __NA__ | 1 |
7 | 8.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 6 | 1.0 | __NA__ | 1 |
8 | NaN | 1 | NaN | 1.0 | 100 | ... | 1 | 2 | 1.0 | __NA__ | 2 |
9 | 10.0 | 1 | NaN | 2.0 | 100 | ... | 1 | 1 | 1.0 | __NA__ | 2 |
10 rows × 168 columns
To see the last several rows, use the .tail()
method in exactly the same way:
anes.tail(10)
caseid | turnout12 | turnout12b | vote12 | percent16 | ... | pew_bornagain | pew_churatd | religpew | religpew_t | ever_vs_12mo_rand | |
---|---|---|---|---|---|---|---|---|---|---|---|
1190 | 1191.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 5 | 12.0 | __NA__ | 1 |
1191 | 1192.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | 2.0 | __NA__ | 2 |
1192 | 1193.0 | 1 | NaN | 3.0 | 100 | ... | 2 | 5 | 12.0 | __NA__ | 1 |
1193 | 1194.0 | 1 | NaN | 2.0 | 53 | ... | 2 | 2 | 2.0 | __NA__ | 2 |
1194 | 1195.0 | 1 | NaN | 3.0 | 50 | ... | 1 | 3 | 1.0 | __NA__ | 2 |
1195 | 1196.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 1 | 1.0 | __NA__ | 2 |
1196 | 1197.0 | 1 | NaN | 1.0 | 80 | ... | 1 | 2 | 1.0 | __NA__ | 1 |
1197 | 1198.0 | 1 | NaN | 2.0 | 95 | ... | 2 | 6 | NaN | __NA__ | 2 |
1198 | 1199.0 | 1 | NaN | 2.0 | 80 | ... | 2 | 4 | 1.0 | __NA__ | 1 |
1199 | 1200.0 | 2 | NaN | NaN | 0 | ... | 2 | 2 | 2.0 | __NA__ | 1 |
10 rows × 168 columns
2.4.2. Displaying the Data Type of Each Column#
All data that is stored in Python’s memory is assigned a type that controls how various functions operate by default on that data. For example, the .describe()
method (that I discuss in the next section) provides means and other summary statistics for numeric data, and frequencies for text data. If you intend to perform mathematical operations on a column in a data frame, such as by including the column in a regression model, it is important to first confirm that the column is read with an appropriate data type. The most common data types are:
int64
: integer numbers with up to 64 digitsfloat64
: numbers with decimals with up to 64 total digitsobject
: text, or numeric data coded as textbool
: onlyTrue
orFalse
We can use the .dtypes
attribute to display the data types of every column. Because this information is stored in a table, we need to turn off the default limit on the number of rows displayed to see all of the columns’ data types:
pd.set_option('display.max_rows', None) #to see all the variables
anes.dtypes
caseid float64
turnout12 int64
turnout12b float64
vote12 float64
percent16 int64
meet int64
givefut int64
info int64
march int64
sign int64
give12mo int64
compromise int64
ftobama float64
ftblack float64
ftwhite int64
fthisp float64
ftgay float64
ftjeb float64
fttrump float64
ftcarson float64
fthrc float64
ftrubio float64
ftcruz float64
ftsanders float64
ftfiorina float64
ftpolice float64
ftfem float64
fttrans float64
ftmuslim float64
ftsci float64
reg int64
demcand int64
repcand float64
vote16jb float64
vote16bc float64
vote16tc float64
vote16mr float64
vote16dt float64
presjob int64
lazyb int64
lazyw int64
lazyh int64
lazym int64
violentb int64
violentw int64
violenth int64
violentm int64
econnow int64
econ12mo int64
pid1d float64
pid2d object
pid1r float64
pid2r object
pidstr float64
pidlean float64
lcself int64
lcd int64
lcr int64
lchc int64
lcbo int64
lcdt int64
lcmr int64
lctc int64
srv_spend int64
campfin int64
immig_legal int64
immig_numb int64
equalpay int64
parleave int64
crimespend int64
death int64
terror_worry float64
terror_12mo float64
terror_local int64
relig_bc int64
relig_bcstr float64
relig_srv int64
relig_srvstr float64
incgap20 int64
isis_troops int64
syrians_a float64
syrians_b float64
pc_a float64
pc_b float64
minwage int64
healthspend int64
childcare int64
getahead int64
ladder int64
finwell int64
warm int64
warmbad int64
warmcause int64
warmdo int64
freetrade int64
stopwhite int64
stopblack int64
forcewhite int64
forceblack int64
stop_12mo float64
arrested_12mo float64
charged_12mo float64
jailed_12mo float64
convict_12mo float64
famstop_12mo float64
stop_ever float64
arrested_ever float64
charged_ever float64
jailed_ever float64
convict_ever float64
famstop_ever float64
pk_deficit int64
pk_sen float64
pk_spend int64
birthright_a float64
birthright_b float64
femoff_jobs float64
femoff_ed float64
femoff_spend float64
femoff_issues float64
lpres_pleased int64
lpres_immig float64
lpres_la float64
vaccine int64
autism int64
bo_muslim int64
bo_confid float64
amer_ident int64
race_ident int64
whitework float64
whitejob float64
wguilt1 float64
wguilt2 float64
wguilt3 float64
buycott int64
boycott int64
skintone_mob int64
skintone float64
skin_discrim int64
africanam10_1 float64
white10_1 float64
hispanic10_1 float64
asianam10_1 float64
nativeam10_1 int64
other10_1 float64
other10_open object
birthyr int64
gender int64
race int64
race_other object
educ int64
marstat int64
speakspanish int64
employ float64
employ_t object
faminc float64
faminc2 int64
state float64
votereg int64
pid3 int64
pid7 float64
ideo5 int64
newsint int64
pew_bornagain int64
pew_churatd int64
religpew float64
religpew_t object
ever_vs_12mo_rand int64
dtype: object
Before moving on, we set the default number of rows displayed back to 10:
pd.set_option('display.max_rows', 10)
2.4.3. Displaying Column Names#
It can be very useful to display the column names in one readable list so that we can work with the variables/features contained in these columns. But as with data frames, we again need to contend with Python’s default display settings. To turn off the display limit for items in a list, type:
pd.set_option('display.max_seq_items', None)
To display the column names, use the .columns
attribute. An attribute is like a method in that it is attached to an existing object in Python’s memory, but while a method is a type of function, an attribute is another object. In this case, the column names are contained in a list-type object:
anes.columns
Index(['caseid', 'turnout12', 'turnout12b', 'vote12', 'percent16', 'meet',
'givefut', 'info', 'march', 'sign', 'give12mo', 'compromise', 'ftobama',
'ftblack', 'ftwhite', 'fthisp', 'ftgay', 'ftjeb', 'fttrump', 'ftcarson',
'fthrc', 'ftrubio', 'ftcruz', 'ftsanders', 'ftfiorina', 'ftpolice',
'ftfem', 'fttrans', 'ftmuslim', 'ftsci', 'reg', 'demcand', 'repcand',
'vote16jb', 'vote16bc', 'vote16tc', 'vote16mr', 'vote16dt', 'presjob',
'lazyb', 'lazyw', 'lazyh', 'lazym', 'violentb', 'violentw', 'violenth',
'violentm', 'econnow', 'econ12mo', 'pid1d', 'pid2d', 'pid1r', 'pid2r',
'pidstr', 'pidlean', 'lcself', 'lcd', 'lcr', 'lchc', 'lcbo', 'lcdt',
'lcmr', 'lctc', 'srv_spend', 'campfin', 'immig_legal', 'immig_numb',
'equalpay', 'parleave', 'crimespend', 'death', 'terror_worry',
'terror_12mo', 'terror_local', 'relig_bc', 'relig_bcstr', 'relig_srv',
'relig_srvstr', 'incgap20', 'isis_troops', 'syrians_a', 'syrians_b',
'pc_a', 'pc_b', 'minwage', 'healthspend', 'childcare', 'getahead',
'ladder', 'finwell', 'warm', 'warmbad', 'warmcause', 'warmdo',
'freetrade', 'stopwhite', 'stopblack', 'forcewhite', 'forceblack',
'stop_12mo', 'arrested_12mo', 'charged_12mo', 'jailed_12mo',
'convict_12mo', 'famstop_12mo', 'stop_ever', 'arrested_ever',
'charged_ever', 'jailed_ever', 'convict_ever', 'famstop_ever',
'pk_deficit', 'pk_sen', 'pk_spend', 'birthright_a', 'birthright_b',
'femoff_jobs', 'femoff_ed', 'femoff_spend', 'femoff_issues',
'lpres_pleased', 'lpres_immig', 'lpres_la', 'vaccine', 'autism',
'bo_muslim', 'bo_confid', 'amer_ident', 'race_ident', 'whitework',
'whitejob', 'wguilt1', 'wguilt2', 'wguilt3', 'buycott', 'boycott',
'skintone_mob', 'skintone', 'skin_discrim', 'africanam10_1',
'white10_1', 'hispanic10_1', 'asianam10_1', 'nativeam10_1', 'other10_1',
'other10_open', 'birthyr', 'gender', 'race', 'race_other', 'educ',
'marstat', 'speakspanish', 'employ', 'employ_t', 'faminc', 'faminc2',
'state', 'votereg', 'pid3', 'pid7', 'ideo5', 'newsint', 'pew_bornagain',
'pew_churatd', 'religpew', 'religpew_t', 'ever_vs_12mo_rand'],
dtype='object')
It can be difficult to read a list of column names formatted as a list. To see the column names in a more cleanly formatted table, use the .dtypes
attribute described above or the .info()
method with verbose=True
. The info()
method also displays the dimensions of the data, a count of the data types across columns, and the memory used by the data frame:
anes.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 168 columns):
# Column Dtype
--- ------ -----
0 caseid float64
1 turnout12 int64
2 turnout12b float64
3 vote12 float64
4 percent16 int64
5 meet int64
6 givefut int64
7 info int64
8 march int64
9 sign int64
10 give12mo int64
11 compromise int64
12 ftobama float64
13 ftblack float64
14 ftwhite int64
15 fthisp float64
16 ftgay float64
17 ftjeb float64
18 fttrump float64
19 ftcarson float64
20 fthrc float64
21 ftrubio float64
22 ftcruz float64
23 ftsanders float64
24 ftfiorina float64
25 ftpolice float64
26 ftfem float64
27 fttrans float64
28 ftmuslim float64
29 ftsci float64
30 reg int64
31 demcand int64
32 repcand float64
33 vote16jb float64
34 vote16bc float64
35 vote16tc float64
36 vote16mr float64
37 vote16dt float64
38 presjob int64
39 lazyb int64
40 lazyw int64
41 lazyh int64
42 lazym int64
43 violentb int64
44 violentw int64
45 violenth int64
46 violentm int64
47 econnow int64
48 econ12mo int64
49 pid1d float64
50 pid2d object
51 pid1r float64
52 pid2r object
53 pidstr float64
54 pidlean float64
55 lcself int64
56 lcd int64
57 lcr int64
58 lchc int64
59 lcbo int64
60 lcdt int64
61 lcmr int64
62 lctc int64
63 srv_spend int64
64 campfin int64
65 immig_legal int64
66 immig_numb int64
67 equalpay int64
68 parleave int64
69 crimespend int64
70 death int64
71 terror_worry float64
72 terror_12mo float64
73 terror_local int64
74 relig_bc int64
75 relig_bcstr float64
76 relig_srv int64
77 relig_srvstr float64
78 incgap20 int64
79 isis_troops int64
80 syrians_a float64
81 syrians_b float64
82 pc_a float64
83 pc_b float64
84 minwage int64
85 healthspend int64
86 childcare int64
87 getahead int64
88 ladder int64
89 finwell int64
90 warm int64
91 warmbad int64
92 warmcause int64
93 warmdo int64
94 freetrade int64
95 stopwhite int64
96 stopblack int64
97 forcewhite int64
98 forceblack int64
99 stop_12mo float64
100 arrested_12mo float64
101 charged_12mo float64
102 jailed_12mo float64
103 convict_12mo float64
104 famstop_12mo float64
105 stop_ever float64
106 arrested_ever float64
107 charged_ever float64
108 jailed_ever float64
109 convict_ever float64
110 famstop_ever float64
111 pk_deficit int64
112 pk_sen float64
113 pk_spend int64
114 birthright_a float64
115 birthright_b float64
116 femoff_jobs float64
117 femoff_ed float64
118 femoff_spend float64
119 femoff_issues float64
120 lpres_pleased int64
121 lpres_immig float64
122 lpres_la float64
123 vaccine int64
124 autism int64
125 bo_muslim int64
126 bo_confid float64
127 amer_ident int64
128 race_ident int64
129 whitework float64
130 whitejob float64
131 wguilt1 float64
132 wguilt2 float64
133 wguilt3 float64
134 buycott int64
135 boycott int64
136 skintone_mob int64
137 skintone float64
138 skin_discrim int64
139 africanam10_1 float64
140 white10_1 float64
141 hispanic10_1 float64
142 asianam10_1 float64
143 nativeam10_1 int64
144 other10_1 float64
145 other10_open object
146 birthyr int64
147 gender int64
148 race int64
149 race_other object
150 educ int64
151 marstat int64
152 speakspanish int64
153 employ float64
154 employ_t object
155 faminc float64
156 faminc2 int64
157 state float64
158 votereg int64
159 pid3 int64
160 pid7 float64
161 ideo5 int64
162 newsint int64
163 pew_bornagain int64
164 pew_churatd int64
165 religpew float64
166 religpew_t object
167 ever_vs_12mo_rand int64
dtypes: float64(76), int64(86), object(6)
memory usage: 1.5+ MB
2.4.4. Descriptive Statistics#
One way to catch some big errors is to display statistics derived from the columns of the data frame, such as the mean and minimum and maximum values. For example, in the ANES data, ftobama
records people’s responses to the question: “On a scale from 0 to 100, with 0 being the coldest and 100 being the warmest, how warmly to you feel towards Barack Obama?” I can display various summary statistics for this feature with the following code:
anes[['ftobama']].describe()
ftobama | |
---|---|
count | 1193.000000 |
mean | 50.442582 |
std | 54.350731 |
min | 0.000000 |
25% | 5.000000 |
50% | 54.000000 |
75% | 87.000000 |
max | 998.000000 |
The .describe()
method displays these summary statistics, and typing anes[['ftobama']]
limits the results to just this one feature (we will discuss subsetting data by columns later in this course). Notice anything strange in the results? The mean value of about 50 seems reasonable, as do the standard deviation and percentiles. But the maximum value is 998 for a feature which by design should be no larger than 100. It turns out that throughout the data 998 is one of a few numeric codes that indicates a nonresponse. We can replace these values with missing values from the outset, and we will do so below in the Data with Numeric Missing Codes section.
To change the percentiles that are displayed, use the percentiles
parameter and provide a list of the desired percentiles. To show percentiles in increments of 20%, type:
anes[['ftobama']].describe(percentiles = [.2, .4, .6, .8])
ftobama | |
---|---|
count | 1193.000000 |
mean | 50.442582 |
std | 54.350731 |
min | 0.000000 |
20% | 3.000000 |
40% | 31.000000 |
50% | 54.000000 |
60% | 70.000000 |
80% | 90.000000 |
max | 998.000000 |
Note that the median (50%) is always displayed.
By default the .describe()
method displays summary statistics for all columns with int
and float
data types. To see the summary statistics for the object
type columns, use the include
parameter:
anes.describe(include="object")
pid2d | pid2r | other10_open | race_other | employ_t | religpew_t | |
---|---|---|---|---|---|---|
count | 1200 | 1199 | 94 | 1200 | 1200 | 1200 |
unique | 37 | 31 | 57 | 1 | 1 | 11 |
top | __NA__ | __NA__ | 0 | __NA__ | __NA__ | __NA__ |
freq | 1159 | 1162 | 17 | 1200 | 1200 | 1188 |
object
type columns are treated as text, which means that statistics like the mean and median are impossible to calculate. Instead the .describe()
method gives us the count of non-missing observations, the number of unique observations, the most frequent value of the feature, and the number of times the most frequent value appears.
2.5. Identifying and Solving Problems with Text-Based Data Files#
There are many ways that text-based data files can be arranged that cause problems for loading the data into Python and for using the data for analyses. Please note that not every problem needs to be solved in the loading step: the entire pandas
library exists for cleaning data once it is loaded into Python’s memory. But some problems are more efficiently solved as we load the file. Here we will discuss some common problems and strategies for solving them while loading the data.
I’ve uploaded to the Github page several additional versions of the ANES data that I messed up on purpose.
2.5.2. Data with Numeric Missing Codes#
I created a version of the data that uses -999 throughout to represent various missing values, such as when a survey respondent refuses to answer a question:
url = "https://raw.githubusercontent.com/jkropko/DS-6001/master/localdata/anes_example_missing.csv"
anes = pd.read_csv(url)
anes
caseid | turnout12 | turnout12b | vote12 | percent16 | ... | pew_bornagain | pew_churatd | religpew | religpew_t | ever_vs_12mo_rand | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | -999 | 2 | 100 | ... | 2 | 6 | -999 | __NA__ | 2 |
1 | 2 | 2 | -999 | -999 | 50 | ... | 2 | 4 | 2 | __NA__ | 1 |
2 | 3 | 1 | -999 | 1 | 100 | ... | 2 | 6 | 1 | __NA__ | 1 |
3 | 4 | 1 | -999 | 2 | 100 | ... | 2 | 6 | -999 | __NA__ | 1 |
4 | 5 | 1 | -999 | 1 | 100 | ... | 2 | 2 | 3 | __NA__ | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1195 | 1196 | 1 | -999 | 2 | 100 | ... | 2 | 1 | 1 | __NA__ | 2 |
1196 | 1197 | 1 | -999 | 1 | 80 | ... | 1 | 2 | 1 | __NA__ | 1 |
1197 | 1198 | 1 | -999 | 2 | 95 | ... | 2 | 6 | -999 | __NA__ | 2 |
1198 | 1199 | 1 | -999 | 2 | 80 | ... | 2 | 4 | 1 | __NA__ | 1 |
1199 | 1200 | 2 | -999 | -999 | 0 | ... | 2 | 2 | 2 | __NA__ | 1 |
1200 rows × 168 columns
We can always proceed with the data frame as it is, and use various functions in pandas
to replace these -999 values. But in this case, it is easier to read these values as missing as we load the data. We can do that with the na_values
parameter. Then all of the -999 values are replaced with the NaN
character, which Python recognizes as a missing datapoint.
anes = pd.read_csv(url, na_values = -999)
anes
caseid | turnout12 | turnout12b | vote12 | percent16 | ... | pew_bornagain | pew_churatd | religpew | religpew_t | ever_vs_12mo_rand | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 2 |
1 | 2.0 | 2 | NaN | NaN | 50 | ... | 2 | 4 | 2.0 | __NA__ | 1 |
2 | 3.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 6 | 1.0 | __NA__ | 1 |
3 | 4.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 1 |
4 | 5.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 2 | 3.0 | __NA__ | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1195 | 1196.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 1 | 1.0 | __NA__ | 2 |
1196 | 1197.0 | 1 | NaN | 1.0 | 80 | ... | 1 | 2 | 1.0 | __NA__ | 1 |
1197 | 1198.0 | 1 | NaN | 2.0 | 95 | ... | 2 | 6 | NaN | __NA__ | 2 |
1198 | 1199.0 | 1 | NaN | 2.0 | 80 | ... | 2 | 4 | 1.0 | __NA__ | 1 |
1199 | 1200.0 | 2 | NaN | NaN | 0 | ... | 2 | 2 | 2.0 | __NA__ | 1 |
1200 rows × 168 columns
2.5.3. Comments Inside the Data#
Suppose that you work with a data file in which the data provider left comments throughout the data. That’s heresy. But it can definitely happen. Hopefully the commenter had enough forethought to place a uniform character in front of these comments. If so, we can remove the comments while loading the data. Consider the following data file:
url = "https://raw.githubusercontent.com/jkropko/DS-6001/master/localdata/anes_example_comments.txt"
anes = pd.read_csv(url, header = 4)
anes
caseid | turnout12 | turnout12b | vote12 | percent16 | ... | pew_bornagain | pew_churatd | religpew | religpew_t | ever_vs_12mo_rand | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1.0 | NaN | 2.0 | 100.0 | ... | 2.0 | 6.0 | NaN | __NA__ | 2.0 |
1 | 2 | 2.0 | NaN | NaN | 50.0 | ... | 2.0 | 4.0 | 2.0 | __NA__ | 1.0 |
2 | 3 | 1.0 | NaN | 1.0 | 100.0 | ... | 2.0 | 6.0 | 1.0 | __NA__ | 1.0 |
3 | @NEXT COMES A BUNCH MORE OBSERVATIONS! | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN |
4 | 4 | 1.0 | NaN | 2.0 | 100.0 | ... | 2.0 | 6.0 | NaN | __NA__ | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1197 | 1196 | 1.0 | NaN | 2.0 | 100.0 | ... | 2.0 | 1.0 | 1.0 | __NA__ | 2.0 |
1198 | 1197 | 1.0 | NaN | 1.0 | 80.0 | ... | 1.0 | 2.0 | 1.0 | __NA__ | 1.0 |
1199 | 1198 | 1.0 | NaN | 2.0 | 95.0 | ... | 2.0 | 6.0 | NaN | __NA__ | 2.0 |
1200 | 1199 | 1.0 | NaN | 2.0 | 80.0 | ... | 2.0 | 4.0 | 1.0 | __NA__ | 1.0 |
1201 | 1200 | 2.0 | NaN | NaN | 0.0 | ... | 2.0 | 2.0 | 2.0 | __NA__ | 1.0 |
1202 rows × 168 columns
To ignore these comments, use the comment
parameter:
anes = pd.read_csv(url, header = 4, comment = '@')
anes
caseid | turnout12 | turnout12b | vote12 | percent16 | ... | pew_bornagain | pew_churatd | religpew | religpew_t | ever_vs_12mo_rand | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 2 |
1 | 2.0 | 2 | NaN | NaN | 50 | ... | 2 | 4 | 2.0 | __NA__ | 1 |
2 | 3.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 6 | 1.0 | __NA__ | 1 |
3 | 4.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 1 |
4 | 5.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 2 | 3.0 | __NA__ | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1195 | 1196.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 1 | 1.0 | __NA__ | 2 |
1196 | 1197.0 | 1 | NaN | 1.0 | 80 | ... | 1 | 2 | 1.0 | __NA__ | 1 |
1197 | 1198.0 | 1 | NaN | 2.0 | 95 | ... | 2 | 6 | NaN | __NA__ | 2 |
1198 | 1199.0 | 1 | NaN | 2.0 | 80 | ... | 2 | 4 | 1.0 | __NA__ | 1 |
1199 | 1200.0 | 2 | NaN | NaN | 0 | ... | 2 | 2 | 2.0 | __NA__ | 1 |
1200 rows × 168 columns
2.5.4. Data Without Column Names#
The pd.read_csv()
function assumes that the column names are listed on the first line of the data file, unless the header
parameter is used, in which it assumes the column names are on the first line after the header. But if the data do not contain a row of column names then Python will use the first line of valid data as the column names. I created a version of the ANES data with the column names deleted. Consider what happens when I load this data file normally:
url = "https://raw.githubusercontent.com/jkropko/DS-6001/master/localdata/anes_example_nocolnames.csv"
anes = pd.read_csv(url)
anes
1 | 1.1 | NA | 2 | 100 | ... | 2.24 | 6.7 | NA.19 | __NA__.4 | 2.25 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.0 | 2 | NaN | NaN | 50 | ... | 2 | 4 | 2.0 | __NA__ | 1 |
1 | 3.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 6 | 1.0 | __NA__ | 1 |
2 | 4.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 1 |
3 | 5.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 2 | 3.0 | __NA__ | 2 |
4 | 6.0 | 1 | NaN | 3.0 | 100 | ... | 2 | 2 | 1.0 | __NA__ | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1194 | 1196.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 1 | 1.0 | __NA__ | 2 |
1195 | 1197.0 | 1 | NaN | 1.0 | 80 | ... | 1 | 2 | 1.0 | __NA__ | 1 |
1196 | 1198.0 | 1 | NaN | 2.0 | 95 | ... | 2 | 6 | NaN | __NA__ | 2 |
1197 | 1199.0 | 1 | NaN | 2.0 | 80 | ... | 2 | 4 | 1.0 | __NA__ | 1 |
1198 | 1200.0 | 2 | NaN | NaN | 0 | ... | 2 | 2 | 2.0 | __NA__ | 1 |
1199 rows × 168 columns
The column names are not descriptive of what the columns represent. But worse than that, these names are the datapoints from the first row of the data. That means we’ve lost an observation. To deal with this problem, use the header=None
parameter.
anes = pd.read_csv(url, header=None)
anes
0 | 1 | 2 | 3 | 4 | ... | 163 | 164 | 165 | 166 | 167 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 2 |
1 | 2.0 | 2 | NaN | NaN | 50 | ... | 2 | 4 | 2.0 | __NA__ | 1 |
2 | 3.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 6 | 1.0 | __NA__ | 1 |
3 | 4.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 1 |
4 | 5.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 2 | 3.0 | __NA__ | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1195 | 1196.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 1 | 1.0 | __NA__ | 2 |
1196 | 1197.0 | 1 | NaN | 1.0 | 80 | ... | 1 | 2 | 1.0 | __NA__ | 1 |
1197 | 1198.0 | 1 | NaN | 2.0 | 95 | ... | 2 | 6 | NaN | __NA__ | 2 |
1198 | 1199.0 | 1 | NaN | 2.0 | 80 | ... | 2 | 4 | 1.0 | __NA__ | 1 |
1199 | 1200.0 | 2 | NaN | NaN | 0 | ... | 2 | 2 | 2.0 | __NA__ | 1 |
1200 rows × 168 columns
While we still have the issue that the column names are non-descriptive, at they are logically labeled in numeric order from left to right, and we haven’t lost the first observation. If we want to add the prefix “X” to each of these numbers, we can add the prefix="X"
parameter:
anes = pd.read_csv(url, header=None, prefix="X")
anes
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[23], line 1
----> 1 anes = pd.read_csv(url, header=None, prefix="X")
2 anes
TypeError: read_csv() got an unexpected keyword argument 'prefix'
If the data do not contain a header row for variable names, and we want the variables to have the right names, we have to define the names separately and pass this list using the names
parameter:
col_names = ['caseid', 'turnout12', 'turnout12b', 'vote12', 'percent16', 'meet',
'givefut', 'info', 'march', 'sign', 'give12mo', 'compromise', 'ftobama',
'ftblack', 'ftwhite', 'fthisp', 'ftgay', 'ftjeb', 'fttrump', 'ftcarson',
'fthrc', 'ftrubio', 'ftcruz', 'ftsanders', 'ftfiorina', 'ftpolice',
'ftfem', 'fttrans', 'ftmuslim', 'ftsci', 'reg', 'demcand', 'repcand',
'vote16jb', 'vote16bc', 'vote16tc', 'vote16mr', 'vote16dt', 'presjob',
'lazyb', 'lazyw', 'lazyh', 'lazym', 'violentb', 'violentw', 'violenth',
'violentm', 'econnow', 'econ12mo', 'pid1d', 'pid2d', 'pid1r', 'pid2r',
'pidstr', 'pidlean', 'lcself', 'lcd', 'lcr', 'lchc', 'lcbo', 'lcdt',
'lcmr', 'lctc', 'srv_spend', 'campfin', 'immig_legal', 'immig_numb',
'equalpay', 'parleave', 'crimespend', 'death', 'terror_worry',
'terror_12mo', 'terror_local', 'relig_bc', 'relig_bcstr', 'relig_srv',
'relig_srvstr', 'incgap20', 'isis_troops', 'syrians_a', 'syrians_b',
'pc_a', 'pc_b', 'minwage', 'healthspend', 'childcare', 'getahead',
'ladder', 'finwell', 'warm', 'warmbad', 'warmcause', 'warmdo',
'freetrade', 'stopwhite', 'stopblack', 'forcewhite', 'forceblack',
'stop_12mo', 'arrested_12mo', 'charged_12mo', 'jailed_12mo',
'convict_12mo', 'famstop_12mo', 'stop_ever', 'arrested_ever',
'charged_ever', 'jailed_ever', 'convict_ever', 'famstop_ever',
'pk_deficit', 'pk_sen', 'pk_spend', 'birthright_a', 'birthright_b',
'femoff_jobs', 'femoff_ed', 'femoff_spend', 'femoff_issues',
'lpres_pleased', 'lpres_immig', 'lpres_la', 'vaccine', 'autism',
'bo_muslim', 'bo_confid', 'amer_ident', 'race_ident', 'whitework',
'whitejob', 'wguilt1', 'wguilt2', 'wguilt3', 'buycott', 'boycott',
'skintone_mob', 'skintone', 'skin_discrim', 'africanam10_1',
'white10_1', 'hispanic10_1', 'asianam10_1', 'nativeam10_1', 'other10_1',
'other10_open', 'birthyr', 'gender', 'race', 'race_other', 'educ',
'marstat', 'speakspanish', 'employ', 'employ_t', 'faminc', 'faminc2',
'state', 'votereg', 'pid3', 'pid7', 'ideo5', 'newsint', 'pew_bornagain',
'pew_churatd', 'religpew', 'religpew_t', 'ever_vs_12mo_rand']
anes = pd.read_csv(url, header=None, names=col_names)
anes
caseid | turnout12 | turnout12b | vote12 | percent16 | ... | pew_bornagain | pew_churatd | religpew | religpew_t | ever_vs_12mo_rand | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 2 |
1 | 2.0 | 2 | NaN | NaN | 50 | ... | 2 | 4 | 2.0 | __NA__ | 1 |
2 | 3.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 6 | 1.0 | __NA__ | 1 |
3 | 4.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 1 |
4 | 5.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 2 | 3.0 | __NA__ | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1195 | 1196.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 1 | 1.0 | __NA__ | 2 |
1196 | 1197.0 | 1 | NaN | 1.0 | 80 | ... | 1 | 2 | 1.0 | __NA__ | 1 |
1197 | 1198.0 | 1 | NaN | 2.0 | 95 | ... | 2 | 6 | NaN | __NA__ | 2 |
1198 | 1199.0 | 1 | NaN | 2.0 | 80 | ... | 2 | 4 | 1.0 | __NA__ | 1 |
1199 | 1200.0 | 2 | NaN | NaN | 0 | ... | 2 | 2 | 2.0 | __NA__ | 1 |
1200 rows × 168 columns
2.5.5. Delimiters Other Than Commas#
Here’s a version of the ANES data that is delimited by tabs:
url = "https://raw.githubusercontent.com/jkropko/DS-6001/master/localdata/anes_example_tab.txt"
file = requests.get(url)
print(file.text[0:2000])
caseid turnout12 turnout12b vote12 percent16 meet givefut info march sign give12mo compromise ftobama ftblack ftwhite fthisp ftgay ftjeb fttrump ftcarson fthrc ftrubio ftcruz ftsanders ftfiorina ftpolice ftfem fttrans ftmuslim ftsci reg demcand repcand vote16jb vote16bc vote16tc vote16mr vote16dt presjob lazyb lazyw lazyh lazym violentb violentw violenth violentm econnow econ12mo pid1d pid2d pid1r pid2r pidstr pidlean lcself lcd lcr lchc lcbo lcdt lcmr lctc srv_spend campfin immig_legal immig_numb equalpay parleave crimespend death terror_worry terror_12mo terror_local relig_bc relig_bcstr relig_srv relig_srvstr incgap20 isis_troops syrians_a syrians_b pc_a pc_b minwage healthspend childcare getahead ladder finwell warm warmbad warmcause warmdo freetrade stopwhite stopblack forcewhite forceblack stop_12mo arrested_12mo charged_12mo jailed_12mo convict_12mo famstop_12mo stop_ever arrested_ever charged_ever jailed_ever convict_ever famstop_ever pk_deficit pk_sen pk_spend birthright_a birthright_b femoff_jobs femoff_ed femoff_spend femoff_issues lpres_pleased lpres_immig lpres_la vaccine autism bo_muslim bo_confid amer_ident race_ident whitework whitejob wguilt1 wguilt2 wguilt3 buycott boycott skintone_mob skintone skin_discrim africanam10_1 white10_1 hispanic10_1 asianam10_1 nativeam10_1 other10_1 other10_open birthyr gender race race_other educ marstat speakspanish employ employ_t faminc faminc2 state votereg pid3 pid7 ideo5 newsint pew_bornagain pew_churatd religpew religpew_t ever_vs_12mo_rand
1 1 NA 2 100 1 3 4 1 2 2 1 100 100 100 100 96 36 1 5 76 31 0 84 2 51 62 97 20 100 1 1 6 NA NA NA 1 1 1 5 5 5 5 5 5 5 5 2 2 NA __NA__ 1 __NA__ 1 NA 4 3 4 5 5 4 4 4 7 1 1 2 1 2 5 7 5 4 5 2 1 2 1 1 7 NA 2 NA 3 1 1 1 4 6 6 1 2 1 1 4 2 4 2 4 NA NA NA NA NA NA 1 2 2 2 2 2 1 6 1 NA 2 NA 4 4 NA 5 NA 5 1 6 2 1 4 4 5 5 5 5 5 2 2 99 2 5 0 10 0 0 0 0 NA 1960 1 1 __NA__ 5 1 4 1 __NA__ 4 99 6 1 1 1 6 1 2 6 NA __NA__ 2
2 2 NA NA 50 4 5 4 2 2 2 1 39 6 74 6 75 28 28 6 52 21 10 13 6 65 53 49 2
The pd.read_csv()
file also loads text-based data files with delimiters other than commas. To load tab-delimited data, use the sep="\t"
parameter:
anes = pd.read_csv(url, sep="\t")
anes
caseid | turnout12 | turnout12b | vote12 | percent16 | ... | pew_bornagain | pew_churatd | religpew | religpew_t | ever_vs_12mo_rand | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 2 |
1 | 2.0 | 2 | NaN | NaN | 50 | ... | 2 | 4 | 2.0 | __NA__ | 1 |
2 | 3.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 6 | 1.0 | __NA__ | 1 |
3 | 4.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 1 |
4 | 5.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 2 | 3.0 | __NA__ | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1195 | 1196.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 1 | 1.0 | __NA__ | 2 |
1196 | 1197.0 | 1 | NaN | 1.0 | 80 | ... | 1 | 2 | 1.0 | __NA__ | 1 |
1197 | 1198.0 | 1 | NaN | 2.0 | 95 | ... | 2 | 6 | NaN | __NA__ | 2 |
1198 | 1199.0 | 1 | NaN | 2.0 | 80 | ... | 2 | 4 | 1.0 | __NA__ | 1 |
1199 | 1200.0 | 2 | NaN | NaN | 0 | ... | 2 | 2 | 2.0 | __NA__ | 1 |
1200 rows × 168 columns
For delimiters other than a tab, simply type that character in quotes with the sep
parameter. Here’s another version of the ANES data, delimited by semi-colons:
url = "https://raw.githubusercontent.com/jkropko/DS-6001/master/localdata/anes_example_semicolon.txt"
file = requests.get(url)
print(file.text[0:2000])
caseid;turnout12;turnout12b;vote12;percent16;meet;givefut;info;march;sign;give12mo;compromise;ftobama;ftblack;ftwhite;fthisp;ftgay;ftjeb;fttrump;ftcarson;fthrc;ftrubio;ftcruz;ftsanders;ftfiorina;ftpolice;ftfem;fttrans;ftmuslim;ftsci;reg;demcand;repcand;vote16jb;vote16bc;vote16tc;vote16mr;vote16dt;presjob;lazyb;lazyw;lazyh;lazym;violentb;violentw;violenth;violentm;econnow;econ12mo;pid1d;pid2d;pid1r;pid2r;pidstr;pidlean;lcself;lcd;lcr;lchc;lcbo;lcdt;lcmr;lctc;srv_spend;campfin;immig_legal;immig_numb;equalpay;parleave;crimespend;death;terror_worry;terror_12mo;terror_local;relig_bc;relig_bcstr;relig_srv;relig_srvstr;incgap20;isis_troops;syrians_a;syrians_b;pc_a;pc_b;minwage;healthspend;childcare;getahead;ladder;finwell;warm;warmbad;warmcause;warmdo;freetrade;stopwhite;stopblack;forcewhite;forceblack;stop_12mo;arrested_12mo;charged_12mo;jailed_12mo;convict_12mo;famstop_12mo;stop_ever;arrested_ever;charged_ever;jailed_ever;convict_ever;famstop_ever;pk_deficit;pk_sen;pk_spend;birthright_a;birthright_b;femoff_jobs;femoff_ed;femoff_spend;femoff_issues;lpres_pleased;lpres_immig;lpres_la;vaccine;autism;bo_muslim;bo_confid;amer_ident;race_ident;whitework;whitejob;wguilt1;wguilt2;wguilt3;buycott;boycott;skintone_mob;skintone;skin_discrim;africanam10_1;white10_1;hispanic10_1;asianam10_1;nativeam10_1;other10_1;other10_open;birthyr;gender;race;race_other;educ;marstat;speakspanish;employ;employ_t;faminc;faminc2;state;votereg;pid3;pid7;ideo5;newsint;pew_bornagain;pew_churatd;religpew;religpew_t;ever_vs_12mo_rand
1;1;NA;2;100;1;3;4;1;2;2;1;100;100;100;100;96;36;1;5;76;31;0;84;2;51;62;97;20;100;1;1;6;NA;NA;NA;1;1;1;5;5;5;5;5;5;5;5;2;2;NA;__NA__;1;__NA__;1;NA;4;3;4;5;5;4;4;4;7;1;1;2;1;2;5;7;5;4;5;2;1;2;1;1;7;NA;2;NA;3;1;1;1;4;6;6;1;2;1;1;4;2;4;2;4;NA;NA;NA;NA;NA;NA;1;2;2;2;2;2;1;6;1;NA;2;NA;4;4;NA;5;NA;5;1;6;2;1;4;4;5;5;5;5;5;2;2;99;2;5;0;10;0;0;0;0;NA;1960;1;1;__NA__;5;1;4;1;__NA__;4;99;6;1;1;1;6;1;2;6;NA;__NA__;2
2;2;NA;NA;50;4;5;4;2;2;2;1;39;6;74;6;75;28;28;6;52;21;10;13;6;65;53;49;2
To load the data:
anes = pd.read_csv(url, sep=";")
anes
caseid | turnout12 | turnout12b | vote12 | percent16 | ... | pew_bornagain | pew_churatd | religpew | religpew_t | ever_vs_12mo_rand | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 2 |
1 | 2.0 | 2 | NaN | NaN | 50 | ... | 2 | 4 | 2.0 | __NA__ | 1 |
2 | 3.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 6 | 1.0 | __NA__ | 1 |
3 | 4.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 6 | NaN | __NA__ | 1 |
4 | 5.0 | 1 | NaN | 1.0 | 100 | ... | 2 | 2 | 3.0 | __NA__ | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1195 | 1196.0 | 1 | NaN | 2.0 | 100 | ... | 2 | 1 | 1.0 | __NA__ | 2 |
1196 | 1197.0 | 1 | NaN | 1.0 | 80 | ... | 1 | 2 | 1.0 | __NA__ | 1 |
1197 | 1198.0 | 1 | NaN | 2.0 | 95 | ... | 2 | 6 | NaN | __NA__ | 2 |
1198 | 1199.0 | 1 | NaN | 2.0 | 80 | ... | 2 | 4 | 1.0 | __NA__ | 1 |
1199 | 1200.0 | 2 | NaN | NaN | 0 | ... | 2 | 2 | 2.0 | __NA__ | 1 |
1200 rows × 168 columns
2.5.6. Fixed-width Files#
A fixed-width file contains no delimiters. Instead, it aligns all of the data for one variable in the same position on each row. These files generally do not store variable names, and might use less memory than CSV. But that makes the data impossible to parse without an external list of where each variable is stored. The first and most important step is to get this list.
For this example, I pulled data from a public opinion survey conducted by the National Journal a few years ago. The data are available from the Roper Center for Public Opinion Research, and the raw data are in fixed-width format. The codebook is stored on GitHub, take a look. The features are listed on page 3, along with their column positions.
Loading a fixed-width file is a two step process. First, we use the codebook to create lists of the variable names, and either the width of each variable (how many columns each variable takes up), or the starting and ending position. Here we save both lists, although you will only need one of these two lists.
For the starting and ending position of each variable, we create a list of length 2 for each variable, where the first element is the column the previous variable ends on (or 0 for the first variable) and the second element is the column the current variable ends on. For example, if a variable occupies columns 34, 35, and 36, its list of length 2 is [33,36]. Then we store the lists-of-2 in a list-of-lists.
url = "https://raw.githubusercontent.com/NovaVolunteer/Practice_Application_DS/master/Week%205/njcc33850.dat"
datanames = ['psraid', 'sample', 'int_date', 'area',
'state', 'cregion', 'density', 'usr', 'cc1', 'cc1a',
'cc2', 'cc3', 'cc4', 'cc5', 'cc6', 'cc7', 'ql1', 'ql1a',
'qc1', 'hh1', 'employ', 'par', 'sex', 'age', 'educ2',
'hisp', 'race', 'inc', 'income', 'reg', 'party',
'partyln', 'iphoneus', 'hphoneus', 'recage', 'receduc',
'racethn', 'standwt', 'raceos']
datawidths = [6, 1, 6, 3, 2, 1, 1, 3, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 1, 1, 1, 2, 1, 1,
1, 1, 1, 1, 1, 1, 1, 4, 30]
datapos = [[0,6], [6,7], [7,13], [13,16], [16,18],
[18,19], [19,20], [20,23], [23,24], [24,25],
[25,26], [26,27], [27,28], [28,29], [29,30],
[30,31], [31,32], [32,33], [33,34], [34,35],
[35,36], [36,37], [37,38], [38,40], [40,41],
[41,42], [42,43], [43,45], [45,46], [46,47],
[47,48], [48,49], [49,50], [50,51], [51,52],
[52,53], [53,54], [54,58], [58,88]]
To load the fixed-width data into Python, we use the pd.read_fwf()
function. We specify the location (or URL) of the data file and the names of the columns. And we can either specify the widths of each column, or the starting and ending positions (but not both). Using the widths:
njcc = pd.read_fwf(url, widths=datawidths, header=None, names=datanames)
njcc
psraid | sample | int_date | area | state | ... | recage | receduc | racethn | standwt | raceos | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100001 | 1 | 121513 | 540 | 51 | ... | 2 | 1 | 1 | 2.45 | NaN |
1 | 100006 | 1 | 121213 | 248 | 26 | ... | 5 | 3 | 1 | 0.67 | NaN |
2 | 100011 | 1 | 121213 | 270 | 21 | ... | 6 | 3 | 1 | 0.37 | NaN |
3 | 100020 | 1 | 121213 | 803 | 45 | ... | 6 | 1 | 2 | 0.95 | NaN |
4 | 100023 | 1 | 121213 | 203 | 9 | ... | 6 | 1 | 1 | 0.85 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 203257 | 2 | 121513 | 901 | 47 | ... | 2 | 1 | 2 | 2.34 | NaN |
996 | 203262 | 2 | 121513 | 701 | 31 | ... | 2 | 3 | 1 | 1.29 | NaN |
997 | 203264 | 2 | 121513 | 619 | 36 | ... | 2 | 3 | 1 | 2.41 | NaN |
998 | 203273 | 2 | 121513 | 415 | 6 | ... | 3 | 3 | 2 | 1.53 | NaN |
999 | 203275 | 2 | 121513 | 541 | 41 | ... | 3 | 3 | 4 | 1.14 | NaN |
1000 rows × 39 columns
And using the starting and ending positions:
njcc = pd.read_fwf(url, colspecs=datapos, header=None, names=datanames)
njcc
psraid | sample | int_date | area | state | ... | recage | receduc | racethn | standwt | raceos | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100001 | 1 | 121513 | 540 | 51 | ... | 2 | 1 | 1 | 2.45 | NaN |
1 | 100006 | 1 | 121213 | 248 | 26 | ... | 5 | 3 | 1 | 0.67 | NaN |
2 | 100011 | 1 | 121213 | 270 | 21 | ... | 6 | 3 | 1 | 0.37 | NaN |
3 | 100020 | 1 | 121213 | 803 | 45 | ... | 6 | 1 | 2 | 0.95 | NaN |
4 | 100023 | 1 | 121213 | 203 | 9 | ... | 6 | 1 | 1 | 0.85 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 203257 | 2 | 121513 | 901 | 47 | ... | 2 | 1 | 2 | 2.34 | NaN |
996 | 203262 | 2 | 121513 | 701 | 31 | ... | 2 | 3 | 1 | 1.29 | NaN |
997 | 203264 | 2 | 121513 | 619 | 36 | ... | 2 | 3 | 1 | 2.41 | NaN |
998 | 203273 | 2 | 121513 | 415 | 6 | ... | 3 | 3 | 2 | 1.53 | NaN |
999 | 203275 | 2 | 121513 | 541 | 41 | ... | 3 | 3 | 4 | 1.14 | NaN |
1000 rows × 39 columns
2.6. Loading Other Kinds of Electronic Data Files#
CSV, ASCII, and other kinds of plain text files are very common in fields that primarily use open-source programming environments like Python and R, as these file types are lightweight and universal. But there are many more people out there who work in fields that use specific, proprietary data analysis software. It can be a real challenge to load a specific file format in a context outside of its native software environment. There are specialized software that only exist to transfer proprietary data from one environment to another. One package is Stat/Transfer, which can easily convert data from Stata to SAS to SPSS to Excel for $129 a year for academic users.
If you are using Stata, SAS, SPSS, Excel, or another program, or if you are working with someone who uses that software, and you want to transfer data to Python, the easiest approach is to save the working data as a CSV file and then load it into Python with pd.read_csv()
. All of those software packages have the functionality to save data as CSV.
That doesn’t help you if you are dealing with an SPSS file and you do not have access to SPSS. Fortunately, Python has functions to load data from pretty much every software environment. Here’s a rundown of these functions.
2.6.1. Loading Excel Files#
There are a few methods for loading Excel files in Python but a great function is the pd.read_excel()
function in pandas
. One big way that Excel files differ from other data formats is the use of sheets in the same document. The main challenge in loading an Excel file in Python is dealing with the sheets.
I’ve saved an Excel file on GitHub that contains information about NBA teams during the 2018-2019 season. This Excel document contains four sheets, named “NBA-TEAM-SAMPLE”, “METADATA”, “TEAMS”, and “CONVERT DATE FORMAT”. All four sheets have visual formatting. I want to load the “TEAMS” sheet without carrying forward any code that refers to the visual appearance of this sheet.
The first argument of pd.read_excel()
is the path, as with pd.read_csv(), and it can take a URL. The second argument is sheet_name. If the Excel file has sheets with names, you can type the name of the sheet here. Or type a number: 0 refers to the first sheet, 1 to the second, etc. This function loads the data directly to a data frame, and ignores other graphical elements of Excel sheets, such as shading particular cells or using fonts.
url = "https://github.com/NovaVolunteer/Practice_Application_DS/blob/master/Week%205/NBA-Team-Sample-BoxScore-Dataset.xlsx?raw=true"
nba = pd.read_excel(url, sheet_name="TEAMS")
nba
INITIALS | LONG NAME | SHORT NAME | CONFERENCE | DIVISION | |
---|---|---|---|---|---|
0 | Atl | Atlanta Hawks | Atlanta | East | Southeast |
1 | Bos | Boston Celtics | Boston | East | Atlantic |
2 | Bro | Brooklyn Nets | Brooklyn | East | Atlantic |
3 | Cha | Charlotte Hornets | Charlotte | East | Southeast |
4 | Chi | Chicago Bulls | Chicago | East | Central |
... | ... | ... | ... | ... | ... |
25 | Sac | Sacramento Kings | Sacramento | West | Pacific |
26 | San | San Antonio Spurs | San Antonio | West | Southwest |
27 | Tor | Toronto Raptors | Toronto | East | Atlantic |
28 | Uta | Utah Jazz | Utah | West | Northwest |
29 | Was | Washington Wizards | Washington | East | Southeast |
30 rows × 5 columns
If you specify more than one sheet within the sheet_name
parameter using a list, pd.read_excel()
will produce a list of dataframes, one for each sheet you specify. Typing sheet_name = None
produces a list with all of the sheets. For example, to load the “NBA-TEAM-SAMPLE” and “TEAMS” sheets, and save them as two separate data frames embedded in a list, type:
nba = pd.read_excel(url, sheet_name=[0,2])
nba[0]
DATASET | GAME-ID | DATE | TEAM | VENUE | ... | CLOSING TOTAL | MONEYLINE | HALFTIME | BOX SCORE\nURL | ODDS\nURL | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | NBA 2018-2019 Regular Season | 21800001 | 10/16/2018 | Philadelphia | R | ... | 211.5 | 170 | NaN | NaN | NaN |
1 | NBA 2018-2019 Regular Season | 21800001 | 10/16/2018 | Boston | H | ... | 211.5 | -200 | NaN | BOX SCORE | ODDS |
2 | NBA 2018-2019 Regular Season | 21800002 | 10/16/2018 | Oklahoma City | R | ... | 220.5 | 711 | NaN | NaN | NaN |
3 | NBA 2018-2019 Regular Season | 21800002 | 10/16/2018 | Golden State | H | ... | 220.5 | -1100 | NaN | BOX SCORE | ODDS |
4 | NBA 2018-2019 Regular Season | 21800003 | 10/17/2018 | Milwaukee | R | ... | 222.0 | -165 | +2.5 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
60 | NBA 2018-2019 Regular Season | 21800031 | 10/20/2018 | Detroit | R | ... | 217.0 | -200 | -2.5 | NaN | NaN |
61 | NBA 2018-2019 Regular Season | 21800031 | 10/20/2018 | Chicago | H | ... | 217.0 | 170 | 106.5u15 | BOX SCORE | ODDS |
62 | NBA 2018-2019 Regular Season | 21800032 | 10/20/2018 | Minnesota | R | ... | 220.5 | 110 | 110 | NaN | NaN |
63 | NBA 2018-2019 Regular Season | 21800032 | 10/20/2018 | Dallas | H | ... | 220.5 | -130 | -1 -05 | BOX SCORE | ODDS |
64 | NBA 2018-2019 Regular Season | 21800033 | 10/20/2018 | Phoenix | R | ... | 224.5 | 459 | 110.5 | NaN | NaN |
65 rows × 57 columns
In this case, because we explicity did not include the sheets indexed as 1 and 3, Python leaves these indices empty. To access the “TEAMS” sheet, specify the index-2 item of the nba
list:
nba[2]
INITIALS | LONG NAME | SHORT NAME | CONFERENCE | DIVISION | |
---|---|---|---|---|---|
0 | Atl | Atlanta Hawks | Atlanta | East | Southeast |
1 | Bos | Boston Celtics | Boston | East | Atlantic |
2 | Bro | Brooklyn Nets | Brooklyn | East | Atlantic |
3 | Cha | Charlotte Hornets | Charlotte | East | Southeast |
4 | Chi | Chicago Bulls | Chicago | East | Central |
... | ... | ... | ... | ... | ... |
25 | Sac | Sacramento Kings | Sacramento | West | Pacific |
26 | San | San Antonio Spurs | San Antonio | West | Southwest |
27 | Tor | Toronto Raptors | Toronto | East | Atlantic |
28 | Uta | Utah Jazz | Utah | West | Northwest |
29 | Was | Washington Wizards | Washington | East | Southeast |
30 rows × 5 columns
2.6.2. Loading SAS, Stata, and SPSS Files#
For SAS files with extensions.sas7bdat
or .xport
, or Stata files with extension .dta
. the pd.read_sas()
and pd.read_stata()
functions work just like other data parsing functions in pandas
. We can pass the file path, the file name alone (if we’ve set the working directory), or a URL to these functions and they loads the data directly to a data frame.
On GitHub, I’ve saved a SAS file containing the monthly inflation rate in the United States since 1983, and a Stata file containing a CBS public opinion poll. To load the SAS file, I type:
url = "https://github.com/jkropko/DS-6001/raw/master/localdata/inflation.sas7bdat"
inflation = pd.read_sas(url)
inflation
YEAR | MONTH | WAGE | PCWAGE | CPI | INFLN | |
---|---|---|---|---|---|---|
0 | 1983.0 | 12.0 | 8.320000 | NaN | 101.400002 | NaN |
1 | 1984.0 | 1.0 | 8.370000 | 0.599163 | 102.099998 | 0.687963 |
2 | 1984.0 | 2.0 | 8.360000 | -0.119546 | 102.599998 | 0.488521 |
3 | 1984.0 | 3.0 | 8.400000 | 0.477328 | 102.900002 | 0.291971 |
4 | 1984.0 | 4.0 | 8.440000 | 0.475060 | 103.300003 | 0.387973 |
... | ... | ... | ... | ... | ... | ... |
265 | 2006.0 | 1.0 | 16.400000 | 0.305344 | 199.000000 | 0.655409 |
266 | 2006.0 | 2.0 | 16.469999 | 0.425921 | 199.100006 | 0.050239 |
267 | 2006.0 | 3.0 | 16.510000 | 0.242571 | 199.800003 | 0.350965 |
268 | 2006.0 | 4.0 | 16.610001 | 0.603867 | 201.000000 | 0.598804 |
269 | 2006.0 | 5.0 | 16.620001 | 0.060187 | 201.899994 | 0.446762 |
270 rows × 6 columns
And to load the Stata file:
url = "https://github.com/jkropko/DS-6001/raw/master/localdata/cbspoll.dta"
cbspoll = pd.read_stata(url)
cbspoll
CASEID | FEMM | DAYS | SEX | INTN | ... | NPH0 | NM | COVR | ADLT | AREACD | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16715.0 | 1.0 | Tuesday | Female | 708.0 | ... | NaN | Willing, talkative | 4.0 | 4.0 | 818.0 |
1 | 19510.0 | 1.0 | Monday | Female | 256.0 | ... | 1.0 | Willing, talkative | 3.0 | 2.0 | 512.0 |
2 | 13448.0 | 1.0 | Sunday | Female | 302.0 | ... | NaN | Willing, talkative | 5.0 | 2.0 | 218.0 |
3 | 9825.0 | 1.0 | Tuesday | Female | 150.0 | ... | NaN | Willing, talkative | 6.0 | 1.0 | 660.0 |
4 | 6504.0 | NaN | Thursday | Male | 246.0 | ... | NaN | Willing, talkative | 6.0 | 2.0 | 207.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1177 | 6389.0 | NaN | Thursday | Male | 303.0 | ... | NaN | Not willing | 1.0 | 4.0 | 540.0 |
1178 | 1090.0 | NaN | Wednesday | Female | 230.0 | ... | NaN | Not willing | 5.0 | 3.0 | 254.0 |
1179 | 19795.0 | 1.0 | Tuesday | Female | 384.0 | ... | NaN | Willing, not talkative | 1.0 | 2.0 | 318.0 |
1180 | 20001.0 | 1.0 | Wednesday | Male | 150.0 | ... | NaN | Not willing | 6.0 | 2.0 | 702.0 |
1181 | 14040.0 | 1.0 | Monday | Female | 189.0 | ... | 1.0 | Willing, talkative | 5.0 | 1.0 | 410.0 |
1182 rows × 115 columns
SPSS files have the file extension .sav
, and can be loaded with pd.read_spss()
in the same way. One issue (at the time this notebook was written) is that the pd.read_spss()
function only accepts local files, and not files from URLs. I saved data from a public opinion survey on GitHub. If you want to try loading it into Python, download the file and move it to the folder where you’ve set your working directory (or type out the whole file path in the following code), and type:
survey = pd.read_spss("survey.sav")
survey
Eventually I expect this pd.read_spss()
function to be able to accept URLs as well. At that point, the following code should also work:
url = "https://github.com/jkropko/DS-6001/raw/master/localdata/survey.sav"
survey = pd.read_spss(url)
survey
2.7. Saving CSV and ASCII Files to Disk#
Suppose we’ve done all the steps needed to clean and manage the data. We might want to save a clean version of the data in a CSV or other ASCII file on our local disk space. We can do so by applying the .to_csv()
method to the anes
data frame. The first argument is the filename with whatever extension we want for the saved file. As with pd.read_csv()
, we can also specify the sep
parameter to choose a delimiter for the text-based data file we are creating. Let’s save the ANES dataframe as “anes_cleaned.csv” in our working directory:
anes.to_csv("anes_cleaned.csv", sep=",")
For a tab-separated file we can employ the more general “.txt” extension:
anes.to_csv("anes_cleaned.txt", sep="\t")
2.5.1. Comments Prior to the Header#
I created a version of the ANES data with four rows of comments prior to the column names. The following lines of code display the raw text of this file without trying to load the data frame:
If I try to load the data file as it currently exists by typing
I get the following error:
This error occurs because the first line tells the
pd.read_csv()
function that there is one column in the data, as there are no commas on this line. So by the time the column names appear there are more commas than the parser knows how to handle.To solve this problem, use the
header
parameter to tellpd.read_csv()
how many rows of the data file are taken up by the unwanted header. This function always supposes that the column names are the first row of the data after the header.1200 rows × 168 columns