2. Loading Data from Electronic Data Files#

2.1. Introduction: Please don’t bend, fold, spindle or mutilate me#

’Do Not Fold, Spindle or Mutilate’: A Cultural History of the Punch Card” by Steven Lubar details the earliest method of data storage and transference. Punch cards were first used by the U.S. government to tabulate the 1890 census. Before World War 1, punch cards started being used by the military, the railroads, and insurance companies to store data. In the 1930s, punch cards were used by the New Deal agencies, and by the 1940s punch cards were widespread throughout American society.

Punchcards predated computers, which is to say that data storage and transference predates computers. These early uses of punchcards employed data tabulation machines capable of counting the number of times particular holes have been punched in a series of cards, fed into the machine one at a time. The tabulation machine used for counting the 1890 census was developed by a company that later became IBM.

When punchcards became widely used, they were printed with a universal warning: “don’t bend, fold, spindle or mutilate”, which became a cultural meme that long outlived the use of punchcards. In the 1960s, punchcards and the slogan “don’t bend, fold, spindle or mutilate” became symbols of alienation and technocratic oppression used by campus protests and many countercultural movements:

The depersonalization of the punch-card era found its catch phrase in the words on the cards; its ubiquity gave it instant familiarity. One observer of the period wrote that marijuana, the 60s escape from the rigors of the real world, let you see “the strangeness of real unfolded-unspindled-unmutilated life” (Gitlan 202). “Do not fold, spindle, or mutilate” became shorthand for a whole realm of countercultural experience. The ecological movement of the early 1970s, a child of the 1960s counterculture, picked up on it too: a popular poster for Earth Day 1970 showed a picture of the Earth taken from space with the legend “Do not fold, spindle, or mutilate” (Lubar 1992, p. 48).

Starting in the 1950s it became possible to run more advanced statistical analyses, such as linear regression, using mainframe computers with punchcard inputs. Many of my professors in grad school had stories about trekking across campus at 3am, the only time they could get access to the computer, to the building with the giant, noisy computer in the basement. They would carry a stack of punchcards, trying hard not to trip or slip on ice because dropping the punchcards would mean hours of work to get them back into a specific order. Finally, while inputting the cards, inevitably one would get jammed, which meant alerting the surly technician on call.

Two technological advances together made punchcards obsolete. First, in the late 1970s and through the 1980s, it became possible to store data directly on a computer hard drive or a floppy disk, which led to the creation of electronic data files such as CSV. Second, the internet made it possible to transfer data electronically, which led to APIs, HTML, and remote databases.

Today there are several common ways to access electronic data.

  1. People can share individual data files through websites, email, or hard storage. These files are often in ASCII format, but can be stored in other (sometimes proprietary) formats.

  2. Through an application programming interface (API): a specific system that connects clients to web servers and allows the transfer of data, often formatted in JavaScript Object Notation (JSON).

  3. Through raw HTML that can be converted into tabular data through web-scraping.

  4. Through a local or remote relational database - a collection of many individual datasets - managed using SQL.

For modern data science, it’s important to be very comfortable working with all of these methods of sharing data. We will go over individual data files in this module, and the other methods in the following modules.

In this notebook, I demonstrate how to load CSV files with various messy properties that have to be addressed when loading the data into Python. I also demonstrate how to look at a dataframe once it’s loaded to search for problems, and how to load fixed width, Excel, SAS, Stata, and SPSS data files. All of the data for these examples are available on this GitHub page.

Before we begin, we load the following libraries:

import numpy as np
import pandas as pd
import os
import sys
sys.tracebacklimit = 0 # turn off the error tracebacks

2.2. Changing the Working Directory#

Before we go over the functions to load and save data files, it is useful to set the working directory at the start of your script or notebook. This sets the default folder where Python opens and saves files. If all of your files are in the same folder, setting the working directory means you don’t have to write out the paths each time you load or save a file.

To set the working directory:

  • Load the os package.

  • Type the folder’s address into os.chdir("folder")

To check on the path Python is currently using as a default, type os.getcwd() into the console. If you want to change the working directory back after you’ve run the relevant code, save your old path as an object to begin your notebook, and change the path back to the old path to end your notebook, like this:

import os
oldpath = os.getcwd()
os.chdir("folder")

#(Your code goes here)

os.chdir(oldpath)

Because I want this notebook to work on anyone’s environment, I will only work with files that are accessable through the web. But make sure to change the working directory in your own notebooks for working with local data files.

2.3. Text-Based Data Files#

When data could first be stored on harddrives and disks, space was very limited. In order to store data as efficiently as possible, universal standards for which symbols would be considered valid data were adopted. These standards, which are still in use today, are called ASCII:

ASCII - (pronounced “As-Key”) American Standard Code for Information Interchange:

  • 128 characters are considered to be “legal” in data files.

  • ASCII files are plain text files. They may be messy, and not immediately ready for use in analyses, but we have methods to load and clean ASCII data.

  • ASCII files are designed to be as small and as universally portable as possible.

  • Data points are usually delimited by commas, spaces, or tabs, or might require a data dictionary to read.

The most common ASCII format is comma-separated values (CSV), in which individual data points are separated by commas. A CSV file is a plain text file that can be opened with Notepad on Windows or with TextEdit on Mac. Each row in the data appears on a new row in the text file, but the columns do not necessarily align - instead the commas represent new columns, so that the fourth and fifth commas on a line surround the fifth column in the data, for example. If the column names are included in the data, they will be listed on the first line. If you open a CSV file in a plain text editor, it will look like this:

On many computers with Microsoft Office installed, CSV files are opened by default in Microsoft Excel. Excel might be a convenient tool for aligning the columns and visualizing the data in a spreadsheet, but please note that the .csv filetype is a universal and lightweight format that can be opened in any programming environment. It is not specific to Microsoft:

In a CSV, the comma is called the delimiter: the character that separates datapoints. Another frequently used delimiter is the tab. Here’s the same data in tab-separated format:

In some situations the data are stored in a way that minimizes the memory needed to contain the data. In that case, we can do away with the delimiter altogether and instead fixing particular spaces on each line to represent specific columns. We can save further space by also removing the column names from the file. This format is called a fixed-width data file, and in a plain text editor it looks like this:

Notice that this file contains no information to help us understand what the columns are or where they begin or end. Fixed-width files always come with a data dictionary that contains this information.

I will show examples of how to work with tab-separated, fixed-width, and other files below. But first, to load a standard CSV file with no issues that need to be fixed, use the pd.read_csv() function. The first argument of pd.read_csv() is filepath_or_buffer, and can be one of three things:

  1. The full file address and file name of the data file.

  2. Just the file name of the data file if you’ve already set the working directory to the folder where the file exist.

  3. The URL of a data file that’s accessible online.

Once a data file is loaded into Python with pd.read_csv(), it is called a DataFrame.

Here I load public opinion data from the American National Election Study:

url = "https://raw.githubusercontent.com/jkropko/DS-6001/master/localdata/anes_example.csv"
anes = pd.read_csv(url)

This dataset is messy! As we proceed you will see confusing column names, misleading codes for missing values, strange encodings for categorical features, and other flaws. It is important for us to work with messy data because data is in general always messy in the real-world.

2.4. Looking at the DataFrame to See if it Loaded Correctly#

One of the biggest mistakes someone can make when loading data into Python (or any programming environment) is to assume that the data loaded correctly when the code does not give an error. There are lots of things that can still go wrong. The data might not have properly read the column names, the dimensions might be wrong, the data might have been read in a way that led to valid data getting replaced with missing values, and so on.

There are several functions that are useful for seeing the data to understand whether the data loaded properly. The best strategy is to load data, look at the loaded data to understand any problems, and if necessary, reload the data to fix these problems from the outset.

2.4.1. Viewing the Data as a Spreadsheet#

The best way to visualize the data frame is to type the name of the data frame into a cell in a Jupyter notebook. The output of the cell will be rendered as a spreadsheet, and in that format it will be clear what problems, if any, exist with the data frame. One issue: for large data frames (many rows, many columns, or both), displaying the entire spreadsheet unnecessarily takes up a lot of space. Prior to displaying a spreadsheet, we can set the defaults for the number of rows and columns to be displayed as follows:

pd.set_option('display.max_columns', 10) #display a maximumn of 10 columns
pd.set_option('display.max_rows', 10) #display a maximumn of 10 rows

Now I can see a manageable slice of the data frame. Here’s what the ANES data look like:

anes
caseid turnout12 turnout12b vote12 percent16 ... pew_bornagain pew_churatd religpew religpew_t ever_vs_12mo_rand
0 1.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 2
1 2.0 2 NaN NaN 50 ... 2 4 2.0 __NA__ 1
2 3.0 1 NaN 1.0 100 ... 2 6 1.0 __NA__ 1
3 4.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 1
4 5.0 1 NaN 1.0 100 ... 2 2 3.0 __NA__ 2
... ... ... ... ... ... ... ... ... ... ... ...
1195 1196.0 1 NaN 2.0 100 ... 2 1 1.0 __NA__ 2
1196 1197.0 1 NaN 1.0 80 ... 1 2 1.0 __NA__ 1
1197 1198.0 1 NaN 2.0 95 ... 2 6 NaN __NA__ 2
1198 1199.0 1 NaN 2.0 80 ... 2 4 1.0 __NA__ 1
1199 1200.0 2 NaN NaN 0 ... 2 2 2.0 __NA__ 1

1200 rows × 168 columns

Note the ... cells: they are unobtrusive, but in this case they represent 1190 rows and 158 columns. To see more of the data than are contained in this 10 x 10 table, change the defaults in the code listed above.

To see the first several rows, use the .head() method. A method is different from a regular function in that it is applied specifically to an existing object in Python’s memory, and it is typed as a suffix to a call to that object. For example, to see the first 10 rows of the anes data frame, type:

anes.head(10)
caseid turnout12 turnout12b vote12 percent16 ... pew_bornagain pew_churatd religpew religpew_t ever_vs_12mo_rand
0 1.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 2
1 2.0 2 NaN NaN 50 ... 2 4 2.0 __NA__ 1
2 3.0 1 NaN 1.0 100 ... 2 6 1.0 __NA__ 1
3 4.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 1
4 5.0 1 NaN 1.0 100 ... 2 2 3.0 __NA__ 2
5 6.0 1 NaN 3.0 100 ... 2 2 1.0 __NA__ 2
6 7.0 1 NaN 3.0 100 ... 2 2 1.0 __NA__ 1
7 8.0 1 NaN 1.0 100 ... 2 6 1.0 __NA__ 1
8 NaN 1 NaN 1.0 100 ... 1 2 1.0 __NA__ 2
9 10.0 1 NaN 2.0 100 ... 1 1 1.0 __NA__ 2

10 rows × 168 columns

To see the last several rows, use the .tail() method in exactly the same way:

anes.tail(10)
caseid turnout12 turnout12b vote12 percent16 ... pew_bornagain pew_churatd religpew religpew_t ever_vs_12mo_rand
1190 1191.0 1 NaN 2.0 100 ... 2 5 12.0 __NA__ 1
1191 1192.0 1 NaN 2.0 100 ... 2 6 2.0 __NA__ 2
1192 1193.0 1 NaN 3.0 100 ... 2 5 12.0 __NA__ 1
1193 1194.0 1 NaN 2.0 53 ... 2 2 2.0 __NA__ 2
1194 1195.0 1 NaN 3.0 50 ... 1 3 1.0 __NA__ 2
1195 1196.0 1 NaN 2.0 100 ... 2 1 1.0 __NA__ 2
1196 1197.0 1 NaN 1.0 80 ... 1 2 1.0 __NA__ 1
1197 1198.0 1 NaN 2.0 95 ... 2 6 NaN __NA__ 2
1198 1199.0 1 NaN 2.0 80 ... 2 4 1.0 __NA__ 1
1199 1200.0 2 NaN NaN 0 ... 2 2 2.0 __NA__ 1

10 rows × 168 columns

2.4.2. Displaying the Data Type of Each Column#

All data that is stored in Python’s memory is assigned a type that controls how various functions operate by default on that data. For example, the .describe() method (that I discuss in the next section) provides means and other summary statistics for numeric data, and frequencies for text data. If you intend to perform mathematical operations on a column in a data frame, such as by including the column in a regression model, it is important to first confirm that the column is read with an appropriate data type. The most common data types are:

  • int64: integer numbers with up to 64 digits

  • float64: numbers with decimals with up to 64 total digits

  • object: text, or numeric data coded as text

  • bool: only True or False

We can use the .dtypes attribute to display the data types of every column. Because this information is stored in a table, we need to turn off the default limit on the number of rows displayed to see all of the columns’ data types:

pd.set_option('display.max_rows', None) #to see all the variables
anes.dtypes
caseid               float64
turnout12              int64
turnout12b           float64
vote12               float64
percent16              int64
meet                   int64
givefut                int64
info                   int64
march                  int64
sign                   int64
give12mo               int64
compromise             int64
ftobama              float64
ftblack              float64
ftwhite                int64
fthisp               float64
ftgay                float64
ftjeb                float64
fttrump              float64
ftcarson             float64
fthrc                float64
ftrubio              float64
ftcruz               float64
ftsanders            float64
ftfiorina            float64
ftpolice             float64
ftfem                float64
fttrans              float64
ftmuslim             float64
ftsci                float64
reg                    int64
demcand                int64
repcand              float64
vote16jb             float64
vote16bc             float64
vote16tc             float64
vote16mr             float64
vote16dt             float64
presjob                int64
lazyb                  int64
lazyw                  int64
lazyh                  int64
lazym                  int64
violentb               int64
violentw               int64
violenth               int64
violentm               int64
econnow                int64
econ12mo               int64
pid1d                float64
pid2d                 object
pid1r                float64
pid2r                 object
pidstr               float64
pidlean              float64
lcself                 int64
lcd                    int64
lcr                    int64
lchc                   int64
lcbo                   int64
lcdt                   int64
lcmr                   int64
lctc                   int64
srv_spend              int64
campfin                int64
immig_legal            int64
immig_numb             int64
equalpay               int64
parleave               int64
crimespend             int64
death                  int64
terror_worry         float64
terror_12mo          float64
terror_local           int64
relig_bc               int64
relig_bcstr          float64
relig_srv              int64
relig_srvstr         float64
incgap20               int64
isis_troops            int64
syrians_a            float64
syrians_b            float64
pc_a                 float64
pc_b                 float64
minwage                int64
healthspend            int64
childcare              int64
getahead               int64
ladder                 int64
finwell                int64
warm                   int64
warmbad                int64
warmcause              int64
warmdo                 int64
freetrade              int64
stopwhite              int64
stopblack              int64
forcewhite             int64
forceblack             int64
stop_12mo            float64
arrested_12mo        float64
charged_12mo         float64
jailed_12mo          float64
convict_12mo         float64
famstop_12mo         float64
stop_ever            float64
arrested_ever        float64
charged_ever         float64
jailed_ever          float64
convict_ever         float64
famstop_ever         float64
pk_deficit             int64
pk_sen               float64
pk_spend               int64
birthright_a         float64
birthright_b         float64
femoff_jobs          float64
femoff_ed            float64
femoff_spend         float64
femoff_issues        float64
lpres_pleased          int64
lpres_immig          float64
lpres_la             float64
vaccine                int64
autism                 int64
bo_muslim              int64
bo_confid            float64
amer_ident             int64
race_ident             int64
whitework            float64
whitejob             float64
wguilt1              float64
wguilt2              float64
wguilt3              float64
buycott                int64
boycott                int64
skintone_mob           int64
skintone             float64
skin_discrim           int64
africanam10_1        float64
white10_1            float64
hispanic10_1         float64
asianam10_1          float64
nativeam10_1           int64
other10_1            float64
other10_open          object
birthyr                int64
gender                 int64
race                   int64
race_other            object
educ                   int64
marstat                int64
speakspanish           int64
employ               float64
employ_t              object
faminc               float64
faminc2                int64
state                float64
votereg                int64
pid3                   int64
pid7                 float64
ideo5                  int64
newsint                int64
pew_bornagain          int64
pew_churatd            int64
religpew             float64
religpew_t            object
ever_vs_12mo_rand      int64
dtype: object

Before moving on, we set the default number of rows displayed back to 10:

pd.set_option('display.max_rows', 10)

2.4.3. Displaying Column Names#

It can be very useful to display the column names in one readable list so that we can work with the variables/features contained in these columns. But as with data frames, we again need to contend with Python’s default display settings. To turn off the display limit for items in a list, type:

pd.set_option('display.max_seq_items', None)

To display the column names, use the .columns attribute. An attribute is like a method in that it is attached to an existing object in Python’s memory, but while a method is a type of function, an attribute is another object. In this case, the column names are contained in a list-type object:

anes.columns
Index(['caseid', 'turnout12', 'turnout12b', 'vote12', 'percent16', 'meet',
       'givefut', 'info', 'march', 'sign', 'give12mo', 'compromise', 'ftobama',
       'ftblack', 'ftwhite', 'fthisp', 'ftgay', 'ftjeb', 'fttrump', 'ftcarson',
       'fthrc', 'ftrubio', 'ftcruz', 'ftsanders', 'ftfiorina', 'ftpolice',
       'ftfem', 'fttrans', 'ftmuslim', 'ftsci', 'reg', 'demcand', 'repcand',
       'vote16jb', 'vote16bc', 'vote16tc', 'vote16mr', 'vote16dt', 'presjob',
       'lazyb', 'lazyw', 'lazyh', 'lazym', 'violentb', 'violentw', 'violenth',
       'violentm', 'econnow', 'econ12mo', 'pid1d', 'pid2d', 'pid1r', 'pid2r',
       'pidstr', 'pidlean', 'lcself', 'lcd', 'lcr', 'lchc', 'lcbo', 'lcdt',
       'lcmr', 'lctc', 'srv_spend', 'campfin', 'immig_legal', 'immig_numb',
       'equalpay', 'parleave', 'crimespend', 'death', 'terror_worry',
       'terror_12mo', 'terror_local', 'relig_bc', 'relig_bcstr', 'relig_srv',
       'relig_srvstr', 'incgap20', 'isis_troops', 'syrians_a', 'syrians_b',
       'pc_a', 'pc_b', 'minwage', 'healthspend', 'childcare', 'getahead',
       'ladder', 'finwell', 'warm', 'warmbad', 'warmcause', 'warmdo',
       'freetrade', 'stopwhite', 'stopblack', 'forcewhite', 'forceblack',
       'stop_12mo', 'arrested_12mo', 'charged_12mo', 'jailed_12mo',
       'convict_12mo', 'famstop_12mo', 'stop_ever', 'arrested_ever',
       'charged_ever', 'jailed_ever', 'convict_ever', 'famstop_ever',
       'pk_deficit', 'pk_sen', 'pk_spend', 'birthright_a', 'birthright_b',
       'femoff_jobs', 'femoff_ed', 'femoff_spend', 'femoff_issues',
       'lpres_pleased', 'lpres_immig', 'lpres_la', 'vaccine', 'autism',
       'bo_muslim', 'bo_confid', 'amer_ident', 'race_ident', 'whitework',
       'whitejob', 'wguilt1', 'wguilt2', 'wguilt3', 'buycott', 'boycott',
       'skintone_mob', 'skintone', 'skin_discrim', 'africanam10_1',
       'white10_1', 'hispanic10_1', 'asianam10_1', 'nativeam10_1', 'other10_1',
       'other10_open', 'birthyr', 'gender', 'race', 'race_other', 'educ',
       'marstat', 'speakspanish', 'employ', 'employ_t', 'faminc', 'faminc2',
       'state', 'votereg', 'pid3', 'pid7', 'ideo5', 'newsint', 'pew_bornagain',
       'pew_churatd', 'religpew', 'religpew_t', 'ever_vs_12mo_rand'],
      dtype='object')

It can be difficult to read a list of column names formatted as a list. To see the column names in a more cleanly formatted table, use the .dtypes attribute described above or the .info() method with verbose=True. The info() method also displays the dimensions of the data, a count of the data types across columns, and the memory used by the data frame:

anes.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 168 columns):
 #    Column             Dtype  
---   ------             -----  
 0    caseid             float64
 1    turnout12          int64  
 2    turnout12b         float64
 3    vote12             float64
 4    percent16          int64  
 5    meet               int64  
 6    givefut            int64  
 7    info               int64  
 8    march              int64  
 9    sign               int64  
 10   give12mo           int64  
 11   compromise         int64  
 12   ftobama            float64
 13   ftblack            float64
 14   ftwhite            int64  
 15   fthisp             float64
 16   ftgay              float64
 17   ftjeb              float64
 18   fttrump            float64
 19   ftcarson           float64
 20   fthrc              float64
 21   ftrubio            float64
 22   ftcruz             float64
 23   ftsanders          float64
 24   ftfiorina          float64
 25   ftpolice           float64
 26   ftfem              float64
 27   fttrans            float64
 28   ftmuslim           float64
 29   ftsci              float64
 30   reg                int64  
 31   demcand            int64  
 32   repcand            float64
 33   vote16jb           float64
 34   vote16bc           float64
 35   vote16tc           float64
 36   vote16mr           float64
 37   vote16dt           float64
 38   presjob            int64  
 39   lazyb              int64  
 40   lazyw              int64  
 41   lazyh              int64  
 42   lazym              int64  
 43   violentb           int64  
 44   violentw           int64  
 45   violenth           int64  
 46   violentm           int64  
 47   econnow            int64  
 48   econ12mo           int64  
 49   pid1d              float64
 50   pid2d              object 
 51   pid1r              float64
 52   pid2r              object 
 53   pidstr             float64
 54   pidlean            float64
 55   lcself             int64  
 56   lcd                int64  
 57   lcr                int64  
 58   lchc               int64  
 59   lcbo               int64  
 60   lcdt               int64  
 61   lcmr               int64  
 62   lctc               int64  
 63   srv_spend          int64  
 64   campfin            int64  
 65   immig_legal        int64  
 66   immig_numb         int64  
 67   equalpay           int64  
 68   parleave           int64  
 69   crimespend         int64  
 70   death              int64  
 71   terror_worry       float64
 72   terror_12mo        float64
 73   terror_local       int64  
 74   relig_bc           int64  
 75   relig_bcstr        float64
 76   relig_srv          int64  
 77   relig_srvstr       float64
 78   incgap20           int64  
 79   isis_troops        int64  
 80   syrians_a          float64
 81   syrians_b          float64
 82   pc_a               float64
 83   pc_b               float64
 84   minwage            int64  
 85   healthspend        int64  
 86   childcare          int64  
 87   getahead           int64  
 88   ladder             int64  
 89   finwell            int64  
 90   warm               int64  
 91   warmbad            int64  
 92   warmcause          int64  
 93   warmdo             int64  
 94   freetrade          int64  
 95   stopwhite          int64  
 96   stopblack          int64  
 97   forcewhite         int64  
 98   forceblack         int64  
 99   stop_12mo          float64
 100  arrested_12mo      float64
 101  charged_12mo       float64
 102  jailed_12mo        float64
 103  convict_12mo       float64
 104  famstop_12mo       float64
 105  stop_ever          float64
 106  arrested_ever      float64
 107  charged_ever       float64
 108  jailed_ever        float64
 109  convict_ever       float64
 110  famstop_ever       float64
 111  pk_deficit         int64  
 112  pk_sen             float64
 113  pk_spend           int64  
 114  birthright_a       float64
 115  birthright_b       float64
 116  femoff_jobs        float64
 117  femoff_ed          float64
 118  femoff_spend       float64
 119  femoff_issues      float64
 120  lpres_pleased      int64  
 121  lpres_immig        float64
 122  lpres_la           float64
 123  vaccine            int64  
 124  autism             int64  
 125  bo_muslim          int64  
 126  bo_confid          float64
 127  amer_ident         int64  
 128  race_ident         int64  
 129  whitework          float64
 130  whitejob           float64
 131  wguilt1            float64
 132  wguilt2            float64
 133  wguilt3            float64
 134  buycott            int64  
 135  boycott            int64  
 136  skintone_mob       int64  
 137  skintone           float64
 138  skin_discrim       int64  
 139  africanam10_1      float64
 140  white10_1          float64
 141  hispanic10_1       float64
 142  asianam10_1        float64
 143  nativeam10_1       int64  
 144  other10_1          float64
 145  other10_open       object 
 146  birthyr            int64  
 147  gender             int64  
 148  race               int64  
 149  race_other         object 
 150  educ               int64  
 151  marstat            int64  
 152  speakspanish       int64  
 153  employ             float64
 154  employ_t           object 
 155  faminc             float64
 156  faminc2            int64  
 157  state              float64
 158  votereg            int64  
 159  pid3               int64  
 160  pid7               float64
 161  ideo5              int64  
 162  newsint            int64  
 163  pew_bornagain      int64  
 164  pew_churatd        int64  
 165  religpew           float64
 166  religpew_t         object 
 167  ever_vs_12mo_rand  int64  
dtypes: float64(76), int64(86), object(6)
memory usage: 1.5+ MB

2.4.4. Descriptive Statistics#

One way to catch some big errors is to display statistics derived from the columns of the data frame, such as the mean and minimum and maximum values. For example, in the ANES data, ftobama records people’s responses to the question: “On a scale from 0 to 100, with 0 being the coldest and 100 being the warmest, how warmly to you feel towards Barack Obama?” I can display various summary statistics for this feature with the following code:

anes[['ftobama']].describe()
ftobama
count 1193.000000
mean 50.442582
std 54.350731
min 0.000000
25% 5.000000
50% 54.000000
75% 87.000000
max 998.000000

The .describe() method displays these summary statistics, and typing anes[['ftobama']] limits the results to just this one feature (we will discuss subsetting data by columns later in this course). Notice anything strange in the results? The mean value of about 50 seems reasonable, as do the standard deviation and percentiles. But the maximum value is 998 for a feature which by design should be no larger than 100. It turns out that throughout the data 998 is one of a few numeric codes that indicates a nonresponse. We can replace these values with missing values from the outset, and we will do so below in the Data with Numeric Missing Codes section.

To change the percentiles that are displayed, use the percentiles parameter and provide a list of the desired percentiles. To show percentiles in increments of 20%, type:

anes[['ftobama']].describe(percentiles = [.2, .4, .6, .8])
ftobama
count 1193.000000
mean 50.442582
std 54.350731
min 0.000000
20% 3.000000
40% 31.000000
50% 54.000000
60% 70.000000
80% 90.000000
max 998.000000

Note that the median (50%) is always displayed.

By default the .describe() method displays summary statistics for all columns with int and float data types. To see the summary statistics for the object type columns, use the include parameter:

anes.describe(include="object")
pid2d pid2r other10_open race_other employ_t religpew_t
count 1200 1199 94 1200 1200 1200
unique 37 31 57 1 1 11
top __NA__ __NA__ 0 __NA__ __NA__ __NA__
freq 1159 1162 17 1200 1200 1188

object type columns are treated as text, which means that statistics like the mean and median are impossible to calculate. Instead the .describe() method gives us the count of non-missing observations, the number of unique observations, the most frequent value of the feature, and the number of times the most frequent value appears.

2.5. Identifying and Solving Problems with Text-Based Data Files#

There are many ways that text-based data files can be arranged that cause problems for loading the data into Python and for using the data for analyses. Please note that not every problem needs to be solved in the loading step: the entire pandas library exists for cleaning data once it is loaded into Python’s memory. But some problems are more efficiently solved as we load the file. Here we will discuss some common problems and strategies for solving them while loading the data.

I’ve uploaded to the Github page several additional versions of the ANES data that I messed up on purpose.

2.5.1. Comments Prior to the Header#

I created a version of the ANES data with four rows of comments prior to the column names. The following lines of code display the raw text of this file without trying to load the data frame:

url = "https://raw.githubusercontent.com/jkropko/DS-6001/master/localdata/anes_example_toplines.csv"
import requests
file = requests.get(url)
print(file.text[0:1000])
THIS IS THE ANES 2016 PILOT STUDY
DOWNLOADED AND EDITED BY JON KROPKO
WHO ALSO ADDED THESE LINES AT THE TOP OF THE DATA FILE
TO ANNOY HIS STUDENTS
caseid,turnout12,turnout12b,vote12,percent16,meet,givefut,info,march,sign,give12mo,compromise,ftobama,ftblack,ftwhite,fthisp,ftgay,ftjeb,fttrump,ftcarson,fthrc,ftrubio,ftcruz,ftsanders,ftfiorina,ftpolice,ftfem,fttrans,ftmuslim,ftsci,reg,demcand,repcand,vote16jb,vote16bc,vote16tc,vote16mr,vote16dt,presjob,lazyb,lazyw,lazyh,lazym,violentb,violentw,violenth,violentm,econnow,econ12mo,pid1d,pid2d,pid1r,pid2r,pidstr,pidlean,lcself,lcd,lcr,lchc,lcbo,lcdt,lcmr,lctc,srv_spend,campfin,immig_legal,immig_numb,equalpay,parleave,crimespend,death,terror_worry,terror_12mo,terror_local,relig_bc,relig_bcstr,relig_srv,relig_srvstr,incgap20,isis_troops,syrians_a,syrians_b,pc_a,pc_b,minwage,healthspend,childcare,getahead,ladder,finwell,warm,warmbad,warmcause,warmdo,freetrade,stopwhite,stopblack,forcewhite,forceblack,stop_12mo,arrested_12mo,charged_12mo,jailed_12

If I try to load the data file as it currently exists by typing

anes = pd.read_csv(url)

I get the following error:

ParserError: Error tokenizing data. C error: Expected 1 fields in line 5, saw 168

This error occurs because the first line tells the pd.read_csv() function that there is one column in the data, as there are no commas on this line. So by the time the column names appear there are more commas than the parser knows how to handle.

To solve this problem, use the header parameter to tell pd.read_csv() how many rows of the data file are taken up by the unwanted header. This function always supposes that the column names are the first row of the data after the header.

anes = pd.read_csv(url, header = 4)
anes
caseid turnout12 turnout12b vote12 percent16 ... pew_bornagain pew_churatd religpew religpew_t ever_vs_12mo_rand
0 1.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 2
1 2.0 2 NaN NaN 50 ... 2 4 2.0 __NA__ 1
2 3.0 1 NaN 1.0 100 ... 2 6 1.0 __NA__ 1
3 4.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 1
4 5.0 1 NaN 1.0 100 ... 2 2 3.0 __NA__ 2
... ... ... ... ... ... ... ... ... ... ... ...
1195 1196.0 1 NaN 2.0 100 ... 2 1 1.0 __NA__ 2
1196 1197.0 1 NaN 1.0 80 ... 1 2 1.0 __NA__ 1
1197 1198.0 1 NaN 2.0 95 ... 2 6 NaN __NA__ 2
1198 1199.0 1 NaN 2.0 80 ... 2 4 1.0 __NA__ 1
1199 1200.0 2 NaN NaN 0 ... 2 2 2.0 __NA__ 1

1200 rows × 168 columns

2.5.2. Data with Numeric Missing Codes#

I created a version of the data that uses -999 throughout to represent various missing values, such as when a survey respondent refuses to answer a question:

url = "https://raw.githubusercontent.com/jkropko/DS-6001/master/localdata/anes_example_missing.csv"
anes = pd.read_csv(url)
anes
caseid turnout12 turnout12b vote12 percent16 ... pew_bornagain pew_churatd religpew religpew_t ever_vs_12mo_rand
0 1 1 -999 2 100 ... 2 6 -999 __NA__ 2
1 2 2 -999 -999 50 ... 2 4 2 __NA__ 1
2 3 1 -999 1 100 ... 2 6 1 __NA__ 1
3 4 1 -999 2 100 ... 2 6 -999 __NA__ 1
4 5 1 -999 1 100 ... 2 2 3 __NA__ 2
... ... ... ... ... ... ... ... ... ... ... ...
1195 1196 1 -999 2 100 ... 2 1 1 __NA__ 2
1196 1197 1 -999 1 80 ... 1 2 1 __NA__ 1
1197 1198 1 -999 2 95 ... 2 6 -999 __NA__ 2
1198 1199 1 -999 2 80 ... 2 4 1 __NA__ 1
1199 1200 2 -999 -999 0 ... 2 2 2 __NA__ 1

1200 rows × 168 columns

We can always proceed with the data frame as it is, and use various functions in pandas to replace these -999 values. But in this case, it is easier to read these values as missing as we load the data. We can do that with the na_values parameter. Then all of the -999 values are replaced with the NaN character, which Python recognizes as a missing datapoint.

anes = pd.read_csv(url, na_values = -999)
anes
caseid turnout12 turnout12b vote12 percent16 ... pew_bornagain pew_churatd religpew religpew_t ever_vs_12mo_rand
0 1.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 2
1 2.0 2 NaN NaN 50 ... 2 4 2.0 __NA__ 1
2 3.0 1 NaN 1.0 100 ... 2 6 1.0 __NA__ 1
3 4.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 1
4 5.0 1 NaN 1.0 100 ... 2 2 3.0 __NA__ 2
... ... ... ... ... ... ... ... ... ... ... ...
1195 1196.0 1 NaN 2.0 100 ... 2 1 1.0 __NA__ 2
1196 1197.0 1 NaN 1.0 80 ... 1 2 1.0 __NA__ 1
1197 1198.0 1 NaN 2.0 95 ... 2 6 NaN __NA__ 2
1198 1199.0 1 NaN 2.0 80 ... 2 4 1.0 __NA__ 1
1199 1200.0 2 NaN NaN 0 ... 2 2 2.0 __NA__ 1

1200 rows × 168 columns

2.5.3. Comments Inside the Data#

Suppose that you work with a data file in which the data provider left comments throughout the data. That’s heresy. But it can definitely happen. Hopefully the commenter had enough forethought to place a uniform character in front of these comments. If so, we can remove the comments while loading the data. Consider the following data file:

url = "https://raw.githubusercontent.com/jkropko/DS-6001/master/localdata/anes_example_comments.txt"
anes = pd.read_csv(url, header = 4)
anes
caseid turnout12 turnout12b vote12 percent16 ... pew_bornagain pew_churatd religpew religpew_t ever_vs_12mo_rand
0 1 1.0 NaN 2.0 100.0 ... 2.0 6.0 NaN __NA__ 2.0
1 2 2.0 NaN NaN 50.0 ... 2.0 4.0 2.0 __NA__ 1.0
2 3 1.0 NaN 1.0 100.0 ... 2.0 6.0 1.0 __NA__ 1.0
3 @NEXT COMES A BUNCH MORE OBSERVATIONS! NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
4 4 1.0 NaN 2.0 100.0 ... 2.0 6.0 NaN __NA__ 1.0
... ... ... ... ... ... ... ... ... ... ... ...
1197 1196 1.0 NaN 2.0 100.0 ... 2.0 1.0 1.0 __NA__ 2.0
1198 1197 1.0 NaN 1.0 80.0 ... 1.0 2.0 1.0 __NA__ 1.0
1199 1198 1.0 NaN 2.0 95.0 ... 2.0 6.0 NaN __NA__ 2.0
1200 1199 1.0 NaN 2.0 80.0 ... 2.0 4.0 1.0 __NA__ 1.0
1201 1200 2.0 NaN NaN 0.0 ... 2.0 2.0 2.0 __NA__ 1.0

1202 rows × 168 columns

To ignore these comments, use the comment parameter:

anes = pd.read_csv(url, header = 4, comment = '@')
anes
caseid turnout12 turnout12b vote12 percent16 ... pew_bornagain pew_churatd religpew religpew_t ever_vs_12mo_rand
0 1.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 2
1 2.0 2 NaN NaN 50 ... 2 4 2.0 __NA__ 1
2 3.0 1 NaN 1.0 100 ... 2 6 1.0 __NA__ 1
3 4.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 1
4 5.0 1 NaN 1.0 100 ... 2 2 3.0 __NA__ 2
... ... ... ... ... ... ... ... ... ... ... ...
1195 1196.0 1 NaN 2.0 100 ... 2 1 1.0 __NA__ 2
1196 1197.0 1 NaN 1.0 80 ... 1 2 1.0 __NA__ 1
1197 1198.0 1 NaN 2.0 95 ... 2 6 NaN __NA__ 2
1198 1199.0 1 NaN 2.0 80 ... 2 4 1.0 __NA__ 1
1199 1200.0 2 NaN NaN 0 ... 2 2 2.0 __NA__ 1

1200 rows × 168 columns

2.5.4. Data Without Column Names#

The pd.read_csv() function assumes that the column names are listed on the first line of the data file, unless the header parameter is used, in which it assumes the column names are on the first line after the header. But if the data do not contain a row of column names then Python will use the first line of valid data as the column names. I created a version of the ANES data with the column names deleted. Consider what happens when I load this data file normally:

url = "https://raw.githubusercontent.com/jkropko/DS-6001/master/localdata/anes_example_nocolnames.csv"
anes = pd.read_csv(url)
anes
1 1.1 NA 2 100 ... 2.24 6.7 NA.19 __NA__.4 2.25
0 2.0 2 NaN NaN 50 ... 2 4 2.0 __NA__ 1
1 3.0 1 NaN 1.0 100 ... 2 6 1.0 __NA__ 1
2 4.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 1
3 5.0 1 NaN 1.0 100 ... 2 2 3.0 __NA__ 2
4 6.0 1 NaN 3.0 100 ... 2 2 1.0 __NA__ 2
... ... ... ... ... ... ... ... ... ... ... ...
1194 1196.0 1 NaN 2.0 100 ... 2 1 1.0 __NA__ 2
1195 1197.0 1 NaN 1.0 80 ... 1 2 1.0 __NA__ 1
1196 1198.0 1 NaN 2.0 95 ... 2 6 NaN __NA__ 2
1197 1199.0 1 NaN 2.0 80 ... 2 4 1.0 __NA__ 1
1198 1200.0 2 NaN NaN 0 ... 2 2 2.0 __NA__ 1

1199 rows × 168 columns

The column names are not descriptive of what the columns represent. But worse than that, these names are the datapoints from the first row of the data. That means we’ve lost an observation. To deal with this problem, use the header=None parameter.

anes = pd.read_csv(url, header=None)
anes
0 1 2 3 4 ... 163 164 165 166 167
0 1.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 2
1 2.0 2 NaN NaN 50 ... 2 4 2.0 __NA__ 1
2 3.0 1 NaN 1.0 100 ... 2 6 1.0 __NA__ 1
3 4.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 1
4 5.0 1 NaN 1.0 100 ... 2 2 3.0 __NA__ 2
... ... ... ... ... ... ... ... ... ... ... ...
1195 1196.0 1 NaN 2.0 100 ... 2 1 1.0 __NA__ 2
1196 1197.0 1 NaN 1.0 80 ... 1 2 1.0 __NA__ 1
1197 1198.0 1 NaN 2.0 95 ... 2 6 NaN __NA__ 2
1198 1199.0 1 NaN 2.0 80 ... 2 4 1.0 __NA__ 1
1199 1200.0 2 NaN NaN 0 ... 2 2 2.0 __NA__ 1

1200 rows × 168 columns

While we still have the issue that the column names are non-descriptive, at they are logically labeled in numeric order from left to right, and we haven’t lost the first observation. If we want to add the prefix “X” to each of these numbers, we can add the prefix="X" parameter:

anes = pd.read_csv(url, header=None, prefix="X")
anes
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[23], line 1
----> 1 anes = pd.read_csv(url, header=None, prefix="X")
      2 anes

TypeError: read_csv() got an unexpected keyword argument 'prefix'

If the data do not contain a header row for variable names, and we want the variables to have the right names, we have to define the names separately and pass this list using the names parameter:

col_names = ['caseid', 'turnout12', 'turnout12b', 'vote12', 'percent16', 'meet',
       'givefut', 'info', 'march', 'sign', 'give12mo', 'compromise', 'ftobama',
       'ftblack', 'ftwhite', 'fthisp', 'ftgay', 'ftjeb', 'fttrump', 'ftcarson',
       'fthrc', 'ftrubio', 'ftcruz', 'ftsanders', 'ftfiorina', 'ftpolice',
       'ftfem', 'fttrans', 'ftmuslim', 'ftsci', 'reg', 'demcand', 'repcand',
       'vote16jb', 'vote16bc', 'vote16tc', 'vote16mr', 'vote16dt', 'presjob',
       'lazyb', 'lazyw', 'lazyh', 'lazym', 'violentb', 'violentw', 'violenth',
       'violentm', 'econnow', 'econ12mo', 'pid1d', 'pid2d', 'pid1r', 'pid2r',
       'pidstr', 'pidlean', 'lcself', 'lcd', 'lcr', 'lchc', 'lcbo', 'lcdt',
       'lcmr', 'lctc', 'srv_spend', 'campfin', 'immig_legal', 'immig_numb',
       'equalpay', 'parleave', 'crimespend', 'death', 'terror_worry',
       'terror_12mo', 'terror_local', 'relig_bc', 'relig_bcstr', 'relig_srv',
       'relig_srvstr', 'incgap20', 'isis_troops', 'syrians_a', 'syrians_b',
       'pc_a', 'pc_b', 'minwage', 'healthspend', 'childcare', 'getahead',
       'ladder', 'finwell', 'warm', 'warmbad', 'warmcause', 'warmdo',
       'freetrade', 'stopwhite', 'stopblack', 'forcewhite', 'forceblack',
       'stop_12mo', 'arrested_12mo', 'charged_12mo', 'jailed_12mo',
       'convict_12mo', 'famstop_12mo', 'stop_ever', 'arrested_ever',
       'charged_ever', 'jailed_ever', 'convict_ever', 'famstop_ever',
       'pk_deficit', 'pk_sen', 'pk_spend', 'birthright_a', 'birthright_b',
       'femoff_jobs', 'femoff_ed', 'femoff_spend', 'femoff_issues',
       'lpres_pleased', 'lpres_immig', 'lpres_la', 'vaccine', 'autism',
       'bo_muslim', 'bo_confid', 'amer_ident', 'race_ident', 'whitework',
       'whitejob', 'wguilt1', 'wguilt2', 'wguilt3', 'buycott', 'boycott',
       'skintone_mob', 'skintone', 'skin_discrim', 'africanam10_1',
       'white10_1', 'hispanic10_1', 'asianam10_1', 'nativeam10_1', 'other10_1',
       'other10_open', 'birthyr', 'gender', 'race', 'race_other', 'educ',
       'marstat', 'speakspanish', 'employ', 'employ_t', 'faminc', 'faminc2',
       'state', 'votereg', 'pid3', 'pid7', 'ideo5', 'newsint', 'pew_bornagain',
       'pew_churatd', 'religpew', 'religpew_t', 'ever_vs_12mo_rand']
anes = pd.read_csv(url, header=None, names=col_names)
anes
caseid turnout12 turnout12b vote12 percent16 ... pew_bornagain pew_churatd religpew religpew_t ever_vs_12mo_rand
0 1.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 2
1 2.0 2 NaN NaN 50 ... 2 4 2.0 __NA__ 1
2 3.0 1 NaN 1.0 100 ... 2 6 1.0 __NA__ 1
3 4.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 1
4 5.0 1 NaN 1.0 100 ... 2 2 3.0 __NA__ 2
... ... ... ... ... ... ... ... ... ... ... ...
1195 1196.0 1 NaN 2.0 100 ... 2 1 1.0 __NA__ 2
1196 1197.0 1 NaN 1.0 80 ... 1 2 1.0 __NA__ 1
1197 1198.0 1 NaN 2.0 95 ... 2 6 NaN __NA__ 2
1198 1199.0 1 NaN 2.0 80 ... 2 4 1.0 __NA__ 1
1199 1200.0 2 NaN NaN 0 ... 2 2 2.0 __NA__ 1

1200 rows × 168 columns

2.5.5. Delimiters Other Than Commas#

Here’s a version of the ANES data that is delimited by tabs:

url = "https://raw.githubusercontent.com/jkropko/DS-6001/master/localdata/anes_example_tab.txt"
file = requests.get(url)
print(file.text[0:2000])
caseid	turnout12	turnout12b	vote12	percent16	meet	givefut	info	march	sign	give12mo	compromise	ftobama	ftblack	ftwhite	fthisp	ftgay	ftjeb	fttrump	ftcarson	fthrc	ftrubio	ftcruz	ftsanders	ftfiorina	ftpolice	ftfem	fttrans	ftmuslim	ftsci	reg	demcand	repcand	vote16jb	vote16bc	vote16tc	vote16mr	vote16dt	presjob	lazyb	lazyw	lazyh	lazym	violentb	violentw	violenth	violentm	econnow	econ12mo	pid1d	pid2d	pid1r	pid2r	pidstr	pidlean	lcself	lcd	lcr	lchc	lcbo	lcdt	lcmr	lctc	srv_spend	campfin	immig_legal	immig_numb	equalpay	parleave	crimespend	death	terror_worry	terror_12mo	terror_local	relig_bc	relig_bcstr	relig_srv	relig_srvstr	incgap20	isis_troops	syrians_a	syrians_b	pc_a	pc_b	minwage	healthspend	childcare	getahead	ladder	finwell	warm	warmbad	warmcause	warmdo	freetrade	stopwhite	stopblack	forcewhite	forceblack	stop_12mo	arrested_12mo	charged_12mo	jailed_12mo	convict_12mo	famstop_12mo	stop_ever	arrested_ever	charged_ever	jailed_ever	convict_ever	famstop_ever	pk_deficit	pk_sen	pk_spend	birthright_a	birthright_b	femoff_jobs	femoff_ed	femoff_spend	femoff_issues	lpres_pleased	lpres_immig	lpres_la	vaccine	autism	bo_muslim	bo_confid	amer_ident	race_ident	whitework	whitejob	wguilt1	wguilt2	wguilt3	buycott	boycott	skintone_mob	skintone	skin_discrim	africanam10_1	white10_1	hispanic10_1	asianam10_1	nativeam10_1	other10_1	other10_open	birthyr	gender	race	race_other	educ	marstat	speakspanish	employ	employ_t	faminc	faminc2	state	votereg	pid3	pid7	ideo5	newsint	pew_bornagain	pew_churatd	religpew	religpew_t	ever_vs_12mo_rand
1	1	NA	2	100	1	3	4	1	2	2	1	100	100	100	100	96	36	1	5	76	31	0	84	2	51	62	97	20	100	1	1	6	NA	NA	NA	1	1	1	5	5	5	5	5	5	5	5	2	2	NA	__NA__	1	__NA__	1	NA	4	3	4	5	5	4	4	4	7	1	1	2	1	2	5	7	5	4	5	2	1	2	1	1	7	NA	2	NA	3	1	1	1	4	6	6	1	2	1	1	4	2	4	2	4	NA	NA	NA	NA	NA	NA	1	2	2	2	2	2	1	6	1	NA	2	NA	4	4	NA	5	NA	5	1	6	2	1	4	4	5	5	5	5	5	2	2	99	2	5	0	10	0	0	0	0	NA	1960	1	1	__NA__	5	1	4	1	__NA__	4	99	6	1	1	1	6	1	2	6	NA	__NA__	2
2	2	NA	NA	50	4	5	4	2	2	2	1	39	6	74	6	75	28	28	6	52	21	10	13	6	65	53	49	2

The pd.read_csv() file also loads text-based data files with delimiters other than commas. To load tab-delimited data, use the sep="\t" parameter:

anes = pd.read_csv(url, sep="\t")
anes
caseid turnout12 turnout12b vote12 percent16 ... pew_bornagain pew_churatd religpew religpew_t ever_vs_12mo_rand
0 1.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 2
1 2.0 2 NaN NaN 50 ... 2 4 2.0 __NA__ 1
2 3.0 1 NaN 1.0 100 ... 2 6 1.0 __NA__ 1
3 4.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 1
4 5.0 1 NaN 1.0 100 ... 2 2 3.0 __NA__ 2
... ... ... ... ... ... ... ... ... ... ... ...
1195 1196.0 1 NaN 2.0 100 ... 2 1 1.0 __NA__ 2
1196 1197.0 1 NaN 1.0 80 ... 1 2 1.0 __NA__ 1
1197 1198.0 1 NaN 2.0 95 ... 2 6 NaN __NA__ 2
1198 1199.0 1 NaN 2.0 80 ... 2 4 1.0 __NA__ 1
1199 1200.0 2 NaN NaN 0 ... 2 2 2.0 __NA__ 1

1200 rows × 168 columns

For delimiters other than a tab, simply type that character in quotes with the sep parameter. Here’s another version of the ANES data, delimited by semi-colons:

url = "https://raw.githubusercontent.com/jkropko/DS-6001/master/localdata/anes_example_semicolon.txt"
file = requests.get(url)
print(file.text[0:2000])
caseid;turnout12;turnout12b;vote12;percent16;meet;givefut;info;march;sign;give12mo;compromise;ftobama;ftblack;ftwhite;fthisp;ftgay;ftjeb;fttrump;ftcarson;fthrc;ftrubio;ftcruz;ftsanders;ftfiorina;ftpolice;ftfem;fttrans;ftmuslim;ftsci;reg;demcand;repcand;vote16jb;vote16bc;vote16tc;vote16mr;vote16dt;presjob;lazyb;lazyw;lazyh;lazym;violentb;violentw;violenth;violentm;econnow;econ12mo;pid1d;pid2d;pid1r;pid2r;pidstr;pidlean;lcself;lcd;lcr;lchc;lcbo;lcdt;lcmr;lctc;srv_spend;campfin;immig_legal;immig_numb;equalpay;parleave;crimespend;death;terror_worry;terror_12mo;terror_local;relig_bc;relig_bcstr;relig_srv;relig_srvstr;incgap20;isis_troops;syrians_a;syrians_b;pc_a;pc_b;minwage;healthspend;childcare;getahead;ladder;finwell;warm;warmbad;warmcause;warmdo;freetrade;stopwhite;stopblack;forcewhite;forceblack;stop_12mo;arrested_12mo;charged_12mo;jailed_12mo;convict_12mo;famstop_12mo;stop_ever;arrested_ever;charged_ever;jailed_ever;convict_ever;famstop_ever;pk_deficit;pk_sen;pk_spend;birthright_a;birthright_b;femoff_jobs;femoff_ed;femoff_spend;femoff_issues;lpres_pleased;lpres_immig;lpres_la;vaccine;autism;bo_muslim;bo_confid;amer_ident;race_ident;whitework;whitejob;wguilt1;wguilt2;wguilt3;buycott;boycott;skintone_mob;skintone;skin_discrim;africanam10_1;white10_1;hispanic10_1;asianam10_1;nativeam10_1;other10_1;other10_open;birthyr;gender;race;race_other;educ;marstat;speakspanish;employ;employ_t;faminc;faminc2;state;votereg;pid3;pid7;ideo5;newsint;pew_bornagain;pew_churatd;religpew;religpew_t;ever_vs_12mo_rand
1;1;NA;2;100;1;3;4;1;2;2;1;100;100;100;100;96;36;1;5;76;31;0;84;2;51;62;97;20;100;1;1;6;NA;NA;NA;1;1;1;5;5;5;5;5;5;5;5;2;2;NA;__NA__;1;__NA__;1;NA;4;3;4;5;5;4;4;4;7;1;1;2;1;2;5;7;5;4;5;2;1;2;1;1;7;NA;2;NA;3;1;1;1;4;6;6;1;2;1;1;4;2;4;2;4;NA;NA;NA;NA;NA;NA;1;2;2;2;2;2;1;6;1;NA;2;NA;4;4;NA;5;NA;5;1;6;2;1;4;4;5;5;5;5;5;2;2;99;2;5;0;10;0;0;0;0;NA;1960;1;1;__NA__;5;1;4;1;__NA__;4;99;6;1;1;1;6;1;2;6;NA;__NA__;2
2;2;NA;NA;50;4;5;4;2;2;2;1;39;6;74;6;75;28;28;6;52;21;10;13;6;65;53;49;2

To load the data:

anes = pd.read_csv(url, sep=";")
anes
caseid turnout12 turnout12b vote12 percent16 ... pew_bornagain pew_churatd religpew religpew_t ever_vs_12mo_rand
0 1.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 2
1 2.0 2 NaN NaN 50 ... 2 4 2.0 __NA__ 1
2 3.0 1 NaN 1.0 100 ... 2 6 1.0 __NA__ 1
3 4.0 1 NaN 2.0 100 ... 2 6 NaN __NA__ 1
4 5.0 1 NaN 1.0 100 ... 2 2 3.0 __NA__ 2
... ... ... ... ... ... ... ... ... ... ... ...
1195 1196.0 1 NaN 2.0 100 ... 2 1 1.0 __NA__ 2
1196 1197.0 1 NaN 1.0 80 ... 1 2 1.0 __NA__ 1
1197 1198.0 1 NaN 2.0 95 ... 2 6 NaN __NA__ 2
1198 1199.0 1 NaN 2.0 80 ... 2 4 1.0 __NA__ 1
1199 1200.0 2 NaN NaN 0 ... 2 2 2.0 __NA__ 1

1200 rows × 168 columns

2.5.6. Fixed-width Files#

A fixed-width file contains no delimiters. Instead, it aligns all of the data for one variable in the same position on each row. These files generally do not store variable names, and might use less memory than CSV. But that makes the data impossible to parse without an external list of where each variable is stored. The first and most important step is to get this list.

For this example, I pulled data from a public opinion survey conducted by the National Journal a few years ago. The data are available from the Roper Center for Public Opinion Research, and the raw data are in fixed-width format. The codebook is stored on GitHub, take a look. The features are listed on page 3, along with their column positions.

Loading a fixed-width file is a two step process. First, we use the codebook to create lists of the variable names, and either the width of each variable (how many columns each variable takes up), or the starting and ending position. Here we save both lists, although you will only need one of these two lists.

For the starting and ending position of each variable, we create a list of length 2 for each variable, where the first element is the column the previous variable ends on (or 0 for the first variable) and the second element is the column the current variable ends on. For example, if a variable occupies columns 34, 35, and 36, its list of length 2 is [33,36]. Then we store the lists-of-2 in a list-of-lists.

url = "https://raw.githubusercontent.com/NovaVolunteer/Practice_Application_DS/master/Week%205/njcc33850.dat"

datanames = ['psraid', 'sample', 'int_date', 'area',
'state', 'cregion', 'density', 'usr', 'cc1', 'cc1a',
'cc2', 'cc3', 'cc4', 'cc5', 'cc6', 'cc7', 'ql1', 'ql1a',
'qc1', 'hh1', 'employ', 'par', 'sex', 'age', 'educ2',
'hisp', 'race', 'inc', 'income', 'reg', 'party',
'partyln', 'iphoneus', 'hphoneus', 'recage', 'receduc',
'racethn', 'standwt', 'raceos']

datawidths = [6, 1, 6, 3, 2, 1, 1, 3, 1, 1, 
            1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
            1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 
            1, 1, 1, 1, 1, 1, 1, 4, 30]

datapos = [[0,6], [6,7], [7,13], [13,16], [16,18],
    [18,19], [19,20], [20,23], [23,24], [24,25],
    [25,26], [26,27], [27,28], [28,29], [29,30],
    [30,31], [31,32], [32,33], [33,34], [34,35],
    [35,36], [36,37], [37,38], [38,40], [40,41],
    [41,42], [42,43], [43,45], [45,46], [46,47],
    [47,48], [48,49], [49,50], [50,51], [51,52],
    [52,53], [53,54], [54,58], [58,88]]

To load the fixed-width data into Python, we use the pd.read_fwf() function. We specify the location (or URL) of the data file and the names of the columns. And we can either specify the widths of each column, or the starting and ending positions (but not both). Using the widths:

njcc = pd.read_fwf(url, widths=datawidths, header=None, names=datanames)
njcc
psraid sample int_date area state ... recage receduc racethn standwt raceos
0 100001 1 121513 540 51 ... 2 1 1 2.45 NaN
1 100006 1 121213 248 26 ... 5 3 1 0.67 NaN
2 100011 1 121213 270 21 ... 6 3 1 0.37 NaN
3 100020 1 121213 803 45 ... 6 1 2 0.95 NaN
4 100023 1 121213 203 9 ... 6 1 1 0.85 NaN
... ... ... ... ... ... ... ... ... ... ... ...
995 203257 2 121513 901 47 ... 2 1 2 2.34 NaN
996 203262 2 121513 701 31 ... 2 3 1 1.29 NaN
997 203264 2 121513 619 36 ... 2 3 1 2.41 NaN
998 203273 2 121513 415 6 ... 3 3 2 1.53 NaN
999 203275 2 121513 541 41 ... 3 3 4 1.14 NaN

1000 rows × 39 columns

And using the starting and ending positions:

njcc = pd.read_fwf(url, colspecs=datapos, header=None, names=datanames)
njcc
psraid sample int_date area state ... recage receduc racethn standwt raceos
0 100001 1 121513 540 51 ... 2 1 1 2.45 NaN
1 100006 1 121213 248 26 ... 5 3 1 0.67 NaN
2 100011 1 121213 270 21 ... 6 3 1 0.37 NaN
3 100020 1 121213 803 45 ... 6 1 2 0.95 NaN
4 100023 1 121213 203 9 ... 6 1 1 0.85 NaN
... ... ... ... ... ... ... ... ... ... ... ...
995 203257 2 121513 901 47 ... 2 1 2 2.34 NaN
996 203262 2 121513 701 31 ... 2 3 1 1.29 NaN
997 203264 2 121513 619 36 ... 2 3 1 2.41 NaN
998 203273 2 121513 415 6 ... 3 3 2 1.53 NaN
999 203275 2 121513 541 41 ... 3 3 4 1.14 NaN

1000 rows × 39 columns

2.6. Loading Other Kinds of Electronic Data Files#

CSV, ASCII, and other kinds of plain text files are very common in fields that primarily use open-source programming environments like Python and R, as these file types are lightweight and universal. But there are many more people out there who work in fields that use specific, proprietary data analysis software. It can be a real challenge to load a specific file format in a context outside of its native software environment. There are specialized software that only exist to transfer proprietary data from one environment to another. One package is Stat/Transfer, which can easily convert data from Stata to SAS to SPSS to Excel for $129 a year for academic users.

If you are using Stata, SAS, SPSS, Excel, or another program, or if you are working with someone who uses that software, and you want to transfer data to Python, the easiest approach is to save the working data as a CSV file and then load it into Python with pd.read_csv(). All of those software packages have the functionality to save data as CSV.

That doesn’t help you if you are dealing with an SPSS file and you do not have access to SPSS. Fortunately, Python has functions to load data from pretty much every software environment. Here’s a rundown of these functions.

2.6.1. Loading Excel Files#

There are a few methods for loading Excel files in Python but a great function is the pd.read_excel() function in pandas. One big way that Excel files differ from other data formats is the use of sheets in the same document. The main challenge in loading an Excel file in Python is dealing with the sheets.

I’ve saved an Excel file on GitHub that contains information about NBA teams during the 2018-2019 season. This Excel document contains four sheets, named “NBA-TEAM-SAMPLE”, “METADATA”, “TEAMS”, and “CONVERT DATE FORMAT”. All four sheets have visual formatting. I want to load the “TEAMS” sheet without carrying forward any code that refers to the visual appearance of this sheet.

The first argument of pd.read_excel() is the path, as with pd.read_csv(), and it can take a URL. The second argument is sheet_name. If the Excel file has sheets with names, you can type the name of the sheet here. Or type a number: 0 refers to the first sheet, 1 to the second, etc. This function loads the data directly to a data frame, and ignores other graphical elements of Excel sheets, such as shading particular cells or using fonts.

url = "https://github.com/NovaVolunteer/Practice_Application_DS/blob/master/Week%205/NBA-Team-Sample-BoxScore-Dataset.xlsx?raw=true"
nba = pd.read_excel(url, sheet_name="TEAMS")
nba
INITIALS LONG NAME SHORT NAME CONFERENCE DIVISION
0 Atl Atlanta Hawks Atlanta East Southeast
1 Bos Boston Celtics Boston East Atlantic
2 Bro Brooklyn Nets Brooklyn East Atlantic
3 Cha Charlotte Hornets Charlotte East Southeast
4 Chi Chicago Bulls Chicago East Central
... ... ... ... ... ...
25 Sac Sacramento Kings Sacramento West Pacific
26 San San Antonio Spurs San Antonio West Southwest
27 Tor Toronto Raptors Toronto East Atlantic
28 Uta Utah Jazz Utah West Northwest
29 Was Washington Wizards Washington East Southeast

30 rows × 5 columns

If you specify more than one sheet within the sheet_name parameter using a list, pd.read_excel() will produce a list of dataframes, one for each sheet you specify. Typing sheet_name = None produces a list with all of the sheets. For example, to load the “NBA-TEAM-SAMPLE” and “TEAMS” sheets, and save them as two separate data frames embedded in a list, type:

nba = pd.read_excel(url, sheet_name=[0,2])
nba[0]
DATASET GAME-ID DATE TEAM VENUE ... CLOSING TOTAL MONEYLINE HALFTIME BOX SCORE\nURL ODDS\nURL
0 NBA 2018-2019 Regular Season 21800001 10/16/2018 Philadelphia R ... 211.5 170 NaN NaN NaN
1 NBA 2018-2019 Regular Season 21800001 10/16/2018 Boston H ... 211.5 -200 NaN BOX SCORE ODDS
2 NBA 2018-2019 Regular Season 21800002 10/16/2018 Oklahoma City R ... 220.5 711 NaN NaN NaN
3 NBA 2018-2019 Regular Season 21800002 10/16/2018 Golden State H ... 220.5 -1100 NaN BOX SCORE ODDS
4 NBA 2018-2019 Regular Season 21800003 10/17/2018 Milwaukee R ... 222.0 -165 +2.5 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
60 NBA 2018-2019 Regular Season 21800031 10/20/2018 Detroit R ... 217.0 -200 -2.5 NaN NaN
61 NBA 2018-2019 Regular Season 21800031 10/20/2018 Chicago H ... 217.0 170 106.5u15 BOX SCORE ODDS
62 NBA 2018-2019 Regular Season 21800032 10/20/2018 Minnesota R ... 220.5 110 110 NaN NaN
63 NBA 2018-2019 Regular Season 21800032 10/20/2018 Dallas H ... 220.5 -130 -1 -05 BOX SCORE ODDS
64 NBA 2018-2019 Regular Season 21800033 10/20/2018 Phoenix R ... 224.5 459 110.5 NaN NaN

65 rows × 57 columns

In this case, because we explicity did not include the sheets indexed as 1 and 3, Python leaves these indices empty. To access the “TEAMS” sheet, specify the index-2 item of the nba list:

nba[2]
INITIALS LONG NAME SHORT NAME CONFERENCE DIVISION
0 Atl Atlanta Hawks Atlanta East Southeast
1 Bos Boston Celtics Boston East Atlantic
2 Bro Brooklyn Nets Brooklyn East Atlantic
3 Cha Charlotte Hornets Charlotte East Southeast
4 Chi Chicago Bulls Chicago East Central
... ... ... ... ... ...
25 Sac Sacramento Kings Sacramento West Pacific
26 San San Antonio Spurs San Antonio West Southwest
27 Tor Toronto Raptors Toronto East Atlantic
28 Uta Utah Jazz Utah West Northwest
29 Was Washington Wizards Washington East Southeast

30 rows × 5 columns

2.6.2. Loading SAS, Stata, and SPSS Files#

For SAS files with extensions.sas7bdat or .xport, or Stata files with extension .dta. the pd.read_sas() and pd.read_stata() functions work just like other data parsing functions in pandas. We can pass the file path, the file name alone (if we’ve set the working directory), or a URL to these functions and they loads the data directly to a data frame.

On GitHub, I’ve saved a SAS file containing the monthly inflation rate in the United States since 1983, and a Stata file containing a CBS public opinion poll. To load the SAS file, I type:

url = "https://github.com/jkropko/DS-6001/raw/master/localdata/inflation.sas7bdat"
inflation = pd.read_sas(url)
inflation
YEAR MONTH WAGE PCWAGE CPI INFLN
0 1983.0 12.0 8.320000 NaN 101.400002 NaN
1 1984.0 1.0 8.370000 0.599163 102.099998 0.687963
2 1984.0 2.0 8.360000 -0.119546 102.599998 0.488521
3 1984.0 3.0 8.400000 0.477328 102.900002 0.291971
4 1984.0 4.0 8.440000 0.475060 103.300003 0.387973
... ... ... ... ... ... ...
265 2006.0 1.0 16.400000 0.305344 199.000000 0.655409
266 2006.0 2.0 16.469999 0.425921 199.100006 0.050239
267 2006.0 3.0 16.510000 0.242571 199.800003 0.350965
268 2006.0 4.0 16.610001 0.603867 201.000000 0.598804
269 2006.0 5.0 16.620001 0.060187 201.899994 0.446762

270 rows × 6 columns

And to load the Stata file:

url = "https://github.com/jkropko/DS-6001/raw/master/localdata/cbspoll.dta"
cbspoll = pd.read_stata(url)
cbspoll
CASEID FEMM DAYS SEX INTN ... NPH0 NM COVR ADLT AREACD
0 16715.0 1.0 Tuesday Female 708.0 ... NaN Willing, talkative 4.0 4.0 818.0
1 19510.0 1.0 Monday Female 256.0 ... 1.0 Willing, talkative 3.0 2.0 512.0
2 13448.0 1.0 Sunday Female 302.0 ... NaN Willing, talkative 5.0 2.0 218.0
3 9825.0 1.0 Tuesday Female 150.0 ... NaN Willing, talkative 6.0 1.0 660.0
4 6504.0 NaN Thursday Male 246.0 ... NaN Willing, talkative 6.0 2.0 207.0
... ... ... ... ... ... ... ... ... ... ... ...
1177 6389.0 NaN Thursday Male 303.0 ... NaN Not willing 1.0 4.0 540.0
1178 1090.0 NaN Wednesday Female 230.0 ... NaN Not willing 5.0 3.0 254.0
1179 19795.0 1.0 Tuesday Female 384.0 ... NaN Willing, not talkative 1.0 2.0 318.0
1180 20001.0 1.0 Wednesday Male 150.0 ... NaN Not willing 6.0 2.0 702.0
1181 14040.0 1.0 Monday Female 189.0 ... 1.0 Willing, talkative 5.0 1.0 410.0

1182 rows × 115 columns

SPSS files have the file extension .sav, and can be loaded with pd.read_spss() in the same way. One issue (at the time this notebook was written) is that the pd.read_spss() function only accepts local files, and not files from URLs. I saved data from a public opinion survey on GitHub. If you want to try loading it into Python, download the file and move it to the folder where you’ve set your working directory (or type out the whole file path in the following code), and type:

survey = pd.read_spss("survey.sav")
survey

Eventually I expect this pd.read_spss() function to be able to accept URLs as well. At that point, the following code should also work:

url = "https://github.com/jkropko/DS-6001/raw/master/localdata/survey.sav"
survey = pd.read_spss(url)
survey

2.7. Saving CSV and ASCII Files to Disk#

Suppose we’ve done all the steps needed to clean and manage the data. We might want to save a clean version of the data in a CSV or other ASCII file on our local disk space. We can do so by applying the .to_csv() method to the anes data frame. The first argument is the filename with whatever extension we want for the saved file. As with pd.read_csv(), we can also specify the sep parameter to choose a delimiter for the text-based data file we are creating. Let’s save the ANES dataframe as “anes_cleaned.csv” in our working directory:

anes.to_csv("anes_cleaned.csv", sep=",")

For a tab-separated file we can employ the more general “.txt” extension:

anes.to_csv("anes_cleaned.txt", sep="\t")