Static Data Visualizations

11. Static Data Visualizations #

11.1. Introduction: What Makes For a Good Data Visualization?#

Data visualizations are graphical representations of distributions and relationships in the data. They are essential tools for exploratory data analysis (EDA), and they are important ways to communicate the results of statistical and machine learning models. In Exploratory Data Analysis, John W. Tukey argues that visualizations help us to see not only problems in the data, but also can reveal patterns that theory alone never would have led us to examine.

Visualizations are among the best tools we have for communication. Data and statistics can be technical and dependent on terminology and jargon that are inaccessable without a great deal of training. If we need to show our data and our results to a manager, to a client, to the public, or to another audience, we can use visualizations to quickly and clearly tell the stories that the evidence in our data supports. Pictures are also efficient: even if someone can understand a technical discussion, a good picture can convey the information more quickly than a table, and more accurately as well because we can better express the context surrounding a result with a visualization.

But like any mode of communication, there are more effective and less effective ways to visualize data. Some visualizations are confusing. Other visualizations are misleading. Our goal is to create graphics that tell a clear story that a general audience can quickly understand while taking steps to ensure that the data are presented accurately and in a way that does not mislead.

There are degrees of completion when designing a visualization. For the purposes of EDA, we ourselves might be the audience, so we can spend less time on writing descriptive labels and choosing the most aesthetically pleasing colors, shapes, and sizes for graphical elements. We can stick to defaults more often even if the defaults are ugly, or when they might be misunderstood by someone else. These exploratory graphics can accomplish Tukey’s goals for EDA: finding anomolies and problems in the data and revealing unexpected patterns. If our purpose is to use the graphic to communicate with an audience, however, then we will take steps to improve the labels and aesthetic elements of the visualization.

There are many words for a data visualization. In the following discussion, we will use the words visualization, graphic, graph, plot, and figure interchangably.

11.1.1. Aesthetics, Data, and Scales #

In chapter 2 of Fundamentals of Data Visualization, Claus O. Wilke explains that a data visualization has three parts: aesthetics, data, and scales. Aesthetics are the visual elements of graphic: the axes, the plot area, the shapes and colors of different sizes that appear in the plot area, and the labels. Scales are functions that map the data we are trying to describe to aesthetics. For example, in a scatterplot, we illustrate the relationship between two continuous-valued features by setting the values of each feature to be the x and y coordinates of a point to be plotted on a Cartesian plane.

Two examples of scatterplots: alcohols by hatred for self and for others, according to Zach Weinersmith (source: https://www.smbc-comics.com/comic/self-destructive-beverages), and the tastiness and difficulty of eating various fruits according to Randall Munroe (Source: https://xkcd.com/388/)

The coordinates of the points are the data, and the rule that sets the data as the x and y coordinates is the scale. In Randall Munroe’s scatterplot, he plots the fruit by scores he gives each fruit for tastiness and for difficulty. In Zach Weinersmith’s scatterplot, he determines the amount of hatred someone must have for themselves and for others in order to drink each of the alcohols. Although both are subjective and funny, there is underlying data represented in each graph, and the scale sets the points at the coordinates defined by the data. Aesthetics include the choice of shape for each point - a small hollow circle, a solid circle, squares, triangles, or another shape such as pictures of fruits - as well as the size, color, and transparency of the points. The aesthetics also include elements beyond the points: the color of the plot area, whether to include vertical and horizontal guiding lines in the background, the range of the axes, how to label the axis values (if at all), the axis labels, whether to include a legend, and the title of the graph.

When constructing a visualization, the part of the graph we select first is the data. We have to acquire the data we need and manipulate the data so that we can perform EDA. Second, we choose the scale. The scales are generally set by the type of graph we choose. Every scatterplot, for example, sets the two features as the coordinates of each point, and every barplot sets the height of the bars equal to the calculated frequency (or a different descriptive statistic) for each category. Once we choose a graph, we won’t be able to change the way data is mapped to aesthetics. Finally, once we’ve set the data and the scales of the graph, we can choose the aesthetics. Most of the guidance that exists regarding how to create an effective visualization focuses on the choice of graph type, which implies the scales, and on the aesthetics.

11.1.2. Edward Tufte’s “Fundamental Principles of Analytical Design”#

Edward Tufte is a statistician, computer scientist, and political scientist who became well known for writing guiding principles for data visualization, beginning with his book The Visual Display of Quantitative Information. Today he travels the country, teaching one-day (expensive) short courses on the topic of data visualization. He is a notorious critic of Microsoft Powerpoint, and he coined the term chartjunk to refer to “all visual elements in charts and graphs that are not necessary to comprehend the information represented on the graph, or that distract the viewer from this information.”

In Beautiful Evidence, Tufte presents six “Fundamental Principles of Analytical Design” (p. 122) which guide the choices visualization designers should make when choosing graph types and aesthetics. First, graphics must show “comparisons, contrasts, [or] differences” in the data (p. 127). A scatterplot shows differences between each of the rows in the data on the features that comprise the coordinates. A barplot compares one category of a feature to another on some dimension.

Second, graphs must show “causality, mechanism, explanation, [or] systemic structure” (p. 128). If the goal is to describe the relationship between two features, then an accurate portrayal of that relationship might be considered causal or an explanation if we take additional steps to account for confounding factors and if we are careful to correctly state which feature is the cause and which is the effect. If we cannot rise to the level of making a causal statement, then showing a relationship is still an illustration of the mechanism or systemic structure in the covariance of the features. Regardless of the reason for a correlation, a correlation exists and can be visualized.

Third, figures can and should “show multivariate data; that is, show more than 1 or 2 [features]” (p. 130). Different aesthetics make it possible for us to include more than two variables in one visualization. For example we can create a scatterplot in which the x-values represent people’s ages and the y-axis represents people’s personal incomes. We can change the colors of each point to represent each person’s gender, and we can use the shape of each point to represent the person’s political party affiliation. Other writers, including Wilke, caution against including too many features in one graphic as each additional feature makes the figure more confusing. Our goal should be to show as much of the data as possible without increasing the complexity of the figure too much.

Fourth, visualizations should “integrate words, numbers, [and] images” (p. 131). We can annotate graphs with words or numbers: for example, by labeling points in a scatterplot, as in both the fruit and alcohol scatterplots shown above, or by labeling bars in a barplot with the height of each bar. We can use images as well, for example by placing images of flags onto bars in a barplot that shows cross-national differences.

Fifth, graphics should be presented along with documentation to help the audience understand the meaning of various aesthetic elements. If we use colors to mark points in a scatterplot as representing a woman or a man, then there should be an accompanying legend that makes the colors’ meanings clear. We can also use captions underneath a figure and explanatory text to make sure that all the distinct mappings from data to aesthetics are understandable.

Finally, Tufte holds the following principle above all of the others:

Analytical presentations ultimately stand or fall depending on the quality, relevance, and integrity of their content.

(p. 136). In other words, all of the work we’ve done to acquire and manipulate data is necessary for effective visualizations to be possible. If we have not collected high quality data, or if we have not properly cleaned the data, then no amount of attention paid to a figure’s aesthetics will make the figure into a worthwhile tool for communicating distributions and relationships in the data.

11.1.3. Claus O. Wilke’s Guide to Telling a Story and Making a Point #

We can use data visualizations to communicate findings within a dataframe to an outside audience. But we want the information contained within a visualization to be understood and remembered, and people in general do not process highly abstract information quickly, and they do not tend to remember it.

In chapter 29 of Fundamentals of Data Visualization, Claus O. Wilke recommends thinking about a crafting a narrative when constructing a data visualization. People understand and remember stories. Using a story to motivate and explain a visualization does not mean that we misrepresent the findings, but rather that we use the frame of a story to describe the findings, why they are important, and what the larger take-away points should be.

A narrative arc describes the organization and sequence of a story. The most well-known model of a narrative arc is Gustav Freytag’s pyramid: Exposition, Rising Action, Climax, Falling Action, and Denouement. An academic paper or a data analysis report can follow this model as well. The artcle begins with a statement of the topic and the research question (the exposition). The literature review reveals the unaddressed areas in which a novel contribution can be made (the rising action). The results provide an answer to the research question (the climax), and a subsequent discussion places the results in the larger context of the literature and describes further implications (falling action). The conclusion summarizes the most important points of the article (the denouement).

There are other models for narrative arcs other than Gustav Freytag’s. Kurt Vonnegut states:

There’s no reason why the simple shapes of stories can’t be fed into computers. They are beautiful shapes.

In the following talk, Vonnegut maps out the narrative arc of many different stories:

from IPython.display import IFrame
IFrame(src="https://www.youtube.com/embed/oP3c1h8v2ZQ", width="560", height="315")

Regardless of the specific model used to describe a story, stories should have a beginning, middle, and an end, and should set up a conflict (a research question) and resolve that conflict (with our findings). If we keep this structure in mind, we can prepare an audience to see our findings with other visualizations that help people understand the problem, and we can use visualizations to convey both our findings and their elaborations and implications.

If we think about visualizations as a means to tell a story, Wilke recommends that that we take steps to make the visualizations understandable and memorable. Although Tufte encourages us to include more than one or two features in a single figure, each additional feature makes it harder to understand the stories that other features describe. Wilke suggests that we think about the main take-home points we want our audience to understand, and to reduce the features in a single visualization to only those that speak directly to this story. He writes: “Never assume your audience can rapidly process complex visual displays.” That’s true both of general audiences and of audiences of people with specific technical training: complicated figures cause everyone to tune out.

Suppose that we create ten barplots with exactly the same formatting. Even if each of the ten plots are perfectly understandable and clear the points that each one makes may be lost in the noise of repeated, similar images. Wilke notes that it is important to make the figures more memorable by making them more distinct. First we can use color to highlight specific elements of a figure to emphasize points that are more important to the story. If one category among the bars is of special interest to us, we can use one color for that bar and a different color for all the others. If elements of the plot represent units with well-known images, we can use these images on the graph the way Randall Munroe uses pictures of fruits instead of points. These extra visual elements will make each figure stand out and be more memorable.

11.1.4. How (not) to Lie with Statistics #

How to Lie with Statistics by Darrell Huff is a short, funny, and classic book that describes many ways that statistics can be used to mislead an audience. The book was written in 1954, and has been reprinted many times since then.

Two examples of scatterplots. The tastiness and difficulty of eating various fruits according to Randall Munroe (Source: https://en.wikipedia.org/wiki/How_to_Lie_with_Statistics).

Darrell Huff discusses the dangers of drawing conclusions from biased and undersized samples, using poorly-worded survey questions, confusing the mean, median, and mode, emphasizing the mean while ignoring the distribution, and making statements about correlation from evidence that only shows a correlation. He also discusses two ways that visualizations specifically can mislead, which he calls:

The Gee-Whiz Graph: a figure that manipulates the range of the y-axis to make it seem like a feature changes more or less than it actually does.
The One-Dimensional Picture: a graph that maps data to the height of images, but uses non-rectangular shapes to increase the area by a greater proportion than the difference in heights.

The idea of a one-dimensional picture is related to Wilke’s discussion of the principle of proportional ink in chapter 17 of Fundamentals of Data Visualization. The principle of proportional ink is

The sizes of shaded areas in a visualization need to be proportional to the data values they represent.

With a barplot, the images we use to represent the data are rectangles. If we double the height of a rectangle, we also double the area of a rectangle:

\[ (2 \times \text{height})(\text{width}) = 2\times(\text{height})(\text{width}) = 2\times(\text{area})\]

The principle of proportional ink is violated when doubling the height causes the area to increase by a factor other than 2. The human eye recognizes area first, so if the data are represented by the height, a change in area by a different factor misrepresents the data. Barplots do not violate the principle of proportional ink because of the specific property of rectangles that the height and area change with the same proportion. The same thing is not true of other shapes. For example, Darrell Huff uses the following image to represent a one-dimensional picture:

An illustration of a violation of the principle of proportional ink. (Source: How to Lie with Statistics, p. 69).

The data that are represented in this figure are monetary amounts, represented by the height of each money bag. The money bag on the right is twice the height of the money bag on the left. But because these shapes are not rectangles, the area of the money bag on the right is more than double the size of the money bag on the right. A figure like this conveys the incorrect notion that the money bag on the right is larger than it is.

When creating data visualizations, it is important to represent the data accurately by using a reasonable scale for the y-axis (generally by including 0 in this scale), and by using rectangular area to ensure that the principle of proportional ink is not violated.

11.1.5. Why Not Pie Charts?#

We will not be discussing the methodology to create pie charts. In general pie charts are considered poor ways to represent data, and many statisticians and data scientists bemoan the prevalence of pie charts in government and business. Walt Hickey, writing for Business Insider in 2013, phrases his opinion of pie charts in no uncertain terms:

The pie chart is easily the worst way to convey information ever developed in the history of data visualization.

Hickey’s argument reflects the argument of many statisticians and data scientists. First, if the goal of a data visualization is to present data in a clear way, pie charts can obscure the data because the human eye can measure distance better than angles. The following data can be represented in pie charts or barplots:

The same data represented in pie charts then in barplots. (Source: Walt Hickey, "The Worst Chart in the World", https://www.businessinsider.com/pie-charts-are-the-worst-2013-6?op=1).

In the pie charts, the five quantities are represented by the angles in the slices of the circle. But it is hard to determine how much larger or smaller each slice is relative to the others, and sometimes it is hard to determine which of the slices is largest and smallest. In contrast, this information is clear in the barplots. Pie charts are often annotated by the numbers that each slice represents, but if the labels are necessary to read a pie chart then the chart itself is not needed as a reader can look at the labels alone.

Pie charts do uphold the principle of proportional ink because the area of a slice of a circle is given by the formula $$ \text{area} = \frac{\theta}{360}(\pi r^2) $$ where $\theta$ is the angle of the slice and $r$ is the radius of the circle, which is constant for all slices. If we double the angle $\theta$, the area also doubles because: $$ \frac{2 \times\theta}{360}(\pi r^2) = 2 \frac{\theta}{360}(\pi r^2) = 2(\text{area}).$$ However, a common variant is a 3D piechart, as in the following example:

A 3D pie chart (Source: Walt Hickey, "The Worst Chart in the World", https://www.businessinsider.com/pie-charts-are-the-worst-2013-6?op=1).

Because of the added base, 3D pie charts violate the principle of proportional ink. Notice that the light-blue slice on top of the circle has a larger angle than the red slice on the bottom, but more ink is devoted to the red slice than the light-blue one because of the perspective in the drawing. In addition, the tilted perspective turns the circle into an oval, which expands the angles on the top and bottom and compresses the angles on the sides. A 3D pie chart is misleading and confusing and should never be used.

There is one advantage of a pie chart. Pie charts show the proportion of each category to the total across categories, which barplots do not. That’s useful when the story we are trying to tell is that one category is much larger than all of the others and dominates the total when adding up all the slices. The following figure is one of the best examples of a pie chart:

Military spending by country (Source: Wikipedia https://en.wikipedia.org/wiki/List_of_countries_by_military_expenditures).

In this figure, the story is that the United States spends much more on defense than any other country, and that China also spends more than any other country with the exception of the United States. Together the United States and China account for nearly half of the world’s defense spending. This pie chart is effective at making these points, but it is not effective at illustrating specific country-to-country comparisons without the numeric annotations.

11.2. Creating Data Visualizations in Python #

There are many distinct styles of data visualization that are well-suited for different kinds of data. Within these styles, there are many adaptions with different aesthetics. Each style involves fixing the scale - the function that maps data to aesthetics - of the visualization, so if we choose a barplot, for example, the height of the bars (or the length of the bars if the bars are oriented horizontally) are determined by the data we are trying to plot.

The five most important types of data visualization are barplots, scatterplots, line plots, distributional plots, and heatmaps. There are many examples of each type later in this chapter.

These visualizations can also be combined. For example, a scatterplot might attach histograms to the edges of the plot area that are opposite the x and y-axes to show the histograms for each of the two features in the scatterplot. There are other kinds of data visualizations as well, such as network plots to show the connections between nodes in a network. However, in this chapter we will only focus on the visualizations described above.

The following discussion illustrates the code to generate barplots, scatterplots, line plots, distributional plots, and heat maps using the matplotlib and seaborn packages in Python, as well as using the .plot() method from pandas. For each of these plots, we can change the aesthetics to change the appearance of the plot and include additional features in the plot. We will alter the aesthetics as needed to produce good-looking, accurate, and understandable visualizations. This discussion is not meant to be a comprehensive list of visualizations, but an introduction to the major coding systems available for plotting in Python. For more examples, see the matplotlib image gallery or the seaborn example gallery.

We can generate visualizations using matplotlib or seaborn, so how do we know which one to use? In what situations is one package better than the other?

Most researchers use both. seaborn is built on top of matplotlib, and it tends to produce prettier graphics with fewer lines of code than matplotlib, but it can be limited in terms of how much an individual visualization can be customized. matplotlib is very flexible, but the default aesthetics do not always look very elegant. In addition, pandas has a .plot() method that can be applied to a dataframe that uses shortcuts to generate figures from that dataframe. pandas does not have its own graphics engine: instead it directly connects to either matplotlib or to seaborn, although it uses a different syntax to do so.

The trickiest thing about visualization in Python is that matplotlib, seaborn, and the .plot() method in pandas are used interchangably because some graphs and aesthetics are only possible or are much easier to generate in one package. None of these packages is currently better across the board than the other two, and different packages are better for specific tasks. It can be hard to remember what works best from situation to situation. It is even more confusing because matplotlib, seaborn, and .plot() overlap in their syntax in some ways, and are different in other ways. The best way forward is to practice many data visualizations. The more we generate data visualizations, the better our instincts will get about which package to use when.

Traditionally, we use the pyplot module from matplotlib with the alias plt, and we use seaborn with the alias sns. We begin by loading the following packages:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

As with chapters 8 and 10, we will be working with the 2019 pilot study from the American National Election Study (ANES). To see how we prepared the data for analysis, or to see the tabular methods of EDA we employed on this dataset, see chapters 8 and 10 respectively. We load the cleaned data:

anes = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/anes_pilot2019_clean.csv")
anes.columns

Index(['caseid', 'liveurban', 'vote16', 'protest', 'vote',
       'most_important_issue', 'confecon', 'ideology', 'partyID',
       'universal_income', 'family_separation', 'free_college',
       'forgive_loans', 'race', 'birthyr', 'sex', 'education', 'weight',
       'fttrump', 'ftobama', 'ftbiden', 'ftwarren', 'ftsanders', 'ftbuttigieg',
       'ftharris', 'ftblack', 'ftwhite', 'fthisp', 'ftasian', 'ftmuslim',
       'ftillegal', 'ftjournal', 'ftnato', 'ftun', 'ftice', 'ftnra', 'ftchina',
       'ftnkorea', 'ftmexico', 'ftsaudi', 'ftukraine', 'ftiran', 'ftbritain',
       'ftgermany', 'ftjapan', 'ftisrael', 'ftfrance', 'ftcanada', 'ftturkey',
       'ftrussia', 'ftpales', 'ftimmig', 'partisanship', 'ftbiden_level',
       'age', 'age2', 'ftbiden_float', 'ftbiden_cat', 'ftbiden_str',
       'prefersbiden', 'worried_econ', 'favor_both'],
      dtype='object')

11.2.1. Barplots and How to Use `matplotlib`, `seaborn`, and the Dataframe `.plot()` Method #

Barplots are appropriate for describing the frequencies of a categorical feature. Each category gets a bar. The bars can be oriented vertically or horizontally, and the height or length each bar represents the data. Most often we plot the frequencies of categories this way, but we can also plot other conditional descriptive statistics such as the within-category mean of a continuous feature. We can overlay more than one feature with the same categories by stacking the bars on top of one another or by placing them side by side.

We can create a barplot using matplotlib, seaborn, or the .plot() method for a pandas dataframe. We can use any of these options to create a visualization. The options differ in that some tasks easier or harder when using a particular package, and specific options might produce graphics that look better or worse.

11.2.1.1. Vertical and Horizontal Barplots, Figure Size, Labels, Titles, and Colors #

To generate a barplot using matplotlib, we start by creating a series that contains only the numbers we intend to plot. For example, we can count the values of each category of confecon, how worried a person is about the state of the economy, using the .value_counts() method. The barplot we create with matplotlib will order the bars in the same way they appear with .value_counts(), so we can first rearrange the categories in the order we want, then we can use the .sort_index() method to make sure that the categories appear in this order:

anes['confecon'] = anes['confecon'].astype('category')
anes['confecon'] = anes['confecon'].cat.reorder_categories(['Not at all worried', 
                                                            'A little worried', 
                                                            'Moderately worried', 
                                                            'Very worried', 
                                                            'Extremely worried'])
mybars = anes.confecon.value_counts().sort_index()
mybars

confecon
Not at all worried    764
A little worried      847
Moderately worried    871
Very worried          431
Extremely worried     252
Name: count, dtype: int64

Now that the counts are stored in mybars, the simplest barplot uses the plt.bar() function. For this image and for all the images presented below, the image will appear automatically in a Jupyter notebook. In a Python script we would have to also type plt.show() in order to display the figure. plt.bar() takes two arguments: the categories, stored in the .index attribute of mybars, and the values, stored in .values:

plt.bar(mybars.index, mybars.values)

<BarContainer object of 5 artists>

_images/afe1f097f81c1130ed718560e8e1160ce35d72e36fbd4cb72d4e872163a2d18a.png

This plot leaves a lot of be desired. First, the category labels run into each other. To solve this problem, we can increase the size of the figure. We can change the size by typing

plt.figure(figsize(a,b))

prior to the call to plt.bar() or to another plotting function. Here a and b are numbers, in inches. If the window can’t contain the proportions, the figure is scaled down proportionally. The default dimensions for a plot are 6.4 inches wide and 4.8 inches high. To create a figure 15 inches wide and 6 inches high, for example, we can type:

plt.figure(figsize=(15, 6))
plt.bar(mybars.index, mybars.values)

<BarContainer object of 5 artists>

_images/2c8d361275866f682f2ab0c402ece49395b55dfce7d0961347774a64e8553b81.png

Now we can clearly see the individual bar labels. Another way to accomodate longer bar labels is to orient the bars horizontally, leaving plenty of room for the labels on the left. To generate horizontal bars, First we use plt.barh() instead of plt.bar():

plt.figure(figsize=(10, 5))
myplot = plt.barh(mybars.index, mybars.values)

_images/4b9b832e55472893653bd11917c1c3d7a891d4be36430e03eaf17f20a9383655.png

There are many ways that we can make this figure more understandable. First, we can add axis labels. To add a label to the x and y axes, we type

plt.ylabel('a label')
plt.xlabel('another label')

after the call to plt.bar(), plt.barh(), or another plotting function. To add an overall title that will appear above the graph, we type

plt.title('a title')

after the plotting function. For example, we can add axis labels and a title to our horizontal barplot:

plt.figure(figsize=(10, 5))
plt.barh(mybars.index, mybars.values)
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')

Text(0.5, 1.0, 'Amount of Worry About the Current State of the Economy')

_images/43c24c534bd42a6e28d6868931658ee6a7634ca13144dee5bac133eb6dd30c66.png

To change the color of the bars, use the color argument inside the plotting function. We can set the color of the bars to red, or to many other colors:

plt.figure(figsize=(10, 5))
plt.barh(mybars.index, mybars.values, color='red')
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')

Text(0.5, 1.0, 'Amount of Worry About the Current State of the Economy')

_images/3a02fe5bc7ee4dbf90d8fac69dd694d165f689f60d8bc95902de23fd5277a443.png

To set different colors for every bar, we can pass a list of colors. That’s useful if we want to emphasize particular bars, such as the ones for “very worried” and “extremely worried”:

plt.figure(figsize=(10, 5))
plt.barh(mybars.index, mybars.values, 
         color=['blue', 'blue', 'blue', 'red', 'red'])
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')

Text(0.5, 1.0, 'Amount of Worry About the Current State of the Economy')

_images/4328c5662b8d32e7ab3f5d66a6e621a42ada3424fd13fe96aa7c885109b59c2a.png

To create a barplot with seaborn, we use the sns.barplot() function. There are two ways to use this function. We can pass a series directly to the x argument to define the categories and another series to the y argument to define the values:

plt.figure(figsize=(10, 6))
sns.barplot(x=mybars.index, y=mybars.values)

<Axes: xlabel='confecon'>

_images/9254f538df127435fa2a847cc75dc8649a18ad4bec1035152d1d0836d3605e6e.png

Alternatively, if the data exists in a dataframe, we can pass the x and y columns to the x and y arguments, and the dataframe to the data argument:

mybars_df = mybars.reset_index()
mybars_df

	confecon	count
0	Not at all worried	764
1	A little worried	847
2	Moderately worried	871
3	Very worried	431
4	Extremely worried	252

plt.figure(figsize=(10, 6))
sns.barplot(x='index', y='confecon', data=mybars_df)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[13], line 2
plt.figure(figsize=(10, 6))
----> 2 sns.barplot(x='index', y='confecon', data=mybars_df)

File ~/.pyenv/versions/3.12.5/lib/python3.12/site-packages/seaborn/categorical.py:2341, in barplot(data, x, y, hue, order, hue_order, estimator, errorbar, n_boot, seed, units, weights, orient, color, palette, saturation, fill, hue_norm, width, dodge, gap, log_scale, native_scale, formatter, legend, capsize, err_kws, ci, errcolor, errwidth, ax, **kwargs)
if estimator is len:
   estimator = "size"
-> 2341 p = _CategoricalAggPlotter(
   data=data,
   variables=dict(x=x, y=y, hue=hue, units=units, weight=weights),
   order=order,
   orient=orient,
   color=color,
   legend=legend,
)
if ax is None:
   ax = plt.gca()

File ~/.pyenv/versions/3.12.5/lib/python3.12/site-packages/seaborn/categorical.py:67, in _CategoricalPlotter.__init__(self, data, variables, order, orient, require_numeric, color, legend)
def __init__(
   self,
   data=None,
   (...)
   legend="auto",
):
---> 67     super().__init__(data=data, variables=variables)
   # This method takes care of some bookkeeping that is necessary because the
   # original categorical plots (prior to the 2021 refactor) had some rules that
   # don't fit exactly into VectorPlotter logic. It may be wise to have a second
   (...)
   # default VectorPlotter rules. If we do decide to make orient part of the
   # _base variable assignment, we'll want to figure out how to express that.
   if self.input_format == "wide" and orient in ["h", "y"]:

File ~/.pyenv/versions/3.12.5/lib/python3.12/site-packages/seaborn/_base.py:634, in VectorPlotter.__init__(self, data, variables)
# var_ordered is relevant only for categorical axis variables, and may
# be better handled by an internal axis information object that tracks
# such information and is set up by the scale_* methods. The analogous
# information for numeric axes would be information about log scales.
self._var_ordered = {"x": False, "y": False}  # alt., used DefaultDict
--> 634 self.assign_variables(data, variables)
# TODO Lots of tests assume that these are called to initialize the
# mappings to default values on class initialization. I'd prefer to
# move away from that and only have a mapping when explicitly called.
for var in ["hue", "size", "style"]:

File ~/.pyenv/versions/3.12.5/lib/python3.12/site-packages/seaborn/_base.py:679, in VectorPlotter.assign_variables(self, data, variables)
else:
   # When dealing with long-form input, use the newer PlotData
   # object (internal but introduced for the objects interface)
   # to centralize / standardize data consumption logic.
   self.input_format = "long"
--> 679     plot_data = PlotData(data, variables)
   frame = plot_data.frame
   names = plot_data.names

File ~/.pyenv/versions/3.12.5/lib/python3.12/site-packages/seaborn/_core/data.py:58, in PlotData.__init__(self, data, variables)
def __init__(
   self,
   data: DataSource,
   variables: dict[str, VariableSpec],
):
   data = handle_data_source(data)
---> 58     frame, names, ids = self._assign_variables(data, variables)
   self.frame = frame
   self.names = names

File ~/.pyenv/versions/3.12.5/lib/python3.12/site-packages/seaborn/_core/data.py:232, in PlotData._assign_variables(self, data, variables)
   else:
       err += "An entry with this name does not appear in `data`."
--> 232     raise ValueError(err)
else:

   # Otherwise, assume the value somehow represents data

   # Ignore empty data structures
   if isinstance(val, Sized) and len(val) == 0:

ValueError: Could not interpret value `index` for `x`. An entry with this name does not appear in `data`.

<Figure size 1000x600 with 0 Axes>

To orient a seaborn barplot horizontally, simply interchange the x and y features:

plt.figure(figsize=(10, 6))
sns.barplot(x='confecon', y='index', data=mybars_df)

<matplotlib.axes._subplots.AxesSubplot at 0x1a23d39d10>

_images/0e24c3038e100765fe70892c0ed58af21328ade5961a1e26623d5eedb57e905d.png

Notice that in the barplots above the axes are mislabeled. seaborn graphics use the same functions as matplotlib to change the title and axes:

plt.figure(figsize=(10, 6))
sns.barplot(x='confecon', y='index', data=mybars_df)
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')

Text(0.5, 1.0, 'Amount of Worry About the Current State of the Economy')

_images/0d4a9892d1a58ca4cef394e1ddf0948e2a7142ce25803822d0370974092dd6ac.png

Unlike matplotlib, seaborn uses different colors for each bar as a default simply because it is prettier. Edward Tufte would call this aesthetic choice “chartjunk”, however, because the colors do not map to any additional feature or have any meaning other than being nicer to look at. Different colors are fine, but to change them, use the palette argument as follows:

plt.figure(figsize=(10, 6))
sns.barplot(x='confecon', y='index', data=mybars_df,
           palette = ['blue', 'blue', 'blue', 'red', 'red'])
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')

Text(0.5, 1.0, 'Amount of Worry About the Current State of the Economy')

_images/a69668671e30c482ca4e68193e1c6a55917924d9b3b49c1c888c302e98a2e348.png

To create a barplot using the .plot() method in pandas, first save the values as a dataframe as we do with mybars_df above, then apply .plot() to this dataframe. .plot() has an argument kind that let’s us specify the graph type. For a barplot, we write kind='bar' and we specify the x and y features the same way we did with seaborn:

mybars_df.plot(kind='bar', x='index', y='confecon')

<matplotlib.axes._subplots.AxesSubplot at 0x1a24398910>

_images/ad815ed975a08ab9819e0279ab6b9569da91f311241c4d25b942ff58f044fff2.png

Although .plot() uses matplotlib to generate its visualizations, it uses some different defaults. Note, for example, that the category labels are oriented vertically instead of horizontally to avoid overlapping labels.

plt.figure(figsize=(10, 6)) does not work with the .plot() method, but this method includes a figsize argument:

mybars_df.plot(kind='bar', x='index', y='confecon',
              figsize = [10,6])

<matplotlib.axes._subplots.AxesSubplot at 0x1a24571a90>

_images/45fd654fe2cd61a3e8510cdcf88493bb1e820c65729d992b00e36be5686ef1cc.png

To create a horizontal bar plot, use kind='barh' instead of kind='bar':

mybars_df.plot(kind='barh', x='index', y='confecon',
              figsize = [10,6])

<matplotlib.axes._subplots.AxesSubplot at 0x1a24675850>

_images/c9e627d130d65cd948d3004153c2755a00679a3a63704c41edacec4ebbe25ac7.png

The .plot() method also uses the same matplotlib functions for setting the axis labels and title:

mybars_df.plot(kind='barh', x='index', y='confecon',
              figsize = [10,6])
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')

Text(0.5, 1.0, 'Amount of Worry About the Current State of the Economy')

_images/2162fa31e534aafbd107a6679469eeb08308de8f93a7e97ceefb066ddc92c68d.png

To change the colors of the bars, use the color argument:

mybars_df.plot(kind='barh', x='index', y='confecon',
              figsize = [10,6],
              color = ['blue', 'blue', 'blue', 'red', 'red'])
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')

Text(0.5, 1.0, 'Amount of Worry About the Current State of the Economy')

_images/432c393526192318bcb115cbf921f2edf64406fbe6c843207b6f0f077ff84508.png

Notice that there is a legend included by default. The legend will be important when we use the .plot() method to handle situations in which we stack bars or align them side-by-side for different groups. But in this case the legend is unnecessary. To turn off the legend, type legend=False:

mybars_df.plot(kind='barh', x='index', y='confecon',
              figsize = [10,6],
              color = ['blue', 'blue', 'blue', 'red', 'red'],
              legend = False)
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')

Text(0.5, 1.0, 'Amount of Worry About the Current State of the Economy')

_images/3eee6d33adb598a3c7fc0e4c4826dc573dca41f90100228b9518e10f7c344e6d.png

11.2.1.2. Barplots for Statistics Other Than Frequencies #

A barplot does not have to simply illustrate frequencies. We can show any descriptive statistic in a barplot that can be calculated within each category. For example, we can show the average age of the people who responded with each level of economic worry. First we need to create a conditional means table and convert it to a dataframe with the .reset_index() method:

mybars = anes.groupby('confecon').agg({'age':'mean'}).sort_index()
mybars_df = mybars.reset_index()
mybars_df

	confecon	age
0	Not at all worried	57.663613
1	A little worried	52.792208
2	Moderately worried	48.409874
3	Very worried	50.359629
4	Extremely worried	47.801587

Then we can plot these average ages using the same code we used for a barplot in matplotlib, seaborn or with the .plot() method. Here is a matplotlib barplot:

plt.figure(figsize=(10, 5))
plt.barh(mybars_df.confecon, mybars_df.age, color='green')
plt.ylabel('Level of worry')
plt.xlabel('Average Age')
plt.title('Amount of Worry About the Current State of the Economy')

Text(0.5, 1.0, 'Amount of Worry About the Current State of the Economy')

_images/2768edcd82f6b2a35c2ab7b66753efa2f2b694dda4730bc760b2a32e4a7d64c7.png

Here is a seaborn barplot:

plt.figure(figsize=(10, 6))
sns.barplot(x='age', y='confecon', data=mybars_df,
           palette = ['green'])
plt.ylabel('Level of worry')
plt.xlabel('Average Age')
plt.title('Amount of Worry About the Current State of the Economy')

Text(0.5, 1.0, 'Amount of Worry About the Current State of the Economy')

_images/35c69c3b8c243819a5e2997ccd025a0dea4161691fe899212cefd38b71c0e0b7.png

And here is the barplot using .plot():

mybars_df.plot(kind='barh', x='confecon', y='age',
              figsize = [10,6],
              color = 'green',
              legend = False)
plt.ylabel('Level of worry')
plt.xlabel('Average Age')
plt.title('Amount of Worry About the Current State of the Economy')

Text(0.5, 1.0, 'Amount of Worry About the Current State of the Economy')

_images/a6b7fde444c8996f133fd74023800510b6fae855fe20f0b41b1b4c31f0790c4e.png

11.2.1.3. Annotating Bars with Values #

We can annotate barplots with the heights of the bars, regardless of whether those bars are frequencies, means, or another descriptive statistic. Annotating barplots with values is tricky whether we use matplotlib, seaborn, or .plot(), but fortunately the same method will work for all three plotting tools.

First, we can use plt.text() to place text onto a figure. This function takes six arguments: the x and y coordinates where the text should be placed, the text, and horizontalalignment='center' and verticalalignment='center' to place the text box at the center of the specified (x,y) point (we can also specify left or right, or top or bottom), and fontsize. To place “Hello, world!” into the last figure, we can type:

mybars_df = anes.confecon.value_counts().sort_index().reset_index()

plt.figure(figsize=(10, 5))
plt.barh(mybars_df['index'], mybars_df['confecon'], color='green')
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')
plt.text(400, 3, 'Hello, world!', horizontalalignment='center', verticalalignment='center',
        fontsize=28)

Text(400, 3, 'Hello, world!')

_images/bf630b1191625569cfa7a8582396adef7ce4f9628fac7a2cdb97efad4289bf82.png

We are going to use plt.text() to place values onto the bars, but we need to know the exact x and y-coordinates of the top or end of each bar. It is a mistake to try to find these coordinates through guessing and checking. Instead, rather than calling plt.bar() or another plotting function directly, we save it as a separate variable. The image will still display automatically in a Jupyter notebook, or after typing plt.show(). The variable has an attribute called .patches:

myplot = plt.bar(mybars_df['index'], mybars_df['confecon'])
myplot.patches

[<matplotlib.patches.Rectangle at 0x1a24021a90>,
 <matplotlib.patches.Rectangle at 0x1a23a3cc10>,
 <matplotlib.patches.Rectangle at 0x1a2452a310>,
 <matplotlib.patches.Rectangle at 0x1a24571ed0>,
 <matplotlib.patches.Rectangle at 0x1a2458cad0>]

_images/e4f30e8500b5b46169853e5e8fb5fe73dd5eb3a73cf56d47b481c34c96d297d7.png

The .patches attribute contains a list in which each item in the list looks like:

<matplotlib.patches.Rectangle at 0x1a391f0c50>

These codes are tags denoting the shapes that appear in the plot. There are five items in the list, one for each rectangle in this barplot. An individual rectangle has a method .get_x() that lists the x-coordinate of the left-edge of the rectangle, .get_height() that lists the y-coordinate of the top edge of the rectangle, and .get_width() that lists the width of each rectangle. These statistics for the first rectangle are:

[myplot.patches[0].get_x(),
 myplot.patches[0].get_height(),
 myplot.patches[0].get_width()]

[-0.4, 764, 0.8]

We can loop across the rectangles in the .patches attribute, extract the x and y-coordinates for the values as well as the values themselves, and use plt.text() to place them onto the figure, as follows:

for rect in myplot.patches:
    xcoor = rect.get_x() + .5*rect.get_width()
    ycoor = rect.get_height()
    plt.text(xcoor, ycoor, str(ycoor), 
             horizontalalignment='center', 
             verticalalignment='bottom',
             fontsize=14)

This is a for-loop in which rect is the token that represents a single rectangle. xcoor is the x-coordinate for the value, which adds half the rectangle’s width to be placed in the middle of the rectangle. ycoor is the y-coordinate and the value to be displayed, which we convert to text with str(). We set horizontalalignment='center' and verticalalignment='bottom' to ensure that the specified point exists on the bottom-middle of the text box: that places the numbers above the middle of each rectangle. Finally we repeat the process for every rectangle. For the barplot of economic confidence this approach yields the following output:

plt.figure(figsize=(15, 6))
myplot = plt.bar(mybars_df['index'], mybars_df['confecon'])
plt.xlabel('Level of worry')
plt.ylabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')
for rect in myplot.patches:
    xcoor = rect.get_x() + .5*rect.get_width()
    ycoor = rect.get_height()
    plt.text(xcoor, ycoor, str(ycoor), 
             horizontalalignment='center', 
             verticalalignment='bottom',
             fontsize=14)

_images/f2fc7faf3c0d98983ec0c7786d72391139ebd7e4c4e7967723783102236174bb.png

To add labels to horizontal barplots, we use the same method, but we switch the x and y coordinates in plt.text() and we switch .get_width() and .get_height() to account for the new orientation:

plt.figure(figsize=(10, 6))
myplot = plt.barh(mybars_df['index'], mybars_df['confecon'])
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')
for rect in myplot.patches:
    xcoor = rect.get_y() + .5*rect.get_height()
    ycoor = rect.get_width()
    plt.text(ycoor, xcoor, str(ycoor), 
             horizontalalignment='left', 
             verticalalignment='center',
             fontsize=14)

_images/19550a419d46069390d6ebc18ae938317085a780c87453921834543bbc8c5a1d.png

This method can also be used to annotate the barplot with other descriptive statistics like the percent. We can calculate the percents inside the loop and use these in the text argument of plt.text():

plt.figure(figsize=(10, 6))
myplot = plt.barh(mybars_df['index'], mybars_df['confecon'])
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')
for rect in myplot.patches:
    xcoor = rect.get_y() + .5*rect.get_height()
    ycoor = rect.get_width()
    pcent = round(100*ycoor / sum(mybars_df['confecon']), 2)
    plt.text(ycoor, xcoor, str(pcent) + '%', 
             horizontalalignment='right', 
             verticalalignment='center',
             fontsize=14, color='white')

_images/ebee000b8dd00f0baea6847e882182c10f3eb5bf539a5e765f8df555662910e0.png

The same code works to label a seaborn barplot,

plt.figure(figsize=(10, 6))
myplot = sns.barplot(x='index', y='confecon', data=mybars_df)
plt.xlabel('Level of worry')
plt.ylabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')
for rect in myplot.patches:
    xcoor = rect.get_x() + .5*rect.get_width()
    ycoor = rect.get_height()
    plt.text(xcoor, ycoor, str(int(ycoor)), 
             horizontalalignment='center', 
             verticalalignment='bottom',
             fontsize=14)

_images/d9a2357512722c3098ef0190ef96db0f9d22606201d823f5f9dde1d4f472282a.png

and a horizontal seaborn barplot:

plt.figure(figsize=(10, 6))
myplot = sns.barplot(x='confecon', y='index', data=mybars_df)
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')
for rect in myplot.patches:
    ycoor = rect.get_y() + .5*rect.get_height()
    xcoor = rect.get_width()
    plt.text(xcoor, ycoor, str(int(xcoor)), 
             horizontalalignment='left', 
             verticalalignment='center',
             fontsize=14)

_images/c67844c9c476cd78f8e13d32a72946558b2db3d048a2f8a5a97ff49b8caec485.png

The same methods also work for barplots generated with the .plot() method,

myplot = mybars_df.plot(kind='bar', x='index', y='confecon',
              figsize = [10,6],
              legend = False)
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')
for rect in myplot.patches:
    xcoor = rect.get_x() + .5*rect.get_width()
    ycoor = rect.get_height()
    plt.text(xcoor, ycoor, str(int(ycoor)), 
             horizontalalignment='center', 
             verticalalignment='bottom',
             fontsize=14)

_images/76f6a11221b675e1582304c3c25d7624be8d9ff9f5ea9894f558a1fcb393f045.png

and for horizontal barplots generated with .plot():

myplot = mybars_df.plot(kind='barh', x='index', y='confecon',
              figsize = [10,6],
              legend = False)
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')
for rect in myplot.patches:
    ycoor = rect.get_y() + .5*rect.get_height()
    xcoor = rect.get_width()
    plt.text(xcoor, ycoor, str(int(xcoor)), 
             horizontalalignment='left', 
             verticalalignment='center',
             fontsize=14)

_images/908bfda58b3e2f89225068fd4dcfbac76f450e7ab6f0a9c51ec27b90c00ceb23.png

11.2.1.4. Grouped and Stacked Bars #

We can group data according to more than one feature. For example, we can create the counts within every combination of economic worry-level and party affiliation:

anes_plot = anes.groupby(['partyID', 'confecon']).size()
anes_plot = anes_plot.reset_index()
anes_plot = anes_plot.rename({0:'count'}, axis=1)
anes_plot

	partyID	confecon	count
0	Democrat	Not at all worried	73
1	Democrat	A little worried	320
2	Democrat	Moderately worried	481
3	Democrat	Very worried	284
4	Democrat	Extremely worried	166
5	Independent	Not at all worried	119
6	Independent	A little worried	134
7	Independent	Moderately worried	161
8	Independent	Very worried	75
9	Independent	Extremely worried	40
10	Republican	Not at all worried	559
11	Republican	A little worried	370
12	Republican	Moderately worried	177
13	Republican	Very worried	58
14	Republican	Extremely worried	33

We can create a barplot in which each economic worry level contains three bars, not just one, representing the three party affiliations. The bars can be color-coded according to party, and these colors can appear in a legend. The easiest way to construct such a visualization is with seaborn: all we have to do is add an argument hue set equal to partyID:

plt.figure(figsize=(15, 6))
sns.barplot(x='confecon', y='count', hue='partyID', data=anes_plot)

<matplotlib.axes._subplots.AxesSubplot at 0x1a25c9ff10>

_images/084cf7acd352df38e4a458ee9b4cb96e0c983f1ebb90ca806a3dc027b8432b04.png

This figure can be annotated in all the ways we described above, with a title, axis labels, and numeric labels on each bar:

plt.figure(figsize=(15, 6))
myplot = sns.barplot(x='confecon', y='count', hue='partyID', data=anes_plot)
plt.xlabel('Level of worry')
plt.ylabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')
for rect in myplot.patches:
    xcoor = rect.get_x() + .5*rect.get_width()
    ycoor = rect.get_height()
    plt.text(xcoor, ycoor, str(int(ycoor)), 
             horizontalalignment='center', 
             verticalalignment='bottom',
             fontsize=14)

_images/8b39c593e555cbeed8aac61d006718e1d2ab5bf6a538f6e12a666b0d97777bc2.png

The same method works in seaborn when the bars are oriented horizontally:

plt.figure(figsize=(10, 6))
myplot = sns.barplot(y='confecon', x='count', hue='partyID', data=anes_plot,
                orient ='h')
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')
for rect in myplot.patches:
    ycoor = rect.get_y() + .5*rect.get_height()
    xcoor = rect.get_width()
    plt.text(xcoor, ycoor, str(int(xcoor)), 
             horizontalalignment='left', 
             verticalalignment='center',
             fontsize=14)

_images/30bb7a38588b4e800b1aadc4e533db447e29f878dab23e332e1f6e81efee1fd3.png

To generate grouped bars using the .plot() method, we have to arrange the data so that one categorical feature comprises the rows and the other comprises the columns. We can either reshape the data or generate a cross-tab:

xtab = pd.crosstab(anes.confecon, anes.partyID)
xtab

partyID	Democrat	Independent	Republican
confecon
Not at all worried	73	119	559
A little worried	320	134	370
Moderately worried	481	161	177
Very worried	284	75	58
Extremely worried	166	40	33

If we apply the .plot() method to this cross-tab, we do not have to define the x and y features, as they are assumed to be the row and column indices. We can type:

myplot = xtab.plot(kind='bar', figsize = [15,6])
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')
for rect in myplot.patches:
    xcoor = rect.get_x() + .5*rect.get_width()
    ycoor = rect.get_height()
    plt.text(xcoor, ycoor, str(int(ycoor)), 
             horizontalalignment='center', 
             verticalalignment='bottom',
             fontsize=14)

_images/4ab98665c44f227d3692b17f81af155cf6714432e352c6bf491f148f9692c6cf.png

We can swap the feature represented on the x-axis and the feature represented in the legend by switching the position of the two features in the cross tab:

xtab = pd.crosstab(anes.partyID, anes.confecon)
myplot = xtab.plot(kind='bar', figsize = [15,6])
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')
for rect in myplot.patches:
    xcoor = rect.get_x() + .5*rect.get_width()
    ycoor = rect.get_height()
    plt.text(xcoor, ycoor, str(int(ycoor)), 
             horizontalalignment='center', 
             verticalalignment='bottom',
             fontsize=14)

_images/39c977e74567f0d4a57afd79352f68a795ae697cb3d11a017fb2da699b9cf55d.png

The code also works for horizontal barplots:

xtab = pd.crosstab(anes.partyID, anes.confecon)
myplot = xtab.plot(kind='barh', figsize = [15,10])
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')
for rect in myplot.patches:
    ycoor = rect.get_y() + .5*rect.get_height()
    xcoor = rect.get_width()
    plt.text(xcoor, ycoor, str(int(xcoor)), 
             horizontalalignment='left', 
             verticalalignment='center',
             fontsize=14)

_images/6d1c733a55e0f079ecc0b598b24f51b825e44085bedc42b9ebc44c299797ce5a.png

To stack the bars on top of one another rather than placing them side-by-side, use the stacked=True argument:

xtab = pd.crosstab(anes.confecon, anes.partyID)
myplot = xtab.plot(kind='barh', figsize = [15,10], stacked=True)
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')

Text(0.5, 1.0, 'Amount of Worry About the Current State of the Economy')

_images/68113f7636953cb188bc67094b6b6e7b39bdcb2a0552947276a14da58f34688e.png

11.2.1.5. Placing Graphs Within Grids #

Instead of grouping many bars in one plot, it can be cleaner and more effective to place multiple corresponding graphs in different cells inside a grid, one for each category of another feature. The easiest method for creating a grid of visualizations is to use the sns.FacetGrid() function and the map() method.

The sns.FacetGrid() function takes a dataframe as its first argument, then we can specify the features whose categories comprise the rows and columns of this grid with the row and col arguments. Optionally, we can also change the color of the bars from cell to cell by setting hue equal to the row feature or the column feature. The .map() method operates on the output of sns.FacetGrid() and defines what the graph inside each cell should be.

For example, we can create six barplots, one for each combination of party affiliation and sex. We can set the rows to represent men and women and we can set the columns to represent Democrats, independents, and Republicans with the following code:

anes_plot = anes.groupby(['partyID', 'sex', 'confecon']).size().reset_index()
anes_plot = anes_plot.rename({0:'count'}, axis=1)

g = sns.FacetGrid(anes_plot, row = 'sex', col = 'partyID', hue = 'partyID')
g.map(plt.barh, 'confecon', 'count')

<seaborn.axisgrid.FacetGrid at 0x1a26035790>

_images/71c355df239beac997d36db0e6598a2e2f174aa198bcfbbb0ac97dc9eff074f2.png

To change the size of a grid, use the height argument to specify the height of each graph inside the grid, and use aspect to control how much bigger or smaller the width should be relative to the height. aspect=1 gives us a square within each cell, and height=2, aspect=1.5 sets the widths to be 3:

g = sns.FacetGrid(anes_plot, row = 'sex', col = 'partyID', hue = 'partyID',
                 height=2, aspect=1.5)
g.map(plt.barh, 'confecon', 'count')

<seaborn.axisgrid.FacetGrid at 0x1a26057a10>

_images/801d91aa3686f6f3116bcf67b80d26196d338a75c158271f765f7416244f06a6.png

The individual cells are given labels like “sex = Male|partyID = Democrat”. To give these cells better labels, we can write

g.set_titles('{row_name}, {col_name}')

where {row_name} is code that refers to the name of the category that defines the row, and {col_name} is code that refers to the name of the category that defines the column.

We can set the axis labels by typing

g.set_axis_labels('Number of people', 'Level of worry') 

where the first entry is the label of the x-axis in each cell, and the second entry is the label of the y-axis in each cell.

We can give the figure an overall title by typing

g.fig.subplots_adjust(top=0.8)
g.fig.suptitle('Amount of Worry About the Current State of the Economy, by Party', fontsize=16)

The g.fig.subplots_adjust(top=0.8) function creates white space at the top of the grid to create space for the title, and the g.fig.suptitle() function sets the title and defines the fontsize of the title.

We can apply all of these functions to the grid as follows:

g = sns.FacetGrid(anes_plot, row = 'sex', col = 'partyID', hue = 'partyID',
                 height=2, aspect=1.5)
g.map(plt.barh, 'confecon', 'count')
g.set_titles('{row_name}, {col_name}')
g.set_axis_labels('Number of people', 'Level of worry') 
g.fig.subplots_adjust(top=0.8)
g.fig.suptitle('Amount of Worry About the Current State of the Economy, by Party', fontsize=16)

Text(0.5, 0.98, 'Amount of Worry About the Current State of the Economy, by Party')

_images/8af5d5157dd64dd3acfbdf9812ae53dad5bb61b95c7862d91208562d9d0ce956.png

If we have only one grouping column, we can use only the row argument without col if we want all three graphs to be on top of one another in separate rows in the same column, as in the following example:

anes_plot = anes.groupby(['partyID','confecon']).size().reset_index()
anes_plot = anes_plot.rename({0:'count'}, axis=1)

g = sns.FacetGrid(anes_plot, row = 'partyID', hue = 'partyID',
                 height=4, aspect=1.5)
g.map(plt.barh, 'confecon', 'count')
g.set_titles('{row_name}')
g.set_axis_labels('Number of people', 'Level of worry') 
g.fig.subplots_adjust(top=.925)
g.fig.suptitle('Amount of Worry About the Current State of the Economy, by Party', fontsize=16)

Text(0.5, 0.98, 'Amount of Worry About the Current State of the Economy, by Party')

_images/f9145aa6edd70ec79ffcc79d674b972012cac9b55fa5ebbf92c2eb85d429117f.png

Or we can use only the col argument without row if we want all three graphs to be side-by-side in one row:

g = sns.FacetGrid(anes_plot, col = 'partyID', hue = 'partyID',
                 height=4, aspect=1)
g.map(plt.barh, 'confecon', 'count')
g.set_titles('{col_name}')
g.set_axis_labels('Number of people', 'Level of worry') 
g.fig.subplots_adjust(top=.8)
g.fig.suptitle('Amount of Worry About the Current State of the Economy, by Party', fontsize=16)

Text(0.5, 0.98, 'Amount of Worry About the Current State of the Economy, by Party')

_images/ffed08b849658fcfa9d907d0cdfa95fbdac0817cf50670aca46062092af48f77.png

If we have many categories, we can use the col_wrap argument to specify how many plots will be listed in each row before moving on to the next row. For example, we can list the economic anxiety by age group, creating nine age groups, and placing these graphs in a 3x3 grid:

anes_plot = anes
anes_plot['age_cat'] = pd.cut(anes.age, [17,25,35,45,50,55,65,75,85,100])
anes_plot = anes_plot.groupby(['confecon','age_cat']).size().reset_index()
anes_plot = anes_plot.rename({0:'count'}, axis=1)

g = sns.FacetGrid(anes_plot, col = 'age_cat', col_wrap=3, height=4, aspect=1)
g.map(sns.barplot, 'count', 'confecon')
g.set_titles('{col_name}')
g.set_axis_labels('Number of people', 'Level of worry') 
g.fig.subplots_adjust(top=.9)
g.fig.suptitle('Amount of Worry About the Current State of the Economy, by Age Group', fontsize=16)

Text(0.5, 0.98, 'Amount of Worry About the Current State of the Economy, by Age Group')

_images/9fc9f66daaec1e910760e9c36590e84d09e55d7be95df9435b534e052158e936.png

11.2.1.6. Saving a Figure as a Local File #

To save a visualization that appears in your notebook as a local file, use the plt.savefig() function immediately after the code that generates and displays the figure. To save as a JPG, PNG, and PDF file, write a filename that ends with a “.jpg”, “.png”, or “.pdf” extension respectively inside of plt.savefig(). The file will save in your current working directory, which you can see by importing the os package and typing os.getcwd() and change with the os.chdir() function.

The plt.savefig() function works with matplotlib, seaborn, and .plot() figures.

For example, we can save the following matplotlib figure to disk as a JPG file:

plt.figure(figsize=(10, 6))
myplot = plt.barh(mybars_df['index'], mybars_df['confecon'])
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')
for rect in myplot.patches:
    xcoor = rect.get_y() + .5*rect.get_height()
    ycoor = rect.get_width()
    plt.text(ycoor, xcoor, str(ycoor), 
             horizontalalignment='left', 
             verticalalignment='center',
             fontsize=14)
plt.savefig('myfigure.jpg')

We can save the following seaborn figure as a PNG file:

g = sns.FacetGrid(anes_plot, col = 'age_cat', col_wrap=3, height=4, aspect=1)
g.map(plt.barh, 'confecon', 'count')
g.set_titles('{col_name}')
g.set_axis_labels('Number of people', 'Level of worry') 
g.fig.subplots_adjust(top=.9)
g.fig.suptitle('Amount of Worry About the Current State of the Economy, by Age Group', fontsize=16)
plt.savefig('myfigure2.png')

_images/2921d5cafd0981fa0f6f3a165eabb5229f4edb4b0c5a43d5c9bf61d07ce00cbe.png

And we can save the following .plot() figure as a PDF:

myplot = xtab.plot(kind='barh', figsize = [15,10], stacked=True)
plt.ylabel('Level of worry')
plt.xlabel('Number of people')
plt.title('Amount of Worry About the Current State of the Economy')
plt.savefig('myfigure3.pdf')

11.2.2. Scatterplots #

In the preceding section on barplots, we discussed how to generate visualizations using matplotlib, seaborn, and the .plot() method for a pandas dataframe. In this section and the following ones, we will focus on seaborn as this package usually creates visually appealing graphics with simpler code. We will use matplotlib and .plot() only in situations in which a specific kind of visualization is easier to build with one of these packages.

Scatterplots are used to illustrate the correlational relationship between two continous features. One feature’s values are used as x-coordinates and the other feature’s values are used as y-coordinates, and these coordinates define points that are plotted on a Cartesian coordinate plane. We can annotate a scatterplot in a few ways. First we can use an aesthethic like color-coding or shapes to denote the value of a third categorical feature from point to point. For example, if the points represent individual people, we can use red points for the women and blue points for the men, or squares for people from Virginia and triangles for people from Maryland. We can also add a line of best fit to a scatterplot, which is derived from a simple linear regression model, to show the overall relationship between the features: a positive slope indicates a positive correlation and a relationship in which the features tend to increase or decrease together; a negative slope indicates a negative correlation and a relationship in which one feature tends to increase as the other decreases.

We can also change the background of seaborn graphs. For scatterplots, it can be useful to display gridlines in the background to better match points to values on the x and y axies. To see this grid, type:

sns.set(color_codes=True)

11.2.2.1. Using Colors, Shapes, and Transparency to Annotate Scatterplots #

To generate a scatterplot from seaborn, use the sns.scatterplot() function. Like the sns.barplot() function, we specify features for the x and y axes and the dataframe where these features are stored. For example, we can plot people’s thermometer rating of Donald Trump against their thermometer rating of Joe Biden. For the sake of clarity, I only include the first 200 rows of anes for now:

sns.scatterplot(x = 'ftbiden', y = 'fttrump', data = anes.head(200))

<matplotlib.axes._subplots.AxesSubplot at 0x1a27ac3450>

_images/2a006830c08cba9966c2ad8fdee0449ba810be48edc59100f66da95b1b5e8381.png

We can change the size of the scatterplot in the notebook with the plt.figure() function, and we can label the axes and add a title as follows:

plt.figure(figsize = (6,6))
sns.scatterplot(x = 'ftbiden', y = 'fttrump', data = anes.head(200))
plt.xlabel('Joe Biden thermometer rating')
plt.ylabel('Donald Trump thermometer rating')
plt.title('Evaluations of Biden vs. Trump')

Text(0.5, 1.0, 'Evaluations of Biden vs. Trump')

_images/17a38b1641259f57626ae1198dd2f9a4e77e9b67f3350f07f4019afe9549aa27.png

There are many ways to use aesthetics to include additional features in a scatterplot. First we can color-code points to represent categories of a different feature using the hue argument. We can represent party affiliation this way:

plt.figure(figsize = (6,6))
sns.scatterplot(x = 'ftbiden', y = 'fttrump', data = anes.head(200),
               hue = 'partyID')
plt.xlabel('Joe Biden thermometer rating')
plt.ylabel('Donald Trump thermometer rating')
plt.title('Evaluations of Biden vs. Trump')

Text(0.5, 1.0, 'Evaluations of Biden vs. Trump')

_images/54f3923d611dd0ad6c7bedbbaf4f486eb5d969f7b664384c888828b1e6b69c38.png

As expected, Republicans cluster in the top-left corner with high ratings for Trump and low ratings for Biden, and Democrats cluster in the bottom-right corner with high ratings for Biden and low ratings for Trump. Independents are spread out and are more common in the middle. No one rates both Trump and Biden both over 80.

Note that when we include an aesthetic, seaborn creates a legend automatically. Legends are crucial as it would be unclear what the colors represent without one. To turn off the legend, however, we can include the legend=False argument in sns.scatterplot().

We can use different symbols to represent these categories with the style argument. The following figure uses different symbols for men and women:

plt.figure(figsize = (6,6))
sns.scatterplot(x = 'ftbiden', y = 'fttrump', data = anes.head(200),
               style = 'sex')
plt.xlabel('Joe Biden thermometer rating')
plt.ylabel('Donald Trump thermometer rating')
plt.title('Evaluations of Biden vs. Trump')

Text(0.5, 1.0, 'Evaluations of Biden vs. Trump')

_images/e1ad642d7f8ab3f153f56e59ef196f1c6687883e7b1d2ba7bcade1a99fa443d9.png

My personal opinion, however, is that shapes do not tend to convey information very well without color-coding points as well. If all of the points are blue, it can be hard to tell the dots from the Xs.

We can use both color and style aesthetics in the same scatterplot. The following graph uses both colors and shapes for partyID:

plt.figure(figsize = (6,6))
sns.scatterplot(x = 'ftbiden', y = 'fttrump', data = anes.head(200),
               hue = 'partyID', style = 'partyID')
plt.xlabel('Joe Biden thermometer rating')
plt.ylabel('Donald Trump thermometer rating')
plt.title('Evaluations of Biden vs. Trump')

Text(0.5, 1.0, 'Evaluations of Biden vs. Trump')

_images/9d694a03a93cf1d24a72652c3b1b4b5a06d2318703550f134708bc85d225a945.png

We can set colors to represent one feature while shapes represent another. In this case, both aesthetic mappings are listed in the legend. Here is a scatterplot that uses color for partyID and shapes for sex:

plt.figure(figsize = (6,6))
sns.scatterplot(x = 'ftbiden', y = 'fttrump', data = anes.head(200),
               hue = 'partyID', style = 'sex')
plt.xlabel('Joe Biden thermometer rating')
plt.ylabel('Donald Trump thermometer rating')
plt.title('Evaluations of Biden vs. Trump')

Text(0.5, 1.0, 'Evaluations of Biden vs. Trump')

_images/3a6e7ba489bc008132ff7b8b6e9b01b90e452718063eaafabfd2c988e7439462.png

We can control the transparency of the points with the alpha argument. alpha ranges between 0 and 1, where 0 means perfectly transparent, 1 means perfectly solid, and values between 0 and 1 represent degrees of transparency. When points are partially transparent, areas with many points become darker, so the main purpose of transparency is to illustrate the areas in the scatterplot with higher densities of data. Below I list two scatterplots, each with the points set to black instead of the default blue, and each with all of the rows in anes. The first scatterplot uses solid dots and the second uses alpha=.1 to make the dots mostly transparent. Notice how the clusters emerge in the second scatterplot:

plt.figure(figsize = (6,6))
sns.scatterplot(x = 'ftbiden', y = 'fttrump', data = anes,
               color=['black'])
plt.xlabel('Joe Biden thermometer rating')
plt.ylabel('Donald Trump thermometer rating')
plt.title('No Transparency')

plt.figure(figsize = (6,6))
sns.scatterplot(x = 'ftbiden', y = 'fttrump', data = anes,
               alpha = .1, color=['black'])
plt.xlabel('Joe Biden thermometer rating')
plt.ylabel('Donald Trump thermometer rating')
plt.title('Transparency (alpha=.1) Shows Where the Data Cluster')

Text(0.5, 1.0, 'Transparency (alpha=.1) Shows Where the Data Cluster')

_images/da8a3df2517a19008e18a7a7e6a30514b78829b2759b1256747d09f430cf8797.png

_images/0483c893691aee9bde94c315bc6c27d7dc698bf1ef2896a5401d835ee947150c.png

The darkest parts of the graph are the areas with the most common ratings of (Trump, Biden). These are (100, 0), (0, 100), (0, 50), (50, 50), and (0,0). Areas along the axes, rating one candidate at 0, are more common than other areas on the graph.

We can also use the size of the points to represent values of a feature. We use the size argument to map a feature to size, and we use sizes to set a range of radii from smallest to largest depending on the value of the feature listed with size. In the following graph, I use both color and size to represent age, where larger dots and darker shades represent older individuals:

plt.figure(figsize = (6,6))
sns.scatterplot(x = 'ftbiden', y = 'fttrump', data = anes.head(200),
               size = 'age', sizes=(40, 360), alpha=.5,
               hue = 'age')
plt.xlabel('Joe Biden thermometer rating')
plt.ylabel('Donald Trump thermometer rating')
plt.title('Evaluations of Biden vs. Trump')

Text(0.5, 1.0, 'Evaluations of Biden vs. Trump')

_images/840b5e34dde8e42f55d4e9ba16eb771b5f1d22f26047e1107cec9cebcc1ad625.png

I don’t recommend using size to represent data, however, as larger points tend to obscure smaller points.

11.2.2.2. Including Lines of Best Fit #

Scatterplots are designed to illustrate the relationship between two features, but it can be hard to see this relationship when there are lot of datapoints or when the data have a lot of variance. We can include a line of best fit on a scatterplot to explictly show the relationship by using the sns.regplot() function instead of sns.scatterplot(). sns.regplot() sometimes expands the range of the x and y axes beyond the data to better show the complete line of best fit.

The Trump vs. Biden scatterplot with a linear fit is as follows:

plt.figure(figsize = (6,6))
sns.regplot(x = 'ftbiden', y = 'fttrump', data = anes.head(200))
plt.xlabel('Joe Biden thermometer rating')
plt.ylabel('Donald Trump thermometer rating')
plt.title('Evaluations of Biden vs. Trump')

Text(0.5, 1.0, 'Evaluations of Biden vs. Trump')

_images/5be3041a268127a4e53d0598076186684b769762461a9c1596b2cbfc1559d0dc.png

This line represents the predicted values of the Trump thermometer given a value of the Biden thermometer, as derived from a simple regression model in which the Trump thermometer is regressed on the Biden thermometer and a intercept is included. There are a few ways to understand the line:

First, consider the slope. Positive slopes show a positive relationship between the features so that they tend to increase or decrease together. In this case the slope is negative, which shows that as the Biden thermometer increases, the Trump thermometer decreases. Flat slopes indicate a weak relationship or no relationship between the features.
Second, the line can be used to generate predictions. First choose a value on the x axis of interest, then find the y-coordinate at which the line passes through this position on the x-axis. This y-coordinate is the predicted value of the feature on the y-axis given the value of the feature on the x-axis. For example, suppose that a person rates Joe Biden 60. We find 60 on the x-axis and follow it up to the line, which crosses this value when y is about 28. So a person who rates Biden 60 is predicted to rate Trump 28.
Third, the shaded area around the line is the 95% confidence interval around the prediction. This interval provides a range of feasible values for the prediction at any value of x. The more datapoints in the graph, and the closer they are on average to the line, the smaller this region will be. For a person who rates Biden 60, the lower bound of this interval is at about 25 on the y-axis and the upper bound of the interval is at about 31. So while our prediction for a person who rates Biden 60 is that they will rate Trump 28, our 95% confidence interval for this prediction is between 25 and 31.

Notice that the linear prediction overestimates the ratings for Trump for those who rate Biden about 60: almost all of these individuals rate Trump lower than that. Over overestiation is due to the constraint of a straight line for the entire scatterplot. There are alternatives to a linear fit that allow for curved, nonlinear relationships between features. First, if we specify order=2, then sns.regplot() uses a parabola to model the relationship instead of a line:

plt.figure(figsize = (6,6))
sns.regplot(x = 'ftbiden', y = 'fttrump', data = anes.head(200),
           order = 2)
plt.xlabel('Joe Biden thermometer rating')
plt.ylabel('Donald Trump thermometer rating')
plt.title('Evaluations of Biden vs. Trump')

Text(0.5, 1.0, 'Evaluations of Biden vs. Trump')

_images/2466ddbad2bbfa541b4f997d1c1020c327345e6b5205db7f652fc329bb2004fe.png

We can also use lowess=True. LOWESS stands for locally weighted scatterplot smoothing, and it estimates the relationship between the features in local regions on the scatterplot, connecting the polynomials to form one curved line through the data. The result is a visualization of a nonlinear relationship that can bend in ways not possible with a polynomial. The LOWESS fit for the Trump vs Biden scatterplot is:

plt.figure(figsize = (6,6))
sns.regplot(x = 'ftbiden', y = 'fttrump', data = anes.head(200),
           lowess = True)
plt.xlabel('Joe Biden thermometer rating')
plt.ylabel('Donald Trump thermometer rating')
plt.title('Evaluations of Biden vs. Trump')

Text(0.5, 1.0, 'Evaluations of Biden vs. Trump')

_images/cf489263922b78a8001dc7a8374695f2d63b7db138403367765cf1614b1986ea.png

11.2.2.3. Scatterplot Grids and Matrices #

It can be useful to see related scatterplots side by side or in a grid. We can use the sns.FacetGrid() function to divide the rows into subsets according to a categorical feature, then to generate different versions of the same scatterplot for each subset. sns.FacetGrid() allows us to place these scatterplots side-by-side.

For example, here is the code to create three scatterplots of the Trump and Biden thermometers. It largely follows the same syntax that we used to generate a facet grid of barplots, but I pass the sns.scatterplot() function to .map() method along with aesthetic parameters I want to apply to each graph: alpha=.1 and color='black' to help reveal the clustering in each scatterplot:

g = sns.FacetGrid(anes, col = 'partyID', height=4, aspect=1)
g.map(sns.scatterplot, 'ftbiden', 'fttrump', alpha=.1, color='black')
g.set_titles('{col_name}')
g.set_axis_labels('Joe Biden thermometer rating', 'Donald Trump thermometer rating') 
g.fig.subplots_adjust(top=.8)
g.fig.suptitle('Evaluations of Biden vs. Trump, by Party', fontsize=16)

Text(0.5, 0.98, 'Evaluations of Biden vs. Trump, by Party')

_images/588e68cd88433e36fb2a4ab072f952d16f37598e3dfde08cd147399e249d89c6.png

Now that we’ve separated the Democrats, Republicans, and independents, we can more clearly see how each group clusters in terms of their ratings of Trump and Biden. Democrats are mostly united in rating Trump very low, but the spread along the x-axis indicates a wide range of mixed feelings towards Biden. Republicans in contrast are more clustered around higher ratings for Trump.

Another way to see scatterplots side-by-side is with a scatterplot matrix, which we can generate with the sns.pairplot() function. For every feature in the data that we pass to this function, sns.pairplot() adds another row and column in a grid of scatterplots. Along the diagonal of this grid are plots that describe the overall distribution of the feature in question, and these plots can be histograms or densities. The off-diagonal graphs are scatterplots in which the feature represented in the row comprises the y-axis and the feature represented in the column comprises the x-axis. The following code shows the scatterplots between the Trump, Biden, Obama, and Sanders thermometers, uses densities on the diagonals, includes best fit lines, and turns the points partially transparent. I also use a for-loop to set the x and y axes in each of the non-diagonal graphs to range from 0 to 100:

g = sns.pairplot(anes[['fttrump', 'ftbiden', 'ftobama', 'ftsanders']].head(200),
            diag_kind = 'kde',
            kind = 'reg',
            plot_kws={'line_kws':{'color':'red'}, 'scatter_kws': {'alpha': 0.25}})
for i in range(0,4):
    for j in range(0,i):
        g.axes[i,j].set_xlim((0,100))
        g.axes[i,j].set_ylim((0,100))
        g.axes[j,i].set_xlim((0,100))
        g.axes[j,i].set_ylim((0,100))

_images/cbacf0c2788386cf0e771feb50f67752a5010303f1d8c4f171dc04c3e89132d7.png

My personal preference however is to avoid scatterplot matrices because these figures can be very complicated, and although they contain a lot of informaation, they are not very efficient at conveying a cohesive story.

11.2.3. Line Plots #

Line plots are used to illustrate the relationship between a continuous feature and an index. An index is a continuous feature in which the order of the rows has meaning, and the most common index is time, ordered from earlier to later time points. If multiple continuous features have similar scales, they can be plotted with overlaid lines against the same index. In that case, we can use different line colors to distinguish between the lines, or different line types (solid, dashed, dotted, etc.). If the features have very different scales, then it is possible to use two different y-axes in one line plot, with the values for one axis listed on the left and the values for the other axis listed on the right, as in the following example:

Per capita consumption of cheese (US) correlates with the number of people who died by becoming tangled in their bedsheets. (Source: Spurious Correlations by Tyler Vigen https://tylervigen.com/old-version.html).

The use of multiple axes however is generally frowned upon by statisticians and data scientists because they can lead to problems such as Darrell Huff’s Gee-Whiz Graphs that overemphasize changes in a feature by manipulating the scale of an axis. They also show strong relationships between features that have no relationship beyond a shared upward trend over time, as is the case in the above figure (unless we can find a compelling scientific argument for why increased cheese consumption would cause people to die by becoming entangled in their bedsheets). As such, we will not discuss line plots with multiple axes here.

As with scatterplots, we will focus on using seaborn to create line plots because seaborn produces good-looking visualizations with comparatively simple code.

To plot a single feature over values on an index, create a dataframe with the index and the feature. Here we calculate the average rating of Joe Biden within each age represented in the data, up to age 84:

anes_line = anes.groupby('age').agg({'ftbiden':'mean'})
anes_line = anes_line[0:65].reset_index()
anes_line

	age	ftbiden
0	20	43.604651
1	21	48.709677
2	22	38.827586
3	23	40.785714
4	24	51.640000
...	...	...
60	80	37.444444
61	81	31.263158
62	82	44.000000
63	83	29.615385
64	84	28.500000

65 rows × 2 columns

To generate a line plot, we use the sns.lineplot() function. This function works just like sns.barplot() and sns.scatterplot(). We specify the features to represent on the x and y axes and we can use plt.figure(), plt.xlabel(), plt.ylabel(), and plt.title() to control the size of the figure, label the axes, and add a title. Here we plot the average Biden ratings against the age groups:

plt.figure(figsize=(7,7))
sns.lineplot(x='age', y='ftbiden', data=anes_line)
plt.xlabel('Age of voters')
plt.ylabel('Joe Biden thermometer rating')
plt.title('Feeling Thermometer Ratings By Age Group')

Text(0.5, 1.0, 'Feeling Thermometer Ratings By Age Group')

_images/44791815d5cdbef9de71e29d00e15e1c2c33203ee8c2a426821196b9ee0fafbe.png

Notice that the y-axis, by default, is set to a range slightly larger than the range of the data. That can misrepresent the degree to which the data change. A more honest presentation expands the y-axis to the full range of possible values for a feeling thermometer, from 0 to 100. We can expand the range by typing plt.ylim(0,100):

plt.figure(figsize=(7,7))
sns.lineplot(x='age', y='ftbiden', data=anes_line)
plt.xlabel('Age of voters')
plt.ylabel('Joe Biden thermometer rating')
plt.title('Feeling Thermometer Ratings By Age Group')
plt.ylim(0,100)

(0.0, 100.0)

_images/cfdd16885f39745cd4b0cc0b17ee2fd9cde7e80a7062f6205f4be341b6408ece.png

To plot multiple features against the same index, create a dataframe that is reshaped to long format. In this case, we can use techniques we discussed in chapters 8 and 9 to create a dataframe with three columns: age, candidate (the subject of each feeling thermometer), and the values of the thermometers. We place the thermometers for Biden, Trump, Barack Obama, and Bernie Sanders in this dataframe:

anes_line = anes.groupby('age').agg({'ftbiden':'mean',
                                    'fttrump':'mean',
                                    'ftobama':'mean',
                                    'ftsanders':'mean'})
anes_line = anes_line[0:65].reset_index()
anes_line = pd.melt(anes_line, id_vars='age', 
                    value_vars = ['ftbiden', 'fttrump', 'ftobama', 'ftsanders'])
anes_line = anes_line.rename({'variable':'candidate', 'value':'thermometer'}, axis=1)
anes_line['candidate'] = anes_line['candidate'].map({'ftbiden':'Joe Biden',
                                                     'fttrump':'Donald Trump',
                                                     'ftobama':'Barack Obama',
                                                     'ftsanders':'Bernie Sanders'})
anes_line

	age	candidate	thermometer
0	20	Joe Biden	43.604651
1	21	Joe Biden	48.709677
2	22	Joe Biden	38.827586
3	23	Joe Biden	40.785714
4	24	Joe Biden	51.640000
...	...	...	...
255	80	Bernie Sanders	35.611111
256	81	Bernie Sanders	27.157895
257	82	Bernie Sanders	39.133333
258	83	Bernie Sanders	18.461538
259	84	Bernie Sanders	21.357143

260 rows × 3 columns

To overlay multiple lines using the same axes, we set age as the x-axis, thermometer as the y-axis, and set the two lines to be different colors with hue='candidate'. seaborn generates a legend by default. To show only the lines for Biden and Trump, we filter the data to only those that represent these two candidates:

plt.figure(figsize=(7,7))
sns.lineplot(x='age', y='thermometer', hue='candidate',
             data=anes_line.query("candidate=='Donald Trump' | candidate=='Joe Biden'"))
plt.xlabel('Age of voters')
plt.ylabel('Thermometer rating')
plt.title('Feeling Thermometer Ratings By Age Group')
plt.ylim(0,100)

(0.0, 100.0)

_images/d474f941b6fe2d7c4de5ed1e2eff52bc832056b288e896b9f5b65eec26b768ba.png

This graph indicates that Biden is more popular than Trump among younger voters and Trump is more popular than Biden among older voters.

We can use different line types as well as different colors for the candidates by including style='candidate' in the sns.lineplot() function. The following figure uses a solid, blue line for Biden and a dashed, orange line for Trump:

plt.figure(figsize=(7,7))
sns.lineplot(x='age', y='thermometer', 
             hue='candidate', style='candidate',
             data=anes_line.query("candidate=='Donald Trump' | candidate=='Joe Biden'"))
plt.xlabel('Age of voters')
plt.ylabel('Thermometer rating')
plt.title('Feeling Thermometer Ratings By Age Group')
plt.ylim(0,100)

(0.0, 100.0)

_images/391848fe0a55343bcbc494ba60496a92477fcd1ba23f43f7fe2513515674e71e.png

To include four lines with four different colors and line types, we use the same code without subsetting the data:

plt.figure(figsize=(7,7))
sns.lineplot(x='age', y='thermometer', 
             hue='candidate', style='candidate',
             data=anes_line)
plt.xlabel('Age of voters')
plt.ylabel('Thermometer rating')
plt.title('Feeling Thermometer Ratings By Age Group')
plt.ylim(0,100)

(0.0, 100.0)

_images/315e63547b625bea256c41138be0d996f81bd5b9ddad4cf2b3c366dafc687fa8.png

Four lines on one graph can be confusing, however. We can convey this information more clearly by using sns.FacetGrid() with col = 'candidate' to create different graphs for each politician, and col_wrap=2 to align these graphs in a 2x2 grid. To color-code the lines, write hue='candidate' in the sns.FacetGrid() function:

g = sns.FacetGrid(anes_line, col = 'candidate', 
                  col_wrap=2, height=4, aspect=1,
                 hue='candidate')
g.map(sns.lineplot, 'age', 'thermometer')
g.set_titles('{col_name}')
g.set_axis_labels('Age', 'Thermometer rating') 
g.fig.subplots_adjust(top=.9)
g.fig.suptitle('Feeling Thermometer Ratings By Age', fontsize=16)

Text(0.5, 0.98, 'Feeling Thermometer Ratings By Age')

_images/d20cfac14bfda9b635b7783477b5dd334241e0ac2952ead258cb06d9f1f1a90e.png

11.2.4. Distributional Graphs #

Distributional plots include histograms, density plots, violin plots, and box plots. Each of these kinds of graphs show the overall distribution of a continuous feature, not just a measure of location like the mean or median. In other words, distributions show the entirety of a feature while descriptive statistics only show a small amount of information about a feature. Suppose we have two candidates that individuals are asked to rate on a scale from 0 to 100. If half the individuals rate the first candidate 0, and half the individuals rate the first candidate 100, then this candidate’s average rating is 50. If all the individuals rate the second candidate 50, then this candidate’s average rating is also 50. Although the means are equal, the distributions are very different, and it would be a mistake to draw conclusions about these two candidates without taking into consideration the whole distribution of each candidates’ ratings.

11.2.4.1. Histograms #

Histograms are similar to barplots, except that the bars are generally drawn without spaces in between. Histograms first use a function like pd.cut() to place the continuous values into bins, then it calculates the frequency of each bin. The number of bins defines the number of bars in a histogram, and the frequencies provide the height of each bar.

Histograms in seaborn are generated with the sns.distplot() function. Unlike other seaborn functions, where we had to specify an x and y feature, here we pass only the feature whose distribution we are trying to visualize. By default, sns.distplot() will draw a curve on top of the histogram showing the estimated density function, but it tends to clutter the figure, so we can turn it off with kde=False. Here is the histogram of the Biden thermometer ratings:

plt.figure(figsize=(7,5))
sns.distplot(anes.ftbiden, kde=False)
plt.ylabel('Frequency')
plt.xlabel('Thermometer rating')
plt.title('Distribution of Joe Biden Thermometer Ratings')

Text(0.5, 1.0, 'Distribution of Joe Biden Thermometer Ratings')

_images/d714f2e27bddc70f50f18884ec65ed9855fba66adc2b727164b3ab19b97581be.png

We can change the color of the bars with the color argument, and we can turn off the default transparency of the color with hist_kws=dict(alpha=1):

plt.figure(figsize=(7,5))
sns.distplot(anes.ftbiden, kde=False,
            color = 'red', hist_kws=dict(alpha=1))
plt.ylabel('Frequency')
plt.xlabel('Thermometer rating')
plt.title('Distribution of Joe Biden Thermometer Ratings')

Text(0.5, 1.0, 'Distribution of Joe Biden Thermometer Ratings')

_images/83a7b178fdbbfe091c567997d885e9ababcd1b28e55536eb792491f028a9e117.png

We can generate a histogram turned 90 degrees with vertical=True. Be sure to switch the labels for the x and y axes:

plt.figure(figsize=(7,5))
sns.distplot(anes.ftbiden, kde=False,
            color = 'red', hist_kws=dict(alpha=1),
            vertical = True)
plt.xlabel('Frequency')
plt.ylabel('Thermometer rating')
plt.title('Distribution of Joe Biden Thermometer Ratings')

Text(0.5, 1.0, 'Distribution of Joe Biden Thermometer Ratings')

_images/8c74e7432a18ff6d38db000fac5c76fd5d1a434c53411a6ec4c161e8d564a248.png

A key decision with histograms is the number of bins: because the feature is continuous, we must first place the values into equally spaced ranges that divide the total range of the feature, then we count the number of values in each range, and we set the heights of the bars equal to these frequencies. By default, sns.distplot() uses the number of bins that is recommended by the numpy.histogram_bin_edges() function. This function calculates the number of bins to use with the Freedman-Diaconis rule and with Sturges’ Formula and take the higher of these two numbers. We can change this default with the bins argument.

If we set the number of bins too low, we won’t accurately show the shape of the feature’s distribution, as in the following graph:

plt.figure(figsize=(7,5))
sns.distplot(anes.ftbiden, kde=False, bins=3)
plt.ylabel('Frequency')
plt.xlabel('Thermometer rating')
plt.title('Distribution of Joe Biden Thermometer Ratings')

Text(0.5, 1.0, 'Distribution of Joe Biden Thermometer Ratings')

_images/14fff53ef52f71e3e8afae2d860ad04d36b150eabb0ba8ed8feb2bb22b92b8f4.png

But if we set the number of bins too high, we won’t have enough observations to get good estimates of the frequencies, leaving us with a figure like this:

plt.figure(figsize=(7,5))
sns.distplot(anes.ftbiden, kde=False, bins=300, 
             color='black', hist_kws=dict(alpha=1))
plt.ylabel('Frequency')
plt.xlabel('Thermometer rating')
plt.title('Distribution of Joe Biden Thermometer Ratings')

Text(0.5, 1.0, 'Distribution of Joe Biden Thermometer Ratings')

_images/e12ca99e45e2f7cb319cb94b78e480405065ca1fb0a11a3e66659e63596fe0c8.png

To show multiple features’ distributions in one figure, we can overlay two histograms with two calls to sns.distplot(), one for each feature. The two histograms will appear in the same figure but with different colors. The default transparency of the colors allows us to see three shades, one for each of the features and one for where the features overlap. It is important to include a legend to tell which color refers to which feature. For a legend, include the label argument in each call to sns.displot(): these labels define the text that will appear in the legend. Then use the plt.legend() function along with the other attributes. A figure that plots the Trump and Biden histograms together is:

plt.figure(figsize=(7,5))
sns.distplot(anes.ftbiden, kde=False, bins=12, label='Biden')
sns.distplot(anes.fttrump, kde=False, bins=12, label='Trump')
plt.ylabel('Frequency')
plt.xlabel('Thermometer rating')
plt.title('Distribution of Thermometer Ratings')
plt.legend()

<matplotlib.legend.Legend at 0x1a25fe8d50>

_images/ea14b826b275816b1f3290bd23c62580c85978b889f41fe1ff81bd5a01ea2e97.png

The conclusion we can draw here is that more people have extreme opinions of Trump, with ratings of 0 or 100, and more people have mediocre options of Biden.

To show various histograms in a facet grid, we have to use pd.melt() to reshape the data with the features we want to plot in long format:

anes_hist = pd.melt(anes, id_vars = ['caseid'], 
                    value_vars = ['ftbiden', 'fttrump',
                                 'ftobama', 'ftsanders'])
anes_hist = anes_hist.rename({'variable':'candidate',
                             'value':'thermometer'}, axis=1)
anes_hist['candidate'] = anes_hist['candidate'].map({'ftbiden':'Joe Biden',
                                                     'fttrump':'Donald Trump',
                                                     'ftobama':'Barack Obama',
                                                     'ftsanders':'Bernie Sanders'})
anes_hist

	caseid	candidate	thermometer
0	1	Joe Biden	52.0
1	2	Joe Biden	41.0
2	3	Joe Biden	88.0
3	4	Joe Biden	0.0
4	5	Joe Biden	25.0
...	...	...	...
12655	3161	Bernie Sanders	6.0
12656	3162	Bernie Sanders	92.0
12657	3163	Bernie Sanders	59.0
12658	3164	Bernie Sanders	79.0
12659	3165	Bernie Sanders	100.0

12660 rows × 3 columns

We can now use the sns.FacetGrid() function the way we did for line plots:

g = sns.FacetGrid(anes_hist, col = 'candidate', 
                  col_wrap=2, height=4, aspect=1,
                 hue='candidate')
g.map(sns.distplot, 'thermometer', kde=False)
g.set_titles('{col_name}')
g.set_axis_labels('Thermometer rating', 'Frequency') 
g.fig.subplots_adjust(top=.9)
g.fig.suptitle('Feeling Thermometer Distributions', fontsize=16)

Text(0.5, 0.98, 'Feeling Thermometer Distributions')

_images/e431bd7a5d8ed3f87aeff7c588934ccdebfb5281f105c1a8a1a7b8d26e3edc6c.png

11.2.4.2. Density Plots #

Density plots, or kernel density plots, are derived from histograms. Given a histogram, it is possible to draw a continuous curve that follows the heights of the bars. This curve is an estimate of the probability density function of a continous feature, and the higher the curve in a range, the more frequently we observe values in the range.

One way to see the density curve is to use the sns.distplot() function with the kde argument set to True and the hist argument set to False:

plt.figure(figsize=(7,7))
sns.distplot(anes.ftbiden, kde=True, hist=False)
plt.ylabel('Density')
plt.xlabel('Thermometer rating')
plt.title('Distribution of Joe Biden Thermometer Ratings')

Text(0.5, 1.0, 'Distribution of Joe Biden Thermometer Ratings')

_images/198a4f42a4eae6e2bcdb77dd2ce65cda37bdbcbc7fd214a2f2b455f545e60a8b.png

A nicer-looking version of this plot comes from the sns.kdeplot() function, which allows shading underneath the curve with shade=True:

plt.figure(figsize=(7,7))
sns.kdeplot(anes.ftbiden, shade=True)
plt.ylabel('Density')
plt.xlabel('Thermometer rating')
plt.title('Distribution of Joe Biden Thermometer Ratings')

Text(0.5, 1.0, 'Distribution of Joe Biden Thermometer Ratings')

_images/fc275135ebc82750a96c0db9f4bf1c7e49d545a14395faf0dbcd006a5508040a.png

The sns.kdeplot() function includes a legend by default, which makes it much easier to plot multiple densities on top of one another. Here is the density plot for for Biden and Trump:

plt.figure(figsize=(7,7))
sns.kdeplot(anes.ftbiden, shade=True, label='Biden')
sns.kdeplot(anes.fttrump, shade=True, label='Trump')
plt.ylabel('Frequency')
plt.xlabel('Thermometer rating')
plt.title('Distribution of Thermometer Ratings')
plt.legend()

<matplotlib.legend.Legend at 0x1a276c0910>

_images/03ba2c8a2029a1af406b0c1779e70ced34a32170880e3a572b9b1718c1b5bf2f.png

I do not recommend including more than two features on a density plot as it will be difficult to tell the difference between more than two shaded areas. However, a facet grid is an effective alternative:

g = sns.FacetGrid(anes_hist, col = 'candidate', 
                  col_wrap=2, height=4, aspect=1,
                 hue='candidate')
g.map(sns.kdeplot, 'thermometer', shade=True)
g.set_titles('{col_name}')
g.set_axis_labels('Thermometer rating', 'Density') 
g.fig.subplots_adjust(top=.9)
g.fig.suptitle('Feeling Thermometer Distributions', fontsize=16)

Text(0.5, 0.98, 'Feeling Thermometer Distributions')

_images/559032fa4c13563f4c5ad0b46fefe0daff3816825cf504c7b1322467733a5c19.png

Density plots are effective ways to estimate and illustrate a single feature’s distribution. But a pair of features together have a joint distribution. A joint distribution of two features is a three-dimensional graph, which is trickier to visualize. The higher the joint distribution, the more those ordered pairs of the two features will appear in the data. One way to visualize a joint distribution is to use a level plot, which looks at a joint distribution from a vantage point directly above the curve. The level plot uses shading to represent points that have the same height, and the darkest areas are the peaks of the curves. These areas are the most frequent combinations of the two features.

Another way to visualize a joint distribution is with a hex plot, which divides the region into hexagons and shades the region according to how many combinations of the two values of the features fall into that region.

A joint plot combines a scatterplot, level plot, or hexplot with histograms or density plots to illustrate the two-way relationship between two features as well as the distributions of each individual feature. The distribution for the feature on the x-axis of the scatterplot is listed on the top, opposite the x-axis, and the distribution for the feature on the y-axis is listed on the right, opposite the y-axis. To generate a joint plot, use the sns.jointplot() function with kind='scatter' to display a scatterplot and histograms:

sns.jointplot(x = 'ftbiden', y = 'fttrump', data=anes, kind='scatter')

<seaborn.axisgrid.JointGrid at 0x1a27211810>

_images/ffe07ffb7397e8be89fa08a7da3aa21e89e9349a95d6bf25096da743190ac848.png

Use kind='kde' to display a level plot and densities:

sns.jointplot(x = 'ftbiden', y = 'fttrump', data=anes, kind='kde')

<seaborn.axisgrid.JointGrid at 0x1a29244850>

_images/55c0752a2418782fafc5c8aa4fbda20c9158cc5a28c9decad5a5733802a56f80.png

And use kind='hex' to display a hex plot and histograms:

sns.jointplot(x = 'ftbiden', y = 'fttrump', data=anes, kind='hex')

<seaborn.axisgrid.JointGrid at 0x1a2a54b3d0>

_images/2d8eff52e1fd2efbe985c49ee655e189f5bd5ab031a62da3e7b5453d897486fe.png

These plots, especially with the level plot and hex plot, show more observations clustered around 100 for Trump and 0 for Biden, and around 0 for Trump and values greater than 50 for Biden.

11.2.4.3. Violin Plots #

A violin plot has two aesthetic components. First, each violin plot has a line with three parts: the longest, thinnest line represents the range of the feature, the thicker line represents the interquartile range (the 25th percentile through the 75th percentile), and a dot in the middle of the thick line represents the median. Second, these plots show the density plot on both sides of the line. Violin plots are symmetric and often end up looking like violins, hence the name.

To generate a violin plot, use the sns.violinplot() function. To illustrate the distribution of one feature, pass that feature to the function:

plt.figure(figsize=(10,7))
sns.violinplot(anes.ftbiden)
plt.ylabel("Density")
plt.xlabel("Feeling Thermometer Ratings")
plt.title("Distributions of Ratings for Joe Biden")

Text(0.5, 1.0, 'Distributions of Ratings for Joe Biden')

_images/f95315b57c034899fdad81fd900a023ccb8433d109f4fd12dea60b3ed51808c6.png

Violin plots are very useful for showing different violins for different features side-by-side. If the data are in a long format, as with anes_hist,

anes_hist

	caseid	candidate	thermometer
0	1	Joe Biden	52.0
1	2	Joe Biden	41.0
2	3	Joe Biden	88.0
3	4	Joe Biden	0.0
4	5	Joe Biden	25.0
...	...	...	...
12655	3161	Bernie Sanders	6.0
12656	3162	Bernie Sanders	92.0
12657	3163	Bernie Sanders	59.0
12658	3164	Bernie Sanders	79.0
12659	3165	Bernie Sanders	100.0

12660 rows × 3 columns

then pass the category label to x and the values to y to display side-by-side violins. For example, here are the violins for the four political figures:

plt.figure(figsize = (10,7))
sns.violinplot(x = 'candidate', y = 'thermometer', data = anes_hist)
plt.xlabel("Political Figure")
plt.ylabel("Feeling Thermometer Ratings")
plt.title("Distributions of Ratings for Political Figures")

Text(0.5, 1.0, 'Distributions of Ratings for Political Figures')

_images/9c2824c88c0ffbb184b7e3b7bc74c22c4a8ae3a858087474ad7ded9a084bc1fe.png

If there is more than one categorical grouping feature, then we can use an aesthetic like hue to represent that feature. In this example, I reshape the anes to long format including sex:

anes_vio = pd.melt(anes, id_vars = ['caseid', 'sex'], 
                    value_vars = ['ftbiden', 'fttrump',
                                 'ftobama', 'ftsanders'])
anes_vio = anes_vio.rename({'variable':'candidate',
                             'value':'thermometer'}, axis=1)
anes_vio['candidate'] = anes_vio['candidate'].map({'ftbiden':'Joe Biden',
                                                    'fttrump':'Donald Trump',
                                                    'ftobama':'Barack Obama',
                                                    'ftsanders':'Bernie Sanders'})
anes_vio

	caseid	sex	candidate	thermometer
0	1	Male	Joe Biden	52.0
1	2	Male	Joe Biden	41.0
2	3	Female	Joe Biden	88.0
3	4	Male	Joe Biden	0.0
4	5	Female	Joe Biden	25.0
...	...	...	...	...
12655	3161	Male	Bernie Sanders	6.0
12656	3162	Male	Bernie Sanders	92.0
12657	3163	Male	Bernie Sanders	59.0
12658	3164	Male	Bernie Sanders	79.0
12659	3165	Male	Bernie Sanders	100.0

12660 rows × 4 columns

I can then specify hue='sex' to generate two violin plots, one for men and one for women, side-by-side within each column devoted to each political figure:

plt.figure(figsize = (10,7))
sns.violinplot(x = 'candidate', y = 'thermometer', data = anes_vio,
              hue = 'sex')
plt.xlabel("Political Figure")
plt.ylabel("Feeling Thermometer Ratings")
plt.title("Distributions of Ratings for Political Figures")

Text(0.5, 1.0, 'Distributions of Ratings for Political Figures')

_images/1bcda87e419aa236deb7f12745b9ee6ab14d73e881f661dffb59221b5197e1f3.png

11.2.4.4. Boxplots #

Instead of illustrating the entire distribution of a feature, box plots only show some descriptive statistics: the median, 25th and 75th percentiles, the minimum and maximum after removing outliers, and dots for the outliers. To generate a box plot with seaborn, use the sns.boxplot() function, passing the feature we want to visualize:

plt.figure(figsize=(10,7))
sns.boxplot(anes.ftbiden)
plt.xlabel("Feeling Thermometer Ratings")
plt.title("Distributions of Ratings for Joe Biden")

Text(0.5, 1.0, 'Distributions of Ratings for Joe Biden')

_images/e2e76d10327dcfb0d4e4322fd717c7970967ef21d8d539b2a1af6e4ca0fb9056.png

Box plots can be confusing because the aesthetic components that map to descriptive statistics are not as intuitive as bars or density curves. There is no meaning to the width of the box, and the lines and edges in a box plot have specific definitions that are necessary to remember in order to interpret a box plot correctly. The middle line represents the median, and the edges of the box are the 25th and 75th percentile. The whiskers that extend from the short side of the box extend to the minimum and maximum values, after removing outliers. An outlier is defined by default as a value that is 1.5 times the interquartile range, either below the 25th percentile or above the 75th percentile, and outliers are illustrated with dots beyond the minimum and maximum values. Here is the same boxplot as the one above with these elements annotated:

plt.figure(figsize=(10,7))
sns.boxplot(anes.ftbiden)
plt.xlabel("Feeling Thermometer Ratings")
plt.title("Distributions of Ratings for Joe Biden")
plt.text(anes.ftbiden.median(), 0, 'median', horizontalalignment='center', verticalalignment='center',
        fontsize = 20, color='white')
plt.text(anes.ftbiden.max(), -.05, 'maximum', horizontalalignment='center', verticalalignment='center',
        fontsize = 20, color='red')
plt.text(anes.ftbiden.min(), -.05, 'minimum', horizontalalignment='center', verticalalignment='center',
        fontsize = 20, color='red')
plt.text(anes.ftbiden.min()-10, 0, 'here there be outliers', horizontalalignment='right', verticalalignment='center',
        fontsize = 20, color='blue')
plt.text(anes.ftbiden.max()+10, 0, 'here there be outliers', horizontalalignment='left', verticalalignment='center',
        fontsize = 20, color='blue')
plt.text(anes.ftbiden.quantile(.25), .25, '25th percentile', horizontalalignment='left', verticalalignment='center',
        fontsize = 20, color='white')
plt.text(anes.ftbiden.quantile(.75), .25, '75th percentile', horizontalalignment='right', verticalalignment='center',
        fontsize = 20, color='white')

Text(70.0, 0.25, '75th percentile')

_images/ca5f64f609f77736a253cb293d504620383575d72a2a47207faad0496ad40870.png

In short, to read a box plot accurately, there’s a lot to remember. Many statisticians and data scientists prefer violin plots to box plots because violin plots are more intuitive, which makes them easier to use for communicating with a general audience. On the other hand, box plots make it easier to see the interquartile range of a feature.

To show multiple features in the same box plot, arrange the data in long-format, such as:

anes_hist

	caseid	candidate	thermometer
0	1	Joe Biden	52.0
1	2	Joe Biden	41.0
2	3	Joe Biden	88.0
3	4	Joe Biden	0.0
4	5	Joe Biden	25.0
...	...	...	...
12655	3161	Bernie Sanders	6.0
12656	3162	Bernie Sanders	92.0
12657	3163	Bernie Sanders	59.0
12658	3164	Bernie Sanders	79.0
12659	3165	Bernie Sanders	100.0

12660 rows × 3 columns

The syntax to show the side-by-side box plots is as follows:

plt.figure(figsize = (10,7))
sns.boxplot(x = 'candidate', y = 'thermometer', data = anes_hist)
plt.xlabel("Political Figure")
plt.ylabel("Feeling Thermometer Ratings")
plt.title("Distributions of Ratings for Political Figures")

Text(0.5, 1.0, 'Distributions of Ratings for Political Figures')

_images/a2218f525cf02ce0185a9a91d72316dd5b44e34e7ca8c8ec3bb61c9b16ed3b1f.png

11.2.5. Heatmaps #

Heat maps are used to visualize a matrix in which all datapoints within the matrix are on comparable scales, such as a correlation matrix. Heat maps often use colors that range from black to white, moving through shades of blue and red, where black and blue indicate smaller values and red and white indicate larger values. The shading allows us to illustrate a range of values with a direct mapping to a specific shade.

First I generate a correlation matrix between many feeling thermometers:

my_corr = anes.loc[:,'fttrump':'ftnra'].corr()
my_corr

	fttrump	ftobama	ftbiden	ftwarren	ftsanders	ftbuttigieg	ftharris	ftblack	ftwhite	fthisp	ftasian	ftmuslim	ftillegal	ftjournal	ftnato	ftun	ftice	ftnra
fttrump	1.000000	-0.754178	-0.646357	-0.699664	-0.678443	-0.588964	-0.616295	-0.088143	0.240802	-0.128286	-0.031700	-0.458024	-0.610016	-0.643103	-0.381275	-0.576975	0.712355	0.769239
ftobama	-0.754178	1.000000	0.805100	0.783064	0.720092	0.714641	0.743463	0.222713	-0.096177	0.223919	0.123806	0.550868	0.644172	0.703190	0.503913	0.670916	-0.541225	-0.620586
ftbiden	-0.646357	0.805100	1.000000	0.733601	0.664075	0.728557	0.730069	0.171006	-0.028490	0.171773	0.093182	0.463104	0.562951	0.654196	0.521605	0.636527	-0.411359	-0.514269
ftwarren	-0.699664	0.783064	0.733601	1.000000	0.798636	0.706680	0.799202	0.224285	-0.086150	0.241548	0.151839	0.534428	0.675027	0.718289	0.509323	0.684970	-0.535019	-0.578689
ftsanders	-0.678443	0.720092	0.664075	0.798636	1.000000	0.612547	0.695092	0.178376	-0.097548	0.203581	0.120267	0.485082	0.625150	0.644384	0.403909	0.608613	-0.534303	-0.543851
ftbuttigieg	-0.588964	0.714641	0.728557	0.706680	0.612547	1.000000	0.736301	0.161853	-0.022885	0.183849	0.129914	0.494359	0.567295	0.662666	0.546362	0.647596	-0.398386	-0.495196
ftharris	-0.616295	0.743463	0.730069	0.799202	0.695092	0.736301	1.000000	0.215413	-0.074303	0.222411	0.126617	0.510411	0.636316	0.668522	0.505836	0.650612	-0.452134	-0.497697
ftblack	-0.088143	0.222713	0.171006	0.224285	0.178376	0.161853	0.215413	1.000000	0.434286	0.723005	0.655427	0.504421	0.358561	0.283701	0.256728	0.233984	0.030010	-0.013278
ftwhite	0.240802	-0.096177	-0.028490	-0.086150	-0.097548	-0.022885	-0.074303	0.434286	1.000000	0.416440	0.528256	0.144579	-0.005061	0.021498	0.133711	0.036300	0.331804	0.284510
fthisp	-0.128286	0.223919	0.171773	0.241548	0.203581	0.183849	0.222411	0.723005	0.416440	1.000000	0.735956	0.540324	0.458139	0.298506	0.316018	0.268326	-0.024752	-0.040818
ftasian	-0.031700	0.123806	0.093182	0.151839	0.120267	0.129914	0.126617	0.655427	0.528256	0.735956	1.000000	0.479058	0.319166	0.220121	0.300172	0.207668	0.097326	0.029180
ftmuslim	-0.458024	0.550868	0.463104	0.534428	0.485082	0.494359	0.510411	0.504421	0.144579	0.540324	0.479058	1.000000	0.675827	0.585710	0.480992	0.555999	-0.312947	-0.373475
ftillegal	-0.610016	0.644172	0.562951	0.675027	0.625150	0.567295	0.636316	0.358561	-0.005061	0.458139	0.319166	0.675827	1.000000	0.680360	0.462385	0.599799	-0.512262	-0.495687
ftjournal	-0.643103	0.703190	0.654196	0.718289	0.644384	0.662666	0.668522	0.283701	0.021498	0.298506	0.220121	0.585710	0.680360	1.000000	0.599288	0.742631	-0.460893	-0.530127
ftnato	-0.381275	0.503913	0.521605	0.509323	0.403909	0.546362	0.505836	0.256728	0.133711	0.316018	0.300172	0.480992	0.462385	0.599288	1.000000	0.694893	-0.159086	-0.286711
ftun	-0.576975	0.670916	0.636527	0.684970	0.608613	0.647596	0.650612	0.233984	0.036300	0.268326	0.207668	0.555999	0.599799	0.742631	0.694893	1.000000	-0.392435	-0.475780
ftice	0.712355	-0.541225	-0.411359	-0.535019	-0.534303	-0.398386	-0.452134	0.030010	0.331804	-0.024752	0.097326	-0.312947	-0.512262	-0.460893	-0.159086	-0.392435	1.000000	0.727624
ftnra	0.769239	-0.620586	-0.514269	-0.578689	-0.543851	-0.495196	-0.497697	-0.013278	0.284510	-0.040818	0.029180	-0.373475	-0.495687	-0.530127	-0.286711	-0.475780	0.727624	1.000000

To display a heatmap, we pass the correlation matrix to the sns.heatmap() function. The function labels the rows and columns automatically from the row and column indices of the correlation matrix and provides a color scale along the left side to help us interpret and understand the colors in the heatmap. The heatmap for the feeling thermometers is:

plt.figure(figsize = (13,10))
sns.heatmap(my_corr)
plt.title("Correlation Between Feeling Thermometers")

Text(0.5, 1.0, 'Correlation Between Feeling Thermometers')

_images/17280a32cfe0e807e484b45a60362a232b79d86a28f38ef7d7b19994fdf88ae2.png

The color scale, by default, ranges from black to white. But we might want to change the colors to emphasize that zero is explicitly in the middle of the scale. For that we can specify center=0:

plt.figure(figsize = (13,10))
sns.heatmap(my_corr, center=0)
plt.title("Correlation Between Feeling Thermometers")

Text(0.5, 1.0, 'Correlation Between Feeling Thermometers')

_images/c88786e67e83dee99d6c72bacbb000cfe327ee78043e6a6597041ab046c042f4.png

The scale now ranges from light blue to light red, with black in the middle. There are many other options for color scales, listed on the Choosing Color Palettes tutorial in the seaborn user guide. We can specify a color scale that ranges from yellow, to green, to blue by specifying cmap='YlGnBu':

plt.figure(figsize = (13,10))
sns.heatmap(my_corr, annot=True, cmap='YlGnBu')
plt.title("Correlation Between Feeling Thermometers")

Text(0.5, 1.0, 'Correlation Between Feeling Thermometers')

_images/95b6eef3dd17a9f760867fe0117fe2cbb811a4eb12c13a87f32f64cb494a970e.png

Finally, we can annotate the heatmap with the values that each color in each cell represents by typing annot=True:

plt.figure(figsize = (13,10))
sns.heatmap(my_corr, annot=True)
plt.title("Correlation Between Feeling Thermometers")

Text(0.5, 1.0, 'Correlation Between Feeling Thermometers')

_images/dacba0c09ae9e5afe8b8c486028816e1cd65cc3731f9c93deaab587b552b81dd.png

Static Data Visualizations

Contents