7. Database Queries#

7.1. Introduction: The History of the Structured Query Language (SQL)#

The structured query language (SQL) was invented by Donald D. Chamberlin and Raymond F. Boyce in 1974. Chamberlain and Boyce were both young computer scientists working at the IBM T.J. Watson Research Center in Yorktown Heights, New York, and they met E. F. Codd at a research symposium that Codd organized there. Codd, four years prior, had published the seminal article that defined the relational model for databases. Codd’s relational model is defined using relational algebra and relational calculus, two notational standards that Codd himself created to elaborate on set theory as applied specifically to data tables. One important property of set theory is that highly abstract mathematical expressions can be expressed in plain language. For example, consider the set \(A\) of holidays in the United States during which banks are closed:

\[\begin{split} A = \{\text{New Year's Day}, \text{Birthday of Martin Luther King, Jr.}, \\ \text{Washington’s Birthday}, \text{Memorial Day}, \text{Independence Day}, \\ \text{Labor Day}, \text{Columbus Day}, \text{Veterans Day}, \\ \text{Thanksgiving Day}, \text{Christmas}\}\end{split}\]

Also consider the set \(B\) of holidays in the United Kingdom during which banks are closed:

\[\begin{split} B = \{\text{New Year's Day}, \text{Good Friday}, \text{Easter Monday}, \\ \text{Early May bank holiday}, \text{Spring bank holiday}, \\\text{Summer bank holiday}, \text{Christmas}, \text{Boxing Day}\}\end{split}\]

The intersection between sets \(A\) and \(B\) is a set that consists of all elements that exist with both set \(A\) and set \(B\):

\[ A \cap B = \{\text{New Year's Day}, \text{Christmas}\}\]

The notation \(A\cap B\) is a mathematical abstraction of an idea that can be expressed in plain-spoken language: \(\cap\) means “and”, and \(A\cap B\) means \(A\) and \(B\), or all elements that are in both \(A\) and \(B\). Put another way, \(A\cap B\) is the set of all holidays during which banks are closed in both the United States and the United Kingdom. Likewise, every piece of set notation can be expressed semantically.

Although Codd laid out the broad parameters of the relational model in mathematical terms, he did not design software or a physical architecture for a relational database. He explicitly left that work up to future research:

Many questions are raised and left unanswered. For example, only a few of the more important properties of the data sublanguage … are mentioned. Neither the purely linguistic details of such a language nor the implementation problems are discussed. Nevertheless, the material presented should be adequate for experienced systems programmers to visualize several approaches (p. 387).

Chamberlin and Boyce took up the challenge of writing a programming language to implement Codd’s relational model. As Chamberlin explains, their primary goal was to create a version of Codd’s set-theoretical relational model that could be expressed in plain language:

The more difficult barrier was at the semantic level. The basic concepts of Codd’s languages were adapted from set theory and symbolic logic. This was natural given Codd’s background as a mathematician, but Ray and I hoped to design a relational language based on concepts that would be familiar to a wider population of users (p. 78).

In short, the idea behind SQL is to implement Codd’s abstract system of using logical statements with set theoretical notation to narrow down the specific records and features in a dataset that a user wishes to read or edit, but to phrase these operations in accessable, plain language. One of the best things about SQL is that once you are used to the language, it reads just like English. That said, Chamberlin admits that SQL “has not proved to be as accessible to untrained users as Ray and I originally hoped” (p. 81).

Another benefit of SQL is that this language is one of the most universal programming languages in existence. It is designed to work with database management systems on any platform, and it works seamlessly within Python, R, C, Java, Javascript, and so on. While the standards for languages and platforms change, SQL has been in continuous use for relational database management since the 1980s and shows no sign of becoming antiquated or being replaced. The SQL syntax exists outside of any individual DBMS, and is maintained by the American National Standards Institute (ANSI) and the International Organization for Standardization (ISO), two non-profit organizations that facilitate the development of voluntary consensus standards for things like programming languages and hardware. Despite the universality of SQL, however, different DBMSs use slightly different versions of SQL, adding some unique functionality in some cases, and failing to implement the entire SQL standard in others. MySQL for example lacks the ability to perform a full join. PostgreSQL distinguishes itself from other RDBMSs by striving to implement as much of the global SQL standard as possible. While there some important differences in the version of SQL used by different DBMSs, the differences generally apply to very specific situations and all implementations of SQL use mostly the same syntax and can do mostly the same work.

7.1.1. Declarative and Procedural Languages#

SQL is considered to be a declarative language, which means that it defines the broad task that a particular computer system must carry out, but it does not define the mechanism through which the system completes the task. For example, SQL can tell a system to access two tables and join them together, but that command must tell a DBMS to access additional code that tells the system how exactly to search and operate on the rows and columns of each data table. A language that provides specific instructions to a system on how to carry out a task - by changing the system state in some way, including how the data exist in the system - is a procedural language. The code that a procedural language uses to make these changes on the system is called imperative code. A DBMS can be thought of as a function that takes declarative SQL code as an input, finds and runs the imperative code that carries out the declarative task, and returns the output. MySQL, for example, uses imperative C and C++ code to carry out SQL queries.

7.1.2. Popularity of SQL#

Common standards and the most popular programming languages and environments change all the time. It’s an eternal struggle for data scientists as well as programmers of all kinds, and a matter of consistent anxiety. Presently, Python is the most widely used tool for data science, but will we all have to drop Python soon and teach ourselves Julia?

In this context, it is stunning that SQL has been so widely used since the 1970s. According to a Stack Overflow survey, SQL is the one of most widely used programming languages among the people who filled out the survey, second only to Javascript. Taking into account the high-tech biases in this specific sample, it is probably the case the SQL is more widely used than any other language mentioned in this survey. What accounts for this popularity?

This blog post argues that SQL achieved this level of longevity because it came to prominence during a time in which many of the baseline standards for the development of computer systems were being invented. As more and more systems were developed in a way that depends on SQL, it became harder to change this standard. But SQL is also simple and highly functional because it is a semantic language that expresses set-theoetical and logical operations. As long as relational databases are used, there’s not much functionality that can be added to a query language beyond these foundational mathematical operations, and whatever additional functionality is needed can be added to a version of SQL by a particular DBMS. There are also many different open source and proprietary DBMSs that all employ SQL, so different users can have a choice over many different DBMSs and platforms without having to learn a query language other than SQL.

That said, there’s much less of a reason to use SQL when the database is not organized according to the relational model. NoSQL databases have much more flexible schema in general, and can store the data in one big table or in as many tables as there are records, or even datapoints, in the database. In fact, without a relational schema, the notion of a data table makes less sense in general. For example, a document store is a collection of individual records encoded using JSON or XML, and not as tables. These records can be sharded: stored in many corresponding servers in a distributed system to address challenges with the size of the database and the speed with which database transactions are conducted. Without tables, NoSQL DBMSs do not usually use SQL. MongoDB, for example, works with queries that are themselves in JSON format.

7.2. Create, Read, Update, and Delete (CRUD) Operations#

Persistent storage refers to a system in which data outlives the process that created it. When you work with software that allows you to save a file, the file is stored in persistent storage because it still exists even after you close the software application. Hard drives are examples of persistent storage, as are local and remote servers that store databases. Any persistent storage mechanism must have methods for creating, reading (or loading), updating (or editing), and deleting the data in that storage device. Create, Read, Update, and Delete are the CRUD Operations.

We’ve previously employed CRUD operations using the requests library to use an API or to do web scraping. Like requests, SQL and other query languages have CRUD operations. The following table, adapted from a similar one that appears on the Wikipedia page for the CRUD operations, shows these operations in the requests package, SQL, and the MongoDB query language:

Operation

requests method

SQL

MongoDB

Create

requests.put() or requests.post()

INSERT

Insert

Read

requests.get()

SELECT

Find

Update

requests.patch()

UPDATE

Update

Delete

requests.delete()

DELETE

Remove

As a data scientist, you will most often use read operations to obtain the data you need for your analysis. However, if you are collecting original data for your project, the create, update, and delete operations become much more important. We will discuss all four operations and their variants in the context of SQL and MongoDB below.

We can work with SQL using pandas if we first create an engine that links to a specific DBMS, server, and database with create_engine from sqlalchemy. Once we do, the pd.read_sql_query() function makes read operations straightforward, and the .execute() method applied to the engine lets us easily issue create, update, and delete commands.

7.3. SQL Style: Capitalization, Quotes, New Lines, Indentation#

There are many ways to write an SQL query, and when you look at someone else’s SQL code you will see a variety of styles. Mostly, with the exception of quotes in some cases, stylistic differences don’t change the behavior of the code, but they can have an impact on how easy the code is for other people to read and understand.

One requirement for SQL code is that the query must end with a semi-colon, and that no semi-colons appear elsewhere in the query. As long as that requirement is met, other stylistic choices are possible.

The least readable way to write an SQL query is to write the entire code on one line, with no capitalization or indentation. The following code is valid SQL code:

select t.id, t.column1, t.column2, t.column3, r.column4 from table1 t inner join table2 r on t.id = r.id where column1>100 order by column2 desc;

We will discuss exactly what this query does. But for now, let’s focus on the presentation of code. SQL uses clauses to represent particular functions for reading and writing data. In the above query, select, from, inner join, on, where, and order by are all clauses, and desc is an option applied to the order by clause.

One stylistic choice many people make is to write SQL clauses in capital letters. That helps readers to quickly see the parts of the code that are clauses as opposed to the rest of the code that contains column names, table names, values, and aliases. If we capitalize the clauses and options in the SQL query, it looks like this:

SELECT t.id, t.column1, t.column2, t.column3, r.column4 FROM table1 t INNER JOIN table2 r ON t.id = r.id WHERE column1>100 ORDER BY column2 DESC;

Another stylistic choice people make to present the code in a more reabable way is to put clauses that are considered distinct enough from other clauses on new lines. The one common exception is FROM, which is considered to be closely related to SELECT and is often written on the same line as SELECT. If we put each clause other than FROM on a new line, the query looks like this:

SELECT t.id, t.column1, t.column2, t.column3, r.column4 FROM table1 t 
INNER JOIN table2 r 
ON t.id = r.id 
WHERE column1>100 
ORDER BY column2 DESC;

Some clauses are considered to be elaborations upon a previous clause. Column names after SELECT are usually written on the same line as SELECT, but if these columns themselves require functions that take up more space, it is useful to put them on new lines. Likewise, ON is considered an elaboration of INNER JOIN. These lines of code are often indented to express the dependence on the previous line. If we include indentation in the code, the query is

SELECT 
    t.id, 
    t.column1, 
    t.column2, 
    t.column3, 
    r.column4 
FROM table1 t 
INNER JOIN table2 r 
    ON t.id = r.id 
WHERE column1>100 
ORDER BY column2 DESC;

I encourage you to develop good habits with how you write the SQL queries, both for other people to read your code, but more importantly, to make it easier for you yourself to read your code. You will be spending a lot of time developing and debugging SQL queries, and anything you do that cuts down the time to understand your own code will save you a lot of time and frustration in the long-run.

Quotes are only used in SQL code when referring to values of a character feature in one of the data tables. When using quotes while working in Python, it is important to use single quotes, not double quotes, to ensure that the quote that is internal to SQL is not read as a termination of the Python variable that contains the SQL code. That is, it is fine to write a clause that looks like WHERE column = 'value', but not WHERE column = "value".

For all of the queries we will write in the following examples, we will store the query as a string variable in Python. We will use the triple-quote syntax, which allows us to write a string that exists on multiple lines. So our SQL query definitions will look like this:

myquery = """
SELECT 
    t.id, 
    t.column1, 
    t.column2, 
    t.column3, 
    r.column4 
FROM table1 t 
INNER JOIN table2 r 
    ON t.id = r.id 
WHERE column1>100 
ORDER BY column2 DESC;
"""

We will then be able to pass the myquery variable to functions like pd.read_sql_query() to be evaluated.

7.4. SQL Joins#

The simplest SQL commands for reading data from a database are SELECT and FROM. In module 6, we issued the following query to read the entire “reviews” entity from the wine reviews database:

SELECT * FROM reviews

But this query does not read data from the other entities in the database. It pulls all of the rows and columns from “reviews” and it does not manipulate the data within “reviews” in any way. That might not be the best way to create a dataframe to conduct analyses.

SQL read operations get data, but they can also clean data at the same time. Cleaning data is an important challenge. Even when the data are stored in a well-organized database, that organization might not be the right format for the data given the analyses we intend to do. “Tidy Data” by Hadley Wickham lays out a philosophy of what it means to clean a dataset. The goal is to put data into a format in which modeling and visualization is as easy as possible. There are two steps: first we create tidy data and then we manipulate the data to fit our specific needs. Tidy data is defined as follows:

A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, [features] and types. In tidy data:

  1. Each [feature] forms a column.

  2. Each observation forms a row.

  3. Each type of observational unit forms a table (p. 2).

In other words, the dataset we will use in an analysis must exist in one table, the rows of the table must represent records (also called observations), the columns of the table must represent features, and the rows must represent comparable units. For example, if the data contain records from the 50 U.S. States, there should be a row for each state, and there should not be a row for regions or for the whole country as these units are not comparable to states. Data from a relational database are not generally in tidy format because the relevant data exists in multiple tables. The first step is to combine these tables into one dataset using join functions within SQL read operations. There are many different kinds of joins, and the easiest way to see the difference between these types is to see what they do to real data.

Before we discuss specific examples of how to use SQL, we load the following libraries:

import numpy as np
import pandas as pd
import sys
import os
import psycopg2
from sqlalchemy import create_engine
import dotenv
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 5
      3 import sys
      4 import os
----> 5 import psycopg2
      6 from sqlalchemy import create_engine
      7 import dotenv

ModuleNotFoundError: No module named 'psycopg2'

7.4.1. Example Database: NFL and NBA Teams#

As an example, I create a PostgreSQL database that contains two tables: “nfl” contains the location and team name of all 32 NFL teams:

nfl_dict = {'city':['Buffalo','Miami','Boston','New York','Cleveland','Cincinnati',
                       'Pittsburgh','Baltimore','Kansas City','Las Vegas','Los Angeles','Denver',
                       'Nashville','Jacksonville','Houston','Indianapolis','Philadelphia','Dallas',
                       'Washington','Atlanta','Charlotte','Tampa Bay','New Orleans','San Francisco',
                       'Phoenix', 'Seattle','Chicago','Green Bay','Minneapolis','Detroit'],
           'footballteam':['Buffalo Bills','Miami Dolphins','New England Patriots',
                           ['New York Jets', 'New York Giants'],'Cleveland Browns','Cincinnati Bengals',
                          'Pittsburgh Steelers','Baltimore Ravens','Kansas City Chiefs',
                           'Las Vegas Raiders',['L.A. Chargers','L.A. Rams'],'Denver Broncos',
                          'Tennessee Titans','Jacksonville Jaguars','Houston Texans',
                           'Indianapolis Colts','Philadelphia Eagles','Dallas Cowboys',
                           'Washington Skins','Atlanta Falcons','Carolina Panthers',
                           'Tampa Bay Buccaneers','New Orleans Saints', 'San Francisco 49ers',
                           'Arizona Cardinals','Seattle Seahawks','Chicago Bears',
                           'Green Bay Packers','Minnesota Vikings','Detroit Lions']}
nfl_df = pd.DataFrame(nfl_dict)
nfl_df
city footballteam
0 Buffalo Buffalo Bills
1 Miami Miami Dolphins
2 Boston New England Patriots
3 New York [New York Jets, New York Giants]
4 Cleveland Cleveland Browns
5 Cincinnati Cincinnati Bengals
6 Pittsburgh Pittsburgh Steelers
7 Baltimore Baltimore Ravens
8 Kansas City Kansas City Chiefs
9 Las Vegas Las Vegas Raiders
10 Los Angeles [L.A. Chargers, L.A. Rams]
11 Denver Denver Broncos
12 Nashville Tennessee Titans
13 Jacksonville Jacksonville Jaguars
14 Houston Houston Texans
15 Indianapolis Indianapolis Colts
16 Philadelphia Philadelphia Eagles
17 Dallas Dallas Cowboys
18 Washington Washington Skins
19 Atlanta Atlanta Falcons
20 Charlotte Carolina Panthers
21 Tampa Bay Tampa Bay Buccaneers
22 New Orleans New Orleans Saints
23 San Francisco San Francisco 49ers
24 Phoenix Arizona Cardinals
25 Seattle Seattle Seahawks
26 Chicago Chicago Bears
27 Green Bay Green Bay Packers
28 Minneapolis Minnesota Vikings
29 Detroit Detroit Lions

This table is not in first normal form because the data are non-atomic (two teams from New York and two in Los Angeles), but this form is useful for illustrating what different SQL joins do. The second table contains the same information about NBA teams:

nba_dict = {'city':['Boston','New York','Philadelphia','Brooklyn','Toronto',
                   'Cleveland','Chicago','Detroit','Milwaukee','Indianapolis',
                   'Atlanta', 'Washington','Orlando','Miami','Charlotte',
                   'Los Angeles','San Francisco','Portland','Sacramento',
                   'Phoenix','San Antonio','Dallas','Houston','Oklahoma City',
                   'Minneapolis','Denver','Salt Lake City','Memphis','New Orleans'],
           'basketballteam':['Boston Celtics','New York Knicks','Philadelphia 76ers',
                             'Brooklyn Nets','Toronto Raptors',
                            'Cleveland Cavaliers','Chicago Bulls','Detroit Pistons',
                             'Milwaukee Bucks','Indiana Pacers',
                            'Atlanta Hawks','Washington Wizards','Orlando Magic',
                             'Miami Heat','Charlotte Hornets',
                            ['L.A. Lakers','L.A. Clippers'],'Golden State Warriors',
                             'Portland Trailblazers','Sacramento Kings',
                            'Phoenix Suns','San Antonio Spurs','Dallas Mavericks',
                             'Houston Rockets','Oklahoma City Thunder',
                            'Minnesota Timberwolves','Denver Nuggets',
                             'Utah Jazz','Memphis Grizzlies','New Orleans Pelicans']}
nba_df = pd.DataFrame(nba_dict)
nba_df
city basketballteam
0 Boston Boston Celtics
1 New York New York Knicks
2 Philadelphia Philadelphia 76ers
3 Brooklyn Brooklyn Nets
4 Toronto Toronto Raptors
5 Cleveland Cleveland Cavaliers
6 Chicago Chicago Bulls
7 Detroit Detroit Pistons
8 Milwaukee Milwaukee Bucks
9 Indianapolis Indiana Pacers
10 Atlanta Atlanta Hawks
11 Washington Washington Wizards
12 Orlando Orlando Magic
13 Miami Miami Heat
14 Charlotte Charlotte Hornets
15 Los Angeles [L.A. Lakers, L.A. Clippers]
16 San Francisco Golden State Warriors
17 Portland Portland Trailblazers
18 Sacramento Sacramento Kings
19 Phoenix Phoenix Suns
20 San Antonio San Antonio Spurs
21 Dallas Dallas Mavericks
22 Houston Houston Rockets
23 Oklahoma City Oklahoma City Thunder
24 Minneapolis Minnesota Timberwolves
25 Denver Denver Nuggets
26 Salt Lake City Utah Jazz
27 Memphis Memphis Grizzlies
28 New Orleans New Orleans Pelicans

To create a PostgreSQL database with entities for the NFL and NBA teams, I first connect to the PostgreSQL server running on my local computer (see module 6 for a more detailed discussion of how this works). First I bring my PostgreSQL password into the local environment:

dotenv.load_dotenv()
pgpassword = os.getenv("pgpassword")

Then I access the server and establish a cursor for the server:

dbserver = psycopg2.connect(
    user='jk8sd', 
    password=pgpassword, 
    host="localhost"
)
dbserver.autocommit = True
cursor = dbserver.cursor()

I create an empty “teams” database:

try:
    cursor.execute("CREATE DATABASE teams")
except:
    cursor.execute("DROP DATABASE teams")
    cursor.execute("CREATE DATABASE teams")

And I use the create_engine() function from sqalchemy to allow queries to the “teams” database:

engine = create_engine("postgresql+psycopg2://{user}:{pw}@localhost/{db}"
                       .format(user="jk8sd", pw=pgpassword, db="teams"))

I add the nfl_df and nba_df dataframes to the “teams” database:

nfl_df.to_sql('nfl', con = engine, index=False, chunksize=1000, if_exists = 'replace')
nba_df.to_sql('nba', con = engine, index=False, chunksize=1000, if_exists = 'replace')

I can now issue queries to the database. To read all of the data in the NFL table, for example, I type:

myquery = "SELECT * FROM nfl"
pd.read_sql_query(myquery, con=engine)
city footballteam
0 Buffalo Buffalo Bills
1 Miami Miami Dolphins
2 Boston New England Patriots
3 New York {"New York Jets","New York Giants"}
4 Cleveland Cleveland Browns
5 Cincinnati Cincinnati Bengals
6 Pittsburgh Pittsburgh Steelers
7 Baltimore Baltimore Ravens
8 Kansas City Kansas City Chiefs
9 Las Vegas Las Vegas Raiders
10 Los Angeles {"L.A. Chargers","L.A. Rams"}
11 Denver Denver Broncos
12 Nashville Tennessee Titans
13 Jacksonville Jacksonville Jaguars
14 Houston Houston Texans
15 Indianapolis Indianapolis Colts
16 Philadelphia Philadelphia Eagles
17 Dallas Dallas Cowboys
18 Washington Washington Skins
19 Atlanta Atlanta Falcons
20 Charlotte Carolina Panthers
21 Tampa Bay Tampa Bay Buccaneers
22 New Orleans New Orleans Saints
23 San Francisco San Francisco 49ers
24 Phoenix Arizona Cardinals
25 Seattle Seattle Seahawks
26 Chicago Chicago Bears
27 Green Bay Green Bay Packers
28 Minneapolis Minnesota Vikings
29 Detroit Detroit Lions

7.4.2. Types of Joins#

Joining data tables is the act of adding columns to an existing data table - that is, adding more features to existing records - by matching the rows in one table to the corresponding rows in another table. In a relational database, data tables can include a foreign key which serves as the primary key for another data table. Joins require matching a foreign key in one table to the corresponding primary key in another table. During the join, this foreign key and this primary key are both called indices. To perform a join with an SQL query, we specify the two tables in the database we want to join and the index in each table we will match on.

In the teams database, city is a primary key in both the “nfl” and “nba” tables, which also makes it a foreign key in both tables. Joining the “nfl” and “nba” tables by matching on city creates one data table in which the rows still represent cities and the columns list both the NBA and NFL teams in each city.

Not every city has both an NFL and an NBA team. Green Bay, for example, has a football team but no basketball team, and Sacramento has a basketball team but no football team. In a join, every row in a table either matches with one or more rows in the other table, or is unmatched. In this case, Cleveland in the NFL table is matched to a row in the NBA table because Cleveland has both a football and basketball team, but Oklahoma City in the NBA table is unmatched because there is no row for Oklahoma City in the NFL table.

The main difference between types of joins in SQL is their treatment of unmatched records. The following table summarizes the types of joins:

Type of join

Definition

Inner join

Only keep the records that exist in both tables

Left join

Keep all the records in the first table listed, and keep only the records in the second table listed that have matches in the first table

Right join

Keep all the records in the second table listed, and keep only the records in the first table listed that have matches in the second table

Full join

Keep all of the records in both tables whether they are matched or not

Anti join

Keep only the records in the first table that are not matched in the second table

Natural join

The same as any of the joins listed above, but no need to specify the indices as these are determined automatically by finding columns with the same name. If no columns share the same name, a natural join performs a cross join. If more than one pair of columns share names across the two data tables, natural joins assume that both are part of the index to match on. Use caution.

Cross join

Also called a Cartesian product. If the first dataframe has \(M\) rows and the second dataframe has \(N\) rows, the result has \(M\times N\) rows. Every row is a pairwise combination of values of each index.

7.4.3. Inner Joins#

The syntax for an inner join is

SELECT * FROM table1
INNER JOIN table2
    ON table1.index_name = table2.index_name;

where table1 and table2 are the data tables we are joining, and table1.index_name and table2.index_name are the columns that contain the indices for tables 1 and 2. Alternatively, inner join is the default type of join, so that this syntax

SELECT * FROM table1
JOIN table2
    ON table1.index_name = table2.index_name;

also produces an inner join. I recommend typing INNER JOIN, however, to avoid confusing this type of join with other types.

In the case of the teams database, an inner join of the NFL and NBA tables yields a dataframe with one row for every city that has both a basketball and a football team. The SQL query that generates this data frame is:

myquery = """
SELECT * FROM nfl
INNER JOIN nba
    ON nfl.city = nba.city;
"""
pd.read_sql_query(myquery, con=engine)
city footballteam city basketballteam
0 Miami Miami Dolphins Miami Miami Heat
1 Boston New England Patriots Boston Boston Celtics
2 New York {"New York Jets","New York Giants"} New York New York Knicks
3 Cleveland Cleveland Browns Cleveland Cleveland Cavaliers
4 Los Angeles {"L.A. Chargers","L.A. Rams"} Los Angeles {"L.A. Lakers","L.A. Clippers"}
5 Denver Denver Broncos Denver Denver Nuggets
6 Houston Houston Texans Houston Houston Rockets
7 Indianapolis Indianapolis Colts Indianapolis Indiana Pacers
8 Philadelphia Philadelphia Eagles Philadelphia Philadelphia 76ers
9 Dallas Dallas Cowboys Dallas Dallas Mavericks
10 Washington Washington Skins Washington Washington Wizards
11 Atlanta Atlanta Falcons Atlanta Atlanta Hawks
12 Charlotte Carolina Panthers Charlotte Charlotte Hornets
13 New Orleans New Orleans Saints New Orleans New Orleans Pelicans
14 San Francisco San Francisco 49ers San Francisco Golden State Warriors
15 Phoenix Arizona Cardinals Phoenix Phoenix Suns
16 Chicago Chicago Bears Chicago Chicago Bulls
17 Minneapolis Minnesota Vikings Minneapolis Minnesota Timberwolves
18 Detroit Detroit Lions Detroit Detroit Pistons

The three quotations that come before and after the SQL code is Python syntax that allow for a string to be entered on multiple lines. With just one quote, Python would assume that the next line should be read as Python code, and will produce an error. Three quotes allows us to space out the components of the SQL query on separate lines to make the SQL code easier to read and understand.

SQL queries can be written on multiple lines, but the last line (and only the last line) must conclude with a semicolon.

Another way to write the inner join query is to use aliasing: specifying a smaller name or a single letter next to each data table in the query to simplify the syntax for ON. For example, I can alias the NFL data with f and the NBA data with b:

myquery = """
SELECT * FROM nfl f
INNER JOIN nba b
    ON f.city = b.city;
"""
pd.read_sql_query(myquery, con=engine)
city footballteam city basketballteam
0 Miami Miami Dolphins Miami Miami Heat
1 Boston New England Patriots Boston Boston Celtics
2 New York {"New York Jets","New York Giants"} New York New York Knicks
3 Cleveland Cleveland Browns Cleveland Cleveland Cavaliers
4 Los Angeles {"L.A. Chargers","L.A. Rams"} Los Angeles {"L.A. Lakers","L.A. Clippers"}
5 Denver Denver Broncos Denver Denver Nuggets
6 Houston Houston Texans Houston Houston Rockets
7 Indianapolis Indianapolis Colts Indianapolis Indiana Pacers
8 Philadelphia Philadelphia Eagles Philadelphia Philadelphia 76ers
9 Dallas Dallas Cowboys Dallas Dallas Mavericks
10 Washington Washington Skins Washington Washington Wizards
11 Atlanta Atlanta Falcons Atlanta Atlanta Hawks
12 Charlotte Carolina Panthers Charlotte Charlotte Hornets
13 New Orleans New Orleans Saints New Orleans New Orleans Pelicans
14 San Francisco San Francisco 49ers San Francisco Golden State Warriors
15 Phoenix Arizona Cardinals Phoenix Phoenix Suns
16 Chicago Chicago Bears Chicago Chicago Bulls
17 Minneapolis Minnesota Vikings Minneapolis Minnesota Timberwolves
18 Detroit Detroit Lions Detroit Detroit Pistons

The two indices we match on do not necessarily have to have the same name. Supposing that the “city” column in each data table was named “location” in the NFL table and “town” in the NBA table, the syntax for the inner join would have been:

SELECT * FROM nfl f
INNER JOIN nba b
    ON f.location = b.town;

7.4.4. Left and Right Joins#

The syntax for a left join is

SELECT * FROM table1
LEFT JOIN table2
    ON table1.index_name = table2.index_name;

and the syntax for a right join is

SELECT * FROM table1
RIGHT JOIN table2
    ON table1.index_name = table2.index_name;

In the case of the teams database, if we list the NFL table next to FROM and the NBA data with the JOIN statement, then left join lists all of the cities with an NFL team, and also displays the NBA team in that city if one exists. Otherwise, the syntax places None in the cell where the NBA team would be. For the teams database, the syntax for a left join is:

myquery = """
SELECT * FROM nfl f
LEFT JOIN nba b
    ON f.city = b.city;
"""
pd.read_sql_query(myquery, con=engine)
city footballteam city basketballteam
0 Buffalo Buffalo Bills None None
1 Miami Miami Dolphins Miami Miami Heat
2 Boston New England Patriots Boston Boston Celtics
3 New York {"New York Jets","New York Giants"} New York New York Knicks
4 Cleveland Cleveland Browns Cleveland Cleveland Cavaliers
5 Cincinnati Cincinnati Bengals None None
6 Pittsburgh Pittsburgh Steelers None None
7 Baltimore Baltimore Ravens None None
8 Kansas City Kansas City Chiefs None None
9 Las Vegas Las Vegas Raiders None None
10 Los Angeles {"L.A. Chargers","L.A. Rams"} Los Angeles {"L.A. Lakers","L.A. Clippers"}
11 Denver Denver Broncos Denver Denver Nuggets
12 Nashville Tennessee Titans None None
13 Jacksonville Jacksonville Jaguars None None
14 Houston Houston Texans Houston Houston Rockets
15 Indianapolis Indianapolis Colts Indianapolis Indiana Pacers
16 Philadelphia Philadelphia Eagles Philadelphia Philadelphia 76ers
17 Dallas Dallas Cowboys Dallas Dallas Mavericks
18 Washington Washington Skins Washington Washington Wizards
19 Atlanta Atlanta Falcons Atlanta Atlanta Hawks
20 Charlotte Carolina Panthers Charlotte Charlotte Hornets
21 Tampa Bay Tampa Bay Buccaneers None None
22 New Orleans New Orleans Saints New Orleans New Orleans Pelicans
23 San Francisco San Francisco 49ers San Francisco Golden State Warriors
24 Phoenix Arizona Cardinals Phoenix Phoenix Suns
25 Seattle Seattle Seahawks None None
26 Chicago Chicago Bears Chicago Chicago Bulls
27 Green Bay Green Bay Packers None None
28 Minneapolis Minnesota Vikings Minneapolis Minnesota Timberwolves
29 Detroit Detroit Lions Detroit Detroit Pistons

Likewise, the right join displays all the cities with an NBA team, along with the NFL team in that city, if one exists:

myquery = """
SELECT * FROM nfl f
RIGHT JOIN nba b
    ON f.city = b.city;
"""
pd.read_sql_query(myquery, con=engine)
city footballteam city basketballteam
0 Boston New England Patriots Boston Boston Celtics
1 New York {"New York Jets","New York Giants"} New York New York Knicks
2 Philadelphia Philadelphia Eagles Philadelphia Philadelphia 76ers
3 None None Brooklyn Brooklyn Nets
4 None None Toronto Toronto Raptors
5 Cleveland Cleveland Browns Cleveland Cleveland Cavaliers
6 Chicago Chicago Bears Chicago Chicago Bulls
7 Detroit Detroit Lions Detroit Detroit Pistons
8 None None Milwaukee Milwaukee Bucks
9 Indianapolis Indianapolis Colts Indianapolis Indiana Pacers
10 Atlanta Atlanta Falcons Atlanta Atlanta Hawks
11 Washington Washington Skins Washington Washington Wizards
12 None None Orlando Orlando Magic
13 Miami Miami Dolphins Miami Miami Heat
14 Charlotte Carolina Panthers Charlotte Charlotte Hornets
15 Los Angeles {"L.A. Chargers","L.A. Rams"} Los Angeles {"L.A. Lakers","L.A. Clippers"}
16 San Francisco San Francisco 49ers San Francisco Golden State Warriors
17 None None Portland Portland Trailblazers
18 None None Sacramento Sacramento Kings
19 Phoenix Arizona Cardinals Phoenix Phoenix Suns
20 None None San Antonio San Antonio Spurs
21 Dallas Dallas Cowboys Dallas Dallas Mavericks
22 Houston Houston Texans Houston Houston Rockets
23 None None Oklahoma City Oklahoma City Thunder
24 Minneapolis Minnesota Vikings Minneapolis Minnesota Timberwolves
25 Denver Denver Broncos Denver Denver Nuggets
26 None None Salt Lake City Utah Jazz
27 None None Memphis Memphis Grizzlies
28 New Orleans New Orleans Saints New Orleans New Orleans Pelicans

For the left and right joins, changing which data table appears along with FROM and which data table appears along with JOIN accomplishes the same thing as changing a left join to a right join.

7.4.5. Full (Outer) Join#

A full join, also called an outer join, keeps all of the records that exist in both tables, whether or not they are matched. Full joins will return a data frame with at least as many rows as the larger of the two data tables in the join because it contains all records that appear in either data frame. Most tutorials on SQL offer a warning about full joins that these queries can result in massive amounts of data being returned, and full joins are not implemented for MySQL databases. For systems like PostgreSQL in which full joins are allowed, the syntax for a full join is

SELECT * FROM table1
FULL JOIN table2
    ON table1.index_name = table2.index_name;

For the teams database, a full join produces a data frame with one row for every city with an NFL team or an NBA team or both:

myquery = """
SELECT * FROM nfl f
FULL JOIN nba b
    ON f.city = b.city;
"""
pd.read_sql_query(myquery, con=engine)
city footballteam city basketballteam
0 Buffalo Buffalo Bills None None
1 Miami Miami Dolphins Miami Miami Heat
2 Boston New England Patriots Boston Boston Celtics
3 New York {"New York Jets","New York Giants"} New York New York Knicks
4 Cleveland Cleveland Browns Cleveland Cleveland Cavaliers
5 Cincinnati Cincinnati Bengals None None
6 Pittsburgh Pittsburgh Steelers None None
7 Baltimore Baltimore Ravens None None
8 Kansas City Kansas City Chiefs None None
9 Las Vegas Las Vegas Raiders None None
10 Los Angeles {"L.A. Chargers","L.A. Rams"} Los Angeles {"L.A. Lakers","L.A. Clippers"}
11 Denver Denver Broncos Denver Denver Nuggets
12 Nashville Tennessee Titans None None
13 Jacksonville Jacksonville Jaguars None None
14 Houston Houston Texans Houston Houston Rockets
15 Indianapolis Indianapolis Colts Indianapolis Indiana Pacers
16 Philadelphia Philadelphia Eagles Philadelphia Philadelphia 76ers
17 Dallas Dallas Cowboys Dallas Dallas Mavericks
18 Washington Washington Skins Washington Washington Wizards
19 Atlanta Atlanta Falcons Atlanta Atlanta Hawks
20 Charlotte Carolina Panthers Charlotte Charlotte Hornets
21 Tampa Bay Tampa Bay Buccaneers None None
22 New Orleans New Orleans Saints New Orleans New Orleans Pelicans
23 San Francisco San Francisco 49ers San Francisco Golden State Warriors
24 Phoenix Arizona Cardinals Phoenix Phoenix Suns
25 Seattle Seattle Seahawks None None
26 Chicago Chicago Bears Chicago Chicago Bulls
27 Green Bay Green Bay Packers None None
28 Minneapolis Minnesota Vikings Minneapolis Minnesota Timberwolves
29 Detroit Detroit Lions Detroit Detroit Pistons
30 None None Milwaukee Milwaukee Bucks
31 None None Oklahoma City Oklahoma City Thunder
32 None None Portland Portland Trailblazers
33 None None Brooklyn Brooklyn Nets
34 None None Sacramento Sacramento Kings
35 None None Memphis Memphis Grizzlies
36 None None San Antonio San Antonio Spurs
37 None None Salt Lake City Utah Jazz
38 None None Orlando Orlando Magic
39 None None Toronto Toronto Raptors

Although there are 30 cities with at least one NFL team and 29 cities with at least one NBA team, there are 41 cities with at least one team from one of these two leagues.

7.4.6. Anti-Joins#

An anti-join leaves us with all of the records in the first data table that do not appear in the second table. There is no “ANTI JOIN” syntax in SQL, but the behavior of an anti-join can be generated by including the WHERE clause along with LEFT JOIN. The syntax for an anti-join is

SELECT * FROM table1
LEFT JOIN table2
    ON table1.index_name = table2.index_name
WHERE table2.index_name is NULL;

The WHERE statement is used to draw a selection of rows from a data table that make a specified logical condition true. After performing a left join we have a data table with all of the rows in the first table along with the data for those rows in the second table if the row had a match in the second table. Typing WHERE table2.index_name is NULL restricts this data table to only the rows that do not have a value of the index in the second table, meaning there was no match. For the teams database, the anti-join of the NFL and NBA tables yields a dataframe of all the cities with an NFL team but no NBA team:

myquery = """
SELECT * FROM nfl f
LEFT JOIN nba b
    ON f.city = b.city
WHERE b.city is NULL;
"""
pd.read_sql_query(myquery, con=engine)
city footballteam city basketballteam
0 Buffalo Buffalo Bills None None
1 Cincinnati Cincinnati Bengals None None
2 Pittsburgh Pittsburgh Steelers None None
3 Baltimore Baltimore Ravens None None
4 Kansas City Kansas City Chiefs None None
5 Las Vegas Las Vegas Raiders None None
6 Nashville Tennessee Titans None None
7 Jacksonville Jacksonville Jaguars None None
8 Tampa Bay Tampa Bay Buccaneers None None
9 Seattle Seattle Seahawks None None
10 Green Bay Green Bay Packers None None

7.4.7. Natural Joins#

One annoying thing about all of the joins shown above is that we end up with two columns that contain the same information. In the case of the team database, we have two city columns that are always either equal, or else one is missing. But when one of the city columns says “None”, the team from that table also says “None”, so the missingness in the city column does not provide additional information.

It might make sense to use a different kind of join that understands that the two city columns contain the same information and includes only one of these columns. A natural join does two things differently from the other joins described here:

  1. A natural join removes duplicated columns from the output data.

  2. A natural join detects the indices automatically by assuming columns that share the same name are part indices.

If done correctly, a natural join saves some work constructing the query as the indices are detected automatically, and provides cleaner output. Any of the joins described above can be done as a natural join by adding NATURAL in front of INNER, LEFT, RIGHT, or FULL. If there are no columns that share the same name, a natural join instead performs a cross join (described below).

The following query performs a natural inner join:

myquery = """
SELECT * from nfl
NATURAL INNER JOIN nba
"""
pd.read_sql_query(myquery, con=engine)
city footballteam basketballteam
0 Miami Miami Dolphins Miami Heat
1 Boston New England Patriots Boston Celtics
2 New York {"New York Jets","New York Giants"} New York Knicks
3 Cleveland Cleveland Browns Cleveland Cavaliers
4 Los Angeles {"L.A. Chargers","L.A. Rams"} {"L.A. Lakers","L.A. Clippers"}
5 Denver Denver Broncos Denver Nuggets
6 Houston Houston Texans Houston Rockets
7 Indianapolis Indianapolis Colts Indiana Pacers
8 Philadelphia Philadelphia Eagles Philadelphia 76ers
9 Dallas Dallas Cowboys Dallas Mavericks
10 Washington Washington Skins Washington Wizards
11 Atlanta Atlanta Falcons Atlanta Hawks
12 Charlotte Carolina Panthers Charlotte Hornets
13 New Orleans New Orleans Saints New Orleans Pelicans
14 San Francisco San Francisco 49ers Golden State Warriors
15 Phoenix Arizona Cardinals Phoenix Suns
16 Chicago Chicago Bears Chicago Bulls
17 Minneapolis Minnesota Vikings Minnesota Timberwolves
18 Detroit Detroit Lions Detroit Pistons

Natural joins are controversial, however, and many data scientists choose not to use them at all. The danger is that if two columns unexpectedly have the same name (it can be hard to keep track of all of the features’ names in big databases) then a natural join will match on the wrong indices. This Stack Overflow post gets into this debate, and one response made a forceful argument against natural joins:

Collapsing columns in the output is the least-important aspect of a natural join. The things you need to know are (A) it automatically joins on fields of the same name and (B) it will f— up your s— when you least expect it. In my world, using a natural join is grounds for dismissal… . Say you have a natural join between Customers and Employees, joining on EmployeeID. Employees also has a ManagerID field. Everything’s fine. Then, some day, someone adds a ManagerID field to the Customers table. Your join will not break (that would be a mercy), instead it will now include a second field, and work incorrectly. Thus, a seemingly harmless change can break something only distantly related. VERY BAD. The only upside of a natural join is saving a little typing, and the downside is substantial.

Personally, I disagree with this statement as I think natural joins can be elegant and convenient, especially when I want to match on multiple indices. But I agree that natural joins do make it easier to mess up a join, and more caution is needed. To demonstrate how a natural join can go wrong, suppose that in both the NFL and NBA tables the columns were named city and team. The following code creates versions of these tables with footballteam and basketballteam each renamed to team and stores these tables in the database as “nfl2” and “nba2”:

nfl2 = pd.read_sql_query("SELECT city, footballteam as team FROM nfl;", con=engine)
nba2 = pd.read_sql_query("SELECT city, basketballteam as team FROM nba;", con=engine)
nfl2.to_sql('nfl2', con = engine, index=False, chunksize=1000, if_exists = 'replace')
nba2.to_sql('nba2', con = engine, index=False, chunksize=1000, if_exists = 'replace')

Now a natural inner join between “nfl2” and “nba2” yields a dataframe with no records:

myquery = """
SELECT * FROM nfl2 
NATURAL INNER JOIN nba2;
"""
pd.read_sql_query(myquery, con=engine)
city team

The reason why there are no records is that the natural join automatically chooses both city and team to be part of the index, and records are only kept in the inner join if they match on both city and team. There are many matches for city, but no matches for both city and team.

In contrast, a regular inner join still works fine:

myquery = """
SELECT * FROM nfl2 f
INNER JOIN nba2 b
    ON f.city = b.city;
"""
pd.read_sql_query(myquery, con=engine)
city team city team
0 Miami Miami Dolphins Miami Miami Heat
1 Boston New England Patriots Boston Boston Celtics
2 New York {"New York Jets","New York Giants"} New York New York Knicks
3 Cleveland Cleveland Browns Cleveland Cleveland Cavaliers
4 Los Angeles {"L.A. Chargers","L.A. Rams"} Los Angeles {"L.A. Lakers","L.A. Clippers"}
5 Denver Denver Broncos Denver Denver Nuggets
6 Houston Houston Texans Houston Houston Rockets
7 Indianapolis Indianapolis Colts Indianapolis Indiana Pacers
8 Philadelphia Philadelphia Eagles Philadelphia Philadelphia 76ers
9 Dallas Dallas Cowboys Dallas Dallas Mavericks
10 Washington Washington Skins Washington Washington Wizards
11 Atlanta Atlanta Falcons Atlanta Atlanta Hawks
12 Charlotte Carolina Panthers Charlotte Charlotte Hornets
13 New Orleans New Orleans Saints New Orleans New Orleans Pelicans
14 San Francisco San Francisco 49ers San Francisco Golden State Warriors
15 Phoenix Arizona Cardinals Phoenix Phoenix Suns
16 Chicago Chicago Bears Chicago Chicago Bulls
17 Minneapolis Minnesota Vikings Minneapolis Minnesota Timberwolves
18 Detroit Detroit Lions Detroit Detroit Pistons

To safely use natural joins, first make certain that the indices you intend to match on have the same name, and then make sure that no other columns in the two data tables share a name.

7.4.8. Cross Joins#

A round robin is a method of organizing a competitive tournament. In a round robin, every team or participant plays every other team or participant once. A cross join, also called a Cartesian product, is a round robin for matching values of the index in one data table to values of the index in the other data table. Every value of the index in the first data table is matched once to every distinct value of the index in the second data table. Cross joins are memory-intensive: if the first data table has \(M\) rows and the second data table has \(N\) rows, the cross join output is a data table with \(M\times N\) rows. In general cross joins are not good ways to combine data entities, and they fail to match strictly like units. But cross joins are useful for constructing data that contain all possible pairings, if that’s what a situation calls for.

The syntax for generating a cross join is

SELECT * FROM table1
CROSS JOIN table2;

There is no ON statement in this query because it is not needed to match each row in table1 to every row in table2. For the teams database, the cross join generates the following output:

myquery = """
SELECT * FROM nfl
CROSS JOIN nba;
"""
pd.read_sql_query(myquery, con=engine)
city footballteam city basketballteam
0 Buffalo Buffalo Bills Boston Boston Celtics
1 Buffalo Buffalo Bills New York New York Knicks
2 Buffalo Buffalo Bills Philadelphia Philadelphia 76ers
3 Buffalo Buffalo Bills Brooklyn Brooklyn Nets
4 Buffalo Buffalo Bills Toronto Toronto Raptors
... ... ... ... ...
865 Detroit Detroit Lions Minneapolis Minnesota Timberwolves
866 Detroit Detroit Lions Denver Denver Nuggets
867 Detroit Detroit Lions Salt Lake City Utah Jazz
868 Detroit Detroit Lions Memphis Memphis Grizzlies
869 Detroit Detroit Lions New Orleans New Orleans Pelicans

870 rows × 4 columns

7.4.9. Multiple Joins in One Query#

All of the examples above show a single join between two data tables, but many situations will require you to join multiple tables. it is possible to join many tables in one SQL query. The syntax to perform an inner join between two tables, then an inner join between the result and a third table is

SELECT * FROM table1
INNER JOIN table2
    ON table1.index_name = table2.index_name
INNER JOIN table 3
    ON table1.index_name = table3.index_name;

To demonstrate how multiple joins can work, I add a third table to the teams database that contains all of the Major League Baseball teams:

mlb_dict = {'city': ['New York', 'Boston', 'Toronto', 'Baltimore', 'Tampa Bay',
                     'Cleveland', 'Chicago', 'Kansas City', 'Minneapolis', 'Detroit',
                     'Houston', 'Anaheim', 'Dallas', 'Seattle', 'Oakland',
                     'Philadelphia', 'Miami', 'Washington', 'Atlanta', 'Cincinnati',
                     'Milwaukee', 'St. Louis', 'Pittsburgh', 'Los Angeles', 'San Francisco',
                     'San Diego', 'Denver', 'Phoenix'],
           'baseballteam': [['New York Mets', 'New York Yankees'], 'Boston Red Sox', 'Toronto Blue Jays',
                            'Baltimore Orioles', 'Tampa Bay Rays', 'Cleveland Indians', 
                             ['Chicago White Sox', 'Chicago Cubs'], 'Kansas City Royals', 'Minnesota Twins',
                            'Detriot Tigers', 'Houston Astros', 'Anaheim Angels', 'Texas Rangers', 
                            'Seattle Mariners', 'Oakland Athletics', 'Philadelphia Phillies',
                            'Miami Marlins', 'Washington Nationals', 'Atlanta Braves', 'Cincinnati Reds',
                            'Milwaukee Brewers', 'St. Louis Cardinals', 'Pittsburgh Pirates', 'Los Angeles Dodgers',
                            'San Francisco Giants', 'San Diego Padres', 'Colorado Rockies', 'Arizona Diamondbacks']}
mlb_df = pd.DataFrame(mlb_dict)
mlb_df.to_sql('mlb', con = engine, index=False, chunksize=1000, if_exists = 'replace')

We can first inner join the NFL and NBA data tables to keep only the cities with both an NFL and an NBA team, then we can inner join the result with the MLB data to keep only the cities with teams in all three sports:

myquery = """
SELECT * FROM nfl f
INNER JOIN nba b
    ON f.city = b.city
INNER JOIN mlb m
    ON f.city = m.city;
"""
pd.read_sql_query(myquery, con=engine)
city footballteam city basketballteam city baseballteam
0 Atlanta Atlanta Falcons Atlanta Atlanta Hawks Atlanta Atlanta Braves
1 Boston New England Patriots Boston Boston Celtics Boston Boston Red Sox
2 Chicago Chicago Bears Chicago Chicago Bulls Chicago {"Chicago White Sox","Chicago Cubs"}
3 Cleveland Cleveland Browns Cleveland Cleveland Cavaliers Cleveland Cleveland Indians
4 Dallas Dallas Cowboys Dallas Dallas Mavericks Dallas Texas Rangers
5 Denver Denver Broncos Denver Denver Nuggets Denver Colorado Rockies
6 Detroit Detroit Lions Detroit Detroit Pistons Detroit Detriot Tigers
7 Houston Houston Texans Houston Houston Rockets Houston Houston Astros
8 Los Angeles {"L.A. Chargers","L.A. Rams"} Los Angeles {"L.A. Lakers","L.A. Clippers"} Los Angeles Los Angeles Dodgers
9 Miami Miami Dolphins Miami Miami Heat Miami Miami Marlins
10 Minneapolis Minnesota Vikings Minneapolis Minnesota Timberwolves Minneapolis Minnesota Twins
11 New York {"New York Jets","New York Giants"} New York New York Knicks New York {"New York Mets","New York Yankees"}
12 Philadelphia Philadelphia Eagles Philadelphia Philadelphia 76ers Philadelphia Philadelphia Phillies
13 Phoenix Arizona Cardinals Phoenix Phoenix Suns Phoenix Arizona Diamondbacks
14 San Francisco San Francisco 49ers San Francisco Golden State Warriors San Francisco San Francisco Giants
15 Washington Washington Skins Washington Washington Wizards Washington Washington Nationals

Things get more complicated when we consider left, right, and full joins in a multiple table context. The trick is to think about the set of records that is required, to express that set in set theoretical notation, and to find the right combination of joins that matches that set theoretical statement.

For example, to obtain all cities with both an NFL and NBA team, also listing the MLB team if one exists in that city, we first inner join the NFL table to the NBA table, then we left join either the NFL’s or NBA’s city index to the MLB’s city column:

myquery = """
SELECT * FROM nfl f
INNER JOIN nba b
    ON f.city = b.city
LEFT JOIN mlb m
    ON f.city = m.city;
"""
pd.read_sql_query(myquery, con=engine)
city footballteam city basketballteam city baseballteam
0 Atlanta Atlanta Falcons Atlanta Atlanta Hawks Atlanta Atlanta Braves
1 Boston New England Patriots Boston Boston Celtics Boston Boston Red Sox
2 Charlotte Carolina Panthers Charlotte Charlotte Hornets None None
3 Chicago Chicago Bears Chicago Chicago Bulls Chicago {"Chicago White Sox","Chicago Cubs"}
4 Cleveland Cleveland Browns Cleveland Cleveland Cavaliers Cleveland Cleveland Indians
5 Dallas Dallas Cowboys Dallas Dallas Mavericks Dallas Texas Rangers
6 Denver Denver Broncos Denver Denver Nuggets Denver Colorado Rockies
7 Detroit Detroit Lions Detroit Detroit Pistons Detroit Detriot Tigers
8 Houston Houston Texans Houston Houston Rockets Houston Houston Astros
9 Indianapolis Indianapolis Colts Indianapolis Indiana Pacers None None
10 Los Angeles {"L.A. Chargers","L.A. Rams"} Los Angeles {"L.A. Lakers","L.A. Clippers"} Los Angeles Los Angeles Dodgers
11 Miami Miami Dolphins Miami Miami Heat Miami Miami Marlins
12 Minneapolis Minnesota Vikings Minneapolis Minnesota Timberwolves Minneapolis Minnesota Twins
13 New Orleans New Orleans Saints New Orleans New Orleans Pelicans None None
14 New York {"New York Jets","New York Giants"} New York New York Knicks New York {"New York Mets","New York Yankees"}
15 Philadelphia Philadelphia Eagles Philadelphia Philadelphia 76ers Philadelphia Philadelphia Phillies
16 Phoenix Arizona Cardinals Phoenix Phoenix Suns Phoenix Arizona Diamondbacks
17 San Francisco San Francisco 49ers San Francisco Golden State Warriors San Francisco San Francisco Giants
18 Washington Washington Skins Washington Washington Wizards Washington Washington Nationals

To narrow the records to teams with a baseball team and a basketball team, but no football team, first we inner join the MLB and NBA data tables, then perform an anti-join with the NFL data:

myquery = """
SELECT * FROM mlb m
INNER JOIN nba b
    ON m.city=b.city
LEFT JOIN nfl f
    ON m.city = f.city
WHERE f.city is NULL;
"""
pd.read_sql_query(myquery, con=engine)
city baseballteam city basketballteam city footballteam
0 Toronto Toronto Blue Jays Toronto Toronto Raptors None None
1 Milwaukee Milwaukee Brewers Milwaukee Milwaukee Bucks None None

7.4.10. Joins on More Than One Index#

Sometimes more than one column comprises the primary key for a table. The general syntax for joining two tables on more than one index adds the AND clause to the standard SQL join syntax:

SELECT * FROM table1
INNER JOIN table2
    ON table1.index1 = table2.index2
        AND table1.anotherindex1 = table2.anotherindex2;

Suppose for example that the NBA table and MLB table also contained records for minor league teams in the NBA G-League or the MLB AAA system. Some cities have both major and minor league teams in the same sport. Washington, for example, has a major league NBA team, the Wizards, and a minor league basketball team, the Capital City Go-Gos. Suppose that both the NBA and MLB tables have a column leaguetype that marks each team as “major” or “minor”, and that we want to match on both city and league type. The syntax to do so is

SELECT * FROM nba b
INNER JOIN mlb m
    ON b.city = m.city
        AND b.leaguetype = m.leaguetype;

7.5. SQL Create, Update, and Delete Operations#

Once a database exists and is populated with data, most changes to the data will be small and incremental. We might add a few records, edit a couple, or delete one or two. There are straightforward SQL commands for creating, updating, and deleting records. To issue these queries, however, we cannot use the pd.read_sql_query() function as this function is only for read operations. Instead, we can use the .execute() method as applied to either the cursor for the database we are working with, or the sqlalchemy engine. Specific examples are shown below.

7.5.1. Creating New Records#

An existing database has a schema, an overarching organizational blueprint for the database, that describes the different tables in the database, and within each table what the columns are and what kinds of data can be input into the columns. Creating new data generally works within an established schema. That means we enter new datapoints into existing columns, matching the data type that must exist in those columns.

The SQL syntax to create new data is

INSERT INTO table (column1, column2, ...)
    VALUES (value1, value2, ...);

This syntax requires us to specify the key elements of the schema that identify a location in the database: the table and the columns. The values need to be listed in the same order as the columns, and character values need to be enclosed in single quotes.

To add a new observation to the NBA table (bring back the Sonics!) we can type:

myquery = """
INSERT INTO nba (city, basketballteam)
    VALUES ('Seattle', 'Seattle Supersonics');
"""
engine.execute(myquery)
<sqlalchemy.engine.result.ResultProxy at 0x1153b5390>

Here the engine variable is the sqlaclchemy connection we previously established for the teams database. We can use the execute() method to pass SQL queries to the database, just as we can with a cursor. Now, when we look at the data, we see the Seattle Supersonics included along with all the other NBA teams:

pd.read_sql_query("SELECT * FROM nba", con=engine)
city basketballteam
0 Boston Boston Celtics
1 New York New York Knicks
2 Philadelphia Philadelphia 76ers
3 Brooklyn Brooklyn Nets
4 Toronto Toronto Raptors
5 Cleveland Cleveland Cavaliers
6 Chicago Chicago Bulls
7 Detroit Detroit Pistons
8 Milwaukee Milwaukee Bucks
9 Indianapolis Indiana Pacers
10 Atlanta Atlanta Hawks
11 Washington Washington Wizards
12 Orlando Orlando Magic
13 Miami Miami Heat
14 Charlotte Charlotte Hornets
15 Los Angeles {"L.A. Lakers","L.A. Clippers"}
16 San Francisco Golden State Warriors
17 Portland Portland Trailblazers
18 Sacramento Sacramento Kings
19 Phoenix Phoenix Suns
20 San Antonio San Antonio Spurs
21 Dallas Dallas Mavericks
22 Houston Houston Rockets
23 Oklahoma City Oklahoma City Thunder
24 Minneapolis Minnesota Timberwolves
25 Denver Denver Nuggets
26 Salt Lake City Utah Jazz
27 Memphis Memphis Grizzlies
28 New Orleans New Orleans Pelicans
29 Seattle Seattle Supersonics

7.5.2. Editing Existing Records#

Instead of creating a new record, there are situations in which we want to edit an existing record. To revise a record, we use the following SQL syntax:

UPDATE table
    SET column2 = newvalue
    WHERE logicalcondition;

In this case, SET specifies the change we want to make to a particular column. But we don’t want to change all of the values of the column, so we use WHERE to specify a logical condition to identify the rows we want to change. A logical condition is a statement that is true on some rows and false on others, and the data update happens only on the rows for which the condition is true.

Suppose we want to change the name of the Charlotte Hornets back to the Charlotte Bobcats (sorry, Charlotte). We can use the following code:

myquery = """
UPDATE nba
    SET basketballteam = 'Charlotte Bobcats'
    WHERE city = 'Charlotte';
"""
engine.execute(myquery)
<sqlalchemy.engine.result.ResultProxy at 0x115432c88>

Here the query says to update values in the NBA table by changing basketballteam to Charlotte Bobcats, but only when city is Charlotte. This update now appears in the NBA data:

pd.read_sql_query("SELECT * FROM nba", con=engine)
city basketballteam
0 Boston Boston Celtics
1 New York New York Knicks
2 Philadelphia Philadelphia 76ers
3 Brooklyn Brooklyn Nets
4 Toronto Toronto Raptors
5 Cleveland Cleveland Cavaliers
6 Chicago Chicago Bulls
7 Detroit Detroit Pistons
8 Milwaukee Milwaukee Bucks
9 Indianapolis Indiana Pacers
10 Atlanta Atlanta Hawks
11 Washington Washington Wizards
12 Orlando Orlando Magic
13 Miami Miami Heat
14 Los Angeles {"L.A. Lakers","L.A. Clippers"}
15 San Francisco Golden State Warriors
16 Portland Portland Trailblazers
17 Sacramento Sacramento Kings
18 Phoenix Phoenix Suns
19 San Antonio San Antonio Spurs
20 Dallas Dallas Mavericks
21 Houston Houston Rockets
22 Oklahoma City Oklahoma City Thunder
23 Minneapolis Minnesota Timberwolves
24 Denver Denver Nuggets
25 Salt Lake City Utah Jazz
26 Memphis Memphis Grizzlies
27 New Orleans New Orleans Pelicans
28 Seattle Seattle Supersonics
29 Charlotte Charlotte Bobcats

7.5.3. Deleting Records#

Sometimes you might need to delete records from a database. These situations should be rare. If a record is no longer relevant for a particular use, it is always better to leave the record in the database and use another column to denote new information that can be used to filter records later. If there are mistakes in data entry, it’s better to edit existing records than to delete those records outright. If you must delete a record, the syntax to do so is

DELETE FROM table WHERE logicalcondition;

First specify the table, then the logical condition that identifies the rows you intend to delete.

In the teams database, suppose we want to delete the Baltimore Ravens (go Browns!) from the NFL table. The code to do that is:

myquery = """
DELETE FROM nfl WHERE city = 'Baltimore'; 
"""
engine.execute(myquery)
<sqlalchemy.engine.result.ResultProxy at 0x115445a58>

In this case, city = 'Baltimore' identifies the rows we want to delete in the NFL table. The NFL data now no longer contains a row for the Ravens:

pd.read_sql_query("SELECT * FROM nfl", con=engine)
city footballteam
0 Buffalo Buffalo Bills
1 Miami Miami Dolphins
2 Boston New England Patriots
3 New York {"New York Jets","New York Giants"}
4 Cleveland Cleveland Browns
5 Cincinnati Cincinnati Bengals
6 Pittsburgh Pittsburgh Steelers
7 Kansas City Kansas City Chiefs
8 Las Vegas Las Vegas Raiders
9 Los Angeles {"L.A. Chargers","L.A. Rams"}
10 Denver Denver Broncos
11 Nashville Tennessee Titans
12 Jacksonville Jacksonville Jaguars
13 Houston Houston Texans
14 Indianapolis Indianapolis Colts
15 Philadelphia Philadelphia Eagles
16 Dallas Dallas Cowboys
17 Washington Washington Skins
18 Atlanta Atlanta Falcons
19 Charlotte Carolina Panthers
20 Tampa Bay Tampa Bay Buccaneers
21 New Orleans New Orleans Saints
22 San Francisco San Francisco 49ers
23 Phoenix Arizona Cardinals
24 Seattle Seattle Seahawks
25 Chicago Chicago Bears
26 Green Bay Green Bay Packers
27 Minneapolis Minnesota Vikings
28 Detroit Detroit Lions

7.6. Cleaning and Manipulating Data with SQL Read Operations#

After using joins to combine data tables in the database, the data needs to be manipulated to make the data more convenient to use. That might involve narrowing down the data to a specific subset of interest, performing calculations on the data to generate new features, and changing the appearance of the data. In “Tidy Data”, Hadley Wickham defines four essential “verbs” of data manipulation:

  • Filter: subsetting or removing observations based on some condition.

  • Transform: adding or modifying variables. These modifications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).

  • Aggregate: collapsing multiple values into a single value (e.g., by summing or taking means).

  • Sort: changing the order of observations (p. 13).

In addition it may be necessary to pull only a selection of the columns into the output, or to change the names of the columns to more readable and useful ones. These operations can be performed within SQL read commands by using the WHERE clause for filtering, mathematical operators to transform columns, the GROUP BY syntax for aggregation, the ORDER BY, ASC, or DESC clauses for sorting, and the AS keyword for renaming columns.

7.6.1. Example: Wine Reviews#

To illustrate how to issue queries to read data while manipulating and cleaning the data, we will use the PostgreSQL version of the wine review database that we created in module 6. If you want to follow along with these example, follow the instructions in the “Using PostgreSQL” subsection of module 6 to get a local wine database running on your system.

For read operations, we can use the pd.read_sql_query() function. For that, we first have to use sqlalchemy to set up an engine that connects pandas to the database:

engine = create_engine("postgresql+psycopg2://{user}:{pw}@localhost/{db}"
                       .format(user="jk8sd", pw=pgpassword, db="winedb"))

The logical ER diagram for the wine reviews database is

7.6.2. Selecting Columns#

SELECT and FROM are the primary SQL verbs for reading data. In many of the examples up to this points, we’ve issued queries like

SELECT * FROM table;

that pull all of the rows and all of the columns from a single data table. The * character is called a wildcard character. When typed by itself, the wildcard captures all of the columns in a table. But sometimes we are interested in only a selection of the columns. In that case, we replace the wildcard with the columns we want to include in the output. The following syntax includes three columns from a specified data table:

SELECT col1, col2, col3 FROM table;

Suppose that I want to know the title, variety, price, points, country, and reviewer for all of the wines in the data. Title, variety, price, and points are all in the reviews table, country is in the locations table, and the reviewer (taster_name) is in the tasters table. To produce the data I need to join these three tables while also using SELECT to identify only the rows I am interested in. Inner joins are appropriate because every wine in the data has both a location and a reviewer.

The best way to select columns across multiple tables is to use aliasing, the same way we did for joins. In this case, if we alias the reviews table as r, locations as l, and tasters as t, we can use these same aliases to inform SQL where to find each column in the SELECT syntax.

The code to return this dataframe is:

myquery="""
SELECT r.title, r.variety, r.price, r.points, l.country, t.taster_name FROM reviews r
INNER JOIN locations l
    ON r.location_id = l.location_id
INNER JOIN tasters t
    ON r.taster_id = t.taster_id;
"""
pd.read_sql_query(myquery, con=engine)
title variety price points country taster_name
0 Olivier Leflaive 2006 Les Pucelles Premier Cru... Chardonnay NaN 93 France Roger Voss
1 The Foundry 2004 Syrah (Coastal Region) Syrah 35.0 87 South Africa Susan Kostrzewa
2 Guilbaud Frères 2007 Le Soleil Nantais (Musca... Melon 11.0 87 France Roger Voss
3 Domaine du Clos du Fief 2007 Cuvée Tradition ... Gamay NaN 86 France Roger Voss
4 Domaine Philippe Delesvaux 2005 La Montée de l... Cabernet Sauvignon NaN 86 France Roger Voss
... ... ... ... ... ... ...
103722 Sheridan Vineyard 2005 Reserve Cabernet Sauvig... Cabernet Sauvignon 75.0 94 US Paul Gregutt
103723 Woodward Canyon 2006 Old Vines Dedication Seri... Cabernet Sauvignon 84.0 94 US Paul Gregutt
103724 Chanson Père et Fils 2005 Champs Gains Premier... Chardonnay 115.0 93 France Roger Voss
103725 Mark Ryan 2006 Chardonnay (Columbia Valley (WA)) Chardonnay NaN 93 US Paul Gregutt
103726 Joseph Drouhin 2007 Grands-Echezeaux Pinot Noir 285.0 94 France Roger Voss

103727 rows × 6 columns

7.6.3. Logical Statements#

Most programming languages have the capacity to evaluate a statement as being either true or false, or true for some values and false for others. A logical statement uses logical operators that define how values should be compared. In SQL, logical statements are used in conjunction with the WHERE statement to select the rows to include in the output.

For SQL, logical statements either compare a column to another column, or compare a column to one or more reference values. The following logical operators are available:

  • = - is equal to?

  • < - is less than?

  • > - is greater than?

  • <= - is less than or equal to?

  • >= - is greater than or equal to?

  • <> - is not equal to?

  • BETWEEN a AND b - true if a value exists within the range from a to b, including a and b

  • IN ('element1','element2','element3') - true if a value is one of the elements in the given set

  • NOT - true if the rest of the logical statement is false, false if the rest of the logical statement is true

  • AND - links separate logical statements together such that the overall statement is true only when all of the linked statements are true

  • OR - links separate logical statements together such that the overall statement is true when any of the linked statements are true

  • LIKE pattern - true if the string value matches the given pattern:

    • LIKE '%%text' captures all rows in which a given column ends with ‘text’

    • LIKE 'text%%' captures all rows in which a given column begins with ‘text’

    • LIKE '%%text%%' captures all rows in which a given column contains ‘text’ somewhere in its string value

  • () - parts of the logical statement that are contained within parentheses are evaluated first

I will show examples of how to use these logical statements for filtering rows, in the next section.

7.6.4. Filtering Rows#

Suppose we wanted to know the title, the variety, and the price of the French wines that Roger Voss scored as 100. It’s a simple semantic sentence, but it connects to a more complicated set of SQL functions. First consider all of the columns we need to use to process the sentence:

  • title, from the reviews table

  • variety, from the reviews table

  • price, from the reviews table

  • French, a value of country, from the locations table

  • Roger Voss, a value of taster name, from the tasters table,

  • and 100, a value of points, from the reviews table.

Because we need to use data from the reviews, locations, and tasters tables, we need to inner join reviews, locations, and tasters.

But then on top of this join, we need to restrict both the columns and rows. We only want title, variety, and price in the final data, so we use SELECT to keep only these columns.

To restrict the rows, we use WHERE along with a logical condition. This logical condition has a few parts: we want wines in which country='France', taster_name='Roger Voss', and points=100. All three conditions need to be true for us to want to keep the row, so we connect the three statements with AND.

The SQL query that returns the title, the variety, and the price of the French wines that Roger Voss scored as 100 is:

myquery = """
SELECT r.title, r.variety, r.price FROM reviews r
INNER JOIN locations l
    ON r.location_id = l.location_id
INNER JOIN tasters t
    ON r.taster_id = t.taster_id
WHERE l.country='France' AND t.taster_name='Roger Voss' AND r.points=100;
"""
pd.read_sql_query(myquery, con=engine)
title variety price
0 Krug 2002 Brut (Champagne) Champagne Blend 259.0
1 Château Léoville Barton 2010 Saint-Julien Bordeaux-style Red Blend 150.0
2 Louis Roederer 2008 Cristal Vintage Brut (Cha... Champagne Blend 250.0
3 Salon 2006 Le Mesnil Blanc de Blancs Brut Char... Chardonnay 617.0
4 Château Lafite Rothschild 2010 Pauillac Bordeaux-style Red Blend 1500.0
5 Château Cheval Blanc 2010 Saint-Émilion Bordeaux-style Red Blend 1500.0
6 Château Léoville Las Cases 2010 Saint-Julien Bordeaux-style Red Blend 359.0
7 Château Haut-Brion 2014 Pessac-Léognan Bordeaux-style White Blend 848.0

I like my wine local or low-cost. So as another example, suppose we want the title, variety, price, points, country, and providence for all of the wines with scores of 90 or more that either cost between 5 and 10 dollars or are from Virginia. In this case, we need to join the reviews and locations data together, and write a logical statement that matches these specific conditions. The logical condition is

points >= 90 AND (price BETWEEN 5 AND 10 OR province = 'Virginia')

The entire SQL query is

myquery = """
SELECT r.title, r.variety, r.price, r.points, l.country, l.province FROM reviews r
INNER JOIN locations l
    ON r.location_id = l.location_id
WHERE r.points >= 90 AND (r.price BETWEEN 5 AND 10 OR l.province = 'Virginia');
"""
pd.read_sql_query(myquery, con=engine)
title variety price points country province
0 Château Vircoulon 2016 Bordeaux Blanc Bordeaux-style White Blend 10.0 90 France Bordeaux
1 Mano A Mano 2011 Tempranillo (Vino de la Tierr... Tempranillo 9.0 90 Spain Central Spain
2 Aveleda 2014 Quinta da Aveleda Estate Bottled ... Portuguese White 9.0 90 Portugal Vinho Verde
3 Chateau Ste. Michelle 2011 Riesling (Columbia ... Riesling 9.0 91 US Washington
4 Quinta do Portal 2007 Mural Reserva Red (Douro) Portuguese Red 10.0 91 Portugal Douro
... ... ... ... ... ... ...
76 Casaleiro 2012 Reserva Touriga Nacional-Castel... Portuguese Red 9.0 90 Portugal Tejo
77 Aveleda 2015 Quinta da Aveleda White (Vinho Ve... Portuguese White 9.0 90 Portugal Vinho Verde
78 Cookies & Cream 2010 Merlot (California) Merlot 10.0 90 US California
79 Lovingston 2012 Josie's Knoll Merlot (Monticello) Merlot 20.0 91 US Virginia
80 Aveleda 2016 Quinta da Aveleda White (Vinho Ve... Portuguese White 10.0 90 Portugal Vinho Verde

81 rows × 6 columns

The parentheses in this last query are needed to ensure that the DBMS evaluates the OR statement first. Without the parentheses,

points >= 90 AND price BETWEEN 5 AND 10 OR province = 'Virginia'

the DBMS evaluates the first two conditions first, then considers the third, so the statement is equivalent to

(points >= 90 AND price BETWEEN 5 AND 10) OR province = 'Virginia'

and it returns data with all wines that have scores of at least 90 and prices between 5 and 10 dollars, along with all wines from Virginia whether or not those wines have scores of at least 90:

myquery = """
SELECT r.title, r.variety, r.price, r.points, l.country, l.province FROM reviews r
INNER JOIN locations l
    ON r.location_id = l.location_id
WHERE r.points >= 90 AND r.price BETWEEN 5 AND 10 OR l.province = 'Virginia';
"""
pd.read_sql_query(myquery, con=engine)
title variety price points country province
0 Veramar 2016 JB Winemaker Series Cabernet Fran... Cabernet Franc 34.0 86 US Virginia
1 Château Vircoulon 2016 Bordeaux Blanc Bordeaux-style White Blend 10.0 90 France Bordeaux
2 Mano A Mano 2011 Tempranillo (Vino de la Tierr... Tempranillo 9.0 90 Spain Central Spain
3 Trump 2014 Rosé (Monticello) Rosé 14.0 86 US Virginia
4 The Boneyard 2014 Chardonnay (Virginia) Chardonnay 15.0 86 US Virginia
... ... ... ... ... ... ...
444 Annefield Vineyards 2009 Cabernet Franc (Virgi... Cabernet Franc 29.0 88 US Virginia
445 Lovingston 2012 Josie's Knoll Merlot (Monticello) Merlot 20.0 91 US Virginia
446 Aveleda 2016 Quinta da Aveleda White (Vinho Ve... Portuguese White 10.0 90 Portugal Vinho Verde
447 Tarara 2013 Cabernet Franc (Virginia) Cabernet Franc 25.0 85 US Virginia
448 Paradise Springs 2014 Nana's Rosé (Virginia) Rosé 22.0 86 US Virginia

449 rows × 6 columns

Suppose I’m open to many wines, but I have a thing against wines from the U.S., and I don’t like Pinot Noir, Pinot Gris, or Chardonnay. I want to query the wines database to return data on the title, variety, country, and price of all of these wines except for the American ones and the ones I dislike. The SQL query requires joining the reviews and locations tables, and using negation in the logical statement with the <> and NOT operators, like this:

myquery = """
SELECT r.title, r.variety, r.price, l.country FROM reviews r
INNER JOIN locations l
    ON r.location_id = l.location_id
WHERE country <> 'US' AND variety NOT IN ('Pinot Noir', 'Pinot Gris', 'Chardonnay');
"""
pd.read_sql_query(myquery, con=engine)
title variety price country
0 The Foundry 2004 Syrah (Coastal Region) Syrah 35.0 South Africa
1 Guilbaud Frères 2007 Le Soleil Nantais (Musca... Melon 11.0 France
2 Domaine du Clos du Fief 2007 Cuvée Tradition ... Gamay NaN France
3 Domaine Philippe Delesvaux 2005 La Montée de l... Cabernet Sauvignon NaN France
4 Georges Duboeuf 2007 Beaujolais-Villages Gamay NaN France
... ... ... ... ...
57429 Indomita NV Rosé Sparkling (Casablanca Valley) Sparkling Blend 18.0 Chile
57430 Intipalka 2013 Valle del Sol Tannat (Ica) Tannat 14.0 Peru
57431 Lobster Reef 2014 Sauvignon Blanc (Marlborough) Sauvignon Blanc 12.0 New Zealand
57432 Millaman 2014 Estate Reserve Sauvignon Blanc (... Sauvignon Blanc 10.0 Chile
57433 Royal Tokaji 1999 Mézes Mály Aszú 6 Puttonyos ... Tokaji 175.0 Hungary

57434 rows × 4 columns

If we want all of the columns from the reviews table for wines whose descriptions contain the words “smoke” and “chocolate” - taking case sensitivity into account by converting the descriptions to all lower case in the WHERE clause so that “chocolate” and “Chocolate” are both matched - the following query returns those wines:

myquery = """
SELECT * FROM reviews 
WHERE LOWER(description) LIKE '%%smoke%%' AND LOWER(description) LIKE '%%chocolate%%';
"""
pd.read_sql_query(myquery, con=engine)
wine_id title variety description points price taster_id winery_id location_id
0 6792 Guardian Peak 2006 Shiraz (Western Cape) Shiraz A gorgeous nose of plums, chocolate and red fr... 89 15.0 16 7363 1141
1 7812 Hightower 2006 Cabernet Sauvignon (Columbia Va... Cabernet Sauvignon Sourced largely from Red Mountain fruit, this ... 87 35.0 3 7629 1474
2 7866 Viña Cobos 2012 Bramare Marchiori Vineyard Mal... Malbec Toasty woodsmoke aromas are matched by wild be... 94 90.0 5 13962 4
3 8110 Mendel 2013 Unus Red (Mendoza) Bordeaux-style Red Blend Rich aromas of raisin, cassis and blackberry a... 93 50.0 5 9746 7
4 9262 Freakshow 2014 Cabernet Sauvignon (Lodi) Cabernet Sauvignon Rich, ripe and oaky, this full-bodied wine has... 91 20.0 10 6893 1304
... ... ... ... ... ... ... ... ... ...
260 4862 Hearst Ranch 2013 Lone Tree Cabernet Franc (Pa... Cabernet Franc Made in a thick, oaky style, this shows burned... 87 35.0 8 7498 1337
261 4690 Pear Valley 2013 Distraction Red (Paso Robles) Bordeaux-style Red Blend The signature bottling from this winery, this ... 91 35.0 8 10762 1337
262 4794 Estate Constantin Gofas 2008 Agiorgitiko (Nemea) Agiorgitiko This wine has the plucky character typical of ... 85 18.0 16 6366 668
263 6574 Fielding Hills 2006 RiverBend Vineyard Syrah (... Syrah Bold and forward, this estate-grown Syrah fair... 94 40.0 3 6619 1483
264 5020 Corliss Estates 2007 Cabernet Sauvignon (Colum... Cabernet Sauvignon Concentrated and wonderfully aromatic, this ag... 94 75.0 3 4677 1474

265 rows × 9 columns

There are situations in which we want to display only some of the records that match a particular query. For that, we can use the LIMIT and OFFSET clauses. LIMIT sets the number of records to extract, and OFFSET set the starting row. For example, adding

LIMIT 10

to a query instructs the DBMS to extract only the first 10 rows of the output data. Adding

LIMIT 10 OFFSET 5

tells the DBMS to extract 10 rows, after first skipping the first 5 rows: so these clauses together return rows 6 through 15. To see the 4th through 7th rows from the previous query to the wine reviews database, we can type:

myquery = """
SELECT * FROM reviews 
WHERE LOWER(description) LIKE '%%smoke%%' AND LOWER(description) LIKE '%%chocolate%%'
LIMIT 4 OFFSET 3;
"""
pd.read_sql_query(myquery, con=engine)
wine_id title variety description points price taster_id winery_id location_id
0 8110 Mendel 2013 Unus Red (Mendoza) Bordeaux-style Red Blend Rich aromas of raisin, cassis and blackberry a... 93 50.0 5 9746 7
1 9262 Freakshow 2014 Cabernet Sauvignon (Lodi) Cabernet Sauvignon Rich, ripe and oaky, this full-bodied wine has... 91 20.0 10 6893 1304
2 9746 Carmel 2013 Admon Vineyard Cabernet Sauvignon ... Cabernet Sauvignon This wine has offers aromas of dark plum and c... 89 35.0 14 1925 694
3 10326 De Martino 2009 Alto de Piedras Single Vineyar... Carmenère A big, earthy type of wine with a ton of ripen... 90 45.0 5 4973 182

7.6.5. Sorting Data#

Sorting data refers to rearranging the rows of a dataframe. Sorting is a cosmetic thing to do to data because the order of the rows should not change the meaning of the data both in terms of storage (rearranging the rows should NOT change the meaning of each row), or for most analytical models (rearranging rows won’t change the parameter estimates from a linear regression, for example). But sorting is a way to visualize important characteristics about the data and to quickly see important records with maximum and minimum values of key features.

To sort the output data, use the ORDER BY syntax within an SQL query. The general syntax for ORDER BY is

ORDER BY column1, column2, column3

Writing more than one column is optional. If more than one column is entered, then the second column is used to break ties between rows that have the same value of the first column. If a third column is entered, it’s used to break ties between rows that have the same value for both the first and second columns. In addition, each column can be sorted in ascending or descending order by typing either ASC (this is the default, so typing ASC is optional, but useful for making the SQL code more readable) or DESC immediately after the column name.

For example, what are the top rated wines from Virginia? And of these top rated wines, which ones are cheapest? To find out, we issue a query that joins the reviews and locations tables, filters the data to just wines from Virginia, narrows down the columns to just title, points, and price, and sorts first by points, then by price. We sort by points in descending order so the best wines appear first, and we sort by price in ascending order so that the cheapest wines appear first. The syntax for this query is:

myquery = """
SELECT r.title, r.points, r.price FROM reviews r
INNER JOIN locations l
    ON r.location_id = l.location_id
WHERE province = 'Virginia'
ORDER BY points DESC, price ASC;
"""
pd.read_sql_query(myquery, con=engine)
title points price
0 King Family 2015 Orange Viognier (Monticello) 92 35.0
1 Lovingston 2012 Josie's Knoll Merlot (Monticello) 91 20.0
2 Lovingston 2015 Josie's Knoll Rotunda Red (Mon... 90 20.0
3 Barboursville Vineyards 2015 Reserve Cabernet ... 90 25.0
4 King Family 2012 Meritage (Monticello) 90 31.0
... ... ... ...
374 Narmada 2013 Reserve Cabernet Franc (Virginia) 82 34.0
375 Veramar 2009 Chardonnay (Virginia) 81 18.0
376 Bogati 2013 Black Label Club Fumé Blanc Sauvig... 81 26.0
377 Three Fox 2014 Calabrese Pinot Grigio (Middleb... 81 28.0
378 Winery at La Grange 2012 Cabernet Sauvignon (V... 81 43.0

379 rows × 3 columns

7.6.6. Renaming Columns and Transforming Data Values#

Sometimes it is useful to rename the columns in the output data within a query. To rename a column, use the AS syntax while referencing the columns in SELECT. For example, we can load the title, variety, and points columns from the reviews table, but we can rename these columns name, type, and score respectively:

myquery = """
SELECT title AS name, variety AS type, points AS score FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
name type score
0 Casas del Bosque 2011 Reserva Sauvignon Blanc ... Sauvignon Blanc 86
1 Marqués de Terán 2009 Selección Especial (Rioja) Tempranillo 86
2 Maurice Ecard 2009 Bourgogne Chardonnay 86
3 McGregor 2010 Semi-Dry Riesling (Finger Lakes) Riesling 86
4 Jules Taylor 2009 Ballochdale Estate Pinot Noi... Pinot Noir 86
... ... ... ...
103722 Glenora 2010 Gewürztraminer (Finger Lakes) Gewürztraminer 86
103723 Hard Row To Hoe 2010 Marsanne (Yakima Valley) Marsanne 86
103724 Animale 2009 Dolcetto (Columbia Valley (WA)) Dolcetto 86
103725 Beresan 2008 The Buzz Yellow Jacket Vineyard R... Red Blend 86
103726 Cabot Vineyards 2007 Syrah (Humboldt County) Syrah 86

103727 rows × 3 columns

In other situations, we might want to transform a column arithmetically or in another way. SQL supports the standard arithmetic operators: + for addition, - for subtraction, * for multiplication, and / for division. SQL also supports the modulo operator % to return the remainder after division (16 % 5 equals 1, for example, because 16 divided by 5 yields a remainder of 1). SQL also allows the following arithmetic functions:

  • EXP(a) - raises a the argument to the power of \(e = 2.718...\)

  • POWER(a,b) - raises a to the power of b

  • LOG(a) - takes the natural (base \(e\)) logarithm of a

  • LOG10(a) - takes the common (base 10) logarithm of a

  • SQRT(a) - takes the square root of a

  • ABS(a) - takes the absolute value of a

  • CEILING(a) - rounds values of a up to the next whole number

  • FLOOR(a) - rounds values of a down to a whole number

  • ROUND(a, k) - rounds values of a up or down to the nearest number with k decimals

  • SIGN(a) - returns 1 if a is positive, -1 if a is negative, and 0 if a is 0

In addition, if you need them, there are many trigonometric functions built into standard SQL.

When using a function that operates on a column, it is important to use AS to name the new column, as SQL has no way to choose a logical name automatically for constructed columns and uses ?column? be default.

For example, we can convert the price of each wine from dollars to Euros by multiplying the price by the .91 USD to Euro exchange rate. We keep the original price in the query but rename it price_dollars, and we name the converted price price_euros:

myquery = """
SELECT title, variety, price AS price_dollars, .91*price AS price_euros FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
title variety price_dollars price_euros
0 Casas del Bosque 2011 Reserva Sauvignon Blanc ... Sauvignon Blanc 12.0 10.92
1 Marqués de Terán 2009 Selección Especial (Rioja) Tempranillo 24.0 21.84
2 Maurice Ecard 2009 Bourgogne Chardonnay NaN NaN
3 McGregor 2010 Semi-Dry Riesling (Finger Lakes) Riesling 18.0 16.38
4 Jules Taylor 2009 Ballochdale Estate Pinot Noi... Pinot Noir 22.0 20.02
... ... ... ... ...
103722 Glenora 2010 Gewürztraminer (Finger Lakes) Gewürztraminer 15.0 13.65
103723 Hard Row To Hoe 2010 Marsanne (Yakima Valley) Marsanne 18.0 16.38
103724 Animale 2009 Dolcetto (Columbia Valley (WA)) Dolcetto 24.0 21.84
103725 Beresan 2008 The Buzz Yellow Jacket Vineyard R... Red Blend 19.0 17.29
103726 Cabot Vineyards 2007 Syrah (Humboldt County) Syrah 24.0 21.84

103727 rows × 4 columns

For no reason other than to demonstrate the use of the various mathematical functions, we can put many transformations of price in one dataframe:

myquery = """
SELECT price,
    EXP(price/1000) as price_exp,
    LOG(price) as price_natlog,
    LOG10(price) as price_commonlog,
    ROUND(LOG(price)) as price_loground,
    SQRT(price) as price_sqrt,
    POWER(price, 2) as price_squared,
    POWER(price, 3) as price_cubed,
    SIGN(price - 50) as price_morethan50
FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
price price_exp price_natlog price_commonlog price_loground price_sqrt price_squared price_cubed price_morethan50
0 12.0 1.012072 1.079181 1.079181 1.0 3.464102 144.0 1728.0 -1.0
1 24.0 1.024290 1.380211 1.380211 1.0 4.898979 576.0 13824.0 -1.0
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 18.0 1.018163 1.255273 1.255273 1.0 4.242641 324.0 5832.0 -1.0
4 22.0 1.022244 1.342423 1.342423 1.0 4.690416 484.0 10648.0 -1.0
... ... ... ... ... ... ... ... ... ...
103722 15.0 1.015113 1.176091 1.176091 1.0 3.872983 225.0 3375.0 -1.0
103723 18.0 1.018163 1.255273 1.255273 1.0 4.242641 324.0 5832.0 -1.0
103724 24.0 1.024290 1.380211 1.380211 1.0 4.898979 576.0 13824.0 -1.0
103725 19.0 1.019182 1.278754 1.278754 1.0 4.358899 361.0 6859.0 -1.0
103726 24.0 1.024290 1.380211 1.380211 1.0 4.898979 576.0 13824.0 -1.0

103727 rows × 9 columns

Another useful operation for transforming columns in a read query is CASE, which maps numeric values to categories. The syntax that uses CASE within SELECT is

SELECT CASE
    WHEN logicalstatement1 THEN value1
    WHEN logicalstatement2 THEN value2
    WHEN logicalstatement3 THEN value3
    ELSE value4
    END AS name

This code evaluates each logical statement, and fills in the datapoint with the specified value if the logical statement is true. If more than one of the logical statements is true, then the first statement/value pair entered in takes precedence. If none of the logical statements are true, then the datapoint is filled in with the value listed with ELSE. As before, it is important to name the new column with AS.

For example, if we want to categorize wines as cheap when the price is under 20 dollars, moderately priced if the price is between 20 and 50 dollars, and expensive if the price is more than 50 dollars, we can type:

myquery="""
SELECT title, variety, price, CASE
    WHEN price < 20 THEN 'cheap'
    WHEN price BETWEEN 20 AND 50 THEN 'moderately priced'
    WHEN price > 50 THEN 'expensive'
    ELSE NULL
    END AS price_level
FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
title variety price price_level
0 Casas del Bosque 2011 Reserva Sauvignon Blanc ... Sauvignon Blanc 12.0 cheap
1 Marqués de Terán 2009 Selección Especial (Rioja) Tempranillo 24.0 moderately priced
2 Maurice Ecard 2009 Bourgogne Chardonnay NaN None
3 McGregor 2010 Semi-Dry Riesling (Finger Lakes) Riesling 18.0 cheap
4 Jules Taylor 2009 Ballochdale Estate Pinot Noi... Pinot Noir 22.0 moderately priced
... ... ... ... ...
103722 Glenora 2010 Gewürztraminer (Finger Lakes) Gewürztraminer 15.0 cheap
103723 Hard Row To Hoe 2010 Marsanne (Yakima Valley) Marsanne 18.0 cheap
103724 Animale 2009 Dolcetto (Columbia Valley (WA)) Dolcetto 24.0 moderately priced
103725 Beresan 2008 The Buzz Yellow Jacket Vineyard R... Red Blend 19.0 cheap
103726 Cabot Vineyards 2007 Syrah (Humboldt County) Syrah 24.0 moderately priced

103727 rows × 4 columns

There’s an important point of caution when using CASE. Unless you account for missing values explicity in your query, the missing values will be matched to the entered last in CASE. That will corrupt the data. It is best practice to write conditions for the full set of observed values of a column, and to end the call to CASE with ELSE NULL, so that when none of the conditions apply, the new column is also missing.

There are also functions that apply to columns with character values:

  • LOWER(a) - converts all characters in a to lowercase

  • UPPER(a) - converts all characters in a to uppercase

  • INITCAP(a) - converts the first letter of every word in a to uppercase

  • CONCAT(a,b,c) - appends the string b to the end of a, and c (if included) to the end of b

  • LENGTH(a) - reports the number of characters in the string a

  • SUBSTR(a, start, length) - restricts the string a to a substring, beginning at the position denoted by start, and including the next length characters

  • TRIM(a) - removes spaces at the beginning and end of string a

  • REPLACE(a, oldtext, newtext) - searches values of a for occurrences of oldtext and replaces them with newtext

For example, we can replace the descriptions in the reviews table with all capitals, all lower-case letters, or capitals for the first letter of each word:

myquery = """
SELECT title, variety, price, 
    UPPER(description) as description_upper, 
    LOWER(description) as description_lower, 
    INITCAP(description) as description_initcap 
FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
title variety price description_upper description_lower description_initcap
0 Casas del Bosque 2011 Reserva Sauvignon Blanc ... Sauvignon Blanc 12.0 IT'S PRETTY EASY PEGGING THIS FOR CHILEAN SB; ... it's pretty easy pegging this for chilean sb; ... It'S Pretty Easy Pegging This For Chilean Sb; ...
1 Marqués de Terán 2009 Selección Especial (Rioja) Tempranillo 24.0 OPAQUE IN COLOR, WITH BLACKBERRY AND LICORICE ... opaque in color, with blackberry and licorice ... Opaque In Color, With Blackberry And Licorice ...
2 Maurice Ecard 2009 Bourgogne Chardonnay NaN ATTRACTIVE RIPE FRUITS GO WITH LIME AND TOAST ... attractive ripe fruits go with lime and toast ... Attractive Ripe Fruits Go With Lime And Toast ...
3 McGregor 2010 Semi-Dry Riesling (Finger Lakes) Riesling 18.0 SMOKY AND A BIT STARK WITH WHIFFS OF STRUCK FL... smoky and a bit stark with whiffs of struck fl... Smoky And A Bit Stark With Whiffs Of Struck Fl...
4 Jules Taylor 2009 Ballochdale Estate Pinot Noi... Pinot Noir 22.0 SOURCED FROM A HIGH-ALTITUDE VINEYARD NEAR THE... sourced from a high-altitude vineyard near the... Sourced From A High-Altitude Vineyard Near The...
... ... ... ... ... ... ...
103722 Glenora 2010 Gewürztraminer (Finger Lakes) Gewürztraminer 15.0 SWEET ON THE NOSE, WITH SCENTS OF PINK GRAPEFR... sweet on the nose, with scents of pink grapefr... Sweet On The Nose, With Scents Of Pink Grapefr...
103723 Hard Row To Hoe 2010 Marsanne (Yakima Valley) Marsanne 18.0 WAXY FRUIT FLAVORS OF PEACH, MELON AND BANANA,... waxy fruit flavors of peach, melon and banana,... Waxy Fruit Flavors Of Peach, Melon And Banana,...
103724 Animale 2009 Dolcetto (Columbia Valley (WA)) Dolcetto 24.0 THIS ALCOHOLIC WINE (15.9%, AND YOU CAN TASTE ... this alcoholic wine (15.9%, and you can taste ... This Alcoholic Wine (15.9%, And You Can Taste ...
103725 Beresan 2008 The Buzz Yellow Jacket Vineyard R... Red Blend 19.0 LIGHT FRUIT FLAVORS RUN FROM MELON INTO PALE S... light fruit flavors run from melon into pale s... Light Fruit Flavors Run From Melon Into Pale S...
103726 Cabot Vineyards 2007 Syrah (Humboldt County) Syrah 24.0 FROM HUSBAND-AND-WIFE TEAM IN CALIFORNIA REDWO... from husband-and-wife team in california redwo... From Husband-And-Wife Team In California Redwo...

103727 rows × 6 columns

We can use REPLACE() to make the writing less artful, for example, by replacing the word “aroma” with “good smell” everywhere it appears in the wine descriptions. Note that REPLACE() is case-sensitive, so it is a good idea to convert the values to a consistent case like lowercase first:

myquery = """
SELECT title, variety, price, description,
    REPLACE(LOWER(description), 'aroma', 'good smell') as description_replace 
FROM reviews
WHERE description LIKE '%%aroma%%';
"""
pd.read_sql_query(myquery, con=engine).description[0]
'Opaque in color, with blackberry and licorice aromas but also a distinct streak of brambly herbs and green. Later on, pine needle and tartness enter the fray. This is a commendable modern Rioja but it does have a few issues, namely a green herbal component.'
pd.read_sql_query(myquery, con=engine).description_replace[0]
'opaque in color, with blackberry and licorice good smells but also a distinct streak of brambly herbs and green. later on, pine needle and tartness enter the fray. this is a commendable modern rioja but it does have a few issues, namely a green herbal component.'

The SUBSTR() function can be used to extract parts of a string. The following code reduces the description column to substrings beginning at the 5th character and proceeding 10 characters in length:

myquery = """
SELECT title, variety, price, description,
    SUBSTR(description, 5, 10) as description_substr 
FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
title variety price description description_substr
0 Casas del Bosque 2011 Reserva Sauvignon Blanc ... Sauvignon Blanc 12.0 It's pretty easy pegging this for Chilean SB; ... pretty ea
1 Marqués de Terán 2009 Selección Especial (Rioja) Tempranillo 24.0 Opaque in color, with blackberry and licorice ... ue in colo
2 Maurice Ecard 2009 Bourgogne Chardonnay NaN Attractive ripe fruits go with lime and toast ... active rip
3 McGregor 2010 Semi-Dry Riesling (Finger Lakes) Riesling 18.0 Smoky and a bit stark with whiffs of struck fl... y and a bi
4 Jules Taylor 2009 Ballochdale Estate Pinot Noi... Pinot Noir 22.0 Sourced from a high-altitude vineyard near the... ced from a
... ... ... ... ... ...
103722 Glenora 2010 Gewürztraminer (Finger Lakes) Gewürztraminer 15.0 Sweet on the nose, with scents of pink grapefr... t on the n
103723 Hard Row To Hoe 2010 Marsanne (Yakima Valley) Marsanne 18.0 Waxy fruit flavors of peach, melon and banana,... fruit fla
103724 Animale 2009 Dolcetto (Columbia Valley (WA)) Dolcetto 24.0 This alcoholic wine (15.9%, and you can taste ... alcoholic
103725 Beresan 2008 The Buzz Yellow Jacket Vineyard R... Red Blend 19.0 Light fruit flavors run from melon into pale s... t fruit fl
103726 Cabot Vineyards 2007 Syrah (Humboldt County) Syrah 24.0 From husband-and-wife team in California redwo... husband-a

103727 rows × 5 columns

In the wine database, province and country are stored in separate columns in the locations table. If we wanted to put these two pieces of information together in one readable column, we can use CONCAT(). In this example, I type CONCAT(l.province, ', ', l.country) which appends three strings - the province from the locations table, a comma and a space, and the country from the locations table - and names the new column place:

myquery = """
SELECT r.title, r.variety, r.price, 
    CONCAT(l.province, ', ', l.country) as place 
FROM reviews r
INNER JOIN locations l
    ON r.location_id = l.location_id;
"""
pd.read_sql_query(myquery, con=engine)
title variety price place
0 Casas del Bosque 2011 Reserva Sauvignon Blanc ... Sauvignon Blanc 12.0 Casablanca Valley, Chile
1 Marqués de Terán 2009 Selección Especial (Rioja) Tempranillo 24.0 Northern Spain, Spain
2 Maurice Ecard 2009 Bourgogne Chardonnay NaN Burgundy, France
3 McGregor 2010 Semi-Dry Riesling (Finger Lakes) Riesling 18.0 New York, US
4 Jules Taylor 2009 Ballochdale Estate Pinot Noi... Pinot Noir 22.0 Marlborough, New Zealand
... ... ... ... ...
103722 Glenora 2010 Gewürztraminer (Finger Lakes) Gewürztraminer 15.0 New York, US
103723 Hard Row To Hoe 2010 Marsanne (Yakima Valley) Marsanne 18.0 Washington, US
103724 Animale 2009 Dolcetto (Columbia Valley (WA)) Dolcetto 24.0 Washington, US
103725 Beresan 2008 The Buzz Yellow Jacket Vineyard R... Red Blend 19.0 Washington, US
103726 Cabot Vineyards 2007 Syrah (Humboldt County) Syrah 24.0 California, US

103727 rows × 4 columns

What are the shortest descriptions in the data? To find out we use the LENGTH() function to count the number of characters in each description, and sort these lengths in ascending order:

myquery = """
SELECT title, points, price, description,
    LENGTH(description) as length
FROM reviews
ORDER BY length ASC;
"""
pd.read_sql_query(myquery, con=engine)
title points price description length
0 Craggy Range 2007 Kidnappers Vineyard Chardonn... 88 24.0 Imported by Kobrand. 20
1 Chasing Venus 2007 Sauvignon Blanc (Marlborough) 88 16.0 Imported by JL Giguiere. 24
2 Philip Shaw 2007 No. 19 Sauvignon Blanc (Orange) 86 20.0 Imported by Lion Nathan USA. 28
3 Peconic Bay Winery 2001 Riesling (North Fork o... 84 13.0 Review not available at this time. 34
4 Mount Baker Vineyards 2006 Barrel Select Sangi... 82 16.0 Very light, could almost be a rosé. 35
... ... ... ... ... ...
103722 René Muré 2015 Clos Saint Landelin Vorbourg Gr... 97 50.0 The heady aromatic scent of fresh tangerine pe... 698
103723 Domaine Marcel Deiss 2009 Altenberg de Berghei... 95 66.0 Lifted notes of dried pear, dried chamomile fl... 699
103724 De Toren 2014 Book 17 XVII Red (Stellenbosch) 95 330.0 Only 95 cases were made of this Bordeaux-style... 723
103725 Domaine Ostertag 2015 Muenchberg Grand Cru Rie... 97 66.0 There is something incredibly fruity and simul... 753
103726 Saggi 2007 Red (Columbia Valley (WA)) 91 45.0 Dark, dusty, strongly scented with barrel toas... 829

103727 rows × 5 columns

7.6.7. Data Aggregation#

If there are columns in the data output that have repeated values, then each distinct value forms a group in the data. Data aggregation is the process of collapsing the data to one row for each group, while summarizing other columns by taking the within-group mean, sum, count, or another statistic.

Aggregating data requires more attention to the ordering of the clauses in an SQL query than is necessary with other tasks. That is, certain clauses must be entered into the query in a particular order. An SQL query that aggregates data should follow this template:

SELECT aggregationfunctions FROM table
(any joins happen here)
(filtering rows with the WHERE clause happens here)
GROUP BY groupingcolumns
HAVING (a logical condition involving aggregation functions)
(sorting with ORDER BY happens here);

Let’s break down this template line by line. First,

  • SELECT aggregationfunctions FROM table

Aggregation functions work like the arithmetic functions described above. The difference is that instead of working with a single value, like SQRT() and POWER() do, aggregation functions work with vectors of data and generate summary statistics. They describe how existing columns should be summarized when the data are collapsed. The aggregation functions are:

  • COUNT(*) - an overall count of the number of rows within each group

  • COUNT(a) - a count of the number of non-missing observations of column a within each group

  • COUNT(DISTINCT a) - a count of the number of distinct observations of column a within each group

  • AVG(a) - the mean of the values of a within each group

  • SUM(a) - the sum of the values of a within each group

  • MAX(a)- the maximum value of a within each group

  • MIN(a)- the minimum value of a within each group

  • VARIANCE(a) and VAR_SAMP(a) - the population and sample variances, respectively, of the values of a within each group

  • STDDEV(a) and STDDEV_SAMP(a) - the population and sample standard deviations, respectively, of the values of a within each group

Additional statistics, like the median, mode, and various quantiles are not included in standard SQL but are available in extensions that are specific to a DBMS, such as the quantile extension for PostgreSQL.

One important point about the aggregation functions is that, with the exception of COUNT(), they ignore NULL values. Suppose that we have a data vector with values (1,3,8,NULL). Because we do not know the value of the fourth value, we cannot calculate the true sum and true mean of the values. However the SUM() function in SQL ignores the NULL value and reports the sum as 12, which makes a strong tacit assumption that the NULL value is equal to 0. The AVG() function calculates the mean from the observed values, and reports (1+3+8)/3 = 4, but this too makes a strong assumption that the NULL value is exactly 4. There are situations in which calculations from the non-NULL values are appropriate, but it is not correct to make broad claims from these summary statistics when there are missing values in the columns being summarized.

The next two lines in the template are placeholders for the syntax we use to join data tables and the syntax we use to filter rows with the WHERE clause. There is a great deal of similarity between WHERE and HAVING, which we will discuss shortly.

The fourth line in the template,

  • GROUP BY groupingcolumns

is the key line for activating the aggregation functionality of SQL. groupingcolumns can include one or more columns. If one column is listed, the unique values of that column define the groups that will comprise the rows of the output data. If there is more than one column listed, the unique combinations of values from the columns define the groups.

The fifth line in the template uses the HAVING clause. HAVING is very similar to WHERE in that both use logical conditions to identify a selection of the rows to include in the output data. The difference between WHERE and HAVING is that WHERE operates on rows in the original data prior to aggregation, and HAVING works on rows after aggregation has occurred. One limitation of HAVING is that it will not recognize new column names defined in SELECT, so the same aggregation functions used in SELECT need to be used again in the logical conditions for HAVING. Finally, if we want to sort, we can include the ORDER BY clause last.

For example, let’s find out which country produces wines with the highest average score. To do that, we need a query that joins reviews and locations together, includes country name, the average score across wines from that country, and for good measure, a count of the number of wines reviewed from that country. To collapse on country, we can use GROUP BY, and to produce the average score and the count of wines we use the AVG() and COUNT() functions. For presentation purposes, I choose to round the average score to one decimal and to sort the rows from the highest to lowest average score, so that we can immediately see which countries produce the highest-rated wines. The query is:

myquery = """
SELECT l.country,
    ROUND(AVG(points),1) as average_points,
    COUNT(*) as numberofwines
FROM reviews r
INNER JOIN locations l
    ON r.location_id = l.location_id
GROUP BY l.country
ORDER BY average_points DESC;
"""
pd.read_sql_query(myquery, con=engine)
country average_points numberofwines
0 England 91.6 74
1 India 90.2 9
2 Austria 90.1 3337
3 Germany 89.9 2134
4 Canada 89.4 256
5 Hungary 89.2 145
6 China 89.0 1
7 US 89.0 37730
8 France 88.9 21828
9 Italy 88.8 11042
10 Australia 88.8 2037
11 Luxembourg 88.7 6
12 None 88.6 63
13 Morocco 88.6 28
14 Switzerland 88.6 7
15 Israel 88.5 500
16 New Zealand 88.3 1311
17 South Africa 88.2 1328
18 Portugal 88.2 5686
19 Slovenia 88.1 87
20 Turkey 88.1 90
21 Bulgaria 87.9 141
22 Georgia 87.7 86
23 Lebanon 87.7 35
24 Armenia 87.5 2
25 Serbia 87.5 12
26 Czech Republic 87.3 12
27 Greece 87.3 466
28 Spain 87.3 6581
29 Moldova 87.2 59
30 Croatia 87.2 73
31 Cyprus 87.2 11
32 Slovakia 87.0 1
33 Uruguay 86.8 109
34 Macedonia 86.8 12
35 Argentina 86.7 3797
36 Bosnia and Herzegovina 86.5 2
37 Chile 86.5 4361
38 Romania 86.4 120
39 Mexico 85.3 65
40 Brazil 84.7 52
41 Ukraine 84.1 14
42 Egypt 84.0 1
43 Peru 83.6 16

So the country that produces the best wine is … England? Wait, that can’t be right. Looking at the results, I see that some of the countries only have a small number of wines reviewed. China, for example, only has one wine review in the database, so we definitely should not put as much confidence in China’s mean score as we can for countries with many more reviews like France and the U.S. For a more fair comparison, let’s restrict the output to only those countries with at least 500 wines in the database. To filter rows on this condition, we use HAVING and not WHERE because the condition involves an aggregation function - specifically the count of the number of wines per country. We can rerun the previous query, including the HAVING clause:

myquery = """
SELECT l.country,
    ROUND(AVG(points),1) as average_points,
    COUNT(*) as numberofwines
FROM reviews r
INNER JOIN locations l
    ON r.location_id = l.location_id
GROUP BY l.country
    HAVING COUNT(*) >= 500
ORDER BY average_points DESC;
"""
pd.read_sql_query(myquery, con=engine)
country average_points numberofwines
0 Austria 90.1 3337
1 Germany 89.9 2134
2 US 89.0 37730
3 France 88.9 21828
4 Italy 88.8 11042
5 Australia 88.8 2037
6 Israel 88.5 500
7 New Zealand 88.3 1311
8 Portugal 88.2 5686
9 South Africa 88.2 1328
10 Spain 87.3 6581
11 Argentina 86.7 3797
12 Chile 86.5 4361

Suppose we were interested in ranking the countries according to their scores for a particular type of wine. To filter rows from the original data, we use a WHERE clause prior to GROUP BY. If we write WHERE r.variety = 'Riesling' prior to GROUP BY, the DBMS first extracts only the rows from the reviews table that refer to Riesling wines, then proceeds with the rest of the query. The following code ranks the countries based on their average scores for Rieslings, given at least 100 Rieslings from that country in the database:

myquery = """
SELECT l.country,
    ROUND(AVG(points),1) as average_points,
    COUNT(*) as numberofwines
FROM reviews r
INNER JOIN locations l
    ON r.location_id = l.location_id
WHERE r.variety = 'Riesling'
GROUP BY l.country
    HAVING COUNT(*) >= 100
ORDER BY average_points DESC;
"""
pd.read_sql_query(myquery, con=engine)
country average_points numberofwines
0 Austria 91.4 581
1 France 90.5 691
2 Germany 90.1 1768
3 Australia 89.4 111
4 US 88.1 1600

Sometimes the groups in the data are formed by more than one column. In that case, simply add the second column name to the GROUP BY clause. For example, if we wanted to know the top rated combination of country and variety (with a minimum of 50 wines for that combination), we can use the following code:

myquery = """
SELECT l.country, r.variety,
    ROUND(AVG(points),1) as average_points,
    COUNT(*) as numberofwines
FROM reviews r
INNER JOIN locations l
    ON r.location_id = l.location_id
GROUP BY l.country, r.variety
    HAVING COUNT(*) >= 50
ORDER BY average_points DESC;
"""
pd.read_sql_query(myquery, con=engine)
country variety average_points numberofwines
0 France Tannat 91.5 59
1 Austria Riesling 91.4 581
2 South Africa Bordeaux-style Red Blend 90.6 84
3 France Riesling 90.5 691
4 Austria Chardonnay 90.4 62
... ... ... ... ...
190 Argentina Chardonnay 84.9 295
191 Spain Rosé 84.9 149
192 Portugal Rosé 84.6 235
193 Spain Rosado 84.6 71
194 Argentina Sauvignon Blanc 84.3 78

195 rows × 4 columns

7.6.8. Subqueries#

There are situations in which it makes sense to use data aggregation techniques to generate new columns filled with group-level summary statistics, then to place these summary statistics into the original data table. To do this work, we can use subqueries. A subquery is a full SQL query that is used inside another SQL query. There are three types of subquery:

  1. Subqueries, like the ones for the mean and standard deviation above, that yield a single datapoint. These subqueries can be used anywhere we might write a value in the query, such as when defining new columns in SELECT or filtering rows with WHERE or HAVING.

  2. Subqueries that yield a list of values that can be used inside logical statements that include the IN operator.

  3. Subqueries that yield a data table that can be joined to existing data tables.

Suppose for example that we wanted to generate a Z-score standardized version of the wine review points. A Z-score subtracts the mean of a column from every value in the column, then divides every value by the standard deviation of the column. When a Z-score equals 1, it means that the original value is one standard deviation above the mean of the column. To calculate the Z-score, we need to calculate the mean of the points column,

myquery = """
SELECT AVG(points) FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
avg
0 88.612107

and the sample standard deviation of the points column,

myquery = """
SELECT STDDEV_SAMP(points) FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
stddev_samp
0 2.955039

We can type these values in manually with a query that looks like:

SELECT title, variety, points,
(points - 88.612107))/(2.955039) as points_z 
FROM reviews;

The problem with typing these values in by hand is that it is easy to make a mistake and accidentially corrupt the points_z column because of a typo. Also these values will have to be changed by hand every time the data inside the database is updated. A better solution is to have SQL do the work of calculating the mean and standard deviation for us by using subqueries. All we need to do is replace the values with the queries (contained in parentheses) that generate those single values. In this case, the query is

myquery = """
SELECT title, variety, points,
(points - (SELECT AVG(points) FROM reviews))/(SELECT STDDEV(points) FROM reviews) as points_z 
FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
title variety points points_z
0 Casas del Bosque 2011 Reserva Sauvignon Blanc ... Sauvignon Blanc 86 -0.88395
1 Marqués de Terán 2009 Selección Especial (Rioja) Tempranillo 86 -0.88395
2 Maurice Ecard 2009 Bourgogne Chardonnay 86 -0.88395
3 McGregor 2010 Semi-Dry Riesling (Finger Lakes) Riesling 86 -0.88395
4 Jules Taylor 2009 Ballochdale Estate Pinot Noi... Pinot Noir 86 -0.88395
... ... ... ... ...
103722 Glenora 2010 Gewürztraminer (Finger Lakes) Gewürztraminer 86 -0.88395
103723 Hard Row To Hoe 2010 Marsanne (Yakima Valley) Marsanne 86 -0.88395
103724 Animale 2009 Dolcetto (Columbia Valley (WA)) Dolcetto 86 -0.88395
103725 Beresan 2008 The Buzz Yellow Jacket Vineyard R... Red Blend 86 -0.88395
103726 Cabot Vineyards 2007 Syrah (Humboldt County) Syrah 86 -0.88395

103727 rows × 4 columns

Suppose that we wanted to restrict the rows to only the wines from wineries with at least 100 wines in the data. The problem is we don’t know which wineries have at least 100 reviewed wines. But we can find out with a query:

myquery = """
SELECT winery_id FROM reviews
GROUP BY winery_id
    HAVING COUNT(*) >= 100;
"""
pd.read_sql_query(myquery, con=engine)
winery_id
0 9112
1 4562
2 9557
3 13540
4 13381
5 1547
6 4257
7 13118
8 2007
9 7060
10 224
11 9995
12 8022
13 5076
14 4586
15 2375
16 14401
17 12042
18 9111
19 4124

This query gives us a list of winery ID numbers that match the wineries with at least 100 reviewed wines in the data. We can now use this list inside another query that restricts the reviews data to only the wines from this list of wineries:

myquery = """
SELECT winery_id, title, variety, points, price FROM reviews
WHERE winery_id IN (
    SELECT winery_id FROM reviews r
    GROUP BY winery_id
        HAVING COUNT(*) >= 100
    );
"""
pd.read_sql_query(myquery, con=engine)
winery_id title variety points price
0 14401 Wines & Winemakers 2012 Pegos Claros Colheita ... Castelão 87 15.0
1 14401 Wines & Winemakers 2013 Lua Cheia em Vinhas Ve... Portuguese Red 87 18.0
2 14401 Wines & Winemakers 2013 Lua Cheia em Vinhas Ve... Portuguese Red 87 12.0
3 4124 Chateau Ste. Michelle 2008 Syrah (Columbia Val... Syrah 87 13.0
4 4586 Concha y Toro 2010 Gravas del Maipo Syrah (Mai... Syrah 91 200.0
... ... ... ... ... ...
2742 13381 Trapiche 2016 Pure Malbec (Uco Valley) Malbec 88 15.0
2743 9111 Louis Jadot 2005 La Dominode Premier Cru (Sav... Pinot Noir 90 37.0
2744 9995 Montes 2009 Limited Selection Pinot Noir (Casa... Pinot Noir 89 20.0
2745 4124 Chateau Ste. Michelle 2012 Canoe Ridge Estate ... Chardonnay 89 22.0
2746 14401 Wines & Winemakers 2015 Nostalgia Alvarinho (V... Alvarinho 88 23.0

2747 rows × 5 columns

Suppose we wanted a data table with the top rated wine from each winery. The problem is we don’t know which wine is the top rated for each winery. Again, we can find out with a query that groups the reviews data by winery ID and uses the MAX() aggregation function to identify the maximum score achieved for any wine from that winery:

myquery = """
SELECT winery_id, MAX(points) as maxpoints
FROM reviews
GROUP BY winery_id;
"""
pd.read_sql_query(myquery, con=engine)
winery_id maxpoints
0 11233 91
1 4790 95
2 3936 87
3 12502 87
4 5468 92
... ... ...
14566 4035 89
14567 9180 92
14568 4827 94
14569 790 83
14570 10896 91

14571 rows × 2 columns

Suppose that this last table already existed in the database with the name “bestscores”. We would be able to join bestscores and reviews, then filter the rows to only those wines whose scores are equal to the maximum scores achieved by the winery with the following code:

SELECT r.title, r.variety, r.points, r.price FROM reviews r
INNER JOIN bestscores b
    ON r.winery_id = b.winery_id
WHERE r.points = b.maxpoints;

But because we do not have a table named “bestscores”, we can replace the reference to this table with the subquery that generates this table:

myquery = """
SELECT r.title, r.variety, r.points, r.price FROM reviews r
INNER JOIN (
    SELECT winery_id, MAX(points) as maxpoints
    FROM reviews
    GROUP BY winery_id) b
    ON r.winery_id = b.winery_id
WHERE r.points = b.maxpoints;
"""
pd.read_sql_query(myquery, con=engine)
title variety points price
0 Marqués de Terán 2009 Selección Especial (Rioja) Tempranillo 86 24.0
1 Alessandro Veglio 2011 Gattera (Barolo) Nebbiolo 87 NaN
2 Pingao 2013 Rioja Tempranillo 87 13.0
3 Chateau Walla Walla 2008 Syrah (Walla Walla Va... Syrah 87 40.0
4 Sweet Valley 2008 Cabernet Sauvignon (Walla Wa... Cabernet Sauvignon 87 35.0
... ... ... ... ...
20643 Kicker Cane 2014 Cabernet Sauvignon (Alexander... Cabernet Sauvignon 88 20.0
20644 Tenuta Grimani 2015 Farinaldo (Soave) Garganega 88 NaN
20645 Vin Vault NV Cabernet Sauvignon (California) Cabernet Sauvignon 88 20.0
20646 Dachshund NV Bubbles Sparkling (Germany) Sparkling Blend 88 17.0
20647 Domaine Guillot-Broux 2009 Beaumont (Mâcon-Cr... Pinot Noir 88 NaN

20648 rows × 4 columns

Finally, once we are finished working with the PostgreSQL server, we close it:

dbserver.close()

7.7. SQL Security#

One criticism of SQL is that it can be manipulated to give unauthorized and hostile users access to perform CRUD operations on the data. This kind of attack is called an SQL injection attack, and the best illustration of this attack comes from an XKCD web comic:

The following discussion borrows heavily from this blog’s explaination of this XKCD comic.

An SQL injection attack starts with an entry field on the user interface of an application that works with a database, like a field where users write their name. The application reads the name and creates an SQL INSERT operation to create a new record in the data with the name. If I were to enter Jonathan into the name field, the app should generate an SQL command that looks like this:

INSERT INTO Students (firstname) VALUES ('Jonathan');

This command specifically places the value “Jonathan” into the firstname attribute of the Students entity. In SQL, different commands are separated by semicolons, so if I wanted to issue two SQL commands I could type:

INSERT INTO Students (firstname) VALUES ('Jonathan'); INSERT INTO Students (lastname) VALUES ('Kropko');

An SQL injection attack works by writing SQL code in a field that is designed to collect data to input into a database. So If I type my name as Jonathan'); DROP TABLE Students; --;, then the SQL create operation becomes

INSERT INTO Students (firstname) VALUES ('Jonathan'); DROP TABLE Students; --;');

This line consists of three commands

  • INSERT INTO Students (firstname) VALUES ('Jonathan'); which inputs “Jonathan” into the database,

  • DROP TABLE Students; which deletes the entire Students table, and

  • --;');: the -- symbol is an SQL comment, and tells the parser to ignore the remainder of the code, which would avoid a parsing error.

So just by inserting specific code into a seemingly innocuous field, like name, I can delete the entire Students entity in the database.

There are two ways to combat SQL injection attacks. First, it is possible to “sanitize” database inputs by using code that automatically places a slash before a single quote. That puts an escape character in front of the quote, which makes it part of the input string and prevents it from being read as the end of the input string. Another approach is to use prepared statements when converting user-entered data into an SQL query. A prepared statement uses placeholders to stand in for the user-supplied data, and treats the data like input into a function: treating the user data this way prevents the entire SQL query from being read as a single string, and prevents SQL injection. For example, instead of inputing the name directly into the query, the database manager can construct the query in Python code (where a database cursor exists and is named curs) like this:

cmd = "INSERT INTO Students (firstname) VALUES (%s)"
curs.execute(cmd, (name,))

In MySQL and PostgreSQL, %s stands in for a parameter to be passed into the query (in SQlite, the stand-in symbol is ? instead of %s). Constructing a query in this way prevents SQL injection attacks. More information about formatting secure SQL code is available at https://bobby-tables.com/, named in honor of this XKCD comic.

As a data scientist mostly issuing read operations, it is unethical for you to attack a database in this way. If you are testing whether a database is secured against SQL injection attacks, don’t try to issue any DROP commands as other commands like SELECT will reveal the insecurity but won’t make changes in the database. If you are building a database that is connected to an interface for users to enter data, please be aware of the SQL injection vulnerability and use prepared statements to guard against it.

7.8. MongoDB Queries#

SQL is a universal language for issuing queries to relational databases, whether the database is managed by SQLite, MySQL, PostgreSQL, or another RDBMS. For NoSQL databases, however, there is no universal query language. Every DBMS has its own query language, and will provide a guide for learning that language. Some of these guides include the ones for key-value stores in Redis and wide column stores in Cassandra. Neo4j has developed a programming language called Cypher that is explicitly for issuing queries to graph databases. Of all of these query protocols, the language used by MongoDB for issuing queries to document stores is one of the most universal because it works entirely with JSONs: queries are written in JSON format and the output is organized in JSON format. All of these query languages include methods for all of the CRUD operations.

The most important difference between relational and NoSQL databases is the rigidity of the schema that organizes the data. The advantage of the strict organization of a relational database, as illustrated in an ER diagram, is that the data that can be extracted from the database using an SQL query will be clean and mostly immediately ready to be analyzed. The disadvantage is that relational databases have schema that are hard to change once they’ve been created and populated with data. SQL also, despite the best intentions of the originators of SQL, can be very difficult for people use for some tasks. For extremely large datasets with many tables, it can be extremely difficult to keep track of what data exists in which table. In contrast, NoSQL databases generally have flexible schema that can be changed easily and can vary even from record to record. There are no rules, like the normalization rules, that require that the data be split into different tables, so there is no need for visual maps like ER diagrams. Also, because all of the data for one record exists in the same JSON dictionary, it is easy to use remote, distributed storage to store all of the records. The disadvantage of NoSQL databases is that the data are rarely ready for analysis after a query. It’s a buy-now-pay-later situation: the price we pay for the convenience of NoSQL storage and organization is that the output requires more work to use.

Some concepts that are crucial to SQL are not relevant to NoSQL. There are no joins in a document store because all the data for a record exist in the same JSON code. As such, we don’t have to worry about accomplishing these tasks within a NoSQL query. NoSQL queries in general focus narrowly on the CRUD operations, although MongoDB provides some advanced functionality for searching for patterns within text and ranking documents based on their relevance to given search terms.

For the following examples, I will use the document store database that we created in module 6, containing the same data on wine reviews that we practiced with above, only in JSON format. First I load the pymongo package and the dumps() and loads() functions from the json_util module of the bson package:

import pymongo
from bson.json_util import dumps, loads

The wine reviews database is stored as a collection winecollection within the winedb database on my local machine. I load it with the following code:

myclient = pymongo.MongoClient("mongodb://localhost/")
winedb = myclient["winedb"]
winecollection = winedb["winecollection"]

We will discuss more advanced read techniques below, but to see one record, we can issue a query using JSON code and we can see the output in JSON format. To see all of the data for the “Nicosia 2013 Vulkà Bianco”, we search for the record based on the title of this wine with the following code:

myquery = { 'title': 'Nicosia 2013 Vulkà Bianco  (Etna)'}
mywine = winecollection.find(myquery) 
for x in mywine:
    print(x)
{'_id': ObjectId('5ed80dbca25fcf746119e3aa'), 'wine_id': 0, 'country': 'Italy', 'description': "Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.", 'points': 87, 'price': None, 'province': 'Sicily & Sardinia', 'region': 'Etna', 'taster_name': 'Kerin O’Keefe', 'taster_twitter_handle': '@kerinokeefe', 'title': 'Nicosia 2013 Vulkà Bianco  (Etna)', 'variety': 'White Blend', 'winery': 'Nicosia'}

Notice that all of the data for this wine exists in this JSON dictionary, including data from the reviews, locations, wineries, and tasters tables in the PostgreSQL database. When we created this MongoDB database, the DBMS automatically created a unique ID for each record designated with the key _id.

We can now use the methods in pymongo for creating, reading, updating, and deleting records and we will apply these methods to the winecollection variable that accesses the data.

7.8.1. Creating and Deleting Records#

Did you know that former NBA all-star Dwyane Wade has a winery? It’s called D Wade Cellars and it is based in the Napa Valley in California. Let’s add the 2016 Napa Valley Three By Wade Red Blend into the database. The first step is to express all of the data we want to associate with a new record in JSON format:

dwadewine = {'title': '2016 Napa Valley Three By Wade Red Blend', 
'description': "This wine goes great with dinner just like Dwyane Wade goes great with LeBron James or Shaq.", 
'taster_name': 'Jonathan Kropko', 
'taster_twitter_handle': '@jmk5131', 
'price': '35', 
'variety': 'Red Blend', 
'location':{
    'region_1': 'Napa Valley', 
    'region_2': None, 
    'province': 'California', 
    'country': 'U.S.', 
    'winery': 'D Wade Cellars'}}

In creating this JSON record, I tried to follow the standards that exist elsewhere in the data by using the same feature names. I departed from the format of other records in two ways. First, I omitted the points and designation features. Second, I placed all the information about the location and name of the winery under the “location” key, which induces some nesting structure.

To add this one record to the database, I use the .insert_one() method on the winecollection database, with the following code:

winecollection.insert_one(dwadewine)

By default, the insert_one() method automatically checks to see whether the record already exists in the data, and throws an error if it does, unless we specify the bypass_document_validation=True argument, which allows duplicate records to be input into the database. For the purposes of this notebook, I rerun these cells many times while writing, and I don’t want to place many duplicate records into the database. Instead, I can delete the record if it already exists. The code

winecollection.count_documents({'title': '2016 Napa Valley Three By Wade Red Blend'})

generates a count of the records of wines in the database that have this title. If there are any existing records, I can delete all of these records with the .delete_many() method, in which the argument is a JSON with enough fields specified to exactly match the records we want to delete:

winecollection.delete_many({'title': '2016 Napa Valley Three By Wade Red Blend'})

In constrast, the .delete_one() method will only delete the first record, when sorting by _id, that matches the query. If there are no documents that match the query, he .delete_all() or .delete_one() methods will both still process the query without error, but will not change anything in the database.

We first delete any records of wines with the title “2016 Napa Valley Three By Wade Red Blend” with .delete_all(), then we insert the entire record of this wine with .insert_one():

winecollection.delete_many({'title': '2016 Napa Valley Three By Wade Red Blend'})
winecollection.insert_one(dwadewine)
<pymongo.results.InsertOneResult at 0x12b264a48>

Now that this record exists in the database, we can find this record by any of the fields associated with the record, such as the title of the wine for example:

myquery = {'title': '2016 Napa Valley Three By Wade Red Blend'}
mywine = winecollection.find(myquery) 
for x in mywine:
    print(x)
{'_id': ObjectId('5edd56e5b4e58ce3841e5dea'), 'title': '2016 Napa Valley Three By Wade Red Blend', 'description': 'This wine goes great with dinner just like Dwyane Wade goes great with LeBron James or Shaq.', 'taster_name': 'Jonathan Kropko', 'taster_twitter_handle': '@jmk5131', 'price': '35', 'variety': 'Red Blend', 'location': {'region_1': 'Napa Valley', 'region_2': None, 'province': 'California', 'country': 'U.S.', 'winery': 'D Wade Cellars'}}

Note that MongoDB automatically generates a unique ID value for this document and includes it under the _id field in the JSON output.

Wikipedia lists 93 other celebrities other than Dwyane Wade who own wineries and vineyards, including Antonio Banderas, Drew Barrymore, and Lil Jon. If we want to add more than one record to the wine collection database, we need to create a list of individual JSON dictionaries with code that looks like

newrecords = [{JSON dictionary 1}, {JSON dictionary 2}, {JSON dictionary 3}]

In this case, I can create entries for Antonio Banderas, Drew Barrymore, and Lil Jon’s wines and store them in one list:

newwines = [{'title': 'Anta Banderas A 10 2008', 
             'description': "This wine will make you speak differently. Maybe not with a charming Spanish accent, but you might think you sound that way.", 
             'taster_name': 'Jonathan Kropko', 
             'taster_twitter_handle': '@jmk5131', 
             'price': '40.99', 
             'variety': 'Red Blend', 
             'location':{
                 'region_1': 'Ribera del Duoro', 
                 'region_2': None, 
                 'province': 'Valladolid', 
                 'country': 'Spain', 
                 'winery': 'Anta Banderas'}},
           {'title': 'Barrymore Rose 2013', 
             'description': "Someone drank my entire bottle of wine!", 
             'taster_name': 'Jonathan Kropko', 
             'taster_twitter_handle': '@jmk5131', 
             'price': '14.99', 
             'variety': 'Rose', 
             'location':{
                 'region_1': 'Monterey', 
                 'region_2': None, 
                 'province': 'California', 
                 'country': 'U.S.', 
                 'winery': 'Barrymore Vineyard'}},
           {'title': '2006 Little Jonathan Winery Cabernet Sauvignon', 
             'description': "This upscale crunk juice is OOOKAAAAAAY.", 
             'taster_name': 'Jonathan Kropko', 
             'taster_twitter_handle': '@jmk5131',  
             'variety': 'Cabernet Sauvignon', 
             'location':{
                 'region_1': 'Central Coast', 
                 'region_2': 'Paso Robles', 
                 'province': 'California', 
                 'country': 'U.S.', 
                 'winery': 'Little Jonathan Winery'}}]

To add these three records to the database with one line of code, we use the .insert_many() method. To avoid duplicates, we first delete any records of wines titled “Anta Banderas A 10 2008”, “Barrymore Rose 2013”, or “2006 Little Jonathan Winery Cabernet Sauvignon”:

winecollection.delete_many({'title': 'Anta Banderas A 10 2008'})
winecollection.delete_many({'title': 'Barrymore Rose 2013'})
winecollection.delete_many({'title': '2006 Little Jonathan Winery Cabernet Sauvignon'})
winecollection.insert_many(newwines)
<pymongo.results.InsertManyResult at 0x1268132c8>

7.8.2. Reading Data and Selecting Records#

To read all of the records in a MongoDB collection, use the .find() method and pass an empty JSON dictionary to this method. For the wine reviews collection, we can query the entire collection by typing

myquery = winecollection.find({})

Here, the queried data exist within the variable cursor. The data are not displayed automatically. To see the data in JSON format, we can employ the print() function on elements of the cursor. To see the first element:

print(myquery[0])
{'_id': ObjectId('5ed80dbca25fcf746119e3aa'), 'wine_id': 0, 'country': 'Italy', 'description': "Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.", 'points': 87, 'price': None, 'province': 'Sicily & Sardinia', 'region': 'Etna', 'taster_name': 'Kerin O’Keefe', 'taster_twitter_handle': '@kerinokeefe', 'title': 'Nicosia 2013 Vulkà Bianco  (Etna)', 'variety': 'White Blend', 'winery': 'Nicosia'}

And to see more elements, we can use a loop. Here’s code to view the first three wines:

for i in myquery[0:3]:
    print(i)
{'_id': ObjectId('5ed80dbca25fcf746119e3aa'), 'wine_id': 0, 'country': 'Italy', 'description': "Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.", 'points': 87, 'price': None, 'province': 'Sicily & Sardinia', 'region': 'Etna', 'taster_name': 'Kerin O’Keefe', 'taster_twitter_handle': '@kerinokeefe', 'title': 'Nicosia 2013 Vulkà Bianco  (Etna)', 'variety': 'White Blend', 'winery': 'Nicosia'}
{'_id': ObjectId('5ed80dcca25fcf746119e3ab'), 'wine_id': 1, 'country': 'Portugal', 'description': "This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's  already drinkable, although it will certainly be better from 2016.", 'points': 87, 'price': 15.0, 'province': 'Douro', 'region': None, 'taster_name': 'Roger Voss', 'taster_twitter_handle': '@vossroger', 'title': 'Quinta dos Avidagos 2011 Avidagos Red (Douro)', 'variety': 'Portuguese Red', 'winery': 'Quinta dos Avidagos'}
{'_id': ObjectId('5ed80dcca25fcf746119e3ac'), 'wine_id': 2, 'country': 'US', 'description': 'Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.', 'points': 87, 'price': 14.0, 'province': 'Oregon', 'region': 'Willamette Valley', 'taster_name': 'Paul Gregutt', 'taster_twitter_handle': '@paulgwine\xa0', 'title': 'Rainstorm 2013 Pinot Gris (Willamette Valley)', 'variety': 'Pinot Gris', 'winery': 'Rainstorm'}

Displaying the query output data as a list of JSON dictionaries, however, is not the most useful way to store the data. We need a way to put these data into a dataframe. For that we can use the dumps() and loads() functions from the bson library. These functions work exactly like the dumps() and loads() functions from the json library, but they remove some of the extra components of these JSON dictionaries associated with the database. To query all of the data and to place all of it into a dataframe, we pass the query output to dumps(), which converts the query output to plain text. Next we pass this text to loads(), which registers the text as a list of JSON dictionaries. Finally we use this list as the argument of pd.DataFrame.from_records() to convert the output to a dataframe. For the wine collection, this code is:

myquery = winecollection.find({})
wine_text = dumps(myquery)
wine_records = loads(wine_text)
wine_df = pd.DataFrame.from_records(wine_records)
wine_df
_id wine_id country description points price province region taster_name taster_twitter_handle title variety winery location
0 5ed80dbca25fcf746119e3aa 0.0 Italy Aromas include tropical fruit, broom, brimston... 87.0 None Sicily & Sardinia Etna Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia NaN
1 5ed80dcca25fcf746119e3ab 1.0 Portugal This is ripe and fruity, a wine that is smooth... 87.0 15 Douro None Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos NaN
2 5ed80dcca25fcf746119e3ac 2.0 US Tart and snappy, the flavors of lime flesh and... 87.0 14 Oregon Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm NaN
3 5ed80dcca25fcf746119e3ad 3.0 US Pineapple rind, lemon pith and orange blossom ... 87.0 13 Michigan Lake Michigan Shore Alexander Peartree None St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian NaN
4 5ed80dcca25fcf746119e3ae 4.0 US Much like the regular bottling from 2012, this... 87.0 65 Oregon Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
103726 5ed80dcfa25fcf74611b78d8 129970.0 France Big, rich and off-dry, this is powered by inte... 90.0 21 Alsace Alsace Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit NaN
103727 5edd56e5b4e58ce3841e5dea NaN NaN This wine goes great with dinner just like Dwy... NaN 35 NaN NaN Jonathan Kropko @jmk5131 2016 Napa Valley Three By Wade Red Blend Red Blend NaN {'region_1': 'Napa Valley', 'region_2': None, ...
103728 5edd56e6b4e58ce3841e5deb NaN NaN This wine will make you speak differently. May... NaN 40.99 NaN NaN Jonathan Kropko @jmk5131 Anta Banderas A 10 2008 Red Blend NaN {'region_1': 'Ribera del Duoro', 'region_2': N...
103729 5edd56e6b4e58ce3841e5dec NaN NaN Someone drank my entire bottle of wine! NaN 14.99 NaN NaN Jonathan Kropko @jmk5131 Barrymore Rose 2013 Rose NaN {'region_1': 'Monterey', 'region_2': None, 'pr...
103730 5edd56e6b4e58ce3841e5ded NaN NaN This upscale crunk juice is OOOKAAAAAAY. NaN NaN NaN NaN Jonathan Kropko @jmk5131 2006 Little Jonathan Winery Cabernet Sauvignon Cabernet Sauvignon NaN {'region_1': 'Central Coast', 'region_2': 'Pas...

103731 rows × 14 columns

Like SQL, read operations in MongoDB can filter records based on logical conditions. Unlike SQL, MongoDB uses different symbols for the common logical operators, and these symbols need to be listed within the JSON formatted query. A couple of the operators are implicit in the JSON syntax. To search on an equality condition, the general syntax is

{'key' : value}

For example, to find all of the wines in the database that are from virginia, we can use the following query:

{'province' : 'Virginia'}

The other implicit operator is “and”, which is expressed simply by including more than one key-value pair within the syntax. To specify that a feature key1 is equal to value1 AND that key2 is equal to value2, type:

{'key1' : value1,
 'key2' : value2}

For example, to filter the data to Pinot Noir wines from Virginia, we can type

{'variety' : 'Pinot Noir',
 'province' : 'Virginia'}

For all other logical operators, MongoDB uses special syntax, described below. To use these operators in a query, the general template is general syntax for using an operator within a MongoDB query is

{'key' : {'$operator' : value } }

The operators are listed in the following table:

Operator

Syntax

Example query

Example code

Equal to

implicit

All wines with scores of 100

{'points': 100}

Greater than

'$gt'

All wines that are more expensive than \$30

{'price': {'$gt': 30}}

Greater than or equal to

'$gte'

All wines with scores of 95 or higher

{'points': {'$gte': 95}}

Less than

'$lt'

All wines that are cheaper than \$20

{'price': {'$lt': 20}}

Less than or equal to

'$lte'

All wines with scores of 85 or lower

{'points': {'$lte': 85}}

Not equal

'$ne'

Wines that are not red blends

{'variety': {'$ne': 'Red Blend'}

And

implicit

All wines with scores of 100 and prices of \$20 or less

{'points': 100, 'price': {'$lte': 20}}

Or

'$or': [{condition1}, {condition2}]

All wines with scores of 100, or prices of \$20 or less

{'$or': [{'points': 100}, {'price: {'$lte': 20}}]}

Exists in a set

'$in': [value1, value2, ...]

All wines from Virginia, Maryland, or North Carolina

{'province': {'$in': ['Virginia', 'Maryland', 'North Carolina']}}

Not in a set

'$nin'

All wines except those from Virginia, Maryland, and North Carolina

{'province': {'$nin': ['Virginia', 'Maryland', 'North Carolina']}}

Use logical conditions that compare two or more keys

{'$expr': <expression>}

All wines whose price is greater than their score

{'$expr': {'$gt': ['$price', '$points']}}

Logical negation (only recommended for use with $text and $regex)

'$not'

All wines whose descriptions do not contain the word “chocolate”, treating capital and lower-case letters the same

{'$not': {'description': {'$text': {'$search': 'chocolate', '$caseSensitive': false}}}}

To quickly see the data that is output by queries that use these operators, I write a function that takes a JSON dictionary as an input, and outputs a pandas dataframe:

def mongo_read_query(col, q):
    qtext = dumps(col.find(q))
    qrec = loads(qtext)
    qdf = pd.DataFrame.from_records(qrec)
    return qdf

To see all wines with a score of 100

myquery = {'points': 100}
mongo_read_query(winecollection, myquery)
_id wine_id country description points price province region taster_name taster_twitter_handle title variety winery
0 5ed80dcca25fcf746119e498 345 Australia This wine contains some material over 100 year... 100 350.0 Victoria Rutherglen Joe Czerwinski @JoeCz Chambers Rosewood Vineyards NV Rare Muscat (Ru... Muscat Chambers Rosewood Vineyards
1 5ed80dcea25fcf74611a5511 36528 France This is a fabulous wine from the greatest Cham... 100 259.0 Champagne Champagne Roger Voss @vossroger Krug 2002 Brut (Champagne) Champagne Blend Krug
2 5ed80dcea25fcf74611a66df 42197 Portugal This is the latest release of what has long be... 100 450.0 Douro None Roger Voss @vossroger Casa Ferreirinha 2008 Barca-Velha Red (Douro) Portuguese Red Casa Ferreirinha
3 5ed80dcea25fcf74611a7214 45781 Italy This gorgeous, fragrant wine opens with classi... 100 550.0 Tuscany Brunello di Montalcino Kerin O’Keefe @kerinokeefe Biondi Santi 2010 Riserva (Brunello di Montal... Sangiovese Biondi Santi
4 5ed80dcea25fcf74611a9877 58352 France This is a magnificently solid wine, initially ... 100 150.0 Bordeaux Saint-Julien Roger Voss @vossroger Château Léoville Barton 2010 Saint-Julien Bordeaux-style Red Blend Château Léoville Barton
5 5ed80dcfa25fcf74611afacd 89728 France This latest incarnation of the famous brand is... 100 250.0 Champagne Champagne Roger Voss @vossroger Louis Roederer 2008 Cristal Vintage Brut (Cha... Champagne Blend Louis Roederer
6 5ed80dcfa25fcf74611aface 89729 France This new release from a great vintage for Char... 100 617.0 Champagne Champagne Roger Voss @vossroger Salon 2006 Le Mesnil Blanc de Blancs Brut Char... Chardonnay Salon
7 5ed80dcfa25fcf74611b3fc4 111753 France Almost black in color, this stunning wine is g... 100 1500.0 Bordeaux Pauillac Roger Voss @vossroger Château Lafite Rothschild 2010 Pauillac Bordeaux-style Red Blend Château Lafite Rothschild
8 5ed80dcfa25fcf74611b3fc5 111755 France This is the finest Cheval Blanc for many years... 100 1500.0 Bordeaux Saint-Émilion Roger Voss @vossroger Château Cheval Blanc 2010 Saint-Émilion Bordeaux-style Red Blend Château Cheval Blanc
9 5ed80dcfa25fcf74611b3fc6 111756 France A hugely powerful wine, full of dark, brooding... 100 359.0 Bordeaux Saint-Julien Roger Voss @vossroger Château Léoville Las Cases 2010 Saint-Julien Bordeaux-style Red Blend Château Léoville Las Cases
10 5ed80dcfa25fcf74611b4648 113929 US In 2005 Charles Smith introduced three high-en... 100 80.0 Washington Columbia Valley (WA) Paul Gregutt @paulgwine Charles Smith 2006 Royal City Syrah (Columbia ... Syrah Charles Smith
11 5ed80dcfa25fcf74611b499d 114972 Portugal A powerful and ripe wine, strongly influenced ... 100 650.0 Port None Roger Voss @vossroger Quinta do Noval 2011 Nacional Vintage (Port) Port Quinta do Noval
12 5ed80dcfa25fcf74611b6300 122935 France Full of ripe fruit, opulent and concentrated, ... 100 848.0 Bordeaux Pessac-Léognan Roger Voss @vossroger Château Haut-Brion 2014 Pessac-Léognan Bordeaux-style White Blend Château Haut-Brion
13 5ed80dcfa25fcf74611b64d4 123545 US Initially a rather subdued Frog; as if it has ... 100 80.0 Washington Walla Walla Valley (WA) Paul Gregutt @paulgwine Cayuse 2008 Bionic Frog Syrah (Walla Walla Val... Syrah Cayuse

To see wines with a score of 100 and a cost of less than \\(100, we can use the `\)lt` operator:

myquery = {'points': 100, 'price': {'$lt': 100}}
mongo_read_query(winecollection, myquery)
_id wine_id country description points price province region taster_name taster_twitter_handle title variety winery
0 5ed80dcfa25fcf74611b4648 113929 US In 2005 Charles Smith introduced three high-en... 100 80.0 Washington Columbia Valley (WA) Paul Gregutt @paulgwine Charles Smith 2006 Royal City Syrah (Columbia ... Syrah Charles Smith
1 5ed80dcfa25fcf74611b64d4 123545 US Initially a rather subdued Frog; as if it has ... 100 80.0 Washington Walla Walla Valley (WA) Paul Gregutt @paulgwine Cayuse 2008 Bionic Frog Syrah (Walla Walla Val... Syrah Cayuse

To see wines that are from Ohio or North Carolina, we use the $in operator:

myquery = {'province': {'$in': ['Ohio','North Carolina']}}
mongo_read_query(winecollection, myquery)
_id wine_id country description points price province region taster_name taster_twitter_handle title variety winery
0 5ed80dcea25fcf74611a0d3f 13402 US Clove and pepper spice the dark red cherry aro... 86 17.0 North Carolina Swan Creek Anna Lee C. Iijima None Raffaldini 2007 Montepulciano (Swan Creek) Montepulciano Raffaldini
1 5ed80dcea25fcf74611a167b 16270 US The nose shows an aroma of blackberry that is ... 86 17.0 North Carolina Yadkin Valley Alexander Peartree None Shadow Springs 2011 Cabernet Franc (Yadkin Val... Cabernet Franc Shadow Springs
2 5ed80dcea25fcf74611a20a2 19566 US Fruits, flowers and spice should lead the nose... 80 15.0 Ohio Ohio Susan Kostrzewa @suskostrzewa Hermes 2006 Estate Bottled Nebbiolo (Ohio) Nebbiolo Hermes
3 5ed80dcea25fcf74611a23ee 20592 US Stewed blackberries and muddled cherries perva... 86 23.0 North Carolina Swan Creek Alexander Peartree None Raffaldini 2012 Riserva Sangiovese (Swan Creek) Sangiovese Raffaldini
4 5ed80dcea25fcf74611a5fc4 39935 US Friendly, appealing flavors fo pear, lychee, a... 84 12.0 Ohio Grand River Valley Susan Kostrzewa @suskostrzewa Debonné 2008 Reserve Riesling (Grand River Val... Riesling Debonné
5 5ed80dcea25fcf74611a686c 42692 US Friendly, appealing flavors fo pear, lychee, a... 84 12.0 Ohio Grand River Valley Susan Kostrzewa @suskostrzewa Debonné 2008 Reserve Riesling (Grand River Val... Riesling Debonné
6 5ed80dcea25fcf74611a7033 45209 US Freshly squeezed lemons, lime and pretty white... 84 16.0 North Carolina Swan Creek Anna Lee C. Iijima None Raffaldini 2009 Pinot Grigio (Swan Creek) Pinot Grigio Raffaldini
7 5ed80dcea25fcf74611a74db 46670 US Black fruit aromas show over toasted vanilla a... 85 29.0 North Carolina Swan Creek Alexander Peartree None Raffaldini 2012 Riserva Montepulciano (Swan Cr... Montepulciano Raffaldini
8 5ed80dcea25fcf74611a7c95 49154 US Lean and racy, with limes and tart green apple... 86 10.0 North Carolina North Carolina Joe Czerwinski @JoeCz Shelton Vineyards 2002 Riesling (North Carolina) Riesling Shelton Vineyards
9 5ed80dcea25fcf74611a8518 52078 US Charred oak, green herbs and vanilla spice not... 82 18.0 North Carolina Yadkin Valley Anna Lee C. Iijima None Divine Llama 2007 In a Heart Beat Red (Yadkin ... R. Blend Divine Llama
10 5ed80dcea25fcf74611a8fe5 55687 US Mostly Sangiovese with a small dose of Petit V... 84 17.0 North Carolina Swan Creek Anna Lee C. Iijima None Raffaldini 2007 Riserva Sangiovese (Swan Creek) Sangiovese Raffaldini
11 5ed80dcea25fcf74611a8fe9 55693 US Aromas of toasted oak, green leaves, vanilla a... 84 25.0 North Carolina North Carolina Anna Lee C. Iijima None RayLen 2006 Eagle's Select Red Wine Red (North... Bordeaux-style Red Blend RayLen
12 5ed80dcfa25fcf74611ac51f 72535 US Who knew Ohio made such tasty Chardonnay? Brig... 87 11.0 Ohio Grand River Valley Anna Lee C. Iijima None Debonné 2009 Chardonnay (Grand River Valley) Chardonnay Debonné
13 5ed80dcfa25fcf74611ae168 81619 US Strawberry and raspberry Kool-Aid aromas are s... 85 21.0 North Carolina Swan Creek Alexander Peartree None Laurel Gray 2012 Estate Grown Cabernet Franc (... Cabernet Franc Laurel Gray
14 5ed80dcfa25fcf74611af839 88841 US This is a vibrant, energetic Chardonnay that s... 84 17.0 Ohio Grand River Valley Susan Kostrzewa @suskostrzewa Debonné 2007 Vintner's Selection Chardonnay (G... Chardonnay Debonné
15 5ed80dcfa25fcf74611b08bb 94267 US Although the nose offers darker notes of petro... 83 15.0 Ohio Grand River Valley Anna Lee C. Iijima None Debonné 2008 Lot 807 Reserve Riesling (Grand R... Riesling Debonné
16 5ed80dcfa25fcf74611b08be 94275 US The nose on this bright red blend from Sanders... 83 18.0 North Carolina Yadkin Valley Anna Lee C. Iijima None Sanders Ridge 2008 Big Woods Red (Yadkin Valley) Bordeaux-style Red Blend Sanders Ridge
17 5ed80dcfa25fcf74611b13ff 97836 US Smoke wafts over pressed apple and lemon notes... 83 13.0 North Carolina North Carolina Anna Lee C. Iijima None RayLen 2009 Riesling (North Carolina) Riesling RayLen
18 5ed80dcfa25fcf74611b1c1f 100445 US Fresh minerality and dancing floral notes make... 84 11.0 Ohio Grand River Valley Susan Kostrzewa @suskostrzewa Debonné 2006 Reserve Riesling (Grand River Val... Riesling Debonné
19 5ed80dcfa25fcf74611b1c23 100452 US A slightly floral but lively nose is followed ... 84 15.0 Ohio Grand River Valley Susan Kostrzewa @suskostrzewa Debonné 2006 Lot 707 Reserve Riesling (Grand R... Riesling Debonné
20 5ed80dcfa25fcf74611b29f1 104722 US There are enticing hints of berries and cream ... 86 15.0 North Carolina North Carolina Anna Lee C. Iijima None Biltmore Estate 2010 Reserve Chardonnay (North... Chardonnay Biltmore Estate
21 5ed80dcfa25fcf74611b4fb6 116912 US Black cherry aromas are dwarfed by notes of wi... 83 24.0 North Carolina Swan Creek Alexander Peartree None Raffaldini 2012 Montepulciano (Swan Creek) Montepulciano Raffaldini
22 5ed80dcfa25fcf74611b55ee 118895 US Bright red fruits achieve a decent amount of r... 84 18.0 North Carolina North Carolina Anna Lee C. Iijima None RayLen 2008 Category 5 Red Wine Red (North Car... R. Blend RayLen

To illustrate the “or” operator, we can query all wines that are either from Virginia, or have a score of 100:

myquery = {'$or': [{'points': 100}, {'province': 'Virginia'}]}
mongo_read_query(winecollection, myquery)
_id wine_id country description points price province region taster_name taster_twitter_handle title variety winery
0 5ed80dcca25fcf746119e3bd 19 US Red fruit aromas pervade on the nose, with cig... 87 32.0 Virginia Virginia Alexander Peartree None Quiévremont 2012 Meritage (Virginia) Meritage Quiévremont
1 5ed80dcca25fcf746119e3be 20 US Ripe aromas of dark berries mingle with ample ... 87 23.0 Virginia Virginia Alexander Peartree None Quiévremont 2012 Vin de Maison Red (Virginia) R. Blend Quiévremont
2 5ed80dcca25fcf746119e498 345 Australia This wine contains some material over 100 year... 100 350.0 Victoria Rutherglen Joe Czerwinski @JoeCz Chambers Rosewood Vineyards NV Rare Muscat (Ru... Muscat Chambers Rosewood Vineyards
3 5ed80dcea25fcf746119e8d0 1625 US Popping with aromas of lychee, rose, geranium ... 85 16.0 Virginia Virginia Carrie Dykes None The Williamsburg Winery 2015 A Midsummer Night... White Blend The Williamsburg Winery
4 5ed80dcea25fcf746119e8d6 1631 US Powerful aromas of lychee, mango and peach giv... 85 22.0 Virginia Middleburg Carrie Dykes None Blue Valley 2015 Muskat Ottonel (Middleburg) Muskat Ottonel Blue Valley
... ... ... ... ... ... ... ... ... ... ... ... ... ...
388 5ed80dcfa25fcf74611b71b3 127736 US Dense with alluring aromas, this wine is full ... 88 30.0 Virginia Virginia Carrie Dykes None Early Mountain 2015 Elevation Red (Virginia) R. Blend Early Mountain
389 5ed80dcfa25fcf74611b71bd 127746 US A grape known in Uruguay and Madiran has prove... 88 25.0 Virginia Virginia Carrie Dykes None Horton 2014 Tannat (Virginia) Tannat Horton
390 5ed80dcfa25fcf74611b7447 128576 US Peach and steely lemon aromas carry to a citru... 87 28.0 Virginia Monticello Alexander Peartree None Pollak 2012 Reserve Chardonnay (Monticello) Chardonnay Pollak
391 5ed80dcfa25fcf74611b7708 129422 US The nose of this wine is bursting with raspber... 89 32.0 Virginia Monticello Carrie Dykes None Stinson 2014 Meritage (Monticello) Meritage Stinson
392 5ed80dcfa25fcf74611b772a 129459 US Somehow, winemaker Luca Paschina manages to ma... 87 23.0 Virginia Virginia Carrie Dykes None Barboursville Vineyards 2015 Reserve Vermentin... Vermentino Barboursville Vineyards

393 rows × 13 columns

$expr requires an operator that compares two keys, and creates a sentence like “X is greater than or equal to Y”. Next it requires a list that specifies what X in the sentence should be, then what Y should be. To search for all wines in which the price is greater than the score we use the $expr operator as follows:

myquery = {'$expr': {'$gt': ['$price', '$points']}}
mongo_read_query(winecollection, myquery)
_id wine_id country description points price province region taster_name taster_twitter_handle title variety winery location
0 5ed80dcca25fcf746119e3d4 60.0 US Syrupy and dense, this wine is jammy in plum a... 86.0 100 California Napa Valley Virginie Boone @vboone Okapi 2013 Estate Cabernet Sauvignon (Napa Val... Cabernet Sauvignon Okapi NaN
1 5ed80dcca25fcf746119e42c 168.0 US A fairly elegant expression of the variety, th... 91.0 95 California Napa Valley Virginie Boone @vboone Duckhorn 2012 Rector Creek Vineyard Merlot (Na... Merlot Duckhorn NaN
2 5ed80dcca25fcf746119e476 284.0 Argentina This huge Malbec defines jammy and concentrate... 92.0 215 Mendoza Province Perdriel Michael Schachner @wineschach Viña Cobos 2011 Marchiori Vineyard Block C2 Ma... Malbec Viña Cobos NaN
3 5ed80dcca25fcf746119e498 345.0 Australia This wine contains some material over 100 year... 100.0 350 Victoria Rutherglen Joe Czerwinski @JoeCz Chambers Rosewood Vineyards NV Rare Muscat (Ru... Muscat Chambers Rosewood Vineyards NaN
4 5ed80dcca25fcf746119e499 346.0 Australia This deep brown wine smells like a damp, mossy... 98.0 350 Victoria Rutherglen Joe Czerwinski @JoeCz Chambers Rosewood Vineyards NV Rare Muscadelle... Muscadelle Chambers Rosewood Vineyards NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3697 5ed80dcfa25fcf74611b78a8 129919.0 US This ripe, rich, almost decadently thick wine ... 91.0 105 Washington Walla Walla Valley (WA) Paul Gregutt @paulgwine Nicholas Cole Cellars 2004 Reserve Red (Walla ... R. Blend Nicholas Cole Cellars NaN
3698 5ed80dcfa25fcf74611b78b2 129931.0 France A powerful, chunky wine, packed with solid tan... 91.0 107 Burgundy Grands-Echezeaux Roger Voss @vossroger Henri de Villamont 2005 Grands-Echezeaux Pinot Noir Henri de Villamont NaN
3699 5edd56e5b4e58ce3841e5dea NaN NaN This wine goes great with dinner just like Dwy... NaN 35 NaN NaN Jonathan Kropko @jmk5131 2016 Napa Valley Three By Wade Red Blend Red Blend NaN {'region_1': 'Napa Valley', 'region_2': None, ...
3700 5edd56e6b4e58ce3841e5deb NaN NaN This wine will make you speak differently. May... NaN 40.99 NaN NaN Jonathan Kropko @jmk5131 Anta Banderas A 10 2008 Red Blend NaN {'region_1': 'Ribera del Duoro', 'region_2': N...
3701 5edd56e6b4e58ce3841e5dec NaN NaN Someone drank my entire bottle of wine! NaN 14.99 NaN NaN Jonathan Kropko @jmk5131 Barrymore Rose 2013 Rose NaN {'region_1': 'Monterey', 'region_2': None, 'pr...

3702 rows × 14 columns

In the previous section, we entered a record for a wine released by former NBA all-star Dwyane Wade, and we purposely included a nested structure in this JSON record. The location key has subkeys region_1, region_2, province, country, and winery:

dwadewine
{'title': '2016 Napa Valley Three By Wade Red Blend',
 'description': 'This wine goes great with dinner just like Dwyane Wade goes great with LeBron James or Shaq.',
 'taster_name': 'Jonathan Kropko',
 'taster_twitter_handle': '@jmk5131',
 'price': '35',
 'variety': 'Red Blend',
 'location': {'region_1': 'Napa Valley',
  'region_2': None,
  'province': 'California',
  'country': 'U.S.',
  'winery': 'D Wade Cellars'},
 '_id': ObjectId('5edd56e5b4e58ce3841e5dea')}

To query a subrecord, use dot notation of the form 'key.subkey' to identify the path to the value you need. To query for the winery name “D Wade Cellars”, we can type:

myquery = {'location.winery': 'D Wade Cellars'}
mongo_read_query(winecollection, myquery)
_id title description taster_name taster_twitter_handle price variety location
0 5edd56e5b4e58ce3841e5dea 2016 Napa Valley Three By Wade Red Blend This wine goes great with dinner just like Dwy... Jonathan Kropko @jmk5131 35 Red Blend {'region_1': 'Napa Valley', 'region_2': None, ...

7.8.3. Selecting Features#

A read query in MongoDB will return the entire JSON dictionary for every record that matches the query. Sometimes, however, the entirety of the data for one record will be more information than we can feasibly work with. In some situations there might be an unmanagable number of features contained within each dictionary, and we only want to use a couple of these features. In other situations a feature might contain values that are so large that we want to avoid dealing with this feature if possible.

To extract only a selection of the features, add a second JSON clause to the .find() method. The general syntax for selecting features is

db.collection.find({query}, {'feature'=1}}

where {query} is code, as described above, for extracting a selection of the records, and {'feature'=1} instructs MongoDB to include only the field named feature in the output. Alternatively, it is possible to list as many keys in this second clause as we want, so {'feature1'=1, 'feature2'=1} extracts feature1 and feature2. In addition, setting the key equal to 0 instead of 1 instructs MongoDB to extract all features except the one specified with 'feature'=0.

In the wine collection, we can extract only the titles of Merlot wines with the following code:

cursor = winecollection.find({'variety': 'Merlot'}, {'title': 1})
qtext = dumps(cursor)
qrec = loads(qtext)
pd.DataFrame.from_records(qrec)
_id title
0 5ed80dcca25fcf746119e3c1 Bianchi 2011 Signature Selection Merlot (Paso ...
1 5ed80dcca25fcf746119e3cd Sundance 2011 Merlot (Maule Valley)
2 5ed80dcca25fcf746119e3ef Passaggio 2014 Blau Vineyards Merlot (Knights ...
3 5ed80dcca25fcf746119e42c Duckhorn 2012 Rector Creek Vineyard Merlot (Na...
4 5ed80dcca25fcf746119e437 Viña Bisquertt 2007 Casa La Joya Reserve Merlo...
... ... ...
2094 5ed80dcfa25fcf74611b77f0 Castillo de Monjardin 2009 Deyo Merlot (Navarra)
2095 5ed80dcfa25fcf74611b7862 Bonair 2006 Chateau Puryear Vineyard Merlot (R...
2096 5ed80dcfa25fcf74611b7865 Hyatt 2005 Merlot (Rattlesnake Hills)
2097 5ed80dcfa25fcf74611b786b Ca' Momi 2013 Reserve Merlot (Carneros)
2098 5ed80dcfa25fcf74611b7896 Psagot 2014 Merlot

2099 rows × 2 columns

By default, the only field that is extracted other than the ones we directly specify is _id, but we can exclude _id as well by typing

cursor = winecollection.find({'variety': 'Merlot'}, {'title': 1, '_id': 0})
qtext = dumps(cursor)
qrec = loads(qtext)
pd.DataFrame.from_records(qrec)
title
0 Bianchi 2011 Signature Selection Merlot (Paso ...
1 Sundance 2011 Merlot (Maule Valley)
2 Passaggio 2014 Blau Vineyards Merlot (Knights ...
3 Duckhorn 2012 Rector Creek Vineyard Merlot (Na...
4 Viña Bisquertt 2007 Casa La Joya Reserve Merlo...
... ...
2094 Castillo de Monjardin 2009 Deyo Merlot (Navarra)
2095 Bonair 2006 Chateau Puryear Vineyard Merlot (R...
2096 Hyatt 2005 Merlot (Rattlesnake Hills)
2097 Ca' Momi 2013 Reserve Merlot (Carneros)
2098 Psagot 2014 Merlot

2099 rows × 1 columns

To keep the title, variety, points, and price, we type

cursor = winecollection.find({'variety': 'Merlot'}, 
                             {'title': 1,
                             'variety': 1,
                             'points': 1,
                             'price': 1,
                             '_id': 0})
qtext = dumps(cursor)
qrec = loads(qtext)
pd.DataFrame.from_records(qrec)
points price title variety
0 87 22.0 Bianchi 2011 Signature Selection Merlot (Paso ... Merlot
1 86 9.0 Sundance 2011 Merlot (Maule Valley) Merlot
2 86 55.0 Passaggio 2014 Blau Vineyards Merlot (Knights ... Merlot
3 91 95.0 Duckhorn 2012 Rector Creek Vineyard Merlot (Na... Merlot
4 88 11.0 Viña Bisquertt 2007 Casa La Joya Reserve Merlo... Merlot
... ... ... ... ...
2094 87 18.0 Castillo de Monjardin 2009 Deyo Merlot (Navarra) Merlot
2095 86 20.0 Bonair 2006 Chateau Puryear Vineyard Merlot (R... Merlot
2096 86 10.0 Hyatt 2005 Merlot (Rattlesnake Hills) Merlot
2097 90 44.0 Ca' Momi 2013 Reserve Merlot (Carneros) Merlot
2098 91 32.0 Psagot 2014 Merlot Merlot

2099 rows × 4 columns

7.8.4. Updating Records#

Updating records in MongoDB is similar to selecting records in that we use the same logical conditions we use for selecting records for identifying the records we want to edit. The .update_one() method, applied to a collection, has two arguments. First we specify a logical condition that identifies the records we want to edit. Then we use the $set operator to choose specific fields within the existing JSON record to change. If we want, we can even write an entire replacement dictionary for this record, and write it along with $set. For example, to identify the record of the wine from Dwyane Wade’s winery, we can query {'location.winery': 'D Wade Cellars'} as we did in the previous section. Suppose that we want to edit this record so that the price increases to \$45. We can do so with the following code:

winecollection.update_one({'location.winery': 'D Wade Cellars'},
                     {'$set' : {'price': 45}})
mongo_read_query(winecollection, {'location.winery': 'D Wade Cellars'})
_id title description taster_name taster_twitter_handle price variety location
0 5edd56e5b4e58ce3841e5dea 2016 Napa Valley Three By Wade Red Blend This wine goes great with dinner just like Dwy... Jonathan Kropko @jmk5131 45 Red Blend {'region_1': 'Napa Valley', 'region_2': None, ...

Suppose that we wanted to add a field that does not currently exist in the record, like points. We can use the same syntax to add fields:

winecollection.update_one({'location.winery': 'D Wade Cellars'},
                     {'$set' : {'score': 90}})
mongo_read_query(winecollection, {'location.winery': 'D Wade Cellars'})
_id title description taster_name taster_twitter_handle price variety location score
0 5edd56e5b4e58ce3841e5dea 2016 Napa Valley Three By Wade Red Blend This wine goes great with dinner just like Dwy... Jonathan Kropko @jmk5131 45 Red Blend {'region_1': 'Napa Valley', 'region_2': None, ... 90

We can change more than one field at a time within one call to the $set operator. To change both the score and the price of the Dwyane Wade wine, we can type:

winecollection.update_one({'location.winery': 'D Wade Cellars'},
                     {'$set' : {'score': 95,
                               'price': 50}})
mongo_read_query(winecollection, {'location.winery': 'D Wade Cellars'})
_id title description taster_name taster_twitter_handle price variety location score
0 5edd56e5b4e58ce3841e5dea 2016 Napa Valley Three By Wade Red Blend This wine goes great with dinner just like Dwy... Jonathan Kropko @jmk5131 50 Red Blend {'region_1': 'Napa Valley', 'region_2': None, ... 95

Suppose that the wine is reviewed by LeBron James, NBA star and noted wine connoisseur, who provided a new score and description. We can update the entire record by first defining a Python variable that contains the record:

dwadewine2 = {'title': '2016 Napa Valley Three By Wade Red Blend', 
'description': "This wine is very good. Not as great as me. But plenty great enough for Miami.", 
'taster_name': 'LeBron James', 
'taster_twitter_handle': '@kingjames', 
'price': 45,
'score': 99,
'variety': 'Red Blend', 
'location':{
    'region_1': 'Napa Valley', 
    'region_2': None, 
    'province': 'California', 
    'country': 'U.S.', 
    'winery': 'D Wade Cellars'}}

We can replace the existing record for this wine by specifying this dictionary as the second argument of the .update_one() method:

winecollection.update_one({'location.winery': 'D Wade Cellars'},
                     {'$set' : dwadewine2})
mongo_read_query(winecollection, {'location.winery': 'D Wade Cellars'})
_id title description taster_name taster_twitter_handle price variety location score
0 5edd56e5b4e58ce3841e5dea 2016 Napa Valley Three By Wade Red Blend This wine is very good. Not as great as me. Bu... LeBron James @kingjames 45 Red Blend {'region_1': 'Napa Valley', 'region_2': None, ... 99

A second method for editing records is .update_all() which revises every document that matches a query. I don’t recommend using this method except in very specific cases, because it is easy to destroy large portions of a database with a mistyped query. But for the sake of illustration, suppose we wanted to change the names of the “Red Blend” varieties of wines to “R. Blend”. We can do that with the following code:

winecollection.update_many({'variety': 'Red Blend'},
                          {'$set': {'variety': 'R. Blend'}})
mongo_read_query(winecollection, {'variety': 'R. Blend'})
_id wine_id country description points price province region taster_name taster_twitter_handle title variety winery location score
0 5ed80dcca25fcf746119e3be 20.0 US Ripe aromas of dark berries mingle with ample ... 87.0 23 Virginia Virginia Alexander Peartree None Quiévremont 2012 Vin de Maison Red (Virginia) R. Blend Quiévremont NaN NaN
1 5ed80dcca25fcf746119e3c6 28.0 Italy Aromas suggest mature berry, scorched earth, a... 87.0 17 Sicily & Sardinia Cerasuolo di Vittoria Kerin O’Keefe @kerinokeefe Terre di Giurfo 2011 Mascaria Barricato (Cera... R. Blend Terre di Giurfo NaN NaN
2 5ed80dcca25fcf746119e3dc 68.0 US Very deep in color and spicy-smoky in flavor, ... 86.0 12 California California Jim Gordon @gordone_cellars Cocobon 2014 Red (California) R. Blend Cocobon NaN NaN
3 5ed80dcca25fcf746119e3f2 90.0 US This blend of Sangiovese, Malbec, Cabernet Sau... 88.0 23 California Sonoma County Virginie Boone @vboone Ferrari-Carano 2014 Siena Red (Sonoma County) R. Blend Ferrari-Carano NaN NaN
4 5ed80dcca25fcf746119e400 104.0 Italy Made with 65% Sangiovese, 20% Merlot and 15% C... 87.0 16 Tuscany Toscana Kerin O’Keefe @kerinokeefe Madonna Alta 2014 Nativo Red (Toscana) R. Blend Madonna Alta NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7106 5ed80dcfa25fcf74611b78b3 129932.0 Argentina Andeluna's top wines tend to be ripe and plump... 91.0 55 Mendoza Province Uco Valley Michael Schachner @wineschach Andeluna 2004 Pasionado Red (Uco Valley) R. Blend Andeluna NaN NaN
7107 5ed80dcfa25fcf74611b78bd 129943.0 Italy A blend of Nero d'Avola and Syrah, this convey... 90.0 29 Sicily & Sardinia Sicilia Kerin O’Keefe @kerinokeefe Baglio del Cristo di Campobello 2012 Adènzia R... R. Blend Baglio del Cristo di Campobello NaN NaN
7108 5ed80dcfa25fcf74611b78c1 129947.0 Italy A blend of 65% Cabernet Sauvignon, 30% Merlot ... 90.0 20 Sicily & Sardinia Terre Siciliane Kerin O’Keefe @kerinokeefe Feudo Principi di Butera 2012 Symposio Red (Te... R. Blend Feudo Principi di Butera NaN NaN
7109 5edd56e5b4e58ce3841e5dea NaN NaN This wine is very good. Not as great as me. Bu... NaN 45 NaN NaN LeBron James @kingjames 2016 Napa Valley Three By Wade Red Blend R. Blend NaN {'region_1': 'Napa Valley', 'region_2': None, ... 99.0
7110 5edd56e6b4e58ce3841e5deb NaN NaN This wine will make you speak differently. May... NaN 40.99 NaN NaN Jonathan Kropko @jmk5131 Anta Banderas A 10 2008 R. Blend NaN {'region_1': 'Ribera del Duoro', 'region_2': N... NaN

7111 rows × 15 columns

7.8.5. Performing Text Searches#

One of the great advantages of a document store database is the ability to search through the text within the documents and extract records that match a certain pattern. A text search in MongoDB involves two steps:

  • First, we will create a text index: a particular field in the records that contains the text we want MongoDB to search within.

  • Second, we will use the $text operator within a call to .find() to specify the search terms.

To create a text index, we can use the syntax

collection.create_index[('keytosearch', 'text')]

We will replace 'keytosearch' with the name of the field in the JSON dictionaries on which we want to search, but we will leave ‘text’as is because this code tells MongoDB to search for text. The code to set thedescriptionfield as the text index in thewinecollection` database is:

winecollection.create_index([('description', 'text')])
'description_text'

Now that we’ve set the text index, we can search the text in that field. The general syntax for a query with a text search is

{'$text': {'$search': 'searchterms', '$caseSensitive': False}}

where 'searchterms' contains the terms we want to search for, and '$caseSensitive': False tells MongoDB to ignore cases in the search, so that a search term of “chocolate” also matches to “Chocolate”. Alternatively, '$caseSensitive': True takes case into account when matching records to a query. If a search is not case sensitive, and if it is not diacritic sensitive (taking things like accents into account, which it can do by adding the $diacriticSensitive=True option), then $search matches on the stems of words: the first several letters in the word, allowing a search term of “blueberry” to also match with “blueberries”.

As a simple example, now that description has been set as the text index, we can find all wines with descriptions that contain the word “chocolate”. Here I save the output as a dataframe and display the description for the first wine in the output:

df = mongo_read_query(winecollection, {'$text': {'$search':'chocolate', '$caseSensitive': False}})
df['description'][0]
'Supermodern in style, with mint, coconut, chocolate and huge black fruit aromas. Powerfully structured and thick-boned, with boysenberry, spice and chocolate in spades. Oaky, broad and layered on the finish, with tobacco, coffee and chocolate finishing notes.'

To search on more than one term, include the terms in the same string after '$search', separated by spaces. By default, these terms are combined using the “or” operator, so that the query returns any document with at least one of the terms. The following code finds all wines whose descriptions contain “chocolate” or “leather”:

df = mongo_read_query(winecollection, {'$text': {'$search':'chocolate leather', '$caseSensitive': False}})
df[df['wine_id']==109396]['description'] # a leathery one
5930    A whiff of leather introduces a wine with a st...
Name: description, dtype: object

To search for documents with a phrase that contains a space, enclose the phrase in double quotes, and precede each double-quote with an \ escape character. The following code captures descriptions with the phrase “very good”:

df = mongo_read_query(winecollection, {'$text': {'$search':'\"very good\"', '$caseSensitive': False}})
df['description'][0]
"A good to very good wine with medicinality on the nose that takes over the aromatics. There's also the slightest bit of hard cheese and stem on the bouquet, so overall it is fighting an uphill battle. Along the way it delivers flavors of cough drop, cherry and a good, solid finish."

To search for documents that contain multiple search terms at once (an “and” operator), enclose each search term in double quotes with escape characters. We can search for descriptions that contain both “leather” and “chocolate” with the following code:

df = mongo_read_query(winecollection, {'$text': {'$search':'\"leather\" \"chocolate\"', '$caseSensitive': False}})
df['description'][0]
'A blend of 85% Melnik, 10% Grenache Noir and 5% Petit Verdot, this wine has aromas of saddle leather, cassis and dark chocolate. In the mouth there are flavors of cherry, chocolate and dried blueberry. It has good balance with a soft tannic finish.'

To exclude a term, we add a negative sign in front of the term we want to exclude. To find all wines whose descriptions contain the word “dark” but not “chocolate”, we type

df = mongo_read_query(winecollection, {'$text': {'$search':'dark -chocolate', '$caseSensitive': False}})
df['description'][0]
'Dark, dark, dark is this estate-grown Syrah, closed at the nose except for the profusion of alcohol, grippy with lots of oak and crazy, teeth-staining tannins.'

Text searches can also be used to construct a search engine. We provide the search terms, and MongoDB generates a score for every document that represents the extent to which the search terms are relevant to the document. Once these scores have been generated, it is possible to sort the documents by the score to find the documents that are most highly related to the search terms.

To rank documents by search-relevancy, we add {'score': {'$meta': 'textScore'}} to the query we pass to the .find() method. Here we enter five search terms, “chocolate”, “leather”, “wood”, “dark”, and “smoke”:

cursor = winecollection.find(
            {'$text': {'$search': 'chocolate leather wood dark smoke'}},
            {'score': {'$meta': 'textScore'}})

Next we apply the .sort() method to the output, arranging the documents by relevancy score, with the following code:

cursor.sort([('score', {'$meta': 'textScore'})]) 
<pymongo.cursor.Cursor at 0x1361d6438>

Finally we can convert the output to a dataframe with the following code:

qtext = dumps(cursor)
qrec = loads(qtext)
df = pd.DataFrame.from_records(qrec)

Among all the wine reviews in the data, here is the wine whose description had the highest relevancy score for “chocolate”, “leather”, “wood”, “dark”, and “smoke”:

df['description'][0]
'Big and bold in character, this full-bodied and abundantly tannic wine looks black to the rim and smells like pencil shavings, leather and wood smoke. The flavors are dry but enticing, with dark chocolate, charred beef and black cherry. It needs until at least 2022 to peak.'

To wrap up our work with the database, we apply the .close() method to the MongoDB server:

myclient.close()