7. Database Queries#
7.1. Introduction: The History of the Structured Query Language (SQL)#
The structured query language (SQL) was invented by Donald D. Chamberlin and Raymond F. Boyce in 1974. Chamberlain and Boyce were both young computer scientists working at the IBM T.J. Watson Research Center in Yorktown Heights, New York, and they met E. F. Codd at a research symposium that Codd organized there. Codd, four years prior, had published the seminal article that defined the relational model for databases. Codd’s relational model is defined using relational algebra and relational calculus, two notational standards that Codd himself created to elaborate on set theory as applied specifically to data tables. One important property of set theory is that highly abstract mathematical expressions can be expressed in plain language. For example, consider the set \(A\) of holidays in the United States during which banks are closed:
Also consider the set \(B\) of holidays in the United Kingdom during which banks are closed:
The intersection between sets \(A\) and \(B\) is a set that consists of all elements that exist with both set \(A\) and set \(B\):
The notation \(A\cap B\) is a mathematical abstraction of an idea that can be expressed in plain-spoken language: \(\cap\) means “and”, and \(A\cap B\) means \(A\) and \(B\), or all elements that are in both \(A\) and \(B\). Put another way, \(A\cap B\) is the set of all holidays during which banks are closed in both the United States and the United Kingdom. Likewise, every piece of set notation can be expressed semantically.
Although Codd laid out the broad parameters of the relational model in mathematical terms, he did not design software or a physical architecture for a relational database. He explicitly left that work up to future research:
Many questions are raised and left unanswered. For example, only a few of the more important properties of the data sublanguage … are mentioned. Neither the purely linguistic details of such a language nor the implementation problems are discussed. Nevertheless, the material presented should be adequate for experienced systems programmers to visualize several approaches (p. 387).
Chamberlin and Boyce took up the challenge of writing a programming language to implement Codd’s relational model. As Chamberlin explains, their primary goal was to create a version of Codd’s set-theoretical relational model that could be expressed in plain language:
The more difficult barrier was at the semantic level. The basic concepts of Codd’s languages were adapted from set theory and symbolic logic. This was natural given Codd’s background as a mathematician, but Ray and I hoped to design a relational language based on concepts that would be familiar to a wider population of users (p. 78).
In short, the idea behind SQL is to implement Codd’s abstract system of using logical statements with set theoretical notation to narrow down the specific records and features in a dataset that a user wishes to read or edit, but to phrase these operations in accessable, plain language. One of the best things about SQL is that once you are used to the language, it reads just like English. That said, Chamberlin admits that SQL “has not proved to be as accessible to untrained users as Ray and I originally hoped” (p. 81).
Another benefit of SQL is that this language is one of the most universal programming languages in existence. It is designed to work with database management systems on any platform, and it works seamlessly within Python, R, C, Java, Javascript, and so on. While the standards for languages and platforms change, SQL has been in continuous use for relational database management since the 1980s and shows no sign of becoming antiquated or being replaced. The SQL syntax exists outside of any individual DBMS, and is maintained by the American National Standards Institute (ANSI) and the International Organization for Standardization (ISO), two non-profit organizations that facilitate the development of voluntary consensus standards for things like programming languages and hardware. Despite the universality of SQL, however, different DBMSs use slightly different versions of SQL, adding some unique functionality in some cases, and failing to implement the entire SQL standard in others. MySQL for example lacks the ability to perform a full join. PostgreSQL distinguishes itself from other RDBMSs by striving to implement as much of the global SQL standard as possible. While there some important differences in the version of SQL used by different DBMSs, the differences generally apply to very specific situations and all implementations of SQL use mostly the same syntax and can do mostly the same work.
7.1.1. Declarative and Procedural Languages#
SQL is considered to be a declarative language, which means that it defines the broad task that a particular computer system must carry out, but it does not define the mechanism through which the system completes the task. For example, SQL can tell a system to access two tables and join them together, but that command must tell a DBMS to access additional code that tells the system how exactly to search and operate on the rows and columns of each data table. A language that provides specific instructions to a system on how to carry out a task - by changing the system state in some way, including how the data exist in the system - is a procedural language. The code that a procedural language uses to make these changes on the system is called imperative code. A DBMS can be thought of as a function that takes declarative SQL code as an input, finds and runs the imperative code that carries out the declarative task, and returns the output. MySQL, for example, uses imperative C and C++ code to carry out SQL queries.
7.1.2. Popularity of SQL#
Common standards and the most popular programming languages and environments change all the time. It’s an eternal struggle for data scientists as well as programmers of all kinds, and a matter of consistent anxiety. Presently, Python is the most widely used tool for data science, but will we all have to drop Python soon and teach ourselves Julia?
In this context, it is stunning that SQL has been so widely used since the 1970s. According to a Stack Overflow survey, SQL is the one of most widely used programming languages among the people who filled out the survey, second only to Javascript. Taking into account the high-tech biases in this specific sample, it is probably the case the SQL is more widely used than any other language mentioned in this survey. What accounts for this popularity?
This blog post argues that SQL achieved this level of longevity because it came to prominence during a time in which many of the baseline standards for the development of computer systems were being invented. As more and more systems were developed in a way that depends on SQL, it became harder to change this standard. But SQL is also simple and highly functional because it is a semantic language that expresses set-theoetical and logical operations. As long as relational databases are used, there’s not much functionality that can be added to a query language beyond these foundational mathematical operations, and whatever additional functionality is needed can be added to a version of SQL by a particular DBMS. There are also many different open source and proprietary DBMSs that all employ SQL, so different users can have a choice over many different DBMSs and platforms without having to learn a query language other than SQL.
That said, there’s much less of a reason to use SQL when the database is not organized according to the relational model. NoSQL databases have much more flexible schema in general, and can store the data in one big table or in as many tables as there are records, or even datapoints, in the database. In fact, without a relational schema, the notion of a data table makes less sense in general. For example, a document store is a collection of individual records encoded using JSON or XML, and not as tables. These records can be sharded: stored in many corresponding servers in a distributed system to address challenges with the size of the database and the speed with which database transactions are conducted. Without tables, NoSQL DBMSs do not usually use SQL. MongoDB, for example, works with queries that are themselves in JSON format.
7.2. Create, Read, Update, and Delete (CRUD) Operations#
Persistent storage refers to a system in which data outlives the process that created it. When you work with software that allows you to save a file, the file is stored in persistent storage because it still exists even after you close the software application. Hard drives are examples of persistent storage, as are local and remote servers that store databases. Any persistent storage mechanism must have methods for creating, reading (or loading), updating (or editing), and deleting the data in that storage device. Create, Read, Update, and Delete are the CRUD Operations.
We’ve previously employed CRUD operations using the requests
library to use an API or to do web scraping. Like requests
, SQL and other query languages have CRUD operations. The following table, adapted from a similar one that appears on the Wikipedia page for the CRUD operations, shows these operations in the requests
package, SQL, and the MongoDB query language:
Operation |
|
SQL |
MongoDB |
---|---|---|---|
Create |
|
|
|
Read |
|
|
|
Update |
|
|
|
Delete |
|
|
|
As a data scientist, you will most often use read operations to obtain the data you need for your analysis. However, if you are collecting original data for your project, the create, update, and delete operations become much more important. We will discuss all four operations and their variants in the context of SQL and MongoDB below.
We can work with SQL using pandas
if we first create an engine that links to a specific DBMS, server, and database with create_engine
from sqlalchemy
. Once we do, the pd.read_sql_query()
function makes read operations straightforward, and the .execute()
method applied to the engine lets us easily issue create, update, and delete commands.
7.3. SQL Style: Capitalization, Quotes, New Lines, Indentation#
There are many ways to write an SQL query, and when you look at someone else’s SQL code you will see a variety of styles. Mostly, with the exception of quotes in some cases, stylistic differences don’t change the behavior of the code, but they can have an impact on how easy the code is for other people to read and understand.
One requirement for SQL code is that the query must end with a semi-colon, and that no semi-colons appear elsewhere in the query. As long as that requirement is met, other stylistic choices are possible.
The least readable way to write an SQL query is to write the entire code on one line, with no capitalization or indentation. The following code is valid SQL code:
select t.id, t.column1, t.column2, t.column3, r.column4 from table1 t inner join table2 r on t.id = r.id where column1>100 order by column2 desc;
We will discuss exactly what this query does. But for now, let’s focus on the presentation of code. SQL uses clauses to represent particular functions for reading and writing data. In the above query, select
, from
, inner join
, on
, where
, and order by
are all clauses, and desc
is an option applied to the order by
clause.
One stylistic choice many people make is to write SQL clauses in capital letters. That helps readers to quickly see the parts of the code that are clauses as opposed to the rest of the code that contains column names, table names, values, and aliases. If we capitalize the clauses and options in the SQL query, it looks like this:
SELECT t.id, t.column1, t.column2, t.column3, r.column4 FROM table1 t INNER JOIN table2 r ON t.id = r.id WHERE column1>100 ORDER BY column2 DESC;
Another stylistic choice people make to present the code in a more reabable way is to put clauses that are considered distinct enough from other clauses on new lines. The one common exception is FROM
, which is considered to be closely related to SELECT
and is often written on the same line as SELECT
. If we put each clause other than FROM
on a new line, the query looks like this:
SELECT t.id, t.column1, t.column2, t.column3, r.column4 FROM table1 t
INNER JOIN table2 r
ON t.id = r.id
WHERE column1>100
ORDER BY column2 DESC;
Some clauses are considered to be elaborations upon a previous clause. Column names after SELECT
are usually written on the same line as SELECT
, but if these columns themselves require functions that take up more space, it is useful to put them on new lines. Likewise, ON
is considered an elaboration of INNER JOIN
. These lines of code are often indented to express the dependence on the previous line. If we include indentation in the code, the query is
SELECT
t.id,
t.column1,
t.column2,
t.column3,
r.column4
FROM table1 t
INNER JOIN table2 r
ON t.id = r.id
WHERE column1>100
ORDER BY column2 DESC;
I encourage you to develop good habits with how you write the SQL queries, both for other people to read your code, but more importantly, to make it easier for you yourself to read your code. You will be spending a lot of time developing and debugging SQL queries, and anything you do that cuts down the time to understand your own code will save you a lot of time and frustration in the long-run.
Quotes are only used in SQL code when referring to values of a character feature in one of the data tables. When using quotes while working in Python, it is important to use single quotes, not double quotes, to ensure that the quote that is internal to SQL is not read as a termination of the Python variable that contains the SQL code. That is, it is fine to write a clause that looks like WHERE column = 'value'
, but not WHERE column = "value"
.
For all of the queries we will write in the following examples, we will store the query as a string variable in Python. We will use the triple-quote syntax, which allows us to write a string that exists on multiple lines. So our SQL query definitions will look like this:
myquery = """
SELECT
t.id,
t.column1,
t.column2,
t.column3,
r.column4
FROM table1 t
INNER JOIN table2 r
ON t.id = r.id
WHERE column1>100
ORDER BY column2 DESC;
"""
We will then be able to pass the myquery
variable to functions like pd.read_sql_query()
to be evaluated.
7.4. SQL Joins#
The simplest SQL commands for reading data from a database are SELECT
and FROM
. In module 6, we issued the following query to read the entire “reviews” entity from the wine reviews database:
SELECT * FROM reviews
But this query does not read data from the other entities in the database. It pulls all of the rows and columns from “reviews” and it does not manipulate the data within “reviews” in any way. That might not be the best way to create a dataframe to conduct analyses.
SQL read operations get data, but they can also clean data at the same time. Cleaning data is an important challenge. Even when the data are stored in a well-organized database, that organization might not be the right format for the data given the analyses we intend to do. “Tidy Data” by Hadley Wickham lays out a philosophy of what it means to clean a dataset. The goal is to put data into a format in which modeling and visualization is as easy as possible. There are two steps: first we create tidy data and then we manipulate the data to fit our specific needs. Tidy data is defined as follows:
A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, [features] and types. In tidy data:
Each [feature] forms a column.
Each observation forms a row.
Each type of observational unit forms a table (p. 2).
In other words, the dataset we will use in an analysis must exist in one table, the rows of the table must represent records (also called observations), the columns of the table must represent features, and the rows must represent comparable units. For example, if the data contain records from the 50 U.S. States, there should be a row for each state, and there should not be a row for regions or for the whole country as these units are not comparable to states. Data from a relational database are not generally in tidy format because the relevant data exists in multiple tables. The first step is to combine these tables into one dataset using join functions within SQL read operations. There are many different kinds of joins, and the easiest way to see the difference between these types is to see what they do to real data.
Before we discuss specific examples of how to use SQL, we load the following libraries:
import numpy as np
import pandas as pd
import sys
import os
import psycopg2
from sqlalchemy import create_engine
import dotenv
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 5
3 import sys
4 import os
----> 5 import psycopg2
6 from sqlalchemy import create_engine
7 import dotenv
ModuleNotFoundError: No module named 'psycopg2'
7.4.1. Example Database: NFL and NBA Teams#
As an example, I create a PostgreSQL database that contains two tables: “nfl” contains the location and team name of all 32 NFL teams:
nfl_dict = {'city':['Buffalo','Miami','Boston','New York','Cleveland','Cincinnati',
'Pittsburgh','Baltimore','Kansas City','Las Vegas','Los Angeles','Denver',
'Nashville','Jacksonville','Houston','Indianapolis','Philadelphia','Dallas',
'Washington','Atlanta','Charlotte','Tampa Bay','New Orleans','San Francisco',
'Phoenix', 'Seattle','Chicago','Green Bay','Minneapolis','Detroit'],
'footballteam':['Buffalo Bills','Miami Dolphins','New England Patriots',
['New York Jets', 'New York Giants'],'Cleveland Browns','Cincinnati Bengals',
'Pittsburgh Steelers','Baltimore Ravens','Kansas City Chiefs',
'Las Vegas Raiders',['L.A. Chargers','L.A. Rams'],'Denver Broncos',
'Tennessee Titans','Jacksonville Jaguars','Houston Texans',
'Indianapolis Colts','Philadelphia Eagles','Dallas Cowboys',
'Washington Skins','Atlanta Falcons','Carolina Panthers',
'Tampa Bay Buccaneers','New Orleans Saints', 'San Francisco 49ers',
'Arizona Cardinals','Seattle Seahawks','Chicago Bears',
'Green Bay Packers','Minnesota Vikings','Detroit Lions']}
nfl_df = pd.DataFrame(nfl_dict)
nfl_df
city | footballteam | |
---|---|---|
0 | Buffalo | Buffalo Bills |
1 | Miami | Miami Dolphins |
2 | Boston | New England Patriots |
3 | New York | [New York Jets, New York Giants] |
4 | Cleveland | Cleveland Browns |
5 | Cincinnati | Cincinnati Bengals |
6 | Pittsburgh | Pittsburgh Steelers |
7 | Baltimore | Baltimore Ravens |
8 | Kansas City | Kansas City Chiefs |
9 | Las Vegas | Las Vegas Raiders |
10 | Los Angeles | [L.A. Chargers, L.A. Rams] |
11 | Denver | Denver Broncos |
12 | Nashville | Tennessee Titans |
13 | Jacksonville | Jacksonville Jaguars |
14 | Houston | Houston Texans |
15 | Indianapolis | Indianapolis Colts |
16 | Philadelphia | Philadelphia Eagles |
17 | Dallas | Dallas Cowboys |
18 | Washington | Washington Skins |
19 | Atlanta | Atlanta Falcons |
20 | Charlotte | Carolina Panthers |
21 | Tampa Bay | Tampa Bay Buccaneers |
22 | New Orleans | New Orleans Saints |
23 | San Francisco | San Francisco 49ers |
24 | Phoenix | Arizona Cardinals |
25 | Seattle | Seattle Seahawks |
26 | Chicago | Chicago Bears |
27 | Green Bay | Green Bay Packers |
28 | Minneapolis | Minnesota Vikings |
29 | Detroit | Detroit Lions |
This table is not in first normal form because the data are non-atomic (two teams from New York and two in Los Angeles), but this form is useful for illustrating what different SQL joins do. The second table contains the same information about NBA teams:
nba_dict = {'city':['Boston','New York','Philadelphia','Brooklyn','Toronto',
'Cleveland','Chicago','Detroit','Milwaukee','Indianapolis',
'Atlanta', 'Washington','Orlando','Miami','Charlotte',
'Los Angeles','San Francisco','Portland','Sacramento',
'Phoenix','San Antonio','Dallas','Houston','Oklahoma City',
'Minneapolis','Denver','Salt Lake City','Memphis','New Orleans'],
'basketballteam':['Boston Celtics','New York Knicks','Philadelphia 76ers',
'Brooklyn Nets','Toronto Raptors',
'Cleveland Cavaliers','Chicago Bulls','Detroit Pistons',
'Milwaukee Bucks','Indiana Pacers',
'Atlanta Hawks','Washington Wizards','Orlando Magic',
'Miami Heat','Charlotte Hornets',
['L.A. Lakers','L.A. Clippers'],'Golden State Warriors',
'Portland Trailblazers','Sacramento Kings',
'Phoenix Suns','San Antonio Spurs','Dallas Mavericks',
'Houston Rockets','Oklahoma City Thunder',
'Minnesota Timberwolves','Denver Nuggets',
'Utah Jazz','Memphis Grizzlies','New Orleans Pelicans']}
nba_df = pd.DataFrame(nba_dict)
nba_df
city | basketballteam | |
---|---|---|
0 | Boston | Boston Celtics |
1 | New York | New York Knicks |
2 | Philadelphia | Philadelphia 76ers |
3 | Brooklyn | Brooklyn Nets |
4 | Toronto | Toronto Raptors |
5 | Cleveland | Cleveland Cavaliers |
6 | Chicago | Chicago Bulls |
7 | Detroit | Detroit Pistons |
8 | Milwaukee | Milwaukee Bucks |
9 | Indianapolis | Indiana Pacers |
10 | Atlanta | Atlanta Hawks |
11 | Washington | Washington Wizards |
12 | Orlando | Orlando Magic |
13 | Miami | Miami Heat |
14 | Charlotte | Charlotte Hornets |
15 | Los Angeles | [L.A. Lakers, L.A. Clippers] |
16 | San Francisco | Golden State Warriors |
17 | Portland | Portland Trailblazers |
18 | Sacramento | Sacramento Kings |
19 | Phoenix | Phoenix Suns |
20 | San Antonio | San Antonio Spurs |
21 | Dallas | Dallas Mavericks |
22 | Houston | Houston Rockets |
23 | Oklahoma City | Oklahoma City Thunder |
24 | Minneapolis | Minnesota Timberwolves |
25 | Denver | Denver Nuggets |
26 | Salt Lake City | Utah Jazz |
27 | Memphis | Memphis Grizzlies |
28 | New Orleans | New Orleans Pelicans |
To create a PostgreSQL database with entities for the NFL and NBA teams, I first connect to the PostgreSQL server running on my local computer (see module 6 for a more detailed discussion of how this works). First I bring my PostgreSQL password into the local environment:
dotenv.load_dotenv()
pgpassword = os.getenv("pgpassword")
Then I access the server and establish a cursor for the server:
dbserver = psycopg2.connect(
user='jk8sd',
password=pgpassword,
host="localhost"
)
dbserver.autocommit = True
cursor = dbserver.cursor()
I create an empty “teams” database:
try:
cursor.execute("CREATE DATABASE teams")
except:
cursor.execute("DROP DATABASE teams")
cursor.execute("CREATE DATABASE teams")
And I use the create_engine()
function from sqalchemy
to allow queries to the “teams” database:
engine = create_engine("postgresql+psycopg2://{user}:{pw}@localhost/{db}"
.format(user="jk8sd", pw=pgpassword, db="teams"))
I add the nfl_df
and nba_df
dataframes to the “teams” database:
nfl_df.to_sql('nfl', con = engine, index=False, chunksize=1000, if_exists = 'replace')
nba_df.to_sql('nba', con = engine, index=False, chunksize=1000, if_exists = 'replace')
I can now issue queries to the database. To read all of the data in the NFL table, for example, I type:
myquery = "SELECT * FROM nfl"
pd.read_sql_query(myquery, con=engine)
city | footballteam | |
---|---|---|
0 | Buffalo | Buffalo Bills |
1 | Miami | Miami Dolphins |
2 | Boston | New England Patriots |
3 | New York | {"New York Jets","New York Giants"} |
4 | Cleveland | Cleveland Browns |
5 | Cincinnati | Cincinnati Bengals |
6 | Pittsburgh | Pittsburgh Steelers |
7 | Baltimore | Baltimore Ravens |
8 | Kansas City | Kansas City Chiefs |
9 | Las Vegas | Las Vegas Raiders |
10 | Los Angeles | {"L.A. Chargers","L.A. Rams"} |
11 | Denver | Denver Broncos |
12 | Nashville | Tennessee Titans |
13 | Jacksonville | Jacksonville Jaguars |
14 | Houston | Houston Texans |
15 | Indianapolis | Indianapolis Colts |
16 | Philadelphia | Philadelphia Eagles |
17 | Dallas | Dallas Cowboys |
18 | Washington | Washington Skins |
19 | Atlanta | Atlanta Falcons |
20 | Charlotte | Carolina Panthers |
21 | Tampa Bay | Tampa Bay Buccaneers |
22 | New Orleans | New Orleans Saints |
23 | San Francisco | San Francisco 49ers |
24 | Phoenix | Arizona Cardinals |
25 | Seattle | Seattle Seahawks |
26 | Chicago | Chicago Bears |
27 | Green Bay | Green Bay Packers |
28 | Minneapolis | Minnesota Vikings |
29 | Detroit | Detroit Lions |
7.4.2. Types of Joins#
Joining data tables is the act of adding columns to an existing data table - that is, adding more features to existing records - by matching the rows in one table to the corresponding rows in another table. In a relational database, data tables can include a foreign key which serves as the primary key for another data table. Joins require matching a foreign key in one table to the corresponding primary key in another table. During the join, this foreign key and this primary key are both called indices. To perform a join with an SQL query, we specify the two tables in the database we want to join and the index in each table we will match on.
In the teams database, city
is a primary key in both the “nfl” and “nba” tables, which also makes it a foreign key in both tables. Joining the “nfl” and “nba” tables by matching on city
creates one data table in which the rows still represent cities and the columns list both the NBA and NFL teams in each city.
Not every city has both an NFL and an NBA team. Green Bay, for example, has a football team but no basketball team, and Sacramento has a basketball team but no football team. In a join, every row in a table either matches with one or more rows in the other table, or is unmatched. In this case, Cleveland in the NFL table is matched to a row in the NBA table because Cleveland has both a football and basketball team, but Oklahoma City in the NBA table is unmatched because there is no row for Oklahoma City in the NFL table.
The main difference between types of joins in SQL is their treatment of unmatched records. The following table summarizes the types of joins:
Type of join |
Definition |
---|---|
Inner join |
Only keep the records that exist in both tables |
Left join |
Keep all the records in the first table listed, and keep only the records in the second table listed that have matches in the first table |
Right join |
Keep all the records in the second table listed, and keep only the records in the first table listed that have matches in the second table |
Full join |
Keep all of the records in both tables whether they are matched or not |
Anti join |
Keep only the records in the first table that are not matched in the second table |
Natural join |
The same as any of the joins listed above, but no need to specify the indices as these are determined automatically by finding columns with the same name. If no columns share the same name, a natural join performs a cross join. If more than one pair of columns share names across the two data tables, natural joins assume that both are part of the index to match on. Use caution. |
Cross join |
Also called a Cartesian product. If the first dataframe has \(M\) rows and the second dataframe has \(N\) rows, the result has \(M\times N\) rows. Every row is a pairwise combination of values of each index. |
7.4.3. Inner Joins#
The syntax for an inner join is
SELECT * FROM table1
INNER JOIN table2
ON table1.index_name = table2.index_name;
where table1
and table2
are the data tables we are joining, and table1.index_name
and table2.index_name
are the columns that contain the indices for tables 1 and 2. Alternatively, inner join is the default type of join, so that this syntax
SELECT * FROM table1
JOIN table2
ON table1.index_name = table2.index_name;
also produces an inner join. I recommend typing INNER JOIN
, however, to avoid confusing this type of join with other types.
In the case of the teams database, an inner join of the NFL and NBA tables yields a dataframe with one row for every city that has both a basketball and a football team. The SQL query that generates this data frame is:
myquery = """
SELECT * FROM nfl
INNER JOIN nba
ON nfl.city = nba.city;
"""
pd.read_sql_query(myquery, con=engine)
city | footballteam | city | basketballteam | |
---|---|---|---|---|
0 | Miami | Miami Dolphins | Miami | Miami Heat |
1 | Boston | New England Patriots | Boston | Boston Celtics |
2 | New York | {"New York Jets","New York Giants"} | New York | New York Knicks |
3 | Cleveland | Cleveland Browns | Cleveland | Cleveland Cavaliers |
4 | Los Angeles | {"L.A. Chargers","L.A. Rams"} | Los Angeles | {"L.A. Lakers","L.A. Clippers"} |
5 | Denver | Denver Broncos | Denver | Denver Nuggets |
6 | Houston | Houston Texans | Houston | Houston Rockets |
7 | Indianapolis | Indianapolis Colts | Indianapolis | Indiana Pacers |
8 | Philadelphia | Philadelphia Eagles | Philadelphia | Philadelphia 76ers |
9 | Dallas | Dallas Cowboys | Dallas | Dallas Mavericks |
10 | Washington | Washington Skins | Washington | Washington Wizards |
11 | Atlanta | Atlanta Falcons | Atlanta | Atlanta Hawks |
12 | Charlotte | Carolina Panthers | Charlotte | Charlotte Hornets |
13 | New Orleans | New Orleans Saints | New Orleans | New Orleans Pelicans |
14 | San Francisco | San Francisco 49ers | San Francisco | Golden State Warriors |
15 | Phoenix | Arizona Cardinals | Phoenix | Phoenix Suns |
16 | Chicago | Chicago Bears | Chicago | Chicago Bulls |
17 | Minneapolis | Minnesota Vikings | Minneapolis | Minnesota Timberwolves |
18 | Detroit | Detroit Lions | Detroit | Detroit Pistons |
The three quotations that come before and after the SQL code is Python syntax that allow for a string to be entered on multiple lines. With just one quote, Python would assume that the next line should be read as Python code, and will produce an error. Three quotes allows us to space out the components of the SQL query on separate lines to make the SQL code easier to read and understand.
SQL queries can be written on multiple lines, but the last line (and only the last line) must conclude with a semicolon.
Another way to write the inner join query is to use aliasing: specifying a smaller name or a single letter next to each data table in the query to simplify the syntax for ON
. For example, I can alias the NFL data with f
and the NBA data with b
:
myquery = """
SELECT * FROM nfl f
INNER JOIN nba b
ON f.city = b.city;
"""
pd.read_sql_query(myquery, con=engine)
city | footballteam | city | basketballteam | |
---|---|---|---|---|
0 | Miami | Miami Dolphins | Miami | Miami Heat |
1 | Boston | New England Patriots | Boston | Boston Celtics |
2 | New York | {"New York Jets","New York Giants"} | New York | New York Knicks |
3 | Cleveland | Cleveland Browns | Cleveland | Cleveland Cavaliers |
4 | Los Angeles | {"L.A. Chargers","L.A. Rams"} | Los Angeles | {"L.A. Lakers","L.A. Clippers"} |
5 | Denver | Denver Broncos | Denver | Denver Nuggets |
6 | Houston | Houston Texans | Houston | Houston Rockets |
7 | Indianapolis | Indianapolis Colts | Indianapolis | Indiana Pacers |
8 | Philadelphia | Philadelphia Eagles | Philadelphia | Philadelphia 76ers |
9 | Dallas | Dallas Cowboys | Dallas | Dallas Mavericks |
10 | Washington | Washington Skins | Washington | Washington Wizards |
11 | Atlanta | Atlanta Falcons | Atlanta | Atlanta Hawks |
12 | Charlotte | Carolina Panthers | Charlotte | Charlotte Hornets |
13 | New Orleans | New Orleans Saints | New Orleans | New Orleans Pelicans |
14 | San Francisco | San Francisco 49ers | San Francisco | Golden State Warriors |
15 | Phoenix | Arizona Cardinals | Phoenix | Phoenix Suns |
16 | Chicago | Chicago Bears | Chicago | Chicago Bulls |
17 | Minneapolis | Minnesota Vikings | Minneapolis | Minnesota Timberwolves |
18 | Detroit | Detroit Lions | Detroit | Detroit Pistons |
The two indices we match on do not necessarily have to have the same name. Supposing that the “city” column in each data table was named “location” in the NFL table and “town” in the NBA table, the syntax for the inner join would have been:
SELECT * FROM nfl f
INNER JOIN nba b
ON f.location = b.town;
7.4.4. Left and Right Joins#
The syntax for a left join is
SELECT * FROM table1
LEFT JOIN table2
ON table1.index_name = table2.index_name;
and the syntax for a right join is
SELECT * FROM table1
RIGHT JOIN table2
ON table1.index_name = table2.index_name;
In the case of the teams database, if we list the NFL table next to FROM
and the NBA data with the JOIN
statement, then left join lists all of the cities with an NFL team, and also displays the NBA team in that city if one exists. Otherwise, the syntax places None
in the cell where the NBA team would be. For the teams database, the syntax for a left join is:
myquery = """
SELECT * FROM nfl f
LEFT JOIN nba b
ON f.city = b.city;
"""
pd.read_sql_query(myquery, con=engine)
city | footballteam | city | basketballteam | |
---|---|---|---|---|
0 | Buffalo | Buffalo Bills | None | None |
1 | Miami | Miami Dolphins | Miami | Miami Heat |
2 | Boston | New England Patriots | Boston | Boston Celtics |
3 | New York | {"New York Jets","New York Giants"} | New York | New York Knicks |
4 | Cleveland | Cleveland Browns | Cleveland | Cleveland Cavaliers |
5 | Cincinnati | Cincinnati Bengals | None | None |
6 | Pittsburgh | Pittsburgh Steelers | None | None |
7 | Baltimore | Baltimore Ravens | None | None |
8 | Kansas City | Kansas City Chiefs | None | None |
9 | Las Vegas | Las Vegas Raiders | None | None |
10 | Los Angeles | {"L.A. Chargers","L.A. Rams"} | Los Angeles | {"L.A. Lakers","L.A. Clippers"} |
11 | Denver | Denver Broncos | Denver | Denver Nuggets |
12 | Nashville | Tennessee Titans | None | None |
13 | Jacksonville | Jacksonville Jaguars | None | None |
14 | Houston | Houston Texans | Houston | Houston Rockets |
15 | Indianapolis | Indianapolis Colts | Indianapolis | Indiana Pacers |
16 | Philadelphia | Philadelphia Eagles | Philadelphia | Philadelphia 76ers |
17 | Dallas | Dallas Cowboys | Dallas | Dallas Mavericks |
18 | Washington | Washington Skins | Washington | Washington Wizards |
19 | Atlanta | Atlanta Falcons | Atlanta | Atlanta Hawks |
20 | Charlotte | Carolina Panthers | Charlotte | Charlotte Hornets |
21 | Tampa Bay | Tampa Bay Buccaneers | None | None |
22 | New Orleans | New Orleans Saints | New Orleans | New Orleans Pelicans |
23 | San Francisco | San Francisco 49ers | San Francisco | Golden State Warriors |
24 | Phoenix | Arizona Cardinals | Phoenix | Phoenix Suns |
25 | Seattle | Seattle Seahawks | None | None |
26 | Chicago | Chicago Bears | Chicago | Chicago Bulls |
27 | Green Bay | Green Bay Packers | None | None |
28 | Minneapolis | Minnesota Vikings | Minneapolis | Minnesota Timberwolves |
29 | Detroit | Detroit Lions | Detroit | Detroit Pistons |
Likewise, the right join displays all the cities with an NBA team, along with the NFL team in that city, if one exists:
myquery = """
SELECT * FROM nfl f
RIGHT JOIN nba b
ON f.city = b.city;
"""
pd.read_sql_query(myquery, con=engine)
city | footballteam | city | basketballteam | |
---|---|---|---|---|
0 | Boston | New England Patriots | Boston | Boston Celtics |
1 | New York | {"New York Jets","New York Giants"} | New York | New York Knicks |
2 | Philadelphia | Philadelphia Eagles | Philadelphia | Philadelphia 76ers |
3 | None | None | Brooklyn | Brooklyn Nets |
4 | None | None | Toronto | Toronto Raptors |
5 | Cleveland | Cleveland Browns | Cleveland | Cleveland Cavaliers |
6 | Chicago | Chicago Bears | Chicago | Chicago Bulls |
7 | Detroit | Detroit Lions | Detroit | Detroit Pistons |
8 | None | None | Milwaukee | Milwaukee Bucks |
9 | Indianapolis | Indianapolis Colts | Indianapolis | Indiana Pacers |
10 | Atlanta | Atlanta Falcons | Atlanta | Atlanta Hawks |
11 | Washington | Washington Skins | Washington | Washington Wizards |
12 | None | None | Orlando | Orlando Magic |
13 | Miami | Miami Dolphins | Miami | Miami Heat |
14 | Charlotte | Carolina Panthers | Charlotte | Charlotte Hornets |
15 | Los Angeles | {"L.A. Chargers","L.A. Rams"} | Los Angeles | {"L.A. Lakers","L.A. Clippers"} |
16 | San Francisco | San Francisco 49ers | San Francisco | Golden State Warriors |
17 | None | None | Portland | Portland Trailblazers |
18 | None | None | Sacramento | Sacramento Kings |
19 | Phoenix | Arizona Cardinals | Phoenix | Phoenix Suns |
20 | None | None | San Antonio | San Antonio Spurs |
21 | Dallas | Dallas Cowboys | Dallas | Dallas Mavericks |
22 | Houston | Houston Texans | Houston | Houston Rockets |
23 | None | None | Oklahoma City | Oklahoma City Thunder |
24 | Minneapolis | Minnesota Vikings | Minneapolis | Minnesota Timberwolves |
25 | Denver | Denver Broncos | Denver | Denver Nuggets |
26 | None | None | Salt Lake City | Utah Jazz |
27 | None | None | Memphis | Memphis Grizzlies |
28 | New Orleans | New Orleans Saints | New Orleans | New Orleans Pelicans |
For the left and right joins, changing which data table appears along with FROM
and which data table appears along with JOIN
accomplishes the same thing as changing a left join to a right join.
7.4.5. Full (Outer) Join#
A full join, also called an outer join, keeps all of the records that exist in both tables, whether or not they are matched. Full joins will return a data frame with at least as many rows as the larger of the two data tables in the join because it contains all records that appear in either data frame. Most tutorials on SQL offer a warning about full joins that these queries can result in massive amounts of data being returned, and full joins are not implemented for MySQL databases. For systems like PostgreSQL in which full joins are allowed, the syntax for a full join is
SELECT * FROM table1
FULL JOIN table2
ON table1.index_name = table2.index_name;
For the teams database, a full join produces a data frame with one row for every city with an NFL team or an NBA team or both:
myquery = """
SELECT * FROM nfl f
FULL JOIN nba b
ON f.city = b.city;
"""
pd.read_sql_query(myquery, con=engine)
city | footballteam | city | basketballteam | |
---|---|---|---|---|
0 | Buffalo | Buffalo Bills | None | None |
1 | Miami | Miami Dolphins | Miami | Miami Heat |
2 | Boston | New England Patriots | Boston | Boston Celtics |
3 | New York | {"New York Jets","New York Giants"} | New York | New York Knicks |
4 | Cleveland | Cleveland Browns | Cleveland | Cleveland Cavaliers |
5 | Cincinnati | Cincinnati Bengals | None | None |
6 | Pittsburgh | Pittsburgh Steelers | None | None |
7 | Baltimore | Baltimore Ravens | None | None |
8 | Kansas City | Kansas City Chiefs | None | None |
9 | Las Vegas | Las Vegas Raiders | None | None |
10 | Los Angeles | {"L.A. Chargers","L.A. Rams"} | Los Angeles | {"L.A. Lakers","L.A. Clippers"} |
11 | Denver | Denver Broncos | Denver | Denver Nuggets |
12 | Nashville | Tennessee Titans | None | None |
13 | Jacksonville | Jacksonville Jaguars | None | None |
14 | Houston | Houston Texans | Houston | Houston Rockets |
15 | Indianapolis | Indianapolis Colts | Indianapolis | Indiana Pacers |
16 | Philadelphia | Philadelphia Eagles | Philadelphia | Philadelphia 76ers |
17 | Dallas | Dallas Cowboys | Dallas | Dallas Mavericks |
18 | Washington | Washington Skins | Washington | Washington Wizards |
19 | Atlanta | Atlanta Falcons | Atlanta | Atlanta Hawks |
20 | Charlotte | Carolina Panthers | Charlotte | Charlotte Hornets |
21 | Tampa Bay | Tampa Bay Buccaneers | None | None |
22 | New Orleans | New Orleans Saints | New Orleans | New Orleans Pelicans |
23 | San Francisco | San Francisco 49ers | San Francisco | Golden State Warriors |
24 | Phoenix | Arizona Cardinals | Phoenix | Phoenix Suns |
25 | Seattle | Seattle Seahawks | None | None |
26 | Chicago | Chicago Bears | Chicago | Chicago Bulls |
27 | Green Bay | Green Bay Packers | None | None |
28 | Minneapolis | Minnesota Vikings | Minneapolis | Minnesota Timberwolves |
29 | Detroit | Detroit Lions | Detroit | Detroit Pistons |
30 | None | None | Milwaukee | Milwaukee Bucks |
31 | None | None | Oklahoma City | Oklahoma City Thunder |
32 | None | None | Portland | Portland Trailblazers |
33 | None | None | Brooklyn | Brooklyn Nets |
34 | None | None | Sacramento | Sacramento Kings |
35 | None | None | Memphis | Memphis Grizzlies |
36 | None | None | San Antonio | San Antonio Spurs |
37 | None | None | Salt Lake City | Utah Jazz |
38 | None | None | Orlando | Orlando Magic |
39 | None | None | Toronto | Toronto Raptors |
Although there are 30 cities with at least one NFL team and 29 cities with at least one NBA team, there are 41 cities with at least one team from one of these two leagues.
7.4.6. Anti-Joins#
An anti-join leaves us with all of the records in the first data table that do not appear in the second table. There is no “ANTI JOIN” syntax in SQL, but the behavior of an anti-join can be generated by including the WHERE
clause along with LEFT JOIN
. The syntax for an anti-join is
SELECT * FROM table1
LEFT JOIN table2
ON table1.index_name = table2.index_name
WHERE table2.index_name is NULL;
The WHERE
statement is used to draw a selection of rows from a data table that make a specified logical condition true. After performing a left join we have a data table with all of the rows in the first table along with the data for those rows in the second table if the row had a match in the second table. Typing WHERE table2.index_name is NULL
restricts this data table to only the rows that do not have a value of the index in the second table, meaning there was no match. For the teams database, the anti-join of the NFL and NBA tables yields a dataframe of all the cities with an NFL team but no NBA team:
myquery = """
SELECT * FROM nfl f
LEFT JOIN nba b
ON f.city = b.city
WHERE b.city is NULL;
"""
pd.read_sql_query(myquery, con=engine)
city | footballteam | city | basketballteam | |
---|---|---|---|---|
0 | Buffalo | Buffalo Bills | None | None |
1 | Cincinnati | Cincinnati Bengals | None | None |
2 | Pittsburgh | Pittsburgh Steelers | None | None |
3 | Baltimore | Baltimore Ravens | None | None |
4 | Kansas City | Kansas City Chiefs | None | None |
5 | Las Vegas | Las Vegas Raiders | None | None |
6 | Nashville | Tennessee Titans | None | None |
7 | Jacksonville | Jacksonville Jaguars | None | None |
8 | Tampa Bay | Tampa Bay Buccaneers | None | None |
9 | Seattle | Seattle Seahawks | None | None |
10 | Green Bay | Green Bay Packers | None | None |
7.4.7. Natural Joins#
One annoying thing about all of the joins shown above is that we end up with two columns that contain the same information. In the case of the team database, we have two city
columns that are always either equal, or else one is missing. But when one of the city
columns says “None”, the team from that table also says “None”, so the missingness in the city
column does not provide additional information.
It might make sense to use a different kind of join that understands that the two city
columns contain the same information and includes only one of these columns. A natural join does two things differently from the other joins described here:
A natural join removes duplicated columns from the output data.
A natural join detects the indices automatically by assuming columns that share the same name are part indices.
If done correctly, a natural join saves some work constructing the query as the indices are detected automatically, and provides cleaner output. Any of the joins described above can be done as a natural join by adding NATURAL
in front of INNER
, LEFT
, RIGHT
, or FULL
. If there are no columns that share the same name, a natural join instead performs a cross join (described below).
The following query performs a natural inner join:
myquery = """
SELECT * from nfl
NATURAL INNER JOIN nba
"""
pd.read_sql_query(myquery, con=engine)
city | footballteam | basketballteam | |
---|---|---|---|
0 | Miami | Miami Dolphins | Miami Heat |
1 | Boston | New England Patriots | Boston Celtics |
2 | New York | {"New York Jets","New York Giants"} | New York Knicks |
3 | Cleveland | Cleveland Browns | Cleveland Cavaliers |
4 | Los Angeles | {"L.A. Chargers","L.A. Rams"} | {"L.A. Lakers","L.A. Clippers"} |
5 | Denver | Denver Broncos | Denver Nuggets |
6 | Houston | Houston Texans | Houston Rockets |
7 | Indianapolis | Indianapolis Colts | Indiana Pacers |
8 | Philadelphia | Philadelphia Eagles | Philadelphia 76ers |
9 | Dallas | Dallas Cowboys | Dallas Mavericks |
10 | Washington | Washington Skins | Washington Wizards |
11 | Atlanta | Atlanta Falcons | Atlanta Hawks |
12 | Charlotte | Carolina Panthers | Charlotte Hornets |
13 | New Orleans | New Orleans Saints | New Orleans Pelicans |
14 | San Francisco | San Francisco 49ers | Golden State Warriors |
15 | Phoenix | Arizona Cardinals | Phoenix Suns |
16 | Chicago | Chicago Bears | Chicago Bulls |
17 | Minneapolis | Minnesota Vikings | Minnesota Timberwolves |
18 | Detroit | Detroit Lions | Detroit Pistons |
Natural joins are controversial, however, and many data scientists choose not to use them at all. The danger is that if two columns unexpectedly have the same name (it can be hard to keep track of all of the features’ names in big databases) then a natural join will match on the wrong indices. This Stack Overflow post gets into this debate, and one response made a forceful argument against natural joins:
Collapsing columns in the output is the least-important aspect of a natural join. The things you need to know are (A) it automatically joins on fields of the same name and (B) it will f— up your s— when you least expect it. In my world, using a natural join is grounds for dismissal… . Say you have a natural join between
Customers
andEmployees
, joining onEmployeeID
. Employees also has aManagerID
field. Everything’s fine. Then, some day, someone adds aManagerID
field to theCustomers
table. Your join will not break (that would be a mercy), instead it will now include a second field, and work incorrectly. Thus, a seemingly harmless change can break something only distantly related. VERY BAD. The only upside of a natural join is saving a little typing, and the downside is substantial.
Personally, I disagree with this statement as I think natural joins can be elegant and convenient, especially when I want to match on multiple indices. But I agree that natural joins do make it easier to mess up a join, and more caution is needed. To demonstrate how a natural join can go wrong, suppose that in both the NFL and NBA tables the columns were named city
and team
. The following code creates versions of these tables with footballteam
and basketballteam
each renamed to team
and stores these tables in the database as “nfl2” and “nba2”:
nfl2 = pd.read_sql_query("SELECT city, footballteam as team FROM nfl;", con=engine)
nba2 = pd.read_sql_query("SELECT city, basketballteam as team FROM nba;", con=engine)
nfl2.to_sql('nfl2', con = engine, index=False, chunksize=1000, if_exists = 'replace')
nba2.to_sql('nba2', con = engine, index=False, chunksize=1000, if_exists = 'replace')
Now a natural inner join between “nfl2” and “nba2” yields a dataframe with no records:
myquery = """
SELECT * FROM nfl2
NATURAL INNER JOIN nba2;
"""
pd.read_sql_query(myquery, con=engine)
city | team |
---|
The reason why there are no records is that the natural join automatically chooses both city
and team
to be part of the index, and records are only kept in the inner join if they match on both city and team. There are many matches for city, but no matches for both city and team.
In contrast, a regular inner join still works fine:
myquery = """
SELECT * FROM nfl2 f
INNER JOIN nba2 b
ON f.city = b.city;
"""
pd.read_sql_query(myquery, con=engine)
city | team | city | team | |
---|---|---|---|---|
0 | Miami | Miami Dolphins | Miami | Miami Heat |
1 | Boston | New England Patriots | Boston | Boston Celtics |
2 | New York | {"New York Jets","New York Giants"} | New York | New York Knicks |
3 | Cleveland | Cleveland Browns | Cleveland | Cleveland Cavaliers |
4 | Los Angeles | {"L.A. Chargers","L.A. Rams"} | Los Angeles | {"L.A. Lakers","L.A. Clippers"} |
5 | Denver | Denver Broncos | Denver | Denver Nuggets |
6 | Houston | Houston Texans | Houston | Houston Rockets |
7 | Indianapolis | Indianapolis Colts | Indianapolis | Indiana Pacers |
8 | Philadelphia | Philadelphia Eagles | Philadelphia | Philadelphia 76ers |
9 | Dallas | Dallas Cowboys | Dallas | Dallas Mavericks |
10 | Washington | Washington Skins | Washington | Washington Wizards |
11 | Atlanta | Atlanta Falcons | Atlanta | Atlanta Hawks |
12 | Charlotte | Carolina Panthers | Charlotte | Charlotte Hornets |
13 | New Orleans | New Orleans Saints | New Orleans | New Orleans Pelicans |
14 | San Francisco | San Francisco 49ers | San Francisco | Golden State Warriors |
15 | Phoenix | Arizona Cardinals | Phoenix | Phoenix Suns |
16 | Chicago | Chicago Bears | Chicago | Chicago Bulls |
17 | Minneapolis | Minnesota Vikings | Minneapolis | Minnesota Timberwolves |
18 | Detroit | Detroit Lions | Detroit | Detroit Pistons |
To safely use natural joins, first make certain that the indices you intend to match on have the same name, and then make sure that no other columns in the two data tables share a name.
7.4.8. Cross Joins#
A round robin is a method of organizing a competitive tournament. In a round robin, every team or participant plays every other team or participant once. A cross join, also called a Cartesian product, is a round robin for matching values of the index in one data table to values of the index in the other data table. Every value of the index in the first data table is matched once to every distinct value of the index in the second data table. Cross joins are memory-intensive: if the first data table has \(M\) rows and the second data table has \(N\) rows, the cross join output is a data table with \(M\times N\) rows. In general cross joins are not good ways to combine data entities, and they fail to match strictly like units. But cross joins are useful for constructing data that contain all possible pairings, if that’s what a situation calls for.
The syntax for generating a cross join is
SELECT * FROM table1
CROSS JOIN table2;
There is no ON
statement in this query because it is not needed to match each row in table1
to every row in table2
. For the teams database, the cross join generates the following output:
myquery = """
SELECT * FROM nfl
CROSS JOIN nba;
"""
pd.read_sql_query(myquery, con=engine)
city | footballteam | city | basketballteam | |
---|---|---|---|---|
0 | Buffalo | Buffalo Bills | Boston | Boston Celtics |
1 | Buffalo | Buffalo Bills | New York | New York Knicks |
2 | Buffalo | Buffalo Bills | Philadelphia | Philadelphia 76ers |
3 | Buffalo | Buffalo Bills | Brooklyn | Brooklyn Nets |
4 | Buffalo | Buffalo Bills | Toronto | Toronto Raptors |
... | ... | ... | ... | ... |
865 | Detroit | Detroit Lions | Minneapolis | Minnesota Timberwolves |
866 | Detroit | Detroit Lions | Denver | Denver Nuggets |
867 | Detroit | Detroit Lions | Salt Lake City | Utah Jazz |
868 | Detroit | Detroit Lions | Memphis | Memphis Grizzlies |
869 | Detroit | Detroit Lions | New Orleans | New Orleans Pelicans |
870 rows × 4 columns
7.4.9. Multiple Joins in One Query#
All of the examples above show a single join between two data tables, but many situations will require you to join multiple tables. it is possible to join many tables in one SQL query. The syntax to perform an inner join between two tables, then an inner join between the result and a third table is
SELECT * FROM table1
INNER JOIN table2
ON table1.index_name = table2.index_name
INNER JOIN table 3
ON table1.index_name = table3.index_name;
To demonstrate how multiple joins can work, I add a third table to the teams database that contains all of the Major League Baseball teams:
mlb_dict = {'city': ['New York', 'Boston', 'Toronto', 'Baltimore', 'Tampa Bay',
'Cleveland', 'Chicago', 'Kansas City', 'Minneapolis', 'Detroit',
'Houston', 'Anaheim', 'Dallas', 'Seattle', 'Oakland',
'Philadelphia', 'Miami', 'Washington', 'Atlanta', 'Cincinnati',
'Milwaukee', 'St. Louis', 'Pittsburgh', 'Los Angeles', 'San Francisco',
'San Diego', 'Denver', 'Phoenix'],
'baseballteam': [['New York Mets', 'New York Yankees'], 'Boston Red Sox', 'Toronto Blue Jays',
'Baltimore Orioles', 'Tampa Bay Rays', 'Cleveland Indians',
['Chicago White Sox', 'Chicago Cubs'], 'Kansas City Royals', 'Minnesota Twins',
'Detriot Tigers', 'Houston Astros', 'Anaheim Angels', 'Texas Rangers',
'Seattle Mariners', 'Oakland Athletics', 'Philadelphia Phillies',
'Miami Marlins', 'Washington Nationals', 'Atlanta Braves', 'Cincinnati Reds',
'Milwaukee Brewers', 'St. Louis Cardinals', 'Pittsburgh Pirates', 'Los Angeles Dodgers',
'San Francisco Giants', 'San Diego Padres', 'Colorado Rockies', 'Arizona Diamondbacks']}
mlb_df = pd.DataFrame(mlb_dict)
mlb_df.to_sql('mlb', con = engine, index=False, chunksize=1000, if_exists = 'replace')
We can first inner join the NFL and NBA data tables to keep only the cities with both an NFL and an NBA team, then we can inner join the result with the MLB data to keep only the cities with teams in all three sports:
myquery = """
SELECT * FROM nfl f
INNER JOIN nba b
ON f.city = b.city
INNER JOIN mlb m
ON f.city = m.city;
"""
pd.read_sql_query(myquery, con=engine)
city | footballteam | city | basketballteam | city | baseballteam | |
---|---|---|---|---|---|---|
0 | Atlanta | Atlanta Falcons | Atlanta | Atlanta Hawks | Atlanta | Atlanta Braves |
1 | Boston | New England Patriots | Boston | Boston Celtics | Boston | Boston Red Sox |
2 | Chicago | Chicago Bears | Chicago | Chicago Bulls | Chicago | {"Chicago White Sox","Chicago Cubs"} |
3 | Cleveland | Cleveland Browns | Cleveland | Cleveland Cavaliers | Cleveland | Cleveland Indians |
4 | Dallas | Dallas Cowboys | Dallas | Dallas Mavericks | Dallas | Texas Rangers |
5 | Denver | Denver Broncos | Denver | Denver Nuggets | Denver | Colorado Rockies |
6 | Detroit | Detroit Lions | Detroit | Detroit Pistons | Detroit | Detriot Tigers |
7 | Houston | Houston Texans | Houston | Houston Rockets | Houston | Houston Astros |
8 | Los Angeles | {"L.A. Chargers","L.A. Rams"} | Los Angeles | {"L.A. Lakers","L.A. Clippers"} | Los Angeles | Los Angeles Dodgers |
9 | Miami | Miami Dolphins | Miami | Miami Heat | Miami | Miami Marlins |
10 | Minneapolis | Minnesota Vikings | Minneapolis | Minnesota Timberwolves | Minneapolis | Minnesota Twins |
11 | New York | {"New York Jets","New York Giants"} | New York | New York Knicks | New York | {"New York Mets","New York Yankees"} |
12 | Philadelphia | Philadelphia Eagles | Philadelphia | Philadelphia 76ers | Philadelphia | Philadelphia Phillies |
13 | Phoenix | Arizona Cardinals | Phoenix | Phoenix Suns | Phoenix | Arizona Diamondbacks |
14 | San Francisco | San Francisco 49ers | San Francisco | Golden State Warriors | San Francisco | San Francisco Giants |
15 | Washington | Washington Skins | Washington | Washington Wizards | Washington | Washington Nationals |
Things get more complicated when we consider left, right, and full joins in a multiple table context. The trick is to think about the set of records that is required, to express that set in set theoretical notation, and to find the right combination of joins that matches that set theoretical statement.
For example, to obtain all cities with both an NFL and NBA team, also listing the MLB team if one exists in that city, we first inner join the NFL table to the NBA table, then we left join either the NFL’s or NBA’s city index to the MLB’s city column:
myquery = """
SELECT * FROM nfl f
INNER JOIN nba b
ON f.city = b.city
LEFT JOIN mlb m
ON f.city = m.city;
"""
pd.read_sql_query(myquery, con=engine)
city | footballteam | city | basketballteam | city | baseballteam | |
---|---|---|---|---|---|---|
0 | Atlanta | Atlanta Falcons | Atlanta | Atlanta Hawks | Atlanta | Atlanta Braves |
1 | Boston | New England Patriots | Boston | Boston Celtics | Boston | Boston Red Sox |
2 | Charlotte | Carolina Panthers | Charlotte | Charlotte Hornets | None | None |
3 | Chicago | Chicago Bears | Chicago | Chicago Bulls | Chicago | {"Chicago White Sox","Chicago Cubs"} |
4 | Cleveland | Cleveland Browns | Cleveland | Cleveland Cavaliers | Cleveland | Cleveland Indians |
5 | Dallas | Dallas Cowboys | Dallas | Dallas Mavericks | Dallas | Texas Rangers |
6 | Denver | Denver Broncos | Denver | Denver Nuggets | Denver | Colorado Rockies |
7 | Detroit | Detroit Lions | Detroit | Detroit Pistons | Detroit | Detriot Tigers |
8 | Houston | Houston Texans | Houston | Houston Rockets | Houston | Houston Astros |
9 | Indianapolis | Indianapolis Colts | Indianapolis | Indiana Pacers | None | None |
10 | Los Angeles | {"L.A. Chargers","L.A. Rams"} | Los Angeles | {"L.A. Lakers","L.A. Clippers"} | Los Angeles | Los Angeles Dodgers |
11 | Miami | Miami Dolphins | Miami | Miami Heat | Miami | Miami Marlins |
12 | Minneapolis | Minnesota Vikings | Minneapolis | Minnesota Timberwolves | Minneapolis | Minnesota Twins |
13 | New Orleans | New Orleans Saints | New Orleans | New Orleans Pelicans | None | None |
14 | New York | {"New York Jets","New York Giants"} | New York | New York Knicks | New York | {"New York Mets","New York Yankees"} |
15 | Philadelphia | Philadelphia Eagles | Philadelphia | Philadelphia 76ers | Philadelphia | Philadelphia Phillies |
16 | Phoenix | Arizona Cardinals | Phoenix | Phoenix Suns | Phoenix | Arizona Diamondbacks |
17 | San Francisco | San Francisco 49ers | San Francisco | Golden State Warriors | San Francisco | San Francisco Giants |
18 | Washington | Washington Skins | Washington | Washington Wizards | Washington | Washington Nationals |
To narrow the records to teams with a baseball team and a basketball team, but no football team, first we inner join the MLB and NBA data tables, then perform an anti-join with the NFL data:
myquery = """
SELECT * FROM mlb m
INNER JOIN nba b
ON m.city=b.city
LEFT JOIN nfl f
ON m.city = f.city
WHERE f.city is NULL;
"""
pd.read_sql_query(myquery, con=engine)
city | baseballteam | city | basketballteam | city | footballteam | |
---|---|---|---|---|---|---|
0 | Toronto | Toronto Blue Jays | Toronto | Toronto Raptors | None | None |
1 | Milwaukee | Milwaukee Brewers | Milwaukee | Milwaukee Bucks | None | None |
7.4.10. Joins on More Than One Index#
Sometimes more than one column comprises the primary key for a table. The general syntax for joining two tables on more than one index adds the AND
clause to the standard SQL join syntax:
SELECT * FROM table1
INNER JOIN table2
ON table1.index1 = table2.index2
AND table1.anotherindex1 = table2.anotherindex2;
Suppose for example that the NBA table and MLB table also contained records for minor league teams in the NBA G-League or the MLB AAA system. Some cities have both major and minor league teams in the same sport. Washington, for example, has a major league NBA team, the Wizards, and a minor league basketball team, the Capital City Go-Gos. Suppose that both the NBA and MLB tables have a column leaguetype
that marks each team as “major” or “minor”, and that we want to match on both city and league type. The syntax to do so is
SELECT * FROM nba b
INNER JOIN mlb m
ON b.city = m.city
AND b.leaguetype = m.leaguetype;
7.5. SQL Create, Update, and Delete Operations#
Once a database exists and is populated with data, most changes to the data will be small and incremental. We might add a few records, edit a couple, or delete one or two. There are straightforward SQL commands for creating, updating, and deleting records. To issue these queries, however, we cannot use the pd.read_sql_query()
function as this function is only for read operations. Instead, we can use the .execute()
method as applied to either the cursor for the database we are working with, or the sqlalchemy
engine. Specific examples are shown below.
7.5.1. Creating New Records#
An existing database has a schema, an overarching organizational blueprint for the database, that describes the different tables in the database, and within each table what the columns are and what kinds of data can be input into the columns. Creating new data generally works within an established schema. That means we enter new datapoints into existing columns, matching the data type that must exist in those columns.
The SQL syntax to create new data is
INSERT INTO table (column1, column2, ...)
VALUES (value1, value2, ...);
This syntax requires us to specify the key elements of the schema that identify a location in the database: the table and the columns. The values need to be listed in the same order as the columns, and character values need to be enclosed in single quotes.
To add a new observation to the NBA table (bring back the Sonics!) we can type:
myquery = """
INSERT INTO nba (city, basketballteam)
VALUES ('Seattle', 'Seattle Supersonics');
"""
engine.execute(myquery)
<sqlalchemy.engine.result.ResultProxy at 0x1153b5390>
Here the engine
variable is the sqlaclchemy
connection we previously established for the teams database. We can use the execute()
method to pass SQL queries to the database, just as we can with a cursor. Now, when we look at the data, we see the Seattle Supersonics included along with all the other NBA teams:
pd.read_sql_query("SELECT * FROM nba", con=engine)
city | basketballteam | |
---|---|---|
0 | Boston | Boston Celtics |
1 | New York | New York Knicks |
2 | Philadelphia | Philadelphia 76ers |
3 | Brooklyn | Brooklyn Nets |
4 | Toronto | Toronto Raptors |
5 | Cleveland | Cleveland Cavaliers |
6 | Chicago | Chicago Bulls |
7 | Detroit | Detroit Pistons |
8 | Milwaukee | Milwaukee Bucks |
9 | Indianapolis | Indiana Pacers |
10 | Atlanta | Atlanta Hawks |
11 | Washington | Washington Wizards |
12 | Orlando | Orlando Magic |
13 | Miami | Miami Heat |
14 | Charlotte | Charlotte Hornets |
15 | Los Angeles | {"L.A. Lakers","L.A. Clippers"} |
16 | San Francisco | Golden State Warriors |
17 | Portland | Portland Trailblazers |
18 | Sacramento | Sacramento Kings |
19 | Phoenix | Phoenix Suns |
20 | San Antonio | San Antonio Spurs |
21 | Dallas | Dallas Mavericks |
22 | Houston | Houston Rockets |
23 | Oklahoma City | Oklahoma City Thunder |
24 | Minneapolis | Minnesota Timberwolves |
25 | Denver | Denver Nuggets |
26 | Salt Lake City | Utah Jazz |
27 | Memphis | Memphis Grizzlies |
28 | New Orleans | New Orleans Pelicans |
29 | Seattle | Seattle Supersonics |
7.5.2. Editing Existing Records#
Instead of creating a new record, there are situations in which we want to edit an existing record. To revise a record, we use the following SQL syntax:
UPDATE table
SET column2 = newvalue
WHERE logicalcondition;
In this case, SET
specifies the change we want to make to a particular column. But we don’t want to change all of the values of the column, so we use WHERE
to specify a logical condition to identify the rows we want to change. A logical condition is a statement that is true on some rows and false on others, and the data update happens only on the rows for which the condition is true.
Suppose we want to change the name of the Charlotte Hornets back to the Charlotte Bobcats (sorry, Charlotte). We can use the following code:
myquery = """
UPDATE nba
SET basketballteam = 'Charlotte Bobcats'
WHERE city = 'Charlotte';
"""
engine.execute(myquery)
<sqlalchemy.engine.result.ResultProxy at 0x115432c88>
Here the query says to update values in the NBA table by changing basketballteam
to Charlotte Bobcats, but only when city
is Charlotte. This update now appears in the NBA data:
pd.read_sql_query("SELECT * FROM nba", con=engine)
city | basketballteam | |
---|---|---|
0 | Boston | Boston Celtics |
1 | New York | New York Knicks |
2 | Philadelphia | Philadelphia 76ers |
3 | Brooklyn | Brooklyn Nets |
4 | Toronto | Toronto Raptors |
5 | Cleveland | Cleveland Cavaliers |
6 | Chicago | Chicago Bulls |
7 | Detroit | Detroit Pistons |
8 | Milwaukee | Milwaukee Bucks |
9 | Indianapolis | Indiana Pacers |
10 | Atlanta | Atlanta Hawks |
11 | Washington | Washington Wizards |
12 | Orlando | Orlando Magic |
13 | Miami | Miami Heat |
14 | Los Angeles | {"L.A. Lakers","L.A. Clippers"} |
15 | San Francisco | Golden State Warriors |
16 | Portland | Portland Trailblazers |
17 | Sacramento | Sacramento Kings |
18 | Phoenix | Phoenix Suns |
19 | San Antonio | San Antonio Spurs |
20 | Dallas | Dallas Mavericks |
21 | Houston | Houston Rockets |
22 | Oklahoma City | Oklahoma City Thunder |
23 | Minneapolis | Minnesota Timberwolves |
24 | Denver | Denver Nuggets |
25 | Salt Lake City | Utah Jazz |
26 | Memphis | Memphis Grizzlies |
27 | New Orleans | New Orleans Pelicans |
28 | Seattle | Seattle Supersonics |
29 | Charlotte | Charlotte Bobcats |
7.5.3. Deleting Records#
Sometimes you might need to delete records from a database. These situations should be rare. If a record is no longer relevant for a particular use, it is always better to leave the record in the database and use another column to denote new information that can be used to filter records later. If there are mistakes in data entry, it’s better to edit existing records than to delete those records outright. If you must delete a record, the syntax to do so is
DELETE FROM table WHERE logicalcondition;
First specify the table, then the logical condition that identifies the rows you intend to delete.
In the teams database, suppose we want to delete the Baltimore Ravens (go Browns!) from the NFL table. The code to do that is:
myquery = """
DELETE FROM nfl WHERE city = 'Baltimore';
"""
engine.execute(myquery)
<sqlalchemy.engine.result.ResultProxy at 0x115445a58>
In this case, city = 'Baltimore'
identifies the rows we want to delete in the NFL table. The NFL data now no longer contains a row for the Ravens:
pd.read_sql_query("SELECT * FROM nfl", con=engine)
city | footballteam | |
---|---|---|
0 | Buffalo | Buffalo Bills |
1 | Miami | Miami Dolphins |
2 | Boston | New England Patriots |
3 | New York | {"New York Jets","New York Giants"} |
4 | Cleveland | Cleveland Browns |
5 | Cincinnati | Cincinnati Bengals |
6 | Pittsburgh | Pittsburgh Steelers |
7 | Kansas City | Kansas City Chiefs |
8 | Las Vegas | Las Vegas Raiders |
9 | Los Angeles | {"L.A. Chargers","L.A. Rams"} |
10 | Denver | Denver Broncos |
11 | Nashville | Tennessee Titans |
12 | Jacksonville | Jacksonville Jaguars |
13 | Houston | Houston Texans |
14 | Indianapolis | Indianapolis Colts |
15 | Philadelphia | Philadelphia Eagles |
16 | Dallas | Dallas Cowboys |
17 | Washington | Washington Skins |
18 | Atlanta | Atlanta Falcons |
19 | Charlotte | Carolina Panthers |
20 | Tampa Bay | Tampa Bay Buccaneers |
21 | New Orleans | New Orleans Saints |
22 | San Francisco | San Francisco 49ers |
23 | Phoenix | Arizona Cardinals |
24 | Seattle | Seattle Seahawks |
25 | Chicago | Chicago Bears |
26 | Green Bay | Green Bay Packers |
27 | Minneapolis | Minnesota Vikings |
28 | Detroit | Detroit Lions |
7.6. Cleaning and Manipulating Data with SQL Read Operations#
After using joins to combine data tables in the database, the data needs to be manipulated to make the data more convenient to use. That might involve narrowing down the data to a specific subset of interest, performing calculations on the data to generate new features, and changing the appearance of the data. In “Tidy Data”, Hadley Wickham defines four essential “verbs” of data manipulation:
Filter: subsetting or removing observations based on some condition.
Transform: adding or modifying variables. These modifications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).
Aggregate: collapsing multiple values into a single value (e.g., by summing or taking means).
Sort: changing the order of observations (p. 13).
In addition it may be necessary to pull only a selection of the columns into the output, or to change the names of the columns to more readable and useful ones. These operations can be performed within SQL read commands by using the WHERE
clause for filtering, mathematical operators to transform columns, the GROUP BY
syntax for aggregation, the ORDER BY
, ASC
, or DESC
clauses for sorting, and the AS
keyword for renaming columns.
7.6.1. Example: Wine Reviews#
To illustrate how to issue queries to read data while manipulating and cleaning the data, we will use the PostgreSQL version of the wine review database that we created in module 6. If you want to follow along with these example, follow the instructions in the “Using PostgreSQL” subsection of module 6 to get a local wine database running on your system.
For read operations, we can use the pd.read_sql_query()
function. For that, we first have to use sqlalchemy
to set up an engine that connects pandas
to the database:
engine = create_engine("postgresql+psycopg2://{user}:{pw}@localhost/{db}"
.format(user="jk8sd", pw=pgpassword, db="winedb"))
The logical ER diagram for the wine reviews database is
7.6.2. Selecting Columns#
SELECT
and FROM
are the primary SQL verbs for reading data. In many of the examples up to this points, we’ve issued queries like
SELECT * FROM table;
that pull all of the rows and all of the columns from a single data table. The *
character is called a wildcard character. When typed by itself, the wildcard captures all of the columns in a table. But sometimes we are interested in only a selection of the columns. In that case, we replace the wildcard with the columns we want to include in the output. The following syntax includes three columns from a specified data table:
SELECT col1, col2, col3 FROM table;
Suppose that I want to know the title, variety, price, points, country, and reviewer for all of the wines in the data. Title, variety, price, and points are all in the reviews table, country is in the locations table, and the reviewer (taster_name
) is in the tasters table. To produce the data I need to join these three tables while also using SELECT
to identify only the rows I am interested in. Inner joins are appropriate because every wine in the data has both a location and a reviewer.
The best way to select columns across multiple tables is to use aliasing, the same way we did for joins. In this case, if we alias the reviews table as r
, locations as l
, and tasters as t
, we can use these same aliases to inform SQL where to find each column in the SELECT
syntax.
The code to return this dataframe is:
myquery="""
SELECT r.title, r.variety, r.price, r.points, l.country, t.taster_name FROM reviews r
INNER JOIN locations l
ON r.location_id = l.location_id
INNER JOIN tasters t
ON r.taster_id = t.taster_id;
"""
pd.read_sql_query(myquery, con=engine)
title | variety | price | points | country | taster_name | |
---|---|---|---|---|---|---|
0 | Olivier Leflaive 2006 Les Pucelles Premier Cru... | Chardonnay | NaN | 93 | France | Roger Voss |
1 | The Foundry 2004 Syrah (Coastal Region) | Syrah | 35.0 | 87 | South Africa | Susan Kostrzewa |
2 | Guilbaud Frères 2007 Le Soleil Nantais (Musca... | Melon | 11.0 | 87 | France | Roger Voss |
3 | Domaine du Clos du Fief 2007 Cuvée Tradition ... | Gamay | NaN | 86 | France | Roger Voss |
4 | Domaine Philippe Delesvaux 2005 La Montée de l... | Cabernet Sauvignon | NaN | 86 | France | Roger Voss |
... | ... | ... | ... | ... | ... | ... |
103722 | Sheridan Vineyard 2005 Reserve Cabernet Sauvig... | Cabernet Sauvignon | 75.0 | 94 | US | Paul Gregutt |
103723 | Woodward Canyon 2006 Old Vines Dedication Seri... | Cabernet Sauvignon | 84.0 | 94 | US | Paul Gregutt |
103724 | Chanson Père et Fils 2005 Champs Gains Premier... | Chardonnay | 115.0 | 93 | France | Roger Voss |
103725 | Mark Ryan 2006 Chardonnay (Columbia Valley (WA)) | Chardonnay | NaN | 93 | US | Paul Gregutt |
103726 | Joseph Drouhin 2007 Grands-Echezeaux | Pinot Noir | 285.0 | 94 | France | Roger Voss |
103727 rows × 6 columns
7.6.3. Logical Statements#
Most programming languages have the capacity to evaluate a statement as being either true or false, or true for some values and false for others. A logical statement uses logical operators that define how values should be compared. In SQL, logical statements are used in conjunction with the WHERE
statement to select the rows to include in the output.
For SQL, logical statements either compare a column to another column, or compare a column to one or more reference values. The following logical operators are available:
=
- is equal to?<
- is less than?>
- is greater than?<=
- is less than or equal to?>=
- is greater than or equal to?<>
- is not equal to?BETWEEN a AND b
- true if a value exists within the range from a to b, including a and bIN ('element1','element2','element3')
- true if a value is one of the elements in the given setNOT
- true if the rest of the logical statement is false, false if the rest of the logical statement is trueAND
- links separate logical statements together such that the overall statement is true only when all of the linked statements are trueOR
- links separate logical statements together such that the overall statement is true when any of the linked statements are trueLIKE pattern
- true if the string value matches the given pattern:LIKE '%%text'
captures all rows in which a given column ends with ‘text’LIKE 'text%%'
captures all rows in which a given column begins with ‘text’LIKE '%%text%%'
captures all rows in which a given column contains ‘text’ somewhere in its string value
()
- parts of the logical statement that are contained within parentheses are evaluated first
I will show examples of how to use these logical statements for filtering rows, in the next section.
7.6.4. Filtering Rows#
Suppose we wanted to know the title, the variety, and the price of the French wines that Roger Voss scored as 100. It’s a simple semantic sentence, but it connects to a more complicated set of SQL functions. First consider all of the columns we need to use to process the sentence:
title, from the reviews table
variety, from the reviews table
price, from the reviews table
French, a value of country, from the locations table
Roger Voss, a value of taster name, from the tasters table,
and 100, a value of points, from the reviews table.
Because we need to use data from the reviews, locations, and tasters tables, we need to inner join reviews, locations, and tasters.
But then on top of this join, we need to restrict both the columns and rows. We only want title, variety, and price in the final data, so we use SELECT
to keep only these columns.
To restrict the rows, we use WHERE
along with a logical condition. This logical condition has a few parts: we want wines in which country='France'
, taster_name='Roger Voss'
, and points=100
. All three conditions need to be true for us to want to keep the row, so we connect the three statements with AND
.
The SQL query that returns the title, the variety, and the price of the French wines that Roger Voss scored as 100 is:
myquery = """
SELECT r.title, r.variety, r.price FROM reviews r
INNER JOIN locations l
ON r.location_id = l.location_id
INNER JOIN tasters t
ON r.taster_id = t.taster_id
WHERE l.country='France' AND t.taster_name='Roger Voss' AND r.points=100;
"""
pd.read_sql_query(myquery, con=engine)
title | variety | price | |
---|---|---|---|
0 | Krug 2002 Brut (Champagne) | Champagne Blend | 259.0 |
1 | Château Léoville Barton 2010 Saint-Julien | Bordeaux-style Red Blend | 150.0 |
2 | Louis Roederer 2008 Cristal Vintage Brut (Cha... | Champagne Blend | 250.0 |
3 | Salon 2006 Le Mesnil Blanc de Blancs Brut Char... | Chardonnay | 617.0 |
4 | Château Lafite Rothschild 2010 Pauillac | Bordeaux-style Red Blend | 1500.0 |
5 | Château Cheval Blanc 2010 Saint-Émilion | Bordeaux-style Red Blend | 1500.0 |
6 | Château Léoville Las Cases 2010 Saint-Julien | Bordeaux-style Red Blend | 359.0 |
7 | Château Haut-Brion 2014 Pessac-Léognan | Bordeaux-style White Blend | 848.0 |
I like my wine local or low-cost. So as another example, suppose we want the title, variety, price, points, country, and providence for all of the wines with scores of 90 or more that either cost between 5 and 10 dollars or are from Virginia. In this case, we need to join the reviews and locations data together, and write a logical statement that matches these specific conditions. The logical condition is
points >= 90 AND (price BETWEEN 5 AND 10 OR province = 'Virginia')
The entire SQL query is
myquery = """
SELECT r.title, r.variety, r.price, r.points, l.country, l.province FROM reviews r
INNER JOIN locations l
ON r.location_id = l.location_id
WHERE r.points >= 90 AND (r.price BETWEEN 5 AND 10 OR l.province = 'Virginia');
"""
pd.read_sql_query(myquery, con=engine)
title | variety | price | points | country | province | |
---|---|---|---|---|---|---|
0 | Château Vircoulon 2016 Bordeaux Blanc | Bordeaux-style White Blend | 10.0 | 90 | France | Bordeaux |
1 | Mano A Mano 2011 Tempranillo (Vino de la Tierr... | Tempranillo | 9.0 | 90 | Spain | Central Spain |
2 | Aveleda 2014 Quinta da Aveleda Estate Bottled ... | Portuguese White | 9.0 | 90 | Portugal | Vinho Verde |
3 | Chateau Ste. Michelle 2011 Riesling (Columbia ... | Riesling | 9.0 | 91 | US | Washington |
4 | Quinta do Portal 2007 Mural Reserva Red (Douro) | Portuguese Red | 10.0 | 91 | Portugal | Douro |
... | ... | ... | ... | ... | ... | ... |
76 | Casaleiro 2012 Reserva Touriga Nacional-Castel... | Portuguese Red | 9.0 | 90 | Portugal | Tejo |
77 | Aveleda 2015 Quinta da Aveleda White (Vinho Ve... | Portuguese White | 9.0 | 90 | Portugal | Vinho Verde |
78 | Cookies & Cream 2010 Merlot (California) | Merlot | 10.0 | 90 | US | California |
79 | Lovingston 2012 Josie's Knoll Merlot (Monticello) | Merlot | 20.0 | 91 | US | Virginia |
80 | Aveleda 2016 Quinta da Aveleda White (Vinho Ve... | Portuguese White | 10.0 | 90 | Portugal | Vinho Verde |
81 rows × 6 columns
The parentheses in this last query are needed to ensure that the DBMS evaluates the OR
statement first. Without the parentheses,
points >= 90 AND price BETWEEN 5 AND 10 OR province = 'Virginia'
the DBMS evaluates the first two conditions first, then considers the third, so the statement is equivalent to
(points >= 90 AND price BETWEEN 5 AND 10) OR province = 'Virginia'
and it returns data with all wines that have scores of at least 90 and prices between 5 and 10 dollars, along with all wines from Virginia whether or not those wines have scores of at least 90:
myquery = """
SELECT r.title, r.variety, r.price, r.points, l.country, l.province FROM reviews r
INNER JOIN locations l
ON r.location_id = l.location_id
WHERE r.points >= 90 AND r.price BETWEEN 5 AND 10 OR l.province = 'Virginia';
"""
pd.read_sql_query(myquery, con=engine)
title | variety | price | points | country | province | |
---|---|---|---|---|---|---|
0 | Veramar 2016 JB Winemaker Series Cabernet Fran... | Cabernet Franc | 34.0 | 86 | US | Virginia |
1 | Château Vircoulon 2016 Bordeaux Blanc | Bordeaux-style White Blend | 10.0 | 90 | France | Bordeaux |
2 | Mano A Mano 2011 Tempranillo (Vino de la Tierr... | Tempranillo | 9.0 | 90 | Spain | Central Spain |
3 | Trump 2014 Rosé (Monticello) | Rosé | 14.0 | 86 | US | Virginia |
4 | The Boneyard 2014 Chardonnay (Virginia) | Chardonnay | 15.0 | 86 | US | Virginia |
... | ... | ... | ... | ... | ... | ... |
444 | Annefield Vineyards 2009 Cabernet Franc (Virgi... | Cabernet Franc | 29.0 | 88 | US | Virginia |
445 | Lovingston 2012 Josie's Knoll Merlot (Monticello) | Merlot | 20.0 | 91 | US | Virginia |
446 | Aveleda 2016 Quinta da Aveleda White (Vinho Ve... | Portuguese White | 10.0 | 90 | Portugal | Vinho Verde |
447 | Tarara 2013 Cabernet Franc (Virginia) | Cabernet Franc | 25.0 | 85 | US | Virginia |
448 | Paradise Springs 2014 Nana's Rosé (Virginia) | Rosé | 22.0 | 86 | US | Virginia |
449 rows × 6 columns
Suppose I’m open to many wines, but I have a thing against wines from the U.S., and I don’t like Pinot Noir, Pinot Gris, or Chardonnay. I want to query the wines database to return data on the title, variety, country, and price of all of these wines except for the American ones and the ones I dislike. The SQL query requires joining the reviews and locations tables, and using negation in the logical statement with the <>
and NOT
operators, like this:
myquery = """
SELECT r.title, r.variety, r.price, l.country FROM reviews r
INNER JOIN locations l
ON r.location_id = l.location_id
WHERE country <> 'US' AND variety NOT IN ('Pinot Noir', 'Pinot Gris', 'Chardonnay');
"""
pd.read_sql_query(myquery, con=engine)
title | variety | price | country | |
---|---|---|---|---|
0 | The Foundry 2004 Syrah (Coastal Region) | Syrah | 35.0 | South Africa |
1 | Guilbaud Frères 2007 Le Soleil Nantais (Musca... | Melon | 11.0 | France |
2 | Domaine du Clos du Fief 2007 Cuvée Tradition ... | Gamay | NaN | France |
3 | Domaine Philippe Delesvaux 2005 La Montée de l... | Cabernet Sauvignon | NaN | France |
4 | Georges Duboeuf 2007 Beaujolais-Villages | Gamay | NaN | France |
... | ... | ... | ... | ... |
57429 | Indomita NV Rosé Sparkling (Casablanca Valley) | Sparkling Blend | 18.0 | Chile |
57430 | Intipalka 2013 Valle del Sol Tannat (Ica) | Tannat | 14.0 | Peru |
57431 | Lobster Reef 2014 Sauvignon Blanc (Marlborough) | Sauvignon Blanc | 12.0 | New Zealand |
57432 | Millaman 2014 Estate Reserve Sauvignon Blanc (... | Sauvignon Blanc | 10.0 | Chile |
57433 | Royal Tokaji 1999 Mézes Mály Aszú 6 Puttonyos ... | Tokaji | 175.0 | Hungary |
57434 rows × 4 columns
If we want all of the columns from the reviews table for wines whose descriptions contain the words “smoke” and “chocolate” - taking case sensitivity into account by converting the descriptions to all lower case in the WHERE
clause so that “chocolate” and “Chocolate” are both matched - the following query returns those wines:
myquery = """
SELECT * FROM reviews
WHERE LOWER(description) LIKE '%%smoke%%' AND LOWER(description) LIKE '%%chocolate%%';
"""
pd.read_sql_query(myquery, con=engine)
wine_id | title | variety | description | points | price | taster_id | winery_id | location_id | |
---|---|---|---|---|---|---|---|---|---|
0 | 6792 | Guardian Peak 2006 Shiraz (Western Cape) | Shiraz | A gorgeous nose of plums, chocolate and red fr... | 89 | 15.0 | 16 | 7363 | 1141 |
1 | 7812 | Hightower 2006 Cabernet Sauvignon (Columbia Va... | Cabernet Sauvignon | Sourced largely from Red Mountain fruit, this ... | 87 | 35.0 | 3 | 7629 | 1474 |
2 | 7866 | Viña Cobos 2012 Bramare Marchiori Vineyard Mal... | Malbec | Toasty woodsmoke aromas are matched by wild be... | 94 | 90.0 | 5 | 13962 | 4 |
3 | 8110 | Mendel 2013 Unus Red (Mendoza) | Bordeaux-style Red Blend | Rich aromas of raisin, cassis and blackberry a... | 93 | 50.0 | 5 | 9746 | 7 |
4 | 9262 | Freakshow 2014 Cabernet Sauvignon (Lodi) | Cabernet Sauvignon | Rich, ripe and oaky, this full-bodied wine has... | 91 | 20.0 | 10 | 6893 | 1304 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
260 | 4862 | Hearst Ranch 2013 Lone Tree Cabernet Franc (Pa... | Cabernet Franc | Made in a thick, oaky style, this shows burned... | 87 | 35.0 | 8 | 7498 | 1337 |
261 | 4690 | Pear Valley 2013 Distraction Red (Paso Robles) | Bordeaux-style Red Blend | The signature bottling from this winery, this ... | 91 | 35.0 | 8 | 10762 | 1337 |
262 | 4794 | Estate Constantin Gofas 2008 Agiorgitiko (Nemea) | Agiorgitiko | This wine has the plucky character typical of ... | 85 | 18.0 | 16 | 6366 | 668 |
263 | 6574 | Fielding Hills 2006 RiverBend Vineyard Syrah (... | Syrah | Bold and forward, this estate-grown Syrah fair... | 94 | 40.0 | 3 | 6619 | 1483 |
264 | 5020 | Corliss Estates 2007 Cabernet Sauvignon (Colum... | Cabernet Sauvignon | Concentrated and wonderfully aromatic, this ag... | 94 | 75.0 | 3 | 4677 | 1474 |
265 rows × 9 columns
There are situations in which we want to display only some of the records that match a particular query. For that, we can use the LIMIT
and OFFSET
clauses. LIMIT
sets the number of records to extract, and OFFSET
set the starting row. For example, adding
LIMIT 10
to a query instructs the DBMS to extract only the first 10 rows of the output data. Adding
LIMIT 10 OFFSET 5
tells the DBMS to extract 10 rows, after first skipping the first 5 rows: so these clauses together return rows 6 through 15. To see the 4th through 7th rows from the previous query to the wine reviews database, we can type:
myquery = """
SELECT * FROM reviews
WHERE LOWER(description) LIKE '%%smoke%%' AND LOWER(description) LIKE '%%chocolate%%'
LIMIT 4 OFFSET 3;
"""
pd.read_sql_query(myquery, con=engine)
wine_id | title | variety | description | points | price | taster_id | winery_id | location_id | |
---|---|---|---|---|---|---|---|---|---|
0 | 8110 | Mendel 2013 Unus Red (Mendoza) | Bordeaux-style Red Blend | Rich aromas of raisin, cassis and blackberry a... | 93 | 50.0 | 5 | 9746 | 7 |
1 | 9262 | Freakshow 2014 Cabernet Sauvignon (Lodi) | Cabernet Sauvignon | Rich, ripe and oaky, this full-bodied wine has... | 91 | 20.0 | 10 | 6893 | 1304 |
2 | 9746 | Carmel 2013 Admon Vineyard Cabernet Sauvignon ... | Cabernet Sauvignon | This wine has offers aromas of dark plum and c... | 89 | 35.0 | 14 | 1925 | 694 |
3 | 10326 | De Martino 2009 Alto de Piedras Single Vineyar... | Carmenère | A big, earthy type of wine with a ton of ripen... | 90 | 45.0 | 5 | 4973 | 182 |
7.6.5. Sorting Data#
Sorting data refers to rearranging the rows of a dataframe. Sorting is a cosmetic thing to do to data because the order of the rows should not change the meaning of the data both in terms of storage (rearranging the rows should NOT change the meaning of each row), or for most analytical models (rearranging rows won’t change the parameter estimates from a linear regression, for example). But sorting is a way to visualize important characteristics about the data and to quickly see important records with maximum and minimum values of key features.
To sort the output data, use the ORDER BY
syntax within an SQL query. The general syntax for ORDER BY
is
ORDER BY column1, column2, column3
Writing more than one column is optional. If more than one column is entered, then the second column is used to break ties between rows that have the same value of the first column. If a third column is entered, it’s used to break ties between rows that have the same value for both the first and second columns. In addition, each column can be sorted in ascending or descending order by typing either ASC
(this is the default, so typing ASC
is optional, but useful for making the SQL code more readable) or DESC
immediately after the column name.
For example, what are the top rated wines from Virginia? And of these top rated wines, which ones are cheapest? To find out, we issue a query that joins the reviews and locations tables, filters the data to just wines from Virginia, narrows down the columns to just title, points, and price, and sorts first by points, then by price. We sort by points in descending order so the best wines appear first, and we sort by price in ascending order so that the cheapest wines appear first. The syntax for this query is:
myquery = """
SELECT r.title, r.points, r.price FROM reviews r
INNER JOIN locations l
ON r.location_id = l.location_id
WHERE province = 'Virginia'
ORDER BY points DESC, price ASC;
"""
pd.read_sql_query(myquery, con=engine)
title | points | price | |
---|---|---|---|
0 | King Family 2015 Orange Viognier (Monticello) | 92 | 35.0 |
1 | Lovingston 2012 Josie's Knoll Merlot (Monticello) | 91 | 20.0 |
2 | Lovingston 2015 Josie's Knoll Rotunda Red (Mon... | 90 | 20.0 |
3 | Barboursville Vineyards 2015 Reserve Cabernet ... | 90 | 25.0 |
4 | King Family 2012 Meritage (Monticello) | 90 | 31.0 |
... | ... | ... | ... |
374 | Narmada 2013 Reserve Cabernet Franc (Virginia) | 82 | 34.0 |
375 | Veramar 2009 Chardonnay (Virginia) | 81 | 18.0 |
376 | Bogati 2013 Black Label Club Fumé Blanc Sauvig... | 81 | 26.0 |
377 | Three Fox 2014 Calabrese Pinot Grigio (Middleb... | 81 | 28.0 |
378 | Winery at La Grange 2012 Cabernet Sauvignon (V... | 81 | 43.0 |
379 rows × 3 columns
7.6.6. Renaming Columns and Transforming Data Values#
Sometimes it is useful to rename the columns in the output data within a query. To rename a column, use the AS
syntax while referencing the columns in SELECT
. For example, we can load the title, variety, and points columns from the reviews table, but we can rename these columns name, type, and score respectively:
myquery = """
SELECT title AS name, variety AS type, points AS score FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
name | type | score | |
---|---|---|---|
0 | Casas del Bosque 2011 Reserva Sauvignon Blanc ... | Sauvignon Blanc | 86 |
1 | Marqués de Terán 2009 Selección Especial (Rioja) | Tempranillo | 86 |
2 | Maurice Ecard 2009 Bourgogne | Chardonnay | 86 |
3 | McGregor 2010 Semi-Dry Riesling (Finger Lakes) | Riesling | 86 |
4 | Jules Taylor 2009 Ballochdale Estate Pinot Noi... | Pinot Noir | 86 |
... | ... | ... | ... |
103722 | Glenora 2010 Gewürztraminer (Finger Lakes) | Gewürztraminer | 86 |
103723 | Hard Row To Hoe 2010 Marsanne (Yakima Valley) | Marsanne | 86 |
103724 | Animale 2009 Dolcetto (Columbia Valley (WA)) | Dolcetto | 86 |
103725 | Beresan 2008 The Buzz Yellow Jacket Vineyard R... | Red Blend | 86 |
103726 | Cabot Vineyards 2007 Syrah (Humboldt County) | Syrah | 86 |
103727 rows × 3 columns
In other situations, we might want to transform a column arithmetically or in another way. SQL supports the standard arithmetic operators: +
for addition, -
for subtraction, *
for multiplication, and /
for division. SQL also supports the modulo operator %
to return the remainder after division (16 % 5
equals 1, for example, because 16 divided by 5 yields a remainder of 1). SQL also allows the following arithmetic functions:
EXP(a)
- raisesa
the argument to the power of \(e = 2.718...\)POWER(a,b)
- raisesa
to the power ofb
LOG(a)
- takes the natural (base \(e\)) logarithm ofa
LOG10(a)
- takes the common (base 10) logarithm ofa
SQRT(a)
- takes the square root ofa
ABS(a)
- takes the absolute value ofa
CEILING(a)
- rounds values ofa
up to the next whole numberFLOOR(a)
- rounds values ofa
down to a whole numberROUND(a, k)
- rounds values ofa
up or down to the nearest number withk
decimalsSIGN(a)
- returns 1 ifa
is positive, -1 ifa
is negative, and 0 ifa
is 0
In addition, if you need them, there are many trigonometric functions built into standard SQL.
When using a function that operates on a column, it is important to use AS
to name the new column, as SQL has no way to choose a logical name automatically for constructed columns and uses ?column?
be default.
For example, we can convert the price of each wine from dollars to Euros by multiplying the price by the .91 USD to Euro exchange rate. We keep the original price in the query but rename it price_dollars
, and we name the converted price price_euros
:
myquery = """
SELECT title, variety, price AS price_dollars, .91*price AS price_euros FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
title | variety | price_dollars | price_euros | |
---|---|---|---|---|
0 | Casas del Bosque 2011 Reserva Sauvignon Blanc ... | Sauvignon Blanc | 12.0 | 10.92 |
1 | Marqués de Terán 2009 Selección Especial (Rioja) | Tempranillo | 24.0 | 21.84 |
2 | Maurice Ecard 2009 Bourgogne | Chardonnay | NaN | NaN |
3 | McGregor 2010 Semi-Dry Riesling (Finger Lakes) | Riesling | 18.0 | 16.38 |
4 | Jules Taylor 2009 Ballochdale Estate Pinot Noi... | Pinot Noir | 22.0 | 20.02 |
... | ... | ... | ... | ... |
103722 | Glenora 2010 Gewürztraminer (Finger Lakes) | Gewürztraminer | 15.0 | 13.65 |
103723 | Hard Row To Hoe 2010 Marsanne (Yakima Valley) | Marsanne | 18.0 | 16.38 |
103724 | Animale 2009 Dolcetto (Columbia Valley (WA)) | Dolcetto | 24.0 | 21.84 |
103725 | Beresan 2008 The Buzz Yellow Jacket Vineyard R... | Red Blend | 19.0 | 17.29 |
103726 | Cabot Vineyards 2007 Syrah (Humboldt County) | Syrah | 24.0 | 21.84 |
103727 rows × 4 columns
For no reason other than to demonstrate the use of the various mathematical functions, we can put many transformations of price in one dataframe:
myquery = """
SELECT price,
EXP(price/1000) as price_exp,
LOG(price) as price_natlog,
LOG10(price) as price_commonlog,
ROUND(LOG(price)) as price_loground,
SQRT(price) as price_sqrt,
POWER(price, 2) as price_squared,
POWER(price, 3) as price_cubed,
SIGN(price - 50) as price_morethan50
FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
price | price_exp | price_natlog | price_commonlog | price_loground | price_sqrt | price_squared | price_cubed | price_morethan50 | |
---|---|---|---|---|---|---|---|---|---|
0 | 12.0 | 1.012072 | 1.079181 | 1.079181 | 1.0 | 3.464102 | 144.0 | 1728.0 | -1.0 |
1 | 24.0 | 1.024290 | 1.380211 | 1.380211 | 1.0 | 4.898979 | 576.0 | 13824.0 | -1.0 |
2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 18.0 | 1.018163 | 1.255273 | 1.255273 | 1.0 | 4.242641 | 324.0 | 5832.0 | -1.0 |
4 | 22.0 | 1.022244 | 1.342423 | 1.342423 | 1.0 | 4.690416 | 484.0 | 10648.0 | -1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
103722 | 15.0 | 1.015113 | 1.176091 | 1.176091 | 1.0 | 3.872983 | 225.0 | 3375.0 | -1.0 |
103723 | 18.0 | 1.018163 | 1.255273 | 1.255273 | 1.0 | 4.242641 | 324.0 | 5832.0 | -1.0 |
103724 | 24.0 | 1.024290 | 1.380211 | 1.380211 | 1.0 | 4.898979 | 576.0 | 13824.0 | -1.0 |
103725 | 19.0 | 1.019182 | 1.278754 | 1.278754 | 1.0 | 4.358899 | 361.0 | 6859.0 | -1.0 |
103726 | 24.0 | 1.024290 | 1.380211 | 1.380211 | 1.0 | 4.898979 | 576.0 | 13824.0 | -1.0 |
103727 rows × 9 columns
Another useful operation for transforming columns in a read query is CASE
, which maps numeric values to categories. The syntax that uses CASE
within SELECT
is
SELECT CASE
WHEN logicalstatement1 THEN value1
WHEN logicalstatement2 THEN value2
WHEN logicalstatement3 THEN value3
ELSE value4
END AS name
This code evaluates each logical statement, and fills in the datapoint with the specified value if the logical statement is true. If more than one of the logical statements is true, then the first statement/value pair entered in takes precedence. If none of the logical statements are true, then the datapoint is filled in with the value listed with ELSE
. As before, it is important to name the new column with AS
.
For example, if we want to categorize wines as cheap when the price is under 20 dollars, moderately priced if the price is between 20 and 50 dollars, and expensive if the price is more than 50 dollars, we can type:
myquery="""
SELECT title, variety, price, CASE
WHEN price < 20 THEN 'cheap'
WHEN price BETWEEN 20 AND 50 THEN 'moderately priced'
WHEN price > 50 THEN 'expensive'
ELSE NULL
END AS price_level
FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
title | variety | price | price_level | |
---|---|---|---|---|
0 | Casas del Bosque 2011 Reserva Sauvignon Blanc ... | Sauvignon Blanc | 12.0 | cheap |
1 | Marqués de Terán 2009 Selección Especial (Rioja) | Tempranillo | 24.0 | moderately priced |
2 | Maurice Ecard 2009 Bourgogne | Chardonnay | NaN | None |
3 | McGregor 2010 Semi-Dry Riesling (Finger Lakes) | Riesling | 18.0 | cheap |
4 | Jules Taylor 2009 Ballochdale Estate Pinot Noi... | Pinot Noir | 22.0 | moderately priced |
... | ... | ... | ... | ... |
103722 | Glenora 2010 Gewürztraminer (Finger Lakes) | Gewürztraminer | 15.0 | cheap |
103723 | Hard Row To Hoe 2010 Marsanne (Yakima Valley) | Marsanne | 18.0 | cheap |
103724 | Animale 2009 Dolcetto (Columbia Valley (WA)) | Dolcetto | 24.0 | moderately priced |
103725 | Beresan 2008 The Buzz Yellow Jacket Vineyard R... | Red Blend | 19.0 | cheap |
103726 | Cabot Vineyards 2007 Syrah (Humboldt County) | Syrah | 24.0 | moderately priced |
103727 rows × 4 columns
There’s an important point of caution when using CASE
. Unless you account for missing values explicity in your query, the missing values will be matched to the entered last in CASE
. That will corrupt the data. It is best practice to write conditions for the full set of observed values of a column, and to end the call to CASE
with ELSE NULL
, so that when none of the conditions apply, the new column is also missing.
There are also functions that apply to columns with character values:
LOWER(a)
- converts all characters ina
to lowercaseUPPER(a)
- converts all characters ina
to uppercaseINITCAP(a)
- converts the first letter of every word ina
to uppercaseCONCAT(a,b,c)
- appends the stringb
to the end ofa
, andc
(if included) to the end ofb
LENGTH(a)
- reports the number of characters in the stringa
SUBSTR(a, start, length)
- restricts the stringa
to a substring, beginning at the position denoted bystart
, and including the nextlength
charactersTRIM(a)
- removes spaces at the beginning and end of stringa
REPLACE(a, oldtext, newtext)
- searches values ofa
for occurrences ofoldtext
and replaces them withnewtext
For example, we can replace the descriptions in the reviews table with all capitals, all lower-case letters, or capitals for the first letter of each word:
myquery = """
SELECT title, variety, price,
UPPER(description) as description_upper,
LOWER(description) as description_lower,
INITCAP(description) as description_initcap
FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
title | variety | price | description_upper | description_lower | description_initcap | |
---|---|---|---|---|---|---|
0 | Casas del Bosque 2011 Reserva Sauvignon Blanc ... | Sauvignon Blanc | 12.0 | IT'S PRETTY EASY PEGGING THIS FOR CHILEAN SB; ... | it's pretty easy pegging this for chilean sb; ... | It'S Pretty Easy Pegging This For Chilean Sb; ... |
1 | Marqués de Terán 2009 Selección Especial (Rioja) | Tempranillo | 24.0 | OPAQUE IN COLOR, WITH BLACKBERRY AND LICORICE ... | opaque in color, with blackberry and licorice ... | Opaque In Color, With Blackberry And Licorice ... |
2 | Maurice Ecard 2009 Bourgogne | Chardonnay | NaN | ATTRACTIVE RIPE FRUITS GO WITH LIME AND TOAST ... | attractive ripe fruits go with lime and toast ... | Attractive Ripe Fruits Go With Lime And Toast ... |
3 | McGregor 2010 Semi-Dry Riesling (Finger Lakes) | Riesling | 18.0 | SMOKY AND A BIT STARK WITH WHIFFS OF STRUCK FL... | smoky and a bit stark with whiffs of struck fl... | Smoky And A Bit Stark With Whiffs Of Struck Fl... |
4 | Jules Taylor 2009 Ballochdale Estate Pinot Noi... | Pinot Noir | 22.0 | SOURCED FROM A HIGH-ALTITUDE VINEYARD NEAR THE... | sourced from a high-altitude vineyard near the... | Sourced From A High-Altitude Vineyard Near The... |
... | ... | ... | ... | ... | ... | ... |
103722 | Glenora 2010 Gewürztraminer (Finger Lakes) | Gewürztraminer | 15.0 | SWEET ON THE NOSE, WITH SCENTS OF PINK GRAPEFR... | sweet on the nose, with scents of pink grapefr... | Sweet On The Nose, With Scents Of Pink Grapefr... |
103723 | Hard Row To Hoe 2010 Marsanne (Yakima Valley) | Marsanne | 18.0 | WAXY FRUIT FLAVORS OF PEACH, MELON AND BANANA,... | waxy fruit flavors of peach, melon and banana,... | Waxy Fruit Flavors Of Peach, Melon And Banana,... |
103724 | Animale 2009 Dolcetto (Columbia Valley (WA)) | Dolcetto | 24.0 | THIS ALCOHOLIC WINE (15.9%, AND YOU CAN TASTE ... | this alcoholic wine (15.9%, and you can taste ... | This Alcoholic Wine (15.9%, And You Can Taste ... |
103725 | Beresan 2008 The Buzz Yellow Jacket Vineyard R... | Red Blend | 19.0 | LIGHT FRUIT FLAVORS RUN FROM MELON INTO PALE S... | light fruit flavors run from melon into pale s... | Light Fruit Flavors Run From Melon Into Pale S... |
103726 | Cabot Vineyards 2007 Syrah (Humboldt County) | Syrah | 24.0 | FROM HUSBAND-AND-WIFE TEAM IN CALIFORNIA REDWO... | from husband-and-wife team in california redwo... | From Husband-And-Wife Team In California Redwo... |
103727 rows × 6 columns
We can use REPLACE()
to make the writing less artful, for example, by replacing the word “aroma” with “good smell” everywhere it appears in the wine descriptions. Note that REPLACE()
is case-sensitive, so it is a good idea to convert the values to a consistent case like lowercase first:
myquery = """
SELECT title, variety, price, description,
REPLACE(LOWER(description), 'aroma', 'good smell') as description_replace
FROM reviews
WHERE description LIKE '%%aroma%%';
"""
pd.read_sql_query(myquery, con=engine).description[0]
'Opaque in color, with blackberry and licorice aromas but also a distinct streak of brambly herbs and green. Later on, pine needle and tartness enter the fray. This is a commendable modern Rioja but it does have a few issues, namely a green herbal component.'
pd.read_sql_query(myquery, con=engine).description_replace[0]
'opaque in color, with blackberry and licorice good smells but also a distinct streak of brambly herbs and green. later on, pine needle and tartness enter the fray. this is a commendable modern rioja but it does have a few issues, namely a green herbal component.'
The SUBSTR()
function can be used to extract parts of a string. The following code reduces the description
column to substrings beginning at the 5th character and proceeding 10 characters in length:
myquery = """
SELECT title, variety, price, description,
SUBSTR(description, 5, 10) as description_substr
FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
title | variety | price | description | description_substr | |
---|---|---|---|---|---|
0 | Casas del Bosque 2011 Reserva Sauvignon Blanc ... | Sauvignon Blanc | 12.0 | It's pretty easy pegging this for Chilean SB; ... | pretty ea |
1 | Marqués de Terán 2009 Selección Especial (Rioja) | Tempranillo | 24.0 | Opaque in color, with blackberry and licorice ... | ue in colo |
2 | Maurice Ecard 2009 Bourgogne | Chardonnay | NaN | Attractive ripe fruits go with lime and toast ... | active rip |
3 | McGregor 2010 Semi-Dry Riesling (Finger Lakes) | Riesling | 18.0 | Smoky and a bit stark with whiffs of struck fl... | y and a bi |
4 | Jules Taylor 2009 Ballochdale Estate Pinot Noi... | Pinot Noir | 22.0 | Sourced from a high-altitude vineyard near the... | ced from a |
... | ... | ... | ... | ... | ... |
103722 | Glenora 2010 Gewürztraminer (Finger Lakes) | Gewürztraminer | 15.0 | Sweet on the nose, with scents of pink grapefr... | t on the n |
103723 | Hard Row To Hoe 2010 Marsanne (Yakima Valley) | Marsanne | 18.0 | Waxy fruit flavors of peach, melon and banana,... | fruit fla |
103724 | Animale 2009 Dolcetto (Columbia Valley (WA)) | Dolcetto | 24.0 | This alcoholic wine (15.9%, and you can taste ... | alcoholic |
103725 | Beresan 2008 The Buzz Yellow Jacket Vineyard R... | Red Blend | 19.0 | Light fruit flavors run from melon into pale s... | t fruit fl |
103726 | Cabot Vineyards 2007 Syrah (Humboldt County) | Syrah | 24.0 | From husband-and-wife team in California redwo... | husband-a |
103727 rows × 5 columns
In the wine database, province and country are stored in separate columns in the locations table. If we wanted to put these two pieces of information together in one readable column, we can use CONCAT()
. In this example, I type CONCAT(l.province, ', ', l.country)
which appends three strings - the province from the locations table, a comma and a space, and the country from the locations table - and names the new column place
:
myquery = """
SELECT r.title, r.variety, r.price,
CONCAT(l.province, ', ', l.country) as place
FROM reviews r
INNER JOIN locations l
ON r.location_id = l.location_id;
"""
pd.read_sql_query(myquery, con=engine)
title | variety | price | place | |
---|---|---|---|---|
0 | Casas del Bosque 2011 Reserva Sauvignon Blanc ... | Sauvignon Blanc | 12.0 | Casablanca Valley, Chile |
1 | Marqués de Terán 2009 Selección Especial (Rioja) | Tempranillo | 24.0 | Northern Spain, Spain |
2 | Maurice Ecard 2009 Bourgogne | Chardonnay | NaN | Burgundy, France |
3 | McGregor 2010 Semi-Dry Riesling (Finger Lakes) | Riesling | 18.0 | New York, US |
4 | Jules Taylor 2009 Ballochdale Estate Pinot Noi... | Pinot Noir | 22.0 | Marlborough, New Zealand |
... | ... | ... | ... | ... |
103722 | Glenora 2010 Gewürztraminer (Finger Lakes) | Gewürztraminer | 15.0 | New York, US |
103723 | Hard Row To Hoe 2010 Marsanne (Yakima Valley) | Marsanne | 18.0 | Washington, US |
103724 | Animale 2009 Dolcetto (Columbia Valley (WA)) | Dolcetto | 24.0 | Washington, US |
103725 | Beresan 2008 The Buzz Yellow Jacket Vineyard R... | Red Blend | 19.0 | Washington, US |
103726 | Cabot Vineyards 2007 Syrah (Humboldt County) | Syrah | 24.0 | California, US |
103727 rows × 4 columns
What are the shortest descriptions in the data? To find out we use the LENGTH()
function to count the number of characters in each description, and sort these lengths in ascending order:
myquery = """
SELECT title, points, price, description,
LENGTH(description) as length
FROM reviews
ORDER BY length ASC;
"""
pd.read_sql_query(myquery, con=engine)
title | points | price | description | length | |
---|---|---|---|---|---|
0 | Craggy Range 2007 Kidnappers Vineyard Chardonn... | 88 | 24.0 | Imported by Kobrand. | 20 |
1 | Chasing Venus 2007 Sauvignon Blanc (Marlborough) | 88 | 16.0 | Imported by JL Giguiere. | 24 |
2 | Philip Shaw 2007 No. 19 Sauvignon Blanc (Orange) | 86 | 20.0 | Imported by Lion Nathan USA. | 28 |
3 | Peconic Bay Winery 2001 Riesling (North Fork o... | 84 | 13.0 | Review not available at this time. | 34 |
4 | Mount Baker Vineyards 2006 Barrel Select Sangi... | 82 | 16.0 | Very light, could almost be a rosé. | 35 |
... | ... | ... | ... | ... | ... |
103722 | René Muré 2015 Clos Saint Landelin Vorbourg Gr... | 97 | 50.0 | The heady aromatic scent of fresh tangerine pe... | 698 |
103723 | Domaine Marcel Deiss 2009 Altenberg de Berghei... | 95 | 66.0 | Lifted notes of dried pear, dried chamomile fl... | 699 |
103724 | De Toren 2014 Book 17 XVII Red (Stellenbosch) | 95 | 330.0 | Only 95 cases were made of this Bordeaux-style... | 723 |
103725 | Domaine Ostertag 2015 Muenchberg Grand Cru Rie... | 97 | 66.0 | There is something incredibly fruity and simul... | 753 |
103726 | Saggi 2007 Red (Columbia Valley (WA)) | 91 | 45.0 | Dark, dusty, strongly scented with barrel toas... | 829 |
103727 rows × 5 columns
7.6.7. Data Aggregation#
If there are columns in the data output that have repeated values, then each distinct value forms a group in the data. Data aggregation is the process of collapsing the data to one row for each group, while summarizing other columns by taking the within-group mean, sum, count, or another statistic.
Aggregating data requires more attention to the ordering of the clauses in an SQL query than is necessary with other tasks. That is, certain clauses must be entered into the query in a particular order. An SQL query that aggregates data should follow this template:
SELECT aggregationfunctions FROM table
(any joins happen here)
(filtering rows with the WHERE clause happens here)
GROUP BY groupingcolumns
HAVING (a logical condition involving aggregation functions)
(sorting with ORDER BY happens here);
Let’s break down this template line by line. First,
SELECT aggregationfunctions FROM table
Aggregation functions work like the arithmetic functions described above. The difference is that instead of working with a single value, like SQRT()
and POWER()
do, aggregation functions work with vectors of data and generate summary statistics. They describe how existing columns should be summarized when the data are collapsed. The aggregation functions are:
COUNT(*)
- an overall count of the number of rows within each groupCOUNT(a)
- a count of the number of non-missing observations of columna
within each groupCOUNT(DISTINCT a)
- a count of the number of distinct observations of columna
within each groupAVG(a)
- the mean of the values ofa
within each groupSUM(a)
- the sum of the values ofa
within each groupMAX(a)
- the maximum value ofa
within each groupMIN(a)
- the minimum value ofa
within each groupVARIANCE(a)
andVAR_SAMP(a)
- the population and sample variances, respectively, of the values ofa
within each groupSTDDEV(a)
andSTDDEV_SAMP(a)
- the population and sample standard deviations, respectively, of the values ofa
within each group
Additional statistics, like the median, mode, and various quantiles are not included in standard SQL but are available in extensions that are specific to a DBMS, such as the quantile extension for PostgreSQL.
One important point about the aggregation functions is that, with the exception of COUNT()
, they ignore NULL values. Suppose that we have a data vector with values (1,3,8,NULL). Because we do not know the value of the fourth value, we cannot calculate the true sum and true mean of the values. However the SUM()
function in SQL ignores the NULL value and reports the sum as 12, which makes a strong tacit assumption that the NULL value is equal to 0. The AVG()
function calculates the mean from the observed values, and reports (1+3+8)/3 = 4, but this too makes a strong assumption that the NULL value is exactly 4. There are situations in which calculations from the non-NULL values are appropriate, but it is not correct to make broad claims from these summary statistics when there are missing values in the columns being summarized.
The next two lines in the template are placeholders for the syntax we use to join data tables and the syntax we use to filter rows with the WHERE
clause. There is a great deal of similarity between WHERE
and HAVING
, which we will discuss shortly.
The fourth line in the template,
GROUP BY groupingcolumns
is the key line for activating the aggregation functionality of SQL. groupingcolumns
can include one or more columns. If one column is listed, the unique values of that column define the groups that will comprise the rows of the output data. If there is more than one column listed, the unique combinations of values from the columns define the groups.
The fifth line in the template uses the HAVING
clause. HAVING
is very similar to WHERE
in that both use logical conditions to identify a selection of the rows to include in the output data. The difference between WHERE
and HAVING
is that WHERE
operates on rows in the original data prior to aggregation, and HAVING
works on rows after aggregation has occurred. One limitation of HAVING
is that it will not recognize new column names defined in SELECT
, so the same aggregation functions used in SELECT
need to be used again in the logical conditions for HAVING
. Finally, if we want to sort, we can include the ORDER BY
clause last.
For example, let’s find out which country produces wines with the highest average score. To do that, we need a query that joins reviews and locations together, includes country name, the average score across wines from that country, and for good measure, a count of the number of wines reviewed from that country. To collapse on country, we can use GROUP BY
, and to produce the average score and the count of wines we use the AVG()
and COUNT()
functions. For presentation purposes, I choose to round the average score to one decimal and to sort the rows from the highest to lowest average score, so that we can immediately see which countries produce the highest-rated wines. The query is:
myquery = """
SELECT l.country,
ROUND(AVG(points),1) as average_points,
COUNT(*) as numberofwines
FROM reviews r
INNER JOIN locations l
ON r.location_id = l.location_id
GROUP BY l.country
ORDER BY average_points DESC;
"""
pd.read_sql_query(myquery, con=engine)
country | average_points | numberofwines | |
---|---|---|---|
0 | England | 91.6 | 74 |
1 | India | 90.2 | 9 |
2 | Austria | 90.1 | 3337 |
3 | Germany | 89.9 | 2134 |
4 | Canada | 89.4 | 256 |
5 | Hungary | 89.2 | 145 |
6 | China | 89.0 | 1 |
7 | US | 89.0 | 37730 |
8 | France | 88.9 | 21828 |
9 | Italy | 88.8 | 11042 |
10 | Australia | 88.8 | 2037 |
11 | Luxembourg | 88.7 | 6 |
12 | None | 88.6 | 63 |
13 | Morocco | 88.6 | 28 |
14 | Switzerland | 88.6 | 7 |
15 | Israel | 88.5 | 500 |
16 | New Zealand | 88.3 | 1311 |
17 | South Africa | 88.2 | 1328 |
18 | Portugal | 88.2 | 5686 |
19 | Slovenia | 88.1 | 87 |
20 | Turkey | 88.1 | 90 |
21 | Bulgaria | 87.9 | 141 |
22 | Georgia | 87.7 | 86 |
23 | Lebanon | 87.7 | 35 |
24 | Armenia | 87.5 | 2 |
25 | Serbia | 87.5 | 12 |
26 | Czech Republic | 87.3 | 12 |
27 | Greece | 87.3 | 466 |
28 | Spain | 87.3 | 6581 |
29 | Moldova | 87.2 | 59 |
30 | Croatia | 87.2 | 73 |
31 | Cyprus | 87.2 | 11 |
32 | Slovakia | 87.0 | 1 |
33 | Uruguay | 86.8 | 109 |
34 | Macedonia | 86.8 | 12 |
35 | Argentina | 86.7 | 3797 |
36 | Bosnia and Herzegovina | 86.5 | 2 |
37 | Chile | 86.5 | 4361 |
38 | Romania | 86.4 | 120 |
39 | Mexico | 85.3 | 65 |
40 | Brazil | 84.7 | 52 |
41 | Ukraine | 84.1 | 14 |
42 | Egypt | 84.0 | 1 |
43 | Peru | 83.6 | 16 |
So the country that produces the best wine is … England? Wait, that can’t be right. Looking at the results, I see that some of the countries only have a small number of wines reviewed. China, for example, only has one wine review in the database, so we definitely should not put as much confidence in China’s mean score as we can for countries with many more reviews like France and the U.S. For a more fair comparison, let’s restrict the output to only those countries with at least 500 wines in the database. To filter rows on this condition, we use HAVING
and not WHERE
because the condition involves an aggregation function - specifically the count of the number of wines per country. We can rerun the previous query, including the HAVING
clause:
myquery = """
SELECT l.country,
ROUND(AVG(points),1) as average_points,
COUNT(*) as numberofwines
FROM reviews r
INNER JOIN locations l
ON r.location_id = l.location_id
GROUP BY l.country
HAVING COUNT(*) >= 500
ORDER BY average_points DESC;
"""
pd.read_sql_query(myquery, con=engine)
country | average_points | numberofwines | |
---|---|---|---|
0 | Austria | 90.1 | 3337 |
1 | Germany | 89.9 | 2134 |
2 | US | 89.0 | 37730 |
3 | France | 88.9 | 21828 |
4 | Italy | 88.8 | 11042 |
5 | Australia | 88.8 | 2037 |
6 | Israel | 88.5 | 500 |
7 | New Zealand | 88.3 | 1311 |
8 | Portugal | 88.2 | 5686 |
9 | South Africa | 88.2 | 1328 |
10 | Spain | 87.3 | 6581 |
11 | Argentina | 86.7 | 3797 |
12 | Chile | 86.5 | 4361 |
Suppose we were interested in ranking the countries according to their scores for a particular type of wine. To filter rows from the original data, we use a WHERE
clause prior to GROUP BY
. If we write WHERE r.variety = 'Riesling'
prior to GROUP BY
, the DBMS first extracts only the rows from the reviews table that refer to Riesling wines, then proceeds with the rest of the query. The following code ranks the countries based on their average scores for Rieslings, given at least 100 Rieslings from that country in the database:
myquery = """
SELECT l.country,
ROUND(AVG(points),1) as average_points,
COUNT(*) as numberofwines
FROM reviews r
INNER JOIN locations l
ON r.location_id = l.location_id
WHERE r.variety = 'Riesling'
GROUP BY l.country
HAVING COUNT(*) >= 100
ORDER BY average_points DESC;
"""
pd.read_sql_query(myquery, con=engine)
country | average_points | numberofwines | |
---|---|---|---|
0 | Austria | 91.4 | 581 |
1 | France | 90.5 | 691 |
2 | Germany | 90.1 | 1768 |
3 | Australia | 89.4 | 111 |
4 | US | 88.1 | 1600 |
Sometimes the groups in the data are formed by more than one column. In that case, simply add the second column name to the GROUP BY
clause. For example, if we wanted to know the top rated combination of country and variety (with a minimum of 50 wines for that combination), we can use the following code:
myquery = """
SELECT l.country, r.variety,
ROUND(AVG(points),1) as average_points,
COUNT(*) as numberofwines
FROM reviews r
INNER JOIN locations l
ON r.location_id = l.location_id
GROUP BY l.country, r.variety
HAVING COUNT(*) >= 50
ORDER BY average_points DESC;
"""
pd.read_sql_query(myquery, con=engine)
country | variety | average_points | numberofwines | |
---|---|---|---|---|
0 | France | Tannat | 91.5 | 59 |
1 | Austria | Riesling | 91.4 | 581 |
2 | South Africa | Bordeaux-style Red Blend | 90.6 | 84 |
3 | France | Riesling | 90.5 | 691 |
4 | Austria | Chardonnay | 90.4 | 62 |
... | ... | ... | ... | ... |
190 | Argentina | Chardonnay | 84.9 | 295 |
191 | Spain | Rosé | 84.9 | 149 |
192 | Portugal | Rosé | 84.6 | 235 |
193 | Spain | Rosado | 84.6 | 71 |
194 | Argentina | Sauvignon Blanc | 84.3 | 78 |
195 rows × 4 columns
7.6.8. Subqueries#
There are situations in which it makes sense to use data aggregation techniques to generate new columns filled with group-level summary statistics, then to place these summary statistics into the original data table. To do this work, we can use subqueries. A subquery is a full SQL query that is used inside another SQL query. There are three types of subquery:
Subqueries, like the ones for the mean and standard deviation above, that yield a single datapoint. These subqueries can be used anywhere we might write a value in the query, such as when defining new columns in
SELECT
or filtering rows withWHERE
orHAVING
.Subqueries that yield a list of values that can be used inside logical statements that include the
IN
operator.Subqueries that yield a data table that can be joined to existing data tables.
Suppose for example that we wanted to generate a Z-score standardized version of the wine review points. A Z-score subtracts the mean of a column from every value in the column, then divides every value by the standard deviation of the column. When a Z-score equals 1, it means that the original value is one standard deviation above the mean of the column. To calculate the Z-score, we need to calculate the mean of the points column,
myquery = """
SELECT AVG(points) FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
avg | |
---|---|
0 | 88.612107 |
and the sample standard deviation of the points column,
myquery = """
SELECT STDDEV_SAMP(points) FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
stddev_samp | |
---|---|
0 | 2.955039 |
We can type these values in manually with a query that looks like:
SELECT title, variety, points,
(points - 88.612107))/(2.955039) as points_z
FROM reviews;
The problem with typing these values in by hand is that it is easy to make a mistake and accidentially corrupt the points_z
column because of a typo. Also these values will have to be changed by hand every time the data inside the database is updated. A better solution is to have SQL do the work of calculating the mean and standard deviation for us by using subqueries. All we need to do is replace the values with the queries (contained in parentheses) that generate those single values. In this case, the query is
myquery = """
SELECT title, variety, points,
(points - (SELECT AVG(points) FROM reviews))/(SELECT STDDEV(points) FROM reviews) as points_z
FROM reviews;
"""
pd.read_sql_query(myquery, con=engine)
title | variety | points | points_z | |
---|---|---|---|---|
0 | Casas del Bosque 2011 Reserva Sauvignon Blanc ... | Sauvignon Blanc | 86 | -0.88395 |
1 | Marqués de Terán 2009 Selección Especial (Rioja) | Tempranillo | 86 | -0.88395 |
2 | Maurice Ecard 2009 Bourgogne | Chardonnay | 86 | -0.88395 |
3 | McGregor 2010 Semi-Dry Riesling (Finger Lakes) | Riesling | 86 | -0.88395 |
4 | Jules Taylor 2009 Ballochdale Estate Pinot Noi... | Pinot Noir | 86 | -0.88395 |
... | ... | ... | ... | ... |
103722 | Glenora 2010 Gewürztraminer (Finger Lakes) | Gewürztraminer | 86 | -0.88395 |
103723 | Hard Row To Hoe 2010 Marsanne (Yakima Valley) | Marsanne | 86 | -0.88395 |
103724 | Animale 2009 Dolcetto (Columbia Valley (WA)) | Dolcetto | 86 | -0.88395 |
103725 | Beresan 2008 The Buzz Yellow Jacket Vineyard R... | Red Blend | 86 | -0.88395 |
103726 | Cabot Vineyards 2007 Syrah (Humboldt County) | Syrah | 86 | -0.88395 |
103727 rows × 4 columns
Suppose that we wanted to restrict the rows to only the wines from wineries with at least 100 wines in the data. The problem is we don’t know which wineries have at least 100 reviewed wines. But we can find out with a query:
myquery = """
SELECT winery_id FROM reviews
GROUP BY winery_id
HAVING COUNT(*) >= 100;
"""
pd.read_sql_query(myquery, con=engine)
winery_id | |
---|---|
0 | 9112 |
1 | 4562 |
2 | 9557 |
3 | 13540 |
4 | 13381 |
5 | 1547 |
6 | 4257 |
7 | 13118 |
8 | 2007 |
9 | 7060 |
10 | 224 |
11 | 9995 |
12 | 8022 |
13 | 5076 |
14 | 4586 |
15 | 2375 |
16 | 14401 |
17 | 12042 |
18 | 9111 |
19 | 4124 |
This query gives us a list of winery ID numbers that match the wineries with at least 100 reviewed wines in the data. We can now use this list inside another query that restricts the reviews data to only the wines from this list of wineries:
myquery = """
SELECT winery_id, title, variety, points, price FROM reviews
WHERE winery_id IN (
SELECT winery_id FROM reviews r
GROUP BY winery_id
HAVING COUNT(*) >= 100
);
"""
pd.read_sql_query(myquery, con=engine)
winery_id | title | variety | points | price | |
---|---|---|---|---|---|
0 | 14401 | Wines & Winemakers 2012 Pegos Claros Colheita ... | Castelão | 87 | 15.0 |
1 | 14401 | Wines & Winemakers 2013 Lua Cheia em Vinhas Ve... | Portuguese Red | 87 | 18.0 |
2 | 14401 | Wines & Winemakers 2013 Lua Cheia em Vinhas Ve... | Portuguese Red | 87 | 12.0 |
3 | 4124 | Chateau Ste. Michelle 2008 Syrah (Columbia Val... | Syrah | 87 | 13.0 |
4 | 4586 | Concha y Toro 2010 Gravas del Maipo Syrah (Mai... | Syrah | 91 | 200.0 |
... | ... | ... | ... | ... | ... |
2742 | 13381 | Trapiche 2016 Pure Malbec (Uco Valley) | Malbec | 88 | 15.0 |
2743 | 9111 | Louis Jadot 2005 La Dominode Premier Cru (Sav... | Pinot Noir | 90 | 37.0 |
2744 | 9995 | Montes 2009 Limited Selection Pinot Noir (Casa... | Pinot Noir | 89 | 20.0 |
2745 | 4124 | Chateau Ste. Michelle 2012 Canoe Ridge Estate ... | Chardonnay | 89 | 22.0 |
2746 | 14401 | Wines & Winemakers 2015 Nostalgia Alvarinho (V... | Alvarinho | 88 | 23.0 |
2747 rows × 5 columns
Suppose we wanted a data table with the top rated wine from each winery. The problem is we don’t know which wine is the top rated for each winery. Again, we can find out with a query that groups the reviews data by winery ID and uses the MAX()
aggregation function to identify the maximum score achieved for any wine from that winery:
myquery = """
SELECT winery_id, MAX(points) as maxpoints
FROM reviews
GROUP BY winery_id;
"""
pd.read_sql_query(myquery, con=engine)
winery_id | maxpoints | |
---|---|---|
0 | 11233 | 91 |
1 | 4790 | 95 |
2 | 3936 | 87 |
3 | 12502 | 87 |
4 | 5468 | 92 |
... | ... | ... |
14566 | 4035 | 89 |
14567 | 9180 | 92 |
14568 | 4827 | 94 |
14569 | 790 | 83 |
14570 | 10896 | 91 |
14571 rows × 2 columns
Suppose that this last table already existed in the database with the name “bestscores”. We would be able to join bestscores and reviews, then filter the rows to only those wines whose scores are equal to the maximum scores achieved by the winery with the following code:
SELECT r.title, r.variety, r.points, r.price FROM reviews r
INNER JOIN bestscores b
ON r.winery_id = b.winery_id
WHERE r.points = b.maxpoints;
But because we do not have a table named “bestscores”, we can replace the reference to this table with the subquery that generates this table:
myquery = """
SELECT r.title, r.variety, r.points, r.price FROM reviews r
INNER JOIN (
SELECT winery_id, MAX(points) as maxpoints
FROM reviews
GROUP BY winery_id) b
ON r.winery_id = b.winery_id
WHERE r.points = b.maxpoints;
"""
pd.read_sql_query(myquery, con=engine)
title | variety | points | price | |
---|---|---|---|---|
0 | Marqués de Terán 2009 Selección Especial (Rioja) | Tempranillo | 86 | 24.0 |
1 | Alessandro Veglio 2011 Gattera (Barolo) | Nebbiolo | 87 | NaN |
2 | Pingao 2013 Rioja | Tempranillo | 87 | 13.0 |
3 | Chateau Walla Walla 2008 Syrah (Walla Walla Va... | Syrah | 87 | 40.0 |
4 | Sweet Valley 2008 Cabernet Sauvignon (Walla Wa... | Cabernet Sauvignon | 87 | 35.0 |
... | ... | ... | ... | ... |
20643 | Kicker Cane 2014 Cabernet Sauvignon (Alexander... | Cabernet Sauvignon | 88 | 20.0 |
20644 | Tenuta Grimani 2015 Farinaldo (Soave) | Garganega | 88 | NaN |
20645 | Vin Vault NV Cabernet Sauvignon (California) | Cabernet Sauvignon | 88 | 20.0 |
20646 | Dachshund NV Bubbles Sparkling (Germany) | Sparkling Blend | 88 | 17.0 |
20647 | Domaine Guillot-Broux 2009 Beaumont (Mâcon-Cr... | Pinot Noir | 88 | NaN |
20648 rows × 4 columns
Finally, once we are finished working with the PostgreSQL server, we close it:
dbserver.close()
7.7. SQL Security#
One criticism of SQL is that it can be manipulated to give unauthorized and hostile users access to perform CRUD operations on the data. This kind of attack is called an SQL injection attack, and the best illustration of this attack comes from an XKCD web comic:
The following discussion borrows heavily from this blog’s explaination of this XKCD comic.
An SQL injection attack starts with an entry field on the user interface of an application that works with a database, like a field where users write their name. The application reads the name and creates an SQL INSERT operation to create a new record in the data with the name. If I were to enter Jonathan into the name field, the app should generate an SQL command that looks like this:
INSERT INTO Students (firstname) VALUES ('Jonathan');
This command specifically places the value “Jonathan” into the firstname
attribute of the Students
entity. In SQL, different commands are separated by semicolons, so if I wanted to issue two SQL commands I could type:
INSERT INTO Students (firstname) VALUES ('Jonathan'); INSERT INTO Students (lastname) VALUES ('Kropko');
An SQL injection attack works by writing SQL code in a field that is designed to collect data to input into a database. So If I type my name as Jonathan'); DROP TABLE Students; --;
, then the SQL create operation becomes
INSERT INTO Students (firstname) VALUES ('Jonathan'); DROP TABLE Students; --;');
This line consists of three commands
INSERT INTO Students (firstname) VALUES ('Jonathan');
which inputs “Jonathan” into the database,DROP TABLE Students;
which deletes the entireStudents
table, and--;');
: the--
symbol is an SQL comment, and tells the parser to ignore the remainder of the code, which would avoid a parsing error.
So just by inserting specific code into a seemingly innocuous field, like name, I can delete the entire Students
entity in the database.
There are two ways to combat SQL injection attacks. First, it is possible to “sanitize” database inputs by using code that automatically places a slash before a single quote. That puts an escape character in front of the quote, which makes it part of the input string and prevents it from being read as the end of the input string. Another approach is to use prepared statements when converting user-entered data into an SQL query. A prepared statement uses placeholders to stand in for the user-supplied data, and treats the data like input into a function: treating the user data this way prevents the entire SQL query from being read as a single string, and prevents SQL injection. For example, instead of inputing the name directly into the query, the database manager can construct the query in Python code (where a database cursor exists and is named curs
) like this:
cmd = "INSERT INTO Students (firstname) VALUES (%s)"
curs.execute(cmd, (name,))
In MySQL and PostgreSQL, %s
stands in for a parameter to be passed into the query (in SQlite, the stand-in symbol is ?
instead of %s
). Constructing a query in this way prevents SQL injection attacks. More information about formatting secure SQL code is available at https://bobby-tables.com/, named in honor of this XKCD comic.
As a data scientist mostly issuing read operations, it is unethical for you to attack a database in this way. If you are testing whether a database is secured against SQL injection attacks, don’t try to issue any DROP
commands as other commands like SELECT
will reveal the insecurity but won’t make changes in the database. If you are building a database that is connected to an interface for users to enter data, please be aware of the SQL injection vulnerability and use prepared statements to guard against it.
7.8. MongoDB Queries#
SQL is a universal language for issuing queries to relational databases, whether the database is managed by SQLite, MySQL, PostgreSQL, or another RDBMS. For NoSQL databases, however, there is no universal query language. Every DBMS has its own query language, and will provide a guide for learning that language. Some of these guides include the ones for key-value stores in Redis and wide column stores in Cassandra. Neo4j has developed a programming language called Cypher that is explicitly for issuing queries to graph databases. Of all of these query protocols, the language used by MongoDB for issuing queries to document stores is one of the most universal because it works entirely with JSONs: queries are written in JSON format and the output is organized in JSON format. All of these query languages include methods for all of the CRUD operations.
The most important difference between relational and NoSQL databases is the rigidity of the schema that organizes the data. The advantage of the strict organization of a relational database, as illustrated in an ER diagram, is that the data that can be extracted from the database using an SQL query will be clean and mostly immediately ready to be analyzed. The disadvantage is that relational databases have schema that are hard to change once they’ve been created and populated with data. SQL also, despite the best intentions of the originators of SQL, can be very difficult for people use for some tasks. For extremely large datasets with many tables, it can be extremely difficult to keep track of what data exists in which table. In contrast, NoSQL databases generally have flexible schema that can be changed easily and can vary even from record to record. There are no rules, like the normalization rules, that require that the data be split into different tables, so there is no need for visual maps like ER diagrams. Also, because all of the data for one record exists in the same JSON dictionary, it is easy to use remote, distributed storage to store all of the records. The disadvantage of NoSQL databases is that the data are rarely ready for analysis after a query. It’s a buy-now-pay-later situation: the price we pay for the convenience of NoSQL storage and organization is that the output requires more work to use.
Some concepts that are crucial to SQL are not relevant to NoSQL. There are no joins in a document store because all the data for a record exist in the same JSON code. As such, we don’t have to worry about accomplishing these tasks within a NoSQL query. NoSQL queries in general focus narrowly on the CRUD operations, although MongoDB provides some advanced functionality for searching for patterns within text and ranking documents based on their relevance to given search terms.
For the following examples, I will use the document store database that we created in module 6, containing the same data on wine reviews that we practiced with above, only in JSON format. First I load the pymongo
package and the dumps()
and loads()
functions from the json_util
module of the bson
package:
import pymongo
from bson.json_util import dumps, loads
The wine reviews database is stored as a collection winecollection
within the winedb
database on my local machine. I load it with the following code:
myclient = pymongo.MongoClient("mongodb://localhost/")
winedb = myclient["winedb"]
winecollection = winedb["winecollection"]
We will discuss more advanced read techniques below, but to see one record, we can issue a query using JSON code and we can see the output in JSON format. To see all of the data for the “Nicosia 2013 Vulkà Bianco”, we search for the record based on the title of this wine with the following code:
myquery = { 'title': 'Nicosia 2013 Vulkà Bianco (Etna)'}
mywine = winecollection.find(myquery)
for x in mywine:
print(x)
{'_id': ObjectId('5ed80dbca25fcf746119e3aa'), 'wine_id': 0, 'country': 'Italy', 'description': "Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.", 'points': 87, 'price': None, 'province': 'Sicily & Sardinia', 'region': 'Etna', 'taster_name': 'Kerin O’Keefe', 'taster_twitter_handle': '@kerinokeefe', 'title': 'Nicosia 2013 Vulkà Bianco (Etna)', 'variety': 'White Blend', 'winery': 'Nicosia'}
Notice that all of the data for this wine exists in this JSON dictionary, including data from the reviews, locations, wineries, and tasters tables in the PostgreSQL database. When we created this MongoDB database, the DBMS automatically created a unique ID for each record designated with the key _id
.
We can now use the methods in pymongo
for creating, reading, updating, and deleting records and we will apply these methods to the winecollection
variable that accesses the data.
7.8.1. Creating and Deleting Records#
Did you know that former NBA all-star Dwyane Wade has a winery? It’s called D Wade Cellars and it is based in the Napa Valley in California. Let’s add the 2016 Napa Valley Three By Wade Red Blend into the database. The first step is to express all of the data we want to associate with a new record in JSON format:
dwadewine = {'title': '2016 Napa Valley Three By Wade Red Blend',
'description': "This wine goes great with dinner just like Dwyane Wade goes great with LeBron James or Shaq.",
'taster_name': 'Jonathan Kropko',
'taster_twitter_handle': '@jmk5131',
'price': '35',
'variety': 'Red Blend',
'location':{
'region_1': 'Napa Valley',
'region_2': None,
'province': 'California',
'country': 'U.S.',
'winery': 'D Wade Cellars'}}
In creating this JSON record, I tried to follow the standards that exist elsewhere in the data by using the same feature names. I departed from the format of other records in two ways. First, I omitted the points and designation features. Second, I placed all the information about the location and name of the winery under the “location” key, which induces some nesting structure.
To add this one record to the database, I use the .insert_one()
method on the winecollection
database, with the following code:
winecollection.insert_one(dwadewine)
By default, the insert_one()
method automatically checks to see whether the record already exists in the data, and throws an error if it does, unless we specify the bypass_document_validation=True
argument, which allows duplicate records to be input into the database. For the purposes of this notebook, I rerun these cells many times while writing, and I don’t want to place many duplicate records into the database. Instead, I can delete the record if it already exists. The code
winecollection.count_documents({'title': '2016 Napa Valley Three By Wade Red Blend'})
generates a count of the records of wines in the database that have this title. If there are any existing records, I can delete all of these records with the .delete_many()
method, in which the argument is a JSON with enough fields specified to exactly match the records we want to delete:
winecollection.delete_many({'title': '2016 Napa Valley Three By Wade Red Blend'})
In constrast, the .delete_one()
method will only delete the first record, when sorting by _id
, that matches the query. If there are no documents that match the query, he .delete_all()
or .delete_one()
methods will both still process the query without error, but will not change anything in the database.
We first delete any records of wines with the title “2016 Napa Valley Three By Wade Red Blend” with .delete_all()
, then we insert the entire record of this wine with .insert_one()
:
winecollection.delete_many({'title': '2016 Napa Valley Three By Wade Red Blend'})
winecollection.insert_one(dwadewine)
<pymongo.results.InsertOneResult at 0x12b264a48>
Now that this record exists in the database, we can find this record by any of the fields associated with the record, such as the title of the wine for example:
myquery = {'title': '2016 Napa Valley Three By Wade Red Blend'}
mywine = winecollection.find(myquery)
for x in mywine:
print(x)
{'_id': ObjectId('5edd56e5b4e58ce3841e5dea'), 'title': '2016 Napa Valley Three By Wade Red Blend', 'description': 'This wine goes great with dinner just like Dwyane Wade goes great with LeBron James or Shaq.', 'taster_name': 'Jonathan Kropko', 'taster_twitter_handle': '@jmk5131', 'price': '35', 'variety': 'Red Blend', 'location': {'region_1': 'Napa Valley', 'region_2': None, 'province': 'California', 'country': 'U.S.', 'winery': 'D Wade Cellars'}}
Note that MongoDB automatically generates a unique ID value for this document and includes it under the _id
field in the JSON output.
Wikipedia lists 93 other celebrities other than Dwyane Wade who own wineries and vineyards, including Antonio Banderas, Drew Barrymore, and Lil Jon. If we want to add more than one record to the wine collection database, we need to create a list of individual JSON dictionaries with code that looks like
newrecords = [{JSON dictionary 1}, {JSON dictionary 2}, {JSON dictionary 3}]
In this case, I can create entries for Antonio Banderas, Drew Barrymore, and Lil Jon’s wines and store them in one list:
newwines = [{'title': 'Anta Banderas A 10 2008',
'description': "This wine will make you speak differently. Maybe not with a charming Spanish accent, but you might think you sound that way.",
'taster_name': 'Jonathan Kropko',
'taster_twitter_handle': '@jmk5131',
'price': '40.99',
'variety': 'Red Blend',
'location':{
'region_1': 'Ribera del Duoro',
'region_2': None,
'province': 'Valladolid',
'country': 'Spain',
'winery': 'Anta Banderas'}},
{'title': 'Barrymore Rose 2013',
'description': "Someone drank my entire bottle of wine!",
'taster_name': 'Jonathan Kropko',
'taster_twitter_handle': '@jmk5131',
'price': '14.99',
'variety': 'Rose',
'location':{
'region_1': 'Monterey',
'region_2': None,
'province': 'California',
'country': 'U.S.',
'winery': 'Barrymore Vineyard'}},
{'title': '2006 Little Jonathan Winery Cabernet Sauvignon',
'description': "This upscale crunk juice is OOOKAAAAAAY.",
'taster_name': 'Jonathan Kropko',
'taster_twitter_handle': '@jmk5131',
'variety': 'Cabernet Sauvignon',
'location':{
'region_1': 'Central Coast',
'region_2': 'Paso Robles',
'province': 'California',
'country': 'U.S.',
'winery': 'Little Jonathan Winery'}}]
To add these three records to the database with one line of code, we use the .insert_many()
method. To avoid duplicates, we first delete any records of wines titled “Anta Banderas A 10 2008”, “Barrymore Rose 2013”, or “2006 Little Jonathan Winery Cabernet Sauvignon”:
winecollection.delete_many({'title': 'Anta Banderas A 10 2008'})
winecollection.delete_many({'title': 'Barrymore Rose 2013'})
winecollection.delete_many({'title': '2006 Little Jonathan Winery Cabernet Sauvignon'})
winecollection.insert_many(newwines)
<pymongo.results.InsertManyResult at 0x1268132c8>
7.8.2. Reading Data and Selecting Records#
To read all of the records in a MongoDB collection, use the .find()
method and pass an empty JSON dictionary to this method. For the wine reviews collection, we can query the entire collection by typing
myquery = winecollection.find({})
Here, the queried data exist within the variable cursor
. The data are not displayed automatically. To see the data in JSON format, we can employ the print()
function on elements of the cursor. To see the first element:
print(myquery[0])
{'_id': ObjectId('5ed80dbca25fcf746119e3aa'), 'wine_id': 0, 'country': 'Italy', 'description': "Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.", 'points': 87, 'price': None, 'province': 'Sicily & Sardinia', 'region': 'Etna', 'taster_name': 'Kerin O’Keefe', 'taster_twitter_handle': '@kerinokeefe', 'title': 'Nicosia 2013 Vulkà Bianco (Etna)', 'variety': 'White Blend', 'winery': 'Nicosia'}
And to see more elements, we can use a loop. Here’s code to view the first three wines:
for i in myquery[0:3]:
print(i)
{'_id': ObjectId('5ed80dbca25fcf746119e3aa'), 'wine_id': 0, 'country': 'Italy', 'description': "Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.", 'points': 87, 'price': None, 'province': 'Sicily & Sardinia', 'region': 'Etna', 'taster_name': 'Kerin O’Keefe', 'taster_twitter_handle': '@kerinokeefe', 'title': 'Nicosia 2013 Vulkà Bianco (Etna)', 'variety': 'White Blend', 'winery': 'Nicosia'}
{'_id': ObjectId('5ed80dcca25fcf746119e3ab'), 'wine_id': 1, 'country': 'Portugal', 'description': "This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016.", 'points': 87, 'price': 15.0, 'province': 'Douro', 'region': None, 'taster_name': 'Roger Voss', 'taster_twitter_handle': '@vossroger', 'title': 'Quinta dos Avidagos 2011 Avidagos Red (Douro)', 'variety': 'Portuguese Red', 'winery': 'Quinta dos Avidagos'}
{'_id': ObjectId('5ed80dcca25fcf746119e3ac'), 'wine_id': 2, 'country': 'US', 'description': 'Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.', 'points': 87, 'price': 14.0, 'province': 'Oregon', 'region': 'Willamette Valley', 'taster_name': 'Paul Gregutt', 'taster_twitter_handle': '@paulgwine\xa0', 'title': 'Rainstorm 2013 Pinot Gris (Willamette Valley)', 'variety': 'Pinot Gris', 'winery': 'Rainstorm'}
Displaying the query output data as a list of JSON dictionaries, however, is not the most useful way to store the data. We need a way to put these data into a dataframe. For that we can use the dumps()
and loads()
functions from the bson
library. These functions work exactly like the dumps()
and loads()
functions from the json
library, but they remove some of the extra components of these JSON dictionaries associated with the database. To query all of the data and to place all of it into a dataframe, we pass the query output to dumps()
, which converts the query output to plain text. Next we pass this text to loads()
, which registers the text as a list of JSON dictionaries. Finally we use this list as the argument of pd.DataFrame.from_records()
to convert the output to a dataframe. For the wine collection, this code is:
myquery = winecollection.find({})
wine_text = dumps(myquery)
wine_records = loads(wine_text)
wine_df = pd.DataFrame.from_records(wine_records)
wine_df
_id | wine_id | country | description | points | price | province | region | taster_name | taster_twitter_handle | title | variety | winery | location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5ed80dbca25fcf746119e3aa | 0.0 | Italy | Aromas include tropical fruit, broom, brimston... | 87.0 | None | Sicily & Sardinia | Etna | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia | NaN |
1 | 5ed80dcca25fcf746119e3ab | 1.0 | Portugal | This is ripe and fruity, a wine that is smooth... | 87.0 | 15 | Douro | None | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos | NaN |
2 | 5ed80dcca25fcf746119e3ac | 2.0 | US | Tart and snappy, the flavors of lime flesh and... | 87.0 | 14 | Oregon | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm | NaN |
3 | 5ed80dcca25fcf746119e3ad | 3.0 | US | Pineapple rind, lemon pith and orange blossom ... | 87.0 | 13 | Michigan | Lake Michigan Shore | Alexander Peartree | None | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian | NaN |
4 | 5ed80dcca25fcf746119e3ae | 4.0 | US | Much like the regular bottling from 2012, this... | 87.0 | 65 | Oregon | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
103726 | 5ed80dcfa25fcf74611b78d8 | 129970.0 | France | Big, rich and off-dry, this is powered by inte... | 90.0 | 21 | Alsace | Alsace | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit | NaN |
103727 | 5edd56e5b4e58ce3841e5dea | NaN | NaN | This wine goes great with dinner just like Dwy... | NaN | 35 | NaN | NaN | Jonathan Kropko | @jmk5131 | 2016 Napa Valley Three By Wade Red Blend | Red Blend | NaN | {'region_1': 'Napa Valley', 'region_2': None, ... |
103728 | 5edd56e6b4e58ce3841e5deb | NaN | NaN | This wine will make you speak differently. May... | NaN | 40.99 | NaN | NaN | Jonathan Kropko | @jmk5131 | Anta Banderas A 10 2008 | Red Blend | NaN | {'region_1': 'Ribera del Duoro', 'region_2': N... |
103729 | 5edd56e6b4e58ce3841e5dec | NaN | NaN | Someone drank my entire bottle of wine! | NaN | 14.99 | NaN | NaN | Jonathan Kropko | @jmk5131 | Barrymore Rose 2013 | Rose | NaN | {'region_1': 'Monterey', 'region_2': None, 'pr... |
103730 | 5edd56e6b4e58ce3841e5ded | NaN | NaN | This upscale crunk juice is OOOKAAAAAAY. | NaN | NaN | NaN | NaN | Jonathan Kropko | @jmk5131 | 2006 Little Jonathan Winery Cabernet Sauvignon | Cabernet Sauvignon | NaN | {'region_1': 'Central Coast', 'region_2': 'Pas... |
103731 rows × 14 columns
Like SQL, read operations in MongoDB can filter records based on logical conditions. Unlike SQL, MongoDB uses different symbols for the common logical operators, and these symbols need to be listed within the JSON formatted query. A couple of the operators are implicit in the JSON syntax. To search on an equality condition, the general syntax is
{'key' : value}
For example, to find all of the wines in the database that are from virginia, we can use the following query:
{'province' : 'Virginia'}
The other implicit operator is “and”, which is expressed simply by including more than one key-value pair within the syntax. To specify that a feature key1
is equal to value1
AND that key2
is equal to value2
, type:
{'key1' : value1,
'key2' : value2}
For example, to filter the data to Pinot Noir wines from Virginia, we can type
{'variety' : 'Pinot Noir',
'province' : 'Virginia'}
For all other logical operators, MongoDB uses special syntax, described below. To use these operators in a query, the general template is general syntax for using an operator within a MongoDB query is
{'key' : {'$operator' : value } }
The operators are listed in the following table:
Operator |
Syntax |
Example query |
Example code |
---|---|---|---|
Equal to |
implicit |
All wines with scores of 100 |
|
Greater than |
|
All wines that are more expensive than \$30 |
|
Greater than or equal to |
|
All wines with scores of 95 or higher |
|
Less than |
|
All wines that are cheaper than \$20 |
|
Less than or equal to |
|
All wines with scores of 85 or lower |
|
Not equal |
|
Wines that are not red blends |
|
And |
implicit |
All wines with scores of 100 and prices of \$20 or less |
|
Or |
|
All wines with scores of 100, or prices of \$20 or less |
|
Exists in a set |
|
All wines from Virginia, Maryland, or North Carolina |
|
Not in a set |
|
All wines except those from Virginia, Maryland, and North Carolina |
|
Use logical conditions that compare two or more keys |
|
All wines whose price is greater than their score |
|
Logical negation (only recommended for use with |
|
All wines whose descriptions do not contain the word “chocolate”, treating capital and lower-case letters the same |
|
To quickly see the data that is output by queries that use these operators, I write a function that takes a JSON dictionary as an input, and outputs a pandas
dataframe:
def mongo_read_query(col, q):
qtext = dumps(col.find(q))
qrec = loads(qtext)
qdf = pd.DataFrame.from_records(qrec)
return qdf
To see all wines with a score of 100
myquery = {'points': 100}
mongo_read_query(winecollection, myquery)
_id | wine_id | country | description | points | price | province | region | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5ed80dcca25fcf746119e498 | 345 | Australia | This wine contains some material over 100 year... | 100 | 350.0 | Victoria | Rutherglen | Joe Czerwinski | @JoeCz | Chambers Rosewood Vineyards NV Rare Muscat (Ru... | Muscat | Chambers Rosewood Vineyards |
1 | 5ed80dcea25fcf74611a5511 | 36528 | France | This is a fabulous wine from the greatest Cham... | 100 | 259.0 | Champagne | Champagne | Roger Voss | @vossroger | Krug 2002 Brut (Champagne) | Champagne Blend | Krug |
2 | 5ed80dcea25fcf74611a66df | 42197 | Portugal | This is the latest release of what has long be... | 100 | 450.0 | Douro | None | Roger Voss | @vossroger | Casa Ferreirinha 2008 Barca-Velha Red (Douro) | Portuguese Red | Casa Ferreirinha |
3 | 5ed80dcea25fcf74611a7214 | 45781 | Italy | This gorgeous, fragrant wine opens with classi... | 100 | 550.0 | Tuscany | Brunello di Montalcino | Kerin O’Keefe | @kerinokeefe | Biondi Santi 2010 Riserva (Brunello di Montal... | Sangiovese | Biondi Santi |
4 | 5ed80dcea25fcf74611a9877 | 58352 | France | This is a magnificently solid wine, initially ... | 100 | 150.0 | Bordeaux | Saint-Julien | Roger Voss | @vossroger | Château Léoville Barton 2010 Saint-Julien | Bordeaux-style Red Blend | Château Léoville Barton |
5 | 5ed80dcfa25fcf74611afacd | 89728 | France | This latest incarnation of the famous brand is... | 100 | 250.0 | Champagne | Champagne | Roger Voss | @vossroger | Louis Roederer 2008 Cristal Vintage Brut (Cha... | Champagne Blend | Louis Roederer |
6 | 5ed80dcfa25fcf74611aface | 89729 | France | This new release from a great vintage for Char... | 100 | 617.0 | Champagne | Champagne | Roger Voss | @vossroger | Salon 2006 Le Mesnil Blanc de Blancs Brut Char... | Chardonnay | Salon |
7 | 5ed80dcfa25fcf74611b3fc4 | 111753 | France | Almost black in color, this stunning wine is g... | 100 | 1500.0 | Bordeaux | Pauillac | Roger Voss | @vossroger | Château Lafite Rothschild 2010 Pauillac | Bordeaux-style Red Blend | Château Lafite Rothschild |
8 | 5ed80dcfa25fcf74611b3fc5 | 111755 | France | This is the finest Cheval Blanc for many years... | 100 | 1500.0 | Bordeaux | Saint-Émilion | Roger Voss | @vossroger | Château Cheval Blanc 2010 Saint-Émilion | Bordeaux-style Red Blend | Château Cheval Blanc |
9 | 5ed80dcfa25fcf74611b3fc6 | 111756 | France | A hugely powerful wine, full of dark, brooding... | 100 | 359.0 | Bordeaux | Saint-Julien | Roger Voss | @vossroger | Château Léoville Las Cases 2010 Saint-Julien | Bordeaux-style Red Blend | Château Léoville Las Cases |
10 | 5ed80dcfa25fcf74611b4648 | 113929 | US | In 2005 Charles Smith introduced three high-en... | 100 | 80.0 | Washington | Columbia Valley (WA) | Paul Gregutt | @paulgwine | Charles Smith 2006 Royal City Syrah (Columbia ... | Syrah | Charles Smith |
11 | 5ed80dcfa25fcf74611b499d | 114972 | Portugal | A powerful and ripe wine, strongly influenced ... | 100 | 650.0 | Port | None | Roger Voss | @vossroger | Quinta do Noval 2011 Nacional Vintage (Port) | Port | Quinta do Noval |
12 | 5ed80dcfa25fcf74611b6300 | 122935 | France | Full of ripe fruit, opulent and concentrated, ... | 100 | 848.0 | Bordeaux | Pessac-Léognan | Roger Voss | @vossroger | Château Haut-Brion 2014 Pessac-Léognan | Bordeaux-style White Blend | Château Haut-Brion |
13 | 5ed80dcfa25fcf74611b64d4 | 123545 | US | Initially a rather subdued Frog; as if it has ... | 100 | 80.0 | Washington | Walla Walla Valley (WA) | Paul Gregutt | @paulgwine | Cayuse 2008 Bionic Frog Syrah (Walla Walla Val... | Syrah | Cayuse |
To see wines with a score of 100 and a cost of less than \\(100, we can use the `\)lt` operator:
myquery = {'points': 100, 'price': {'$lt': 100}}
mongo_read_query(winecollection, myquery)
_id | wine_id | country | description | points | price | province | region | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5ed80dcfa25fcf74611b4648 | 113929 | US | In 2005 Charles Smith introduced three high-en... | 100 | 80.0 | Washington | Columbia Valley (WA) | Paul Gregutt | @paulgwine | Charles Smith 2006 Royal City Syrah (Columbia ... | Syrah | Charles Smith |
1 | 5ed80dcfa25fcf74611b64d4 | 123545 | US | Initially a rather subdued Frog; as if it has ... | 100 | 80.0 | Washington | Walla Walla Valley (WA) | Paul Gregutt | @paulgwine | Cayuse 2008 Bionic Frog Syrah (Walla Walla Val... | Syrah | Cayuse |
To see wines that are from Ohio or North Carolina, we use the $in
operator:
myquery = {'province': {'$in': ['Ohio','North Carolina']}}
mongo_read_query(winecollection, myquery)
_id | wine_id | country | description | points | price | province | region | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5ed80dcea25fcf74611a0d3f | 13402 | US | Clove and pepper spice the dark red cherry aro... | 86 | 17.0 | North Carolina | Swan Creek | Anna Lee C. Iijima | None | Raffaldini 2007 Montepulciano (Swan Creek) | Montepulciano | Raffaldini |
1 | 5ed80dcea25fcf74611a167b | 16270 | US | The nose shows an aroma of blackberry that is ... | 86 | 17.0 | North Carolina | Yadkin Valley | Alexander Peartree | None | Shadow Springs 2011 Cabernet Franc (Yadkin Val... | Cabernet Franc | Shadow Springs |
2 | 5ed80dcea25fcf74611a20a2 | 19566 | US | Fruits, flowers and spice should lead the nose... | 80 | 15.0 | Ohio | Ohio | Susan Kostrzewa | @suskostrzewa | Hermes 2006 Estate Bottled Nebbiolo (Ohio) | Nebbiolo | Hermes |
3 | 5ed80dcea25fcf74611a23ee | 20592 | US | Stewed blackberries and muddled cherries perva... | 86 | 23.0 | North Carolina | Swan Creek | Alexander Peartree | None | Raffaldini 2012 Riserva Sangiovese (Swan Creek) | Sangiovese | Raffaldini |
4 | 5ed80dcea25fcf74611a5fc4 | 39935 | US | Friendly, appealing flavors fo pear, lychee, a... | 84 | 12.0 | Ohio | Grand River Valley | Susan Kostrzewa | @suskostrzewa | Debonné 2008 Reserve Riesling (Grand River Val... | Riesling | Debonné |
5 | 5ed80dcea25fcf74611a686c | 42692 | US | Friendly, appealing flavors fo pear, lychee, a... | 84 | 12.0 | Ohio | Grand River Valley | Susan Kostrzewa | @suskostrzewa | Debonné 2008 Reserve Riesling (Grand River Val... | Riesling | Debonné |
6 | 5ed80dcea25fcf74611a7033 | 45209 | US | Freshly squeezed lemons, lime and pretty white... | 84 | 16.0 | North Carolina | Swan Creek | Anna Lee C. Iijima | None | Raffaldini 2009 Pinot Grigio (Swan Creek) | Pinot Grigio | Raffaldini |
7 | 5ed80dcea25fcf74611a74db | 46670 | US | Black fruit aromas show over toasted vanilla a... | 85 | 29.0 | North Carolina | Swan Creek | Alexander Peartree | None | Raffaldini 2012 Riserva Montepulciano (Swan Cr... | Montepulciano | Raffaldini |
8 | 5ed80dcea25fcf74611a7c95 | 49154 | US | Lean and racy, with limes and tart green apple... | 86 | 10.0 | North Carolina | North Carolina | Joe Czerwinski | @JoeCz | Shelton Vineyards 2002 Riesling (North Carolina) | Riesling | Shelton Vineyards |
9 | 5ed80dcea25fcf74611a8518 | 52078 | US | Charred oak, green herbs and vanilla spice not... | 82 | 18.0 | North Carolina | Yadkin Valley | Anna Lee C. Iijima | None | Divine Llama 2007 In a Heart Beat Red (Yadkin ... | R. Blend | Divine Llama |
10 | 5ed80dcea25fcf74611a8fe5 | 55687 | US | Mostly Sangiovese with a small dose of Petit V... | 84 | 17.0 | North Carolina | Swan Creek | Anna Lee C. Iijima | None | Raffaldini 2007 Riserva Sangiovese (Swan Creek) | Sangiovese | Raffaldini |
11 | 5ed80dcea25fcf74611a8fe9 | 55693 | US | Aromas of toasted oak, green leaves, vanilla a... | 84 | 25.0 | North Carolina | North Carolina | Anna Lee C. Iijima | None | RayLen 2006 Eagle's Select Red Wine Red (North... | Bordeaux-style Red Blend | RayLen |
12 | 5ed80dcfa25fcf74611ac51f | 72535 | US | Who knew Ohio made such tasty Chardonnay? Brig... | 87 | 11.0 | Ohio | Grand River Valley | Anna Lee C. Iijima | None | Debonné 2009 Chardonnay (Grand River Valley) | Chardonnay | Debonné |
13 | 5ed80dcfa25fcf74611ae168 | 81619 | US | Strawberry and raspberry Kool-Aid aromas are s... | 85 | 21.0 | North Carolina | Swan Creek | Alexander Peartree | None | Laurel Gray 2012 Estate Grown Cabernet Franc (... | Cabernet Franc | Laurel Gray |
14 | 5ed80dcfa25fcf74611af839 | 88841 | US | This is a vibrant, energetic Chardonnay that s... | 84 | 17.0 | Ohio | Grand River Valley | Susan Kostrzewa | @suskostrzewa | Debonné 2007 Vintner's Selection Chardonnay (G... | Chardonnay | Debonné |
15 | 5ed80dcfa25fcf74611b08bb | 94267 | US | Although the nose offers darker notes of petro... | 83 | 15.0 | Ohio | Grand River Valley | Anna Lee C. Iijima | None | Debonné 2008 Lot 807 Reserve Riesling (Grand R... | Riesling | Debonné |
16 | 5ed80dcfa25fcf74611b08be | 94275 | US | The nose on this bright red blend from Sanders... | 83 | 18.0 | North Carolina | Yadkin Valley | Anna Lee C. Iijima | None | Sanders Ridge 2008 Big Woods Red (Yadkin Valley) | Bordeaux-style Red Blend | Sanders Ridge |
17 | 5ed80dcfa25fcf74611b13ff | 97836 | US | Smoke wafts over pressed apple and lemon notes... | 83 | 13.0 | North Carolina | North Carolina | Anna Lee C. Iijima | None | RayLen 2009 Riesling (North Carolina) | Riesling | RayLen |
18 | 5ed80dcfa25fcf74611b1c1f | 100445 | US | Fresh minerality and dancing floral notes make... | 84 | 11.0 | Ohio | Grand River Valley | Susan Kostrzewa | @suskostrzewa | Debonné 2006 Reserve Riesling (Grand River Val... | Riesling | Debonné |
19 | 5ed80dcfa25fcf74611b1c23 | 100452 | US | A slightly floral but lively nose is followed ... | 84 | 15.0 | Ohio | Grand River Valley | Susan Kostrzewa | @suskostrzewa | Debonné 2006 Lot 707 Reserve Riesling (Grand R... | Riesling | Debonné |
20 | 5ed80dcfa25fcf74611b29f1 | 104722 | US | There are enticing hints of berries and cream ... | 86 | 15.0 | North Carolina | North Carolina | Anna Lee C. Iijima | None | Biltmore Estate 2010 Reserve Chardonnay (North... | Chardonnay | Biltmore Estate |
21 | 5ed80dcfa25fcf74611b4fb6 | 116912 | US | Black cherry aromas are dwarfed by notes of wi... | 83 | 24.0 | North Carolina | Swan Creek | Alexander Peartree | None | Raffaldini 2012 Montepulciano (Swan Creek) | Montepulciano | Raffaldini |
22 | 5ed80dcfa25fcf74611b55ee | 118895 | US | Bright red fruits achieve a decent amount of r... | 84 | 18.0 | North Carolina | North Carolina | Anna Lee C. Iijima | None | RayLen 2008 Category 5 Red Wine Red (North Car... | R. Blend | RayLen |
To illustrate the “or” operator, we can query all wines that are either from Virginia, or have a score of 100:
myquery = {'$or': [{'points': 100}, {'province': 'Virginia'}]}
mongo_read_query(winecollection, myquery)
_id | wine_id | country | description | points | price | province | region | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5ed80dcca25fcf746119e3bd | 19 | US | Red fruit aromas pervade on the nose, with cig... | 87 | 32.0 | Virginia | Virginia | Alexander Peartree | None | Quiévremont 2012 Meritage (Virginia) | Meritage | Quiévremont |
1 | 5ed80dcca25fcf746119e3be | 20 | US | Ripe aromas of dark berries mingle with ample ... | 87 | 23.0 | Virginia | Virginia | Alexander Peartree | None | Quiévremont 2012 Vin de Maison Red (Virginia) | R. Blend | Quiévremont |
2 | 5ed80dcca25fcf746119e498 | 345 | Australia | This wine contains some material over 100 year... | 100 | 350.0 | Victoria | Rutherglen | Joe Czerwinski | @JoeCz | Chambers Rosewood Vineyards NV Rare Muscat (Ru... | Muscat | Chambers Rosewood Vineyards |
3 | 5ed80dcea25fcf746119e8d0 | 1625 | US | Popping with aromas of lychee, rose, geranium ... | 85 | 16.0 | Virginia | Virginia | Carrie Dykes | None | The Williamsburg Winery 2015 A Midsummer Night... | White Blend | The Williamsburg Winery |
4 | 5ed80dcea25fcf746119e8d6 | 1631 | US | Powerful aromas of lychee, mango and peach giv... | 85 | 22.0 | Virginia | Middleburg | Carrie Dykes | None | Blue Valley 2015 Muskat Ottonel (Middleburg) | Muskat Ottonel | Blue Valley |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
388 | 5ed80dcfa25fcf74611b71b3 | 127736 | US | Dense with alluring aromas, this wine is full ... | 88 | 30.0 | Virginia | Virginia | Carrie Dykes | None | Early Mountain 2015 Elevation Red (Virginia) | R. Blend | Early Mountain |
389 | 5ed80dcfa25fcf74611b71bd | 127746 | US | A grape known in Uruguay and Madiran has prove... | 88 | 25.0 | Virginia | Virginia | Carrie Dykes | None | Horton 2014 Tannat (Virginia) | Tannat | Horton |
390 | 5ed80dcfa25fcf74611b7447 | 128576 | US | Peach and steely lemon aromas carry to a citru... | 87 | 28.0 | Virginia | Monticello | Alexander Peartree | None | Pollak 2012 Reserve Chardonnay (Monticello) | Chardonnay | Pollak |
391 | 5ed80dcfa25fcf74611b7708 | 129422 | US | The nose of this wine is bursting with raspber... | 89 | 32.0 | Virginia | Monticello | Carrie Dykes | None | Stinson 2014 Meritage (Monticello) | Meritage | Stinson |
392 | 5ed80dcfa25fcf74611b772a | 129459 | US | Somehow, winemaker Luca Paschina manages to ma... | 87 | 23.0 | Virginia | Virginia | Carrie Dykes | None | Barboursville Vineyards 2015 Reserve Vermentin... | Vermentino | Barboursville Vineyards |
393 rows × 13 columns
$expr
requires an operator that compares two keys, and creates a sentence like “X is greater than or equal to Y”. Next it requires a list that specifies what X in the sentence should be, then what Y should be. To search for all wines in which the price is greater than the score we use the $expr
operator as follows:
myquery = {'$expr': {'$gt': ['$price', '$points']}}
mongo_read_query(winecollection, myquery)
_id | wine_id | country | description | points | price | province | region | taster_name | taster_twitter_handle | title | variety | winery | location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5ed80dcca25fcf746119e3d4 | 60.0 | US | Syrupy and dense, this wine is jammy in plum a... | 86.0 | 100 | California | Napa Valley | Virginie Boone | @vboone | Okapi 2013 Estate Cabernet Sauvignon (Napa Val... | Cabernet Sauvignon | Okapi | NaN |
1 | 5ed80dcca25fcf746119e42c | 168.0 | US | A fairly elegant expression of the variety, th... | 91.0 | 95 | California | Napa Valley | Virginie Boone | @vboone | Duckhorn 2012 Rector Creek Vineyard Merlot (Na... | Merlot | Duckhorn | NaN |
2 | 5ed80dcca25fcf746119e476 | 284.0 | Argentina | This huge Malbec defines jammy and concentrate... | 92.0 | 215 | Mendoza Province | Perdriel | Michael Schachner | @wineschach | Viña Cobos 2011 Marchiori Vineyard Block C2 Ma... | Malbec | Viña Cobos | NaN |
3 | 5ed80dcca25fcf746119e498 | 345.0 | Australia | This wine contains some material over 100 year... | 100.0 | 350 | Victoria | Rutherglen | Joe Czerwinski | @JoeCz | Chambers Rosewood Vineyards NV Rare Muscat (Ru... | Muscat | Chambers Rosewood Vineyards | NaN |
4 | 5ed80dcca25fcf746119e499 | 346.0 | Australia | This deep brown wine smells like a damp, mossy... | 98.0 | 350 | Victoria | Rutherglen | Joe Czerwinski | @JoeCz | Chambers Rosewood Vineyards NV Rare Muscadelle... | Muscadelle | Chambers Rosewood Vineyards | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3697 | 5ed80dcfa25fcf74611b78a8 | 129919.0 | US | This ripe, rich, almost decadently thick wine ... | 91.0 | 105 | Washington | Walla Walla Valley (WA) | Paul Gregutt | @paulgwine | Nicholas Cole Cellars 2004 Reserve Red (Walla ... | R. Blend | Nicholas Cole Cellars | NaN |
3698 | 5ed80dcfa25fcf74611b78b2 | 129931.0 | France | A powerful, chunky wine, packed with solid tan... | 91.0 | 107 | Burgundy | Grands-Echezeaux | Roger Voss | @vossroger | Henri de Villamont 2005 Grands-Echezeaux | Pinot Noir | Henri de Villamont | NaN |
3699 | 5edd56e5b4e58ce3841e5dea | NaN | NaN | This wine goes great with dinner just like Dwy... | NaN | 35 | NaN | NaN | Jonathan Kropko | @jmk5131 | 2016 Napa Valley Three By Wade Red Blend | Red Blend | NaN | {'region_1': 'Napa Valley', 'region_2': None, ... |
3700 | 5edd56e6b4e58ce3841e5deb | NaN | NaN | This wine will make you speak differently. May... | NaN | 40.99 | NaN | NaN | Jonathan Kropko | @jmk5131 | Anta Banderas A 10 2008 | Red Blend | NaN | {'region_1': 'Ribera del Duoro', 'region_2': N... |
3701 | 5edd56e6b4e58ce3841e5dec | NaN | NaN | Someone drank my entire bottle of wine! | NaN | 14.99 | NaN | NaN | Jonathan Kropko | @jmk5131 | Barrymore Rose 2013 | Rose | NaN | {'region_1': 'Monterey', 'region_2': None, 'pr... |
3702 rows × 14 columns
In the previous section, we entered a record for a wine released by former NBA all-star Dwyane Wade, and we purposely included a nested structure in this JSON record. The location
key has subkeys region_1
, region_2
, province
, country
, and winery
:
dwadewine
{'title': '2016 Napa Valley Three By Wade Red Blend',
'description': 'This wine goes great with dinner just like Dwyane Wade goes great with LeBron James or Shaq.',
'taster_name': 'Jonathan Kropko',
'taster_twitter_handle': '@jmk5131',
'price': '35',
'variety': 'Red Blend',
'location': {'region_1': 'Napa Valley',
'region_2': None,
'province': 'California',
'country': 'U.S.',
'winery': 'D Wade Cellars'},
'_id': ObjectId('5edd56e5b4e58ce3841e5dea')}
To query a subrecord, use dot notation of the form 'key.subkey'
to identify the path to the value you need. To query for the winery name “D Wade Cellars”, we can type:
myquery = {'location.winery': 'D Wade Cellars'}
mongo_read_query(winecollection, myquery)
_id | title | description | taster_name | taster_twitter_handle | price | variety | location | |
---|---|---|---|---|---|---|---|---|
0 | 5edd56e5b4e58ce3841e5dea | 2016 Napa Valley Three By Wade Red Blend | This wine goes great with dinner just like Dwy... | Jonathan Kropko | @jmk5131 | 35 | Red Blend | {'region_1': 'Napa Valley', 'region_2': None, ... |
7.8.3. Selecting Features#
A read query in MongoDB will return the entire JSON dictionary for every record that matches the query. Sometimes, however, the entirety of the data for one record will be more information than we can feasibly work with. In some situations there might be an unmanagable number of features contained within each dictionary, and we only want to use a couple of these features. In other situations a feature might contain values that are so large that we want to avoid dealing with this feature if possible.
To extract only a selection of the features, add a second JSON clause to the .find()
method. The general syntax for selecting features is
db.collection.find({query}, {'feature'=1}}
where {query}
is code, as described above, for extracting a selection of the records, and {'feature'=1}
instructs MongoDB to include only the field named feature
in the output. Alternatively, it is possible to list as many keys in this second clause as we want, so {'feature1'=1, 'feature2'=1}
extracts feature1
and feature2
. In addition, setting the key equal to 0 instead of 1 instructs MongoDB to extract all features except the one specified with 'feature'=0
.
In the wine collection, we can extract only the titles of Merlot wines with the following code:
cursor = winecollection.find({'variety': 'Merlot'}, {'title': 1})
qtext = dumps(cursor)
qrec = loads(qtext)
pd.DataFrame.from_records(qrec)
_id | title | |
---|---|---|
0 | 5ed80dcca25fcf746119e3c1 | Bianchi 2011 Signature Selection Merlot (Paso ... |
1 | 5ed80dcca25fcf746119e3cd | Sundance 2011 Merlot (Maule Valley) |
2 | 5ed80dcca25fcf746119e3ef | Passaggio 2014 Blau Vineyards Merlot (Knights ... |
3 | 5ed80dcca25fcf746119e42c | Duckhorn 2012 Rector Creek Vineyard Merlot (Na... |
4 | 5ed80dcca25fcf746119e437 | Viña Bisquertt 2007 Casa La Joya Reserve Merlo... |
... | ... | ... |
2094 | 5ed80dcfa25fcf74611b77f0 | Castillo de Monjardin 2009 Deyo Merlot (Navarra) |
2095 | 5ed80dcfa25fcf74611b7862 | Bonair 2006 Chateau Puryear Vineyard Merlot (R... |
2096 | 5ed80dcfa25fcf74611b7865 | Hyatt 2005 Merlot (Rattlesnake Hills) |
2097 | 5ed80dcfa25fcf74611b786b | Ca' Momi 2013 Reserve Merlot (Carneros) |
2098 | 5ed80dcfa25fcf74611b7896 | Psagot 2014 Merlot |
2099 rows × 2 columns
By default, the only field that is extracted other than the ones we directly specify is _id
, but we can exclude _id
as well by typing
cursor = winecollection.find({'variety': 'Merlot'}, {'title': 1, '_id': 0})
qtext = dumps(cursor)
qrec = loads(qtext)
pd.DataFrame.from_records(qrec)
title | |
---|---|
0 | Bianchi 2011 Signature Selection Merlot (Paso ... |
1 | Sundance 2011 Merlot (Maule Valley) |
2 | Passaggio 2014 Blau Vineyards Merlot (Knights ... |
3 | Duckhorn 2012 Rector Creek Vineyard Merlot (Na... |
4 | Viña Bisquertt 2007 Casa La Joya Reserve Merlo... |
... | ... |
2094 | Castillo de Monjardin 2009 Deyo Merlot (Navarra) |
2095 | Bonair 2006 Chateau Puryear Vineyard Merlot (R... |
2096 | Hyatt 2005 Merlot (Rattlesnake Hills) |
2097 | Ca' Momi 2013 Reserve Merlot (Carneros) |
2098 | Psagot 2014 Merlot |
2099 rows × 1 columns
To keep the title, variety, points, and price, we type
cursor = winecollection.find({'variety': 'Merlot'},
{'title': 1,
'variety': 1,
'points': 1,
'price': 1,
'_id': 0})
qtext = dumps(cursor)
qrec = loads(qtext)
pd.DataFrame.from_records(qrec)
points | price | title | variety | |
---|---|---|---|---|
0 | 87 | 22.0 | Bianchi 2011 Signature Selection Merlot (Paso ... | Merlot |
1 | 86 | 9.0 | Sundance 2011 Merlot (Maule Valley) | Merlot |
2 | 86 | 55.0 | Passaggio 2014 Blau Vineyards Merlot (Knights ... | Merlot |
3 | 91 | 95.0 | Duckhorn 2012 Rector Creek Vineyard Merlot (Na... | Merlot |
4 | 88 | 11.0 | Viña Bisquertt 2007 Casa La Joya Reserve Merlo... | Merlot |
... | ... | ... | ... | ... |
2094 | 87 | 18.0 | Castillo de Monjardin 2009 Deyo Merlot (Navarra) | Merlot |
2095 | 86 | 20.0 | Bonair 2006 Chateau Puryear Vineyard Merlot (R... | Merlot |
2096 | 86 | 10.0 | Hyatt 2005 Merlot (Rattlesnake Hills) | Merlot |
2097 | 90 | 44.0 | Ca' Momi 2013 Reserve Merlot (Carneros) | Merlot |
2098 | 91 | 32.0 | Psagot 2014 Merlot | Merlot |
2099 rows × 4 columns
7.8.4. Updating Records#
Updating records in MongoDB is similar to selecting records in that we use the same logical conditions we use for selecting records for identifying the records we want to edit. The .update_one()
method, applied to a collection, has two arguments. First we specify a logical condition that identifies the records we want to edit. Then we use the $set
operator to choose specific fields within the existing JSON record to change. If we want, we can even write an entire replacement dictionary for this record, and write it along with $set
. For example, to identify the record of the wine from Dwyane Wade’s winery, we can query {'location.winery': 'D Wade Cellars'}
as we did in the previous section. Suppose that we want to edit this record so that the price increases to \$45. We can do so with the following code:
winecollection.update_one({'location.winery': 'D Wade Cellars'},
{'$set' : {'price': 45}})
mongo_read_query(winecollection, {'location.winery': 'D Wade Cellars'})
_id | title | description | taster_name | taster_twitter_handle | price | variety | location | |
---|---|---|---|---|---|---|---|---|
0 | 5edd56e5b4e58ce3841e5dea | 2016 Napa Valley Three By Wade Red Blend | This wine goes great with dinner just like Dwy... | Jonathan Kropko | @jmk5131 | 45 | Red Blend | {'region_1': 'Napa Valley', 'region_2': None, ... |
Suppose that we wanted to add a field that does not currently exist in the record, like points
. We can use the same syntax to add fields:
winecollection.update_one({'location.winery': 'D Wade Cellars'},
{'$set' : {'score': 90}})
mongo_read_query(winecollection, {'location.winery': 'D Wade Cellars'})
_id | title | description | taster_name | taster_twitter_handle | price | variety | location | score | |
---|---|---|---|---|---|---|---|---|---|
0 | 5edd56e5b4e58ce3841e5dea | 2016 Napa Valley Three By Wade Red Blend | This wine goes great with dinner just like Dwy... | Jonathan Kropko | @jmk5131 | 45 | Red Blend | {'region_1': 'Napa Valley', 'region_2': None, ... | 90 |
We can change more than one field at a time within one call to the $set
operator. To change both the score and the price of the Dwyane Wade wine, we can type:
winecollection.update_one({'location.winery': 'D Wade Cellars'},
{'$set' : {'score': 95,
'price': 50}})
mongo_read_query(winecollection, {'location.winery': 'D Wade Cellars'})
_id | title | description | taster_name | taster_twitter_handle | price | variety | location | score | |
---|---|---|---|---|---|---|---|---|---|
0 | 5edd56e5b4e58ce3841e5dea | 2016 Napa Valley Three By Wade Red Blend | This wine goes great with dinner just like Dwy... | Jonathan Kropko | @jmk5131 | 50 | Red Blend | {'region_1': 'Napa Valley', 'region_2': None, ... | 95 |
Suppose that the wine is reviewed by LeBron James, NBA star and noted wine connoisseur, who provided a new score and description. We can update the entire record by first defining a Python variable that contains the record:
dwadewine2 = {'title': '2016 Napa Valley Three By Wade Red Blend',
'description': "This wine is very good. Not as great as me. But plenty great enough for Miami.",
'taster_name': 'LeBron James',
'taster_twitter_handle': '@kingjames',
'price': 45,
'score': 99,
'variety': 'Red Blend',
'location':{
'region_1': 'Napa Valley',
'region_2': None,
'province': 'California',
'country': 'U.S.',
'winery': 'D Wade Cellars'}}
We can replace the existing record for this wine by specifying this dictionary as the second argument of the .update_one()
method:
winecollection.update_one({'location.winery': 'D Wade Cellars'},
{'$set' : dwadewine2})
mongo_read_query(winecollection, {'location.winery': 'D Wade Cellars'})
_id | title | description | taster_name | taster_twitter_handle | price | variety | location | score | |
---|---|---|---|---|---|---|---|---|---|
0 | 5edd56e5b4e58ce3841e5dea | 2016 Napa Valley Three By Wade Red Blend | This wine is very good. Not as great as me. Bu... | LeBron James | @kingjames | 45 | Red Blend | {'region_1': 'Napa Valley', 'region_2': None, ... | 99 |
A second method for editing records is .update_all()
which revises every document that matches a query. I don’t recommend using this method except in very specific cases, because it is easy to destroy large portions of a database with a mistyped query. But for the sake of illustration, suppose we wanted to change the names of the “Red Blend” varieties of wines to “R. Blend”. We can do that with the following code:
winecollection.update_many({'variety': 'Red Blend'},
{'$set': {'variety': 'R. Blend'}})
mongo_read_query(winecollection, {'variety': 'R. Blend'})
_id | wine_id | country | description | points | price | province | region | taster_name | taster_twitter_handle | title | variety | winery | location | score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5ed80dcca25fcf746119e3be | 20.0 | US | Ripe aromas of dark berries mingle with ample ... | 87.0 | 23 | Virginia | Virginia | Alexander Peartree | None | Quiévremont 2012 Vin de Maison Red (Virginia) | R. Blend | Quiévremont | NaN | NaN |
1 | 5ed80dcca25fcf746119e3c6 | 28.0 | Italy | Aromas suggest mature berry, scorched earth, a... | 87.0 | 17 | Sicily & Sardinia | Cerasuolo di Vittoria | Kerin O’Keefe | @kerinokeefe | Terre di Giurfo 2011 Mascaria Barricato (Cera... | R. Blend | Terre di Giurfo | NaN | NaN |
2 | 5ed80dcca25fcf746119e3dc | 68.0 | US | Very deep in color and spicy-smoky in flavor, ... | 86.0 | 12 | California | California | Jim Gordon | @gordone_cellars | Cocobon 2014 Red (California) | R. Blend | Cocobon | NaN | NaN |
3 | 5ed80dcca25fcf746119e3f2 | 90.0 | US | This blend of Sangiovese, Malbec, Cabernet Sau... | 88.0 | 23 | California | Sonoma County | Virginie Boone | @vboone | Ferrari-Carano 2014 Siena Red (Sonoma County) | R. Blend | Ferrari-Carano | NaN | NaN |
4 | 5ed80dcca25fcf746119e400 | 104.0 | Italy | Made with 65% Sangiovese, 20% Merlot and 15% C... | 87.0 | 16 | Tuscany | Toscana | Kerin O’Keefe | @kerinokeefe | Madonna Alta 2014 Nativo Red (Toscana) | R. Blend | Madonna Alta | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7106 | 5ed80dcfa25fcf74611b78b3 | 129932.0 | Argentina | Andeluna's top wines tend to be ripe and plump... | 91.0 | 55 | Mendoza Province | Uco Valley | Michael Schachner | @wineschach | Andeluna 2004 Pasionado Red (Uco Valley) | R. Blend | Andeluna | NaN | NaN |
7107 | 5ed80dcfa25fcf74611b78bd | 129943.0 | Italy | A blend of Nero d'Avola and Syrah, this convey... | 90.0 | 29 | Sicily & Sardinia | Sicilia | Kerin O’Keefe | @kerinokeefe | Baglio del Cristo di Campobello 2012 Adènzia R... | R. Blend | Baglio del Cristo di Campobello | NaN | NaN |
7108 | 5ed80dcfa25fcf74611b78c1 | 129947.0 | Italy | A blend of 65% Cabernet Sauvignon, 30% Merlot ... | 90.0 | 20 | Sicily & Sardinia | Terre Siciliane | Kerin O’Keefe | @kerinokeefe | Feudo Principi di Butera 2012 Symposio Red (Te... | R. Blend | Feudo Principi di Butera | NaN | NaN |
7109 | 5edd56e5b4e58ce3841e5dea | NaN | NaN | This wine is very good. Not as great as me. Bu... | NaN | 45 | NaN | NaN | LeBron James | @kingjames | 2016 Napa Valley Three By Wade Red Blend | R. Blend | NaN | {'region_1': 'Napa Valley', 'region_2': None, ... | 99.0 |
7110 | 5edd56e6b4e58ce3841e5deb | NaN | NaN | This wine will make you speak differently. May... | NaN | 40.99 | NaN | NaN | Jonathan Kropko | @jmk5131 | Anta Banderas A 10 2008 | R. Blend | NaN | {'region_1': 'Ribera del Duoro', 'region_2': N... | NaN |
7111 rows × 15 columns
7.8.5. Performing Text Searches#
One of the great advantages of a document store database is the ability to search through the text within the documents and extract records that match a certain pattern. A text search in MongoDB involves two steps:
First, we will create a text index: a particular field in the records that contains the text we want MongoDB to search within.
Second, we will use the
$text
operator within a call to.find()
to specify the search terms.
To create a text index, we can use the syntax
collection.create_index[('keytosearch', 'text')]
We will replace 'keytosearch' with the name of the field in the JSON dictionaries on which we want to search, but we will leave
‘text’as is because this code tells MongoDB to search for text. The code to set the
descriptionfield as the text index in the
winecollection` database is:
winecollection.create_index([('description', 'text')])
'description_text'
Now that we’ve set the text index, we can search the text in that field. The general syntax for a query with a text search is
{'$text': {'$search': 'searchterms', '$caseSensitive': False}}
where 'searchterms'
contains the terms we want to search for, and '$caseSensitive': False
tells MongoDB to ignore cases in the search, so that a search term of “chocolate” also matches to “Chocolate”. Alternatively, '$caseSensitive': True
takes case into account when matching records to a query. If a search is not case sensitive, and if it is not diacritic sensitive (taking things like accents into account, which it can do by adding the $diacriticSensitive=True
option), then $search
matches on the stems of words: the first several letters in the word, allowing a search term of “blueberry” to also match with “blueberries”.
As a simple example, now that description
has been set as the text index, we can find all wines with descriptions that contain the word “chocolate”. Here I save the output as a dataframe and display the description for the first wine in the output:
df = mongo_read_query(winecollection, {'$text': {'$search':'chocolate', '$caseSensitive': False}})
df['description'][0]
'Supermodern in style, with mint, coconut, chocolate and huge black fruit aromas. Powerfully structured and thick-boned, with boysenberry, spice and chocolate in spades. Oaky, broad and layered on the finish, with tobacco, coffee and chocolate finishing notes.'
To search on more than one term, include the terms in the same string after '$search'
, separated by spaces. By default, these terms are combined using the “or” operator, so that the query returns any document with at least one of the terms. The following code finds all wines whose descriptions contain “chocolate” or “leather”:
df = mongo_read_query(winecollection, {'$text': {'$search':'chocolate leather', '$caseSensitive': False}})
df[df['wine_id']==109396]['description'] # a leathery one
5930 A whiff of leather introduces a wine with a st...
Name: description, dtype: object
To search for documents with a phrase that contains a space, enclose the phrase in double quotes, and precede each double-quote with an \ escape character. The following code captures descriptions with the phrase “very good”:
df = mongo_read_query(winecollection, {'$text': {'$search':'\"very good\"', '$caseSensitive': False}})
df['description'][0]
"A good to very good wine with medicinality on the nose that takes over the aromatics. There's also the slightest bit of hard cheese and stem on the bouquet, so overall it is fighting an uphill battle. Along the way it delivers flavors of cough drop, cherry and a good, solid finish."
To search for documents that contain multiple search terms at once (an “and” operator), enclose each search term in double quotes with escape characters. We can search for descriptions that contain both “leather” and “chocolate” with the following code:
df = mongo_read_query(winecollection, {'$text': {'$search':'\"leather\" \"chocolate\"', '$caseSensitive': False}})
df['description'][0]
'A blend of 85% Melnik, 10% Grenache Noir and 5% Petit Verdot, this wine has aromas of saddle leather, cassis and dark chocolate. In the mouth there are flavors of cherry, chocolate and dried blueberry. It has good balance with a soft tannic finish.'
To exclude a term, we add a negative sign in front of the term we want to exclude. To find all wines whose descriptions contain the word “dark” but not “chocolate”, we type
df = mongo_read_query(winecollection, {'$text': {'$search':'dark -chocolate', '$caseSensitive': False}})
df['description'][0]
'Dark, dark, dark is this estate-grown Syrah, closed at the nose except for the profusion of alcohol, grippy with lots of oak and crazy, teeth-staining tannins.'
Text searches can also be used to construct a search engine. We provide the search terms, and MongoDB generates a score for every document that represents the extent to which the search terms are relevant to the document. Once these scores have been generated, it is possible to sort the documents by the score to find the documents that are most highly related to the search terms.
To rank documents by search-relevancy, we add {'score': {'$meta': 'textScore'}}
to the query we pass to the .find()
method. Here we enter five search terms, “chocolate”, “leather”, “wood”, “dark”, and “smoke”:
cursor = winecollection.find(
{'$text': {'$search': 'chocolate leather wood dark smoke'}},
{'score': {'$meta': 'textScore'}})
Next we apply the .sort()
method to the output, arranging the documents by relevancy score, with the following code:
cursor.sort([('score', {'$meta': 'textScore'})])
<pymongo.cursor.Cursor at 0x1361d6438>
Finally we can convert the output to a dataframe with the following code:
qtext = dumps(cursor)
qrec = loads(qtext)
df = pd.DataFrame.from_records(qrec)
Among all the wine reviews in the data, here is the wine whose description had the highest relevancy score for “chocolate”, “leather”, “wood”, “dark”, and “smoke”:
df['description'][0]
'Big and bold in character, this full-bodied and abundantly tannic wine looks black to the rim and smells like pencil shavings, leather and wood smoke. The flavors are dry but enticing, with dark chocolate, charred beef and black cherry. It needs until at least 2022 to peak.'
To wrap up our work with the database, we apply the .close()
method to the MongoDB server:
myclient.close()