{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Database Queries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{contents} Table of Contents\n", ":depth: 4\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction: The History of the Structured Query Language (SQL)\n", "\n", "
\n", "\n", "
\n", "\n", "The structured query language (SQL) was invented by Donald D. Chamberlin and Raymond F. Boyce in 1974. Chamberlain and Boyce were both young computer scientists working at the IBM T.J. Watson Research Center in Yorktown Heights, New York, and they met E. F. Codd at a research symposium that Codd organized there. Codd, four years prior, had published the [seminal article that defined the relational model](https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf) for databases. Codd's relational model is defined using relational algebra and relational calculus, two notational standards that Codd himself created to elaborate on set theory as applied specifically to data tables. One important property of set theory is that highly abstract mathematical expressions can be expressed in plain language. For example, consider the set $A$ of holidays in the United States during which banks are closed:\n", "\n", "$$ A = \\{\\text{New Year's Day}, \\text{Birthday of Martin Luther King, Jr.}, \\\\ \\text{Washington’s Birthday}, \\text{Memorial Day}, \\text{Independence Day}, \\\\ \\text{Labor Day}, \\text{Columbus Day}, \\text{Veterans Day}, \\\\ \\text{Thanksgiving Day}, \\text{Christmas}\\}$$\n", "\n", "Also consider the set $B$ of holidays in the United Kingdom during which banks are closed:\n", "\n", "$$ B = \\{\\text{New Year's Day}, \\text{Good Friday}, \\text{Easter Monday}, \\\\ \\text{Early May bank holiday}, \\text{Spring bank holiday}, \\\\\\text{Summer bank holiday}, \\text{Christmas}, \\text{Boxing Day}\\}$$\n", "\n", "The intersection between sets $A$ and $B$ is a set that consists of all elements that exist with both set $A$ and set $B$:\n", "\n", "$$ A \\cap B = \\{\\text{New Year's Day}, \\text{Christmas}\\}$$\n", "\n", "The notation $A\\cap B$ is a mathematical abstraction of an idea that can be expressed in plain-spoken language: $\\cap$ means \"and\", and $A\\cap B$ means $A$ *and* $B$, or all elements that are in both $A$ *and* $B$. Put another way, $A\\cap B$ is the set of all holidays during which banks are closed in both the United States *and* the United Kingdom. Likewise, every piece of set notation can be expressed semantically.\n", "\n", "Although Codd laid out the broad parameters of the relational model in mathematical terms, he did not design software or a physical architecture for a relational database. He explicitly left that work up to future research:\n", "\n", "> Many questions are raised and left unanswered. For example, only a few of the more important properties of the data sublanguage . . . are mentioned. Neither the purely linguistic details of such a language nor the implementation problems are discussed. Nevertheless, the material presented should be adequate for experienced systems programmers to visualize several approaches (p. 387).\n", "\n", "Chamberlin and Boyce took up the challenge of writing a programming language to implement Codd's relational model. [As Chamberlin explains](https://ieeexplore.ieee.org/document/6359709), their primary goal was to create a version of Codd's set-theoretical relational model that could be expressed in plain language:\n", "\n", "> The more difficult barrier was at the semantic level. The basic concepts of Codd's languages were adapted from set theory and symbolic logic. This was natural given Codd's background as a mathematician, but Ray and I hoped to design a relational language based on concepts that would be familiar to a wider population of users (p. 78).\n", "\n", "In short, the idea behind SQL is to implement Codd's abstract system of using logical statements with set theoretical notation to narrow down the specific records and features in a dataset that a user wishes to read or edit, but to phrase these operations in accessable, plain language. One of the best things about SQL is that once you are used to the language, it reads just like English. That said, Chamberlin admits that SQL \"has not proved to be as accessible to untrained users as Ray and I originally hoped\" ([p. 81](https://ieeexplore.ieee.org/document/6359709)).\n", "\n", "Another benefit of SQL is that this language is one of the most universal programming languages in existence. It is designed to work with database management systems on any platform, and it works seamlessly within Python, R, C, Java, Javascript, and so on. While the standards for languages and platforms change, SQL has been in continuous use for relational database management since the 1980s and shows no sign of becoming antiquated or being replaced. The SQL syntax exists outside of any individual DBMS, and is maintained by the [American National Standards Institute (ANSI)](https://en.wikipedia.org/wiki/American_National_Standards_Institute) and the [International Organization for Standardization (ISO)](https://en.wikipedia.org/wiki/International_Organization_for_Standardization), two non-profit organizations that facilitate the development of [voluntary consensus standards](https://en.wikipedia.org/wiki/Standardization) for things like programming languages and hardware. Despite the universality of SQL, however, different DBMSs use slightly different versions of SQL, adding some unique functionality in some cases, and failing to implement the entire SQL standard in others. MySQL for example [lacks the ability to perform a full join](https://stackoverflow.com/questions/4796872/how-to-do-a-full-outer-join-in-mysql). PostgreSQL distinguishes itself from other RDBMSs by striving to implement as much of the global SQL standard as possible. While there some important differences in the version of SQL used by different DBMSs, the differences generally apply to very specific situations and all implementations of SQL use mostly the same syntax and can do mostly the same work. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Declarative and Procedural Languages\n", "\n", "SQL is considered to be a [declarative language](https://en.wikipedia.org/wiki/Declarative_programming), which means that it defines the broad task that a particular computer system must carry out, but it does not define the mechanism through which the system completes the task. For example, SQL can tell a system to access two tables and join them together, but that command must tell a DBMS to access additional code that tells the system how exactly to search and operate on the rows and columns of each data table. A language that provides specific instructions to a system on how to carry out a task - by changing the system state in some way, including how the data exist in the system - is a [procedural language](https://en.wikipedia.org/wiki/Procedural_programming). The code that a procedural language uses to make these changes on the system is called [imperative code](https://en.wikipedia.org/wiki/Imperative_programming). A DBMS can be thought of as a function that takes declarative SQL code as an input, finds and runs the imperative code that carries out the declarative task, and returns the output. MySQL, for example, [uses imperative C and C++ code](https://dev.mysql.com/doc/refman/8.0/en/features.html) to carry out SQL queries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Popularity of SQL\n", "Common standards and the most popular programming languages and environments change all the time. It's an eternal struggle for data scientists as well as programmers of all kinds, and a matter of consistent anxiety. Presently, Python is the most widely used tool for data science, but will we all have to drop Python soon and [teach ourselves Julia](https://towardsdatascience.com/bye-bye-python-hello-julia-9230bff0df62)? \n", "\n", "In this context, it is stunning that SQL has been so widely used since the 1970s. According to a [Stack Overflow survey](https://insights.stackoverflow.com/survey/2017), SQL is the one of most widely used programming languages among the people who filled out the survey, second only to Javascript. Taking into account the high-tech biases in this specific sample, it is probably the case the SQL is more widely used than any other language mentioned in this survey. What accounts for this popularity?\n", "\n", "This [blog post](https://blog.sqlizer.io/posts/sql-43/) argues that SQL achieved this level of longevity because it came to prominence during a time in which many of the baseline standards for the development of computer systems were being invented. As more and more systems were developed in a way that depends on SQL, it became harder to change this standard. But SQL is also simple and highly functional because it is a semantic language that expresses set-theoetical and logical operations. As long as relational databases are used, there's not much functionality that can be added to a query language beyond these foundational mathematical operations, and whatever additional functionality is needed can be added to a version of SQL by a particular DBMS. There are also many different open source and proprietary DBMSs that all employ SQL, so different users can have a choice over many different DBMSs and platforms without having to learn a query language other than SQL.\n", "\n", "That said, there's much less of a reason to use SQL when the database is not organized according to the relational model. NoSQL databases have much more flexible schema in general, and can store the data in one big table or in as many tables as there are records, or even datapoints, in the database. In fact, without a relational schema, the notion of a data table makes less sense in general. For example, a document store is a collection of individual records encoded using JSON or XML, and not as tables. These records can be sharded: stored in many corresponding servers in a distributed system to address challenges with the size of the database and the speed with which database transactions are conducted. Without tables, NoSQL DBMSs do not usually use SQL. MongoDB, for example, works with queries that are themselves in JSON format." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create, Read, Update, and Delete (CRUD) Operations\n", "**Persistent storage** refers to a system in which [data outlives the process that created it](https://en.wikipedia.org/wiki/Persistence_(computer_science)). When you work with software that allows you to save a file, the file is stored in persistent storage because it still exists even after you close the software application. Hard drives are examples of persistent storage, as are local and remote servers that store databases. Any persistent storage mechanism must have methods for creating, reading (or loading), updating (or editing), and deleting the data in that storage device. Create, Read, Update, and Delete are the CRUD Operations.\n", "\n", "We've previously employed CRUD operations using the `requests` library to use an API or to do web scraping. Like `requests`, SQL and other query languages have CRUD operations. The following table, adapted from a similar one that appears on the [Wikipedia page for the CRUD operations](https://en.wikipedia.org/wiki/Create%2C_read%2C_update_and_delete), shows these operations in the `requests` package, SQL, and the MongoDB query language:\n", "\n", "|Operation|`requests` method|SQL|MongoDB|\n", "|:-|:-|:-|:-|\n", "|Create|`requests.put()` or `requests.post()`|`INSERT`|`Insert`|\n", "|Read|`requests.get()`|`SELECT`|`Find`|\n", "|Update|`requests.patch()`|`UPDATE`|`Update`|\n", "|Delete|`requests.delete()`|`DELETE`|`Remove`|\n", "\n", "As a data scientist, you will most often use read operations to obtain the data you need for your analysis. However, if you are collecting original data for your project, the create, update, and delete operations become much more important. We will discuss all four operations and their variants in the context of SQL and MongoDB below.\n", "\n", "We can work with SQL using `pandas` if we first create an engine that links to a specific DBMS, server, and database with `create_engine` from `sqlalchemy`. Once we do, the `pd.read_sql_query()` function makes read operations straightforward, and the `.execute()` method applied to the engine lets us easily issue create, update, and delete commands." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SQL Style: Capitalization, Quotes, New Lines, Indentation\n", "There are many ways to write an SQL query, and when you look at someone else's SQL code you will see a variety of styles. Mostly, with the exception of quotes in some cases, stylistic differences don't change the behavior of the code, but they can have an impact on how easy the code is for other people to read and understand.\n", "\n", "One requirement for SQL code is that the query must end with a semi-colon, and that no semi-colons appear elsewhere in the query. As long as that requirement is met, other stylistic choices are possible.\n", "\n", "The least readable way to write an SQL query is to write the entire code on one line, with no capitalization or indentation. The following code is valid SQL code:\n", "```\n", "select t.id, t.column1, t.column2, t.column3, r.column4 from table1 t inner join table2 r on t.id = r.id where column1>100 order by column2 desc;\n", "```\n", "We will discuss exactly what this query does. But for now, let's focus on the presentation of code. SQL uses **clauses** to represent particular functions for reading and writing data. In the above query, `select`, `from`, `inner join`, `on`, `where`, and `order by` are all clauses, and `desc` is an option applied to the `order by` clause. \n", "\n", "One stylistic choice many people make is to write SQL clauses in capital letters. That helps readers to quickly see the parts of the code that are clauses as opposed to the rest of the code that contains column names, table names, values, and aliases. If we capitalize the clauses and options in the SQL query, it looks like this:\n", "```\n", "SELECT t.id, t.column1, t.column2, t.column3, r.column4 FROM table1 t INNER JOIN table2 r ON t.id = r.id WHERE column1>100 ORDER BY column2 DESC;\n", "```\n", "Another stylistic choice people make to present the code in a more reabable way is to put clauses that are considered distinct enough from other clauses on new lines. The one common exception is `FROM`, which is considered to be closely related to `SELECT` and is often written on the same line as `SELECT`. If we put each clause other than `FROM` on a new line, the query looks like this:\n", "```\n", "SELECT t.id, t.column1, t.column2, t.column3, r.column4 FROM table1 t \n", "INNER JOIN table2 r \n", "ON t.id = r.id \n", "WHERE column1>100 \n", "ORDER BY column2 DESC;\n", "```\n", "Some clauses are considered to be elaborations upon a previous clause. Column names after `SELECT` are usually written on the same line as `SELECT`, but if these columns themselves require functions that take up more space, it is useful to put them on new lines. Likewise, `ON` is considered an elaboration of `INNER JOIN`. These lines of code are often indented to express the dependence on the previous line. If we include indentation in the code, the query is\n", "```\n", "SELECT \n", " t.id, \n", " t.column1, \n", " t.column2, \n", " t.column3, \n", " r.column4 \n", "FROM table1 t \n", "INNER JOIN table2 r \n", " ON t.id = r.id \n", "WHERE column1>100 \n", "ORDER BY column2 DESC;\n", "```\n", "I encourage you to develop good habits with how you write the SQL queries, both for other people to read your code, but more importantly, to make it easier for you yourself to read your code. You will be spending a lot of time developing and debugging SQL queries, and anything you do that cuts down the time to understand your own code will save you a lot of time and frustration in the long-run.\n", "\n", "Quotes are only used in SQL code when referring to values of a character feature in one of the data tables. When using quotes while working in Python, it is important to use single quotes, not double quotes, to ensure that the quote that is internal to SQL is not read as a termination of the Python variable that contains the SQL code. That is, it is fine to write a clause that looks like `WHERE column = 'value'`, but not `WHERE column = \"value\"`.\n", "\n", "For all of the queries we will write in the following examples, we will store the query as a string variable in Python. We will use the triple-quote syntax, which allows us to write a string that exists on multiple lines. So our SQL query definitions will look like this:\n", "```\n", "myquery = \"\"\"\n", "SELECT \n", " t.id, \n", " t.column1, \n", " t.column2, \n", " t.column3, \n", " r.column4 \n", "FROM table1 t \n", "INNER JOIN table2 r \n", " ON t.id = r.id \n", "WHERE column1>100 \n", "ORDER BY column2 DESC;\n", "\"\"\"\n", "```\n", "We will then be able to pass the `myquery` variable to functions like `pd.read_sql_query()` to be evaluated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SQL Joins\n", "The simplest SQL commands for reading data from a database are `SELECT` and `FROM`. In module 6, we issued the following query to read the entire \"reviews\" entity from the wine reviews database:\n", "```\n", "SELECT * FROM reviews\n", "```\n", "But this query does not read data from the other entities in the database. It pulls all of the rows and columns from \"reviews\" and it does not manipulate the data within \"reviews\" in any way. That might not be the best way to create a dataframe to conduct analyses.\n", "\n", "SQL read operations get data, but they can also clean data at the same time. Cleaning data is an important challenge. Even when the data are stored in a well-organized database, that organization might not be the right format for the data given the analyses we intend to do. \"[Tidy Data](https://www.jstatsoft.org/article/view/v059i10)\" by Hadley Wickham lays out a philosophy of what it means to clean a dataset. The goal is to put data into a format in which modeling and visualization is as easy as possible. There are two steps: first we create **tidy data** and then we manipulate the data to fit our specific needs. Tidy data is defined as follows:\n", "\n", "> A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, [features] and types. In tidy data:\n", "> 1. Each [feature] forms a column.\n", "> 2. Each observation forms a row.\n", "> 3. Each type of observational unit forms a table (p. 2).\n", "\n", "In other words, the dataset we will use in an analysis must exist in one table, the rows of the table must represent records (also called observations), the columns of the table must represent features, and the rows must represent comparable units. For example, if the data contain records from the 50 U.S. States, there should be a row for each state, and there should not be a row for regions or for the whole country as these units are not comparable to states. Data from a relational database are not generally in tidy format because the relevant data exists in multiple tables. The first step is to combine these tables into one dataset using **join** functions within SQL read operations. There are many different kinds of joins, and the easiest way to see the difference between these types is to see what they do to real data. \n", "\n", "Before we discuss specific examples of how to use SQL, we load the following libraries:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import sys\n", "import os\n", "import psycopg2\n", "from sqlalchemy import create_engine\n", "import dotenv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example Database: NFL and NBA Teams\n", "As an example, I create a PostgreSQL database that contains two tables: \"nfl\" contains the location and team name of all 32 NFL teams:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cityfootballteam
0BuffaloBuffalo Bills
1MiamiMiami Dolphins
2BostonNew England Patriots
3New York[New York Jets, New York Giants]
4ClevelandCleveland Browns
5CincinnatiCincinnati Bengals
6PittsburghPittsburgh Steelers
7BaltimoreBaltimore Ravens
8Kansas CityKansas City Chiefs
9Las VegasLas Vegas Raiders
10Los Angeles[L.A. Chargers, L.A. Rams]
11DenverDenver Broncos
12NashvilleTennessee Titans
13JacksonvilleJacksonville Jaguars
14HoustonHouston Texans
15IndianapolisIndianapolis Colts
16PhiladelphiaPhiladelphia Eagles
17DallasDallas Cowboys
18WashingtonWashington Skins
19AtlantaAtlanta Falcons
20CharlotteCarolina Panthers
21Tampa BayTampa Bay Buccaneers
22New OrleansNew Orleans Saints
23San FranciscoSan Francisco 49ers
24PhoenixArizona Cardinals
25SeattleSeattle Seahawks
26ChicagoChicago Bears
27Green BayGreen Bay Packers
28MinneapolisMinnesota Vikings
29DetroitDetroit Lions
\n", "
" ], "text/plain": [ " city footballteam\n", "0 Buffalo Buffalo Bills\n", "1 Miami Miami Dolphins\n", "2 Boston New England Patriots\n", "3 New York [New York Jets, New York Giants]\n", "4 Cleveland Cleveland Browns\n", "5 Cincinnati Cincinnati Bengals\n", "6 Pittsburgh Pittsburgh Steelers\n", "7 Baltimore Baltimore Ravens\n", "8 Kansas City Kansas City Chiefs\n", "9 Las Vegas Las Vegas Raiders\n", "10 Los Angeles [L.A. Chargers, L.A. Rams]\n", "11 Denver Denver Broncos\n", "12 Nashville Tennessee Titans\n", "13 Jacksonville Jacksonville Jaguars\n", "14 Houston Houston Texans\n", "15 Indianapolis Indianapolis Colts\n", "16 Philadelphia Philadelphia Eagles\n", "17 Dallas Dallas Cowboys\n", "18 Washington Washington Skins\n", "19 Atlanta Atlanta Falcons\n", "20 Charlotte Carolina Panthers\n", "21 Tampa Bay Tampa Bay Buccaneers\n", "22 New Orleans New Orleans Saints\n", "23 San Francisco San Francisco 49ers\n", "24 Phoenix Arizona Cardinals\n", "25 Seattle Seattle Seahawks\n", "26 Chicago Chicago Bears\n", "27 Green Bay Green Bay Packers\n", "28 Minneapolis Minnesota Vikings\n", "29 Detroit Detroit Lions" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nfl_dict = {'city':['Buffalo','Miami','Boston','New York','Cleveland','Cincinnati',\n", " 'Pittsburgh','Baltimore','Kansas City','Las Vegas','Los Angeles','Denver',\n", " 'Nashville','Jacksonville','Houston','Indianapolis','Philadelphia','Dallas',\n", " 'Washington','Atlanta','Charlotte','Tampa Bay','New Orleans','San Francisco',\n", " 'Phoenix', 'Seattle','Chicago','Green Bay','Minneapolis','Detroit'],\n", " 'footballteam':['Buffalo Bills','Miami Dolphins','New England Patriots',\n", " ['New York Jets', 'New York Giants'],'Cleveland Browns','Cincinnati Bengals',\n", " 'Pittsburgh Steelers','Baltimore Ravens','Kansas City Chiefs',\n", " 'Las Vegas Raiders',['L.A. Chargers','L.A. Rams'],'Denver Broncos',\n", " 'Tennessee Titans','Jacksonville Jaguars','Houston Texans',\n", " 'Indianapolis Colts','Philadelphia Eagles','Dallas Cowboys',\n", " 'Washington Skins','Atlanta Falcons','Carolina Panthers',\n", " 'Tampa Bay Buccaneers','New Orleans Saints', 'San Francisco 49ers',\n", " 'Arizona Cardinals','Seattle Seahawks','Chicago Bears',\n", " 'Green Bay Packers','Minnesota Vikings','Detroit Lions']}\n", "nfl_df = pd.DataFrame(nfl_dict)\n", "nfl_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This table is not in first normal form because the data are non-atomic (two teams from New York and two in Los Angeles), but this form is useful for illustrating what different SQL joins do. The second table contains the same information about NBA teams:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
citybasketballteam
0BostonBoston Celtics
1New YorkNew York Knicks
2PhiladelphiaPhiladelphia 76ers
3BrooklynBrooklyn Nets
4TorontoToronto Raptors
5ClevelandCleveland Cavaliers
6ChicagoChicago Bulls
7DetroitDetroit Pistons
8MilwaukeeMilwaukee Bucks
9IndianapolisIndiana Pacers
10AtlantaAtlanta Hawks
11WashingtonWashington Wizards
12OrlandoOrlando Magic
13MiamiMiami Heat
14CharlotteCharlotte Hornets
15Los Angeles[L.A. Lakers, L.A. Clippers]
16San FranciscoGolden State Warriors
17PortlandPortland Trailblazers
18SacramentoSacramento Kings
19PhoenixPhoenix Suns
20San AntonioSan Antonio Spurs
21DallasDallas Mavericks
22HoustonHouston Rockets
23Oklahoma CityOklahoma City Thunder
24MinneapolisMinnesota Timberwolves
25DenverDenver Nuggets
26Salt Lake CityUtah Jazz
27MemphisMemphis Grizzlies
28New OrleansNew Orleans Pelicans
\n", "
" ], "text/plain": [ " city basketballteam\n", "0 Boston Boston Celtics\n", "1 New York New York Knicks\n", "2 Philadelphia Philadelphia 76ers\n", "3 Brooklyn Brooklyn Nets\n", "4 Toronto Toronto Raptors\n", "5 Cleveland Cleveland Cavaliers\n", "6 Chicago Chicago Bulls\n", "7 Detroit Detroit Pistons\n", "8 Milwaukee Milwaukee Bucks\n", "9 Indianapolis Indiana Pacers\n", "10 Atlanta Atlanta Hawks\n", "11 Washington Washington Wizards\n", "12 Orlando Orlando Magic\n", "13 Miami Miami Heat\n", "14 Charlotte Charlotte Hornets\n", "15 Los Angeles [L.A. Lakers, L.A. Clippers]\n", "16 San Francisco Golden State Warriors\n", "17 Portland Portland Trailblazers\n", "18 Sacramento Sacramento Kings\n", "19 Phoenix Phoenix Suns\n", "20 San Antonio San Antonio Spurs\n", "21 Dallas Dallas Mavericks\n", "22 Houston Houston Rockets\n", "23 Oklahoma City Oklahoma City Thunder\n", "24 Minneapolis Minnesota Timberwolves\n", "25 Denver Denver Nuggets\n", "26 Salt Lake City Utah Jazz\n", "27 Memphis Memphis Grizzlies\n", "28 New Orleans New Orleans Pelicans" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nba_dict = {'city':['Boston','New York','Philadelphia','Brooklyn','Toronto',\n", " 'Cleveland','Chicago','Detroit','Milwaukee','Indianapolis',\n", " 'Atlanta', 'Washington','Orlando','Miami','Charlotte',\n", " 'Los Angeles','San Francisco','Portland','Sacramento',\n", " 'Phoenix','San Antonio','Dallas','Houston','Oklahoma City',\n", " 'Minneapolis','Denver','Salt Lake City','Memphis','New Orleans'],\n", " 'basketballteam':['Boston Celtics','New York Knicks','Philadelphia 76ers',\n", " 'Brooklyn Nets','Toronto Raptors',\n", " 'Cleveland Cavaliers','Chicago Bulls','Detroit Pistons',\n", " 'Milwaukee Bucks','Indiana Pacers',\n", " 'Atlanta Hawks','Washington Wizards','Orlando Magic',\n", " 'Miami Heat','Charlotte Hornets',\n", " ['L.A. Lakers','L.A. Clippers'],'Golden State Warriors',\n", " 'Portland Trailblazers','Sacramento Kings',\n", " 'Phoenix Suns','San Antonio Spurs','Dallas Mavericks',\n", " 'Houston Rockets','Oklahoma City Thunder',\n", " 'Minnesota Timberwolves','Denver Nuggets',\n", " 'Utah Jazz','Memphis Grizzlies','New Orleans Pelicans']}\n", "nba_df = pd.DataFrame(nba_dict)\n", "nba_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create a PostgreSQL database with entities for the NFL and NBA teams, I first connect to the PostgreSQL server running on my local computer (see module 6 for a more detailed discussion of how this works). First I bring my PostgreSQL password into the local environment:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "dotenv.load_dotenv()\n", "pgpassword = os.getenv(\"pgpassword\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then I access the server and establish a cursor for the server:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "dbserver = psycopg2.connect(\n", " user='jk8sd', \n", " password=pgpassword, \n", " host=\"localhost\"\n", ")\n", "dbserver.autocommit = True\n", "cursor = dbserver.cursor()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I create an empty \"teams\" database:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "try:\n", " cursor.execute(\"CREATE DATABASE teams\")\n", "except:\n", " cursor.execute(\"DROP DATABASE teams\")\n", " cursor.execute(\"CREATE DATABASE teams\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And I use the `create_engine()` function from `sqalchemy` to allow queries to the \"teams\" database:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "engine = create_engine(\"postgresql+psycopg2://{user}:{pw}@localhost/{db}\"\n", " .format(user=\"jk8sd\", pw=pgpassword, db=\"teams\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I add the `nfl_df` and `nba_df` dataframes to the \"teams\" database:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "nfl_df.to_sql('nfl', con = engine, index=False, chunksize=1000, if_exists = 'replace')\n", "nba_df.to_sql('nba', con = engine, index=False, chunksize=1000, if_exists = 'replace')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I can now issue queries to the database. To read all of the data in the NFL table, for example, I type:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cityfootballteam
0BuffaloBuffalo Bills
1MiamiMiami Dolphins
2BostonNew England Patriots
3New York{\"New York Jets\",\"New York Giants\"}
4ClevelandCleveland Browns
5CincinnatiCincinnati Bengals
6PittsburghPittsburgh Steelers
7BaltimoreBaltimore Ravens
8Kansas CityKansas City Chiefs
9Las VegasLas Vegas Raiders
10Los Angeles{\"L.A. Chargers\",\"L.A. Rams\"}
11DenverDenver Broncos
12NashvilleTennessee Titans
13JacksonvilleJacksonville Jaguars
14HoustonHouston Texans
15IndianapolisIndianapolis Colts
16PhiladelphiaPhiladelphia Eagles
17DallasDallas Cowboys
18WashingtonWashington Skins
19AtlantaAtlanta Falcons
20CharlotteCarolina Panthers
21Tampa BayTampa Bay Buccaneers
22New OrleansNew Orleans Saints
23San FranciscoSan Francisco 49ers
24PhoenixArizona Cardinals
25SeattleSeattle Seahawks
26ChicagoChicago Bears
27Green BayGreen Bay Packers
28MinneapolisMinnesota Vikings
29DetroitDetroit Lions
\n", "
" ], "text/plain": [ " city footballteam\n", "0 Buffalo Buffalo Bills\n", "1 Miami Miami Dolphins\n", "2 Boston New England Patriots\n", "3 New York {\"New York Jets\",\"New York Giants\"}\n", "4 Cleveland Cleveland Browns\n", "5 Cincinnati Cincinnati Bengals\n", "6 Pittsburgh Pittsburgh Steelers\n", "7 Baltimore Baltimore Ravens\n", "8 Kansas City Kansas City Chiefs\n", "9 Las Vegas Las Vegas Raiders\n", "10 Los Angeles {\"L.A. Chargers\",\"L.A. Rams\"}\n", "11 Denver Denver Broncos\n", "12 Nashville Tennessee Titans\n", "13 Jacksonville Jacksonville Jaguars\n", "14 Houston Houston Texans\n", "15 Indianapolis Indianapolis Colts\n", "16 Philadelphia Philadelphia Eagles\n", "17 Dallas Dallas Cowboys\n", "18 Washington Washington Skins\n", "19 Atlanta Atlanta Falcons\n", "20 Charlotte Carolina Panthers\n", "21 Tampa Bay Tampa Bay Buccaneers\n", "22 New Orleans New Orleans Saints\n", "23 San Francisco San Francisco 49ers\n", "24 Phoenix Arizona Cardinals\n", "25 Seattle Seattle Seahawks\n", "26 Chicago Chicago Bears\n", "27 Green Bay Green Bay Packers\n", "28 Minneapolis Minnesota Vikings\n", "29 Detroit Detroit Lions" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"SELECT * FROM nfl\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Types of Joins\n", "Joining data tables is the act of *adding columns* to an existing data table - that is, adding more features to existing records - by matching the rows in one table to the corresponding rows in another table. In a relational database, data tables can include a foreign key which serves as the primary key for another data table. Joins require matching a foreign key in one table to the corresponding primary key in another table. During the join, this foreign key and this primary key are both called **indices**. To perform a join with an SQL query, we specify the two tables in the database we want to join and the index in each table we will match on.\n", "\n", "In the teams database, `city` is a primary key in both the \"nfl\" and \"nba\" tables, which also makes it a foreign key in both tables. Joining the \"nfl\" and \"nba\" tables by matching on `city` creates one data table in which the rows still represent cities and the columns list both the NBA and NFL teams in each city. \n", "\n", "Not every city has both an NFL and an NBA team. Green Bay, for example, has a football team but no basketball team, and Sacramento has a basketball team but no football team. In a join, every row in a table either **matches** with one or more rows in the other table, or is **unmatched**. In this case, Cleveland in the NFL table is matched to a row in the NBA table because Cleveland has both a football and basketball team, but Oklahoma City in the NBA table is unmatched because there is no row for Oklahoma City in the NFL table.\n", "\n", "The main difference between types of joins in SQL is their treatment of unmatched records. The following table summarizes the types of joins:\n", "\n", "| Type of join | Definition |\n", "|--------------|---------------------------------------------------------------------------------------------------------------------------------------------\n", "| Inner join | Only keep the records that exist in both tables|\n", "| Left join | Keep all the records in the first table listed, and keep only the records in the second table listed that have matches in the first table |\n", "| Right join | Keep all the records in the second table listed, and keep only the records in the first table listed that have matches in the second table | \n", "| Full join | Keep all of the records in both tables whether they are matched or not | \n", "| Anti join | Keep only the records in the first table that are not matched in the second table | \n", "|Natural join | The same as any of the joins listed above, but no need to specify the indices as these are determined automatically by finding columns with the same name. If no columns share the same name, a natural join performs a cross join. If more than one pair of columns share names across the two data tables, natural joins assume that both are part of the index to match on. Use caution.|\n", "|Cross join | Also called a **Cartesian product**. If the first dataframe has $M$ rows and the second dataframe has $N$ rows, the result has $M\\times N$ rows. Every row is a pairwise combination of values of each index.|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inner Joins\n", "The syntax for an inner join is\n", "```\n", "SELECT * FROM table1\n", "INNER JOIN table2\n", " ON table1.index_name = table2.index_name;\n", "```\n", "where `table1` and `table2` are the data tables we are joining, and `table1.index_name` and `table2.index_name` are the columns that contain the indices for tables 1 and 2. Alternatively, inner join is the default type of join, so that this syntax\n", "```\n", "SELECT * FROM table1\n", "JOIN table2\n", " ON table1.index_name = table2.index_name;\n", "```\n", "also produces an inner join. I recommend typing `INNER JOIN`, however, to avoid confusing this type of join with other types.\n", "\n", "In the case of the teams database, an inner join of the NFL and NBA tables yields a dataframe with one row for every city that has both a basketball and a football team. The SQL query that generates this data frame is:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cityfootballteamcitybasketballteam
0MiamiMiami DolphinsMiamiMiami Heat
1BostonNew England PatriotsBostonBoston Celtics
2New York{\"New York Jets\",\"New York Giants\"}New YorkNew York Knicks
3ClevelandCleveland BrownsClevelandCleveland Cavaliers
4Los Angeles{\"L.A. Chargers\",\"L.A. Rams\"}Los Angeles{\"L.A. Lakers\",\"L.A. Clippers\"}
5DenverDenver BroncosDenverDenver Nuggets
6HoustonHouston TexansHoustonHouston Rockets
7IndianapolisIndianapolis ColtsIndianapolisIndiana Pacers
8PhiladelphiaPhiladelphia EaglesPhiladelphiaPhiladelphia 76ers
9DallasDallas CowboysDallasDallas Mavericks
10WashingtonWashington SkinsWashingtonWashington Wizards
11AtlantaAtlanta FalconsAtlantaAtlanta Hawks
12CharlotteCarolina PanthersCharlotteCharlotte Hornets
13New OrleansNew Orleans SaintsNew OrleansNew Orleans Pelicans
14San FranciscoSan Francisco 49ersSan FranciscoGolden State Warriors
15PhoenixArizona CardinalsPhoenixPhoenix Suns
16ChicagoChicago BearsChicagoChicago Bulls
17MinneapolisMinnesota VikingsMinneapolisMinnesota Timberwolves
18DetroitDetroit LionsDetroitDetroit Pistons
\n", "
" ], "text/plain": [ " city footballteam city \\\n", "0 Miami Miami Dolphins Miami \n", "1 Boston New England Patriots Boston \n", "2 New York {\"New York Jets\",\"New York Giants\"} New York \n", "3 Cleveland Cleveland Browns Cleveland \n", "4 Los Angeles {\"L.A. Chargers\",\"L.A. Rams\"} Los Angeles \n", "5 Denver Denver Broncos Denver \n", "6 Houston Houston Texans Houston \n", "7 Indianapolis Indianapolis Colts Indianapolis \n", "8 Philadelphia Philadelphia Eagles Philadelphia \n", "9 Dallas Dallas Cowboys Dallas \n", "10 Washington Washington Skins Washington \n", "11 Atlanta Atlanta Falcons Atlanta \n", "12 Charlotte Carolina Panthers Charlotte \n", "13 New Orleans New Orleans Saints New Orleans \n", "14 San Francisco San Francisco 49ers San Francisco \n", "15 Phoenix Arizona Cardinals Phoenix \n", "16 Chicago Chicago Bears Chicago \n", "17 Minneapolis Minnesota Vikings Minneapolis \n", "18 Detroit Detroit Lions Detroit \n", "\n", " basketballteam \n", "0 Miami Heat \n", "1 Boston Celtics \n", "2 New York Knicks \n", "3 Cleveland Cavaliers \n", "4 {\"L.A. Lakers\",\"L.A. Clippers\"} \n", "5 Denver Nuggets \n", "6 Houston Rockets \n", "7 Indiana Pacers \n", "8 Philadelphia 76ers \n", "9 Dallas Mavericks \n", "10 Washington Wizards \n", "11 Atlanta Hawks \n", "12 Charlotte Hornets \n", "13 New Orleans Pelicans \n", "14 Golden State Warriors \n", "15 Phoenix Suns \n", "16 Chicago Bulls \n", "17 Minnesota Timberwolves \n", "18 Detroit Pistons " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT * FROM nfl\n", "INNER JOIN nba\n", " ON nfl.city = nba.city;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The three quotations that come before and after the SQL code is Python syntax that allow for a string to be entered on multiple lines. With just one quote, Python would assume that the next line should be read as Python code, and will produce an error. Three quotes allows us to space out the components of the SQL query on separate lines to make the SQL code easier to read and understand.\n", "\n", "SQL queries can be written on multiple lines, but the last line (and only the last line) must conclude with a semicolon.\n", "\n", "Another way to write the inner join query is to use **aliasing**: specifying a smaller name or a single letter next to each data table in the query to simplify the syntax for `ON`. For example, I can alias the NFL data with `f` and the NBA data with `b`:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cityfootballteamcitybasketballteam
0MiamiMiami DolphinsMiamiMiami Heat
1BostonNew England PatriotsBostonBoston Celtics
2New York{\"New York Jets\",\"New York Giants\"}New YorkNew York Knicks
3ClevelandCleveland BrownsClevelandCleveland Cavaliers
4Los Angeles{\"L.A. Chargers\",\"L.A. Rams\"}Los Angeles{\"L.A. Lakers\",\"L.A. Clippers\"}
5DenverDenver BroncosDenverDenver Nuggets
6HoustonHouston TexansHoustonHouston Rockets
7IndianapolisIndianapolis ColtsIndianapolisIndiana Pacers
8PhiladelphiaPhiladelphia EaglesPhiladelphiaPhiladelphia 76ers
9DallasDallas CowboysDallasDallas Mavericks
10WashingtonWashington SkinsWashingtonWashington Wizards
11AtlantaAtlanta FalconsAtlantaAtlanta Hawks
12CharlotteCarolina PanthersCharlotteCharlotte Hornets
13New OrleansNew Orleans SaintsNew OrleansNew Orleans Pelicans
14San FranciscoSan Francisco 49ersSan FranciscoGolden State Warriors
15PhoenixArizona CardinalsPhoenixPhoenix Suns
16ChicagoChicago BearsChicagoChicago Bulls
17MinneapolisMinnesota VikingsMinneapolisMinnesota Timberwolves
18DetroitDetroit LionsDetroitDetroit Pistons
\n", "
" ], "text/plain": [ " city footballteam city \\\n", "0 Miami Miami Dolphins Miami \n", "1 Boston New England Patriots Boston \n", "2 New York {\"New York Jets\",\"New York Giants\"} New York \n", "3 Cleveland Cleveland Browns Cleveland \n", "4 Los Angeles {\"L.A. Chargers\",\"L.A. Rams\"} Los Angeles \n", "5 Denver Denver Broncos Denver \n", "6 Houston Houston Texans Houston \n", "7 Indianapolis Indianapolis Colts Indianapolis \n", "8 Philadelphia Philadelphia Eagles Philadelphia \n", "9 Dallas Dallas Cowboys Dallas \n", "10 Washington Washington Skins Washington \n", "11 Atlanta Atlanta Falcons Atlanta \n", "12 Charlotte Carolina Panthers Charlotte \n", "13 New Orleans New Orleans Saints New Orleans \n", "14 San Francisco San Francisco 49ers San Francisco \n", "15 Phoenix Arizona Cardinals Phoenix \n", "16 Chicago Chicago Bears Chicago \n", "17 Minneapolis Minnesota Vikings Minneapolis \n", "18 Detroit Detroit Lions Detroit \n", "\n", " basketballteam \n", "0 Miami Heat \n", "1 Boston Celtics \n", "2 New York Knicks \n", "3 Cleveland Cavaliers \n", "4 {\"L.A. Lakers\",\"L.A. Clippers\"} \n", "5 Denver Nuggets \n", "6 Houston Rockets \n", "7 Indiana Pacers \n", "8 Philadelphia 76ers \n", "9 Dallas Mavericks \n", "10 Washington Wizards \n", "11 Atlanta Hawks \n", "12 Charlotte Hornets \n", "13 New Orleans Pelicans \n", "14 Golden State Warriors \n", "15 Phoenix Suns \n", "16 Chicago Bulls \n", "17 Minnesota Timberwolves \n", "18 Detroit Pistons " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT * FROM nfl f\n", "INNER JOIN nba b\n", " ON f.city = b.city;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The two indices we match on do not necessarily have to have the same name. Supposing that the \"city\" column in each data table was named \"location\" in the NFL table and \"town\" in the NBA table, the syntax for the inner join would have been:\n", "```\n", "SELECT * FROM nfl f\n", "INNER JOIN nba b\n", " ON f.location = b.town;\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Left and Right Joins\n", "The syntax for a left join is\n", "```\n", "SELECT * FROM table1\n", "LEFT JOIN table2\n", " ON table1.index_name = table2.index_name;\n", "```\n", "and the syntax for a right join is\n", "```\n", "SELECT * FROM table1\n", "RIGHT JOIN table2\n", " ON table1.index_name = table2.index_name;\n", "```\n", "In the case of the teams database, if we list the NFL table next to `FROM` and the NBA data with the `JOIN` statement, then left join lists all of the cities with an NFL team, and also displays the NBA team in that city if one exists. Otherwise, the syntax places `None` in the cell where the NBA team would be. For the teams database, the syntax for a left join is:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cityfootballteamcitybasketballteam
0BuffaloBuffalo BillsNoneNone
1MiamiMiami DolphinsMiamiMiami Heat
2BostonNew England PatriotsBostonBoston Celtics
3New York{\"New York Jets\",\"New York Giants\"}New YorkNew York Knicks
4ClevelandCleveland BrownsClevelandCleveland Cavaliers
5CincinnatiCincinnati BengalsNoneNone
6PittsburghPittsburgh SteelersNoneNone
7BaltimoreBaltimore RavensNoneNone
8Kansas CityKansas City ChiefsNoneNone
9Las VegasLas Vegas RaidersNoneNone
10Los Angeles{\"L.A. Chargers\",\"L.A. Rams\"}Los Angeles{\"L.A. Lakers\",\"L.A. Clippers\"}
11DenverDenver BroncosDenverDenver Nuggets
12NashvilleTennessee TitansNoneNone
13JacksonvilleJacksonville JaguarsNoneNone
14HoustonHouston TexansHoustonHouston Rockets
15IndianapolisIndianapolis ColtsIndianapolisIndiana Pacers
16PhiladelphiaPhiladelphia EaglesPhiladelphiaPhiladelphia 76ers
17DallasDallas CowboysDallasDallas Mavericks
18WashingtonWashington SkinsWashingtonWashington Wizards
19AtlantaAtlanta FalconsAtlantaAtlanta Hawks
20CharlotteCarolina PanthersCharlotteCharlotte Hornets
21Tampa BayTampa Bay BuccaneersNoneNone
22New OrleansNew Orleans SaintsNew OrleansNew Orleans Pelicans
23San FranciscoSan Francisco 49ersSan FranciscoGolden State Warriors
24PhoenixArizona CardinalsPhoenixPhoenix Suns
25SeattleSeattle SeahawksNoneNone
26ChicagoChicago BearsChicagoChicago Bulls
27Green BayGreen Bay PackersNoneNone
28MinneapolisMinnesota VikingsMinneapolisMinnesota Timberwolves
29DetroitDetroit LionsDetroitDetroit Pistons
\n", "
" ], "text/plain": [ " city footballteam city \\\n", "0 Buffalo Buffalo Bills None \n", "1 Miami Miami Dolphins Miami \n", "2 Boston New England Patriots Boston \n", "3 New York {\"New York Jets\",\"New York Giants\"} New York \n", "4 Cleveland Cleveland Browns Cleveland \n", "5 Cincinnati Cincinnati Bengals None \n", "6 Pittsburgh Pittsburgh Steelers None \n", "7 Baltimore Baltimore Ravens None \n", "8 Kansas City Kansas City Chiefs None \n", "9 Las Vegas Las Vegas Raiders None \n", "10 Los Angeles {\"L.A. Chargers\",\"L.A. Rams\"} Los Angeles \n", "11 Denver Denver Broncos Denver \n", "12 Nashville Tennessee Titans None \n", "13 Jacksonville Jacksonville Jaguars None \n", "14 Houston Houston Texans Houston \n", "15 Indianapolis Indianapolis Colts Indianapolis \n", "16 Philadelphia Philadelphia Eagles Philadelphia \n", "17 Dallas Dallas Cowboys Dallas \n", "18 Washington Washington Skins Washington \n", "19 Atlanta Atlanta Falcons Atlanta \n", "20 Charlotte Carolina Panthers Charlotte \n", "21 Tampa Bay Tampa Bay Buccaneers None \n", "22 New Orleans New Orleans Saints New Orleans \n", "23 San Francisco San Francisco 49ers San Francisco \n", "24 Phoenix Arizona Cardinals Phoenix \n", "25 Seattle Seattle Seahawks None \n", "26 Chicago Chicago Bears Chicago \n", "27 Green Bay Green Bay Packers None \n", "28 Minneapolis Minnesota Vikings Minneapolis \n", "29 Detroit Detroit Lions Detroit \n", "\n", " basketballteam \n", "0 None \n", "1 Miami Heat \n", "2 Boston Celtics \n", "3 New York Knicks \n", "4 Cleveland Cavaliers \n", "5 None \n", "6 None \n", "7 None \n", "8 None \n", "9 None \n", "10 {\"L.A. Lakers\",\"L.A. Clippers\"} \n", "11 Denver Nuggets \n", "12 None \n", "13 None \n", "14 Houston Rockets \n", "15 Indiana Pacers \n", "16 Philadelphia 76ers \n", "17 Dallas Mavericks \n", "18 Washington Wizards \n", "19 Atlanta Hawks \n", "20 Charlotte Hornets \n", "21 None \n", "22 New Orleans Pelicans \n", "23 Golden State Warriors \n", "24 Phoenix Suns \n", "25 None \n", "26 Chicago Bulls \n", "27 None \n", "28 Minnesota Timberwolves \n", "29 Detroit Pistons " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT * FROM nfl f\n", "LEFT JOIN nba b\n", " ON f.city = b.city;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Likewise, the right join displays all the cities with an NBA team, along with the NFL team in that city, if one exists:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cityfootballteamcitybasketballteam
0BostonNew England PatriotsBostonBoston Celtics
1New York{\"New York Jets\",\"New York Giants\"}New YorkNew York Knicks
2PhiladelphiaPhiladelphia EaglesPhiladelphiaPhiladelphia 76ers
3NoneNoneBrooklynBrooklyn Nets
4NoneNoneTorontoToronto Raptors
5ClevelandCleveland BrownsClevelandCleveland Cavaliers
6ChicagoChicago BearsChicagoChicago Bulls
7DetroitDetroit LionsDetroitDetroit Pistons
8NoneNoneMilwaukeeMilwaukee Bucks
9IndianapolisIndianapolis ColtsIndianapolisIndiana Pacers
10AtlantaAtlanta FalconsAtlantaAtlanta Hawks
11WashingtonWashington SkinsWashingtonWashington Wizards
12NoneNoneOrlandoOrlando Magic
13MiamiMiami DolphinsMiamiMiami Heat
14CharlotteCarolina PanthersCharlotteCharlotte Hornets
15Los Angeles{\"L.A. Chargers\",\"L.A. Rams\"}Los Angeles{\"L.A. Lakers\",\"L.A. Clippers\"}
16San FranciscoSan Francisco 49ersSan FranciscoGolden State Warriors
17NoneNonePortlandPortland Trailblazers
18NoneNoneSacramentoSacramento Kings
19PhoenixArizona CardinalsPhoenixPhoenix Suns
20NoneNoneSan AntonioSan Antonio Spurs
21DallasDallas CowboysDallasDallas Mavericks
22HoustonHouston TexansHoustonHouston Rockets
23NoneNoneOklahoma CityOklahoma City Thunder
24MinneapolisMinnesota VikingsMinneapolisMinnesota Timberwolves
25DenverDenver BroncosDenverDenver Nuggets
26NoneNoneSalt Lake CityUtah Jazz
27NoneNoneMemphisMemphis Grizzlies
28New OrleansNew Orleans SaintsNew OrleansNew Orleans Pelicans
\n", "
" ], "text/plain": [ " city footballteam city \\\n", "0 Boston New England Patriots Boston \n", "1 New York {\"New York Jets\",\"New York Giants\"} New York \n", "2 Philadelphia Philadelphia Eagles Philadelphia \n", "3 None None Brooklyn \n", "4 None None Toronto \n", "5 Cleveland Cleveland Browns Cleveland \n", "6 Chicago Chicago Bears Chicago \n", "7 Detroit Detroit Lions Detroit \n", "8 None None Milwaukee \n", "9 Indianapolis Indianapolis Colts Indianapolis \n", "10 Atlanta Atlanta Falcons Atlanta \n", "11 Washington Washington Skins Washington \n", "12 None None Orlando \n", "13 Miami Miami Dolphins Miami \n", "14 Charlotte Carolina Panthers Charlotte \n", "15 Los Angeles {\"L.A. Chargers\",\"L.A. Rams\"} Los Angeles \n", "16 San Francisco San Francisco 49ers San Francisco \n", "17 None None Portland \n", "18 None None Sacramento \n", "19 Phoenix Arizona Cardinals Phoenix \n", "20 None None San Antonio \n", "21 Dallas Dallas Cowboys Dallas \n", "22 Houston Houston Texans Houston \n", "23 None None Oklahoma City \n", "24 Minneapolis Minnesota Vikings Minneapolis \n", "25 Denver Denver Broncos Denver \n", "26 None None Salt Lake City \n", "27 None None Memphis \n", "28 New Orleans New Orleans Saints New Orleans \n", "\n", " basketballteam \n", "0 Boston Celtics \n", "1 New York Knicks \n", "2 Philadelphia 76ers \n", "3 Brooklyn Nets \n", "4 Toronto Raptors \n", "5 Cleveland Cavaliers \n", "6 Chicago Bulls \n", "7 Detroit Pistons \n", "8 Milwaukee Bucks \n", "9 Indiana Pacers \n", "10 Atlanta Hawks \n", "11 Washington Wizards \n", "12 Orlando Magic \n", "13 Miami Heat \n", "14 Charlotte Hornets \n", "15 {\"L.A. Lakers\",\"L.A. Clippers\"} \n", "16 Golden State Warriors \n", "17 Portland Trailblazers \n", "18 Sacramento Kings \n", "19 Phoenix Suns \n", "20 San Antonio Spurs \n", "21 Dallas Mavericks \n", "22 Houston Rockets \n", "23 Oklahoma City Thunder \n", "24 Minnesota Timberwolves \n", "25 Denver Nuggets \n", "26 Utah Jazz \n", "27 Memphis Grizzlies \n", "28 New Orleans Pelicans " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT * FROM nfl f\n", "RIGHT JOIN nba b\n", " ON f.city = b.city;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the left and right joins, changing which data table appears along with `FROM` and which data table appears along with `JOIN` accomplishes the same thing as changing a left join to a right join. \n", "\n", "### Full (Outer) Join\n", "A full join, also called an outer join, keeps all of the records that exist in both tables, whether or not they are matched. Full joins will return a data frame with at least as many rows as the larger of the two data tables in the join because it contains all records that appear in either data frame. Most tutorials on SQL offer a warning about full joins that these queries can result in massive amounts of data being returned, and full joins are not implemented for MySQL databases. For systems like PostgreSQL in which full joins are allowed, the syntax for a full join is\n", "```\n", "SELECT * FROM table1\n", "FULL JOIN table2\n", " ON table1.index_name = table2.index_name;\n", "```\n", "For the teams database, a full join produces a data frame with one row for every city with an NFL team or an NBA team or both:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cityfootballteamcitybasketballteam
0BuffaloBuffalo BillsNoneNone
1MiamiMiami DolphinsMiamiMiami Heat
2BostonNew England PatriotsBostonBoston Celtics
3New York{\"New York Jets\",\"New York Giants\"}New YorkNew York Knicks
4ClevelandCleveland BrownsClevelandCleveland Cavaliers
5CincinnatiCincinnati BengalsNoneNone
6PittsburghPittsburgh SteelersNoneNone
7BaltimoreBaltimore RavensNoneNone
8Kansas CityKansas City ChiefsNoneNone
9Las VegasLas Vegas RaidersNoneNone
10Los Angeles{\"L.A. Chargers\",\"L.A. Rams\"}Los Angeles{\"L.A. Lakers\",\"L.A. Clippers\"}
11DenverDenver BroncosDenverDenver Nuggets
12NashvilleTennessee TitansNoneNone
13JacksonvilleJacksonville JaguarsNoneNone
14HoustonHouston TexansHoustonHouston Rockets
15IndianapolisIndianapolis ColtsIndianapolisIndiana Pacers
16PhiladelphiaPhiladelphia EaglesPhiladelphiaPhiladelphia 76ers
17DallasDallas CowboysDallasDallas Mavericks
18WashingtonWashington SkinsWashingtonWashington Wizards
19AtlantaAtlanta FalconsAtlantaAtlanta Hawks
20CharlotteCarolina PanthersCharlotteCharlotte Hornets
21Tampa BayTampa Bay BuccaneersNoneNone
22New OrleansNew Orleans SaintsNew OrleansNew Orleans Pelicans
23San FranciscoSan Francisco 49ersSan FranciscoGolden State Warriors
24PhoenixArizona CardinalsPhoenixPhoenix Suns
25SeattleSeattle SeahawksNoneNone
26ChicagoChicago BearsChicagoChicago Bulls
27Green BayGreen Bay PackersNoneNone
28MinneapolisMinnesota VikingsMinneapolisMinnesota Timberwolves
29DetroitDetroit LionsDetroitDetroit Pistons
30NoneNoneMilwaukeeMilwaukee Bucks
31NoneNoneOklahoma CityOklahoma City Thunder
32NoneNonePortlandPortland Trailblazers
33NoneNoneBrooklynBrooklyn Nets
34NoneNoneSacramentoSacramento Kings
35NoneNoneMemphisMemphis Grizzlies
36NoneNoneSan AntonioSan Antonio Spurs
37NoneNoneSalt Lake CityUtah Jazz
38NoneNoneOrlandoOrlando Magic
39NoneNoneTorontoToronto Raptors
\n", "
" ], "text/plain": [ " city footballteam city \\\n", "0 Buffalo Buffalo Bills None \n", "1 Miami Miami Dolphins Miami \n", "2 Boston New England Patriots Boston \n", "3 New York {\"New York Jets\",\"New York Giants\"} New York \n", "4 Cleveland Cleveland Browns Cleveland \n", "5 Cincinnati Cincinnati Bengals None \n", "6 Pittsburgh Pittsburgh Steelers None \n", "7 Baltimore Baltimore Ravens None \n", "8 Kansas City Kansas City Chiefs None \n", "9 Las Vegas Las Vegas Raiders None \n", "10 Los Angeles {\"L.A. Chargers\",\"L.A. Rams\"} Los Angeles \n", "11 Denver Denver Broncos Denver \n", "12 Nashville Tennessee Titans None \n", "13 Jacksonville Jacksonville Jaguars None \n", "14 Houston Houston Texans Houston \n", "15 Indianapolis Indianapolis Colts Indianapolis \n", "16 Philadelphia Philadelphia Eagles Philadelphia \n", "17 Dallas Dallas Cowboys Dallas \n", "18 Washington Washington Skins Washington \n", "19 Atlanta Atlanta Falcons Atlanta \n", "20 Charlotte Carolina Panthers Charlotte \n", "21 Tampa Bay Tampa Bay Buccaneers None \n", "22 New Orleans New Orleans Saints New Orleans \n", "23 San Francisco San Francisco 49ers San Francisco \n", "24 Phoenix Arizona Cardinals Phoenix \n", "25 Seattle Seattle Seahawks None \n", "26 Chicago Chicago Bears Chicago \n", "27 Green Bay Green Bay Packers None \n", "28 Minneapolis Minnesota Vikings Minneapolis \n", "29 Detroit Detroit Lions Detroit \n", "30 None None Milwaukee \n", "31 None None Oklahoma City \n", "32 None None Portland \n", "33 None None Brooklyn \n", "34 None None Sacramento \n", "35 None None Memphis \n", "36 None None San Antonio \n", "37 None None Salt Lake City \n", "38 None None Orlando \n", "39 None None Toronto \n", "\n", " basketballteam \n", "0 None \n", "1 Miami Heat \n", "2 Boston Celtics \n", "3 New York Knicks \n", "4 Cleveland Cavaliers \n", "5 None \n", "6 None \n", "7 None \n", "8 None \n", "9 None \n", "10 {\"L.A. Lakers\",\"L.A. Clippers\"} \n", "11 Denver Nuggets \n", "12 None \n", "13 None \n", "14 Houston Rockets \n", "15 Indiana Pacers \n", "16 Philadelphia 76ers \n", "17 Dallas Mavericks \n", "18 Washington Wizards \n", "19 Atlanta Hawks \n", "20 Charlotte Hornets \n", "21 None \n", "22 New Orleans Pelicans \n", "23 Golden State Warriors \n", "24 Phoenix Suns \n", "25 None \n", "26 Chicago Bulls \n", "27 None \n", "28 Minnesota Timberwolves \n", "29 Detroit Pistons \n", "30 Milwaukee Bucks \n", "31 Oklahoma City Thunder \n", "32 Portland Trailblazers \n", "33 Brooklyn Nets \n", "34 Sacramento Kings \n", "35 Memphis Grizzlies \n", "36 San Antonio Spurs \n", "37 Utah Jazz \n", "38 Orlando Magic \n", "39 Toronto Raptors " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT * FROM nfl f\n", "FULL JOIN nba b\n", " ON f.city = b.city;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although there are 30 cities with at least one NFL team and 29 cities with at least one NBA team, there are 41 cities with at least one team from one of these two leagues." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Anti-Joins\n", "An anti-join leaves us with all of the records in the first data table that do not appear in the second table. There is no \"ANTI JOIN\" syntax in SQL, but the behavior of an anti-join can be generated by including the `WHERE` clause along with `LEFT JOIN`. The syntax for an anti-join is\n", "```\n", "SELECT * FROM table1\n", "LEFT JOIN table2\n", " ON table1.index_name = table2.index_name\n", "WHERE table2.index_name is NULL;\n", "```\n", "The `WHERE` statement is used to draw a selection of rows from a data table that make a specified logical condition true. After performing a left join we have a data table with all of the rows in the first table along with the data for those rows in the second table if the row had a match in the second table. Typing `WHERE table2.index_name is NULL` restricts this data table to only the rows that do not have a value of the index in the second table, meaning there was no match. For the teams database, the anti-join of the NFL and NBA tables yields a dataframe of all the cities with an NFL team but no NBA team:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cityfootballteamcitybasketballteam
0BuffaloBuffalo BillsNoneNone
1CincinnatiCincinnati BengalsNoneNone
2PittsburghPittsburgh SteelersNoneNone
3BaltimoreBaltimore RavensNoneNone
4Kansas CityKansas City ChiefsNoneNone
5Las VegasLas Vegas RaidersNoneNone
6NashvilleTennessee TitansNoneNone
7JacksonvilleJacksonville JaguarsNoneNone
8Tampa BayTampa Bay BuccaneersNoneNone
9SeattleSeattle SeahawksNoneNone
10Green BayGreen Bay PackersNoneNone
\n", "
" ], "text/plain": [ " city footballteam city basketballteam\n", "0 Buffalo Buffalo Bills None None\n", "1 Cincinnati Cincinnati Bengals None None\n", "2 Pittsburgh Pittsburgh Steelers None None\n", "3 Baltimore Baltimore Ravens None None\n", "4 Kansas City Kansas City Chiefs None None\n", "5 Las Vegas Las Vegas Raiders None None\n", "6 Nashville Tennessee Titans None None\n", "7 Jacksonville Jacksonville Jaguars None None\n", "8 Tampa Bay Tampa Bay Buccaneers None None\n", "9 Seattle Seattle Seahawks None None\n", "10 Green Bay Green Bay Packers None None" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT * FROM nfl f\n", "LEFT JOIN nba b\n", " ON f.city = b.city\n", "WHERE b.city is NULL;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Natural Joins\n", "One annoying thing about all of the joins shown above is that we end up with two columns that contain the same information. In the case of the team database, we have two `city` columns that are always either equal, or else one is missing. But when one of the `city` columns says \"None\", the team from that table also says \"None\", so the missingness in the `city` column does not provide additional information.\n", "\n", "It might make sense to use a different kind of join that understands that the two `city` columns contain the same information and includes only one of these columns. A natural join does two things differently from the other joins described here:\n", "\n", "1. A natural join removes duplicated columns from the output data.\n", "\n", "2. A natural join detects the indices automatically by assuming columns that share the same name are part indices.\n", "\n", "If done correctly, a natural join saves some work constructing the query as the indices are detected automatically, and provides cleaner output. Any of the joins described above can be done as a natural join by adding `NATURAL` in front of `INNER`, `LEFT`, `RIGHT`, or `FULL`. If there are no columns that share the same name, a natural join instead performs a cross join (described below). \n", "\n", "The following query performs a natural inner join:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cityfootballteambasketballteam
0MiamiMiami DolphinsMiami Heat
1BostonNew England PatriotsBoston Celtics
2New York{\"New York Jets\",\"New York Giants\"}New York Knicks
3ClevelandCleveland BrownsCleveland Cavaliers
4Los Angeles{\"L.A. Chargers\",\"L.A. Rams\"}{\"L.A. Lakers\",\"L.A. Clippers\"}
5DenverDenver BroncosDenver Nuggets
6HoustonHouston TexansHouston Rockets
7IndianapolisIndianapolis ColtsIndiana Pacers
8PhiladelphiaPhiladelphia EaglesPhiladelphia 76ers
9DallasDallas CowboysDallas Mavericks
10WashingtonWashington SkinsWashington Wizards
11AtlantaAtlanta FalconsAtlanta Hawks
12CharlotteCarolina PanthersCharlotte Hornets
13New OrleansNew Orleans SaintsNew Orleans Pelicans
14San FranciscoSan Francisco 49ersGolden State Warriors
15PhoenixArizona CardinalsPhoenix Suns
16ChicagoChicago BearsChicago Bulls
17MinneapolisMinnesota VikingsMinnesota Timberwolves
18DetroitDetroit LionsDetroit Pistons
\n", "
" ], "text/plain": [ " city footballteam \\\n", "0 Miami Miami Dolphins \n", "1 Boston New England Patriots \n", "2 New York {\"New York Jets\",\"New York Giants\"} \n", "3 Cleveland Cleveland Browns \n", "4 Los Angeles {\"L.A. Chargers\",\"L.A. Rams\"} \n", "5 Denver Denver Broncos \n", "6 Houston Houston Texans \n", "7 Indianapolis Indianapolis Colts \n", "8 Philadelphia Philadelphia Eagles \n", "9 Dallas Dallas Cowboys \n", "10 Washington Washington Skins \n", "11 Atlanta Atlanta Falcons \n", "12 Charlotte Carolina Panthers \n", "13 New Orleans New Orleans Saints \n", "14 San Francisco San Francisco 49ers \n", "15 Phoenix Arizona Cardinals \n", "16 Chicago Chicago Bears \n", "17 Minneapolis Minnesota Vikings \n", "18 Detroit Detroit Lions \n", "\n", " basketballteam \n", "0 Miami Heat \n", "1 Boston Celtics \n", "2 New York Knicks \n", "3 Cleveland Cavaliers \n", "4 {\"L.A. Lakers\",\"L.A. Clippers\"} \n", "5 Denver Nuggets \n", "6 Houston Rockets \n", "7 Indiana Pacers \n", "8 Philadelphia 76ers \n", "9 Dallas Mavericks \n", "10 Washington Wizards \n", "11 Atlanta Hawks \n", "12 Charlotte Hornets \n", "13 New Orleans Pelicans \n", "14 Golden State Warriors \n", "15 Phoenix Suns \n", "16 Chicago Bulls \n", "17 Minnesota Timberwolves \n", "18 Detroit Pistons " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT * from nfl\n", "NATURAL INNER JOIN nba\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Natural joins are controversial, however, and many data scientists choose not to use them at all. The danger is that if two columns unexpectedly have the same name (it can be hard to keep track of all of the features' names in big databases) then a natural join will match on the wrong indices. This [Stack Overflow post](https://stackoverflow.com/questions/8696383/difference-between-natural-join-and-inner-join#8696402) gets into this debate, and one response made a forceful argument against natural joins:\n", "\n", "> Collapsing columns in the output is the least-important aspect of a natural join. The things you need to know are (A) it automatically joins on fields of the same name and (B) it will f--- up your s--- when you least expect it. In my world, using a natural join is grounds for dismissal. . . . Say you have a natural join between `Customers` and `Employees`, joining on `EmployeeID`. Employees also has a `ManagerID` field. Everything's fine. Then, some day, someone adds a `ManagerID` field to the `Customers` table. Your join will not break (that would be a mercy), instead it will now include a second field, and work incorrectly. Thus, a seemingly harmless change can break something only distantly related. VERY BAD. The only upside of a natural join is saving a little typing, and the downside is substantial.\n", "\n", "Personally, I disagree with this statement as I think natural joins can be elegant and convenient, especially when I want to match on multiple indices. But I agree that natural joins do make it easier to mess up a join, and more caution is needed. To demonstrate how a natural join can go wrong, suppose that in both the NFL and NBA tables the columns were named `city` and `team`. The following code creates versions of these tables with `footballteam` and `basketballteam` each renamed to `team` and stores these tables in the database as \"nfl2\" and \"nba2\": " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "nfl2 = pd.read_sql_query(\"SELECT city, footballteam as team FROM nfl;\", con=engine)\n", "nba2 = pd.read_sql_query(\"SELECT city, basketballteam as team FROM nba;\", con=engine)\n", "nfl2.to_sql('nfl2', con = engine, index=False, chunksize=1000, if_exists = 'replace')\n", "nba2.to_sql('nba2', con = engine, index=False, chunksize=1000, if_exists = 'replace')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now a natural inner join between \"nfl2\" and \"nba2\" yields a dataframe with no records:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cityteam
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [city, team]\n", "Index: []" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT * FROM nfl2 \n", "NATURAL INNER JOIN nba2;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reason why there are no records is that the natural join automatically chooses both `city` and `team` to be part of the index, and records are only kept in the inner join if they match on both city and team. There are many matches for city, but no matches for both city and team.\n", "\n", "In contrast, a regular inner join still works fine:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cityteamcityteam
0MiamiMiami DolphinsMiamiMiami Heat
1BostonNew England PatriotsBostonBoston Celtics
2New York{\"New York Jets\",\"New York Giants\"}New YorkNew York Knicks
3ClevelandCleveland BrownsClevelandCleveland Cavaliers
4Los Angeles{\"L.A. Chargers\",\"L.A. Rams\"}Los Angeles{\"L.A. Lakers\",\"L.A. Clippers\"}
5DenverDenver BroncosDenverDenver Nuggets
6HoustonHouston TexansHoustonHouston Rockets
7IndianapolisIndianapolis ColtsIndianapolisIndiana Pacers
8PhiladelphiaPhiladelphia EaglesPhiladelphiaPhiladelphia 76ers
9DallasDallas CowboysDallasDallas Mavericks
10WashingtonWashington SkinsWashingtonWashington Wizards
11AtlantaAtlanta FalconsAtlantaAtlanta Hawks
12CharlotteCarolina PanthersCharlotteCharlotte Hornets
13New OrleansNew Orleans SaintsNew OrleansNew Orleans Pelicans
14San FranciscoSan Francisco 49ersSan FranciscoGolden State Warriors
15PhoenixArizona CardinalsPhoenixPhoenix Suns
16ChicagoChicago BearsChicagoChicago Bulls
17MinneapolisMinnesota VikingsMinneapolisMinnesota Timberwolves
18DetroitDetroit LionsDetroitDetroit Pistons
\n", "
" ], "text/plain": [ " city team city \\\n", "0 Miami Miami Dolphins Miami \n", "1 Boston New England Patriots Boston \n", "2 New York {\"New York Jets\",\"New York Giants\"} New York \n", "3 Cleveland Cleveland Browns Cleveland \n", "4 Los Angeles {\"L.A. Chargers\",\"L.A. Rams\"} Los Angeles \n", "5 Denver Denver Broncos Denver \n", "6 Houston Houston Texans Houston \n", "7 Indianapolis Indianapolis Colts Indianapolis \n", "8 Philadelphia Philadelphia Eagles Philadelphia \n", "9 Dallas Dallas Cowboys Dallas \n", "10 Washington Washington Skins Washington \n", "11 Atlanta Atlanta Falcons Atlanta \n", "12 Charlotte Carolina Panthers Charlotte \n", "13 New Orleans New Orleans Saints New Orleans \n", "14 San Francisco San Francisco 49ers San Francisco \n", "15 Phoenix Arizona Cardinals Phoenix \n", "16 Chicago Chicago Bears Chicago \n", "17 Minneapolis Minnesota Vikings Minneapolis \n", "18 Detroit Detroit Lions Detroit \n", "\n", " team \n", "0 Miami Heat \n", "1 Boston Celtics \n", "2 New York Knicks \n", "3 Cleveland Cavaliers \n", "4 {\"L.A. Lakers\",\"L.A. Clippers\"} \n", "5 Denver Nuggets \n", "6 Houston Rockets \n", "7 Indiana Pacers \n", "8 Philadelphia 76ers \n", "9 Dallas Mavericks \n", "10 Washington Wizards \n", "11 Atlanta Hawks \n", "12 Charlotte Hornets \n", "13 New Orleans Pelicans \n", "14 Golden State Warriors \n", "15 Phoenix Suns \n", "16 Chicago Bulls \n", "17 Minnesota Timberwolves \n", "18 Detroit Pistons " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT * FROM nfl2 f\n", "INNER JOIN nba2 b\n", " ON f.city = b.city;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To safely use natural joins, first make certain that the indices you intend to match on have the same name, and then make sure that no other columns in the two data tables share a name." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross Joins\n", "A **round robin** is a method of organizing a competitive tournament. In a round robin, every team or participant plays every other team or participant once. A cross join, also called a Cartesian product, is a round robin for matching values of the index in one data table to values of the index in the other data table. Every value of the index in the first data table is matched once to every distinct value of the index in the second data table. Cross joins are memory-intensive: if the first data table has $M$ rows and the second data table has $N$ rows, the cross join output is a data table with $M\\times N$ rows. In general cross joins are not good ways to combine data entities, and they fail to match strictly like units. But cross joins are useful for constructing data that contain all possible pairings, if that's what a situation calls for.\n", "\n", "The syntax for generating a cross join is\n", "```\n", "SELECT * FROM table1\n", "CROSS JOIN table2;\n", "```\n", "There is no `ON` statement in this query because it is not needed to match each row in `table1` to every row in `table2`. For the teams database, the cross join generates the following output:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cityfootballteamcitybasketballteam
0BuffaloBuffalo BillsBostonBoston Celtics
1BuffaloBuffalo BillsNew YorkNew York Knicks
2BuffaloBuffalo BillsPhiladelphiaPhiladelphia 76ers
3BuffaloBuffalo BillsBrooklynBrooklyn Nets
4BuffaloBuffalo BillsTorontoToronto Raptors
...............
865DetroitDetroit LionsMinneapolisMinnesota Timberwolves
866DetroitDetroit LionsDenverDenver Nuggets
867DetroitDetroit LionsSalt Lake CityUtah Jazz
868DetroitDetroit LionsMemphisMemphis Grizzlies
869DetroitDetroit LionsNew OrleansNew Orleans Pelicans
\n", "

870 rows × 4 columns

\n", "
" ], "text/plain": [ " city footballteam city basketballteam\n", "0 Buffalo Buffalo Bills Boston Boston Celtics\n", "1 Buffalo Buffalo Bills New York New York Knicks\n", "2 Buffalo Buffalo Bills Philadelphia Philadelphia 76ers\n", "3 Buffalo Buffalo Bills Brooklyn Brooklyn Nets\n", "4 Buffalo Buffalo Bills Toronto Toronto Raptors\n", ".. ... ... ... ...\n", "865 Detroit Detroit Lions Minneapolis Minnesota Timberwolves\n", "866 Detroit Detroit Lions Denver Denver Nuggets\n", "867 Detroit Detroit Lions Salt Lake City Utah Jazz\n", "868 Detroit Detroit Lions Memphis Memphis Grizzlies\n", "869 Detroit Detroit Lions New Orleans New Orleans Pelicans\n", "\n", "[870 rows x 4 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT * FROM nfl\n", "CROSS JOIN nba;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiple Joins in One Query\n", "All of the examples above show a single join between two data tables, but many situations will require you to join multiple tables. it is possible to join many tables in one SQL query. The syntax to perform an inner join between two tables, then an inner join between the result and a third table is\n", "```\n", "SELECT * FROM table1\n", "INNER JOIN table2\n", " ON table1.index_name = table2.index_name\n", "INNER JOIN table 3\n", " ON table1.index_name = table3.index_name;\n", "```\n", "To demonstrate how multiple joins can work, I add a third table to the teams database that contains all of the Major League Baseball teams:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "mlb_dict = {'city': ['New York', 'Boston', 'Toronto', 'Baltimore', 'Tampa Bay',\n", " 'Cleveland', 'Chicago', 'Kansas City', 'Minneapolis', 'Detroit',\n", " 'Houston', 'Anaheim', 'Dallas', 'Seattle', 'Oakland',\n", " 'Philadelphia', 'Miami', 'Washington', 'Atlanta', 'Cincinnati',\n", " 'Milwaukee', 'St. Louis', 'Pittsburgh', 'Los Angeles', 'San Francisco',\n", " 'San Diego', 'Denver', 'Phoenix'],\n", " 'baseballteam': [['New York Mets', 'New York Yankees'], 'Boston Red Sox', 'Toronto Blue Jays',\n", " 'Baltimore Orioles', 'Tampa Bay Rays', 'Cleveland Indians', \n", " ['Chicago White Sox', 'Chicago Cubs'], 'Kansas City Royals', 'Minnesota Twins',\n", " 'Detriot Tigers', 'Houston Astros', 'Anaheim Angels', 'Texas Rangers', \n", " 'Seattle Mariners', 'Oakland Athletics', 'Philadelphia Phillies',\n", " 'Miami Marlins', 'Washington Nationals', 'Atlanta Braves', 'Cincinnati Reds',\n", " 'Milwaukee Brewers', 'St. Louis Cardinals', 'Pittsburgh Pirates', 'Los Angeles Dodgers',\n", " 'San Francisco Giants', 'San Diego Padres', 'Colorado Rockies', 'Arizona Diamondbacks']}\n", "mlb_df = pd.DataFrame(mlb_dict)\n", "mlb_df.to_sql('mlb', con = engine, index=False, chunksize=1000, if_exists = 'replace')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can first inner join the NFL and NBA data tables to keep only the cities with both an NFL and an NBA team, then we can inner join the result with the MLB data to keep only the cities with teams in all three sports:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cityfootballteamcitybasketballteamcitybaseballteam
0AtlantaAtlanta FalconsAtlantaAtlanta HawksAtlantaAtlanta Braves
1BostonNew England PatriotsBostonBoston CelticsBostonBoston Red Sox
2ChicagoChicago BearsChicagoChicago BullsChicago{\"Chicago White Sox\",\"Chicago Cubs\"}
3ClevelandCleveland BrownsClevelandCleveland CavaliersClevelandCleveland Indians
4DallasDallas CowboysDallasDallas MavericksDallasTexas Rangers
5DenverDenver BroncosDenverDenver NuggetsDenverColorado Rockies
6DetroitDetroit LionsDetroitDetroit PistonsDetroitDetriot Tigers
7HoustonHouston TexansHoustonHouston RocketsHoustonHouston Astros
8Los Angeles{\"L.A. Chargers\",\"L.A. Rams\"}Los Angeles{\"L.A. Lakers\",\"L.A. Clippers\"}Los AngelesLos Angeles Dodgers
9MiamiMiami DolphinsMiamiMiami HeatMiamiMiami Marlins
10MinneapolisMinnesota VikingsMinneapolisMinnesota TimberwolvesMinneapolisMinnesota Twins
11New York{\"New York Jets\",\"New York Giants\"}New YorkNew York KnicksNew York{\"New York Mets\",\"New York Yankees\"}
12PhiladelphiaPhiladelphia EaglesPhiladelphiaPhiladelphia 76ersPhiladelphiaPhiladelphia Phillies
13PhoenixArizona CardinalsPhoenixPhoenix SunsPhoenixArizona Diamondbacks
14San FranciscoSan Francisco 49ersSan FranciscoGolden State WarriorsSan FranciscoSan Francisco Giants
15WashingtonWashington SkinsWashingtonWashington WizardsWashingtonWashington Nationals
\n", "
" ], "text/plain": [ " city footballteam city \\\n", "0 Atlanta Atlanta Falcons Atlanta \n", "1 Boston New England Patriots Boston \n", "2 Chicago Chicago Bears Chicago \n", "3 Cleveland Cleveland Browns Cleveland \n", "4 Dallas Dallas Cowboys Dallas \n", "5 Denver Denver Broncos Denver \n", "6 Detroit Detroit Lions Detroit \n", "7 Houston Houston Texans Houston \n", "8 Los Angeles {\"L.A. Chargers\",\"L.A. Rams\"} Los Angeles \n", "9 Miami Miami Dolphins Miami \n", "10 Minneapolis Minnesota Vikings Minneapolis \n", "11 New York {\"New York Jets\",\"New York Giants\"} New York \n", "12 Philadelphia Philadelphia Eagles Philadelphia \n", "13 Phoenix Arizona Cardinals Phoenix \n", "14 San Francisco San Francisco 49ers San Francisco \n", "15 Washington Washington Skins Washington \n", "\n", " basketballteam city \\\n", "0 Atlanta Hawks Atlanta \n", "1 Boston Celtics Boston \n", "2 Chicago Bulls Chicago \n", "3 Cleveland Cavaliers Cleveland \n", "4 Dallas Mavericks Dallas \n", "5 Denver Nuggets Denver \n", "6 Detroit Pistons Detroit \n", "7 Houston Rockets Houston \n", "8 {\"L.A. Lakers\",\"L.A. Clippers\"} Los Angeles \n", "9 Miami Heat Miami \n", "10 Minnesota Timberwolves Minneapolis \n", "11 New York Knicks New York \n", "12 Philadelphia 76ers Philadelphia \n", "13 Phoenix Suns Phoenix \n", "14 Golden State Warriors San Francisco \n", "15 Washington Wizards Washington \n", "\n", " baseballteam \n", "0 Atlanta Braves \n", "1 Boston Red Sox \n", "2 {\"Chicago White Sox\",\"Chicago Cubs\"} \n", "3 Cleveland Indians \n", "4 Texas Rangers \n", "5 Colorado Rockies \n", "6 Detriot Tigers \n", "7 Houston Astros \n", "8 Los Angeles Dodgers \n", "9 Miami Marlins \n", "10 Minnesota Twins \n", "11 {\"New York Mets\",\"New York Yankees\"} \n", "12 Philadelphia Phillies \n", "13 Arizona Diamondbacks \n", "14 San Francisco Giants \n", "15 Washington Nationals " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT * FROM nfl f\n", "INNER JOIN nba b\n", " ON f.city = b.city\n", "INNER JOIN mlb m\n", " ON f.city = m.city;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Things get more complicated when we consider left, right, and full joins in a multiple table context. The trick is to think about the set of records that is required, to express that set in set theoretical notation, and to find the right combination of joins that matches that set theoretical statement. \n", "\n", "For example, to obtain all cities with both an NFL and NBA team, also listing the MLB team if one exists in that city, we first inner join the NFL table to the NBA table, then we left join either the NFL's or NBA's city index to the MLB's city column:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cityfootballteamcitybasketballteamcitybaseballteam
0AtlantaAtlanta FalconsAtlantaAtlanta HawksAtlantaAtlanta Braves
1BostonNew England PatriotsBostonBoston CelticsBostonBoston Red Sox
2CharlotteCarolina PanthersCharlotteCharlotte HornetsNoneNone
3ChicagoChicago BearsChicagoChicago BullsChicago{\"Chicago White Sox\",\"Chicago Cubs\"}
4ClevelandCleveland BrownsClevelandCleveland CavaliersClevelandCleveland Indians
5DallasDallas CowboysDallasDallas MavericksDallasTexas Rangers
6DenverDenver BroncosDenverDenver NuggetsDenverColorado Rockies
7DetroitDetroit LionsDetroitDetroit PistonsDetroitDetriot Tigers
8HoustonHouston TexansHoustonHouston RocketsHoustonHouston Astros
9IndianapolisIndianapolis ColtsIndianapolisIndiana PacersNoneNone
10Los Angeles{\"L.A. Chargers\",\"L.A. Rams\"}Los Angeles{\"L.A. Lakers\",\"L.A. Clippers\"}Los AngelesLos Angeles Dodgers
11MiamiMiami DolphinsMiamiMiami HeatMiamiMiami Marlins
12MinneapolisMinnesota VikingsMinneapolisMinnesota TimberwolvesMinneapolisMinnesota Twins
13New OrleansNew Orleans SaintsNew OrleansNew Orleans PelicansNoneNone
14New York{\"New York Jets\",\"New York Giants\"}New YorkNew York KnicksNew York{\"New York Mets\",\"New York Yankees\"}
15PhiladelphiaPhiladelphia EaglesPhiladelphiaPhiladelphia 76ersPhiladelphiaPhiladelphia Phillies
16PhoenixArizona CardinalsPhoenixPhoenix SunsPhoenixArizona Diamondbacks
17San FranciscoSan Francisco 49ersSan FranciscoGolden State WarriorsSan FranciscoSan Francisco Giants
18WashingtonWashington SkinsWashingtonWashington WizardsWashingtonWashington Nationals
\n", "
" ], "text/plain": [ " city footballteam city \\\n", "0 Atlanta Atlanta Falcons Atlanta \n", "1 Boston New England Patriots Boston \n", "2 Charlotte Carolina Panthers Charlotte \n", "3 Chicago Chicago Bears Chicago \n", "4 Cleveland Cleveland Browns Cleveland \n", "5 Dallas Dallas Cowboys Dallas \n", "6 Denver Denver Broncos Denver \n", "7 Detroit Detroit Lions Detroit \n", "8 Houston Houston Texans Houston \n", "9 Indianapolis Indianapolis Colts Indianapolis \n", "10 Los Angeles {\"L.A. Chargers\",\"L.A. Rams\"} Los Angeles \n", "11 Miami Miami Dolphins Miami \n", "12 Minneapolis Minnesota Vikings Minneapolis \n", "13 New Orleans New Orleans Saints New Orleans \n", "14 New York {\"New York Jets\",\"New York Giants\"} New York \n", "15 Philadelphia Philadelphia Eagles Philadelphia \n", "16 Phoenix Arizona Cardinals Phoenix \n", "17 San Francisco San Francisco 49ers San Francisco \n", "18 Washington Washington Skins Washington \n", "\n", " basketballteam city \\\n", "0 Atlanta Hawks Atlanta \n", "1 Boston Celtics Boston \n", "2 Charlotte Hornets None \n", "3 Chicago Bulls Chicago \n", "4 Cleveland Cavaliers Cleveland \n", "5 Dallas Mavericks Dallas \n", "6 Denver Nuggets Denver \n", "7 Detroit Pistons Detroit \n", "8 Houston Rockets Houston \n", "9 Indiana Pacers None \n", "10 {\"L.A. Lakers\",\"L.A. Clippers\"} Los Angeles \n", "11 Miami Heat Miami \n", "12 Minnesota Timberwolves Minneapolis \n", "13 New Orleans Pelicans None \n", "14 New York Knicks New York \n", "15 Philadelphia 76ers Philadelphia \n", "16 Phoenix Suns Phoenix \n", "17 Golden State Warriors San Francisco \n", "18 Washington Wizards Washington \n", "\n", " baseballteam \n", "0 Atlanta Braves \n", "1 Boston Red Sox \n", "2 None \n", "3 {\"Chicago White Sox\",\"Chicago Cubs\"} \n", "4 Cleveland Indians \n", "5 Texas Rangers \n", "6 Colorado Rockies \n", "7 Detriot Tigers \n", "8 Houston Astros \n", "9 None \n", "10 Los Angeles Dodgers \n", "11 Miami Marlins \n", "12 Minnesota Twins \n", "13 None \n", "14 {\"New York Mets\",\"New York Yankees\"} \n", "15 Philadelphia Phillies \n", "16 Arizona Diamondbacks \n", "17 San Francisco Giants \n", "18 Washington Nationals " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT * FROM nfl f\n", "INNER JOIN nba b\n", " ON f.city = b.city\n", "LEFT JOIN mlb m\n", " ON f.city = m.city;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To narrow the records to teams with a baseball team and a basketball team, but no football team, first we inner join the MLB and NBA data tables, then perform an anti-join with the NFL data:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
citybaseballteamcitybasketballteamcityfootballteam
0TorontoToronto Blue JaysTorontoToronto RaptorsNoneNone
1MilwaukeeMilwaukee BrewersMilwaukeeMilwaukee BucksNoneNone
\n", "
" ], "text/plain": [ " city baseballteam city basketballteam city footballteam\n", "0 Toronto Toronto Blue Jays Toronto Toronto Raptors None None\n", "1 Milwaukee Milwaukee Brewers Milwaukee Milwaukee Bucks None None" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT * FROM mlb m\n", "INNER JOIN nba b\n", " ON m.city=b.city\n", "LEFT JOIN nfl f\n", " ON m.city = f.city\n", "WHERE f.city is NULL;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Joins on More Than One Index\n", "Sometimes more than one column comprises the primary key for a table. The general syntax for joining two tables on more than one index adds the `AND` clause to the standard SQL join syntax:\n", "```\n", "SELECT * FROM table1\n", "INNER JOIN table2\n", " ON table1.index1 = table2.index2\n", " AND table1.anotherindex1 = table2.anotherindex2;\n", "```\n", "Suppose for example that the NBA table and MLB table also contained records for minor league teams in the NBA G-League or the MLB AAA system. Some cities have both major and minor league teams in the same sport. Washington, for example, has a major league NBA team, the Wizards, and a minor league basketball team, the Capital City Go-Gos. Suppose that both the NBA and MLB tables have a column `leaguetype` that marks each team as \"major\" or \"minor\", and that we want to match on both city and league type. The syntax to do so is\n", "```\n", "SELECT * FROM nba b\n", "INNER JOIN mlb m\n", " ON b.city = m.city\n", " AND b.leaguetype = m.leaguetype;\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SQL Create, Update, and Delete Operations\n", "Once a database exists and is populated with data, most changes to the data will be small and incremental. We might add a few records, edit a couple, or delete one or two. There are straightforward SQL commands for creating, updating, and deleting records. To issue these queries, however, we cannot use the `pd.read_sql_query()` function as this function is only for read operations. Instead, we can use the `.execute()` method as applied to either the cursor for the database we are working with, or the `sqlalchemy` engine. Specific examples are shown below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating New Records\n", "An existing database has a schema, an overarching organizational blueprint for the database, that describes the different tables in the database, and within each table what the columns are and what kinds of data can be input into the columns. Creating new data generally works within an established schema. That means we enter new datapoints into existing columns, matching the data type that must exist in those columns.\n", "\n", "The SQL syntax to create new data is\n", "```\n", "INSERT INTO table (column1, column2, ...)\n", " VALUES (value1, value2, ...);\n", "```\n", "This syntax requires us to specify the key elements of the schema that identify a location in the database: the table and the columns. The values need to be listed in the same order as the columns, and character values need to be enclosed in single quotes.\n", "\n", "To add a new observation to the NBA table (bring back the Sonics!) we can type:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "INSERT INTO nba (city, basketballteam)\n", " VALUES ('Seattle', 'Seattle Supersonics');\n", "\"\"\"\n", "engine.execute(myquery)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the `engine` variable is the `sqlaclchemy` connection we previously established for the teams database. We can use the `execute()` method to pass SQL queries to the database, just as we can with a cursor. Now, when we look at the data, we see the Seattle Supersonics included along with all the other NBA teams:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
citybasketballteam
0BostonBoston Celtics
1New YorkNew York Knicks
2PhiladelphiaPhiladelphia 76ers
3BrooklynBrooklyn Nets
4TorontoToronto Raptors
5ClevelandCleveland Cavaliers
6ChicagoChicago Bulls
7DetroitDetroit Pistons
8MilwaukeeMilwaukee Bucks
9IndianapolisIndiana Pacers
10AtlantaAtlanta Hawks
11WashingtonWashington Wizards
12OrlandoOrlando Magic
13MiamiMiami Heat
14CharlotteCharlotte Hornets
15Los Angeles{\"L.A. Lakers\",\"L.A. Clippers\"}
16San FranciscoGolden State Warriors
17PortlandPortland Trailblazers
18SacramentoSacramento Kings
19PhoenixPhoenix Suns
20San AntonioSan Antonio Spurs
21DallasDallas Mavericks
22HoustonHouston Rockets
23Oklahoma CityOklahoma City Thunder
24MinneapolisMinnesota Timberwolves
25DenverDenver Nuggets
26Salt Lake CityUtah Jazz
27MemphisMemphis Grizzlies
28New OrleansNew Orleans Pelicans
29SeattleSeattle Supersonics
\n", "
" ], "text/plain": [ " city basketballteam\n", "0 Boston Boston Celtics\n", "1 New York New York Knicks\n", "2 Philadelphia Philadelphia 76ers\n", "3 Brooklyn Brooklyn Nets\n", "4 Toronto Toronto Raptors\n", "5 Cleveland Cleveland Cavaliers\n", "6 Chicago Chicago Bulls\n", "7 Detroit Detroit Pistons\n", "8 Milwaukee Milwaukee Bucks\n", "9 Indianapolis Indiana Pacers\n", "10 Atlanta Atlanta Hawks\n", "11 Washington Washington Wizards\n", "12 Orlando Orlando Magic\n", "13 Miami Miami Heat\n", "14 Charlotte Charlotte Hornets\n", "15 Los Angeles {\"L.A. Lakers\",\"L.A. Clippers\"}\n", "16 San Francisco Golden State Warriors\n", "17 Portland Portland Trailblazers\n", "18 Sacramento Sacramento Kings\n", "19 Phoenix Phoenix Suns\n", "20 San Antonio San Antonio Spurs\n", "21 Dallas Dallas Mavericks\n", "22 Houston Houston Rockets\n", "23 Oklahoma City Oklahoma City Thunder\n", "24 Minneapolis Minnesota Timberwolves\n", "25 Denver Denver Nuggets\n", "26 Salt Lake City Utah Jazz\n", "27 Memphis Memphis Grizzlies\n", "28 New Orleans New Orleans Pelicans\n", "29 Seattle Seattle Supersonics" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_sql_query(\"SELECT * FROM nba\", con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Editing Existing Records\n", "Instead of creating a new record, there are situations in which we want to edit an existing record. To revise a record, we use the following SQL syntax:\n", "```\n", "UPDATE table\n", " SET column2 = newvalue\n", " WHERE logicalcondition;\n", "```\n", "In this case, `SET` specifies the change we want to make to a particular column. But we don't want to change *all* of the values of the column, so we use `WHERE` to specify a logical condition to identify the rows we want to change. A logical condition is a statement that is true on some rows and false on others, and the data update happens only on the rows for which the condition is true. \n", "\n", "Suppose we want to change the name of the Charlotte Hornets back to the Charlotte Bobcats (sorry, Charlotte). We can use the following code:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "UPDATE nba\n", " SET basketballteam = 'Charlotte Bobcats'\n", " WHERE city = 'Charlotte';\n", "\"\"\"\n", "engine.execute(myquery)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the query says to update values in the NBA table by changing `basketballteam` to Charlotte Bobcats, but only when `city` is Charlotte. This update now appears in the NBA data:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
citybasketballteam
0BostonBoston Celtics
1New YorkNew York Knicks
2PhiladelphiaPhiladelphia 76ers
3BrooklynBrooklyn Nets
4TorontoToronto Raptors
5ClevelandCleveland Cavaliers
6ChicagoChicago Bulls
7DetroitDetroit Pistons
8MilwaukeeMilwaukee Bucks
9IndianapolisIndiana Pacers
10AtlantaAtlanta Hawks
11WashingtonWashington Wizards
12OrlandoOrlando Magic
13MiamiMiami Heat
14Los Angeles{\"L.A. Lakers\",\"L.A. Clippers\"}
15San FranciscoGolden State Warriors
16PortlandPortland Trailblazers
17SacramentoSacramento Kings
18PhoenixPhoenix Suns
19San AntonioSan Antonio Spurs
20DallasDallas Mavericks
21HoustonHouston Rockets
22Oklahoma CityOklahoma City Thunder
23MinneapolisMinnesota Timberwolves
24DenverDenver Nuggets
25Salt Lake CityUtah Jazz
26MemphisMemphis Grizzlies
27New OrleansNew Orleans Pelicans
28SeattleSeattle Supersonics
29CharlotteCharlotte Bobcats
\n", "
" ], "text/plain": [ " city basketballteam\n", "0 Boston Boston Celtics\n", "1 New York New York Knicks\n", "2 Philadelphia Philadelphia 76ers\n", "3 Brooklyn Brooklyn Nets\n", "4 Toronto Toronto Raptors\n", "5 Cleveland Cleveland Cavaliers\n", "6 Chicago Chicago Bulls\n", "7 Detroit Detroit Pistons\n", "8 Milwaukee Milwaukee Bucks\n", "9 Indianapolis Indiana Pacers\n", "10 Atlanta Atlanta Hawks\n", "11 Washington Washington Wizards\n", "12 Orlando Orlando Magic\n", "13 Miami Miami Heat\n", "14 Los Angeles {\"L.A. Lakers\",\"L.A. Clippers\"}\n", "15 San Francisco Golden State Warriors\n", "16 Portland Portland Trailblazers\n", "17 Sacramento Sacramento Kings\n", "18 Phoenix Phoenix Suns\n", "19 San Antonio San Antonio Spurs\n", "20 Dallas Dallas Mavericks\n", "21 Houston Houston Rockets\n", "22 Oklahoma City Oklahoma City Thunder\n", "23 Minneapolis Minnesota Timberwolves\n", "24 Denver Denver Nuggets\n", "25 Salt Lake City Utah Jazz\n", "26 Memphis Memphis Grizzlies\n", "27 New Orleans New Orleans Pelicans\n", "28 Seattle Seattle Supersonics\n", "29 Charlotte Charlotte Bobcats" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_sql_query(\"SELECT * FROM nba\", con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deleting Records\n", "Sometimes you might need to delete records from a database. These situations should be rare. If a record is no longer relevant for a particular use, it is always better to leave the record in the database and use another column to denote new information that can be used to filter records later. If there are mistakes in data entry, it's better to edit existing records than to delete those records outright. If you must delete a record, the syntax to do so is\n", "```\n", "DELETE FROM table WHERE logicalcondition;\n", "```\n", "First specify the table, then the logical condition that identifies the rows you intend to delete. \n", "\n", "In the teams database, suppose we want to delete the Baltimore Ravens (go Browns!) from the NFL table. The code to do that is:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "DELETE FROM nfl WHERE city = 'Baltimore'; \n", "\"\"\"\n", "engine.execute(myquery)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, `city = 'Baltimore'` identifies the rows we want to delete in the NFL table. The NFL data now no longer contains a row for the Ravens:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cityfootballteam
0BuffaloBuffalo Bills
1MiamiMiami Dolphins
2BostonNew England Patriots
3New York{\"New York Jets\",\"New York Giants\"}
4ClevelandCleveland Browns
5CincinnatiCincinnati Bengals
6PittsburghPittsburgh Steelers
7Kansas CityKansas City Chiefs
8Las VegasLas Vegas Raiders
9Los Angeles{\"L.A. Chargers\",\"L.A. Rams\"}
10DenverDenver Broncos
11NashvilleTennessee Titans
12JacksonvilleJacksonville Jaguars
13HoustonHouston Texans
14IndianapolisIndianapolis Colts
15PhiladelphiaPhiladelphia Eagles
16DallasDallas Cowboys
17WashingtonWashington Skins
18AtlantaAtlanta Falcons
19CharlotteCarolina Panthers
20Tampa BayTampa Bay Buccaneers
21New OrleansNew Orleans Saints
22San FranciscoSan Francisco 49ers
23PhoenixArizona Cardinals
24SeattleSeattle Seahawks
25ChicagoChicago Bears
26Green BayGreen Bay Packers
27MinneapolisMinnesota Vikings
28DetroitDetroit Lions
\n", "
" ], "text/plain": [ " city footballteam\n", "0 Buffalo Buffalo Bills\n", "1 Miami Miami Dolphins\n", "2 Boston New England Patriots\n", "3 New York {\"New York Jets\",\"New York Giants\"}\n", "4 Cleveland Cleveland Browns\n", "5 Cincinnati Cincinnati Bengals\n", "6 Pittsburgh Pittsburgh Steelers\n", "7 Kansas City Kansas City Chiefs\n", "8 Las Vegas Las Vegas Raiders\n", "9 Los Angeles {\"L.A. Chargers\",\"L.A. Rams\"}\n", "10 Denver Denver Broncos\n", "11 Nashville Tennessee Titans\n", "12 Jacksonville Jacksonville Jaguars\n", "13 Houston Houston Texans\n", "14 Indianapolis Indianapolis Colts\n", "15 Philadelphia Philadelphia Eagles\n", "16 Dallas Dallas Cowboys\n", "17 Washington Washington Skins\n", "18 Atlanta Atlanta Falcons\n", "19 Charlotte Carolina Panthers\n", "20 Tampa Bay Tampa Bay Buccaneers\n", "21 New Orleans New Orleans Saints\n", "22 San Francisco San Francisco 49ers\n", "23 Phoenix Arizona Cardinals\n", "24 Seattle Seattle Seahawks\n", "25 Chicago Chicago Bears\n", "26 Green Bay Green Bay Packers\n", "27 Minneapolis Minnesota Vikings\n", "28 Detroit Detroit Lions" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_sql_query(\"SELECT * FROM nfl\", con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleaning and Manipulating Data with SQL Read Operations\n", "After using joins to combine data tables in the database, the data needs to be manipulated to make the data more convenient to use. That might involve narrowing down the data to a specific subset of interest, performing calculations on the data to generate new features, and changing the appearance of the data. In \"[Tidy Data](https://www.jstatsoft.org/article/view/v059i10)\", Hadley Wickham defines four essential \"verbs\" of data manipulation:\n", "\n", "> * Filter: subsetting or removing observations based on some condition.\n", "> * Transform: adding or modifying variables. These modifications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).\n", "> * Aggregate: collapsing multiple values into a single value (e.g., by summing or taking means).\n", "> * Sort: changing the order of observations (p. 13).\n", "\n", "In addition it may be necessary to pull only a selection of the columns into the output, or to change the names of the columns to more readable and useful ones. These operations can be performed within SQL read commands by using the `WHERE` clause for filtering, mathematical operators to transform columns, the `GROUP BY` syntax for aggregation, the `ORDER BY`, `ASC`, or `DESC` clauses for sorting, and the `AS` keyword for renaming columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Wine Reviews\n", "To illustrate how to issue queries to read data while manipulating and cleaning the data, we will use the PostgreSQL version of the wine review database that we created in module 6. If you want to follow along with these example, follow the instructions in the \"Using PostgreSQL\" subsection of module 6 to get a local wine database running on your system.\n", "\n", "For read operations, we can use the `pd.read_sql_query()` function. For that, we first have to use `sqlalchemy` to set up an engine that connects `pandas` to the database: " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "engine = create_engine(\"postgresql+psycopg2://{user}:{pw}@localhost/{db}\"\n", " .format(user=\"jk8sd\", pw=pgpassword, db=\"winedb\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The logical ER diagram for the wine reviews database is\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting Columns\n", "`SELECT` and `FROM` are the primary SQL verbs for reading data. In many of the examples up to this points, we've issued queries like\n", "```\n", "SELECT * FROM table;\n", "```\n", "that pull all of the rows and all of the columns from a single data table. The `*` character is called a [wildcard character](https://en.wikipedia.org/wiki/Wildcard_character). When typed by itself, the wildcard captures all of the columns in a table. But sometimes we are interested in only a selection of the columns. In that case, we replace the wildcard with the columns we want to include in the output. The following syntax includes three columns from a specified data table:\n", "```\n", "SELECT col1, col2, col3 FROM table;\n", "```\n", "Suppose that I want to know the title, variety, price, points, country, and reviewer for all of the wines in the data. Title, variety, price, and points are all in the reviews table, country is in the locations table, and the reviewer (`taster_name`) is in the tasters table. To produce the data I need to join these three tables while also using `SELECT` to identify only the rows I am interested in. Inner joins are appropriate because every wine in the data has both a location and a reviewer. \n", "\n", "The best way to select columns across multiple tables is to use aliasing, the same way we did for joins. In this case, if we alias the reviews table as `r`, locations as `l`, and tasters as `t`, we can use these same aliases to inform SQL where to find each column in the `SELECT` syntax. \n", "\n", "The code to return this dataframe is:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlevarietypricepointscountrytaster_name
0Olivier Leflaive 2006 Les Pucelles Premier Cru...ChardonnayNaN93FranceRoger Voss
1The Foundry 2004 Syrah (Coastal Region)Syrah35.087South AfricaSusan Kostrzewa
2Guilbaud Frères 2007 Le Soleil Nantais (Musca...Melon11.087FranceRoger Voss
3Domaine du Clos du Fief 2007 Cuvée Tradition ...GamayNaN86FranceRoger Voss
4Domaine Philippe Delesvaux 2005 La Montée de l...Cabernet SauvignonNaN86FranceRoger Voss
.....................
103722Sheridan Vineyard 2005 Reserve Cabernet Sauvig...Cabernet Sauvignon75.094USPaul Gregutt
103723Woodward Canyon 2006 Old Vines Dedication Seri...Cabernet Sauvignon84.094USPaul Gregutt
103724Chanson Père et Fils 2005 Champs Gains Premier...Chardonnay115.093FranceRoger Voss
103725Mark Ryan 2006 Chardonnay (Columbia Valley (WA))ChardonnayNaN93USPaul Gregutt
103726Joseph Drouhin 2007 Grands-EchezeauxPinot Noir285.094FranceRoger Voss
\n", "

103727 rows × 6 columns

\n", "
" ], "text/plain": [ " title variety \\\n", "0 Olivier Leflaive 2006 Les Pucelles Premier Cru... Chardonnay \n", "1 The Foundry 2004 Syrah (Coastal Region) Syrah \n", "2 Guilbaud Frères 2007 Le Soleil Nantais (Musca... Melon \n", "3 Domaine du Clos du Fief 2007 Cuvée Tradition ... Gamay \n", "4 Domaine Philippe Delesvaux 2005 La Montée de l... Cabernet Sauvignon \n", "... ... ... \n", "103722 Sheridan Vineyard 2005 Reserve Cabernet Sauvig... Cabernet Sauvignon \n", "103723 Woodward Canyon 2006 Old Vines Dedication Seri... Cabernet Sauvignon \n", "103724 Chanson Père et Fils 2005 Champs Gains Premier... Chardonnay \n", "103725 Mark Ryan 2006 Chardonnay (Columbia Valley (WA)) Chardonnay \n", "103726 Joseph Drouhin 2007 Grands-Echezeaux Pinot Noir \n", "\n", " price points country taster_name \n", "0 NaN 93 France Roger Voss \n", "1 35.0 87 South Africa Susan Kostrzewa \n", "2 11.0 87 France Roger Voss \n", "3 NaN 86 France Roger Voss \n", "4 NaN 86 France Roger Voss \n", "... ... ... ... ... \n", "103722 75.0 94 US Paul Gregutt \n", "103723 84.0 94 US Paul Gregutt \n", "103724 115.0 93 France Roger Voss \n", "103725 NaN 93 US Paul Gregutt \n", "103726 285.0 94 France Roger Voss \n", "\n", "[103727 rows x 6 columns]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery=\"\"\"\n", "SELECT r.title, r.variety, r.price, r.points, l.country, t.taster_name FROM reviews r\n", "INNER JOIN locations l\n", " ON r.location_id = l.location_id\n", "INNER JOIN tasters t\n", " ON r.taster_id = t.taster_id;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Logical Statements\n", "Most programming languages have the capacity to evaluate a statement as being either true or false, or true for some values and false for others. A logical statement uses **logical operators** that define how values should be compared. In SQL, logical statements are used in conjunction with the `WHERE` statement to select the rows to include in the output.\n", "\n", "For SQL, logical statements either compare a column to another column, or compare a column to one or more reference values. The following logical operators are available:\n", "\n", "* `=` - is equal to?\n", "* `<` - is less than?\n", "* `>` - is greater than?\n", "* `<=` - is less than or equal to?\n", "* `>=` - is greater than or equal to?\n", "* `<>` - is not equal to?\n", "* `BETWEEN a AND b` - true if a value exists within the range from a to b, including a and b \n", "* `IN ('element1','element2','element3')` - true if a value is one of the elements in the given set\n", "* `NOT` - true if the rest of the logical statement is false, false if the rest of the logical statement is true\n", "* `AND` - links separate logical statements together such that the overall statement is true only when all of the linked statements are true\n", "* `OR` - links separate logical statements together such that the overall statement is true when any of the linked statements are true\n", "* `LIKE pattern` - true if the string value matches the given pattern:\n", " * `LIKE '%%text'` captures all rows in which a given column ends with 'text'\n", " * `LIKE 'text%%'` captures all rows in which a given column begins with 'text'\n", " * `LIKE '%%text%%'` captures all rows in which a given column contains 'text' somewhere in its string value\n", "* `()` - parts of the logical statement that are contained within parentheses are evaluated first\n", "\n", "I will show examples of how to use these logical statements for filtering rows, in the next section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filtering Rows\n", "Suppose we wanted to know the title, the variety, and the price of the French wines that Roger Voss scored as 100. It's a simple semantic sentence, but it connects to a more complicated set of SQL functions. First consider all of the columns we need to use to process the sentence:\n", "\n", "* title, from the reviews table\n", "* variety, from the reviews table\n", "* price, from the reviews table\n", "* French, a value of country, from the locations table\n", "* Roger Voss, a value of taster name, from the tasters table,\n", "* and 100, a value of points, from the reviews table.\n", "\n", "Because we need to use data from the reviews, locations, and tasters tables, we need to inner join reviews, locations, and tasters. \n", "\n", "But then on top of this join, we need to restrict both the columns and rows. We only want title, variety, and price in the final data, so we use `SELECT` to keep only these columns. \n", "\n", "To restrict the rows, we use `WHERE` along with a logical condition. This logical condition has a few parts: we want wines in which `country='France'`, `taster_name='Roger Voss'`, and `points=100`. All three conditions need to be true for us to want to keep the row, so we connect the three statements with `AND`.\n", "\n", "The SQL query that returns the title, the variety, and the price of the French wines that Roger Voss scored as 100 is:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlevarietyprice
0Krug 2002 Brut (Champagne)Champagne Blend259.0
1Château Léoville Barton 2010 Saint-JulienBordeaux-style Red Blend150.0
2Louis Roederer 2008 Cristal Vintage Brut (Cha...Champagne Blend250.0
3Salon 2006 Le Mesnil Blanc de Blancs Brut Char...Chardonnay617.0
4Château Lafite Rothschild 2010 PauillacBordeaux-style Red Blend1500.0
5Château Cheval Blanc 2010 Saint-ÉmilionBordeaux-style Red Blend1500.0
6Château Léoville Las Cases 2010 Saint-JulienBordeaux-style Red Blend359.0
7Château Haut-Brion 2014 Pessac-LéognanBordeaux-style White Blend848.0
\n", "
" ], "text/plain": [ " title \\\n", "0 Krug 2002 Brut (Champagne) \n", "1 Château Léoville Barton 2010 Saint-Julien \n", "2 Louis Roederer 2008 Cristal Vintage Brut (Cha... \n", "3 Salon 2006 Le Mesnil Blanc de Blancs Brut Char... \n", "4 Château Lafite Rothschild 2010 Pauillac \n", "5 Château Cheval Blanc 2010 Saint-Émilion \n", "6 Château Léoville Las Cases 2010 Saint-Julien \n", "7 Château Haut-Brion 2014 Pessac-Léognan \n", "\n", " variety price \n", "0 Champagne Blend 259.0 \n", "1 Bordeaux-style Red Blend 150.0 \n", "2 Champagne Blend 250.0 \n", "3 Chardonnay 617.0 \n", "4 Bordeaux-style Red Blend 1500.0 \n", "5 Bordeaux-style Red Blend 1500.0 \n", "6 Bordeaux-style Red Blend 359.0 \n", "7 Bordeaux-style White Blend 848.0 " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT r.title, r.variety, r.price FROM reviews r\n", "INNER JOIN locations l\n", " ON r.location_id = l.location_id\n", "INNER JOIN tasters t\n", " ON r.taster_id = t.taster_id\n", "WHERE l.country='France' AND t.taster_name='Roger Voss' AND r.points=100;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I like my wine local or low-cost. So as another example, suppose we want the title, variety, price, points, country, and providence for all of the wines with scores of 90 or more that either cost between 5 and 10 dollars or are from Virginia. In this case, we need to join the reviews and locations data together, and write a logical statement that matches these specific conditions. The logical condition is\n", "```\n", "points >= 90 AND (price BETWEEN 5 AND 10 OR province = 'Virginia')\n", "```\n", "The entire SQL query is" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlevarietypricepointscountryprovince
0Château Vircoulon 2016 Bordeaux BlancBordeaux-style White Blend10.090FranceBordeaux
1Mano A Mano 2011 Tempranillo (Vino de la Tierr...Tempranillo9.090SpainCentral Spain
2Aveleda 2014 Quinta da Aveleda Estate Bottled ...Portuguese White9.090PortugalVinho Verde
3Chateau Ste. Michelle 2011 Riesling (Columbia ...Riesling9.091USWashington
4Quinta do Portal 2007 Mural Reserva Red (Douro)Portuguese Red10.091PortugalDouro
.....................
76Casaleiro 2012 Reserva Touriga Nacional-Castel...Portuguese Red9.090PortugalTejo
77Aveleda 2015 Quinta da Aveleda White (Vinho Ve...Portuguese White9.090PortugalVinho Verde
78Cookies & Cream 2010 Merlot (California)Merlot10.090USCalifornia
79Lovingston 2012 Josie's Knoll Merlot (Monticello)Merlot20.091USVirginia
80Aveleda 2016 Quinta da Aveleda White (Vinho Ve...Portuguese White10.090PortugalVinho Verde
\n", "

81 rows × 6 columns

\n", "
" ], "text/plain": [ " title \\\n", "0 Château Vircoulon 2016 Bordeaux Blanc \n", "1 Mano A Mano 2011 Tempranillo (Vino de la Tierr... \n", "2 Aveleda 2014 Quinta da Aveleda Estate Bottled ... \n", "3 Chateau Ste. Michelle 2011 Riesling (Columbia ... \n", "4 Quinta do Portal 2007 Mural Reserva Red (Douro) \n", ".. ... \n", "76 Casaleiro 2012 Reserva Touriga Nacional-Castel... \n", "77 Aveleda 2015 Quinta da Aveleda White (Vinho Ve... \n", "78 Cookies & Cream 2010 Merlot (California) \n", "79 Lovingston 2012 Josie's Knoll Merlot (Monticello) \n", "80 Aveleda 2016 Quinta da Aveleda White (Vinho Ve... \n", "\n", " variety price points country province \n", "0 Bordeaux-style White Blend 10.0 90 France Bordeaux \n", "1 Tempranillo 9.0 90 Spain Central Spain \n", "2 Portuguese White 9.0 90 Portugal Vinho Verde \n", "3 Riesling 9.0 91 US Washington \n", "4 Portuguese Red 10.0 91 Portugal Douro \n", ".. ... ... ... ... ... \n", "76 Portuguese Red 9.0 90 Portugal Tejo \n", "77 Portuguese White 9.0 90 Portugal Vinho Verde \n", "78 Merlot 10.0 90 US California \n", "79 Merlot 20.0 91 US Virginia \n", "80 Portuguese White 10.0 90 Portugal Vinho Verde \n", "\n", "[81 rows x 6 columns]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT r.title, r.variety, r.price, r.points, l.country, l.province FROM reviews r\n", "INNER JOIN locations l\n", " ON r.location_id = l.location_id\n", "WHERE r.points >= 90 AND (r.price BETWEEN 5 AND 10 OR l.province = 'Virginia');\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The parentheses in this last query are needed to ensure that the DBMS evaluates the `OR` statement first. Without the parentheses,\n", "```\n", "points >= 90 AND price BETWEEN 5 AND 10 OR province = 'Virginia'\n", "```\n", "the DBMS evaluates the first two conditions first, then considers the third, so the statement is equivalent to\n", "```\n", "(points >= 90 AND price BETWEEN 5 AND 10) OR province = 'Virginia'\n", "```\n", "and it returns data with all wines that have scores of at least 90 and prices between 5 and 10 dollars, along with all wines from Virginia whether or not those wines have scores of at least 90:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlevarietypricepointscountryprovince
0Veramar 2016 JB Winemaker Series Cabernet Fran...Cabernet Franc34.086USVirginia
1Château Vircoulon 2016 Bordeaux BlancBordeaux-style White Blend10.090FranceBordeaux
2Mano A Mano 2011 Tempranillo (Vino de la Tierr...Tempranillo9.090SpainCentral Spain
3Trump 2014 Rosé (Monticello)Rosé14.086USVirginia
4The Boneyard 2014 Chardonnay (Virginia)Chardonnay15.086USVirginia
.....................
444Annefield Vineyards 2009 Cabernet Franc (Virgi...Cabernet Franc29.088USVirginia
445Lovingston 2012 Josie's Knoll Merlot (Monticello)Merlot20.091USVirginia
446Aveleda 2016 Quinta da Aveleda White (Vinho Ve...Portuguese White10.090PortugalVinho Verde
447Tarara 2013 Cabernet Franc (Virginia)Cabernet Franc25.085USVirginia
448Paradise Springs 2014 Nana's Rosé (Virginia)Rosé22.086USVirginia
\n", "

449 rows × 6 columns

\n", "
" ], "text/plain": [ " title \\\n", "0 Veramar 2016 JB Winemaker Series Cabernet Fran... \n", "1 Château Vircoulon 2016 Bordeaux Blanc \n", "2 Mano A Mano 2011 Tempranillo (Vino de la Tierr... \n", "3 Trump 2014 Rosé (Monticello) \n", "4 The Boneyard 2014 Chardonnay (Virginia) \n", ".. ... \n", "444 Annefield Vineyards 2009 Cabernet Franc (Virgi... \n", "445 Lovingston 2012 Josie's Knoll Merlot (Monticello) \n", "446 Aveleda 2016 Quinta da Aveleda White (Vinho Ve... \n", "447 Tarara 2013 Cabernet Franc (Virginia) \n", "448 Paradise Springs 2014 Nana's Rosé (Virginia) \n", "\n", " variety price points country province \n", "0 Cabernet Franc 34.0 86 US Virginia \n", "1 Bordeaux-style White Blend 10.0 90 France Bordeaux \n", "2 Tempranillo 9.0 90 Spain Central Spain \n", "3 Rosé 14.0 86 US Virginia \n", "4 Chardonnay 15.0 86 US Virginia \n", ".. ... ... ... ... ... \n", "444 Cabernet Franc 29.0 88 US Virginia \n", "445 Merlot 20.0 91 US Virginia \n", "446 Portuguese White 10.0 90 Portugal Vinho Verde \n", "447 Cabernet Franc 25.0 85 US Virginia \n", "448 Rosé 22.0 86 US Virginia \n", "\n", "[449 rows x 6 columns]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT r.title, r.variety, r.price, r.points, l.country, l.province FROM reviews r\n", "INNER JOIN locations l\n", " ON r.location_id = l.location_id\n", "WHERE r.points >= 90 AND r.price BETWEEN 5 AND 10 OR l.province = 'Virginia';\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose I'm open to many wines, but I have a thing against wines from the U.S., and I don't like Pinot Noir, Pinot Gris, or Chardonnay. I want to query the wines database to return data on the title, variety, country, and price of all of these wines except for the American ones and the ones I dislike. The SQL query requires joining the reviews and locations tables, and using negation in the logical statement with the `<>` and `NOT` operators, like this:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlevarietypricecountry
0The Foundry 2004 Syrah (Coastal Region)Syrah35.0South Africa
1Guilbaud Frères 2007 Le Soleil Nantais (Musca...Melon11.0France
2Domaine du Clos du Fief 2007 Cuvée Tradition ...GamayNaNFrance
3Domaine Philippe Delesvaux 2005 La Montée de l...Cabernet SauvignonNaNFrance
4Georges Duboeuf 2007 Beaujolais-VillagesGamayNaNFrance
...............
57429Indomita NV Rosé Sparkling (Casablanca Valley)Sparkling Blend18.0Chile
57430Intipalka 2013 Valle del Sol Tannat (Ica)Tannat14.0Peru
57431Lobster Reef 2014 Sauvignon Blanc (Marlborough)Sauvignon Blanc12.0New Zealand
57432Millaman 2014 Estate Reserve Sauvignon Blanc (...Sauvignon Blanc10.0Chile
57433Royal Tokaji 1999 Mézes Mály Aszú 6 Puttonyos ...Tokaji175.0Hungary
\n", "

57434 rows × 4 columns

\n", "
" ], "text/plain": [ " title variety \\\n", "0 The Foundry 2004 Syrah (Coastal Region) Syrah \n", "1 Guilbaud Frères 2007 Le Soleil Nantais (Musca... Melon \n", "2 Domaine du Clos du Fief 2007 Cuvée Tradition ... Gamay \n", "3 Domaine Philippe Delesvaux 2005 La Montée de l... Cabernet Sauvignon \n", "4 Georges Duboeuf 2007 Beaujolais-Villages Gamay \n", "... ... ... \n", "57429 Indomita NV Rosé Sparkling (Casablanca Valley) Sparkling Blend \n", "57430 Intipalka 2013 Valle del Sol Tannat (Ica) Tannat \n", "57431 Lobster Reef 2014 Sauvignon Blanc (Marlborough) Sauvignon Blanc \n", "57432 Millaman 2014 Estate Reserve Sauvignon Blanc (... Sauvignon Blanc \n", "57433 Royal Tokaji 1999 Mézes Mály Aszú 6 Puttonyos ... Tokaji \n", "\n", " price country \n", "0 35.0 South Africa \n", "1 11.0 France \n", "2 NaN France \n", "3 NaN France \n", "4 NaN France \n", "... ... ... \n", "57429 18.0 Chile \n", "57430 14.0 Peru \n", "57431 12.0 New Zealand \n", "57432 10.0 Chile \n", "57433 175.0 Hungary \n", "\n", "[57434 rows x 4 columns]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT r.title, r.variety, r.price, l.country FROM reviews r\n", "INNER JOIN locations l\n", " ON r.location_id = l.location_id\n", "WHERE country <> 'US' AND variety NOT IN ('Pinot Noir', 'Pinot Gris', 'Chardonnay');\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want all of the columns from the reviews table for wines whose descriptions contain the words \"smoke\" and \"chocolate\" - taking case sensitivity into account by converting the descriptions to all lower case in the `WHERE` clause so that \"chocolate\" and \"Chocolate\" are both matched - the following query returns those wines:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wine_idtitlevarietydescriptionpointspricetaster_idwinery_idlocation_id
06792Guardian Peak 2006 Shiraz (Western Cape)ShirazA gorgeous nose of plums, chocolate and red fr...8915.01673631141
17812Hightower 2006 Cabernet Sauvignon (Columbia Va...Cabernet SauvignonSourced largely from Red Mountain fruit, this ...8735.0376291474
27866Viña Cobos 2012 Bramare Marchiori Vineyard Mal...MalbecToasty woodsmoke aromas are matched by wild be...9490.05139624
38110Mendel 2013 Unus Red (Mendoza)Bordeaux-style Red BlendRich aromas of raisin, cassis and blackberry a...9350.0597467
49262Freakshow 2014 Cabernet Sauvignon (Lodi)Cabernet SauvignonRich, ripe and oaky, this full-bodied wine has...9120.01068931304
..............................
2604862Hearst Ranch 2013 Lone Tree Cabernet Franc (Pa...Cabernet FrancMade in a thick, oaky style, this shows burned...8735.0874981337
2614690Pear Valley 2013 Distraction Red (Paso Robles)Bordeaux-style Red BlendThe signature bottling from this winery, this ...9135.08107621337
2624794Estate Constantin Gofas 2008 Agiorgitiko (Nemea)AgiorgitikoThis wine has the plucky character typical of ...8518.0166366668
2636574Fielding Hills 2006 RiverBend Vineyard Syrah (...SyrahBold and forward, this estate-grown Syrah fair...9440.0366191483
2645020Corliss Estates 2007 Cabernet Sauvignon (Colum...Cabernet SauvignonConcentrated and wonderfully aromatic, this ag...9475.0346771474
\n", "

265 rows × 9 columns

\n", "
" ], "text/plain": [ " wine_id title \\\n", "0 6792 Guardian Peak 2006 Shiraz (Western Cape) \n", "1 7812 Hightower 2006 Cabernet Sauvignon (Columbia Va... \n", "2 7866 Viña Cobos 2012 Bramare Marchiori Vineyard Mal... \n", "3 8110 Mendel 2013 Unus Red (Mendoza) \n", "4 9262 Freakshow 2014 Cabernet Sauvignon (Lodi) \n", ".. ... ... \n", "260 4862 Hearst Ranch 2013 Lone Tree Cabernet Franc (Pa... \n", "261 4690 Pear Valley 2013 Distraction Red (Paso Robles) \n", "262 4794 Estate Constantin Gofas 2008 Agiorgitiko (Nemea) \n", "263 6574 Fielding Hills 2006 RiverBend Vineyard Syrah (... \n", "264 5020 Corliss Estates 2007 Cabernet Sauvignon (Colum... \n", "\n", " variety \\\n", "0 Shiraz \n", "1 Cabernet Sauvignon \n", "2 Malbec \n", "3 Bordeaux-style Red Blend \n", "4 Cabernet Sauvignon \n", ".. ... \n", "260 Cabernet Franc \n", "261 Bordeaux-style Red Blend \n", "262 Agiorgitiko \n", "263 Syrah \n", "264 Cabernet Sauvignon \n", "\n", " description points price \\\n", "0 A gorgeous nose of plums, chocolate and red fr... 89 15.0 \n", "1 Sourced largely from Red Mountain fruit, this ... 87 35.0 \n", "2 Toasty woodsmoke aromas are matched by wild be... 94 90.0 \n", "3 Rich aromas of raisin, cassis and blackberry a... 93 50.0 \n", "4 Rich, ripe and oaky, this full-bodied wine has... 91 20.0 \n", ".. ... ... ... \n", "260 Made in a thick, oaky style, this shows burned... 87 35.0 \n", "261 The signature bottling from this winery, this ... 91 35.0 \n", "262 This wine has the plucky character typical of ... 85 18.0 \n", "263 Bold and forward, this estate-grown Syrah fair... 94 40.0 \n", "264 Concentrated and wonderfully aromatic, this ag... 94 75.0 \n", "\n", " taster_id winery_id location_id \n", "0 16 7363 1141 \n", "1 3 7629 1474 \n", "2 5 13962 4 \n", "3 5 9746 7 \n", "4 10 6893 1304 \n", ".. ... ... ... \n", "260 8 7498 1337 \n", "261 8 10762 1337 \n", "262 16 6366 668 \n", "263 3 6619 1483 \n", "264 3 4677 1474 \n", "\n", "[265 rows x 9 columns]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT * FROM reviews \n", "WHERE LOWER(description) LIKE '%%smoke%%' AND LOWER(description) LIKE '%%chocolate%%';\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are situations in which we want to display only some of the records that match a particular query. For that, we can use the `LIMIT` and `OFFSET` clauses. `LIMIT` sets the number of records to extract, and `OFFSET` set the starting row. For example, adding\n", "```\n", "LIMIT 10\n", "```\n", "to a query instructs the DBMS to extract only the first 10 rows of the output data. Adding\n", "```\n", "LIMIT 10 OFFSET 5\n", "```\n", "tells the DBMS to extract 10 rows, after first skipping the first 5 rows: so these clauses together return rows 6 through 15. To see the 4th through 7th rows from the previous query to the wine reviews database, we can type:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wine_idtitlevarietydescriptionpointspricetaster_idwinery_idlocation_id
08110Mendel 2013 Unus Red (Mendoza)Bordeaux-style Red BlendRich aromas of raisin, cassis and blackberry a...9350.0597467
19262Freakshow 2014 Cabernet Sauvignon (Lodi)Cabernet SauvignonRich, ripe and oaky, this full-bodied wine has...9120.01068931304
29746Carmel 2013 Admon Vineyard Cabernet Sauvignon ...Cabernet SauvignonThis wine has offers aromas of dark plum and c...8935.0141925694
310326De Martino 2009 Alto de Piedras Single Vineyar...CarmenèreA big, earthy type of wine with a ton of ripen...9045.054973182
\n", "
" ], "text/plain": [ " wine_id title \\\n", "0 8110 Mendel 2013 Unus Red (Mendoza) \n", "1 9262 Freakshow 2014 Cabernet Sauvignon (Lodi) \n", "2 9746 Carmel 2013 Admon Vineyard Cabernet Sauvignon ... \n", "3 10326 De Martino 2009 Alto de Piedras Single Vineyar... \n", "\n", " variety \\\n", "0 Bordeaux-style Red Blend \n", "1 Cabernet Sauvignon \n", "2 Cabernet Sauvignon \n", "3 Carmenère \n", "\n", " description points price \\\n", "0 Rich aromas of raisin, cassis and blackberry a... 93 50.0 \n", "1 Rich, ripe and oaky, this full-bodied wine has... 91 20.0 \n", "2 This wine has offers aromas of dark plum and c... 89 35.0 \n", "3 A big, earthy type of wine with a ton of ripen... 90 45.0 \n", "\n", " taster_id winery_id location_id \n", "0 5 9746 7 \n", "1 10 6893 1304 \n", "2 14 1925 694 \n", "3 5 4973 182 " ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT * FROM reviews \n", "WHERE LOWER(description) LIKE '%%smoke%%' AND LOWER(description) LIKE '%%chocolate%%'\n", "LIMIT 4 OFFSET 3;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sorting Data\n", "Sorting data refers to rearranging the rows of a dataframe. Sorting is a cosmetic thing to do to data because the order of the rows should not change the meaning of the data both in terms of storage (rearranging the rows should NOT change the meaning of each row), or for most analytical models (rearranging rows won't change the parameter estimates from a linear regression, for example). But sorting is a way to visualize important characteristics about the data and to quickly see important records with maximum and minimum values of key features.\n", "\n", "To sort the output data, use the `ORDER BY` syntax within an SQL query. The general syntax for `ORDER BY` is\n", "```\n", "ORDER BY column1, column2, column3\n", "```\n", "Writing more than one column is optional. If more than one column is entered, then the second column is used to *break ties* between rows that have the same value of the first column. If a third column is entered, it's used to break ties between rows that have the same value for both the first and second columns. In addition, each column can be sorted in ascending or descending order by typing either `ASC` (this is the default, so typing `ASC` is optional, but useful for making the SQL code more readable) or `DESC` immediately after the column name.\n", "\n", "For example, what are the top rated wines from Virginia? And of these top rated wines, which ones are cheapest? To find out, we issue a query that joins the reviews and locations tables, filters the data to just wines from Virginia, narrows down the columns to just title, points, and price, and sorts first by points, then by price. We sort by points in descending order so the best wines appear first, and we sort by price in ascending order so that the cheapest wines appear first. The syntax for this query is:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlepointsprice
0King Family 2015 Orange Viognier (Monticello)9235.0
1Lovingston 2012 Josie's Knoll Merlot (Monticello)9120.0
2Lovingston 2015 Josie's Knoll Rotunda Red (Mon...9020.0
3Barboursville Vineyards 2015 Reserve Cabernet ...9025.0
4King Family 2012 Meritage (Monticello)9031.0
............
374Narmada 2013 Reserve Cabernet Franc (Virginia)8234.0
375Veramar 2009 Chardonnay (Virginia)8118.0
376Bogati 2013 Black Label Club Fumé Blanc Sauvig...8126.0
377Three Fox 2014 Calabrese Pinot Grigio (Middleb...8128.0
378Winery at La Grange 2012 Cabernet Sauvignon (V...8143.0
\n", "

379 rows × 3 columns

\n", "
" ], "text/plain": [ " title points price\n", "0 King Family 2015 Orange Viognier (Monticello) 92 35.0\n", "1 Lovingston 2012 Josie's Knoll Merlot (Monticello) 91 20.0\n", "2 Lovingston 2015 Josie's Knoll Rotunda Red (Mon... 90 20.0\n", "3 Barboursville Vineyards 2015 Reserve Cabernet ... 90 25.0\n", "4 King Family 2012 Meritage (Monticello) 90 31.0\n", ".. ... ... ...\n", "374 Narmada 2013 Reserve Cabernet Franc (Virginia) 82 34.0\n", "375 Veramar 2009 Chardonnay (Virginia) 81 18.0\n", "376 Bogati 2013 Black Label Club Fumé Blanc Sauvig... 81 26.0\n", "377 Three Fox 2014 Calabrese Pinot Grigio (Middleb... 81 28.0\n", "378 Winery at La Grange 2012 Cabernet Sauvignon (V... 81 43.0\n", "\n", "[379 rows x 3 columns]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT r.title, r.points, r.price FROM reviews r\n", "INNER JOIN locations l\n", " ON r.location_id = l.location_id\n", "WHERE province = 'Virginia'\n", "ORDER BY points DESC, price ASC;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Renaming Columns and Transforming Data Values\n", "Sometimes it is useful to rename the columns in the output data within a query. To rename a column, use the `AS` syntax while referencing the columns in `SELECT`. For example, we can load the title, variety, and points columns from the reviews table, but we can rename these columns name, type, and score respectively:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nametypescore
0Casas del Bosque 2011 Reserva Sauvignon Blanc ...Sauvignon Blanc86
1Marqués de Terán 2009 Selección Especial (Rioja)Tempranillo86
2Maurice Ecard 2009 BourgogneChardonnay86
3McGregor 2010 Semi-Dry Riesling (Finger Lakes)Riesling86
4Jules Taylor 2009 Ballochdale Estate Pinot Noi...Pinot Noir86
............
103722Glenora 2010 Gewürztraminer (Finger Lakes)Gewürztraminer86
103723Hard Row To Hoe 2010 Marsanne (Yakima Valley)Marsanne86
103724Animale 2009 Dolcetto (Columbia Valley (WA))Dolcetto86
103725Beresan 2008 The Buzz Yellow Jacket Vineyard R...Red Blend86
103726Cabot Vineyards 2007 Syrah (Humboldt County)Syrah86
\n", "

103727 rows × 3 columns

\n", "
" ], "text/plain": [ " name type \\\n", "0 Casas del Bosque 2011 Reserva Sauvignon Blanc ... Sauvignon Blanc \n", "1 Marqués de Terán 2009 Selección Especial (Rioja) Tempranillo \n", "2 Maurice Ecard 2009 Bourgogne Chardonnay \n", "3 McGregor 2010 Semi-Dry Riesling (Finger Lakes) Riesling \n", "4 Jules Taylor 2009 Ballochdale Estate Pinot Noi... Pinot Noir \n", "... ... ... \n", "103722 Glenora 2010 Gewürztraminer (Finger Lakes) Gewürztraminer \n", "103723 Hard Row To Hoe 2010 Marsanne (Yakima Valley) Marsanne \n", "103724 Animale 2009 Dolcetto (Columbia Valley (WA)) Dolcetto \n", "103725 Beresan 2008 The Buzz Yellow Jacket Vineyard R... Red Blend \n", "103726 Cabot Vineyards 2007 Syrah (Humboldt County) Syrah \n", "\n", " score \n", "0 86 \n", "1 86 \n", "2 86 \n", "3 86 \n", "4 86 \n", "... ... \n", "103722 86 \n", "103723 86 \n", "103724 86 \n", "103725 86 \n", "103726 86 \n", "\n", "[103727 rows x 3 columns]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT title AS name, variety AS type, points AS score FROM reviews;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In other situations, we might want to transform a column arithmetically or in another way. SQL supports the standard arithmetic operators: `+` for addition, `-` for subtraction, `*` for multiplication, and `/` for division. SQL also supports the modulo operator `%` to return the remainder after division (`16 % 5` equals 1, for example, because 16 divided by 5 yields a remainder of 1). SQL also allows the following [arithmetic functions](https://www.w3schools.com/sql/func_sqlserver_pi.asp):\n", "\n", "* `EXP(a)` - raises `a` the argument to the power of $e = 2.718...$\n", "* `POWER(a,b)` - raises `a` to the power of `b`\n", "* `LOG(a)` - takes the natural (base $e$) logarithm of `a`\n", "* `LOG10(a)` - takes the common (base 10) logarithm of `a`\n", "* `SQRT(a)` - takes the square root of `a`\n", "* `ABS(a)` - takes the absolute value of `a`\n", "* `CEILING(a)` - rounds values of `a` up to the next whole number\n", "* `FLOOR(a)` - rounds values of `a` down to a whole number\n", "* `ROUND(a, k)` - rounds values of `a` up or down to the nearest number with `k` decimals\n", "* `SIGN(a)` - returns 1 if `a` is positive, -1 if `a` is negative, and 0 if `a` is 0\n", "\n", "In addition, if you need them, there are many trigonometric functions built into standard SQL.\n", "\n", "When using a function that operates on a column, it is important to use `AS` to name the new column, as SQL has no way to choose a logical name automatically for constructed columns and uses `?column?` be default.\n", "\n", "For example, we can convert the price of each wine from dollars to Euros by multiplying the price by the .91 USD to Euro exchange rate. We keep the original price in the query but rename it `price_dollars`, and we name the converted price `price_euros`:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlevarietyprice_dollarsprice_euros
0Casas del Bosque 2011 Reserva Sauvignon Blanc ...Sauvignon Blanc12.010.92
1Marqués de Terán 2009 Selección Especial (Rioja)Tempranillo24.021.84
2Maurice Ecard 2009 BourgogneChardonnayNaNNaN
3McGregor 2010 Semi-Dry Riesling (Finger Lakes)Riesling18.016.38
4Jules Taylor 2009 Ballochdale Estate Pinot Noi...Pinot Noir22.020.02
...............
103722Glenora 2010 Gewürztraminer (Finger Lakes)Gewürztraminer15.013.65
103723Hard Row To Hoe 2010 Marsanne (Yakima Valley)Marsanne18.016.38
103724Animale 2009 Dolcetto (Columbia Valley (WA))Dolcetto24.021.84
103725Beresan 2008 The Buzz Yellow Jacket Vineyard R...Red Blend19.017.29
103726Cabot Vineyards 2007 Syrah (Humboldt County)Syrah24.021.84
\n", "

103727 rows × 4 columns

\n", "
" ], "text/plain": [ " title variety \\\n", "0 Casas del Bosque 2011 Reserva Sauvignon Blanc ... Sauvignon Blanc \n", "1 Marqués de Terán 2009 Selección Especial (Rioja) Tempranillo \n", "2 Maurice Ecard 2009 Bourgogne Chardonnay \n", "3 McGregor 2010 Semi-Dry Riesling (Finger Lakes) Riesling \n", "4 Jules Taylor 2009 Ballochdale Estate Pinot Noi... Pinot Noir \n", "... ... ... \n", "103722 Glenora 2010 Gewürztraminer (Finger Lakes) Gewürztraminer \n", "103723 Hard Row To Hoe 2010 Marsanne (Yakima Valley) Marsanne \n", "103724 Animale 2009 Dolcetto (Columbia Valley (WA)) Dolcetto \n", "103725 Beresan 2008 The Buzz Yellow Jacket Vineyard R... Red Blend \n", "103726 Cabot Vineyards 2007 Syrah (Humboldt County) Syrah \n", "\n", " price_dollars price_euros \n", "0 12.0 10.92 \n", "1 24.0 21.84 \n", "2 NaN NaN \n", "3 18.0 16.38 \n", "4 22.0 20.02 \n", "... ... ... \n", "103722 15.0 13.65 \n", "103723 18.0 16.38 \n", "103724 24.0 21.84 \n", "103725 19.0 17.29 \n", "103726 24.0 21.84 \n", "\n", "[103727 rows x 4 columns]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT title, variety, price AS price_dollars, .91*price AS price_euros FROM reviews;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For no reason other than to demonstrate the use of the various mathematical functions, we can put many transformations of price in one dataframe:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
priceprice_expprice_natlogprice_commonlogprice_logroundprice_sqrtprice_squaredprice_cubedprice_morethan50
012.01.0120721.0791811.0791811.03.464102144.01728.0-1.0
124.01.0242901.3802111.3802111.04.898979576.013824.0-1.0
2NaNNaNNaNNaNNaNNaNNaNNaNNaN
318.01.0181631.2552731.2552731.04.242641324.05832.0-1.0
422.01.0222441.3424231.3424231.04.690416484.010648.0-1.0
..............................
10372215.01.0151131.1760911.1760911.03.872983225.03375.0-1.0
10372318.01.0181631.2552731.2552731.04.242641324.05832.0-1.0
10372424.01.0242901.3802111.3802111.04.898979576.013824.0-1.0
10372519.01.0191821.2787541.2787541.04.358899361.06859.0-1.0
10372624.01.0242901.3802111.3802111.04.898979576.013824.0-1.0
\n", "

103727 rows × 9 columns

\n", "
" ], "text/plain": [ " price price_exp price_natlog price_commonlog price_loground \\\n", "0 12.0 1.012072 1.079181 1.079181 1.0 \n", "1 24.0 1.024290 1.380211 1.380211 1.0 \n", "2 NaN NaN NaN NaN NaN \n", "3 18.0 1.018163 1.255273 1.255273 1.0 \n", "4 22.0 1.022244 1.342423 1.342423 1.0 \n", "... ... ... ... ... ... \n", "103722 15.0 1.015113 1.176091 1.176091 1.0 \n", "103723 18.0 1.018163 1.255273 1.255273 1.0 \n", "103724 24.0 1.024290 1.380211 1.380211 1.0 \n", "103725 19.0 1.019182 1.278754 1.278754 1.0 \n", "103726 24.0 1.024290 1.380211 1.380211 1.0 \n", "\n", " price_sqrt price_squared price_cubed price_morethan50 \n", "0 3.464102 144.0 1728.0 -1.0 \n", "1 4.898979 576.0 13824.0 -1.0 \n", "2 NaN NaN NaN NaN \n", "3 4.242641 324.0 5832.0 -1.0 \n", "4 4.690416 484.0 10648.0 -1.0 \n", "... ... ... ... ... \n", "103722 3.872983 225.0 3375.0 -1.0 \n", "103723 4.242641 324.0 5832.0 -1.0 \n", "103724 4.898979 576.0 13824.0 -1.0 \n", "103725 4.358899 361.0 6859.0 -1.0 \n", "103726 4.898979 576.0 13824.0 -1.0 \n", "\n", "[103727 rows x 9 columns]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT price,\n", " EXP(price/1000) as price_exp,\n", " LOG(price) as price_natlog,\n", " LOG10(price) as price_commonlog,\n", " ROUND(LOG(price)) as price_loground,\n", " SQRT(price) as price_sqrt,\n", " POWER(price, 2) as price_squared,\n", " POWER(price, 3) as price_cubed,\n", " SIGN(price - 50) as price_morethan50\n", "FROM reviews;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another useful operation for transforming columns in a read query is `CASE`, which maps numeric values to categories. The syntax that uses `CASE` within `SELECT` is\n", "```\n", "SELECT CASE\n", " WHEN logicalstatement1 THEN value1\n", " WHEN logicalstatement2 THEN value2\n", " WHEN logicalstatement3 THEN value3\n", " ELSE value4\n", " END AS name\n", "```\n", "This code evaluates each logical statement, and fills in the datapoint with the specified value if the logical statement is true. If more than one of the logical statements is true, then the first statement/value pair entered in takes precedence. If none of the logical statements are true, then the datapoint is filled in with the value listed with `ELSE`. As before, it is important to name the new column with `AS`.\n", "\n", "For example, if we want to categorize wines as cheap when the price is under 20 dollars, moderately priced if the price is between 20 and 50 dollars, and expensive if the price is more than 50 dollars, we can type:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlevarietypriceprice_level
0Casas del Bosque 2011 Reserva Sauvignon Blanc ...Sauvignon Blanc12.0cheap
1Marqués de Terán 2009 Selección Especial (Rioja)Tempranillo24.0moderately priced
2Maurice Ecard 2009 BourgogneChardonnayNaNNone
3McGregor 2010 Semi-Dry Riesling (Finger Lakes)Riesling18.0cheap
4Jules Taylor 2009 Ballochdale Estate Pinot Noi...Pinot Noir22.0moderately priced
...............
103722Glenora 2010 Gewürztraminer (Finger Lakes)Gewürztraminer15.0cheap
103723Hard Row To Hoe 2010 Marsanne (Yakima Valley)Marsanne18.0cheap
103724Animale 2009 Dolcetto (Columbia Valley (WA))Dolcetto24.0moderately priced
103725Beresan 2008 The Buzz Yellow Jacket Vineyard R...Red Blend19.0cheap
103726Cabot Vineyards 2007 Syrah (Humboldt County)Syrah24.0moderately priced
\n", "

103727 rows × 4 columns

\n", "
" ], "text/plain": [ " title variety \\\n", "0 Casas del Bosque 2011 Reserva Sauvignon Blanc ... Sauvignon Blanc \n", "1 Marqués de Terán 2009 Selección Especial (Rioja) Tempranillo \n", "2 Maurice Ecard 2009 Bourgogne Chardonnay \n", "3 McGregor 2010 Semi-Dry Riesling (Finger Lakes) Riesling \n", "4 Jules Taylor 2009 Ballochdale Estate Pinot Noi... Pinot Noir \n", "... ... ... \n", "103722 Glenora 2010 Gewürztraminer (Finger Lakes) Gewürztraminer \n", "103723 Hard Row To Hoe 2010 Marsanne (Yakima Valley) Marsanne \n", "103724 Animale 2009 Dolcetto (Columbia Valley (WA)) Dolcetto \n", "103725 Beresan 2008 The Buzz Yellow Jacket Vineyard R... Red Blend \n", "103726 Cabot Vineyards 2007 Syrah (Humboldt County) Syrah \n", "\n", " price price_level \n", "0 12.0 cheap \n", "1 24.0 moderately priced \n", "2 NaN None \n", "3 18.0 cheap \n", "4 22.0 moderately priced \n", "... ... ... \n", "103722 15.0 cheap \n", "103723 18.0 cheap \n", "103724 24.0 moderately priced \n", "103725 19.0 cheap \n", "103726 24.0 moderately priced \n", "\n", "[103727 rows x 4 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery=\"\"\"\n", "SELECT title, variety, price, CASE\n", " WHEN price < 20 THEN 'cheap'\n", " WHEN price BETWEEN 20 AND 50 THEN 'moderately priced'\n", " WHEN price > 50 THEN 'expensive'\n", " ELSE NULL\n", " END AS price_level\n", "FROM reviews;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's an important **point of caution** when using `CASE`. Unless you account for missing values explicity in your query, the missing values will be matched to the entered last in `CASE`. That will corrupt the data. It is best practice to write conditions for the full set of observed values of a column, and to end the call to `CASE` with `ELSE NULL`, so that when none of the conditions apply, the new column is also missing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are also functions that apply to columns with [character](https://www.geeksforgeeks.org/sql-character-functions-examples/) values:\n", "\n", "* `LOWER(a)` - converts all characters in `a` to lowercase\n", "* `UPPER(a)` - converts all characters in `a` to uppercase\n", "* `INITCAP(a)` - converts the first letter of every word in `a` to uppercase\n", "* `CONCAT(a,b,c)` - appends the string `b` to the end of `a`, and `c` (if included) to the end of `b`\n", "* `LENGTH(a)` - reports the number of characters in the string `a`\n", "* `SUBSTR(a, start, length)` - restricts the string `a` to a substring, beginning at the position denoted by `start`, and including the next `length` characters \n", "* `TRIM(a)` - removes spaces at the beginning and end of string `a`\n", "* `REPLACE(a, oldtext, newtext)` - searches values of `a` for occurrences of `oldtext` and replaces them with `newtext`\n", "\n", "For example, we can replace the descriptions in the reviews table with all capitals, all lower-case letters, or capitals for the first letter of each word:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlevarietypricedescription_upperdescription_lowerdescription_initcap
0Casas del Bosque 2011 Reserva Sauvignon Blanc ...Sauvignon Blanc12.0IT'S PRETTY EASY PEGGING THIS FOR CHILEAN SB; ...it's pretty easy pegging this for chilean sb; ...It'S Pretty Easy Pegging This For Chilean Sb; ...
1Marqués de Terán 2009 Selección Especial (Rioja)Tempranillo24.0OPAQUE IN COLOR, WITH BLACKBERRY AND LICORICE ...opaque in color, with blackberry and licorice ...Opaque In Color, With Blackberry And Licorice ...
2Maurice Ecard 2009 BourgogneChardonnayNaNATTRACTIVE RIPE FRUITS GO WITH LIME AND TOAST ...attractive ripe fruits go with lime and toast ...Attractive Ripe Fruits Go With Lime And Toast ...
3McGregor 2010 Semi-Dry Riesling (Finger Lakes)Riesling18.0SMOKY AND A BIT STARK WITH WHIFFS OF STRUCK FL...smoky and a bit stark with whiffs of struck fl...Smoky And A Bit Stark With Whiffs Of Struck Fl...
4Jules Taylor 2009 Ballochdale Estate Pinot Noi...Pinot Noir22.0SOURCED FROM A HIGH-ALTITUDE VINEYARD NEAR THE...sourced from a high-altitude vineyard near the...Sourced From A High-Altitude Vineyard Near The...
.....................
103722Glenora 2010 Gewürztraminer (Finger Lakes)Gewürztraminer15.0SWEET ON THE NOSE, WITH SCENTS OF PINK GRAPEFR...sweet on the nose, with scents of pink grapefr...Sweet On The Nose, With Scents Of Pink Grapefr...
103723Hard Row To Hoe 2010 Marsanne (Yakima Valley)Marsanne18.0WAXY FRUIT FLAVORS OF PEACH, MELON AND BANANA,...waxy fruit flavors of peach, melon and banana,...Waxy Fruit Flavors Of Peach, Melon And Banana,...
103724Animale 2009 Dolcetto (Columbia Valley (WA))Dolcetto24.0THIS ALCOHOLIC WINE (15.9%, AND YOU CAN TASTE ...this alcoholic wine (15.9%, and you can taste ...This Alcoholic Wine (15.9%, And You Can Taste ...
103725Beresan 2008 The Buzz Yellow Jacket Vineyard R...Red Blend19.0LIGHT FRUIT FLAVORS RUN FROM MELON INTO PALE S...light fruit flavors run from melon into pale s...Light Fruit Flavors Run From Melon Into Pale S...
103726Cabot Vineyards 2007 Syrah (Humboldt County)Syrah24.0FROM HUSBAND-AND-WIFE TEAM IN CALIFORNIA REDWO...from husband-and-wife team in california redwo...From Husband-And-Wife Team In California Redwo...
\n", "

103727 rows × 6 columns

\n", "
" ], "text/plain": [ " title variety \\\n", "0 Casas del Bosque 2011 Reserva Sauvignon Blanc ... Sauvignon Blanc \n", "1 Marqués de Terán 2009 Selección Especial (Rioja) Tempranillo \n", "2 Maurice Ecard 2009 Bourgogne Chardonnay \n", "3 McGregor 2010 Semi-Dry Riesling (Finger Lakes) Riesling \n", "4 Jules Taylor 2009 Ballochdale Estate Pinot Noi... Pinot Noir \n", "... ... ... \n", "103722 Glenora 2010 Gewürztraminer (Finger Lakes) Gewürztraminer \n", "103723 Hard Row To Hoe 2010 Marsanne (Yakima Valley) Marsanne \n", "103724 Animale 2009 Dolcetto (Columbia Valley (WA)) Dolcetto \n", "103725 Beresan 2008 The Buzz Yellow Jacket Vineyard R... Red Blend \n", "103726 Cabot Vineyards 2007 Syrah (Humboldt County) Syrah \n", "\n", " price description_upper \\\n", "0 12.0 IT'S PRETTY EASY PEGGING THIS FOR CHILEAN SB; ... \n", "1 24.0 OPAQUE IN COLOR, WITH BLACKBERRY AND LICORICE ... \n", "2 NaN ATTRACTIVE RIPE FRUITS GO WITH LIME AND TOAST ... \n", "3 18.0 SMOKY AND A BIT STARK WITH WHIFFS OF STRUCK FL... \n", "4 22.0 SOURCED FROM A HIGH-ALTITUDE VINEYARD NEAR THE... \n", "... ... ... \n", "103722 15.0 SWEET ON THE NOSE, WITH SCENTS OF PINK GRAPEFR... \n", "103723 18.0 WAXY FRUIT FLAVORS OF PEACH, MELON AND BANANA,... \n", "103724 24.0 THIS ALCOHOLIC WINE (15.9%, AND YOU CAN TASTE ... \n", "103725 19.0 LIGHT FRUIT FLAVORS RUN FROM MELON INTO PALE S... \n", "103726 24.0 FROM HUSBAND-AND-WIFE TEAM IN CALIFORNIA REDWO... \n", "\n", " description_lower \\\n", "0 it's pretty easy pegging this for chilean sb; ... \n", "1 opaque in color, with blackberry and licorice ... \n", "2 attractive ripe fruits go with lime and toast ... \n", "3 smoky and a bit stark with whiffs of struck fl... \n", "4 sourced from a high-altitude vineyard near the... \n", "... ... \n", "103722 sweet on the nose, with scents of pink grapefr... \n", "103723 waxy fruit flavors of peach, melon and banana,... \n", "103724 this alcoholic wine (15.9%, and you can taste ... \n", "103725 light fruit flavors run from melon into pale s... \n", "103726 from husband-and-wife team in california redwo... \n", "\n", " description_initcap \n", "0 It'S Pretty Easy Pegging This For Chilean Sb; ... \n", "1 Opaque In Color, With Blackberry And Licorice ... \n", "2 Attractive Ripe Fruits Go With Lime And Toast ... \n", "3 Smoky And A Bit Stark With Whiffs Of Struck Fl... \n", "4 Sourced From A High-Altitude Vineyard Near The... \n", "... ... \n", "103722 Sweet On The Nose, With Scents Of Pink Grapefr... \n", "103723 Waxy Fruit Flavors Of Peach, Melon And Banana,... \n", "103724 This Alcoholic Wine (15.9%, And You Can Taste ... \n", "103725 Light Fruit Flavors Run From Melon Into Pale S... \n", "103726 From Husband-And-Wife Team In California Redwo... \n", "\n", "[103727 rows x 6 columns]" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT title, variety, price, \n", " UPPER(description) as description_upper, \n", " LOWER(description) as description_lower, \n", " INITCAP(description) as description_initcap \n", "FROM reviews;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use `REPLACE()` to make the writing less artful, for example, by replacing the word \"aroma\" with \"good smell\" everywhere it appears in the wine descriptions. Note that `REPLACE()` is case-sensitive, so it is a good idea to convert the values to a consistent case like lowercase first:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Opaque in color, with blackberry and licorice aromas but also a distinct streak of brambly herbs and green. Later on, pine needle and tartness enter the fray. This is a commendable modern Rioja but it does have a few issues, namely a green herbal component.'" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT title, variety, price, description,\n", " REPLACE(LOWER(description), 'aroma', 'good smell') as description_replace \n", "FROM reviews\n", "WHERE description LIKE '%%aroma%%';\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine).description[0]" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'opaque in color, with blackberry and licorice good smells but also a distinct streak of brambly herbs and green. later on, pine needle and tartness enter the fray. this is a commendable modern rioja but it does have a few issues, namely a green herbal component.'" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_sql_query(myquery, con=engine).description_replace[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `SUBSTR()` function can be used to extract parts of a string. The following code reduces the `description` column to substrings beginning at the 5th character and proceeding 10 characters in length:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlevarietypricedescriptiondescription_substr
0Casas del Bosque 2011 Reserva Sauvignon Blanc ...Sauvignon Blanc12.0It's pretty easy pegging this for Chilean SB; ...pretty ea
1Marqués de Terán 2009 Selección Especial (Rioja)Tempranillo24.0Opaque in color, with blackberry and licorice ...ue in colo
2Maurice Ecard 2009 BourgogneChardonnayNaNAttractive ripe fruits go with lime and toast ...active rip
3McGregor 2010 Semi-Dry Riesling (Finger Lakes)Riesling18.0Smoky and a bit stark with whiffs of struck fl...y and a bi
4Jules Taylor 2009 Ballochdale Estate Pinot Noi...Pinot Noir22.0Sourced from a high-altitude vineyard near the...ced from a
..................
103722Glenora 2010 Gewürztraminer (Finger Lakes)Gewürztraminer15.0Sweet on the nose, with scents of pink grapefr...t on the n
103723Hard Row To Hoe 2010 Marsanne (Yakima Valley)Marsanne18.0Waxy fruit flavors of peach, melon and banana,...fruit fla
103724Animale 2009 Dolcetto (Columbia Valley (WA))Dolcetto24.0This alcoholic wine (15.9%, and you can taste ...alcoholic
103725Beresan 2008 The Buzz Yellow Jacket Vineyard R...Red Blend19.0Light fruit flavors run from melon into pale s...t fruit fl
103726Cabot Vineyards 2007 Syrah (Humboldt County)Syrah24.0From husband-and-wife team in California redwo...husband-a
\n", "

103727 rows × 5 columns

\n", "
" ], "text/plain": [ " title variety \\\n", "0 Casas del Bosque 2011 Reserva Sauvignon Blanc ... Sauvignon Blanc \n", "1 Marqués de Terán 2009 Selección Especial (Rioja) Tempranillo \n", "2 Maurice Ecard 2009 Bourgogne Chardonnay \n", "3 McGregor 2010 Semi-Dry Riesling (Finger Lakes) Riesling \n", "4 Jules Taylor 2009 Ballochdale Estate Pinot Noi... Pinot Noir \n", "... ... ... \n", "103722 Glenora 2010 Gewürztraminer (Finger Lakes) Gewürztraminer \n", "103723 Hard Row To Hoe 2010 Marsanne (Yakima Valley) Marsanne \n", "103724 Animale 2009 Dolcetto (Columbia Valley (WA)) Dolcetto \n", "103725 Beresan 2008 The Buzz Yellow Jacket Vineyard R... Red Blend \n", "103726 Cabot Vineyards 2007 Syrah (Humboldt County) Syrah \n", "\n", " price description \\\n", "0 12.0 It's pretty easy pegging this for Chilean SB; ... \n", "1 24.0 Opaque in color, with blackberry and licorice ... \n", "2 NaN Attractive ripe fruits go with lime and toast ... \n", "3 18.0 Smoky and a bit stark with whiffs of struck fl... \n", "4 22.0 Sourced from a high-altitude vineyard near the... \n", "... ... ... \n", "103722 15.0 Sweet on the nose, with scents of pink grapefr... \n", "103723 18.0 Waxy fruit flavors of peach, melon and banana,... \n", "103724 24.0 This alcoholic wine (15.9%, and you can taste ... \n", "103725 19.0 Light fruit flavors run from melon into pale s... \n", "103726 24.0 From husband-and-wife team in California redwo... \n", "\n", " description_substr \n", "0 pretty ea \n", "1 ue in colo \n", "2 active rip \n", "3 y and a bi \n", "4 ced from a \n", "... ... \n", "103722 t on the n \n", "103723 fruit fla \n", "103724 alcoholic \n", "103725 t fruit fl \n", "103726 husband-a \n", "\n", "[103727 rows x 5 columns]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT title, variety, price, description,\n", " SUBSTR(description, 5, 10) as description_substr \n", "FROM reviews;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the wine database, province and country are stored in separate columns in the locations table. If we wanted to put these two pieces of information together in one readable column, we can use `CONCAT()`. In this example, I type `CONCAT(l.province, ', ', l.country)` which appends three strings - the province from the locations table, a comma and a space, and the country from the locations table - and names the new column `place`:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlevarietypriceplace
0Casas del Bosque 2011 Reserva Sauvignon Blanc ...Sauvignon Blanc12.0Casablanca Valley, Chile
1Marqués de Terán 2009 Selección Especial (Rioja)Tempranillo24.0Northern Spain, Spain
2Maurice Ecard 2009 BourgogneChardonnayNaNBurgundy, France
3McGregor 2010 Semi-Dry Riesling (Finger Lakes)Riesling18.0New York, US
4Jules Taylor 2009 Ballochdale Estate Pinot Noi...Pinot Noir22.0Marlborough, New Zealand
...............
103722Glenora 2010 Gewürztraminer (Finger Lakes)Gewürztraminer15.0New York, US
103723Hard Row To Hoe 2010 Marsanne (Yakima Valley)Marsanne18.0Washington, US
103724Animale 2009 Dolcetto (Columbia Valley (WA))Dolcetto24.0Washington, US
103725Beresan 2008 The Buzz Yellow Jacket Vineyard R...Red Blend19.0Washington, US
103726Cabot Vineyards 2007 Syrah (Humboldt County)Syrah24.0California, US
\n", "

103727 rows × 4 columns

\n", "
" ], "text/plain": [ " title variety \\\n", "0 Casas del Bosque 2011 Reserva Sauvignon Blanc ... Sauvignon Blanc \n", "1 Marqués de Terán 2009 Selección Especial (Rioja) Tempranillo \n", "2 Maurice Ecard 2009 Bourgogne Chardonnay \n", "3 McGregor 2010 Semi-Dry Riesling (Finger Lakes) Riesling \n", "4 Jules Taylor 2009 Ballochdale Estate Pinot Noi... Pinot Noir \n", "... ... ... \n", "103722 Glenora 2010 Gewürztraminer (Finger Lakes) Gewürztraminer \n", "103723 Hard Row To Hoe 2010 Marsanne (Yakima Valley) Marsanne \n", "103724 Animale 2009 Dolcetto (Columbia Valley (WA)) Dolcetto \n", "103725 Beresan 2008 The Buzz Yellow Jacket Vineyard R... Red Blend \n", "103726 Cabot Vineyards 2007 Syrah (Humboldt County) Syrah \n", "\n", " price place \n", "0 12.0 Casablanca Valley, Chile \n", "1 24.0 Northern Spain, Spain \n", "2 NaN Burgundy, France \n", "3 18.0 New York, US \n", "4 22.0 Marlborough, New Zealand \n", "... ... ... \n", "103722 15.0 New York, US \n", "103723 18.0 Washington, US \n", "103724 24.0 Washington, US \n", "103725 19.0 Washington, US \n", "103726 24.0 California, US \n", "\n", "[103727 rows x 4 columns]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT r.title, r.variety, r.price, \n", " CONCAT(l.province, ', ', l.country) as place \n", "FROM reviews r\n", "INNER JOIN locations l\n", " ON r.location_id = l.location_id;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are the shortest descriptions in the data? To find out we use the `LENGTH()` function to count the number of characters in each description, and sort these lengths in ascending order:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlepointspricedescriptionlength
0Craggy Range 2007 Kidnappers Vineyard Chardonn...8824.0Imported by Kobrand.20
1Chasing Venus 2007 Sauvignon Blanc (Marlborough)8816.0Imported by JL Giguiere.24
2Philip Shaw 2007 No. 19 Sauvignon Blanc (Orange)8620.0Imported by Lion Nathan USA.28
3Peconic Bay Winery 2001 Riesling (North Fork o...8413.0Review not available at this time.34
4Mount Baker Vineyards 2006 Barrel Select Sangi...8216.0Very light, could almost be a rosé.35
..................
103722René Muré 2015 Clos Saint Landelin Vorbourg Gr...9750.0The heady aromatic scent of fresh tangerine pe...698
103723Domaine Marcel Deiss 2009 Altenberg de Berghei...9566.0Lifted notes of dried pear, dried chamomile fl...699
103724De Toren 2014 Book 17 XVII Red (Stellenbosch)95330.0Only 95 cases were made of this Bordeaux-style...723
103725Domaine Ostertag 2015 Muenchberg Grand Cru Rie...9766.0There is something incredibly fruity and simul...753
103726Saggi 2007 Red (Columbia Valley (WA))9145.0Dark, dusty, strongly scented with barrel toas...829
\n", "

103727 rows × 5 columns

\n", "
" ], "text/plain": [ " title points price \\\n", "0 Craggy Range 2007 Kidnappers Vineyard Chardonn... 88 24.0 \n", "1 Chasing Venus 2007 Sauvignon Blanc (Marlborough) 88 16.0 \n", "2 Philip Shaw 2007 No. 19 Sauvignon Blanc (Orange) 86 20.0 \n", "3 Peconic Bay Winery 2001 Riesling (North Fork o... 84 13.0 \n", "4 Mount Baker Vineyards 2006 Barrel Select Sangi... 82 16.0 \n", "... ... ... ... \n", "103722 René Muré 2015 Clos Saint Landelin Vorbourg Gr... 97 50.0 \n", "103723 Domaine Marcel Deiss 2009 Altenberg de Berghei... 95 66.0 \n", "103724 De Toren 2014 Book 17 XVII Red (Stellenbosch) 95 330.0 \n", "103725 Domaine Ostertag 2015 Muenchberg Grand Cru Rie... 97 66.0 \n", "103726 Saggi 2007 Red (Columbia Valley (WA)) 91 45.0 \n", "\n", " description length \n", "0 Imported by Kobrand. 20 \n", "1 Imported by JL Giguiere. 24 \n", "2 Imported by Lion Nathan USA. 28 \n", "3 Review not available at this time. 34 \n", "4 Very light, could almost be a rosé. 35 \n", "... ... ... \n", "103722 The heady aromatic scent of fresh tangerine pe... 698 \n", "103723 Lifted notes of dried pear, dried chamomile fl... 699 \n", "103724 Only 95 cases were made of this Bordeaux-style... 723 \n", "103725 There is something incredibly fruity and simul... 753 \n", "103726 Dark, dusty, strongly scented with barrel toas... 829 \n", "\n", "[103727 rows x 5 columns]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT title, points, price, description,\n", " LENGTH(description) as length\n", "FROM reviews\n", "ORDER BY length ASC;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Aggregation\n", "If there are columns in the data output that have repeated values, then each distinct value forms a group in the data. Data aggregation is the process of collapsing the data to one row for each group, while summarizing other columns by taking the within-group mean, sum, count, or another statistic. \n", "\n", "Aggregating data requires more attention to the ordering of the clauses in an SQL query than is necessary with other tasks. That is, certain clauses must be entered into the query in a particular order. An SQL query that aggregates data should follow this template:\n", "```\n", "SELECT aggregationfunctions FROM table\n", "(any joins happen here)\n", "(filtering rows with the WHERE clause happens here)\n", "GROUP BY groupingcolumns\n", "HAVING (a logical condition involving aggregation functions)\n", "(sorting with ORDER BY happens here);\n", "```\n", "Let's break down this template line by line. First,\n", "\n", "* `SELECT aggregationfunctions FROM table`\n", "\n", "Aggregation functions work like the arithmetic functions described above. The difference is that instead of working with a single value, like `SQRT()` and `POWER()` do, aggregation functions work with vectors of data and generate summary statistics. They describe how existing columns should be summarized when the data are collapsed. The aggregation functions are:\n", "\n", "* `COUNT(*)` - an overall count of the number of rows within each group\n", "* `COUNT(a)` - a count of the number of non-missing observations of column `a` within each group\n", "* `COUNT(DISTINCT a)` - a count of the number of distinct observations of column `a` within each group\n", "* `AVG(a)` - the mean of the values of `a` within each group\n", "* `SUM(a)` - the sum of the values of `a` within each group\n", "* `MAX(a)`- the maximum value of `a` within each group\n", "* `MIN(a)`- the minimum value of `a` within each group\n", "* `VARIANCE(a)` and `VAR_SAMP(a)` - the population and sample variances, respectively, of the values of `a` within each group\n", "* `STDDEV(a)` and `STDDEV_SAMP(a)` - the population and sample standard deviations, respectively, of the values of `a` within each group\n", "\n", "Additional statistics, like the median, mode, and various quantiles are not included in standard SQL but are available in extensions that are specific to a DBMS, such as the [quantile extension](https://pgxn.org/dist/quantile/) for PostgreSQL.\n", "\n", "One important point about the aggregation functions is that, with the exception of `COUNT()`, they **ignore NULL values**. Suppose that we have a data vector with values (1,3,8,NULL). Because we do not know the value of the fourth value, we cannot calculate the true sum and true mean of the values. However the `SUM()` function in SQL ignores the NULL value and reports the sum as 12, which makes a strong tacit assumption that the NULL value is equal to 0. The `AVG()` function calculates the mean from the observed values, and reports (1+3+8)/3 = 4, but this too makes a strong assumption that the NULL value is exactly 4. There are situations in which calculations from the non-NULL values are appropriate, but it is not correct to make broad claims from these summary statistics when there are missing values in the columns being summarized.\n", "\n", "The next two lines in the template are placeholders for the syntax we use to join data tables and the syntax we use to filter rows with the `WHERE` clause. There is a great deal of similarity between `WHERE` and `HAVING`, which we will discuss shortly.\n", "\n", "The fourth line in the template,\n", "\n", "* `GROUP BY groupingcolumns`\n", "\n", "is the key line for activating the aggregation functionality of SQL. `groupingcolumns` can include one or more columns. If one column is listed, the unique values of that column define the groups that will comprise the rows of the output data. If there is more than one column listed, the unique combinations of values from the columns define the groups.\n", "\n", "The fifth line in the template uses the `HAVING` clause. `HAVING` is very similar to `WHERE` in that both use logical conditions to identify a selection of the rows to include in the output data. The difference between `WHERE` and `HAVING` is that `WHERE` operates on rows in the original data prior to aggregation, and `HAVING` works on rows after aggregation has occurred. One limitation of `HAVING` is that it will not recognize new column names defined in `SELECT`, so the same aggregation functions used in `SELECT` need to be used again in the logical conditions for `HAVING`. Finally, if we want to sort, we can include the `ORDER BY` clause last.\n", "\n", "For example, let's find out which country produces wines with the highest average score. To do that, we need a query that joins reviews and locations together, includes country name, the average score across wines from that country, and for good measure, a count of the number of wines reviewed from that country. To collapse on country, we can use `GROUP BY`, and to produce the average score and the count of wines we use the `AVG()` and `COUNT()` functions. For presentation purposes, I choose to round the average score to one decimal and to sort the rows from the highest to lowest average score, so that we can immediately see which countries produce the highest-rated wines. The query is:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryaverage_pointsnumberofwines
0England91.674
1India90.29
2Austria90.13337
3Germany89.92134
4Canada89.4256
5Hungary89.2145
6China89.01
7US89.037730
8France88.921828
9Italy88.811042
10Australia88.82037
11Luxembourg88.76
12None88.663
13Morocco88.628
14Switzerland88.67
15Israel88.5500
16New Zealand88.31311
17South Africa88.21328
18Portugal88.25686
19Slovenia88.187
20Turkey88.190
21Bulgaria87.9141
22Georgia87.786
23Lebanon87.735
24Armenia87.52
25Serbia87.512
26Czech Republic87.312
27Greece87.3466
28Spain87.36581
29Moldova87.259
30Croatia87.273
31Cyprus87.211
32Slovakia87.01
33Uruguay86.8109
34Macedonia86.812
35Argentina86.73797
36Bosnia and Herzegovina86.52
37Chile86.54361
38Romania86.4120
39Mexico85.365
40Brazil84.752
41Ukraine84.114
42Egypt84.01
43Peru83.616
\n", "
" ], "text/plain": [ " country average_points numberofwines\n", "0 England 91.6 74\n", "1 India 90.2 9\n", "2 Austria 90.1 3337\n", "3 Germany 89.9 2134\n", "4 Canada 89.4 256\n", "5 Hungary 89.2 145\n", "6 China 89.0 1\n", "7 US 89.0 37730\n", "8 France 88.9 21828\n", "9 Italy 88.8 11042\n", "10 Australia 88.8 2037\n", "11 Luxembourg 88.7 6\n", "12 None 88.6 63\n", "13 Morocco 88.6 28\n", "14 Switzerland 88.6 7\n", "15 Israel 88.5 500\n", "16 New Zealand 88.3 1311\n", "17 South Africa 88.2 1328\n", "18 Portugal 88.2 5686\n", "19 Slovenia 88.1 87\n", "20 Turkey 88.1 90\n", "21 Bulgaria 87.9 141\n", "22 Georgia 87.7 86\n", "23 Lebanon 87.7 35\n", "24 Armenia 87.5 2\n", "25 Serbia 87.5 12\n", "26 Czech Republic 87.3 12\n", "27 Greece 87.3 466\n", "28 Spain 87.3 6581\n", "29 Moldova 87.2 59\n", "30 Croatia 87.2 73\n", "31 Cyprus 87.2 11\n", "32 Slovakia 87.0 1\n", "33 Uruguay 86.8 109\n", "34 Macedonia 86.8 12\n", "35 Argentina 86.7 3797\n", "36 Bosnia and Herzegovina 86.5 2\n", "37 Chile 86.5 4361\n", "38 Romania 86.4 120\n", "39 Mexico 85.3 65\n", "40 Brazil 84.7 52\n", "41 Ukraine 84.1 14\n", "42 Egypt 84.0 1\n", "43 Peru 83.6 16" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT l.country,\n", " ROUND(AVG(points),1) as average_points,\n", " COUNT(*) as numberofwines\n", "FROM reviews r\n", "INNER JOIN locations l\n", " ON r.location_id = l.location_id\n", "GROUP BY l.country\n", "ORDER BY average_points DESC;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the country that produces the best wine is . . . England? Wait, that can't be right. Looking at the results, I see that some of the countries only have a small number of wines reviewed. China, for example, only has one wine review in the database, so we definitely should not put as much confidence in China's mean score as we can for countries with many more reviews like France and the U.S. For a more fair comparison, let's restrict the output to only those countries with at least 500 wines in the database. To filter rows on this condition, we use `HAVING` and not `WHERE` because the condition involves an aggregation function - specifically the count of the number of wines per country. We can rerun the previous query, including the `HAVING` clause:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryaverage_pointsnumberofwines
0Austria90.13337
1Germany89.92134
2US89.037730
3France88.921828
4Italy88.811042
5Australia88.82037
6Israel88.5500
7New Zealand88.31311
8Portugal88.25686
9South Africa88.21328
10Spain87.36581
11Argentina86.73797
12Chile86.54361
\n", "
" ], "text/plain": [ " country average_points numberofwines\n", "0 Austria 90.1 3337\n", "1 Germany 89.9 2134\n", "2 US 89.0 37730\n", "3 France 88.9 21828\n", "4 Italy 88.8 11042\n", "5 Australia 88.8 2037\n", "6 Israel 88.5 500\n", "7 New Zealand 88.3 1311\n", "8 Portugal 88.2 5686\n", "9 South Africa 88.2 1328\n", "10 Spain 87.3 6581\n", "11 Argentina 86.7 3797\n", "12 Chile 86.5 4361" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT l.country,\n", " ROUND(AVG(points),1) as average_points,\n", " COUNT(*) as numberofwines\n", "FROM reviews r\n", "INNER JOIN locations l\n", " ON r.location_id = l.location_id\n", "GROUP BY l.country\n", " HAVING COUNT(*) >= 500\n", "ORDER BY average_points DESC;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose we were interested in ranking the countries according to their scores for a particular type of wine. To filter rows from the original data, we use a `WHERE` clause prior to `GROUP BY`. If we write `WHERE r.variety = 'Riesling'` prior to `GROUP BY`, the DBMS first extracts only the rows from the reviews table that refer to Riesling wines, then proceeds with the rest of the query. The following code ranks the countries based on their average scores for Rieslings, given at least 100 Rieslings from that country in the database:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryaverage_pointsnumberofwines
0Austria91.4581
1France90.5691
2Germany90.11768
3Australia89.4111
4US88.11600
\n", "
" ], "text/plain": [ " country average_points numberofwines\n", "0 Austria 91.4 581\n", "1 France 90.5 691\n", "2 Germany 90.1 1768\n", "3 Australia 89.4 111\n", "4 US 88.1 1600" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT l.country,\n", " ROUND(AVG(points),1) as average_points,\n", " COUNT(*) as numberofwines\n", "FROM reviews r\n", "INNER JOIN locations l\n", " ON r.location_id = l.location_id\n", "WHERE r.variety = 'Riesling'\n", "GROUP BY l.country\n", " HAVING COUNT(*) >= 100\n", "ORDER BY average_points DESC;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes the groups in the data are formed by more than one column. In that case, simply add the second column name to the `GROUP BY` clause. For example, if we wanted to know the top rated combination of country and variety (with a minimum of 50 wines for that combination), we can use the following code:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryvarietyaverage_pointsnumberofwines
0FranceTannat91.559
1AustriaRiesling91.4581
2South AfricaBordeaux-style Red Blend90.684
3FranceRiesling90.5691
4AustriaChardonnay90.462
...............
190ArgentinaChardonnay84.9295
191SpainRosé84.9149
192PortugalRosé84.6235
193SpainRosado84.671
194ArgentinaSauvignon Blanc84.378
\n", "

195 rows × 4 columns

\n", "
" ], "text/plain": [ " country variety average_points numberofwines\n", "0 France Tannat 91.5 59\n", "1 Austria Riesling 91.4 581\n", "2 South Africa Bordeaux-style Red Blend 90.6 84\n", "3 France Riesling 90.5 691\n", "4 Austria Chardonnay 90.4 62\n", ".. ... ... ... ...\n", "190 Argentina Chardonnay 84.9 295\n", "191 Spain Rosé 84.9 149\n", "192 Portugal Rosé 84.6 235\n", "193 Spain Rosado 84.6 71\n", "194 Argentina Sauvignon Blanc 84.3 78\n", "\n", "[195 rows x 4 columns]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT l.country, r.variety,\n", " ROUND(AVG(points),1) as average_points,\n", " COUNT(*) as numberofwines\n", "FROM reviews r\n", "INNER JOIN locations l\n", " ON r.location_id = l.location_id\n", "GROUP BY l.country, r.variety\n", " HAVING COUNT(*) >= 50\n", "ORDER BY average_points DESC;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Subqueries\n", "There are situations in which it makes sense to use data aggregation techniques to generate new columns filled with group-level summary statistics, then to place these summary statistics into the original data table. To do this work, we can use **subqueries**. A subquery is a full SQL query that is used inside another SQL query. There are three types of subquery:\n", "\n", "1. Subqueries, like the ones for the mean and standard deviation above, that yield a single datapoint. These subqueries can be used anywhere we might write a value in the query, such as when defining new columns in `SELECT` or filtering rows with `WHERE` or `HAVING`.\n", "\n", "2. Subqueries that yield a list of values that can be used inside logical statements that include the `IN` operator.\n", "\n", "3. Subqueries that yield a data table that can be joined to existing data tables.\n", "\n", "Suppose for example that we wanted to generate a Z-score standardized version of the wine review points. A Z-score subtracts the mean of a column from every value in the column, then divides every value by the standard deviation of the column. When a Z-score equals 1, it means that the original value is one standard deviation above the mean of the column. To calculate the Z-score, we need to calculate the mean of the points column," ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
avg
088.612107
\n", "
" ], "text/plain": [ " avg\n", "0 88.612107" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT AVG(points) FROM reviews;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and the sample standard deviation of the points column," ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stddev_samp
02.955039
\n", "
" ], "text/plain": [ " stddev_samp\n", "0 2.955039" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT STDDEV_SAMP(points) FROM reviews;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can type these values in manually with a query that looks like:\n", "```\n", "SELECT title, variety, points,\n", "(points - 88.612107))/(2.955039) as points_z \n", "FROM reviews;\n", "```\n", "The problem with typing these values in by hand is that it is easy to make a mistake and accidentially corrupt the `points_z` column because of a typo. Also these values will have to be changed by hand every time the data inside the database is updated. A better solution is to have SQL do the work of calculating the mean and standard deviation for us by using subqueries. All we need to do is replace the values with the queries (contained in parentheses) that generate those single values. In this case, the query is " ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlevarietypointspoints_z
0Casas del Bosque 2011 Reserva Sauvignon Blanc ...Sauvignon Blanc86-0.88395
1Marqués de Terán 2009 Selección Especial (Rioja)Tempranillo86-0.88395
2Maurice Ecard 2009 BourgogneChardonnay86-0.88395
3McGregor 2010 Semi-Dry Riesling (Finger Lakes)Riesling86-0.88395
4Jules Taylor 2009 Ballochdale Estate Pinot Noi...Pinot Noir86-0.88395
...............
103722Glenora 2010 Gewürztraminer (Finger Lakes)Gewürztraminer86-0.88395
103723Hard Row To Hoe 2010 Marsanne (Yakima Valley)Marsanne86-0.88395
103724Animale 2009 Dolcetto (Columbia Valley (WA))Dolcetto86-0.88395
103725Beresan 2008 The Buzz Yellow Jacket Vineyard R...Red Blend86-0.88395
103726Cabot Vineyards 2007 Syrah (Humboldt County)Syrah86-0.88395
\n", "

103727 rows × 4 columns

\n", "
" ], "text/plain": [ " title variety \\\n", "0 Casas del Bosque 2011 Reserva Sauvignon Blanc ... Sauvignon Blanc \n", "1 Marqués de Terán 2009 Selección Especial (Rioja) Tempranillo \n", "2 Maurice Ecard 2009 Bourgogne Chardonnay \n", "3 McGregor 2010 Semi-Dry Riesling (Finger Lakes) Riesling \n", "4 Jules Taylor 2009 Ballochdale Estate Pinot Noi... Pinot Noir \n", "... ... ... \n", "103722 Glenora 2010 Gewürztraminer (Finger Lakes) Gewürztraminer \n", "103723 Hard Row To Hoe 2010 Marsanne (Yakima Valley) Marsanne \n", "103724 Animale 2009 Dolcetto (Columbia Valley (WA)) Dolcetto \n", "103725 Beresan 2008 The Buzz Yellow Jacket Vineyard R... Red Blend \n", "103726 Cabot Vineyards 2007 Syrah (Humboldt County) Syrah \n", "\n", " points points_z \n", "0 86 -0.88395 \n", "1 86 -0.88395 \n", "2 86 -0.88395 \n", "3 86 -0.88395 \n", "4 86 -0.88395 \n", "... ... ... \n", "103722 86 -0.88395 \n", "103723 86 -0.88395 \n", "103724 86 -0.88395 \n", "103725 86 -0.88395 \n", "103726 86 -0.88395 \n", "\n", "[103727 rows x 4 columns]" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT title, variety, points,\n", "(points - (SELECT AVG(points) FROM reviews))/(SELECT STDDEV(points) FROM reviews) as points_z \n", "FROM reviews;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose that we wanted to restrict the rows to only the wines from wineries with at least 100 wines in the data. The problem is we don't know which wineries have at least 100 reviewed wines. But we can find out with a query:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
winery_id
09112
14562
29557
313540
413381
51547
64257
713118
82007
97060
10224
119995
128022
135076
144586
152375
1614401
1712042
189111
194124
\n", "
" ], "text/plain": [ " winery_id\n", "0 9112\n", "1 4562\n", "2 9557\n", "3 13540\n", "4 13381\n", "5 1547\n", "6 4257\n", "7 13118\n", "8 2007\n", "9 7060\n", "10 224\n", "11 9995\n", "12 8022\n", "13 5076\n", "14 4586\n", "15 2375\n", "16 14401\n", "17 12042\n", "18 9111\n", "19 4124" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT winery_id FROM reviews\n", "GROUP BY winery_id\n", " HAVING COUNT(*) >= 100;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This query gives us a list of winery ID numbers that match the wineries with at least 100 reviewed wines in the data. We can now use this list inside another query that restricts the reviews data to only the wines from this list of wineries:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
winery_idtitlevarietypointsprice
014401Wines & Winemakers 2012 Pegos Claros Colheita ...Castelão8715.0
114401Wines & Winemakers 2013 Lua Cheia em Vinhas Ve...Portuguese Red8718.0
214401Wines & Winemakers 2013 Lua Cheia em Vinhas Ve...Portuguese Red8712.0
34124Chateau Ste. Michelle 2008 Syrah (Columbia Val...Syrah8713.0
44586Concha y Toro 2010 Gravas del Maipo Syrah (Mai...Syrah91200.0
..................
274213381Trapiche 2016 Pure Malbec (Uco Valley)Malbec8815.0
27439111Louis Jadot 2005 La Dominode Premier Cru (Sav...Pinot Noir9037.0
27449995Montes 2009 Limited Selection Pinot Noir (Casa...Pinot Noir8920.0
27454124Chateau Ste. Michelle 2012 Canoe Ridge Estate ...Chardonnay8922.0
274614401Wines & Winemakers 2015 Nostalgia Alvarinho (V...Alvarinho8823.0
\n", "

2747 rows × 5 columns

\n", "
" ], "text/plain": [ " winery_id title \\\n", "0 14401 Wines & Winemakers 2012 Pegos Claros Colheita ... \n", "1 14401 Wines & Winemakers 2013 Lua Cheia em Vinhas Ve... \n", "2 14401 Wines & Winemakers 2013 Lua Cheia em Vinhas Ve... \n", "3 4124 Chateau Ste. Michelle 2008 Syrah (Columbia Val... \n", "4 4586 Concha y Toro 2010 Gravas del Maipo Syrah (Mai... \n", "... ... ... \n", "2742 13381 Trapiche 2016 Pure Malbec (Uco Valley) \n", "2743 9111 Louis Jadot 2005 La Dominode Premier Cru (Sav... \n", "2744 9995 Montes 2009 Limited Selection Pinot Noir (Casa... \n", "2745 4124 Chateau Ste. Michelle 2012 Canoe Ridge Estate ... \n", "2746 14401 Wines & Winemakers 2015 Nostalgia Alvarinho (V... \n", "\n", " variety points price \n", "0 Castelão 87 15.0 \n", "1 Portuguese Red 87 18.0 \n", "2 Portuguese Red 87 12.0 \n", "3 Syrah 87 13.0 \n", "4 Syrah 91 200.0 \n", "... ... ... ... \n", "2742 Malbec 88 15.0 \n", "2743 Pinot Noir 90 37.0 \n", "2744 Pinot Noir 89 20.0 \n", "2745 Chardonnay 89 22.0 \n", "2746 Alvarinho 88 23.0 \n", "\n", "[2747 rows x 5 columns]" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT winery_id, title, variety, points, price FROM reviews\n", "WHERE winery_id IN (\n", " SELECT winery_id FROM reviews r\n", " GROUP BY winery_id\n", " HAVING COUNT(*) >= 100\n", " );\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose we wanted a data table with the top rated wine from each winery. The problem is we don't know which wine is the top rated for each winery. Again, we can find out with a query that groups the reviews data by winery ID and uses the `MAX()` aggregation function to identify the maximum score achieved for any wine from that winery: " ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
winery_idmaxpoints
01123391
1479095
2393687
31250287
4546892
.........
14566403589
14567918092
14568482794
1456979083
145701089691
\n", "

14571 rows × 2 columns

\n", "
" ], "text/plain": [ " winery_id maxpoints\n", "0 11233 91\n", "1 4790 95\n", "2 3936 87\n", "3 12502 87\n", "4 5468 92\n", "... ... ...\n", "14566 4035 89\n", "14567 9180 92\n", "14568 4827 94\n", "14569 790 83\n", "14570 10896 91\n", "\n", "[14571 rows x 2 columns]" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT winery_id, MAX(points) as maxpoints\n", "FROM reviews\n", "GROUP BY winery_id;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose that this last table already existed in the database with the name \"bestscores\". We would be able to join bestscores and reviews, then filter the rows to only those wines whose scores are equal to the maximum scores achieved by the winery with the following code:\n", "```\n", "SELECT r.title, r.variety, r.points, r.price FROM reviews r\n", "INNER JOIN bestscores b\n", " ON r.winery_id = b.winery_id\n", "WHERE r.points = b.maxpoints;\n", "```\n", "But because we do not have a table named \"bestscores\", we can replace the reference to this table with the subquery that generates this table:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlevarietypointsprice
0Marqués de Terán 2009 Selección Especial (Rioja)Tempranillo8624.0
1Alessandro Veglio 2011 Gattera (Barolo)Nebbiolo87NaN
2Pingao 2013 RiojaTempranillo8713.0
3Chateau Walla Walla 2008 Syrah (Walla Walla Va...Syrah8740.0
4Sweet Valley 2008 Cabernet Sauvignon (Walla Wa...Cabernet Sauvignon8735.0
...............
20643Kicker Cane 2014 Cabernet Sauvignon (Alexander...Cabernet Sauvignon8820.0
20644Tenuta Grimani 2015 Farinaldo (Soave)Garganega88NaN
20645Vin Vault NV Cabernet Sauvignon (California)Cabernet Sauvignon8820.0
20646Dachshund NV Bubbles Sparkling (Germany)Sparkling Blend8817.0
20647Domaine Guillot-Broux 2009 Beaumont (Mâcon-Cr...Pinot Noir88NaN
\n", "

20648 rows × 4 columns

\n", "
" ], "text/plain": [ " title variety \\\n", "0 Marqués de Terán 2009 Selección Especial (Rioja) Tempranillo \n", "1 Alessandro Veglio 2011 Gattera (Barolo) Nebbiolo \n", "2 Pingao 2013 Rioja Tempranillo \n", "3 Chateau Walla Walla 2008 Syrah (Walla Walla Va... Syrah \n", "4 Sweet Valley 2008 Cabernet Sauvignon (Walla Wa... Cabernet Sauvignon \n", "... ... ... \n", "20643 Kicker Cane 2014 Cabernet Sauvignon (Alexander... Cabernet Sauvignon \n", "20644 Tenuta Grimani 2015 Farinaldo (Soave) Garganega \n", "20645 Vin Vault NV Cabernet Sauvignon (California) Cabernet Sauvignon \n", "20646 Dachshund NV Bubbles Sparkling (Germany) Sparkling Blend \n", "20647 Domaine Guillot-Broux 2009 Beaumont (Mâcon-Cr... Pinot Noir \n", "\n", " points price \n", "0 86 24.0 \n", "1 87 NaN \n", "2 87 13.0 \n", "3 87 40.0 \n", "4 87 35.0 \n", "... ... ... \n", "20643 88 20.0 \n", "20644 88 NaN \n", "20645 88 20.0 \n", "20646 88 17.0 \n", "20647 88 NaN \n", "\n", "[20648 rows x 4 columns]" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = \"\"\"\n", "SELECT r.title, r.variety, r.points, r.price FROM reviews r\n", "INNER JOIN (\n", " SELECT winery_id, MAX(points) as maxpoints\n", " FROM reviews\n", " GROUP BY winery_id) b\n", " ON r.winery_id = b.winery_id\n", "WHERE r.points = b.maxpoints;\n", "\"\"\"\n", "pd.read_sql_query(myquery, con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, once we are finished working with the PostgreSQL server, we close it:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "dbserver.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SQL Security\n", "One criticism of SQL is that it can be manipulated to give unauthorized and hostile users access to perform CRUD operations on the data. This kind of [attack](https://en.wikipedia.org/wiki/SQL_injection) is called an **SQL injection attack**, and the best illustration of this attack comes from an [XKCD](https://xkcd.com) web comic:\n", "\n", "\n", "\n", "The following discussion borrows heavily from [this blog's](https://explainxkcd.com/wiki/index.php/327:_Exploits_of_a_Mom) explaination of this XKCD comic. \n", "\n", "An SQL injection attack starts with an entry field on the user interface of an application that works with a database, like a field where users write their name. The application reads the name and creates an SQL INSERT operation to create a new record in the data with the name. If I were to enter Jonathan into the name field, the app should generate an SQL command that looks like this:\n", "```\n", "INSERT INTO Students (firstname) VALUES ('Jonathan');\n", "```\n", "This command specifically places the value \"Jonathan\" into the `firstname` attribute of the `Students` entity. In SQL, different commands are separated by semicolons, so if I wanted to issue two SQL commands I could type:\n", "```\n", "INSERT INTO Students (firstname) VALUES ('Jonathan'); INSERT INTO Students (lastname) VALUES ('Kropko');\n", "```\n", "An SQL injection attack works by writing SQL code in a field that is designed to collect data to input into a database. So If I type my name as `Jonathan'); DROP TABLE Students; --;`, then the SQL create operation becomes\n", "```\n", "INSERT INTO Students (firstname) VALUES ('Jonathan'); DROP TABLE Students; --;');\n", "```\n", "This line consists of three commands\n", "* `INSERT INTO Students (firstname) VALUES ('Jonathan');` which inputs \"Jonathan\" into the database,\n", "* `DROP TABLE Students;` which deletes the entire `Students` table, and\n", "* `--;');`: the `--` symbol is an SQL comment, and tells the parser to ignore the remainder of the code, which would avoid a parsing error.\n", "\n", "So just by inserting specific code into a seemingly innocuous field, like name, I can delete the entire `Students` entity in the database.\n", "\n", "There are two ways to combat SQL injection attacks. First, it is possible to \"sanitize\" database inputs by using code that automatically places a slash before a single quote. That puts an [escape character](https://en.wikipedia.org/wiki/Escape_character) in front of the quote, which makes it part of the input string and prevents it from being read as the end of the input string. Another approach is to use [prepared statements](https://en.wikipedia.org/wiki/Prepared_statement) when converting user-entered data into an SQL query. A prepared statement uses placeholders to stand in for the user-supplied data, and treats the data like input into a function: treating the user data this way prevents the entire SQL query from being read as a single string, and prevents SQL injection. For example, instead of inputing the name directly into the query, the database manager can construct the query in Python code (where a database cursor exists and is named `curs`) like this:\n", "```\n", "cmd = \"INSERT INTO Students (firstname) VALUES (%s)\"\n", "curs.execute(cmd, (name,))\n", "```\n", "In MySQL and PostgreSQL, `%s` stands in for a parameter to be passed into the query (in SQlite, the stand-in symbol is `?` instead of `%s`). Constructing a query in this way prevents SQL injection attacks. More information about formatting secure SQL code is available at https://bobby-tables.com/, named in honor of this XKCD comic.\n", "\n", "As a data scientist mostly issuing read operations, it is unethical for you to attack a database in this way. If you are testing whether a database is secured against SQL injection attacks, don't try to issue any `DROP` commands as other commands like `SELECT` will reveal the insecurity but won't make changes in the database. If you are building a database that is connected to an interface for users to enter data, please be aware of the SQL injection vulnerability and use prepared statements to guard against it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MongoDB Queries\n", "SQL is a universal language for issuing queries to relational databases, whether the database is managed by SQLite, MySQL, PostgreSQL, or another RDBMS. For NoSQL databases, however, there is no universal query language. Every DBMS has its own query language, and will provide a guide for learning that language. Some of these guides include the ones for key-value stores in [Redis](https://redis.io/commands) and wide column stores in [Cassandra](https://cassandra.apache.org/doc/latest/cql/index.html). Neo4j has developed a programming language called [Cypher](https://neo4j.com/developer/cypher-query-language/) that is explicitly for issuing queries to graph databases. Of all of these query protocols, the language used by [MongoDB](https://docs.mongodb.com/guides/server/read_queries/) for issuing queries to document stores is one of the most universal because it works entirely with JSONs: queries are written in JSON format and the output is organized in JSON format. All of these query languages include methods for all of the CRUD operations.\n", "\n", "The most important difference between relational and NoSQL databases is the rigidity of the schema that organizes the data. The advantage of the strict organization of a relational database, as illustrated in an ER diagram, is that the data that can be extracted from the database using an SQL query will be clean and mostly immediately ready to be analyzed. The disadvantage is that relational databases have schema that are hard to change once they've been created and populated with data. SQL also, despite the best intentions of the originators of SQL, can be very difficult for people use for some tasks. For extremely large datasets with many tables, it can be extremely difficult to keep track of what data exists in which table. In contrast, NoSQL databases generally have flexible schema that can be changed easily and can vary even from record to record. There are no rules, like the normalization rules, that require that the data be split into different tables, so there is no need for visual maps like ER diagrams. Also, because all of the data for one record exists in the same JSON dictionary, it is easy to use remote, distributed storage to store all of the records. The disadvantage of NoSQL databases is that the data are rarely ready for analysis after a query. It's a buy-now-pay-later situation: the price we pay for the convenience of NoSQL storage and organization is that the output requires more work to use.\n", "\n", "Some concepts that are crucial to SQL are not relevant to NoSQL. There are **no joins** in a document store because all the data for a record exist in the same JSON code. As such, we don't have to worry about accomplishing these tasks within a NoSQL query. NoSQL queries in general focus narrowly on the CRUD operations, although MongoDB provides some advanced functionality for searching for patterns within text and ranking documents based on their relevance to given search terms.\n", "\n", "For the following examples, I will use the document store database that we created in module 6, containing the same data on wine reviews that we practiced with above, only in JSON format. First I load the `pymongo` package and the `dumps()` and `loads()` functions from the `json_util` module of the `bson` package:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "import pymongo\n", "from bson.json_util import dumps, loads" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The wine reviews database is stored as a collection `winecollection` within the `winedb` database on my local machine. I load it with the following code:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "myclient = pymongo.MongoClient(\"mongodb://localhost/\")\n", "winedb = myclient[\"winedb\"]\n", "winecollection = winedb[\"winecollection\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will discuss more advanced read techniques below, but to see one record, we can issue a query using JSON code and we can see the output in JSON format. To see all of the data for the \"Nicosia 2013 Vulkà Bianco\", we search for the record based on the title of this wine with the following code:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'_id': ObjectId('5ed80dbca25fcf746119e3aa'), 'wine_id': 0, 'country': 'Italy', 'description': \"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.\", 'points': 87, 'price': None, 'province': 'Sicily & Sardinia', 'region': 'Etna', 'taster_name': 'Kerin O’Keefe', 'taster_twitter_handle': '@kerinokeefe', 'title': 'Nicosia 2013 Vulkà Bianco (Etna)', 'variety': 'White Blend', 'winery': 'Nicosia'}\n" ] } ], "source": [ "myquery = { 'title': 'Nicosia 2013 Vulkà Bianco (Etna)'}\n", "mywine = winecollection.find(myquery) \n", "for x in mywine:\n", " print(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that all of the data for this wine exists in this JSON dictionary, including data from the reviews, locations, wineries, and tasters tables in the PostgreSQL database. When we created this MongoDB database, the DBMS automatically created a unique ID for each record designated with the key `_id`. \n", "\n", "We can now use the methods in `pymongo` for creating, reading, updating, and deleting records and we will apply these methods to the `winecollection` variable that accesses the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating and Deleting Records\n", "Did you know that former NBA all-star Dwyane Wade has a winery? It's called [D Wade Cellars](https://dwadecellars.com/) and it is based in the Napa Valley in California. Let's add the [2016 Napa Valley Three By Wade Red Blend](https://dwadecellars.vinespring.com/purchase/detail?item=2016-napa-valley-three-by-wade-red-blend) into the database. The first step is to express all of the data we want to associate with a new record in JSON format:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "dwadewine = {'title': '2016 Napa Valley Three By Wade Red Blend', \n", "'description': \"This wine goes great with dinner just like Dwyane Wade goes great with LeBron James or Shaq.\", \n", "'taster_name': 'Jonathan Kropko', \n", "'taster_twitter_handle': '@jmk5131', \n", "'price': '35', \n", "'variety': 'Red Blend', \n", "'location':{\n", " 'region_1': 'Napa Valley', \n", " 'region_2': None, \n", " 'province': 'California', \n", " 'country': 'U.S.', \n", " 'winery': 'D Wade Cellars'}}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In creating this JSON record, I tried to follow the standards that exist elsewhere in the data by using the same feature names. I departed from the format of other records in two ways. First, I omitted the points and designation features. Second, I placed all the information about the location and name of the winery under the \"location\" key, which induces some nesting structure.\n", "\n", "To add this one record to the database, I use the `.insert_one()` method on the `winecollection` database, with the following code:\n", "```\n", "winecollection.insert_one(dwadewine)\n", "```\n", "By default, the `insert_one()` method automatically checks to see whether the record already exists in the data, and throws an error if it does, unless we specify the `bypass_document_validation=True` argument, which allows duplicate records to be input into the database. For the purposes of this notebook, I rerun these cells many times while writing, and I don't want to place many duplicate records into the database. Instead, I can delete the record if it already exists. The code\n", "```\n", "winecollection.count_documents({'title': '2016 Napa Valley Three By Wade Red Blend'})\n", "```\n", "generates a count of the records of wines in the database that have this title. If there are any existing records, I can delete all of these records with the `.delete_many()` method, in which the argument is a JSON with enough fields specified to exactly match the records we want to delete:\n", "```\n", "winecollection.delete_many({'title': '2016 Napa Valley Three By Wade Red Blend'})\n", "```\n", "In constrast, the `.delete_one()` method will only delete the first record, when sorting by `_id`, that matches the query. If there are no documents that match the query, he `.delete_all()` or `.delete_one()` methods will both still process the query without error, but will not change anything in the database.\n", "\n", "We first delete any records of wines with the title \"2016 Napa Valley Three By Wade Red Blend\" with `.delete_all()`, then we insert the entire record of this wine with `.insert_one()`:" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "winecollection.delete_many({'title': '2016 Napa Valley Three By Wade Red Blend'})\n", "winecollection.insert_one(dwadewine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that this record exists in the database, we can find this record by any of the fields associated with the record, such as the title of the wine for example:" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'_id': ObjectId('5edd56e5b4e58ce3841e5dea'), 'title': '2016 Napa Valley Three By Wade Red Blend', 'description': 'This wine goes great with dinner just like Dwyane Wade goes great with LeBron James or Shaq.', 'taster_name': 'Jonathan Kropko', 'taster_twitter_handle': '@jmk5131', 'price': '35', 'variety': 'Red Blend', 'location': {'region_1': 'Napa Valley', 'region_2': None, 'province': 'California', 'country': 'U.S.', 'winery': 'D Wade Cellars'}}\n" ] } ], "source": [ "myquery = {'title': '2016 Napa Valley Three By Wade Red Blend'}\n", "mywine = winecollection.find(myquery) \n", "for x in mywine:\n", " print(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that MongoDB automatically generates a unique ID value for this document and includes it under the `_id` field in the JSON output.\n", "\n", "Wikipedia lists [93 other celebrities](https://en.wikipedia.org/wiki/List_of_celebrities_who_own_wineries_and_vineyards) other than Dwyane Wade who own wineries and vineyards, including [Antonio Banderas](https://www.decanter.com/wine-news/antonio-banderas-32255/), [Drew Barrymore](https://thewinesiren.com/drew-barrymore-vintner/), and [Lil Jon](http://www.today.com/id/23945035/ns/today-today_entertainment/t/rapper-lil-jon-starts-his-own-wine-label/#.XtV1BZp7l24). If we want to add more than one record to the wine collection database, we need to create a list of individual JSON dictionaries with code that looks like\n", "```\n", "newrecords = [{JSON dictionary 1}, {JSON dictionary 2}, {JSON dictionary 3}]\n", "```\n", "In this case, I can create entries for Antonio Banderas, Drew Barrymore, and Lil Jon's wines and store them in one list:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "newwines = [{'title': 'Anta Banderas A 10 2008', \n", " 'description': \"This wine will make you speak differently. Maybe not with a charming Spanish accent, but you might think you sound that way.\", \n", " 'taster_name': 'Jonathan Kropko', \n", " 'taster_twitter_handle': '@jmk5131', \n", " 'price': '40.99', \n", " 'variety': 'Red Blend', \n", " 'location':{\n", " 'region_1': 'Ribera del Duoro', \n", " 'region_2': None, \n", " 'province': 'Valladolid', \n", " 'country': 'Spain', \n", " 'winery': 'Anta Banderas'}},\n", " {'title': 'Barrymore Rose 2013', \n", " 'description': \"Someone drank my entire bottle of wine!\", \n", " 'taster_name': 'Jonathan Kropko', \n", " 'taster_twitter_handle': '@jmk5131', \n", " 'price': '14.99', \n", " 'variety': 'Rose', \n", " 'location':{\n", " 'region_1': 'Monterey', \n", " 'region_2': None, \n", " 'province': 'California', \n", " 'country': 'U.S.', \n", " 'winery': 'Barrymore Vineyard'}},\n", " {'title': '2006 Little Jonathan Winery Cabernet Sauvignon', \n", " 'description': \"This upscale crunk juice is OOOKAAAAAAY.\", \n", " 'taster_name': 'Jonathan Kropko', \n", " 'taster_twitter_handle': '@jmk5131', \n", " 'variety': 'Cabernet Sauvignon', \n", " 'location':{\n", " 'region_1': 'Central Coast', \n", " 'region_2': 'Paso Robles', \n", " 'province': 'California', \n", " 'country': 'U.S.', \n", " 'winery': 'Little Jonathan Winery'}}]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To add these three records to the database with one line of code, we use the `.insert_many()` method. To avoid duplicates, we first delete any records of wines titled \"Anta Banderas A 10 2008\", \"Barrymore Rose 2013\", or \"2006 Little Jonathan Winery Cabernet Sauvignon\":" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "winecollection.delete_many({'title': 'Anta Banderas A 10 2008'})\n", "winecollection.delete_many({'title': 'Barrymore Rose 2013'})\n", "winecollection.delete_many({'title': '2006 Little Jonathan Winery Cabernet Sauvignon'})\n", "winecollection.insert_many(newwines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading Data and Selecting Records\n", "To read all of the records in a MongoDB collection, use the `.find()` method and pass an empty JSON dictionary to this method. For the wine reviews collection, we can query the entire collection by typing" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "myquery = winecollection.find({})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, the queried data exist within the variable `cursor`. The data are not displayed automatically. To see the data in JSON format, we can employ the `print()` function on elements of the cursor. To see the first element:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'_id': ObjectId('5ed80dbca25fcf746119e3aa'), 'wine_id': 0, 'country': 'Italy', 'description': \"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.\", 'points': 87, 'price': None, 'province': 'Sicily & Sardinia', 'region': 'Etna', 'taster_name': 'Kerin O’Keefe', 'taster_twitter_handle': '@kerinokeefe', 'title': 'Nicosia 2013 Vulkà Bianco (Etna)', 'variety': 'White Blend', 'winery': 'Nicosia'}\n" ] } ], "source": [ "print(myquery[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And to see more elements, we can use a loop. Here's code to view the first three wines:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'_id': ObjectId('5ed80dbca25fcf746119e3aa'), 'wine_id': 0, 'country': 'Italy', 'description': \"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.\", 'points': 87, 'price': None, 'province': 'Sicily & Sardinia', 'region': 'Etna', 'taster_name': 'Kerin O’Keefe', 'taster_twitter_handle': '@kerinokeefe', 'title': 'Nicosia 2013 Vulkà Bianco (Etna)', 'variety': 'White Blend', 'winery': 'Nicosia'}\n", "{'_id': ObjectId('5ed80dcca25fcf746119e3ab'), 'wine_id': 1, 'country': 'Portugal', 'description': \"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016.\", 'points': 87, 'price': 15.0, 'province': 'Douro', 'region': None, 'taster_name': 'Roger Voss', 'taster_twitter_handle': '@vossroger', 'title': 'Quinta dos Avidagos 2011 Avidagos Red (Douro)', 'variety': 'Portuguese Red', 'winery': 'Quinta dos Avidagos'}\n", "{'_id': ObjectId('5ed80dcca25fcf746119e3ac'), 'wine_id': 2, 'country': 'US', 'description': 'Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.', 'points': 87, 'price': 14.0, 'province': 'Oregon', 'region': 'Willamette Valley', 'taster_name': 'Paul Gregutt', 'taster_twitter_handle': '@paulgwine\\xa0', 'title': 'Rainstorm 2013 Pinot Gris (Willamette Valley)', 'variety': 'Pinot Gris', 'winery': 'Rainstorm'}\n" ] } ], "source": [ "for i in myquery[0:3]:\n", " print(i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Displaying the query output data as a list of JSON dictionaries, however, is not the most useful way to store the data. We need a way to put these data into a dataframe. For that we can use the `dumps()` and `loads()` functions from the `bson` library. These functions work exactly like the `dumps()` and `loads()` functions from the `json` library, but they remove some of the extra components of these JSON dictionaries associated with the database. To query all of the data and to place all of it into a dataframe, we pass the query output to `dumps()`, which converts the query output to plain text. Next we pass this text to `loads()`, which registers the text as a list of JSON dictionaries. Finally we use this list as the argument of `pd.DataFrame.from_records()` to convert the output to a dataframe. For the wine collection, this code is:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idwine_idcountrydescriptionpointspriceprovinceregiontaster_nametaster_twitter_handletitlevarietywinerylocation
05ed80dbca25fcf746119e3aa0.0ItalyAromas include tropical fruit, broom, brimston...87.0NoneSicily & SardiniaEtnaKerin O’Keefe@kerinokeefeNicosia 2013 Vulkà Bianco (Etna)White BlendNicosiaNaN
15ed80dcca25fcf746119e3ab1.0PortugalThis is ripe and fruity, a wine that is smooth...87.015DouroNoneRoger Voss@vossrogerQuinta dos Avidagos 2011 Avidagos Red (Douro)Portuguese RedQuinta dos AvidagosNaN
25ed80dcca25fcf746119e3ac2.0USTart and snappy, the flavors of lime flesh and...87.014OregonWillamette ValleyPaul Gregutt@paulgwineRainstorm 2013 Pinot Gris (Willamette Valley)Pinot GrisRainstormNaN
35ed80dcca25fcf746119e3ad3.0USPineapple rind, lemon pith and orange blossom ...87.013MichiganLake Michigan ShoreAlexander PeartreeNoneSt. Julian 2013 Reserve Late Harvest Riesling ...RieslingSt. JulianNaN
45ed80dcca25fcf746119e3ae4.0USMuch like the regular bottling from 2012, this...87.065OregonWillamette ValleyPaul Gregutt@paulgwineSweet Cheeks 2012 Vintner's Reserve Wild Child...Pinot NoirSweet CheeksNaN
.............................................
1037265ed80dcfa25fcf74611b78d8129970.0FranceBig, rich and off-dry, this is powered by inte...90.021AlsaceAlsaceRoger Voss@vossrogerDomaine Schoffit 2012 Lieu-dit Harth Cuvée Car...GewürztraminerDomaine SchoffitNaN
1037275edd56e5b4e58ce3841e5deaNaNNaNThis wine goes great with dinner just like Dwy...NaN35NaNNaNJonathan Kropko@jmk51312016 Napa Valley Three By Wade Red BlendRed BlendNaN{'region_1': 'Napa Valley', 'region_2': None, ...
1037285edd56e6b4e58ce3841e5debNaNNaNThis wine will make you speak differently. May...NaN40.99NaNNaNJonathan Kropko@jmk5131Anta Banderas A 10 2008Red BlendNaN{'region_1': 'Ribera del Duoro', 'region_2': N...
1037295edd56e6b4e58ce3841e5decNaNNaNSomeone drank my entire bottle of wine!NaN14.99NaNNaNJonathan Kropko@jmk5131Barrymore Rose 2013RoseNaN{'region_1': 'Monterey', 'region_2': None, 'pr...
1037305edd56e6b4e58ce3841e5dedNaNNaNThis upscale crunk juice is OOOKAAAAAAY.NaNNaNNaNNaNJonathan Kropko@jmk51312006 Little Jonathan Winery Cabernet SauvignonCabernet SauvignonNaN{'region_1': 'Central Coast', 'region_2': 'Pas...
\n", "

103731 rows × 14 columns

\n", "
" ], "text/plain": [ " _id wine_id country \\\n", "0 5ed80dbca25fcf746119e3aa 0.0 Italy \n", "1 5ed80dcca25fcf746119e3ab 1.0 Portugal \n", "2 5ed80dcca25fcf746119e3ac 2.0 US \n", "3 5ed80dcca25fcf746119e3ad 3.0 US \n", "4 5ed80dcca25fcf746119e3ae 4.0 US \n", "... ... ... ... \n", "103726 5ed80dcfa25fcf74611b78d8 129970.0 France \n", "103727 5edd56e5b4e58ce3841e5dea NaN NaN \n", "103728 5edd56e6b4e58ce3841e5deb NaN NaN \n", "103729 5edd56e6b4e58ce3841e5dec NaN NaN \n", "103730 5edd56e6b4e58ce3841e5ded NaN NaN \n", "\n", " description points price \\\n", "0 Aromas include tropical fruit, broom, brimston... 87.0 None \n", "1 This is ripe and fruity, a wine that is smooth... 87.0 15 \n", "2 Tart and snappy, the flavors of lime flesh and... 87.0 14 \n", "3 Pineapple rind, lemon pith and orange blossom ... 87.0 13 \n", "4 Much like the regular bottling from 2012, this... 87.0 65 \n", "... ... ... ... \n", "103726 Big, rich and off-dry, this is powered by inte... 90.0 21 \n", "103727 This wine goes great with dinner just like Dwy... NaN 35 \n", "103728 This wine will make you speak differently. May... NaN 40.99 \n", "103729 Someone drank my entire bottle of wine! NaN 14.99 \n", "103730 This upscale crunk juice is OOOKAAAAAAY. NaN NaN \n", "\n", " province region taster_name \\\n", "0 Sicily & Sardinia Etna Kerin O’Keefe \n", "1 Douro None Roger Voss \n", "2 Oregon Willamette Valley Paul Gregutt \n", "3 Michigan Lake Michigan Shore Alexander Peartree \n", "4 Oregon Willamette Valley Paul Gregutt \n", "... ... ... ... \n", "103726 Alsace Alsace Roger Voss \n", "103727 NaN NaN Jonathan Kropko \n", "103728 NaN NaN Jonathan Kropko \n", "103729 NaN NaN Jonathan Kropko \n", "103730 NaN NaN Jonathan Kropko \n", "\n", " taster_twitter_handle \\\n", "0 @kerinokeefe \n", "1 @vossroger \n", "2 @paulgwine  \n", "3 None \n", "4 @paulgwine  \n", "... ... \n", "103726 @vossroger \n", "103727 @jmk5131 \n", "103728 @jmk5131 \n", "103729 @jmk5131 \n", "103730 @jmk5131 \n", "\n", " title variety \\\n", "0 Nicosia 2013 Vulkà Bianco (Etna) White Blend \n", "1 Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red \n", "2 Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris \n", "3 St. Julian 2013 Reserve Late Harvest Riesling ... Riesling \n", "4 Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir \n", "... ... ... \n", "103726 Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer \n", "103727 2016 Napa Valley Three By Wade Red Blend Red Blend \n", "103728 Anta Banderas A 10 2008 Red Blend \n", "103729 Barrymore Rose 2013 Rose \n", "103730 2006 Little Jonathan Winery Cabernet Sauvignon Cabernet Sauvignon \n", "\n", " winery location \n", "0 Nicosia NaN \n", "1 Quinta dos Avidagos NaN \n", "2 Rainstorm NaN \n", "3 St. Julian NaN \n", "4 Sweet Cheeks NaN \n", "... ... ... \n", "103726 Domaine Schoffit NaN \n", "103727 NaN {'region_1': 'Napa Valley', 'region_2': None, ... \n", "103728 NaN {'region_1': 'Ribera del Duoro', 'region_2': N... \n", "103729 NaN {'region_1': 'Monterey', 'region_2': None, 'pr... \n", "103730 NaN {'region_1': 'Central Coast', 'region_2': 'Pas... \n", "\n", "[103731 rows x 14 columns]" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = winecollection.find({})\n", "wine_text = dumps(myquery)\n", "wine_records = loads(wine_text)\n", "wine_df = pd.DataFrame.from_records(wine_records)\n", "wine_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like SQL, read operations in MongoDB can filter records based on logical conditions. Unlike SQL, MongoDB uses different symbols for the common logical operators, and these symbols need to be listed within the JSON formatted query. A couple of the operators are implicit in the JSON syntax. To search on an equality condition, the general syntax is\n", "```\n", "{'key' : value}\n", "```\n", "For example, to find all of the wines in the database that are from virginia, we can use the following query:\n", "```\n", "{'province' : 'Virginia'}\n", "```\n", "The other implicit operator is \"and\", which is expressed simply by including more than one key-value pair within the syntax. To specify that a feature `key1` is equal to `value1` AND that `key2` is equal to `value2`, type:\n", "```\n", "{'key1' : value1,\n", " 'key2' : value2}\n", "```\n", "For example, to filter the data to Pinot Noir wines from Virginia, we can type\n", "```\n", "{'variety' : 'Pinot Noir',\n", " 'province' : 'Virginia'}\n", "```\n", "For all other logical operators, MongoDB uses special syntax, described below. To use these operators in a query, the general template is general syntax for using an operator within a MongoDB query is\n", "```\n", "{'key' : {'$operator' : value } }\n", "```\n", "The operators are listed in the following table:\n", "\n", "| Operator | Syntax | Example query | Example code |\n", "|-----------------------------------------------------------------------|-----------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|\n", "| Equal to | implicit | All wines with scores of 100 | `{'points': 100}` |\n", "| Greater than | `'$gt'` | All wines that are more expensive than \\\\$30 | `{'price': {'$gt': 30}}` |\n", "| Greater than or equal to | `'$gte'` | All wines with scores of 95 or higher | `{'points': {'$gte': 95}}` |\n", "| Less than | `'$lt'` | All wines that are cheaper than \\\\$20 | `{'price': {'$lt': 20}}` |\n", "| Less than or equal to | `'$lte'` | All wines with scores of 85 or lower | `{'points': {'$lte': 85}}` |\n", "| Not equal | `'$ne'` | Wines that are not red blends | `{'variety': {'$ne': 'Red Blend'}` |\n", "| And | implicit | All wines with scores of 100 and prices of \\\\$20 or less | `{'points': 100, 'price': {'$lte': 20}}` |\n", "| Or | `'$or': [{condition1}, {condition2}]` | All wines with scores of 100, or prices of \\\\$20 or less | `{'$or': [{'points': 100}, {'price: {'$lte': 20}}]}` |\n", "| Exists in a set | `'$in': [value1, value2, ...]` | All wines from Virginia, Maryland, or North Carolina | `{'province': {'$in': ['Virginia', 'Maryland', 'North Carolina']}}` |\n", "| Not in a set | `'$nin'` | All wines except those from Virginia, Maryland, and North Carolina | `{'province': {'$nin': ['Virginia', 'Maryland', 'North Carolina']}}` |\n", "| Use logical conditions that compare two or more keys | `{'$expr': }` | All wines whose price is greater than their score | `{'$expr': {'$gt': ['$price', '$points']}}` |\n", "| Logical negation (only recommended for use with `$text` and `$regex`) | `'$not'` | All wines whose descriptions do not contain the word \"chocolate\", treating capital and lower-case letters the same | `{'$not': {'description': {'$text': {'$search': 'chocolate', '$caseSensitive': false}}}}` |\n", "\n", "To quickly see the data that is output by queries that use these operators, I write a function that takes a JSON dictionary as an input, and outputs a `pandas` dataframe:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "def mongo_read_query(col, q):\n", " qtext = dumps(col.find(q))\n", " qrec = loads(qtext)\n", " qdf = pd.DataFrame.from_records(qrec)\n", " return qdf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see all wines with a score of 100" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idwine_idcountrydescriptionpointspriceprovinceregiontaster_nametaster_twitter_handletitlevarietywinery
05ed80dcca25fcf746119e498345AustraliaThis wine contains some material over 100 year...100350.0VictoriaRutherglenJoe Czerwinski@JoeCzChambers Rosewood Vineyards NV Rare Muscat (Ru...MuscatChambers Rosewood Vineyards
15ed80dcea25fcf74611a551136528FranceThis is a fabulous wine from the greatest Cham...100259.0ChampagneChampagneRoger Voss@vossrogerKrug 2002 Brut (Champagne)Champagne BlendKrug
25ed80dcea25fcf74611a66df42197PortugalThis is the latest release of what has long be...100450.0DouroNoneRoger Voss@vossrogerCasa Ferreirinha 2008 Barca-Velha Red (Douro)Portuguese RedCasa Ferreirinha
35ed80dcea25fcf74611a721445781ItalyThis gorgeous, fragrant wine opens with classi...100550.0TuscanyBrunello di MontalcinoKerin O’Keefe@kerinokeefeBiondi Santi 2010 Riserva (Brunello di Montal...SangioveseBiondi Santi
45ed80dcea25fcf74611a987758352FranceThis is a magnificently solid wine, initially ...100150.0BordeauxSaint-JulienRoger Voss@vossrogerChâteau Léoville Barton 2010 Saint-JulienBordeaux-style Red BlendChâteau Léoville Barton
55ed80dcfa25fcf74611afacd89728FranceThis latest incarnation of the famous brand is...100250.0ChampagneChampagneRoger Voss@vossrogerLouis Roederer 2008 Cristal Vintage Brut (Cha...Champagne BlendLouis Roederer
65ed80dcfa25fcf74611aface89729FranceThis new release from a great vintage for Char...100617.0ChampagneChampagneRoger Voss@vossrogerSalon 2006 Le Mesnil Blanc de Blancs Brut Char...ChardonnaySalon
75ed80dcfa25fcf74611b3fc4111753FranceAlmost black in color, this stunning wine is g...1001500.0BordeauxPauillacRoger Voss@vossrogerChâteau Lafite Rothschild 2010 PauillacBordeaux-style Red BlendChâteau Lafite Rothschild
85ed80dcfa25fcf74611b3fc5111755FranceThis is the finest Cheval Blanc for many years...1001500.0BordeauxSaint-ÉmilionRoger Voss@vossrogerChâteau Cheval Blanc 2010 Saint-ÉmilionBordeaux-style Red BlendChâteau Cheval Blanc
95ed80dcfa25fcf74611b3fc6111756FranceA hugely powerful wine, full of dark, brooding...100359.0BordeauxSaint-JulienRoger Voss@vossrogerChâteau Léoville Las Cases 2010 Saint-JulienBordeaux-style Red BlendChâteau Léoville Las Cases
105ed80dcfa25fcf74611b4648113929USIn 2005 Charles Smith introduced three high-en...10080.0WashingtonColumbia Valley (WA)Paul Gregutt@paulgwineCharles Smith 2006 Royal City Syrah (Columbia ...SyrahCharles Smith
115ed80dcfa25fcf74611b499d114972PortugalA powerful and ripe wine, strongly influenced ...100650.0PortNoneRoger Voss@vossrogerQuinta do Noval 2011 Nacional Vintage (Port)PortQuinta do Noval
125ed80dcfa25fcf74611b6300122935FranceFull of ripe fruit, opulent and concentrated, ...100848.0BordeauxPessac-LéognanRoger Voss@vossrogerChâteau Haut-Brion 2014 Pessac-LéognanBordeaux-style White BlendChâteau Haut-Brion
135ed80dcfa25fcf74611b64d4123545USInitially a rather subdued Frog; as if it has ...10080.0WashingtonWalla Walla Valley (WA)Paul Gregutt@paulgwineCayuse 2008 Bionic Frog Syrah (Walla Walla Val...SyrahCayuse
\n", "
" ], "text/plain": [ " _id wine_id country \\\n", "0 5ed80dcca25fcf746119e498 345 Australia \n", "1 5ed80dcea25fcf74611a5511 36528 France \n", "2 5ed80dcea25fcf74611a66df 42197 Portugal \n", "3 5ed80dcea25fcf74611a7214 45781 Italy \n", "4 5ed80dcea25fcf74611a9877 58352 France \n", "5 5ed80dcfa25fcf74611afacd 89728 France \n", "6 5ed80dcfa25fcf74611aface 89729 France \n", "7 5ed80dcfa25fcf74611b3fc4 111753 France \n", "8 5ed80dcfa25fcf74611b3fc5 111755 France \n", "9 5ed80dcfa25fcf74611b3fc6 111756 France \n", "10 5ed80dcfa25fcf74611b4648 113929 US \n", "11 5ed80dcfa25fcf74611b499d 114972 Portugal \n", "12 5ed80dcfa25fcf74611b6300 122935 France \n", "13 5ed80dcfa25fcf74611b64d4 123545 US \n", "\n", " description points price \\\n", "0 This wine contains some material over 100 year... 100 350.0 \n", "1 This is a fabulous wine from the greatest Cham... 100 259.0 \n", "2 This is the latest release of what has long be... 100 450.0 \n", "3 This gorgeous, fragrant wine opens with classi... 100 550.0 \n", "4 This is a magnificently solid wine, initially ... 100 150.0 \n", "5 This latest incarnation of the famous brand is... 100 250.0 \n", "6 This new release from a great vintage for Char... 100 617.0 \n", "7 Almost black in color, this stunning wine is g... 100 1500.0 \n", "8 This is the finest Cheval Blanc for many years... 100 1500.0 \n", "9 A hugely powerful wine, full of dark, brooding... 100 359.0 \n", "10 In 2005 Charles Smith introduced three high-en... 100 80.0 \n", "11 A powerful and ripe wine, strongly influenced ... 100 650.0 \n", "12 Full of ripe fruit, opulent and concentrated, ... 100 848.0 \n", "13 Initially a rather subdued Frog; as if it has ... 100 80.0 \n", "\n", " province region taster_name taster_twitter_handle \\\n", "0 Victoria Rutherglen Joe Czerwinski @JoeCz \n", "1 Champagne Champagne Roger Voss @vossroger \n", "2 Douro None Roger Voss @vossroger \n", "3 Tuscany Brunello di Montalcino Kerin O’Keefe @kerinokeefe \n", "4 Bordeaux Saint-Julien Roger Voss @vossroger \n", "5 Champagne Champagne Roger Voss @vossroger \n", "6 Champagne Champagne Roger Voss @vossroger \n", "7 Bordeaux Pauillac Roger Voss @vossroger \n", "8 Bordeaux Saint-Émilion Roger Voss @vossroger \n", "9 Bordeaux Saint-Julien Roger Voss @vossroger \n", "10 Washington Columbia Valley (WA) Paul Gregutt @paulgwine  \n", "11 Port None Roger Voss @vossroger \n", "12 Bordeaux Pessac-Léognan Roger Voss @vossroger \n", "13 Washington Walla Walla Valley (WA) Paul Gregutt @paulgwine  \n", "\n", " title \\\n", "0 Chambers Rosewood Vineyards NV Rare Muscat (Ru... \n", "1 Krug 2002 Brut (Champagne) \n", "2 Casa Ferreirinha 2008 Barca-Velha Red (Douro) \n", "3 Biondi Santi 2010 Riserva (Brunello di Montal... \n", "4 Château Léoville Barton 2010 Saint-Julien \n", "5 Louis Roederer 2008 Cristal Vintage Brut (Cha... \n", "6 Salon 2006 Le Mesnil Blanc de Blancs Brut Char... \n", "7 Château Lafite Rothschild 2010 Pauillac \n", "8 Château Cheval Blanc 2010 Saint-Émilion \n", "9 Château Léoville Las Cases 2010 Saint-Julien \n", "10 Charles Smith 2006 Royal City Syrah (Columbia ... \n", "11 Quinta do Noval 2011 Nacional Vintage (Port) \n", "12 Château Haut-Brion 2014 Pessac-Léognan \n", "13 Cayuse 2008 Bionic Frog Syrah (Walla Walla Val... \n", "\n", " variety winery \n", "0 Muscat Chambers Rosewood Vineyards \n", "1 Champagne Blend Krug \n", "2 Portuguese Red Casa Ferreirinha \n", "3 Sangiovese Biondi Santi \n", "4 Bordeaux-style Red Blend Château Léoville Barton \n", "5 Champagne Blend Louis Roederer \n", "6 Chardonnay Salon \n", "7 Bordeaux-style Red Blend Château Lafite Rothschild \n", "8 Bordeaux-style Red Blend Château Cheval Blanc \n", "9 Bordeaux-style Red Blend Château Léoville Las Cases \n", "10 Syrah Charles Smith \n", "11 Port Quinta do Noval \n", "12 Bordeaux-style White Blend Château Haut-Brion \n", "13 Syrah Cayuse " ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = {'points': 100}\n", "mongo_read_query(winecollection, myquery)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see wines with a score of 100 and a cost of less than \\\\$100, we can use the `$lt` operator:" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idwine_idcountrydescriptionpointspriceprovinceregiontaster_nametaster_twitter_handletitlevarietywinery
05ed80dcfa25fcf74611b4648113929USIn 2005 Charles Smith introduced three high-en...10080.0WashingtonColumbia Valley (WA)Paul Gregutt@paulgwineCharles Smith 2006 Royal City Syrah (Columbia ...SyrahCharles Smith
15ed80dcfa25fcf74611b64d4123545USInitially a rather subdued Frog; as if it has ...10080.0WashingtonWalla Walla Valley (WA)Paul Gregutt@paulgwineCayuse 2008 Bionic Frog Syrah (Walla Walla Val...SyrahCayuse
\n", "
" ], "text/plain": [ " _id wine_id country \\\n", "0 5ed80dcfa25fcf74611b4648 113929 US \n", "1 5ed80dcfa25fcf74611b64d4 123545 US \n", "\n", " description points price \\\n", "0 In 2005 Charles Smith introduced three high-en... 100 80.0 \n", "1 Initially a rather subdued Frog; as if it has ... 100 80.0 \n", "\n", " province region taster_name taster_twitter_handle \\\n", "0 Washington Columbia Valley (WA) Paul Gregutt @paulgwine  \n", "1 Washington Walla Walla Valley (WA) Paul Gregutt @paulgwine  \n", "\n", " title variety winery \n", "0 Charles Smith 2006 Royal City Syrah (Columbia ... Syrah Charles Smith \n", "1 Cayuse 2008 Bionic Frog Syrah (Walla Walla Val... Syrah Cayuse " ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = {'points': 100, 'price': {'$lt': 100}}\n", "mongo_read_query(winecollection, myquery)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see wines that are from Ohio or North Carolina, we use the `$in` operator:" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idwine_idcountrydescriptionpointspriceprovinceregiontaster_nametaster_twitter_handletitlevarietywinery
05ed80dcea25fcf74611a0d3f13402USClove and pepper spice the dark red cherry aro...8617.0North CarolinaSwan CreekAnna Lee C. IijimaNoneRaffaldini 2007 Montepulciano (Swan Creek)MontepulcianoRaffaldini
15ed80dcea25fcf74611a167b16270USThe nose shows an aroma of blackberry that is ...8617.0North CarolinaYadkin ValleyAlexander PeartreeNoneShadow Springs 2011 Cabernet Franc (Yadkin Val...Cabernet FrancShadow Springs
25ed80dcea25fcf74611a20a219566USFruits, flowers and spice should lead the nose...8015.0OhioOhioSusan Kostrzewa@suskostrzewaHermes 2006 Estate Bottled Nebbiolo (Ohio)NebbioloHermes
35ed80dcea25fcf74611a23ee20592USStewed blackberries and muddled cherries perva...8623.0North CarolinaSwan CreekAlexander PeartreeNoneRaffaldini 2012 Riserva Sangiovese (Swan Creek)SangioveseRaffaldini
45ed80dcea25fcf74611a5fc439935USFriendly, appealing flavors fo pear, lychee, a...8412.0OhioGrand River ValleySusan Kostrzewa@suskostrzewaDebonné 2008 Reserve Riesling (Grand River Val...RieslingDebonné
55ed80dcea25fcf74611a686c42692USFriendly, appealing flavors fo pear, lychee, a...8412.0OhioGrand River ValleySusan Kostrzewa@suskostrzewaDebonné 2008 Reserve Riesling (Grand River Val...RieslingDebonné
65ed80dcea25fcf74611a703345209USFreshly squeezed lemons, lime and pretty white...8416.0North CarolinaSwan CreekAnna Lee C. IijimaNoneRaffaldini 2009 Pinot Grigio (Swan Creek)Pinot GrigioRaffaldini
75ed80dcea25fcf74611a74db46670USBlack fruit aromas show over toasted vanilla a...8529.0North CarolinaSwan CreekAlexander PeartreeNoneRaffaldini 2012 Riserva Montepulciano (Swan Cr...MontepulcianoRaffaldini
85ed80dcea25fcf74611a7c9549154USLean and racy, with limes and tart green apple...8610.0North CarolinaNorth CarolinaJoe Czerwinski@JoeCzShelton Vineyards 2002 Riesling (North Carolina)RieslingShelton Vineyards
95ed80dcea25fcf74611a851852078USCharred oak, green herbs and vanilla spice not...8218.0North CarolinaYadkin ValleyAnna Lee C. IijimaNoneDivine Llama 2007 In a Heart Beat Red (Yadkin ...R. BlendDivine Llama
105ed80dcea25fcf74611a8fe555687USMostly Sangiovese with a small dose of Petit V...8417.0North CarolinaSwan CreekAnna Lee C. IijimaNoneRaffaldini 2007 Riserva Sangiovese (Swan Creek)SangioveseRaffaldini
115ed80dcea25fcf74611a8fe955693USAromas of toasted oak, green leaves, vanilla a...8425.0North CarolinaNorth CarolinaAnna Lee C. IijimaNoneRayLen 2006 Eagle's Select Red Wine Red (North...Bordeaux-style Red BlendRayLen
125ed80dcfa25fcf74611ac51f72535USWho knew Ohio made such tasty Chardonnay? Brig...8711.0OhioGrand River ValleyAnna Lee C. IijimaNoneDebonné 2009 Chardonnay (Grand River Valley)ChardonnayDebonné
135ed80dcfa25fcf74611ae16881619USStrawberry and raspberry Kool-Aid aromas are s...8521.0North CarolinaSwan CreekAlexander PeartreeNoneLaurel Gray 2012 Estate Grown Cabernet Franc (...Cabernet FrancLaurel Gray
145ed80dcfa25fcf74611af83988841USThis is a vibrant, energetic Chardonnay that s...8417.0OhioGrand River ValleySusan Kostrzewa@suskostrzewaDebonné 2007 Vintner's Selection Chardonnay (G...ChardonnayDebonné
155ed80dcfa25fcf74611b08bb94267USAlthough the nose offers darker notes of petro...8315.0OhioGrand River ValleyAnna Lee C. IijimaNoneDebonné 2008 Lot 807 Reserve Riesling (Grand R...RieslingDebonné
165ed80dcfa25fcf74611b08be94275USThe nose on this bright red blend from Sanders...8318.0North CarolinaYadkin ValleyAnna Lee C. IijimaNoneSanders Ridge 2008 Big Woods Red (Yadkin Valley)Bordeaux-style Red BlendSanders Ridge
175ed80dcfa25fcf74611b13ff97836USSmoke wafts over pressed apple and lemon notes...8313.0North CarolinaNorth CarolinaAnna Lee C. IijimaNoneRayLen 2009 Riesling (North Carolina)RieslingRayLen
185ed80dcfa25fcf74611b1c1f100445USFresh minerality and dancing floral notes make...8411.0OhioGrand River ValleySusan Kostrzewa@suskostrzewaDebonné 2006 Reserve Riesling (Grand River Val...RieslingDebonné
195ed80dcfa25fcf74611b1c23100452USA slightly floral but lively nose is followed ...8415.0OhioGrand River ValleySusan Kostrzewa@suskostrzewaDebonné 2006 Lot 707 Reserve Riesling (Grand R...RieslingDebonné
205ed80dcfa25fcf74611b29f1104722USThere are enticing hints of berries and cream ...8615.0North CarolinaNorth CarolinaAnna Lee C. IijimaNoneBiltmore Estate 2010 Reserve Chardonnay (North...ChardonnayBiltmore Estate
215ed80dcfa25fcf74611b4fb6116912USBlack cherry aromas are dwarfed by notes of wi...8324.0North CarolinaSwan CreekAlexander PeartreeNoneRaffaldini 2012 Montepulciano (Swan Creek)MontepulcianoRaffaldini
225ed80dcfa25fcf74611b55ee118895USBright red fruits achieve a decent amount of r...8418.0North CarolinaNorth CarolinaAnna Lee C. IijimaNoneRayLen 2008 Category 5 Red Wine Red (North Car...R. BlendRayLen
\n", "
" ], "text/plain": [ " _id wine_id country \\\n", "0 5ed80dcea25fcf74611a0d3f 13402 US \n", "1 5ed80dcea25fcf74611a167b 16270 US \n", "2 5ed80dcea25fcf74611a20a2 19566 US \n", "3 5ed80dcea25fcf74611a23ee 20592 US \n", "4 5ed80dcea25fcf74611a5fc4 39935 US \n", "5 5ed80dcea25fcf74611a686c 42692 US \n", "6 5ed80dcea25fcf74611a7033 45209 US \n", "7 5ed80dcea25fcf74611a74db 46670 US \n", "8 5ed80dcea25fcf74611a7c95 49154 US \n", "9 5ed80dcea25fcf74611a8518 52078 US \n", "10 5ed80dcea25fcf74611a8fe5 55687 US \n", "11 5ed80dcea25fcf74611a8fe9 55693 US \n", "12 5ed80dcfa25fcf74611ac51f 72535 US \n", "13 5ed80dcfa25fcf74611ae168 81619 US \n", "14 5ed80dcfa25fcf74611af839 88841 US \n", "15 5ed80dcfa25fcf74611b08bb 94267 US \n", "16 5ed80dcfa25fcf74611b08be 94275 US \n", "17 5ed80dcfa25fcf74611b13ff 97836 US \n", "18 5ed80dcfa25fcf74611b1c1f 100445 US \n", "19 5ed80dcfa25fcf74611b1c23 100452 US \n", "20 5ed80dcfa25fcf74611b29f1 104722 US \n", "21 5ed80dcfa25fcf74611b4fb6 116912 US \n", "22 5ed80dcfa25fcf74611b55ee 118895 US \n", "\n", " description points price \\\n", "0 Clove and pepper spice the dark red cherry aro... 86 17.0 \n", "1 The nose shows an aroma of blackberry that is ... 86 17.0 \n", "2 Fruits, flowers and spice should lead the nose... 80 15.0 \n", "3 Stewed blackberries and muddled cherries perva... 86 23.0 \n", "4 Friendly, appealing flavors fo pear, lychee, a... 84 12.0 \n", "5 Friendly, appealing flavors fo pear, lychee, a... 84 12.0 \n", "6 Freshly squeezed lemons, lime and pretty white... 84 16.0 \n", "7 Black fruit aromas show over toasted vanilla a... 85 29.0 \n", "8 Lean and racy, with limes and tart green apple... 86 10.0 \n", "9 Charred oak, green herbs and vanilla spice not... 82 18.0 \n", "10 Mostly Sangiovese with a small dose of Petit V... 84 17.0 \n", "11 Aromas of toasted oak, green leaves, vanilla a... 84 25.0 \n", "12 Who knew Ohio made such tasty Chardonnay? Brig... 87 11.0 \n", "13 Strawberry and raspberry Kool-Aid aromas are s... 85 21.0 \n", "14 This is a vibrant, energetic Chardonnay that s... 84 17.0 \n", "15 Although the nose offers darker notes of petro... 83 15.0 \n", "16 The nose on this bright red blend from Sanders... 83 18.0 \n", "17 Smoke wafts over pressed apple and lemon notes... 83 13.0 \n", "18 Fresh minerality and dancing floral notes make... 84 11.0 \n", "19 A slightly floral but lively nose is followed ... 84 15.0 \n", "20 There are enticing hints of berries and cream ... 86 15.0 \n", "21 Black cherry aromas are dwarfed by notes of wi... 83 24.0 \n", "22 Bright red fruits achieve a decent amount of r... 84 18.0 \n", "\n", " province region taster_name \\\n", "0 North Carolina Swan Creek Anna Lee C. Iijima \n", "1 North Carolina Yadkin Valley Alexander Peartree \n", "2 Ohio Ohio Susan Kostrzewa \n", "3 North Carolina Swan Creek Alexander Peartree \n", "4 Ohio Grand River Valley Susan Kostrzewa \n", "5 Ohio Grand River Valley Susan Kostrzewa \n", "6 North Carolina Swan Creek Anna Lee C. Iijima \n", "7 North Carolina Swan Creek Alexander Peartree \n", "8 North Carolina North Carolina Joe Czerwinski \n", "9 North Carolina Yadkin Valley Anna Lee C. Iijima \n", "10 North Carolina Swan Creek Anna Lee C. Iijima \n", "11 North Carolina North Carolina Anna Lee C. Iijima \n", "12 Ohio Grand River Valley Anna Lee C. Iijima \n", "13 North Carolina Swan Creek Alexander Peartree \n", "14 Ohio Grand River Valley Susan Kostrzewa \n", "15 Ohio Grand River Valley Anna Lee C. Iijima \n", "16 North Carolina Yadkin Valley Anna Lee C. Iijima \n", "17 North Carolina North Carolina Anna Lee C. Iijima \n", "18 Ohio Grand River Valley Susan Kostrzewa \n", "19 Ohio Grand River Valley Susan Kostrzewa \n", "20 North Carolina North Carolina Anna Lee C. Iijima \n", "21 North Carolina Swan Creek Alexander Peartree \n", "22 North Carolina North Carolina Anna Lee C. Iijima \n", "\n", " taster_twitter_handle title \\\n", "0 None Raffaldini 2007 Montepulciano (Swan Creek) \n", "1 None Shadow Springs 2011 Cabernet Franc (Yadkin Val... \n", "2 @suskostrzewa Hermes 2006 Estate Bottled Nebbiolo (Ohio) \n", "3 None Raffaldini 2012 Riserva Sangiovese (Swan Creek) \n", "4 @suskostrzewa Debonné 2008 Reserve Riesling (Grand River Val... \n", "5 @suskostrzewa Debonné 2008 Reserve Riesling (Grand River Val... \n", "6 None Raffaldini 2009 Pinot Grigio (Swan Creek) \n", "7 None Raffaldini 2012 Riserva Montepulciano (Swan Cr... \n", "8 @JoeCz Shelton Vineyards 2002 Riesling (North Carolina) \n", "9 None Divine Llama 2007 In a Heart Beat Red (Yadkin ... \n", "10 None Raffaldini 2007 Riserva Sangiovese (Swan Creek) \n", "11 None RayLen 2006 Eagle's Select Red Wine Red (North... \n", "12 None Debonné 2009 Chardonnay (Grand River Valley) \n", "13 None Laurel Gray 2012 Estate Grown Cabernet Franc (... \n", "14 @suskostrzewa Debonné 2007 Vintner's Selection Chardonnay (G... \n", "15 None Debonné 2008 Lot 807 Reserve Riesling (Grand R... \n", "16 None Sanders Ridge 2008 Big Woods Red (Yadkin Valley) \n", "17 None RayLen 2009 Riesling (North Carolina) \n", "18 @suskostrzewa Debonné 2006 Reserve Riesling (Grand River Val... \n", "19 @suskostrzewa Debonné 2006 Lot 707 Reserve Riesling (Grand R... \n", "20 None Biltmore Estate 2010 Reserve Chardonnay (North... \n", "21 None Raffaldini 2012 Montepulciano (Swan Creek) \n", "22 None RayLen 2008 Category 5 Red Wine Red (North Car... \n", "\n", " variety winery \n", "0 Montepulciano Raffaldini \n", "1 Cabernet Franc Shadow Springs \n", "2 Nebbiolo Hermes \n", "3 Sangiovese Raffaldini \n", "4 Riesling Debonné \n", "5 Riesling Debonné \n", "6 Pinot Grigio Raffaldini \n", "7 Montepulciano Raffaldini \n", "8 Riesling Shelton Vineyards \n", "9 R. Blend Divine Llama \n", "10 Sangiovese Raffaldini \n", "11 Bordeaux-style Red Blend RayLen \n", "12 Chardonnay Debonné \n", "13 Cabernet Franc Laurel Gray \n", "14 Chardonnay Debonné \n", "15 Riesling Debonné \n", "16 Bordeaux-style Red Blend Sanders Ridge \n", "17 Riesling RayLen \n", "18 Riesling Debonné \n", "19 Riesling Debonné \n", "20 Chardonnay Biltmore Estate \n", "21 Montepulciano Raffaldini \n", "22 R. Blend RayLen " ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = {'province': {'$in': ['Ohio','North Carolina']}}\n", "mongo_read_query(winecollection, myquery)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To illustrate the \"or\" operator, we can query all wines that are either from Virginia, or have a score of 100:" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idwine_idcountrydescriptionpointspriceprovinceregiontaster_nametaster_twitter_handletitlevarietywinery
05ed80dcca25fcf746119e3bd19USRed fruit aromas pervade on the nose, with cig...8732.0VirginiaVirginiaAlexander PeartreeNoneQuiévremont 2012 Meritage (Virginia)MeritageQuiévremont
15ed80dcca25fcf746119e3be20USRipe aromas of dark berries mingle with ample ...8723.0VirginiaVirginiaAlexander PeartreeNoneQuiévremont 2012 Vin de Maison Red (Virginia)R. BlendQuiévremont
25ed80dcca25fcf746119e498345AustraliaThis wine contains some material over 100 year...100350.0VictoriaRutherglenJoe Czerwinski@JoeCzChambers Rosewood Vineyards NV Rare Muscat (Ru...MuscatChambers Rosewood Vineyards
35ed80dcea25fcf746119e8d01625USPopping with aromas of lychee, rose, geranium ...8516.0VirginiaVirginiaCarrie DykesNoneThe Williamsburg Winery 2015 A Midsummer Night...White BlendThe Williamsburg Winery
45ed80dcea25fcf746119e8d61631USPowerful aromas of lychee, mango and peach giv...8522.0VirginiaMiddleburgCarrie DykesNoneBlue Valley 2015 Muskat Ottonel (Middleburg)Muskat OttonelBlue Valley
..........................................
3885ed80dcfa25fcf74611b71b3127736USDense with alluring aromas, this wine is full ...8830.0VirginiaVirginiaCarrie DykesNoneEarly Mountain 2015 Elevation Red (Virginia)R. BlendEarly Mountain
3895ed80dcfa25fcf74611b71bd127746USA grape known in Uruguay and Madiran has prove...8825.0VirginiaVirginiaCarrie DykesNoneHorton 2014 Tannat (Virginia)TannatHorton
3905ed80dcfa25fcf74611b7447128576USPeach and steely lemon aromas carry to a citru...8728.0VirginiaMonticelloAlexander PeartreeNonePollak 2012 Reserve Chardonnay (Monticello)ChardonnayPollak
3915ed80dcfa25fcf74611b7708129422USThe nose of this wine is bursting with raspber...8932.0VirginiaMonticelloCarrie DykesNoneStinson 2014 Meritage (Monticello)MeritageStinson
3925ed80dcfa25fcf74611b772a129459USSomehow, winemaker Luca Paschina manages to ma...8723.0VirginiaVirginiaCarrie DykesNoneBarboursville Vineyards 2015 Reserve Vermentin...VermentinoBarboursville Vineyards
\n", "

393 rows × 13 columns

\n", "
" ], "text/plain": [ " _id wine_id country \\\n", "0 5ed80dcca25fcf746119e3bd 19 US \n", "1 5ed80dcca25fcf746119e3be 20 US \n", "2 5ed80dcca25fcf746119e498 345 Australia \n", "3 5ed80dcea25fcf746119e8d0 1625 US \n", "4 5ed80dcea25fcf746119e8d6 1631 US \n", ".. ... ... ... \n", "388 5ed80dcfa25fcf74611b71b3 127736 US \n", "389 5ed80dcfa25fcf74611b71bd 127746 US \n", "390 5ed80dcfa25fcf74611b7447 128576 US \n", "391 5ed80dcfa25fcf74611b7708 129422 US \n", "392 5ed80dcfa25fcf74611b772a 129459 US \n", "\n", " description points price \\\n", "0 Red fruit aromas pervade on the nose, with cig... 87 32.0 \n", "1 Ripe aromas of dark berries mingle with ample ... 87 23.0 \n", "2 This wine contains some material over 100 year... 100 350.0 \n", "3 Popping with aromas of lychee, rose, geranium ... 85 16.0 \n", "4 Powerful aromas of lychee, mango and peach giv... 85 22.0 \n", ".. ... ... ... \n", "388 Dense with alluring aromas, this wine is full ... 88 30.0 \n", "389 A grape known in Uruguay and Madiran has prove... 88 25.0 \n", "390 Peach and steely lemon aromas carry to a citru... 87 28.0 \n", "391 The nose of this wine is bursting with raspber... 89 32.0 \n", "392 Somehow, winemaker Luca Paschina manages to ma... 87 23.0 \n", "\n", " province region taster_name taster_twitter_handle \\\n", "0 Virginia Virginia Alexander Peartree None \n", "1 Virginia Virginia Alexander Peartree None \n", "2 Victoria Rutherglen Joe Czerwinski @JoeCz \n", "3 Virginia Virginia Carrie Dykes None \n", "4 Virginia Middleburg Carrie Dykes None \n", ".. ... ... ... ... \n", "388 Virginia Virginia Carrie Dykes None \n", "389 Virginia Virginia Carrie Dykes None \n", "390 Virginia Monticello Alexander Peartree None \n", "391 Virginia Monticello Carrie Dykes None \n", "392 Virginia Virginia Carrie Dykes None \n", "\n", " title variety \\\n", "0 Quiévremont 2012 Meritage (Virginia) Meritage \n", "1 Quiévremont 2012 Vin de Maison Red (Virginia) R. Blend \n", "2 Chambers Rosewood Vineyards NV Rare Muscat (Ru... Muscat \n", "3 The Williamsburg Winery 2015 A Midsummer Night... White Blend \n", "4 Blue Valley 2015 Muskat Ottonel (Middleburg) Muskat Ottonel \n", ".. ... ... \n", "388 Early Mountain 2015 Elevation Red (Virginia) R. Blend \n", "389 Horton 2014 Tannat (Virginia) Tannat \n", "390 Pollak 2012 Reserve Chardonnay (Monticello) Chardonnay \n", "391 Stinson 2014 Meritage (Monticello) Meritage \n", "392 Barboursville Vineyards 2015 Reserve Vermentin... Vermentino \n", "\n", " winery \n", "0 Quiévremont \n", "1 Quiévremont \n", "2 Chambers Rosewood Vineyards \n", "3 The Williamsburg Winery \n", "4 Blue Valley \n", ".. ... \n", "388 Early Mountain \n", "389 Horton \n", "390 Pollak \n", "391 Stinson \n", "392 Barboursville Vineyards \n", "\n", "[393 rows x 13 columns]" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = {'$or': [{'points': 100}, {'province': 'Virginia'}]}\n", "mongo_read_query(winecollection, myquery)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`$expr` requires an operator that compares two keys, and creates a sentence like \"X is greater than or equal to Y\". Next it requires a list that specifies what X in the sentence should be, then what Y should be. To search for all wines in which the price is greater than the score we use the `$expr` operator as follows:" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idwine_idcountrydescriptionpointspriceprovinceregiontaster_nametaster_twitter_handletitlevarietywinerylocation
05ed80dcca25fcf746119e3d460.0USSyrupy and dense, this wine is jammy in plum a...86.0100CaliforniaNapa ValleyVirginie Boone@vbooneOkapi 2013 Estate Cabernet Sauvignon (Napa Val...Cabernet SauvignonOkapiNaN
15ed80dcca25fcf746119e42c168.0USA fairly elegant expression of the variety, th...91.095CaliforniaNapa ValleyVirginie Boone@vbooneDuckhorn 2012 Rector Creek Vineyard Merlot (Na...MerlotDuckhornNaN
25ed80dcca25fcf746119e476284.0ArgentinaThis huge Malbec defines jammy and concentrate...92.0215Mendoza ProvincePerdrielMichael Schachner@wineschachViña Cobos 2011 Marchiori Vineyard Block C2 Ma...MalbecViña CobosNaN
35ed80dcca25fcf746119e498345.0AustraliaThis wine contains some material over 100 year...100.0350VictoriaRutherglenJoe Czerwinski@JoeCzChambers Rosewood Vineyards NV Rare Muscat (Ru...MuscatChambers Rosewood VineyardsNaN
45ed80dcca25fcf746119e499346.0AustraliaThis deep brown wine smells like a damp, mossy...98.0350VictoriaRutherglenJoe Czerwinski@JoeCzChambers Rosewood Vineyards NV Rare Muscadelle...MuscadelleChambers Rosewood VineyardsNaN
.............................................
36975ed80dcfa25fcf74611b78a8129919.0USThis ripe, rich, almost decadently thick wine ...91.0105WashingtonWalla Walla Valley (WA)Paul Gregutt@paulgwineNicholas Cole Cellars 2004 Reserve Red (Walla ...R. BlendNicholas Cole CellarsNaN
36985ed80dcfa25fcf74611b78b2129931.0FranceA powerful, chunky wine, packed with solid tan...91.0107BurgundyGrands-EchezeauxRoger Voss@vossrogerHenri de Villamont 2005 Grands-EchezeauxPinot NoirHenri de VillamontNaN
36995edd56e5b4e58ce3841e5deaNaNNaNThis wine goes great with dinner just like Dwy...NaN35NaNNaNJonathan Kropko@jmk51312016 Napa Valley Three By Wade Red BlendRed BlendNaN{'region_1': 'Napa Valley', 'region_2': None, ...
37005edd56e6b4e58ce3841e5debNaNNaNThis wine will make you speak differently. May...NaN40.99NaNNaNJonathan Kropko@jmk5131Anta Banderas A 10 2008Red BlendNaN{'region_1': 'Ribera del Duoro', 'region_2': N...
37015edd56e6b4e58ce3841e5decNaNNaNSomeone drank my entire bottle of wine!NaN14.99NaNNaNJonathan Kropko@jmk5131Barrymore Rose 2013RoseNaN{'region_1': 'Monterey', 'region_2': None, 'pr...
\n", "

3702 rows × 14 columns

\n", "
" ], "text/plain": [ " _id wine_id country \\\n", "0 5ed80dcca25fcf746119e3d4 60.0 US \n", "1 5ed80dcca25fcf746119e42c 168.0 US \n", "2 5ed80dcca25fcf746119e476 284.0 Argentina \n", "3 5ed80dcca25fcf746119e498 345.0 Australia \n", "4 5ed80dcca25fcf746119e499 346.0 Australia \n", "... ... ... ... \n", "3697 5ed80dcfa25fcf74611b78a8 129919.0 US \n", "3698 5ed80dcfa25fcf74611b78b2 129931.0 France \n", "3699 5edd56e5b4e58ce3841e5dea NaN NaN \n", "3700 5edd56e6b4e58ce3841e5deb NaN NaN \n", "3701 5edd56e6b4e58ce3841e5dec NaN NaN \n", "\n", " description points price \\\n", "0 Syrupy and dense, this wine is jammy in plum a... 86.0 100 \n", "1 A fairly elegant expression of the variety, th... 91.0 95 \n", "2 This huge Malbec defines jammy and concentrate... 92.0 215 \n", "3 This wine contains some material over 100 year... 100.0 350 \n", "4 This deep brown wine smells like a damp, mossy... 98.0 350 \n", "... ... ... ... \n", "3697 This ripe, rich, almost decadently thick wine ... 91.0 105 \n", "3698 A powerful, chunky wine, packed with solid tan... 91.0 107 \n", "3699 This wine goes great with dinner just like Dwy... NaN 35 \n", "3700 This wine will make you speak differently. May... NaN 40.99 \n", "3701 Someone drank my entire bottle of wine! NaN 14.99 \n", "\n", " province region taster_name \\\n", "0 California Napa Valley Virginie Boone \n", "1 California Napa Valley Virginie Boone \n", "2 Mendoza Province Perdriel Michael Schachner \n", "3 Victoria Rutherglen Joe Czerwinski \n", "4 Victoria Rutherglen Joe Czerwinski \n", "... ... ... ... \n", "3697 Washington Walla Walla Valley (WA) Paul Gregutt \n", "3698 Burgundy Grands-Echezeaux Roger Voss \n", "3699 NaN NaN Jonathan Kropko \n", "3700 NaN NaN Jonathan Kropko \n", "3701 NaN NaN Jonathan Kropko \n", "\n", " taster_twitter_handle title \\\n", "0 @vboone Okapi 2013 Estate Cabernet Sauvignon (Napa Val... \n", "1 @vboone Duckhorn 2012 Rector Creek Vineyard Merlot (Na... \n", "2 @wineschach Viña Cobos 2011 Marchiori Vineyard Block C2 Ma... \n", "3 @JoeCz Chambers Rosewood Vineyards NV Rare Muscat (Ru... \n", "4 @JoeCz Chambers Rosewood Vineyards NV Rare Muscadelle... \n", "... ... ... \n", "3697 @paulgwine  Nicholas Cole Cellars 2004 Reserve Red (Walla ... \n", "3698 @vossroger Henri de Villamont 2005 Grands-Echezeaux \n", "3699 @jmk5131 2016 Napa Valley Three By Wade Red Blend \n", "3700 @jmk5131 Anta Banderas A 10 2008 \n", "3701 @jmk5131 Barrymore Rose 2013 \n", "\n", " variety winery \\\n", "0 Cabernet Sauvignon Okapi \n", "1 Merlot Duckhorn \n", "2 Malbec Viña Cobos \n", "3 Muscat Chambers Rosewood Vineyards \n", "4 Muscadelle Chambers Rosewood Vineyards \n", "... ... ... \n", "3697 R. Blend Nicholas Cole Cellars \n", "3698 Pinot Noir Henri de Villamont \n", "3699 Red Blend NaN \n", "3700 Red Blend NaN \n", "3701 Rose NaN \n", "\n", " location \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "... ... \n", "3697 NaN \n", "3698 NaN \n", "3699 {'region_1': 'Napa Valley', 'region_2': None, ... \n", "3700 {'region_1': 'Ribera del Duoro', 'region_2': N... \n", "3701 {'region_1': 'Monterey', 'region_2': None, 'pr... \n", "\n", "[3702 rows x 14 columns]" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = {'$expr': {'$gt': ['$price', '$points']}}\n", "mongo_read_query(winecollection, myquery)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous section, we entered a record for a wine released by former NBA all-star Dwyane Wade, and we purposely included a nested structure in this JSON record. The `location` key has subkeys `region_1`, `region_2`, `province`, `country`, and `winery`:" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'title': '2016 Napa Valley Three By Wade Red Blend',\n", " 'description': 'This wine goes great with dinner just like Dwyane Wade goes great with LeBron James or Shaq.',\n", " 'taster_name': 'Jonathan Kropko',\n", " 'taster_twitter_handle': '@jmk5131',\n", " 'price': '35',\n", " 'variety': 'Red Blend',\n", " 'location': {'region_1': 'Napa Valley',\n", " 'region_2': None,\n", " 'province': 'California',\n", " 'country': 'U.S.',\n", " 'winery': 'D Wade Cellars'},\n", " '_id': ObjectId('5edd56e5b4e58ce3841e5dea')}" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dwadewine" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To query a subrecord, use dot notation of the form `'key.subkey'` to identify the path to the value you need. To query for the winery name \"D Wade Cellars\", we can type:" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idtitledescriptiontaster_nametaster_twitter_handlepricevarietylocation
05edd56e5b4e58ce3841e5dea2016 Napa Valley Three By Wade Red BlendThis wine goes great with dinner just like Dwy...Jonathan Kropko@jmk513135Red Blend{'region_1': 'Napa Valley', 'region_2': None, ...
\n", "
" ], "text/plain": [ " _id title \\\n", "0 5edd56e5b4e58ce3841e5dea 2016 Napa Valley Three By Wade Red Blend \n", "\n", " description taster_name \\\n", "0 This wine goes great with dinner just like Dwy... Jonathan Kropko \n", "\n", " taster_twitter_handle price variety \\\n", "0 @jmk5131 35 Red Blend \n", "\n", " location \n", "0 {'region_1': 'Napa Valley', 'region_2': None, ... " ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myquery = {'location.winery': 'D Wade Cellars'}\n", "mongo_read_query(winecollection, myquery)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting Features\n", "A read query in MongoDB will return the entire JSON dictionary for every record that matches the query. Sometimes, however, the entirety of the data for one record will be more information than we can feasibly work with. In some situations there might be an unmanagable number of features contained within each dictionary, and we only want to use a couple of these features. In other situations a feature might contain values that are so large that we want to avoid dealing with this feature if possible.\n", "\n", "To extract only a selection of the features, add a second JSON clause to the `.find()` method. The general syntax for selecting features is\n", "```\n", "db.collection.find({query}, {'feature'=1}}\n", "```\n", "where `{query}` is code, as described above, for extracting a selection of the records, and `{'feature'=1}` instructs MongoDB to include only the field named `feature` in the output. Alternatively, it is possible to list as many keys in this second clause as we want, so `{'feature1'=1, 'feature2'=1}` extracts `feature1` and `feature2`. In addition, setting the key equal to 0 instead of 1 instructs MongoDB to extract all features *except* the one specified with `'feature'=0`.\n", "\n", "In the wine collection, we can extract only the titles of Merlot wines with the following code:" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idtitle
05ed80dcca25fcf746119e3c1Bianchi 2011 Signature Selection Merlot (Paso ...
15ed80dcca25fcf746119e3cdSundance 2011 Merlot (Maule Valley)
25ed80dcca25fcf746119e3efPassaggio 2014 Blau Vineyards Merlot (Knights ...
35ed80dcca25fcf746119e42cDuckhorn 2012 Rector Creek Vineyard Merlot (Na...
45ed80dcca25fcf746119e437Viña Bisquertt 2007 Casa La Joya Reserve Merlo...
.........
20945ed80dcfa25fcf74611b77f0Castillo de Monjardin 2009 Deyo Merlot (Navarra)
20955ed80dcfa25fcf74611b7862Bonair 2006 Chateau Puryear Vineyard Merlot (R...
20965ed80dcfa25fcf74611b7865Hyatt 2005 Merlot (Rattlesnake Hills)
20975ed80dcfa25fcf74611b786bCa' Momi 2013 Reserve Merlot (Carneros)
20985ed80dcfa25fcf74611b7896Psagot 2014 Merlot
\n", "

2099 rows × 2 columns

\n", "
" ], "text/plain": [ " _id \\\n", "0 5ed80dcca25fcf746119e3c1 \n", "1 5ed80dcca25fcf746119e3cd \n", "2 5ed80dcca25fcf746119e3ef \n", "3 5ed80dcca25fcf746119e42c \n", "4 5ed80dcca25fcf746119e437 \n", "... ... \n", "2094 5ed80dcfa25fcf74611b77f0 \n", "2095 5ed80dcfa25fcf74611b7862 \n", "2096 5ed80dcfa25fcf74611b7865 \n", "2097 5ed80dcfa25fcf74611b786b \n", "2098 5ed80dcfa25fcf74611b7896 \n", "\n", " title \n", "0 Bianchi 2011 Signature Selection Merlot (Paso ... \n", "1 Sundance 2011 Merlot (Maule Valley) \n", "2 Passaggio 2014 Blau Vineyards Merlot (Knights ... \n", "3 Duckhorn 2012 Rector Creek Vineyard Merlot (Na... \n", "4 Viña Bisquertt 2007 Casa La Joya Reserve Merlo... \n", "... ... \n", "2094 Castillo de Monjardin 2009 Deyo Merlot (Navarra) \n", "2095 Bonair 2006 Chateau Puryear Vineyard Merlot (R... \n", "2096 Hyatt 2005 Merlot (Rattlesnake Hills) \n", "2097 Ca' Momi 2013 Reserve Merlot (Carneros) \n", "2098 Psagot 2014 Merlot \n", "\n", "[2099 rows x 2 columns]" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cursor = winecollection.find({'variety': 'Merlot'}, {'title': 1})\n", "qtext = dumps(cursor)\n", "qrec = loads(qtext)\n", "pd.DataFrame.from_records(qrec)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, the only field that is extracted other than the ones we directly specify is `_id`, but we can exclude `_id` as well by typing" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
title
0Bianchi 2011 Signature Selection Merlot (Paso ...
1Sundance 2011 Merlot (Maule Valley)
2Passaggio 2014 Blau Vineyards Merlot (Knights ...
3Duckhorn 2012 Rector Creek Vineyard Merlot (Na...
4Viña Bisquertt 2007 Casa La Joya Reserve Merlo...
......
2094Castillo de Monjardin 2009 Deyo Merlot (Navarra)
2095Bonair 2006 Chateau Puryear Vineyard Merlot (R...
2096Hyatt 2005 Merlot (Rattlesnake Hills)
2097Ca' Momi 2013 Reserve Merlot (Carneros)
2098Psagot 2014 Merlot
\n", "

2099 rows × 1 columns

\n", "
" ], "text/plain": [ " title\n", "0 Bianchi 2011 Signature Selection Merlot (Paso ...\n", "1 Sundance 2011 Merlot (Maule Valley)\n", "2 Passaggio 2014 Blau Vineyards Merlot (Knights ...\n", "3 Duckhorn 2012 Rector Creek Vineyard Merlot (Na...\n", "4 Viña Bisquertt 2007 Casa La Joya Reserve Merlo...\n", "... ...\n", "2094 Castillo de Monjardin 2009 Deyo Merlot (Navarra)\n", "2095 Bonair 2006 Chateau Puryear Vineyard Merlot (R...\n", "2096 Hyatt 2005 Merlot (Rattlesnake Hills)\n", "2097 Ca' Momi 2013 Reserve Merlot (Carneros)\n", "2098 Psagot 2014 Merlot\n", "\n", "[2099 rows x 1 columns]" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cursor = winecollection.find({'variety': 'Merlot'}, {'title': 1, '_id': 0})\n", "qtext = dumps(cursor)\n", "qrec = loads(qtext)\n", "pd.DataFrame.from_records(qrec)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To keep the title, variety, points, and price, we type" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pointspricetitlevariety
08722.0Bianchi 2011 Signature Selection Merlot (Paso ...Merlot
1869.0Sundance 2011 Merlot (Maule Valley)Merlot
28655.0Passaggio 2014 Blau Vineyards Merlot (Knights ...Merlot
39195.0Duckhorn 2012 Rector Creek Vineyard Merlot (Na...Merlot
48811.0Viña Bisquertt 2007 Casa La Joya Reserve Merlo...Merlot
...............
20948718.0Castillo de Monjardin 2009 Deyo Merlot (Navarra)Merlot
20958620.0Bonair 2006 Chateau Puryear Vineyard Merlot (R...Merlot
20968610.0Hyatt 2005 Merlot (Rattlesnake Hills)Merlot
20979044.0Ca' Momi 2013 Reserve Merlot (Carneros)Merlot
20989132.0Psagot 2014 MerlotMerlot
\n", "

2099 rows × 4 columns

\n", "
" ], "text/plain": [ " points price title variety\n", "0 87 22.0 Bianchi 2011 Signature Selection Merlot (Paso ... Merlot\n", "1 86 9.0 Sundance 2011 Merlot (Maule Valley) Merlot\n", "2 86 55.0 Passaggio 2014 Blau Vineyards Merlot (Knights ... Merlot\n", "3 91 95.0 Duckhorn 2012 Rector Creek Vineyard Merlot (Na... Merlot\n", "4 88 11.0 Viña Bisquertt 2007 Casa La Joya Reserve Merlo... Merlot\n", "... ... ... ... ...\n", "2094 87 18.0 Castillo de Monjardin 2009 Deyo Merlot (Navarra) Merlot\n", "2095 86 20.0 Bonair 2006 Chateau Puryear Vineyard Merlot (R... Merlot\n", "2096 86 10.0 Hyatt 2005 Merlot (Rattlesnake Hills) Merlot\n", "2097 90 44.0 Ca' Momi 2013 Reserve Merlot (Carneros) Merlot\n", "2098 91 32.0 Psagot 2014 Merlot Merlot\n", "\n", "[2099 rows x 4 columns]" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cursor = winecollection.find({'variety': 'Merlot'}, \n", " {'title': 1,\n", " 'variety': 1,\n", " 'points': 1,\n", " 'price': 1,\n", " '_id': 0})\n", "qtext = dumps(cursor)\n", "qrec = loads(qtext)\n", "pd.DataFrame.from_records(qrec)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Updating Records\n", "Updating records in MongoDB is similar to selecting records in that we use the same logical conditions we use for selecting records for identifying the records we want to edit. The `.update_one()` method, applied to a collection, has two arguments. First we specify a logical condition that identifies the records we want to edit. Then we use the `$set` operator to choose specific fields within the existing JSON record to change. If we want, we can even write an entire replacement dictionary for this record, and write it along with `$set`. For example, to identify the record of the wine from Dwyane Wade's winery, we can query `{'location.winery': 'D Wade Cellars'}` as we did in the previous section. Suppose that we want to edit this record so that the price increases to \\\\$45. We can do so with the following code:" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idtitledescriptiontaster_nametaster_twitter_handlepricevarietylocation
05edd56e5b4e58ce3841e5dea2016 Napa Valley Three By Wade Red BlendThis wine goes great with dinner just like Dwy...Jonathan Kropko@jmk513145Red Blend{'region_1': 'Napa Valley', 'region_2': None, ...
\n", "
" ], "text/plain": [ " _id title \\\n", "0 5edd56e5b4e58ce3841e5dea 2016 Napa Valley Three By Wade Red Blend \n", "\n", " description taster_name \\\n", "0 This wine goes great with dinner just like Dwy... Jonathan Kropko \n", "\n", " taster_twitter_handle price variety \\\n", "0 @jmk5131 45 Red Blend \n", "\n", " location \n", "0 {'region_1': 'Napa Valley', 'region_2': None, ... " ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "winecollection.update_one({'location.winery': 'D Wade Cellars'},\n", " {'$set' : {'price': 45}})\n", "mongo_read_query(winecollection, {'location.winery': 'D Wade Cellars'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose that we wanted to add a field that does not currently exist in the record, like `points`. We can use the same syntax to add fields:" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idtitledescriptiontaster_nametaster_twitter_handlepricevarietylocationscore
05edd56e5b4e58ce3841e5dea2016 Napa Valley Three By Wade Red BlendThis wine goes great with dinner just like Dwy...Jonathan Kropko@jmk513145Red Blend{'region_1': 'Napa Valley', 'region_2': None, ...90
\n", "
" ], "text/plain": [ " _id title \\\n", "0 5edd56e5b4e58ce3841e5dea 2016 Napa Valley Three By Wade Red Blend \n", "\n", " description taster_name \\\n", "0 This wine goes great with dinner just like Dwy... Jonathan Kropko \n", "\n", " taster_twitter_handle price variety \\\n", "0 @jmk5131 45 Red Blend \n", "\n", " location score \n", "0 {'region_1': 'Napa Valley', 'region_2': None, ... 90 " ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "winecollection.update_one({'location.winery': 'D Wade Cellars'},\n", " {'$set' : {'score': 90}})\n", "mongo_read_query(winecollection, {'location.winery': 'D Wade Cellars'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can change more than one field at a time within one call to the `$set` operator. To change both the score and the price of the Dwyane Wade wine, we can type:" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idtitledescriptiontaster_nametaster_twitter_handlepricevarietylocationscore
05edd56e5b4e58ce3841e5dea2016 Napa Valley Three By Wade Red BlendThis wine goes great with dinner just like Dwy...Jonathan Kropko@jmk513150Red Blend{'region_1': 'Napa Valley', 'region_2': None, ...95
\n", "
" ], "text/plain": [ " _id title \\\n", "0 5edd56e5b4e58ce3841e5dea 2016 Napa Valley Three By Wade Red Blend \n", "\n", " description taster_name \\\n", "0 This wine goes great with dinner just like Dwy... Jonathan Kropko \n", "\n", " taster_twitter_handle price variety \\\n", "0 @jmk5131 50 Red Blend \n", "\n", " location score \n", "0 {'region_1': 'Napa Valley', 'region_2': None, ... 95 " ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "winecollection.update_one({'location.winery': 'D Wade Cellars'},\n", " {'$set' : {'score': 95,\n", " 'price': 50}})\n", "mongo_read_query(winecollection, {'location.winery': 'D Wade Cellars'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose that the wine is reviewed by LeBron James, NBA star and noted [wine connoisseur](https://twitter.com/KingJames/status/1239424365621469184?s=20), who provided a new score and description. We can update the entire record by first defining a Python variable that contains the record:" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "dwadewine2 = {'title': '2016 Napa Valley Three By Wade Red Blend', \n", "'description': \"This wine is very good. Not as great as me. But plenty great enough for Miami.\", \n", "'taster_name': 'LeBron James', \n", "'taster_twitter_handle': '@kingjames', \n", "'price': 45,\n", "'score': 99,\n", "'variety': 'Red Blend', \n", "'location':{\n", " 'region_1': 'Napa Valley', \n", " 'region_2': None, \n", " 'province': 'California', \n", " 'country': 'U.S.', \n", " 'winery': 'D Wade Cellars'}}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can replace the existing record for this wine by specifying this dictionary as the second argument of the `.update_one()` method:" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idtitledescriptiontaster_nametaster_twitter_handlepricevarietylocationscore
05edd56e5b4e58ce3841e5dea2016 Napa Valley Three By Wade Red BlendThis wine is very good. Not as great as me. Bu...LeBron James@kingjames45Red Blend{'region_1': 'Napa Valley', 'region_2': None, ...99
\n", "
" ], "text/plain": [ " _id title \\\n", "0 5edd56e5b4e58ce3841e5dea 2016 Napa Valley Three By Wade Red Blend \n", "\n", " description taster_name \\\n", "0 This wine is very good. Not as great as me. Bu... LeBron James \n", "\n", " taster_twitter_handle price variety \\\n", "0 @kingjames 45 Red Blend \n", "\n", " location score \n", "0 {'region_1': 'Napa Valley', 'region_2': None, ... 99 " ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "winecollection.update_one({'location.winery': 'D Wade Cellars'},\n", " {'$set' : dwadewine2})\n", "mongo_read_query(winecollection, {'location.winery': 'D Wade Cellars'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A second method for editing records is `.update_all()` which revises every document that matches a query. I don't recommend using this method except in very specific cases, because it is easy to destroy large portions of a database with a mistyped query. But for the sake of illustration, suppose we wanted to change the names of the \"Red Blend\" varieties of wines to \"R. Blend\". We can do that with the following code:" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idwine_idcountrydescriptionpointspriceprovinceregiontaster_nametaster_twitter_handletitlevarietywinerylocationscore
05ed80dcca25fcf746119e3be20.0USRipe aromas of dark berries mingle with ample ...87.023VirginiaVirginiaAlexander PeartreeNoneQuiévremont 2012 Vin de Maison Red (Virginia)R. BlendQuiévremontNaNNaN
15ed80dcca25fcf746119e3c628.0ItalyAromas suggest mature berry, scorched earth, a...87.017Sicily & SardiniaCerasuolo di VittoriaKerin O’Keefe@kerinokeefeTerre di Giurfo 2011 Mascaria Barricato (Cera...R. BlendTerre di GiurfoNaNNaN
25ed80dcca25fcf746119e3dc68.0USVery deep in color and spicy-smoky in flavor, ...86.012CaliforniaCaliforniaJim Gordon@gordone_cellarsCocobon 2014 Red (California)R. BlendCocobonNaNNaN
35ed80dcca25fcf746119e3f290.0USThis blend of Sangiovese, Malbec, Cabernet Sau...88.023CaliforniaSonoma CountyVirginie Boone@vbooneFerrari-Carano 2014 Siena Red (Sonoma County)R. BlendFerrari-CaranoNaNNaN
45ed80dcca25fcf746119e400104.0ItalyMade with 65% Sangiovese, 20% Merlot and 15% C...87.016TuscanyToscanaKerin O’Keefe@kerinokeefeMadonna Alta 2014 Nativo Red (Toscana)R. BlendMadonna AltaNaNNaN
................................................
71065ed80dcfa25fcf74611b78b3129932.0ArgentinaAndeluna's top wines tend to be ripe and plump...91.055Mendoza ProvinceUco ValleyMichael Schachner@wineschachAndeluna 2004 Pasionado Red (Uco Valley)R. BlendAndelunaNaNNaN
71075ed80dcfa25fcf74611b78bd129943.0ItalyA blend of Nero d'Avola and Syrah, this convey...90.029Sicily & SardiniaSiciliaKerin O’Keefe@kerinokeefeBaglio del Cristo di Campobello 2012 Adènzia R...R. BlendBaglio del Cristo di CampobelloNaNNaN
71085ed80dcfa25fcf74611b78c1129947.0ItalyA blend of 65% Cabernet Sauvignon, 30% Merlot ...90.020Sicily & SardiniaTerre SicilianeKerin O’Keefe@kerinokeefeFeudo Principi di Butera 2012 Symposio Red (Te...R. BlendFeudo Principi di ButeraNaNNaN
71095edd56e5b4e58ce3841e5deaNaNNaNThis wine is very good. Not as great as me. Bu...NaN45NaNNaNLeBron James@kingjames2016 Napa Valley Three By Wade Red BlendR. BlendNaN{'region_1': 'Napa Valley', 'region_2': None, ...99.0
71105edd56e6b4e58ce3841e5debNaNNaNThis wine will make you speak differently. May...NaN40.99NaNNaNJonathan Kropko@jmk5131Anta Banderas A 10 2008R. BlendNaN{'region_1': 'Ribera del Duoro', 'region_2': N...NaN
\n", "

7111 rows × 15 columns

\n", "
" ], "text/plain": [ " _id wine_id country \\\n", "0 5ed80dcca25fcf746119e3be 20.0 US \n", "1 5ed80dcca25fcf746119e3c6 28.0 Italy \n", "2 5ed80dcca25fcf746119e3dc 68.0 US \n", "3 5ed80dcca25fcf746119e3f2 90.0 US \n", "4 5ed80dcca25fcf746119e400 104.0 Italy \n", "... ... ... ... \n", "7106 5ed80dcfa25fcf74611b78b3 129932.0 Argentina \n", "7107 5ed80dcfa25fcf74611b78bd 129943.0 Italy \n", "7108 5ed80dcfa25fcf74611b78c1 129947.0 Italy \n", "7109 5edd56e5b4e58ce3841e5dea NaN NaN \n", "7110 5edd56e6b4e58ce3841e5deb NaN NaN \n", "\n", " description points price \\\n", "0 Ripe aromas of dark berries mingle with ample ... 87.0 23 \n", "1 Aromas suggest mature berry, scorched earth, a... 87.0 17 \n", "2 Very deep in color and spicy-smoky in flavor, ... 86.0 12 \n", "3 This blend of Sangiovese, Malbec, Cabernet Sau... 88.0 23 \n", "4 Made with 65% Sangiovese, 20% Merlot and 15% C... 87.0 16 \n", "... ... ... ... \n", "7106 Andeluna's top wines tend to be ripe and plump... 91.0 55 \n", "7107 A blend of Nero d'Avola and Syrah, this convey... 90.0 29 \n", "7108 A blend of 65% Cabernet Sauvignon, 30% Merlot ... 90.0 20 \n", "7109 This wine is very good. Not as great as me. Bu... NaN 45 \n", "7110 This wine will make you speak differently. May... NaN 40.99 \n", "\n", " province region taster_name \\\n", "0 Virginia Virginia Alexander Peartree \n", "1 Sicily & Sardinia Cerasuolo di Vittoria Kerin O’Keefe \n", "2 California California Jim Gordon \n", "3 California Sonoma County Virginie Boone \n", "4 Tuscany Toscana Kerin O’Keefe \n", "... ... ... ... \n", "7106 Mendoza Province Uco Valley Michael Schachner \n", "7107 Sicily & Sardinia Sicilia Kerin O’Keefe \n", "7108 Sicily & Sardinia Terre Siciliane Kerin O’Keefe \n", "7109 NaN NaN LeBron James \n", "7110 NaN NaN Jonathan Kropko \n", "\n", " taster_twitter_handle title \\\n", "0 None Quiévremont 2012 Vin de Maison Red (Virginia) \n", "1 @kerinokeefe Terre di Giurfo 2011 Mascaria Barricato (Cera... \n", "2 @gordone_cellars Cocobon 2014 Red (California) \n", "3 @vboone Ferrari-Carano 2014 Siena Red (Sonoma County) \n", "4 @kerinokeefe Madonna Alta 2014 Nativo Red (Toscana) \n", "... ... ... \n", "7106 @wineschach Andeluna 2004 Pasionado Red (Uco Valley) \n", "7107 @kerinokeefe Baglio del Cristo di Campobello 2012 Adènzia R... \n", "7108 @kerinokeefe Feudo Principi di Butera 2012 Symposio Red (Te... \n", "7109 @kingjames 2016 Napa Valley Three By Wade Red Blend \n", "7110 @jmk5131 Anta Banderas A 10 2008 \n", "\n", " variety winery \\\n", "0 R. Blend Quiévremont \n", "1 R. Blend Terre di Giurfo \n", "2 R. Blend Cocobon \n", "3 R. Blend Ferrari-Carano \n", "4 R. Blend Madonna Alta \n", "... ... ... \n", "7106 R. Blend Andeluna \n", "7107 R. Blend Baglio del Cristo di Campobello \n", "7108 R. Blend Feudo Principi di Butera \n", "7109 R. Blend NaN \n", "7110 R. Blend NaN \n", "\n", " location score \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", "... ... ... \n", "7106 NaN NaN \n", "7107 NaN NaN \n", "7108 NaN NaN \n", "7109 {'region_1': 'Napa Valley', 'region_2': None, ... 99.0 \n", "7110 {'region_1': 'Ribera del Duoro', 'region_2': N... NaN \n", "\n", "[7111 rows x 15 columns]" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "winecollection.update_many({'variety': 'Red Blend'},\n", " {'$set': {'variety': 'R. Blend'}})\n", "mongo_read_query(winecollection, {'variety': 'R. Blend'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Performing Text Searches\n", "One of the great advantages of a document store database is the ability to search through the text within the documents and extract records that match a certain pattern. A text search in MongoDB involves two steps:\n", "\n", "* First, we will create a **text index**: a particular field in the records that contains the text we want MongoDB to search within.\n", "\n", "* Second, we will use the `$text` operator within a call to `.find()` to specify the search terms.\n", "\n", "To create a text index, we can use the syntax\n", "```\n", "collection.create_index[('keytosearch', 'text')]\n", "```\n", "We will replace `'keytosearch' with the name of the field in the JSON dictionaries on which we want to search, but we will leave `'text'` as is because this code tells MongoDB to search for text. The code to set the `description` field as the text index in the `winecollection` database is:" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'description_text'" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "winecollection.create_index([('description', 'text')])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've set the text index, we can search the text in that field. The general syntax for a query with a text search is\n", "```\n", "{'$text': {'$search': 'searchterms', '$caseSensitive': False}}\n", "```\n", "where `'searchterms'` contains the terms we want to search for, and `'$caseSensitive': False` tells MongoDB to ignore cases in the search, so that a search term of \"chocolate\" also matches to \"Chocolate\". Alternatively, `'$caseSensitive': True` takes case into account when matching records to a query. If a search is not case sensitive, and if it is not diacritic sensitive (taking things like accents into account, which it can do by adding the `$diacriticSensitive=True` option), then `$search` matches on the **stems** of words: the [first several letters in the word](https://en.wikipedia.org/wiki/Stemming), allowing a search term of \"blueberry\" to also match with \"blueberries\".\n", "\n", "As a simple example, now that `description` has been set as the text index, we can find all wines with descriptions that contain the word \"chocolate\". Here I save the output as a dataframe and display the description for the first wine in the output:" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Supermodern in style, with mint, coconut, chocolate and huge black fruit aromas. Powerfully structured and thick-boned, with boysenberry, spice and chocolate in spades. Oaky, broad and layered on the finish, with tobacco, coffee and chocolate finishing notes.'" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = mongo_read_query(winecollection, {'$text': {'$search':'chocolate', '$caseSensitive': False}})\n", "df['description'][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To search on more than one term, include the terms in the same string after `'$search'`, separated by spaces. By default, these terms are combined using the \"or\" operator, so that the query returns any document with at least one of the terms. The following code finds all wines whose descriptions contain \"chocolate\" or \"leather\":" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5930 A whiff of leather introduces a wine with a st...\n", "Name: description, dtype: object" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = mongo_read_query(winecollection, {'$text': {'$search':'chocolate leather', '$caseSensitive': False}})\n", "df[df['wine_id']==109396]['description'] # a leathery one" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To search for documents with a phrase that contains a space, enclose the phrase in double quotes, and precede each double-quote with an \\ escape character. The following code captures descriptions with the phrase \"very good\":" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"A good to very good wine with medicinality on the nose that takes over the aromatics. There's also the slightest bit of hard cheese and stem on the bouquet, so overall it is fighting an uphill battle. Along the way it delivers flavors of cough drop, cherry and a good, solid finish.\"" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = mongo_read_query(winecollection, {'$text': {'$search':'\\\"very good\\\"', '$caseSensitive': False}})\n", "df['description'][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To search for documents that contain multiple search terms at once (an \"and\" operator), enclose each search term in double quotes with escape characters. We can search for descriptions that contain both \"leather\" and \"chocolate\" with the following code: " ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'A blend of 85% Melnik, 10% Grenache Noir and 5% Petit Verdot, this wine has aromas of saddle leather, cassis and dark chocolate. In the mouth there are flavors of cherry, chocolate and dried blueberry. It has good balance with a soft tannic finish.'" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = mongo_read_query(winecollection, {'$text': {'$search':'\\\"leather\\\" \\\"chocolate\\\"', '$caseSensitive': False}})\n", "df['description'][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To exclude a term, we add a negative sign in front of the term we want to exclude. To find all wines whose descriptions contain the word \"dark\" but not \"chocolate\", we type" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Dark, dark, dark is this estate-grown Syrah, closed at the nose except for the profusion of alcohol, grippy with lots of oak and crazy, teeth-staining tannins.'" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = mongo_read_query(winecollection, {'$text': {'$search':'dark -chocolate', '$caseSensitive': False}})\n", "df['description'][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Text searches can also be used to construct a search engine. We provide the search terms, and MongoDB generates a score for every document that represents the extent to which the search terms are relevant to the document. Once these scores have been generated, it is possible to sort the documents by the score to find the documents that are most highly related to the search terms.\n", "\n", "To rank documents by search-relevancy, we add `{'score': {'$meta': 'textScore'}}` to the query we pass to the `.find()` method. Here we enter five search terms, \"chocolate\", \"leather\", \"wood\", \"dark\", and \"smoke\": " ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [], "source": [ "cursor = winecollection.find(\n", " {'$text': {'$search': 'chocolate leather wood dark smoke'}},\n", " {'score': {'$meta': 'textScore'}})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we apply the `.sort()` method to the output, arranging the documents by relevancy score, with the following code:" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cursor.sort([('score', {'$meta': 'textScore'})]) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally we can convert the output to a dataframe with the following code:" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [], "source": [ "qtext = dumps(cursor)\n", "qrec = loads(qtext)\n", "df = pd.DataFrame.from_records(qrec)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Among all the wine reviews in the data, here is the wine whose description had the highest relevancy score for \"chocolate\", \"leather\", \"wood\", \"dark\", and \"smoke\":" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Big and bold in character, this full-bodied and abundantly tannic wine looks black to the rim and smells like pencil shavings, leather and wood smoke. The flavors are dry but enticing, with dark chocolate, charred beef and black cherry. It needs until at least 2022 to peak.'" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['description'][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To wrap up our work with the database, we apply the `.close()` method to the MongoDB server:" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [], "source": [ "myclient.close()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }