5. Web Scraping Using BeautifulSoup#

5.2. How Websites Prevent You From Scraping#

This discussion follows the excellent overview by a Stack Overflow and GitHub contributor with the username JonasCz (I wish I knew this user’s real name!) on how to prevent web scraping.

To understand the restrictions and challenges you will encounter when scraping data, put yourself in the position of a website’s owner:

If you own and maintain a website, there are many reasons why you might want to prevent web scraping bots from accessing the data on your website. Maybe the bots will overload the traffic to your site and make it impossible for your website to work as you intend. You might be running a business through this website and sharing the data in mass transfers would undercut your business. For whatever reason, you are now faced with a challenge: how to you prevent automated scraping of the data on your webpage while still allowing individual customers to view your website?

Web scraping will require issuing HTTP requests to a particular web address with a tool like requests, sometimes many times in a short period. Every HTTP request is logged by the server that receives the request, and these logs contain the IP address of the entity making the request. If too many requests are made by the same IP address, the server can block that IP address. The coding logic to automatically identify and block overactive IP addresses is simple, so many websites include these security measures. Some blocks are temporary, placing a rate limit on these requests to slow down the scrapers, and some blocks reroute scrapers through a CAPTCHA (which stands for “Completely Automated Test to Tell Computers and Humans Apart”) to prevent robots like a scraper from accessing the website. JonasCz recommends that these security measures look at other factors as well: the speed of actions on the website, the amount of data requested, and other factors that can identify a user when the IP address is masked.

Stronger gates, such as making users register for a username and password with email confirmation to use your website, are effective against scraping bots. But they also turn away individuals who wouldn’t want to jump through those hoops. Saving all text as images on your server will prevent bots from accessing the text very easily, but it makes the website harder to use and violates regulations that protect people with disabilities.

Instead, JonasCz recommends building your website in a way that never reveals the entirety of the data you own, and never reveals the private API endpoints you use to display the data. Also, web scrapers are fragile: they are built to pull data from the specific HTML structure of a particular website. Changing the HTML code frequently or using different versions of the code based on geographic location will break the scrapers that are built for that code. JonasCz also suggests adding “honeypot” links to the HTML code that will not be displayed to legitimate users but will be followed by scrapers that recursively follow links, and taking action against the agents that follow these links: block their IP addresses, require a CAPTCHA, or deliver fake data.

One important piece of information in a request is the user agent header (which we discuss in more detail below). JonasCz recommends looking at this information and blocking requests when the user agent is blank or matches information from agents that have previously been identified as malicious bots.

Understanding the steps you would take to protect your data from bots if you owned a website, you should have greater insight into why a web scraping endeavor may fail. Your web scraper might not be malicious, but might still violate the rules that the website owner setup to guard against bots. These rules are usually listed explicitly in a file on the server, usually called robots.txt. Some tips for reading and understanding a robots.txt file are here: https://www.promptcloud.com/blog/how-to-read-and-respect-robots-file/

For example, in this document we will be scraping data on the playlist of a radio station from https://spinitron.com/. This website has a robots.txt file here: https://spinitron.com/robots.txt, which reads:

User-agent: *
Crawl-delay: 10
Request-rate: 1/10

The User-agent: * line tells us that the next two lines apply to all user agent strings. Crawl-delay: 10 places a limit on the frequency with which our scraper can make a request from this website. In this case, individual requests must be made 10 second apart. Request-rate: 1/10 tells us that our scraper is only allowed to access one page every 10 seconds, and that we are not allowed to make requests from more than one page at the same time.

5.3. Using requests with a User Agent Header#

As the articles by James Densmore and JonasCz described, requests are much more likely to get blocked by websites if the request does not specify a header that contains a user agent. An HTTP header is a parameter that gets sent along with the HTTP request that contains metadata about the request. A user agent header contains contact and identification information about the person making the request. If there is any issue with your web scraper, you want to give the website owner a chance to contact you directly about that problem. If you do not feel comfortable being contacted by the website’s owner, you should reconsider whether you should be scraping that website.

Fortunately, it is straightforward to include headers in a GET request using requests: just use the headers argument. First, we import the relevant libraries:

import numpy as np
import pandas as pd
import requests

In module 4 we issued GET requests from the Wikipedia API as an example.

r = requests.get("https://en.wikipedia.org/w/api.php")
r
<Response [200]>

To add a user agent string, I use the following code:

headers = {'user-agent': 'Kropko class example (jkropko@virginia.edu)'}
r = requests.get("https://en.wikipedia.org/w/api.php", headers = headers)
r
<Response [200]>

What information needs to go into a user agent header? Different resources have different information about that. According to Amazon Web Services, a user agent should identify your application, its version number, and programming language. So a user agent should look like this:

headers = {'user-agent': 'Kropko class example version 1.0 (jkropko@virginia.edu) (Language=Python 3.8.2; Platform=Mac OSX 10.15.5)'}
r = requests.get("https://en.wikipedia.org/w/api.php", headers = headers)
r
<Response [200]>

Including a user agent is not hard, and it goes a long way towards alleviating the anxieties that website owners have about dealing with your web scraping code. It is a good practice to cultivate into a habit.

5.4. Using BeautifulSoup() (Example: WNRN, Charlottesville’s Legendary Radio Station)#

WNRN is a legendary radio station, and it’s based right here in Charlottesville at 91.9 FM (and streaming online at www.wnrn.org). It’s commercial-free, with only a few interruptions for local nonprofits to tell you about cool things happening in town. They play a mix of new and classic alternative rock and R&B. They emphasize music for bands coming to play at local venues. And they play the Grateful Dead on Saturday mornings. You should be listening to WNRN!

The playlist of the songs that WNRN has played in the last few hours is here: https://spinitron.com/WNRN/. I want to scrape the data off this website. I also want to scrape the data off of the additional playlists that this website links to, to collect as much data as possible. Our goal in this example is to create a dataframe of each song WNRN has played, the artist, the album, and the time each song was played.

The process involves four steps:

  1. Download the raw text of the HTML code for the website we want to scrape using the requests library.

  2. Use the BeautifulSoup() function from the bs4 library to parse the raw text so that Python can understand, search through, and operate on the HTML tags from string.

  3. Use methods associated with BeautifulSoup() to extract the data we need from the HTML code.

  4. Place the data into a pandas data frame.

5.4.1. Downloading and Understanding Raw HTML#

For this example, I first download the HTML that exists on https://spinitron.com/WNRN using the requests.get() function. To be ethical and to help this website’s owners know that I am not a malicious actor, I also specify a user agent string.

url = "https://spinitron.com/WNRN"
headers = {'user-agent': 'Kropko class example (jkropko@virginia.edu)'}
r = requests.get(url, headers=headers)
r
<Response [200]>

The raw HTML code contains a series of text fragments that look like this,

<tag attribute="value"> Navigable string </tag>

where tag, attribute, "value", and Navigable string are replaced by specific parameters and data that control the content and presentation of the webpage that gets displayed in a web browser. For example, here are the first 1000 characters of the raw text from WNRN’s playlist:

print(r.text[0:1000])
<!doctype html><html lang="en">
<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1,maximum-scale=1">
    <title>WNRN – Independent Music Radio</title>
    <meta name="description" content="A member supported Independent Music Radio Station serving Central Virginia and the Shenandoah Valley">

    <link rel="apple-touch-icon" href="apple-touch-icon.png">
                <meta name="csrf-param" content="_csrf">
<meta name="csrf-token" content="WY7foc8huxP04oXBu_BchEv1CBMEKuy9PFvWtwnuUlwt_qbNtmPWZ8yAtJTyuTD2MY97Z3UYmu8IIo-CRoscCg==">

    <meta property="og:url" content="/WNRN/">
<meta property="og:title" content="WNRN – Independent Music Radio">
<meta property="og:description" content="A member supported Independent Music Radio Station serving Central Virginia and the Shenandoah Valley">
<meta property="og:image" content="https://spinitron.com/images/Station/13/1326-img_logo.2

Tags specify how the data contained within the page are organized and how the visual elements on this page should look. Tags are designated by opening and closing angle braces, < and >. In the HTML code displayed above, there are tags named

  • <html>, which tells browsers that the following code is written in HTML,

  • <meta>, which defines metadata in the document that help govern how the output shold be displayed in the browser,

  • <title>, which sets the title of the document, and

  • <link>, which pulls data or images from external resources for later use.

To see what other HTML tags do, look at the list on https://www.w3schools.com/TAGs/.

In some cases the tag operates on the text that immediately follows, and a closing tag </tag> frames the text that gets operated on by the tag. The text in between the opening and closing tag is called the navigable string. For example, the tag <title>WNRN Independent Music Radio</title> specifies that “WNRN – Independent Music Radio”, and only this string, is the title.

Some tags have attributes, which are arguments listed inside an opening tag to modify the behavior of that tag or to attach relevant data to the tag. The first <html> tag listed above contains an attribute lang with a value "en" that specifies that this document contains HTML code in English.

5.4.2. Parsing Raw HTML Using BeautifulSoup()#

The requests.get() function only downloads the raw text of the HTML code, but it does not yet understand the logic and organization of the HTML code. Getting Python to register text as a particular coding standard is called parsing the code. We’ve parsed code into Python before with JSON data. We used requests.get() to download the JSON formatted data, but we needed json.loads() to parse the data in order to be able to navigate the branches of the JSON tree.

There are two widely used Python libraries for parsing HTML data: bs4 which contains the BeautifulSoup() function, and selenium. BeautifulSoup() works with raw text, but cannot access websites themselves (we use requests.get() for that). In order to access the data on a website, the data needs to be visible in the raw HTML that requests.get() returns. If there are measures taken by a website to hide that data, possibly by calling server-side Javascript to populate data fields, or by saving data as image files, then we won’t be able to access the data with an HTML parser. selenium has more features to extract more complicated data and circumvent anti-scraping measures, such as taking a screenshot of the webpage in a browser and using optical character recognition (OCR) to pull data directly from the image. However, selenium requires each request to be loaded in a web browser, so it can be quite a bit slower than BeautifulSoup(). If you are interested in learning how to use selenium, see this guide: https://selenium-python.readthedocs.io/. Here we will be using BeautifulSoup().

First I import the BeautifulSoup() function:

from bs4 import BeautifulSoup

To use it, we pass the .text attribute of the requests.get() output from https://spinitron.com/WNRN to BeautifulSoup() (which I saved as r.text above). This function can parse either HTML or XML code, so the second argument should specify HTML:

wnrn = BeautifulSoup(r.text, 'html')

Now that the https://spinitron.com/WNRN source code is registered as HTML code in Python, we can begin executing commands to navigate the organizational structure of the code and extract data.

5.4.3. Searching for HTML Tags and Extracting Data#

While HTML is a coding language, it does not force coders to follow very strict templates. There’s a lot of flexibility and creativity possible for HTML programmers, and as such, there is no one universal method for extracting data from HTML. The best approach is to open a browser window, navigate to the webpage you want to scrape, and “view page source”. (Different web browsers have different ways to do that. On Mozilla Firefox, right click somewhere on the page other than an active link, and “view page source” should be an option.) The source will display the raw HTML code that generates the page. You will need to search through this code to find examples of the data points you intend to collect, possibly using control+F to search for specific values. Once you find the data you need, make note of the tags that surround the data and use the tools we will describe next to extract the data.

The parsable HTML BeautifulSoup() output, wnrn, has important methods and attributes that we will use to extract the data we want. First, we can use the name of a tag as an attribute to extract the first occurrence of that tag. Here we extract the first <meta> tag:

metatag = wnrn.meta
metatag
<meta charset="utf-8"/>

This tag stores its attributes as a list, so we can extract the value of an attribute by calling the name of that attribute as follows:

metatag['charset']
'utf-8'

If a tag has a navigable string, we can extract that with the .string attribute of a particular tag. For example, to extract the title, we start with the <title> tag:

titletag = wnrn.title
titletag
<title>WNRN – Independent Music Radio</title>

Then we extract the title as follows:

titletag.string
'WNRN – Independent Music Radio'

Our goal in this example is to extract the artist, song, album, and time played for every song played on WNRN. I look in the raw HTML source code for the first instance of an artist. These data are contained in the <span> tags:

spantag = wnrn.span
spantag
<span class="artist">Fleet Foxes</span>

Calling one tag is not especially useful, because we generally want to extract all of the relevant data on a page. For that, we can use the .find_next() and .find_all() methods, both of which are very literal. The next <span> tag in the HTML code contains the song associated with the artist:

spantag.find_next()
<span class="song">Can I Believe You</span>

And the next occurrence of <span> contains the album name (under "release"):

spantag.find_next().find_next()
<div class="info"><span class="release">Shore</span></div>

To find all occurrences of the <span> tag, organized in a list, use .find_all() and provide the tag as the argument:

spanlist = wnrn.find_all("span")
spanlist
[<span class="artist">Fleet Foxes</span>,
 <span class="song">Can I Believe You</span>,
 <span class="release">Shore</span>,
 <span class="artist">Bob Marley &amp; The Wailers</span>,
 <span class="song">One Love/People Get Ready</span>,
 <span class="release">Exodus</span>,
 <span class="artist">Dori Freeman</span>,
 <span class="song">Why Do I Do This To Myself</span>,
 <span class="release">Do You Recall</span>,
 <span class="artist">LCD Soundsystem</span>,
 <span class="song">Tonite</span>,
 <span class="release">American Dream</span>,
 <span class="artist">Brigitte Calls Me Baby</span>,
 <span class="song">Impressively Average</span>,
 <span class="release">This House Is Made Of Corners</span>,
 <span class="artist">Margaret Glaspy</span>,
 <span class="song">Act Natural</span>,
 <span class="release">Echo The Diamond</span>,
 <span class="artist">Emmylou Harris</span>,
 <span class="song">The Road</span>,
 <span class="release">Hard Bargain</span>,
 <span class="artist">Butcher Brown</span>,
 <span class="song">No Way Around It</span>,
 <span class="release">Solar Music</span>,
 <span class="artist">Lindsay Lou</span>,
 <span class="song">On Your Side (Starman)</span>,
 <span class="release">Queen Of Time</span>,
 <span class="artist">Chet Faker</span>,
 <span class="song">Low</span>,
 <span class="release">Hotel Surrender</span>,
 <span class="artist">The Replacements</span>,
 <span class="song">I Will Dare</span>,
 <span class="release">Let It Be</span>,
 <span class="artist">Real Estate</span>,
 <span class="song">Talking Backwards</span>,
 <span class="release">Atlas</span>,
 <span class="artist">Jaime Wyatt</span>,
 <span class="song">World Worth Keeping</span>,
 <span class="release">Feel Good</span>,
 <span class="artist">The Rolling Stones</span>,
 <span class="song">Angry</span>,
 <span class="release">Hackney Diamonds</span>,
 <span class="artist">Sonic Youth</span>,
 <span class="song">Incinerate</span>,
 <span class="release">Rather Ripped</span>,
 <span class="artist">The Steel Wheels</span>,
 <span class="song">Scrape Me off the Ceiling</span>,
 <span class="release">Wild as We Came Here</span>,
 <span class="artist">Free Union</span>,
 <span class="song">It Gets Better</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Bryan Elijah Smith &amp; The Wild Hearts</span>,
 <span class="song">Roses &amp; Wardens</span>,
 <span class="release">From the Shenandoah Valley</span>,
 <span class="artist">Jenny Lewis</span>,
 <span class="song">Cherry Baby</span>,
 <span class="release">Joy'all</span>,
 <span class="artist">Slowdive</span>,
 <span class="song">Kisses</span>,
 <span class="release">Everything Is Alive</span>,
 <span class="artist">G. Love and Special Sauce</span>,
 <span class="song">Peace, Love, and Happiness</span>,
 <span class="release">Superhero Brother</span>,
 <span class="artist">Vacations</span>,
 <span class="song">Next Exit</span>,
 <span class="release">No Place Like Home</span>,
 <span class="artist">Lucinda Williams</span>,
 <span class="song">Greenville</span>,
 <span class="release">Car Wheels On a Gravel Road</span>,
 <span class="artist">Amos Lee</span>,
 <span class="song">Greenville</span>,
 <span class="release">Honeysuckle Switches (The Songs Of</span>,
 <span class="artist">Kurt Vile</span>,
 <span class="song">Pretty Pimpin'</span>,
 <span class="release">B'lieve I'm Goin' Down</span>,
 <span class="artist">Trousdale</span>,
 <span class="song">Point Your Finger</span>,
 <span class="release">Out Of My Mind</span>,
 <span class="artist">Sarah Jarosz</span>,
 <span class="song">Jealous Moon</span>,
 <span class="release">Polaroid Lovers</span>,
 <span class="artist">Modest Mouse</span>,
 <span class="song">Dashboard</span>,
 <span class="release">We Were Dead Before The Ship Even Sank</span>,
 <span class="artist">Chris Stapleton</span>,
 <span class="song">Tennessee Whiskey</span>,
 <span class="release">Traveler</span>,
 <span class="artist">Danielle Ponder</span>,
 <span class="song">Roll the Credits</span>,
 <span class="release">Some of Us Are Brave (Deluxe)</span>,
 <span class="artist">The Psychedelic Furs</span>,
 <span class="song">Pretty in Pink</span>,
 <span class="release">Midnight To Midnight</span>,
 <span class="artist">Cordovas</span>,
 <span class="song">Fallen Angels of Rock 'n' Roll</span>,
 <span class="release">The Rose of Aces</span>,
 <span class="artist">Ilsey</span>,
 <span class="song">No California</span>,
 <span class="release">From The Valley</span>,
 <span class="artist">Neko Case</span>,
 <span class="song">People Got a Lotta Nerve</span>,
 <span class="release">Middle Cyclone</span>,
 <span class="artist">Jon Batiste</span>,
 <span class="song">Calling Your Name</span>,
 <span class="release">World Music Radio</span>,
 <span class="artist">Gov't Mule f/ Ivan Neville &amp; Ruthie Foster</span>,
 <span class="song">Dreaming Out Loud</span>,
 <span class="release">Peace... Like a River</span>,
 <span class="artist">Haim</span>,
 <span class="song">The Wire</span>,
 <span class="release">Days Are Gone</span>,
 <span class="artist">The Cars</span>,
 <span class="song">Just What I Needed</span>,
 <span class="release">The Cars</span>,
 <span class="artist">Band of Horses</span>,
 <span class="song">Crutch</span>,
 <span class="release">Things Are Great</span>,
 <span class="artist">The Japanese House</span>,
 <span class="song">Sunshine Baby</span>,
 <span class="release">In The End It Always Does</span>,
 <span class="artist">Black Pumas</span>,
 <span class="song">More Than A Love Song</span>,
 <span class="release">Chronicles Of A Diamond</span>,
 <span class="artist">My Morning Jacket</span>,
 <span class="song">Big Decisions</span>,
 <span class="release">The Waterfall</span>,
 <span class="artist">Tracy Chapman</span>,
 <span class="song">You're the One</span>,
 <span class="release">Let It Rain</span>,
 <span class="artist">Brent Cobb</span>,
 <span class="song">Southern Star</span>,
 <span class="release">Southern Star</span>,
 <span class="artist">Vivian Leva &amp; Riley Calcagno</span>,
 <span class="song">Will You</span>,
 <span class="release">Vivian Leva &amp; Riley Calcagno</span>,
 <span class="artist">Abraham Alexander</span>,
 <span class="song">Tears Run Dry</span>,
 <span class="release">Sea/Sons</span>,
 <span class="artist">Leisure</span>,
 <span class="song">Back In Love</span>,
 <span class="release">Leisurevision</span>,
 <span class="artist">U2</span>,
 <span class="song">One</span>,
 <span class="release">Achtung Baby</span>,
 <span class="artist">Half Moon Run</span>,
 <span class="song">Alco</span>,
 <span class="release">Salt</span>,
 <span class="artist">Sir Chloe</span>,
 <span class="song">Know Better</span>,
 <span class="release">I Am the Dog</span>,
 <span class="artist">Punch Brothers</span>,
 <span class="song">Rye Whiskey</span>,
 <span class="release">Antifogmatic</span>,
 <span class="artist">L7</span>,
 <span class="song">Pretend We're Dead</span>,
 <span class="release">Bricks Are Heavy</span>,
 <span class="artist">Watchhouse</span>,
 <span class="song">Belly of the Beast</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Bully</span>,
 <span class="song">Days Move Slow</span>,
 <span class="release">Lucky for You</span>,
 <span class="artist">Tyler Childers</span>,
 <span class="song">In Your Love</span>,
 <span class="release">Rustin' In The Rain</span>,
 <span class="artist">Paul Simon</span>,
 <span class="song">Hearts and Bones</span>,
 <span class="release">Hearts And Bones</span>,
 <span class="artist">The Pretenders</span>,
 <span class="song">I'll Stand by You</span>,
 <span class="release">Last of the Independents</span>,
 <span class="artist">Ghost of Vroom</span>,
 <span class="song">Pay the Man</span>,
 <span class="release">Ghost of Vroom III</span>,
 <span class="artist">Wilco</span>,
 <span class="song">Random Name Generator</span>,
 <span class="release">Star Wars</span>,
 <span class="artist">Margaret Glaspy</span>,
 <span class="song">Get Back</span>,
 <span class="release">Echo The Diamond</span>,
 <span class="artist">Jamila Woods f/ Saba</span>,
 <span class="song">Practice</span>,
 <span class="release">Water Made Us</span>,
 <span class="artist">Phoebe Bridgers f/ Jackson Browne</span>,
 <span class="song">Christmas Song</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Bahamas</span>,
 <span class="song">I'm Still</span>,
 <span class="release">Bootcut</span>,
 <span class="artist">Dogwood Tales</span>,
 <span class="song">Hard to be Anywhere</span>,
 <span class="release">Closest Thing to Heaven</span>,
 <span class="artist">Jason Isbell &amp; The 400 Unit</span>,
 <span class="song">If We Were Vampires</span>,
 <span class="release">The Nashville Sound</span>,
 <span class="artist">Tennis</span>,
 <span class="song">Need Your Love</span>,
 <span class="release">Swimmer</span>,
 <span class="artist">Fleetwood Mac</span>,
 <span class="song">Go Your Own Way</span>,
 <span class="release">Rumours</span>,
 <span class="artist">Big Thief</span>,
 <span class="song">Born For Loving You</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Blur</span>,
 <span class="song">Barbaric</span>,
 <span class="release">The Ballad of Darren</span>,
 <span class="artist">Tame Impala</span>,
 <span class="song">Feels Like We Only Go Backwards</span>,
 <span class="release">Lonerism</span>,
 <span class="artist">Bon Iver</span>,
 <span class="song">Skinny Love</span>,
 <span class="release">For Emma, Forever Ago</span>,
 <span class="artist">Jobi Riccio</span>,
 <span class="song">Sweet</span>,
 <span class="release">Whiplash</span>,
 <span class="artist">Beirut</span>,
 <span class="song">Gibraltar</span>,
 <span class="release">No No No</span>,
 <span class="artist">Yard Act</span>,
 <span class="song">Dream Job</span>,
 <span class="release">Where's My Utopia?</span>,
 <span class="artist">Olivia Dean f/ Leon Bridges</span>,
 <span class="song">The Hardest Part</span>,
 <span class="release">Messy</span>,
 <span class="artist">Coldplay</span>,
 <span class="song">Yellow</span>,
 <span class="release">Parachutes</span>,
 <span class="artist">Nan Macmillan</span>,
 <span class="song">Both Eyes Now</span>,
 <span class="release">From Both Eyes</span>,
 <span class="artist">The National f/ Rosanne Cash</span>,
 <span class="song">Crumble</span>,
 <span class="release">Laugh Track</span>,
 <span class="artist">The Velvet Underground</span>,
 <span class="song">Femme Fatale</span>,
 <span class="release">The Velvet Underground &amp; Nico</span>,
 <span class="artist">Shemekia Copeland</span>,
 <span class="song">Clotilda's on Fire</span>,
 <span class="release">Uncivil War</span>,
 <span class="artist">Yoke Lore</span>,
 <span class="song">Hallucinate</span>,
 <span class="release">Toward a Never Ending New Beginning</span>,
 <span class="artist">Thad Cockrell</span>,
 <span class="song">Warmth &amp; Beauty</span>,
 <span class="release">Warmth &amp; Beauty</span>,
 <span class="artist">Marty Stuart</span>,
 <span class="song">Sitting Alone</span>,
 <span class="release">Altitude</span>,
 <span class="artist">Christone "Kingfish" Ingram</span>,
 <span class="song">Midnight Heat</span>,
 <span class="release">Live in London</span>,
 <span class="artist">Gomez</span>,
 <span class="song">How We Operate</span>,
 <span class="release">How We Operate</span>,
 <span class="artist">Esther Rose</span>,
 <span class="song">Chet Baker</span>,
 <span class="release">Safe to Run</span>,
 <span class="artist">Bruce Springsteen</span>,
 <span class="song">Thunder Road</span>,
 <span class="release">Born to Run</span>,
 <span class="artist">Paul Thorn</span>,
 <span class="song">Here We Go</span>,
 <span class="release">Never Too Late to Call</span>,
 <span class="artist">Bakar</span>,
 <span class="song">All Night</span>,
 <span class="release">Halo</span>,
 <span class="artist">Jesse Roper</span>,
 <span class="song">Throw This Rope</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Flipturn</span>,
 <span class="song">Playground</span>,
 <span class="release">Shadowglow</span>,
 <span class="artist">John R. Miller</span>,
 <span class="song">Conspiracies, Cults &amp; UFOs</span>,
 <span class="release">Heat Comes Down</span>,
 <span class="artist">Jenny Owen Youngs</span>,
 <span class="song">It's Later Than You Think</span>,
 <span class="release">Avalanche</span>,
 <span class="artist">Peter, Bjorn, and John</span>,
 <span class="song">Young Folks</span>,
 <span class="release">Writer's Block</span>,
 <span class="artist">Peter Gabriel</span>,
 <span class="song">Solsbury Hill</span>,
 <span class="release">Peter Gabriel</span>,
 <span class="artist">The Pink Stones f/ Nikki Lane</span>,
 <span class="song">Baby, I'm Still (Right Here With You)</span>,
 <span class="release">You Know Who</span>,
 <span class="artist">Donovan</span>,
 <span class="song">Sunshine Superman</span>,
 <span class="release">Sunshine Superman</span>,
 <span class="artist">Lake Street Dive</span>,
 <span class="song">Hypotheticals</span>,
 <span class="release">Obviously</span>,
 <span class="artist">Hiss Golden Messenger</span>,
 <span class="song">The Wondering</span>,
 <span class="release">Jump for Joy</span>,
 <span class="artist">Jackie Greene</span>,
 <span class="song">Prayer for Spanish Harlem</span>,
 <span class="release">Giving Up The Ghost</span>,
 <span class="artist">Marika Hackman</span>,
 <span class="song">No Caffeine</span>,
 <span class="release">Big Sigh</span>,
 <span class="artist">Eilen Jewell</span>,
 <span class="song">Lethal Love</span>,
 <span class="release">Get Behind the Wheel</span>,
 <span class="artist">Manchester Orchestra</span>,
 <span class="song">Telepath</span>,
 <span class="release">The Million Masks of God</span>,
 <span class="artist">49 Winchester</span>,
 <span class="song">Chemistry</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Brittany Howard</span>,
 <span class="song">What Now</span>,
 <span class="release">What Now?</span>,
 <span class="artist">Big Star</span>,
 <span class="song">September Gurls</span>,
 <span class="release">Radio City</span>,
 <span class="artist">Driftwood</span>,
 <span class="song">Lay Like You Do</span>,
 <span class="release">Tree of Shade</span>,
 <span class="artist">Beck</span>,
 <span class="song">Think I'm in Love</span>,
 <span class="release">The Information</span>,
 <span class="artist">Sharon Jones &amp; the Dap-Kings</span>,
 <span class="song">Stranger to My Happiness</span>,
 <span class="release">Give The People What They Want</span>,
 <span class="artist">Brenda Lee</span>,
 <span class="song">I'm Sorry</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Mitski</span>,
 <span class="song">My Love Mine All Mine</span>,
 <span class="release">The Land is Inhospitable and So Are We</span>,
 <span class="artist">Lucinda Williams</span>,
 <span class="song">Righteously</span>,
 <span class="release">World Without Tears</span>,
 <span class="artist">Hem</span>,
 <span class="song">Half Acre</span>,
 <span class="release">Rabbit Song</span>,
 <span class="artist">The Beatles</span>,
 <span class="song">Now &amp; Then</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Billy Strings</span>,
 <span class="song">Fire Line</span>,
 <span class="release">Renewal</span>,
 <span class="artist">Boy Golden</span>,
 <span class="song">Blue Hills</span>,
 <span class="release">For Jimmy</span>,
 <span class="artist">Hurray for the Riff Raff</span>,
 <span class="song">Alibi</span>,
 <span class="release">The Past Is Still Alive</span>,
 <span class="artist">Josh Ritter</span>,
 <span class="song">Old Black Magic</span>,
 <span class="release">Fever Breaks</span>,
 <span class="artist">Cat Clyde</span>,
 <span class="song">Everywhere I Go</span>,
 <span class="release">Down Rounder</span>,
 <span class="artist">Middle Kids</span>,
 <span class="song">Dramamine</span>,
 <span class="release">Faith Crisis Pt. 1</span>,
 <span class="artist">Sylvan Esso</span>,
 <span class="song">Ferris Wheel</span>,
 <span class="release">Free Love</span>,
 <span class="artist">Jack Klatt</span>,
 <span class="song">Ramblin Kind</span>,
 <span class="release">It Ain't the Same</span>,
 <span class="artist">Tom Petty</span>,
 <span class="song">Wildflowers</span>,
 <span class="release">Wildflowers</span>,
 <span class="artist">Turnpike Troubadours</span>,
 <span class="song">Chipping Mill</span>,
 <span class="release">A Cat in the Rain</span>,
 <span class="artist">Cage the Elephant</span>,
 <span class="song">Social Cues</span>,
 <span class="release">Social Cues</span>,
 <span class="artist">Dave Matthews Band</span>,
 <span class="song">Break Free</span>,
 <span class="release">Walk Around the Moon</span>,
 <span class="artist">J Mascis</span>,
 <span class="song">Can't Believe We're Here</span>,
 <span class="release">What Do We Do Now</span>,
 <span class="artist">Alison Krauss</span>,
 <span class="song">Forget About It</span>,
 <span class="release">Forget About It</span>,
 <span class="artist">Robbie Fulks</span>,
 <span class="song">One Glass of Whiskey</span>,
 <span class="release">Bluegrass Vacation</span>,
 <span class="artist">Black Pumas</span>,
 <span class="song">Christmas Will Really Be Christmas</span>,
 <span class="release">Spotify Singles: Holiday Collectio</span>,
 <span class="artist">Real Estate</span>,
 <span class="song">Water Underground</span>,
 <span class="release">Daniel</span>,
 <span class="artist">Charles Wesley Godwin</span>,
 <span class="song">All Again</span>,
 <span class="release">Family Ties</span>,
 <span class="artist">The Head &amp; the Heart</span>,
 <span class="song">Virginia (Wind in the Night)</span>,
 <span class="release">Every Shade of Blue</span>,
 <span class="artist">Viv &amp; Riley</span>,
 <span class="song">Imaginary People</span>,
 <span class="release">Imaginary People</span>,
 <span class="artist">Margo Price</span>,
 <span class="song">Strays</span>,
 <span class="release">Strays II (Act1: Topanga Canyon)</span>,
 <span class="artist">The Civil Wars</span>,
 <span class="song">Dust To Dust</span>,
 <span class="release">The Civil Wars</span>]

Notice that the HTML source code distinguishes between the three types of datapoint with different class values. To limit this list to just the artists, we can specify the "artist" class as a second argument of .find_all():

artistlist = wnrn.find_all("span", "artist")
artistlist
[<span class="artist">Fleet Foxes</span>,
 <span class="artist">Bob Marley &amp; The Wailers</span>,
 <span class="artist">Dori Freeman</span>,
 <span class="artist">LCD Soundsystem</span>,
 <span class="artist">Brigitte Calls Me Baby</span>,
 <span class="artist">Margaret Glaspy</span>,
 <span class="artist">Emmylou Harris</span>,
 <span class="artist">Butcher Brown</span>,
 <span class="artist">Lindsay Lou</span>,
 <span class="artist">Chet Faker</span>,
 <span class="artist">The Replacements</span>,
 <span class="artist">Real Estate</span>,
 <span class="artist">Jaime Wyatt</span>,
 <span class="artist">The Rolling Stones</span>,
 <span class="artist">Sonic Youth</span>,
 <span class="artist">The Steel Wheels</span>,
 <span class="artist">Free Union</span>,
 <span class="artist">Bryan Elijah Smith &amp; The Wild Hearts</span>,
 <span class="artist">Jenny Lewis</span>,
 <span class="artist">Slowdive</span>,
 <span class="artist">G. Love and Special Sauce</span>,
 <span class="artist">Vacations</span>,
 <span class="artist">Lucinda Williams</span>,
 <span class="artist">Amos Lee</span>,
 <span class="artist">Kurt Vile</span>,
 <span class="artist">Trousdale</span>,
 <span class="artist">Sarah Jarosz</span>,
 <span class="artist">Modest Mouse</span>,
 <span class="artist">Chris Stapleton</span>,
 <span class="artist">Danielle Ponder</span>,
 <span class="artist">The Psychedelic Furs</span>,
 <span class="artist">Cordovas</span>,
 <span class="artist">Ilsey</span>,
 <span class="artist">Neko Case</span>,
 <span class="artist">Jon Batiste</span>,
 <span class="artist">Gov't Mule f/ Ivan Neville &amp; Ruthie Foster</span>,
 <span class="artist">Haim</span>,
 <span class="artist">The Cars</span>,
 <span class="artist">Band of Horses</span>,
 <span class="artist">The Japanese House</span>,
 <span class="artist">Black Pumas</span>,
 <span class="artist">My Morning Jacket</span>,
 <span class="artist">Tracy Chapman</span>,
 <span class="artist">Brent Cobb</span>,
 <span class="artist">Vivian Leva &amp; Riley Calcagno</span>,
 <span class="artist">Abraham Alexander</span>,
 <span class="artist">Leisure</span>,
 <span class="artist">U2</span>,
 <span class="artist">Half Moon Run</span>,
 <span class="artist">Sir Chloe</span>,
 <span class="artist">Punch Brothers</span>,
 <span class="artist">L7</span>,
 <span class="artist">Watchhouse</span>,
 <span class="artist">Bully</span>,
 <span class="artist">Tyler Childers</span>,
 <span class="artist">Paul Simon</span>,
 <span class="artist">The Pretenders</span>,
 <span class="artist">Ghost of Vroom</span>,
 <span class="artist">Wilco</span>,
 <span class="artist">Margaret Glaspy</span>,
 <span class="artist">Jamila Woods f/ Saba</span>,
 <span class="artist">Phoebe Bridgers f/ Jackson Browne</span>,
 <span class="artist">Bahamas</span>,
 <span class="artist">Dogwood Tales</span>,
 <span class="artist">Jason Isbell &amp; The 400 Unit</span>,
 <span class="artist">Tennis</span>,
 <span class="artist">Fleetwood Mac</span>,
 <span class="artist">Big Thief</span>,
 <span class="artist">Blur</span>,
 <span class="artist">Tame Impala</span>,
 <span class="artist">Bon Iver</span>,
 <span class="artist">Jobi Riccio</span>,
 <span class="artist">Beirut</span>,
 <span class="artist">Yard Act</span>,
 <span class="artist">Olivia Dean f/ Leon Bridges</span>,
 <span class="artist">Coldplay</span>,
 <span class="artist">Nan Macmillan</span>,
 <span class="artist">The National f/ Rosanne Cash</span>,
 <span class="artist">The Velvet Underground</span>,
 <span class="artist">Shemekia Copeland</span>,
 <span class="artist">Yoke Lore</span>,
 <span class="artist">Thad Cockrell</span>,
 <span class="artist">Marty Stuart</span>,
 <span class="artist">Christone "Kingfish" Ingram</span>,
 <span class="artist">Gomez</span>,
 <span class="artist">Esther Rose</span>,
 <span class="artist">Bruce Springsteen</span>,
 <span class="artist">Paul Thorn</span>,
 <span class="artist">Bakar</span>,
 <span class="artist">Jesse Roper</span>,
 <span class="artist">Flipturn</span>,
 <span class="artist">John R. Miller</span>,
 <span class="artist">Jenny Owen Youngs</span>,
 <span class="artist">Peter, Bjorn, and John</span>,
 <span class="artist">Peter Gabriel</span>,
 <span class="artist">The Pink Stones f/ Nikki Lane</span>,
 <span class="artist">Donovan</span>,
 <span class="artist">Lake Street Dive</span>,
 <span class="artist">Hiss Golden Messenger</span>,
 <span class="artist">Jackie Greene</span>,
 <span class="artist">Marika Hackman</span>,
 <span class="artist">Eilen Jewell</span>,
 <span class="artist">Manchester Orchestra</span>,
 <span class="artist">49 Winchester</span>,
 <span class="artist">Brittany Howard</span>,
 <span class="artist">Big Star</span>,
 <span class="artist">Driftwood</span>,
 <span class="artist">Beck</span>,
 <span class="artist">Sharon Jones &amp; the Dap-Kings</span>,
 <span class="artist">Brenda Lee</span>,
 <span class="artist">Mitski</span>,
 <span class="artist">Lucinda Williams</span>,
 <span class="artist">Hem</span>,
 <span class="artist">The Beatles</span>,
 <span class="artist">Billy Strings</span>,
 <span class="artist">Boy Golden</span>,
 <span class="artist">Hurray for the Riff Raff</span>,
 <span class="artist">Josh Ritter</span>,
 <span class="artist">Cat Clyde</span>,
 <span class="artist">Middle Kids</span>,
 <span class="artist">Sylvan Esso</span>,
 <span class="artist">Jack Klatt</span>,
 <span class="artist">Tom Petty</span>,
 <span class="artist">Turnpike Troubadours</span>,
 <span class="artist">Cage the Elephant</span>,
 <span class="artist">Dave Matthews Band</span>,
 <span class="artist">J Mascis</span>,
 <span class="artist">Alison Krauss</span>,
 <span class="artist">Robbie Fulks</span>,
 <span class="artist">Black Pumas</span>,
 <span class="artist">Real Estate</span>,
 <span class="artist">Charles Wesley Godwin</span>,
 <span class="artist">The Head &amp; the Heart</span>,
 <span class="artist">Viv &amp; Riley</span>,
 <span class="artist">Margo Price</span>,
 <span class="artist">The Civil Wars</span>]

Likewise we can create lists of the songs:

songlist = wnrn.find_all("span", "song")
songlist
[<span class="song">Can I Believe You</span>,
 <span class="song">One Love/People Get Ready</span>,
 <span class="song">Why Do I Do This To Myself</span>,
 <span class="song">Tonite</span>,
 <span class="song">Impressively Average</span>,
 <span class="song">Act Natural</span>,
 <span class="song">The Road</span>,
 <span class="song">No Way Around It</span>,
 <span class="song">On Your Side (Starman)</span>,
 <span class="song">Low</span>,
 <span class="song">I Will Dare</span>,
 <span class="song">Talking Backwards</span>,
 <span class="song">World Worth Keeping</span>,
 <span class="song">Angry</span>,
 <span class="song">Incinerate</span>,
 <span class="song">Scrape Me off the Ceiling</span>,
 <span class="song">It Gets Better</span>,
 <span class="song">Roses &amp; Wardens</span>,
 <span class="song">Cherry Baby</span>,
 <span class="song">Kisses</span>,
 <span class="song">Peace, Love, and Happiness</span>,
 <span class="song">Next Exit</span>,
 <span class="song">Greenville</span>,
 <span class="song">Greenville</span>,
 <span class="song">Pretty Pimpin'</span>,
 <span class="song">Point Your Finger</span>,
 <span class="song">Jealous Moon</span>,
 <span class="song">Dashboard</span>,
 <span class="song">Tennessee Whiskey</span>,
 <span class="song">Roll the Credits</span>,
 <span class="song">Pretty in Pink</span>,
 <span class="song">Fallen Angels of Rock 'n' Roll</span>,
 <span class="song">No California</span>,
 <span class="song">People Got a Lotta Nerve</span>,
 <span class="song">Calling Your Name</span>,
 <span class="song">Dreaming Out Loud</span>,
 <span class="song">The Wire</span>,
 <span class="song">Just What I Needed</span>,
 <span class="song">Crutch</span>,
 <span class="song">Sunshine Baby</span>,
 <span class="song">More Than A Love Song</span>,
 <span class="song">Big Decisions</span>,
 <span class="song">You're the One</span>,
 <span class="song">Southern Star</span>,
 <span class="song">Will You</span>,
 <span class="song">Tears Run Dry</span>,
 <span class="song">Back In Love</span>,
 <span class="song">One</span>,
 <span class="song">Alco</span>,
 <span class="song">Know Better</span>,
 <span class="song">Rye Whiskey</span>,
 <span class="song">Pretend We're Dead</span>,
 <span class="song">Belly of the Beast</span>,
 <span class="song">Days Move Slow</span>,
 <span class="song">In Your Love</span>,
 <span class="song">Hearts and Bones</span>,
 <span class="song">I'll Stand by You</span>,
 <span class="song">Pay the Man</span>,
 <span class="song">Random Name Generator</span>,
 <span class="song">Get Back</span>,
 <span class="song">Practice</span>,
 <span class="song">Christmas Song</span>,
 <span class="song">I'm Still</span>,
 <span class="song">Hard to be Anywhere</span>,
 <span class="song">If We Were Vampires</span>,
 <span class="song">Need Your Love</span>,
 <span class="song">Go Your Own Way</span>,
 <span class="song">Born For Loving You</span>,
 <span class="song">Barbaric</span>,
 <span class="song">Feels Like We Only Go Backwards</span>,
 <span class="song">Skinny Love</span>,
 <span class="song">Sweet</span>,
 <span class="song">Gibraltar</span>,
 <span class="song">Dream Job</span>,
 <span class="song">The Hardest Part</span>,
 <span class="song">Yellow</span>,
 <span class="song">Both Eyes Now</span>,
 <span class="song">Crumble</span>,
 <span class="song">Femme Fatale</span>,
 <span class="song">Clotilda's on Fire</span>,
 <span class="song">Hallucinate</span>,
 <span class="song">Warmth &amp; Beauty</span>,
 <span class="song">Sitting Alone</span>,
 <span class="song">Midnight Heat</span>,
 <span class="song">How We Operate</span>,
 <span class="song">Chet Baker</span>,
 <span class="song">Thunder Road</span>,
 <span class="song">Here We Go</span>,
 <span class="song">All Night</span>,
 <span class="song">Throw This Rope</span>,
 <span class="song">Playground</span>,
 <span class="song">Conspiracies, Cults &amp; UFOs</span>,
 <span class="song">It's Later Than You Think</span>,
 <span class="song">Young Folks</span>,
 <span class="song">Solsbury Hill</span>,
 <span class="song">Baby, I'm Still (Right Here With You)</span>,
 <span class="song">Sunshine Superman</span>,
 <span class="song">Hypotheticals</span>,
 <span class="song">The Wondering</span>,
 <span class="song">Prayer for Spanish Harlem</span>,
 <span class="song">No Caffeine</span>,
 <span class="song">Lethal Love</span>,
 <span class="song">Telepath</span>,
 <span class="song">Chemistry</span>,
 <span class="song">What Now</span>,
 <span class="song">September Gurls</span>,
 <span class="song">Lay Like You Do</span>,
 <span class="song">Think I'm in Love</span>,
 <span class="song">Stranger to My Happiness</span>,
 <span class="song">I'm Sorry</span>,
 <span class="song">My Love Mine All Mine</span>,
 <span class="song">Righteously</span>,
 <span class="song">Half Acre</span>,
 <span class="song">Now &amp; Then</span>,
 <span class="song">Fire Line</span>,
 <span class="song">Blue Hills</span>,
 <span class="song">Alibi</span>,
 <span class="song">Old Black Magic</span>,
 <span class="song">Everywhere I Go</span>,
 <span class="song">Dramamine</span>,
 <span class="song">Ferris Wheel</span>,
 <span class="song">Ramblin Kind</span>,
 <span class="song">Wildflowers</span>,
 <span class="song">Chipping Mill</span>,
 <span class="song">Social Cues</span>,
 <span class="song">Break Free</span>,
 <span class="song">Can't Believe We're Here</span>,
 <span class="song">Forget About It</span>,
 <span class="song">One Glass of Whiskey</span>,
 <span class="song">Christmas Will Really Be Christmas</span>,
 <span class="song">Water Underground</span>,
 <span class="song">All Again</span>,
 <span class="song">Virginia (Wind in the Night)</span>,
 <span class="song">Imaginary People</span>,
 <span class="song">Strays</span>,
 <span class="song">Dust To Dust</span>]

And a list for the albums:

albumlist = wnrn.find_all("span", "release")
albumlist
[<span class="release">Shore</span>,
 <span class="release">Exodus</span>,
 <span class="release">Do You Recall</span>,
 <span class="release">American Dream</span>,
 <span class="release">This House Is Made Of Corners</span>,
 <span class="release">Echo The Diamond</span>,
 <span class="release">Hard Bargain</span>,
 <span class="release">Solar Music</span>,
 <span class="release">Queen Of Time</span>,
 <span class="release">Hotel Surrender</span>,
 <span class="release">Let It Be</span>,
 <span class="release">Atlas</span>,
 <span class="release">Feel Good</span>,
 <span class="release">Hackney Diamonds</span>,
 <span class="release">Rather Ripped</span>,
 <span class="release">Wild as We Came Here</span>,
 <span class="release">(Single)</span>,
 <span class="release">From the Shenandoah Valley</span>,
 <span class="release">Joy'all</span>,
 <span class="release">Everything Is Alive</span>,
 <span class="release">Superhero Brother</span>,
 <span class="release">No Place Like Home</span>,
 <span class="release">Car Wheels On a Gravel Road</span>,
 <span class="release">Honeysuckle Switches (The Songs Of</span>,
 <span class="release">B'lieve I'm Goin' Down</span>,
 <span class="release">Out Of My Mind</span>,
 <span class="release">Polaroid Lovers</span>,
 <span class="release">We Were Dead Before The Ship Even Sank</span>,
 <span class="release">Traveler</span>,
 <span class="release">Some of Us Are Brave (Deluxe)</span>,
 <span class="release">Midnight To Midnight</span>,
 <span class="release">The Rose of Aces</span>,
 <span class="release">From The Valley</span>,
 <span class="release">Middle Cyclone</span>,
 <span class="release">World Music Radio</span>,
 <span class="release">Peace... Like a River</span>,
 <span class="release">Days Are Gone</span>,
 <span class="release">The Cars</span>,
 <span class="release">Things Are Great</span>,
 <span class="release">In The End It Always Does</span>,
 <span class="release">Chronicles Of A Diamond</span>,
 <span class="release">The Waterfall</span>,
 <span class="release">Let It Rain</span>,
 <span class="release">Southern Star</span>,
 <span class="release">Vivian Leva &amp; Riley Calcagno</span>,
 <span class="release">Sea/Sons</span>,
 <span class="release">Leisurevision</span>,
 <span class="release">Achtung Baby</span>,
 <span class="release">Salt</span>,
 <span class="release">I Am the Dog</span>,
 <span class="release">Antifogmatic</span>,
 <span class="release">Bricks Are Heavy</span>,
 <span class="release">(Single)</span>,
 <span class="release">Lucky for You</span>,
 <span class="release">Rustin' In The Rain</span>,
 <span class="release">Hearts And Bones</span>,
 <span class="release">Last of the Independents</span>,
 <span class="release">Ghost of Vroom III</span>,
 <span class="release">Star Wars</span>,
 <span class="release">Echo The Diamond</span>,
 <span class="release">Water Made Us</span>,
 <span class="release">(Single)</span>,
 <span class="release">Bootcut</span>,
 <span class="release">Closest Thing to Heaven</span>,
 <span class="release">The Nashville Sound</span>,
 <span class="release">Swimmer</span>,
 <span class="release">Rumours</span>,
 <span class="release">(Single)</span>,
 <span class="release">The Ballad of Darren</span>,
 <span class="release">Lonerism</span>,
 <span class="release">For Emma, Forever Ago</span>,
 <span class="release">Whiplash</span>,
 <span class="release">No No No</span>,
 <span class="release">Where's My Utopia?</span>,
 <span class="release">Messy</span>,
 <span class="release">Parachutes</span>,
 <span class="release">From Both Eyes</span>,
 <span class="release">Laugh Track</span>,
 <span class="release">The Velvet Underground &amp; Nico</span>,
 <span class="release">Uncivil War</span>,
 <span class="release">Toward a Never Ending New Beginning</span>,
 <span class="release">Warmth &amp; Beauty</span>,
 <span class="release">Altitude</span>,
 <span class="release">Live in London</span>,
 <span class="release">How We Operate</span>,
 <span class="release">Safe to Run</span>,
 <span class="release">Born to Run</span>,
 <span class="release">Never Too Late to Call</span>,
 <span class="release">Halo</span>,
 <span class="release">(Single)</span>,
 <span class="release">Shadowglow</span>,
 <span class="release">Heat Comes Down</span>,
 <span class="release">Avalanche</span>,
 <span class="release">Writer's Block</span>,
 <span class="release">Peter Gabriel</span>,
 <span class="release">You Know Who</span>,
 <span class="release">Sunshine Superman</span>,
 <span class="release">Obviously</span>,
 <span class="release">Jump for Joy</span>,
 <span class="release">Giving Up The Ghost</span>,
 <span class="release">Big Sigh</span>,
 <span class="release">Get Behind the Wheel</span>,
 <span class="release">The Million Masks of God</span>,
 <span class="release">(Single)</span>,
 <span class="release">What Now?</span>,
 <span class="release">Radio City</span>,
 <span class="release">Tree of Shade</span>,
 <span class="release">The Information</span>,
 <span class="release">Give The People What They Want</span>,
 <span class="release">(Single)</span>,
 <span class="release">The Land is Inhospitable and So Are We</span>,
 <span class="release">World Without Tears</span>,
 <span class="release">Rabbit Song</span>,
 <span class="release">(Single)</span>,
 <span class="release">Renewal</span>,
 <span class="release">For Jimmy</span>,
 <span class="release">The Past Is Still Alive</span>,
 <span class="release">Fever Breaks</span>,
 <span class="release">Down Rounder</span>,
 <span class="release">Faith Crisis Pt. 1</span>,
 <span class="release">Free Love</span>,
 <span class="release">It Ain't the Same</span>,
 <span class="release">Wildflowers</span>,
 <span class="release">A Cat in the Rain</span>,
 <span class="release">Social Cues</span>,
 <span class="release">Walk Around the Moon</span>,
 <span class="release">What Do We Do Now</span>,
 <span class="release">Forget About It</span>,
 <span class="release">Bluegrass Vacation</span>,
 <span class="release">Spotify Singles: Holiday Collectio</span>,
 <span class="release">Daniel</span>,
 <span class="release">Family Ties</span>,
 <span class="release">Every Shade of Blue</span>,
 <span class="release">Imaginary People</span>,
 <span class="release">Strays II (Act1: Topanga Canyon)</span>,
 <span class="release">The Civil Wars</span>]

Finally, we want to also extract the times each song was played. I look at the HTML code and find an example of the play time. These times are stored in the <td> tag with class="spin-time". I create a list of these times:

timelist = wnrn.find_all("td", "spin-time")
timelist
[<td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367926290">3:42 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367926083">3:39 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367925786">3:35 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367925379">3:28 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367925181">3:25 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367924801">3:19 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367924520">3:15 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367924229">3:11 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367923872">3:06 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367923588">3:01 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367923257">2:58 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367923022">2:54 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367922823">2:51 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367922482">2:45 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367922160">2:40 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367921991">2:37 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367921800">2:34 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367921452">2:28 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367921110">2:22 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367920820">2:17 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367920596">2:13 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367920407">2:10 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367920081">2:04 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367919881">2:01 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367919452">1:54 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367919252">1:51 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367918789">1:44 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367918511">1:39 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367918186">1:35 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367917749">1:29 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367917424">1:24 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367917247">1:21 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367916930">1:15 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367916770">1:13 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367916633">1:11 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367916203">1:05 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367915960">1:01 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367915535">12:57 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367915321">12:53 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367915119">12:50 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367914748">12:43 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367914539">12:40 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367914378">12:37 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367914072">12:31 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367913869">12:27 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367913665">12:23 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367913314">12:16 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367913014">12:12 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367912801">12:08 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367912527">12:03 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367912330">12:00 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367911977">11:56 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367911782">11:53 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367911626">11:50 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367911252">11:44 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367910836">11:38 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367910648">11:34 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367910363">11:29 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367910125">11:25 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367909879">11:22 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367909571">11:16 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367909360">11:12 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367909106">11:09 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367908790">11:03 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367908574">11:00 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367908343">10:56 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367908127">10:52 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367907972">10:49 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367907574">10:42 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367907370">10:39 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367907171">10:35 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367906872">10:30 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367906600">10:26 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367906437">10:23 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367906165">10:18 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367905913">10:13 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367905753">10:10 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367905344">10:04 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367905188">10:01 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367904819">9:56 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367904656">9:53 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367904242">9:48 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367904092">9:45 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367903841">9:40 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367903649">9:37 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367903254">9:30 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367902986">9:25 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367902808">9:22 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367902627">9:18 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367902387">9:14 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367902209">9:11 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367902065">9:08 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367901790">9:04 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367901536">8:59 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367901246">8:55 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367900762">8:47 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367900591">8:43 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367900433">8:40 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367900098">8:34 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367899612">8:24 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367899416">8:21 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367899170">8:16 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367899035">8:13 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367898865">8:10 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367898585">8:04 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367898454">8:02 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367898179">7:58 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367897980">7:54 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367897802">7:51 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367897657">7:48 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367897408">7:43 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367897146">7:39 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367896943">7:36 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367896565">7:29 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367896245">7:23 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367896013">7:19 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367895764">7:14 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367895525">7:10 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367895394">7:08 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367895091">7:03 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367894890">7:00 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367894672">6:57 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367894499">6:54 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367894289">6:51 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367893937">6:45 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367893720">6:41 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367893500">6:37 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367893347">6:34 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367893061">6:29 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367892905">6:26 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367892739">6:22 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367892441">6:17 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367892253">6:13 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367892002">6:09 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367891692">6:04 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/18266906/WNRN?sp=367891473">6:01 AM</a></td>]

Sometimes the information we need exists in a particular tag, but only when a specific attribute is present. For example, in the WNRN playlist HTML there are many <a> tags, but only some of those tags include a title attribute. To extract all of the <a> tags with a title attribute, specify title=True in the call to .find_all():

atags_title = wnrn.find_all("a", title=True)
print(atags_title[0:5]) # just show the first 6 elements
[<a class="buy-link" data-vendor="apple" href="#" target="_blank" title='View "Fleet Foxes - Can I Believe You" on Apple'><div alt='View "Fleet Foxes - Can I Believe You" on Apple' class="buy-icon buy-icon-apple"></div></a>, <a class="buy-link" data-vendor="amazon" href="#" target="_blank" title='View "Fleet Foxes - Can I Believe You" on Amazon'><div alt='View "Fleet Foxes - Can I Believe You" on Amazon' class="buy-icon buy-icon-amazon"></div></a>, <a class="buy-link" data-vendor="spotify" href="#" target="_blank" title='View "Fleet Foxes - Can I Believe You" on Spotify'><div alt='View "Fleet Foxes - Can I Believe You" on Spotify' class="buy-icon buy-icon-spotify"></div></a>, <a class="buy-link" data-vendor="apple" href="#" target="_blank" title='View "Bob Marley &amp; The Wailers - One Love/People Get Ready" on Apple'><div alt='View "Bob Marley &amp; The Wailers - One Love/People Get Ready" on Apple' class="buy-icon buy-icon-apple"></div></a>, <a class="buy-link" data-vendor="amazon" href="#" target="_blank" title='View "Bob Marley &amp; The Wailers - One Love/People Get Ready" on Amazon'><div alt='View "Bob Marley &amp; The Wailers - One Love/People Get Ready" on Amazon' class="buy-icon buy-icon-amazon"></div></a>]

5.4.4. Constructing a Data Frame from HTML Data#

Next we need to place these data into a clean data frame. For that, we will need to keep the valid data while dropping the HTML tags. We stored the tags with the artists, songs, albums, and times in separate lists. Every name is stored as a navigable string in the HTML tags, so to extract these names we need to loop across the elements of the list. The simplest loop for this task is called a list comprehension, which has the following syntax:

newlist = [ expression for item in oldlist if condition ]

In this syntax, we are creating a new list by iteratively performing operations on the elements of an existing list (oldlist). item is a token that we will use to represent one item of the existing list. expression is the same Python code we would use on a single element of the existing list, except we replace the name of the element with the token defined with item. Finally condition is an optional part of this code which sets a filter by which only certain elements of the old list are transformed and placed into the new list (there’s an example of conditioning in a comprehension loop in the section on spiders).

For example, to extract the navigable string from every element of artistlist, we can set item to a, expression to a.string, and list to artistlist:

artists = [a.string for a in artistlist]
artists
['Fleet Foxes',
 'Bob Marley & The Wailers',
 'Dori Freeman',
 'LCD Soundsystem',
 'Brigitte Calls Me Baby',
 'Margaret Glaspy',
 'Emmylou Harris',
 'Butcher Brown',
 'Lindsay Lou',
 'Chet Faker',
 'The Replacements',
 'Real Estate',
 'Jaime Wyatt',
 'The Rolling Stones',
 'Sonic Youth',
 'The Steel Wheels',
 'Free Union',
 'Bryan Elijah Smith & The Wild Hearts',
 'Jenny Lewis',
 'Slowdive',
 'G. Love and Special Sauce',
 'Vacations',
 'Lucinda Williams',
 'Amos Lee',
 'Kurt Vile',
 'Trousdale',
 'Sarah Jarosz',
 'Modest Mouse',
 'Chris Stapleton',
 'Danielle Ponder',
 'The Psychedelic Furs',
 'Cordovas',
 'Ilsey',
 'Neko Case',
 'Jon Batiste',
 "Gov't Mule f/ Ivan Neville & Ruthie Foster",
 'Haim',
 'The Cars',
 'Band of Horses',
 'The Japanese House',
 'Black Pumas',
 'My Morning Jacket',
 'Tracy Chapman',
 'Brent Cobb',
 'Vivian Leva & Riley Calcagno',
 'Abraham Alexander',
 'Leisure',
 'U2',
 'Half Moon Run',
 'Sir Chloe',
 'Punch Brothers',
 'L7',
 'Watchhouse',
 'Bully',
 'Tyler Childers',
 'Paul Simon',
 'The Pretenders',
 'Ghost of Vroom',
 'Wilco',
 'Margaret Glaspy',
 'Jamila Woods f/ Saba',
 'Phoebe Bridgers f/ Jackson Browne',
 'Bahamas',
 'Dogwood Tales',
 'Jason Isbell & The 400 Unit',
 'Tennis',
 'Fleetwood Mac',
 'Big Thief',
 'Blur',
 'Tame Impala',
 'Bon Iver',
 'Jobi Riccio',
 'Beirut',
 'Yard Act',
 'Olivia Dean f/ Leon Bridges',
 'Coldplay',
 'Nan Macmillan',
 'The National f/ Rosanne Cash',
 'The Velvet Underground',
 'Shemekia Copeland',
 'Yoke Lore',
 'Thad Cockrell',
 'Marty Stuart',
 'Christone "Kingfish" Ingram',
 'Gomez',
 'Esther Rose',
 'Bruce Springsteen',
 'Paul Thorn',
 'Bakar',
 'Jesse Roper',
 'Flipturn',
 'John R. Miller',
 'Jenny Owen Youngs',
 'Peter, Bjorn, and John',
 'Peter Gabriel',
 'The Pink Stones f/ Nikki Lane',
 'Donovan',
 'Lake Street Dive',
 'Hiss Golden Messenger',
 'Jackie Greene',
 'Marika Hackman',
 'Eilen Jewell',
 'Manchester Orchestra',
 '49 Winchester',
 'Brittany Howard',
 'Big Star',
 'Driftwood',
 'Beck',
 'Sharon Jones & the Dap-Kings',
 'Brenda Lee',
 'Mitski',
 'Lucinda Williams',
 'Hem',
 'The Beatles',
 'Billy Strings',
 'Boy Golden',
 'Hurray for the Riff Raff',
 'Josh Ritter',
 'Cat Clyde',
 'Middle Kids',
 'Sylvan Esso',
 'Jack Klatt',
 'Tom Petty',
 'Turnpike Troubadours',
 'Cage the Elephant',
 'Dave Matthews Band',
 'J Mascis',
 'Alison Krauss',
 'Robbie Fulks',
 'Black Pumas',
 'Real Estate',
 'Charles Wesley Godwin',
 'The Head & the Heart',
 'Viv & Riley',
 'Margo Price',
 'The Civil Wars']

Likewise, we extract the navigable strings for the songs, albums, and times:

songs = [a.string for a in songlist]
albums = [a.string for a in albumlist]
times = [a.string for a in timelist]

Finally, to construct a clean data frame, we create a dictionary that combines these lists and passes this dictionary to the pd.DataFrame() function:

mydict = {'time':times,
          'artist':artists,
         'song':songs,
         'album':albums}
wnrn_df = pd.DataFrame(mydict)
wnrn_df
time artist song album
0 3:42 PM Fleet Foxes Can I Believe You Shore
1 3:39 PM Bob Marley & The Wailers One Love/People Get Ready Exodus
2 3:35 PM Dori Freeman Why Do I Do This To Myself Do You Recall
3 3:28 PM LCD Soundsystem Tonite American Dream
4 3:25 PM Brigitte Calls Me Baby Impressively Average This House Is Made Of Corners
... ... ... ... ...
131 6:17 AM Charles Wesley Godwin All Again Family Ties
132 6:13 AM The Head & the Heart Virginia (Wind in the Night) Every Shade of Blue
133 6:09 AM Viv & Riley Imaginary People Imaginary People
134 6:04 AM Margo Price Strays Strays II (Act1: Topanga Canyon)
135 6:01 AM The Civil Wars Dust To Dust The Civil Wars

136 rows × 4 columns

5.5. Building a Spider#

At the bottom of the WNRN playlist on https://spinitron.com/WNRN/ there are links to older song playlists. Let’s extend our example by building a spider to capture the data that exists on these links as well. A spider is a web scraper that follows links on a page automatically and scrapes from those links as well.

I look at the page source for these links, and find that they are contained in a <div class="recent-playlists"> tag. I start by finding this tag. As there’s only one occurrence, I can use .find() instead of .find_all():

recent = wnrn.find("div", "recent-playlists")
recent
<div class="recent-playlists">
<h4>Recent</h4>
<div class="grid-view" id="w2"><div class="summary"></div>
<table class="table table-bordered table-narrow"><tbody>
<tr data-key="0"><td class="show-time">5:00 AM</td><td></td><td><strong><a href="/WNRN/pl/18266814/WNRN-12-11-23-5-02-AM">WNRN 12/11/23, 5:02 AM</a></strong> with <a href="/WNRN/dj/104061/WNRN">WNRN</a></td></tr>
<tr data-key="1"><td class="show-time">4:00 AM</td><td></td><td><strong><a href="/WNRN/pl/18266673/WNRN-12-11-23-4-02-AM">WNRN 12/11/23, 4:02 AM</a></strong> with <a href="/WNRN/dj/104061/WNRN">WNRN</a></td></tr>
<tr data-key="2"><td class="show-time">3:00 AM</td><td></td><td><strong><a href="/WNRN/pl/18266538/WNRN-12-11-23-3-04-AM">WNRN 12/11/23, 3:04 AM</a></strong> with <a href="/WNRN/dj/104061/WNRN">WNRN</a></td></tr>
<tr data-key="3"><td class="show-time">12:00 AM</td><td></td><td><strong><a href="/WNRN/pl/18266023/WNRN">WNRN</a></strong> (Music) with <a href="/WNRN/dj/104073/WNRN">WNRN</a></td></tr>
<tr data-key="4"><td class="show-time">9:00 PM</td><td></td><td><strong><a href="/WNRN/pl/18265453/WNRN">WNRN</a></strong> (Music) with <a href="/WNRN/dj/104073/WNRN">WNRN</a></td></tr>
</tbody></table>
</div></div>

Notice that all of the addresses we need are contained in <a> tags. We can extract these <a> tags with .find_all():

recent_atags = recent.find_all("a")
recent_atags
[<a href="/WNRN/pl/18266814/WNRN-12-11-23-5-02-AM">WNRN 12/11/23, 5:02 AM</a>,
 <a href="/WNRN/dj/104061/WNRN">WNRN</a>,
 <a href="/WNRN/pl/18266673/WNRN-12-11-23-4-02-AM">WNRN 12/11/23, 4:02 AM</a>,
 <a href="/WNRN/dj/104061/WNRN">WNRN</a>,
 <a href="/WNRN/pl/18266538/WNRN-12-11-23-3-04-AM">WNRN 12/11/23, 3:04 AM</a>,
 <a href="/WNRN/dj/104061/WNRN">WNRN</a>,
 <a href="/WNRN/pl/18266023/WNRN">WNRN</a>,
 <a href="/WNRN/dj/104073/WNRN">WNRN</a>,
 <a href="/WNRN/pl/18265453/WNRN">WNRN</a>,
 <a href="/WNRN/dj/104073/WNRN">WNRN</a>]

The resulting list contains the web endpoints we need, and also some web endpoints we don’t need: we want the URLs that contain the string /pl/ as these are playlists, and we want to exclude the URLs that contain the string /dj/ as these pages refer to a particular DJ. We need a comprehension loop that loops across these elements, extracts the href attribute of the entries that include /pl/, and ignore the entries that include /dj/. We again use this syntax:

newlist = [ expression for item in oldlist if condition ]

In this case:

  • newlist is a list containing the URLs we want to direct our spider to. I call it urls.

  • item is one element of recent_atags, which I will call pl.

  • expression is code that extracts the web address from the href attribute of the <a> tag, so here the code would be pl['href'].

  • Finally, condition is a logical statement that should be True if the web address contains /pl/ and False if the web address contains /dj/. Here, the conditional statement should be if "/pl/" in pl['href']. This code will look for the string "/pl/" inside the string called by pl['href'] and return True or False depending on whether this string is found.

Putting all this syntax together gives us our list of playlist URLs:

wnrn_url = [pl['href'] for pl in recent_atags if "/pl/" in pl['href']]
wnrn_url
['/WNRN/pl/18266814/WNRN-12-11-23-5-02-AM',
 '/WNRN/pl/18266673/WNRN-12-11-23-4-02-AM',
 '/WNRN/pl/18266538/WNRN-12-11-23-3-04-AM',
 '/WNRN/pl/18266023/WNRN',
 '/WNRN/pl/18265453/WNRN']

First, we need to collect all of the code we created above to extract the artist, song, album, and play times from the HTML code. We define a function that does all of this work. We specify one argument for this function, the URL, so that all the function needs is the URL and it can output a clean dataframe. I name the function wnrn_spider():

def wnrn_spider(url):
    """Perform web scraping for any WNRN playlist given the available link"""
    
    headers = {'user-agent': 'Kropko class example (jkropko@virginia.edu)'}
    r = requests.get(url, headers=headers)
    wnrn = BeautifulSoup(r.text, 'html')
    
    artistlist = wnrn.find_all("span", "artist")
    songlist = wnrn.find_all("span", "song")
    albumlist = wnrn.find_all("span", "release")
    timelist = wnrn.find_all("td", "spin-time")
    
    artists = [a.string for a in artistlist]
    songs = [a.string for a in songlist]
    albums = [a.string for a in albumlist]
    times = [a.string for a in timelist]
    
    mydict = {'time':times, 'artist':artists, 'song':songs, 'album':albums}
    wnrn_df = pd.DataFrame(mydict)
    
    return wnrn_df

We can pass any of the URLs we collected to our function and get the other playlists. We will have to add the domain “https://spinitron.com” to the beginning of each of the URLs we collected:

wnrn2 = wnrn_spider('https://spinitron.com/' + wnrn_url[0])
wnrn2
time artist song album
0 5:02 AM Vieux Farka Toure et Khruangbin Tongo Barra Ali
1 5:08 AM Dogwood Tales Stranger Rodeo EP
2 5:11 AM The Pretenders Back on the Chain Gang Learning to Crawl
3 5:15 AM I'm With Her See You Around See You Around
4 5:18 AM Charlie Mars Country Home Times Have Changed
5 5:22 AM Brigitte Calls Me Baby Impressively Average This House Is Made Of Corners
6 5:26 AM Maggie Rogers Love You for a Long Time (Single)
7 5:29 AM Dave Carter and Tracy Grammar Crocodile Man Tanglewood Tree
8 5:32 AM David Bowie Changes Hunky Dory
9 5:36 AM JK Mabry Governor's Son Out of State
10 5:42 AM The Lowlies Drink From the Well The Lowlies
11 5:45 AM Fleet Foxes Sunblind Shore
12 5:49 AM Van Morrison Shakin' All Over Accentuate The Positive
13 5:52 AM Darrell Scott String Band Cumberland Plateau Old Cane Back Rocker
14 5:55 AM Arcade Fire We Used to Wait The Suburbs

Our goal here is to loop across all the URLs we collected, extract the data in a clean data frame, and append these data frames together to construct a longer playlist. To do that, we will use a for loop, which has the following syntax:

for index in list:
    expressions

This syntax is similar to the syntax we used to build a comprehension loop. list is an existing list, and index stands in for one element of this list. For each element of the list, we execute the code contained in expressions, which can use the index.

For our spider, we will use the following steps:

  1. We take the data we already scraped from https://spinitron.com/WNRN (saved as wnrn_df) and clone it as a new variable named wnrn_total_playlist. It is important that we make a copy, and that we do not overwrite wnrn_df. We will be repeatedly saving over wnrn_total_playlist within the loop, and if we do not overwrite wnrn_df, it gives us a stable data frame to return to as a starting point if we need to rerun this loop.

  2. We use a for loop to loop across all the web addresses inside wnrn_url.

  3. In the for loop, we use the wnrn_spider() function to extract the playlist data from each of the URLs inside wnrn_url.

  4. In the for loop, we use the .append() method to attach the new data to the bottom of the existing data, matching corresponding columns.

The code is as follows:

wnrn_total_playlist = wnrn_df 
for w in wnrn_url:
    moredata = wnrn_spider('https://spinitron.com/' + w) 
    wnrn_total_playlist = wnrn_total_playlist.append(moredata)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/4w/k9sqqcbx4dxgpjtwgv_1m29h0000gq/T/ipykernel_6483/3262710234.py in ?()
      1 wnrn_total_playlist = wnrn_df
      2 for w in wnrn_url:
      3     moredata = wnrn_spider('https://spinitron.com/' + w)
----> 4     wnrn_total_playlist = wnrn_total_playlist.append(moredata)

~/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, name)
   5985             and name not in self._accessors
   5986             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5987         ):
   5988             return self[name]
-> 5989         return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'append'

We now have a data frame that combines all of the playlists on https://spinitron.com/WNRN and on the playlists linked to under “Recent”:

wnrn_total_playlist
time artist song album
0 8:31 PM Nick Lowe Lay It On Me Baby Lay It on Me
1 8:27 PM Okkervil River Lost Coastlines The Stand Ins
2 8:24 PM Perfume Genius On the Floor Set My Heart on Fire Immediately
3 8:20 PM Stray Fossa It's Nothing (Single)
4 8:13 PM Lianne La Havas Can't Fight Lianne La Havas
... ... ... ... ...
12 4:44 AM Chicano Batman Color My Life Invisible People
13 4:48 AM Laura Marling Held Down Song for Our Daughter
14 4:52 AM J. Roddy Walston & The Business Sweat Shock Essential Tremors
15 4:55 AM Becca Mancari Hunter The Greatest Part
16 4:58 AM Cordovas This Town’s a Drag That Santa Fe Channel

209 rows × 4 columns