5. Web Scraping Using BeautifulSoup#

5.2. How Websites Prevent You From Scraping#

This discussion follows the excellent overview by a Stack Overflow and GitHub contributor with the username JonasCz (I wish I knew this user’s real name!) on how to prevent web scraping.

To understand the restrictions and challenges you will encounter when scraping data, put yourself in the position of a website’s owner:

If you own and maintain a website, there are many reasons why you might want to prevent web scraping bots from accessing the data on your website. Maybe the bots will overload the traffic to your site and make it impossible for your website to work as you intend. You might be running a business through this website and sharing the data in mass transfers would undercut your business. For whatever reason, you are now faced with a challenge: how to you prevent automated scraping of the data on your webpage while still allowing individual customers to view your website?

Web scraping will require issuing HTTP requests to a particular web address with a tool like requests, sometimes many times in a short period. Every HTTP request is logged by the server that receives the request, and these logs contain the IP address of the entity making the request. If too many requests are made by the same IP address, the server can block that IP address. The coding logic to automatically identify and block overactive IP addresses is simple, so many websites include these security measures. Some blocks are temporary, placing a rate limit on these requests to slow down the scrapers, and some blocks reroute scrapers through a CAPTCHA (which stands for “Completely Automated Test to Tell Computers and Humans Apart”) to prevent robots like a scraper from accessing the website. JonasCz recommends that these security measures look at other factors as well: the speed of actions on the website, the amount of data requested, and other factors that can identify a user when the IP address is masked.

Stronger gates, such as making users register for a username and password with email confirmation to use your website, are effective against scraping bots. But they also turn away individuals who wouldn’t want to jump through those hoops. Saving all text as images on your server will prevent bots from accessing the text very easily, but it makes the website harder to use and violates regulations that protect people with disabilities.

Instead, JonasCz recommends building your website in a way that never reveals the entirety of the data you own, and never reveals the private API endpoints you use to display the data. Also, web scrapers are fragile: they are built to pull data from the specific HTML structure of a particular website. Changing the HTML code frequently or using different versions of the code based on geographic location will break the scrapers that are built for that code. JonasCz also suggests adding “honeypot” links to the HTML code that will not be displayed to legitimate users but will be followed by scrapers that recursively follow links, and taking action against the agents that follow these links: block their IP addresses, require a CAPTCHA, or deliver fake data.

One important piece of information in a request is the user agent header (which we discuss in more detail below). JonasCz recommends looking at this information and blocking requests when the user agent is blank or matches information from agents that have previously been identified as malicious bots.

Understanding the steps you would take to protect your data from bots if you owned a website, you should have greater insight into why a web scraping endeavor may fail. Your web scraper might not be malicious, but might still violate the rules that the website owner setup to guard against bots. These rules are usually listed explicitly in a file on the server, usually called robots.txt. Some tips for reading and understanding a robots.txt file are here: https://www.promptcloud.com/blog/how-to-read-and-respect-robots-file/

For example, in this document we will be scraping data on the playlist of a radio station from https://spinitron.com/. This website has a robots.txt file here: https://spinitron.com/robots.txt, which reads:

User-agent: *
Crawl-delay: 10
Request-rate: 1/10

The User-agent: * line tells us that the next two lines apply to all user agent strings. Crawl-delay: 10 places a limit on the frequency with which our scraper can make a request from this website. In this case, individual requests must be made 10 second apart. Request-rate: 1/10 tells us that our scraper is only allowed to access one page every 10 seconds, and that we are not allowed to make requests from more than one page at the same time.

5.3. Using requests with a User Agent Header#

As the articles by James Densmore and JonasCz described, requests are much more likely to get blocked by websites if the request does not specify a header that contains a user agent. An HTTP header is a parameter that gets sent along with the HTTP request that contains metadata about the request. A user agent header contains contact and identification information about the person making the request. If there is any issue with your web scraper, you want to give the website owner a chance to contact you directly about that problem. If you do not feel comfortable being contacted by the website’s owner, you should reconsider whether you should be scraping that website.

Fortunately, it is straightforward to include headers in a GET request using requests: just use the headers argument. First, we import the relevant libraries:

import numpy as np
import pandas as pd
import requests

In module 4 we issued GET requests from the Wikipedia API as an example.

r = requests.get("https://en.wikipedia.org/w/api.php")
r
<Response [200]>

To add a user agent string, I use the following code:

headers = {'user-agent': 'Kropko class example (jkropko@virginia.edu)'}
r = requests.get("https://en.wikipedia.org/w/api.php", headers = headers)
r
<Response [200]>

What information needs to go into a user agent header? Different resources have different information about that. According to Amazon Web Services, a user agent should identify your application, its version number, and programming language. So a user agent should look like this:

headers = {'user-agent': 'Kropko class example version 1.0 (jkropko@virginia.edu) (Language=Python 3.8.2; Platform=Mac OSX 10.15.5)'}
r = requests.get("https://en.wikipedia.org/w/api.php", headers = headers)
r
<Response [200]>

Including a user agent is not hard, and it goes a long way towards alleviating the anxieties that website owners have about dealing with your web scraping code. It is a good practice to cultivate into a habit.

5.4. Using BeautifulSoup() (Example: WNRN, Charlottesville’s Legendary Radio Station)#

WNRN is a legendary radio station, and it’s based right here in Charlottesville at 91.9 FM (and streaming online at www.wnrn.org). It’s commercial-free, with only a few interruptions for local nonprofits to tell you about cool things happening in town. They play a mix of new and classic alternative rock and R&B. They emphasize music for bands coming to play at local venues. And they play the Grateful Dead on Saturday mornings. You should be listening to WNRN!

The playlist of the songs that WNRN has played in the last few hours is here: https://spinitron.com/WNRN/. I want to scrape the data off this website. I also want to scrape the data off of the additional playlists that this website links to, to collect as much data as possible. Our goal in this example is to create a dataframe of each song WNRN has played, the artist, the album, and the time each song was played.

The process involves four steps:

  1. Download the raw text of the HTML code for the website we want to scrape using the requests library.

  2. Use the BeautifulSoup() function from the bs4 library to parse the raw text so that Python can understand, search through, and operate on the HTML tags from string.

  3. Use methods associated with BeautifulSoup() to extract the data we need from the HTML code.

  4. Place the data into a pandas data frame.

5.4.1. Downloading and Understanding Raw HTML#

For this example, I first download the HTML that exists on https://spinitron.com/WNRN using the requests.get() function. To be ethical and to help this website’s owners know that I am not a malicious actor, I also specify a user agent string.

url = "https://spinitron.com/WNRN"
headers = {'user-agent': 'Kropko class example (jkropko@virginia.edu)'}
r = requests.get(url, headers=headers)
r
<Response [200]>

The raw HTML code contains a series of text fragments that look like this,

<tag attribute="value"> Navigable string </tag>

where tag, attribute, "value", and Navigable string are replaced by specific parameters and data that control the content and presentation of the webpage that gets displayed in a web browser. For example, here are the first 1000 characters of the raw text from WNRN’s playlist:

print(r.text[0:1000])
<!doctype html><html lang="en">
<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1,maximum-scale=1">
    <title>WNRN – Independent Music Radio</title>
    <meta name="description" content="A member-supported, independent music radio station broadcasting from the Blue Ridge to the Bay across Virginia—Richmond, Hampton Roads, Roanoke, Charlottesville, Lynchburg, Nelson County, Williamsburg, and The Shenandoah Valley.">

                                    <meta name="csrf-param" content="_csrf">
<meta name="csrf-token" content="mWlJCHwF8YUborIOtauYOpB0C3NivG90a3E9X3GTyWjQEB5pLWykyC7WgHzj0d1Z9EZPMC3dJ0JbNFM-QMX-EQ==">

    <meta property="og:url" content="/WNRN/">
<meta property="og:title" content="WNRN – Independent Music Radio">
<meta property="og:description" content="A member-supported, independent music radio station broadcasting from the Blue Ridge to the Bay across Virgi

Tags specify how the data contained within the page are organized and how the visual elements on this page should look. Tags are designated by opening and closing angle braces, < and >. In the HTML code displayed above, there are tags named

  • <html>, which tells browsers that the following code is written in HTML,

  • <meta>, which defines metadata in the document that help govern how the output shold be displayed in the browser,

  • <title>, which sets the title of the document, and

  • <link>, which pulls data or images from external resources for later use.

To see what other HTML tags do, look at the list on https://www.w3schools.com/TAGs/.

In some cases the tag operates on the text that immediately follows, and a closing tag </tag> frames the text that gets operated on by the tag. The text in between the opening and closing tag is called the navigable string. For example, the tag <title>WNRN Independent Music Radio</title> specifies that “WNRN – Independent Music Radio”, and only this string, is the title.

Some tags have attributes, which are arguments listed inside an opening tag to modify the behavior of that tag or to attach relevant data to the tag. The first <html> tag listed above contains an attribute lang with a value "en" that specifies that this document contains HTML code in English.

5.4.2. Parsing Raw HTML Using BeautifulSoup()#

The requests.get() function only downloads the raw text of the HTML code, but it does not yet understand the logic and organization of the HTML code. Getting Python to register text as a particular coding standard is called parsing the code. We’ve parsed code into Python before with JSON data. We used requests.get() to download the JSON formatted data, but we needed json.loads() to parse the data in order to be able to navigate the branches of the JSON tree.

There are two widely used Python libraries for parsing HTML data: bs4 which contains the BeautifulSoup() function, and selenium. BeautifulSoup() works with raw text, but cannot access websites themselves (we use requests.get() for that). In order to access the data on a website, the data needs to be visible in the raw HTML that requests.get() returns. If there are measures taken by a website to hide that data, possibly by calling server-side Javascript to populate data fields, or by saving data as image files, then we won’t be able to access the data with an HTML parser. selenium has more features to extract more complicated data and circumvent anti-scraping measures, such as taking a screenshot of the webpage in a browser and using optical character recognition (OCR) to pull data directly from the image. However, selenium requires each request to be loaded in a web browser, so it can be quite a bit slower than BeautifulSoup(). If you are interested in learning how to use selenium, see this guide: https://selenium-python.readthedocs.io/. Here we will be using BeautifulSoup().

First I import the BeautifulSoup() function:

from bs4 import BeautifulSoup

To use it, we pass the .text attribute of the requests.get() output from https://spinitron.com/WNRN to BeautifulSoup() (which I saved as r.text above). This function can parse either HTML or XML code, so the second argument should specify HTML:

wnrn = BeautifulSoup(r.text, 'html')

Now that the https://spinitron.com/WNRN source code is registered as HTML code in Python, we can begin executing commands to navigate the organizational structure of the code and extract data.

5.4.3. Searching for HTML Tags and Extracting Data#

While HTML is a coding language, it does not force coders to follow very strict templates. There’s a lot of flexibility and creativity possible for HTML programmers, and as such, there is no one universal method for extracting data from HTML. The best approach is to open a browser window, navigate to the webpage you want to scrape, and “view page source”. (Different web browsers have different ways to do that. On Mozilla Firefox, right click somewhere on the page other than an active link, and “view page source” should be an option.) The source will display the raw HTML code that generates the page. You will need to search through this code to find examples of the data points you intend to collect, possibly using control+F to search for specific values. Once you find the data you need, make note of the tags that surround the data and use the tools we will describe next to extract the data.

The parsable HTML BeautifulSoup() output, wnrn, has important methods and attributes that we will use to extract the data we want. First, we can use the name of a tag as an attribute to extract the first occurrence of that tag. Here we extract the first <meta> tag:

metatag = wnrn.meta
metatag
<meta charset="utf-8"/>

This tag stores its attributes as a list, so we can extract the value of an attribute by calling the name of that attribute as follows:

metatag['charset']
'utf-8'

If a tag has a navigable string, we can extract that with the .string attribute of a particular tag. For example, to extract the title, we start with the <title> tag:

titletag = wnrn.title
titletag
<title>WNRN – Independent Music Radio</title>

Then we extract the title as follows:

titletag.string
'WNRN – Independent Music Radio'

Our goal in this example is to extract the artist, song, album, and time played for every song played on WNRN. I look in the raw HTML source code for the first instance of an artist. These data are contained in the <span> tags:

spantag = wnrn.span
spantag
<span class="artist">Old Crow Medicine Show</span>

Calling one tag is not especially useful, because we generally want to extract all of the relevant data on a page. For that, we can use the .find_next() and .find_all() methods, both of which are very literal. The next <span> tag in the HTML code contains the song associated with the artist:

spantag.find_next()
<span class="song">Methamphetamine</span>

And the next occurrence of <span> contains the album name (under "release"):

spantag.find_next().find_next()
<div class="info"><span class="release">Live at the Ryman</span></div>

To find all occurrences of the <span> tag, organized in a list, use .find_all() and provide the tag as the argument:

spanlist = wnrn.find_all("span")
spanlist
[<span class="artist">Old Crow Medicine Show</span>,
 <span class="song">Methamphetamine</span>,
 <span class="release">Live at the Ryman</span>,
 <span class="artist">The Beths</span>,
 <span class="song">A Real Thing</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Feist</span>,
 <span class="song">I Feel It All</span>,
 <span class="release">The Reminder</span>,
 <span class="artist">Pulp</span>,
 <span class="song">Spike Island</span>,
 <span class="release">More</span>,
 <span class="artist">Laufey</span>,
 <span class="song">Silver Lining</span>,
 <span class="release">(Single)</span>,
 <span class="artist">R.E.M.</span>,
 <span class="song">Daysleeper</span>,
 <span class="release">Up</span>,
 <span class="artist">Leon Bridges</span>,
 <span class="song">Laredo</span>,
 <span class="release">Leon</span>,
 <span class="artist">Nathaniel Rateliff &amp; The Night Sweats</span>,
 <span class="song">Time Makes Fools Of Us All</span>,
 <span class="release">South Of Here</span>,
 <span class="artist">The Smiths</span>,
 <span class="song">How Soon Is Now?</span>,
 <span class="release">Meat Is Murder</span>,
 <span class="artist">Momma</span>,
 <span class="song">I Want You (Fever)</span>,
 <span class="release">Welcome To My Blue Sky</span>,
 <span class="artist">Hannah Cohen</span>,
 <span class="song">Draggin'</span>,
 <span class="release">Earthstar Mountain</span>,
 <span class="artist">Jenny Lewis</span>,
 <span class="song">Psychos</span>,
 <span class="release">Joy'all</span>,
 <span class="artist">Yndling</span>,
 <span class="song">It's Almost Like You're Here</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Monobloc</span>,
 <span class="song">Where Is My Garden</span>,
 <span class="release">Monobloc</span>,
 <span class="artist">David Gray</span>,
 <span class="song">Babylon</span>,
 <span class="release">White Ladder</span>,
 <span class="artist">Squeeze</span>,
 <span class="song">Cool for Cats</span>,
 <span class="release">Cool For Cats</span>,
 <span class="artist">MRCY</span>,
 <span class="song">Wandering Attention</span>,
 <span class="release">Volume 2</span>,
 <span class="artist">Rachel Chinouriri</span>,
 <span class="song">23:42</span>,
 <span class="release">Little House EP</span>,
 <span class="artist">Dinosaur Jr.</span>,
 <span class="song">Feel the Pain</span>,
 <span class="release">Without a Sound</span>,
 <span class="artist">Trombone Shorty</span>,
 <span class="song">Come Back</span>,
 <span class="release">Lifted</span>,
 <span class="artist">Japanese Breakfast</span>,
 <span class="song">Picture Window</span>,
 <span class="release">For Melancholy Brunettes (&amp; sad women)</span>,
 <span class="artist">Matt Berninger</span>,
 <span class="song">Bonnet of Pins</span>,
 <span class="release">Get Sunk</span>,
 <span class="artist">Caroline Spence</span>,
 <span class="song">Who's Gonna Make My Mistakes</span>,
 <span class="release">Mint Condition</span>,
 <span class="artist">Electric Light Orchestra</span>,
 <span class="song">Turn To Stone</span>,
 <span class="release">Out of the Blue</span>,
 <span class="artist">Wishy</span>,
 <span class="song">Fly</span>,
 <span class="release">Planet Popstar EP</span>,
 <span class="artist">Lord Huron</span>,
 <span class="song">Nothing I Need</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Angie McMahon</span>,
 <span class="song">Keeping Time</span>,
 <span class="release">Salt</span>,
 <span class="artist">Inhaler</span>,
 <span class="song">Your House</span>,
 <span class="release">Open Wide</span>,
 <span class="artist">Bon Iver</span>,
 <span class="song">Everything Is Peaceful Love</span>,
 <span class="release">SABLE, fABLE</span>,
 <span class="artist">The Velvet Underground and Nico</span>,
 <span class="song">Sunday Morning</span>,
 <span class="release">The Velvet Underground &amp; Nico</span>,
 <span class="artist">Birdtalker</span>,
 <span class="song">Season Of Charade</span>,
 <span class="release">All Means, No End</span>,
 <span class="artist">Kacey Musgraves</span>,
 <span class="song">Follow Your Arrow</span>,
 <span class="release">Same Trailer Different Park</span>,
 <span class="artist">Sharp Pins</span>,
 <span class="song">Race for the Audience</span>,
 <span class="release">Radio DDR</span>,
 <span class="artist">Billy Strings</span>,
 <span class="song">Be Your Man</span>,
 <span class="release">Highway Prayers</span>,
 <span class="artist">Quilt</span>,
 <span class="song">Roller</span>,
 <span class="release">Plaza</span>,
 <span class="artist">Caamp</span>,
 <span class="song">Let Things Go</span>,
 <span class="release">Somewhere EP</span>,
 <span class="artist">BADBADNOTGOOD &amp; V.C.R</span>,
 <span class="song">Found A Light (Beale Street)</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Peter Bjorn and John</span>,
 <span class="song">Objects of My Affection</span>,
 <span class="release">Writer's Block</span>,
 <span class="artist">Hurray For The Riff Raff</span>,
 <span class="song">Pyramid Scheme</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Arny Margret</span>,
 <span class="song">Day Old Thoughts</span>,
 <span class="release">I Miss You, I Do</span>,
 <span class="artist">Bahamas</span>,
 <span class="song">All the Time</span>,
 <span class="release">Bahamas is Afie</span>,
 <span class="artist">Snacktime f/ Devon Gilfillian</span>,
 <span class="song">Together</span>,
 <span class="release">This Is Dance Music EP</span>,
 <span class="artist">Neil Young</span>,
 <span class="song">Heart of Gold</span>,
 <span class="release">Harvest</span>,
 <span class="artist">Fiona Apple</span>,
 <span class="song">Heart Of Gold</span>,
 <span class="release">Heart of Gold: The Songs of Neil Y</span>,
 <span class="artist">Phosphorescent</span>,
 <span class="song">New Birth in New England</span>,
 <span class="release">C'est La Vie</span>,
 <span class="artist">S.G. Goodman</span>,
 <span class="song">Fire Sign</span>,
 <span class="release">Planting by the Signs</span>,
 <span class="artist">Orla Gartland</span>,
 <span class="song">Now What?</span>,
 <span class="release">Everybody Needs A Hero (Deluxe)</span>,
 <span class="artist">The War on Drugs</span>,
 <span class="song">Red Eyes</span>,
 <span class="release">Lost In The Dream</span>,
 <span class="artist">James Bay f/ Jon Batiste</span>,
 <span class="song">Sunshine In The Room</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Tunde Adebimpe</span>,
 <span class="song">God Knows</span>,
 <span class="release">Thee Black Boltz</span>,
 <span class="artist">Spectator Bird</span>,
 <span class="song">Saint Anthony</span>,
 <span class="release">Fall Down in a Small Town</span>,
 <span class="artist">De La Soul</span>,
 <span class="song">Eye Know</span>,
 <span class="release">3 Feet High and Rising</span>,
 <span class="artist">Billie Marten</span>,
 <span class="song">Feeling</span>,
 <span class="release">Dog Eared</span>,
 <span class="artist">The Head &amp; The Heart</span>,
 <span class="song">After The Setting Sun</span>,
 <span class="release">Aperture</span>,
 <span class="artist">Hop Along</span>,
 <span class="song">How Simple</span>,
 <span class="release">Bark Your Head Off, Dog</span>,
 <span class="artist">Panchiko</span>,
 <span class="song">Ginkgo</span>,
 <span class="release">Ginkgo</span>,
 <span class="artist">Flock Of Dimes</span>,
 <span class="song">Two</span>,
 <span class="release">WNRN Studios</span>,
 <span class="artist">INXS</span>,
 <span class="song">Never Tear Us Apart</span>,
 <span class="release">Kick</span>,
 <span class="artist">Loaded Honey</span>,
 <span class="song">Don't Speak</span>,
 <span class="release">Love Made Trees</span>,
 <span class="artist">Ben Harper</span>,
 <span class="song">Diamonds on the Inside</span>,
 <span class="release">Diamonds on the Inside</span>,
 <span class="artist">I'm With Her</span>,
 <span class="song">Ancient Light</span>,
 <span class="release">Wild and Clear and Blue</span>,
 <span class="artist">Peach Pit</span>,
 <span class="song">Magpie</span>,
 <span class="release">Magpie</span>,
 <span class="artist">49 Winchester</span>,
 <span class="song">Chemistry</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Neiked &amp; Portugal. The Man</span>,
 <span class="song">Glide</span>,
 <span class="release">(Single)</span>,
 <span class="artist">flipturn</span>,
 <span class="song">Burnout Days</span>,
 <span class="release">Burnout Days</span>,
 <span class="artist">Middle Kids</span>,
 <span class="song">Stacking Chairs</span>,
 <span class="release">Today We're the Greatest</span>,
 <span class="artist">King Gizzard &amp; The Lizard Wizard</span>,
 <span class="song">Deadstick</span>,
 <span class="release">Phantom Island</span>,
 <span class="artist">John Butler</span>,
 <span class="song">Trippin On You</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Brandi Carlile</span>,
 <span class="song">The Story</span>,
 <span class="release">The Story</span>,
 <span class="artist">Gold Connections</span>,
 <span class="song">Fool's Gold</span>,
 <span class="release">Fortune</span>,
 <span class="artist">The Breeders</span>,
 <span class="song">Cannonball</span>,
 <span class="release">Last Splash</span>,
 <span class="artist">Watchhouse</span>,
 <span class="song">Rituals</span>,
 <span class="release">Rituals</span>,
 <span class="artist">Man/Woman/Chainsaw</span>,
 <span class="song">Adam &amp; Steve</span>,
 <span class="release">(Single)</span>,
 <span class="artist">The Hold Steady</span>,
 <span class="song">Sequestered in Memphis</span>,
 <span class="release">Stay Positive</span>,
 <span class="artist">G. Love and Special Sauce</span>,
 <span class="song">Peace, Love, and Happiness</span>,
 <span class="release">Superhero Brother</span>,
 <span class="artist">Cardinals</span>,
 <span class="song">Get It</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Alison Krauss &amp; Union Station</span>,
 <span class="song">Richmond on the James</span>,
 <span class="release">Arcadia</span>,
 <span class="artist">The Sherman Holmes Project</span>,
 <span class="song">Don't Do It</span>,
 <span class="release">The Richmond Sessions</span>,
 <span class="artist">Will Worden</span>,
 <span class="song">Lovin' You Forever</span>,
 <span class="release">The Only One &amp; All The Others</span>,
 <span class="artist">Dope Lemon</span>,
 <span class="song">Electric Green Lambo</span>,
 <span class="release">Golden Wolf</span>,
 <span class="artist">Faye Webster</span>,
 <span class="song">RIght Side of My Neck</span>,
 <span class="release">Atlanta Millionaires Club</span>,
 <span class="artist">Johnny Delaware</span>,
 <span class="song">Running</span>,
 <span class="release">Para Llevar</span>,
 <span class="artist">Van Morrison</span>,
 <span class="song">Caravan</span>,
 <span class="release">Moondance</span>,
 <span class="artist">The Vices</span>,
 <span class="song">Before It Might Be Gone</span>,
 <span class="release">Before It Might Be Gone</span>,
 <span class="artist">Lewis OfMan (feat. Empress Of)</span>,
 <span class="song">Highway</span>,
 <span class="release">Cristal Medium Blue</span>,
 <span class="artist">Tune-Yards</span>,
 <span class="song">Heartbreak</span>,
 <span class="release">Better Dreaming</span>,
 <span class="artist">Counting Crows</span>,
 <span class="song">Spaceman In Tulsa</span>,
 <span class="release">Butter Miracle, The Complete Sweets</span>,
 <span class="artist">A Tribe Called Quest</span>,
 <span class="song">Ego</span>,
 <span class="release">We Got It from Here... Thank You for Your Service</span>,
 <span class="artist">Tennis</span>,
 <span class="song">At The Wedding</span>,
 <span class="release">Face Down In The Garden</span>,
 <span class="artist">Deep Sea Diver</span>,
 <span class="song">Shovel</span>,
 <span class="release">Billboard Heart</span>,
 <span class="artist">Beirut</span>,
 <span class="song">Santa Fe</span>,
 <span class="release">The Rip Tide</span>,
 <span class="artist">Goose</span>,
 <span class="song">Lead Up</span>,
 <span class="release">Everything Must Go</span>,
 <span class="artist">Leon Bridges</span>,
 <span class="song">Laredo</span>,
 <span class="release">Leon</span>,
 <span class="artist">Joe Jackson</span>,
 <span class="song">Look Sharp!</span>,
 <span class="release">Look Sharp!</span>,
 <span class="artist">HAIM</span>,
 <span class="song">Relationships</span>,
 <span class="release">I Quit</span>,
 <span class="artist">Lucy Dacus f/ Hozier</span>,
 <span class="song">Bullseye</span>,
 <span class="release">Forever Is A Feeling</span>,
 <span class="artist">Palmyra</span>,
 <span class="song">Arizona</span>,
 <span class="release">Restless</span>,
 <span class="artist">Bruce Springsteen</span>,
 <span class="song">I'm on Fire</span>,
 <span class="release">Born in the USA</span>,
 <span class="artist">Marc Broussard</span>,
 <span class="song">Time Is A Thief</span>,
 <span class="release">Time Is A Thief</span>,
 <span class="artist">Girl and Girl</span>,
 <span class="song">Okay</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Hannah Cohen</span>,
 <span class="song">Draggin'</span>,
 <span class="release">Earthstar Mountain</span>,
 <span class="artist">John Prine</span>,
 <span class="song">In Spite of Ourselves</span>,
 <span class="release">In Spite of Ourselves</span>,
 <span class="artist">Grace Potter</span>,
 <span class="song">Before The Sky Falls</span>,
 <span class="release">Medicine</span>,
 <span class="artist">Matt Andersen</span>,
 <span class="song">In-Studio Session with</span>,
 <span class="release">WNRN Studios</span>,
 <span class="artist">Webb Wilder</span>,
 <span class="song">Hillbilly Speedball</span>,
 <span class="release">Hillbilly Speedball</span>,
 <span class="artist">Pete Yorn</span>,
 <span class="song">Summer Was a Day</span>,
 <span class="release">Arranging Time</span>,
 <span class="artist">Esther Rose</span>,
 <span class="song">New Bad</span>,
 <span class="release">Want</span>,
 <span class="artist">Lucius</span>,
 <span class="song">Old Tape</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Julien Baker &amp; TORRES</span>,
 <span class="song">Sugar In The Tank</span>,
 <span class="release">Send A Prayer My Way</span>,
 <span class="artist">Big Star</span>,
 <span class="song">Thirteen</span>,
 <span class="release">#1 Record</span>,
 <span class="artist">Chris Knight</span>,
 <span class="song">A Pretty Good Guy</span>,
 <span class="release">A Pretty Good Guy</span>,
 <span class="artist">Shinyribs</span>,
 <span class="song">Leaving Time</span>,
 <span class="release">Leaving Time</span>,
 <span class="artist">My Morning Jacket</span>,
 <span class="song">Time Waited</span>,
 <span class="release">is</span>,
 <span class="artist">The Cure</span>,
 <span class="song">All I Ever Am</span>,
 <span class="release">Songs Of A Lost World</span>,
 <span class="artist">Caamp</span>,
 <span class="song">Let Things Go</span>,
 <span class="release">Somewhere EP</span>,
 <span class="artist">Rosanne Cash</span>,
 <span class="song">Not Many Miles to Go</span>,
 <span class="release">She Remembers Everything</span>,
 <span class="artist">Car Seat Headrest</span>,
 <span class="song">Gethsemane</span>,
 <span class="release">The Scholars</span>,
 <span class="artist">Southern Avenue</span>,
 <span class="song">Upside</span>,
 <span class="release">Family</span>,
 <span class="artist">Phoebe Bridgers</span>,
 <span class="song">Motion Sickness</span>,
 <span class="release">Stranger in the Alps</span>,
 <span class="artist">Jade Bird</span>,
 <span class="song">Dreams</span>,
 <span class="release">Who Wants To Talk About Love?</span>,
 <span class="artist">Charles Wesley Godwin</span>,
 <span class="song">It's the Little Things</span>,
 <span class="release">Lonely Mountain Town</span>,
 <span class="artist">Yeah Yeah Yeahs</span>,
 <span class="song">Gold Lion</span>,
 <span class="release">Show Your Bones</span>,
 <span class="artist">Laufey</span>,
 <span class="song">Silver Lining</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Arny Margret</span>,
 <span class="song">Day Old Thoughts</span>,
 <span class="release">I Miss You, I Do</span>,
 <span class="artist">Seth Walker</span>,
 <span class="song">All I Need to Know</span>,
 <span class="release">Are You Open?</span>,
 <span class="artist">Oracle Sisters</span>,
 <span class="song">Blue Left Hand</span>,
 <span class="release">Divinations</span>,
 <span class="artist">Momma</span>,
 <span class="song">I Want You (Fever)</span>,
 <span class="release">Welcome To My Blue Sky</span>,
 <span class="artist">Day Wave</span>,
 <span class="song">Gone</span>,
 <span class="release">Hard to Read EP</span>,
 <span class="artist">Willie Nelson f/ Rodney Crowell</span>,
 <span class="song">Oh What A Beautiful World</span>,
 <span class="release">Oh What A Beautiful World</span>,
 <span class="artist">Tunde Adebimpe</span>,
 <span class="song">God Knows</span>,
 <span class="release">Thee Black Boltz</span>,
 <span class="artist">Gillian Welch</span>,
 <span class="song">Wayside/Back In Time</span>,
 <span class="release">Soul Journey</span>,
 <span class="artist">Death Cab for Cutie</span>,
 <span class="song">Here to Forever</span>,
 <span class="release">Asphalt Meadows</span>,
 <span class="artist">Maggie Rogers &amp; Sylvan Esso</span>,
 <span class="song">Anthems For A Seventeen Year-Old Girl</span>,
 <span class="release">ANTHEMS</span>,
 <span class="artist">Yndling</span>,
 <span class="song">It's Almost Like You're Here</span>,
 <span class="release">(Single)</span>,
 <span class="artist">Josh Ritter</span>,
 <span class="song">Showboat</span>,
 <span class="release">Gathering</span>,
 <span class="artist">Big Thief</span>,
 <span class="song">Shark Smile</span>,
 <span class="release">Capacity</span>,
 <span class="artist">Zach Top f/ Billy Strings</span>,
 <span class="song">Bad Luck</span>,
 <span class="release">Me &amp; Billy</span>,
 <span class="artist">Los Lobos</span>,
 <span class="song">Don't Worry Baby</span>,
 <span class="release">How Will the Wolf Survive?</span>,
 <span class="artist">Noeline Hofmann</span>,
 <span class="song">Lightning in July (Prairie Fire)</span>,
 <span class="release">Purple Gas</span>,
 <span class="artist">MRCY</span>,
 <span class="song">Wandering Attention</span>,
 <span class="release">Volume 2</span>,
 <span class="artist">I'm With Her</span>,
 <span class="song">Ancient Light</span>,
 <span class="release">Wild and Clear and Blue</span>,
 <span class="artist">R.E.M.</span>,
 <span class="song">Nightswimming</span>,
 <span class="release">Automatic for the People</span>,
 <span class="artist">The Bug Club</span>,
 <span class="song">Jealous Boy</span>,
 <span class="release">Very Human Features</span>]

Notice that the HTML source code distinguishes between the three types of datapoint with different class values. To limit this list to just the artists, we can specify the "artist" class as a second argument of .find_all():

artistlist = wnrn.find_all("span", "artist")
artistlist
[<span class="artist">Old Crow Medicine Show</span>,
 <span class="artist">The Beths</span>,
 <span class="artist">Feist</span>,
 <span class="artist">Pulp</span>,
 <span class="artist">Laufey</span>,
 <span class="artist">R.E.M.</span>,
 <span class="artist">Leon Bridges</span>,
 <span class="artist">Nathaniel Rateliff &amp; The Night Sweats</span>,
 <span class="artist">The Smiths</span>,
 <span class="artist">Momma</span>,
 <span class="artist">Hannah Cohen</span>,
 <span class="artist">Jenny Lewis</span>,
 <span class="artist">Yndling</span>,
 <span class="artist">Monobloc</span>,
 <span class="artist">David Gray</span>,
 <span class="artist">Squeeze</span>,
 <span class="artist">MRCY</span>,
 <span class="artist">Rachel Chinouriri</span>,
 <span class="artist">Dinosaur Jr.</span>,
 <span class="artist">Trombone Shorty</span>,
 <span class="artist">Japanese Breakfast</span>,
 <span class="artist">Matt Berninger</span>,
 <span class="artist">Caroline Spence</span>,
 <span class="artist">Electric Light Orchestra</span>,
 <span class="artist">Wishy</span>,
 <span class="artist">Lord Huron</span>,
 <span class="artist">Angie McMahon</span>,
 <span class="artist">Inhaler</span>,
 <span class="artist">Bon Iver</span>,
 <span class="artist">The Velvet Underground and Nico</span>,
 <span class="artist">Birdtalker</span>,
 <span class="artist">Kacey Musgraves</span>,
 <span class="artist">Sharp Pins</span>,
 <span class="artist">Billy Strings</span>,
 <span class="artist">Quilt</span>,
 <span class="artist">Caamp</span>,
 <span class="artist">BADBADNOTGOOD &amp; V.C.R</span>,
 <span class="artist">Peter Bjorn and John</span>,
 <span class="artist">Hurray For The Riff Raff</span>,
 <span class="artist">Arny Margret</span>,
 <span class="artist">Bahamas</span>,
 <span class="artist">Snacktime f/ Devon Gilfillian</span>,
 <span class="artist">Neil Young</span>,
 <span class="artist">Fiona Apple</span>,
 <span class="artist">Phosphorescent</span>,
 <span class="artist">S.G. Goodman</span>,
 <span class="artist">Orla Gartland</span>,
 <span class="artist">The War on Drugs</span>,
 <span class="artist">James Bay f/ Jon Batiste</span>,
 <span class="artist">Tunde Adebimpe</span>,
 <span class="artist">Spectator Bird</span>,
 <span class="artist">De La Soul</span>,
 <span class="artist">Billie Marten</span>,
 <span class="artist">The Head &amp; The Heart</span>,
 <span class="artist">Hop Along</span>,
 <span class="artist">Panchiko</span>,
 <span class="artist">Flock Of Dimes</span>,
 <span class="artist">INXS</span>,
 <span class="artist">Loaded Honey</span>,
 <span class="artist">Ben Harper</span>,
 <span class="artist">I'm With Her</span>,
 <span class="artist">Peach Pit</span>,
 <span class="artist">49 Winchester</span>,
 <span class="artist">Neiked &amp; Portugal. The Man</span>,
 <span class="artist">flipturn</span>,
 <span class="artist">Middle Kids</span>,
 <span class="artist">King Gizzard &amp; The Lizard Wizard</span>,
 <span class="artist">John Butler</span>,
 <span class="artist">Brandi Carlile</span>,
 <span class="artist">Gold Connections</span>,
 <span class="artist">The Breeders</span>,
 <span class="artist">Watchhouse</span>,
 <span class="artist">Man/Woman/Chainsaw</span>,
 <span class="artist">The Hold Steady</span>,
 <span class="artist">G. Love and Special Sauce</span>,
 <span class="artist">Cardinals</span>,
 <span class="artist">Alison Krauss &amp; Union Station</span>,
 <span class="artist">The Sherman Holmes Project</span>,
 <span class="artist">Will Worden</span>,
 <span class="artist">Dope Lemon</span>,
 <span class="artist">Faye Webster</span>,
 <span class="artist">Johnny Delaware</span>,
 <span class="artist">Van Morrison</span>,
 <span class="artist">The Vices</span>,
 <span class="artist">Lewis OfMan (feat. Empress Of)</span>,
 <span class="artist">Tune-Yards</span>,
 <span class="artist">Counting Crows</span>,
 <span class="artist">A Tribe Called Quest</span>,
 <span class="artist">Tennis</span>,
 <span class="artist">Deep Sea Diver</span>,
 <span class="artist">Beirut</span>,
 <span class="artist">Goose</span>,
 <span class="artist">Leon Bridges</span>,
 <span class="artist">Joe Jackson</span>,
 <span class="artist">HAIM</span>,
 <span class="artist">Lucy Dacus f/ Hozier</span>,
 <span class="artist">Palmyra</span>,
 <span class="artist">Bruce Springsteen</span>,
 <span class="artist">Marc Broussard</span>,
 <span class="artist">Girl and Girl</span>,
 <span class="artist">Hannah Cohen</span>,
 <span class="artist">John Prine</span>,
 <span class="artist">Grace Potter</span>,
 <span class="artist">Matt Andersen</span>,
 <span class="artist">Webb Wilder</span>,
 <span class="artist">Pete Yorn</span>,
 <span class="artist">Esther Rose</span>,
 <span class="artist">Lucius</span>,
 <span class="artist">Julien Baker &amp; TORRES</span>,
 <span class="artist">Big Star</span>,
 <span class="artist">Chris Knight</span>,
 <span class="artist">Shinyribs</span>,
 <span class="artist">My Morning Jacket</span>,
 <span class="artist">The Cure</span>,
 <span class="artist">Caamp</span>,
 <span class="artist">Rosanne Cash</span>,
 <span class="artist">Car Seat Headrest</span>,
 <span class="artist">Southern Avenue</span>,
 <span class="artist">Phoebe Bridgers</span>,
 <span class="artist">Jade Bird</span>,
 <span class="artist">Charles Wesley Godwin</span>,
 <span class="artist">Yeah Yeah Yeahs</span>,
 <span class="artist">Laufey</span>,
 <span class="artist">Arny Margret</span>,
 <span class="artist">Seth Walker</span>,
 <span class="artist">Oracle Sisters</span>,
 <span class="artist">Momma</span>,
 <span class="artist">Day Wave</span>,
 <span class="artist">Willie Nelson f/ Rodney Crowell</span>,
 <span class="artist">Tunde Adebimpe</span>,
 <span class="artist">Gillian Welch</span>,
 <span class="artist">Death Cab for Cutie</span>,
 <span class="artist">Maggie Rogers &amp; Sylvan Esso</span>,
 <span class="artist">Yndling</span>,
 <span class="artist">Josh Ritter</span>,
 <span class="artist">Big Thief</span>,
 <span class="artist">Zach Top f/ Billy Strings</span>,
 <span class="artist">Los Lobos</span>,
 <span class="artist">Noeline Hofmann</span>,
 <span class="artist">MRCY</span>,
 <span class="artist">I'm With Her</span>,
 <span class="artist">R.E.M.</span>,
 <span class="artist">The Bug Club</span>]

Likewise we can create lists of the songs:

songlist = wnrn.find_all("span", "song")
songlist
[<span class="song">Methamphetamine</span>,
 <span class="song">A Real Thing</span>,
 <span class="song">I Feel It All</span>,
 <span class="song">Spike Island</span>,
 <span class="song">Silver Lining</span>,
 <span class="song">Daysleeper</span>,
 <span class="song">Laredo</span>,
 <span class="song">Time Makes Fools Of Us All</span>,
 <span class="song">How Soon Is Now?</span>,
 <span class="song">I Want You (Fever)</span>,
 <span class="song">Draggin'</span>,
 <span class="song">Psychos</span>,
 <span class="song">It's Almost Like You're Here</span>,
 <span class="song">Where Is My Garden</span>,
 <span class="song">Babylon</span>,
 <span class="song">Cool for Cats</span>,
 <span class="song">Wandering Attention</span>,
 <span class="song">23:42</span>,
 <span class="song">Feel the Pain</span>,
 <span class="song">Come Back</span>,
 <span class="song">Picture Window</span>,
 <span class="song">Bonnet of Pins</span>,
 <span class="song">Who's Gonna Make My Mistakes</span>,
 <span class="song">Turn To Stone</span>,
 <span class="song">Fly</span>,
 <span class="song">Nothing I Need</span>,
 <span class="song">Keeping Time</span>,
 <span class="song">Your House</span>,
 <span class="song">Everything Is Peaceful Love</span>,
 <span class="song">Sunday Morning</span>,
 <span class="song">Season Of Charade</span>,
 <span class="song">Follow Your Arrow</span>,
 <span class="song">Race for the Audience</span>,
 <span class="song">Be Your Man</span>,
 <span class="song">Roller</span>,
 <span class="song">Let Things Go</span>,
 <span class="song">Found A Light (Beale Street)</span>,
 <span class="song">Objects of My Affection</span>,
 <span class="song">Pyramid Scheme</span>,
 <span class="song">Day Old Thoughts</span>,
 <span class="song">All the Time</span>,
 <span class="song">Together</span>,
 <span class="song">Heart of Gold</span>,
 <span class="song">Heart Of Gold</span>,
 <span class="song">New Birth in New England</span>,
 <span class="song">Fire Sign</span>,
 <span class="song">Now What?</span>,
 <span class="song">Red Eyes</span>,
 <span class="song">Sunshine In The Room</span>,
 <span class="song">God Knows</span>,
 <span class="song">Saint Anthony</span>,
 <span class="song">Eye Know</span>,
 <span class="song">Feeling</span>,
 <span class="song">After The Setting Sun</span>,
 <span class="song">How Simple</span>,
 <span class="song">Ginkgo</span>,
 <span class="song">Two</span>,
 <span class="song">Never Tear Us Apart</span>,
 <span class="song">Don't Speak</span>,
 <span class="song">Diamonds on the Inside</span>,
 <span class="song">Ancient Light</span>,
 <span class="song">Magpie</span>,
 <span class="song">Chemistry</span>,
 <span class="song">Glide</span>,
 <span class="song">Burnout Days</span>,
 <span class="song">Stacking Chairs</span>,
 <span class="song">Deadstick</span>,
 <span class="song">Trippin On You</span>,
 <span class="song">The Story</span>,
 <span class="song">Fool's Gold</span>,
 <span class="song">Cannonball</span>,
 <span class="song">Rituals</span>,
 <span class="song">Adam &amp; Steve</span>,
 <span class="song">Sequestered in Memphis</span>,
 <span class="song">Peace, Love, and Happiness</span>,
 <span class="song">Get It</span>,
 <span class="song">Richmond on the James</span>,
 <span class="song">Don't Do It</span>,
 <span class="song">Lovin' You Forever</span>,
 <span class="song">Electric Green Lambo</span>,
 <span class="song">RIght Side of My Neck</span>,
 <span class="song">Running</span>,
 <span class="song">Caravan</span>,
 <span class="song">Before It Might Be Gone</span>,
 <span class="song">Highway</span>,
 <span class="song">Heartbreak</span>,
 <span class="song">Spaceman In Tulsa</span>,
 <span class="song">Ego</span>,
 <span class="song">At The Wedding</span>,
 <span class="song">Shovel</span>,
 <span class="song">Santa Fe</span>,
 <span class="song">Lead Up</span>,
 <span class="song">Laredo</span>,
 <span class="song">Look Sharp!</span>,
 <span class="song">Relationships</span>,
 <span class="song">Bullseye</span>,
 <span class="song">Arizona</span>,
 <span class="song">I'm on Fire</span>,
 <span class="song">Time Is A Thief</span>,
 <span class="song">Okay</span>,
 <span class="song">Draggin'</span>,
 <span class="song">In Spite of Ourselves</span>,
 <span class="song">Before The Sky Falls</span>,
 <span class="song">In-Studio Session with</span>,
 <span class="song">Hillbilly Speedball</span>,
 <span class="song">Summer Was a Day</span>,
 <span class="song">New Bad</span>,
 <span class="song">Old Tape</span>,
 <span class="song">Sugar In The Tank</span>,
 <span class="song">Thirteen</span>,
 <span class="song">A Pretty Good Guy</span>,
 <span class="song">Leaving Time</span>,
 <span class="song">Time Waited</span>,
 <span class="song">All I Ever Am</span>,
 <span class="song">Let Things Go</span>,
 <span class="song">Not Many Miles to Go</span>,
 <span class="song">Gethsemane</span>,
 <span class="song">Upside</span>,
 <span class="song">Motion Sickness</span>,
 <span class="song">Dreams</span>,
 <span class="song">It's the Little Things</span>,
 <span class="song">Gold Lion</span>,
 <span class="song">Silver Lining</span>,
 <span class="song">Day Old Thoughts</span>,
 <span class="song">All I Need to Know</span>,
 <span class="song">Blue Left Hand</span>,
 <span class="song">I Want You (Fever)</span>,
 <span class="song">Gone</span>,
 <span class="song">Oh What A Beautiful World</span>,
 <span class="song">God Knows</span>,
 <span class="song">Wayside/Back In Time</span>,
 <span class="song">Here to Forever</span>,
 <span class="song">Anthems For A Seventeen Year-Old Girl</span>,
 <span class="song">It's Almost Like You're Here</span>,
 <span class="song">Showboat</span>,
 <span class="song">Shark Smile</span>,
 <span class="song">Bad Luck</span>,
 <span class="song">Don't Worry Baby</span>,
 <span class="song">Lightning in July (Prairie Fire)</span>,
 <span class="song">Wandering Attention</span>,
 <span class="song">Ancient Light</span>,
 <span class="song">Nightswimming</span>,
 <span class="song">Jealous Boy</span>]

And a list for the albums:

albumlist = wnrn.find_all("span", "release")
albumlist
[<span class="release">Live at the Ryman</span>,
 <span class="release">(Single)</span>,
 <span class="release">The Reminder</span>,
 <span class="release">More</span>,
 <span class="release">(Single)</span>,
 <span class="release">Up</span>,
 <span class="release">Leon</span>,
 <span class="release">South Of Here</span>,
 <span class="release">Meat Is Murder</span>,
 <span class="release">Welcome To My Blue Sky</span>,
 <span class="release">Earthstar Mountain</span>,
 <span class="release">Joy'all</span>,
 <span class="release">(Single)</span>,
 <span class="release">Monobloc</span>,
 <span class="release">White Ladder</span>,
 <span class="release">Cool For Cats</span>,
 <span class="release">Volume 2</span>,
 <span class="release">Little House EP</span>,
 <span class="release">Without a Sound</span>,
 <span class="release">Lifted</span>,
 <span class="release">For Melancholy Brunettes (&amp; sad women)</span>,
 <span class="release">Get Sunk</span>,
 <span class="release">Mint Condition</span>,
 <span class="release">Out of the Blue</span>,
 <span class="release">Planet Popstar EP</span>,
 <span class="release">(Single)</span>,
 <span class="release">Salt</span>,
 <span class="release">Open Wide</span>,
 <span class="release">SABLE, fABLE</span>,
 <span class="release">The Velvet Underground &amp; Nico</span>,
 <span class="release">All Means, No End</span>,
 <span class="release">Same Trailer Different Park</span>,
 <span class="release">Radio DDR</span>,
 <span class="release">Highway Prayers</span>,
 <span class="release">Plaza</span>,
 <span class="release">Somewhere EP</span>,
 <span class="release">(Single)</span>,
 <span class="release">Writer's Block</span>,
 <span class="release">(Single)</span>,
 <span class="release">I Miss You, I Do</span>,
 <span class="release">Bahamas is Afie</span>,
 <span class="release">This Is Dance Music EP</span>,
 <span class="release">Harvest</span>,
 <span class="release">Heart of Gold: The Songs of Neil Y</span>,
 <span class="release">C'est La Vie</span>,
 <span class="release">Planting by the Signs</span>,
 <span class="release">Everybody Needs A Hero (Deluxe)</span>,
 <span class="release">Lost In The Dream</span>,
 <span class="release">(Single)</span>,
 <span class="release">Thee Black Boltz</span>,
 <span class="release">Fall Down in a Small Town</span>,
 <span class="release">3 Feet High and Rising</span>,
 <span class="release">Dog Eared</span>,
 <span class="release">Aperture</span>,
 <span class="release">Bark Your Head Off, Dog</span>,
 <span class="release">Ginkgo</span>,
 <span class="release">WNRN Studios</span>,
 <span class="release">Kick</span>,
 <span class="release">Love Made Trees</span>,
 <span class="release">Diamonds on the Inside</span>,
 <span class="release">Wild and Clear and Blue</span>,
 <span class="release">Magpie</span>,
 <span class="release">(Single)</span>,
 <span class="release">(Single)</span>,
 <span class="release">Burnout Days</span>,
 <span class="release">Today We're the Greatest</span>,
 <span class="release">Phantom Island</span>,
 <span class="release">(Single)</span>,
 <span class="release">The Story</span>,
 <span class="release">Fortune</span>,
 <span class="release">Last Splash</span>,
 <span class="release">Rituals</span>,
 <span class="release">(Single)</span>,
 <span class="release">Stay Positive</span>,
 <span class="release">Superhero Brother</span>,
 <span class="release">(Single)</span>,
 <span class="release">Arcadia</span>,
 <span class="release">The Richmond Sessions</span>,
 <span class="release">The Only One &amp; All The Others</span>,
 <span class="release">Golden Wolf</span>,
 <span class="release">Atlanta Millionaires Club</span>,
 <span class="release">Para Llevar</span>,
 <span class="release">Moondance</span>,
 <span class="release">Before It Might Be Gone</span>,
 <span class="release">Cristal Medium Blue</span>,
 <span class="release">Better Dreaming</span>,
 <span class="release">Butter Miracle, The Complete Sweets</span>,
 <span class="release">We Got It from Here... Thank You for Your Service</span>,
 <span class="release">Face Down In The Garden</span>,
 <span class="release">Billboard Heart</span>,
 <span class="release">The Rip Tide</span>,
 <span class="release">Everything Must Go</span>,
 <span class="release">Leon</span>,
 <span class="release">Look Sharp!</span>,
 <span class="release">I Quit</span>,
 <span class="release">Forever Is A Feeling</span>,
 <span class="release">Restless</span>,
 <span class="release">Born in the USA</span>,
 <span class="release">Time Is A Thief</span>,
 <span class="release">(Single)</span>,
 <span class="release">Earthstar Mountain</span>,
 <span class="release">In Spite of Ourselves</span>,
 <span class="release">Medicine</span>,
 <span class="release">WNRN Studios</span>,
 <span class="release">Hillbilly Speedball</span>,
 <span class="release">Arranging Time</span>,
 <span class="release">Want</span>,
 <span class="release">(Single)</span>,
 <span class="release">Send A Prayer My Way</span>,
 <span class="release">#1 Record</span>,
 <span class="release">A Pretty Good Guy</span>,
 <span class="release">Leaving Time</span>,
 <span class="release">is</span>,
 <span class="release">Songs Of A Lost World</span>,
 <span class="release">Somewhere EP</span>,
 <span class="release">She Remembers Everything</span>,
 <span class="release">The Scholars</span>,
 <span class="release">Family</span>,
 <span class="release">Stranger in the Alps</span>,
 <span class="release">Who Wants To Talk About Love?</span>,
 <span class="release">Lonely Mountain Town</span>,
 <span class="release">Show Your Bones</span>,
 <span class="release">(Single)</span>,
 <span class="release">I Miss You, I Do</span>,
 <span class="release">Are You Open?</span>,
 <span class="release">Divinations</span>,
 <span class="release">Welcome To My Blue Sky</span>,
 <span class="release">Hard to Read EP</span>,
 <span class="release">Oh What A Beautiful World</span>,
 <span class="release">Thee Black Boltz</span>,
 <span class="release">Soul Journey</span>,
 <span class="release">Asphalt Meadows</span>,
 <span class="release">ANTHEMS</span>,
 <span class="release">(Single)</span>,
 <span class="release">Gathering</span>,
 <span class="release">Capacity</span>,
 <span class="release">Me &amp; Billy</span>,
 <span class="release">How Will the Wolf Survive?</span>,
 <span class="release">Purple Gas</span>,
 <span class="release">Volume 2</span>,
 <span class="release">Wild and Clear and Blue</span>,
 <span class="release">Automatic for the People</span>,
 <span class="release">Very Human Features</span>]

Finally, we want to also extract the times each song was played. I look at the HTML code and find an example of the play time. These times are stored in the <td> tag with class="spin-time". I create a list of these times:

timelist = wnrn.find_all("td", "spin-time")
timelist
[<td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413269211">5:02 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413268880">4:58 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413268626">4:54 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413268386">4:51 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413268033">4:45 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413267812">4:42 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413267595">4:38 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413267242">4:33 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413266794">4:26 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413266558">4:22 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413266235">4:17 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413266063">4:14 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413265815">4:10 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413265473">4:05 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413265229">4:01 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413264674">3:54 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413264516">3:52 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413264193">3:47 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413263932">3:43 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413263726">3:39 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413263546">3:37 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413263180">3:30 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413263020">3:28 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413262806">3:24 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413262607">3:21 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413262286">3:16 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413262061">3:12 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413261858">3:09 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413261507">3:03 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413261344">3:00 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413261026">2:56 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413260833">2:53 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413260664">2:50 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413260307">2:44 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413260050">2:40 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413259827">2:36 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413259350">2:29 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413259036">2:24 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413258874">2:21 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413258528">2:15 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413258310">2:11 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413258113">2:08 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413257797">2:03 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413257576">2:00 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413257201">1:54 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413257030">1:51 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413256695">1:45 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413256464">1:41 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413256302">1:38 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413255937">1:32 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413255719">1:28 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413255450">1:24 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413255212">1:21 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413254779">1:14 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413254554">1:10 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413254411">1:08 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413254049">1:02 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413253758">12:58 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413253568">12:54 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413252776">12:41 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413252566">12:38 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413252161">12:31 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413251917">12:28 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413251755">12:25 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413251396">12:19 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413251235">12:16 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413250990">12:12 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413250668">12:07 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413250438">12:03 PM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413250081">11:58 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413249838">11:54 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413249602">11:50 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413249326">11:46 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413249140">11:43 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413248917">11:39 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413248767">11:36 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413248425">11:30 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413248112">11:24 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413247918">11:21 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413247664">11:17 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413247532">11:14 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413247334">11:11 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413246791">11:01 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413246524">10:58 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413246315">10:55 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413246108">10:51 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413245765">10:45 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413245585">10:42 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413245315">10:37 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413244959">10:30 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413244711">10:26 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413244423">10:21 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413244090">10:15 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413243903">10:12 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413243725">10:08 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413243436">10:03 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413242944">9:55 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413242817">9:53 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413242659">9:50 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413242337">9:44 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413241637">9:30 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413241402">9:27 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413241207">9:23 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413240354">9:08 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413240032">9:03 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413239778">8:59 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413239553">8:55 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413239361">8:51 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413239026">8:45 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413238881">8:43 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413238727">8:40 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413238233">8:31 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413238022">8:27 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413237839">8:24 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413237599">8:19 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413237382">8:14 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413237143">8:10 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413236834">8:04 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413236619">8:00 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413235697">7:45 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413235465">7:41 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413235243">7:38 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413235035">7:35 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413234616">7:28 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413234472">7:25 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413234203">7:21 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413233867">7:16 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413233632">7:12 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413233423">7:08 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413233065">7:02 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413232798">6:59 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413232582">6:55 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413232379">6:51 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413232035">6:46 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413231717">6:42 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413231504">6:38 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413231275">6:34 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413230800">6:26 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413230549">6:22 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413230423">6:20 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413230089">6:15 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413229818">6:10 AM</a></td>,
 <td class="spin-time"><a href="/WNRN/pl/20657020/WNRN?sp=413229341">6:03 AM</a></td>]

Sometimes the information we need exists in a particular tag, but only when a specific attribute is present. For example, in the WNRN playlist HTML there are many <a> tags, but only some of those tags include a title attribute. To extract all of the <a> tags with a title attribute, specify title=True in the call to .find_all():

atags_title = wnrn.find_all("a", title=True)
print(atags_title[0:5]) # just show the first 6 elements
[<a class="buy-link" data-vendor="apple" href="#" target="_blank" title='View "Old Crow Medicine Show - Methamphetamine" on Apple'><div alt='View "Old Crow Medicine Show - Methamphetamine" on Apple' class="buy-icon buy-icon-apple"></div></a>, <a class="buy-link" data-vendor="amazon" href="#" target="_blank" title='View "Old Crow Medicine Show - Methamphetamine" on Amazon'><div alt='View "Old Crow Medicine Show - Methamphetamine" on Amazon' class="buy-icon buy-icon-amazon"></div></a>, <a class="buy-link" data-vendor="spotify" href="#" target="_blank" title='View "Old Crow Medicine Show - Methamphetamine" on Spotify'><div alt='View "Old Crow Medicine Show - Methamphetamine" on Spotify' class="buy-icon buy-icon-spotify"></div></a>, <a class="buy-link" data-vendor="apple" href="#" target="_blank" title='View "The Beths - A Real Thing" on Apple'><div alt='View "The Beths - A Real Thing" on Apple' class="buy-icon buy-icon-apple"></div></a>, <a class="buy-link" data-vendor="amazon" href="#" target="_blank" title='View "The Beths - A Real Thing" on Amazon'><div alt='View "The Beths - A Real Thing" on Amazon' class="buy-icon buy-icon-amazon"></div></a>]

5.4.4. Constructing a Data Frame from HTML Data#

Next we need to place these data into a clean data frame. For that, we will need to keep the valid data while dropping the HTML tags. We stored the tags with the artists, songs, albums, and times in separate lists. Every name is stored as a navigable string in the HTML tags, so to extract these names we need to loop across the elements of the list. The simplest loop for this task is called a list comprehension, which has the following syntax:

newlist = [ expression for item in oldlist if condition ]

In this syntax, we are creating a new list by iteratively performing operations on the elements of an existing list (oldlist). item is a token that we will use to represent one item of the existing list. expression is the same Python code we would use on a single element of the existing list, except we replace the name of the element with the token defined with item. Finally condition is an optional part of this code which sets a filter by which only certain elements of the old list are transformed and placed into the new list (there’s an example of conditioning in a comprehension loop in the section on spiders).

For example, to extract the navigable string from every element of artistlist, we can set item to a, expression to a.string, and list to artistlist:

artists = [a.string for a in artistlist]
artists
['Old Crow Medicine Show',
 'The Beths',
 'Feist',
 'Pulp',
 'Laufey',
 'R.E.M.',
 'Leon Bridges',
 'Nathaniel Rateliff & The Night Sweats',
 'The Smiths',
 'Momma',
 'Hannah Cohen',
 'Jenny Lewis',
 'Yndling',
 'Monobloc',
 'David Gray',
 'Squeeze',
 'MRCY',
 'Rachel Chinouriri',
 'Dinosaur Jr.',
 'Trombone Shorty',
 'Japanese Breakfast',
 'Matt Berninger',
 'Caroline Spence',
 'Electric Light Orchestra',
 'Wishy',
 'Lord Huron',
 'Angie McMahon',
 'Inhaler',
 'Bon Iver',
 'The Velvet Underground and Nico',
 'Birdtalker',
 'Kacey Musgraves',
 'Sharp Pins',
 'Billy Strings',
 'Quilt',
 'Caamp',
 'BADBADNOTGOOD & V.C.R',
 'Peter Bjorn and John',
 'Hurray For The Riff Raff',
 'Arny Margret',
 'Bahamas',
 'Snacktime f/ Devon Gilfillian',
 'Neil Young',
 'Fiona Apple',
 'Phosphorescent',
 'S.G. Goodman',
 'Orla Gartland',
 'The War on Drugs',
 'James Bay f/ Jon Batiste',
 'Tunde Adebimpe',
 'Spectator Bird',
 'De La Soul',
 'Billie Marten',
 'The Head & The Heart',
 'Hop Along',
 'Panchiko',
 'Flock Of Dimes',
 'INXS',
 'Loaded Honey',
 'Ben Harper',
 "I'm With Her",
 'Peach Pit',
 '49 Winchester',
 'Neiked & Portugal. The Man',
 'flipturn',
 'Middle Kids',
 'King Gizzard & The Lizard Wizard',
 'John Butler',
 'Brandi Carlile',
 'Gold Connections',
 'The Breeders',
 'Watchhouse',
 'Man/Woman/Chainsaw',
 'The Hold Steady',
 'G. Love and Special Sauce',
 'Cardinals',
 'Alison Krauss & Union Station',
 'The Sherman Holmes Project',
 'Will Worden',
 'Dope Lemon',
 'Faye Webster',
 'Johnny Delaware',
 'Van Morrison',
 'The Vices',
 'Lewis OfMan (feat. Empress Of)',
 'Tune-Yards',
 'Counting Crows',
 'A Tribe Called Quest',
 'Tennis',
 'Deep Sea Diver',
 'Beirut',
 'Goose',
 'Leon Bridges',
 'Joe Jackson',
 'HAIM',
 'Lucy Dacus f/ Hozier',
 'Palmyra',
 'Bruce Springsteen',
 'Marc Broussard',
 'Girl and Girl',
 'Hannah Cohen',
 'John Prine',
 'Grace Potter',
 'Matt Andersen',
 'Webb Wilder',
 'Pete Yorn',
 'Esther Rose',
 'Lucius',
 'Julien Baker & TORRES',
 'Big Star',
 'Chris Knight',
 'Shinyribs',
 'My Morning Jacket',
 'The Cure',
 'Caamp',
 'Rosanne Cash',
 'Car Seat Headrest',
 'Southern Avenue',
 'Phoebe Bridgers',
 'Jade Bird',
 'Charles Wesley Godwin',
 'Yeah Yeah Yeahs',
 'Laufey',
 'Arny Margret',
 'Seth Walker',
 'Oracle Sisters',
 'Momma',
 'Day Wave',
 'Willie Nelson f/ Rodney Crowell',
 'Tunde Adebimpe',
 'Gillian Welch',
 'Death Cab for Cutie',
 'Maggie Rogers & Sylvan Esso',
 'Yndling',
 'Josh Ritter',
 'Big Thief',
 'Zach Top f/ Billy Strings',
 'Los Lobos',
 'Noeline Hofmann',
 'MRCY',
 "I'm With Her",
 'R.E.M.',
 'The Bug Club']

Likewise, we extract the navigable strings for the songs, albums, and times:

songs = [a.string for a in songlist]
albums = [a.string for a in albumlist]
times = [a.string for a in timelist]

Finally, to construct a clean data frame, we create a dictionary that combines these lists and passes this dictionary to the pd.DataFrame() function:

mydict = {'time':times,
          'artist':artists,
         'song':songs,
         'album':albums}
wnrn_df = pd.DataFrame(mydict)
wnrn_df
time artist song album
0 5:02 PM Old Crow Medicine Show Methamphetamine Live at the Ryman
1 4:58 PM The Beths A Real Thing (Single)
2 4:54 PM Feist I Feel It All The Reminder
3 4:51 PM Pulp Spike Island More
4 4:45 PM Laufey Silver Lining (Single)
... ... ... ... ...
138 6:22 AM Noeline Hofmann Lightning in July (Prairie Fire) Purple Gas
139 6:20 AM MRCY Wandering Attention Volume 2
140 6:15 AM I'm With Her Ancient Light Wild and Clear and Blue
141 6:10 AM R.E.M. Nightswimming Automatic for the People
142 6:03 AM The Bug Club Jealous Boy Very Human Features

143 rows × 4 columns

5.5. Building a Spider#

At the bottom of the WNRN playlist on https://spinitron.com/WNRN/ there are links to older song playlists. Let’s extend our example by building a spider to capture the data that exists on these links as well. A spider is a web scraper that follows links on a page automatically and scrapes from those links as well.

I look at the page source for these links, and find that they are contained in a <div class="recent-playlists"> tag. I start by finding this tag. As there’s only one occurrence, I can use .find() instead of .find_all():

recent = wnrn.find("div", "recent-playlists")
recent
<div class="recent-playlists">
<h4>Recent</h4>
<div class="grid-view" id="w2"><div class="summary"></div>
<table class="table table-bordered table-narrow"><tbody>
<tr data-key="0"><td class="show-time">5:00 AM</td><td></td><td><strong><a href="/WNRN/pl/20656821/WNRN-5-15-25-5-01-AM">WNRN 5/15/25, 5:01 AM</a></strong> with <a href="/WNRN/dj/104061/WNRN">WNRN</a></td></tr>
<tr data-key="1"><td class="show-time">4:00 AM</td><td></td><td><strong><a href="/WNRN/pl/20656630/WNRN-5-15-25-4-00-AM">WNRN 5/15/25, 4:00 AM</a></strong> with <a href="/WNRN/dj/104061/WNRN">WNRN</a></td></tr>
<tr data-key="2"><td class="show-time">8:00 PM</td><td></td><td><strong><a href="/WNRN/pl/20655184/WNRN">WNRN</a></strong> (Music)</td></tr>
<tr data-key="3"><td class="show-time">6:00 PM</td><td></td><td><strong><a href="/WNRN/pl/20654657/World-Caf%C3%A9">World Café</a></strong> (Music) with <a href="/WNRN/dj/179987/Raina-Douris-and-Stephen-Kallao">Raina Douris and Stephen Kallao</a></td></tr>
<tr data-key="4"><td class="show-time">6:00 AM</td><td></td><td><strong><a href="/WNRN/pl/20652238/WNRN">WNRN</a></strong> (Music)</td></tr>
</tbody></table>
</div></div>

Notice that all of the addresses we need are contained in <a> tags. We can extract these <a> tags with .find_all():

recent_atags = recent.find_all("a")
recent_atags
[<a href="/WNRN/pl/20656821/WNRN-5-15-25-5-01-AM">WNRN 5/15/25, 5:01 AM</a>,
 <a href="/WNRN/dj/104061/WNRN">WNRN</a>,
 <a href="/WNRN/pl/20656630/WNRN-5-15-25-4-00-AM">WNRN 5/15/25, 4:00 AM</a>,
 <a href="/WNRN/dj/104061/WNRN">WNRN</a>,
 <a href="/WNRN/pl/20655184/WNRN">WNRN</a>,
 <a href="/WNRN/pl/20654657/World-Caf%C3%A9">World Café</a>,
 <a href="/WNRN/dj/179987/Raina-Douris-and-Stephen-Kallao">Raina Douris and Stephen Kallao</a>,
 <a href="/WNRN/pl/20652238/WNRN">WNRN</a>]

The resulting list contains the web endpoints we need, and also some web endpoints we don’t need: we want the URLs that contain the string /pl/ as these are playlists, and we want to exclude the URLs that contain the string /dj/ as these pages refer to a particular DJ. We need a comprehension loop that loops across these elements, extracts the href attribute of the entries that include /pl/, and ignore the entries that include /dj/. We again use this syntax:

newlist = [ expression for item in oldlist if condition ]

In this case:

  • newlist is a list containing the URLs we want to direct our spider to. I call it urls.

  • item is one element of recent_atags, which I will call pl.

  • expression is code that extracts the web address from the href attribute of the <a> tag, so here the code would be pl['href'].

  • Finally, condition is a logical statement that should be True if the web address contains /pl/ and False if the web address contains /dj/. Here, the conditional statement should be if "/pl/" in pl['href']. This code will look for the string "/pl/" inside the string called by pl['href'] and return True or False depending on whether this string is found.

Putting all this syntax together gives us our list of playlist URLs:

wnrn_url = [pl['href'] for pl in recent_atags if "/pl/" in pl['href']]
wnrn_url
['/WNRN/pl/20656821/WNRN-5-15-25-5-01-AM',
 '/WNRN/pl/20656630/WNRN-5-15-25-4-00-AM',
 '/WNRN/pl/20655184/WNRN',
 '/WNRN/pl/20654657/World-Caf%C3%A9',
 '/WNRN/pl/20652238/WNRN']

First, we need to collect all of the code we created above to extract the artist, song, album, and play times from the HTML code. We define a function that does all of this work. We specify one argument for this function, the URL, so that all the function needs is the URL and it can output a clean dataframe. I name the function wnrn_spider():

def wnrn_spider(url):
    """Perform web scraping for any WNRN playlist given the available link"""
    
    headers = {'user-agent': 'Kropko class example (jkropko@virginia.edu)'}
    r = requests.get(url, headers=headers)
    wnrn = BeautifulSoup(r.text, 'html')
    
    artistlist = wnrn.find_all("span", "artist")
    songlist = wnrn.find_all("span", "song")
    albumlist = wnrn.find_all("span", "release")
    timelist = wnrn.find_all("td", "spin-time")
    
    artists = [a.string for a in artistlist]
    songs = [a.string for a in songlist]
    albums = [a.string for a in albumlist]
    times = [a.string for a in timelist]
    
    mydict = {'time':times, 'artist':artists, 'song':songs, 'album':albums}
    wnrn_df = pd.DataFrame(mydict)
    
    return wnrn_df

We can pass any of the URLs we collected to our function and get the other playlists. We will have to add the domain “https://spinitron.com” to the beginning of each of the URLs we collected:

wnrn2 = wnrn_spider('https://spinitron.com/' + wnrn_url[0])
wnrn2
time artist song album
0 5:01 AM Blue Rodeo Til I am Myself Again Casino
1 5:05 AM Alison Krauss & Union Station Richmond on the James Arcadia
2 5:08 AM Turnpike Troubadours Ruby Ann The Price of Admission
3 5:11 AM Pixies Here Comes Your Man Doolittle
4 5:15 AM Wilder Woods f/ The War & Treaty Be Yourself (Single)
5 5:18 AM Hurray For The Riff Raff Pyramid Scheme (Single)
6 5:21 AM Nilufer Yanya Cold Heart (Single)
7 5:25 AM Old 97's King of All of the World Satellite Rides
8 5:28 AM Father John Misty Nancy from Now On Fear Fun
9 5:35 AM Jungle Keep Me Satisfied (Single)
10 5:39 AM Billy Strings Be Your Man Highway Prayers
11 5:43 AM Jesse Daniel My Time Is Gonna Come Son of the San Lorenzo
12 5:47 AM Luvcat Love & Money (Single)
13 5:51 AM Lord Huron Nothing I Need (Single)
14 5:55 AM Robert Earl Keen Over the Waterfall Picnic
15 5:59 AM David Bromberg Band I Like to Sleep Late in the Morning Midnight On The Water

Our goal here is to loop across all the URLs we collected, extract the data in a clean data frame, and append these data frames together to construct a longer playlist. To do that, we will use a for loop, which has the following syntax:

for index in list:
    expressions

This syntax is similar to the syntax we used to build a comprehension loop. list is an existing list, and index stands in for one element of this list. For each element of the list, we execute the code contained in expressions, which can use the index.

For our spider, we will use the following steps:

  1. We take the data we already scraped from https://spinitron.com/WNRN (saved as wnrn_df) and clone it as a new variable named wnrn_total_playlist. It is important that we make a copy, and that we do not overwrite wnrn_df. We will be repeatedly saving over wnrn_total_playlist within the loop, and if we do not overwrite wnrn_df, it gives us a stable data frame to return to as a starting point if we need to rerun this loop.

  2. We use a for loop to loop across all the web addresses inside wnrn_url.

  3. In the for loop, we use the wnrn_spider() function to extract the playlist data from each of the URLs inside wnrn_url.

  4. In the for loop, we use the pd.concat() method to attach the new data to the bottom of the existing data, matching corresponding columns.

The code is as follows:

wnrn_total_playlist = wnrn_df 
for w in wnrn_url:
    moredata = wnrn_spider('https://spinitron.com/' + w) 
    wnrn_total_playlist = pd.concat([wnrn_total_playlist, moredata], ignore_index=True) 

We now have a data frame that combines all of the playlists on https://spinitron.com/WNRN and on the playlists linked to under “Recent”:

wnrn_total_playlist
time artist song album
0 5:02 PM Old Crow Medicine Show Methamphetamine Live at the Ryman
1 4:58 PM The Beths A Real Thing (Single)
2 4:54 PM Feist I Feel It All The Reminder
3 4:51 PM Pulp Spike Island More
4 4:45 PM Laufey Silver Lining (Single)
... ... ... ... ...
471 5:31 PM Samia Spine Oil Bloodless
472 5:40 PM Jimi Hendrix Hear My Train a Comin' People, Hell and Angels
473 5:45 PM Hurray For The Riff Raff Pyramid Scheme (Single)
474 5:51 PM Grace Potter Before The Sky Falls Medicine
475 5:54 PM Lowland Hum Olivia Lowland Hum

476 rows × 4 columns