CS91R Lab 06: Scraping
Due Wednesday, April 2, before midnight
Goals
The goals for this lab assignment are:
-
Use
Beautiful Soup
to parseHTML
-
Use the Internet Archive’s WayBack Machine to understand how websites evolve
-
Work with
JSON
(JavaScript Object Notation) -
Use escape sequences to add color to terminal output
-
Read and navigate APIs
Cloning your repository
Log into the CS91R-S25 github
organization for our class and find your git repository for Lab 06,
which will be of the format lab06-user1-user2
, where user1
and user2
are you and your partner’s usernames.
You can clone your repository using the following steps while connected to the CS lab machines:
# cd into your cs91r/labs sub-directory and clone your lab06 repo
$ cd ~/cs91r/labs
$ git clone git@github.swarthmore.edu:CS91R-S25/lab06-user1-user2.git
# change directory to list its contents
$ cd ~/cs91r/labs/lab06-user1-user2
# ls should list the following contents
$ ls
README.md
Answers to written questions should be included in the README.md file included in your repository.
The Wayback Machine
The Wayback Machine provides an archive of nearly 1 trillion webpages. It provides a crucial service by archiving the web, preserving webpages that might otherwise disappear. In this lab, we will use the Wayback machine to explore how websites change over time. A recent article in the New York Times, These Words Are Disappearing in the New Trump Administration, looked at how language on federal websites has changed under the new administration.
Questions
-
Read the NYT article (gift link). What are your takeaways from reading this article? What are some other federal and non-federal websites that you expect will also have been impacted that you can explore later in the lab.
-
Open one of the webpages mentioned in the article and use your browser to view the source HTML of the page. (This will be different depending on the browser that you use.)
-
Does the structure of HTML remind you of anything we’ve done in class already?
-
What pieces of the webpage (i.e., what
HTML
elements) are the most interesting to track over time? Why?
-
Scraping APIs
curl
curl
can be used for a variety of networking tasks, for example, to
grab the weather.
(You might want to make your window wider to see the full output.)
$ curl wttr.in/london
$ curl wttr.in/philadelphia?u
$ unset LESS # try this just for the next line and Question 3
$ curl wttr.in/philadelphia?u | less
-
(YOU CAN SKIP THIS QUESTION) What’s the difference betwen the last two ways to get Swarthmore’s weather? Why? Does the
-r
option toless
help?
We can use it to save a webpage as well:
$ curl -o people.html https://www.swarthmore.edu/computer-science/faculty-staff
$ less people.html
wget
wget
is another tool to grab webpages. In addition to grabbing a
single page, wget
can crawl a webpage and recursively download the
pages it links to as well as other content like images and
stylesheets. The -c
and -r
options are particularly useful for
this.
-
Use the
man
pages to figure out how to download the Swarthmore weather forecast to the file calledswat_weather.txt
usingwget
, then view the file in the terminal. Show command(s) you used to do that. -
What do the
-c
and-r
options do? Describe other flags that look useful and why you might use them. -
Try scraping some subsample of https://quotes.toscrape.com/. Do not add what you’ve downloaded to your respository. Show the command you used to download the pages. Use command-line tools to report the number of files downloaded. Explore the files and folders you’ve downloaded. Discuss the structure of the downloaded pages. How much data (in bytes, or lines, or words) did you download?
Scraping websites using curl and wget can be legally and ethically
prickly. For example,
Aaron Swartz was prosecuted for automatically downloading academic
articles via JSTOR. Remember: with great power comes great responsibility.
|
Scraping with Python
Python has a variety of libraries for scraping, crawling and parsing the web.
You will want to add four packages to your virtual environment. We will use these packages in various places in the rest of the lab.
$ workon cs91r
(cs91r) $ pip install requests requests-cache bs4 wayback
-
Write a Python program called
scrape.py
that uses therequests
library to download a webpage and write it to a local file. Use therequests
documentation to figure out how to do this. You will need toimport requests
at the top of your program. You can hard-code the web page you want to download in your python program or make it a command-line argument. You can choose the web page you’d like to download. You should print out the status code of your request, the headers, and the downloaded contents of the page. If the downloaded content is not HTML, choose another web page to download.
Depending on the website, you may need to add the following parameter to your requests.get call: headers = {'User-Agent':'Mozilla/5.0'}
|
-
The Beautful Soup library can be used to parse
HTML
. Add to yourscrape.py
program. Use this library to parse the webpage previously scraped. Use the Beautiful Soup documentation to figure out how to do this. ThegetText
method is particularly useful! You will need to addfrom bs4 import BeautifulSoup
at the top of your program.-
Print out the title of the webpage (if it has one).
-
Pull out the text of the body and print it.
-
Find each link in the document and print on the URL on separate lines.
-
When downloading webpages, it can be useful to cache them so you
only download a new version if your cached version is out of date.
To do this, we’ll use the requests_cache
library. This library will save copies of webpages to a local database.
If we try to retrieve the same webpage again, we will use the cached version
instead of going out to the internet to get a copy of it. Skim the
documentation for the requests_cache
library.
-
Add to your
scrape.py
program. Use the "Patching" method to install a cache, which will allow you to use the cache without making any other changes to your code. Add an argument to your installed cache that specifies that the cached version of the page is valid for 3600 seconds. -
If you were designing a web scraper, how would you determine how long to keep pages in your cache valid?
JSON
In an earlier lab we wrote assistants for solving the NYT’s wordle and spelling bee puzzles. With our new power of scraping: full-on cheating is now an option (but there goes the fun!).
-
Use the
requests
andrequests-cache
libraries, along with thejson
library to find and nicely print the solutions (notice the use of color) to the spelling bee puzzle from the website. The bold green first line, the red indicates the center letter, the blue indicates a panagram, the words should be sorted by word length.
Your solution should be in the file bee.py
.
SPELLING BEE (March 22, 2025): m,d,g,i,l,n,o
molding
doom
glom
limn
limo
loom
midi
mild
mill
mind
mini
moil
mold
moll
mono
mood
moon
gloom
idiom
minim
mondo
doming
domino
miming
mining
minion
mooing
omigod
dimming
dooming
limning
looming
milling
million
minding
moiling
monolog
mooning
dominion
glomming
glooming
middling
mingling
-
Modify your
bee.py
program. Rather than relying on color to nicely print the solutions to the spelling bee puzzle from the Spelling Bee website, add a-p
plain flag to use the[]
indicates the center letter, the*
indicates a panagram rather than relying on color and escape sequences. Useargparse
to handle the command-line argument. Why is implementing the plaintext mode a good idea?
SPELLING BEE (March 22, 2025): [m],d,g,i,l,n,o
molding*
doom
glom
limn
limo
loom
midi
mild
mill
mind
mini
moil
mold
moll
mono
mood
moon
gloom
idiom
minim
mondo
doming
domino
miming
mining
minion
mooing
omigod
dimming
dooming
limning
looming
milling
million
minding
moiling
monolog
mooning
dominion
glomming
glooming
middling
mingling
-
OPTIONAL: Do the same thing for letter boxed. Save as
box.py
.
Scraping the Wayback Machine
In this section, we will be scraping web pages from the Wayback Machine, which provides an archive of nearly 1 trillion webpages, including multiple versions of webpages, so that it serves as a way to go "way back" in time and see how some websites used to look.
Scraping the Wayback Machine with curl
The Wayback Machine can be queried using curl
; read over this
tutorial for using curl
with the Wayback Machine.
-
Download the earliest version of the Swarthmore CS website using curl. What year what that webpage last modified? Show what
curl
commands you used to do this.
Scraping the Wayback Machine with the wayback
API
Next, we will use the
wayback
Python
API to interact with the Wayback machine.
Save your solutions to the next 3 questions in swat.py
.
-
Use this tutorial to find the earliest Swarthmore CS website. Use Beautiful Soup to find all the links in the webpage and print them out.
-
What CS Swat people were listed on that first archived website?
-
OPTIONAL: Can you find the first version of the website that mentions Rich and/or Keith?
Federal Websites
The NYT article identified three federal websites whose changes were illuminating:
Save your solution to this question in newspeak.py
.
-
Write a program to explore how have these websites changed over time. For example, how have the words used changed since January? Discuss your approach and any findings.
Other Websites
If you can extend your newspeak.py
program to answer this question
(while preserving the behavior from the previous question via a
command-line argument or some other method), do that. If you would
rather create a new program, name it explore.py
.
-
Use the Wayback Machine and Beautiful Soup to explore how some other websites have changed over time. ESPN reported that the MLB has changed its language. Do you see the same changes here, or on another website you are interested in? Discuss your approach and any findings.
How to turn in your solutions
Edit the README.md
file that we provided to answer each of the questions
and add any discussion you think we’d like to know about.
Be sure to commit and push all changes to your python files.
If you think it would be helpful, use asciinema
to record
a terminal session and include it your README.md
.