CS91R Lab 06: Scraping

Due Wednesday, April 2, before midnight

Goals

The goals for this lab assignment are:

Scrape websites using curl, wget, and Python
Use Beautiful Soup to parse HTML
Use the Internet Archive’s WayBack Machine to understand how websites evolve
Work with JSON (JavaScript Object Notation)
Use escape sequences to add color to terminal output
Read and navigate APIs

Cloning your repository

Log into the CS91R-S25 github organization for our class and find your git repository for Lab 06, which will be of the format lab06-user1-user2, where user1 and user2 are you and your partner’s usernames.

You can clone your repository using the following steps while connected to the CS lab machines:

# cd into your cs91r/labs sub-directory and clone your lab06 repo
$ cd ~/cs91r/labs
$ git clone git@github.swarthmore.edu:CS91R-S25/lab06-user1-user2.git

# change directory to list its contents
$ cd ~/cs91r/labs/lab06-user1-user2

# ls should list the following contents
$ ls
README.md

Answers to written questions should be included in the README.md file included in your repository.

The Wayback Machine

The Wayback Machine provides an archive of nearly 1 trillion webpages. It provides a crucial service by archiving the web, preserving webpages that might otherwise disappear. In this lab, we will use the Wayback machine to explore how websites change over time. A recent article in the New York Times, These Words Are Disappearing in the New Trump Administration, looked at how language on federal websites has changed under the new administration.

Questions

Read the NYT article (gift link). What are your takeaways from reading this article? What are some other federal and non-federal websites that you expect will also have been impacted that you can explore later in the lab.
Open one of the webpages mentioned in the article and use your browser to view the source HTML of the page. (This will be different depending on the browser that you use.)
- Does the structure of HTML remind you of anything we’ve done in class already?
- What pieces of the webpage (i.e., what HTML elements) are the most interesting to track over time? Why?

Scraping APIs

Two tools we can use to grab webpages from the command-line are wget and curl.

curl

curl can be used for a variety of networking tasks, for example, to grab the weather.

(You might want to make your window wider to see the full output.)

$ curl wttr.in/london
$ curl wttr.in/philadelphia?u
$ unset LESS   # try this just for the next line and Question 3
$ curl wttr.in/philadelphia?u | less

(YOU CAN SKIP THIS QUESTION) What’s the difference betwen the last two ways to get Swarthmore’s weather? Why? Does the -r option to less help?

We can use it to save a webpage as well:

$ curl -o people.html https://www.swarthmore.edu/computer-science/faculty-staff
$ less people.html

wget

wget is another tool to grab webpages. In addition to grabbing a single page, wget can crawl a webpage and recursively download the pages it links to as well as other content like images and stylesheets. The -c and -r options are particularly useful for this.

Use the man pages to figure out how to download the Swarthmore weather forecast to the file called swat_weather.txt using wget, then view the file in the terminal. Show command(s) you used to do that.
What do the -c and -r options do? Describe other flags that look useful and why you might use them.
Try scraping some subsample of https://quotes.toscrape.com/. Do not add what you’ve downloaded to your respository. Show the command you used to download the pages. Use command-line tools to report the number of files downloaded. Explore the files and folders you’ve downloaded. Discuss the structure of the downloaded pages. How much data (in bytes, or lines, or words) did you download?

Scraping websites using curl and wget can be legally and ethically prickly. For example, Aaron Swartz was prosecuted for automatically downloading academic articles via JSTOR. Remember: with great power comes great responsibility.

Scraping with Python

Python has a variety of libraries for scraping, crawling and parsing the web.

You will want to add four packages to your virtual environment. We will use these packages in various places in the rest of the lab.

$ workon cs91r
(cs91r) $ pip install requests requests-cache bs4 wayback

Write a Python program called scrape.py that uses the requests library to download a webpage and write it to a local file. Use the requests documentation to figure out how to do this. You will need to import requests at the top of your program. You can hard-code the web page you want to download in your python program or make it a command-line argument. You can choose the web page you’d like to download. You should print out the status code of your request, the headers, and the downloaded contents of the page. If the downloaded content is not HTML, choose another web page to download.

Depending on the website, you may need to add the following parameter to your requests.get call: headers = {'User-Agent':'Mozilla/5.0'}

The Beautful Soup library can be used to parse HTML. Add to your scrape.py program. Use this library to parse the webpage previously scraped. Use the Beautiful Soup documentation to figure out how to do this. The getText method is particularly useful! You will need to add from bs4 import BeautifulSoup at the top of your program.
- Print out the title of the webpage (if it has one).
- Pull out the text of the body and print it.
- Find each link in the document and print on the URL on separate lines.

When downloading webpages, it can be useful to cache them so you only download a new version if your cached version is out of date. To do this, we’ll use the requests_cache library. This library will save copies of webpages to a local database. If we try to retrieve the same webpage again, we will use the cached version instead of going out to the internet to get a copy of it. Skim the documentation for the requests_cache library.

Add to your scrape.py program. Use the "Patching" method to install a cache, which will allow you to use the cache without making any other changes to your code. Add an argument to your installed cache that specifies that the cached version of the page is valid for 3600 seconds.
If you were designing a web scraper, how would you determine how long to keep pages in your cache valid?

JSON

In an earlier lab we wrote assistants for solving the NYT’s wordle and spelling bee puzzles. With our new power of scraping: full-on cheating is now an option (but there goes the fun!).

Use the requests and requests-cache libraries, along with the json library to find and nicely print the solutions (notice the use of color) to the spelling bee puzzle from the website. The bold green first line, the red indicates the center letter, the blue indicates a panagram, the words should be sorted by word length.

Your solution should be in the file bee.py.

SPELLING BEE (March 22, 2025): m,d,g,i,l,n,o
molding
doom
glom
limn
limo
loom
midi
mild
mill
mind
mini
moil
mold
moll
mono
mood
moon
gloom
idiom
minim
mondo
doming
domino
miming
mining
minion
mooing
omigod
dimming
dooming
limning
looming
milling
million
minding
moiling
monolog
mooning
dominion
glomming
glooming
middling
mingling

Modify your bee.py program. Rather than relying on color to nicely print the solutions to the spelling bee puzzle from the Spelling Bee website, add a -p plain flag to use the [] indicates the center letter, the * indicates a panagram rather than relying on color and escape sequences. Use argparse to handle the command-line argument. Why is implementing the plaintext mode a good idea?

SPELLING BEE (March 22, 2025): [m],d,g,i,l,n,o
molding*
doom
glom
limn
limo
loom
midi
mild
mill
mind
mini
moil
mold
moll
mono
mood
moon
gloom
idiom
minim
mondo
doming
domino
miming
mining
minion
mooing
omigod
dimming
dooming
limning
looming
milling
million
minding
moiling
monolog
mooning
dominion
glomming
glooming
middling
mingling

OPTIONAL: Do the same thing for letter boxed. Save as box.py.

Scraping the Wayback Machine

In this section, we will be scraping web pages from the Wayback Machine, which provides an archive of nearly 1 trillion webpages, including multiple versions of webpages, so that it serves as a way to go "way back" in time and see how some websites used to look.

Scraping the Wayback Machine with `curl`

The Wayback Machine can be queried using curl; read over this tutorial for using curl with the Wayback Machine.

Download the earliest version of the Swarthmore CS website using curl. What year what that webpage last modified? Show what curl commands you used to do this.

Scraping the Wayback Machine with the `wayback` API

Next, we will use the wayback Python API to interact with the Wayback machine.

Save your solutions to the next 3 questions in swat.py.

Use this tutorial to find the earliest Swarthmore CS website. Use Beautiful Soup to find all the links in the webpage and print them out.
What CS Swat people were listed on that first archived website?
OPTIONAL: Can you find the first version of the website that mentions Rich and/or Keith?

Federal Websites

The NYT article identified three federal websites whose changes were illuminating:

Save your solution to this question in newspeak.py.

Write a program to explore how have these websites changed over time. For example, how have the words used changed since January? Discuss your approach and any findings.

Other Websites

If you can extend your newspeak.py program to answer this question (while preserving the behavior from the previous question via a command-line argument or some other method), do that. If you would rather create a new program, name it explore.py.

Use the Wayback Machine and Beautiful Soup to explore how some other websites have changed over time. ESPN reported that the MLB has changed its language. Do you see the same changes here, or on another website you are interested in? Discuss your approach and any findings.

How to turn in your solutions

Edit the README.md file that we provided to answer each of the questions and add any discussion you think we’d like to know about.

Be sure to commit and push all changes to your python files.

If you think it would be helpful, use asciinema to record a terminal session and include it your README.md.