CS 43 — Lab 1: A Basic Web Client
Due: Thursday, February 10 @ 11:59 PM
1. Overview
Please familiarize yourself with the course’s partnership expectations before starting the lab.
In this lab we’ll write our first networking application — a barebones web client. A web client communicates with a web server, and they both "speak" HTTP, the Hypertext Transfer Protocol.
HTTP uses a client-server model of communication: in which the client (your lab code) initiates communication, and a server that is always-on, passively waits and responds. The web client and web server communicate using HTTP requests and responses. We’ll look into the HTTP protocol format in a lot more detail in class on Tuesday.
1.1. Goals
-
Use
git
to clone a repository full of starter code. -
Apply top-down design to write a web client.
-
Practice with C networking basics: sockets,
send()
,receive()
and DNS (name-to-IP resolution). -
Manipulate HTTP headers with C string functions.
1.2. Handy References
-
RFC 1945: HTTP 1.0 Specification. Sections 4, 5, and 6 are probably the most helpful.
2. Requirements
We’ll write a command-line program named lab1
that takes a URL as its only
parameter. It should retrieve the indicated file specified in the URL and
stores it in the local directory (that lab1
was run from) with the
appropriate filename from the URL. If the URL does not end in a filename, your
program should automatically name the file index.html
.
For example:
# This should create a local file named 'pride_and_prejudice.txt' containing lots of text. $ ./lab1 http://demo.cs.swarthmore.edu/example/pride_and_prejudice.txt
2.1. Workflow of Your Program
The high-level tasks your lab1
web client needs to perform are:
-
Given a URL of the form
http://host/path
, construct a HTTP request to send to the web server.-
isolate the host and the file portions using string functions
-
lookup the server’s hostname via DNS to get its IP address so that you can address data to it
-
create a
socket
andconnect
to the server’s IP address, on port 80, the port used for HTTP -
generate an HTTP 1.0 request string for the specified file path
-
use the
send
system call to send the request over the network
-
-
Receive and interpret the server’s response.
-
use the
recv
system call to receive the server’s response, in full -
inspect the HTTP response code received from the web server
-
if there are no errors, open a file for writing (named according to the name of the file in the URL argument) and save the body of HTTP response to the file
-
if there are errors, report them and exit
-
2.2. Client Behavior Expectations
For full credit:
-
Your client should faithfully download and save byte-for-byte identical copies of the files it’s asked to retrieve. It should work for both text (e.g., html) and binary files (e.g., images). See: Examples and Testing below.
-
Your client should report any errors or unexpected responses it encounters. If you get any HTTP response code other than 200, simply report the code you received and terminate.
-
Your client should name the files it saves according to the name of the file in the URL argument. That is, everything after the final
/
in the URL should be considered the file name to use when storing the file locally. If there’s nothing after the final/
, use the nameindex.html
.
2.3. Assumptions
-
You may assume that the URL will be no more than 100 characters long and that it will be of the form
http://host/path
, where:-
The host portion will be an IP address (e.g.,
130.58.68.26
) or a hostname (e.g.,demo.cs.swarthmore.edu
). See Other Reference Material and the providedgetaddrinfo.c
file for more info about DNS. -
The path may or may not be an empty string, may or may not contain multiple slashes (for subdirectories), and may or may not contain a file name. If no path is given, your client should request:
/
. The server will send you back an index.html file, if it has one.
-
-
You may assume that the files you’ll be retrieving are no larger than one megabyte. This means you can statically declare storage space for the server’s response, which makes life a bit easier.
2.4. Checkpoint
To be on track, by the start of the next lab session, you should have finished:
-
Extract the host and path portions of the input URL
-
Construct the HTTP request by filling in the provided template
-
Extract just the file name from the path
A good stretch goal is:
-
Start trying to send the request to the server and maybe print the response, but it’s ok if this part isn’t solid yet
It’s fine to defer until next week:
-
Making send/recv calls robust
-
Parsing response headers
-
Saving the output file
3. Examples and Testing
To test your program, you’ll want to ensure that the files it’s saving are identical to the originals.
For a quick check, you can open the file in a browser, and it should appear like the original website. Note that appearance alone does NOT guarantee that the file is byte-for-byte identical.
An easy and more precise way to check that the files are correct is to use
wget
, which downloads files much like your lab program, to retrieve a correct
copy of the file. Run your lab code on a URL first, then run wget
on the
same URL — it’ll store a (correct) copy of the file of the same name with .1
appended to the end of it.
For text files, you can use diff -u
to see if the files are identical:
diff -u index.html index.html.1
If the files are identical you will see no output. If the files are not
identical, diff
will show you the lines that differ. Examining the
differences may help you to narrow down what’s going wrong while debugging.
For all files (text and binary) you can use something like md5sum
to generate
a hash of the two files. If the hashes differ, so do the files. You will
need to use something like md5sum
to make sure binary files (images, pdfs)
are identical. You can then compare wget
's file with yours.
$ md5sum index.html index.html.1 937e1d7af5e5cc0ce63694cdd2969233 index.html 937e1d7af5e5cc0ce63694cdd2969233 index.html.1
Here, the hash is 937e1d7af5e5cc0ce63694cdd2969233
, and it matches for both files.
4. Tips & FAQ
-
Use HTTP version 1.0 — version 1.1 can get a lot more complicated. The subset of the HTTP 1.0 protocol you’ll need to implement for this assignment is quite small, but you may find the full protocol specification to be helpful.
-
All HTTP headers are ASCII string characters, so you can use the
str
family of functions to manipulate them safely.
Do NOT use strlen , or any other string functions, on the body of
the response. The response body is not necessarily a string. In some cases
(e.g., html responses) it will be, but in other cases (e.g., image files) it
won’t be. Remember that the C string functions look for, and typically
terminate when they find, the null terminator character. A null terminator is
nothing more than a byte whose value is zero (0). Such bytes are LIKELY TO BE
PRESENT in binary response data. If you call strlen() on binary data and it
finds a 0 , it will stop and return the WRONG ANSWER to you.
|
-
"But, if I can’t call
strlen()
on the response, how will I know how much data I received?"The
recv()
function’s return value will tell you how many bytes you received every time you call it. Likewise, thesend()
function will tell you how many bytes you successfully transmitted.
You should ALWAYS check the return values of these functions because
the answer may not be what you expect. That is, even if you tell recv() to
get 1000 bytes , the call may return with fewer bytes, and the only way you’ll
know is to check the return value. Likewise, you may tell send() to transmit
1000 bytes , but it may only have room to buffer fewer bytes. You can’t just
assume that all 1000 bytes were sent! Instead, check the return value of
send() to see if (or which) bytes need to be resent. + For this lab
assignment, your life will be easier if you call send() and recv() each in
exactly one place (inside a loop). Use send() in a loop to send the entire
request and recv() in a loop to read the entire response. If recv()
returns 0 , it means you’ve reached the end of the data.
|
4.1. String Manipulation
-
Spend some time thinking about how to do the string manipulation. It does not need to be complex — refer back to your lab 0 code for inspiration.
-
Good functions to use for handling filenames and text include:
snprintf
,sscanf
,strstr
, andstrchr
. You can learn more about these and other useful functions (e.g.,send
andrecv
) by reading theirman
pages. For example, tryman snprintf
on the command line. -
The newlines, which signal the end of a message in many protocols, are represented in HTTP as
\r\n
, not just\n
.
4.2. Writing Output
-
The
fopen
,fwrite
, andfclose
functions may be useful for writing output files. -
Make sure you do not save the HTTP headers from the web server’s response as part of the file’s contents.
4.3. General C Programming
-
Good systems programming involves:
-
writing a small bit of code
-
testing
-
brief comments
-
repeat
-
-
Test early and often, and don’t write new code until you’ve ironed out any problems with your existing code!
-
Run
valgrind
as you go, rather than waiting until the end. It will help you identify problems sooner! -
If a system call fails, the
perror()
function will typically tell you why, in a nice, human-readable way. Take advantage of it, and don’t assume why a system call might be failing!
5. Other Reference Material
This lab comes early in the semester, when we haven’t seen much course content yet. I’ve put together some brief reference material to help bootstrap you on some of the tools we’ll be using for this lab. I do NOT expect that you will finish this lab as am expert on these topics — just that you’ll have enough background to finish the lab. We’ll cover the details in class soon enough.
6. Submitting
Please remove any excessive debugging output prior to submitting.
To submit your code, commit your changes locally using git add
and git
commit
. Then run git push
while in your lab directory.