Handy References:
Lab Audio
Lab 1 Goals:
- Use git to clone a repository full of starter code.
- Practice with C networking basics: sockets, DNS, send(), and receive().
- Manipulate HTTP headers with C string functions.
- Apply top-down design to a web client.
Overview
For this week's programming exercise, we will create a barebones web client
(think wget
). Based on the example TCP client code in your
repository and the HTTP example sessions shown in class, we'll write a
command-line program called lab1 that takes a URL as its only parameter,
retrieves the indicated file, and stores it in the local directory with the
appropriate filename. If the URL does not end in a filename, your program
should automatically name the file 'index.html'.
For example:
# This should create a local file named 'pride_and_prejudice.txt' containing lots of text.
$ ./lab1 http://demo.cs.swarthmore.edu/example/pride_and_prejudice.txt
# This should create a local file named 'index.html' containing the demo server's home page contents.
$ ./lab1 http://demo.cs.swarthmore.edu
# This should create a local image file named 'fiona.jpg' containing a cute cat picture.
$ ./lab1 http://demo.cs.swarthmore.edu/example/fiona.jpg
You may assume that the URL will be no more than 100 characters long and
that it will be of the form http://host/path, where:
- The host portion may be an IP address or a hostname like
"demo.cs.swarthmore.edu". Socket programming requires an IP address for
communication (e.g., 130.58.68.137), so when given hostnames, you'll need to
query the domain name system (DNS) to find their corresponding IP address.
To look up an IP address for a given host name, use getaddrinfo() ("man
getaddrinfo" on the command line will give you the details), or consult the
getaddrinfo.c example. We'll cover DNS in much more detail later in the
course, so for now, you can treat it like a black box that magically converts
hostnames to IP addresses.
- The path may or may not be an empty string, may or may not contain
multiple slashes (for subdirectories), and may or may not contain a file
name. If no path is given, your client should request: "/" (without quotes).
The server will send you back an index.html file, if it has one.
You may assume that the files you'll be retrieving are no larger than one
megabyte. This means you can statically declare storage space for the
response, which makes life a bit easier.
Requirements
- Your client should faithfully download and save byte-for-byte identical
copies of the files it's asked to retrieve. It should work for both text
(e.g., html) and binary files (e.g., images). See: "Testing" section
below.
- Your client should report any errors or unexpected responses it encounters.
If you get any HTTP response code other than 200, simply report the code you
received and terminate.
- Your client should name the files it saves according to the name of the
file in the URL argument. That is, everything after the final '/' in the URL
should be considered the file name to use when storing the file locally. If
there's nothing after the final '/', use the name 'index.html'.
Miscellaneous hints and background information
- The general workflow of your program will be:
- Break the URL argument into the host and file portions.
- Look up the hostname via DNS to get its IP address.
- Create a socket and connect to that IP address on port 80.
- Generate an HTTP 1.0 request for the file and send it to the server.
- Read the response and report errors. If no errors, save the response body to a file.
- All HTTP headers are ASCII string characters, so you can use the
str family of functions to manipulate them safely. Do NOT use
strlen, or any other string functions, on the body of the response. The
response body is not necessarily a string. In some cases (e.g., html
responses) it will be, but in other cases (e.g., image files) it won't be.
Remember that the C string functions look for, and typically terminate when
they find, the null terminator character. A null terminator is nothing more
than a byte whose value is zero (0). Such bytes are LIKELY TO BE PRESENT in
binary response data. If you call strlen() on binary data and it
finds a 0, it will stop and return the WRONG ANSWER to you.
- "But, if I can't call strlen() on the response, how will I know how much
data I received?" The recv() function's return value will tell you how many
bytes you received every time you call it. Likewise, the send() function
will tell you how many bytes you successfully transmitted. You should ALWAYS
check the return values of these functions because the answer may not be what
you expect. That is, even if you tell recv() to get 1000 bytes, the call may
return with fewer bytes, and the only way you'll know is to check the return
value. Likewise, you may tell send() to transmit 1000 bytes, but it may only
have room to buffer fewer bytes. You can't just assume that all 1000 bytes
were sent! Instead, check the return value of send() to see if (or which)
bytes need to be resent.
For this lab assignment, your life will be easier if you call send() and recv()
each in exactly one place (inside a loop). Use send() in a loop to send the
entire request and recv() in a loop to read the entire response. If recv()
returns 0, it means you've reached the end of the data.
- Use HTTP version 1.0. Version 1.1 can get a lot more complicated. The
subset of the HTTP 1.0 protocol you'll need to implement for this assignment is
quite small, but you may find the full protocol specification to be
helpful.
- Section 2.2 in the book should also be helpful. Your book talks
about the "request line" and "header lines" for an HTTP request. You will only
need to use the request line and the host line of the header.
- Good functions to use for handling filenames and text include:
snprintf, sscanf, strstr,
and strchr
. You can learn
more about these and other useful functions (e.g., send
and
recv
) by reading their "man pages". For example, try "man
snprintf" on the command line.
- Newlines, which signal
the end of a message in many protocols, are represented in HTTP as "\r\n", not
just "\n".
- You will need to remove the HTTP headers from the web server's response
before saving the data to a file.
- Spend some time thinking about how to do the string manipulation. It does
not need to be complex. The complete program, including comments, error
handling etc. can be written in about 100-150 leisurely lines.
- The
fopen
, fwrite
, and fclose
functions may be useful for writing the output file.
Testing
To test your program, you'll want to ensure that the files it's saving are
identical to the originals.
For a quick check, you can open the file in a browser, and it should appear
like the original website. Note that appearance alone does NOT guarantee that
the file is byte-for-byte identical.
An easy and more precise way to check that the files are correct is to use
wget, which downloads files much like your lab program, to retrieve a
correct copy of the file. You can then compare wget's file with yours.
For text files, you can use diff to see if the files are identical.
For all files (text and binary) you can use something like md5sum to
generate a hash of the two files. If the hashes differ, so do the files.
Submitting
Please remove any debugging output prior to submitting.
To submit your code, simply commit your changes locally using git
add and git commit. Then run git push while in your lab
directory.