CS 43: Lab 1

Handy References:

Departmental git resources.
RFC 1945: HTTP 1.0 Specification. Sections 4, 5, and 6 are probably the most helpful.
Manual pages for socket, send, and recv.
Some example files for testing.

Lab 1 Goals:

Use git to clone a repository full of starter code.
Practice with C networking basics: sockets, DNS, send(), and receive().
Manipulate HTTP headers with C string functions.
Apply top-down design to a web client.

Overview

For this week's programming exercise, we will create a barebones web client (think wget). Based on the example TCP client code in your repository and the HTTP example sessions shown in class, we'll write a command-line program called lab1 that takes a URL as its only parameter, retrieves the indicated file, and stores it in the local directory with the appropriate filename. If the URL does not end in a filename, your program should automatically name the file 'index.html'.

For example:

# This should create a local file named 'pride_and_prejudice.txt' containing lots of text.
 $ ./lab1 http://demo.cs.swarthmore.edu/example/pride_and_prejudice.txt

# This should create a local file named 'index.html' containing the demo server's home page contents.
 $ ./lab1 http://demo.cs.swarthmore.edu

# This should create a local image file named 'fiona.jpg' containing a cute cat picture.
 $ ./lab1 http://demo.cs.swarthmore.edu/example/fiona.jpg

You may assume that the URL will be no more than 100 characters long and that it will be of the form http://host/path, where:

The host portion may be an IP address or a hostname like "demo.cs.swarthmore.edu". Socket programming requires an IP address for communication (e.g., 130.58.68.137), so when given hostnames, you'll need to query the domain name system (DNS) to find their corresponding IP address. To look up an IP address for a given host name, use getaddrinfo() ("man getaddrinfo" on the command line will give you the details), or consult the getaddrinfo.c example. We'll cover DNS in much more detail later in the course, so for now, you can treat it like a black box that magically converts hostnames to IP addresses.

The path may or may not be an empty string, may or may not contain multiple slashes (for subdirectories), and may or may not contain a file name. If no path is given, your client should request: "/" (without quotes). The server will send you back an index.html file, if it has one.

You may assume that the files you'll be retrieving are no larger than one megabyte. This means you can statically declare storage space for the response, which makes life a bit easier.

Requirements

Your client should faithfully download and save byte-for-byte identical copies of the files it's asked to retrieve. It should work for both text (e.g., html) and binary files (e.g., images). See: "Testing" section below.

Your client should report any errors or unexpected responses it encounters. If you get any HTTP response code other than 200, simply report the code you received and terminate.

Your client should name the files it saves according to the name of the file in the URL argument. That is, everything after the final '/' in the URL should be considered the file name to use when storing the file locally. If there's nothing after the final '/', use the name 'index.html'.

Miscellaneous hints and background information

The general workflow of your program will be:
1. Break the URL argument into the host and file portions.
2. Look up the hostname via DNS to get its IP address.
3. Create a socket and connect to that IP address on port 80.
4. Generate an HTTP 1.0 request for the file and send it to the server.
5. Read the response and report errors. If no errors, save the response body to a file.

All HTTP headers are ASCII string characters, so you can use the str family of functions to manipulate them safely. Do NOT use strlen, or any other string functions, on the body of the response. The response body is not necessarily a string. In some cases (e.g., html responses) it will be, but in other cases (e.g., image files) it won't be. Remember that the C string functions look for, and typically terminate when they find, the null terminator character. A null terminator is nothing more than a byte whose value is zero (0). Such bytes are LIKELY TO BE PRESENT in binary response data. If you call strlen() on binary data and it finds a 0, it will stop and return the WRONG ANSWER to you.

"But, if I can't call strlen() on the response, how will I know how much data I received?" The recv() function's return value will tell you how many bytes you received every time you call it. Likewise, the send() function will tell you how many bytes you successfully transmitted. You should ALWAYS check the return values of these functions because the answer may not be what you expect. That is, even if you tell recv() to get 1000 bytes, the call may return with fewer bytes, and the only way you'll know is to check the return value. Likewise, you may tell send() to transmit 1000 bytes, but it may only have room to buffer fewer bytes. You can't just assume that all 1000 bytes were sent! Instead, check the return value of send() to see if (or which) bytes need to be resent.

For this lab assignment, your life will be easier if you call send() and recv() each in exactly one place (inside a loop). Use send() in a loop to send the entire request and recv() in a loop to read the entire response. If recv() returns 0, it means you've reached the end of the data.

Use HTTP version 1.0. Version 1.1 can get a lot more complicated. The subset of the HTTP 1.0 protocol you'll need to implement for this assignment is quite small, but you may find the full protocol specification to be helpful.

Section 2.2 in the book should also be helpful. Your book talks about the "request line" and "header lines" for an HTTP request. You will only need to use the request line and the host line of the header.

Good functions to use for handling filenames and text include: snprintf, sscanf, strstr, and strchr. You can learn more about these and other useful functions (e.g., send and recv) by reading their "man pages". For example, try "man snprintf" on the command line.

Newlines, which signal the end of a message in many protocols, are represented in HTTP as "\r\n", not just "\n".

You will need to remove the HTTP headers from the web server's response before saving the data to a file.

Spend some time thinking about how to do the string manipulation. It does not need to be complex. The complete program, including comments, error handling etc. can be written in about 100-150 leisurely lines.

The fopen, fwrite, and fclose functions may be useful for writing the output file.

Testing

To test your program, you'll want to ensure that the files it's saving are identical to the originals.

For a quick check, you can open the file in a browser, and it should appear like the original website. Note that appearance alone does NOT guarantee that the file is byte-for-byte identical.

An easy and more precise way to check that the files are correct is to use wget, which downloads files much like your lab program, to retrieve a correct copy of the file. You can then compare wget's file with yours.

For text files, you can use diff to see if the files are identical. For all files (text and binary) you can use something like md5sum to generate a hash of the two files. If the hashes differ, so do the files.

Submitting

Please remove any debugging output prior to submitting.

To submit your code, simply commit your changes locally using git add and git commit. Then run git push while in your lab directory.

CS 43 Lab 1: A Basic Web Client