Checkpoint Due: Tuesday, September 10th, 11.59PM
Lab Due: Tuesday, September 17th, 11:59 PM
1. Overview
In this lab we will write our first networking application — a barebones web client. A web client corresponds with a web server, and they both "speak" HTTP, the Hypertext Transfer Protocol.
HTTP uses a client-server model of communication: in which the client initiates communication, and a server that is always-on, passively waits and responds. The web client and web server correspond using HTTP queries and responses. One way to think of HTTP, is as a document retrieval system over the web. We will look into the HTTP protocol format in a lot more detail in class on Thursday.
1.1. Goals
-
Use
git
to clone a repository full of starter code. -
Apply top-down design to write a web-client.
-
Practice with C networking basics: sockets,
send()
,receive()
and DNS (name-to-IP resolution). -
Manipulate HTTP headers with C string functions.
1.2. Handy References:
-
RFC 1945: HTTP 1.0 Specification. Sections 4, 5, and 6 are probably the most helpful.
2. Lab Requirements
We will write a command-line program called lab1
that takes a URL as its only parameter, retrieves the indicated file, and stores it in the local directory with the appropriate filename. If the URL does not end in a filename, your program should automatically name the file index.html
.
For example:
# This should create a local file named 'pride_and_prejudice.txt' containing lots of text. $ ./lab1 http://demo.cs.swarthmore.edu/example/pride_and_prejudice.txt
2.1. Checkpoint Requirements
For the checkpoint for the lab, due Tuesday, Sep. 10th, you should have the following pieces complete:
-
Complete and submit
worksheet.txt
(Pro-Tip: the extra-credit section is not required, however completing this will make the C coding for the lab much much easier.) -
Extract the host and path portions of the input URL
-
Construct the HTTP request by filling in the provided GET template decribed in the Telnet example and on the class slides.
-
Extract just the file name from the path
A good stretch goal is:
-
Start trying to send the request to the server and print the response, but it’s ok if this part isn’t solid yet.
Week 2 Goals:
-
Making send/recv calls with error handling
-
Parsing response headers
-
Saving the output file
2.2. Workflow of your program
-
Given a URL of the form
http://host/path
, construct a HTTP query to send to the web-server.-
separate the host from the file portions using string parsing
-
look up the hostname via DNS to get its IP address so that yo can send and receive data
-
create a
socket
andconnect
to the server’s IP address, on port 80, the port used for HTTP -
generate an HTTP 1.0 request string for the specified file path
-
use the
send
system-call to send the request over the network
-
-
Receive and interpret the server’s response
-
use the
recv
system call to receive the server’s response, in full -
inspect the HTTP response code received from the web server
-
faithfully download byte-for-byte copies of both text (e.g.,
html
) and binary files (e.g.,images
) using therecv()
system call. -
if there are no errors, open a file for writing (named according to the name of the file in the URL argument) and save the body of HTTP response to the file
-
report any errors or unexpected responses the web-client encounters.
-
2.3. Client Behavior Expectations
For full credit your client should:
-
Download and save byte-for-byte identical copies of the files it’s asked to retrieve. It should work for both text (e.g., html) and binary files (e.g., images). See: Examples and Testing below.
-
Report any errors or unexpected responses it encounters. If you get any HTTP response code other than 200, simply report the code you received and terminate.
-
Name the files it saves according to the name of the file in the URL argument. That is, everything after the final / in the URL should be considered the file name to use when storing the file locally. If there’s nothing after the final /, use the name index.html.
2.4. Try wget
Running ./lab1
should have the same functionality as wget
. Try the following on the command line to get a sense of how wget
works (your code does not need to print extra information that wget
does).
# This should create a local file named 'pride_and_prejudice.txt' containing lots of text. $ wget http://demo.cs.swarthmore.edu/example/pride_and_prejudice.txt # This should create a local file named 'index.html' containing the demo server's home page contents. $ wget http://demo.cs.swarthmore.edu # This should return a 404 not found error $ wget http://demo.cs.swarthmore.edu/example/hi.txt # This should create a local image file named 'fiona.jpg' containing a cute cat picture. $ wget http://demo.cs.swarthmore.edu/example/fiona.jpg
3. Try TELNET!
To see what the HTTP message format looks like, let’s generate a HTTP request from the command-line using telnet
.
Wait, you can just look at the HTTP message structure? Yes! HTTP is an all-text protocol (it’s really old), meaning we can actually read the request and response fields. In later labs, we will see application-layer protocols that are all binary for which we will need other tools to parse the header fields.
Telnet is an application layer protocol that was used for remote login to access network servers (this has been replaced today by SSH). Try the following in your terminal window:
telnet demo.cs.swarthmore.edu 80
GET / HTTP/1.0
Host: demo.cs.swarthmore.edu
(press return twice after the last line)
Request message breakdown:
-
demo.cs.swarthmore.edu
: hostname to connect -
80
: theport number
that serves as the identifier for an HTTP message so that when the client receives this message -
GET
: HTTP "verb" to request data. -
/
: path being requested -
HTTP/1.0
: HTTP version being "spoken" by the client. -
Host:hostname
: Required header field -
\r\n\r\n
: TwoCRLF
s (carriage-returns) to indicate end of the header
Response message:
HTTP/1.0 200 OK
Vary: Accept-Encoding
Content-Type: text/html
Accept-Ranges: bytes
ETag: "316912886"
Last-Modified: Wed, 04 Jan 2017 17:47:31 GMT
Content-Length: 1062
Connection: close
Date: Tue, 03 Sep 2019 14:06:05 GMT
Server: lighttpd/1.4.35
<body of file>
-
HTTP/1.0
: HTTP version spoken by webserver -
200 OK
: File exists, being transferred. -
Optional Header fields
-
\r\n\r\n
: TwoCRLF
s (two carriage-returns) to indicate the end of the header. -
Body of message follows.
Record your responses to your TELNET queries on the week1-lab sheet on github |
4. Socket Programming
As we saw yesterday, a protocol defines both the message + header format and transfer procedure. We saw that like a human protocol (initial Hello), the network protocol must first establish a connection, before we start sending and receiving data. The application layer HTTP message, is encapsulated in a transport layer protocol, TCP. TCP is used as the transport protocol of choice for in-order, reliable delivery. To start a TCP connection, we associate the client with a socket. You can think of a socket like a mailbox to send and receive mail.
5. Socket system-calls
Once we establish a socket on the client side, the following system-calls are used to send, receive data and eventually close the socket.
Work through the system calls given in the starter code and fill in the weekly lab sheet on github |
6. String Parsing and Name-to-IP resolution
As stated in Lab requirements your lab1
program takes in a URL as its only parameter, retrieves the indicated file, and stores it in the local directory with the appropriate filename.
-
If the URL does not end in a filename, your program should automatically name the file
index.html
. -
You may assume that the URL will be no more than 100 characters long and that it will be of the form
http://host/path
. -
The path may or may not be an empty string, and may or may not contain multiple slashes (for subdirectories), and may or may not contain a file name.
-
If no path is given, your client should request:
/
. The server will send you back anindex.html
file, if it has one.
You may assume that the files you’ll be retrieving are no larger than one megabyte. This means you can statically declare storage space for the response, which makes life a bit easier.
6.1. Resolving a hostname to an IP address
-
The host portion may be an IP address or a hostname like
demo.cs.swarthmore.edu
. Socket programming requires an IP address for communication (e.g., 130.58.68.137), so when given hostnames, you’ll need to query the domain name system (DNS) to find their corresponding IP address. -
To look up an IP address for a given host name, use
getaddrinfo()
(man getaddrinfo
on the command line will give you the details), or consult thegetaddrinfo.c
example. We will cover DNS in much more detail later in the course, so for now, you can treat it like a black box that magically converts hostnames to IP addresses.
7. Miscellaneous hints and background information
Good systems programming involves: |
-
writing a small bit of code
-
testing
-
brief comments
-
repeat step 1
Students who follow this have a much higher rate of succeeding in the course!
-
All HTTP headers are ASCII string characters, so you can use the
str
family of functions to manipulate them safely.
Do NOT use strlen, or any other string functions, on the body of the response. The response body is not necessarily a string. In some cases (e.g., html responses) it will be, but in other cases (e.g., image files) it won’t be. Remember that the C string functions look for, and typically terminate when they find, the null terminator character. A null terminator is nothing more than a byte whose value is zero (0). Such bytes are LIKELY TO BE PRESENT in binary response data. If you call strlen() on binary data and it finds a 0 , it will stop and return the WRONG ANSWER to you.
|
-
"But, if I can’t call
strlen()
on the response, how will I know how much data I received?"The
recv()
function’s return value will tell you how many bytes you received every time you call it. Likewise, thesend()
function will tell you how many bytes you successfully transmitted.
You should ALWAYS check the return values of these functions because the answer may not be what you expect. That is, even if you tell recv() to get 1000 bytes , the call may return with fewer bytes, and the only way you’ll know is to check the return value. Likewise, you may tell send() to transmit 1000 bytes , but it may only have room to buffer fewer bytes. You can’t just assume that all 1000 bytes were sent! Instead, check the return value of send() to see if (or which) bytes need to be resent.For this lab assignment, your life will be easier if you call send() and recv() each in exactly one place (inside a loop). Use send() in a loop to send the entire request and recv() in a loop to read the entire response. If recv() returns 0 , it means you’ve reached the end of the data.
|
-
Use HTTP version 1.0 — version 1.1 can get a lot more complicated. The subset of the HTTP 1.0 protocol you’ll need to implement for this assignment is quite small, but you may find the full protocol specification to be helpful.
-
Section 2.2 in the book should also be helpful. Your book talks about the "request line" and "header lines" for an HTTP request. You will only need to use the request line and the host line of the header.
-
Good functions to use for handling filenames and text include:
snprintf
,sscanf
,strstr
, andstrchr
. You can learn more about these and other useful functions (e.g.,send
andrecv
) by reading theirman
pages. For example, tryman snprintf
on the command line. -
Newlines, which signal the end of a message in many protocols, are represented in HTTP as
\r\n
, not just\n
. -
You will need to remove the HTTP headers from the web server’s response before saving the data to a file.
-
Spend some time thinking about how to do the string manipulation. It does not need to be complex. The complete program, including comments, error handling etc. can be written in about 100-150 leisurely lines.
-
The
fopen
,fwrite
, andfclose
functions may be useful for writing the output file.
8. Testing
-
To test your program, you’ll want to ensure that the files it’s saving are identical to the originals.
-
For a quick check, you can open the file in a browser, and it should appear like the original website. Note that appearance alone does NOT guarantee that the file is byte-for-byte identical.
-
An easy and more precise way to check that the files are correct is to use
wget
, which downloads files much like your lab program, to retrieve a correct copy of the file. You will need to use md5sum to make sure binary files (images, pdfs) are identical. You can then comparewget
's file with yours. Example below:$ md5sum index.html index.html.1 937e1d7af5e5cc0ce63694cdd2969233 index.html 937e1d7af5e5cc0ce63694cdd2969233 index.html.1
-
For text files, you can use
diff
to see if the files are identical. For all files (text and binary) you can use something likemd5sum
to generate a hash of the two files. If the hashes differ, so do the files. If the files are identical you will see no output:$ diff index.html index.html.1 $
If the files are not identical,
diff
will show you the lines that differ. For e.g.,diff index.html index.html.1 21a22,23 > </body> > </html> >
-
You should run
valgrind
, as you add to your code base, to make sure your code has no memory leaks.
9. Grading Rubric
Total: 3 points
-
1/2 point for completing weekly-lab questions and lab review.
-
1 point for transferring text files correctly - MD5sums must match.
-
1/2 point for transferring binary image files correctly - MD5sums must match.
-
1/2 point for correctly identifying and reporting error messages.
-
1/4 point for naming files correctly.
-
1/4 point for no valgrind errors.
10. Submitting
Please remove any debugging output prior to submitting.
To submit your code, simply commit your changes locally using git
add and git commit
. Then run git
push while in your lab directory.