CS87: Lab 3

Lab 3 Partners
Luis Ramirez and Nick Felt	Elliot Weiser and Steven Hwang
Jordan Singleton and Phil Koonce	Ames Bielenberg and Niels Verosky
Kyle Erf and Sam White	Choloe Stevens and Katherine Bertaut
See the git howto for information about how you can set up a git repository for your lab 3 project.

Project Introduction

For this assignment you and your partner will implement a web server. This lab is designed to give you some practice writing client-server socket programs, writing a multi-threaded server, using signals, and learning about the HTTP protocol.

This is a larger and more involved programming assignment than the first two labs. I strongly encourage you to get started on it right away.

There is a lot of information about getting started and about helpful resources on this page (including information about where to get starting point code and sample code). Read through this entire page before you get started, and refer back to it as you go...if you have a question about how to do something, there may be an answer or hint here.

Contents:
Project Requirements
Project Details
Getting Started
Useful Functions and Links to more Resources
Submission and Demo

Project Requirements

Your web server should be written in C or C++.
Use port 8888 instead of port 80.
You will implement a multi-threaded web server, one thread per client connection. This will allow your web server to simultaneously handle requests from multiple clients.
Your server should implement parts of the HTTP 1.1 protocol, which maintains an open socket connection to the client after the response is sent (HTTP 1.0 closes the connection as soon as the response is sent to the client). HTTP 1.1 makes subsequent communication with the client faster by not having to repeat the TCP connection protocol. However, your server must prevent too many simultaneous connections: if the total number of simultaneous open connections gets above some max_connections threshold value (pick something small to test like 5), your server will start closing the oldest open connections and their associated server threads will die. The killed server thread should clean up any shared state and close its end of the socket before dying.
Remember that connections can be closed other ways too. For example, the client-side can close the connection. In this case the associated server thread should detect that the socket was closed, clean up any shared state, and exit.
Your server must handle GET and HEAD client requests. It does not need to handle POST nor any other requests.
It should return appropriate status codes, including 200, 400, 403, and 404. If the server returns an error code to a client, it should also return headers and a message body with a simple error page. For example:
"<html><body>Not Found</body></html>"
If you'd like, you can include a link to the HTTP Status Cat jpg corresponding to the status code in your response. Add in something like this to the body of the response above:
<img src="http://httpcats.herokuapp.com/400">
You can view all the status cat images here
It should support the headers Content-Length, Content-Type, and Date.
It does not need to handle any php or javascript parsing. If the client requests a .php file, just send the file contents back just like an .html or .jpg file.
It should handle urls that start with / and that start with /~username. The web server's starting pages are in /scratch/cs87/cs/. urls that start with /, such as /people come from files in /scratch/cs87/cs/people/, url requests for /~username/, come from files in /home/username/public_html/.
Your web server should be free of memory access errors (i.e have no valgrind errors), and it should be well designed and well commented.

Note if you start your web server on one of our lab machines, you can only connect to it with clients that are also running on our lab machines.

Project Details

Web server

The basic design of your web server is the following:

create a listen socket on port 8888
enter an infinite loop:
1. accept the next connection
2. if there are already max connections, kill the oldest thread by sending it a SIGALRM signal.
3. create a new thread to handle the new client's connection, passing it the socket returned by accept.
4. the worker thread main function should be an infinite loop that only exits if there is an error condition returned by a system call, or if the thread receives a SIGALRM from the main thread and kills itself. Otherwise, the worker threads continue to handle HTTP requests from the client.
  Before a thread dies, it should close its end of the socket and clean up any other global state necessary for correct functioning of your web server.

The main server thread should be in an infinite loop, waiting to accept the next client connection. It should exit only when it gets appropriate error return values from accept, send, recv, read, write, ...

Signals and Sockets and Threads

Threads share the same address space so they can coordinate using shared memory, and synchronize using locks, barriers, or semaphores. Threads also share the same copy of open files and the signal table associated with the process in which they are contained. This means that if one thread opens a file, all threads can read or write to it using the file descriptor returned by open. Similarly, if one thread closes a file, it is closed for all threads in the process.

In Unix, sockets have a file interface and threads can close sockets just like they would close a file by calling close:

int fd = socket();
...
close(fd);

Your web server will use signals as a way to notify a worker thread that it should die when there are too many open connections. A signal is a software interrupt, that can by synchronous or asynchronous. One process or thread can send (or post) a signal to another one, and when the other one receives the signal it stops doing what it is currently doing and runs a special signal handler function. Processes (and threads) can block some signals, register their own handler functions on some signals, or just use the operating system's signal handler functions (this is the default). For example, when you type CNTL-C in the terminal that is running a program, the running process is sent a SIGKILL signal telling it to die. SIGKILL is an example of a non-blockable signal, meaning that a process cannot choose to ignore a SIGKILL...it must die.

Your web server's main thread will send a worker thread a SIGALRM signal when it wants the worker thread to exit (and close its connection to the client). To do this, do the following:

The main thread will register a signal handler function on the SIGALRM signal before entering its main loop (this sets up the signal handler on SIGALRM for all threads):

  struct sigaction sa;

  // set all field values in sa to zero using memset:
  memset((void *)(&sa), 0, sizeof(sigacts)); 
  sigemptyset(&sa.sa_mask);
  // name of my signal handler function:
  sa.sa_handler = my_sigalrm_handler;  
  sa.sa_flags = 0;

  // register my signal handler with the SIGALRM signal: 
  val = sigaction(SIGALRM, &sa, NULL);

When the main listener thread receives a new connection, it will check to see if there are already a maximum number of connections, and if so, it will send the oldest thread a SIGALRM signal by calling:
```
pthread_signal(workers_pthread_tid, SIGALRM);  
```

The signaled worker thread will call the handler function registered on SIGALRM:

void my_sigalrm_handler(int s) {
  // clean up any shared state associated with me
  // close my socket 
  // and call pthread_exit to die
}

You will have to determine how a signaled thread knows which socket is its own to close.

Threads will also need to detect and handle other cases when they should exit, and clean up any global state associated with them, including closing their socket before exiting. One place where this may occur is if the client side disconnects and closes its end of the socket.

HTTP 1.1 and multiple simultaneous connections

You should use the pthread library to spawn a new server thread each time a client connects to your server. The server thread has a dedicated connection to this client and will keep this connection open and continue to handle GET and HEAD requests from the client. Your main server thread should return back to its accept loop after spawning the server thread so that it can handle a connection from another client. This way your server can simultaneously handle requests from different clients. Test that this works by connecting to your server from different clients simultaneously and sending multiple requests from these clients.

Remember to link in the pthreads library to compile a pthreads program If you are using the Makefile from my client server example code, it is already included here:

LIBS =  $(LIBDIRS) -pthread

If you aren't using my Makefile, include -pthread at the end of the gcc or g++ command line in your makefile.

The main listener thread should repeat its main loop after spawning a new worker thread (and perhaps killing an old one), and call accept on the listener socket to wait for another client connection.

If your solution requires any use of shared state among threads, make sure to use a pthread synchronization primitive (likely a pthread_mutex_t) to synchronize the accesses to this shared state. Also, think about scope very carefully: threads can only share memory associated with global variables or that is on the heap. Technically, a thread can share state on another thread's stack too (if they have a pointer to it) but I strongly suggest not doing this because the state can be overwritten and modified by the other thread's execution.

Web clients

You can use multiple programs to connect to your web server and send it HTTP commands:

telnet server_IP port_num, then type in a GET command (make sure to enter a blank line after the GET command). For example:
```
$ telnet 130.58.68.62 8888

  GET /index.html HTTP/1.0
```
telnet will exit when it detects that your web server has closed its end of the socket (or you can kill it with CNTL^C, or if that doesn't work use kill or pkill: pkill telnet). Use ifconfig to get a machine's IP address (described in Useful Utilities section).
firefox: Enter the url of the desired page specifying your web server using its IP:port_num (e.g. http://130.58.68.62:8888/index.php)
You can also just use localhost or the host name on our system:
```
localhost:8888/index.php
tomato:8888/~cfk/
```
wget: wget -v 130.58.68.62:8888/index.html
wget copies the html file returned by your web server into a file with a matching name (index.html) in the directory from which you call wget.
modify the example client program to send http requests to your server. I don't think this is necessary (since the other three clients are already written for you), but you could modify the web_client program given with the starting point code to send GET requests to your web server and receive the responses.

HTTP

Start by reading HTTP Made Really Easy by Jim Marshall.

It is very important that you can interpret the format of a client request correctly, and that you send correctly formated responses to clients. Many parts of a correctly formatted message involve sequences of carriage return and newline characters ("\r\n"). These are used to signify the end of all or part of a "message". Here is the general format of a server request:

   initial line
   Header1: value1
   Header2: value2
   Header3: value3

   (optional message body goes here)

For example, a GET response for a very simple page may look like:

   HTTP/1.1 200 OK
   Date: Sun, 10 Jan 2010 18:17:43 GMT
   Content-Type: text/html
   Content-Length: 53

   <html>
   <body>
   <h1>CS 87 Test Page</h1>
   </body></html>

It is very important that each header line ends with a "\r\n" and that there is a blank line (another "\r\n") between the headers and the message body. The message body, however is sent without a trailing "\r\n". Instead the header Content-Length is used to tell the client the size of the message body.

GET requests and mapping urls to files

There is one format of url that you do not need to handle for this assignment. These are ones where the server would respond with a "301 Moved Permanently" response vs. responding with OK and the file contents. This case is described below in more detail.

Directory names in urls correspond to files named either index.html or index.php in the named directory. Your web server should first look for a file named index.html and if that doesn't exist look for index.php when handling these requests.

Here are some example GET requests that you need to handle, and their corresponding file name(s):

GET  /   HTTP/1.1                           /scratch/cs87/cs/index.html 
                                       or   /scratch/cs87/cs/index.php 

GET /index.html  HTTP/1.1                   /scratch/cs87/cs/index.html 

GET /index.php   HTTP/1.1                   /scratch/cs87/cs/index.php 

GET /search.html HTTP/1.1                   /scratch/cs87/cs/search.html

GET /courses/ HTTP/1.1                      /scratch/cs87/cs/courses/index.html 
                                            /scratch/cs87/cs/courses/index.php 

GET /~newhall/  HTTP/1.1                    /home/newhall/public_html/index.html
                                            /home/newhall/public_html/index.php

GET /~newhall/newcluster.jpg  HTTP/1.1      /home/newhall/public_html/newcluster.jpg

You do not need to correctly handle GET requests of the following format (i.e. GET requests with no trailing '/' when the last name corresponds to a directory):

GET /~newhall  HTTP/1.1
GET /courses  HTTP/1.1

The way a web server would handle requests like this is to send a "301 Moved Permanently" response to the client with the real url of the page ("Location: http://IP:portnum/~newhall/"). The client would resend the GET request using the url returned by the server:

GET /~newhall/  HTTP/1.1

When your web server receives a request of this form, you can choose to either have it respond with an error response or with OK. If your web server sends an OK response, then the client may make subsequent GET requests for any files included in the page, and these GET requests will not have the correct url (the client doesn't know that newhall is a directory and instead of requesting /~newhall/foo.jpg will request /foo.jpg, if my homepage includes the foo.jp file). Just handle these as you would any bad url (there is no file associated with /foo.jpg).

You do, however, need to correctly handle GET requests with the trailing '/' (e.g. /~newhall/).

You are welcome to add support for 301 responses if you'd like, but you are not required to do so for this assignment, so I'd suggest only adding this after the rest of your web server works.

Getting Started

You can grab a copy of my starting point files for client and server TCP/IP socket programs in C. They are in ~newhall/public/cs87/socket_startingpt/. The starting point contains a sample Makefile for building a web_client and web_server executables, and the very beginnings of both implementations (mostly just #includes for the server).

In addition, I have a example program for sending and handling signals in pthread programs. It is available here: ~newhall/public/cs87/pthreads_signals_example/

I strongly encourage you to implement and test incrementally. Also, it is very important to check return values from all functions and to handle error return values correctly. For example, if a call to read on a socket returns before the requested number of bytes have been read, this could mean that the other end of the socket was closed. When this is the case, you want to stop continuing to try to read from this socket (an infinite loop).

Here is one suggestion for proceeding:

Starting with the starting point code, finish a simple client and server program where the client connects to the server and sends it a simple message and waits for a response. The server should receive the message, print it out, and close the socket. The client should exit when it detects the server has closed its end of the socket.
See if you can connect to your web server from wget, firefox and telnet and send it an http request (in the correct format). Your server could just spawn a worker thread whose main function just prints out the message, closes the socket and calls pthread_exit (no infinite worker thread loop, and no response sent to the client).
Next, modify your server to send a fake response to a client GET request (don't really parse the requested page and fetch the corresponding file, but send a 200 response with a very short web page message body. If all goes well, firefox should display your bogus web page after receiving your response. If things don't go well, connect to your server using telnet as you can more easily see what the client is receiving from your server.
Next, add support for finding the correct web page to return for a GET response. Add support for handling different errors (file not found, etc.).
Next, add in full support multiple pthread worker threads that keep the connection open until they are killed or detect an error and kill themselves. Add support for the main thread killing the oldest connections when max connections are reached and a new connection comes in.
Make sure your program is free of valgrind errors (it would not hurt to run in it on valgrind as you develop different parts too).
Remove (or comment out) any debug output before submitting your solution.

Your program should use good modular design, be well-commented, robust, and correct. See my C Style Guide off my C resource page.

Useful Functions and Resources

HTTP made real easy. by Jim Marshall
HTTP 1.0 Specification
HTTP 1.1 Specification
Socket Programming Links. Beej's Guide is a good staring point and has code examples (sections 5 and 6 are particularly useful). You can use either read and write or send and recv to send and receive messages on sockets. We also have a copy of Steven's "Unix Network Programming" in the main lab. Chapter 5 is likely the most useful.
C and C++ programming and debugging. Includes some documentation on C string library functions and file I/O.

A few string functions that may be particularly useful for this assignment:

strtok: string tokenizer, multiple calls to it on same string return a pointer
        to the next token in the string:

        #include <string.h>
        
        char *next =0;
        char *s = "hello   there    how    are   you?"          
        char *delim = " \t\r\n";   // delimiters are space, tab, cr, eoln

        next = strtok(s, delim);     // first call pass string to tokenize
        while (next != 0) {
            printf("%s\n", next);
            next = strtok(0, delim);   // subsequent calls pass 0 to get the
                                       // next token in the string s
        }

sprintf:  like printf, but instead of writing out the resulting string, it
          is copied to the dest string: sprintf(dest, format_string, args ...) 

     ex:
           #include <stdio.h>
           
           char result[1024]; 
           sprintf(result, "%s%d: %4.2f\n", "hello there", 34, 6.55);
           // result string will have value:  "hellothere34: 6.55\0"
           printf("%s", result);

ifconfig, dig, nslookup: to get a machine's IP address:

$ ifconfig        #   (/sbin/ifconfig) on machine on which you want the IP

eth0      Link encap:Ethernet  HWaddr 
          inet addr:130.58.68.62  Bcast:130.58.68.255  Mask:255.255.255.0
...                 ^^^^^^^^^^^^
                    IP address

$ dig tomato.cs.swarthmore.edu

$ nslookup tomato.cs.swarthmore.edu

setsockopt: I recommend setting the main server's socket options SO_LINGER to off (0). SO_LINGER is on by default, meaning that if there are data to send in the socket's buffer when the socket is closed, the close is delayed for some time to wait for all the data to be sent. For the listener socket, this can mean that if you kill your server program, you cannot restart it for about 1 minute because the TCP socket bound to port 8888 from the previous run of your server, often is still lingering around, and bind will fail. There is an example call to setsockopt in the web_client.c starting point code.

access: check to see if a file is accessible in some way:

access(path_name_of_file, X_OK | F_OK);
access(path_name_of_file, R_OK | F_OK);

stat: get statistics about a file, including its size in bytes, modification time, ...
```
struct stat stat_info;

ret = stat("/home/newhall/foo.txt", &stat_info);
```
time, gmtime_r, ctime_r functions to get current time (in GMT time zone) and convert it to a string representation. time returns the Unix time, which is the number of seconds since Jan 1, 1970. It takes a time_t argument that it sets to this value. gmtime_r converts it to GMT. mktime converts the time rep returned by gmtime_r to a time_t value. ctime takes a time_t value and creates a string representation of the time. You should use the reentrant versions of functions.
Here is an example (note: there is missing error detection and handling in this example):
```
  char buff[64];
  time_t mytime, mytime2;
  struct tm my_time_struct;

  time(&mytime);
  gmtime_r(&mytime, &my_time_struct);
  mytime2 = mktime(&my_time_struct);
  ctime_r(&mytime2, buff);
  printf("Date: %s\n", buff);
```
In the initial stages of implementing your solution, you can just return a bogus time string for the Date header.

pthread mutex variables:

// declare and initialize:
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

// use:
pthread_mutex_lock(&mutex);
  // critical section code
pthread_mutex_unlock(&mutex);

htons, htonl, ntohl, ...: functions for converting between host and network byte order.
man and apropos: documentation for system calls (in section 2) and C and pthread library calls (in section 3).
```
$ man 2 read
$ man 2 send
$ man 3 strcpy
$ man 2 stat
$ man pthread_create
```
/proc file system: access to all kinds of system information. To watch TCP sockets being created, run (this can help verify that you are supporting multiple client connections):
```
 watch -n 1 cat /proc/net/tcp
```
netstat -ant lists all information about all in-use sockets on a machines. If you want to continually watch this, run: watch -n 2 netstat -ant.

Submission and Demo

Create a tar file containing:

All your web server source files, and makefile to build server (and client if applicable).
A README file with: (1) you and your partner's names; (2) an example of how to run your web server (a command line); and (3) a description of any features you have not fully supported and/or any errors you were unable to fix.

I'd suggest creating a handin directory and copying all these things into it. It is good to check that you have your full solution in the handin directory (type 'make' to check that everything builds, try running it, then type 'make clean' to remove executables and .o's from what you submit). Then tar up your handin directory.

One of you or your partner should submit your tar file by running cs87handin.

Demo

You and your partner will sign up for a 15 minute demo slot to demo your web server. Think about, and practice, different scenarios to demonstrate both correctness and good error handling. You will want to demonstrate concurrent client connections, persistent connections, what happens when the client side closes it end of the socket (maybe via killing the client), and show that older server connection are closed when the max number of connections has been reached.