Lab 8: Deep Q-Learning

Lab 8: Deep Q-Learning
Due April 8th by 11:59pm

Starting point code

Your group should be based on the partner request form, and you should have a github repo ready to go by the time lab starts. Please let the instructor know if you have any issues accessing it.

Introduction

In this lab we will apply deep Q-Learning to two reinforcement learning problems. We will use problems provided by Gymnasium (formerly known as Open AI gym).

To start, we will focus on a classic control problem known as the Cart Pole. In this problem, we have a pole attached to a cart and the goal is the keep the pole as close to vertical as possible for as long as possible. The state for this problem consists of four continuous variables (the cart position, the cart velocity, the pole angle in radians, and the pole velocity). The actions are discrete: either push the cart left or right. The reward is +1 for every step the pole is within 12 degrees of vertical and the cart's position is within 2.4 of the starting position.

Once you've successfully been able to learn the Cart Pole problem, you will tackle a harder problem called Lunar Lander.

Testing Gymnasium Environments

Open the file testGym.py and review how to create an gym environment, run multiple episodes, execute steps given a particular action, and receive rewards.

In order to use gymnasium, you'll need to activate the CS63 virtual environment. Then you can execute this file and watch the cart move and see example state information:

    source /usr/swat/bin/CS63env
    python3 testGym.py

Try the other environment, the lunar lander, which was initially commented out.

NOTE: this program tries to open a graphical window for output, which means it won't work well over a remote connection without doing some extra work. It should at least run regardless, but if you want to see anything you'll need to set up X-forwarding, by passing the '-X' flag to SSH, and also having an X-server running on your local computer. This happens automatically on Linux, but if you're on Windows or Mac there's some extra work to do if you want to run an X-server.

Generally, it's recommended to debug this program while actually sitting at a lab machine so you can see what it's doing.

Make sure you understand how this program works before moving on.

Complete the implementation deep Q-Learning

Recall that in deep Q-Learning we represent the Q-table using a neural network. The input to this network represents a state and the output from the network represents the current Q values for every possible action.

Open the file deepQCart.py. Much of the code has been written for you. This file contains the definition of a class called DeepQAgent that builds the neural network model of the Q-learning table and allows you to train and to use this model to choose actions. You should read carefully through the methods in this class to be sure you understand how it functions.

There are two methods that you need to write train and test. The main program, which calls both of these functions, is written for you and is in the file cartPole.py.

Here is pseudocode for the train function:

initialize the agent's history list to be empty
loop over episodes
  reset the environment and get the initial state
  state = np.reshape(state, [1, state size])
  every 50 episodes save the agent's weights to a unique filename
    (see program comments for details on this filename)
  initialize total_reward to 0
  loop over steps
     choose an action using the epsilon-greedy policy
     take the action, saving next_state, reward, and done
     update total_reward based on most recent reward
     reshape the next_state (similar to above)
     remember this experience
     reset state to next_state
     if episode is done
        break
  add total_reward to the agent's history list
  print a message that episode has ended with total_reward
  if length of agent memory > batchSize
     replay batchSize experiences for batchIteration times
  if epsilon > epsion_min
     decay epsilon
save the agent's final weights after training is complete

The test function is similar in structure to the train function, but you should choose greedy actions and also render the environment to observe the agent behavior. You should not remember experiences, replay experiences, decay epsilon nor save weights.

Use Deep Q learning to find a successful policy

Once you have completed the implementation of the DQNAgent in the file deepQCart.py, you can test it out on the Cart Pole problem. There are two ways that you can run this program, either in training mode or testing mode.

To run this program for training with 200 episodes do:

    python3 cartPole.py train 200

Remember that you will first need to activate the CS63env in the terminal window where you run the code. Deep Q learning should be able to find a successful policy in 200 episodes using the parameter settings provided in the file.

After training is complete, it is interesting to go back and look at the agent's learning progress over time. We can do this by using the weight files that were saved every 50 episodes.

To run this program in testing mode do:

    python3 cartPole.py test CartPole_episode_0.weights.h5

This would show you how the agent behaved prior to any training at episode 0. The next command would show you the behavior after 50 episodes of training.

    python3 cartPole.py test CartPole_episode_50.weights.h5

The file CartRewardByEpisode.png contains a plot of reward over time which also gives you a sense of the agent's progress on the task. This is always displayed at the end of training. You can also view this file by doing:

    eog CartRewardByEpisode.png

Do not move on to the next section until you have successfully learned the Cart Pole problem.

Applying Deep Q-Learning to Lunar Lander

You will now try applying the same approach to a new problem called Lunar Lander (pictured below left). This is a harder problem and will take many more episodes to learn. For example, the graph of reward by episode (pictured below right), shows that even after 750 episodes, reward is still increasing.

lunar lander graph of rewards per episode

As we have discussed in class, determining the appropriate settings for all of the hyperparameters in a machine learning system is a non-trivial problem. Luckily, Xinli Yu has written a paper entitled Deep Q-learning on Lunar Lander Game which systematically explores the possible settings for each of the key hyperparameters in this problem.

Start by reading the Abstract and Problem Description of this paper to get a general overview of the Lunar Lander problem including its state information, its action space, and the possible rewards.

Next look at the algorithm given at the top of page 3. Notice that the pseudocode for the loop that completes an episode is slightly different then what we did for the cart pole problem. In particular, the replay command is inside the loop rather than after the loop.

While the paper does things in this 'online' fashion, it is more efficient to do 'batch' updates at the end of each episode,

only

Open the file deepQLunar.py, which is very similar deepQCart.py that you just completed in the previous section. Copy over your implementations of the train and test methods. Update the method as necessary, including some extra verbose logging as described in the method comment.
Next, read through the Experiments section of the paper and study the graphs of the results. Based on the paper, find the optimal settings for the following hyperparameters:
- Architecture of the network (number and size of hidden layers)
- Discount (this is called gamma in the code)
- Epsilon start
- Epsilon decay
- Epsilon min
- Learning rate
Start by using the parameter settings found to be the best values; you can either pass these as command line arguments, or edit them in as the defaults in lunarLandar.py, but you'll still likely want to try out some different values later to see what happens. In the method build_model, create the best neural network architecture found.

The main program for Lunar Lander is in the file lunarLander.py, and like the cart pole main program this can be run in training mode or testing mode. Try training for 50 episodes so that you can debug your code, and ensure that it seems to be working properly. After training ends, test both the initial weights and the final weights. Some progress should be evident. For example, using the initial weights, the lunar lander will likely crash most episodes, while using the later weights, it may hover and avoid landing (though probably not reliably).

Our goal is to find a policy that can land the lunar module between the flags, which equates to an overall reward of over 200. Based on the graphs in the paper, we should expect this to take at least 1000 episodes, which took 8-10 hours! Fortunately, our hardware is a bit faster than theirs was, and some of their code was not fully optimized (in particular, they did online updates rather than batch updates), so if we're lucky we might be able to do it only a few hours. Read about how to run long jobs using screen. Start a long training session for lunar lander using screen, and be sure to log off of the computer before you leave.

Based on my personal experiments, it's possible with the right settings to achieve this level of 'success' in 500 episodes, which took between 2 and 3 hours. However, it took a lot more time than that to do the experiments to discover these settings, so don't expect to get that far on your first try; you'll likely need to play around with things. One thing that we've noticed in the past is that 300 timesteps isn't always sufficient for it to actually land, so you might want to try longer episodes, but remember that's going to slow things down since you need to simulate the extra timesteps. You can also try doing a second round of training using the weights from a previous run as your starting point; note that if you do this you'll likely want to adjust your other settings (epsilon especially).

After training is complete, look at the graph of reward by episode called LunarRewardByEpisode.png (be sure to specify the command-line flag asking for this). Hopefully, your lunar lander Q-learner has managed to consistently achieve positive rewards over 200. Use recordmydesktop to make a video of your test runs for the most successful lunar lander result. Note that you may also want to use a larger number of timesteps for the testing than you used for the training, if nothing else to see if its behavior continues to be good after that point, or if it falls apart when you go beyond the 300 (or whatever) steps you used in training.

Be sure to use git to add, commit, and push all of your files especially your graphs of rewards by episode and your movie!

Acknowledgments

The structure of the Q-learning algorithm is based on code provided at Deep Q-Learning with Keras and Gym