Use Teammaker to form your team. You can log in to that site to indicate your partner preference. Once you and your partner have specified each other, a GitHub repository will be created for your team.
In this lab we will apply approximate Q-Learning to two reinforcement learning problems. We will use problems provided by Open AI gym. To start, we will focus on a classic control problem known as the Cart Pole. In this problem, we have a pole attached to a cart and the goal is the keep the pole as close to vertical as possible for as long as possible. The state for this problem consists of four continuous variables (the cart position, the cart velocity, the pole angle in radians, and the pole velocity). The actions are discrete: either push the cart left or right. The reward is +1 for every step the pole is within 12 degrees of vertical and the cart's position is within 2.4 of the starting position.
Once you've successfully been able to learn the Cart Pole problem, you will attempt another problem from Open AI Gym to explore, and write up a summary of your results.
Open the file testGym.py and review how to create an open AI gym environment, run multiple episodes, execute steps given a particular action, and receive rewards.
In order to use open AI gym, you'll need to activate the CS63 virtual environment. Then you can execute this file and watch the cart move and see example state information:
source /usr/swat/bin/CS63env python3 testGym.pyTry the other environments provided in the file.
Make sure you understand how this program works before moving on.
Recall that in approximate Q-Learning we represent the Q-table using a neural network. The input to this network represents a state and the output from the network represents the current Q values for every possible action.
Open the file deepQAgent.py. Much of the code has been written for you. This file contains the definition of a class called DeepQAgent that builds the neural network model of the Q-learning table and allows you to train and to use this model to choose actions. You should read carefully through the methods in this class to be sure you understand how it functions.
There are two methods that you need to write train and test. The main program, which calls both of these functions, is written for you and is in the file cartPole.py.
Here is pseudocode for the train function:
initialize the agent's history list to be empty loop over episodes state = reset the environment state = np.reshape(state, [1, state size]) every 50 episodes save the agent's weights to a unique filename (see program comments for details on this filename) initialize total_reward to 0 loop over steps choose an action using the epsilon-greedy policy take the action observe next_state, reward, and whether the episode is done update total_reward based on current reward reshape the next_state (similar to above) remember this experience state = next_state if episode is done break add total_reward to the agent's history list if length of agent memory > batchSize replay batchSize experiences for batchIteration times print a message that episode has ended with total_reward save the agent's final weights after training is complete
The test function is similar in structure to the train function, but you should choose greedy actions and also render the environment to observe the agent behavior. You should not remember experiences, replay experiences, nor save weights.
To run this program for training with 200 episodes do:
python3 cartPole.py train 200Remember that you will first need to activate the CS63env in the terminal window where you run the code.
If you experience a long delay (more than a minute) before the program starts running or you get error messages, tensorflow may be hung up searching for GPU devices. You can kill the program with CTRL-C and try running it like this instead:
CUDA_VISIBLE_DEVICES="" python3 cartPole.py train 200
Approximate Q learning should be able to find a succesful policy in 200 episodes using the parameter settings provided in the file.
After training is complete, it is interesting to go back and look at the agent's learning progress over time. We can do this by using the weight files that were saved every 50 episodes.
To run this program in testing mode do:
python3 cartPole.py test CartPole_episode_0.h5This would show you how the agent behaved prior to any training at episode 0. The next command would show you the behavior after 50 episodes of training.
python3 cartPole.py test CartPole_episode_50.h5The file RewardByEpisode.png contains a plot of reward over time which also gives you a sense of the agent's progress on the task. You can view this file by doing:
eog RewardByEpisode.png
Once you have been able to successfully learn the Cart Pole problem, you should choose another problem to try. Look through the open AI gym documentation and pick a different environment to play with. Remember that reinforcement learning is difficult! We started with one of the easier classic problems. Don't get discouraged if your attempts at a new problem are not as successful.
You should use the testGym.py file to test out different domains and to figure out how the states, actions, and rewards are represented. But you must choose a task that has a discrete action space.
I suggest trying the Lunar Lander problem pictured above. This problem is more challenging than the Cart Pole, but you should be able to make some progress at learning it. Here, the goal is to learn how to land the vessel between the flags.
You should make copies of the files cartPole.py and the deepQAgent.py, and rename them for your new task. Here are some of the aspects of the code that may need to change to be successful on a harder problem:
Write up a summary of what you did for this second problem in the file writeup.tex
The structure of the Q-learning algorithm is based on code provided at Deep Q-Learning with Keras and Gym