Lab 5: General Game Playing

Lab 5: General Game Playing with MCTS
Due March 3 by 11:59pm

 - - - - - - - -
\ · · · · · · ● ● \
 \ · · · · · · ● ● \
  \ · · · · · · ● ● \
   \ · · · · · ● ● ● \
    \ · · · ● ● ● ● · \
     \ ● ● ● ● ● · ● · \
      \ ● ● ● ● · · · · \
       \ · · · · · · · · \
          - - - - - - - -

General game players are systems able to accept descriptions of arbitrary games at run time and able to use such descriptions to play those games effectively without human intervention. In other words, they do not know the rules until the games start. Unlike specialized game players, general game players cannot rely on algorithms designed in advance for specific games.

Starting point code

As in the previous lab, use Teammaker to form your team. You can log in to that site to indicate your partner preference. Once you and your partner have specified each other and the lab has been released, a GitHub repository will be created for your team.

Introduction

The objective of this lab is to use Monte Carlo Tree Search to implement a general game playing agent.

The primary Python file you will be modifying is MonteCarloTreeSearch.py. You have also been provided several files that should look familiar from last week's lab :

BoardGame.py, containing a base class that implements common functionality
PlayGame.py
Nim.py (Like last week, this very simple game is a great option to use when you are testing and debugging your code.)
Mancala.py
Breakthrough.py
BasicPlayers.py

There are also several new python programs:

Hex.py, which implements hex, the game depicted below, where players attempt to form a connected path between opposing sides of the board.
MonteCarloTreeSearch.py, which is where you'll implement your MCTS player agent.

The name 'hex' comes from the fact that board is a hexagonal grid; in other words, each location has six other locations that are 'adjacent.'

Hex has several properties that make it resemble Go, while still being comparatively small and tractable. In particular, the board starts empty and players can place a piece in any unoccupied spot, so the branching factor is high. However, Hex is played on a smaller board than Go, so it's not quite as bad (8x8 instead of 19x19; and as a reminder, you can adjust the board size even smaller with the -game_arg flag). Also, players win by forming a continuous 'path' from one side of the board to the opposite side; this means that similar to Go the 'value' of a piece is defined by the configuration of the other pieces on the board, making it challenging to write a good static evaluation function. Again, it's a lot less complicated than Go, but it has enough similarities that it's useful as an example of a game that can't be easily solved using pure minimax.

The starting board for an 8x8 game of Hex.

 - - - - - - - -
\ · · · · · · · · \
 \ · · · · · · · · \
  \ · · · · · · · · \
   \ · · · · · · · · \
    \ · · · · · · · · \
     \ · · · · · · · · \
      \ · · · · · · · · \
       \ · · · · · · · · \
          - - - - - - - -

A point in the game after each player has made four moves.

 - - - - - - - -
\ · · · · · · ● ● \
 \ · · · · · · ● ● \
  \ · · · · · · ● · \
   \ · · · · · ● · · \
    \ · · · · · · · · \
     \ · ● · · · · · · \
      \ · · · ● · · · · \
       \ · · · · · · · · \
          - - - - - - - -

Blue has just won the game by creating a connected path between the two blue sides of the Hex board.

 - - - - - - - -
\ · · · · · · ● ● \
 \ · · · · · · ● ● \
  \ · · · · · · ● ● \
   \ · · · · · ● ● ● \
    \ · · · ● ● ● ● · \
     \ ● ● ● ● ● · ● · \
      \ ● ● ● ● · · · · \
       \ · · · · · · · · \
          - - - - - - - -

Just like last week, all games are played using PlayGame.py. The interface has been updated slightly; you can use the -h option for more information.

To get started, try playing hex against your partner. This game has a large branching factor, so you'll likely have to scroll up to see the game board between turns. The game is played on a grid where (0,0) is a the top left corner and (7,7) is the bottom right corner.

./PlayGame.py hex human human

Implementation Tasks

To begin, open the file MonteCarloTreeSearch.py and complete the implementations of the two classes provided.

Node class:
1. You will need to implement the method UCBWeight(UCB_constant, parent_visits, parent_turn) used in node selection. The UCB_weight is calculated according the formula in the node selection section of mcts.ai.
2. This class will also need to implement the method updateValue(outcome) used for value backpropagation. The outcome will be either +1, -1, or 0 representing a win for the maximizer, a win for the minimizer, or a draw. Recall from class that we will be calculating value according to this formula:
```
  value = 1 + (wins-losses)/visits
```
  The benefits of this formula are that:
  - wins are valued better than draws and draws are valued better than losses
  - value will always be positive, in the range [0,2], and positive values are necessary for the UCBWeight method
MCTSPlayer class:
1. You will need to implement the method getMove(game_state) which is called by the PlayGame.py program to determine the player's next move. It should:
  - check whether a node already exists for the given game state, and if not create one
  - call MCTS on the node
  - determine the best move from the node, taking into account the current player at the node
  - return the best move
  Here's pseudocode that fleshes out these steps:
```
getMove(game_state)
   # Find or create node for game_state
   key = str(game_state)
   if key in tree
      curr_node = get node from tree using key
   else
      curr_node = create new node with game_state
      add curr_node to tree using key 
   # Perform Monte Carlo Tree Search from that node
   MCTS(curr_node)
   # Determine the best move from that node
   bestValue = -float("inf"); bestMove = None
   for move, child_node in curr_node's children
      if curr_node's player is +1
         value = child_node.value
      else
         value = 2 - child_node. value
      if value > bestValue
         update bestValue and bestMove
   return bestMove
```
2. Debugging MCTS can be challenging due to the randomness inherent in the rollouts. Implement the status(node) method so that you can easily view the contents of a particular node within the tree. For example, let's play a game of Nim starting with 7 pieces, where we do 1000 rollouts per turn:
```
./PlayGame.py nim mcts random
  
```
  Here's an example status that might be printed for the root node after the first turn:
```
node wins 988, losses  12, visits 1000, value 1.98
  child wins   0, losses   2, visits    2, value 0.00, move 2
  child wins   7, losses   4, visits   11, value 1.27, move 1
  child wins 981, losses   7, visits  988, value 1.99, move 3
  
```
  Note that there is randomness in this algorithm, so running it again won't get you exactly the same results, but it should give similar results. Also note that the order of the different 'moves' is arbitrary, so a different run may list them in a different order.
  
  Notice that the best move based on the rollouts is to take 3, which puts our opponent at 4 pieces. We saw in class that, with optimal strategy, playing from 4 pieces is a guaranteed loss. MCTS has also discovered this via the rollouts.
  
  Look at numbers of wins, losses, and visits. You would expect that if we were to sum up these values at the child nodes that they would equal the total at the root node.
  - For wins, 0 + 7 + 981 = 988 as expected.
  - For losses, 2 + 4 + 7 = 13 and not 12 as we would expect.
  - For visits, 2 + 11 + 988 = 1001 and not 1000 as we would expect, given that we did 1000 rollouts.
  What is going on? Remember that MCTS is storing the tree in a dictionary that maps states to nodes. The states are represented by the number of pieces remaining and whose turn it is. From the starting state of (7, turn 1), we can get to three successor states: (6, turn -1), (5, turn -1), and (4, turn -1). It turns out that there is another way to get to this last successor state (4, turn -1), by the max player taking 1, the min player taking 1, and the max player taking 1 again. This series of moves has a low value so is rarely tried in rollouts, but because of the UCB formula, it typically does get explored at least one time out of the many rollouts that were done. And this is why the number of losses and visits is off from our expectations.
3. Lastly, you must complete the MCTS(node) method. This method takes a node from which to start the search, and the number of rollouts to perform.
  Each rollout:
  - navigates explored nodes using the UCB weight to select the best option until it reaches the frontier
  - expands one new node
  - performs a random playout to a terminal state
  - propagates the outcome back to expanded nodes along the path of selection and expansion
  Pseudocode for MCTS is provided below:
```
 MCTS(current_node)
    repeat num_rollout times
       path = selection(current_node)
       selected_node = final node in path
       if selected_node is terminal
          outcome = winner of selected_node's state
       else
          next_node = expansion(selected_node)
          add next_node to end of path
          outcome = simulation(next_node's state)
       backpropagation(path, outcome)
    status(current_node) # use for debugging
```
  You will certainly want to break this task down using several helper methods, at least one for each phase of the algorithm.

Testing your MCTS

Once you have implemented MCTS, you should do extensive testing on the simplest game we have provided, which is Nim. Here is how you would play Nim, starting with 7 pieces and with the MCTS doing 100 rollouts:

  ./PlayGame.py nim mcts random -game_args 7 -a1 100

Your output should look similar to the following, though the numbers will not be exactly the same due to the randomness of the rollouts, the trends should be similar. Player 1 (MCTS) should win every time.

Nim: 7 Turn: 1
root win    90 loss    10 visit   100 value 1.80
	child win    84 loss     5 visit    89 value 1.89, move 3
	child win     4 loss     3 visit     7 value 1.14, move 1
	child win     2 loss     2 visit     4 value 1.00, move 2
found expanded node 10 times
Move: 3

Nim: 4 Turn: -1
Move: 3

Nim: 1 Turn: 1
root win   118 loss     0 visit   118 value 2.00
	child win   180 loss     0 visit   180 value 2.00, move 1
Move: 1

Nim: 0 Turn: -1

player 1 (MCTS) wins

Once you are confident that MCTS is working properly you can turn off the status messages and explore how MCTS does with the much harder game Hex. Note that in Hex, Player 1 is blue and Player 2 is red. Try a game vs a random opponent using 1000 rollouts per turn (note there will be a clear pause in play as the MCTS completes these rollouts). Player 1 (MCTS) should win every time.

./PlayGame.py hex mcts random -a1 1000

Try games with two MCTS opponents pitted against one another. Give one version only 100 rollouts and the other version 1000 rollouts. The MCTS with more rollouts should always defeat the one with less rollouts.

./PlayGame.py hex mcts mcts -a1 1000 -a2 100

You should ensure that the MCTS with more rollouts is successful as either Player 1 or Player 2.

./PlayGame.py hex mcts mcts -a1 100 -a2 1000

Once you are confident that MCTS is working properly try playing Hex against it.

./PlayGame.py hex mcts human -a1 1000

Can you beat it? Does it seem to have good strategies? Does it's play improve if you give it more rollouts, say 2000 per turn?

Optional Extensions

When you have completed the above implementation tasks, you are welcome to try some of the following extensions:

Play your agent against itself with both sides having the same number of rollouts. Does Hex seem to have a first-mover advantage? If so, how strong is it?
Try varying the number of rollouts. How big a difference between the players is needed to ensure a victory? Is it a constant ratio, or does it change at different scales?
Try varying the UCB_CONST parameter to trade off exploration vs exploitation. How does the optimal UCB_CONST value depend on the game? How does it depend on the amount of prior knowledge? The number of rollouts?
Try playing the MCTS agent against your Minmax-with-pruning agent from last week (on the games you wrote static eval functions for). Under what circumstances does one or the other win?
Try saving the tree that MCTS has built for re-use in the next game. Useful libraries for accomplishing this include json and pickle. Try playing the version with saving against one without. How much does this improve the agent's play (e.g. how many extra roll-outs does a 'fresh start' agent need to compete with a 'saved state' agent)? Does the rate of improvement stay constant, or does it fall off after a certain number of games?

For any extension that you try, describe what you did and the outcome in the file called extensions.md.

Submitting your code

Use git to add, commit, and push your code.