Project 4: Nim with Q-learning

In this project, you will design a program to play the game of Nim optimally. Nim is a 2-player game that starts with three piles of objects. Players alternate removing any number of objects from a single pile (a player may not remove objects from multiple piles on a single turn). The player who is forced to take the last object loses (in some variants, the player who takes the last object wins, but in our version, they lose).

Your program will allow the user to specify the starting configuration of how many objects are in each of the three piles, then use Q-learning to figure out the optimal strategy for both players simultaneously. After the Q-learning finishes, the program should allow the user to play as many games of Nim against the computer as they want, with the user picking if they want to go first or let the computer go first. There are guaranteed winning strategies for certain starting configurations (e.g., with piles of 3-4-5 the first player has a guaranteed win if they play perfectly) and the computer should always win in those cases (assuming Q-learning simulated enough games to learn the Q-values well enough).

How the program should run

Prompt the user for three integers: the starting number of objects in each of the three piles. This is your initial board configuration.
Prompt the user for an integer $n$, the number of simulated games to run via Q-learning.
Run $n$ simulated games of Nim from the initial board configuration to construct a table of (hopefully, close to optimal) Q-values.
Print out the table of Q-values that were learned.
Allow the user to play unlimited games of Nim from the initial board configuration until the user wants to stop. The user should be asked whether to take the role of Player A (first player) or Player B (second player). The computer should take the other role. The computer player should play with an optimal strategy according to the Q-values learned during the Q-learning simulated games.

The Q-learning phase

Use $\gamma = 0.9$ and $\alpha=1$. The reason we use $\alpha = 1$ is because in a deterministic environment (one with no randomness), Q-learning learns quickest with $\alpha = 1$ and still guarantees convergence to the correct Q* values. In a stochastic environment (where taking an action from a state may lead probabilistically to one of many next states), such as the Ebola environment we examined in class, the Q-learning algorithm must use an $\alpha$ less than 1 in order to guarantee convergence to the correct Q* values. The Wikipedia article on Q-learning has a nice discussion of this.

In our environment, we call the first player "A" and the second player "B." There are only two rewards in the whole game. On any move that causes player A to win, there is a reward of 1000, and on any move that causes B to win, there is a reward of -1000.

Each action that Player A and Player B take in the simulated games should be chosen randomly. (In the real world, the actions are often chosen according to a policy that balances exploration and exploitation, but for simplicity, we will always explore.) Recall that we need two update equations:

When Player A moves from state $s$: $Q[s, a] \gets Q[s, a] + \alpha \left[ r + \gamma \min_{a'} Q[s', a'] - Q[s, a] \right]$

When Player B moves from state $s$: $Q[s, a] \gets Q[s, a] + \alpha \left[ r + \gamma \max_{a'} Q[s', a'] - Q[s, a] \right]$

Note that we are still maximizing or minimizing over the action $a'$ taken from state $s'$; only the action $a$ from state $s$ is chosen at random. (If we pick $a'$ randomly instead of maximizing or minimizing, we actually get a different reinforcement learning algorithm called SARSA.)

You can use the printed table of Q-values to debug your program. One idea you can use is that after the values have converged to the optimal Q-values, if there is a guaranteed winning strategy for the Player A in some state, then the function should list at least one state-action combination with a positive Q-value for that state. If there is no such strategy (i.e., Player A always loses if the Player B plays perfectly), all of the Q-values will be negative for the state in question.

The game-playing phase

After all $n$ simulated games have been played, the user is given a chance to play against the computer either as Player A or Player B. Here, the computer should not move randomly; it should use the optimal policy derived from the table of Q-values:

For Player A: $\pi(s) = \arg \max_{a} Q[s, a]$

For Player B: $\pi(s) = \arg \min_{a} Q[s, a]$

Hints

This project has a few quirks. The biggest hurdle is changing the Q-learning update equation to handle the fact that we have two players that are playing against each other, which we have done above. Make sure you understand why we make those changes.

You can store the table of Q-values using a map or dictionary or hash table. To do this, you will need to store each state-action pair as a single item that can be mapped to the Q-value for that pair. This can be done in many ways, but one straight-forward idea is to use a 6-character string for a state-action pair as follows.
The first character is either "A" or "B" for the player whose turn it is. The next three characters are the numbers of the objects left in each pile (we'll assume there will never be more than nine objects in a pile). The last two characters represent the action taken: the number of the pile (0, 1, or 2) followed by the number of objects to remove from that pile.

For example, if we start with piles of 3, 4, and 5 objects, and the first player (A) chooses to remove 1 object from pile zero (the first pile), that would be the state-action combination "A34501." Our board now has 2, 4, and 5 objects left, and now it's B's turn. If B decides to remove 2 objects from the second pile (pile 1), this would be represented by the string "B24512."

This representation of state-action pairs allows us to store the table of Q-values easily as a mapping from strings to doubles.

An alternate representation is a two-level map (a map of maps), where the first level maps states like ("A345") to a map containing actions and their corresponding Q-values. This makes the bookkeeping a little more difficult, but it makes it easier to find, for example, what actions are available from a given state.

Testing your program

Under the initial board 3-4-5, the computer should always win if it goes first. Similarly for 0-1-2, 1-2-4 and 2-3-4. Under 2-4-6, the computer should always win if it goes second. Similarly for 1-2-3.

Sample output

Here are some different tables of Q-values and sample output for playing the games:

The 0-1-2 game (We did this one by hand in class.)

In this game, Player A is guaranteed to win if they play perfectly. Notice that for the state-action pairs for the opening move (A012xx), there is only one positive Q-value. This indicates that Player A must make that specific move to guarantee a win, otherwise it opens the door for Player B to possibly win.

The 1-2-3 game

Player B is guaranteed to win here (with perfect play), no matter what Player A does on the opening move. You can deduce this because all of the opening moves for Player A (A123xx) have negative Q-values. This is why the computer will always win if allowed to go second with this board configuration.

You should make sure your program still works for larger board sizes as specified under "testing your program" above.