Project 4: Nim with Q-learning

In this project, you will design a program to play the game of Nim optimally. Nim is a 2-player game that starts with three piles of objects. Players alternate removing any number of objects from a single pile (a player may not remove objects from multiple piles on a single turn). The player who is forced to take the last object loses (in some variants, the player who takes the last object wins, but in our version, they lose).

Your program will allow the user to specify the starting configuration of how many objects are in each of the three piles, then use Q-learning to figure out the optimal strategy for both players simultaneously. After the Q-learning finishes, the program should allow the user to play as many games of Nim against the computer as they want, with the user picking if they want to go first or let the computer go first. There are guaranteed winning strategies for certain starting configurations (e.g., with piles of 3-4-5 the first player has a guaranteed win if they play perfectly) and the computer should always win in those cases (assuming Q-learning played enough games to win).

Specifics

At the beginning of the program, prompt the user for the following information:

The starting configuration of the three piles of objects (that is, the number of objects in each pile at the beginning of the game).
The number of simulated games to play.

Use gamma = 0.9, and alpha = 1. Yes, use alpha = 1, even though we talked about how usually it's better if alpha is small and close to zero. It will be clear later.

While doing Q-learning, assume the first player is "A" and the second player is "B." There are only two rewards in the whole game. On any move that causes player A to win, there is a reward of 1000, and on any move that causes B to win, there is a reward of -1000. The program should play the number of desired simulated games and learn the state-action function Q using the parameters the user supplied. After all of the simulated games are done, print out the Q-function for the starting state and all actions from that start state. (If there is a guaranteed winning strategy for the first player, then the function should list at least one state-action combination with a positive Q-value. If there is no such strategy (i.e., the first player always loses if the second player plays perfectly), all of the Q-values will be negative).

Hints

This project has a few quirks. The biggest hurdle is changing the Q-learning equation to handle the fact that we have two players that are playing against each other. To handle that, when considering moves for player B, change the "max" in the update formula to a "min." That way, player A will act normally, picking the action with the maximum expected value, but player B will act in the opposite fashion, picking the action with the minimum expected value. This is analogous to the minimax algorithm.

Here's an idea for how to represent a state-action pair really easily. Use a three-digit number for the state: the first three digits are the number of objects in each of the three piles (we'll assume they'll always be less than 10). To represent the action, use two additional numbers for the the action taken: the number of the pile (0, 1, or 2) followed by the number of objects to remove from that pile. Use two arrays, QA for A's Q-values, and QB for B's Q-values.

For example, if we start with 3-4-5 objects, this would be state "345." If we choose to remove 1 object from pile zero (the first pile), that would be the state-action combination "34501." Therefore, we would perform a Q update on QA[34501]. Our next state is "245", and now it's B's turn. If B decides to remove 2 objects from the second pile (pile 1), this would be state-action "24512," and we would do a Q update on QB[24512].

This formulation of states and state-action pairs allows us to store the table of Q-values easily as two arrays, each containing 100,000 doubles (one for A's values, one for B's). However, if you're comfortable with maps, is probably makes more sense to use a single map from strings to doubles, and a state-action pair can be a string like "A34501" indicating it's player A's turn, the board state is 3-4-5, and the action is removing 1 object from pile 0.

Under this formulation, upon entering state "A000" there is a reward of +1000 because player B removed the last object, so they lose (and player A wins). Entering state "B000" indicates player A has lost, so player B wins, so there's a reward of -1000.

Testing your program

Under 3-4-5, the computer should always win if it goes first. Similarly for 0-1-2, 1-2-4 and 2-3-4. Under 2-4-6, the computer should always win if it goes second. Similarly for 1-2-3.

Sample output

If you want to make sure your Q-learning is doing the right thing, here are the values your Q-tables should converge to in the long term (after playing, say 100,000 games). Note that playing 100,000 games really shouldn't take more than a minute, if that. On my laptop that is 4+ years old, playing 100,000 games from an initial start state of 3-4-5 takes about 25 seconds.

For the 0-1-2 game (we did this by hand in class on 11/29), the values your program should converge to are:

Q[A001, 21] = -1000.0
Q[A010, 11] = -1000.0
Q[A012, 11] = -810.0
Q[A012, 21] = -810.0
Q[A012, 22] = 900.0
Q[B002, 21] = -900.0
Q[B002, 22] = 1000.0
Q[B010, 11] = 1000.0
Q[B011, 11] = -900.0
Q[B011, 21] = -900.0

For the 1-2-3 game, the values your program should converge to are:

Q[A001, 21] = -1000.0
Q[A002, 21] = 900.0
Q[A002, 22] = -1000.0
Q[A003, 21] = -810.0
Q[A003, 22] = 900.0
Q[A003, 23] = -1000.0
Q[A010, 11] = -1000.0
Q[A011, 11] = 900.0
Q[A011, 21] = 900.0
Q[A013, 11] = -810.0
Q[A013, 21] = -810.0
Q[A013, 22] = -810.0
Q[A013, 23] = 900.0
Q[A020, 11] = 900.0
Q[A020, 12] = -1000.0
Q[A021, 11] = -810.0
Q[A021, 12] = 900.0
Q[A021, 21] = -810.0
Q[A022, 11] = -810.0
Q[A022, 12] = -810.0
Q[A022, 21] = -810.0
Q[A022, 22] = -810.0
Q[A100, 01] = -1000.0
Q[A101, 01] = 900.0
Q[A101, 21] = 900.0
Q[A102, 01] = -810.0
Q[A102, 21] = -810.0
Q[A102, 22] = 900.0
Q[A103, 01] = -810.0
Q[A103, 21] = -810.0
Q[A103, 22] = -810.0
Q[A103, 23] = 900.0
Q[A110, 01] = 900.0
Q[A110, 11] = 900.0
Q[A111, 01] = -810.0
Q[A111, 11] = -810.0
Q[A111, 21] = -810.0
Q[A112, 01] = -810.0
Q[A112, 11] = -810.0
Q[A112, 21] = 729.0
Q[A112, 22] = -810.0
Q[A120, 01] = -810.0
Q[A120, 11] = -810.0
Q[A120, 12] = 900.0
Q[A121, 01] = -810.0
Q[A121, 11] = 729.0
Q[A121, 12] = -810.0
Q[A121, 21] = -810.0
Q[A123, 01] = -656.1
Q[A123, 11] = -656.1
Q[A123, 12] = -810.0
Q[A123, 21] = -656.1
Q[A123, 22] = -656.1
Q[A123, 23] = -810.0
Q[B001, 21] = 1000.0
Q[B002, 21] = -900.0
Q[B002, 22] = 1000.0
Q[B003, 21] = 810.0
Q[B003, 22] = -900.0
Q[B003, 23] = 1000.0
Q[B010, 11] = 1000.0
Q[B011, 11] = -900.0
Q[B011, 21] = -900.0
Q[B012, 11] = 810.0
Q[B012, 21] = 810.0
Q[B012, 22] = -900.0
Q[B020, 11] = -900.0
Q[B020, 12] = 1000.0
Q[B021, 11] = 810.0
Q[B021, 12] = -900.0
Q[B021, 21] = 810.0
Q[B023, 11] = 810.0
Q[B023, 12] = 810.0
Q[B023, 21] = -729.0
Q[B023, 22] = 810.0
Q[B023, 23] = 810.0
Q[B100, 01] = 1000.0
Q[B101, 01] = -900.0
Q[B101, 21] = -900.0
Q[B102, 01] = 810.0
Q[B102, 21] = 810.0
Q[B102, 22] = -900.0
Q[B103, 01] = 810.0
Q[B103, 21] = 810.0
Q[B103, 22] = 810.0
Q[B103, 23] = -900.0
Q[B110, 01] = -900.0
Q[B110, 11] = -900.0
Q[B111, 01] = 810.0
Q[B111, 11] = 810.0
Q[B111, 21] = 810.0
Q[B113, 01] = 810.0
Q[B113, 11] = 810.0
Q[B113, 21] = 656.1
Q[B113, 22] = -729.0
Q[B113, 23] = 810.0
Q[B120, 01] = 810.0
Q[B120, 11] = 810.0
Q[B120, 12] = -900.0
Q[B121, 01] = 810.0
Q[B121, 11] = -729.0
Q[B121, 12] = 810.0
Q[B121, 21] = 810.0
Q[B122, 01] = -729.0
Q[B122, 11] = 656.1
Q[B122, 12] = 810.0
Q[B122, 21] = 656.1
Q[B122, 22] = 810.0

Notice how for the start state, A123, the first player (player A) has no Q-values with positive expected rewards. This means that the whoever goes first will lose, assuming player B plays perfectly. (Hence, the computer should win if allowed to go second.)