Your program will allow the user to specify the starting configuration of how many objects are in each of the three piles, then use Q-learning to figure out the optimal strategy for both players simultaneously. After the Q-learning finishes, the program should allow the user to play as many games of Nim against the computer as they want, with the user picking if they want to go first or let the computer go first. There are guaranteed winning strategies for certain starting configurations (e.g., with piles of 3-4-5 the first player has a guaranteed win if they play perfectly) and the computer should always win in those cases (assuming Q-learning simulated enough games to learn the Q-values well enough).
In our environment, we call the first player "A" and the second player "B." There are only two rewards in the whole game. On any move that causes player A to win, there is a reward of 1000, and on any move that causes B to win, there is a reward of -1000.
Each action that Player A and Player B take in the simulated games should be chosen randomly. (In the real world, the actions are often chosen according to a policy that balances exploration and exploitation, but for simplicity, we will always explore.) Recall that we need two update equations:
When Player A moves from state $s$: $Q[s, a] \gets Q[s, a] + \alpha \left[ r + \gamma \min_{a'} Q[s', a'] - Q[s, a] \right]$
When Player B moves from state $s$: $Q[s, a] \gets Q[s, a] + \alpha \left[ r + \gamma \max_{a'} Q[s', a'] - Q[s, a] \right]$
Note that we are still maximizing or minimizing over the action $a'$ taken from state $s'$; only the action $a$ from state $s$ is chosen at random. (If we pick $a'$ randomly instead of maximizing or minimizing, we actually get a different reinforcement learning algorithm called SARSA.)
You can use the printed table of Q-values to debug your program. One idea you can use is that after the values have converged to the optimal Q-values, if there is a guaranteed winning strategy for the Player A in some state, then the function should list at least one state-action combination with a positive Q-value for that state. If there is no such strategy (i.e., Player A always loses if the Player B plays perfectly), all of the Q-values will be negative for the state in question.
For Player A: $\pi(s) = \arg \max_{a} Q[s, a]$
For Player B: $\pi(s) = \arg \min_{a} Q[s, a]$
You can store the table of Q-values using a map or dictionary or hash table. To do this, you will need to
store each state-action pair as a single item that can be mapped to the Q-value for that pair.
This can be done in many ways, but one straight-forward idea is to use a 6-character string for a state-action pair
as follows.
The first character is either "A" or "B" for the player whose turn it is. The next three characters are
the numbers of the objects left in each pile (we'll assume there will never be more than nine objects in a
pile). The last two characters represent the action taken: the number of the pile (0, 1, or 2) followed by the number of objects
to remove from that pile.
For example, if we start with piles of 3, 4, and 5 objects, and the first player (A) chooses to remove 1 object from pile zero (the first pile), that would be the state-action combination "A34501." Our board now has 2, 4, and 5 objects left, and now it's B's turn. If B decides to remove 2 objects from the second pile (pile 1), this would be represented by the string "B24512."
This representation of state-action pairs allows us to store the table of Q-values easily as a mapping from strings to doubles.
An alternate representation is a two-level map (a map of maps), where the first level maps states like ("A345") to a map containing actions and their corresponding Q-values. This makes the bookkeeping a little more difficult, but it makes it easier to find, for example, what actions are available from a given state.
The 0-1-2 game (We did this one by hand in class.)
In this game, Player A is guaranteed to win if they play perfectly. Notice that for the state-action pairs for the opening move (A012xx), there is only one positive Q-value. This indicates that Player A must make that specific move to guarantee a win, otherwise it opens the door for Player B to possibly win.The 1-2-3 game
Player B is guaranteed to win here (with perfect play), no matter what Player A does on the opening move. You can deduce this because all of the opening moves for Player A (A123xx) have negative Q-values. This is why the computer will always win if allowed to go second with this board configuration.
You should make sure your program still works for larger board sizes as specified under "testing your program" above.