Your program will allow the user to specify the starting configuration of how many objects are in each of the three piles, then use Q-learning to figure out the optimal strategy for both players simultaneously. After the Q-learning finishes, the program should allow the user to play as many games of Nim against the computer as they want, with the user picking if they want to go first or let the computer go first. There are guaranteed winning strategies for certain starting configurations (e.g., with piles of 3-4-5 the first player has a guaranteed win if they play perfectly) and the computer should always win in those cases (assuming Q-learning played enough games to win).
While doing Q-learning, assume the first player is "A" and the second player is "B." There are only two rewards in the whole game. On any move that causes player A to win, there is a reward of 1000, and on any move that causes B to win, there is a reward of -1000. The program should play the number of desired simulated games and learn the state-action function Q using the parameters the user supplied. After all of the simulated games are done, print out the Q-function for the starting state and all actions from that start state. (If there is a guaranteed winning strategy for the first player, then the function should list at least one state-action combination with a positive Q-value. If there is no such strategy (i.e., the first player always loses if the second player plays perfectly), all of the Q-values will be negative).
Here's an idea for how to represent a state-action pair really easily. Use a three-digit number for the state: the first three digits are the number of objects in each of the three piles (we'll assume they'll always be less than 10). To represent the action, use two additional numbers for the the action taken: the number of the pile (0, 1, or 2) followed by the number of objects to remove from that pile. Use two arrays, QA for A's Q-values, and QB for B's Q-values.
For example, if we start with 3-4-5 objects, this would be state "345." If we choose to remove 1 object from pile zero (the first pile), that would be the state-action combination "34501." Therefore, we would perform a Q update on QA[34501]. Our next state is "245", and now it's B's turn. If B decides to remove 2 objects from the second pile (pile 1), this would be state-action "24512," and we would do a Q update on QB[24512].
This formulation of states and state-action pairs allows us to store the table of Q-values easily as two arrays, each containing 100,000 doubles (one for A's values, one for B's). However, if you're comfortable with maps, is probably makes more sense to use a single map from strings to doubles, and a state-action pair can be a string like "A34501" indicating it's player A's turn, the board state is 3-4-5, and the action is removing 1 object from pile 0.
Under this formulation, upon entering state "A000" there is a reward of +1000 because player B removed the last object, so they lose (and player A wins). Entering state "B000" indicates player A has lost, so player B wins, so there's a reward of -1000.
For the 0-1-2 game (we did this by hand in class on 11/29), the values your program should converge to are:
Q[A001, 21] = -1000.0 Q[A010, 11] = -1000.0 Q[A012, 11] = -810.0 Q[A012, 21] = -810.0 Q[A012, 22] = 900.0 Q[B002, 21] = -900.0 Q[B002, 22] = 1000.0 Q[B010, 11] = 1000.0 Q[B011, 11] = -900.0 Q[B011, 21] = -900.0For the 1-2-3 game, the values your program should converge to are:
Q[A001, 21] = -1000.0 Q[A002, 21] = 900.0 Q[A002, 22] = -1000.0 Q[A003, 21] = -810.0 Q[A003, 22] = 900.0 Q[A003, 23] = -1000.0 Q[A010, 11] = -1000.0 Q[A011, 11] = 900.0 Q[A011, 21] = 900.0 Q[A013, 11] = -810.0 Q[A013, 21] = -810.0 Q[A013, 22] = -810.0 Q[A013, 23] = 900.0 Q[A020, 11] = 900.0 Q[A020, 12] = -1000.0 Q[A021, 11] = -810.0 Q[A021, 12] = 900.0 Q[A021, 21] = -810.0 Q[A022, 11] = -810.0 Q[A022, 12] = -810.0 Q[A022, 21] = -810.0 Q[A022, 22] = -810.0 Q[A100, 01] = -1000.0 Q[A101, 01] = 900.0 Q[A101, 21] = 900.0 Q[A102, 01] = -810.0 Q[A102, 21] = -810.0 Q[A102, 22] = 900.0 Q[A103, 01] = -810.0 Q[A103, 21] = -810.0 Q[A103, 22] = -810.0 Q[A103, 23] = 900.0 Q[A110, 01] = 900.0 Q[A110, 11] = 900.0 Q[A111, 01] = -810.0 Q[A111, 11] = -810.0 Q[A111, 21] = -810.0 Q[A112, 01] = -810.0 Q[A112, 11] = -810.0 Q[A112, 21] = 729.0 Q[A112, 22] = -810.0 Q[A120, 01] = -810.0 Q[A120, 11] = -810.0 Q[A120, 12] = 900.0 Q[A121, 01] = -810.0 Q[A121, 11] = 729.0 Q[A121, 12] = -810.0 Q[A121, 21] = -810.0 Q[A123, 01] = -656.1 Q[A123, 11] = -656.1 Q[A123, 12] = -810.0 Q[A123, 21] = -656.1 Q[A123, 22] = -656.1 Q[A123, 23] = -810.0 Q[B001, 21] = 1000.0 Q[B002, 21] = -900.0 Q[B002, 22] = 1000.0 Q[B003, 21] = 810.0 Q[B003, 22] = -900.0 Q[B003, 23] = 1000.0 Q[B010, 11] = 1000.0 Q[B011, 11] = -900.0 Q[B011, 21] = -900.0 Q[B012, 11] = 810.0 Q[B012, 21] = 810.0 Q[B012, 22] = -900.0 Q[B020, 11] = -900.0 Q[B020, 12] = 1000.0 Q[B021, 11] = 810.0 Q[B021, 12] = -900.0 Q[B021, 21] = 810.0 Q[B023, 11] = 810.0 Q[B023, 12] = 810.0 Q[B023, 21] = -729.0 Q[B023, 22] = 810.0 Q[B023, 23] = 810.0 Q[B100, 01] = 1000.0 Q[B101, 01] = -900.0 Q[B101, 21] = -900.0 Q[B102, 01] = 810.0 Q[B102, 21] = 810.0 Q[B102, 22] = -900.0 Q[B103, 01] = 810.0 Q[B103, 21] = 810.0 Q[B103, 22] = 810.0 Q[B103, 23] = -900.0 Q[B110, 01] = -900.0 Q[B110, 11] = -900.0 Q[B111, 01] = 810.0 Q[B111, 11] = 810.0 Q[B111, 21] = 810.0 Q[B113, 01] = 810.0 Q[B113, 11] = 810.0 Q[B113, 21] = 656.1 Q[B113, 22] = -729.0 Q[B113, 23] = 810.0 Q[B120, 01] = 810.0 Q[B120, 11] = 810.0 Q[B120, 12] = -900.0 Q[B121, 01] = 810.0 Q[B121, 11] = -729.0 Q[B121, 12] = 810.0 Q[B121, 21] = 810.0 Q[B122, 01] = -729.0 Q[B122, 11] = 656.1 Q[B122, 12] = 810.0 Q[B122, 21] = 656.1 Q[B122, 22] = 810.0Notice how for the start state, A123, the first player (player A) has no Q-values with positive expected rewards. This means that the whoever goes first will lose, assuming player B plays perfectly. (Hence, the computer should win if allowed to go second.)