About this project
This is a (not so) simple implementation of the tic-tac-toe game. Try it yourself by playing against an AI trained with variations of the Q-Learning and Q-Iteration algorithms. You can also watch the AI play against itself. If you want more details about the algorithms used, keep reading the sections after the game.
Play the game
Choose your opponent: either play against an agent trained with Q-Iteration or Q-Learning. Click the "Agent move" button to make the AI play whenever you want. You can also play yourself by clicking on the squares. Observe the agent's Q-function as a heatmap directly on the grid: green squares have the highest Q-value (1), red squares the lowest (-1) and yellow squares have Q-value of 0.
Q-Iteration
The agent was trained with a modified version of Q-Iteration (see section below for details), playing 50 games against a random policy and then 200 games against its own policy.Implementation: Q-Iteration
Let \(Q^\star\) be an optimal Q-function of a Markov Decision Process. The Bellman equation in a deterministic environment is given by: \[Q^\star(s,a) = r + \gamma \max_{a'} Q^\star(a',s'),\] where \(s'\) is the state reached after taking action \(a\) in state \(s\), \(\gamma\) is the discount factor, and \(r\) is the immediate reward for taking action \(a\) in state \(s\). Here, the environment is stochastic as when the agent plays a move, it doesn't know what the opponent's answer will be. Denote by \(p(s'|s,a)\) the probability of reaching state \(s'\) after taking action \(a\) in state \(s\). The Bellman equation becomes: \[Q^\star(s,a) = \sum_{s'} p(s'|s,a) \left[r + \gamma \max_{a'} Q^\star(a',s')\right].\] To avoid having to compute the transition probabilities, we used the following update rule for all states \(s\) and actions \(a\) (where \(\gamma < 1\)):
- If \(Q(s,a) = -1\), meaning that the agent lost immediately after taking action \(a\) in state \(s\) in the past, don't update \(Q(s,a)\). This will happen when training against a random/suboptimal policy.
- Otherwise, update \(Q(s,a)\) using the following rule: \[Q(s,a) \leftarrow r + \gamma \max_{a'} Q(a',s').\]
We first ran 50 iterations of this algorithm with a random policy for the opponent of the agent. We then ran it again 200 times using the agent's policy against itself. The results are surprisingly good, given that the update rule doesn't take into account the stochastic nature of the environment.
Implementation: Q-Learning
With an \(\varepsilon\)-greedy policy, the update rule for Q-Learning is given by: \[Q(s,a) \leftarrow (1-\alpha) Q(s,a) + \alpha \left[r + \gamma \max_{a'} Q(a',s')\right],\] where \(\alpha\) is the learning rate, and \(s'\) is the state reached after taking action \(a\) in state \(s\). An iteration of this algorithm is defined as so:
- Choose a random state \(s\)
- Choose an action \(a\) according to an \(\varepsilon\)-greedy policy:
- with probability \(\varepsilon\) choose an action randomly
- with probability \(1-\varepsilon\) choose the action that maximizes \(Q(s,a)\)
- Observe the reward \(r\) and the next state \(s'\)
- Update the Q-function with the rule given above
All Q-Learning agents were trained against a random policy: this explains why they occasionally make mistakes even though the number of iterations is quite high. Q-Learning-1 and Q-Learning-2 were trained for respectively 1,000,000 and 5,000,000 iterations. For Q-Learning-3 we changed the algorithm so as to visit every state at each iteration: we then ran 250 iterations of this algorithm. The hyperparameters chosen for the three algorithms are the following: \[ \alpha=0.2,\quad \gamma=0.9,\quad \varepsilon=0.1.\] Finally, to compare Q-Learning to our design of Q-Iteration, we trained an agent 50 times against a random policy (visiting every state at each iteration) and then 200 times against its own policy. The hyperparameters were left unchanged.