About this project

This is a (not so) simple implementation of the tic-tac-toe game. Try it yourself by playing against an AI trained with variations of the Q-Learning and Q-Iteration algorithms. You can also watch the AI play against itself. If you want more details about the algorithms used, keep reading the sections after the game.

Play the game

Choose your opponent: either play against an agent trained with Q-Iteration or Q-Learning. Click the "Agent move" button to make the AI play whenever you want. You can also play yourself by clicking on the squares. Observe the agent's Q-function as a heatmap directly on the grid: green squares have the highest Q-value (1), red squares the lowest (-1) and yellow squares have Q-value of 0.

Implementation: Q-Iteration

Let \(Q^\star\) be an optimal Q-function of a Markov Decision Process. The Bellman equation in a deterministic environment is given by: \[Q^\star(s,a) = r + \gamma \max_{a'} Q^\star(a',s'),\] where \(s'\) is the state reached after taking action \(a\) in state \(s\), \(\gamma\) is the discount factor, and \(r\) is the immediate reward for taking action \(a\) in state \(s\). Here, the environment is stochastic as when the agent plays a move, it doesn't know what the opponent's answer will be. Denote by \(p(s'|s,a)\) the probability of reaching state \(s'\) after taking action \(a\) in state \(s\). The Bellman equation becomes: \[Q^\star(s,a) = \sum_{s'} p(s'|s,a) \left[r + \gamma \max_{a'} Q^\star(a',s')\right].\] To avoid having to compute the transition probabilities, we used the following update rule for all states \(s\) and actions \(a\) (where \(\gamma < 1\)):

  • If \(Q(s,a) = -1\), meaning that the agent lost immediately after taking action \(a\) in state \(s\) in the past, don't update \(Q(s,a)\). This will happen when training against a random/suboptimal policy.
  • Otherwise, update \(Q(s,a)\) using the following rule: \[Q(s,a) \leftarrow r + \gamma \max_{a'} Q(a',s').\]
The reward is set to 1 if the agent wins, -1 if it loses, and 0 otherwise. The hyperparameter \(\gamma\) was set to 1.

We first ran 50 iterations of this algorithm with a random policy for the opponent of the agent. We then ran it again 200 times using the agent's policy against itself. The results are surprisingly good, given that the update rule doesn't take into account the stochastic nature of the environment.

Implementation: Q-Learning

With an \(\varepsilon\)-greedy policy, the update rule for Q-Learning is given by: \[Q(s,a) \leftarrow (1-\alpha) Q(s,a) + \alpha \left[r + \gamma \max_{a'} Q(a',s')\right],\] where \(\alpha\) is the learning rate, and \(s'\) is the state reached after taking action \(a\) in state \(s\). An iteration of this algorithm is defined as so:

  • Choose a random state \(s\)
  • Choose an action \(a\) according to an \(\varepsilon\)-greedy policy:
    • with probability \(\varepsilon\) choose an action randomly
    • with probability \(1-\varepsilon\) choose the action that maximizes \(Q(s,a)\)
  • Observe the reward \(r\) and the next state \(s'\)
  • Update the Q-function with the rule given above
These steps are then repeated until the agent reaches a terminal state (i.e. until the game is over). The reward is set to 1 if the agent wins, -1 if it loses, and 0 otherwise.

All Q-Learning agents were trained against a random policy: this explains why they occasionally make mistakes even though the number of iterations is quite high. Q-Learning-1 and Q-Learning-2 were trained for respectively 1,000,000 and 5,000,000 iterations. For Q-Learning-3 we changed the algorithm so as to visit every state at each iteration: we then ran 250 iterations of this algorithm. The hyperparameters chosen for the three algorithms are the following: \[ \alpha=0.2,\quad \gamma=0.9,\quad \varepsilon=0.1.\] Finally, to compare Q-Learning to our design of Q-Iteration, we trained an agent 50 times against a random policy (visiting every state at each iteration) and then 200 times against its own policy. The hyperparameters were left unchanged.

Source

The code for rendering the board with javascript was taken from this article and modified to fit our needs. As for the colormap, it was borrowed to this github repository. Feel free to check out the python code used to train the agents here.