Alexis Flesch - Portfolio

About this project

In this project, we trained an agent to play connect four using DDQN (Double Q-Network). Deep Q-Learning was made famous by DeepMind's paper where they trained an agent to play Atari games. In the case of connect four, the state space is too large to try to train our agent with simple Q-Learning as we did with our tic-tac-toe project. We therefore used a neural network to approximate the Q-function. The input of the network is the board, and the output is a vector of size 7, each entry corresponding to the Q-value of a possible action. The network is trained using the Bellman equation as a loss function. The details of the algorithm are given below.

This project involved a simple technology stack:

Pytorch to train the neural network.
ONNX Runtime to run inferences on the networks in the browser.

Play the game

The colors on the grid indicate the estimated Q-value of each square. You can either play against an AI (choose one in the dropdown menu) or let it play against itself. Green squares have the highest estimated Q-value (1), red squares the lowest (-1) and yellow squares have an estimated Q-value of 0. Click the "Agent move" button to have the AI play a move.

Implementation: DDQN

The update rule for the Q-Learning algorithm is given by: \[Q(s,a) \leftarrow (1-\alpha) Q(s,a) + \alpha \left[r + \gamma \max_{a'} Q(a',s')\right],\] where \(\alpha\) is the learning rate, and \(s'\) is the state reached after taking action \(a\) in state \(s\). When dealing with a big state space, it is not possible to store all the Q-values in a table like we did for the game of tic-tac-toe. Instead, we create a Deep Neural Network to estimate the Q-function. To be more precise, we used a DDQN (Double Q-Network). The idea is to use two networks: one to choose the action (\(Q\)), and the other to evaluate the Q-value of that action (\(Q_{target}\)). The update of the network is done with the Huber-Loss function for Yoda and the MSE for Grogu (both denoted by \(d\) below): \[L = d \left(Q(s,a), r + \gamma \max_{a'} Q_{target}(a',s')\right)\] Remark that the parameter \(\alpha\) disappeared. Instead we use a soft update on the networks themselves. At the end of each episode (that is, at the end of each game), we update the weights of the target network with the weights of the action network: \[Q_{target} \leftarrow (1-\tau)Q + \tau Q_{target}.\] Also, to adress the issue of non-independant data, we created a replay buffer with 1,000 states and sampled 8 states at each iteration.

An iteration of this algorithm is defined as so:

Reset the game to state \(s\)
With probability 1/2, start the game or play as player 2

Choose an action \(a\) according to an \(\varepsilon\)-greedy policy
- with probability \(\varepsilon\) choose an action randomly
- with probability \(1-\varepsilon\) choose the action that maximizes \(Q_{target}(s,a)\)
Observe the reward \(r\) and the next state \(s'\)
Update the Q-function with the rule given above. If \(s'\) is a terminal state (i.e. if the game is over) then set Q(s,a) to the immediate reward.
Repeat until episode (game) is over

Update the target network with the action network

The reward is set to 1 if the agent wins, -1 if it loses, and 0 otherwise.

As you can see below, there are quite a lot of hyperparameters to consider here.

                            
PARAMS = {
    "learning_rate": 0.0001,
    "gamma": 0.99,
    "buffer_size": 1000,
    "batch_size": 8,
    "epsilon": 1,
    "epsilon_decay": 0.9999,
    "min_epsilon": 0.1,
    "n_episodes": 10000,
    "tau": 0.01,
    "update_frequency": 1,
    "nn": neural_networks.DDQN5
}

My hybrid fanless tablet-PC is not really suited to train a neural network. That's why only two models are available to play with here.

Grogu

Training a DDQN to play connect four has been done before, so here we tried something new. When playing connect four, an obvious strategy is to look at all possible alignments of four cells: if there are already 3 pieces of the same kind then a win (or a loss) could occur. We therefore created the first layer of our network with that in mind:

The grid of the game is filled with 0 at the begining. Player 1 puts 1 into it and player 2 puts -1.
The first layer is made of:

42 neurons representing the grid.
69 neurons that are the sums of all possible alignments of 4 aligned cells.

The output layer is made of 7 neurons that estimate the Q-value of each action.

The network then simply sums up as a matrix product against the board of the game. The weights of the output layer were computed during training.
Grogu was trained against a dumb opponent: he looks for an easy win and plays randomly if he can't find any. After 100,000 episodes, Grogu is able to beat his opponent 75% of the time, which is quite impressive given its simplicity.
If you play against Grogu, you will soon notice that the estimated Q-values drop and are actually very bad estimates: this might come from the fact that the number of moves against the dumb opponent is very limited. Here is a barplot of the estimated distribution of the length of the games between Grogu and his opponent:

Barplot

Yoda

Playing with Grogu is fun for a while but I still wanted to try and make an AI a little bit more capable. I therefore trained a DDQN with the following architecture:

First layer of size 42
Second layer of size 180
Third layer of size 180
Fourth layer of size 180
Output layer of size 7

The layer are fully connected with RELU activation functions. To train this network, a random opponent was chosen after each action with probability 1/2 between :

The dumb opponent that met Grogu (looks for an easy win and plays randomly if he can't find one).
A slightly smarter one that will also block a winning move if there is one for Yoda.

Training took a whole 12 hours (once again, fanless computer) and 1,000,000 episodes, after which Yoda was able to beat his opponents 75% of the time. To avoid having my computer catch fire I didn't push the experiment further but I am confident I could get better results by training for longer.

Source

The code for rendering the board with javascript was adapted from this article and modified to fit our needs. As for the colormap, it was borrowed to this github repository. Feel free to check out the python code used to train the agents here.

Connect Four

About this project

Play the game

Grogu

Yoda

Implementation: DDQN

Grogu

Yoda

Source