REINFORCEMENT LEARNING FOR BEGINNERS

Deep Q-Learning with Pytorch and OpenAI-gym: The Taxi-cab puzzle

Guillaume Androz
12 min readMar 16, 2021

Reinforcement Learning with a deep learning model to choose proper action

In a previous post, I tried to explain how we can play a simple game Taxi-v3 from OpenAI automatically by applying reinforcement learning basics with the Q-learning algorithm. This approach paved the way to the deep-Q-learning algorithm.

Deep-Q-Learning or more generally deep reinforcement learning has been emphasized by Deepmind and its AlphaGo program. In this article, we’ll go through this new approach step-by-step and I’ll try to give you a quick view on the challenges it brings, even for very simple environments such as the Taxi-v3 game.

Q-table reminder

Let’s remind what a Q-table is about.

Use of a Q-table

A Q-table is just a table learnt by exploring then exploiting an environment and experiences, mapping couples (state, action) to Q-values. The Q-values are learnt by playing with the environment and incrementally updated thanks to the Bellman equation

Bellman equation. How to update Q-values in the Q-table

Another way to look at a Q-table is to consider it as a general function

General function

where A is the action space and S the state space.

This function can be anything, and it is especially true with more complicated environments where the states or the actions are continuous values or composed vectors.

Thanskfully for us, such general or complicated functions can be approximated with neural networks (NN) or even deep neural networks (DNN).

Deep-Q-learning

So the principle of deep-Q-learning is simple, just replace the Q-table by a deep neural network and it’s done !… well not so fast. You are right if you think the principle is the same, but evil is in the details.

Just as the Q-learning algorithm, DQL uses the epsilon-greedy exploration/exploitation algorithm to first explore, construct kind of a memory then exploit the collected informations. A DQL approach has in fact three main components:

  • A training network
  • A target network (same as the training network)
  • A large enough memory

Network training

You may be familiar with deep-learning model training. You got some data, well labelled data, and you got your model. What you do is feeding the model with your data, compute the loss between the expected values and the model prediction, then propagate the loss inside the network to update its weights. You iterate like that several times until “convergence” and your’re done.

The problem with reinforcement learning is that you do not have such data. All you have is an environment you can explore and play with, but you have no idea of what the rewards look like, you have to discover it by yourself (or your model has to).

So let’s have a look at the different steps to train our agent network.

First Exploration

Just as in the case of Q-learning, we first need to intialize our networks, generally with random values (the same weights are assigned to both networks!). Then come the exploration phase where we let the agent randomly naviguate in the environment, taking actions and gettings rewards accordingly.

All the experiences are logged into a memory, a big experience replay memory. This step or warmup step is just here to initialize the training of our network, to gather some useful data.

Exploration vs. exploitation: ε-greedy action selection

Once the memory has been fed with some random exploration, we can begin to exploit our training model. Just like with a Q-table approach, we select an action to take for the current state with ε-greedy action selection. It means that for a given state, we choose the action to take randomly with a probability ε, or we take the action predicted by the training model. As the training model predicts Q-values for every possible action for a given state, we choose the most rewarding action i.e. the action with the highest Q-value.

The new tuple (state, action, reward, next state) is then pushed into the replay memory to enrich data diversity. Note that the training model is not trained during this phase, just used to computed predicated Q-values so the action predicated might not be accurate at the beginning of the process.

Network training

After having explored a little and gathered some data, we can start the training process of the network. That’s where the memory comes into action.

First, we randomly sample a batch from the memory experience replay. This batch will be fed into the networks to predict the Q-values. Each experience of the batch represents a state of the environment, the reward the agent got when he took the action stored in memory and the resulting state. So an experience is a tuple (state, action, reward, next state).

Forward pass

The states in the batch are sent to the training network to predict the Q-values, but we only retain the Q-values corresponding to the action written in the memory for each experience. These Q-values are our predicted Q-values and will be used to compute the model loss.

To train a deep learning model the supervised way, we need labelled data but we’ve already said tha twe do not have any from our environment. We then need to construct labelled data and that’s the aim of the target network. The target network takes the next states and predicts the best Q value out of all actions because that’s what we want our model to predict, the most rewarding action to take for each state. Said another way, the next state is the state we would have like the training model had predicted. To reach this state, the model had to choose the proper action, and regarding the target model, it is the action corresponding to its highest Q-value.

Loss

What we just said is : for memory sample (state S, action A, reward R, next state S') we wanted our training model to predict the same highest Q-value Qt(S') as the target network, corresponding to action A’, but it predicted Q(S, A) instead. The observed loss is then

Naive loss

However, if we work with this simple naive loss, we miss a crutial information which is the expected reward, and our model will not tend to maximize the final reward. That’s where the Bellman equation comes into action

Bellman equation

Now, the expected Q-value is not only given by the maximum predicted Q-values, but it is also modulated by the expected reward R. The γ factor is a discount factor, an hyperparameter to adjust to give more or less weight to the target Q-value.

Back-propagation

So we now have Q-values predicted by the training model, and target (or labelled) Q-values predicted by the target network. The rest is the same we find when training deep learning model, i.e. loss back-propagation. And generally in reinforcement learning, we use the Huber loss:

Huber loss

Recap

The recapitulate all the steps to train our model:

  1. Initialize memory with a few warmup random steps
  2. Reset environment so that we are in a known state
  3. Choose an action with ε-greedy action selection policy
  4. Perform a step with this action and store the results into memory
  5. Sample a batch from memory
  6. Predict Q-values with both training and target networks
  7. Compute Huber loss and back-propagate in the training network
  8. Update target network periodically

Look at the target network

You may wonder why we bother with a second network, the target network, and why we just don’t use the same network ? I’m glad you ask ! And the answer is, well you can use the same network but…. you may encounter some issues.

I say it again, but we do not have any labelled data here to train the model. We compute the labelled data at each batch, so if we use the training model to label the data and we sample the same entry from the memory, we may not have the same target ! The target data is a moving target and if we update it too often, we won’t be able to track it when training our model. It is just as simple as that !

Using a second target network that is not trained, we ensure that the target Q-values are stable during a given period (a few episodes). However, we know that the targets are not the best ones, at least at the beginning of the training. That’s why the target network has to be update periodically. How often ? well, it depends and the period is another hyperparameters you have to play with.

I could not say it better than the Google Deepmind scientists who actually proposed this approach to play Atari games :

Human-level control through deep reinforcement learning, V. Mnih & al., Nature volume 518, pages529–533(2015)

Pytorch Implementation

I don’t want to bother you with tones of lines of code here, so I refer you to my github repo :

Note on the model

The model used for this toy example is very simple. It consists of an embedding layer (a state is a simple integer from 0 to 500), two hidden dense layers with ReLU activation without any regularization of any kind, and a final dense linear layer i.e. without activation function. because the Q-values are not limited in their values.

model architecture

The embedding layer embeds the input integer to a 4-coordinates vector. I use hidden layers with 50 nodes and the last dense layer maps the nodes to a 6-coordinates vector as Taxi-v3 game has 6 possible actions.

Hyperparameters tuning

Throughout this article, we’ve often spoken of hyperparameters, discount-factor, ε-greedy policy, target network update period, but we can also mention the learning rate and batch size. Tuning these batch size can be tricky as I already said, the target in deep-Q-learning is a moving target so the process may not be very stable. So let’s dig a little in each of these hyperparameters to try to see their influence on the training process.

First, here the training result I ended up

Training curves for a set of hyperparameters that made the training converging

We see on this figure the rewards curve, the steps curve and the ε curve. The reward plot is the sum of all the rewards for each episode. An episode is play round starting with a fresh new environment and ending with either the end of the game (passenger dropped-off at the right place) or a maximum number of steps because we don’t want the exploration of the environment to take too much time. The steps curve is the number of steps performed in each episode. What we want is to maximize the reward and minimize the number of steps for each episode.

We also find the ε curve on the figure. It is the evolution of the hyperparameter ε of the ε-greedy policy. This parameter has been chosen to vary along the episodes because we want the training process to let our agent explore the environment at the beginning, but we also want it to exploit the learnt model. The ε parameter make this choice : it is near 1 at the beginning so that the probability to choose a random action for a step is close to 1 i.e. the agent explores the environment. This probability exponentially decreases along the road to a low value (0.1 in the present case) to let the agent exploit the model.

Batch size

A simple hyperparameter is the batch size. Too few samples in the batch and the model won’t see lots of data to perform an epoch, and too much data and it will see too much “wrong” data. Indeed, at least at the beginning of the training, data gathered are almost randomly chosen.

Large batch size (left) vs. small batch size (right)

As we can see here, using a large batch size speeds the training to….a random model. The model is unable to converge as the data is of too low quality. On the contrary, with a too smal batch size, the model learns slowly and does not converge either (at least not after 10,000 episodes) as it has too few data to exploit.

Learning rate

I won’t detail a lot about the learning rate as I assume you are familiar with it, but here are the comparison between a high and a low LR

High learning rate (left) vs. low learning rate (right)

Note that to achieve the training curve of the converged model, I did not use a constant learning rate. Instead I let it exponentially decrease over the episodes to speed up the learning at the beginning and help convergence at the end.

ε-greedy policy

We can already see on the figures that the ε hyperparameter is not constant over the episodes. I already explained why, but we can still vary the ending point and choose if we want the agent to explore more or less at the end of the training.

High steady state ε (left) vs. low ε (right)

If we let the agent explore too much the environment, it could be hard for us to say that the model converges. To be sure, we need to evaluate the model over several episodes and see if it plays well the game. The left curve tell us that the variance decreases over the episodes, so we can suspect the model has converged, but the remaining high variance is probably due to the high probability of exploration. Even if the model takes the right decision, the ε-greedy policy forces the agent to choose another way. On the opposite, if the probability of exploration is too low too quickly and the model has note converge, we will end up with data in the memory that are progressively replaced by non-accurate ones so that the model can start to diverge. That’s what we see in this example.

Discount factor

It was our choice to use a high discount factor, ~0.99. It means that we give the same weight to the reward as to the target Q-values. If the discount factor is too low, we do not allow the model to learn from its mistakes and it won’t converge, at least not fast.

Low discount factor

Update period

As state by the authors of the Google Deepmind paper, using a second network and updating it periodically stabilize the learning process.

Update the target model very often (left) vs. rarely (right)

We can see that updating the model too often destabilizes the learning process as it is seen from the high variance. The predictions might be quite good as the rewards are high, but as the target changes too often, it is not stable. However, if the updating frequency is too low, the model converges rapidly to the target, but to a wrong target. The model eventually converges, but it could converge faster.

Conclusion

In this article, we’ve seen how to use reinforcement learning and deep learning together to learn some policy to play a simple game. Whereas it is possible to use a table to map each state to the right action to take for this simple game, it could be tricky even impossible for more complex games. That’s why we’ve replaced the Q-table by a deep-learning model in an approach that could generalize for more environments. However, we’ve also seen that training such a model can be harder that traditionnal deep-learning models. Not that the model itself is more complicated, but because the data is a moving target so that the process is hardly stable.

I hope you enjoyed the journay as much as I did !

--

--