Reinforcement Learning for Beginners

Reinforcement Learning — Teach a Taxi Cab to drive around with Q-Learning

A gentle introduction to RL with OpenAI-gym and Python

Guillaume Androz

Published in

Towards Data Science

7 min readMar 14, 2021

I played a lot with machine learning for fun and for work. I’ve built deep-learning models with PyTorch and TensorFlow for NLP, vision and healthcare. However, reinforcement learning was still a mystery for me and reading a lot about Deepmind, AlphaGo and so on was very intriguing. Having a little more time now and I decided to deep dive into RL to try to understand the basics.

My choice was to use a simple basic example, python friendly, and OpenAI-gym is such a very good framework to start with. The toy example I chose was the taxi-cab environment. This environment is very simple :

There are 4 locations (labeled by different letters) and your job is to pick up the passenger at one location and drop him off in another. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.

OpenAI taxi-V3 environment. Image by Author

As stated in the documentation, you drive a taxi-cab. There are four locations at the four corners of the table. A passenger is waiting for the taxi at one location and you have to drive him to the designated location. You are rewarded for each move, with a low penalty when you are on travel, with a large penalty if you pick-up or drop-off the passenger at the wrong location, but you earn a big reward if you succeed. Simple as hell.

States

In RL, a state is… well the state your agent (your taxi-cab) stands in. In the taxi problem, a state is described by the location on the grid (a row and a column number between 0 and 4), a location to drop-off the passenger from four choices, and the passenger which can be in one of the four locations or inside the taxi.

If you count well, we then have 5x5x5x4=500 possible states.

Actions

The actions now. There exactly six possible actions. With your taxi, you can try to move to the four cardinal directions : South, North, East or West. It does not mean that the move is allowed neither possible. There are also two other possible actions which are the pick-up or the drop-off for a great total of six actions.

0 = South
1 = North
2 = East
3 = West
4 = Pickup
5 = Dropoff

Note here that you are not penalized more if you try to move to an unauthorized location. For example, in the figure above, you cannot move to the west, but you still get the -1 penalty, no more no less. The same goes when you try to move farther than the 5x5 grid. However, you are severely penalized if you try to pick-up or drop-off the passenger at the wrong location.

Play with OpenAI-gym and python

Now that we described the environment, it is time to play with it with python. OpenAI provides us with a gym environment already fully coded so the task is quite easy.

Rendering

The first step is to render a random environment.

A new environment is also easily generated by calling the reset method state = env.reset().

Moving around

When you try to play within an environment, you have to tell it what you want to do, which action you take. For example in the example above, you may want to go in the north direction. The north action has the code 1 so to perform a step in this direction we call

This call returns a tuple (204, -1, False, {'prob': 1.0}) which components are

state: new state of the environment, coded from 1 to 500
reward: the reward you got with your action. -1 meaning you’ve simply moved
done: a boolean to indicate if we’ve successfully dropped off a passenger previously picked up
info: information not relevant for our usage

Note that we could have chosen to randomly pick an action. gym framework provides us with a useful method for that env.action_space.sample().

A gentle introduction to Q-learning

Ok, so what is Q-learning about? I knew that for training a deep-learning model, I needed lots of data, possibly annotated data, and I needed a method to let a model learn a function to minimize the error between the training data and its prediction. However, reinforcement learning has not the same approach. Here, we need our agent to explore its environment and let it decide which action to take to maximize its expected reward. So right now, we see a fundamental concept of RL which is the trade-off between exploring (the environment) and exploiting (what he learnt).

Now, how do you explore a new environment and exploit what you learnt of your experiences? Imagine you are in a labyrinth. You know there is an exit and you have to find it out. Imagine then you have a great memory, a memory large enough to remember a big number of your last moves. With that in mind, you could imagine that you can reconstruct a mental image of the labyrinth to be able to step back and explore a new way to the exit. Of course, you are rewarded if you manage to find the exit with sweeties :) What you’ve just constructed is a kind of a table where you know exactly what to do, where to go from anywhere within the labyrinth to find the exit. In RL, this table is known as the Q table.

The Q-table is a table where you find a Q-value for each couple (state, action). For each state, we can have several possibleactions, so several Q-values. The goal is to learn to choose the higher Q-value for each state which means choose the action with the best chance to get the maximum reward.

To start learning, the Q-table is generally randomly initialized (or initialized to 0). The agent then explores its environment by randomly choosing an action and get a reward for its action. The idea is to let the agent exploring the environment and getting some rewards. But after a while, we need to exploit the information gathered. From a given state, the agent must compute the expected reward for each action it can take and choose the one with the maximum expected reward:

The core of RL is then given by the Bellman equation which tells us how to update the Q-values of the Q-table:

Q-value update

where

α is the learning rate
γ is a discount factor to give more or less importance to the next reward

What the agent is learning is the proper action to take in the state by looking at the reward for an action, and the max rewards for the next state. The intuition tells us that a lower discount factor designs a greedy agent which wants immediate rewards without looking forward.

Explore vs. Exploit

We already mentioned it several times, but we now see more clearly what the trade-off between exploring and exploiting is all about. If we let the agent always randomly choose an action, it could eventually learn the Q-table, but the process will never be efficient. On the contrary, if we only choose an action based on the maximization of the Q-value, the agent will tend to always take the same route, overfitting the current environment setup. Furthermore it will suffer from great variance as it will not be able to find the proper route in another environment setup.

To prevent those two scenarios to occur and to try to find a trade-off, we add another hyperparameter epsilon ϵ, which is the probability with which we choose a random action instead of the one computed from the Q-table. Playing with this parameter allows us to find an equilibrium.

Implementation in python

I know what you looking for, so first here is the code

Simple right? But as surprising as it looks, this piece of code learns a Q-table from exploring an environment, with no historical well-labelled data, just as a basic intelligence would do. Awesome!

But to convince you, why not play with our new toy?

Conclusion

Oh yeah! We’ve just completed our first RL-toy agent. It does not take complicated mathematics neither hard algorithms to understand the basics of RL. And with only several lines of code, we were able to train an agent to play the taxi game.

The journey only starts. Now we can play with the various hyperparameters and see their influence on the training and on the performances of the agent. And to evaluate the performances, all you need to do is play several times with the agent (code a simple for-loop, do not do it by hand !) and see how the hyperparameters influence the mean expected rewards and the mean number of steps to complete an episode.

Also, stay tuned to conjugate both worlds of RL and DL by replacing the Q-table with a DL model in a coming post!