Introduction to Reinforcement Learning

Reinforcement learning in formal terms is a method of machine learning wherein the software agent learns to perform certain actions in an environment which lead it to maximum reward. It does so by exploration and exploitation of knowledge it learns by repeated trials of maximizing the reward.

One can conclude that while supervised learning predicts continuous ranged values or discrete labels/classes based on the training it receives from examples with provided labels or values. Unsupervised learning tries to club together samples based on their similarity and determine discrete clusters.

Reinforcement learning on the other hand, which is a subset of Unsupervised learning, performs learning very differently. It takes up the method of "cause and effect".

Intuition to Reinforcement Learning

Let us try to understand the previously stated formal definition by means of an example -

Imagine you are supposed to cross an unknown field in the middle of a pitch black night without a torch. There can be pits and stones in the field, the position of those are unfamiliar to you. There's a simple rule - if you fall into a hole or hit a rock, you must start again from your initial point.

Basic Concept and Terminology

Insight

In the above example, you are the agent who is trying to walk across the field, which is the environment. Walking is the action the agent performs on the environment. The distance the agent walks acts as the reward. The agent tries to perform the action in such a way that the reward maximizes. This is how Reinforcement Learning works in a nutshell. The following figure puts it into a simple diagram -

And in the proper technical terms, and generalizing to fit more examples into it, the diagram becomes -

Terminology

Some important terms related to reinforcement learning are

How Reinforcement Learning Works

There are majorly three approaches to implement a reinforcement learning algorithm. They are -

An Example: Tic-Tac-Toe

A sequence of tic-tac-toe moves. The solid black lines represent the moves taken during a game; the dashed lines represent moves that the agent considered but did not make. Our second move was an exploratory move, meaning that it was taken even though another sibling move, the one leading to e*, was ranked higher. Exploratory moves do not result in any learning, but each of our other moves do, causing updates as suggested by the red arrows in which estimated values are moved up the tree from later nodes to earlier.

Reinforcement Learning Approach to solve Tic-Tac-Toe:

  1. Set up table of numbers, one for each possible state of the game.
  2. Each number will be our latest estimate of our probability of winning from that state.
  3. This estimate is the state’s value and the whole table is the learned value function.
  4. Assuming we always play Xs, then for all states with 3 Xs in a row (column and diagonal) the probability of winning is 1.0
  5. And for all states with 3 Os in a row (column and diagonal) the probability of winning is 0.0
  6. We set the initial values of all other states to 0.5 (representing we have a 50% chance of winning.)

We then play many games against the opponent. To select our moves:

  1. We examine the states that would result from each of our possible moves and look up their current values in the table.
  2. Most of the time we move greedily, selecting the move that leads to the state with the greatest value. (highest probability of winning)
  3. Occasionally, we select randomly from among the other moves instead. (Exploration)

We then play many games against the opponent. To select our moves:

  1. After each greedy move, from A to B, we update the value of A to be more closer to the value of B.
  2. This is achieved using the following formula where, V(S_t) — value of the older state, state before the greedy move (A), V(S_t+1) — value of the new state, state after the greedy move (B), alpha — learning rate

This update rule is an example of Temporal-Difference Learning method, so called because its changes are based on a difference, V(S_t+1) — V(S_t), between estimates at two successive times.

For better understanding of Reinforcement Learning along eith the code, check out the jupyter notebook here.