Reinforcement Learning

One area where nature-inspired techniques are often used is reinforcement learning. The goal of the reinforcement learning is to learn an agent’s behavior that maximizes its total reward from the environment if it uses that behavior. We assume that the agent exists in some environment whose state it can observe and influence. The agent runs in cycles, in each iteration $t$, it observes the state of the environment $s_t$, based on the observation, it performs some action $a_t$ and thus changes the state of the environment to the new state $s_{t+1}$. For performing the action, it receives a reward $r_t$ from the environment.

A typical example problem for reinforcement learning is Mountain Car. It is about a car that is in a valley and is trying to get out of it, but it has a weak engine, so it cannot go straight out and has to “swing”. In this case, the state is a vector of two numbers indicating the car’s position and speed. The car can perform three actions - drive forward, drive backward, or nothing. As we want it to exit the valley as quickly as possible, it receives a reward of -1 for each step in the valley and 0 for the step in which it exits (this also ends the simulation).

Markov Decision Process

We can formally describe the environment as a Markov decision process (MDP), which is specified by the quadruple $(S, A, P, R)$, where $S$ is a finite set of states of the environment, $A$ is a (finite) set of actions (it can also be replaced by sets $A_s$ of actions applicable in state $s$), $P_a(s, s’)$ is a transition function that indicates the probability that by applying action $a$ in state $s$, the environment will transition to state $s’$ and $R_a(s, s’)$ is a reward function that indicates the immediate reward that the agent receives from the environment if, in state $s$, it performs action $a$ and thereby transitions the environment to state $s’$. It is important for the transition function that it fulfills the so-called Markov condition, i.e. that it depends only on the current state $s$ and action $a$ and not on the previous actions and states.

We can then describe the agent’s behavior using the policy $\pi: S \times A \to [0,1]$, where $\pi(s,a)$ indicates the probability of executing action $a$ in state $s$. The goal of reinforcement learning is to find a strategy $\pi$ such that it maximizes the total reward that the agent receives $\sum_{t=0}^\infty \gamma^tR_{a_t}(s_t, s_{t+1})$ , where $a_t = \pi(s_t)$ is the action taken by the agent at step $t$ and $\gamma < 1$ is the discount factor that ensures that the sum is finite.

State value and state-action value

The value $V^\pi(s)$ of state $s$ when using strategy $\pi$ can be defined as $V^\pi(s) = \mathrm{E}[R] = \mathrm{E}[\sum_{t=0}^\infty \gamma^t r_t|s_0 = s]$, where $R$ denotes the total obtained discounted reward and $r_t$ denotes the reward received at time $t$. In addition to the value of the state, it is very often useful to also consider the value $Q^\pi(s, a)$ of the action $a$ performed in the state $s$ if we continue to follow the $\pi$ strategy. The advantage of this model is that the agent does not need to have an environment model that describes how actions affect the environment and it learns this model on the fly.

The goal of the agent is then to find an optimal strategy $\pi^\star$ such that $V^{\pi^\star}(s) = \max_\pi V^\pi(s)$. We will denote the value of the states (actions) for the optimal strategy as $V^\star (s)$ ($Q^\star(s, a)$).

Search strategy

The value of both the aforementioned functions $V$ and $Q$ can be used by the agent to improve its strategy. It should be noted that the agent has a relatively complex task - it must try to maximize its total profit $R$, for this it must learn a suitable strategy, which, however, affects the values of the states estimated by the agent. An agent could follow a strategy that always transitions to the state that promises the greatest utility. The problem is that the agent does not know the values of individual states in advance and has to learn them on the fly. Therefore, if he has a bad (very low) estimate of the value of a state, it may never visit it, even though using a different strategy its value would be much better (for example, a shortcut that the agent has not yet discovered leads from this state to the destination). An agent that always chooses the best action (greedy agent) does a poor exploration of the environment and the actions available in it. Choosing an appropriate strategy for learning is a complex problem, it is necessary to balance searching the space (exploration) and the use of the known information (exploitation). One of the popular methods of action selection in reinforcement learning is the so-called $\epsilon$-greedy strategy. It chooses the best action with probability $(1-\epsilon)$ and chooses a random action with probability $\epsilon$. This combines both the use of obtained knowledge about the environment and the search for new states.

Monte-Carlo methods

We can describe reinforcement learning in two steps - strategy improvement and strategy evaluation. Strategy evaluation can be done using Monte-Carlo methods that calculate the values of the function $Q^\pi(s,a)$. We perform action $a$ in state $s$ and then start executing strategy $\pi$ until we reach some goal. We will then record the reward received. The improvement of the strategy is then done by calculating a new (greedy) strategy based on the new values of $Q(s,a)$.

The method described above does not work very well. If the variances of the rewards for actions and states are large, it can converge very slowly. Moreover, in one evaluation run, we only adjust the value for one pair of state and action, so it cannot be used for larger Markov decision processes.


Some of the above-mentioned problems can be solved by so-called temporal-difference (TD) methods. These are based on the so-called Bellman equations, which say that $V^\pi(s)=E_{\pi}[r_0 + \gamma V^\pi(s_1)|s_0=s]$. The value function is then updated after each strategy evaluation step (each agent step) as $ V(s) \leftarrow V(s) + \alpha(r + \gamma V(s’) - V(s) )$ , where $r + \gamma V(s’)$ is the new target value for $V(s)$. The advantage over Monte Carlo methods is that the value of several states is adjusted during one evaluation. TD methods can also be described in such a way that reward information is propagated from target states (where it can be known) to previous states. In addition, these methods solve the problem of distributing the reward over time. If the agent receives a reward only when the goal is reached, the TD method will distribute this reward gradually among the actions that led to this goal.

A specific algorithm based on TD methods is Q-learning. This directly learns the function $Q(s,a)$. In traditional cases, $Q$ is represented as a matrix, initially initialized with only zeros. Then, at each step, the agent successively observes the state $s_t$, performs the action $a_t$, receives the reward $r_t$, and observes the new state $s_{t+1}$. Based on this information, it modifies the $Q$ matrix as follows:

\[Q^{new}(s_{t},a_{t}) \leftarrow (1-\alpha) \cdot Q(s_{t},a_{t}) + \alpha \cdot \bigg( r_{ t} + \gamma\cdot \max_{a}Q(s_{t+1}, a) \bigg),\]

where $\alpha$ is the learning parameter and the action $a_t$ can be chosen by any of the methods mentioned above.


The Q-learning algorithm is a so-called off-policy algorithm, because it does not depend on any specific strategy that the agent follows. The SARSA algorithm is similar, but the calculation of $Q$ is performed according to the relation \(Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha [r_{t} + \gamma Q(s_{t+1}, a_{t+1})-Q(s_t,a_t)],\) where the actions $a_t$ and $a_{t+1}$ are based on the strategy used by the agent.

Problems in Reinforcement Learning

The traditional implementation of Q-learning using a matrix has the problem that it only works in discrete spaces and with discrete actions. The first limitation can be circumvented by discretizing the states, e.g. in the Mountain Car example, we can divide the position and speed values into intervals and represent the state only by an interval, instead of a specific value. We can solve the problem with continuous actions in a similar way. Another problem is that the state spaces can be very large, this then leads to very large $Q$ matrix and the algorithm learns slowly or not at all.

Later we will also see methods that learn strategy directly as a mapping from state to action - such do not have a problem with continuous actions and typically do not have problems with the size of the state either.

Multi-agent Reinforcement Learning

Reinforcement learning can also be used in cases where there are more agents. One of the simple ways to generalize the algorithms is to consider $n$-tuples of actions instead of an action, but in this case the complexity of the whole problem increases significantly. A frequently used method is also the so-called self-learning, where the agent assumes that other agents use the same policy. However, there are also methods based on modeling the behavior of other agents.

One of the main problems in multi-agent reinforcement learning is how to determine to whom to assign what portion of the reward. There is still only one reward from the environment. One possibility is to determine the reward of a given agent by looking at the reward we would get if the agent did nothing and what we got because of its action - the difference between the two numbers is the reward for that agent.