Deep reinforcement learning

So far in the lecture, we talked about reinforcement learning, evolutionary algorithms and neural networks. Last time we connected evolution and neural networks, today we will look at the use of neural networks in reinforcement learning - deep reinforcement learning.

Let’s first recall the terminology and notation (copy of the text from the chapter on reinforcement learning):

We can formally describe the environment as a Markov decision process (MDP), which is specified by the quadruple $(S, A, P, R)$, where $S$ is a finite set of states of the environment, $A$ is a (finite) set of actions (it can also be replaced by sets $A_s$ of actions applicable in state $s$), $P_a(s, s’)$ is a transition function that indicates the probability that by applying action $a$ in state $s$, the environment will transition to state $s’$ and $R_a(s, s’)$ is a reward function that indicates the immediate reward that the agent receives from the environment if, in state $s$, it performs action $a$ and thereby transitions the environment to state $s’$. It is important for the transition function that it fulfills the so-called Markov condition, i.e. that it depends only on the current state $s$ and action $a$ and not on the previous actions and states.

We can then describe the agent’s behavior using the policy $\pi: S \times A \to [0,1]$, where $\pi(s,a)$ indicates the probability of executing action $a$ in state $s$. The goal of reinforcement learning is to find a strategy $\pi$ such that it maximizes the total reward that the agent receives $\sum_{t=0}^\infty \gamma^tR_{a_t}(s_t, s_{t+1})$ , where $a_t = \pi(s_t)$ is the action taken by the agent at step $t$ and $\gamma < 1$ is the discount factor that ensures that the sum is finite.

The value $V^\pi(s)$ of state $s$ when using strategy $\pi$ can be defined as $V^\pi(s) = \mathrm{E}[R] = \mathrm{E}[\sum_{t=0}^\infty \gamma^t r_t|s_0 = s]$, where $R$ denotes the total obtained discounted reward and $r_t$ denotes the reward received at time $t$. In addition to the value of the state, it is very often useful to also consider the value $Q^\pi(s, a)$ of the action $a$ performed in the state $s$ if we continue to follow the $\pi$ strategy. The advantage of this model is that the agent does not need to have an environment model that describes how actions affect the environment and it learns this model on the fly.

The goal of the agent is then to find an optimal strategy $\pi^\star$ such that $V^{\pi^\star}(s) = \max_\pi V^\pi(s)$. We will denote the value of the states (actions) for the optimal strategy as $V^\star (s)$ ($Q^\star(s, a)$).

Based on these definitions, we can deduce the so called Bellman equations that are necessary conditions for the optimal strategy $\pi$. \(Q^\pi (s, a) = \sum_{s'}P_a(s, s')\big[ R_a(s, s') + \gamma \sum_{a'}\pi(s', a')Q^\pi(s', a') \big]\) \(V^\pi(s) = \sum_{a} \pi(s, a)\sum_{s'}P_a(s, s')\big[ R_a(s, s') + \gamma V^\pi(s') \big]\)

Deep Q-Learning

Deep Q-learning is based on the same idea as Q-learning, i.e. for each state $s$ and each action $a$ we want to estimate what discounted reward we would get if we performed the action $a$ in state $s$. In traditional Q-learning, these values are represented as a Q matrix. However, this representation is problematic if there are many states or many different actions. In deep Q-learning, a neural network is used to represent this matrix, which returns the evaluation of all actions for each state.

The training of the neural network is performed according to the Bellman equation for $Q$ in such a way that the current reward from the environment $R_a(s, s’)$ is compared with the value calculated using the Bellman equations from $Q$ and the mean square error of their difference is minimized. Thus, our goal is to minimize the difference between $Q(s, a)$ and $R_a(s, s’) + \gamma \max_{a’}Q_k(s’, a’)$. Note that, as in standard Q-learning, we consider that in the next state we choose the best action according to the current estimate of $Q$ . So the error function is

\[\mathbb E_{s' \sim P_a(s, s')}\big[ (R_a(s, s') + \gamma \max_{a'}Q_\theta(s', a') - Q_ \theta(s, a))^2 \big]\|{\theta = \theta_k}\]

where $Q_\theta$ are the neural network parameters representing the $Q$ matrix.

Thus, to calculate the error function, we need to know the state $s$, the selected action $a$, the obtained reward $R_a(s, s’)$, and the following state $s’$. We can easily get all this data if we let the agent perform actions in the environment. During training, however, a problem often arises with the fact that we directly change the function that estimates Q and thereby change the behavior of the agent and the estimates at the same time, but it can be seen that both of these aspects are closely related - in fact, we are trying to learn a function where it is constantly changing inputs and outputs. Two tricks are used to maintain greater training stability - target network and experience replay.

The trick with the target network is based on separation of the action selection network and the value estimation network. Typically, we fix the parameters of the network, according to which the action is selected, and change only the parameters of the network that learns the Q function. After a given number of iterations, the parameters of both networks are set to the same values and the training continues according to the same procedure. So the error function in this case looks like \(\mathbb E_{s' \sim P_a(s, s')}\big[ (R_a(s, s') + \gamma \max_{a'}Q_{\theta^-}(s', a ') - Q_\theta(s, a))^2 \big]\|_{\theta = \theta_k},\) where $\theta^-$ are just the parameters of the action selection network that are updated less frequently than the $\theta$ parameters that estimate the value of Q.

The essence of experience replay is that we remember quadruplets $(s, a, r, s’)$ of state, action, reward, and next state, and when training, we randomly select transitions from this memory instead of always using the last transition. This will get rid of dependencies between consecutive inputs.

Deep Q-learning has become famous as a technique that is able to learn how to play Atari games at a level similar to human players, with only image information as input and rewards calculated as changes in the game’s score.


Deep Q-learning requires calculating the maximum over all actions - but this operation is complex if the action space is continuous (imagine for example driving a car, where the actions are the angle of turning the steering wheel and the pressing of the gas/brake pedal). For this reason, the DDPG (Deep Deterministic Policy Gradient) algorithm replaces this operation with a new network that directly returns the action to be performed. Let us denote this function with parameters $\theta$ as $\mu_\theta$. It is a function from state space to action space. At the same time, let us denote the function calculating the value of $Q$ with parameters $\phi$ as $Q_\phi$. DDPG uses both tricks mentioned above, so let us denote by $\theta^-$ and $\phi^-$ the parameters of the target networks for $\theta$ and $\phi$.

An update of the $\phi$ network parameters for $Q$ is then computed using the modified error function \(\mathbb E_{s' \sim P_a(s, s')}\big[ (R_a(s, s') + \gamma Q_{\phi^-}(s', \mu_{\theta^- }(s')) - Q_\phi(s, a))^2 \big]| \theta = \theta_k.\) The idea of training a $\mu$ network is very simple - we want it to return actions that maximize $Q_\theta(s, a)$ in state $s$. So we maximize the expression using the gradient method \(\mathbb E_s\big[ Q_\phi(s, \mu_\theta(s)) \big].\)

Policy gradient methods

We can also consider the goal of reinforcement learning as finding a policy parameterized by $\theta$ that will maximize the total reward $J(\theta) = \mathbb{E}[\sum_{t=0}^{T- 1}r_{t+1}]$, i.e., the sum of the rewards that the agent gets when following the policy in the environment. With a bit of effort and derivation, it can be shown that the gradient of this total reward is $\nabla_\theta J(\theta) = \sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta( a_t|s_t)G_t$, where $\pi_\theta(a | s)$ is the probability of selecting action $a$ in state $s$ using policy $\pi$ parameterized by $\theta$ and $G_t$ are cumulated (discounted) rewards from time $t$ to $T-1$.


The disadvantage of this description of learning is that for large rewards $G_t$ we can get large variances of the gradient values. For that reason, $G_t$ is often replaced by other expressions - one possibility is directly the value of $Q(s, a)$, which is then learned similarly to the Q values in DQN. But the so-called advantage is also popular, it is calculated as the difference between the execution of the action $a_t$ in the state $s_t$ (i.e. the value $Q(s_t, a_t)$) and the value of the state $V(s_t)$ - $A(s_t, a_t) = Q(s_t, a_t) - V(s_t)$. So we calculate how much better it is to perform action $s_t$ compared to some general action. To determine the advantage, we do not need to train the networks for $Q$ and $V$, one network for $V$ is enough, from the relations between $Q$ and $V$ it can be deduced that the advantage can also be calculated as $A(s_t, a_t) = R_a(s_t, s_{t+1}) + \gamma V(s_{t+1}) - V(s_t)$. The method implemented in this way is called advantage actor-critic. The network for $V$ plays the role of a critic and tells how good the states are, the network for policy $\pi$ then chooses the actions to be performed.

There is also a very popular A3C method - Asynchronous Advantage Actor-Critic. This replaces the use of experience replay by playing multiple games at once, the weight updates are then averaged over the updates counted in all independent games.