# Multi-agent Reinforcement Learning

The goal of reinforcement learning is to learn a behavior in an environment. We have an agent in an environment that can perform a number of actions. Each action somehow changes the environment (transforms it into a new state) and after performing an action the agent may get a reward. The goal of the agent is to maximize its overall reward. In multi-agent reinforcement learning, there are multiple agents in the environment at the same time. These agents can either cooperate (e.g. if solving a problem together) or compete (e.g. when learning how to play soccer).

Formally, reinforcement learning problems are described as Markov decision processes (MDP). An MDP is given by a tuple $(S, A, P, R)$, where $S$ is a finite set of environment states, $A$ is a (finite) set of actions, $P_a(s, s’)$ is a transition function that gives the probability that the environment will be transformed from state $s$ to $s’$ after the application of action $a$ and $R_a(s, s’)$ is a reward function that returns the immediate reward obtained from the environment by performing action $a$ in state $s$ and transforming the environment to state $s’$. Additionally, the transition function must satisfy the Markov condition, i.e. the probability of transition from one state to another after performing an action depends only on the current state and the action and not on any previous states.

Agent’s behavior can then be described by a policy $\pi: S \times A \to [0,1]$, where $\pi(s,a)$ gives the probability of performing an action $a$ in state $s$. The goal of reinforcement learning is to find a strategy $\pi$ such that it maximizes the total reward obtained by the agent $\sum_{t=0}^\infty \gamma^tR_{a_t}(s_t, s_{t+1})$, where $a_t = \pi(s_t)$ performed by the agent at step $t$ and $\gamma < 1$ is a discount factor (used mainly to ensure that the reward is finite).

## State value and action value

The value of a state $s$ while using a strategy $\pi$ is defined as $V^\pi(s) = \mathrm{E}[R] = \mathrm{E}[\sum_{t=0}^\infty \gamma^t r_t|s_0 = s]$, where $R$ is the total discounted reward and $r_t$ reward obtained at step $t$. Apart from the state value we will also consider the value of an action $a$ in a state $s$ if the strategy $\pi$ is followed afterwards. This will be denoted as $Q^\pi(s, a)$. An advantage of this model is that we do not need to have a model of the environment that describes how actions affect the environment. The agent can also learn this model while learning the policy.

The agent’s goal is thus finding a strategy $\pi^\star$ such as $V^{\pi^\star}(s) = \max_\pi V^\pi(s)$. The state (action) value of the optimal strategy will be denoted as $V^\star (s)$ ($Q^\star(s, a)$).

## Q-Learning

Q-learning is the classical algorithm for reinforcement learning. This algorithm aims to learn the function $Q$ defined above and represents it as a matrix with rows for each state and columns for each action. This matrix is updated anytime the agent performs an action in the environment and obtains a reward. The update is based on so called Bellman equations \(Q^{\pi}(s,a) = \sum_{s'} \mathcal{P}_{s s'}^{a} \Big[ \mathcal{R}_{s s'}^{a} + \gamma \sum_{a'} \pi (s', a') Q^{\pi}(s', a') \Big].\) Specifically, whenever the agent performs an action $a_t$ in state $s$ and receives the reward $r$, the matrix $Q$ is updated in the following way: \(Q^{new}(s_{t},a_{t}) \leftarrow (1-\alpha) \cdot Q(s_{t},a_{t}) + \alpha \cdot \bigg( r_{t} + \gamma\cdot \max_{a}Q(s_{t+1}, a) \bigg),\)

where $\alpha$ is a learning rate and the action $a_t$ can be selected by a number of methods. The most common of these is the so called $\epsilon$-greedy strategy that selects the best action (according to $Q$) with probability $\epsilon$ and selects a random action with probability $1-\epsilon$. The randomness improves the exploration of the $Q$ learning method.

## Deep Q-Learning

A modern implementation of $Q$-learning is based on deep neural networks. In deep $Q$-learning a neural network is trained to approximate the matrix $Q$. It takes states as inputs and outputs the values of all actions in the given state. During training, the agent performs actions in the environment and remembers the states, actions and rewards received. Based on these, a training set is constructed and the network is trained to minimize the difference between $Q(s, a)$ and $R_a(s, s’) + \gamma max_{a’}Q_k(s’, a’)$. The loss function for this training is thus given by

\[\mathbb E_{s' \sim P_a(s, s')}\big[ (R_a(s, s') + \gamma max_{a'}Q_\theta(s', a') - Q_\theta(s, a))^2 \big]\|{\theta = \theta_k}.\]Remembering the previous action, state, and reward triples helps with the training as the network can use the information multiple times. Also, this stored experience (so called experience buffer) helps to make the inputs to the network uncorrelated, which improves the training. An additional trick is also often employed - there may be a problem with the fact that the same network is trained and used for action selection. Therefore, there are sometimes two networks, one is fixed and used for action selection, while the other one is trained. After a given number of epochs the fixed network is replaced by the trained one and the training continues.

## Evolutionary Algorithms

Apart from using deep $Q$-learning, it is also possible to train the neural network using evolutionary algorithms. In this case, we encode all the weights in the neural network as an individual and use an evolutionary algorithm for continuous optimization (e.g. an evolution strategy or differential evolution). The total reward obtained by the agent is used as the fitness function and is maximized. The main advantage of this approach is that it is easier to parallelize the training (in RL the evaluation of the environment can be the slowest step), and evolutionary algorithms also work well in cases with sparse rewards (i.e. the environment gives a reward only in some steps, in the extreme case only in the last one).

## Multi-agent reinforcement learning

Algorithms for reinforcement learning in multi-agent systems are often based on the algorithms for single agents described above. The simplest modification (especially usable in cooperative cases) is so called joint-action learning (JAL). In JAL, all agents are modelled as a single agent with tuples of actions – one for each of the agents. This means that the Q-learning algorithm can be used almost directly, however, the main disadvantage is that the number of actions grows exponentially with the number of agents.

Another popular technique is self-play, in self-play we simulate a number of agents, each of them makes decisions independently, however, they all share the same decision procedure (e.g. neural network) for action selection. In such a case, we actually have a set of identical agents. While this limits the expressivity of the possible strategies (all agents behave the same), it greatly reduces the complexity of the problem. Self-play is very often used in competitive environments. For example, if we want to train agents to play soccer, we can give one team a strategy (e.g. a randomly initialized neural network) and the other team learns to play against it. After some time, when the latter team is able to beat the former one, we copy the strategy of the better team to the other one and continue in training against this improved strategy. This also means we do not need to have a strategy to train against, we create this strategy (and improve it) during the training of the resulting strategy.

How to tell whether our strategy (for the whole system) is optimal? What does it even mean, for a strategy to be optimal? There are a few common criteria we can use. One of them is that the set of strategies for the agents should be a Nash equilibrium (i.e. no agent can improve their utility by unilaterally changing their strategy), another criterium might be Pareto-optimality (i.e. there is no other set of strategies that would give better utility to all strategies at the same time). Another common metrics of strategies are social welfare and social fairness. Social welfare is the sum of the utilities of the strategies, while social fairness is the product of these utilities. Therefore social fairness is maximized if all the agents have the same (or similar) utilities.

# Assignment

Your last assignment is to implement one of the multi-agent reinforcement learning algorithms for a simplified version of the level-based foraging problem. In this problem, the agents should collect objects from the environment. However, both agents and objects have levels and agents can collect an object only when the sum of levels of all agents adjacent to the object is higher than the level of the object. In our simplification, all agents have level 1, and all objects have integer levels between 1 and the number of agents in the system. We will also not check adjacent positions, but only the number of agents on the same position as the object. The items are automatically collected whenever the number of agents at the location of the object is higher than the level of the object.

The environment for this problem is a square grid and objects and agents are at random locations at the beginning. The agents get a reward equal to the level of the object for any collected object and also get -0.1 for any step in the simulation. Therefore, in order to maximize the reward, you should collect the objects as quickly as possible.

The implementation of this environment and an implementation of a simple example agents (with joint actions) is in the attached source codes. There is also a really simple visualization, where the agents are orange-red squares (darker colors for multiple agents at the same location) and the objects are blue squares (darker colors for objects with higher level).

You need to implement the two methods of the agent – `action`

and `reward`

. The `action`

method gets the state as a tuple of world map and agent locations (in the world map negative numbers represent agents - $-i$ means there are $i$ agents on the given location - and positive numbers are objects and their levels). The agent locations are in the same order in which you should return the list of actions for the agents (i.e. the $i$-th action will be used for the $i$-th agent in the locations list). There are only four possible actions - NORTH, SOUTH, WEST, and EAST - represented by numbers 0-3. The `reward`

function is called after every step in order to inform the agents about the reward they obtained in the previous step.

You can work on the assignment in teams of up to 4 students. However, the easiest solution (implementing a strategy manually for 5 points) and the re-implementation of the framework are available only for individuals (these are really simple). The solution for 8 points should be doable with Q-learning, however some clever representation of the state (that reduces the number of states) may be needed, as well as using more clever actions (like “move closest agents towards a location”).

The deadline for the assignment is set far in the future, so that you have enough time to work on it (and do not have to work on it now). This does not mean it is that hard - it just gives you an opportunity to get some points later, if you need them.

**Deadline:** September 19, 2021

**Points:**

**5 points**- any non-trivial strategy solving the problem (does not have to use RL, available only to individuals)**8 points**- simple RL strategy able to solve smaller problems (5x5 environment, max 2 agents)**10 points**- any RL strategy able to solve even larger instances of the problem with multiple agents**+2-3 bonus points**- implementation of the problem in another popular programming language (I am mainly interested in Java and C#, a better Python implementation is also welcome) so that it can be used for the assignment in following years (only one person can get the bonus for each language - send me an e-mail before you start so I can reserve the selected language for you)

You can work in teams of up to 4 students in order to find a RL based strategy. Simple, scripted strategies (for 5 points) will be accepted only from individuals (not teams).