**3. Improvement of Cooperative Action by Rewards Distribution**

## **3.1 Cooperative agents**

In multi-agent system, the environment changes from static to dynamic because multiple autonomous agents exist. An agent engaged in cooperative action decides its actions by referring to not only its own information and purpose but to those of other agents as well [16]. The cooperative agents are acquired by sharing sensation, sharing episodes, and sharing learned policies [17, 19–21]. Cooperative actions are important not only in situations where multiple agents have to work together to accomplish a common goal but also in situations where each agents has its own goal [22].

In this chapter, the multi-agent systems are composed of agents' different behaviors. One is to perform relief (*relief agents*), and the other is to remove obstacles (*removing agents*). Cooperation was achieved by giving different rewards to different behaviors.

## **3.2 Reward distribution with consideration of condition of the injured**

It is necessary to have the multi-agent systems learn efficient rescue with the condition of the injured taken into consideration.

Prior studies used reward distribution with the reward value differing in accordance with the agent action but gave no consideration to the condition of the injured.

In this chapter, we propose three types of reward distribution as methods for obtaining cooperative action of injured rescue and obstacle removal in accordance with the urgency of the condition of the injured.

#### *3.2.1 Method 1: reward distribution responding to the condition of the injured*

A conferred reward is high in value when an injured person in a condition of high urgency is rescued and decreases in value for those in less urgent conditions. Thus, *Rr > Ry > Rg > Rb*, where *Rr , Ry , Rg*, and *Rb* are the reward values for the rescue of injured persons in the red, yellow, green, and black condition categories, respectively.

#### *3.2.2 Method 2: reward distribution based on the contribution degree*

In Method 2, the reward value reflects the time spent by the rescue agent as the contribution degree.

With *R* as the basic reward value when the rescue agent completes the injured rescue, *C* as the contribution degree, and *λ* as a weighting factor, the reward *r* earned by the rescue agent in learning is as given by Eq. (2). A large *λ* results in a reward that is greatly augmented relative to the basic reward, according to contribution degree.

Assessed contribution degree *C* increases with decreasing time spent in rescuing the injured, as shown in Eq. (3), in which *Te* is the time of completion of rescue of all the injured by the rescue agents and *Ti* is the time spent by an agent to rescue an injured person.

$$r = (\mathbf{1} + \lambda \mathbf{C}) \mathbf{R} \tag{2}$$

$$\mathbf{C} = T\_i / T\_e \tag{3}$$

**169**

**Table 1.**

*Improvement of Cooperative Action for Multi-Agent System by Rewards Distribution*

*3.2.3 Method 3: reward distribution by the contribution degree responding to the* 

In Eq. (2), basic reward value R at the time of completion of each task takes on one of the values Rr > Ry > Rg > Rb according to the condition of the injured person.

In the study presented in this chapter, we experimented on obtaining cooperative action by agents for efficient rescue in accordance with the condition of the injured and obstacle removal, using the three proposed reward distributions. We assigned the injured and obstacle transport destinations to one cell each on the field shown in **Figure 3** and numbers of agents, injured persons, and obstacles as listed in

We investigated the effects of the three patterns of reward distribution timing on task completion on the efficiency of learning injured rescue. The reward is given when an injured person or obstacle is discovered in Pattern 1. The reward is given when an injured person or object is taken for removal to the appropriate location in Pattern 2. Rewards are given twice in Pattern 3: at the stage of discovering an injured person or obstacle and at the stage where transportation is

The results of an experiment to compare the three reward distribution timing patterns are shown in **Figure 5**. The horizontal axis represents the episodes, and the vertical axis represents the number of steps for task completion by all agents. These results indicate that Pattern 3 allowed completion of the tasks in a smaller number

The field size 10 × 10 The number of rescue agents 2 The number of clearing agents 2

The number of removal of possible obstacles 10 The number of removal of impossible obstacles 3

Learning rate α 0.1 Discount rate γ 0.9 Greedy policy ε 0.1

Red 3 Yellow 3 Green 3 Black 3

**Table 1**. The mean of five simulation trials was taken as the result.

*DOI: http://dx.doi.org/10.5772/intechopen.85109*

**4. Experimental results and discussion**

**4.2 Effects of reward distribution timing**

*condition of the injured*

**4.1 Experimental conditions**

completed.

**The setting of field**

**The setting of agent**

*Experimental conditions.*

The number of injured individuals

*Improvement of Cooperative Action for Multi-Agent System by Rewards Distribution DOI: http://dx.doi.org/10.5772/intechopen.85109*
