**4. Experimental results and discussion**

### **4.1 Experimental conditions**

*Assistive and Rehabilitation Engineering*

**3.1 Cooperative agents**

ent behaviors.

injured.

contribution degree.

bution degree.

injured person.

**3. Improvement of Cooperative Action by Rewards Distribution**

In multi-agent system, the environment changes from static to dynamic because multiple autonomous agents exist. An agent engaged in cooperative action decides its actions by referring to not only its own information and purpose but to those of other agents as well [16]. The cooperative agents are acquired by sharing sensation, sharing episodes, and sharing learned policies [17, 19–21]. Cooperative actions are important not only in situations where multiple agents have to work together to accomplish a common goal but also in situations where each agents has its own goal [22].

In this chapter, the multi-agent systems are composed of agents' different behav-

iors. One is to perform relief (*relief agents*), and the other is to remove obstacles (*removing agents*). Cooperation was achieved by giving different rewards to differ-

It is necessary to have the multi-agent systems learn efficient rescue with the

In this chapter, we propose three types of reward distribution as methods for obtaining cooperative action of injured rescue and obstacle removal in accordance

A conferred reward is high in value when an injured person in a condition of high urgency is rescued and decreases in value for those in less urgent conditions. Thus, *Rr > Ry > Rg > Rb*, where *Rr , Ry , Rg*, and *Rb* are the reward values for the rescue of injured persons in the red, yellow, green, and black condition categories, respectively.

In Method 2, the reward value reflects the time spent by the rescue agent as the

With *R* as the basic reward value when the rescue agent completes the injured rescue, *C* as the contribution degree, and *λ* as a weighting factor, the reward *r* earned by the rescue agent in learning is as given by Eq. (2). A large *λ* results in a reward that is greatly augmented relative to the basic reward, according to contri-

Assessed contribution degree *C* increases with decreasing time spent in rescuing the injured, as shown in Eq. (3), in which *Te* is the time of completion of rescue of all the injured by the rescue agents and *Ti* is the time spent by an agent to rescue an

*r* = (1 + λ *C*)*R* (2)

C = *Ti*/*Te* (3)

Prior studies used reward distribution with the reward value differing in accordance with the agent action but gave no consideration to the condition of the

**3.2 Reward distribution with consideration of condition of the injured**

*3.2.1 Method 1: reward distribution responding to the condition of the injured*

*3.2.2 Method 2: reward distribution based on the contribution degree*

condition of the injured taken into consideration.

with the urgency of the condition of the injured.

**168**

In the study presented in this chapter, we experimented on obtaining cooperative action by agents for efficient rescue in accordance with the condition of the injured and obstacle removal, using the three proposed reward distributions. We assigned the injured and obstacle transport destinations to one cell each on the field shown in **Figure 3** and numbers of agents, injured persons, and obstacles as listed in **Table 1**. The mean of five simulation trials was taken as the result.

#### **4.2 Effects of reward distribution timing**

We investigated the effects of the three patterns of reward distribution timing on task completion on the efficiency of learning injured rescue. The reward is given when an injured person or obstacle is discovered in Pattern 1. The reward is given when an injured person or object is taken for removal to the appropriate location in Pattern 2. Rewards are given twice in Pattern 3: at the stage of discovering an injured person or obstacle and at the stage where transportation is completed.

The results of an experiment to compare the three reward distribution timing patterns are shown in **Figure 5**. The horizontal axis represents the episodes, and the vertical axis represents the number of steps for task completion by all agents. These results indicate that Pattern 3 allowed completion of the tasks in a smaller number


**Table 1.** *Experimental conditions.*

**Figure 5.** *Results of experiment to compare three reward distribution timing patterns.*

of steps than did Patterns 1 and 2, which in turn indicates that efficient rescue and removal was learned by conferring rewards in two stages and thus led the agent to regard the course from discovery to transport as one task. We therefore applied Pattern 3 in the subsequent experiments.
