A State-of-the-Art Survey on Various Domains of Multi-Agent Systems and Machine Learning

*Aida Huerta Barrientos and Alejandro Nila Luevano*

#### **Abstract**

Multi-agent systems (MASs) are defined as a group of interacting entities or agents sharing a common environment that changes over time, with capabilities of perception and action, and the mechanisms for their coordination provide a modern perspective on systems that traditionally were regarded as centralized. The main characteristics of agents are learning and adaptation. In the last few years, MASs have received tremendous attention from scholars in different fields. However, there are still challenges faced by MASs and their integration with machine learning (ML) methods. The primary goal of the study is to provide a broad review of the current developments in the field of MASs combined with ML methods. First, we present features of MASs considering the ML perspective. Second, we provide a classification of applications of MASs combined with ML methods. Third, we present a density map of applications in E-learning, manufacturing, and commerce. We expect this study to serve as a comprehensive resource for researchers and practitioners in the area.

**Keywords:** machine learning, multi-agents, simulation, optimization, neural networks

#### **1. Introduction**

At the beginning of the 1990s, agent-based programming became an important part of simulation [1]. Although there is currently no formal definition of what an agent is, the term is used to describe self-contained programs that can control their own actions based on perceptions of their operating environment [2]. The goal of agent-based programming is to create programs that intelligently interact with their environment. The software used to program agents has its origins in the field of artificial intelligence, especially in the subfield of distributed artificial intelligence [3, 4], whose objective is the study of the properties of agents and the design of interaction networks between them.

In the same 1990s, as is suggested in Ref. [5], computational agents are typically characterized by:

• Autonomy: The agents had direct control of their actions and their internal state.


A typical agent-based model contains the following four elements [6]:


In most of the agent-based models, agents are able to move within their environment through sensors through which they perceive their local neighbors. Communication between agents is usually done by sending messages. For this action, the agents must be able to listen to the messages that come from their environment and send messages to the environment.

#### **1.1 Multiagent systems**

According to Garro et al. [7], a system comprising a number of possibly interacting agents is called a *multiagent system* (*MAS*). In this direction, *MASs* are defined as a group of multiple autonomous interacting entities or agents sharing a common environment that changes over time, with capabilities of perception and action, and the mechanisms for their coordination provide a modern perspective on systems that traditionally were regarded as centralized. The main characteristics of agents are learning and adaptation. The main feature achieved when developing multi-agent systems is flexibility, since a multi-agent system can be added to, modified, and reconstructed, without the need for detailed rewriting of the application [7]. Cooperation is other feature that has been studied in *MASs*. In this direction, as suggested by Al-Jumaily and Al-Jaafreh [8], the *MASs* are divided into theory and application phases. On the one hand, taxonomy, cooperation structure, cooperation forming procedure, and others are related to the theory phase. On the other hand, mobile agent cooperation, information gathering, sensor information, and communication are related to the applications phase.

#### **1.2 Machine learning**

Machine learning (ML) is a subset of artificial intelligence (AI) that concerns the development of algorithms, which allows the machine to learn via inductive inference based on observation data that represent incomplete information about statistical

#### *A State-of-the-Art Survey on Various Domains of Multi-Agent Systems and Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.107109*

phenomena [9]. To carry out the learning process an algorithm is used based on examples of the task we want to solve (data) and letting the computer find patterns and make inferences that optimize the decision-making according to a user-defined objective [10]. Based on the training strategy, ML can be divided into three classical categories with different learning approaches: supervised learning, unsupervised learning, and reinforcement learning [10]. The first one includes classification and regression tasks, in the second one the widely used task is clustering, and the third one consists of the process of training a model on a series of actions that lead to a particular outcome, where the system receives rewards for performing well and punishment for performing poorly directly from its environment [10].

In the last years, MASs integrated with ML have received tremendous attention from scholars in different fields such as Computer Science, Engineering, Mathematics, Material Science, Neuroscience, Energy, Physics and Astronomy, Social Sciences, Environmental Sciences, Business, Management, and Accounting. The overview of MASs integrated with ML in these fields will be presented in the following sections. MASs have been used in areas of e-learning, manufacturing, and commerce combining mathematical methods, optimization methods, Markov processes, learning algorithms, and artificial intelligence techniques. The reinforcement learning method jointly with MASs has been applied in e-learning, manufacturing (multi-agent reinforcement learning), and commerce (machine learning and learning automation) areas. While deep reinforcement learning and adaptive learning have been applied jointly with MASs in manufacturing and commerce areas. However, there are still challenges faced by MASs and their integration with ML. The primary goal of the study is to provide a broad review of the current developments in the field of MASs combined with ML methods.

This chapter is prepared as follows: the methodology followed to carry out the state-of-the-art survey on various domains of multi-agent systems and machine learning is described in Section 2. Features of MASs considering the ML perspective are presented in Section 3. A density map of applications in e-learning, manufacturing, and commerce is provided in Section 4. Concluding remarks are drawn in Section 5.

#### **2. Methodology**

We followed the methodology proposed by Machi and McEvoy [11]. **Figure 1** shows the steps for conducting the systematic literature review.

#### **2.1 Step 1. Select a topic**

The topic for this study was MASs combined with ML methods.

#### **2.2 Step 2. Search the literature**

The data for this study was extracted from the Scopus database (accessed on July 2022) based on the string *multi-agent* AND *systems* AND *machine* AND *learning*. In this case, 2869 relevant results were found. Then, the publication date of documents was limited from 2017 to 2022, the last five years. Also, the document type was limited to articles and book chapters. In this direction, the search was based on the following string:

#### **Figure 1.** *The literature review model. Machi and McEvoy [11].*

**Figure 2.**

*Trends in the number of articles and chapter book by year that were published between 2017 and 2022.*

TITLE-ABS-KEY (*multiagent* AND *systems* AND *machine* AND *learning*) AND (LIMIT-TO (PUBYEAR, *2022*) OR LIMIT-TO (PUBYEAR, *2021*) OR LIMIT-TO (PUBYEAR, *2020*) OR LIMIT-TO (PUBYEAR, *2019*) OR LIMIT-TO (PUBYEAR, *2018*) OR LIMIT-TO (PUBYEAR, *2017*)) AND (LIMIT-TO (DOCTYPE, "*ar"*) OR LIMIT-TO (DOCTYPE, "*ch"*)). In this case, 552 relevant articles and book chapters were found. All the information was exported in RIS format to VOSviewer software [12, 13] to generate the author collaboration and word co-occurrence networks. **Figure 2** shows the trends in the number of articles and book chapters published annually between 2017 and 2022. **Figure 3** depicts the fluctuating trend in the number of articles and chapter book published annually by the top-five sources, which reached a peak in 2019 by IEEE Access, and from then on it has gradually descended.

#### **2.3 Step 3. Develop the argument**

In this chapter, a broad review of the current developments in the field of MASs combined with ML methods is presented.

*A State-of-the-Art Survey on Various Domains of Multi-Agent Systems and Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.107109*

#### **Figure 3.**

*Trends in the number of articles and chapter book published annually by the top-five sources.*

#### **Figure 4.**

*Documents by the author published between 2017 and 2022 on multi-agent systems and machine learning applications.*

#### **2.4 Step 4. Survey the literature**

**Figure 4** summarizes the top ten authors in terms of contribution to the number of papers and chapter book published on multi-agent systems and machine learning applications. Even though Yu [14–18] as coauthor is the main contributor in the area. He accounts for five articles, followed by Li [19–22], Mohammed [23–26], Wong [27–30], Yap [27–30], and Yaw [27–30] with four articles each one. **Figure 5** presents the coauthor collaboration network from articles and chapter book published during the 2017–2022 period time, on multi-agent systems and machine learning applications, based on full counting. Each circle represents an author. The size of a circle reflects the number of publications of the corresponding author. The distance between two circles indicates the relatedness of the authors [13]. Colors represent clusters of authors with strong coauthorship links. A total of 1,718 different authors are in the collaboration network.

The top-ten fields in terms of applications on multi-agent systems and machine learning applications during the 2017–2022 period is showed in **Figure 6**. Even though Computer Science is the main field contributor, it only accounts for 37.4%, followed by Engineering (26.9%) and Mathematics (7.8%). The visualization

#### **Figure 5.**

*Visualization by VOSviewerTM software of coauthor collaboration network from documents published between 2017 and 2022 on multi-agent systems and machine learning applications.*

#### **Figure 6.**

*Percentage of documents by subject fields that were published between 2017 and 2022.*

in **Figure 7**, based on full counting, shows the co-occurrence network, distinct groups of keywords can be easily distinguished. Each circle represents a keyword. The size of a circle reflects the number of occurrences of the corresponding keyword. A total of 5,155 different words presented in the titles and abstracts of the 552 documents published between 2017 and 2022 were analyzed to establish the co-occurrence network, generating clusters associated with the research topic on multi-agent systems and machine learning applications. The colors used indicate the evolution of the time period of the different clusters. The blue cluster contained research topics published in 2019, while the green cluster contained research topics published in 2020, and the yellow cluster contained research topics published in 2021 and forward.

*A State-of-the-Art Survey on Various Domains of Multi-Agent Systems and Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.107109*

#### **Figure 7.**

*Visualization by VOSviewerTM software of word co-occurrence network built using words present in titles and abstracts of documents published between 2017 and 2022 on multi-agent systems and machine learning applications.*

#### **2.5 Step 5. Critique the literature**

The critique of the scientific literature of MASs integrated with ML in Computer Science, Engineering, Mathematics, Material Science, Neuroscience, Energy, Physics and Astronomy, and Social Sciences fields will be presented in the following sections, highlighting their potentiality and limitations.

#### **3. Review of state-of-the-art**

#### **3.1 Computer science**

Computer Science turned out as the top field contributor as indicated by the retrieved data, with 440 documents on multi-agent systems and machine learning applications during the 2017–2022 time period. **Figure 8** shows the terms with high co-occurrence frequencies in this field, distinguishing a color pallet from blue to yellow based on the publication year. In **Figure 8**, each term is represented by a circle, where the diameter of the circle and size of its label represents the frequency of the term, and its proximity to another term indicates the degree of relatedness of the two terms.

#### *3.1.1 Potentiality*

The network in **Figure 8** clearly indicates that multi-agent systems and machine learning have received tremendous attention from scholars. In 2019, the research was focused on topics such as embedded systems, robots, and forecasting, whereas in 2020, it was focused on topics such as intelligent agents, decision-making systems, optimization, IoT, stochastics systems scheduling, wireless sensor networks, compute games, and energy and efficiency of systems, which were broadly studied. More

#### **Figure 8.**

*Visualization by VOSviewerTM software of word co-occurrence network built using words present in titles and abstracts of documents published between 2017 and 2022 on multi-agent systems and machine learning applications in the Computer Science field.*

recently, in 2021 resource allocation, heuristics algorithms, decision trees, real-time systems, distributed optimization, and predictive analysis jointly with multi-agent systems were applied to study street traffic, manufacture, and cognitive radio.

#### *3.1.2 Limitations*

ML algorithms and multi-agent systems showed potentially significant combination/integration in the Computer Sciences field. However, the studies in this field had several limitations. First, they had a small number of applications. Second, the ML algorithms are limited to deep reinforcement. Finally, it is unclear which software is used to integrate multi-agent systems and ML algorithms.

#### **3.2 Engineering**

Engineering turned out as the second contributor as indicated by the retrieved data, with 317 documents on MASs and ML applications between 2017 and 2022. **Figure 9** shows the terms with high co-occurrence frequencies in this field. In **Figure 9**, each term is represented by a circle, where the diameter of the circle and size of its label represent the frequency of the term, and its proximity to another term indicates the degree of relatedness of the two terms.

#### *3.2.1 Potentiality*

As is depicted in **Figure 9**, MASs researchers are very enthusiastic about learning algorithms to study vehicular networks, traffic control, 5G mobile communications networks, image communication systems, multi-robots systems, wireless sensor

*A State-of-the-Art Survey on Various Domains of Multi-Agent Systems and Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.107109*

#### **Figure 9.**

*Visualization by VOSviewerTM software of word co-occurrence network built using words present in titles and abstracts of documents published between 2017 and 2022 on multi-agent systems and machine learning applications in the Engineering field.*

networks, manufacture, digital storage, electric power utilization, cyber-physical systems, embedded systems, electric vehicles, electric power transmission, and e-learning, and have proposed reinforcement learning and deep reinforcement learning to be integrated to MASs.

#### *3.2.2 Limitations*

Although, MASs interacting with reinforcement learning and deep reinforcement learning algorithms showed potentially significant application in the Engineering field. However, there were several potential limitations, such as small minority of optimization techniques, traditional simulation approaches, and little information about the integration of optimization algorithms and simulation software.

#### **3.3 Mathematics**

Mathematics field turned out as the third contributor as indicated by the retrieved data, with 92 documents on MASs and ML applications between 2017 and 2022.

#### *3.3.1 Potentiality*

The number of publications increased noticeably in the past two years. The scientific landscape of main research areas of MASs and ML applications in the Mathematics field are bibliometrically explored by way of co-occurrence term map presented in **Figure 10**. In **Figure 10**, each term is represented by a circle, where the diameter of the circle and size of its label represent the frequency of the term, and its proximity to another term indicates the degree of relatedness of the two terms.

#### **Figure 10.**

*Visualization by VOSviewerTM software of word co-occurrence network built using words present in titles and abstracts of documents published between 2017 and 2022 on multi-agent systems and machine learning applications in the Mathematics field.*

The most used theoretical tools were deep learning, neural networks, intelligent agents, Markov process, scheduling algorithms, q-learning algorithms, decision trees, and game theory. While the most important application areas were decision making, job scheduling, power control, cloud computing, energy efficiency, fertilizers, e-commerce, speech processing, cooperative behaviors, quality of service, resource allocation, forecasting, and manufacturing.

#### *3.3.2 Limitations*

A considerable amount of literature has been published on multi-agent systems and machine learning applications in the Mathematics field. However, the reinforcement learning method has been recently used in specific areas such as manufacturing and job scheduling supporting decision making. Optimization algorithms based on warm intelligence were used in the past as well as supervised learning methods. Much of the recent literature in this field has limited applications that are centered on cloud computing.

#### **3.4 Materials science**

Materials Science field turned out as the fourth contributor as indicated by the retrieved data, with 63 documents on MASs and ML applications between 2017 and 2022. **Figure 11** shows the mapping and clustering of terms over time. A co-occurrence network was built using words present in titles and abstracts of documents published between 2017 and 2022 on multi-agent systems and machine learning applications in the Materials Science field. In **Figure 11**, each term is represented by a circle, where the diameter of the circle and size of its label represent the frequency of the term, and its proximity to another term indicates the degree of relatedness of the two terms. The network can be seen to contain four clusters of co-occurring terms.

*A State-of-the-Art Survey on Various Domains of Multi-Agent Systems and Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.107109*

#### **Figure 11.**

*Visualization by VOSviewerTM software of word co-occurrence network built using words present in titles and abstracts of documents published between 2017 and 2022 on multi-agent systems and machine learning applications in the Materials Science field.*

The royal blue cluster is predominately associated with documents published in 2019, the turquoise cluster is associated with documents published in 2020, the yellowgreen cluster is associated with documents, and the yellow cluster is associated with documents recently published in 2022.

#### *3.4.1 Potentiality*

For the period of 2017–2022, multi-agent reinforcement learning, deep learning, deep reinforcement learning, complex networks, multi-agent systems, stochastic systems, and game theory remained among the top major topics. It has been demonstrated that deep reinforcement learning methods have been combined with game theory and deep neural network techniques. Additionally, complex networks have considered computational complexity.

#### *3.4.2 Limitations*

Although there are many studies, the research in multi-agent systems and machine learning applications in the Materials Science field remains limited. First, the application areas have been centered on traffic studies, scheduling, Internet of Things, crowd simulation, and more recently in energy efficiency and wireless networks. Second, the learning algorithms are mainly based on deep learning. Additionally, optimization is based on the particle swarm algorithm, mainly.

#### **3.5 Neuroscience**

Neuroscience field turned out as the fifth contributor as indicated by the retrieved data, with 33 documents on MASs and ML applications between 2017 and 2022. **Figure 12** depicts the terms with high co-occurrence frequencies in this field.

#### **Figure 12.**

*Visualization by VOSviewerTM software of word co-occurrence network built using words present in titles and abstracts of documents published between 2017 and 2022 on multi-agent systems and machine learning applications in the Neuroscience field.*

In **Figure 12**, each term is represented by a circle, where the diameter of the circle and size of its label represent the frequency of the term, and its proximity to another term indicates the degree of relatedness of the two terms.

#### *3.5.1 Potentiality*

The number of publications increased noticeably in the past two years on multi-agent systems and machine learning applications in the Neuroscience field. The terms on the left of the network shown in **Figure 12** represent the contributions developed in 2018. Here, mathematical models, simulation models, as well as analytic and optimization methods were broadly used. The group of terms at the center of the diagram characterizes the contributions developed between 2019 and 2020. Here, artificial neural networks, nonlinear systems, graph theory, and simulation were the theoretical tool mainly used. The right of the diagram contains terms related to contributions in the last year. Here, the learning and reinforcement algorithms have been broadly applied.

#### *3.5.2 Limitations*

ML algorithms and multi-agent systems showed potentially significant combination/integration in the Neurosciences field. However, the studies in this field had several limitations. First, the applications are centered on autonomous systems that learn using deep learning and reinforcement learning algorithms. Second, the reward concept is minimum exploited in reinforcement learning. The software to implement multi-agent systems has little attention. Finally, computer simulation was relevant in studies developed in 2018 but is not included in the studies recently developed.

*A State-of-the-Art Survey on Various Domains of Multi-Agent Systems and Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.107109*

#### **Figure 13.**

*Visualization by VOSviewerTM software of word co-occurrence network built using words present in titles and abstracts of documents published between 2017 and 2022 on multi-agent systems and machine learning applications in the energy field.*

#### **3.6 Energy**

Energy field turned out as the sixth contributor as indicated by the retrieved data, with 32 documents on MASs and ML applications between 2017 and 2022. **Figure 13** shows the terms with high co-occurrence frequencies in this field. In **Figure 13**, each term is represented by a circle, where the diameter of the circle and size of its label represent the frequency of the term, and its proximity to another term indicates the degree of relatedness of the two terms.

#### *3.6.1 Potentiality*

A number of specific research topics show significant active growth and may be considered to be emerging topics on multi-agent systems and machine learning applications in the Energy field. Reinforcement learning is the dominant algorithm used followed by deep learning and deep reinforcement learning. Optimization is based on learning algorithms. The main applications in this field are electric machine control, electric vehicles, housing, fertilizers, smart grid, electric power transmission networks, microgrids, energy management, and energy utilization.

#### *3.6.2 Limitations*

There are two major limitations in the applications of multi-agent systems and machine learning in the Energy field. First, the more recent applications are centered on model predictive control. Second, the electric cost is evaluated in a few studies.

#### **Figure 14.**

*Visualization by VOSviewerTM software of word co-occurrence network built using words present in titles and abstracts of documents published between 2017 and 2022 on multi-agent systems and machine learning applications in the Physics and Astronomy field.*

#### **3.7 Physics and astronomy**

Physics and Astronomy field turned out as the seventh contributor as indicated by the retrieved data, with 31 documents on MASs and ML applications between 2017 and 2022. **Figure 14** illustrates the network of terms with high co-occurrence frequencies in this field. In **Figure 14**, each term is represented by a circle, where the diameter of the circle and size of its label represent the frequency of the term, and the distance between any two possible terms reflects the relatedness of the terms as closely as possible. In general, the stronger the relationship between two terms, the smaller the distance between the terms on the map. The network can be seen to contain four clusters of co-occurring terms.

#### *3.7.1 Potentiality*

After careful analysis, in 2019, three topics can be distinguished (the royal blue biggest circles in **Figure 14**). The first is learning algorithms, the second topic is sensor nodes, and the third topic is game theory.

In 2020, six topics (the turquoise biggest circles in **Figure 14**) were the most often discussed of the last three years. The observation that reinforcement learning is one of the most occurring terms in the Physics and Astronomy field does not come as a surprise Internet of Things.

In 2021, six topics (the yellow biggest circles in **Figure 14**) were the most predominant of the last three years. The prominence of neural networks may be explained by the fact that is linked directly to machine learning approaches, constituent models, and multilayer neural networks.

#### *3.7.2 Limitations*

Although, MASs interacting with reinforcement learning and deep reinforcement learning algorithms showed potentially significant application in Physics

*A State-of-the-Art Survey on Various Domains of Multi-Agent Systems and Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.107109*

and Astronomy field. However, there were several potential limitations, such as small minority of simulation and optimization models, learning algorithms, and software.

#### **4. Applications of MASs combined with ML**

Density map assists in the understanding of active growth areas, research trends, emerging topics, and hot topics in MASs combined with ML. **Figures 15**–**17** illustrate the density map of applications in E-learning, manufacturing, and commerce, respectively.

#### **Figure 15.**

*Density map by VOSviewerTM software of word co-occurrence network built using words present in titles and abstracts of documents published between 2017 and 2022 on multi-agent systems and machine learning applications in E-learning.*

#### **Figure 16.**

*Density map by VOSviewerTM software of word co-occurrence network built using words present in titles and abstracts of documents published between 2017 and 2022 on multi-agent systems and machine learning applications in manufacturing.*

#### **Figure 17.**

*Density map by VOSviewerTM software of word co-occurrence network built using words present in titles and abstracts of documents published between 2017 and 2022 on multi-agent systems and machine learning applications in commerce.*

#### **5. Conclusions**

The primary goal of the study was to provide a broad review of the current developments in the field of MASs combined with ML methods. The trend on MASs and ML is the use of reinforcement learning algorithms integrated with optimization and simulation models. Artificial neural networks, nonlinear systems, and graph theory are the theoretical tool mostly used. We want to emphasize that it is unclear which software is used to integrate multi-agent systems and ML algorithms. In most studies, optimization algorithms were based on warm intelligence.

#### **Acknowledgements**

The authors appreciate the partial support by UNAM- PAPIME PE 104022. The authors also gratefully acknowledge all the time and support received from IntechOpen.

#### **Conflict of interest**

The author declares no conflict of interest.

*A State-of-the-Art Survey on Various Domains of Multi-Agent Systems and Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.107109*

### **Author details**

Aida Huerta Barrientos\* and Alejandro Nila Luevano Faculty of Engineering, National Autonomous University of Mexico, Mexico City, Mexico

\*Address all correspondence to: aida.huerta@comunidad.unam.mx

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Maes P. Agents that reduce work and information overload. Communications of the ACM. 1994;**37**:31-40

[2] Huhns M, Singh MP. Readings in Agents. 1st ed. San Mateo, CA: Morgan Kaufmann; 1998

[3] Bond AH, Gasser L. Readings in Distributed Artificial Intelligence. Los Altos, CA, USA: Morgan Kaufmann; 1998

[4] Chaib-draa B, Moulin B, Mandiau R, Millot P. Trends in distributed artificial intelligence. Artificial Intelligence Review. 1992;**6**:35-66

[5] Wooldridge M, Jennings NR. Intelligent agents: Theory and practice. The Knowledge Engineering Review. 1995;**10**:115-152

[6] Macal C, North M. Introductory tutorial: Agent-based modeling and simulation. In: Proceedings of the 2011 Winter Simulation Conference. Phoenix, Arizona, USA: IEEE; 2011. pp. 1468-1468

[7] Garro A, Mühlhäuser M, Tundis A, Baldoni M, Baroglio C, Bergenti F, et al. Intelligent agents: Multi-agent systems. Encyclopedia of Bioinformatics and Computational Biology. 2019;**1**:315-320

[8] Al-Jumaily A, Al-Jaafreh M. Multiagent system concepts, Theory and Applications phases. In: Buchli J, editor. Mobile Robots, Moving Intelligence. London: InTech; 2006. pp. 369-392

[9] Cen L, Dong M, Li Zhu H, Yu L, Chan P. Machine learning methods in the application speech emotion recognition. In: Zhang Y, editor. Application of Machine Learning. London: InTech; 2010

[10] Haidine A, Zahra Salmam F, Aqqal A, Dahbi A. Artificial intelligence and machine learning in 5G and beyond: A survey and perspectives. In: Haidine A, editor. Moving Broadband Mobile Communications Forward. Intelligent Technologies for 5G. London: InTech; 2021

[11] Machi LA, Mcevoy BT. The Literature Review. California: Corwin Press; 2009

[12] Van Eck NJ, Waltman L. Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics. 2010, 2010;**84**:523-538

[13] Van Eck NJ, Waltman L. Visualizing bibliometric networks. In: Ding Y, Rousseau R, Wolfram D, editors. Measuring Scholarly Impact: Methods and Practice. Belin: Springer; 2014. pp. 285-320

[14] Li J, Yang B, Yu T. Distributed deep reinforcement learning-based coordination performance optimization method for proton exchange membrane fuel cell system. Sustainable Energy Technologies and Assessments. 2022;**50**:101284

[15] Li J, Yu T. Large-scale multi-agent deep reinforcement learning-based coordination strategy for energy optimization and control of proton exchange membrane fuel cell. Sustainable Energy Technologies and Assessments. 2021;**48**:101752

[16] Chen Y, Zhang X, Guo L, Yu T. Optimal carbon-energy combined flow in power system based on multi-agent transfer reinforcement learning. Gaodianya Jishu/High Voltage Engineering. 2019;**45**:3

[17] Cheng L, Yu T. Exploration and exploitation of new knowledge *A State-of-the-Art Survey on Various Domains of Multi-Agent Systems and Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.107109*

emergence to improve the collective intelligent decision-making level of web-of-cells with cyber-physical-social systems based on complex network modeling. IEEE Access. 2018;**6**

[18] Zhang X, Chen Y, Yu T, Yang B, Qu K, Mao S. Equilibrium-inspired multiagent optimizer with extreme transfer learning for decentralized optimal carbon-energy combined-flow of large-scale power systems. Applied Energy. 2018;**189**:157-176

[19] Liu C, Shi Y, Li H, Du W. Accelerated dual averaging methods for decentralized constrained optimization. IEEE Transactions on Automatic Control. 2022. DOI: 10.1109/TAC.2022.3173062

[20] Sun B, Hu J, Xia D, Li H. A distributed stochastic optimization algorithm with gradient-tracking and distributed heavy-ball acceleration. Frontiers of Information Technology and Electronic Engineering. 2021;**22**:11

[21] Li H, Hu J, Ran L, Wang Z, Lü Q, Du Z, et al. Decentralized dual proximal gradient algorithms for non-smooth constrained composite optimization problems. IEEE Transactions on Parallel and Distributed Systems. 2021;**32**:103

[22] Hu J, Ran L, Du Z, Li H. Decentralized stochastic optimization algorithms using uncoordinated stepsizes over unbalanced directed networks. Signal Processing. 2021;**32**:10

[23] Mohammed MA, Elhoseny M, Abdulkareem KH, Mostafa SA, Maashi MS. A multi-agent feature selection and hybrid classification model for parkinson's disease diagnosis. ACM Transactions on Multimedia Computing, Communications and Applications. 2021;**17**:2

[24] Mohammed MA, Ibrahim DA, Salman AO. Adaptive intelligent learning approach based on visual anti-spam email model for multi-natural language. Journal of Intelligent Systems. 2021;**30**:1

[25] Elhoseny M, Mohammed MA, Mostafa SA, Abdulkareem KH, Maashi MS, Garcia-Zapirain B, et al. A new multiagent feature wrapper machine learning approach for heart disease diagnosis. Computers, Materials and Continua. 2021;**67**:1

[26] Mostafa SA, Mustapha A, Mohammed MA, Hamed RI, Arunkumar N, Abd Ghani MK, et al. Examining multiple feature evaluation and classification methods for improving the diagnosis of Parkinson's disease. Cognitive Systems Research. 2019;**54**:90-99

[27] Yaw CT, Yap KS, Wong SY, Yap HJ, Paw JKS. Enhancement of neural network based multi agent system for classification and regression in energy system. IEEE Access. 2020;**8**:163026-163043

[28] Yaw CT, Wong SY, Yap KS, Tan CH. Extreme learning machine neural networks for multi-agent system in power generation. International Journal of Engineering and Technology (UAE). 2018;**7**:4

[29] Yaw CT, Wong SY, Yap KS, Yap HJ, Amirulddin UAU, Tan SC. An ELM based multi-agent system and its applications to power generation. Intelligent Decision Technologies. 2018;**12**:2

[30] Yaw CT, Wong SY, Yap KS, Yap HJ, Amirulddin UAU, Tan SC. An ELM based multi-agent system and its applications to power generation. Intelligent Decision Technologies. 2017;**11**:3

#### **Chapter 2**

## Deep Multiagent Reinforcement Learning Methods Addressing the Scalability Challenge

*Theocharis Kravaris and George A. Vouros*

#### **Abstract**

Motivated to solve complex demand-capacity imbalance problems in air traffic management at the pre-tactical stage of operations, with thousands of agents (flights) daily, even in a restricted airspace, in this paper, we review deep multiagent reinforcement learning methods under the prism of their ability to scale toward solving problems with large populations of heterogeneous agents, where agents have to unavoidably decide on their joint policy, without communication constraints.

**Keywords:** deep reinforcement learning, multiagent systems, scalability

#### **1. Introduction**

Scalability in training large numbers of deep reinforcement learning agents, which must decide on actions jointly, is a major issue that becomes apparent in many real-life problems. This issue is related to numerous aspects of deep multiagent reinforcement learning (DMARL), such as assignment of credits to the learners for their choices, assumptions regarding homogeneity or interchangeability of the agents, society structure due to interaction of agents' decisions, agents' communication requirements, abilities, and constraints.

In this chapter, we provide a review of deep multiagent reinforcement learning (DMARL) methods, examining their ability to scale up to large agent populations. "Large" here could mean anything from hundreds up to several thousands of agents.

We are motivated to solve complex demand-capacity balancing problems in air traffic management (congestion problems regarding air sectors), where we may have 6000 flights daily, even in a restricted airspace (e.g., the Spanish airspace). Specifically, in the air-traffic management (ATM) domain, *demand and capacity balancing* problems (DCBs) are a kind of congestion problems that arise naturally whenever demand of airspace use exceeds capacity, resulting to *"hotspots."* Hotspots are resolved via capacity management or flow management solutions, including regulations that generate delays and reroutings to flights, causing unforeseen effects for the entire system, and increasing uncertainty regarding the scheduling of (ground and airspace) operations. For instance, flight delays cause the introduction/increase of time buffers in operations'schedules and may accumulate demand for resources within specific

periods. These are translated into costs and negative effects on airlines' reliability, customers'satisfaction, and environmental footprint. Representing flights by selfinterested, heterogeneous agents (each with its own possible regulations, preferences, and constraints), the automation of problems' resolution requires agents to jointly decide on their policies regarding their own regulations [1]. Driven by the resolution of problems at the pre-tactical phase of operations (i.e., from some hours to few days before the actual flights), there are no communication or observation constraints for agents: In any case, agents need to coordinate with any agent with whom they participate to any specific congestion problem, given also that these problems may emerge dynamically in space and time.

In such large-scale, complex multiagent settings, we need to consider quality of solutions (as regulations incur additional costs to operations) and DMARL methods' training scalability. Factors that affect training scalability are as follows:


In addition to the above factors, decomposing rewards among agents is a relevant issue, affecting both scalability and the quality of joint policies. This issue of credit assignment concerns designating high reward to the agents with a desirable behavior, thus avoiding agents enjoying the rewards without contributing in achieving the intended joint goal.

#### **2. Deep MARL**

Tabular function representations in reinforcement learning (RL) have many successes [2] in relatively low-dimensional problems, but it has two major drawbacks: (a) The designer of the application had to hand-craft the state representations, and (b) methods store each state or state-action value (*V*-value or *Q*-value, respectively) independently, resulting in slow learning in large state-action spaces and poor generalization abilities.

To resolve these problems, the deep *Q-*network [3] method successfully combined RL with neural networks (NNs). It is well established that the combination of RL with nonlinear function approximators such as NNs can be unstable and result in divergence [4]. This instability has several causes: the correlations present in the sequence

#### *Deep Multiagent Reinforcement Learning Methods Addressing the Scalability Challenge DOI: http://dx.doi.org/10.5772/intechopen.105627*

of observations, the fact that small updates to Q values may significantly change the policy—and therefore change the data distribution of the samples produced by the rollouts, and the correlations between the action values and the target values. Deep Q-Network (DQN) [3] uses an NN to approximate *Q*-values, modeling the agent's policy.

Two vital elements of DQN that address these issues are the target network and the experience replay memory. The target network mitigates the effect of constantly moving targets, by incorporating a second network from which the targets are sampled. This network is periodically updated with the weights of the online network. The addition of a uniform experience replay memory decorrelates the samples collected during rollouts, by randomizing over the data, thereby smoothing over changes in the data distribution. During learning, the method applies Q-learning updates on samples (or minibatches) of experience drawn uniformly at random from the pool of samples stored.

Since the original DQN publication, many extensions have emerged [5–9], with double DQN [10] and prioritized experience replay [11] being the most well known. Double Q learning was originally introduced by H. Hasselt [12] and aims to address the problem of overestimating action values, a phenomenon inherent to the Qlearning method. This addition to the original method is considered to be standard practice and is particularly useful in the multiagent domain, where nonstationarity is a common phenomenon. Originally, the idea behind this method is to utilize two independent tabular representations of the Q function during training, where each Q function is updated with targets produced from the other Q function. Double DQN transfers this approach in the Deep RL setting, by exploiting the second approximator using a target network.

In the initial DQN approach, experience transitions are uniformly sampled from a replay memory. This approach ignores the significance of samples, replaying them at the same frequency that they were originally observed in the environment. Prioritized experience replay [11] enforces a priority over the samples, aiming to replay important transitions more frequently, and subsequently improve learning efficiency. In particular, the method proposes to assign higher priority to transitions with high expected learning progress, as measured by the magnitude of their temporaldifference (TD) error.

This new paradigm of combining Q-learning with NNs swiftly crossed over to the multiagent (MAS) field. Tampuu et al. [13] studied cooperation and competition between two DQN agents. The publication investigates the interaction between two agents in the well-known video game Pong, by utilizing two independent DQN architectures, one for each agent. By solely adjusting the rewarding scheme, competitive and collaborative behaviors emerge.

Deep deterministic policy gradient (DDPG) [14] combines DQN with deterministic policy gradient (DPG) [15] to address continuous actions. As a variant of actorcritic algorithms it incorporates two separate NNs: the actor, which models the policy, and the critic, which provides feedback on the desirability of an action *a* in a state *s*. This desirability can be expressed by means of learnt *V*-values, *Q*-values, or advantages. MADDPG [16] extends DDPG in MAS settings: Each agent is given a dedicated policy network. This approach renders the method not scalable: It is not viable to maintain and train hundreds, if not thousands, of policy networks. In addition, detrimental to the scalability is the fact that, during training, the method utilizes a centralized critic, which takes as input the observations and actions of all agents. The input size of the critic can explode, depending on the size of the agent population and the state dimensions.

Multiagent deep deterministic policy gradient (MADDPG) employs a technique that has many variants and is vital in understanding deep MARL. This technique is called centralized training and decentralized execution (CTDE). A naive way to provide agents with a joint policy is to train in parallel independent policies, one for each agent, following the independent learners paradigm. This approach can produce results in low-dimensional problems, but in real-world scenarios is inefficient and unstable. In order to alleviate these problems produced by the nonstationarity of an MAS environment, many algorithms, one of which is MADDPG, employ some form of centralized training, thus exploiting information that is available during training but often unavailable during execution. In particular, MADDPG trains a centralized critic, which takes as input the global state and agents' joint action. During the execution phase, this information is inaccessible by agents, and agents act in a decentralized manner. Another well-known application of CTDE is QMIX [17]. During training, the learning algorithm has access to all local action observation histories and global state, but each agent learns a policy that conditions actions on local agent observations.

Parameter sharing, first introduced by Gupta et al. [18], is the extreme case of CTDE: It learns a single policy shared by many agents. The main idea is that this single policy should be able to adequately describe the behavior of different agents with the same goals. Immediately apparent is the important assumption that the agents are homogeneous, meaning that they decide on the same state-action space. This assumption is relaxed by the same publication [18], which "tagged" observations with agent identifiers, allowing differentiating agents according to their goals, and responding accordingly. The resulting policy can be more robust, given the fact that it has been trained with samples that potentially belong to different parts of the state space, explored by different agents. We consider the parameter sharing approach to be the cornerstone of any scalable algorithm.

Castañeda proposes two multiagent variants of DQN [19]. Repeated update Q-learning extends DQN by updating each action inversely proportional to the probability of selecting it, practically changing the learning rate based on the action probability. It is designed to mitigate the inherit overestimation of state values by Q-learning, much like double Q-learning [10]. Loosely coupled Q-learning defines a diffusion function and utilizes eligibility traces in order to associate negative rewards with states. The combination of the diffusion function with eligibility traces is used to define a function, which indicates the necessity for cooperation, expressing the probability of an agent carrying on an action independently. It combines this with Dec-MDP and two Q-value functions, one for independent acting (when appropriate) and one for coordinating with others.

Credit assignment concerns the difficulty to give credit and provide higher reward to the agents with a desirable behavior, thus accelerating learning in MAS settings. A common phenomenon, called "lazy agent," occurs when an agent does not participate actively in a cooperative solution, enjoying the rewards resulting from the cooperation of others. Various approaches have been proposed to deal with this problem, with the difference reward [20] and reward shaping [21, 22] being the most well known.

Difference reward [20] introduces a method to visualize the desirable properties of a reward function, thus facilitating the creation of new reward structures based on the specific needs of the domain. Specifically, the authors in [20] focus on two aspects of the reward function. The first is called *factordness* and expresses how well the reward promotes coordination among agents in different parts of a domain's state-space. The second is called *learnability* and expresses how easy it is for an agent to learn to

#### *Deep Multiagent Reinforcement Learning Methods Addressing the Scalability Challenge DOI: http://dx.doi.org/10.5772/intechopen.105627*

maximize its reward, by measuring the reward's sensitivity to the agent's actions. The main idea behind reward shaping is to utilize some prior knowledge while engineering the reward function, in order to reduce the training time, by reducing the number of suboptimal actions taken [22]. In the most general form, reward shaping can be as simple as *R*<sup>0</sup> = *R* + *F*, where *R* is the original reward and *F* a positive scalar reward, designed to encourage the agent to move toward the goal. Here, most of the research focuses on the principles, which would lead to an effective reward design, as well as how the optimal policy changes as an effect of reward design.

Value decomposition network (VDN) [23] is the first DMARL method that addresses the credit assignment problem. VDN represents joint action value as a summation of local (individual agents') action values conditioned on agents' local observations. The fact that the agents' local value function depends only on local observations facilitates agents' understanding of the received rewards. This method is not scalable, because it utilizes distinct NNs for each agent's policy. QMIX [17] provides a more general case of VDN using a mixing NN that combines all independent Q-values into *Qtot*, thus approximating a broader class of functions for joint action values. Specifically, moving from the VDN's assumption of additivity, QMIX's mixing network can approximate monotonic relations between individual and the global Q-values. Both works are based on the individual-global-max (IGM) principle, according to which, the joint greedy action should be equivalent to the set of individual greedy actions of agents.

VDN and QMIX address a subset of factorizable MARL tasks due to their structural constraint in factorization of additivity and monotonicity, respectively. Later works have improved the representation ability of the mixing network. QTRAN [24] achieves more general factorization by transforming the original joint action-value function into a new, easily factorizable one with the same optimal actions as the original. NDQ [25] combines value function factorization learning with communication learning, by introducing an information-theoretic regularizer, aiming to reduce interagent communication while maintaining performance.

ROMA [26] combines MARL, mixing networks in particular, with the role concept [27–33]. A role is a comprehensive pattern of behavior, often specialized in some tasks. Agents with similar roles will show similar behaviors and thus can share their experiences to improve performance. The main drawback of this approach is the demand of exploitation of prior domain knowledge in order to decompose tasks and predefine the responsibilities of each role, which necessitates adding to role-based MAS dynamic and adaptive abilities to perform effectively in dynamic and unpredictable settings. However, the specification of roles may not be appropriate for any domain. ROMA introduces two regularizers to enable roles to be dynamically identifiable, in order to exploit the benefits of both, role-based and learning paradigms. Following the QMIX [17] framework, it utilizes independent policy networks and a centralized mixing network.

QPLEX [34] focuses on ensuring that the IGM principle (as specified above) stands, while reformalizing it as an advantaged-based IGM. QPLEX replaces the original mixing network of QMIX [17] with a duplex dueling network architecture [5], which induces the joint and local (duplex) advantage functions, to factorize the joint action-value function into individual action-value functions. This duplex dueling structure encodes the IGM principle into the neural network architecture. The architecture uses an individual action-value function for each agent in combination with the centralized duplex dueling component. The method manages to achieve higher win rates in numerous Starcraft II scenarios [35] against multiple baselines such as VDN, QMIX, and QTRAN [17, 23, 24] in experiments with up to 27 agents.

Another well-known approach is the counterfactual multiagent policy gradients method (COMA) [36]: It decomposes the global reward to the agents, utilizing a counterfactual baseline inspired by difference rewards [20]. Similarly to MADDPG, COMA utilizes a centralized critic and each agent has a distinct policy network, so the drawbacks regarding scalability are common to MADDPG.

Concluding the above, the line of research on value decomposition has different focal points rather than scalability, thus does not produce methods ideal to scale up to thousands of agents. In addition to that, as far as credit assignment problem is concerned, although a vital aspect of MARL, we do not consider that the resolution of this problem is a prerequisite toward methods'scalability. However, it is a related issue in settings where rewards are not inherently decomposed to individual agents.

#### **3. Scalable deep MARL**

Differentiable interagent learning (DIAL) [37] is the first proposal for learnable communication utilizing DQNs. Agents generate real-valued messages, which are broadcasted to the other agents. During centralized training, these messages are practically gradients, allowing end-to-end training across agents. During decentralized execution, messages are discretized and mapped to a predefined set of communication actions. There are two major reasons that this approach can not scale effectively: With no parameter sharing in place, each agent uses its own, independent policy NN. Also, every agent communicates with everyone else throughout training. This, beyond the communication cost, could result in agents receiving misleading or noisy information, in the form of gradients.

CommNet [38] aims to learn a communication protocol alongside the policies of cooperating agents. CommNet consists of a centralized feed-forward NN, with the observations of all agents as input and their actions as output. Each layer represents a communication step between the agents and consists of one decision module per agent. While the first layer uses an encoder function, every hidden layer module takes the internal state *h* as well as the broadcasted communication vector *c* as input and outputs the next internal state. These internal states are averaged and broadcasted as the next communication vector. CommNet has three major limitations regarding scalability. First, all agents use the same centralized network, which has to be big enough to accommodate this design approach, due to the size of the input in the form of the global state. In addition, this renders the method unsuitable for heterogenous agents. Second, CommNet calculates the mean of messages between layers, therefore assuming that all agents are of equal importance, which could be unsuitable in some settings. Finally, the most important disadvantage is the fact that all agents have to communicate with everyone. This could result in significant communication overhead with noise in the communication channel: e.g., an agent could receive a communication vector consisting of the average observations of irrelevant agents. The last drawback could be mitigated by the local connectivity model extension proposed in the original CommNet publication [38]. According to this extension, agents can only receive messages from a dynamic group of agents, which are within a certain range. However, no experiments are provided for this extension.

Toward resolving the mandatory global communication problem, an extension of CommNet, called vertex attention interaction network (VAIN) [39], employs an attention mechanism. This attention mechanism improves the performance of the original method by modeling the locality of interactions and thus determining which

#### *Deep Multiagent Reinforcement Learning Methods Addressing the Scalability Challenge DOI: http://dx.doi.org/10.5772/intechopen.105627*

agents will share information. The authors claim that this method can make CommNet suitable for as many as 1000 agents. In the experiments provided, VAIN achieved better score than CommNet, although for a small number of agents.

Another method with scalability potential is BiCNet, [40]. BiCNet is an extension of the actor-critic formulation where agents use a bidirectional recurrent NN (RNN) [41]. The recurrent network serves as a bidirectional communication channel but also as a local memory saver: Each agent is able to maintain its own internal states, as well as to share information with its neighbors. Similarly to CommNet, the communication channel serves to broadcast agents' local observations. Contrary to CommNet agents, BiCNet agents utilize additive messages, while the method makes no assumptions on agents' homogeneity/interchangeability. The major drawback is that, similarly to CommNet, all agents communicate with everyone. Later works [42] show that the ability of BiCNet to learn effective policies is reduced as the number of the agents increases. This deterioration of effectiveness is attributed to the lack of a mechanism capable of capturing the importance of information from different agents.

TarMAC [43] proposes an architecture designed specifically to allow each agent to chose to which other agents to address messages to, as well as to which messages it will pay attention to. It introduces a signature-based soft attention mechanism with a key, which encodes properties of intended recipients (as opposed to specific agent identification), to be part of the message. The receiver of the message takes this key into consideration and decides whether it is relevant. The method is enhanced by multiple rounds of communication and collaborative reasoning. TarMAC utilizes the actor-critic framework, with a shared policy network and a centralized critic. It is evaluated on four diverse environments against CommNet [38], as well with variants of itself without the attention mechanism or communication at all. The sophistication of the communication scheme could be a disadvantage regarding scalability, as multiple rounds of communication between thousands of agents would result in significant overhead. The main drawback of the method though, with scalability in mind, is the usage of a centralized critic, which takes as input the predicted actions and hidden states of all agents.

Coder [44] is a hierarchical approach, with three distinct hierarchical levels to solve a traffic congestion problem with agents being the traffic intersections. First, a centralized base environment is trained, which is a small and simple subproblem compared with the original one, i.e., a single intersection managing incoming traffic. Here, two training alternatives are considered, a DQN variant and a DDPG variant called Wolpertinger architecture [45]. The next step is called regional DRL, where the parameters of the base environment are shared to a small number of neighboring agents controlling similar but not identical parts of the environment (e.g., different traffic dynamics). To tackle these differences, additional refinement through training is required. The final level combines all regional policies and adds a global dense layer. In order to choose actions while incorporating some form of coordination between the regional policies, an iterative action search is employed. This search starts with the concatenation of the actions produced by the regional nets and attempts to find a globally better alternative. This approach is shown to work with a maximum size of 96 adjacent intersections. A similar approach to Coder [44], we call it here KT DQN [46] expands DQN. Aiming to speed up training and produce better results, it uses single agent training before applying the policy to MAS settings. In doing so, the method freezes all the weights of the single-agent policy network model that are transferred to the MAS scenario, with the exception of the ones between the last hidden layer and the output. Coder and KT DQN utilize a hierarchical approach that initializes learning from a simplified single agent environment. This can indeed speed-up learning, in

contrast to start learning from scratch. However, this approach is not straightforwardly transferable between applications. For example, using the regional DRL step of coder calls for a separate design, i.e., decomposing the problem into subproblems, for different applications.

An interesting take on how to utilize an agent population in order to facilitate exploration is presented by J. Leibo [47]. The method proposed is a variant of IMPALA [48]: In the simulation used every agent is a member of an animal species. These homogeneous agents share the same policy network. The paper does not address interagent communication or cooperation, explicitly. Instead, it utilizes the multiagent framework mainly to encourage a more robust exploration. Cooperative behavior emerged when incentives are given to the agents: For example, experiments are presented with agents rewarded for specializing in eating a specific kind of food. The total number of agents present in the simulation reported are 960, and the population is considered to be dynamic.

Mean field MARL [49] proposes two algorithms with the explicit aim to improve MARL scalability. The main idea is to calculate the mean action of an agent's neighborhood, considering that each agent interacts with a virtual mean agent instead of *n* agents. The two proposed alternatives are mean field Q (MF-Q) and mean field actorcritic (MF-AC), among which the MF-Q approach has superior sample efficiency, which becomes more apparent as the number of agents gets bigger. MF-Q has been empirically shown to scale up to 1000 agents. Mean field MARL does not explicitly address credit assignment, but provides rigorous mathematical proof of convergence to Nash equilibrium. Although both algorithms are scalable, their main drawback lies in the coordination scheme: Averaging the neighborhood of an agent can result in loss of important information and lackluster cooperation [50].

Inspired by the factorization machines [51, 52], FQL [53] utilizes a composite deep neural network architecture, combining *Q* and *V* nets, for computing a low-rank approximation of the Q-function, by reformulating the multiagent joint-action Qfunction into a factorized form. Depending on agent roles, agents are divided into *G* groups, and agents within each group share network parameters. While ensuring efficiency, the uniformity assumption between agents might not be suitable for various real-world applications. In addition, since the factorized Q-function for each agent requires the knowledge of the other agents' current states and their last actions for both training and execution, the algorithm mainly addresses the MARL problems with a central controller that communicates the global information to all the agents. In the experimental section, the method produces competitive results in settings with agent populations as large as 500 agents. As the size of the agent population is increased, from 100 to 500, MF-Q [49], which is one of the baselines used, seems to be a more suitable method.

CoLight [54] aims to enable agent cooperation in a large-scale road network with hundreds of traffic signals, recognizing that RL-based methods at the time fail to reach optimal interagent communication. To achieve this, authors introduce the utilization of graph attentional networks. At the core of the method is a *Q*-value prediction deep network. This network incorporates an observation embedding layer, the hidden neighborhood cooperation layers, and the output *Q*-value prediction layer. Similarly to mean field MARL [49], the method "averages" neighbors' influence in the broadcasted messages. Although experiments with simulated as well as real-world data involve up to 196 agents, authors claim scalability to thousands of agents, based on method's complexity analysis [54]. Contrasted against various baselines, CoLight shows advanced scalability, as well as sample efficiency.

#### *Deep Multiagent Reinforcement Learning Methods Addressing the Scalability Challenge DOI: http://dx.doi.org/10.5772/intechopen.105627*

As already pointed out, many major approaches such as DIAL, CommNet, and BicNet [37, 38, 40] suffer from the lack of agents' ability to differentiate between useful information driving interagent cooperation from noisy or misleading information. ATOC [42] strives to eliminate this specific issue by proposing an attentional communication model that achieves interagent communication between a large amount of agents. The method expands DDPG and uses an attention module as well as a bidirectional LSTM, which serves as a communication channel. The LSTM module plays the role of integrating neighboring agents' internal states, thus creating a message from their combined intentions, and guiding the agents toward coordinated decision-making. Actor and critic networks share parameters to ensure scalability. Indeed, the method seems to challenge the known approaches (DDPG [14], CommNet [38], and BicNet [40]), while agents show division of work and meaningful cooperation.

Graph convolutional reinforcement learning (DGN) [50] has the explicit goal to handle highly dynamic environments, where agents constantly change neighborhoods. Toward this goal, the paper proposes an MAS framework in which agents are connected in a graph where each agent is a node and edges connect neighbors. Neighborhoods are determined by agent distance or other measures, e.g., communication range or critical interactions, and can vary over time. Communication is allowed only between neighbors, in order to minimize inefficiency. DGN consists of an observation encoder, convolutional layers with relation kernels, and a *Q* network. The method was compared against well-known algorithms such as CommNet [38] and MFQ [49] and is shown to achieve very competitive results in environments with up to 140 agents. However, experiments with greater agent population are needed in order to assess the effectiveness of the method as the environment gets more complex.

Lin et al. [55] proposes two distinct methods, which, very closely to our aims, aim to solve large-scale demand-capacity imbalance problems in the air traffic management domain. The environment simulates a population of approx. 5000 homogeneous agents, acquiring identical rewards. Both methods assume that agents have the same action values. In practice, this assumption means that the agents are not simply homogeneous, but interchangeable. This assumption allows the building of a centralized action-value table, which is utilized for coordination of agents' actions. The first method is called contextual deep Q-learning. It utilizes a table with centralized action values to form a collaborative context, which prevents agents from choosing suboptimal actions, thus avoiding unwanted or redundant behavior such as agents exchanging positions with one another. The second proposed method, called contextual actor-critic, describes an actor critic variant of the contextual DQN. It uses a centralized value function shared by all agents and a parameter-sharing policy network.

Closely related works, with scalability in mind, are those presented by Nguyen et al. [56–58]. In particular, AC for CDec-POMDPs [57] is based on the FEM algorithm [56] and presents an actor-critic algorithm for optimizing collective decentralized POMDPs. It achieves extreme scalability and provides experiments with up to 8000 agents. Subsequent work proposes MCAC [58], which focuses on difference rewards, also addressing the credit assignment problem. A fundamental idea underlining the work described in these papers and a vital aspect in order to achieve scalability is exploiting the count of agents taking the same action *a* in a state *s*. This count, as well as other, more complex measures based on this, serves a statistical basis for training, eliminating the need to collect trajectory samples from every agent. Instead, the resulting policy is dependent on count-based observations. Therefore, this method

does not explicitly assume communication. It rather assumes full access to all the count-based information during training. During execution, agents execute individual policies without accessing centralized functions. Similarly to contextual DQN [55],


**Table 1.** *Reviewed methods' characteristics.* this approach assumes that the agents are interchangeable. This assumption is necessary to achieve the required scalability, but is not applicable to every problem.

#### **4. Concluding remarks**

In this paper, we provide a review of state of the art DMARL methods with significant scalability potential and interagent coordination capabilities in large-scale MAS settings. **Table 1** lists all the reviewed methods with the characteristics, which are considered essential for their scalability, as mentioned in the introductory section. The abbreviations used are the following: Independent learners (I), Parameter Sharing (PS), Centralized model (C), Knowledge Transfer (KT), Hetero/Homo-geneous agents (Het/Hom), Interchangeable agents (IC), Implicit (Impl), Explicit (Expl).

The first conclusion we can draw is the importance of parameter sharing in large agent populations. As the number of agents grows, parameter sharing shows by far the most potential for scalability. It is impractical to train thousands of independent networks for each agent or to utilize a centralized approach whose input size would explode as the number of agents and the size of their observations grow larger. We can clearly see in **Table 1** that all works that provide experiments with large agent populations concur on that approach.

Another important conclusion from this study is the fact that, as scalability potential gets more prominent, stricter assumptions are made on the agents and on their environment. Specifically, the approach that manages to scale up to the largest agent population [57] assumes interchangeable agents. In order for these methods to find broad practical applicability, such assumptions have to be relaxed. In numerous practical applications, agents are not homogeneous, but more importantly, interchangeable agents are a more rare occurrence. Additions to the methods, in order to incorporate heterogeneous agents, would be of great value. An important point here is that parameter sharing assumes homogeneous agents. Further research and experimenting on the direction of techniques such as labeling agents, as proposed in the original parameter sharing publication [18], should be beneficial in that regard.

Finally, we can conclude that as the need for scalability becomes more prevalent, targeted and detailed communication becomes more challenging to achieve. For example, MF-Q [49], a method that shows great scalability capabilities, assumes that all agents affect the recipient of their messages equally. Local communication, on the other hand, affects the scalability of methods, while further studies on the use of attention and messages' combination mechanisms are necessary to prove the potential to operate in environments with thousands of agents.

#### **Acknowledgements**

This work is supported by the TAPAS project, which has received funding from the SESAR Joint Undertaking (JU) under grant agreement No 892358. In addition, this work was partially supported by the University of Piraeus Research Center (UPRC).

*Multi-Agent Technologies and Machine Learning*

### **Author details**

Theocharis Kravaris\* and George A. Vouros University of Piraeus, Piraeus, Greece

\*Address all correspondence to: hariskrav.unipi@gmail.com

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Deep Multiagent Reinforcement Learning Methods Addressing the Scalability Challenge DOI: http://dx.doi.org/10.5772/intechopen.105627*

#### **References**

[1] Kravaris T, Spatharis C, Bastas A, Vouros GA, Blekas K, Andrienko G, Andrienko N, Garcia JM. Resolving congestions in the air traffic management domain via multiagent reinforcement learning methods. 14 December 2019. arXiv preprint arXiv: 1912.06860

[2] Sutton RS, Barto AG. Reinforcement learning: An introduction. Cambridge, Massachusetts, USA: MIT Press; 13 November, 2018

[3] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement learning. Nature. 2015; **518**(7540):529-533

[4] Tsitsiklis J, Van Roy B. Analysis of temporal-diffference learning with function approximation. Advances in Neural Information Processing Systems. 1996;**9**:1075-1081

[5] Wang Z, Schaul T, Hessel M, Hasselt H, Lanctot M, Freitas N. Dueling network architectures for deep reinforcement learning. In: International Conference on Machine Learning. New York, NY, USA: JMLR: WCP; 11 June 2016. pp. 1995-2003

[6] Fortunato M, Azar MG, Piot B, Menick J, Osband I, Graves A, et al. Noisy networks for exploration. 30 June 2017. arXiv preprint arXiv:1706.10295

[7] Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, et al. Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning. New York, NY, USA: JMLR: W&CP; 2016. pp. 1928-1937

[8] Dabney W, Rowland M, Bellemare M, Munos R. Distributional reinforcement

learning with quantile regression. In: Proceedings of the AAAI Conference on Artificial Intelligence. California, USA: AAAI Press, Palo Alto; 2018

[9] Hessel M, Modayil J, Van Hasselt H, Schaul T, Ostrovski G, Dabney W, et al. Rainbow: Combining improvements in deep reinforcement learning. In: Thirtysecond AAAI Conference on Artificial Intelligence. California, USA: AAAI Press, Palo Alto; 2018

[10] Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double q-learning. In: Thirty-second AAAI Conference on Artificial Intelligence. California, USA: AAAI Press, Palo Alto; 2016

[11] Schaul T, Quan J, Antonoglou I, Silver D. Prioritized Experience Replay. InICLR (Poster). 1 January 2016

[12] Hasselt H. Double Q-learning. Advances in Neural Information Processing Systems. 2010;**23**:2613-2621

[13] Tampuu A, Matiisen T, Kodelja D, Kuzovkin I, Korjus K, Aru J, et al. Multiagent cooperation and competition with deep reinforcement learning. PloS One. 2017;**12**(4):e0172395

[14] Qiu C, Hu Y, Chen Y, Zeng B. Deep deterministic policy gradient (DDPG)-based energy harvesting wireless communications. IEEE Internet of Things Journal. 2019;**6**(5): 8577-8588

[15] Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M. Deterministic policy gradient algorithms. In: International Conference on Machine Learning. New York, NY, USA: JMLR: W&CP; 2014. pp. 387-395

[16] Lowe R, Wu YI, Tamar A, Harb J, Pieter Abbeel O, Mordatch I. Multi-agent actor-critic for mixed cooperativecompetitive environments. Advances in Neural Information Processing Systems. 2017;**30**:6379-6390

[17] Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S. Qmix: Monotonic value function factorisation for deep multiagent reinforcement learning. In: International Conference on Machine Learning. New York, NY, USA: JMLR: W&CP; 2018. pp. 4295-4304

[18] Gupta JK, Egorov M, Kochenderfer M. Cooperative multi-agent control using deep reinforcement learning. In: International Conference on Autonomous Agents and Multiagent Systems. Cham, Switzerland: Springer International Publishing AG; 2017. pp. 66-83

[19] Castaneda AO. Deep Reinforcement Learning Variants of Multi-agent Learning Algorithms. Edinburgh: School of Informatics, University of Edinburgh; 2016

[20] Agogino AK, Tumer K. Analyzing and visualizing multiagent rewards in dynamic and stochastic domains. Autonomous Agents and Multi-Agent Systems. 2008;**17**(2):320-338

[21] Devlin SM, Kudenko D. Dynamic potential-based reward shaping. In: Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems. Cham, Switzerland: Springer International Publishing AG; 2012. pp. 433-440

[22] Ng AY, Harada D, Russell S. Policy invariance under reward transformations: Theory and application to reward shaping. In: ICML. New York, NY, USA: JMLR: W&CP; 1999. pp. 278-287

[23] Sunehag P, Lever G, Gruslys A, Czarnecki WM, Zambaldi VF,

Jaderberg M, et al. Value-decomposition networks for cooperative multi-agent learning based on team reward. In: AAMAS. Cham, Switzerland: Springer International Publishing AG; 2018

[24] Son K, Kim D, Kang WJ, Hostallero DE, Yi Y. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In: International Conference on Machine Learning. New York, NY, USA: JMLR: W&CP; 2019. pp. 5887-5896

[25] Wang T, Wang J, Zheng C, Zhang C. Learning nearly decomposable value functions via communication minimization. In: International Conference on Learning Representations. 2019

[26] Wang T, Dong H, Lesser V, Zhang C. ROMA: Multi-agent reinforcement learning with emergent roles. In: International Conference on Machine Learning. New York, NY, USA: JMLR: W&CP; 2020. pp. 9876-9886

[27] Becht M, Gurzki T, Klarmann J, Muscholl M. ROPE: Role oriented programming environment for multiagent systems. In: Proceedings Fourth IFCIS International Conference on Cooperative Information Systems. CoopIS 99 (Cat. No. PR00384). New York, U.S.: IEEE; 1999. pp. 325-333

[28] Stone P, Veloso M. Task decomposition, dynamic role assignment, and low-bandwidth communication for real-time strategic teamwork. Artificial Intelligence. 1999; **110**(2):241-273

[29] Depke R, Heckel R, Küster JM. Roles in agent-oriented modeling. International Journal of Software Engineering and Knowledge Engineering. 2001;**11**(03):281-302

*Deep Multiagent Reinforcement Learning Methods Addressing the Scalability Challenge DOI: http://dx.doi.org/10.5772/intechopen.105627*

[30] Ferber J, Gutknecht O, Michel F. From agents to organizations: An organizational view of multi-agent systems. In: International Workshop on Agent-oriented Software Engineering. Cham, Switzerland: Springer International Publishing AG; 2003. pp. 214-230

[31] Odell J, Nodine M, Levy R. A metamodel for agents, roles, and groups. In: International Workshop on Agentoriented Software Engineering. Cham, Switzerland: Springer International Publishing AG; 2004. pp. 78-92

[32] Cossentino M, Hilaire V, Molesini A, Seidita V, editors. Handbook on Agentoriented Design Processes. Berlin Heidelberg: Springer; 2014

[33] Lhaksmana KM, Murakami Y, Ishida T. Role-based modeling for designing agent behavior in self-organizing multi-agent systems. International Journal of Software Engineering and Knowledge Engineering. 2018;**28**:79-96

[34] Wang J, Ren Z, Liu T, Yu Y, Zhang C. QPLEX: Duplex dueling multiagent Q-learning. In: International Conference on Learning Representations. 2020

[35] Samvelyan M, Rashid T, de Witt C, Farquhar G, Nardelli N, Rudner TG, et al. The starcraft multi-agent challenge. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. Cham, Switzerland: Springer International Publishing AG; 2019. pp. 2186-2188

[36] Foerster J, Farquhar G, Afouras T, Nardelli N, Whiteson S. Counterfactual multi-agent policy gradients. In: Proceedings of the AAAI Conference on Artificial Intelligence. California, USA: AAAI Press, Palo Alto; 2018

[37] Foerster J, Assael IA, De Freitas N, Whiteson S. Learning to communicate

with deep multi-agent reinforcement learning. Advances in Neural Information Processing Systems. 2016;**29**:2137-2145

[38] Sukhbaatar S, Fergus R. Learning multiagent communication with backpropagation. Advances in Neural Information Processing Systems. 2016;**29**: 2244-2252

[39] Hoshen Y. Vain: Attentional multiagent predictive modeling. Advances in Neural Information Processing Systems. 2017;**30**:2701-2711

[40] Peng P, Wen Y, Yang Y, Yuan Q, Tang Z, Long H, Wang J. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play starcraft combat games. 29 March 2017. arXiv preprint arXiv: 1703.10069

[41] Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing. 1997;**45**(11):2673-2681

[42] Jiang J, Lu Z. Learning attentional communication for multi-agent cooperation. Advances in Neural Information Processing Systems. 2018; **31**:7254-7264

[43] Das A, Gervet T, Romoff J, Batra D, Parikh D, Rabbat M, et al. Tarmac: Targeted multi-agent communication. In: International Conference on Machine Learning. New York, NY, USA: JMLR: W&CP; 2019. pp. 1538-1546

[44] Tan T, Bao F, Deng Y, Jin A, Dai Q, Wang J. Cooperative deep reinforcement learning for large-scale traffic grid signal control. IEEE Transactions on Cybernetics. 2019;**50**(6):2687-2700

[45] Marden JR, Arslan G, Shamma JS. Cooperative control and potential games. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 2009; **39**(6):1393-1407

[46] Rădulescu R, Legrand M, Efthymiadis K, Roijers DM, Nowé A. Deep multi-agent reinforcement learning in a homogeneous open population. In: Benelux Conference on Artificial Intelligence. New York, NY, USA: JMLR: W&CP; 2018. pp. 90-105

[47] Leibo JZ, Pérolat J, Hughes E, Wheelwright S, Marblestone AH, Duéñez-Guzmán EA, et al. Malthusian Reinforcement Learning. In: AAMAS. Cham, Switzerland: Springer International Publishing AG; 2019

[48] Espeholt L, Soyer H, Munos R, Simonyan K, Mnih V, Ward T, et al. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In: ICML. New York, NY, USA: JMLR: W&CP; 2018

[49] Yang Y, Luo R, Li M, Zhou M, Zhang W, Wang J. Mean field multiagent reinforcement learning. In: International Conference on Machine Learning. New York, NY, USA: JMLR: W&CP; 2018. pp. 5571-5580

[50] You J, Liu B, Ying Z, Pande V, Leskovec J. Graph convolutional policy network for goal-directed molecular graph generation. Advances in Neural Information Processing Systems. 2018; **31**:6410-6421

[51] Rendle S, Schmidt-Thieme L. Pairwise interaction tensor factorization for personalized tag recommendation. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining. New York City, New York, USA: ACM. 2010; pp. 81-90

[52] Rendle S. Factorization machines with libfm. ACM Transactions on

Intelligent Systems and Technology (TIST). 2012;**3**(3):1-22

[53] Zhou M, Chen Y, Wen Y, Yang Y, Su Y, Zhang W, et al. Factorized qlearning for large-scale multi-agent systems. In: Proceedings of the First International Conference on Distributed Artificial Intelligence. New York City, New York, USA: ACM; 2019. pp. 1-7

[54] Wei H, Xu N, Zhang H, Zheng G, Zang X, Chen C, et al. Colight: Learning network-level cooperation for traffic signal control. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management. New York, NY, United States: Association for Computing Machinery; 2019. pp. 1913-1922

[55] Lin K, Zhao R, Xu Z, Zhou J. Efficient large-scale fleet management via multi-agent deep reinforcement learning. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, NY, United States: Association for Computing Machinery; 2018. pp. 1774-1783

[56] Nguyen DT, Kumar A, Lau HC. Collective multiagent sequential decision making under uncertainty. In: Thirty-First AAAI Conference on Artificial Intelligence. California, USA: AAAI Press, Palo Alto; 2017

[57] Nguyen DT, Kumar A, Lau HC. Policy gradient with value function approximation for collective multiagent planning. Advances in Neural Information Processing Systems. 2017; **30**:4319-4329

[58] Nguyen DT, Kumar A, Lau HC. Credit assignment for collective multiagent RL with global rewards. Advances in Neural Information Processing Systems. 2018;**31**:8102-8113

### Section 2
