**5. Message-passing in future Many-core Grids**

In the previous sections, we have considered the demands of message-passing in Grid environments as well as in Many-core systems. However, we have done this each apart from the other. Now, in this section we want to discuss how these two worlds can eventually be combined.

<sup>12</sup> MPI actually provides functions for real asynchronous two-sided communication.

<sup>13</sup> This is an asynchronous extension to iRCCE that may be used additionally to the functionality offered by iRCCE itself.

DĂƐƚĞƌ DĂƐƚĞƌ

(a) Three Linked AMP Many-core Systems

Fig. 10. Many-core Systems According to the AMP Approach

Many-core systems, as the MCAPI promises, would certainly be a step in the right direction. Actually, the MCAPI specification aims to facilitate lightweight implementations that can handle the core-to-core communication in embedded systems with limited resource but less

Hierarchy-Aware Message-Passing in the Upcoming Many-Core Era 171

Several approaches for Many-core architectures follow the asymmetric multiprocessing approach (AMP) where one core is designated to be a master core whereas the other cores act as accelerator cores (see Section 4.1). Examples for this approach are the Cell Broadband Engine or Intel's Knights Corner (Vajda, 2011). One way to combine the macrocosmic world of Grid computing with the microcosmic world of Many-cores in terms of message-passing is using MPI for the off-die communication between multiple master cores and customized communication interfaces, as for example MCAPI, for the on-die communication between the masters and their respective accelerator cores. In this *Multiple-Master Multiple-Worker* approach, the master cores not only act as dispatchers for the local workload but must also act as routers for messages to be sent from one accelerator core to another one on a remote die (see Figure 10(a)). This approach can be arranged with the MPMD paradigm (see Section 2.2.1), where the master cores run a different program (based on MPI plus e.g. MCAPI calls for the communication) than the accelerator cores (running e.g. a pure MCAPI-based parallel code). In addition, the master cores may spawn the processes on the local accelerator cores at runtime and in an iterative manner, in order to assign dedicated subtasks to them. However, one drawback of this approach is the need for processes running on the master cores to communicate via two (or even more) application programming interfaces. A further drawback is the fact, that the master cores are prone to become bottlenecks in terms of inter-site communication. Another approach would be to base all communication calls of the applications upon the MPI API so that all cores become part of one large MPI session. However, also this approach has some drawbacks that, if not taken into account, threaten to

First of all, when running one large MPI session that covers all participating cores in a Many-core Grid, one has to apply a Grid-enabled MPI library that is not only capable of

DĂƐƚĞƌ ĐĐĞůĞƌĂƚŽƌƐ

(b) MPI for On-die Communication

DW/>ĂLJĞƌ

DW/

/ŶƚĞƌůŝŶŬŝŶŐ EĞƚǁŽƌŬ

requirements (Brehmer et al., 2011),

**5.1.3 The best of both worlds**

limit the overall performance.

**5.1.4 The demand for hierarchy-awareness**

ĐĐĞůĞƌĂƚŽƌƐ

#### **5.1 Bringing two worlds together**

When looking at Section 3 and Section 4, it becomes obvious that the two worlds there described will inevitably merge in the near future. The world of chip-embedded Many-core systems will have to be incorporated into the hierarchies of distributed Grid computing. However, the question is how this integration step can be conducted in such a way that both worlds can benefit from this fusion. In doing so, especially the communication interfaces between these two worlds will play a key role regarding their interaction.

#### **5.1.1 The macrocosmic world**

In the world of Grid computing as well as in the domain of High-Performance Computing (HPC), MPI has become the prevalent standard for message-passing. The range of functions defined by the MPI standard is very large and an MPI implementation that aims to provide the sheer magnitude has to implement way above 250 functions.<sup>14</sup> In turn, this implies that such an MPI library tends to become heavyweight in terms of resource consumption, as for example in terms a large memory footprint. However, this is definitely tolerable concerning the HPC domain; at least as long as the resources deployed lead to a high performance as for example in the form of low latencies, high bandwidth and optimal processor utilization. Furthermore, a Grid-enabled MPI implementation must also be capable of dealing with heterogeneous structures and it must be able to provide support for the varying networking technologies and protocols of hierarchical topologies. Moreover, it becomes clear how complex and extensive the implementation of such a library might get if besides the MPI API additional service interfaces for an interaction between the MPI session and the Grid environment come into play. However, when looking at the application level it can be noticed that many MPI applications just make use of less than 10 functions from the large function set offered. On the one hand, this is due to the fact that already a handful of core functions are sufficient in order to write a vast number of useful and efficient MPI programs (Gropp et al., 1999). On the other hand, knowledge of and experience with *all* offered MPI functions are not very common even within the community of MPI users. Therefore, many users rely on a subset of more common functions and implement less common functionalities at application level on their own.<sup>15</sup>

#### **5.1.2 The microcosmic world**

Currently, there does not exist one uniform and dominant standard for message-passing in the domain of Many-cores and cluster-on-chip architectures. Although MPI can basically also be employed in such embedded systems, customized communication interfaces, as for example RCCE for the SCC, are predominant in this domain. The reason for this is that MPI is frequently too heavyweight for pure on-chip communication because major parts of an MPI implementation would be superfluous in such systems, as for example the support for unreliable connections, different networking protocols or heterogeneity in general. However, a major drawback of customized libraries with proprietary interfaces is that applications get bound to their specific library and thus become less portable to other platforms. Therefore, a unified and widespread *interface standard* for on-chip communication in multicore and

<sup>14</sup> That is an MPI implementation compliant with the compatibility level MPI-2.

<sup>15</sup> An example for this is the comprehensive set of *collective operations* offered by MPI (see Section 2.1.2).

Fig. 10. Many-core Systems According to the AMP Approach

Many-core systems, as the MCAPI promises, would certainly be a step in the right direction. Actually, the MCAPI specification aims to facilitate lightweight implementations that can handle the core-to-core communication in embedded systems with limited resource but less requirements (Brehmer et al., 2011),
