**6. Communications**

**5.3. Multiple host, single device (MHSD)**

52 Recent Progress in Parallel and Distributed Computing

in a network.

**Figure 8.** MHSD (4).

ter connecting many SHSD nodes together, **Figure 8**.

**5.4. Multiple host, multiple device (MHMD)**

number of devices), **Figure 9**.

Multiple host with single device in each host (MHSD) is an architecture for using a GPU clus-

**Figure 7.** Server of type SHMD (16). Image from: https://www.youtube.com/watch?v=Vm9mFtSq2sg.

We can use the notation MHSD (h) to define the number of hosts in this architecture. For example, the architecture in **Figure 7** is an MHSD (4), where four nodes (SHSD) are connected

Multiple host with multiple devices in each host (MHMD) is a GPU cluster with a number of SHMD nodes (where all nodes may have the same number of devices or may have a different There are two types of connections that can be used in GPU computing:


In this section, we will discuss about each one below.

#### **6.1. Peripheral component interconnect express (PCIe)**

PCIe is a standard point-to-point connection used to connect internal devices in a computer (used to connect two or more PCIe devices). For example, you can connect GPU to CPU or other GPU, or connect network card like InfiniBand with CPU or GPU, because most of these devices now are PCIe devices.

PCIe started in 1992 with 133 MB/s and increased to reach 533 MB/s in 1995. Then in 1995, PCIX appeared with a transfer rate of 1066 MB/s (1 GB/s). In 2004, PCIe generation 1.x came with 2.5 GB/s and then generation 2.x in 2007 with 5 GB/s, after that generation 3.x appeared in 2011 with a transfer rate equal to 8 GB/s and the latest generation 4.x in 2016 came with a transfer rate reaching 16 GB/s [15–17].

PCIe is doubling the data rate in each new generation.

### **6.2. Communication between nodes**

In GPU cluster (MHSD or MHMD), the main bottleneck is the communications between nodes (network bandwidth) that is how much data can be transferred from computer to computer per second.

If we use none direct data transfer between different nodes in GPU cluster, then the data transferred in the following steps:


Some companies such as Mellanox and NVIDIA recently have solved the problem by using GPUDirect RDMA, which can transfer data directly from GPU to GPU between the computers [18].

#### **6.3. GPUDirect**

GPUDirect allows multiple GPU devices to transfer data with no CPU intervention (eliminate internal copying and overhead by the host CPU). This can accelerate communication with network and make data transfer from GPU to communication network efficient. Allow peer-to-peer transfers between GPUs [19]. CUDA supports multiple GPUs communication, where data can be transferred between GPU devices without being buffered in CPU's memory, which can significantly speed up transfers and simplify programming [20]. **Figures 10** and **11** show how GPUDirect can be used in SHMD and MHMD, respectively.

**Figure 10.** GPUDirect transfer in SHMD, from Ref. [19].

**Figure 11.** GPUDirect transfer in MHMD. Image from: http://www.paulcaheny.com/wp-content/uploads/2012/05/RD MA-GPU-Direct.jpg.
