**5. GPU taxonomy**

to do millions of similar calculations at the same time in parallel), **Figure 1**. These threads need to be independent of each other without synchronization issues to run concurrently. Parallelism of threads in a GPU is suitable for executing the same copy of a single program on

SPMD is different from single instruction, multiple data (SIMD). In SPMD, the same code of the program executed in parallel on different parts of data, while in SIMD, the same instruction is executed at the same time in all processing units [9]. CPUs are low latency, low throughput processors (faster for serial processing), while GPUs are high latency, high throughput processors (optimized for maximum throughput and for

GPU architecture consists of two main components, global memory and streaming multiprocessors (SMs). The global memory is the main memory for the GPU and it is accessible by both GPU and CPU with high bandwidth. While SMs contain many simple cores that execute the parallel computations, the number of SMs in a device and the number of cores in SMs differ from one device to another. For example, Fermi has 16 SMs with 32 cores on each one (with

There are different GPU memory hierarchies for different devices. **Figure 2** shows an example

the total cores equal to 16 × 32 = 512 cores), see **Table 1** for different GPU devices.

of NVIDIA Fermi memory hierarchy with following memories:

• Shared memory and L1 cache (primary cache).

• Texture memory and read-only cache.

• Global (main) memory and local memory.

• L2 cache (secondary cache).

**Figure 1.** CPU vs. GPU, from Ref. [8].

different data [single program multiple data (SPMD)], i.e. data parallelism [8].

scalable parallel processing).

48 Recent Progress in Parallel and Distributed Computing

• Registers.

• Constant memory.

GPU computing can be divided into four different classes according to the different combinations between hosts and devices, **Figure 3**. These classes are as follows:


In the rest of this section, we will talk about each class separately.


**Figure 3.** GPU computing taxonomy.

#### **5.1. Single host, single device (SHSD)**

The first class (type) in GPU taxonomy is the single host with single device (SHSD), as shown in **Figure 4**. It is composed of one host (computer) with one GPU device installed in it. Normally, the host will run the codes that are similar to conventional programming that we know (may be serial), while the parallel part of the code will be executed in the device's cores concurrently (massive parallelism).

The processing flow of SHSD computing includes:


The example of SHSD is shown in **Figure 5**; data transferred from CPU host to GPU device are coming through a communication bus connecting GPU to CPU. This communication bus is of type PCI-express with data transfer rate equal to 8 GB/s, which is the weakest link in the connection (new fast generation is available, see Section 6.1 for more details). The other links in the figure are the memory bandwidth between main memory DDR3 and CPU (42 GB/s), and the memory bandwidth between GPU and its global memory GDDR5 (288 GB/s).

Bandwidth limited

Because transfer data between the CPU and GPU is expensive, we will always try to minimize the data transfer between the CPU and GPU. Therefore, if processors request data at too high a rate, the memory system cannot keep up. No amount of latency hiding helps this. Overcoming bandwidth limits are a common challenge for GPU-compute application developers [14].

**Figure 4.** Single host, single device (SHSD).

**Figure 5.** Example of SHSD class, from Ref. [13].

**5.1. Single host, single device (SHSD)**

50 Recent Progress in Parallel and Distributed Computing

**Figure 3.** GPU computing taxonomy.

concurrently (massive parallelism).

• Load GPU program and execute it.

Bandwidth limited

**Figure 4.** Single host, single device (SHSD).

The processing flow of SHSD computing includes:

• Transfer input data from CPU's memory to GPU's memory.

• Transfer results from GPU's memory to CPU's memory [12].

The first class (type) in GPU taxonomy is the single host with single device (SHSD), as shown in **Figure 4**. It is composed of one host (computer) with one GPU device installed in it. Normally, the host will run the codes that are similar to conventional programming that we know (may be serial), while the parallel part of the code will be executed in the device's cores

Single Hos t SHSD SHMD

Multiple Hos t MHSD MHMD

Single Device Multiple Device

The example of SHSD is shown in **Figure 5**; data transferred from CPU host to GPU device are coming through a communication bus connecting GPU to CPU. This communication bus is of type PCI-express with data transfer rate equal to 8 GB/s, which is the weakest link in the connection (new fast generation is available, see Section 6.1 for more details). The other links in the figure are the memory bandwidth between main memory DDR3 and CPU (42 GB/s),

Because transfer data between the CPU and GPU is expensive, we will always try to minimize the data transfer between the CPU and GPU. Therefore, if processors request data at too high a rate, the memory system cannot keep up. No amount of latency hiding helps this. Overcoming bandwidth limits are a com-

and the memory bandwidth between GPU and its global memory GDDR5 (288 GB/s).

mon challenge for GPU-compute application developers [14].

#### **5.2. Single host, multiple device (SHMD)**

In this section, we can use single host with multiple devices installed in it (SHMD), **Figures 6** and **7**. SHMD can be used to run parallel tasks in installed GPUs, with each GPU run sub-tasks in in their cores.

We can use a notation like SHMD (d, to show the number of devices that installed in the host. For example, if we have a single host with three GPU devices installed in it, we can write this as SHMD (3).

**Figure 6.** Single host, multiple device [SHMD (n)].

**Figure 7.** Server of type SHMD (16). Image from: https://www.youtube.com/watch?v=Vm9mFtSq2sg.

#### **5.3. Multiple host, single device (MHSD)**

Multiple host with single device in each host (MHSD) is an architecture for using a GPU cluster connecting many SHSD nodes together, **Figure 8**.

We can use the notation MHSD (h) to define the number of hosts in this architecture. For example, the architecture in **Figure 7** is an MHSD (4), where four nodes (SHSD) are connected in a network.

#### **5.4. Multiple host, multiple device (MHMD)**

Multiple host with multiple devices in each host (MHMD) is a GPU cluster with a number of SHMD nodes (where all nodes may have the same number of devices or may have a different number of devices), **Figure 9**.

**Figure 9.** GPU cluster [MHMD (2,2)], image from: http://timdettmers.com/2014/09/21/how-to-build-and-use-a-multigpu-system-for-deep-learning.

We can use MHMD (h, d) notation to denote the number of hosts and the number of devices in each host, where h is for the number of hosts and d is for the number of devices. If the number of devices in each node is not equal, we can ignore the second parameter by putting x as do not care, MHMD (h, x).
