**7. Tools for GPU computing**

**6.2. Communication between nodes**

54 Recent Progress in Parallel and Distributed Computing

transferred in the following steps:

• GPU in node 1 to CPU in node 1.

• CPU in node 1 to network card in node 1.

• Network in node 2 to CPU in node 2.

**Figure 10.** GPUDirect transfer in SHMD, from Ref. [19].

• CPU in node 2 to GPU in node 2.

• Network card in node 1 to network card in node 2.

puter per second.

ers [18].

**6.3. GPUDirect**

respectively.

In GPU cluster (MHSD or MHMD), the main bottleneck is the communications between nodes (network bandwidth) that is how much data can be transferred from computer to com-

If we use none direct data transfer between different nodes in GPU cluster, then the data

Some companies such as Mellanox and NVIDIA recently have solved the problem by using GPUDirect RDMA, which can transfer data directly from GPU to GPU between the comput-

GPUDirect allows multiple GPU devices to transfer data with no CPU intervention (eliminate internal copying and overhead by the host CPU). This can accelerate communication with network and make data transfer from GPU to communication network efficient. Allow peer-to-peer transfers between GPUs [19]. CUDA supports multiple GPUs communication, where data can be transferred between GPU devices without being buffered in CPU's memory, which can significantly speed up transfers and simplify programming [20]. **Figures 10** and **11** show how GPUDirect can be used in SHMD and MHMD, Many tools are available for developing GPU computing applications, including development environments, programming languages and libraries. In this section, we will give a little overview of some of these tools.

### **7.1. Compute unified device architecture (CUDA)**

CUDA is a parallel computing platform and programming model invented by NVIDIA. It enables increases the computing performance by harnessing the power of the GPU devices. CUDA splits a problem into serial sections and parallel sections, serial sections are executed on the CPU as a host code and parallel sections are executed on the GPU by launching a kernel, **Figure 12**.

#### *7.1.1. CUDA in SHSD*

CUDA can run in SHSD using the following sequence of operations [21]:


Kernel code: the instructions actually executed on the GPU.

The following codes demonstrate a portion of code that can be run in SHSD.

```
//Kernel definition --- device code
__global__ void VecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
//////host code invoking kernel
int main() {
…
```
//Kernel invocation with N threads (invoke device code from host code) VecAdd<<<1, N>>>(A, B, C);

#### … }

**Figure 12.** Using CUDA in SHSD. Image from: http://3dgep.com/wp-content/uploads/2011/11/Cuda-Execution-Model. png.

## SHSD example from

#### *7.1.2. CUDA in SHMD*

The need for multiple GPUs is to gain more speed and overcome the limit of GPU device's memory (has smaller capacity compared to CPU's memory). For example, 32–64 GB is a typical size for the host memory, whereas a single GPU device has between 4 and 12 GB device memory. When dealing with large-scale scientific applications, the size of the device memory may thus become a limiting factor. One way of overcoming this barrier is to make use of multiple GPUs [22]. Now, systems with multiple GPUs are becoming more common. For SHMD, CUDA can manage multiple GPU devices from a single CPU host thread [23].

To use multiple GPU devices in single host, we need a thread for each device to control it (attach a GPU device to a host thread). The following codes show how one can invoke different kernels in different devices from a single host.

```
//Run independent kernel on each CUDA device
int numDevs = 0;
cudaGetNumDevices(&numDevs);//number of devices available
…
for (int d = 0; d < numDevs; d++) {
cudaSetDevice(d);//Attach a GPU device to a host thread (select a GPU)
kernel<<<blocks, threads>>>(args);//invoke independent kernel in each device
}
SHMD example using CUDA from Ref. [24]
```
## *7.1.3. Multiple host, single device (MHSD) and multiple host, multiple device (MHMD)*

MPI is a programming model that used for a distributed memory system. If we have a MHSD or a MHMD system, MPI can be used to distribute tasks to computers, each of which can use their CPU and GPU devices to process the distributed task.

For example, if we want to do matrix multiplication on MHSD, then we can:

• Split the matrix into sub-matrices.

//Kernel definition --- device code

56 Recent Progress in Parallel and Distributed Computing

//////host code invoking kernel

VecAdd<<<1, N>>>(A, B, C);

SHSD example from

png.

*7.1.2. CUDA in SHMD*

{

}

…

… }

int i = threadIdx.x; C[i] = A[i] + B[i];

int main() {

\_\_global\_\_ void VecAdd(float\* A, float\* B, float\* C)

//Kernel invocation with N threads (invoke device code from host code)

The need for multiple GPUs is to gain more speed and overcome the limit of GPU device's memory (has smaller capacity compared to CPU's memory). For example, 32–64 GB is a typical size for the host memory, whereas a single GPU device has between 4 and 12 GB device memory. When dealing with large-scale scientific applications, the size of the device memory may thus become a limiting factor. One way of overcoming this barrier is to make use of multiple GPUs [22]. Now, systems with multiple GPUs are becoming more common. For SHMD,

**Figure 12.** Using CUDA in SHSD. Image from: http://3dgep.com/wp-content/uploads/2011/11/Cuda-Execution-Model.

To use multiple GPU devices in single host, we need a thread for each device to control it (attach a GPU device to a host thread). The following codes show how one can invoke differ-

CUDA can manage multiple GPU devices from a single CPU host thread [23].

ent kernels in different devices from a single host.


One way for programming MHSD and MHMD is to use MPI with CUDA, where MPI can be used to handles parallelization over hosts (nodes), and CUDA can be used to handle parallelization on devices (GPUs). We can use one MPI process per GPU and accelerate the computational kernels with CUDA.

For transferring data between different devices in different hosts, we can use the following steps:


For example, if we need to transfer data between node 0 and node N-1 as shown in **Figure 13**, then we can use MPI\_Send and MPI\_Recv function as follows:

//MPI rank 0 MPI\_Send(s\_buf\_d,size,MPI\_CHAR,n-1,tag,MPI\_COMM\_WORLD); //MPI rank n-1 MPI\_Recv(r\_buf\_d,size,MPI\_CHAR,0,tag,MPI\_COMM\_WORLD,&stat);

#### MHSD using MPI and CUDA

Where s\_buf\_d is from device 0 and r\_buf\_d is from device N-1.

The following code is a simple MPI with CUDA example that shows how to collect and print the list of devices from all MPI processes in the MHSD/MHMD system.

#include <stdlib.h> #include <stdio.h> #include <string.h> #include <mpi.h> #include <cuda.h> #define MAX\_NODES 100 #define BUFF\_LEN 256 //Enumeration of CUDA devices accessible for the process. void enumCudaDevices(char \*buff) { char tmpBuff[BUFF\_LEN]; int i, devCount; cudaGetDeviceCount(&devCount);//number of devices sprintf(tmpBuff," %3d", devCount); strncat(buff, tmpBuff, BUFF\_LEN); for (i = 0; i < devCount; i++) { cudaDeviceProp devProp; cudaGetDeviceProperties(&devProp, i); sprintf(tmpBuff, " %d:%s", i, devProp.name); strncat(buff, tmpBuff, BUFF\_LEN); } } int main(int argc, char \*argv[]) { int i, myrank, numprocs; char pName[MPI\_MAX\_PROCESSOR\_NAME], buff[BUFF\_LEN]; MPI\_Init(&argc,&argv); MPI\_Comm\_size(MPI\_COMM\_WORLD,&numprocs); MPI\_Comm\_rank(MPI\_COMM\_WORLD, &myrank); MPI\_Get\_processor\_name(pName, &i); sprintf(buff, "%-15s %3d", pName, myrank); //Find local CUDA devices enumCudaDevices(buff); //Collect and print the list of CUDA devices from all MPI processes if (myrank == 0) { char devList[MAX\_NODES][BUFF\_LEN];

```
MPI_Gather(buff, BUFF_LEN, MPI_CHAR,
devList, BUFF_LEN, MPI_CHAR,
0, MPI_COMM_WORLD);
for (i = 0; i < numprocs; i++)
printf("%s\n", devList + i);
}
else
MPI_Gather(buff, BUFF_LEN, MPI_CHAR,
NULL, 0, MPI_CHAR,
0, MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
The output of the program look similar to this:
g01n07.pdc.kth.se 0 3 0:Tesla M2090 1:Tesla M2090 2:Tesla M2090
g01n06.pdc.kth.se 1 3 0:Tesla M2090 1:Tesla M2090 2:Tesla M2090
MHMD simple example from
(https://www.pdc.kth.se/resources/software/old-installed-soft-ware/mpi-libraries/
cuda-and-mpi)
```
MHSD using MPI and CUDA

58 Recent Progress in Parallel and Distributed Computing

#include <stdlib.h> #include <stdio.h> #include <string.h> #include <mpi.h> #include <cuda.h>

{

{

} }

{

{

#define MAX\_NODES 100 #define BUFF\_LEN 256

char tmpBuff[BUFF\_LEN];

for (i = 0; i < devCount; i++)

cudaDeviceProp devProp;

int i, devCount;

void enumCudaDevices(char \*buff)

sprintf(tmpBuff," %3d", devCount); strncat(buff, tmpBuff, BUFF\_LEN);

cudaGetDeviceProperties(&devProp, i); sprintf(tmpBuff, " %d:%s", i, devProp.name);

char pName[MPI\_MAX\_PROCESSOR\_NAME],

MPI\_Get\_processor\_name(pName, &i); sprintf(buff, "%-15s %3d", pName, myrank);

char devList[MAX\_NODES][BUFF\_LEN];

MPI\_Comm\_size(MPI\_COMM\_WORLD,&numprocs); MPI\_Comm\_rank(MPI\_COMM\_WORLD, &myrank);

//Collect and print the list of CUDA devices from all MPI processes

strncat(buff, tmpBuff, BUFF\_LEN);

int main(int argc, char \*argv[])

int i, myrank, numprocs;

//Find local CUDA devices enumCudaDevices(buff);

buff[BUFF\_LEN]; MPI\_Init(&argc,&argv);

if (myrank == 0)

Where s\_buf\_d is from device 0 and r\_buf\_d is from device N-1.

//Enumeration of CUDA devices accessible for the process.

cudaGetDeviceCount(&devCount);//number of devices

the list of devices from all MPI processes in the MHSD/MHMD system.

The following code is a simple MPI with CUDA example that shows how to collect and print

Most recent versions of most MPI libraries support sending/receiving directly from CUDA device memory; for example Cray's implementation of MPICH supports passing GPU memory buffers directly to MPI function calls, without manually copying GPU data to the host before passing data through MPI. The following codes show how initialize memory on the GPU and then perform an MPI\_Allgather operation between GPUs using device buffer [28].

```
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <mpi.h>
void main(int argc, char** argv)
{
MPI_Init (&argc, &argv);
int direct;
int rank, size;
int *h_buff = NULL;
int *d_rank = NULL;
int *d_buff = NULL;
size_t bytes;
int i;
//Ensure that RDMA ENABLED CUDA is set correctly
direct = getenv("MPICH_RDMA_ENABLED_CUDA")==NULL?0:atoi(getenv ("MPICH_RD
MA_ENABLED_CUDA"));
if(direct != 1){
printf ("MPICH_RDMA_ENABLED_CUDA not enabled!\n");
exit (EXIT_FAILURE);
```

```
}
```

```
//Get MPI rank and size
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
MPI_Comm_size (MPI_COMM_WORLD, &size);
//Allocate host and device buffers and copy rank value to GPU
bytes = size*sizeof(int);
h_buff = (int*)malloc(bytes);
cudaMalloc(&d_buff, bytes);
cudaMalloc(&d_rank, sizeof(int));
cudaMemcpy(d_rank, &rank, sizeof(int), cudaMemcpyHostToDevice);
//Preform Allgather using device buffer
MPI_Allgather(d_rank, 1, MPI_INT, d_buff, 1, MPI_INT, MPI_COMM_WORLD);
//Check that the GPU buffer is correct
cudaMemcpy(h_buff, d_buff, bytes, cudaMemcpyDeviceToHost);
for(i=0; i<size; i++){
if(h_buff[i] != i) {
printf ("Alltoall Failed!\n");
exit (EXIT_FAILURE);
}
}
if(rank==0)
printf("Success!\n");
//Clean up
free(h_buff);
cudaFree(d_buff);
cudaFree(d_rank);
MPI_Finalize();
}
```
Direct transfer data, code from Ref. [28]

**Figure 13.** MPI with CUDA for MHSD, from Ref. [27].

## **7.2. GPU computing using MATLAB**

}

} }

}

if(rank==0)

//Clean up free(h\_buff); cudaFree(d\_buff); cudaFree(d\_rank); MPI\_Finalize();

//Get MPI rank and size

bytes = size\*sizeof(int); h\_buff = (int\*)malloc(bytes); cudaMalloc(&d\_buff, bytes); cudaMalloc(&d\_rank, sizeof(int));

for(i=0; i<size; i++){ if(h\_buff[i] != i) {

printf("Success!\n");

printf ("Alltoall Failed!\n"); exit (EXIT\_FAILURE);

MPI\_Comm\_rank (MPI\_COMM\_WORLD, &rank); MPI\_Comm\_size (MPI\_COMM\_WORLD, &size);

//Preform Allgather using device buffer

60 Recent Progress in Parallel and Distributed Computing

//Check that the GPU buffer is correct

Direct transfer data, code from Ref. [28]

**Figure 13.** MPI with CUDA for MHSD, from Ref. [27].

//Allocate host and device buffers and copy rank value to GPU

cudaMemcpy(d\_rank, &rank, sizeof(int), cudaMemcpyHostToDevice);

cudaMemcpy(h\_buff, d\_buff, bytes, cudaMemcpyDeviceToHost);

MPI\_Allgather(d\_rank, 1, MPI\_INT, d\_buff, 1, MPI\_INT, MPI\_COMM\_WORLD);

MATLAB is a widely used simulation tool for rapid prototyping and algorithm development. Since MATLAB uses a vector/matrix representation of data, which is suitable for parallel processing, it can benefit a lot from CPU and GPU cores.

We can use two tools for parallelization in MATLAB:

