*7.2.1. Parallel computing toolbox*

Parallel computing toolbox can be used to speed up MATLAB code by executing it on a GPU. There are more than 100 built-in functions in MATLAB that can be executed directly on the GPU by providing an input argument of the type GPUArray, a special array type provided by parallel computing toolbox. MATLAB GPU-enabled functions such as fft, filter and several linear algebra operations that can be used in GPU computing. In addition, there are other GPU-enabled functions in many toolboxes like image processing toolbox, communication system toolbox, statistics and machine learning toolbox, neural network toolbox, phased array systems toolbox and signal processing toolbox. So the CUDA kernel can be integrated in MATLAB applications by only a single line of MATLAB code [29].

Using MATLAB for GPU computing is good for those who have some or a lot of experience on MATLAB coding, but not enough depth in either C coding or the computer architecture for parallelization [30].

For example, FFT can be used to find the discrete Fourier transform of a vector of pseudorandom numbers on the CPU with normal arguments like this:

A = rand(2^16,1); B = fft(A);

The same operation can be executed on the GPU by just using data type of gpuArray like this: A = gpuArray(rand(2^16,1));

B = fft(A);

The result, B, is stored on the GPU. However, it is still visible in the MATLAB workspace. You can return the data back to the local MATLAB workspace by using gather command, for example, C = gather(B) [29].

#### *7.2.2. MATLAB-distributed computing server*

MATLAB-distributed computing server is suitable for MHSD and MHMD. The server provides access to multiple workers that receive and execute MATLAB code and Simulink models. Multiple users can run their applications on the server simultaneously.

MHSD and MHMD can use MATLAB workers in parallel computing toolbox and MATLABdistributed computing server.

MATLAB support CUDA-enabled NVIDIA GPUs with compute capability 2.0 or higher. For releases 14a and earlier, compute capability 1.3 is sufficient. In a future release, support for GPU devices of compute capability 2.x will be removed. At that time, a minimum compute capability of 3.0 will be required.

#### *7.2.3. SHSD code examples*

The following codes show how to perform matrix multiplication in SHSD.

Z = X\*Y; % computation on CPU

x = gpuArray(X); % create copy from X on GPU

y = gpuArray(Y); % create copy from Y on GPU

z = x\*y; % computation on GPU

ZZ = gather(z); % return data from GPU to CPU

#### Example 1: SHSD matlab code

The following codes using gpuArray to push data to GPU and then any function call on this array will be executed on GPU. To return result back from GPU device memory to host memory, we use gather function.

A = someArray(1000, 1000); G = gpuArray(A); % Push to GPU memory … F = fft(G); x = G\b; … z = gather(x); % Bring back into MATLAB

Example 2: SHSD code example, from Ref. [31]

#### *7.2.4. SHMD code examples*

In MTALAB, the parallel computing toolbox (PCT) can be used easily to perform computations on SHMD systems. PCT support CPU parallelism by using MATLAB pool. It allows you to use a number of workers run concurrently in the same time. The default number of workers is equal to the number of cores (for local pool). When you run a PARFOR loop, for example, then the work for that loop is broken up and executed by the MATLAB workers.

To perform computations in the SHMD system, you need to open a MATLAB pool with one worker for each GPU device. One MATLAB worker is needed to communicate with each GPU. Each MATLAB session can use one GPU at a time.

If you have only one GPU in your computer that GPU is the default. If you have more than one GPU device in your computer, you can use the gpuDevice function to select which device you want to use.

If you have 2 GPUs, you can assign one local worker for each device, as shown below:

```
matlabpool local 2 % two workers
     spmd
     gpuDevice(labindex); % select device for each work
     g = gpuArray(…);
     … operate on g…
     End
SHMD (2)
```
MHSD and MHMD can use MATLAB workers in parallel computing toolbox and MATLAB-

MATLAB support CUDA-enabled NVIDIA GPUs with compute capability 2.0 or higher. For releases 14a and earlier, compute capability 1.3 is sufficient. In a future release, support for GPU devices of compute capability 2.x will be removed. At that time, a minimum compute capability of 3.0 will be required.

The following codes using gpuArray to push data to GPU and then any function call on this array will be executed on GPU. To return result back from GPU device memory to host

In MTALAB, the parallel computing toolbox (PCT) can be used easily to perform computations on SHMD systems. PCT support CPU parallelism by using MATLAB pool. It allows you to use a number of workers run concurrently in the same time. The default number of workers is equal to the number of cores (for local pool). When you run a PARFOR loop, for example,

To perform computations in the SHMD system, you need to open a MATLAB pool with one worker for each GPU device. One MATLAB worker is needed to communicate with each

then the work for that loop is broken up and executed by the MATLAB workers.

The following codes show how to perform matrix multiplication in SHSD.

distributed computing server.

62 Recent Progress in Parallel and Distributed Computing

*7.2.3. SHSD code examples*

Z = X\*Y; % computation on CPU

z = x\*y; % computation on GPU

Example 1: SHSD matlab code

memory, we use gather function. A = someArray(1000, 1000);

…

…

F = fft(G); x = G\b;

*7.2.4. SHMD code examples*

x = gpuArray(X); % create copy from X on GPU y = gpuArray(Y); % create copy from Y on GPU

ZZ = gather(z); % return data from GPU to CPU

G = gpuArray(A); % Push to GPU memory

z = gather(x); % Bring back into MATLAB Example 2: SHSD code example, from Ref. [31]

GPU. Each MATLAB session can use one GPU at a time.

*7.2.5. Multiple host, single device (MHSD)/multiple host, multiple device (MHMD)*

MATLAB® Distributed Computing Server™ lets you run computationally intensive MATLAB programs and Simulink® models on computer clusters, clouds and grids. You develop your program or model on a multicore desktop computer using parallel computing toolbox and then scale up to many computers by running it on MATLAB-distributed computing server. The server supports batch jobs, parallel computations and distributed large data. The server includes a built-in cluster job scheduler and provides support for commonly used third-party schedulers [32].

A parallel pool is a set of workers in a compute cluster (remote) or desktop (local). The default pool size and cluster are specified by your parallel preferences. The workers in a parallel pool can be used interactively and can communicate with each other during the lifetime of the job [33].

In the following multiple GPU example, we can have more than one workers: if the workers are local in the same host, then we can name this type as SHMD; if the workers are remote on cluster, then this means we are dealing with multiple hosts (MHs) and if there are more than one GPU device in each host (local workers in each remote host, with one worker for each remote GPU device, this can be multiple devices (MDs) and hence the system is MHMD; otherwise, it is MHSD.

N = 1000; A = GPUaRRAY(a); for ix = 1:N x = myGPUFunction(ix,A) xtotal(ix,:) = gather(x); end SHSD

N = 1000; spmd gpuDevice(labindex)%worker for each device A = GPUaRRAY(A); end parfor ix = 1:N x = myGPUFunction(ix,A) xtotal(ix,:) = gather(x); end Multiple GPUs, from Ref. [32]

### **7.3. Open accelerator (OpenACC)**

OpenACC is an application-programming interface, stands for open accelerators, it came to simplify parallel programming by providing a set of compiler directives that allow developers to run parallel code with the modifying underlying code (like OpenMP), and it was developed by CAPS, Cray, NVidia and PGI. OpenACC uses compiler directives that allow small segments of code, called kernels, to be run on the device. OpenACC divides tasks among gangs (blocks), gangs have workers (warps) and workers have vectors (threads) [9, 34–36].

OpenACC is portable across operating systems and various types of host CPUs and devices (accelerators). In OpenACC, some computations are executed in the CPU, while the compute intensive regions are offloaded to GPU devices to be executed in parallel. The host is responsible for


A small code example for using OpenACC is as follow:

```
main()
{
<serial>
#pragma acc kernels
//automatically runs on GPU device
{
<parallel code>
}
}
```
#pragma acc kernels: tells the compiler to generate parallel accelerator kernels that run in parallel inside the GPU device [38].
