**3.2. GPU and CUDA**

Originally, a GPU is a graphic card attached with a cluster of streaming processors aimed at graphic-oriented details that needs the ability of extremely fast processing of large-volume data sets. To apply the special-purpose GPU to general-purpose application, NVIDIA provides a user-friendly development environment that is Compute Unified Device Architecture (CUDA). The CUDA platform enables the generation of parallel codes on GPUs by driving the process from the CPU to GPU. A CUDA program is a unified code that consists of executions on both the host (CUP) and the device (GPU) by CUDA kernel functions that are called out from the host to the device for asynchronously execution. The massive parallelism is carried out in each kernel on CUDA threads that are the basic executing units on the GPU. The CUDA provides an interface named Peripheral Component Interconnect Express (PCI-e) for the intercommunication between host and device and shared memory for synchronization among the parallel threads. The general architecture of CUDA and CUDA memory is illustrated in **Figures 3** and **4**. More details could be referred from the Guides of NVIDA.

**Figure 3.** The general architecture of CUDA.

Since this method does not need to store patterns, it releases the memory intensity; on the other

**1.** Maximum number of closest neighbors *n*, namely, the size of data template of SNESIM.

**2.** The distance threshold *t*. A value of *t* = 0 means a trend of verbatim copy of the training and the increasing value introduce more variabilities between realizations in pattern

**3.** The fraction of scanned TI *f*. The fraction is defined to stop the process of scanning the training image while no distances are under the threshold. If the percentage of nodes in the training image reach *f*, the scanning stops and the pattern with lowest distance is

Sensitivity analysis of parameters [38] shows trade-offs between the quality of realizations and

A computing component that featured with two or more processing units to execute program instructions independently is known as a multicore processor. With the ability of running multiple instructions at the same time, multicore processors increase overall speed for many general-purpose computing. Currently, adding support for more execution threads is the norm avenue to improve the performance of high-end processors. The many-core architectures are formed by manufacturing massive multicores on a single component. For general-purpose parallel computing, many-core architectures on both the central processing unit (CPU) and the

Compared with a many-core CPU architecture known as a supercomputer, the general GPU has many more cores, which are constructively cheap and suitable for intensive computing.

Originally, a GPU is a graphic card attached with a cluster of streaming processors aimed at graphic-oriented details that needs the ability of extremely fast processing of large-volume data sets. To apply the special-purpose GPU to general-purpose application, NVIDIA provides a user-friendly development environment that is Compute Unified Device Architecture (CUDA). The CUDA platform enables the generation of parallel codes on GPUs by driving the process from the CPU to GPU. A CUDA program is a unified code that consists of executions on both the host (CUP) and the device (GPU) by CUDA kernel functions that are called out from the host to the device for asynchronously execution. The massive parallelism is carried out in each kernel on CUDA threads that are the basic executing units on the GPU. The CUDA

graphics processing unit (GPU) are available for different tasks.

hand, parameter is the key factor for Direct Sampling.

There are three main input parameters:

146 Modeling and Simulation in Engineering Sciences

reproduction.

sampled.

**3. Many-core architectures**

the CPU times.

**3.1. Overview**

**3.2. GPU and CUDA**

**Figure 4.** CUDA device memory organization.
