**4.1. GPU-based SNESIM implementation**

In 2013, a node-level parallelization was applied to SNESIM algorithm and achieved significant performance improvement [34]. The overall parallel scheme is illustrated in **Figure 5**.

**Figure 5.** Procedure of GPU-based SNESIM implementation.

By transferring the training image and simulation grid from the CPU to GPU, each node is assigned to a CUDA thread. For each unsimulated node along the random simulation path, Kernel 1 compares each value of the given data template dn with every central nodes in the training image simultaneously and returns the number of replicates c*k*(dn) for each stage *k*. With these outputs, Kernel 2 calculates the conditional cumulative distribution function (CCDF) and uses Monte Carlo sampling to draw a value to this node. Repeat the kernels along a random path until all the nodes in the simulation grid are simulated. Transfer the results from GPU to CPU.

Since the most time-consuming part, that is getting data event for each node, is parallelized in Kernel 1, this GPU-based implementation gains significant speedup. Moreover, in contrast with the increasing physical memory demanding along the template sizes of original imple‐ mentation, the proposed GPU implementation fixes the amount of memory.

## **4.2. GPU-based Direct Sampling implementation**

**4. Parallel strategies**

148 Modeling and Simulation in Engineering Sciences

path-level for Direct Sampling in this section.

**4.1. GPU-based SNESIM implementation**

**Figure 5.** Procedure of GPU-based SNESIM implementation.

from GPU to CPU.

Generally, there are three strategies to implement the parallelism of MPS algorithms, that is, realization-level, path-level, and the node-level parallelization. The bottleneck for large-scale application focuses on the millions' of nodes in the training image and simulation grids. So we present two parallel implementations on node-level for SNESIM and on both node-level and

In 2013, a node-level parallelization was applied to SNESIM algorithm and achieved significant

By transferring the training image and simulation grid from the CPU to GPU, each node is assigned to a CUDA thread. For each unsimulated node along the random simulation path, Kernel 1 compares each value of the given data template dn with every central nodes in the training image simultaneously and returns the number of replicates c*k*(dn) for each stage *k*. With these outputs, Kernel 2 calculates the conditional cumulative distribution function (CCDF) and uses Monte Carlo sampling to draw a value to this node. Repeat the kernels along a random path until all the nodes in the simulation grid are simulated. Transfer the results

performance improvement [34]. The overall parallel scheme is illustrated in **Figure 5**.

Similar parallelization could also be applied to the Direct Sampling algorithm. Besides the parallelism of the data template, the searching fraction could also be parallelized. The twostage parallel scheme was implemented in 2013 [35]. The procedure of the GPU-based implementation is illustrated in **Figure 6**.

**Figure 6.** Procedure of GPU-based Direct Sampling implementation.

As described, there are three important parameters controlling the simulation of Direct Sampling, that is *t*, *n*, and *f*. This scheme implements the parallelism on two of the three parameters that is *n* and *f*.

## *4.2.1. Parallelism of n*

The number of neighbor is the *n* closest previously simulated nodes to define a data template. Different from the data template used in SNESIM that is controlled by the predefined geom‐ etry, the data templates consisted of the *n* closest simulation nodes and have the flexibility of adaption shape and searching range. Due to these flexibilities, this approach can directly catch large-scale structures and easily condition to hard data. On the other hand, the searching for the *n* neighbors is also the most time-consuming part in the serial implementation.

The parallelism of *n* is achieved in Kernel 2 and Kernel 3. In Kernel 2, each previously simulated node is allocated to a CUDA thread and calculates the Euclidean distance to the node to be simulated simultaneously. These distances are transferred to Kernel 3 and sorted using a parallel sorting algorithm.
