**5. Experiments**

In this section, the performance of the GPU-based implementations is compared with that obtained using SGeMS software [39]. A computer with 4 GB main memory, Interal Core i3540 3.07 GHz CPU, and NVIDIA GeForce GTX 680 GPU that contains eight streaming multiproc‐ essors with 192 CUDA Cores/MP and 2 GB device memory is used for the simulation. The programming platform is the NVIDIA driver version 301.42 with CUDA version 4.2 imple‐ mented on VS2010 using C++.

To test the performance of the GPU-based SNESIM algorithm, a 2D porous slice image obtained by CT scanning is used as the training image. The 200 × 200 pixels training image and the corresponding histogram for background and the pore are shown in **Figure 7**.

**Figure 7.** Training image of porous slice and the Histogram.

As described, there are three important parameters controlling the simulation of Direct Sampling, that is *t*, *n*, and *f*. This scheme implements the parallelism on two of the three

The number of neighbor is the *n* closest previously simulated nodes to define a data template. Different from the data template used in SNESIM that is controlled by the predefined geom‐ etry, the data templates consisted of the *n* closest simulation nodes and have the flexibility of adaption shape and searching range. Due to these flexibilities, this approach can directly catch large-scale structures and easily condition to hard data. On the other hand, the searching for

The parallelism of *n* is achieved in Kernel 2 and Kernel 3. In Kernel 2, each previously simulated node is allocated to a CUDA thread and calculates the Euclidean distance to the node to be simulated simultaneously. These distances are transferred to Kernel 3 and sorted using a

In the serial program if the similarity-distance between data events of the simulation grid and the one sampled from the current training image node is higher than the threshold, a new central node will be sampled from the training image along a random path. Most nodes of the training image that may be visited is defined as *f* × *NTI*, as a result, large amount of data event

The parallelization of *f* is implemented in Kernel 4 that allocates *f* × *NTI* threads to *f* × *NTI* central nodes in the training image using a unique random path denoting each node. Thus these similarity distances are calculated simultaneously. There are two possibilities for the similarity

**1.** There are values lower than the threshold *t*, choose the one that has the smallest path index

**2.** There are no values lower than the threshold *t*, choose the one that has the smallest path

Finally, the central value of the chosen data event is assigned to the simulation grid by Kernel

In this section, the performance of the GPU-based implementations is compared with that obtained using SGeMS software [39]. A computer with 4 GB main memory, Interal Core i3540 3.07 GHz CPU, and NVIDIA GeForce GTX 680 GPU that contains eight streaming multiproc‐ essors with 192 CUDA Cores/MP and 2 GB device memory is used for the simulation. The

6. Repeat all these kernels until all the nodes in the simulation grid are simulated.

the *n* neighbors is also the most time-consuming part in the serial implementation.

parameters that is *n* and *f*.

150 Modeling and Simulation in Engineering Sciences

parallel sorting algorithm.

will be sampled in large-scale 3-days simulations.

from the data events with a lower distance.

distance and the data sampling strategy is shown as follows:

index from the data events with the lowest distance.

*4.2.2. Parallelism of f*

**5. Experiments**

*4.2.1. Parallelism of n*

**Figure 8.** SNESIM realization using CUP and GPU. (a)–(d) CPU-based realizations with the number template equals 50, 120, 200 and 350, respectively; (e)–(h) GPU-based realizations with the number template equals 50, 120, 200 and 350, respectively.

Realizations of the same size as the training image using data template of 50, 120, 200, and 350 nodes are generated for each simulation. The realizations generated with CPU and GPU are shown in **Figure 8**. The average variograms for each simulation are shown in **Figure 9a**, and

the performance is shown in **Figure 9b**. The results show that the proposed GPU-based algorithm can generate similar realizations as the original algorithm, whereas significantly increases the performance. The speedup ranges from six to 24 times depending on the template size than a larger template size resulting in a larger speedup.

**Figure 9.** Results and performance comparison between the CPU and GPU implementation. (a) Variogram of the train‐ ing image and the average variogram of the realizations of **Figure 8** and (b) speedup obtained by using the GPU-based parallel scheme.

The performance of the GPU-based Direct Sampling algorithm is also compared with the original algorithm on a 100 × 130 by 20 fluvial reservoir training image as shown in **Figure 10**.

**Figure 10.** A 100 × 130 by 20 fluvial reservoir training image.

Parameter sensitivities are analyzed on *n* = 30, 50, 100, 200, *f* = 0.005, 0.01, 0.02, 0.05, and *t* = 0.01, 0.02, 0.05, 0.1, respectively, with which reasonable realizations are generated. The performance times are shown in **Figure 11**.

Training Images-Based Stochastic Simulation on Many-Core Architectures http://dx.doi.org/10.5772/64276 153

the performance is shown in **Figure 9b**. The results show that the proposed GPU-based algorithm can generate similar realizations as the original algorithm, whereas significantly increases the performance. The speedup ranges from six to 24 times depending on the template

**Figure 9.** Results and performance comparison between the CPU and GPU implementation. (a) Variogram of the train‐ ing image and the average variogram of the realizations of **Figure 8** and (b) speedup obtained by using the GPU-based

The performance of the GPU-based Direct Sampling algorithm is also compared with the original algorithm on a 100 × 130 by 20 fluvial reservoir training image as shown in **Figure 10**.

Parameter sensitivities are analyzed on *n* = 30, 50, 100, 200, *f* = 0.005, 0.01, 0.02, 0.05, and *t* = 0.01, 0.02, 0.05, 0.1, respectively, with which reasonable realizations are generated. The performance

size than a larger template size resulting in a larger speedup.

152 Modeling and Simulation in Engineering Sciences

parallel scheme.

**Figure 10.** A 100 × 130 by 20 fluvial reservoir training image.

times are shown in **Figure 11**.

**Figure 11.** Performance comparison. (a) Performance with fixed *t* = 0.2, *f* = 0.005 and varying *n* = 30, 50, 100, and 200; (b) performance with fixed *n* = 30, *t* = 0.02 and varying *f* = 0.005, 0.01, 0.02, and 0.05; (c) performance with fixed *n* = 30, *f* = 0.005 and varying *t* = 0.01, 0.02, 0.05, and 0.1.

The results show that GPU-based implementation significantly improves the performance for all the tests. Moreover, the sensitivity of parameters to performance is alleviated with the parallel scheme. The time difference is around 200 s for *n* and *f* and almost none for *t* for the GPU-based implementation, whereas it can be as large as several magnitudes for the CPUbased implementation.

In summary, both the presented GPU-based parallel schemes for SNESIM and Direct Sampling achieve significant speedups, especially for large-scale simulations with their node-level parallelism strategy. Moreover, the parallel implementations are insensitive to parameters that are the key points not only for performance but also for simulation results implying for better results in application. These strategies could be further improved with other parallel optimi‐ zation methods as well. In fact, besides the node-level parallel schemes for SNESIM and Direct Sampling, various parallelisms have been proposed and new optimizations are keeping introduced aimed at further improvements. Up to now, almost all kinds of MPS algorithms could be implemented on a parallel scheme. Many-core architectures are the current main‐ streams to improve the performance of extremely massive computing tasks. The developing of computer hardware and parallel interface techniques promise the wider and wider utiliza‐ tion of high-performance parallelization. These parallel schemes of training images-based stochastic simulations approve the application to high-resolution and large-scale simulations.
