Security Applications of GPUs

*Giorgos Vasiliadis*

### **Abstract**

Despite the recent advances in software security hardening techniques, vulnerabilities can always be exploited if the attackers are really determined. Regardless the protection enabled, successful exploitation can always be achieved, even though admittedly, today, it is much harder than it was in the past. Since securing software is still under ongoing research, the community investigates detection methods in order to protect software. Three of the most promising such methods are monitoring the (i) network, (ii) the filesystem, and (iii) the host memory, for possible exploitation. Whenever a malicious operation is detected then the monitor should be able to terminate it and/or alert the administrator. In this chapter, we explore how to utilize the highly parallel capabilities of modern commodity graphics processing units (GPUs) in order to improve the performance of different security tools operating at the network, storage, and memory level, and how they can offload the CPU whenever possible. Our results show that modern GPUs can be very efficient and highly effective at accelerating the pattern matching operations of network intrusion detection systems and antivirus tools, as well as for monitoring the integrity of the base computing systems.

**Keywords:** security, network security, host security, intrusion detection, antivirus, kernel integrity monitoring

### **1. Introduction**

The ever-increasing amount of malicious software (malware) constitutes an enormous challenge to network operators, IT administrators, as well as ordinary home users. To protect against such an evolving threat landscape, it is necessary to provide detection of malicious activities at different levels: (i) by inspected exchanged data at central network traffic ingress points, (ii) by scanning of unwanted software at the storage level, and (iii) by providing program memory integrity at the host level. Three of the most widely used tools that perform such kind of operations are intrusion detection systems, antivirus software, and host integrity tools. Unfortunately, the constant increase in link speeds, storage capacity, number of end-devices and the sheer number of malware, poses significant challenges to all these tools, which end up requiring high scanning throughput and low latency.

Typically, the detection of malicious activities spends the majority of its time matching data streams against a large set of known signatures or checksums, using string searching, regular expression matching and hashing algorithms. Signature matching algorithms analyze the data stream and compare it against a database of fixed strings or regular expressions to detect known malware. The signature patterns can be quite complex, composed of wild-card characters,

range constraints, different-size strings, and sometimes recursive forms. To make matters worse, the number of signatures is increasing proportional every year, as the amount of malware grows, exposing scaling problems of anti-malware products.

Modern GPUs have been proven to be highly effective and very efficient at accelerating computational- and memory-intensive workloads. The ever-growing video game industry is a driving factor for becoming ever more powerful and flexible stream processors, specialized for highly parallel operations. Comparing with commodity CPUs, the massive number of transistors is devoted to data processing, rather than data caching and flow control, making them ideal to perform data parallel computations that up till now were handled by the CPU.

In this chapter, we present new models for malware detection tools that operate at the network, storage, and memory level. These models combine the commodity, general-purpose GPU paradigms, tailored for high-performance and low-latency analysis. Our systems take advantage of the parallelism offered by the GPUs to improve scalability and runtime performance and are able to offload the CPU whenever possible. Our results show that modern GPUs can be highly effective and very efficient at accelerating a highly diverse set of operations that are core functions of modern security tools, including string searching, regular expression matching and checksum computations.

### **2. Network intrusion detection and prevention systems**

First, we show how to exploit the parallelism of the graphics processing unit (GPU) to offload specific intensive tasks of a network intrusion detection system (NIDS). Particularly, we present the design, implementation, and evaluation of string searching and regular expression algorithms engines running on GPUs. We have integrated these implementations in the popular Snort intrusion detection system [1] to offload both string and regular expression matching computation, as shown in **Figure 1**.

The data parallel capabilities of modern GPUs can allow the concurrent matching of multiple input data streams at the same time against a large set of fixed string patterns and regular expressions. Mainly, the architecture can be separated in several different tasks: packet capturing, decoding, preprocesing, the transfer of the network packets to the GPU, the string-matching on the GPU, and the transfer of the matching results back to the CPU, where all the remaining conditions of the detection rules are checked. Whenever a packet needs to be scanned against a regular expression, it is subsequently transferred back to the GPU where the actual matching takes place.

#### **2.1 Architecture**

The overall design of our GPU-assisted network intrusion detection architecture, has two key factors for achieving good performance: (i) load balancing between processing units, and (ii) linear performance scalability with the addition of more processing units. In particular, the monitored traffic is distributed at the flow-level to different CPU cores, by applying a symmetric hash function on the 5-tuple fields of each packet header (i.e., source IP address, destination IP address, source port, destination port, protocol). Eventually, all packets of the same flow (i.e., same connection) will always be placed in the same ring buffer, and will be processed by the same CPU-core. This inherently leads us to a multicore architecture, in which each core processes an evenly distributed portion of

**39**

components.

**Figure 1.**

such as a file or database.

*Security Applications of GPUs*

*DOI: http://dx.doi.org/10.5772/intechopen.81885*

the network traffic, without requiring any intra-node communication for processing operations that are limited in scope to a single flow. Each CPU core is responsible for network flow tracking, protocol parsing, TCP stream reassembly, and content normalization. The reassembled and normalized packets of each network flow are then transferred to the graphics card in large batches, in order to be processed in parallel. This enables an "intra-flow" parallelism, in which network packets from the same flow can be processed in parallel, while also maintaining flow-state dependencies. We note that this buffering scheme, as well as the extra data transfer operations that need to be performed between the memory address spaces of each device obviously, adds some latency to the processing path. Even though the computational gains offered by the GPU tolerates these extra data transfers and pay off in terms of increased throughput, we further mitigate these overheads by implementing pipelining schemes that allow the CPU and GPU execution to overlap, thus offering an additional level of parallelism to the overall execution path (see Section 2.1.3). Overall, by parallelizing both packet preprocessing and content inspection across multiple CPUs and GPUs, our proposed architecture can operate in multi-Gigabit networks using solely commodity

*Overview of the single-threaded GPU-based network intrusion detection architecture.*

As shown in **Figure 2**, we utilize the different processing units available (i.e., CPUs and GPUs) in order to map the different functionalities that are performed across the incoming network flows, using both *task* and *data* parallelism. More specifically, the network interface distributes the incoming network packets to the CPU-cores, by applying a symmetric hash function on the 5-tuple fields of each packet header (i.e., source IP address, destination IP address, source port, destination port, protocol). This ensures that all packets of the same flow (i.e., same connection) will always be placed in the same ring buffer, and will be processed by the same CPU-core. Each CPU-core reassembles and normalizes the captured traffic before offloading it to the GPU for pattern matching [2]. Any matching results are logged by the corresponding CPU-core using the specified logging mechanism,

This design has many advantages: *First*, no synchronization or lock mechanisms is needed, since different network flows will be processed by different CPU-cores independently. *Second*, each CPU-core maintains smaller data

#### **Figure 1.**

*High Performance Parallel Computing*

products.

range constraints, different-size strings, and sometimes recursive forms. To make matters worse, the number of signatures is increasing proportional every year, as the amount of malware grows, exposing scaling problems of anti-malware

Modern GPUs have been proven to be highly effective and very efficient at accelerating computational- and memory-intensive workloads. The ever-growing video game industry is a driving factor for becoming ever more powerful and flexible stream processors, specialized for highly parallel operations. Comparing with commodity CPUs, the massive number of transistors is devoted to data processing, rather than data caching and flow control, making them ideal to perform data paral-

In this chapter, we present new models for malware detection tools that operate at the network, storage, and memory level. These models combine the commodity, general-purpose GPU paradigms, tailored for high-performance and low-latency analysis. Our systems take advantage of the parallelism offered by the GPUs to improve scalability and runtime performance and are able to offload the CPU whenever possible. Our results show that modern GPUs can be highly effective and very efficient at accelerating a highly diverse set of operations that are core functions of modern security tools, including string searching, regular expression

First, we show how to exploit the parallelism of the graphics processing unit (GPU) to offload specific intensive tasks of a network intrusion detection system (NIDS). Particularly, we present the design, implementation, and evaluation of string searching and regular expression algorithms engines running on GPUs. We have integrated these implementations in the popular Snort intrusion detection system [1] to offload both string and regular expression matching computation, as

The data parallel capabilities of modern GPUs can allow the concurrent match-

The overall design of our GPU-assisted network intrusion detection architecture, has two key factors for achieving good performance: (i) load balancing between processing units, and (ii) linear performance scalability with the addition of more processing units. In particular, the monitored traffic is distributed at the flow-level to different CPU cores, by applying a symmetric hash function on the 5-tuple fields of each packet header (i.e., source IP address, destination IP address, source port, destination port, protocol). Eventually, all packets of the same flow (i.e., same connection) will always be placed in the same ring buffer, and will be processed by the same CPU-core. This inherently leads us to a multicore architecture, in which each core processes an evenly distributed portion of

ing of multiple input data streams at the same time against a large set of fixed string patterns and regular expressions. Mainly, the architecture can be separated in several different tasks: packet capturing, decoding, preprocesing, the transfer of the network packets to the GPU, the string-matching on the GPU, and the transfer of the matching results back to the CPU, where all the remaining conditions of the detection rules are checked. Whenever a packet needs to be scanned against a regular expression, it is subsequently transferred back to the GPU where the actual

lel computations that up till now were handled by the CPU.

**2. Network intrusion detection and prevention systems**

matching and checksum computations.

shown in **Figure 1**.

matching takes place.

**2.1 Architecture**

**38**

*Overview of the single-threaded GPU-based network intrusion detection architecture.*

the network traffic, without requiring any intra-node communication for processing operations that are limited in scope to a single flow. Each CPU core is responsible for network flow tracking, protocol parsing, TCP stream reassembly, and content normalization. The reassembled and normalized packets of each network flow are then transferred to the graphics card in large batches, in order to be processed in parallel. This enables an "intra-flow" parallelism, in which network packets from the same flow can be processed in parallel, while also maintaining flow-state dependencies. We note that this buffering scheme, as well as the extra data transfer operations that need to be performed between the memory address spaces of each device obviously, adds some latency to the processing path. Even though the computational gains offered by the GPU tolerates these extra data transfers and pay off in terms of increased throughput, we further mitigate these overheads by implementing pipelining schemes that allow the CPU and GPU execution to overlap, thus offering an additional level of parallelism to the overall execution path (see Section 2.1.3). Overall, by parallelizing both packet preprocessing and content inspection across multiple CPUs and GPUs, our proposed architecture can operate in multi-Gigabit networks using solely commodity components.

As shown in **Figure 2**, we utilize the different processing units available (i.e., CPUs and GPUs) in order to map the different functionalities that are performed across the incoming network flows, using both *task* and *data* parallelism. More specifically, the network interface distributes the incoming network packets to the CPU-cores, by applying a symmetric hash function on the 5-tuple fields of each packet header (i.e., source IP address, destination IP address, source port, destination port, protocol). This ensures that all packets of the same flow (i.e., same connection) will always be placed in the same ring buffer, and will be processed by the same CPU-core. Each CPU-core reassembles and normalizes the captured traffic before offloading it to the GPU for pattern matching [2]. Any matching results are logged by the corresponding CPU-core using the specified logging mechanism, such as a file or database.

This design has many advantages: *First*, no synchronization or lock mechanisms is needed, since different network flows will be processed by different CPU-cores independently. *Second*, each CPU-core maintains smaller data

#### **Figure 2.**

*The architecture of the GPU-enabled network intrusion detection system.*

structures (e.g., the flow management table, the TCP reassembly tables, etc.) instead of sharing a few large ones, which reduces both the number of tables lookups, as well as the size of the working set in each cache, increasing overall cache efficiency.

### *2.1.1 Parallel multi-pattern engine*

A major design criterion for scanning large network data flows against many different fixed string patterns, is the choice of an efficient matching algorithm. The majority of network intrusion detection systems utilize a flavor of the Aho-Corasick algorithm [3] for string searching. Internally, the algorithm uses a transition function that computes the next state T [state, c] for a given state and a character c. A pattern is matched every time the algorithm transits to a final state. The performance and memory requirements of Aho-Corasick depend on how the transition function is implemented. In the full implementation, hereinafter AC-Full, each transition is represented with 256 elements, one for each 8-bit character. Each element contains the next state to move to, as a result the next state can be found in *O*(1) steps for every input character; this ensures a linear complexity over the input data, independently on the number of patterns, which is very efficient in terms of performance.

However, a disadvantage of the full state representation is the large memory requirements, even for small signature sets. For Snort, the compiled state table can reach up to several hundred Megabytes of memory. To make matters worse, CPU processes cannot share memory on the GPU device, as such a different memory space has to be allocated in the GPU. This can result to significant memory allocations, as shown in **Table 1**. Given that the GeForce GTX780 that we used for our evaluation comes with 2GB of memory, only two Snort instances can fully utilize the GPU at a time.

To preserve scalability with respect to the number of concurrently running Snort instances, it is important to optimize the memory requirements needed. As such, instead of using the full state table representation, we use a compacted version, similar to [4], in which the states are represented in a banded-row format. In particular, we store only the elements from the first non-zero value to the last

**41**

*Security Applications of GPUs*

of a match.

**Table 1.**

respectively.

can fit in device memory.

*2.1.2 Compiling PCRE regular expressions to DFA state tables*

*DOI: http://dx.doi.org/10.5772/intechopen.81885*

by the significantly lower memory consumption.

non-zero value of the table; the number of the stored elements is known as the bandwidth of the sparse table. Obviously, in the compacted implementation, namely AC-Compact, the next state cannot be accessed directly while matching input bytes. Instead, it has to be computed, as shown in **Figure 3**; this computation obviously adds a small overhead at the scanning phase, which is amortized though

*Memory requirements of AC-Full and AC-Compact for the default Snort rule set.*

**#Rules #Patterns #States AC-Full AC-Compact** 8192 193,167 1,703,023 890.46 MB 24.18 MB

Moreover, it is common that some patterns may share the same final state in the state table or be case-insensitive. Instead of adding every different combination of capital and lower letters for every case-insensitive pattern, we simply mark that pattern as case-insensitive and add only one combination (i.e., all characters are converted to capitals). In case the pattern is matched in a packet, an extra case-insensitive search should be made at the index where the pattern was found. Similarly, if two patterns share the same final state, they need to be verified in case

Each GPU thread processes a different reassembled network packet. We use an array to store the network packets; every time the array fills up, it is transferred to the GPU and processed at once. **Figure 4** shows the sustained throughput when matching full-size packets (i.e., of 1500-bytes length) on a single GTX770, for a varied number of packets that are processed at once. The traffic is generated from a separate machine over four 10 Gbit/s network cards. As can be shown, the AC-Full achieves a peak performance of 21.1 Gbit/s, while the AC-Compact about 16.4 Gbit/s. In both cases, all data transferring costs to and from the GPU are included. The corresponding CPU implementation achieves a performance of 0.6 Gbit/s for the AC-Full implementation, and thus a single GPU instance corresponds to 36.2 and 28.1 CPU-cores for the AC-Full and AC-Compact implementations,

As expected, AC-Full outperforms AC-Compact in all cases. The added overhead of the extra computation that AC-Compact performs in every transition decreases its performance about 30%. The main advantage of AC-Compact is that it has significantly lower memory consumption than AC-Full. The corresponding memory requirements for storing the detection engines of a single Snort instance are shown in **Table 1**. As shown, AC-Compact utilizes up to 36 times less memory, which makes it a better fit for a multi-CPU environment, due to CUDA's limitation of allocating a separate memory context for each host thread. Using AC-Compact, a single GTX770 card can store the detection engines of about 80 Snort instances (80 × 24.18 MB ≈ 1.9 GB). The remaining memory can be used for storing the actual contents of the incoming network packets. If AC-Full is used, only two instances

The majority of tools that use regular expressions typically convert them into DFAs [5]. To do that, the most common approach is to first compile them into NFAs, and then convert them into DFAs. We also follow the same approach, and, using the Thompson algorithm [6], we first convert each regular expression into an NFA. The generated NFA is then converted incrementally to an equivalent DFA, using the Subset


**Table 1.**

*High Performance Parallel Computing*

structures (e.g., the flow management table, the TCP reassembly tables, etc.) instead of sharing a few large ones, which reduces both the number of tables lookups, as well as the size of the working set in each cache, increasing overall cache

*The architecture of the GPU-enabled network intrusion detection system.*

A major design criterion for scanning large network data flows against many different fixed string patterns, is the choice of an efficient matching algorithm. The majority of network intrusion detection systems utilize a flavor of the Aho-Corasick algorithm [3] for string searching. Internally, the algorithm uses a transition function that computes the next state T [state, c] for a given state and a character c. A pattern is matched every time the algorithm transits to a final state. The performance and memory requirements of Aho-Corasick depend on how the transition function is implemented. In the full implementation, hereinafter AC-Full, each transition is represented with 256 elements, one for each 8-bit character. Each element contains the next state to move to, as a result the next state can be found in *O*(1) steps for every input character; this ensures a linear complexity over the input data, independently

on the number of patterns, which is very efficient in terms of performance.

However, a disadvantage of the full state representation is the large memory requirements, even for small signature sets. For Snort, the compiled state table can reach up to several hundred Megabytes of memory. To make matters worse, CPU processes cannot share memory on the GPU device, as such a different memory space has to be allocated in the GPU. This can result to significant memory allocations, as shown in **Table 1**. Given that the GeForce GTX780 that we used for our evaluation comes with 2GB of memory, only two Snort instances can fully utilize

To preserve scalability with respect to the number of concurrently running Snort instances, it is important to optimize the memory requirements needed. As such, instead of using the full state table representation, we use a compacted version, similar to [4], in which the states are represented in a banded-row format. In particular, we store only the elements from the first non-zero value to the last

**40**

the GPU at a time.

efficiency.

**Figure 2.**

*2.1.1 Parallel multi-pattern engine*

*Memory requirements of AC-Full and AC-Compact for the default Snort rule set.*

non-zero value of the table; the number of the stored elements is known as the bandwidth of the sparse table. Obviously, in the compacted implementation, namely AC-Compact, the next state cannot be accessed directly while matching input bytes. Instead, it has to be computed, as shown in **Figure 3**; this computation obviously adds a small overhead at the scanning phase, which is amortized though by the significantly lower memory consumption.

Moreover, it is common that some patterns may share the same final state in the state table or be case-insensitive. Instead of adding every different combination of capital and lower letters for every case-insensitive pattern, we simply mark that pattern as case-insensitive and add only one combination (i.e., all characters are converted to capitals). In case the pattern is matched in a packet, an extra case-insensitive search should be made at the index where the pattern was found. Similarly, if two patterns share the same final state, they need to be verified in case of a match.

Each GPU thread processes a different reassembled network packet. We use an array to store the network packets; every time the array fills up, it is transferred to the GPU and processed at once. **Figure 4** shows the sustained throughput when matching full-size packets (i.e., of 1500-bytes length) on a single GTX770, for a varied number of packets that are processed at once. The traffic is generated from a separate machine over four 10 Gbit/s network cards. As can be shown, the AC-Full achieves a peak performance of 21.1 Gbit/s, while the AC-Compact about 16.4 Gbit/s. In both cases, all data transferring costs to and from the GPU are included. The corresponding CPU implementation achieves a performance of 0.6 Gbit/s for the AC-Full implementation, and thus a single GPU instance corresponds to 36.2 and 28.1 CPU-cores for the AC-Full and AC-Compact implementations, respectively.

As expected, AC-Full outperforms AC-Compact in all cases. The added overhead of the extra computation that AC-Compact performs in every transition decreases its performance about 30%. The main advantage of AC-Compact is that it has significantly lower memory consumption than AC-Full. The corresponding memory requirements for storing the detection engines of a single Snort instance are shown in **Table 1**. As shown, AC-Compact utilizes up to 36 times less memory, which makes it a better fit for a multi-CPU environment, due to CUDA's limitation of allocating a separate memory context for each host thread. Using AC-Compact, a single GTX770 card can store the detection engines of about 80 Snort instances (80 × 24.18 MB ≈ 1.9 GB). The remaining memory can be used for storing the actual contents of the incoming network packets. If AC-Full is used, only two instances can fit in device memory.

### *2.1.2 Compiling PCRE regular expressions to DFA state tables*

The majority of tools that use regular expressions typically convert them into DFAs [5]. To do that, the most common approach is to first compile them into NFAs, and then convert them into DFAs. We also follow the same approach, and, using the Thompson algorithm [6], we first convert each regular expression into an NFA. The generated NFA is then converted incrementally to an equivalent DFA, using the Subset

#### **Figure 3.**

*State tables of AC-Full vs. AC-Compact.*

**Figure 4.** *GPU throughput for AC-Full and AC-Compact.*

Construction algorithm. The basic concept of subset construction is to define a DFA in which each state is a set of states of the corresponding NFA. Each state in the DFA represents a set of active states in which the corresponding NFA can be in after some transition. During the matching phase, the resulting DFA achieves *O*(1) computational cost for each incoming character.

However, a major concern when converting regular expressions into DFAs is the state-space explosion that may occur during compilation [7]. To distinguish among the states, a different DFA state may be required for all possible NFA states. Obviously, this can result to exponential growth of memory utilization, primarily due to the usage of wildcards, e.g. (.\*), and repetition expressions, e.g. (a (x,y)). As a result, certain regular expressions may end-up consuming large amounts of memory when compiled to DFAs. A theoretical worst-case study shows that a single regular expression of length *n* can be expressed as a DFA of up to *O* (*Σ<sup>n</sup>* ) states, where *Σ* is the size of the alphabet, i.e. 28 symbols for the extended ASCII character set [8].

To prevent the greedy memory consumption that can be occurred by some regular expressions, we follow a hybrid approach, in which we convert only the regular expressions that do not exceed a certain threshold of states; the remaining regular expressions will be matched on the CPU using NFAs. The total number of states is traced during the incremental conversion from the NFA to

**43**

**Figure 5**.

the memory space of the GPU.

*Security Applications of GPUs*

*2.1.3 Multi-GPU support*

number of GPUs it should try to use.

**3. Host-based virus scanning**

*DOI: http://dx.doi.org/10.5772/intechopen.81885*

NFA implementation, similar to the vanilla Snort.

state and reports a match at the current input offset.

the DFA and the conversion stops if a certain threshold is reached. As we have experimentally found, more than 97% of the total regular expressions used by Snort can be converted to DFAs, when using an upper bound of 5000 states per expression. The remaining expressions can be processed by the CPU using an

Each DFA is represented as a two-dimensional array that is mapped linearly on the memory space of the GPU. The dimensions of the array are equal to the number of states and the size of the alphabet (256 in our case), respectively. Each cell contains the next state to move to, as well as an indication of whether the state is a final state or not. The final states are represented as negative numbers, due to the fact that transition numbers can be positive integers only. Whenever the state machine reaches into a state that is represented by a negative number, it considers it as a final

Our system is able to utilize several GPUs simultaneously, by dividing the incoming flows equally and performing the signature matching concurrently across all devices. In the CUDA runtime system, each device is bound to a single process. As such, several host processes must be spawned (at least one process per device) in order to enable multi-GPU support. The load balancing scheme shown in **Figure 2** ensures that each process receives a uniform amount of flows; as a result, flows are equally distributed to the different GPUs. By default, our system utilizes all the GPUs that are available in the system, still this can be easily configured by defining the

Antivirus software is one of the most popular tools for detecting and stopping malicious or unwanted software. ClamAV [9] is a popular open-source virus scanner, which contains more than 60 thousand signatures, formed by both fixed strings, as well as regular expressions. It can be used both at the server-side for protecting mail and file servers, as well as for client personal computers. The database includes signatures for polymorphic viruses in regular expression format and for non-polymorphic viruses in simple string format. To detect non-polymorphic viruses, the current version of ClamAV uses an optimized version of the Boyer-Moore algorithm [10] for matching simple fixed string signatures. For polymorphic

viruses, ClamAV uses a variant of the classical Aho-Corasick algorithm [3].

The main design principle of our GPU-assisted antivirus is to utilize the GPU in order to quickly filter out the data segments that do not contain any viruses. To achieve this, we have modified ClamAV, such that the input data stream is initially scanned by the GPU. The GPU uses a prefix of each virus signature to quickly filter-out clean data. The motivation behind this is that the majority of the data do not contain any viruses, as such the GPU filtering is quite efficient, as shown in

**Figure 6** presents the overall architecture of our GPU-assisted antivirus tool. The contents of each file are read from disk and stored into a file buffer. The file buffer is used to store the contents of many files, and is transferred to the GPU in a single transaction. This results in a reduction of I/O transactions over the PCI Express bus. Moreover, the file buffer is page-locked (i.e., it does not get swapped), hence it can be transferred asynchronously, via DMA (Direct Memory Access), to

### *Security Applications of GPUs DOI: http://dx.doi.org/10.5772/intechopen.81885*

*High Performance Parallel Computing*

cost for each incoming character.

*GPU throughput for AC-Full and AC-Compact.*

*State tables of AC-Full vs. AC-Compact.*

the size of the alphabet, i.e. 28

Construction algorithm. The basic concept of subset construction is to define a DFA in which each state is a set of states of the corresponding NFA. Each state in the DFA represents a set of active states in which the corresponding NFA can be in after some transition. During the matching phase, the resulting DFA achieves *O*(1) computational

However, a major concern when converting regular expressions into DFAs is the state-space explosion that may occur during compilation [7]. To distinguish among the states, a different DFA state may be required for all possible NFA states. Obviously, this can result to exponential growth of memory utilization, primarily due to the usage of wildcards, e.g. (.\*), and repetition expressions, e.g. (a (x,y)). As a result, certain regular expressions may end-up consuming large amounts of memory when compiled to DFAs. A theoretical worst-case study shows that a single regular

To prevent the greedy memory consumption that can be occurred by some regular expressions, we follow a hybrid approach, in which we convert only the regular expressions that do not exceed a certain threshold of states; the remaining regular expressions will be matched on the CPU using NFAs. The total number of states is traced during the incremental conversion from the NFA to

symbols for the extended ASCII character set [8].

) states, where *Σ* is

expression of length *n* can be expressed as a DFA of up to *O* (*Σ<sup>n</sup>*

**42**

**Figure 3.**

**Figure 4.**

the DFA and the conversion stops if a certain threshold is reached. As we have experimentally found, more than 97% of the total regular expressions used by Snort can be converted to DFAs, when using an upper bound of 5000 states per expression. The remaining expressions can be processed by the CPU using an NFA implementation, similar to the vanilla Snort.

Each DFA is represented as a two-dimensional array that is mapped linearly on the memory space of the GPU. The dimensions of the array are equal to the number of states and the size of the alphabet (256 in our case), respectively. Each cell contains the next state to move to, as well as an indication of whether the state is a final state or not. The final states are represented as negative numbers, due to the fact that transition numbers can be positive integers only. Whenever the state machine reaches into a state that is represented by a negative number, it considers it as a final state and reports a match at the current input offset.

### *2.1.3 Multi-GPU support*

Our system is able to utilize several GPUs simultaneously, by dividing the incoming flows equally and performing the signature matching concurrently across all devices. In the CUDA runtime system, each device is bound to a single process. As such, several host processes must be spawned (at least one process per device) in order to enable multi-GPU support. The load balancing scheme shown in **Figure 2** ensures that each process receives a uniform amount of flows; as a result, flows are equally distributed to the different GPUs. By default, our system utilizes all the GPUs that are available in the system, still this can be easily configured by defining the number of GPUs it should try to use.

### **3. Host-based virus scanning**

Antivirus software is one of the most popular tools for detecting and stopping malicious or unwanted software. ClamAV [9] is a popular open-source virus scanner, which contains more than 60 thousand signatures, formed by both fixed strings, as well as regular expressions. It can be used both at the server-side for protecting mail and file servers, as well as for client personal computers. The database includes signatures for polymorphic viruses in regular expression format and for non-polymorphic viruses in simple string format. To detect non-polymorphic viruses, the current version of ClamAV uses an optimized version of the Boyer-Moore algorithm [10] for matching simple fixed string signatures. For polymorphic viruses, ClamAV uses a variant of the classical Aho-Corasick algorithm [3].

The main design principle of our GPU-assisted antivirus is to utilize the GPU in order to quickly filter out the data segments that do not contain any viruses. To achieve this, we have modified ClamAV, such that the input data stream is initially scanned by the GPU. The GPU uses a prefix of each virus signature to quickly filter-out clean data. The motivation behind this is that the majority of the data do not contain any viruses, as such the GPU filtering is quite efficient, as shown in **Figure 5**.

**Figure 6** presents the overall architecture of our GPU-assisted antivirus tool. The contents of each file are read from disk and stored into a file buffer. The file buffer is used to store the contents of many files, and is transferred to the GPU in a single transaction. This results in a reduction of I/O transactions over the PCI Express bus. Moreover, the file buffer is page-locked (i.e., it does not get swapped), hence it can be transferred asynchronously, via DMA (Direct Memory Access), to the memory space of the GPU.

### **Figure 6.**

*The architecture of the GPU-assisted antivirus. Files are mapped onto pinned memory that can be copied onto the graphics card via DMA. A first-pass filtering is performed on the GPU and return any potential true positives for further checking onto the CPU.*

Every time the GPU detects a suspicious virus, i.e., there is a prefix match, the file is further investigated by the verification module. Otherwise, no further computation takes place. The data parallel operation of the GPU is ideal for creating multiple search engine instances that will scan for virus signatures on different

**45**

**Figure 7.**

*the same leaf (final node).*

*Security Applications of GPUs*

**3.1 Basic mechanisms**

on the CPU.

*DOI: http://dx.doi.org/10.5772/intechopen.81885*

data in a massively parallel fashion. Overall, the GPU is employed as a high-speed first-pass filter, before completing any further potential signature-matching work

Initially, the entire signature set of ClamAV is preprocessed, in order to construct a deterministic finite automaton (DFA). As we have explained before, the DFA state machine provides linear complexity over the input text stream, which is very efficient. Unfortunately, it is not feasible to construct the full DFA, due to the

To overcome this, we use a portion from each virus signature for constructing the DFA, and specifically only the first *n* symbols, similar to [11]. By doing so, the height of the resulting DFA machine is limited, as shown in **Figure 7**. Any patterns that share the same prefix are stored under the same final node, called leaf. In case the length of the signature pattern is smaller than the prefix length, the entire pattern is added. A prefix may also contain special characters, such as the wildcharacters \* and ?, that are used in ClamAV signatures to describe a known virus. At the scanning phase, the input file data will be first scanned by the DFA running on the GPU. It is clear that the DFA may not be able to match an exact virus signature inside a data stream, as in most cases the length of the signature is longer than the length of the prefix we used to create the automaton. This will be the firstlevel filtering though, which purpose is to use the high parallelism of the GPU to quickly filter-out the majority number of true negatives, and drastically eliminate a significant portion of the input data that need to be scanned by the CPU. Obviously, the longer the prefix, the fewer the number of false positives at this initial scanning phase. As shown in **Figure 5**, using a value of 8 for *n*, can result to less than

During scan time, the algorithm iterates over the input data stream one byte at a time and moves the current state appropriately. The pattern matching is performed byte-wise, meaning that we have an input width of 8 bits and an alphabet size of 28 *=* 256. Thus, each state will contain 256 transitions to other states, as shown in **Figure 7**. If the scanning algorithm reaches a final-state, then a potential signature

*A fragment of the DFA structure with n levels. All the patterns that begin with the same prefix are listed under* 

big number and large size of the virus signatures contained in the ClamAV.

0.0001% of false positives in a realistic corpus of binary files.

**3.2 Parallelizing DFA matching on the GPU**

data in a massively parallel fashion. Overall, the GPU is employed as a high-speed first-pass filter, before completing any further potential signature-matching work on the CPU.

### **3.1 Basic mechanisms**

*High Performance Parallel Computing*

**44**

**Figure 6.**

*positives for further checking onto the CPU.*

**Figure 5.** *Number of matches.*

Every time the GPU detects a suspicious virus, i.e., there is a prefix match, the file is further investigated by the verification module. Otherwise, no further computation takes place. The data parallel operation of the GPU is ideal for creating multiple search engine instances that will scan for virus signatures on different

*The architecture of the GPU-assisted antivirus. Files are mapped onto pinned memory that can be copied onto the graphics card via DMA. A first-pass filtering is performed on the GPU and return any potential true* 

Initially, the entire signature set of ClamAV is preprocessed, in order to construct a deterministic finite automaton (DFA). As we have explained before, the DFA state machine provides linear complexity over the input text stream, which is very efficient. Unfortunately, it is not feasible to construct the full DFA, due to the big number and large size of the virus signatures contained in the ClamAV.

To overcome this, we use a portion from each virus signature for constructing the DFA, and specifically only the first *n* symbols, similar to [11]. By doing so, the height of the resulting DFA machine is limited, as shown in **Figure 7**. Any patterns that share the same prefix are stored under the same final node, called leaf. In case the length of the signature pattern is smaller than the prefix length, the entire pattern is added. A prefix may also contain special characters, such as the wildcharacters \* and ?, that are used in ClamAV signatures to describe a known virus.

At the scanning phase, the input file data will be first scanned by the DFA running on the GPU. It is clear that the DFA may not be able to match an exact virus signature inside a data stream, as in most cases the length of the signature is longer than the length of the prefix we used to create the automaton. This will be the firstlevel filtering though, which purpose is to use the high parallelism of the GPU to quickly filter-out the majority number of true negatives, and drastically eliminate a significant portion of the input data that need to be scanned by the CPU. Obviously, the longer the prefix, the fewer the number of false positives at this initial scanning phase. As shown in **Figure 5**, using a value of 8 for *n*, can result to less than 0.0001% of false positives in a realistic corpus of binary files.

### **3.2 Parallelizing DFA matching on the GPU**

During scan time, the algorithm iterates over the input data stream one byte at a time and moves the current state appropriately. The pattern matching is performed byte-wise, meaning that we have an input width of 8 bits and an alphabet size of 28 *=* 256. Thus, each state will contain 256 transitions to other states, as shown in **Figure 7**. If the scanning algorithm reaches a final-state, then a potential signature

#### **Figure 7.**

*A fragment of the DFA structure with n levels. All the patterns that begin with the same prefix are listed under the same leaf (final node).*

match has been found, and the corresponding offset is marked. All marked offsets will be verified later by the CPU.

To utilize all streaming processors of the GPU, we exploit its data parallel capabilities by creating multiple threads. An important design decision is the partitioning of the input data to each thread. The simplest approach would be to use multiple data input streams, one for each thread, in separate memory areas. However, this will result in asymmetrical processing effort for each processor and will not scale well. For example, if the sizes of the input streams vary, the amount of work per thread will not be the same. This means that threads will have to wait, until all have finished searching the data stream that was assigned to them. To overcome this, we use a single input data stream and each thread searches a different portion of it. In particular, our strategy splits the input stream in distinct chunks, and each chunk is processed by a different thread. **Figure 8** shows how each GPU thread scans its assigned chunk, using the underlying DFA state table. Although they access the same automaton, each thread maintains its own state, eliminating any need for communication between them.

The two operations of DFA matching, is determining the address of the next state in the state table and fetching the next state from the device memory. These memory transfers can take up to several hundreds of nanoseconds, depending on the memory congestion. To obtain the highest level of performance and hide memory latencies, we run many threads in parallel. Multiple threads can overlap data transfer with computation, hence improving the memory bandwidth.

Moreover, we have explored storing the DFA state table both in the global memory space, as well as in the texture memory space of the GPU. The texture memory can be accessed in a random fashion for reading, in contrast to global memory, where the access patterns must be coalesced. This feature can be very useful for algorithms like DFA matching, which exhibit irregular access patterns across large data sets. As described in Section 2.1.2, the usage of texture memory can boost the computational throughput up to a factor of two.

A case that requires special consideration though, is when patterns span across two or more different chunks. The simplest approach for fixed string patterns, is to continue the scanning to the next chunk (s), up to *n* bytes, where *n* is the maximum pattern length in the dictionary. However, the patterns used for virus scanning are typically very large, especially compared with other signature-based tools, such as Snort. Besides that, a virus signature may contain wild card characters (i.e., \*), as such the length of the patterns may not be determined. To overcome this, the following heuristic is used: each thread carries on the search to the consecutive bytes of the following chunk, up till a fail or final-state is reached. While matching a pattern that spans chunk boundaries, the state machine will perform regular transitions. However, if the state machine reaches a fail or final-state, then it is clear that there is no need to continue the searching, since any consecutive patterns will be found by the thread that was assigned to search the current chunk. This enables us to avoid any communication between the threads concerning boundaries in the input data buffer. Every time a match is found, it is stored to a bit array. The size of the bit array is equal to the size of the data that is processed at once, and each bit represents whether a match was found in the corresponding offset.

**Figure 9** shows the throughput achieved for different prefix lengths. As input data stream, we use the files under /usr/bin/ of a typical Linux installation, which contains 1516 binary files of about 132 MB. To eliminate disk latencies, all files are cached in memory by the operating system. Even though the files do not contain any viruses, they exercise most code branches of our tool. As we can see, the overall

**47**

**Figure 10.**

*Execution time breakdown.*

**Figure 9.**

**Figure 8.**

*Virus matching on the GPU.*

*GPU-assisted.*

throughput increases rapidly at a maximum of 20 Gbits/s. Moreover, the overall application throughput increases proportionally to the prefix length, due to the fact that the number of potential matches decreases, resulting to lower CPU postprocessing. A plateau is reached for a prefix length of around 10 (**Figure 7**).

*Performance of GPU-assisted ClamAV and vanilla ClamAV. The performance number for ClamAV running on eight cores is also included. The CPU-only performance is still an order of magnitude less than the* 

*Security Applications of GPUs*

*DOI: http://dx.doi.org/10.5772/intechopen.81885*

**Figure 8.** *Virus matching on the GPU.*

*High Performance Parallel Computing*

will be verified later by the CPU.

communication between them.

throughput up to a factor of two.

memory bandwidth.

match has been found, and the corresponding offset is marked. All marked offsets

To utilize all streaming processors of the GPU, we exploit its data parallel capabilities by creating multiple threads. An important design decision is the partitioning of the input data to each thread. The simplest approach would be to use multiple data input streams, one for each thread, in separate memory areas. However, this will result in asymmetrical processing effort for each processor and will not scale well. For example, if the sizes of the input streams vary, the amount of work per thread will not be the same. This means that threads will have to wait, until all have finished searching the data stream that was assigned to them. To overcome this, we use a single input data stream and each thread searches a different portion of it. In particular, our strategy splits the input stream in distinct chunks, and each chunk is processed by a different thread. **Figure 8** shows how each GPU thread scans its assigned chunk, using the underlying DFA state table. Although they access the same automaton, each thread maintains its own state, eliminating any need for

The two operations of DFA matching, is determining the address of the next

Moreover, we have explored storing the DFA state table both in the global memory space, as well as in the texture memory space of the GPU. The texture memory can be accessed in a random fashion for reading, in contrast to global memory, where the access patterns must be coalesced. This feature can be very useful for algorithms like DFA matching, which exhibit irregular access patterns across large data sets. As described in Section 2.1.2, the usage of texture memory can boost the computational

A case that requires special consideration though, is when patterns span across two or more different chunks. The simplest approach for fixed string patterns, is to continue the scanning to the next chunk (s), up to *n* bytes, where *n* is the maximum pattern length in the dictionary. However, the patterns used for virus scanning are typically very large, especially compared with other signature-based tools, such as Snort. Besides that, a virus signature may contain wild card characters (i.e., \*), as such the length of the patterns may not be determined. To overcome this, the following heuristic is used: each thread carries on the search to the consecutive bytes of the following chunk, up till a fail or final-state is reached. While matching a pattern that spans chunk boundaries, the state machine will perform regular transitions. However, if the state machine reaches a fail or final-state, then it is clear that there is no need to continue the searching, since any consecutive patterns will be found by the thread that was assigned to search the current chunk. This enables us to avoid any communication between the threads concerning boundaries in the input data buffer. Every time a match is found, it is stored to a bit array. The size of the bit array is equal to the size of the data that is processed at once, and each bit

represents whether a match was found in the corresponding offset.

**Figure 9** shows the throughput achieved for different prefix lengths. As input data stream, we use the files under /usr/bin/ of a typical Linux installation, which contains 1516 binary files of about 132 MB. To eliminate disk latencies, all files are cached in memory by the operating system. Even though the files do not contain any viruses, they exercise most code branches of our tool. As we can see, the overall

state in the state table and fetching the next state from the device memory. These memory transfers can take up to several hundreds of nanoseconds, depending on the memory congestion. To obtain the highest level of performance and hide memory latencies, we run many threads in parallel. Multiple threads can overlap data transfer with computation, hence improving the

**46**

### **Figure 9.**

*Performance of GPU-assisted ClamAV and vanilla ClamAV. The performance number for ClamAV running on eight cores is also included. The CPU-only performance is still an order of magnitude less than the GPU-assisted.*

**Figure 10.** *Execution time breakdown.*

throughput increases rapidly at a maximum of 20 Gbits/s. Moreover, the overall application throughput increases proportionally to the prefix length, due to the fact that the number of potential matches decreases, resulting to lower CPU postprocessing. A plateau is reached for a prefix length of around 10 (**Figure 7**).

### **4. Kernel integrity monitoring**

In this section, we describe how to leverage modern GPUs as a memory monitoring mechanism. In particular, we show the design of an external, snapshot-based, GPU-based kernel integrity tool that can be deployed in commodity servers and personal computers. Typically, a host memory integrity monitor should meet at least the following set of requirements [12, 13]: (i) Independence: the monitor needs to use a dedicated GPU that is completely isolated from the host. Our integrity monitor should operate continuously and detect any malicious action, notwithstanding the running state of the host machine. The GPU must be used exclusively for memory monitoring. Obviously, other usages can be served by any extra GPUs, if necessary, without affecting the proper usage of the kernel monitor. (ii) Hostmemory access: the physical memory of the host must be accessed directly, in order to periodically check its integrity and detect any suspicious or malicious actions. (iii) Sufficient computational and memory resources: the system must be able to perform any requested computational operations and be capable to process large amounts of data efficiently. In addition, it must have sufficient on-chip memory that can be used for private calculations and ensure that secret data would not be leaked or held by an adversary that has compromised the host system. (iv) Out-ofband reporting: the system must be able to report the state of the host system over a secure channel. This can be achieved by establishing a secure connection that can be used to exchange valid reports, even in the case the host system has been fully compromised.

In order to meet the above requirements, several characteristics of the GPU's execution model require careful consideration. For instance, the typical GPU model in which a GPU kernel run for a while, perform some computations and then terminate cannot be considered secure. Instead, the coprocessor needs to execute in isolation, without being influenced by the host it protects. It is clear that leveraging GPUs for designing an independent environment with unrestricted memory access that will monitor the host's memory securely, is rather challenging. Many GPU characteristics must be considered carefully and in a particular way (**Figure 10**).

**Figure 11** shows the overall architecture. In essence, the GPU continuously investigates, in terms of security, the specified kernel memory regions via DMA, over the PCI Express bus, and reports any suspicious or unwanted activity to an externally-connected admin station on the local network. From a high-level perspective, our system has two main parts that run in parallel: the device program (GPU code) and the host program (user process). The device program is responsible for continuously checking the integrity of requested memory regions and report any alerts. The host program periodically reads the status area and reports to the admin station, in the form of keep-alive messages. The only trusted component is hosted on the GPU; the user process cannot be trusted, so there is an end-to-end encryption scheme between the GPU and the admin station to protect against attacks.

#### **4.1 Autonomous GPU execution**

Given that the host system may be vulnerable and could be compromised, it is important that the integrity monitoring of the operating system be completely isolated from the host, and guarantee that an adversary cannot tamper any code or data used by our system. Modern GPU chips follow a non-preemptive, cooperatively scheduled, execution style, which means that only a single kernel can run on the device at any point in time. As such, a bootstrapping method is employed that

**49**

*Security Applications of GPUs*

**Figure 11.**

over a separate stream.

**4.2 Host memory access**

*DOI: http://dx.doi.org/10.5772/intechopen.81885*

*together with a status code to a separate admin station.*

forbids any external interaction with our system. Any attempt to kill, pause, or suspend the GPU component of our system results in system shutdown, and the system can only resume its operation by repeating the bootstrap process. Even though many recent NVIDIA models conditionally support concurrent kernel execution, these conditions can be easily adjusted, by occupying occupy all resources. By doing

*The GPU-assisted integrity monitor architecture. The GPU continuously checks the integrity of host memory regions by accessing the corresponding device virtual addresses. To defend against man-in-the-middle and replay attacks on the reporting channel, a user program periodically sends an encrypted sequence number* 

Even though these procedures ensure that our system can be initialized and run safely, current GPGPU frameworks (i.e., CUDA and OpenCL) do not support isolated and/or independent execution. To make matters worse, some drivers would even suspend a context in case the GPU kernel appears to be unresponsive. To overcome these issues, we configure the driver to ignore these types of checks. In addition, we create a second stream dedicated to the GPU kernel of our system, since by default, only one queue, or stream in CUDA terminology, is active and the host cannot issue any data transfers before the previous kernel execution is finished. Therefore, all data communications with the host and the admin station are issued

An important requirement for our GPU-assisted kernel integrity monitor is to implement a mechanism to access the requested memory pages that need to be monitored. Current GPGPU frameworks, such as CUDA and OpenCL, use a virtual address layer that is located within the host process virtual memory. Given that our system needs to access the kernel's memory, the kernel memory regions that are

Typically, the access of memory regions that have not been assigned to a process is prohibited in the majority of modern OSes, including Linux and Windows. Every time a process accesses a page outside of its virtual address space, a segmentation violation will be thrown. To overcome such restrictions and be able to access the memory regions where the OS kernel text and data reside, we first need to map them to the user-space of the host process that has spawned the GPU kernel. To overcome the protection added by the OS, a separate loadable kernel module is used, which is able to map a requested memory region to the user-space. These memory regions can subsequently be registered to the GPU address space, using

so, no other, possibly malicious, code can run in parallel.

monitored should be mapped to the user process.

**Figure 11.**

*High Performance Parallel Computing*

**4. Kernel integrity monitoring**

compromised.

In this section, we describe how to leverage modern GPUs as a memory monitoring mechanism. In particular, we show the design of an external, snapshot-based, GPU-based kernel integrity tool that can be deployed in commodity servers and personal computers. Typically, a host memory integrity monitor should meet at least the following set of requirements [12, 13]: (i) Independence: the monitor needs to use a dedicated GPU that is completely isolated from the host. Our integrity monitor should operate continuously and detect any malicious action, notwithstanding the running state of the host machine. The GPU must be used exclusively for memory monitoring. Obviously, other usages can be served by any extra GPUs, if necessary, without affecting the proper usage of the kernel monitor. (ii) Hostmemory access: the physical memory of the host must be accessed directly, in order to periodically check its integrity and detect any suspicious or malicious actions. (iii) Sufficient computational and memory resources: the system must be able to perform any requested computational operations and be capable to process large amounts of data efficiently. In addition, it must have sufficient on-chip memory that can be used for private calculations and ensure that secret data would not be leaked or held by an adversary that has compromised the host system. (iv) Out-ofband reporting: the system must be able to report the state of the host system over a secure channel. This can be achieved by establishing a secure connection that can be used to exchange valid reports, even in the case the host system has been fully

In order to meet the above requirements, several characteristics of the GPU's execution model require careful consideration. For instance, the typical GPU model in which a GPU kernel run for a while, perform some computations and then terminate cannot be considered secure. Instead, the coprocessor needs to execute in isolation, without being influenced by the host it protects. It is clear that leveraging GPUs for designing an independent environment with unrestricted memory access that will monitor the host's memory securely, is rather challenging. Many GPU characteristics must be considered carefully and in a particular way (**Figure 10**). **Figure 11** shows the overall architecture. In essence, the GPU continuously investigates, in terms of security, the specified kernel memory regions via DMA, over the PCI Express bus, and reports any suspicious or unwanted activity to an externally-connected admin station on the local network. From a high-level perspective, our system has two main parts that run in parallel: the device program (GPU code) and the host program (user process). The device program is responsible for continuously checking the integrity of requested memory regions and report any alerts. The host program periodically reads the status area and reports to the admin station, in the form of keep-alive messages. The only trusted component is hosted on the GPU; the user process cannot be trusted, so there is an end-to-end encryption scheme between the GPU and the admin station to protect against

Given that the host system may be vulnerable and could be compromised, it is important that the integrity monitoring of the operating system be completely isolated from the host, and guarantee that an adversary cannot tamper any code or data used by our system. Modern GPU chips follow a non-preemptive, cooperatively scheduled, execution style, which means that only a single kernel can run on the device at any point in time. As such, a bootstrapping method is employed that

**48**

attacks.

**4.1 Autonomous GPU execution**

*The GPU-assisted integrity monitor architecture. The GPU continuously checks the integrity of host memory regions by accessing the corresponding device virtual addresses. To defend against man-in-the-middle and replay attacks on the reporting channel, a user program periodically sends an encrypted sequence number together with a status code to a separate admin station.*

forbids any external interaction with our system. Any attempt to kill, pause, or suspend the GPU component of our system results in system shutdown, and the system can only resume its operation by repeating the bootstrap process. Even though many recent NVIDIA models conditionally support concurrent kernel execution, these conditions can be easily adjusted, by occupying occupy all resources. By doing so, no other, possibly malicious, code can run in parallel.

Even though these procedures ensure that our system can be initialized and run safely, current GPGPU frameworks (i.e., CUDA and OpenCL) do not support isolated and/or independent execution. To make matters worse, some drivers would even suspend a context in case the GPU kernel appears to be unresponsive. To overcome these issues, we configure the driver to ignore these types of checks. In addition, we create a second stream dedicated to the GPU kernel of our system, since by default, only one queue, or stream in CUDA terminology, is active and the host cannot issue any data transfers before the previous kernel execution is finished. Therefore, all data communications with the host and the admin station are issued over a separate stream.

### **4.2 Host memory access**

An important requirement for our GPU-assisted kernel integrity monitor is to implement a mechanism to access the requested memory pages that need to be monitored. Current GPGPU frameworks, such as CUDA and OpenCL, use a virtual address layer that is located within the host process virtual memory. Given that our system needs to access the kernel's memory, the kernel memory regions that are monitored should be mapped to the user process.

Typically, the access of memory regions that have not been assigned to a process is prohibited in the majority of modern OSes, including Linux and Windows. Every time a process accesses a page outside of its virtual address space, a segmentation violation will be thrown. To overcome such restrictions and be able to access the memory regions where the OS kernel text and data reside, we first need to map them to the user-space of the host process that has spawned the GPU kernel. To overcome the protection added by the OS, a separate loadable kernel module is used, which is able to map a requested memory region to the user-space. These memory regions can subsequently be registered to the GPU address space, using

the CUDA API. Afterwards, the GPU is able to access the requested kernel memory regions directly, through the physical address space, due to the fact that the GPU is a peripheral PCIe device (i.e., it only uses physical addressing to access the host memory). This feature enables us to un-map the user-space mappings of the kernel memory regions during the bootstrap phase, that would otherwise pose significant security risks.

### *4.2.1 Mapping kernel memory to GPU*

During bootstrapping, our GPU-assisted integrity monitor acquires the kernel memory regions that need to be monitored. Given that these regions are located in the kernel virtual address space, the first step is to map them to the virtual address space of the user process, which spawns the execution of the kernel integrity monitoring GPU kernel.

In most modern operating systems, a peripheral device is able to bypass the virtual address layer and access physical addresses directly. The device driver creates a device-specific mapping of the device's address space that points to the corresponding host's physical pages. In order to map the corresponding kernel physical memory mappings to the GPU, we use a separate loadable kernel module that is responsible to provide the required page table mapping functionality. **Figure 12** shows the steps for mapping the OS kernel memory to the GPU. In step 1, the loadable kernel module resolves the physical mapping for a given kernel virtual address. In step 2, the kernel module allocates one page in the user context and saves its physical mapping; then it makes the allocated page point to the same page as the kernel virtual address by duplicating the PTE in the user-page table. Afterwards, in step 3, the kernel module maps this user page to the GPU (cudaHostRegister() with the cudaHostRegister-Mapped flag) and gets its device pointer via the cudaHostGetDevicePointer(). Last, the original physical mapping of the allocation is restored-and-freed by the kernel module (step 4). By doing so, any OS kernel memory page can be effectively mapped to the GPU address space. Furthermore, by un-mapping the user-allocated page right after the successful execution of the bootstrapping process, all intermediate mappings are destroyed. The same procedure is performed for all kernel virtual memory ranges that need to be monitored by our tool. The GPU driver populates a device-resident page table for the device in order to resolve its virtual addresses and spawn DMA transactions.

Finally, we compile Linux with the CONFIG\_KALLSYMS\_ALL = y and CONFIG\_KALLSYMS = y flags, in order to have full access to the /proc./kallsyms interface. Even though this is not an inherent constraint for our design, it makes

#### **Figure 12.**

*Mapping OS kernel memory to the GPU. First, we have a kernel virtual address pointing to a physical address (step 1). In step 2 this mapping is duplicated to user space using a specialized kernel module capable to manipulate page tables. The user virtual address is then passed to a CUDA API call that pins the page into memory and creates the GPU-side page table entries (step 3). Finally, the intermediate user space mappings are destroyed (step 4), while the GPU continues to access the physical page.*

**51**

*Security Applications of GPUs*

symbol table.

loaded LKM.

**4.4 Sufficient resources**

to support them.

snapshot frequency of 9 KHz or more.

*DOI: http://dx.doi.org/10.5772/intechopen.81885*

**4.3 Memory integrity monitoring**

entry (PTE), we end up with a physical page number.

the development and debugging much easier because it alleviates the need of custom memory scanners to search for the kernel page table address that need to be monitored. In case the access of the kernel symbol lookup table is not acceptable (i.e., in certain environments), we could locate the corresponding memory regions either using memory pattern scanners for dynamic loadable parts or via an external

The user can specify which host memory regions will be monitored; these regions can include, among others, pages that contain kernel text, kernel modules (LKM), and arrays or structures that contain function pointers (e.g., jump tables). Even though the hashing of static kernel text parts or loaded LKMs is straightforward, still many parts of the OS kernel are quite dynamic. For instance, the data structures of the VFS layer change every time a new filesystem is mounted or unmounted. In addition, function pointers can be added dynamically by every

As modern GPUs offer general purpose programmability, it is feasible to implement multi-step checks on different memory regions. These checks can be quite complex, such as for example in cases where several memory pointers need to be dereferenced in order to acquire a proper kernel memory address. Such type of checks can be supported by walking the kernel page table and resolving their virtual addresses dynamically from the GPU. Given that we can already access the parts of the page table needed to follow a specific virtual address down to a leaf page table

Finally, we note that dynamic page table resolutions are not currently supported; instead, we need to provide a static list of kernel memory regions. Still, this is not an inherent limitation of a PCIe device, such as the GPU. The development of opensource frameworks (e.g., Gdev [14]) and drivers (e.g. Nouveau [15], PSCNV [16], etc.) will make the mapping of any physical page in the GPU address space feasible.

Modern GPUs are usually equipped with hundreds of cores and adequate amount of memory. This gives the ability to keep a sufficient number of memory snapshots and much state to detect complex, multi-step, attacks. It is clear though that these types of checks can become quite sophisticated, principally due to the lack of a generic framework that will give the opportunity to define detection modules on top of our architecture. Even though the support of such complicated memory checks is not supported currently by our GPU-assisted integrity tool, still we have tested the operation of aggressively reading and hashing memory, and as we demonstrate below, the GPU prevails the computational and memory resources

In particular, we measure the detection rate of a self-hiding LKM that is loaded, repeatedly, 100 times, using a different snapshot frequency. The self-hiding LKM acts similar to the operation of a rootkit, however it does not perform any malicious operations; instead, an actual rootkit will be exposed to the monitor for a longer time period, once loaded, in order to perform its malicious actions. The detection rate achieved is shown in **Figure 13**. We utilize a GTX770 under each configuration and use the CRC-32 for checksums (as defined by ISO 3309), due to its simplicity, speed, and its wide adoption. As can be shown, we can reliably detect (i.e., 100% detection rate) that a new kernel module has been loaded before hiding itself with a

### *Security Applications of GPUs DOI: http://dx.doi.org/10.5772/intechopen.81885*

*High Performance Parallel Computing*

*4.2.1 Mapping kernel memory to GPU*

monitoring GPU kernel.

spawn DMA transactions.

security risks.

the CUDA API. Afterwards, the GPU is able to access the requested kernel memory regions directly, through the physical address space, due to the fact that the GPU is a peripheral PCIe device (i.e., it only uses physical addressing to access the host memory). This feature enables us to un-map the user-space mappings of the kernel memory regions during the bootstrap phase, that would otherwise pose significant

During bootstrapping, our GPU-assisted integrity monitor acquires the kernel memory regions that need to be monitored. Given that these regions are located in the kernel virtual address space, the first step is to map them to the virtual address space of the user process, which spawns the execution of the kernel integrity

In most modern operating systems, a peripheral device is able to bypass the virtual address layer and access physical addresses directly. The device driver creates a device-specific mapping of the device's address space that points to the corresponding host's physical pages. In order to map the corresponding kernel physical memory mappings to the GPU, we use a separate loadable kernel module that is responsible to provide the required page table mapping functionality. **Figure 12** shows the steps for mapping the OS kernel memory to the GPU. In step 1, the loadable kernel module resolves the physical mapping for a given kernel virtual address. In step 2, the kernel module allocates one page in the user context and saves its physical mapping; then it makes the allocated page point to the same page as the kernel virtual address by duplicating the PTE in the user-page table. Afterwards, in step 3, the kernel module maps this user page to the GPU (cudaHostRegister() with the cudaHostRegister-Mapped flag) and gets its device pointer via the cudaHostGetDevicePointer(). Last, the original physical mapping of the allocation is restored-and-freed by the kernel module (step 4). By doing so, any OS kernel memory page can be effectively mapped to the GPU address space. Furthermore, by un-mapping the user-allocated page right after the successful execution of the bootstrapping process, all intermediate mappings are destroyed. The same procedure is performed for all kernel virtual memory ranges that need to be monitored by our tool. The GPU driver populates a device-resident page table for the device in order to resolve its virtual addresses and

Finally, we compile Linux with the CONFIG\_KALLSYMS\_ALL = y and CONFIG\_KALLSYMS = y flags, in order to have full access to the /proc./kallsyms interface. Even though this is not an inherent constraint for our design, it makes

*Mapping OS kernel memory to the GPU. First, we have a kernel virtual address pointing to a physical address (step 1). In step 2 this mapping is duplicated to user space using a specialized kernel module capable to manipulate page tables. The user virtual address is then passed to a CUDA API call that pins the page into memory and creates the GPU-side page table entries (step 3). Finally, the intermediate user space mappings are* 

*destroyed (step 4), while the GPU continues to access the physical page.*

**50**

**Figure 12.**

the development and debugging much easier because it alleviates the need of custom memory scanners to search for the kernel page table address that need to be monitored. In case the access of the kernel symbol lookup table is not acceptable (i.e., in certain environments), we could locate the corresponding memory regions either using memory pattern scanners for dynamic loadable parts or via an external symbol table.

### **4.3 Memory integrity monitoring**

The user can specify which host memory regions will be monitored; these regions can include, among others, pages that contain kernel text, kernel modules (LKM), and arrays or structures that contain function pointers (e.g., jump tables). Even though the hashing of static kernel text parts or loaded LKMs is straightforward, still many parts of the OS kernel are quite dynamic. For instance, the data structures of the VFS layer change every time a new filesystem is mounted or unmounted. In addition, function pointers can be added dynamically by every loaded LKM.

As modern GPUs offer general purpose programmability, it is feasible to implement multi-step checks on different memory regions. These checks can be quite complex, such as for example in cases where several memory pointers need to be dereferenced in order to acquire a proper kernel memory address. Such type of checks can be supported by walking the kernel page table and resolving their virtual addresses dynamically from the GPU. Given that we can already access the parts of the page table needed to follow a specific virtual address down to a leaf page table entry (PTE), we end up with a physical page number.

Finally, we note that dynamic page table resolutions are not currently supported; instead, we need to provide a static list of kernel memory regions. Still, this is not an inherent limitation of a PCIe device, such as the GPU. The development of opensource frameworks (e.g., Gdev [14]) and drivers (e.g. Nouveau [15], PSCNV [16], etc.) will make the mapping of any physical page in the GPU address space feasible.

### **4.4 Sufficient resources**

Modern GPUs are usually equipped with hundreds of cores and adequate amount of memory. This gives the ability to keep a sufficient number of memory snapshots and much state to detect complex, multi-step, attacks. It is clear though that these types of checks can become quite sophisticated, principally due to the lack of a generic framework that will give the opportunity to define detection modules on top of our architecture. Even though the support of such complicated memory checks is not supported currently by our GPU-assisted integrity tool, still we have tested the operation of aggressively reading and hashing memory, and as we demonstrate below, the GPU prevails the computational and memory resources to support them.

In particular, we measure the detection rate of a self-hiding LKM that is loaded, repeatedly, 100 times, using a different snapshot frequency. The self-hiding LKM acts similar to the operation of a rootkit, however it does not perform any malicious operations; instead, an actual rootkit will be exposed to the monitor for a longer time period, once loaded, in order to perform its malicious actions. The detection rate achieved is shown in **Figure 13**. We utilize a GTX770 under each configuration and use the CRC-32 for checksums (as defined by ISO 3309), due to its simplicity, speed, and its wide adoption. As can be shown, we can reliably detect (i.e., 100% detection rate) that a new kernel module has been loaded before hiding itself with a snapshot frequency of 9 KHz or more.

Next, we study the implications of requiring a snapshot frequency of at least 9 KHz for accurate detection, with respect to the amount of memory we can cover. The snapshot frequency is a function of the number and size of the monitored memory regions. We notice that the memory alignment is major factor that significantly affects the performance of memory reads of the GPU; the most efficient memory reads are achieved with 16-byte aligned reads (or one uint4 in CUDA's device native data types). Unfortunately, since we cannot control how the kernel data structures will be placed in memory, we assume that many of our monitored regions will require one or two extra fetches.

In our final experiment, we focus on monitoring single memory pointer (8-bytes each). As shown in **Figure 14**, we can monitor at most 8 K pointers simultaneously without any loss in accuracy, due to the fact that we need to stay above the 9 KHz snapshot frequency. Obviously, this limits the amount of memory that we can monitor using our system, albeit 8 K addresses that are spread out in memory could potentially safeguard many important kernel data structures.

#### **Figure 13.**

*Self-hiding LKM loading detection with different snapshot frequencies. For each case, a module is loaded that deletes itself from the kernel modules list 100 times, while monitoring the head of the list. A 100% detection rate with a snapshot frequency of 9 KHz or more is achieved.*

#### **Figure 14.**

*Maximum achieved frequency according to the number of pointers being monitored. As the number of (8-byte) memory regions increases, the snapshot frequency decreases. A threshold of 9 KHz is required to accurately detect a self-hiding LKM loading, which is achieved with monitoring 8 K pointers.*

**53**

provided the original work is properly cited.

\*Address all correspondence to: gvasil@ics.forth.gr

*Security Applications of GPUs*

**4.5 Out-of-band execution**

**Acknowledgements**

for this research.

**Author details**

Giorgos Vasiliadis

*DOI: http://dx.doi.org/10.5772/intechopen.81885*

reports (initiated by a time-out) or invalid messages.

The coprocessor nature of GPUs offers limited defenses against itself. For instance, an attacker that has gained access on the host system can easily reset or disable the GPU device, and as a result, block its operation as integrity monitor. To solve this, an admin station needs to be deployed, completely isolated from the monitored system, that is responsible for keeping track of its proper state. Specifically, the user program that has spawned the GPU kernel that performs the integrity monitor, periodically sends keep-alive messages to the admin station via a virtual private connection. It is clear that simply sending messages to raise an alert is unsafe, due to the difficulty of the admin station to distinguish normal operation from a network partition or other failure. As such, we use keep-alive messages that encapsulate a GPU-generated status. In order to prevent attackers to send spoofed messages or replay old ones, we further encrypt the keep-alive messages together with a sequence number. Eventually, the secure channel between the admin station and the host is established at the bootstrapping phase. On the admin station, the reports are logged and is responsible that for the responsiveness of the monitor. Every time an alert is received or on the admin station, it takes the appropriate action. Moreover, the admin station takes care for any error case, such as missed

The author would like to thank Sotiris Ioannidis, Michalis Polychronakis, Elias Athanasopoulos, and Lazaros Koromilas for their invaluable support and contributions during the development of several parts of this work. I gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

Foundation for Research and Technology—Hellas, Heraklion, Crete, Greece

### **4.5 Out-of-band execution**

*High Performance Parallel Computing*

regions will require one or two extra fetches.

*with a snapshot frequency of 9 KHz or more is achieved.*

potentially safeguard many important kernel data structures.

Next, we study the implications of requiring a snapshot frequency of at least 9 KHz for accurate detection, with respect to the amount of memory we can cover. The snapshot frequency is a function of the number and size of the monitored memory regions. We notice that the memory alignment is major factor that significantly affects the performance of memory reads of the GPU; the most efficient memory reads are achieved with 16-byte aligned reads (or one uint4 in CUDA's device native data types). Unfortunately, since we cannot control how the kernel data structures will be placed in memory, we assume that many of our monitored

In our final experiment, we focus on monitoring single memory pointer (8-bytes each). As shown in **Figure 14**, we can monitor at most 8 K pointers simultaneously without any loss in accuracy, due to the fact that we need to stay above the 9 KHz snapshot frequency. Obviously, this limits the amount of memory that we can monitor using our system, albeit 8 K addresses that are spread out in memory could

*Maximum achieved frequency according to the number of pointers being monitored. As the number of (8-byte) memory regions increases, the snapshot frequency decreases. A threshold of 9 KHz is required to accurately* 

*Self-hiding LKM loading detection with different snapshot frequencies. For each case, a module is loaded that deletes itself from the kernel modules list 100 times, while monitoring the head of the list. A 100% detection rate* 

*detect a self-hiding LKM loading, which is achieved with monitoring 8 K pointers.*

**52**

**Figure 14.**

**Figure 13.**

The coprocessor nature of GPUs offers limited defenses against itself. For instance, an attacker that has gained access on the host system can easily reset or disable the GPU device, and as a result, block its operation as integrity monitor. To solve this, an admin station needs to be deployed, completely isolated from the monitored system, that is responsible for keeping track of its proper state. Specifically, the user program that has spawned the GPU kernel that performs the integrity monitor, periodically sends keep-alive messages to the admin station via a virtual private connection. It is clear that simply sending messages to raise an alert is unsafe, due to the difficulty of the admin station to distinguish normal operation from a network partition or other failure. As such, we use keep-alive messages that encapsulate a GPU-generated status. In order to prevent attackers to send spoofed messages or replay old ones, we further encrypt the keep-alive messages together with a sequence number. Eventually, the secure channel between the admin station and the host is established at the bootstrapping phase. On the admin station, the reports are logged and is responsible that for the responsiveness of the monitor. Every time an alert is received or on the admin station, it takes the appropriate action. Moreover, the admin station takes care for any error case, such as missed reports (initiated by a time-out) or invalid messages.

### **Acknowledgements**

The author would like to thank Sotiris Ioannidis, Michalis Polychronakis, Elias Athanasopoulos, and Lazaros Koromilas for their invaluable support and contributions during the development of several parts of this work. I gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

### **Author details**

Giorgos Vasiliadis Foundation for Research and Technology—Hellas, Heraklion, Crete, Greece

\*Address all correspondence to: gvasil@ics.forth.gr

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## **References**

[1] Snort—Network Intrusion Detection & Prevention System. Available from: https://www.snort.org/

[2] Paxson V, Asanovic ́ K, Dharmapurikar S, Lockwood J, Pang R, Sommer R, et al. Rethinking hardware support for network analysis and intrusion prevention. In: Proceedings of the 1st USENIX Workshop on Hot Topics in Security (HotSec). 2006

[3] Aho AV, Corasick MJ. Efficient string matching: An aid to bibliographic search. Communications of the ACM. 1975;**18**(6):333-340

[4] Norton M. Optimizing Pattern Matching for Intrusion Detection. Whitepaper. 2004. Available from: https://www.snort.org/documents/ optimization-of-pattern-matches-for-ids

[5] PCRE: Perl Compatible Regular Expressions. Available from: http:// www.pcre.org

[6] Thompson K. Programming techniques: Regular expression search algorithm. Communications of the ACM. 1968;**11**(6):419-422

[7] Berry G, Sethi R. From regular expressions to deterministic automata. Theoretical Computer Science. 1986;**48**(1):117-126

[8] Hopcroft JE, Ullman JD. Introduction to Automata Theory, Languages, and Computation. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.; 1990

[9] ClamAV Open-source Antivirus. Available from: http://www.clamav.net/

[10] Boyer RS, Moore JS. A fast string searching algorithm. Communications of the Association for Computing Machinery. 1977;**20**(10):762-772

[11] Miretskiy Y, Das A, Wright CP, Zadok E. Avfs: An on-access anti-virus file system. In: Proceedings of the 13th USENIX Security Symposium. 2004

Chapter 5

Abstract

1. Introduction

55

Particle-Based Fused Rendering

In this chapter, we propose a fused rendering technique that can integrally handle multiple irregular volumes. Although there is a strong requirement for understanding large-scale datasets generated from coupled simulation techniques such as computational structure mechanics (CSM) and computational fluid dynamics (CFD), there is no fused rendering technique to the best of our knowledge. For this purpose, we can employ the particle-based volume rendering (PBVR) technique for each irregular volume dataset. Since the current PBVR technique regards an irregular cell as a planar footprint during depth evaluation, the straightforward employment causes some artifacts especially at the cell boundaries. To solve the problem, we calculate the depth value based on the assumption that the opacity describes the cumulative distribution function (CDF) of a probability variable, w, which shows a length from the entry point in the fragment interval in the cell. In our experiments, we applied our method to numerical simulation results in which two different irregular grid cells are defined in the same space and confirmed

Koji Koyamada and Naohisa Sakamoto

its effectiveness with respect to the image quality.

multiple irregular volumes is expected.

Keywords: volume rendering, irregular volume, unstructured grid

Coupled analysis is important to solve complex phenomena. Several computational schemes such as computational fluid dynamics (CFD), computational structure mechanics (CSM), and computational electronic magnetics (CEM) are applied to the same geometrical domain, and high-performance computing (HPC) facility has been used for the computation. Since, in general, the requirement for the computational grid is different at each scheme, and the space is composed of multiple irregular volumes. Thus, a volume rendering technique which can handle

Volume rendering can provide useful information because with this technique, it is possible to grasp the spatial distribution of the related physical quantities. It is a powerful technique for displaying volume datasets, especially three-dimensional scalar data fields, which are composed of scalar data values defined in threedimensional space. In a CSM, CFD or CEM simulation, the three-dimensional space is composed of computational cells, the shapes of which are, for example, tetrahedra, prisms, and hexahedra. In a large-scale simulation model for complex physical phenomena, the number of computational cells may be more than one million. Current volume rendering techniques require discrete sampling in which the composition is executed in the visibility order. In the volume rendering calculation, accumulating the optical depth is time consuming. The optical depth is accumulated so that we can efficiently calculate the occlusion of the scattered light from various

[12] Petroni NL Jr, Fraser T, Molina J, Arbaugh WA. Copilot: A coprocessorbased kernel runtime integrity monitor. In: USENIX Security Symposium. 2004

[13] Lee H, Moon H, Jang D, Kim K, Lee J, Paek Y, et al. KI-mon: A hardwareassisted event-triggered monitoring platform for mutable kernel object. In: USENIX Security Symposium. 2013

[14] shinpei0208/gdev. Available from: https://github.com/shinpei0208/gdev

[15] Nouveau driver for nVidia cards. Available from: http://nouveau. freedesktop.org/

[16] PathScale NVIDIA graphics driver. Available from: https://github.com/ pathscale/pscnv

### Chapter 5

## Particle-Based Fused Rendering

Koji Koyamada and Naohisa Sakamoto

### Abstract

In this chapter, we propose a fused rendering technique that can integrally handle multiple irregular volumes. Although there is a strong requirement for understanding large-scale datasets generated from coupled simulation techniques such as computational structure mechanics (CSM) and computational fluid dynamics (CFD), there is no fused rendering technique to the best of our knowledge. For this purpose, we can employ the particle-based volume rendering (PBVR) technique for each irregular volume dataset. Since the current PBVR technique regards an irregular cell as a planar footprint during depth evaluation, the straightforward employment causes some artifacts especially at the cell boundaries. To solve the problem, we calculate the depth value based on the assumption that the opacity describes the cumulative distribution function (CDF) of a probability variable, w, which shows a length from the entry point in the fragment interval in the cell. In our experiments, we applied our method to numerical simulation results in which two different irregular grid cells are defined in the same space and confirmed its effectiveness with respect to the image quality.

Keywords: volume rendering, irregular volume, unstructured grid

### 1. Introduction

Coupled analysis is important to solve complex phenomena. Several computational schemes such as computational fluid dynamics (CFD), computational structure mechanics (CSM), and computational electronic magnetics (CEM) are applied to the same geometrical domain, and high-performance computing (HPC) facility has been used for the computation. Since, in general, the requirement for the computational grid is different at each scheme, and the space is composed of multiple irregular volumes. Thus, a volume rendering technique which can handle multiple irregular volumes is expected.

Volume rendering can provide useful information because with this technique, it is possible to grasp the spatial distribution of the related physical quantities. It is a powerful technique for displaying volume datasets, especially three-dimensional scalar data fields, which are composed of scalar data values defined in threedimensional space. In a CSM, CFD or CEM simulation, the three-dimensional space is composed of computational cells, the shapes of which are, for example, tetrahedra, prisms, and hexahedra. In a large-scale simulation model for complex physical phenomena, the number of computational cells may be more than one million.

Current volume rendering techniques require discrete sampling in which the composition is executed in the visibility order. In the volume rendering calculation, accumulating the optical depth is time consuming. The optical depth is accumulated so that we can efficiently calculate the occlusion of the scattered light from various

**54**

*High Performance Parallel Computing*

[1] Snort—Network Intrusion Detection & Prevention System. Available from:

[11] Miretskiy Y, Das A, Wright CP, Zadok E. Avfs: An on-access anti-virus file system. In: Proceedings of the 13th USENIX Security Symposium. 2004

[12] Petroni NL Jr, Fraser T, Molina J, Arbaugh WA. Copilot: A coprocessorbased kernel runtime integrity monitor. In: USENIX Security Symposium. 2004

[13] Lee H, Moon H, Jang D, Kim K, Lee J, Paek Y, et al. KI-mon: A hardwareassisted event-triggered monitoring platform for mutable kernel object. In: USENIX Security Symposium. 2013

[14] shinpei0208/gdev. Available from: https://github.com/shinpei0208/gdev

[15] Nouveau driver for nVidia cards. Available from: http://nouveau.

[16] PathScale NVIDIA graphics driver. Available from: https://github.com/

freedesktop.org/

pathscale/pscnv

Dharmapurikar S, Lockwood J, Pang R, Sommer R, et al. Rethinking hardware support for network analysis and intrusion prevention. In: Proceedings of the 1st USENIX Workshop on Hot Topics in Security (HotSec). 2006

[3] Aho AV, Corasick MJ. Efficient string matching: An aid to bibliographic search. Communications of the ACM.

[4] Norton M. Optimizing Pattern Matching for Intrusion Detection. Whitepaper. 2004. Available from: https://www.snort.org/documents/ optimization-of-pattern-matches-for-ids

[5] PCRE: Perl Compatible Regular Expressions. Available from: http://

[6] Thompson K. Programming techniques: Regular expression search algorithm. Communications of the

[7] Berry G, Sethi R. From regular expressions to deterministic automata.

[8] Hopcroft JE, Ullman JD. Introduction to Automata Theory, Languages, and Computation. Boston, MA, USA: Addison-Wesley Longman Publishing

[9] ClamAV Open-source Antivirus. Available from: http://www.clamav.net/

[10] Boyer RS, Moore JS. A fast string searching algorithm. Communications of the Association for Computing Machinery. 1977;**20**(10):762-772

Theoretical Computer Science.

ACM. 1968;**11**(6):419-422

1986;**48**(1):117-126

Co., Inc.; 1990

**References**

https://www.snort.org/

1975;**18**(6):333-340

www.pcre.org

[2] Paxson V, Asanovic ́ K,

lighting positions. Most volume rendering techniques accumulate the optical depth in order from front to back or from back to front along a viewing ray.

when a volume dataset is rendered in a high resolution. In addition, I-PBVR becomes a feasible choice for rendering irregular volume data when particle gener-

To understand the relationships between variables in the irregular volume data, it is necessary to integrate multiple volumes, into a single volume rendering. In this chapter, we improve the I-PBVR technique in terms of its extensibility to multiple volumes and propose a new technique for semi-transparent rendering which can integrally handle multiple volumes without visibility sorting. I-PBVR regards a tetrahedral cell as a triangle footprint on the image plane. When a single volume is rendered, each footprint does not intersect with the other one since the volume is composed of cells that are not overlapped with others. When multiple volumes are rendered, the footprints may intersect with others. This intersection causes a problem in which the cell boundaries are visible when dealing with multiple volumes. Thus, in this chapter, our research question is "How can we realize a fused volume rendering technique which is free from artifact?" The hypothesis to the question is "If we employ an adequate probability process to sample particles along a viewing in a grid cell, the artifact is not noticeable." In the early age of volume rendering, Blinn assumed that the number of particles is distributed in a volume space according to the Poisson distribution. If we take an interval that is a part of the viewing ray cut by cell faces, the distance between particles is distributed according

In the remaining part, we first describe related work and the basic theory of PBVR and then propose a fused rendering technique using PBVR. Finally, we make a comparative work to confirm the effectiveness of the proposed technique. To test the hypothesis, we design the following experiments for the sampling along the

1. The sampling is made at the entry, exit or middle points along an interval in

2. The sampling is made at a random position along an interval in the grid cell.

3. The sampling is made according to a probability process, which is determined

In the particle-based modeling of Saturn's ring, Blinn assumed that the number of particles follows the Poisson distribution although he did not describe it in detail [4]. The assumption led to a definition of opacity which was an important keyword for the volume rendering. Then, volume rendering has been the focus of intensive study for nearly three decades [5–7]. The volume rendering of unstructured volume data has received much attention, and several approaches have been proposed. Extensive literature and surveys on volume rendering are available that address unstructured volume data [8, 9]. A concern has often been visibility sorting, which

To solve the problem recognized by many volume rendering researchers, we returned to a density emitter model and presented the basic concept for this approach. The proposed PBVR technique represents the 3D scalar fields as a set of particles and considers both emission and absorption effects [1, 2]. The particle density is derived from a user-specified transfer function and is used to estimate the number of particles to be generated in a given volume dataset. Because the particles

ation becomes frequent.

Particle-Based Fused Rendering

DOI: http://dx.doi.org/10.5772/intechopen.81191

to the exponent distribution.

the grid cell.

2. Related work

57

based on an opacity along the interval.

causes a severe bottleneck in the interactive exploration.

interval:

Although this sorted sampling is straightforward in structured grid data, it becomes more complicated if the computation is conducted in a distributed computing environment. In such an environment, the sub-volume dataset is stored in each distributed node. In this case, it is difficult for a sub-volume to be sorted along the viewing ray since the shape of the sub-volume may be concave. On the other hand, the shape of a computational cell is convex. The whole volume is subdivided into multiple sub-volumes so that the total data transmission cost is minimized.

To handle the problem, a particle-based rendering technique, which does not need visibility sorting, has been proposed [1]. In this technique, opaque emissive particles are employed for realizing sorting-free rendering. There are two approaches for the implementation, that is, an object-space approach and an imagespace approach (see Figure 1). In the former approach, which we call object-space particle-based volume rendering (O-PBVR), a particle density function is estimated from a user-specified transfer function, and a set of opaque particles are generated at each computational cell. The generated particles are projected onto an image plane, and the projection is repeated to improve the image quality. Although O-PBVR shows good scalability for handling large-scale irregular volume [2], the current drawback of the technique is the generation of low-quality images in which particles are visible on the boundary surface polygons when viewed closely. Moreover, with O-PBVR, it is necessary to generate a large number of particles to obtain a high-resolution image.

In the latter approach, we proposed a sorting-free technique by regarding the brightness equation as the expected value of the luminosity of a sampling point along a viewing ray [3]. We applied the technique to a projected tetrahedral technique with pre-integration. We called this image-space particle-based volume rendering (I-PBVR). We conducted a thorough experimental analysis to construct the performance model [3]. The model suggests that I-PBVR is preferable to O-PBVR

Figure 1. Overview of PBVR processes that employ the ensemble average.

### Particle-Based Fused Rendering DOI: http://dx.doi.org/10.5772/intechopen.81191

lighting positions. Most volume rendering techniques accumulate the optical depth

Although this sorted sampling is straightforward in structured grid data, it becomes more complicated if the computation is conducted in a distributed computing environment. In such an environment, the sub-volume dataset is stored in each distributed node. In this case, it is difficult for a sub-volume to be sorted along the viewing ray since the shape of the sub-volume may be concave. On the other hand, the shape of a computational cell is convex. The whole volume is subdivided into multiple sub-volumes so that the total data transmission cost is minimized. To handle the problem, a particle-based rendering technique, which does not need visibility sorting, has been proposed [1]. In this technique, opaque emissive particles are employed for realizing sorting-free rendering. There are two

approaches for the implementation, that is, an object-space approach and an imagespace approach (see Figure 1). In the former approach, which we call object-space particle-based volume rendering (O-PBVR), a particle density function is estimated from a user-specified transfer function, and a set of opaque particles are generated at each computational cell. The generated particles are projected onto an image plane, and the projection is repeated to improve the image quality. Although O-PBVR shows good scalability for handling large-scale irregular volume [2], the current drawback of the technique is the generation of low-quality images in which particles are visible on the boundary surface polygons when viewed closely. Moreover, with O-PBVR, it is necessary to generate a large number of particles to obtain

In the latter approach, we proposed a sorting-free technique by regarding the brightness equation as the expected value of the luminosity of a sampling point along a viewing ray [3]. We applied the technique to a projected tetrahedral technique with pre-integration. We called this image-space particle-based volume rendering (I-PBVR). We conducted a thorough experimental analysis to construct the performance model [3]. The model suggests that I-PBVR is preferable to O-PBVR

in order from front to back or from back to front along a viewing ray.

a high-resolution image.

High Performance Parallel Computing

Figure 1.

56

Overview of PBVR processes that employ the ensemble average.

when a volume dataset is rendered in a high resolution. In addition, I-PBVR becomes a feasible choice for rendering irregular volume data when particle generation becomes frequent.

To understand the relationships between variables in the irregular volume data, it is necessary to integrate multiple volumes, into a single volume rendering. In this chapter, we improve the I-PBVR technique in terms of its extensibility to multiple volumes and propose a new technique for semi-transparent rendering which can integrally handle multiple volumes without visibility sorting. I-PBVR regards a tetrahedral cell as a triangle footprint on the image plane. When a single volume is rendered, each footprint does not intersect with the other one since the volume is composed of cells that are not overlapped with others. When multiple volumes are rendered, the footprints may intersect with others. This intersection causes a problem in which the cell boundaries are visible when dealing with multiple volumes. Thus, in this chapter, our research question is "How can we realize a fused volume rendering technique which is free from artifact?" The hypothesis to the question is "If we employ an adequate probability process to sample particles along a viewing in a grid cell, the artifact is not noticeable." In the early age of volume rendering, Blinn assumed that the number of particles is distributed in a volume space according to the Poisson distribution. If we take an interval that is a part of the viewing ray cut by cell faces, the distance between particles is distributed according to the exponent distribution.

In the remaining part, we first describe related work and the basic theory of PBVR and then propose a fused rendering technique using PBVR. Finally, we make a comparative work to confirm the effectiveness of the proposed technique. To test the hypothesis, we design the following experiments for the sampling along the interval:


### 2. Related work

In the particle-based modeling of Saturn's ring, Blinn assumed that the number of particles follows the Poisson distribution although he did not describe it in detail [4]. The assumption led to a definition of opacity which was an important keyword for the volume rendering. Then, volume rendering has been the focus of intensive study for nearly three decades [5–7]. The volume rendering of unstructured volume data has received much attention, and several approaches have been proposed. Extensive literature and surveys on volume rendering are available that address unstructured volume data [8, 9]. A concern has often been visibility sorting, which causes a severe bottleneck in the interactive exploration.

To solve the problem recognized by many volume rendering researchers, we returned to a density emitter model and presented the basic concept for this approach. The proposed PBVR technique represents the 3D scalar fields as a set of particles and considers both emission and absorption effects [1, 2]. The particle density is derived from a user-specified transfer function and is used to estimate the number of particles to be generated in a given volume dataset. Because the particles can be considered fully opaque, no visibility sorting processing is required during the rendering process, which is advantageous from a distributed processing perspective.

The development of a fused rendering technique began in the seismic or medical imaging field. Lu Cai et al. developed a fused volume rendering technique for multiple seismic attribute volume data by the way of planar slices or horizon slices and revealed a variety of geological phenomena more effectively and clearly in order to provide a true three-dimensional perspective view. Their method can address only regular volume datasets [10].

### 3. Particle-based volume rendering

### 3.1 Definition of opacity

In image generation in volume rendering, it is thought that giving a viewing ray (viewpoint and direction), when there is no other emissive particle between the particle and the viewpoint, the energy from the particle reaches the viewpoint. Let us consider a certain section on the viewing ray, where τ particles are distributed on average. Furthermore, suppose that this section is divided into M equal parts, M small sections are formed, and particles are present in k small sections. Here, it is assumed that there is at most one particle in each small section and k is a natural number in addition to 0. At this time, the probability p that there is a particle in the small section is as follows:

$$p = \frac{\pi}{M} \tag{1}$$

is also called the optical thickness. Additionally, a negative sign is given to this optical thickness, and an index is taken, that is, Eq. 4 is called transparency (t), and it shows the ease of light transmission. The opacity α is obtained by subtracting this transparency from 1 and represents the probability that one or more particles exist

In traditional computer graphics, it is assumed that all light is radiated from the outside, the particles constituting the object are treated as reflectors, and the light scattering and absorption are repeated inside the object. On the other hand, in the volume rendering technique proposed by Sabella in 1988, the internal structure of the volume data can be known by treating the particle as a radiator in addition to a reflector (particle emission model) for the purpose of visualizing the scalar volume. In Blinn's model and Kajiya's model, the radiant energy is only reflected from the light source and energy emission (luminescence) by the particles themselves has not been considered. However, in the model of Sabella, from the standpoint of visual-

To accurately simulate the light scattering phenomenon inside the object, complicated analysis using radiation theory becomes necessary, and it is necessary to solve the scattering equation derived from that theory. In the particle emission model, focusing only on the viewing ray direction, we use a simple equation describing the transmission of optical energy with volume data. This equation can be formulated as follows by considering the difference in radiant intensity (lumi-

¼ �ð Þ� B tðÞþ c tð Þ <sup>π</sup>r<sup>2</sup>ρð Þ<sup>t</sup> Adt

ρðÞ�t exp �

ðt0

1

Adt (6)

0 @

t πr 2 ρ λð Þdλ (5)

izing the volume data, we assume that the particles themselves emit light.

dB tð ÞA ¼ �absorbed þ emitted

the particle radius, and c (t) is the light emission amount per unit area.

2

c tð Þ� πr

dt ¼ �ð Þ� B tðÞþ c tð Þ <sup>π</sup>r<sup>2</sup>ρð Þ<sup>t</sup>

Here, ρð Þt is the particle density (the number of particles per unit volume), r is

Eq. 6 is integrated in the interval in which the parameters t0and tn represent the nearest and the farthest points, respectively, from the viewpoint among the inter-

This is called the brightness equation. Generally, the brightness value B and the light emission amount c (t) are composed of three components of red, green, and blue. Eq. 6 shows that energy emitted from a point on the viewing ray reaches the viewpoint by receiving attenuation represented by an exponential term. Note that the exponent term represents the optical thickness τ, so it is equal to the transpar-

In the particle emission model in volume rendering, from the viewpoint of visualization of scalar data, scalar values interpolated in particle positions are converted into color data (composed of three components of red, green, and blue). This conversion table is called a transfer function together with a conversion table

nance value) B in a cylindrical tube of minute length.

dB tð Þ

ðt0

tn

section points of the volume data and the viewing ray.

ency calculated by assuming a Poisson distribution.

to opacity described below.

59

B<sup>0</sup> ¼ B tð Þ¼ <sup>0</sup>

in the section.

3.2 Volume rendering

Particle-Based Fused Rendering

DOI: http://dx.doi.org/10.5772/intechopen.81191

Such a distribution follows a binomial distribution, and its probability is given as follows:

$$P(X=k) = {}\_{M}C\_{k}p^{k}(1-p)^{M-k} \tag{2}$$

Here, the number of particles X is set as a random variable. When M is brought close to infinity while keeping τ = Mp constant, its distribution becomes the Poisson distribution of the average τ. The probability that there are k particles in the section is expressed by the following equation:

$$P(N=k) = \frac{\tau^k e^{-\tau}}{k!} \tag{3}$$

Here, e is the Napier's constant (e = 2.71828 …), and k! is the factorial of k. Thus, the probability is a positive real number. The Poisson distribution is often employed in the context of the number of occurrences of events within the interval defined in the time domain, but in volume rendering, it is considered not in the time domain but in the space domain. Eq. 3 represents the probability that k particles exist when the average number of particles is τ. When k = 0, it represents the situation in which no particle exists in the section on the viewing ray.

$$P(\mathbf{N} = \mathbf{0}) = \mathbf{e}^{-\mathbf{r}} \tag{4}$$

If there are many particles, the rate at which energy reaches the viewpoint becomes small, and it is easier to define the passing distance of light by the number of particles rather than the actual distance. Therefore, the average particle number τ is also called the optical thickness. Additionally, a negative sign is given to this optical thickness, and an index is taken, that is, Eq. 4 is called transparency (t), and it shows the ease of light transmission. The opacity α is obtained by subtracting this transparency from 1 and represents the probability that one or more particles exist in the section.

### 3.2 Volume rendering

can be considered fully opaque, no visibility sorting processing is required during the rendering process, which is advantageous from a distributed processing

imaging field. Lu Cai et al. developed a fused volume rendering technique for multiple seismic attribute volume data by the way of planar slices or horizon slices and revealed a variety of geological phenomena more effectively and clearly in order to provide a true three-dimensional perspective view. Their method can

The development of a fused rendering technique began in the seismic or medical

In image generation in volume rendering, it is thought that giving a viewing ray (viewpoint and direction), when there is no other emissive particle between the particle and the viewpoint, the energy from the particle reaches the viewpoint. Let us consider a certain section on the viewing ray, where τ particles are distributed on average. Furthermore, suppose that this section is divided into M equal parts, M small sections are formed, and particles are present in k small sections. Here, it is assumed that there is at most one particle in each small section and k is a natural number in addition to 0. At this time, the probability p that there is a particle in the

<sup>p</sup> <sup>¼</sup> <sup>τ</sup>

P Xð Þ¼ <sup>¼</sup> <sup>k</sup> MCkpkð Þ <sup>1</sup> � <sup>p</sup>

Such a distribution follows a binomial distribution, and its probability is given as

Here, the number of particles X is set as a random variable. When M is brought close to infinity while keeping τ = Mp constant, its distribution becomes the Poisson distribution of the average τ. The probability that there are k particles in the section

P Nð Þ¼ <sup>¼</sup> <sup>k</sup> <sup>τ</sup>ke�<sup>τ</sup>

P Nð Þ¼ ¼ 0 e

If there are many particles, the rate at which energy reaches the viewpoint becomes small, and it is easier to define the passing distance of light by the number of particles rather than the actual distance. Therefore, the average particle number τ

Here, e is the Napier's constant (e = 2.71828 …), and k! is the factorial of k. Thus, the probability is a positive real number. The Poisson distribution is often employed in the context of the number of occurrences of events within the interval defined in the time domain, but in volume rendering, it is considered not in the time domain but in the space domain. Eq. 3 represents the probability that k particles exist when the average number of particles is τ. When k = 0, it represents the situation in which

<sup>M</sup> (1)

<sup>M</sup>�<sup>k</sup> (2)

<sup>k</sup>! (3)

�<sup>τ</sup> (4)

perspective.

address only regular volume datasets [10].

High Performance Parallel Computing

3. Particle-based volume rendering

3.1 Definition of opacity

small section is as follows:

is expressed by the following equation:

no particle exists in the section on the viewing ray.

follows:

58

In traditional computer graphics, it is assumed that all light is radiated from the outside, the particles constituting the object are treated as reflectors, and the light scattering and absorption are repeated inside the object. On the other hand, in the volume rendering technique proposed by Sabella in 1988, the internal structure of the volume data can be known by treating the particle as a radiator in addition to a reflector (particle emission model) for the purpose of visualizing the scalar volume.

In Blinn's model and Kajiya's model, the radiant energy is only reflected from the light source and energy emission (luminescence) by the particles themselves has not been considered. However, in the model of Sabella, from the standpoint of visualizing the volume data, we assume that the particles themselves emit light.

To accurately simulate the light scattering phenomenon inside the object, complicated analysis using radiation theory becomes necessary, and it is necessary to solve the scattering equation derived from that theory. In the particle emission model, focusing only on the viewing ray direction, we use a simple equation describing the transmission of optical energy with volume data. This equation can be formulated as follows by considering the difference in radiant intensity (luminance value) B in a cylindrical tube of minute length.

$$\begin{array}{rcl}d\mathcal{B}(t)A &=& -ab\kappaorbed + emitted\\ \\ &=& (-\mathcal{B}(t) + c(t)) \times \pi r^2 \rho(t)Adt\\ \end{array} \tag{5}$$

$$\begin{array}{rcl}\frac{dB(t)}{dt} &=& (-\mathcal{B}(t) + c(t)) \times \pi r^2 \rho(t) \end{array}$$

Here, ρð Þt is the particle density (the number of particles per unit volume), r is the particle radius, and c (t) is the light emission amount per unit area.

$$B\_0 = B(t\_0) = \int\_{t\_0}^{t\_0} c(t) \times \pi r^2 \rho(t) \times \exp\left(-\int\_{\tau}^{t\_0} \pi r^2 \rho(\lambda) d\lambda\right) dt\tag{6}$$

Eq. 6 is integrated in the interval in which the parameters t0and tn represent the nearest and the farthest points, respectively, from the viewpoint among the intersection points of the volume data and the viewing ray.

This is called the brightness equation. Generally, the brightness value B and the light emission amount c (t) are composed of three components of red, green, and blue. Eq. 6 shows that energy emitted from a point on the viewing ray reaches the viewpoint by receiving attenuation represented by an exponential term. Note that the exponent term represents the optical thickness τ, so it is equal to the transparency calculated by assuming a Poisson distribution.

In the particle emission model in volume rendering, from the viewpoint of visualization of scalar data, scalar values interpolated in particle positions are converted into color data (composed of three components of red, green, and blue). This conversion table is called a transfer function together with a conversion table to opacity described below.

In volume rendering, by performing shading processing, it is possible to effectively express shading on the isosurface inherent in the scalar volume. Particularly in the case of three-dimensional medical images, it is necessary to visualize complicated structures such as bones, muscles, blood vessels, etc. as isosurfaces, and shading processing becomes important. In shading processing targeting the scalar volume data, luminance value calculation using the gradient vector obtained by interpolation calculation inside the grid cell is performed.

In volume rendering, the user should have a large opacity (maximum value is 1.0) for the scalar value to be emphasized and a small opacity (minimum value of 0.0 for the less important scalar value) to set the transfer function. By doing so, the relationship between the scalar value and the opacity can be defined, and according to Eq. 11, the following particle density is defined in the three-dimensional region in

<sup>ρ</sup><sup>k</sup> <sup>¼</sup> log 1ð Þ � <sup>α</sup><sup>k</sup>

During the exploration phase, the opacity is often modified in order to change an emphasized region. If we keep the particle radius as defined in the first place, we need to re-generate particles which requires a significant computational time. To avoid the additional particle generation process, we need to change the particle radius on the condition that the particle density stays the same. If we change the

> ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi log 1ð Þ � α<sup>k</sup> log 1 � α<sup>0</sup>

� � <sup>s</sup>

O-PBVR is comprised of three processes: particle generation, particle projection,

The particle model considers three particle attributes: shape, size, and density. The particle shape is assumed to be spherical, as in the density emitter model. The size of the sphere is characterized by its diameter, which is the same as a side length of a sub-pixel. The particle density ρ can be estimated from the user-specified transfer function. To generate a rendering image equivalent in quality to the volume ray-casting result, the above relation must estimate the particle density function.

ρdv (14)

N ¼ ð Cell

and the ensemble average [1]. The first process constructs a density field and generates particles consistent with the density function. The density is derived from a user-specified transfer function that converts a scalar to an opacity data value, and it describes the probability that a particle is present at a given point in space. The second process projects particles onto an image plane, and the corresponding particle buffer stores the particles. Each pixel on the image plane contains sub-pixels (i.e., divided pixels), and the number of division is called the sub-pixel level. This sub-pixel processing is equivalent to an ensemble average which repeats the first and second processes in sequence, N times, and calculates the resulting brightness

k

<sup>k</sup>, the new radius r<sup>0</sup> becomes as follows:

r <sup>0</sup> ¼ r

values by averaging the accumulating color values.

The number of particles N in a volume cell is calculated as

Eq. 12 defines the opacity when a transfer function is set in the threedimensional region, and the particle density is determined when the particle diameter is determined; thus, particles can be generated using an appropriate method. By allocating colors to particles and projecting them on the image plane, volume ren-

<sup>π</sup>r2Δ<sup>t</sup> (12)

(13)

which the volume data are defined.

DOI: http://dx.doi.org/10.5772/intechopen.81191

Particle-Based Fused Rendering

dering can be realized.

opacity from α<sup>k</sup> to α<sup>0</sup>

3.3.1 Particle generation

61

3.3 O-PBVR

To numerically calculate the brightness equation represented by Eq. 6, an integration area defined on the viewing ray is divided by a step width in which particle emission can be regarded as constant.

At this time, the k-th light emission amount c (t) is regarded as a constant and is set as ck. In the calculation of integration, the integral of the exponent part is divided into the integral section and others. For those that are divided outside the integration interval, sections corresponding to the division sections are divided and expressed with product signs. Each element of the product symbol is an index obtained by attaching a minus sign to the optical thickness and represents the transparency described above.

As a result, it can be seen that a term of the same form as the exponent term of the divided section outside the integral section and the transparency are included. Transparency takes values from 0 to 1 by definition. The value obtained by subtracting this transparency from 1 is called opacity, as mentioned above, and may be a target of transfer function, and opacity is used in volume rendering in many cases. That is, in the k-th integral interval, the opacity α<sup>k</sup> is defined as follows.

$$a\_k = \mathbf{1} - \exp\left(-\int\_{t\_k}^{t\_{k-1}} \pi r^2 \rho(\lambda) d\lambda\right) = \mathbf{1} - \exp\left(-\pi r^2 \rho\_k \Delta t\right) \tag{7}$$

where Δt is the length of the integration interval, and ρ<sup>k</sup> is the average particle density in the integral interval k. By introducing this opacity, Eq. 6 is as follows:

$$B\_k = c\_k a\_k \prod\_{j=0}^{k-2} \left( \mathbf{1} - a\_{j+1} \right) = c\_k a\_k \prod\_{j=1}^{k-1} \left( \mathbf{1} - a\_j \right) \tag{8}$$

By adding all the terms described by Eq. 8, the brightness is as follows:

$$B = \sum\_{k=1}^{n} \left[ c\_k \times a\_k \prod\_{j=1}^{k-1} \left( 1 - a\_j \right) \right] \tag{9}$$

Normally, this opacity is converted from the scalar value S in the transfer function specified by the user. That is, the opacity is a function of the scalar value S.

$$a = a(\mathcal{S}(x, y, z))\tag{10}$$

If the length of the integration interval is a value Δt' different from the predetermined Δt, it is necessary to make the following correction.

$$\begin{aligned} \alpha\_k &= 1 - \exp\left(-\pi r^2 \rho\_k \Delta t\right) \\ \alpha\_k' &= 1 - \exp\left(-\pi r^2 \rho\_k \Delta t'\right) \\ \vdots \,\,\alpha\_k' &= 1 - (1 - \alpha\_k)^{\frac{\omega'}{\Delta t}} \end{aligned} \tag{11}$$

### Particle-Based Fused Rendering DOI: http://dx.doi.org/10.5772/intechopen.81191

In volume rendering, the user should have a large opacity (maximum value is 1.0) for the scalar value to be emphasized and a small opacity (minimum value of 0.0 for the less important scalar value) to set the transfer function. By doing so, the relationship between the scalar value and the opacity can be defined, and according to Eq. 11, the following particle density is defined in the three-dimensional region in which the volume data are defined.

$$\rho\_k = \frac{\log\left(1 - a\_k\right)}{\pi r^2 \Delta t} \tag{12}$$

Eq. 12 defines the opacity when a transfer function is set in the threedimensional region, and the particle density is determined when the particle diameter is determined; thus, particles can be generated using an appropriate method. By allocating colors to particles and projecting them on the image plane, volume rendering can be realized.

During the exploration phase, the opacity is often modified in order to change an emphasized region. If we keep the particle radius as defined in the first place, we need to re-generate particles which requires a significant computational time. To avoid the additional particle generation process, we need to change the particle radius on the condition that the particle density stays the same. If we change the opacity from α<sup>k</sup> to α<sup>0</sup> <sup>k</sup>, the new radius r<sup>0</sup> becomes as follows:

$$r' = r \sqrt{\frac{\log\left(1 - a\_k\right)}{\log\left(1 - a'\_k\right)}}\tag{13}$$

### 3.3 O-PBVR

In volume rendering, by performing shading processing, it is possible to effectively express shading on the isosurface inherent in the scalar volume. Particularly in the case of three-dimensional medical images, it is necessary to visualize complicated structures such as bones, muscles, blood vessels, etc. as isosurfaces, and shading processing becomes important. In shading processing targeting the scalar volume data, luminance value calculation using the gradient vector obtained by

To numerically calculate the brightness equation represented by Eq. 6, an integration area defined on the viewing ray is divided by a step width in which particle

At this time, the k-th light emission amount c (t) is regarded as a constant and is

As a result, it can be seen that a term of the same form as the exponent term of the divided section outside the integral section and the transparency are included. Transparency takes values from 0 to 1 by definition. The value obtained by

subtracting this transparency from 1 is called opacity, as mentioned above, and may be a target of transfer function, and opacity is used in volume rendering in many cases. That is, in the k-th integral interval, the opacity α<sup>k</sup> is defined as follows.

1

where Δt is the length of the integration interval, and ρ<sup>k</sup> is the average particle density in the integral interval k. By introducing this opacity, Eq. 6 is as follows:

> 1 � α<sup>j</sup>þ<sup>1</sup> � � <sup>¼</sup> ckα<sup>k</sup>

By adding all the terms described by Eq. 8, the brightness is as follows:

ck � α<sup>k</sup>

Normally, this opacity is converted from the scalar value S in the transfer function specified by the user. That is, the opacity is a function of the scalar value S.

If the length of the integration interval is a value Δt' different from the

<sup>α</sup><sup>k</sup> <sup>¼</sup> <sup>1</sup> � exp �πr<sup>2</sup> ð Þ <sup>ρ</sup>kΔ<sup>t</sup>

<sup>k</sup> <sup>¼</sup> <sup>1</sup> � exp �πr<sup>2</sup>ρkΔ<sup>t</sup>

<sup>k</sup> ¼ 1 � ð Þ 1 � α<sup>k</sup>

<sup>0</sup> ð Þ

Δt 0 Δt

predetermined Δt, it is necessary to make the following correction.

α0

∴ α<sup>0</sup>

60

Y k�1

� � " #

j¼1

CA <sup>¼</sup> <sup>1</sup> � exp �π<sup>r</sup>

Y k�1

1 � α<sup>j</sup>

α ¼ αð Þ S xð Þ ; y; z (10)

j¼1

1 � α<sup>j</sup>

2

ρkΔt � � (7)

� � (8)

(9)

(11)

tkð�1

0

B@

πr 2 ρ λð Þdλ

tk

Y k�2

j¼0

B ¼ ∑ n k¼1

set as ck. In the calculation of integration, the integral of the exponent part is divided into the integral section and others. For those that are divided outside the integration interval, sections corresponding to the division sections are divided and expressed with product signs. Each element of the product symbol is an index obtained by attaching a minus sign to the optical thickness and represents the

interpolation calculation inside the grid cell is performed.

emission can be regarded as constant.

High Performance Parallel Computing

transparency described above.

α<sup>k</sup> ¼ 1 � exp �

Bk ¼ ckα<sup>k</sup>

O-PBVR is comprised of three processes: particle generation, particle projection, and the ensemble average [1]. The first process constructs a density field and generates particles consistent with the density function. The density is derived from a user-specified transfer function that converts a scalar to an opacity data value, and it describes the probability that a particle is present at a given point in space. The second process projects particles onto an image plane, and the corresponding particle buffer stores the particles. Each pixel on the image plane contains sub-pixels (i.e., divided pixels), and the number of division is called the sub-pixel level. This sub-pixel processing is equivalent to an ensemble average which repeats the first and second processes in sequence, N times, and calculates the resulting brightness values by averaging the accumulating color values.

### 3.3.1 Particle generation

The particle model considers three particle attributes: shape, size, and density. The particle shape is assumed to be spherical, as in the density emitter model. The size of the sphere is characterized by its diameter, which is the same as a side length of a sub-pixel. The particle density ρ can be estimated from the user-specified transfer function. To generate a rendering image equivalent in quality to the volume ray-casting result, the above relation must estimate the particle density function. The number of particles N in a volume cell is calculated as

$$N = \int\_{\text{Cell}} \rho dv \tag{14}$$

The particles are generated cell-by-cell. In each cell, particle locations are calculated stochastically in the local coordinate system, which may be one of a variety of types (e.g., barycentric coordinate). Because this technique generates particles with uniform sampling in each cell, blocky noise occurs in the rendering result. To solve this problem, Metropolis sampling for O-PBVR is presented [1]. Metropolis sampling, which uses a ratio of density at the current position to that at the candidate position, is widely used as an efficient Monte Carlo technique in chemistry and physics.

sampling may miss the important feature. The pre-integration assumes that the scalar is linearly distributed in the ray segment. In this assumption, the integrand can be transformed from a function of the distance to that of the scalar. The pre-integration computes the lookup tables mapping three integration parameters (scalar value at the front triangle face sf, back one sb, and length of the segment l in Eq. 18) to the preintegrated color C and opacity α. By considering many combinations of scalar and

From Eq. 9, we can regard a brightness calculation model as an expected value calculation in which there are n ray segments along a viewing ray, and the k-th particle occurs at the probability of αk. Thus, the brightness can be regarded as the

where the possibility that the k-th luminosity ck is equal to the brightness value

Y k�1

1 � α<sup>j</sup>

j¼1

In both O-PBVR and I-PBVR, a stochastic approach is employed to generate particles that are projected onto an image plane. The generation is repeated to make the average of the pixel values, which can be viewed as an ensemble average. An ensemble is an imaginary collection of notionally identical experiments. In the ensemble average, the total brightness is calculated by averaging the pixel values in all of the repetitions. We confirmed the fluctuation of the total brightness follows a large number and evaluated the minimum repetition, 65,536, that makes the total variance become within half of the minimum discretized brightness in the worst case [12]. This result suggests that little improvement can be expected in the brightness value when the repetition number exceeds 65,536 in most cases. When we interact the volume rendering image with some transformation such as translation, rotation, or scaling, we think much of the interaction by reducing the repetition number at the cost of the image quality. When we intend to improve the image

In I-PBVR, we generate a particle in an interval of a tetrahedral cell by regarding the opacity as a cumulative distribution function as shown in Figure 2. The opacity

This represents an event in which there is no particle from the first to the (k-1)-th ray segment, and there is more than one particle in the k-th ray segment. In this case, the brightness B becomes ck since opaque and emissive particles are used. Please note that the brightness is not contributed to by the ray segments from the k-th to the last

Pkck (15)

� � (16)

B ¼ ∑ n k¼1

Pk ¼ α<sup>k</sup>

segments since the (k-1)-th particle completely occludes these segments.

distance values, the pre-integration table is stored as 3D texture in GPU.

expected value of the luminosity from the ray segment:

can be described as follows by using the opacity value αk:

3.4.2 Stochastic rasterization

Particle-Based Fused Rendering

DOI: http://dx.doi.org/10.5772/intechopen.81191

3.5 Ensemble average

quality, we stay still without any interaction.

4. Particle-based fused rendering

63

### 3.3.2 Particle projection

Using the aforementioned particle generation method, we can generate particles in a volume space according to the density function ρð Þ x . By projecting these particles onto the image plane, we calculate the brightness values of the corresponding pixels. We also perform particle occlusion with the Z-buffer algorithm during this projection stage. This incorporates the effects of particle collision, which prevents some particles from reaching the image plane.

In the present method, we assume that the particles are completely opaque. Thus, neither alpha blending nor visibility ordering is required. However, when the number of projected particles is small, for instance one per pixel, it becomes difficult to produce a semi-transparent appearance. This problem can be solved by an ensemble average, that is, by accumulating a pixel value for particles generated multiple times and averaging their brightness values within the pixel.

### 3.4 I-PBVR

I-PBVR is comprised of three processes: cell projection, stochastic rasterization, and ensemble average. The first process decomposes a cell into several tetrahedral cells and splits each of the tetrahedral cells into a set of triangles on the projection plane. The second process renders the fragments of the triangles with a probability equal to the opacity value at each ray segment along a viewing ray. It projects particles onto the image plane, and the corresponding particle buffer stores the particles' colors and depths. The third process repeats the first and second processes in sequence, N times, and calculates the resulting brightness values by averaging the accumulating color values.

### 3.4.1 Cell projection

Projected tetrahedra (PT) is a technique for rendering a tetrahedral volume dataset using polygonal approximation, which regards a tetrahedral cell as triangles. In this technique, first the tetrahedral cells are sorted in the order of distance from a viewing point. Second, each tetrahedral cell is projected onto an image plane and subdivided into three or four PT triangles. In the original PT technique, the color and opacity are evaluated at the vertices and rasterizing of the PT triangle generates the fragments on the image plane. Finally, the colors are accumulated to calculate the pixel value using the back-to-front algorithm. In I-PBVR, although the particle radius is not explicitly specified, it actually becomes a pixel scale on the image plane.

To improve the accuracy of the pixel value, a pre-integration technique, proposed by Engel [11], is often employed in the rendering stage. The technique calculates the color and opacity in the ray segment in a more precise way than the conventional technique which just samples a scalar value at the middle point of the ray segment. If the color or the opacity changes drastically in the ray segment, this

sampling may miss the important feature. The pre-integration assumes that the scalar is linearly distributed in the ray segment. In this assumption, the integrand can be transformed from a function of the distance to that of the scalar. The pre-integration computes the lookup tables mapping three integration parameters (scalar value at the front triangle face sf, back one sb, and length of the segment l in Eq. 18) to the preintegrated color C and opacity α. By considering many combinations of scalar and distance values, the pre-integration table is stored as 3D texture in GPU.

### 3.4.2 Stochastic rasterization

The particles are generated cell-by-cell. In each cell, particle locations are calculated stochastically in the local coordinate system, which may be one of a variety of types (e.g., barycentric coordinate). Because this technique generates particles with uniform sampling in each cell, blocky noise occurs in the rendering result. To solve this problem, Metropolis sampling for O-PBVR is presented [1]. Metropolis sampling, which uses a ratio of density at the current position to that at the candidate position, is widely used as an efficient Monte Carlo technique in chemistry and

Using the aforementioned particle generation method, we can generate particles

in a volume space according to the density function ρð Þ x . By projecting these particles onto the image plane, we calculate the brightness values of the

which prevents some particles from reaching the image plane.

multiple times and averaging their brightness values within the pixel.

corresponding pixels. We also perform particle occlusion with the Z-buffer algorithm during this projection stage. This incorporates the effects of particle collision,

In the present method, we assume that the particles are completely opaque. Thus, neither alpha blending nor visibility ordering is required. However, when the number of projected particles is small, for instance one per pixel, it becomes difficult to produce a semi-transparent appearance. This problem can be solved by an ensemble average, that is, by accumulating a pixel value for particles generated

I-PBVR is comprised of three processes: cell projection, stochastic rasterization, and ensemble average. The first process decomposes a cell into several tetrahedral cells and splits each of the tetrahedral cells into a set of triangles on the projection plane. The second process renders the fragments of the triangles with a probability equal to the opacity value at each ray segment along a viewing ray. It projects particles onto the image plane, and the corresponding particle buffer stores the particles' colors and depths. The third process repeats the first and second processes in sequence, N times, and calculates the resulting brightness values by averaging the

Projected tetrahedra (PT) is a technique for rendering a tetrahedral volume dataset using polygonal approximation, which regards a tetrahedral cell as triangles. In this technique, first the tetrahedral cells are sorted in the order of distance from a viewing point. Second, each tetrahedral cell is projected onto an image plane and subdivided into three or four PT triangles. In the original PT technique, the color and opacity are evaluated at the vertices and rasterizing of the PT triangle generates the fragments on the image plane. Finally, the colors are accumulated to calculate the pixel value using the back-to-front algorithm. In I-PBVR, although the particle radius is not explicitly specified, it actually becomes a pixel scale on the image

To improve the accuracy of the pixel value, a pre-integration technique, proposed by Engel [11], is often employed in the rendering stage. The technique calculates the color and opacity in the ray segment in a more precise way than the conventional technique which just samples a scalar value at the middle point of the ray segment. If the color or the opacity changes drastically in the ray segment, this

physics.

3.4 I-PBVR

accumulating color values.

3.4.1 Cell projection

plane.

62

3.3.2 Particle projection

High Performance Parallel Computing

From Eq. 9, we can regard a brightness calculation model as an expected value calculation in which there are n ray segments along a viewing ray, and the k-th particle occurs at the probability of αk. Thus, the brightness can be regarded as the expected value of the luminosity from the ray segment:

$$B = \sum\_{k=1}^{n} P\_k c\_k \tag{15}$$

where the possibility that the k-th luminosity ck is equal to the brightness value can be described as follows by using the opacity value αk:

$$P\_k = a\_k \prod\_{j=1}^{k-1} \left(1 - a\_j\right) \tag{16}$$

This represents an event in which there is no particle from the first to the (k-1)-th ray segment, and there is more than one particle in the k-th ray segment. In this case, the brightness B becomes ck since opaque and emissive particles are used. Please note that the brightness is not contributed to by the ray segments from the k-th to the last segments since the (k-1)-th particle completely occludes these segments.

### 3.5 Ensemble average

In both O-PBVR and I-PBVR, a stochastic approach is employed to generate particles that are projected onto an image plane. The generation is repeated to make the average of the pixel values, which can be viewed as an ensemble average. An ensemble is an imaginary collection of notionally identical experiments. In the ensemble average, the total brightness is calculated by averaging the pixel values in all of the repetitions. We confirmed the fluctuation of the total brightness follows a large number and evaluated the minimum repetition, 65,536, that makes the total variance become within half of the minimum discretized brightness in the worst case [12]. This result suggests that little improvement can be expected in the brightness value when the repetition number exceeds 65,536 in most cases. When we interact the volume rendering image with some transformation such as translation, rotation, or scaling, we think much of the interaction by reducing the repetition number at the cost of the image quality. When we intend to improve the image quality, we stay still without any interaction.

### 4. Particle-based fused rendering

In I-PBVR, we generate a particle in an interval of a tetrahedral cell by regarding the opacity as a cumulative distribution function as shown in Figure 2. The opacity

T sðÞ¼ <sup>ð</sup><sup>s</sup>

tetrahedrons.

as follows:

expressed as follows.

Particle-Based Fused Rendering

DOI: http://dx.doi.org/10.5772/intechopen.81191

<sup>α</sup> sf ; sb; <sup>l</sup> � � <sup>¼</sup> limsf !sb

<sup>w</sup> <sup>¼</sup> <sup>α</sup>�<sup>1</sup>ð Þ <sup>R</sup> <sup>¼</sup> <sup>1</sup> sb � sf

65

<sup>¼</sup> limsf !sb

0

Therefore, in the case of sf 6¼ sb, the opacity function α (w) expressed by Eq. 17 is

sb � sf

In the case of sf ¼ sb, the opacity function α (w) expressed by Eq. 17 is expressed

αð Þ¼ w 1 � exp �τ sf

<sup>1</sup> � exp � <sup>l</sup>

� � � <sup>l</sup> � �

� � � <sup>l</sup> � �

25 are derived from Eqs. 21 and 22 when sf 6¼ sb and sf ¼ sb, respectively.

<sup>w</sup> <sup>¼</sup> <sup>α</sup>�<sup>1</sup>ð Þ <sup>R</sup>

<sup>T</sup>�<sup>1</sup> � sb � sf

sb � sf

In this method, particles are placed at a position wl away from the start point of the section. At this time, assuming that the random variable w follows the probability density function such that the cumulative distribution function is the opacity α(w), this variable w can be generated using the inverse function method. In the inverse function method, when the random number R exists in the range of the interval [0, α sf ; sb; l � �], the variable w is calculated using Eqs. 24 or 25. Eqs. 24 and

> <sup>l</sup> log 1ð Þþ � <sup>R</sup> T sf � � � � � sf

> > ¼ � log 1ð Þ � <sup>R</sup> τ sf � � � <sup>l</sup>

� � (24)

� � � � ! !

<sup>α</sup>ðÞ ¼ <sup>w</sup> <sup>α</sup> sf ; s wð Þ; wl � �

<sup>¼</sup> <sup>1</sup> � exp � <sup>l</sup>

Here, reference is made to the following derivation process.

α sf ; sb; l � �

¼ 1 � exp �T<sup>0</sup> sf

¼ 1 � exp �τ sf

4.1 Calculation of depth value by inverse function method

In the proposed method, it is assumed that the scalar data value is linearly interpolated in the interval, and it is expressed as s (w) at the point expressed using the random variable w. This assumption arises from the linear distribution of scalar data in line segments defined in tetrahedrons when interpolation calculations using scalar data and volume coordinates defined at each vertex are performed in the

τ λð Þdλ (19)

s wð Þ¼ ð Þ 1 � w sf þ wsb (20)

Tsw ð Þ� ð Þ T sf � � � � ! (21)

T sð Þ� <sup>b</sup> T sf

� � � wl � � (22)

(23)

(25)

#### Figure 2.

Cumulative distribution function defined as the opacity between the entry point and the particle in the section where the viewing ray is cut into a tetrahedral cell.

can be represented as a function of a length from the entry point in the interval. Thus, the depth, which is the length from the entry point, can be regarded as a probability variable which follows a probability density function that is a derivative of the cumulative distribution function (CDF).

When we consider the definition of the opacity, we find that it describes the CDF of a probability variable, w, which shows a length from the entry point in the fragment interval. The probability density function, that is, its derivative, describes an exponential distribution, which matches the theorem that the number of particles follows the Poisson distribution since the exponential distribution describes the distance between particles in a Poisson process.

The opacity in Eq. 7 can be used to express the CDF α (w) of the random variable w as follows:

$$a(w) = 1 - \exp\left(-\int\_{t\_{k-1}-\nu l}^{t\_{k-1}} \tau(\lambda) d\lambda\right) \tag{17}$$

Let l be the width of the section where the viewing ray is cut by the cell. This equation represents the opacity calculated between the entry point and the position where the particles are located in the entry. This interval can be expressed as wl using the random variable w (see Figure 2). Furthermore, the theorem that the probability density function is represented by an exponential function is consistent with theorem that "when the number of particles in a certain section follows the Poisson distribution, the particle spacing follows the exponential distribution."

In the pre-integration method, the opacity in the section is described as follows.

$$a(s\_{\mathcal{f}}, s\_b, l) = 1 - \exp\left(-\frac{l}{s\_b - s\_{\mathcal{f}}} \left(T(s\_b) - T(s\_{\mathcal{f}})\right)\right) \tag{18}$$

Here, sf and sb are scalar data values interpolated and computed at the entry and exit points of the section, respectively, and l represents the width of the section as described above. T(s) represents an integral expression for calculating the number of particles generated in the interval.

Particle-Based Fused Rendering DOI: http://dx.doi.org/10.5772/intechopen.81191

$$T(\mathfrak{s}) = \int\_0^\epsilon \mathfrak{r}(\lambda)d\lambda \tag{19}$$

In the proposed method, it is assumed that the scalar data value is linearly interpolated in the interval, and it is expressed as s (w) at the point expressed using the random variable w. This assumption arises from the linear distribution of scalar data in line segments defined in tetrahedrons when interpolation calculations using scalar data and volume coordinates defined at each vertex are performed in the tetrahedrons.

$$s(w) = (1 - w)s\_f + ws\_b \tag{20}$$

Therefore, in the case of sf 6¼ sb, the opacity function α (w) expressed by Eq. 17 is expressed as follows.

$$\begin{aligned} a(w) &= -a(\mathfrak{s}\_f, \mathfrak{s}(w), wl) \\ &= \mathbf{1} - \exp\left(-\frac{l}{\mathfrak{s}\_b - \mathfrak{s}\_f} \left(T(\mathfrak{s}(w)) - T(\mathfrak{s}\_f)\right)\right) \end{aligned} \tag{21}$$

In the case of sf ¼ sb, the opacity function α (w) expressed by Eq. 17 is expressed as follows:

$$a(\omega) = \mathbf{1} - \exp\left(-\tau(\mathbf{s}\_f) \cdot \omega l\right) \tag{22}$$

Here, reference is made to the following derivation process.

$$\begin{aligned} a(s\_f, s\_b, l) &= \lim\_{s\_f \to s\_b} a(s\_f, s\_b, l) \\ &= \lim\_{s\_f \to s\_b} \left( \mathbf{1} - \exp \left( -\frac{l}{s\_b - s\_f} \left( T(s\_b) - T(s\_f) \right) \right) \right) \\ &= \mathbf{1} - \exp \left( -T'(s\_f) \cdot l \right) \\ &= \mathbf{1} - \exp \left( -\tau(s\_f) \cdot l \right) \end{aligned} \tag{23}$$

#### 4.1 Calculation of depth value by inverse function method

In this method, particles are placed at a position wl away from the start point of the section. At this time, assuming that the random variable w follows the probability density function such that the cumulative distribution function is the opacity α(w), this variable w can be generated using the inverse function method. In the inverse function method, when the random number R exists in the range of the interval [0, α sf ; sb; l � �], the variable w is calculated using Eqs. 24 or 25. Eqs. 24 and 25 are derived from Eqs. 21 and 22 when sf 6¼ sb and sf ¼ sb, respectively.

$$\begin{aligned} w &=& a^{-1}(\mathsf{R}) \\ &=& \frac{\mathbf{1}}{s\_b - s\_f} \left( T^{-1} \left( -\frac{s\_b - s\_f}{l} \log \left( \mathbf{1} - \mathsf{R} \right) + T \left( s\_f \right) \right) - s\_f \right) \end{aligned} \tag{24}$$

$$\begin{aligned} w &= \ a^{-1}(R) \\ &= -\frac{\log\left(1 - R\right)}{\pi \left(s\_f\right) \cdot l} \end{aligned} \tag{25}$$

can be represented as a function of a length from the entry point in the interval. Thus, the depth, which is the length from the entry point, can be regarded as a probability variable which follows a probability density function that is a derivative

Cumulative distribution function defined as the opacity between the entry point and the particle in the section

When we consider the definition of the opacity, we find that it describes the CDF of a probability variable, w, which shows a length from the entry point in the fragment interval. The probability density function, that is, its derivative, describes an exponential distribution, which matches the theorem that the number of particles follows the Poisson distribution since the exponential distribution describes the

The opacity in Eq. 7 can be used to express the CDF α (w) of the random variable

Let l be the width of the section where the viewing ray is cut by the cell. This equation represents the opacity calculated between the entry point and the position where the particles are located in the entry. This interval can be expressed as wl using the random variable w (see Figure 2). Furthermore, the theorem that the probability density function is represented by an exponential function is consistent with theorem that "when the number of particles in a certain section follows the Poisson distribution, the particle spacing follows the exponential distribution."

In the pre-integration method, the opacity in the section is described as follows.

sb � sf

Here, sf and sb are scalar data values interpolated and computed at the entry and exit points of the section, respectively, and l represents the width of the section as described above. T(s) represents an integral expression for calculating the number

ðtk�<sup>1</sup>

tk�1�wl

� �

τ λð Þdλ

T sð Þ� <sup>b</sup> T sf � � � � ! (17)

(18)

αð Þ¼ w 1 � exp �

<sup>α</sup> sf ; sb; <sup>l</sup> � � <sup>¼</sup> <sup>1</sup> � exp � <sup>l</sup>

of particles generated in the interval.

of the cumulative distribution function (CDF).

where the viewing ray is cut into a tetrahedral cell.

High Performance Parallel Computing

distance between particles in a Poisson process.

w as follows:

64

Figure 2.

Using Eqs. 24 and 25, we can calculate the depth value of the particle.

$$D(w) = (\mathbf{1} - w) \cdot D\_{\hat{f}} + w \cdot D\_b \tag{26}$$

Experiments were conducted using two irregular volume data of appropriate size. These are obtained as a result of computational fluid dynamics calculation and are called "Tank." This calculation relates to a physical phenomenon when a pipelike valve installed in a gas tank filled with a high-pressure state is instantly opened. Therefore, the important variables are pressure data and velocity absolute value data (both are scalar data). Pressure data and velocity absolute data were calculated using 9827 and 516 tetrahedral cells, respectively. Figure 5 shows the volume rendering display of the fused mixed irregular volume data with different cells. In this experiment, red color is assigned to pressure data, and blue color is assigned to

In Figure 5, we compare the proposed method (a), the previous methods (b), (c), and (d) in which the relative position in the section is fixed as the entry point, the middle point, and the exit point for the depth values in the tetrahedral cells and a method in which a random position is located between the entry and exit points (e). In the five figures, in addition to presenting the overall visualization result (grid

line presence/absence) and the local visualization result (grid line presence/

Example of application to "Tank" data. (a) Velocity data defined by 516 cells, (b) pressure data defined by

Comparison of application example to "Tank" data by the proposed method and conventional methods.

velocity absolute value data (Figure 4).

DOI: http://dx.doi.org/10.5772/intechopen.81191

Particle-Based Fused Rendering

Figure 5.

67

Figure 4.

9827 cells, (c) (a)+(b) fused visualization.

Here, Df and Db represent the depth values at the start and end points of the section. Similarly, scalar data values at particle positions can be interpolated and color values can be calculated using the transfer function.

This method was implemented using OpenGL and the GPU shader described in GLSL. For the implementation of pre-integration, we used two-dimensional preintegration to perform error correction [13] with perspective transformation, so we implemented the function T described in Eq. 19 as a two-dimensional texture in the GPU [14]. To determine the random variable w described in Eq. 20, an inverse function of T is required, but in order to realize efficient computation, the function value was calculated in advance and this was also implemented in the GPU as a twodimensional texture.

### 5. Result and discussion

Experiments were conducted to evaluate the effectiveness of this method. The experiment used CPU: Intel Core i7 3.3GHz, MEM: 16GB, and GPU: Intel Iris Graphics 550.

To confirm the appropriateness of the depth value in this method, experiments were conducted on two irregular volume data consisting of a single tetrahedral cell. Figure 3 visualizes each irregular volume data using a red/blue monochrome color map that increases the color value according to the data value.

As a comparative experiment, as in the conventional method, the depth value in the tetrahedral cell is visualized by fixing the relative position in the section like the entry point, the middle point, and the exit point. It turns out that the proposed method realizes satisfactory visualization at the intersection of the two tetrahedral lattices. According to the result of the conventional method, a change in unnatural color value can be visually confirmed.

Figure 3.

Application example of proposed method for intersecting tetrahedral cell (red and blue color maps were used for each tetrahedral cell).

### Particle-Based Fused Rendering DOI: http://dx.doi.org/10.5772/intechopen.81191

Experiments were conducted using two irregular volume data of appropriate size. These are obtained as a result of computational fluid dynamics calculation and are called "Tank." This calculation relates to a physical phenomenon when a pipelike valve installed in a gas tank filled with a high-pressure state is instantly opened. Therefore, the important variables are pressure data and velocity absolute value data (both are scalar data). Pressure data and velocity absolute data were calculated using 9827 and 516 tetrahedral cells, respectively. Figure 5 shows the volume rendering display of the fused mixed irregular volume data with different cells. In this experiment, red color is assigned to pressure data, and blue color is assigned to velocity absolute value data (Figure 4).

In Figure 5, we compare the proposed method (a), the previous methods (b), (c), and (d) in which the relative position in the section is fixed as the entry point, the middle point, and the exit point for the depth values in the tetrahedral cells and a method in which a random position is located between the entry and exit points (e). In the five figures, in addition to presenting the overall visualization result (grid line presence/absence) and the local visualization result (grid line presence/

Figure 4.

Using Eqs. 24 and 25, we can calculate the depth value of the particle.

color values can be calculated using the transfer function.

dimensional texture.

Graphics 550.

Figure 3.

66

each tetrahedral cell).

5. Result and discussion

High Performance Parallel Computing

color value can be visually confirmed.

Here, Df and Db represent the depth values at the start and end points of the section. Similarly, scalar data values at particle positions can be interpolated and

This method was implemented using OpenGL and the GPU shader described in GLSL. For the implementation of pre-integration, we used two-dimensional preintegration to perform error correction [13] with perspective transformation, so we implemented the function T described in Eq. 19 as a two-dimensional texture in the GPU [14]. To determine the random variable w described in Eq. 20, an inverse function of T is required, but in order to realize efficient computation, the function value was calculated in advance and this was also implemented in the GPU as a two-

Experiments were conducted to evaluate the effectiveness of this method. The

To confirm the appropriateness of the depth value in this method, experiments were conducted on two irregular volume data consisting of a single tetrahedral cell. Figure 3 visualizes each irregular volume data using a red/blue monochrome color

As a comparative experiment, as in the conventional method, the depth value in the tetrahedral cell is visualized by fixing the relative position in the section like the entry point, the middle point, and the exit point. It turns out that the proposed method realizes satisfactory visualization at the intersection of the two tetrahedral lattices. According to the result of the conventional method, a change in unnatural

Application example of proposed method for intersecting tetrahedral cell (red and blue color maps were used for

experiment used CPU: Intel Core i7 3.3GHz, MEM: 16GB, and GPU: Intel Iris

map that increases the color value according to the data value.

D wð Þ¼ ð Þ� 1 � w Df þ w � Db (26)

Example of application to "Tank" data. (a) Velocity data defined by 516 cells, (b) pressure data defined by 9827 cells, (c) (a)+(b) fused visualization.

Figure 5. Comparison of application example to "Tank" data by the proposed method and conventional methods.

absence), the time required for the visualization result is described. Obviously, with the conventional method, it can be understood that artifacts due to improper setting of the depth value are visible in the visualization results. In particular, the trend is noticeable in the visualization results (b–d) where the depth value is set at the fixed point of the section. Even with the random position, it is more noticeable than that of the proposed method (e). Additionally, it can be seen that there is almost no difference in the calculation time required for the visualization.

References

pp. 11-18

21-29

[1] Sakamoto N, Nonaka J, Koyamada K,

DOI: http://dx.doi.org/10.5772/intechopen.81191

[11] Engel K, Kraus M, Ertl T. Highquality pre-integrated volume rendering using hardware-accelerated pixel shading. In: Proc. of Eurographics/ SIGGRAPH Workshop on Graphics

Hardware; 2001. pp. 9-16

Visualization 2012; 2012

2001. pp. 93-155

[12] Sakamoto N, Koyamada K. Stoachastic approach for integrated rendering of volumes and semi-

transparent surfaces. In: Proceedings of the 2012 Workshop on Ultrascale

[13] Meredith J, Ma KL. Multiresolution view-dependent splat-based volume rendering of large irregular data. In: Proc. IEEE 2001 Symp. on Parallel and Large-Data Visualization and Graphics;

[14] Westover L. Footprint evaluation for volume rendering. Computers and

Graphics. 1990;24(4):367-376

Koyamada K. Improvement of particlebased volume rendering for visualizing irregular volume data sets. Computers

[3] Sakamoto N, Kawamura T, Kuwano H, Koyamada K. Sorting-free preintergrated projected tetrahedra. In: Proceedings of the 2009 Workshop on Ultrascale Visualization 2009; 2009.

[4] Blinn J. Light reflection function for simulation of clouds and dusty surfaces. Computers and Graphics. 1982;16(3):

[5] Sabella P. A rendering algorithm for visualizing 3D scalar field. Computers and Graphics. 1988;22(4):51-58

[6] Drebin RA, Carpenter L, Hanrahan P. Volume rendering. Computers and

[7] Levoy M. Display of surfaces from volume data. IEEE Computer Graphics and Applications. 1988;8(3):29-37

Visualization Handbook. Elsevier; 2005

[10] Lu Cai, Yuan Mingkai, Wang Qi, Kang Kun. Application of multiattributes fused volume rendering techniques in 3D seismic interpretation.

2014:1609–1613

69

Graphics. 1988;22(4):65-74

[8] Hansen C, Johnson C. The

[9] Silva C, Comba J, Callahan S, Bernardon F. A survey of GPU-based volume rendering on unstructured grids. Brazillian Journal of Theoretic and Applied Computing. 2005;12(2):9-29

Tanaka S. Particle-based volume rendering. In: Proceedings of Asia-Pacific Symposium on Visualization (APVIS 2007); 2007. pp. 129-132

Particle-Based Fused Rendering

[2] Sakamoto N, Kawamura T,

& Graphics. 2010;34(1):34-42

### 6. Conclusion

In this chapter, we proposed a volume rendering algorithm method for multiple irregular volume data. In this method, the tetrahedral grid constituting the volume data is projected on the image plane, and the opacity is used to control the presence/ absence of drawing at pixel expansion. To efficiently perform volume rendering of multiple irregular volume data, we developed a method for stochastically arranging particles in the section where the viewing ray is cut off by tetrahedrons. In this arrangement method, the particle position is calculated by inverse function method, considering the particle position as a random variable and the cumulative distribution function as opacity.

In the experiments for confirming the effectiveness of this method, we prepared two types of irregular volume data with different cells and confirmed the effectiveness of the proposed method in terms of the presence/absence of artifacts and calculation time at the intersection of the cells.

Although this time we concerned the proposal of the visualization method itself, we would like to use this method to elucidate the causal relationship between variables in important physical phenomena. For example, in order to clarify the influence of coherent vortices on heat transport in thermal fluid phenomena, it is necessary to combine scalar data representing a second invariant representing a vortex region and scalar data representing a heat flux absolute value related to heat transport. By using this method for the visualization of two kinds of time series irregular volume data, we would like to figure out a visual correlation between the multiple variables.

### Author details

Koji Koyamada<sup>1</sup> \* and Naohisa Sakamoto<sup>2</sup>


\*Address all correspondence to: koyamada.koji.3w@kyoto-u.ac.jp

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## References

absence), the time required for the visualization result is described. Obviously, with the conventional method, it can be understood that artifacts due to improper setting of the depth value are visible in the visualization results. In particular, the trend is noticeable in the visualization results (b–d) where the depth value is set at the fixed point of the section. Even with the random position, it is more noticeable than that of the proposed method (e). Additionally, it can be seen that there is almost no

In this chapter, we proposed a volume rendering algorithm method for multiple irregular volume data. In this method, the tetrahedral grid constituting the volume data is projected on the image plane, and the opacity is used to control the presence/ absence of drawing at pixel expansion. To efficiently perform volume rendering of multiple irregular volume data, we developed a method for stochastically arranging particles in the section where the viewing ray is cut off by tetrahedrons. In this arrangement method, the particle position is calculated by inverse function method, considering the particle position as a random variable and the cumulative distribu-

In the experiments for confirming the effectiveness of this method, we prepared two types of irregular volume data with different cells and confirmed the effectiveness of the proposed method in terms of the presence/absence of artifacts and

Although this time we concerned the proposal of the visualization method itself, we would like to use this method to elucidate the causal relationship between variables in important physical phenomena. For example, in order to clarify the influence of coherent vortices on heat transport in thermal fluid phenomena, it is necessary to combine scalar data representing a second invariant representing a vortex region and scalar data representing a heat flux absolute value related to heat transport. By using this method for the visualization of two kinds of time series irregular volume data, we would like to figure out a visual correlation between the

difference in the calculation time required for the visualization.

6. Conclusion

High Performance Parallel Computing

tion function as opacity.

multiple variables.

Author details

Koji Koyamada<sup>1</sup>

68

1 Kyoto University, Kyoto, Japan

2 Kobe University, Kobe, Japan

provided the original work is properly cited.

calculation time at the intersection of the cells.

\* and Naohisa Sakamoto<sup>2</sup>

\*Address all correspondence to: koyamada.koji.3w@kyoto-u.ac.jp

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

[1] Sakamoto N, Nonaka J, Koyamada K, Tanaka S. Particle-based volume rendering. In: Proceedings of Asia-Pacific Symposium on Visualization (APVIS 2007); 2007. pp. 129-132

[2] Sakamoto N, Kawamura T, Koyamada K. Improvement of particlebased volume rendering for visualizing irregular volume data sets. Computers & Graphics. 2010;34(1):34-42

[3] Sakamoto N, Kawamura T, Kuwano H, Koyamada K. Sorting-free preintergrated projected tetrahedra. In: Proceedings of the 2009 Workshop on Ultrascale Visualization 2009; 2009. pp. 11-18

[4] Blinn J. Light reflection function for simulation of clouds and dusty surfaces. Computers and Graphics. 1982;16(3): 21-29

[5] Sabella P. A rendering algorithm for visualizing 3D scalar field. Computers and Graphics. 1988;22(4):51-58

[6] Drebin RA, Carpenter L, Hanrahan P. Volume rendering. Computers and Graphics. 1988;22(4):65-74

[7] Levoy M. Display of surfaces from volume data. IEEE Computer Graphics and Applications. 1988;8(3):29-37

[8] Hansen C, Johnson C. The Visualization Handbook. Elsevier; 2005

[9] Silva C, Comba J, Callahan S, Bernardon F. A survey of GPU-based volume rendering on unstructured grids. Brazillian Journal of Theoretic and Applied Computing. 2005;12(2):9-29

[10] Lu Cai, Yuan Mingkai, Wang Qi, Kang Kun. Application of multiattributes fused volume rendering techniques in 3D seismic interpretation. 2014:1609–1613

[11] Engel K, Kraus M, Ertl T. Highquality pre-integrated volume rendering using hardware-accelerated pixel shading. In: Proc. of Eurographics/ SIGGRAPH Workshop on Graphics Hardware; 2001. pp. 9-16

[12] Sakamoto N, Koyamada K. Stoachastic approach for integrated rendering of volumes and semitransparent surfaces. In: Proceedings of the 2012 Workshop on Ultrascale Visualization 2012; 2012

[13] Meredith J, Ma KL. Multiresolution view-dependent splat-based volume rendering of large irregular data. In: Proc. IEEE 2001 Symp. on Parallel and Large-Data Visualization and Graphics; 2001. pp. 93-155

[14] Westover L. Footprint evaluation for volume rendering. Computers and Graphics. 1990;24(4):367-376

Chapter 6

Abstract

1. Introduction

techniques [3].

71

Design and Implementation of

Particle Systems for Meshfree

Giuseppe Bilotta, Vito Zago and Alexis Hérault

of domain size, and multiple SPH formulations.

computing, numerical stability, best practices

Methods with High Performance

Particle systems, commonly associated with computer graphics, animation, and

video games, are an essential component in the implementation of numerical methods ranging from the meshfree methods for computational fluid dynamics and related applications (e.g., smoothed particle hydrodynamics, SPH) to minimization methods for arbitrary problems (e.g., particle swarm optimization, PSO). These methods are frequently embarrassingly parallel in nature, making them a natural fit for implementation on massively parallel computational hardware such as modern graphics processing units (GPUs). However, naive implementations fail to fully exploit the capabilities of this hardware. We present practical solutions to the challenges faced in the efficient parallel implementation of these particle systems, with a focus on performance, robustness, and flexibility. The techniques are illustrated through GPUSPH, the first implementation of SPH to run completely on GPU, and currently supporting multi-GPU clusters, uniform precision independent

Keywords: software design, particle systems, SPH, GPGPU, high-performance

Particle systems were first formally introduced in computer science by Reeves [1], as the technique used by Lucasfilm Ltd. for the realization of some of the special effects present in the film Star Trek II: The Wrath of Khan [2]. Since then, particle systems have been used in computer graphics for the simulation of visually realistic fire, moving water, clouds, dust, lava, and snow. A particle system consists of a collection of distinct elements (particles) that are generated according to specific rules, evolve and move in the simulation space, and die out at the end of their life cycle. The position and characteristics of the particles in the system over simulated time are then used to render larger bodies (flames, rivers, etc.) with appropriate

While originally developed purely for visual effects, and associated with evolution laws focused on the final appearance rather than the physical correctness of the behavior, particle systems also form the digital infrastructure for the implementation of a class of numerical methods (known as meshless, meshfree, or particle methods) that have started emerging since the late 1970s as alternatives to the

### Chapter 6

## Design and Implementation of Particle Systems for Meshfree Methods with High Performance

Giuseppe Bilotta, Vito Zago and Alexis Hérault

### Abstract

Particle systems, commonly associated with computer graphics, animation, and video games, are an essential component in the implementation of numerical methods ranging from the meshfree methods for computational fluid dynamics and related applications (e.g., smoothed particle hydrodynamics, SPH) to minimization methods for arbitrary problems (e.g., particle swarm optimization, PSO). These methods are frequently embarrassingly parallel in nature, making them a natural fit for implementation on massively parallel computational hardware such as modern graphics processing units (GPUs). However, naive implementations fail to fully exploit the capabilities of this hardware. We present practical solutions to the challenges faced in the efficient parallel implementation of these particle systems, with a focus on performance, robustness, and flexibility. The techniques are illustrated through GPUSPH, the first implementation of SPH to run completely on GPU, and currently supporting multi-GPU clusters, uniform precision independent of domain size, and multiple SPH formulations.

Keywords: software design, particle systems, SPH, GPGPU, high-performance computing, numerical stability, best practices

### 1. Introduction

Particle systems were first formally introduced in computer science by Reeves [1], as the technique used by Lucasfilm Ltd. for the realization of some of the special effects present in the film Star Trek II: The Wrath of Khan [2]. Since then, particle systems have been used in computer graphics for the simulation of visually realistic fire, moving water, clouds, dust, lava, and snow. A particle system consists of a collection of distinct elements (particles) that are generated according to specific rules, evolve and move in the simulation space, and die out at the end of their life cycle. The position and characteristics of the particles in the system over simulated time are then used to render larger bodies (flames, rivers, etc.) with appropriate techniques [3].

While originally developed purely for visual effects, and associated with evolution laws focused on the final appearance rather than the physical correctness of the behavior, particle systems also form the digital infrastructure for the implementation of a class of numerical methods (known as meshless, meshfree, or particle methods) that have started emerging since the late 1970s as alternatives to the

traditional grid-based numerical methods (finite differences, finite volumes, finite elements). These methods—smoothed particle hydrodynamics (SPH) [4], reproducing kernel particle method (RKPM) [5], finite pointset method (FPM) [6], discrete element method (DEM) [7], etc.—provide rigorous methods to discretize the physical laws governing the continuum and thus provide physics-based evolution law for the properties of the particles that act both as interpolation nodes in the mathematical sense and as virtual volumes of infinitesimal size carrying the properties of the macroscopic mass they represent.

While our focus will be on GPU implementation, many of the approaches we discuss can bring significant benefits even on CPU implementations, allowing better exploitation of the vector hardware and multiple cores of current hardware. The intention is thus to provide material that is of practical use regardless of the specific

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

Throughout the paper, we will rely on the terminology used by the crossplatform OpenCL standard [19]. All the concepts we discuss will be equally valid in different programming contexts, such as the proprietary CUDA developed by NVIDIA specifically for their GPU and HPC solutions [20]. This choice stems from the authors' opinion that the OpenCL terminology is more neutral and less susceptible to the kind of confusion that some vendors have leveraged as a marketing

In our examples, we will also frequently refer to "small vector" data types. These are types in the form typeN where type is a primitive type (such as char, int, float, double) and N is one of 1, 2, 3, 4, 8, and 16. For example, a float4 would be a structure that in C or C++ could be defined as struct float4 {float x, y, z, w;}. Following OpenCL, the components of the small vector types will be named x, y, z, w for types with up to 4 components, and s0, … s9, sa, … sf for types with up to 16 components. In some examples we also make use of the OpenCL "swizzle notation," such that, for example, given float2 v=(0.0f, 1.0f);, then v.xxyy is a float4 with

We will assume that each small vector type is "naturally aligned," when N is a

momentarily that such 3-component types should in general be avoided as they lead to lower performance, since most if not all modern hardware are designed around power-of-two types (which is the reason why the OpenCL type is aligned to four

Finally, we will assume that the standard operations (component-by-component addition, subtraction, and multiplication, multiplication by a scalar, dot product) on the small vector types have been defined, in the usual manner. (OpenCL C defines these as part of the language, for CUDA appropriate overloads for the common

Modern GPUs are designed around the stream processing paradigm, a simplified model for shared-memory parallel programming that sacrifices inter-unit commu-

At an abstract level, stream processing is defined by a sequence of instructions (a computational kernel) to be executed on each element of an input data stream to produce an output data stream, under the assumption that each input element can be processed independently from the others (and thus in parallel). A computational

kernel is similar to a standard function in classic imperative programming

power of two: a typeN will begin at a memory address which is a multiple of N\*sizeof(type); for N=3, we will assume that type3 begins at a memory address which is a multiple of sizeof(type). This is in contrast to OpenCL, whose cl\_type3 types are assumed to be aligned like the corresponding cl\_type4 types, and special

vload3 instructions are needed to load packed 3-vectors. We will also show

application and hardware.

2. Terminology and notation

DOI: http://dx.doi.org/10.5772/intechopen.81755

tactic in promoting their solutions.

components (0.0f, 0.0f, 1.0f, 1.0f).

operators must be defined by the user.)

3. The GPU programming model

nication in favor of higher efficiency and scalability.

3.1 Stream processing

73

components).

More recently, the same computer techniques have been used to solve more abstract problems. For example, particle swarm optimization (PSO) [8] is a methodology to find approximate minima for functions whose derivatives cannot be computed (at all or in reasonable times), in spaces of arbitrary dimensions. Outside of the particle systems in the proper sense, simulation methods such as molecular dynamics (MD) [9] and many large-scale agent-based models present significant similarities with particle systems [10, 11] and share much of the infrastructural work with it.

Particle systems are computationally intensive. Realistic visual effects, accurate physical simulations, fast minimization, and large-scale agent-based models all require thousands if not millions (or more) of particles. On the upside, the behavior of most particle systems can be described in an embarrassingly parallel way, where each particle evolves either independently from the rest of the system or with at most local knowledge of the state of the system. This property makes particle systems a perfect fit for implementation on massively parallel computational hardware following the stream processing programming model, and in particular modern graphics processing units (GPUs), that have grown in the last decade from fast, programmable 3D rendering hardware to more general-purpose computing accelerators [12].

While the parallel computational power of GPUs is a natural fit for the parallel nature of particle systems, naive implementations will miss many opportunities to fully exploit GPUs, even when achieving performance orders of magnitude higher than an unoptimized, serial CPU implementation. Our objective is to discuss the optimal implementation of particle systems on GPU, so that anyone setting forth to implement a particle system can draw from our experience to avoid common pitfalls and be aware of the implications of many design choices. Optimality will be viewed in terms of performance (achieving the highest number of iterations per second in the evolution of the system), robustness (numerical stability), and flexibility (allowing the implementation of a wide range of variants for the particle system, e.g., to allow the simulation of different phenomena).

We will show that while these objectives are sometimes in conflict—so that the developer will have to choose to, e.g., sacrifice performance for better numerical stability—there are also cases where they complement each other, e.g., with some numerically more robust approaches also being more computationally efficient or with certain design choices for the host code structure being also more favorable to future extensions to multi-GPU support.

We will make extensive reference to our experience from the implementation of GPUSPH [13–17], the first implementation of SPH to run completely on GPU using CUDA and currently supporting multi-GPU and multi-node distribution of the computation. However, all the themes that we discuss and solutions we present are of interest to all particle systems and related methods, regardless of the specific theoretical background underlying them. To show this, simpler examples to illustrate the benefits of individual topics discussed will also be presented through a reduced implementation of PSO. Some of the most advanced techniques described can be seen in action in GPUSPH itself, which is freely available under the GNU General Public License, version 3 or higher [18].

While our focus will be on GPU implementation, many of the approaches we discuss can bring significant benefits even on CPU implementations, allowing better exploitation of the vector hardware and multiple cores of current hardware. The intention is thus to provide material that is of practical use regardless of the specific application and hardware.

### 2. Terminology and notation

traditional grid-based numerical methods (finite differences, finite volumes, finite

reproducing kernel particle method (RKPM) [5], finite pointset method (FPM) [6], discrete element method (DEM) [7], etc.—provide rigorous methods to discretize the physical laws governing the continuum and thus provide physics-based evolution law for the properties of the particles that act both as interpolation nodes in the mathematical sense and as virtual volumes of infinitesimal size carrying the prop-

More recently, the same computer techniques have been used to solve more abstract problems. For example, particle swarm optimization (PSO) [8] is a methodology to find approximate minima for functions whose derivatives cannot be computed (at all or in reasonable times), in spaces of arbitrary dimensions. Outside of the particle systems in the proper sense, simulation methods such as molecular dynamics (MD) [9] and many large-scale agent-based models present significant similarities with particle systems [10, 11] and share much of the infrastructural

Particle systems are computationally intensive. Realistic visual effects, accurate physical simulations, fast minimization, and large-scale agent-based models all require thousands if not millions (or more) of particles. On the upside, the behavior of most particle systems can be described in an embarrassingly parallel way, where each particle evolves either independently from the rest of the system or with at most local knowledge of the state of the system. This property makes particle systems a perfect fit for implementation on massively parallel computational hardware following the stream processing programming model, and in particular modern graphics processing units (GPUs), that have grown in the last decade from fast,

programmable 3D rendering hardware to more general-purpose computing

system, e.g., to allow the simulation of different phenomena).

future extensions to multi-GPU support.

General Public License, version 3 or higher [18].

While the parallel computational power of GPUs is a natural fit for the parallel nature of particle systems, naive implementations will miss many opportunities to fully exploit GPUs, even when achieving performance orders of magnitude higher than an unoptimized, serial CPU implementation. Our objective is to discuss the optimal implementation of particle systems on GPU, so that anyone setting forth to implement a particle system can draw from our experience to avoid common pitfalls and be aware of the implications of many design choices. Optimality will be viewed in terms of performance (achieving the highest number of iterations per second in the evolution of the system), robustness (numerical stability), and flexibility (allowing the implementation of a wide range of variants for the particle

We will show that while these objectives are sometimes in conflict—so that the developer will have to choose to, e.g., sacrifice performance for better numerical stability—there are also cases where they complement each other, e.g., with some numerically more robust approaches also being more computationally efficient or with certain design choices for the host code structure being also more favorable to

We will make extensive reference to our experience from the implementation of GPUSPH [13–17], the first implementation of SPH to run completely on GPU using CUDA and currently supporting multi-GPU and multi-node distribution of the computation. However, all the themes that we discuss and solutions we present are of interest to all particle systems and related methods, regardless of the specific theoretical background underlying them. To show this, simpler examples to illustrate the benefits of individual topics discussed will also be presented through a reduced implementation of PSO. Some of the most advanced techniques described can be seen in action in GPUSPH itself, which is freely available under the GNU

elements). These methods—smoothed particle hydrodynamics (SPH) [4],

erties of the macroscopic mass they represent.

High Performance Parallel Computing

work with it.

accelerators [12].

72

Throughout the paper, we will rely on the terminology used by the crossplatform OpenCL standard [19]. All the concepts we discuss will be equally valid in different programming contexts, such as the proprietary CUDA developed by NVIDIA specifically for their GPU and HPC solutions [20]. This choice stems from the authors' opinion that the OpenCL terminology is more neutral and less susceptible to the kind of confusion that some vendors have leveraged as a marketing tactic in promoting their solutions.

In our examples, we will also frequently refer to "small vector" data types. These are types in the form typeN where type is a primitive type (such as char, int, float, double) and N is one of 1, 2, 3, 4, 8, and 16. For example, a float4 would be a structure that in C or C++ could be defined as struct float4 {float x, y, z, w;}. Following OpenCL, the components of the small vector types will be named x, y, z, w for types with up to 4 components, and s0, … s9, sa, … sf for types with up to 16 components. In some examples we also make use of the OpenCL "swizzle notation," such that, for example, given float2 v=(0.0f, 1.0f);, then v.xxyy is a float4 with components (0.0f, 0.0f, 1.0f, 1.0f).

We will assume that each small vector type is "naturally aligned," when N is a power of two: a typeN will begin at a memory address which is a multiple of N\*sizeof(type); for N=3, we will assume that type3 begins at a memory address which is a multiple of sizeof(type). This is in contrast to OpenCL, whose cl\_type3 types are assumed to be aligned like the corresponding cl\_type4 types, and special vload3 instructions are needed to load packed 3-vectors. We will also show momentarily that such 3-component types should in general be avoided as they lead to lower performance, since most if not all modern hardware are designed around power-of-two types (which is the reason why the OpenCL type is aligned to four components).

Finally, we will assume that the standard operations (component-by-component addition, subtraction, and multiplication, multiplication by a scalar, dot product) on the small vector types have been defined, in the usual manner. (OpenCL C defines these as part of the language, for CUDA appropriate overloads for the common operators must be defined by the user.)

### 3. The GPU programming model

### 3.1 Stream processing

Modern GPUs are designed around the stream processing paradigm, a simplified model for shared-memory parallel programming that sacrifices inter-unit communication in favor of higher efficiency and scalability.

At an abstract level, stream processing is defined by a sequence of instructions (a computational kernel) to be executed on each element of an input data stream to produce an output data stream, under the assumption that each input element can be processed independently from the others (and thus in parallel). A computational kernel is similar to a standard function in classic imperative programming

languages; at runtime, as many instances of the function will be executed as necessary to cover the whole input data stream. Such instances (work-items) may be dispatched in concurrent batches, running in parallel as far as the hardware allows, and the programmer is generally given very little control, if any, on the dispatch itself, other than being able to specify how many instances are needed in total. This choice allows the same kernel to be executed on the same data stream, adapting naturally to the characteristics of the underlying hardware, and is one of the main characteristics of stream processing.

Branch divergence occurs when work-items belonging to the same subgroup need to take different execution paths at a given conditional. Since the subgroup proceeds in lockstep for all intents and purposes, in such a situation the hardware must mask the work-items not satisfying either branch, execute one side of the branch, invert the mask, and execute the other side of the branch: the total runtime cost is then the sum of the runtimes of each branch. If the work-items taking different execution paths belong to separate subgroups, this cost is not incurred, because separate subgroups can execute concurrently on different code paths, lead-

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

Coalescence in memory access is achieved when the controller of a GPU can provide data for the entire subgroup with a single memory transaction. Ensuring that this happens is one of the primary aspects of efficient GPU implementations and will be the basis for many of the performance hints discussed later on.

Stream processing is a natural fit for the implementation of particle systems, since the vast majority of algorithms that rely on particle systems are embarrassingly parallel in nature, with the behavior of each particle determined independently, thus providing a natural map between particles and work-items for most kernels. This allows naive implementations of particle systems to be developed very quickly, often with massive performance gains over trivial serial implementations

Such implementations will however generally fail at leveraging the full computational power of GPUs, except in the simplest of cases. Any moderately sophisticated algorithm will frequently require a violation of the natural mapping of particles to stream elements (and thus work-items), either in terms of data structure and access or in terms of implementation logic, to be able to achieve the

Programmable GPUs have brought forth a revolution in computing, making (certain forms of) large-scale parallel computing accessible to the masses. Many applications have seen significant benefit from a transition to the GPU as

supporting hardware, and in response vendors have improved GPU architectures, making it easier to achieve better performance with less implementation effort. When choosing the GPU as preferential target platform, however, developers must take into consideration the fact that not all users may have high-end professional GPUs, and while the stream computing paradigm is largely sufficient in compensating for the difference in computational power, there are at least two

Memory amount is one of these issues: consumer GPUs typically only have a fraction of the total amount of RAM offered in professional or compute-dedicated devices: while the latter may feature up to 16GB of RAM, low-end devices may have 1/4th or even 1/8th of that. Moreover, even the amount of memory available on high-end devices may be insufficient to handle larger problems. Software should therefore be designed to allow distribution of computation across multiple devices. The second issue is that, being designed for computer graphics, GPUs typically focus on single-precision (32-bit) floating-point operations, and double precision (64-bit) may be either not supported at all or supported at a much lower execution rate (as low as 1:32) than single precision, which may remove the computational advantage of using GPUs in the first place (this can be true even on high-end GPUs,

ing to an overall runtime cost equal to that of the longer branch.

3.2 Stream processing and particle systems

DOI: http://dx.doi.org/10.5772/intechopen.81755

optimal performance on any given hardware.

significant aspects that must be explicitly handled.

3.3 Limitations in the use of GPUs

75

running on single-core CPUs.

For example, if the hardware can run 1000 concurrent work-items, but the input stream consists of 2000,000 total elements, the hardware may batch 1000 work-items for execution at once and then dispatch another 1000 when the first batch completes execution. This continues until the entire input stream has been processed, executing 2000 total batches. For the same workloads, more powerful hardware able to run 100,000 concurrent work-items may be able to complete sooner by issuing 20 total batches, in a manner completely transparent to the programmer.

This programming model fits very well the simpler workload needed in many steps of the rendering process for which GPUs are designed: in such a case, the input and output streams may consist of the data and attributes for the vertices in the geometries describing the scene, for example, or for the fragments produced by the rasterization of such geometries. However, the more sophisticated requirements of general-purpose programming have led to the extension of the stream processing paradigm to provide programmers with finer control on the work-item dispatch as well as the possibility for efficient data sharing between work-items under appropriate conditions.

A modern stream processing device (typically a GPU, but may also be a multicore CPU with vector units, a dedicated accelerator like Intel's Xeon Phi, or a specialdesign FPGA) is composed of one or more compute units (each being a CPU core, a GPU multiprocessor, etc.) equipped with one or more processing elements (a SIMD lane on CPU, a single stream processor on GPU, etc.), which are the hardware components that process the individual work-items during a kernel execution. The programming model of these devices, as presented, e.g., by standards such as OpenCL [19] and by proprietary solutions such as NVIDIA CUDA [20], exposes the underlying hardware structure by allowing the programmer to specify the granularity at which work-items should be dispatched: each workgroup is a collection of work-items that are guaranteed to run on a single compute unit; work-items within the same workgroup can share data efficiently through dedicated (often on-chip) memory and can synchronize with each other, ensuring correct instruction ordering. Tuning workgroup size and the way work-items in the same workgroup access data can have a significant impact on performance.

The GPU multiprocessors are further characterized by an additional level of work-item grouping at the hardware level, as the work-items running on a single multiprocessor are not independent from each other: instead, a single instruction pointer is shared by a fixed-width group of work-items, known as the warp on NVIDIA GPUs, or wavefront on AMD GPUs, corresponding in a very general sense to the vector width of SIMD instructions on modern CPUs. We will use the hardware-independent term subgroup (as introduced, e.g., in OpenCL 2.0) to denote this hardware grouping. The subgroup structure of kernel execution influences performance in a number of ways. The most obvious way is that the size of a workgroup should always be a multiple of the subgroup size: a partial subgroup would be fully dispatched anyway, but masked, leading to lower hardware usage. Additional aspects where the subgroup partitioning can influence performance are branch divergence and coalesced memory access.

### Design and Implementation of Particle Systems for Meshfree Methods with High Performance DOI: http://dx.doi.org/10.5772/intechopen.81755

Branch divergence occurs when work-items belonging to the same subgroup need to take different execution paths at a given conditional. Since the subgroup proceeds in lockstep for all intents and purposes, in such a situation the hardware must mask the work-items not satisfying either branch, execute one side of the branch, invert the mask, and execute the other side of the branch: the total runtime cost is then the sum of the runtimes of each branch. If the work-items taking different execution paths belong to separate subgroups, this cost is not incurred, because separate subgroups can execute concurrently on different code paths, leading to an overall runtime cost equal to that of the longer branch.

Coalescence in memory access is achieved when the controller of a GPU can provide data for the entire subgroup with a single memory transaction. Ensuring that this happens is one of the primary aspects of efficient GPU implementations and will be the basis for many of the performance hints discussed later on.

### 3.2 Stream processing and particle systems

languages; at runtime, as many instances of the function will be executed as necessary to cover the whole input data stream. Such instances (work-items) may be dispatched in concurrent batches, running in parallel as far as the hardware allows, and the programmer is generally given very little control, if any, on the dispatch itself, other than being able to specify how many instances are needed in total. This choice allows the same kernel to be executed on the same data stream, adapting naturally to the characteristics of the underlying hardware, and is one of the main

For example, if the hardware can run 1000 concurrent work-items, but the input stream consists of 2000,000 total elements, the hardware may batch 1000 work-items for execution at once and then dispatch another 1000 when the first batch completes execution. This continues until the entire input stream has been processed, executing 2000 total batches. For the same workloads, more powerful hardware able to run 100,000 concurrent work-items may be able to complete sooner by issuing 20 total batches, in a manner completely transparent to the

This programming model fits very well the simpler workload needed in many steps of the rendering process for which GPUs are designed: in such a case, the input and output streams may consist of the data and attributes for the vertices in the geometries describing the scene, for example, or for the fragments produced by the rasterization of such geometries. However, the more sophisticated requirements of general-purpose programming have led to the extension of the stream processing paradigm to provide programmers with finer control on the work-item dispatch as well as the possibility for efficient data sharing between work-items under appro-

A modern stream processing device (typically a GPU, but may also be a multicore CPU with vector units, a dedicated accelerator like Intel's Xeon Phi, or a specialdesign FPGA) is composed of one or more compute units (each being a CPU core, a GPU multiprocessor, etc.) equipped with one or more processing elements (a SIMD lane on CPU, a single stream processor on GPU, etc.), which are the hardware components that process the individual work-items during a kernel execution. The programming model of these devices, as presented, e.g., by standards such as OpenCL [19] and by proprietary solutions such as NVIDIA CUDA [20], exposes the underlying hardware structure by allowing the programmer to specify the granularity at which work-items should be dispatched: each workgroup is a collection of work-items that are guaranteed to run on a single compute unit; work-items within the same workgroup can share data efficiently through dedicated (often on-chip) memory and can synchronize with each other, ensuring correct instruction ordering. Tuning workgroup size and the way work-items in the same workgroup access

The GPU multiprocessors are further characterized by an additional level of work-item grouping at the hardware level, as the work-items running on a single multiprocessor are not independent from each other: instead, a single instruction pointer is shared by a fixed-width group of work-items, known as the warp on NVIDIA GPUs, or wavefront on AMD GPUs, corresponding in a very general sense

to the vector width of SIMD instructions on modern CPUs. We will use the hardware-independent term subgroup (as introduced, e.g., in OpenCL 2.0) to denote this hardware grouping. The subgroup structure of kernel execution influences performance in a number of ways. The most obvious way is that the size of a workgroup should always be a multiple of the subgroup size: a partial subgroup would be fully dispatched anyway, but masked, leading to lower hardware usage. Additional aspects where the subgroup partitioning can influence performance are

characteristics of stream processing.

High Performance Parallel Computing

programmer.

priate conditions.

74

data can have a significant impact on performance.

branch divergence and coalesced memory access.

Stream processing is a natural fit for the implementation of particle systems, since the vast majority of algorithms that rely on particle systems are embarrassingly parallel in nature, with the behavior of each particle determined independently, thus providing a natural map between particles and work-items for most kernels. This allows naive implementations of particle systems to be developed very quickly, often with massive performance gains over trivial serial implementations running on single-core CPUs.

Such implementations will however generally fail at leveraging the full computational power of GPUs, except in the simplest of cases. Any moderately sophisticated algorithm will frequently require a violation of the natural mapping of particles to stream elements (and thus work-items), either in terms of data structure and access or in terms of implementation logic, to be able to achieve the optimal performance on any given hardware.

### 3.3 Limitations in the use of GPUs

Programmable GPUs have brought forth a revolution in computing, making (certain forms of) large-scale parallel computing accessible to the masses. Many applications have seen significant benefit from a transition to the GPU as supporting hardware, and in response vendors have improved GPU architectures, making it easier to achieve better performance with less implementation effort.

When choosing the GPU as preferential target platform, however, developers must take into consideration the fact that not all users may have high-end professional GPUs, and while the stream computing paradigm is largely sufficient in compensating for the difference in computational power, there are at least two significant aspects that must be explicitly handled.

Memory amount is one of these issues: consumer GPUs typically only have a fraction of the total amount of RAM offered in professional or compute-dedicated devices: while the latter may feature up to 16GB of RAM, low-end devices may have 1/4th or even 1/8th of that. Moreover, even the amount of memory available on high-end devices may be insufficient to handle larger problems. Software should therefore be designed to allow distribution of computation across multiple devices.

The second issue is that, being designed for computer graphics, GPUs typically focus on single-precision (32-bit) floating-point operations, and double precision (64-bit) may be either not supported at all or supported at a much lower execution rate (as low as 1:32) than single precision, which may remove the computational advantage of using GPUs in the first place (this can be true even on high-end GPUs, as was infamously the case for the Maxwell-class Tesla GPUs from NVIDIA). Designing the software around the use of single precision can therefore allow supporting higher performance across a wider class of devices, but it may require particular care in the handling of essential state variables in particle systems. This will be discussed in Section 5.

Let us consider, for example, a simple particle system in three dimensions, where each particle is described by its position (3 floats) and velocity (3 floats). In a CPU implementation, data would be stored in a format based on a Particle structure, and the particle system would be an array of Particles. Integrating the particles' position over a time-step dt would be achieved in a simple loop like the

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

This approach is called array of structures (AoS), and assuming a stream processing perspective, preserving the same layout would lead to a compute kernel in the form presented on the left in Listing 2. However, since each particle is more than 128 bit wide, the GPU would not be able to satisfy each subgroup access to the particle\_system (marked by the comments) in a single transaction, resulting in a potential slowdown of an order of magnitude or more. A better solution on GPU would be to split the particle structure in each primary component and thus have, in this case, an array of positions and an array of velocities, as shown on the right in

Part of the advantage of this approach (structure of arrays, SoA) is the natural higher access granularity, which limits read and write access to what is strictly necessary. With the AoS approach, it is also possible to limit writes to the specific parts, e.g., particle\_system[i].pos+= particle\_system[i].vel\*dt, but we will see that this only partially recovers the performance gap against SoA. Moreover, the access granularity of SoA also reflects in the function signatures, improving developer discipline. The downside is the growing number of buffers, and strategies to

void

} }

{

Particle system with stream processing: array of structure (left) versus structure of array (right).

kernel void

integrate\_pos(Particle \*particle\_system,

for (size\_t i=0; i<num\_particles; ++i) { Particle& p=particle\_system[i];

size\_t N, float dt)

p.pos+= p.vel\*dt;

{

}

integrate\_pos(float3 \*posArray, const float3 \*velArray, size\_t N, float dt)

> size\_t i=get\_global\_id(0); if (i>= N) return; float3 pos=posArray[i]; const float3 vel=velArray[i];

pos+= vel\*dt; posArray[i]=pos;

one illustrated in Listing 1.

DOI: http://dx.doi.org/10.5772/intechopen.81755

manage this will be discussed in Section 6.2.1.

Simple host code to integrate the position of a particle system.

Listing 2.

Listing 1.

};

struct Particle { float3 pos; float3 vel;

Listing 2.

{

}

77

kernel void

integrate\_pos(Particle \*particle\_system,

/\* read the old particle state \*/ Particle p=particle\_system[i];

/\* write the new particle state \*/

size\_t i=get\_global\_id(0); if (i>= N) return;

size\_t N, float dt)

p.pos+= p.vel\*dt;

particle\_system[i]=p;

### 4. Performance

While GPUs provide impressive computational power compared to CPUs, this is offset by a much higher sensitivity to data layout and access patterns: even a very computationally intensive kernel may result memory bound if the appropriate care is not given to these aspects.

The main GPU memory (global memory) is characterized by having high bandwidth, but also very high latency: access to global memory may consume hundreds of cycles, and work-items waiting for data will not proceed until the data is available to all of them, at least at the subgroup granularity.

Under appropriate conditions (called memory coalescing or fast-path), the GPU can provide data for a whole subgroup with a single memory transaction. Optimal access patterns in this regard are achieved when the work-items in a subgroup request data which is consecutive in memory, properly aligned (i.e., with the lowest-index element starting at an address which is a multiple of the data size times the subgroup size), and with specific size constraints—typically power-oftwo sizes, up to 128 bits per work-item: essentially, the equivalent of a float, float2, or float4, but not float3.

When fast-path requirements are not satisfied, the impact on kernel run times can be dramatic, especially on older architectures: designing data structures and algorithms around these requirements is therefore one of the main topics we will address. But even when coalesced access is achieved, each subgroup will have to wait for at least one memory transaction before proceeding to the instruction that makes use of the data. To hide this latency, GPU multiprocessors are designed to keep multiple workgroups alive concurrently and will automatically switch to another active workgroup (or subgroup within the same workgroup), while one is stalled waiting for data; to make efficient use of this capability, it is necessary to overcommit the device, i.e., issue kernels with more workgroups than would theoretically be able to run concurrently on the given hardware.

For example, on a GPU with 16 multiprocessors, each equipped with 128 streaming processors, it will not be sufficient to issue kernels with 2048 work-items to fully exploit the hardware: to fully hide operation latency, the developer should aim for global work sizes which are at least an order of magnitude more than the bare minimum.

A GPU that is fully under load is said to be saturated. On most modern architectures, tens of thousands of work-items are generally necessary to saturate mid- and high-end devices. This condition is usually satisfied for any moderate or large particle system, in which case the data layout and access patterns become the bottleneck for memory bandwidth utilization.

### 4.1 Array of structures versus structure of array

The first step in improving GPU bandwidth usage is to avoid high-level structured data layouts and store information about the particle system in a "transposed" format.

Design and Implementation of Particle Systems for Meshfree Methods with High Performance DOI: http://dx.doi.org/10.5772/intechopen.81755

Let us consider, for example, a simple particle system in three dimensions, where each particle is described by its position (3 floats) and velocity (3 floats). In a CPU implementation, data would be stored in a format based on a Particle structure, and the particle system would be an array of Particles. Integrating the particles' position over a time-step dt would be achieved in a simple loop like the one illustrated in Listing 1.

This approach is called array of structures (AoS), and assuming a stream processing perspective, preserving the same layout would lead to a compute kernel in the form presented on the left in Listing 2. However, since each particle is more than 128 bit wide, the GPU would not be able to satisfy each subgroup access to the particle\_system (marked by the comments) in a single transaction, resulting in a potential slowdown of an order of magnitude or more. A better solution on GPU would be to split the particle structure in each primary component and thus have, in this case, an array of positions and an array of velocities, as shown on the right in Listing 2.

Part of the advantage of this approach (structure of arrays, SoA) is the natural higher access granularity, which limits read and write access to what is strictly necessary. With the AoS approach, it is also possible to limit writes to the specific parts, e.g., particle\_system[i].pos+= particle\_system[i].vel\*dt, but we will see that this only partially recovers the performance gap against SoA. Moreover, the access granularity of SoA also reflects in the function signatures, improving developer discipline. The downside is the growing number of buffers, and strategies to manage this will be discussed in Section 6.2.1.

#### Listing 1.

as was infamously the case for the Maxwell-class Tesla GPUs from NVIDIA). Designing the software around the use of single precision can therefore allow supporting higher performance across a wider class of devices, but it may require particular care in the handling of essential state variables in particle systems. This

While GPUs provide impressive computational power compared to CPUs, this is offset by a much higher sensitivity to data layout and access patterns: even a very computationally intensive kernel may result memory bound if the appropriate care

The main GPU memory (global memory) is characterized by having high bandwidth, but also very high latency: access to global memory may consume hundreds of cycles, and work-items waiting for data will not proceed until the data is available

Under appropriate conditions (called memory coalescing or fast-path), the GPU can provide data for a whole subgroup with a single memory transaction. Optimal access patterns in this regard are achieved when the work-items in a subgroup request data which is consecutive in memory, properly aligned (i.e., with the lowest-index element starting at an address which is a multiple of the data size times the subgroup size), and with specific size constraints—typically power-oftwo sizes, up to 128 bits per work-item: essentially, the equivalent of a float,

When fast-path requirements are not satisfied, the impact on kernel run times can be dramatic, especially on older architectures: designing data structures and algorithms around these requirements is therefore one of the main topics we will address. But even when coalesced access is achieved, each subgroup will have to wait for at least one memory transaction before proceeding to the instruction that makes use of the data. To hide this latency, GPU multiprocessors are designed to keep multiple workgroups alive concurrently and will automatically switch to another active workgroup (or subgroup within the same workgroup), while one is stalled waiting for data; to make efficient use of this capability, it is necessary to overcommit the device, i.e., issue kernels with more workgroups than would theo-

For example, on a GPU with 16 multiprocessors, each equipped with 128 streaming processors, it will not be sufficient to issue kernels with 2048 work-items to fully exploit the hardware: to fully hide operation latency, the developer should aim for global work sizes which are at least an order of magnitude more than the

A GPU that is fully under load is said to be saturated. On most modern architectures, tens of thousands of work-items are generally necessary to saturate mid- and high-end devices. This condition is usually satisfied for any moderate or large particle system, in which case the data layout and access patterns become the

The first step in improving GPU bandwidth usage is to avoid high-level structured data layouts and store information about the particle system in a "transposed"

will be discussed in Section 5.

High Performance Parallel Computing

is not given to these aspects.

float2, or float4, but not float3.

bare minimum.

format.

76

to all of them, at least at the subgroup granularity.

retically be able to run concurrently on the given hardware.

bottleneck for memory bandwidth utilization.

4.1 Array of structures versus structure of array

4. Performance



#### Listing 2.

Particle system with stream processing: array of structure (left) versus structure of array (right).


Further optimizations, particularly important on older architectures, can be achieved with the sacrifice of some memory, to provide the position and velocity with a fourth (unused) component, as illustrated in Listing 3 (left), resulting in better bandwidth usage and thus faster execution; moreover, additional frequently used data may be stored in the fourth component, if needed (e.g., in GPUSPH we store the mass in pos.w and the density in vel.w).

neighbors; frequently, the number of neighbors for a particle will range in the tens or hundreds, sometimes even more, requiring storage of as many integers per particle. Another example is given by particle systems with high dimensionality (higher than 4), which could arise, for example, for a particle swarm optimization approach to the minimization of the cost function of a deep neural network. The position (and velocities) of particles in such a system might require hundreds,

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

The optimal storage solution for the array holding the data in such cases is transposed compared to the natural order: whereas for most CPU code it is natural to first store the data belonging to the first particle, then the data belonging to the second particle, etc., the optimal GPU storage for these wide arrays is to first store the first component for each particle, followed by the second component for each particle, etc. Using the standard C array notation, the i-th component of the p-th particle in the classic format would be found at location data[p\*num\_components+i],

[i\*num\_particles+p]. Similarly, neighbors would be stored interleaved: the first neighbor of each particle, followed by the second neighbor for each particle, etc. This ensures that when particles iterate over their neighbors, they fetch the neighbor index in coalescence. The concept is illustrated in Figure 1 (top and middle

The data transposition can rely on different chunk sizes; the single components approach discussed so far has the benefit of being simpler and the natural choice when each component needs to be treated independently (e.g., neighbors list traversal); if possible, however, wider chunks (e.g., using arrays of float2 or float4 elements instead of float) should be used (Figure 1, bottom), to achieve better

In general, the balance between transposition and chunk width should be calibrated based on the hardware capability: current GPUs achieve optimal performance with float4s, while on a CPU or a Xeon Phi, the wide vector width offered by AVX and AVX-512 could be better exploited using float8 or float16 chunks, as

In many particle systems, the behavior of the individual particles depends on the state of the particles in a neighborhood of the particle itself. The neighborhood may

Possible memory layouts for wide arrays. Top, standard (particle-major) layout; middle, transposed (component-major) layout; bottom, transpose-chunked layout. Memory locations are colored by data component access: locations with the same color are accessed concurrently in parallel by all work-items. In the chunked case, more than one location may be accessed concurrently, depending on the chunk size and hardware

whereas the optimal GPU location would use the addressing data

thousands, or even more, components.

DOI: http://dx.doi.org/10.5772/intechopen.81755

utilization of the memory bandwidth.

4.3 Particle sorting and neighbor search

illustrated in Table 2.

graphs).

Figure 1.

capability.

79

The benefits of the discussed strategies are exemplified in Table 1 for a simple three-dimensional implementation of PSO. The specific values will obviously generally depend on the specific compute kernel as well as on the specific hardware.

Some additional (usually minor) benefits can be achieved by explicitly telling the compiler that the position and velocity array never intersect; this is achieved using the restrict specification for the pointer (Listing 3, right) which, for more complex kernels, may allow the compiler to produce faster code by assuming no dependencies between writes on one array and reads on the other. On some hardware, const \* restrict arrays are also made accessible through a dedicated cache, further improving performance.

### 4.2 Wide arrays

The SoA approach can provide a significant boost in performance on GPU, as long as the individual parts of the structure (position, velocity, etc.) fit within the size requirements for coalesced memory access. When even individual structure members are wider than the optimal 128-bit width, however, alternative approaches are necessary. An example of this occurrence is the storage of a list of

#### Listing 3.

Using efficient data types on GPU (left) and leveraging the power of restricted pointers (right).



#### Table 1.

Runtime comparison for a simple three-dimensional particle swarm optimization implementation, using the discussed paradigms: array of structures, array of structures with selective writing, structure of arrays, structure of arrays with padded members (i.e., using four instead of three components).

### Design and Implementation of Particle Systems for Meshfree Methods with High Performance DOI: http://dx.doi.org/10.5772/intechopen.81755

neighbors; frequently, the number of neighbors for a particle will range in the tens or hundreds, sometimes even more, requiring storage of as many integers per particle. Another example is given by particle systems with high dimensionality (higher than 4), which could arise, for example, for a particle swarm optimization approach to the minimization of the cost function of a deep neural network. The position (and velocities) of particles in such a system might require hundreds, thousands, or even more, components.

The optimal storage solution for the array holding the data in such cases is transposed compared to the natural order: whereas for most CPU code it is natural to first store the data belonging to the first particle, then the data belonging to the second particle, etc., the optimal GPU storage for these wide arrays is to first store the first component for each particle, followed by the second component for each particle, etc. Using the standard C array notation, the i-th component of the p-th particle in the classic format would be found at location data[p\*num\_components+i], whereas the optimal GPU location would use the addressing data [i\*num\_particles+p]. Similarly, neighbors would be stored interleaved: the first neighbor of each particle, followed by the second neighbor for each particle, etc. This ensures that when particles iterate over their neighbors, they fetch the neighbor index in coalescence. The concept is illustrated in Figure 1 (top and middle graphs).

The data transposition can rely on different chunk sizes; the single components approach discussed so far has the benefit of being simpler and the natural choice when each component needs to be treated independently (e.g., neighbors list traversal); if possible, however, wider chunks (e.g., using arrays of float2 or float4 elements instead of float) should be used (Figure 1, bottom), to achieve better utilization of the memory bandwidth.

In general, the balance between transposition and chunk width should be calibrated based on the hardware capability: current GPUs achieve optimal performance with float4s, while on a CPU or a Xeon Phi, the wide vector width offered by AVX and AVX-512 could be better exploited using float8 or float16 chunks, as illustrated in Table 2.

### 4.3 Particle sorting and neighbor search

In many particle systems, the behavior of the individual particles depends on the state of the particles in a neighborhood of the particle itself. The neighborhood may

#### Figure 1.

Further optimizations, particularly important on older architectures, can be achieved with the sacrifice of some memory, to provide the position and velocity with a fourth (unused) component, as illustrated in Listing 3 (left), resulting in better bandwidth usage and thus faster execution; moreover, additional frequently used data may be stored in the fourth component, if needed (e.g., in GPUSPH we

The benefits of the discussed strategies are exemplified in Table 1 for a simple three-dimensional implementation of PSO. The specific values will obviously generally depend on the specific compute kernel as well as on the specific hardware. Some additional (usually minor) benefits can be achieved by explicitly telling the compiler that the position and velocity array never intersect; this is achieved using the restrict specification for the pointer (Listing 3, right) which, for more complex kernels, may allow the compiler to produce faster code by assuming no dependencies between writes on one array and reads on the other. On some hardware, const \* restrict arrays are also made accessible through a dedicated cache,

The SoA approach can provide a significant boost in performance on GPU, as long as the individual parts of the structure (position, velocity, etc.) fit within the size requirements for coalesced memory access. When even individual structure

approaches are necessary. An example of this occurrence is the storage of a list of

kernel void

{

}

Runtime (ms) 98 73 25 13 Speedup (prev) — 1.3 2.9 1.9 Speedup (total) — 1.3 3.9 7.5

Runtime comparison for a simple three-dimensional particle swarm optimization implementation, using the discussed paradigms: array of structures, array of structures with selective writing, structure of arrays, structure

integrate\_pos(float4 \* restrict posArray, const float4 \* restrict velArray,

pos.xyz+= vel.xyz\*dt; /\* OpenCL swizzle \*/

size\_t N, float dt)

posArray[i]=pos;

AoS Selective AoS SoA Padded SoA

size\_t i=get\_global\_id(0); if (i>= N) return; float4 pos=posArray[i]; const float4 vel=velArray[i];

members are wider than the optimal 128-bit width, however, alternative

Using efficient data types on GPU (left) and leveraging the power of restricted pointers (right).

store the mass in pos.w and the density in vel.w).

further improving performance.

High Performance Parallel Computing

integrate\_pos(float4 \*posArray, const float4 \*velArray, size\_t N, float dt)

> size\_t i=get\_global\_id(0); if (i>= N) return; float4 pos=posArray[i]; const float4 vel=velArray[i];

posArray[i]=pos;

pos.xyz+= vel.xyz\*dt; /\* OpenCL swizzle \*/

2^24 particles running on an NVIDIA GeForce GT 750M.

of arrays with padded members (i.e., using four instead of three components).

4.2 Wide arrays

Listing 3.

{

}

Table 1.

78

kernel void

Possible memory layouts for wide arrays. Top, standard (particle-major) layout; middle, transposed (component-major) layout; bottom, transpose-chunked layout. Memory locations are colored by data component access: locations with the same color are accessed concurrently in parallel by all work-items. In the chunked case, more than one location may be accessed concurrently, depending on the chunk size and hardware capability.


The NVIDIA GPU is a GeForce GT 750M; the Intel GPU is an Intel HD Graphics Haswell GT2 Mobile, using the Beignet OpenCL implementation from Intel; the Intel CPU is an Intel Core i7-4712HQ , using the Intel OpenCL SDK 6.4.0.25. Bold italic values show the best performance. The CPU exhibits worse performance for the transposed layout than the naive due to the auto-vectorization introduced by the OpenCL implementation.

#### Table 2.

Median runtimes (in ms) of the position update kernel for a 128-dimensional particle swarm optimization using the described memory layouts, on different hardware.

be defined in terms of some fixed influence radius or may be determined dynamically, either based on a changing influence radius or based on a pure neighbors count (e.g., "the 10 closest neighbors"). Performance of particle systems on GPU can be improved by reordering particle data in memory so that the data for particles that are close to each other in the domain metric (e.g., distance) are also close in device memory, providing more opportunities for coalesced memory access and (when available) better cache utilization [21].

to all data arrays) to the beginning of the data for the particles in each cell

Support grid for neighbor search: if the cell side is no less than the influence radius, the neighbors for any particle in any given cell (dark blue square in the picture) can be found in at most the Moore neighborhood of the cell

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

A single particle can then search for neighbors by only looking at the corresponding subsets of the particle system, starting from the cell start index for each adjacent cell. Since all particles belonging in the same cell will need to traverse the same subset of the particle list, further improvements can be obtained by loading the data about the potential neighbors into a shared-memory array.

The results of the neighbor search may be used immediately (e.g., by computing the particle-particle interaction as each neighbor is found) or deferred: in the latter case, the neighbor search itself constitutes its own step in the system evolution, and the results of the search are stored in a list which is then used in subsequent

The "just-in-time" approach (which can be equivalently seen as searching for neighbors whenever needed) has the advantage of lower memory requirements (since the list of neighbors needs not be stored), but the disadvantage that the cost of the search itself must be paid whenever interactions must be computed. Therefore, it is the preferred approach when the results of each search are only needed once. Conversely, when the results of the neighbor search are to be used multiple times, it is better, performance-wise, to store the results, and then reuse them in the

Memory layout for the support grid. Data arrays are sorted so that the data belonging to particles in any given cell is consecutive in memory, and a separate array holds the offset to the beginning of the data of the particles in each cell. In the picture, each primary color denotes a cell, and the color gradient refers to different particle data within each cell. The first cell has two particles, the second has four particles, and the third has three.

4.3.2 Just-in-time neighbor search versus neighbor list storage

kernels when particle-particle interactions must be computed.

(Figure 3).

Figure 3.

81

itself (light blue squares in the picture).

DOI: http://dx.doi.org/10.5772/intechopen.81755

Figure 2.

Sorting is generally achieved using key/value pairs, with the particle hash key computed from the particle position in space: the key array is then sorted, and all data arrays are reordered based on the new key array positions. Common ways to compute the particle sort key are based on either n-trees [22] or regular grids [23]. The main advantage of using an n-tree (and thus in particular quadtrees in 2D and octrees in 3D) is the adaptive nature of the structure, which is denser where particles are concentrated and sparser in the domain regions where particles are more spread out. By contrast, regular grids result in cells which are uniformly spaced and thus in unbalanced particle distributions among the cells.

The adaptive nature of n-trees can result in performance gains in a number of use cases, such as nearest-neighbor searches, collision detection, clump finding, and rendering. At the same time, traversing the tree structure itself efficiently on a stream processing architecture is nontrivial and often results in sub-optimal memory bandwidth utilization [24]. Regular grids, on the other hand, have a much simpler and computationally less expensive implementation, they lead to efficient neighbor search with fixed radius (as we will discuss momentarily), and the resulting data structures can also be used to support domain decomposition in the multi-GPU case, as we will discuss in Section 4.5.3, and also to improve the numerical robustness of the particle system, as we will discuss in Section 5.4.

### 4.3.1 Regular grids for neighbor search

Given a neighbor search radius r, we can subdivide the domain with a regular grid where the stepping in each direction is no less than r. This guarantees that the neighbors for any particle in any given cell can only be found at most in the adjacent cells in each of the cardinal and diagonal directions (Moore neighborhood of radius 1), as depicted in Figure 2.

We can then sort particles (i.e., their data) by, e.g., the linear index or the Morton code [25] of the cell they belong to (computed from the particle global position), so that data for all particles belonging to one cell ends up in a consecutive memory region. Furthermore, we can store in a separate array the offset (common

Design and Implementation of Particle Systems for Meshfree Methods with High Performance DOI: http://dx.doi.org/10.5772/intechopen.81755

#### Figure 2.

be defined in terms of some fixed influence radius or may be determined dynamically, either based on a changing influence radius or based on a pure neighbors count (e.g., "the 10 closest neighbors"). Performance of particle systems on GPU can be improved by reordering particle data in memory so that the data for particles that are close to each other in the domain metric (e.g., distance) are also close in device memory, providing more opportunities for coalesced memory access and

Median runtimes (in ms) of the position update kernel for a 128-dimensional particle swarm optimization

than the naive due to the auto-vectorization introduced by the OpenCL implementation.

Hardware Naïve Transposed Chunk 2 Chunk 4 Chunk 8 Chunk 16 NVIDIA 476 41 39 38 48 71 Intel GPU 237 75 63 58 59 91 Intel CPU 120 568 294 422 53 41 The NVIDIA GPU is a GeForce GT 750M; the Intel GPU is an Intel HD Graphics Haswell GT2 Mobile, using the Beignet OpenCL implementation from Intel; the Intel CPU is an Intel Core i7-4712HQ , using the Intel OpenCL SDK 6.4.0.25. Bold italic values show the best performance. The CPU exhibits worse performance for the transposed layout

Sorting is generally achieved using key/value pairs, with the particle hash key computed from the particle position in space: the key array is then sorted, and all data arrays are reordered based on the new key array positions. Common ways to compute the particle sort key are based on either n-trees [22] or regular grids [23]. The main advantage of using an n-tree (and thus in particular

quadtrees in 2D and octrees in 3D) is the adaptive nature of the structure, which is denser where particles are concentrated and sparser in the domain regions where particles are more spread out. By contrast, regular grids result in cells which are uniformly spaced and thus in unbalanced particle distributions among

The adaptive nature of n-trees can result in performance gains in a number of use cases, such as nearest-neighbor searches, collision detection, clump finding, and rendering. At the same time, traversing the tree structure itself efficiently on a stream processing architecture is nontrivial and often results in sub-optimal memory bandwidth utilization [24]. Regular grids, on the other hand, have a much simpler and computationally less expensive implementation, they lead to efficient neighbor search with fixed radius (as we will discuss momentarily), and the resulting data structures can also be used to support domain decomposition in the multi-GPU case, as we will discuss in Section 4.5.3, and also to improve the numer-

Given a neighbor search radius r, we can subdivide the domain with a regular grid where the stepping in each direction is no less than r. This guarantees that the neighbors for any particle in any given cell can only be found at most in the adjacent cells in each of the cardinal and diagonal directions (Moore neighborhood of radius 1),

We can then sort particles (i.e., their data) by, e.g., the linear index or the Morton code [25] of the cell they belong to (computed from the particle global position), so that data for all particles belonging to one cell ends up in a consecutive memory region. Furthermore, we can store in a separate array the offset (common

ical robustness of the particle system, as we will discuss in Section 5.4.

(when available) better cache utilization [21].

using the described memory layouts, on different hardware.

High Performance Parallel Computing

4.3.1 Regular grids for neighbor search

as depicted in Figure 2.

80

the cells.

Table 2.

Support grid for neighbor search: if the cell side is no less than the influence radius, the neighbors for any particle in any given cell (dark blue square in the picture) can be found in at most the Moore neighborhood of the cell itself (light blue squares in the picture).

to all data arrays) to the beginning of the data for the particles in each cell (Figure 3).

A single particle can then search for neighbors by only looking at the corresponding subsets of the particle system, starting from the cell start index for each adjacent cell. Since all particles belonging in the same cell will need to traverse the same subset of the particle list, further improvements can be obtained by loading the data about the potential neighbors into a shared-memory array.

### 4.3.2 Just-in-time neighbor search versus neighbor list storage

The results of the neighbor search may be used immediately (e.g., by computing the particle-particle interaction as each neighbor is found) or deferred: in the latter case, the neighbor search itself constitutes its own step in the system evolution, and the results of the search are stored in a list which is then used in subsequent kernels when particle-particle interactions must be computed.

The "just-in-time" approach (which can be equivalently seen as searching for neighbors whenever needed) has the advantage of lower memory requirements (since the list of neighbors needs not be stored), but the disadvantage that the cost of the search itself must be paid whenever interactions must be computed. Therefore, it is the preferred approach when the results of each search are only needed once. Conversely, when the results of the neighbor search are to be used multiple times, it is better, performance-wise, to store the results, and then reuse them in the

#### Figure 3.

Memory layout for the support grid. Data arrays are sorted so that the data belonging to particles in any given cell is consecutive in memory, and a separate array holds the offset to the beginning of the data of the particles in each cell. In the picture, each primary color denotes a cell, and the color gradient refers to different particle data within each cell. The first cell has two particles, the second has four particles, and the third has three.

following steps, in order to amortize the cost of the search. The downside in this case is much higher memory requirements.

where particles behave differently depending on some intrinsic characteristic. For example, SPH for fluid dynamics typically needs at least two different particle types: fluid particles that track the fluid itself and boundary particles that define solid walls, moving objects, etc.; the way particles interact with each other (or even whether or not they interact at all) in this case depends on both the central and

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

Heterogeneity in the behavior of the particles and their interactions can have a significant impact on the performance of the system, particularly when it is stored together, without any specific attention to the distribution of the particles and their types. Indeed, the natural way to process a particle system is to issue, for most kernels, a work-item per particle. However, when particles with different types or behavior are processed by work-items in the same subgroup, this leads to divergence, slowing down execution. Similarly, when particles iterate over neighbors, they may have neighbors of different types at corresponding indices (e.g., the third neighbor of the first particle may be a fluid neighbor, while the third neighbor of the second particle might be a boundary neighbor); in this case, again, kernel execution will incur divergences, even if the central particles are of the same type. Moreover, since the distinction between particle types and interaction form must be done at kernel runtime, the kernel code itself grows more complex, reducing optimization opportunities for the compiler and leading to sub-optimal usage of private variables, frequently resulting in register spills, where a reserved area of global memory gets used for temporary storage of private work-item variables, with a

When the heterogeneous behavior is due to some static property (e.g., a fixed particle type property), the most obvious choice is to split the particle system itself (e.g., have a particle system for fluid particles and a separate particle system for boundary particles). This has several advantages: it is possible to run a kernel on particles of a specific type more efficiently while still making it possible to run a kernel on all particles; it is also possible to do selective allocations, for example, if a given property (e.g., object number) is only needed for a specific type. The downside is that the management code becomes more complex, and kernels where particles of one type need to interact with particles of the other type must be given access to both sets of arrays, which can increase the complexity of the kernel

A simpler approach that does not completely solve the divergence issue but can

Divergence during the neighbor list traversals can be avoided by using a split neighbor list that separately stores neighbors of each type. This comes naturally when using separate particle systems and is efficient also with the specialized sorting approach, since potential neighbors of the same type will be enumerated

greatly reduce the occurrences of divergence is to introduce additional sorting criteria. For example, one can sort particles by cell, and within cell then sort particles by type, so that for any cell one finds first all the fluid particles (in that cell), followed by all the boundary particles (in the same cell). This "specialized sorting" approach reduces the occurrences of subgroups spanning multiple types to those crossing the boundary between types or between cells. In GPUSPH, the introduction of the per-type sorting within cells has improved the performance of the particle-particle interaction by around 2%. A significant advantage of this approach compared to the split system mentioned before is that it can be used also

when the criteria for the behavioral difference are dynamic.

neighboring particle type.

DOI: http://dx.doi.org/10.5772/intechopen.81755

severe impact on performance.

signatures.

4.4.1 Split neighbor list

83

For example, in GPUSPH the cost of the particle sorting and neighbor list construction can take as much as 30% of the runtime of a single step; most of this (25% of the step runtime) is spent in the neighbor search phase of the neighbor list construction; the list itself is then used for all subsequent kernels that require particle-particle interaction (boundary conditions, density smoothing, forces computation, surface detection, etc., most of which are executed twice due to the predictor/corrector integration scheme used).

Working without a neighbor list, the runtime cost of the search would have to be paid for each execution of a kernel with particle-particle interaction, increasing the runtime of a single operation by over 50%. The neighbor list is therefore a better choice for performance.

On the other hand, on a typical simulation in GPUSPH, the neighbor list is also responsible for the highest memory allocation, even including the double buffering required for most data arrays: indeed, all of the particle properties combined take between 100 and 300 bytes per particle (depending on the formulation being used), whereas for the neighbor list, we need to store, in the simplest cases, 128 neighbors, leading to an occupation of 512 bytes per particle. Some formulations may require two or three times as many neighbors per particle.

We remark that even when not all particles have the exact same number of neighbors, the neighbor list should be statically allocated at the beginning of the simulation, with the capacity to hold the maximum (expected) number of neighbors for any particle and with the layout discussed in Section 4.2, to maximize the performance of its traversal. This leads to some potential memory waste for the benefit of performance. However, even a more conservative and less wasteful neighbor list would still occupy significant memory (e.g., a median of 80 neighbors per particle still needs to be tracked on a typical GPUSPH simulation, which only lowers the memory occupation for the neighbor list to 320 bytes per particle). The net result is that the large allocations required by the neighbor list will limit the maximum size of a particle system that can be simulated on a single GPU.

Additionally, even with the stored neighbor list, nearly one-third of a simulation step is still taken by its construction. A solution to this issue is to only update the list every n iterations, with n large enough to reduce the performance impact and small enough to not affect the results in a significant way. With the default choice of n ¼ 10 in GPUSPH, the cost of particle sorting and neighbor search drops to around 5% of the total runtime in the worst cases. To improve the reliability of the simulations when neighbor list updates are less frequent, a good strategy is to increase the search radius: with this approach, given an influence radius r (maximum distance for interaction), the neighbors are actually added to the list of neighbors if their distance from the central particle is less than αr, with α>1; neighbors whose distance from the central particle is larger than r are then skipped in the kernels. The larger search radius takes into account the fact that particles may move before the next neighbor list update, thus bringing them closer. In this sense, the expansion factor for the neighbor search should be computed based on nvMΔt, where vM is the maximum expected (relative) particle velocity and Δt the maximum expected time step.

#### 4.4 Heterogeneous particle systems

While simple particle systems are often homogeneous (in that all particles behave the same way), many applications require heterogeneous particle systems,

### Design and Implementation of Particle Systems for Meshfree Methods with High Performance DOI: http://dx.doi.org/10.5772/intechopen.81755

where particles behave differently depending on some intrinsic characteristic. For example, SPH for fluid dynamics typically needs at least two different particle types: fluid particles that track the fluid itself and boundary particles that define solid walls, moving objects, etc.; the way particles interact with each other (or even whether or not they interact at all) in this case depends on both the central and neighboring particle type.

Heterogeneity in the behavior of the particles and their interactions can have a significant impact on the performance of the system, particularly when it is stored together, without any specific attention to the distribution of the particles and their types. Indeed, the natural way to process a particle system is to issue, for most kernels, a work-item per particle. However, when particles with different types or behavior are processed by work-items in the same subgroup, this leads to divergence, slowing down execution. Similarly, when particles iterate over neighbors, they may have neighbors of different types at corresponding indices (e.g., the third neighbor of the first particle may be a fluid neighbor, while the third neighbor of the second particle might be a boundary neighbor); in this case, again, kernel execution will incur divergences, even if the central particles are of the same type. Moreover, since the distinction between particle types and interaction form must be done at kernel runtime, the kernel code itself grows more complex, reducing optimization opportunities for the compiler and leading to sub-optimal usage of private variables, frequently resulting in register spills, where a reserved area of global memory gets used for temporary storage of private work-item variables, with a severe impact on performance.

When the heterogeneous behavior is due to some static property (e.g., a fixed particle type property), the most obvious choice is to split the particle system itself (e.g., have a particle system for fluid particles and a separate particle system for boundary particles). This has several advantages: it is possible to run a kernel on particles of a specific type more efficiently while still making it possible to run a kernel on all particles; it is also possible to do selective allocations, for example, if a given property (e.g., object number) is only needed for a specific type. The downside is that the management code becomes more complex, and kernels where particles of one type need to interact with particles of the other type must be given access to both sets of arrays, which can increase the complexity of the kernel signatures.

A simpler approach that does not completely solve the divergence issue but can greatly reduce the occurrences of divergence is to introduce additional sorting criteria. For example, one can sort particles by cell, and within cell then sort particles by type, so that for any cell one finds first all the fluid particles (in that cell), followed by all the boundary particles (in the same cell). This "specialized sorting" approach reduces the occurrences of subgroups spanning multiple types to those crossing the boundary between types or between cells. In GPUSPH, the introduction of the per-type sorting within cells has improved the performance of the particle-particle interaction by around 2%. A significant advantage of this approach compared to the split system mentioned before is that it can be used also when the criteria for the behavioral difference are dynamic.

### 4.4.1 Split neighbor list

Divergence during the neighbor list traversals can be avoided by using a split neighbor list that separately stores neighbors of each type. This comes naturally when using separate particle systems and is efficient also with the specialized sorting approach, since potential neighbors of the same type will be enumerated

following steps, in order to amortize the cost of the search. The downside in this

For example, in GPUSPH the cost of the particle sorting and neighbor list construction can take as much as 30% of the runtime of a single step; most of this (25% of the step runtime) is spent in the neighbor search phase of the neighbor list construction; the list itself is then used for all subsequent kernels that require particle-particle interaction (boundary conditions, density smoothing, forces computation, surface detection, etc., most of which are executed twice due to the

Working without a neighbor list, the runtime cost of the search would have to be paid for each execution of a kernel with particle-particle interaction, increasing the runtime of a single operation by over 50%. The neighbor list is therefore a better

On the other hand, on a typical simulation in GPUSPH, the neighbor list is also responsible for the highest memory allocation, even including the double buffering required for most data arrays: indeed, all of the particle properties combined take between 100 and 300 bytes per particle (depending on the formulation being used), whereas for the neighbor list, we need to store, in the simplest cases, 128 neighbors, leading to an occupation of 512 bytes per particle. Some formulations may require

We remark that even when not all particles have the exact same number of neighbors, the neighbor list should be statically allocated at the beginning of the simulation, with the capacity to hold the maximum (expected) number of neighbors for any particle and with the layout discussed in Section 4.2, to maximize the performance of its traversal. This leads to some potential memory waste for the benefit of performance. However, even a more conservative and less wasteful neighbor list would still occupy significant memory (e.g., a median of 80 neighbors per particle still needs to be tracked on a typical GPUSPH simulation, which only lowers the memory occupation for the neighbor list to 320 bytes per particle). The net result is that the large allocations required by the neighbor list will limit the maximum size of a particle system that can be simulated on a

Additionally, even with the stored neighbor list, nearly one-third of a simulation step is still taken by its construction. A solution to this issue is to only update the list every n iterations, with n large enough to reduce the performance impact and small enough to not affect the results in a significant way. With the default choice of n ¼ 10 in GPUSPH, the cost of particle sorting and neighbor search drops to around 5% of the total runtime in the worst cases. To improve the reliability of the simulations when neighbor list updates are less frequent, a good strategy is to increase the search radius: with this approach, given an influence radius r (maximum distance for interaction), the neighbors are actually added to the list of neighbors if their distance from the central particle is less than αr, with α>1; neighbors whose distance from the central particle is larger than r are then skipped in the kernels. The larger search radius takes into account the fact that particles may move before the next neighbor list update, thus bringing them closer. In this sense, the expansion factor for the neighbor search should be computed based on nvMΔt, where vM is the maximum expected (relative) particle velocity and Δt the maximum expected time step.

While simple particle systems are often homogeneous (in that all particles behave the same way), many applications require heterogeneous particle systems,

case is much higher memory requirements.

High Performance Parallel Computing

predictor/corrector integration scheme used).

two or three times as many neighbors per particle.

choice for performance.

single GPU.

82

4.4 Heterogeneous particle systems

consecutively in each cell. The split neighbor list can be implemented in such a way that it is possible to iterate efficiently on neighbors of only one given type (or otherwise satisfying one given splitting criterion) while still preserving the possibility to iterate over all neighbors when necessary and minimizing allocation.

4.4.2 Split kernels

DOI: http://dx.doi.org/10.5772/intechopen.81755

4.5 Multi-GPU

Once traversal of individual particle types (for the system itself or even just for the neighbors) has been made efficient, the more complex kernels that need to provide significantly different behavior based should be split as well. For example, in GPUSPH we have recently refactored the main computational kernel (dedicated to the computation of the forces acting on each particle) into separate versions to compute fluid/fluid, fluid/boundary, boundary/fluid, etc. interactions. While this generally requires some additional memory access (because, e.g., the particle accelerations now need to be stored and retrieved between one incarnation of the kernel and the next), there has been an overall performance benefit; for the most complex formulations, on first-generation Kepler hardware, we have seen a 50% increase in the number of iterations per second. On the more recent and capable Maxwell architecture, the benefit has been less significant (30% more iterations per second), due to the smaller number of register spills: the more modern hardware supports more registers per work-item and is therefore less affected by the computational

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

Very large particle systems can benefit from distribution across multiple GPUs. This may in fact be necessary simply due to the limited resources available on a single GPU: high-end GPUs currently have at most 16GB of RAM, which may limit the particle system size to a few tens of millions, depending on the complexity of the system. Even for smaller systems, however, distribution over multiple GPUs can provide a performance boost, provided each of the devices is saturated (otherwise, the overhead involved in distributing the particle system will be higher than the

Distributing a particle system across multiple GPUs requires a significant change of vision: GPU coding is facilitated by its shared-memory architecture, where all compute units have read/write access to the entire device global memory. Multi-GPU introduces a distributed parallel computing layer, shifting the focus to effi-

For particle systems, the preferential way to distribute work across devices is based on domain decomposition rather than task decomposition, since the former allows both to minimize data exchange and to cover the associated latency more easily. Domain decomposition is achieved by distributing separate sections of the particle system to different devices (e.g., half of the particles and the associated data go on a GPU; the second half goes on a second GPU). When the particles can be processed independently (i.e., no neighborhood information is required), the partition is trivial, and the only objective is load balancing (i.e., ensuring that the fraction of particle system assigned to each GPU is proportional to its computational power). In these cases, the only data exchange needed between GPUs is the tracking of some global quantities, such as the particle system optimal position in PSO. The decomposition becomes more challenging when the particles' behavior depends on a local neighborhood. In this case, each device must track the state not only of the particles that have been assigned to it but also their neighbors, some of which may have been assigned to other devices. Each device therefore has a view of

• (Strictly) inner particles: particles assigned to the device and that no other device is interested in; no information exchange is involved in their processing.

issues associated with particle system heterogeneity.

benefits offered by the higher computational capacity).

the particle system as divided in four sections:

85

cient work distribution and data exchange between the devices.

A naive split neighbor list can be implemented with separate allocations (e.g., a separate neighbor list per type), but this can quickly lead to an explosion of the already significant memory usage due to the existence of the neighbor list itself: without additional information on the neighbor distribution by type, for example, it may be necessary to allocate a full-sized neighbor list for each type. A more compact solution without loss of traversal efficiency can be achieved by storing the split neighbor list in a single allocation but filling the per-type section of the list from different ends.

As an example, consider the case of two particle types (fluid and boundary), and assume that the neighbor list can hold M neighbors per particle (M is the maximum number of neighbors any particle can have). For each particle, we store the fluid neighbors starting from position 0 (using the 0-based indexing common in the C language family), moving forward, and the boundary neighbors starting from position M-1, backward, where the indices refer to the particle-specific section of the full neighbor list, and taking interleaving into account as described in Section 4.2; iterating over all fluid neighbors is then achieved in the usual way, whereas iterating over all boundary neighbors would be achieved by traversing the array in reverse, as illustrated in Listing 4.

To prevent one section of the neighbor list from bleeding into the other, it is now necessary to put a marker (i.e., an index with a special value, such as 1, defined as NEIBS\_END in the example in Listing 4) at the end of each of the sides of the list, which implies that the effective maximum number of neighbors is reduced by 1 (with the end-of-list marker shared between the two sides when the neighbors list is otherwise full).

If there are more than two types of particle, the same strategy can still be applied, by sectioning the neighbor list. For example, with four types A, B, C, and D, we need to set a value M1 (the total number of neighbors of type A and B combined) and M2 (the total number of neighbors of type C and D combined); the neighbor list is allocated to hold M1+M2 neighbors per particle, where neighbors of type A are stored from position 0 onward, neighbors of type B are stored from position M1–1 backward, neighbors of type C are stored from position M1 onward, and neighbors of type D are stored from position M1+M2–1 backward. Type pairs should be chosen, when possible, based on the uniformity of the cumulative number of neighbors (e.g., if it is more likely that the sum of A and C neighbors is constant, it is better to pair A with C than with B).

#### Listing 4.

Traversing the split neighbors list: first type (fluid particles in this example) on the left, second type (boundary particle in this example) on the right.


Design and Implementation of Particle Systems for Meshfree Methods with High Performance DOI: http://dx.doi.org/10.5772/intechopen.81755

### 4.4.2 Split kernels

consecutively in each cell. The split neighbor list can be implemented in such a way that it is possible to iterate efficiently on neighbors of only one given type (or otherwise satisfying one given splitting criterion) while still preserving the possibility to iterate over all neighbors when necessary and minimizing

A naive split neighbor list can be implemented with separate allocations (e.g., a separate neighbor list per type), but this can quickly lead to an explosion of the already significant memory usage due to the existence of the neighbor list itself: without additional information on the neighbor distribution by type, for example, it may be necessary to allocate a full-sized neighbor list for each type. A more compact solution without loss of traversal efficiency can be achieved by storing the split neighbor list in a single allocation but filling the per-type section of the list from

As an example, consider the case of two particle types (fluid and boundary), and assume that the neighbor list can hold M neighbors per particle (M is the maximum number of neighbors any particle can have). For each particle, we store the fluid neighbors starting from position 0 (using the 0-based indexing common in the C language family), moving forward, and the boundary neighbors starting from position M-1, backward, where the indices refer to the particle-specific section of the full neighbor list, and taking interleaving into account as described in Section 4.2; iterating over all fluid neighbors is then achieved in the usual way, whereas iterating over all boundary neighbors would be achieved by traversing the array in reverse, as

To prevent one section of the neighbor list from bleeding into the other, it is now necessary to put a marker (i.e., an index with a special value, such as 1, defined as NEIBS\_END in the example in Listing 4) at the end of each of the sides of the list, which implies that the effective maximum number of neighbors is reduced by 1 (with the end-of-list marker shared between the two sides when the neighbors list is

If there are more than two types of particle, the same strategy can still be applied, by sectioning the neighbor list. For example, with four types A, B, C, and D, we need to set a value M1 (the total number of neighbors of type A and B combined) and M2 (the total number of neighbors of type C and D combined); the neighbor list is allocated to hold M1+M2 neighbors per particle, where neighbors of type A are stored from position 0 onward, neighbors of type B are stored from position M1–1 backward, neighbors of type C are stored from position M1 onward, and neighbors of type D are stored from position M1+M2–1 backward. Type pairs should be chosen, when possible, based on the uniformity of the cumulative number of neighbors (e.g., if it is more likely that the sum of A and C neighbors is constant, it is better to pair A

Traversing the split neighbors list: first type (fluid particles in this example) on the left, second type (boundary

}

int p=get\_global\_id(0); /\* particle index \*/

/\* index of the next boundary neighbor \*/ int neib\_index=neibsList[i\*N+p]; if (neib\_index == NEIBS\_END) break; /\* do stuff with neib\_index \*/

for (int i=M-1; i>=0; –i) {

allocation.

High Performance Parallel Computing

different ends.

illustrated in Listing 4.

otherwise full).

with C than with B).

particle in this example) on the right.

for (int i=0; i<M; ++i) {

int p=get\_global\_id(0); /\* particle index \*/

/\* index of the next fluid neighbor \*/ int neib\_index=neibsList[i\*N+p]; if (neib\_index == NEIBS\_END) break; /\* do stuff with neib\_index \*/

Listing 4.

}

84

Once traversal of individual particle types (for the system itself or even just for the neighbors) has been made efficient, the more complex kernels that need to provide significantly different behavior based should be split as well. For example, in GPUSPH we have recently refactored the main computational kernel (dedicated to the computation of the forces acting on each particle) into separate versions to compute fluid/fluid, fluid/boundary, boundary/fluid, etc. interactions. While this generally requires some additional memory access (because, e.g., the particle accelerations now need to be stored and retrieved between one incarnation of the kernel and the next), there has been an overall performance benefit; for the most complex formulations, on first-generation Kepler hardware, we have seen a 50% increase in the number of iterations per second. On the more recent and capable Maxwell architecture, the benefit has been less significant (30% more iterations per second), due to the smaller number of register spills: the more modern hardware supports more registers per work-item and is therefore less affected by the computational issues associated with particle system heterogeneity.

### 4.5 Multi-GPU

Very large particle systems can benefit from distribution across multiple GPUs. This may in fact be necessary simply due to the limited resources available on a single GPU: high-end GPUs currently have at most 16GB of RAM, which may limit the particle system size to a few tens of millions, depending on the complexity of the system. Even for smaller systems, however, distribution over multiple GPUs can provide a performance boost, provided each of the devices is saturated (otherwise, the overhead involved in distributing the particle system will be higher than the benefits offered by the higher computational capacity).

Distributing a particle system across multiple GPUs requires a significant change of vision: GPU coding is facilitated by its shared-memory architecture, where all compute units have read/write access to the entire device global memory. Multi-GPU introduces a distributed parallel computing layer, shifting the focus to efficient work distribution and data exchange between the devices.

For particle systems, the preferential way to distribute work across devices is based on domain decomposition rather than task decomposition, since the former allows both to minimize data exchange and to cover the associated latency more easily. Domain decomposition is achieved by distributing separate sections of the particle system to different devices (e.g., half of the particles and the associated data go on a GPU; the second half goes on a second GPU). When the particles can be processed independently (i.e., no neighborhood information is required), the partition is trivial, and the only objective is load balancing (i.e., ensuring that the fraction of particle system assigned to each GPU is proportional to its computational power). In these cases, the only data exchange needed between GPUs is the tracking of some global quantities, such as the particle system optimal position in PSO.

The decomposition becomes more challenging when the particles' behavior depends on a local neighborhood. In this case, each device must track the state not only of the particles that have been assigned to it but also their neighbors, some of which may have been assigned to other devices. Each device therefore has a view of the particle system as divided in four sections:

• (Strictly) inner particles: particles assigned to the device and that no other device is interested in; no information exchange is involved in their processing. • Inner edge particles: particles assigned to the device that are neighbors of particles assigned to different devices; information about the evolution of these particles must be sent to other devices.

itself. This can be done under the condition that the update does not require information from the neighbors (since the device does not hold information about all the neighbors of outer edge particles) and becomes convenient when the amount of data to transfer before the update is less than the amount of data to transfer after

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

As an example, consider a simple particle system where the forces acting on each particle are computed from the interaction with the neighbors (forces kernel), but the new position and velocity are computed only from the previously computed forces (euler kernel), without any (further) interaction with the neighbors. Assume that there are two devices, D1 and D2, and that all the involved arrays (forces, positions, and velocities) are the same size (e.g., a float4 per particle). Then it is more convenient for D1 to get the information about the forces acting on its outer edge particles from D2 (and conversely), and then run euler on the outer edge particles, than it is for D1 to get the information about the positions and velocities after euler has been run on both devices: this is because exchanging forces results in a data transfer of a single float4 per (outer edge) particle, while exchanging positions and velocities would require exchanging twice as much data.

Assuming data transfers have been minimized as discussed in the previous paragraph, the next step in an efficient multi-GPU implementation is to cover the data transfer latency by running computational kernels concurrently with the data transfer itself. This can be achieved as long as it is possible to efficiently launch computational kernels on a subset of the particle system (specifically, on the inner edge particles). It is then possible to first compute the new data on the inner edge particles, and then launch the kernel on the remaining (strictly inner) particles, while the inner edge data is sent to adjacent devices (and conversely the outer edge data is received from adjacent devices). This strategy allows optimal latency hiding, especially for the most computationally intensive kernels, provided all involved device are saturated. In GPUSPH, this is how we achieve nearly linear speedups in the number of devices [26], even when network transfers are involved in multi-

In our experience, multi-GPU also benefits from the use of the auxiliary grid that can be used for sorting (as described in Section 4.3) and for improved numerical accuracy (as will be described in Section 5.4): indeed, to improve the efficiency of data transfers, it is important that the sections of the arrays that need to be sent to adjacent devices are as consecutive as possible, since multiple small bursts are generally less efficient (both over the PCI Express bus and over the network) than

With the auxiliary reference grid, this can be obtained by always splitting the domain at the cell level and computing the cell index by taking the inner/edge/outer relation into consideration (Figure 4). In GPUSPH this is achieved by reserving the two most significant bits of the cell index to indicate strictly inner cells (00), inner edge cells (01), outer edge cells (10), and strictly outer cells (11)—with the latter never actually used except in an optional global cell map. With this strategy, all strictly inner particles will be sorted first, followed by all inner edge particles, and finally by outer edge particles. Since outer edge particles will be sorted consecutively in memory, receiving data about them will be more efficient; similarly, sending inner edge data over will be made more efficient by the coalesced layout.

the update.

4.5.2 Computing during data exchange

DOI: http://dx.doi.org/10.5772/intechopen.81755

node simulations [23].

larger data transfers.

87

4.5.3 Reference grid for domain decomposition


The inner/outer relation is symmetrical, in the sense that a (strictly) inner particle for a device is (strictly) outer for all other devices and an inner edge particle for a device is an outer edge particle for at least one other device (Figure 4). We will say that two devices are adjacent if they share an inner/outer edge relation (i.e., if they need to exchange data about neighboring particles). Note that, depending on how the particle system is distributed, information about an inner edge particle may need to be sent to multiple adjacent devices.

### 4.5.1 Computing versus data exchange

The key to an efficient multi-GPU implementation is the ability to minimize the impact of data exchange. The most obvious approach in this sense is to minimize the data transfers themselves, for example, by ensuring that the domain is partitioned in such a way that the number of edge particles is minimized.

As a general rule, during the simulation each device will hold all the data relevant to both its inner (and inner edge) particles and the data relevant to its outer edge particles, i.e., the inner edge particles of adjacent devices. However, it will only run computational kernels on its own particles (using, read-only, the information from the outer edge particles) and then receive updates about the outer edge particles from the adjacent devices. On the other hand, there are cases when it may be convenient for each device to compute the updates for the outer edge particles by

#### Figure 4.

Inner/outer/edge relationship between devices, with domain decomposition based on a reference grid: inner/outer/edge particles are then defined based on the cell they belong to rather than on a purely geometrical relationship to the particles assigned to other devices.

### Design and Implementation of Particle Systems for Meshfree Methods with High Performance DOI: http://dx.doi.org/10.5772/intechopen.81755

itself. This can be done under the condition that the update does not require information from the neighbors (since the device does not hold information about all the neighbors of outer edge particles) and becomes convenient when the amount of data to transfer before the update is less than the amount of data to transfer after the update.

As an example, consider a simple particle system where the forces acting on each particle are computed from the interaction with the neighbors (forces kernel), but the new position and velocity are computed only from the previously computed forces (euler kernel), without any (further) interaction with the neighbors. Assume that there are two devices, D1 and D2, and that all the involved arrays (forces, positions, and velocities) are the same size (e.g., a float4 per particle). Then it is more convenient for D1 to get the information about the forces acting on its outer edge particles from D2 (and conversely), and then run euler on the outer edge particles, than it is for D1 to get the information about the positions and velocities after euler has been run on both devices: this is because exchanging forces results in a data transfer of a single float4 per (outer edge) particle, while exchanging positions and velocities would require exchanging twice as much data.

### 4.5.2 Computing during data exchange

• Inner edge particles: particles assigned to the device that are neighbors of

particles must be sent to other devices.

High Performance Parallel Computing

need to be sent to multiple adjacent devices.

4.5.1 Computing versus data exchange

Figure 4.

86

relationship to the particles assigned to other devices.

particles must be received from other devices.

particles assigned to different devices; information about the evolution of these

• Outer edge particles: particles assigned to other devices that are neighbors of particles assigned to this device; information about the evolution of these

• (Strictly) outer particles: particles assigned to other devices and that this device does not care about; no information exchange is involved in their processing.

The key to an efficient multi-GPU implementation is the ability to minimize the impact of data exchange. The most obvious approach in this sense is to minimize the data transfers themselves, for example, by ensuring that the domain is partitioned in such a way that the number of edge particles is minimized. As a general rule, during the simulation each device will hold all the data relevant to both its inner (and inner edge) particles and the data relevant to its outer edge particles, i.e., the inner edge particles of adjacent devices. However, it will only run computational kernels on its own particles (using, read-only, the information from the outer edge particles) and then receive updates about the outer edge particles from the adjacent devices. On the other hand, there are cases when it may be convenient for each device to compute the updates for the outer edge particles by

Inner/outer/edge relationship between devices, with domain decomposition based on a reference grid: inner/outer/edge particles are then defined based on the cell they belong to rather than on a purely geometrical

The inner/outer relation is symmetrical, in the sense that a (strictly) inner particle for a device is (strictly) outer for all other devices and an inner edge particle for a device is an outer edge particle for at least one other device (Figure 4). We will say that two devices are adjacent if they share an inner/outer edge relation (i.e., if they need to exchange data about neighboring particles). Note that, depending on how the particle system is distributed, information about an inner edge particle may

> Assuming data transfers have been minimized as discussed in the previous paragraph, the next step in an efficient multi-GPU implementation is to cover the data transfer latency by running computational kernels concurrently with the data transfer itself. This can be achieved as long as it is possible to efficiently launch computational kernels on a subset of the particle system (specifically, on the inner edge particles). It is then possible to first compute the new data on the inner edge particles, and then launch the kernel on the remaining (strictly inner) particles, while the inner edge data is sent to adjacent devices (and conversely the outer edge data is received from adjacent devices). This strategy allows optimal latency hiding, especially for the most computationally intensive kernels, provided all involved device are saturated. In GPUSPH, this is how we achieve nearly linear speedups in the number of devices [26], even when network transfers are involved in multinode simulations [23].

### 4.5.3 Reference grid for domain decomposition

In our experience, multi-GPU also benefits from the use of the auxiliary grid that can be used for sorting (as described in Section 4.3) and for improved numerical accuracy (as will be described in Section 5.4): indeed, to improve the efficiency of data transfers, it is important that the sections of the arrays that need to be sent to adjacent devices are as consecutive as possible, since multiple small bursts are generally less efficient (both over the PCI Express bus and over the network) than larger data transfers.

With the auxiliary reference grid, this can be obtained by always splitting the domain at the cell level and computing the cell index by taking the inner/edge/outer relation into consideration (Figure 4). In GPUSPH this is achieved by reserving the two most significant bits of the cell index to indicate strictly inner cells (00), inner edge cells (01), outer edge cells (10), and strictly outer cells (11)—with the latter never actually used except in an optional global cell map. With this strategy, all strictly inner particles will be sorted first, followed by all inner edge particles, and finally by outer edge particles. Since outer edge particles will be sorted consecutively in memory, receiving data about them will be more efficient; similarly, sending inner edge data over will be made more efficient by the coalesced layout.

Note that this cell sorting strategy does not completely eliminate the need for multiple transfers; it does however help to reduce it significantly, especially when combined with linear cell indexing and the appropriate choice of order of dimensions for linearization. For this reason, GPUSPH offers a (compile-time) option to allow customization of this choice that in our experience can have performance benefits of up to 30%.

5.2 Compensated and balanced summation

DOI: http://dx.doi.org/10.5772/intechopen.81755

for (i=0; i<num\_neighbors; ++i);

term is computed as shown in Listing 5.

algorithms can be found in [34].

latency in the summation.

Kahan summation algorithm.

y=term - correction;

correction=(t - sum) - y;

t=sum + y;

sum=t;

89

Listing 5.

Oðε√nÞ with the default round-to-nearest rounding mode).

total\_action=0;;

In many particle systems, the behavior of a particle is dictated by the influence of a local neighborhood: the total action on the central particle is then achieved by adding up the contributions from each neighboring particle. The number of contributions is frequently in the order of tens or hundreds and in some applications even

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

more. The naive approach, which could be described algorithmically as

total\_action +=contribution\_from\_neighbor(i);

suffers from low accuracy: since each contribution adds a relative error in the order of the machine epsilon ε, the total relative error is in the order of the total number of contributions, Oð Þ εn in the worst case (in practice, the error is typically

This can be significantly improved by using a compensated summation algorithm, which can bring down the total relative error to order Oð Þ1 (constant). The idea behind this class of algorithms is to keep track of the quantity that gets "lost" during a summation due to the finite precision. This is achieved by keeping two accumulators, the sum itself and a correction. The simplest form of compensated summation was popularized by Kahan [32], where the contribution from each new

The approach relies on the compiler not trying to do an algebraic simplification of the expressions for the temporary sum t and the correction, which may require

The Kahan compensated summation algorithm works best when all terms in the summation are of similar or decreasing orders of magnitude but fail to take into account that the new term may be (significantly) larger in magnitude than the current running sum. To this end, Neumaier presented a variant [33], usually known as KBN (Kahan-Babuška-Neumaier), that takes into account the relative magnitude of the new term and the running sum when computing the correction (Listing 6). In contrast to Kahan's algorithm, the final sum is obtained by adding sum and correction. More details about balancing and compensated summation

disabling "fast-math" and the contraction of floating-point expressions.

The main downside of KBN is the branching condition to compute the correction, which may reduce performance on GPU. The algorithm can be rewritten to be vector-friendly, as illustrated on the right in Listing 6, which makes use of the select(a, b, c) function and the component-by-component extension to the ternary operator c?b:a defined, e.g., in OpenCL C. This form of KBN should only be chosen when profiling shows the branching to be a performance bottleneck, since the extra operations otherwise introduce higher

> /\* take into account what got lost previously \*/ /\* temporary sum: add the new term y \*/ /\* estimate what got lost adding the new term \*/ /\* actual new value of the summation \*/

### 5. Numerical robustness

When the intent of GPGPU is to leverage the low-cost, high-performance ratio offered by consumer GPUs, a significant bottleneck in scientific applications is given by the limited (when not completely absent) support for double precision: since consumer GPUs are designed for video games and similar applications, where the highest rendering accuracy is not a requirement, the hardware is optimized for single-precision computation. Applications that need higher accuracy can thus follow one of the following strategies: use double-precision anyway, use soft extended precision (double-float, etc.), and rely on alternative algorithms that ensure a better utilization of the available precision.

Due to the high-performance cost (ratios as low as 1:32 compared to single precision, nearly completely defeating the benefits of the high performance of GPUs over CPUs), the use of double precision should be avoided whenever possible. If absolutely necessary, it should be restricted to parts of the code where it is essential. In all other cases, faster alternatives should be sought out. A possible solution is offered by double-float arithmetic, in which two single-precision values are used to provide higher accuracy [27, 28]: most operations will require between two and four hardware operation to complete, and the overall accuracy will generally be slightly lower than using actual double precision, but this can still be a good compromise between performance and accuracy when 64-bit floating-point has low or no support in hardware.

In many cases, it will be possible to avoid relying on extended precision by taking some care in the choice of algorithm used. In fact, the methods and algorithms that we will discuss momentarily can help improve the accuracy of an implementation regardless of the precision used: we would recommend their use even with double precision, since they always lead to more accurate results and in some cases (such as Horner's method) even higher performance.

### 5.1 Horner's method

Polynomial evaluation should always be done using Horner's method [29]. Any polynomial anx<sup>n</sup> <sup>þ</sup> an�<sup>1</sup>xn�<sup>1</sup> <sup>þ</sup> ::: <sup>þ</sup> <sup>a</sup>1<sup>x</sup> <sup>þ</sup> <sup>a</sup><sup>0</sup> can be written in the equivalent form

$$\frac{1}{2}((\ldots((a\_n\infty + a\_{n-1})\infty + a\_{n-2})\infty + \ldots)\infty + a\_1)\infty + a\_0\ldots$$

When this second form is evaluated from the innermost to the outermost expression, better accuracy and performance can be achieved. Indeed, Horner's method is known to be optimal in that it requires the minimal number of additions and multiplications for the evaluation of the polynomial [30, 31], and given the widespread availability of fused multiply-add operations on modern hardware, every term can be computed with a higher accuracy and in a single cycle, making this evaluation method the fastest and most accurate for general polynomials.

Design and Implementation of Particle Systems for Meshfree Methods with High Performance DOI: http://dx.doi.org/10.5772/intechopen.81755

### 5.2 Compensated and balanced summation

Note that this cell sorting strategy does not completely eliminate the need for multiple transfers; it does however help to reduce it significantly, especially when combined with linear cell indexing and the appropriate choice of order of dimensions for linearization. For this reason, GPUSPH offers a (compile-time) option to allow customization of this choice that in our experience can have performance

When the intent of GPGPU is to leverage the low-cost, high-performance ratio offered by consumer GPUs, a significant bottleneck in scientific applications is given by the limited (when not completely absent) support for double precision: since consumer GPUs are designed for video games and similar applications, where the highest rendering accuracy is not a requirement, the hardware is optimized for single-precision computation. Applications that need higher accuracy can thus follow one of the following strategies: use double-precision anyway, use soft extended precision (double-float, etc.), and rely on alternative algorithms that ensure a better

Due to the high-performance cost (ratios as low as 1:32 compared to single precision, nearly completely defeating the benefits of the high performance of GPUs over CPUs), the use of double precision should be avoided whenever possible. If absolutely necessary, it should be restricted to parts of the code where it is essential. In all other cases, faster alternatives should be sought out. A possible solution is offered by double-float arithmetic, in which two single-precision values are used to provide higher accuracy [27, 28]: most operations will require between two and four hardware operation to complete, and the overall accuracy will generally be slightly lower than using actual double precision, but this can still be a good compromise between performance and accuracy when 64-bit floating-point has low

In many cases, it will be possible to avoid relying on extended precision by taking some care in the choice of algorithm used. In fact, the methods and algorithms that we will discuss momentarily can help improve the accuracy of an implementation regardless of the precision used: we would recommend their use even with double precision, since they always lead to more accurate results and in

Polynomial evaluation should always be done using Horner's method [29]. Any polynomial anx<sup>n</sup> <sup>þ</sup> an�<sup>1</sup>xn�<sup>1</sup> <sup>þ</sup> ::: <sup>þ</sup> <sup>a</sup>1<sup>x</sup> <sup>þ</sup> <sup>a</sup><sup>0</sup> can be written in the equivalent form

ð Þ ð Þ :::ð Þ ð Þ anx þ an�<sup>1</sup> x þ an�<sup>2</sup> x þ ::: x þ a<sup>1</sup> x þ a0:

When this second form is evaluated from the innermost to the outermost expression, better accuracy and performance can be achieved. Indeed, Horner's method is known to be optimal in that it requires the minimal number of additions and multiplications for the evaluation of the polynomial [30, 31], and given the widespread availability of fused multiply-add operations on modern hardware, every term can be computed with a higher accuracy and in a single cycle, making this evaluation method the fastest and most accurate for general polynomials.

some cases (such as Horner's method) even higher performance.

benefits of up to 30%.

5. Numerical robustness

High Performance Parallel Computing

utilization of the available precision.

or no support in hardware.

5.1 Horner's method

88

In many particle systems, the behavior of a particle is dictated by the influence of a local neighborhood: the total action on the central particle is then achieved by adding up the contributions from each neighboring particle. The number of contributions is frequently in the order of tens or hundreds and in some applications even more. The naive approach, which could be described algorithmically as

total\_action=0;; for (i=0; i<num\_neighbors; ++i); total\_action +=contribution\_from\_neighbor(i);

suffers from low accuracy: since each contribution adds a relative error in the order of the machine epsilon ε, the total relative error is in the order of the total number of contributions, Oð Þ εn in the worst case (in practice, the error is typically Oðε√nÞ with the default round-to-nearest rounding mode).

This can be significantly improved by using a compensated summation algorithm, which can bring down the total relative error to order Oð Þ1 (constant). The idea behind this class of algorithms is to keep track of the quantity that gets "lost" during a summation due to the finite precision. This is achieved by keeping two accumulators, the sum itself and a correction. The simplest form of compensated summation was popularized by Kahan [32], where the contribution from each new term is computed as shown in Listing 5.

The approach relies on the compiler not trying to do an algebraic simplification of the expressions for the temporary sum t and the correction, which may require disabling "fast-math" and the contraction of floating-point expressions.

The Kahan compensated summation algorithm works best when all terms in the summation are of similar or decreasing orders of magnitude but fail to take into account that the new term may be (significantly) larger in magnitude than the current running sum. To this end, Neumaier presented a variant [33], usually known as KBN (Kahan-Babuška-Neumaier), that takes into account the relative magnitude of the new term and the running sum when computing the correction (Listing 6). In contrast to Kahan's algorithm, the final sum is obtained by adding sum and correction. More details about balancing and compensated summation algorithms can be found in [34].

The main downside of KBN is the branching condition to compute the correction, which may reduce performance on GPU. The algorithm can be rewritten to be vector-friendly, as illustrated on the right in Listing 6, which makes use of the select(a, b, c) function and the component-by-component extension to the ternary operator c?b:a defined, e.g., in OpenCL C. This form of KBN should only be chosen when profiling shows the branching to be a performance bottleneck, since the extra operations otherwise introduce higher latency in the summation.

#### Listing 5.

Kahan summation algorithm.


### High Performance Parallel Computing

### Listing 6.

KBN (Kahan-Babuška-Neumaier) summation algorithm. Standard form (left) and possible vectorization (right).

The solution is to compute the norm using the hypot operator. The idea is to

arithmetic the two expressions are equivalent, but with finite precision, the second expression is more accurate, assuming the values are sorted by magnitude (largest to smallest). The higher accuracy of hypot however comes at a significant computational cost, due to the additional division per component: even highly optimized implementations of hypot (such as the two-argument function available in the standard C library, in OpenCL C and in CUDA) can easily be as much as two orders

A major difference between numerical methods such as finite differences and meshless methods such as smoothed particle hydrodynamics is that in the former case, there is no need to track the global position of each computational node, as this is defined algorithmically based on the (possibly adaptive) mesh size, and the internode distance is fixed and known in advance. Meshless methods, on the other hand, need to track the global position of each particle; due to the nonuniform distribution of floating-point values, the inter-particle distance (computed as the difference between the global position of the particles) will then have a higher precision near the origin of the domain and a lower precision the further away from the origin the particles are. When the ratio of the inter-particle distance to the domain size gets close to machine epsilon, this nonuniform accuracy may lead to artificial clustering of particles, an effect that is quite noticeable when using single precision to simulate very large domains with a very fine resolution (Figure 5). While extending the precision for the global position is a possible solution [37], this only delays the problem and, as mentioned previously, may have a nontrivial computational cost. An alternative approach is to use a support, fixed grid with regular spacing and only track the position of each particle with respect to the center of the grid cell it belongs to [38]; distances to the other particles are obtained by correcting the distance between the grid cell centers and the local position difference (Listing 7). Absolute (global) positions are only reconstructed when

Simulation of an 8-pool fish pass with GPUSPH before the introduction of uniform position precision. At high

resolutions, spurious explosion due to insufficient precision near the domain edge can be observed.

1 þ q<sup>2</sup>

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

q

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

<sup>1</sup> þ ::: þ q<sup>2</sup>

n

where qi ¼ ai=a0. In exact

rewrite ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi a2 <sup>0</sup> þ a<sup>2</sup>

<sup>1</sup> þ ::: þ a<sup>2</sup>

of magnitude slower than the naive approach.

necessary (e.g., when writing the results to disk).

Figure 5.

91

5.4 Local versus global position

<sup>p</sup> as j j <sup>a</sup><sup>0</sup>

DOI: http://dx.doi.org/10.5772/intechopen.81755

n


The downsides of compensated summation algorithms are higher storage requirements (the additional accumulator) and higher computational cost (Kahan, e.g., requires four times more operations compared to the standard summation). Compared to the use of double precision, the storage requirements are unchanged; the computational cost, however, is two (or more) times higher than the cost of double-precision on hardware that supports it at full frequency; on most consumer GPUs, however, the use of compensated summation can be up to eight times faster than the use of double precision.

Compensated summation algorithms can be used both locally, at the single kernel level, to improve the computation of the contributions for the next time step, and globally, across kernel launches, providing better accuracy for long-running simulations.

### 5.3 Vector norms and the hypot function

Computing the norm of a vector is a very frequent operation. In Euclidean metric, the norm is computed as the square root of the sum of the square of the components and thus requires d multiplications, d-1 additions, and a square root. With the exception of very high dimensions, the final square root is frequently the most expensive operation as well as the least accurate.

The first step to improve both performance and accuracy is therefore to avoid taking the square root if possible. For example, when the vector is a distance and the objective is to compare distances, it is much faster (and accurate) to compare the squared distances (sum of the squares of the differences of the components) rather than the distances themselves. When the distance has to be compared against a reference length, it is cheaper to square the reference length than it is to take the square root in the distance computations.

A typical circumstance where one cannot avoid taking the square root is normalization of a vector, in which each component needs to be divided by the length of the vector; in some applications this is such a frequent (and slow) operation that fast, but less accurate implementations are used, such as the fast inverse square root [35] popularized by its use in id Software Quake III: Arena video game [36].

While games can afford to sacrifice accuracy for performance, this is not the case in scientific applications, for which a significant issue in vector normalization (and similar operations) is numerical stability: when the vector has components which are very close to zero, a naive computation of the norm may lead to underflow, potentially resulting in a final division by zero during normalization; conversely, very large components can lead to overflow of the inner summation before the root extraction.

Design and Implementation of Particle Systems for Meshfree Methods with High Performance DOI: http://dx.doi.org/10.5772/intechopen.81755

The solution is to compute the norm using the hypot operator. The idea is to rewrite ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi a2 <sup>0</sup> þ a<sup>2</sup> <sup>1</sup> þ ::: þ a<sup>2</sup> n <sup>p</sup> as j j <sup>a</sup><sup>0</sup> ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 þ q<sup>2</sup> <sup>1</sup> þ ::: þ q<sup>2</sup> n q where qi ¼ ai=a0. In exact arithmetic the two expressions are equivalent, but with finite precision, the second expression is more accurate, assuming the values are sorted by magnitude (largest to smallest). The higher accuracy of hypot however comes at a significant computational cost, due to the additional division per component: even highly optimized implementations of hypot (such as the two-argument function available in the standard C library, in OpenCL C and in CUDA) can easily be as much as two orders of magnitude slower than the naive approach.

### 5.4 Local versus global position

The downsides of compensated summation algorithms are higher storage requirements (the additional accumulator) and higher computational cost (Kahan, e.g., requires four times more operations compared to the standard summation). Compared to the use of double precision, the storage requirements are unchanged; the computational cost, however, is two (or more) times higher than the cost of double-precision on hardware that supports it at full frequency; on most consumer GPUs, however, the use of compensated summation can be up to eight times faster

KBN (Kahan-Babuška-Neumaier) summation algorithm. Standard form (left) and possible vectorization

t=sum + term;

sum=t;

cond=(abs(sum)>= abs(term)); sum\_or\_term=select(term, sum, cond); term\_or\_sum=select(sum, term, cond); correction +=(sum\_or\_term - t)+term\_or\_sum;

Compensated summation algorithms can be used both locally, at the single kernel level, to improve the computation of the contributions for the next time step, and globally, across kernel launches, providing better accuracy for long-running

Computing the norm of a vector is a very frequent operation. In Euclidean metric, the norm is computed as the square root of the sum of the square of the components and thus requires d multiplications, d-1 additions, and a square root. With the exception of very high dimensions, the final square root is frequently the

The first step to improve both performance and accuracy is therefore to avoid taking the square root if possible. For example, when the vector is a distance and the objective is to compare distances, it is much faster (and accurate) to compare the squared distances (sum of the squares of the differences of the components) rather than the distances themselves. When the distance has to be compared against a reference length, it is cheaper to square the reference length than it is to take the

A typical circumstance where one cannot avoid taking the square root is normalization of a vector, in which each component needs to be divided by the length of the vector; in some applications this is such a frequent (and slow) operation that fast, but less accurate implementations are used, such as the fast inverse square root

While games can afford to sacrifice accuracy for performance, this is not the case in scientific applications, for which a significant issue in vector normalization (and similar operations) is numerical stability: when the vector has components which are very close to zero, a naive computation of the norm may lead to underflow, potentially resulting in a final division by zero during normalization; conversely, very large components can lead to overflow of the inner summation

[35] popularized by its use in id Software Quake III: Arena video game [36].

than the use of double precision.

5.3 Vector norms and the hypot function

square root in the distance computations.

before the root extraction.

90

most expensive operation as well as the least accurate.

simulations.

Listing 6.

t=sum + term;

} else {

} sum=t;

if (abs(sum)>= abs(term)) {

correction +=(sum - t)+term;

High Performance Parallel Computing

correction +=(term - t)+sum;

(right).

A major difference between numerical methods such as finite differences and meshless methods such as smoothed particle hydrodynamics is that in the former case, there is no need to track the global position of each computational node, as this is defined algorithmically based on the (possibly adaptive) mesh size, and the internode distance is fixed and known in advance. Meshless methods, on the other hand, need to track the global position of each particle; due to the nonuniform distribution of floating-point values, the inter-particle distance (computed as the difference between the global position of the particles) will then have a higher precision near the origin of the domain and a lower precision the further away from the origin the particles are. When the ratio of the inter-particle distance to the domain size gets close to machine epsilon, this nonuniform accuracy may lead to artificial clustering of particles, an effect that is quite noticeable when using single precision to simulate very large domains with a very fine resolution (Figure 5).

While extending the precision for the global position is a possible solution [37], this only delays the problem and, as mentioned previously, may have a nontrivial computational cost. An alternative approach is to use a support, fixed grid with regular spacing and only track the position of each particle with respect to the center of the grid cell it belongs to [38]; distances to the other particles are obtained by correcting the distance between the grid cell centers and the local position difference (Listing 7). Absolute (global) positions are only reconstructed when necessary (e.g., when writing the results to disk).

#### Figure 5.

Simulation of an 8-pool fish pass with GPUSPH before the introduction of uniform position precision. At high resolutions, spurious explosion due to insufficient precision near the domain edge can be observed.

#### Listing 7.

Computing particle distance with uniform precision using cell indices and local positions.


due to the possible temptation to produce a "universal particle system engine" that could be extended to support any kind of particle system: a worthy objective, but in direct contrast with the need to have something useful and functional "right now." As a general rule, our recommendation is to start with a well-focused objective (e.g., a very specific formulation of a numerical method) and only extend/expand the

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

The focus of the initial implementation of a particle system (as for any general application) should always be correctness over performance. This is true also when specifically targeting high-performance computing (e.g., when designing for a GPU implementation). There are however some important design aspects that can be kept in mind, with low implementation cost and high return of investment at later development stages, both in terms of code cleanliness and extensibility. We will present them in this section, showing how the complexity that comes with flexibility can be isolated to provide a cleaner interface and without sacrificing (runtime)

Implementations of particle systems can be characterized by three main roles: management (setup, teardown, data exchange, e.g., saving/visualization), evolution (abstract sequence of steps that describe a single iteration of the lifetime of the system), and execution (concrete functions and kernels for each individual step). Keeping these roles distinct from the beginning can provide a solid base for the growth of the implementation; for example, while at first the developer may focus on single GPU, a subsequent extension to multi-GPU can be achieved more easily at a later stage if the management and execution roles are delegated to separate classes, running on separate threads: typically, management is handled by the main thread, while the execution role will be delegated to a worker thread (which become

Two approaches are then possible: assign the evolution role (i.e., the decisions about which steps to run next) to the manager thread or to the workers. The latter choice ("smart workers") has the advantage that the worker threads know what to do for every step, and this can be leveraged to reduce synchronization points; the downside is that unification tasks (such as global reductions and data exchange in the multi-node case) must be taken over by a specific worker, which creates an imbalance (since not all workers are created equal). Conversely, with "dumb workers," most commands become a synchronization point (which can become a performance bottleneck), but the manager thread can more easily subsume the unification tasks. Hybrid solutions are also possible, at the expense of a growing complexity in logic. Factoring out the (abstract description of the) evolution into its own class (in the object-oriented programming sense) can make it easier to transi-

Refactoring the execution role into a separate Worker class has several advantages. The most important, as mentioned above, is that having a separate Worker class provides a solid ground to expand the code to support multiple GPUs (with

Additionally, with the correct design, wider hardware support can be implemented by making the Worker class itself an abstract class, implementing only the common code, such as the evolution logic in the case of smart workers, or the command dispatch table in the case of dumb workers. Derived classes would then

objective for specific use cases.

DOI: http://dx.doi.org/10.5772/intechopen.81755

performance.

6.1 Separation of roles

multiple worker threads in the multi-GPU case).

tion from one implementation to the other.

each Worker taking control of a separate device).

6.1.1 Workers for hardware abstraction

93

This approach doubles the storage requirement (e.g., in three dimensions, an additional int3 is needed to describe the cell index, in addition to the local position stored in a float3), matching the requirement of the higher-precision type (e.g., storing global positions but using a double3 per particle), with the additional benefit of uniform accuracy throughout the domain, and the possibility to support much larger domains, even larger than would be allowed with higher precision. If the overall number of cells is relatively low, storage requirements can be reduced by encoding the cell coordinates in less memory: for example, if the overall number of cells is less than 232, it is possible to store a linearized cell index in a single unsigned int. The support grid solution is particularly advantageous when the same grid (and linearized cell index) can also be used for neighbor search, as described in Section 4.3, and to improve the performance of multi-GPU simulations, as discussed in Section 4.5.

### 5.5 Relative and nondimensional quantities

A general rule to improve the numerical stability of most methods is to work in nondimensional form, i.e., to scale all quantities (length, time, mass, etc.) by some problem-specific scale factor related to the problem's characteristic numbers (e.g., the Reynold number in case of viscous flows), and rewrite the underlying equations in terms of the nondimensional quantities instead of using standard units.

Working with nondimensional quantities can lead to better accuracy by allowing fuller utilization of the range of the floating-point values, but additional steps can be taken to gain further accuracy. An example of this is the local coordinate system proposed in the previous paragraph that alone is sufficient to improve the energy conservation of GPUSPH by as much as four orders of magnitude [38], but the same principle can be extended to most quantities. The benefits are particularly high when the range of variation of the quantity itself is small.

For example, in the weakly compressible SPH formulation, the density ρ of the particles in a fluid is assumed to differ from the reference (at-rest) value ρ<sup>o</sup> by at most a few percent. In nondimensional terms, the recommended approach would be to work with the relative density RD ¼ ρ=ρ0, but the numerical accuracy can be further improved by working with the centered relative density <sup>e</sup><sup>ρ</sup> <sup>¼</sup> <sup>ρ</sup>=ρ<sup>o</sup> � 1, which allows us to gain three or more digits of accuracy and to bring the absolute error in the computation of the neighbor contribution to the pressure part of the momentum equation from 10�<sup>7</sup> down to 10�<sup>11</sup> [38].

### 6. Flexibility

When setting forth to implement a new particle engine, important decisions have to be made concerning the general design of the system, particularly in reference to the scope and objectives. This is particularly important for particle system,

### Design and Implementation of Particle Systems for Meshfree Methods with High Performance DOI: http://dx.doi.org/10.5772/intechopen.81755

due to the possible temptation to produce a "universal particle system engine" that could be extended to support any kind of particle system: a worthy objective, but in direct contrast with the need to have something useful and functional "right now." As a general rule, our recommendation is to start with a well-focused objective (e.g., a very specific formulation of a numerical method) and only extend/expand the objective for specific use cases.

The focus of the initial implementation of a particle system (as for any general application) should always be correctness over performance. This is true also when specifically targeting high-performance computing (e.g., when designing for a GPU implementation). There are however some important design aspects that can be kept in mind, with low implementation cost and high return of investment at later development stages, both in terms of code cleanliness and extensibility. We will present them in this section, showing how the complexity that comes with flexibility can be isolated to provide a cleaner interface and without sacrificing (runtime) performance.

### 6.1 Separation of roles

This approach doubles the storage requirement (e.g., in three dimensions, an additional int3 is needed to describe the cell index, in addition to the local position stored in a float3), matching the requirement of the higher-precision type (e.g., storing global positions but using a double3 per particle), with the additional benefit of uniform accuracy throughout the domain, and the possibility to support much larger domains, even larger than would be allowed with higher precision. If the overall number of cells is relatively low, storage requirements can be reduced by encoding the cell coordinates in less memory: for example, if the overall number of cells is less than 232, it is possible to store a linearized cell index in a single unsigned int. The support grid solution is particularly advantageous when the same grid (and linearized cell index) can also be used for neighbor search, as described in Section 4.3, and to improve the performance of multi-GPU simulations, as discussed in Section 4.5.

/\* global constant: cell size \*/ /\* (integer) coordinates of the cells \*/ /\* particle position wrt their cell center \*/ /\* particle distance, computed from the local position difference and the cell center

distance \*/

Computing particle distance with uniform precision using cell indices and local positions.

A general rule to improve the numerical stability of most methods is to work in nondimensional form, i.e., to scale all quantities (length, time, mass, etc.) by some problem-specific scale factor related to the problem's characteristic numbers (e.g., the Reynold number in case of viscous flows), and rewrite the underlying equations

Working with nondimensional quantities can lead to better accuracy by allowing fuller utilization of the range of the floating-point values, but additional steps can be taken to gain further accuracy. An example of this is the local coordinate system proposed in the previous paragraph that alone is sufficient to improve the energy conservation of GPUSPH by as much as four orders of magnitude [38], but the same principle can be extended to most quantities. The benefits are particularly high

For example, in the weakly compressible SPH formulation, the density ρ of the particles in a fluid is assumed to differ from the reference (at-rest) value ρ<sup>o</sup> by at most a few percent. In nondimensional terms, the recommended approach would be to work with the relative density RD ¼ ρ=ρ0, but the numerical accuracy can be further improved by working with the centered relative density <sup>e</sup><sup>ρ</sup> <sup>¼</sup> <sup>ρ</sup>=ρ<sup>o</sup> � 1, which allows us to gain three or more digits of accuracy and to bring the absolute error in the computation of the neighbor contribution to the pressure part of the momen-

When setting forth to implement a new particle engine, important decisions have to be made concerning the general design of the system, particularly in reference to the scope and objectives. This is particularly important for particle system,

in terms of the nondimensional quantities instead of using standard units.

when the range of variation of the quantity itself is small.

tum equation from 10�<sup>7</sup> down to 10�<sup>11</sup> [38].

6. Flexibility

92

Listing 7.

const float3 cell\_size; int3 our\_cell, neib\_cell; float3 our\_lpos, neib\_lpos; float3 dist\_vec =

(neib\_cell – our\_cell)\*cell\_size +

(neib\_lpos – our\_lpos);

High Performance Parallel Computing

5.5 Relative and nondimensional quantities

Implementations of particle systems can be characterized by three main roles: management (setup, teardown, data exchange, e.g., saving/visualization), evolution (abstract sequence of steps that describe a single iteration of the lifetime of the system), and execution (concrete functions and kernels for each individual step). Keeping these roles distinct from the beginning can provide a solid base for the growth of the implementation; for example, while at first the developer may focus on single GPU, a subsequent extension to multi-GPU can be achieved more easily at a later stage if the management and execution roles are delegated to separate classes, running on separate threads: typically, management is handled by the main thread, while the execution role will be delegated to a worker thread (which become multiple worker threads in the multi-GPU case).

Two approaches are then possible: assign the evolution role (i.e., the decisions about which steps to run next) to the manager thread or to the workers. The latter choice ("smart workers") has the advantage that the worker threads know what to do for every step, and this can be leveraged to reduce synchronization points; the downside is that unification tasks (such as global reductions and data exchange in the multi-node case) must be taken over by a specific worker, which creates an imbalance (since not all workers are created equal). Conversely, with "dumb workers," most commands become a synchronization point (which can become a performance bottleneck), but the manager thread can more easily subsume the unification tasks. Hybrid solutions are also possible, at the expense of a growing complexity in logic. Factoring out the (abstract description of the) evolution into its own class (in the object-oriented programming sense) can make it easier to transition from one implementation to the other.

### 6.1.1 Workers for hardware abstraction

Refactoring the execution role into a separate Worker class has several advantages. The most important, as mentioned above, is that having a separate Worker class provides a solid ground to expand the code to support multiple GPUs (with each Worker taking control of a separate device).

Additionally, with the correct design, wider hardware support can be implemented by making the Worker class itself an abstract class, implementing only the common code, such as the evolution logic in the case of smart workers, or the command dispatch table in the case of dumb workers. Derived classes would then

implement the hardware-specific details such as the actual kernel execution or the host/device memory transfers. For example, one could have a GPUWorker for execution on GPU and a CPUWorker for execution on CPU [26]; the GPUWorker class itself could be further specialized in a CUDAWorker and an OpenCLWorker, when support for both APIs is desired.

### 6.2 The cost of variation

The complexity of the implementation of each individual step of the evolution loop in the particle system is directly related to the number of features offered; for example, some sections of the code may only be relevant when the fluid needs to interact with moving objects; or some particle types or particle data may only be present if specific options (e.g., an SPH formulation) are enabled; or it may be possible to choose between a faster but less accurate approach and a computationally more expensive, more accurate solution.

BUFFER\_POS | BUFFER\_VEL would indicate a collection of buffers holding both the positions and the velocity. This is particularly useful in conjunction with the "dumb worker" approach, because it allows most commands given by the Manager thread

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

#define BUFFER\_POS (1U< <0) #define BUFFER\_NEIBS (1U< <1) #define BUFFER\_TAU (1U< <2)

where the list of input and output buffers for the command are given as single

In languages such as C++, the use of symbolic names (with or without the bitmask property) also allows static knowledge about the buffer properties to be encoded into "traits" structure that can be used to evince information about the buffers at compile time, as exemplified in Listing 9. This allows developers to programmatically determine the element type of a buffer as BufferTraits

<symbolic\_name>::element\_type, which can be used in our BufferList class to make sure that when requesting a specific array, we get a pointer of the correct type and BufferTraits<symbolic\_name>::num\_buffers to determine how many components the buffer has (e.g., in this example we model a symmetric 33 tensor,

The management of the actual data is handled with different layers of abstraction. We use an AbstractBuffer class to describe the interface shared by all buffers, regardless of content or type; the interface presents (pure virtual) methods for operations such as allocations and deallocation of data, as well as a way to return a "generic" pointer to the data itself (as a void\*, since no type information is available

The next layer can be implemented as a GenericBuffer class template that depends on the data type and number of components of the buffer, so that

GenericBuffer<float2, 3>would be able to handle storage and typed access to per-

Example declaration of a BufferTraits structure and its specializations for some named buffers, declaring their

{

};

{

};

template<> struct BufferTraits<BUFFER\_NEIBS>

template<> struct BufferTraits<BUFFER\_TAU>

typedef int element\_type; enum {num\_buffers=1};

typedef float2 element\_type; enum {num\_buffers=3};

doCommand(COMMAND\_NAME, BUFFER\_R1 | BUFFER\_R2 | ...,

parameters by doing a binary OR of the symbolic buffer names.

which has six components, as a collection of three float2 buffers).

particle symmetric tensor data (encoded in three float2 per particle).

to the Workers to have a syntax such as

float2 \*bufferTau[3]; /\* stress tensor \*/

DOI: http://dx.doi.org/10.5772/intechopen.81755

Listing 8.

float4 \*bufferPos; int \*bufferNeibs;

in AbstractBuffer).

data type and multiplicity.

template<int Buffer> struct BufferTraits;

template<> struct BufferTraits<BUFFER\_POS>

typedef float4 element\_type; enum {num\_buffers=1};

Listing 9.

{

};

95

BUFFER\_W1 | BUFFER\_W2 | …);

Buffer as member variables (left) versus buffer symbolic names (right).

There are two main costs that come with providing multiple features (or variations thereof): an implementation cost (larger, more complex code to write) and a runtime cost. Both costs can be reduced with appropriate care in the design of the software. An optimal implementation should be designed in such a way that the runtime cost of a disabled feature matches the runtime cost had it not been implemented in the first place. For example, if the user runs an SPH simulation without any moving objects, then the runtime should be the same in an SPH implementation that does not support moving objects, and in an implementation that does support them, but that can detect (or can be told) that support for them is not needed. Likewise, the implementation of any additional feature should be as unobtrusive as possible, factored out into its own sets of functions. Luckily these two goals are not in conflict.

### 6.2.1 Managing buffers

One of the key aspects in supporting multiple variants for particle systems is correct buffer management. The objective is to support allocating all and only the (copies of the) buffers that are needed, correctly sized, in a unified manner (i.e., with the same interface on host and device). Moreover, we want to be able to pass around sets of buffers (e.g., the collection of all allocated buffers or a specific subset of them) to other parts of the code, in a way that minimizes both the actual copying of data and the detailed specification of which buffers belong to a set.

The approach we use in GPUSPH to solve this issue relies on two aspects: bitmask-based buffer naming and a set of classes that abstract buffer management, isolating the details of the individual buffer while still allowing the retrieval of all the necessary information, such as the data type, the number of elements, the kind of buffer (e.g., host or device), etc.

Bitmask-based buffer naming relies on using an appropriate set of values (defined with either #defines or an enum) to refer to the buffers, on the condition that each (symbolic) buffer name corresponds to an individual bit. These symbolic buffer names are used to refer to buffers in most of the host code, with the exceptions of the classes and functions that require access to the actual corresponding data pointers (i.e., the lowest level of implementation of the GPUWorker). An example of this is shown in Listing 8, which needs to be paired with an appropriate BufferList class such that, given BufferList buffers, we can get the array of positions as buffers.getData<BUFFER\_POS>().

With each symbolic name associated to a separate buffer, it is possible to combine them into an expression to refer to multiple buffers at once. For example,

Design and Implementation of Particle Systems for Meshfree Methods with High Performance DOI: http://dx.doi.org/10.5772/intechopen.81755

#### Listing 8.

implement the hardware-specific details such as the actual kernel execution or the host/device memory transfers. For example, one could have a GPUWorker for execution on GPU and a CPUWorker for execution on CPU [26]; the GPUWorker class itself could be further specialized in a CUDAWorker and an OpenCLWorker, when

The complexity of the implementation of each individual step of the evolution loop in the particle system is directly related to the number of features offered; for example, some sections of the code may only be relevant when the fluid needs to interact with moving objects; or some particle types or particle data may only be present if specific options (e.g., an SPH formulation) are enabled; or it may be possible to choose between a faster but less accurate approach and a computation-

There are two main costs that come with providing multiple features (or variations thereof): an implementation cost (larger, more complex code to write) and a runtime cost. Both costs can be reduced with appropriate care in the design of the software. An optimal implementation should be designed in such a way that the runtime cost of a disabled feature matches the runtime cost had it not been implemented in the first place. For example, if the user runs an SPH simulation without any moving objects, then the runtime should be the same in an SPH implementation that does not support moving objects, and in an implementation that does support them, but that can detect (or can be told) that support for them is not needed. Likewise, the implementation of any additional feature should be as unobtrusive as possible, factored out into its own sets of functions. Luckily these

One of the key aspects in supporting multiple variants for particle systems is correct buffer management. The objective is to support allocating all and only the (copies of the) buffers that are needed, correctly sized, in a unified manner (i.e., with the same interface on host and device). Moreover, we want to be able to pass around sets of buffers (e.g., the collection of all allocated buffers or a specific subset of them) to other parts of the code, in a way that minimizes both the actual copying

The approach we use in GPUSPH to solve this issue relies on two aspects: bitmask-based buffer naming and a set of classes that abstract buffer management, isolating the details of the individual buffer while still allowing the retrieval of all the necessary information, such as the data type, the number of elements, the kind

Bitmask-based buffer naming relies on using an appropriate set of values (defined with either #defines or an enum) to refer to the buffers, on the condition that each (symbolic) buffer name corresponds to an individual bit. These symbolic buffer names are used to refer to buffers in most of the host code, with the exceptions of the classes and functions that require access to the actual corresponding data pointers (i.e., the lowest level of implementation of the GPUWorker). An example of this is shown in Listing 8, which needs to be paired with an appropriate BufferList class such that, given BufferList buffers, we can get the array of

With each symbolic name associated to a separate buffer, it is possible to combine them into an expression to refer to multiple buffers at once. For example,

of data and the detailed specification of which buffers belong to a set.

support for both APIs is desired.

High Performance Parallel Computing

ally more expensive, more accurate solution.

6.2 The cost of variation

two goals are not in conflict.

of buffer (e.g., host or device), etc.

94

positions as buffers.getData<BUFFER\_POS>().

6.2.1 Managing buffers

Buffer as member variables (left) versus buffer symbolic names (right).


BUFFER\_POS | BUFFER\_VEL would indicate a collection of buffers holding both the positions and the velocity. This is particularly useful in conjunction with the "dumb worker" approach, because it allows most commands given by the Manager thread to the Workers to have a syntax such as

doCommand(COMMAND\_NAME, BUFFER\_R1 | BUFFER\_R2 | ..., BUFFER\_W1 | BUFFER\_W2 | …);

where the list of input and output buffers for the command are given as single parameters by doing a binary OR of the symbolic buffer names.

In languages such as C++, the use of symbolic names (with or without the bitmask property) also allows static knowledge about the buffer properties to be encoded into "traits" structure that can be used to evince information about the buffers at compile time, as exemplified in Listing 9. This allows developers to programmatically determine the element type of a buffer as BufferTraits <symbolic\_name>::element\_type, which can be used in our BufferList class to make sure that when requesting a specific array, we get a pointer of the correct type and BufferTraits<symbolic\_name>::num\_buffers to determine how many components the buffer has (e.g., in this example we model a symmetric 33 tensor, which has six components, as a collection of three float2 buffers).

The management of the actual data is handled with different layers of abstraction. We use an AbstractBuffer class to describe the interface shared by all buffers, regardless of content or type; the interface presents (pure virtual) methods for operations such as allocations and deallocation of data, as well as a way to return a "generic" pointer to the data itself (as a void\*, since no type information is available in AbstractBuffer).

The next layer can be implemented as a GenericBuffer class template that depends on the data type and number of components of the buffer, so that GenericBuffer<float2, 3>would be able to handle storage and typed access to perparticle symmetric tensor data (encoded in three float2 per particle).

#### Listing 9.

Example declaration of a BufferTraits structure and its specializations for some named buffers, declaring their data type and multiplicity.


Since we can derive element types and number of components from the buffer traits, our Buffer class template can simply derive from the appropriate GenericBuffer:

There are two possible solutions to this issue. The simplest solution, which is currently used in GPUSPH, is to push down the compile-time selection to the user: the setup of the user simulation is done via a source file where all the compile-time parameters must be selected. When the user simulation setup is compiled, only the specific combination of parameters will be enabled and compiled for. An alternative solution is to rely on the specific possibility offered with GPU programming, to compile the device code at runtime. This feature has always been available with OpenCL (in fact, OpenCL requires runtime compilation of the device code), and it

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

The main downside of the compile-time selection is that the software must always be distributed in source form, with several implications in terms of userfriendliness and maintenance. Downsides for runtime compilation include the limited support for older version of CUDA, when relying on this proprietary solution, and the potential time loss at the beginning of each execution (which may or may

On the other hand, runtime compilation of the device code allows an even wider range of aspects to be implemented at compile time (on the device side), which may allow even stronger compiler optimizations. For example, global simulation parameters that are constant throughout a simulation are often stored in the device constant memory at the beginning of the simulation; even though access to constant memory is quite efficient, inlining the constants can be even more efficient, and this can be achieved exploiting runtime compilation, by replacing the upload of the constants to the device with appropriate #define in the runtime-compiled device

Implementing multiple variations of a kernel (or function used by a kernel) is generally nontrivial, as the functions may have different sets of parameters and different private variables and may operate differently even on data that is present in all or most of them. The objective is therefore to isolate the variant-specific parts

When using runtime compilation for the device code and a C-based language such as OpenCL C, the only way to achieve this is to fence nonrelevant parts of the code with appropriate preprocessor directives, as illustrated in Listing 10 (left). Code fencing can be factored out, and sometimes reduced, by collecting data into conditional structures and refactoring computations into conditional functions; the resulting code is slightly more verbose (Listing 10, right), but the optional features

Note that the conditional structure in this example cannot be extended to the kernel parameters and private variables itself, due to the impossibility for global array addresses to be member of structures shared between host and device, which further limits the possibility to initialize the conditional parts of the private variable structure with appropriate conditional functions. (This is a limitation of OpenCL C; alternative solutions are possible using the Shared Virtual Memory feature introduced in OpenCL 2.0 and supported by some implementations as an extension on

When the device code can be written in C++ and global arrays pointers are made

available in the same form to both the host and device (e.g., with CUDA), the language itself provides powerful meta-programming techniques that can be lever-

aged to eliminate the need for a preprocessor, allowing multiple specialized

has been recently made available in CUDA via the NVRTC library.

DOI: http://dx.doi.org/10.5772/intechopen.81755

not be offset by any caching the runtime compilation engine might do).

6.2.3 C++ SFINAE versus C preprocessors for compile-time specialization

from the common parts, avoiding code repetition.

older versions of the standard.)

implementations to coexist.

97

are better isolated, improving the maintainability of the code.

code.

```
template<flag_t Key>
class Buffer : public GenericBuffer<
     BufferTraits<Key>::element_type,
     BufferTraits<Key>::num_buffers>
{ /* other specializations, as necessary */ };
```
Note that the Buffer class template still does not have any actual allocation logic: its only purpose is to provide the correct base class for the named properties of each particle (e.g., position, velocity, etc.). The allocation logic is delegated to derived classes such as HostBuffer (that would use malloc/free or new/delete), CUDABuffer (using cudaMalloc/cudaFree), and CLBuffer (using clCreateBuffer/ clReleaseMemObject).

Finally, a collection of buffers is managed by a BufferList that maps symbolic names to the concrete class implementing the buffer with the specific symbolic name.

Since different variants of a particle system will instance different subsets of all possible buffers, the mapping will in general be sparse, and it is therefore better to use a dictionary type such as std::map (or language equivalent) to implement it. This is particularly true for the bitmask choice of symbolic names. Moreover, since each symbolic name maps to a buffer with a different data type, the mapping will generally be between symbolic names and AbstractBuffers. Downcasting to the correct Buffer<symbolic\_name> type can be done in specific template methods, where the symbolic name is known as compile type. Our BufferList class, for example, exposes a template getData method that returns a pointer of the correct type for the given symbolic name, again using the buffer traits to deduce it; some example of its usage are illustrated further on.

In general, a single instance of a running particle system will have multiple BufferLists: it may have one to hold the data on host (e.g., for saving or visualization) and one on device to hold the data for the running simulation; on device, it might have more than one, separating read/write copies of each buffer, or to hold the data for different parts of the particle system (e.g., a BufferList for fluid particle data, a BufferList for boundary particle data); it may also have one (or more) BufferLists to hold all of the available data for the particle system and separate smaller BufferLists with a selection of the data when passing it to individual kernels, depending on the approach used.

### 6.2.2 Handling options combinatorial growth

As the number of options offered to the end user of a particle system grows, there is a consequent explosion in the number of possible valid combinations that need to be supported, which results in competing needs between the opportunity (for performance reasons) for their compile-time implementation, and the associated resource consumption at compile time when runtime selection of the specific option is offered.

(At some point of time, compiling all of the combinations of simulation parameters offered by GPUSPH took several hours, occupying several gigabytes of memory and producing a binary with around 3000 variants of the main computational kernels overall.)

### Design and Implementation of Particle Systems for Meshfree Methods with High Performance DOI: http://dx.doi.org/10.5772/intechopen.81755

There are two possible solutions to this issue. The simplest solution, which is currently used in GPUSPH, is to push down the compile-time selection to the user: the setup of the user simulation is done via a source file where all the compile-time parameters must be selected. When the user simulation setup is compiled, only the specific combination of parameters will be enabled and compiled for. An alternative solution is to rely on the specific possibility offered with GPU programming, to compile the device code at runtime. This feature has always been available with OpenCL (in fact, OpenCL requires runtime compilation of the device code), and it has been recently made available in CUDA via the NVRTC library.

The main downside of the compile-time selection is that the software must always be distributed in source form, with several implications in terms of userfriendliness and maintenance. Downsides for runtime compilation include the limited support for older version of CUDA, when relying on this proprietary solution, and the potential time loss at the beginning of each execution (which may or may not be offset by any caching the runtime compilation engine might do).

On the other hand, runtime compilation of the device code allows an even wider range of aspects to be implemented at compile time (on the device side), which may allow even stronger compiler optimizations. For example, global simulation parameters that are constant throughout a simulation are often stored in the device constant memory at the beginning of the simulation; even though access to constant memory is quite efficient, inlining the constants can be even more efficient, and this can be achieved exploiting runtime compilation, by replacing the upload of the constants to the device with appropriate #define in the runtime-compiled device code.

### 6.2.3 C++ SFINAE versus C preprocessors for compile-time specialization

Implementing multiple variations of a kernel (or function used by a kernel) is generally nontrivial, as the functions may have different sets of parameters and different private variables and may operate differently even on data that is present in all or most of them. The objective is therefore to isolate the variant-specific parts from the common parts, avoiding code repetition.

When using runtime compilation for the device code and a C-based language such as OpenCL C, the only way to achieve this is to fence nonrelevant parts of the code with appropriate preprocessor directives, as illustrated in Listing 10 (left). Code fencing can be factored out, and sometimes reduced, by collecting data into conditional structures and refactoring computations into conditional functions; the resulting code is slightly more verbose (Listing 10, right), but the optional features are better isolated, improving the maintainability of the code.

Note that the conditional structure in this example cannot be extended to the kernel parameters and private variables itself, due to the impossibility for global array addresses to be member of structures shared between host and device, which further limits the possibility to initialize the conditional parts of the private variable structure with appropriate conditional functions. (This is a limitation of OpenCL C; alternative solutions are possible using the Shared Virtual Memory feature introduced in OpenCL 2.0 and supported by some implementations as an extension on older versions of the standard.)

When the device code can be written in C++ and global arrays pointers are made available in the same form to both the host and device (e.g., with CUDA), the language itself provides powerful meta-programming techniques that can be leveraged to eliminate the need for a preprocessor, allowing multiple specialized implementations to coexist.

Since we can derive element types and number of components from the buffer

Note that the Buffer class template still does not have any actual allocation logic: its only purpose is to provide the correct base class for the named properties of each particle (e.g., position, velocity, etc.). The allocation logic is delegated to derived classes such as HostBuffer (that would use malloc/free or new/delete),

CUDABuffer (using cudaMalloc/cudaFree), and CLBuffer (using clCreateBuffer/

Finally, a collection of buffers is managed by a BufferList that maps symbolic names to the concrete class implementing the buffer with the specific symbolic

Since different variants of a particle system will instance different subsets of all possible buffers, the mapping will in general be sparse, and it is therefore better to use a dictionary type such as std::map (or language equivalent) to implement it. This is particularly true for the bitmask choice of symbolic names. Moreover, since each symbolic name maps to a buffer with a different data type, the mapping will generally be between symbolic names and AbstractBuffers. Downcasting to the correct Buffer<symbolic\_name> type can be done in specific template methods, where the symbolic name is known as compile type. Our BufferList class, for example, exposes a template getData method that returns a pointer of the correct type for the given symbolic name, again using the buffer traits to deduce it; some

In general, a single instance of a running particle system will have multiple BufferLists: it may have one to hold the data on host (e.g., for saving or visualization) and one on device to hold the data for the running simulation; on device, it might have more than one, separating read/write copies of each buffer, or to hold the data for different parts of the particle system (e.g., a BufferList for fluid particle data, a BufferList for boundary particle data); it may also have one (or more) BufferLists to hold all of the available data for the particle system and separate smaller BufferLists with a selection of the data when passing it to indi-

As the number of options offered to the end user of a particle system grows, there is a consequent explosion in the number of possible valid combinations that need to be supported, which results in competing needs between the opportunity (for performance reasons) for their compile-time implementation, and the associated resource consumption at compile time when runtime selection of the specific

(At some point of time, compiling all of the combinations of simulation parameters offered by GPUSPH took several hours, occupying several gigabytes of memory and producing a binary with around 3000 variants of the main computational

traits, our Buffer class template can simply derive from the appropriate

GenericBuffer:

template<flag\_t Key>

High Performance Parallel Computing

clReleaseMemObject).

name.

class Buffer : public GenericBuffer<

example of its usage are illustrated further on.

vidual kernels, depending on the approach used.

6.2.2 Handling options combinatorial growth

option is offered.

kernels overall.)

96

BufferTraits<Key>::element\_type, BufferTraits<Key>::num\_buffers> { /\* other specializations, as necessary \*/ };

### Listing 10.

Code fencing for optional components in the case of runtime device code compilation: inline approach (left) and refactored approach with code isolation (right).

The first building block is a structure template.

versions of the standard [39].

};

arguments.

constructor, it can be implemented as.

DOI: http://dx.doi.org/10.5772/intechopen.81755

template<typename T>struct empty {

type, relying on the C++11 using template directive:

conditional< B, T, empty<T> >::type;

empty<\_\_VA\_ARGS> >::type

typename conditional<cond, \_\_VA\_ARGS\_\_, \

template<bool B, typename T> using cond\_struct=typename

implemented with a macro such as.

kernel (Listing 12).

99

#define COND\_STRUCT(cond, ...) \

template<typename U…>empty(U… args) {}

template<bool B, typename T, typename F>struct conditional;

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

such that conditional<somecondition, SomeType, SomeOtherType>::type corresponds to SomeType when somecondition is true and to SomeOtherType when the condition is false. This is part of C++11 and can be also defined in previous

The purpose of the empty structure template is to "absorb" any type, construct from anything, and otherwise be empty. Using C++11 variadic templates for the

In older versions of C++, the single variadic template constructor must to be replaced with multiple constructor templates, each taking a separate number of

With these building blocks, we can define our conditional structure support

When forced to use older C++ versions, something similar but less robust can be

A structure with optional members can then be defined by defining multiple structures grouping each set of optional members and then defining the main structure as derived from all substructures, each wrapped in their own

cond\_struct<>, as illustrated in Listing 11. We see how the individual groups of members for the final structure are refactored into simpler structures, carrying their own initialization information. We also see why the empty structure needs to be a template: if this was not the case, and both XSPH and the stress tensor computation were disabled, the kernel parameters structure would have empty as a base class twice, which is not allowed by the standard; with our template approach, the two empty base classes are now formally distinct types: empty<xsph\_kernel\_params> and empty<stress\_kernel\_params>. The sample code also shows the advantage of the BufferList class described previously and its typed buffer access methods. On the device side, we can use the same approach for the private variables of the

Finally, we need to define the individual process functions. For this, we need

separate overloads depending on whether the priv structure has the specific members or not. One way to achieve this is to make all functions depend on the

parameters, this becomes quite complex and hard to maintain and extend, since every additional parameter will require a change in all the functions that access

same template parameters as the structure, but when there are many


Conditional structures and functions in C++ can be implemented by using templates and the meta-programming feature of the language known as SFINAE (substitution failure is not an error) to select function specializations based on any combination of (compile-time) properties of their parameters. The approach we show requires two building blocks that are available in the standard library since C++11 (and can even be implemented in older C++ versions) and a special empty structure template.

Design and Implementation of Particle Systems for Meshfree Methods with High Performance DOI: http://dx.doi.org/10.5772/intechopen.81755

The first building block is a structure template.

template<bool B, typename T, typename F>struct conditional;

such that conditional<somecondition, SomeType, SomeOtherType>::type corresponds to SomeType when somecondition is true and to SomeOtherType when the condition is false. This is part of C++11 and can be also defined in previous versions of the standard [39].

The purpose of the empty structure template is to "absorb" any type, construct from anything, and otherwise be empty. Using C++11 variadic templates for the constructor, it can be implemented as.

```
template<typename T>struct empty {
  template<typename U…>empty(U… args) {}
  };
```
In older versions of C++, the single variadic template constructor must to be replaced with multiple constructor templates, each taking a separate number of arguments.

With these building blocks, we can define our conditional structure support type, relying on the C++11 using template directive:

```
template<bool B, typename T>
using cond_struct=typename
     conditional< B, T, empty<T> >::type;
```
When forced to use older C++ versions, something similar but less robust can be implemented with a macro such as.

#define COND\_STRUCT(cond, ...) \ typename conditional<cond, \_\_VA\_ARGS\_\_, \ empty<\_\_VA\_ARGS> >::type

A structure with optional members can then be defined by defining multiple structures grouping each set of optional members and then defining the main structure as derived from all substructures, each wrapped in their own cond\_struct<>, as illustrated in Listing 11. We see how the individual groups of members for the final structure are refactored into simpler structures, carrying their own initialization information. We also see why the empty structure needs to be a template: if this was not the case, and both XSPH and the stress tensor computation were disabled, the kernel parameters structure would have empty as a base class twice, which is not allowed by the standard; with our template approach, the two empty base classes are now formally distinct types: empty<xsph\_kernel\_params> and empty<stress\_kernel\_params>. The sample code also shows the advantage of the BufferList class described previously and its typed buffer access methods. On the device side, we can use the same approach for the private variables of the kernel (Listing 12).

Finally, we need to define the individual process functions. For this, we need separate overloads depending on whether the priv structure has the specific members or not. One way to achieve this is to make all functions depend on the same template parameters as the structure, but when there are many parameters, this becomes quite complex and hard to maintain and extend, since every additional parameter will require a change in all the functions that access

Conditional structures and functions in C++ can be implemented by using templates and the meta-programming feature of the language known as SFINAE (substitution failure is not an error) to select function specializations based on any combination of (compile-time) properties of their parameters. The approach we show requires two building blocks that are available in the standard library since C++11 (and can even be implemented in older C++ versions) and a special empty

}

Code fencing for optional components in the case of runtime device code compilation: inline approach (left) and

struct private\_vars {

#ifdef HAS\_XSPH

#ifdef HAS\_XSPH

#endif

#endif };

#else

#endif

#else

#endif

#endif

#endif

#endif

#endif

{

/\* common private variables \*/

/\* stress tensor private variables \*/

void process\_common(struct private\_vars \*priv)

void process\_xsph(struct private\_vars \* priv)

{/\* stress tensor computations go here \*/}

global const float4 \* restrict posArray, global const float4 \* restrict velArray,

global float4 \* restrict stressTensor4, global float4 \* restrict stressTensor2,

global float4 \* restrict xsphArray

global float4 \* restrict forces)

/\* initialize common part of priv \*/

/\* initialize stress tensor part of priv \*/

/\* initialize XSPH part of priv \*/

struct private\_vars priv;

process\_common(&priv); process\_xsph(&priv); process\_stress\_tensor(&priv);

#ifdef HAS\_STRESS\_TENSOR

/\* XSPH private variables \*/

{/\* common computations here \*/}

{/\* XSPH computations go here \*/}

{} /\* intentionally left blank \*/

{} /\* intentionally left blank \*/

kernel void some\_kernel(

#ifdef HAS\_STRESS\_TENSOR

#ifdef HAS\_XSPH

#ifdef HAS\_XSPH

void process\_stress\_tensor( struct private\_vars \* priv) #ifdef HAS\_STRESS\_TENSOR

#ifdef HAS\_STRESS\_TENSOR

structure template.

98

Listing 10.

refactored approach with code isolation (right).

High Performance Parallel Computing

global const float4 \* restrict posArray, global const float4 \* restrict velArray,

global float4 \* restrict stressTensor4, global float4 \* restrict stressTensor2,

/\* common private variables go here \*/

/\* stress tensor private variables go here \*/

/\* stress tensor computations go here \*/

/\* XSPH private variables go here \*/

/\* common computations go here \*/

/\* XSPH computations go here \*/

global float4 \* restrict xsphArray

global float4 \* restrict forces)

kernel void some\_kernel(

#ifdef HAS\_STRESS\_TENSOR

#ifdef HAS\_STRESS\_TENSOR

#ifdef HAS\_STRESS\_TENSOR

#ifdef HAS\_XSPH

#ifdef HAS\_XSPH

#ifdef HAS\_XSPH

#endif

#endif

#endif

#endif

#endif

#endif }

{

### Listing 11.

Conditional structures with C++ applied to kernel parameters: definition of the optional members (left) and definition of the conditional structure template including them (right).

```
struct common_kernel_params {
    const float4 * restrict posArray;
    const float4 * restrict velArray;
    float4 * restrict forcesArray;
    common_kernel_params(BufferList& buffers)
: posArray(buffers.getData<BUFFER_POS>())
, velArray(buffers.getData<BUFFER_VEL>())
, forcesArray(buffers.getData<BUFFER_FORCES>())
    {}
};
struct xsph_kernel_params {
    float4 * restrict xsphArray;
    xsph_kernel_params(BufferList& buffers)
: xsphArray(buffers.getData<BUFFER_XSPH>())
    {}
};
struct stress_kernel_params {
    float4 * restrict stressTensor4,
    float4 * restrict stressTensor2,
    stress_kernel_params(BufferList& buffers)
: stressTensor4(buffers.getData<BUFFER_TAU4>())
, stressTensor2(buffers.getData<BUFFER_TAU2>())
    {}
};
                                                    template<
                                                        /* actual template parameters */
                                                    bool needs_xsph, bool needs_stress_tensor,
                                                    /* pseudo-template parameters,
                                                      used to give simpler names to
                                                      conditional structure members */
                                                       typename optional_xsph=
                                                    cond_struct<needs_xsph, xsph_kernel_params>,
                                                       typename optional_stress =
                                                    cond_struct<needs_stress_tensor,
                                                       stress_kernel_params>
                                                    >
                                                    struct kernel_params
                                                       : common_kernel_params
                                                       , optional_xsph
                                                       , optional_stress
                                                    {
                                                       /* These static variables allow compile-time
                                                         knowledge about the parameters used
                                                         for the specific instantiation
                                                         of the template */
                                                       static const bool has_xsph=needs_xsph;
                                                       static const bool has_stress=
                                                    needs_stress_tensor;
                                                       kernel_params(BufferList& buffers)
                                                         : common_kernel_params(buffers)
                                                         , optional_xsph(buffers)
                                                         , optional_stress(buffers)
                                                       {}
                                                    };
```
the structure, regardless of whether the additional parameter actually has an impact.

A simple way is to make the functions into templates depending on a single parameter (the arbitrary type of the structure passed), and then make overloads based on specific properties of the actual structure that gets passed. This can be achieved by means of enable\_if, a structure template declared as.

The processing functions can then be defined as in Listing 13, with separate overloads made available based on compile-time properties of the argument. The kernel structure becomes very simple (Listing 14): all of the complexity has been delegated to specific (sub)structures and functions, and we have a guarantee that each specialized version of the kernel will only contain the code and variables that

Function specialization with overloads based on argument properties with enable-if in CUDA.

Conditional structures with C++ applied to private function variables: definitions of the optional members

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

template<

struct kernel\_priv : common\_kernel\_priv , optional\_xsph , optional\_stress

>

{

bool needs\_xsph, bool needs\_stress\_tensor,

/\* These static variables allow compile-time knowledge about the parameters used for

the specific instantiation of

static const bool has\_xsph=needs\_xsph; static const bool has\_stress=

kernel\_priv(kernel\_params<needs\_xsph, needs\_stress\_tensor> const& params) : common\_kernel\_priv(params) , optional\_xsph(params) , optional\_stress(params)

the template \*/

/\* Similarly for the stress tensor,

\_\_device\_\_ enable\_if\_t<Priv::has\_stress> void process\_stress\_tensor(Priv& priv) {/\* stress tensor computations here \*/}

\_\_device\_\_ enable\_if\_t<not Priv::has\_stress> void process\_stress\_tensor(Priv& priv) {} /\* intentionally left blank \*/

/\* The common code needs no special treatment \*/

process\_common(common\_priv& priv) {/\* common computations here \*/}

using has\_stress \*/ template<typename Priv>

\_\_device\_\_ void

template<typename Priv>

needs\_stress\_tensor;

{} };

cond\_struct<needs\_xsph, xsph\_kernel\_priv>, typename optional\_stress = cond\_struct<needs\_stress\_tensor, stress\_kernel\_priv>

typename optional\_xsph =

(left) and definition of the conditional structure template including them (right).

are pertinent to its functionality.

/\* process\_xsph is defined differently,

\_\_device\_\_ enable\_if\_t<Priv::has\_xsph> void process\_xsph(Priv& priv) {/\* XSPH computations here \*/} template<typename Priv>

\_\_device\_\_ enable\_if\_t<not Priv::has\_xsph>

XSPH was enabled \*/ template<typename Priv>

void process\_xsph(Priv& priv) {} /\* intentionally left blank \*/

depending on whether XSPH is enabled or not; we check for this based on the static const has\_xsph member of the priv structure: this will always be present, and it will be true or false depending on whether

Listing 12.

};

};

};

Listing 13.

101

struct common\_kernel\_priv {

struct xsph\_kernel\_priv {

struct stress\_kernel\_priv {

stress\_kernel\_priv(

of this structure \*/ xsph\_kernel\_priv(x

will be initialized from

will be initialized from

of this structure \*/ common\_kernel\_priv(

{/\* initialize the variables from the parameters \*/}

/\* common variables become members

DOI: http://dx.doi.org/10.5772/intechopen.81755

common\_kernel\_params const& params)

/\* XSPH-specific variables become members

sph\_kernel\_params const& params) {/\* typically, feature-specific variables

/\* Stress-tensor specific variables become members of this structure \*/

feature-specific kernel parameters \*/}

stress\_kernel\_params const& params) {/\* typically, feature-specific variables

feature-specific kernel parameters \*/}

template<bool B, typename T=void> enable\_if;

which is such that enable\_if<condition, SomeType>::type is SomeType when the condition is true and an error otherwise. Due to the SFINAE principle, when the compiler is looking for the overload of a function to use, it will discard (without errors) the overloads which result in an error and automatically select the one which does not result in an error. Additionally, if SomeType is omitted, void is implied, which can simply the syntax. Again, this template is provided by the standard library in C++11 and can be implemented in older version of C++ [40].

To further simplify the syntax, we assume that C++11 is available and we can define:

template<bool B, typename T=void> using enable\_if\_t=typename enable\_if<B, T>::type; (which is pre-defined in C++14).

Design and Implementation of Particle Systems for Meshfree Methods with High Performance DOI: http://dx.doi.org/10.5772/intechopen.81755

### Listing 12.

Conditional structures with C++ applied to private function variables: definitions of the optional members (left) and definition of the conditional structure template including them (right).


The processing functions can then be defined as in Listing 13, with separate overloads made available based on compile-time properties of the argument. The kernel structure becomes very simple (Listing 14): all of the complexity has been delegated to specific (sub)structures and functions, and we have a guarantee that each specialized version of the kernel will only contain the code and variables that are pertinent to its functionality.

#### Listing 13.

the structure, regardless of whether the additional parameter actually has an

achieved by means of enable\_if, a structure template declared as.

using enable\_if\_t=typename enable\_if<B, T>::type;

template<bool B, typename T=void> enable\_if;

template<bool B, typename T=void>

(which is pre-defined in C++14).

A simple way is to make the functions into templates depending on a single parameter (the arbitrary type of the structure passed), and then make overloads based on specific properties of the actual structure that gets passed. This can be

Conditional structures with C++ applied to kernel parameters: definition of the optional members (left) and

template<

>

{

/\* actual template parameters \*/ bool needs\_xsph, bool needs\_stress\_tensor,

cond\_struct<needs\_xsph, xsph\_kernel\_params>,

/\* These static variables allow compile-time knowledge about the parameters used for the specific instantiation

static const bool has\_xsph=needs\_xsph; static const bool has\_stress=

kernel\_params(BufferList& buffers) : common\_kernel\_params(buffers) , optional\_xsph(buffers) , optional\_stress(buffers)

/\* pseudo-template parameters, used to give simpler names to conditional structure members \*/ typename optional\_xsph=

typename optional\_stress = cond\_struct<needs\_stress\_tensor, stress\_kernel\_params>

struct kernel\_params : common\_kernel\_params , optional\_xsph , optional\_stress

of the template \*/

needs\_stress\_tensor;

{} };

definition of the conditional structure template including them (right).

common\_kernel\_params(BufferList& buffers) : posArray(buffers.getData<BUFFER\_POS>()) , velArray(buffers.getData<BUFFER\_VEL>()) , forcesArray(buffers.getData<BUFFER\_FORCES>())

xsph\_kernel\_params(BufferList& buffers) : xsphArray(buffers.getData<BUFFER\_XSPH>())

stress\_kernel\_params(BufferList& buffers) : stressTensor4(buffers.getData<BUFFER\_TAU4>()) , stressTensor2(buffers.getData<BUFFER\_TAU2>())

struct common\_kernel\_params { const float4 \* restrict posArray; const float4 \* restrict velArray; float4 \* restrict forcesArray;

High Performance Parallel Computing

struct xsph\_kernel\_params { float4 \* restrict xsphArray;

struct stress\_kernel\_params { float4 \* restrict stressTensor4, float4 \* restrict stressTensor2,

which is such that enable\_if<condition, SomeType>::type is SomeType when the condition is true and an error otherwise. Due to the SFINAE principle, when the compiler is looking for the overload of a function to use, it will discard (without errors) the overloads which result in an error and automatically select the one which does not result in an error. Additionally, if SomeType is omitted, void is implied, which can simply the syntax. Again, this template is provided by the standard library in C++11 and can be implemented in older version of C++ [40].

To further simplify the syntax, we assume that C++11 is available and we can define:

impact.

100

Listing 11.

{} };

{} };

{} };

Function specialization with overloads based on argument properties with enable-if in CUDA.


#### Listing 14.

Kernel structure after isolation of the optional components.


While hardware vendors go to great lengths to support more liberal coding, the

We have stressed the importance of the choice to the correct approach in dealing with the potentially severe limitations in the numerical robustness of the implementation, due to the restricted accuracy and precision of the single-precision floating-point format which GPUs are optimized for. While many of the techniques we have presented are not new (some going as far back as the nineteenth century), in our experience they have surprisingly limited adoption; we hope that our discussion of their usefulness in this context will lead to higher awareness of the possibilities they offer. We dislike the adoption of higher-precision data types as a solution to the issue, not only because of the performance implications on consumer hardware, but as a philosophical objection to waste: why use the wrong numerical approach, wasting the additional precision granted by double precision, when the correct approach can make single precision sufficient? We do understand the need for the extended precision requirements in many applications, and we are sure that our reminiscence of classical solutions to better accuracy can benefit them as well, particularly since support for even higher-precision data types is nearly nonexistent in hardware (with the possible exception of the IBM POWER9 support for IEEE-

software can—and should—be designed to leverage the natural programming model of the hardware, and we have provided several insights on how the particle systems can be designed to fit better with the requirements of optimal GPU usage. We also presented a few simple ideas that, when taken into consideration during an initial implementation, can make future extensions much easier. Many of the suggestions we provide can also be of general interest beyond the implementation of

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

754-compliant 128-bit floating-point formats) and especially on GPUs.

\*, Vito Zago<sup>1</sup> and Alexis Hérault<sup>2</sup>

1 Istituto Nazionale di Geofisica e Vulcanologia, Sezione di Catania, Italy

2 Conservatoire National des Arts et Métiers, Laboratoire Modélisation

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

mathématique et numérique, Paris, France

provided the original work is properly cited.

\*Address all correspondence to: giuseppe.bilotta@ingv.it

particle systems.

DOI: http://dx.doi.org/10.5772/intechopen.81755

Author details

Giuseppe Bilotta<sup>1</sup>

103

#### Listing 15.

Runtime switch to call the appropriate compile-time kernel specialization.

if (opt\_xsph && opt\_stress) some\_kernel<<<…>>>(kernel\_params<true, true>(buffers)); else if (opt\_xsph && !opt\_stress) some\_kernel<<<…>>>(kernel\_params<true, false>(buffers)); else if (!opt\_xsph && opt\_stress) some\_kernel<<<…>>>(kernel\_params<false, true>(buffers)); else if (!opt\_xsph && !opt\_stress) some\_kernel<<<…>>>(kernel\_params<false, false>(buffers));

Runtime selection of the variant of the kernel to be used can be achieved with simple conditionals (Listing 15). However, when the number of conditionals is large, this can be rather bothersome to write; more compact and efficient solutions to the runtime switch are possible, using the meta-programming techniques presented in [41].

When using C macros, the multiple specialized variants of the kernel and related structures and functions cannot coexist in the same compilation unit (since C does not support overloading or templates), making runtime selection of the compiletime variant impossible: a single specific instance must be selected when the device code is compiled; on the upside, one would generally use C when using OpenCL C, for which the device code is compiled at application runtime, as discussed in the previous section.

In terms of syntax, the only significant downside of the SFINAE approach is that the signature needs to be repeated for every specialization, in contrast to the C macro approach, for which we only need one signature, and each implementation is fenced by #if/#elif/#else/#endif. This could be avoided using the C++17 feature if constexpr, but support for it in device code is still missing.

### 7. Conclusions

Particle systems are a fundamental aspect of many applications and numerical methods. By their own nature, they benefit from the massively parallel stream processing architecture of modern GPUs, but naive implementations can easily encounter pitfalls that can limit the full exploitation of the hardware.

Design and Implementation of Particle Systems for Meshfree Methods with High Performance DOI: http://dx.doi.org/10.5772/intechopen.81755

While hardware vendors go to great lengths to support more liberal coding, the software can—and should—be designed to leverage the natural programming model of the hardware, and we have provided several insights on how the particle systems can be designed to fit better with the requirements of optimal GPU usage. We also presented a few simple ideas that, when taken into consideration during an initial implementation, can make future extensions much easier. Many of the suggestions we provide can also be of general interest beyond the implementation of particle systems.

We have stressed the importance of the choice to the correct approach in dealing with the potentially severe limitations in the numerical robustness of the implementation, due to the restricted accuracy and precision of the single-precision floating-point format which GPUs are optimized for. While many of the techniques we have presented are not new (some going as far back as the nineteenth century), in our experience they have surprisingly limited adoption; we hope that our discussion of their usefulness in this context will lead to higher awareness of the possibilities they offer. We dislike the adoption of higher-precision data types as a solution to the issue, not only because of the performance implications on consumer hardware, but as a philosophical objection to waste: why use the wrong numerical approach, wasting the additional precision granted by double precision, when the correct approach can make single precision sufficient? We do understand the need for the extended precision requirements in many applications, and we are sure that our reminiscence of classical solutions to better accuracy can benefit them as well, particularly since support for even higher-precision data types is nearly nonexistent in hardware (with the possible exception of the IBM POWER9 support for IEEE-754-compliant 128-bit floating-point formats) and especially on GPUs.

### Author details

Runtime selection of the variant of the kernel to be used can be achieved with simple conditionals (Listing 15). However, when the number of conditionals is large, this can be rather bothersome to write; more compact and efficient solutions to the runtime switch are possible, using the meta-programming techniques

if (opt\_xsph && opt\_stress) some\_kernel<<<…>>>(kernel\_params<true, true>(buffers)); else if (opt\_xsph && !opt\_stress) some\_kernel<<<…>>>(kernel\_params<true, false>(buffers)); else if (!opt\_xsph && opt\_stress) some\_kernel<<<…>>>(kernel\_params<false, true>(buffers)); else if (!opt\_xsph && !opt\_stress) some\_kernel<<<…>>>(kernel\_params<false, false>(buffers));

/\* template parameters for the kernel \*/ /\* shorthand for the kernel parameters struct \*/ /\* shorthand for the kernel private variables \*/

/\* kernel signature \*/

When using C macros, the multiple specialized variants of the kernel and related structures and functions cannot coexist in the same compilation unit (since C does not support overloading or templates), making runtime selection of the compiletime variant impossible: a single specific instance must be selected when the device code is compiled; on the upside, one would generally use C when using OpenCL C, for which the device code is compiled at application runtime, as discussed in the

In terms of syntax, the only significant downside of the SFINAE approach is that

Particle systems are a fundamental aspect of many applications and numerical methods. By their own nature, they benefit from the massively parallel stream processing architecture of modern GPUs, but naive implementations can easily

the signature needs to be repeated for every specialization, in contrast to the C macro approach, for which we only need one signature, and each implementation is fenced by #if/#elif/#else/#endif. This could be avoided using the C++17 feature

if constexpr, but support for it in device code is still missing.

encounter pitfalls that can limit the full exploitation of the hardware.

presented in [41].

Listing 14.

>

{

}

Listing 15.

template<

Kernel structure after isolation of the optional components.

kernel\_params<needs\_xsph, needs\_stress\_tensor>,

/\* initialize both common and optional private

/\* run common and optional parts of the code \*/

Runtime switch to call the appropriate compile-time kernel specialization.

kernel\_priv<needs\_xsph, needs\_stress\_tensor>

\_\_global\_\_ void some\_kernel(Params params)

bool needs\_xsph, bool needs\_stress,

High Performance Parallel Computing

typename Params =

typename Priv =

variables here \*/ Priv priv(params);

process\_common(priv); process\_xsph(priv); process\_stress\_tensor(priv);

previous section.

7. Conclusions

102

Giuseppe Bilotta<sup>1</sup> \*, Vito Zago<sup>1</sup> and Alexis Hérault<sup>2</sup>

1 Istituto Nazionale di Geofisica e Vulcanologia, Sezione di Catania, Italy

2 Conservatoire National des Arts et Métiers, Laboratoire Modélisation mathématique et numérique, Paris, France

\*Address all correspondence to: giuseppe.bilotta@ingv.it

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## References

[1] Reeves WT. Particle systems—A technique for modeling a class of fuzzy objects. ACM Transactions on Graphics. 1983;2(2):91-108. DOI: 10.1145/ 357318.357320

[2] Paramount. Star Trek II: The Wrath of Khan [film]. 1982

[3] Unity Technologies. Particle Systems. In: Unity User Manual v2018.2 [Internet]. 2018. Available from: https:// docs.unity3d.com/Manual/index.html [Accessed: July 18, 2018]

[4] Monaghan JJ. Smoothed particle hydrodynamics and its diverse applications. Annual Review of Fluid Mechanics. 2012;44:323-346. DOI: 10.1146/annurev-fluid-120710-101220

[5] Chen JS, Liu WK, Hillman MC, Chi SW, Lian Y, Bessa MA. Reproducing kernel particle method for solving partial differential equations. In: Stein E, Borst R, Hughes TK, editors. Encyclopedia of Computational Mechanics. 2nd ed. Chichester, UK: Wiley; 2017. DOI: 10.1002/ 9781119176817.ecm2104

[6] Tiwari S, Kuhnert J. Finite pointset method based on the projection method for simulations of the incompressible Navier-Stokes equations. In: Griebel M, Schweitzer MA, editors. Meshfree Methods for Partial Differential Equations. Lecture Notes in Computational Science and Engineering. Vol. 26. Berlin, Heidelberg: Springer; 2003

[7] Bićanić N. Discrete element methods. In: Stein E, Borst R, Hughes TK, editors. Encyclopedia of Computational Mechanics. Chichester, UK: Wiley; 2017. DOI: 10.1002/0470091355.ecm006.pub2

[8] Kennedy J, Eberhart RC. Particle swarm optimization. In: Proceedings of ICNN'95—International Conference on Neural Networks; November 27– December 1, 1995; 2002. pp. 1942-1948. DOI: 10.1109/ ICNN.1995.488968

[9] Stone JE, Phillips JC, Freddolino PL, Hardy DJ, Trabuco LG, Schulten K. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry. 2007;28:2618-2640. DOI: 10.1002/ jcc.20829

smoothed particle hydrodynamics model for the thermal and rheological evolution of lava flows. In: Harris AJL, De Groeve T, Garel F, Carn SA, editors. Detecting, Modelling and Responding to Effusive Eruptions. Geological Society, London. 2016;426. DOI: 10.1144/ SP426.24. Special Publications

DOI: http://dx.doi.org/10.5772/intechopen.81755

[23] Rustico E, Jankowski JA, Hérault A, Bilotta G, Del Negro C. Multi-GPU, multi-node SPH implementation with arbitrary domain decomposition. In: Proceedings of the 9th SPHERIC Workshop; March 2014; Paris; 2014.

[24] Burtscher M, Pingali K. Chapter 6: An efficient CUDA implementation of the tree-based Barnes Hut N-body. In: GPU Computing Gems Emerald Edition. Boston, USA: Morgan Kaufmann; 2011. pp. 75-92. DOI: 10.1016/B978-0-

[25] Morton GM. A computer oriented geodetic data base; and a new technique in file sequencing. IBM Technical

[26] Rustico E, Bilotta G, Hérault A, Del Negro C, Gallo G. Advances in multi-GPU smoothed particle hydrodynamics simulations. IEEE Transactions on Parallel and Distributed Systems. 2012; 25(1):43-52. DOI: 10.1109/TPDS.

[27] Li XS, Demmel JW, Bailey DH, Henry G, Hida Y, Iskandar J, et al. Design, implementation and testing of extended and mixed precision BLAS. ACM Transactions on Mathematical Software. 2002;28(2):152-205. DOI:

[28] Colberg PH, Höfling F. Highly accelerated simulations of glassy dynamics using GPUs: Caveats on limited floating-point precision. Computer Physics Communications. 2011;182:1120-1129. DOI: 10.1016/j.

[29] Horner WG. A new method of solving numerical equations of all orders, by continuous approximation. Philosophical Transactions of the Royal Society of London. 1819;109:308-335. JSTOR: http://www.jstor.org/stable/

10.1145/567806.567808

cpc.2011.01.009

107508

pp. 127-133

Design and Implementation of Particle Systems for Meshfree Methods with High Performance

12-384988-5.00006-1

Report; 1966

2012.340

[16] Zago V, Bilotta G, Hérault A, Dalrymple RA, Fortuna L, Cappello A, et al. Semi-implicit 3D SPH on GPU for lava flows. Journal of Computational

[17] Zago V, Bilotta G, Cappello A, Dalrymple RA, Fortuna L, Ganci G, et al. Preliminary validation of lava benchmark tests on the GPUSPH particle engine. Annals of Geophysics.

[18] GPUSPH v4.1 [Internet]. Available

[19] Khronos OpenCL Working Group. The OpenCL™ Specification [Internet]. 2018. Available from: https://www.kh ronos.org/registry/OpenCL/ [Accessed:

[20] NVIDIA. CUDA C Programming Guide [Internet]. 2018. Available from:

[21] Green S. Particle Simulation Using

Whitepaper. 2010. Available from: h ttp://developer.download.nvidia.com/a ssets/cuda/files/particles.pdf [Accessed:

[22] Bédorf J, Gaburov E, Portegies Zwart P. A sparse octree gravitational N-body code that runs entirely on the

Computational Physics. 2012;231(7):

https://docs.nvidia.com/cuda/ [Accessed: September 05, 2018]

CUDA. NVIDIA Corporation

September 05, 2018]

GPU processor. Journal of

2825-2839. DOI: 10.1016/j.

jcp.2011.12.024

105

from: http://www.gpusph.org/ [Accessed: September 05, 2018]

Physics. 2018;375:854-870

2018;61. In Press

September 05, 2018]

[10] Richmond P. Template driven agent based modelling and simulation with CUDA. In: Hwu WM, editor. GPU Computing Gems Emerald Edition. Boston, USA: Morgan Kaufmann; 2011

[11] Richmond P, Walker D, Coakley S, Romano D. High performance cellular level agent-based simulation with FLAME for the GPU. Briefings in Bioinformatics. 2013;11(3):334-347. DOI: 10.1093/bib/bbp073

[12] Vuduc R, Choi J. A brief history and introduction to GPGPU. In: Shi X, Kindratenko V, Yang C, editors. Modern Accelerator Technologies for Geographic Information Science. Boston, MA: Springer; 2013. DOI: 10.1007/978-1-4614-8745-6\_2

[13] Hérault A, Bilotta G, Dalrymple RA. SPH on GPU with CUDA. Journal of Hydraulic Research. 2010;48(Extra Issue):74-79. DOI: 10.1080/ 00221686.2010.9641247

[14] Bilotta G. GPU Implementation and Validation of Fully Three-Dimensional Multi-Fluid SPH Models. Rapporto Tecnico. 2014:292 INGV. Available from: http://istituto.ingv.it/images/colla ne-editoriali/rapporti%20tecnici/ra pporti-tecnici-2014/rapporto292.pdf [Accessed: September 05, 2018]

[15] Bilotta G, Hérault A, Cappello A, Ganci G, Del Negro C. GPUSPH: a

Design and Implementation of Particle Systems for Meshfree Methods with High Performance DOI: http://dx.doi.org/10.5772/intechopen.81755

smoothed particle hydrodynamics model for the thermal and rheological evolution of lava flows. In: Harris AJL, De Groeve T, Garel F, Carn SA, editors. Detecting, Modelling and Responding to Effusive Eruptions. Geological Society, London. 2016;426. DOI: 10.1144/ SP426.24. Special Publications

References

357318.357320

of Khan [film]. 1982

[1] Reeves WT. Particle systems—A technique for modeling a class of fuzzy objects. ACM Transactions on Graphics.

High Performance Parallel Computing

Neural Networks; November 27– December 1, 1995; 2002. pp. 1942-1948. DOI: 10.1109/

[9] Stone JE, Phillips JC, Freddolino PL, Hardy DJ, Trabuco LG, Schulten K. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry. 2007;28:2618-2640. DOI: 10.1002/

[10] Richmond P. Template driven agent based modelling and simulation with CUDA. In: Hwu WM, editor. GPU Computing Gems Emerald Edition. Boston, USA: Morgan Kaufmann; 2011

[11] Richmond P, Walker D, Coakley S, Romano D. High performance cellular level agent-based simulation with FLAME for the GPU. Briefings in Bioinformatics. 2013;11(3):334-347.

[12] Vuduc R, Choi J. A brief history and introduction to GPGPU. In: Shi X, Kindratenko V, Yang C, editors. Modern

[13] Hérault A, Bilotta G, Dalrymple RA. SPH on GPU with CUDA. Journal of Hydraulic Research. 2010;48(Extra

[14] Bilotta G. GPU Implementation and Validation of Fully Three-Dimensional Multi-Fluid SPH Models. Rapporto Tecnico. 2014:292 INGV. Available from: http://istituto.ingv.it/images/colla ne-editoriali/rapporti%20tecnici/ra pporti-tecnici-2014/rapporto292.pdf [Accessed: September 05, 2018]

[15] Bilotta G, Hérault A, Cappello A, Ganci G, Del Negro C. GPUSPH: a

DOI: 10.1093/bib/bbp073

Accelerator Technologies for Geographic Information Science. Boston, MA: Springer; 2013. DOI: 10.1007/978-1-4614-8745-6\_2

Issue):74-79. DOI: 10.1080/ 00221686.2010.9641247

ICNN.1995.488968

jcc.20829

[2] Paramount. Star Trek II: The Wrath

[3] Unity Technologies. Particle Systems.

[Internet]. 2018. Available from: https:// docs.unity3d.com/Manual/index.html

[4] Monaghan JJ. Smoothed particle hydrodynamics and its diverse applications. Annual Review of Fluid Mechanics. 2012;44:323-346. DOI: 10.1146/annurev-fluid-120710-101220

[5] Chen JS, Liu WK, Hillman MC, Chi SW, Lian Y, Bessa MA. Reproducing kernel particle method for solving partial differential equations. In: Stein E,

[6] Tiwari S, Kuhnert J. Finite pointset method based on the projection method for simulations of the incompressible Navier-Stokes equations. In: Griebel M, Schweitzer MA, editors. Meshfree Methods for Partial Differential Equations. Lecture Notes in Computational Science and

Engineering. Vol. 26. Berlin, Heidelberg:

[7] Bićanić N. Discrete element methods. In: Stein E, Borst R, Hughes TK, editors.

Mechanics. Chichester, UK: Wiley; 2017. DOI: 10.1002/0470091355.ecm006.pub2

[8] Kennedy J, Eberhart RC. Particle swarm optimization. In: Proceedings of ICNN'95—International Conference on

Encyclopedia of Computational

Borst R, Hughes TK, editors. Encyclopedia of Computational Mechanics. 2nd ed. Chichester, UK:

Wiley; 2017. DOI: 10.1002/ 9781119176817.ecm2104

Springer; 2003

104

1983;2(2):91-108. DOI: 10.1145/

In: Unity User Manual v2018.2

[Accessed: July 18, 2018]

[16] Zago V, Bilotta G, Hérault A, Dalrymple RA, Fortuna L, Cappello A, et al. Semi-implicit 3D SPH on GPU for lava flows. Journal of Computational Physics. 2018;375:854-870

[17] Zago V, Bilotta G, Cappello A, Dalrymple RA, Fortuna L, Ganci G, et al. Preliminary validation of lava benchmark tests on the GPUSPH particle engine. Annals of Geophysics. 2018;61. In Press

[18] GPUSPH v4.1 [Internet]. Available from: http://www.gpusph.org/ [Accessed: September 05, 2018]

[19] Khronos OpenCL Working Group. The OpenCL™ Specification [Internet]. 2018. Available from: https://www.kh ronos.org/registry/OpenCL/ [Accessed: September 05, 2018]

[20] NVIDIA. CUDA C Programming Guide [Internet]. 2018. Available from: https://docs.nvidia.com/cuda/ [Accessed: September 05, 2018]

[21] Green S. Particle Simulation Using CUDA. NVIDIA Corporation Whitepaper. 2010. Available from: h ttp://developer.download.nvidia.com/a ssets/cuda/files/particles.pdf [Accessed: September 05, 2018]

[22] Bédorf J, Gaburov E, Portegies Zwart P. A sparse octree gravitational N-body code that runs entirely on the GPU processor. Journal of Computational Physics. 2012;231(7): 2825-2839. DOI: 10.1016/j. jcp.2011.12.024

[23] Rustico E, Jankowski JA, Hérault A, Bilotta G, Del Negro C. Multi-GPU, multi-node SPH implementation with arbitrary domain decomposition. In: Proceedings of the 9th SPHERIC Workshop; March 2014; Paris; 2014. pp. 127-133

[24] Burtscher M, Pingali K. Chapter 6: An efficient CUDA implementation of the tree-based Barnes Hut N-body. In: GPU Computing Gems Emerald Edition. Boston, USA: Morgan Kaufmann; 2011. pp. 75-92. DOI: 10.1016/B978-0- 12-384988-5.00006-1

[25] Morton GM. A computer oriented geodetic data base; and a new technique in file sequencing. IBM Technical Report; 1966

[26] Rustico E, Bilotta G, Hérault A, Del Negro C, Gallo G. Advances in multi-GPU smoothed particle hydrodynamics simulations. IEEE Transactions on Parallel and Distributed Systems. 2012; 25(1):43-52. DOI: 10.1109/TPDS. 2012.340

[27] Li XS, Demmel JW, Bailey DH, Henry G, Hida Y, Iskandar J, et al. Design, implementation and testing of extended and mixed precision BLAS. ACM Transactions on Mathematical Software. 2002;28(2):152-205. DOI: 10.1145/567806.567808

[28] Colberg PH, Höfling F. Highly accelerated simulations of glassy dynamics using GPUs: Caveats on limited floating-point precision. Computer Physics Communications. 2011;182:1120-1129. DOI: 10.1016/j. cpc.2011.01.009

[29] Horner WG. A new method of solving numerical equations of all orders, by continuous approximation. Philosophical Transactions of the Royal Society of London. 1819;109:308-335. JSTOR: http://www.jstor.org/stable/ 107508

[30] Ostrowski AM. On two problems in abstract algebra connected with Horner's rule. In: von Mises R, editor. Studies in Mathematics and Mechanics. New York, USA: Academic Press; 2013. pp. 40-48. DOI: 10.1016/B978-1- 4832-3272-0.50010-7

[31] Pan VY. Methods of computing values of polynomials. Russian Mathematical Surveys. 1966;21(1): 105-136. DOI: 10.1070/ RM1966v021n01ABEH004147

[32] Kahan WM. Further remarks on reducing truncation errors. Communications of the ACM. 1964; 8(1):40. DOI: 10.1145/363707.363723

[33] Neumaier A.

Rundungsfehleranalyse einiger Verfahren zur Summation endlicher Summen. Zeitschrift für Angewandte Mathematik und Mechanik. 1974;54: 39-51. DOI: 10.1002/ zamm.19740540106

[34] Klein A. A generalized Kahan-Babuška-Summation-algorithm. Computing. 2006;76(3–4):279-293. DOI: 10.1007/s00607-005-0139-x

[35] Blinn JF. Floating-point tricks. IEEE Computer Graphics and Applications. 1997;17(4):80-84. DOI: 10.1109/ 38.595279

[36] id Software. Quake III: Arena [Video Game]; 1999

[37] Domínguez JM, Crespo AJC, Barreiro A, Rogers BD, Gómez-Gesteira M. Efficient implementation of double precision in GPU computing to simulate realistic cases with high resolution. In: Proceedings of the 9th SPHERIC Workshop; March 2014; Paris; 2014. pp. 140-145

[38] Hérault A, Bilotta G, Dalrymple RA. Achieving the best accuracy in an SPH implementation. In: Proceedings of the

9th SPHERIC Workshop; March 2014; Paris; 2014. pp. 134-139

[39] cppreference. std::conditional [Internet]. 2018. Available from: https:// en.cppreference.com/w/cpp/types/cond itional [Accessed: September 05, 2018]

[40] cppreference. std::enable\_if [Internet]. 2018. Available from: https:// en.cppreference.com/w/cpp/types/enab le\_if [Accessed: September 05, 2018]

[41] Langr D, Tvrdík P, Dytrych T, Draayer JP. Fake run-time selection of template arguments in C++. In: Furia CA, Nanz S, editors. Objects, Models, Components, Patterns. TOOLS 2012. Lecture Notes in Computer Science. Vol. 7304. Berlin, Heidelberg: Springer; 2012. DOI: 10.1007/978-3-642-30561-0\_11

[30] Ostrowski AM. On two problems in

9th SPHERIC Workshop; March 2014;

[39] cppreference. std::conditional [Internet]. 2018. Available from: https:// en.cppreference.com/w/cpp/types/cond itional [Accessed: September 05, 2018]

[40] cppreference. std::enable\_if

[41] Langr D, Tvrdík P, Dytrych T, Draayer JP. Fake run-time selection of template arguments in C++. In: Furia CA, Nanz S, editors. Objects, Models, Components, Patterns. TOOLS 2012. Lecture Notes in Computer Science. Vol. 7304. Berlin, Heidelberg: Springer; 2012. DOI: 10.1007/978-3-642-30561-0\_11

[Internet]. 2018. Available from: https:// en.cppreference.com/w/cpp/types/enab le\_if [Accessed: September 05, 2018]

Paris; 2014. pp. 134-139

abstract algebra connected with Horner's rule. In: von Mises R, editor. Studies in Mathematics and Mechanics. New York, USA: Academic Press; 2013. pp. 40-48. DOI: 10.1016/B978-1-

High Performance Parallel Computing

[31] Pan VY. Methods of computing values of polynomials. Russian Mathematical Surveys. 1966;21(1):

[32] Kahan WM. Further remarks on

Communications of the ACM. 1964; 8(1):40. DOI: 10.1145/363707.363723

4832-3272-0.50010-7

105-136. DOI: 10.1070/

RM1966v021n01ABEH004147

Rundungsfehleranalyse einiger Verfahren zur Summation endlicher Summen. Zeitschrift für Angewandte Mathematik und Mechanik. 1974;54:

[34] Klein A. A generalized Kahan-Babuška-Summation-algorithm. Computing. 2006;76(3–4):279-293. DOI: 10.1007/s00607-005-0139-x

[35] Blinn JF. Floating-point tricks. IEEE Computer Graphics and Applications. 1997;17(4):80-84. DOI: 10.1109/

[36] id Software. Quake III: Arena

[37] Domínguez JM, Crespo AJC, Barreiro A, Rogers BD, Gómez-Gesteira M. Efficient implementation of double precision in GPU computing to simulate realistic cases with high resolution. In: Proceedings of the 9th SPHERIC Workshop; March 2014; Paris; 2014.

[38] Hérault A, Bilotta G, Dalrymple RA. Achieving the best accuracy in an SPH implementation. In: Proceedings of the

reducing truncation errors.

[33] Neumaier A.

39-51. DOI: 10.1002/ zamm.19740540106

38.595279

pp. 140-145

106

[Video Game]; 1999

## *Edited by Satyadhyan Chickerur*

This edited book aims to present the state of the art in research and development of the convergence of high-performance computing and parallel programming for various engineering and scientific applications. The book has consolidated algorithms, techniques, and methodologies to bridge the gap between the theoretical foundations of academia and implementation for research, which might be used in business and other real-time applications in the future.The book outlines techniques and tools used for emergent areas and domains, which include acceleration of large-scale electronic structure simulations with heterogeneous parallel computing, characterizing power and energy efficiency of a data-centric high-performance computing runtime and applications, security applications of GPUs, parallel implementation of multiprocessors on MPI using FDTD, particle-based fused rendering, design and implementation of particle systems for mesh-free methods with high performance, and evolving topics on heterogeneous computing. In the coming days the need to converge HPC, IoT, cloud-based applications will be felt and this volume tries to bridge that gap.

Published in London, UK © 2019 IntechOpen © Vladimir\_Timofeev / iStock

High Performance Parallel Computing

High Performance Parallel

Computing

*Edited by Satyadhyan Chickerur*