**2. GPU programming**

The origin of graphics processing units dates back to a long time ago, when they were built exclusively to execute graphics operations, mainly to process images' pixels, such as calculating each individual pixel color, applying filters, and the like. In video or gaming processing, for instance, the task is to process batches of pixels within a short time-frame—such operation is also known as *frame rendering*—in order to display smooth and fluid images to the spectator or player.

Pixel operations tend to be very independent among them, in other words, each individual pixel can be processed at the same time as another one, leading to what is known as *data parallelism* or SIMD. Although making the hardware less general, designing an architecture targeted at some specific type of workload, like data parallelism, may result in a very efficient processor. This is one main reason why GPUs have an excellent performance with respect to power consumption, price, and density. Another major reason behind such a performance is attributed to the remarkable growing of the game industry in the last years and the fact that computer games have become more and more complex, pressing forward the development of GPUs while making them ubiquitous.

It turned out that at some point the development of GPUs was advancing so well and the architecture was progressively getting more ability to execute a wider range of sophisticated instructions, that eventually it earned the status of a general-purpose processor—although still an essentially data parallel architecture. That point was the beginning of the exploitation of the graphics processing unit as a parallel accelerator for a much broader range of applications besides video and gaming processing.

## **2.1. GPU architecture**

The key design philosophy responsible for the great GPU's efficiency is the maximization of the number of transistors dedicated to actual computing—i.e., arithmetic and logic units (ALU)—which are packed as many small and relatively simple processors [26]. This is rather different from the modern multi-core CPU architecture, which has large and complex cores, reserving a considerable area of the processor die for other functional units, such as control units (out-of-order execution, branch prediction, speculative execution, etc.) and cache memory [21].

This design difference reflects the different purpose of those architectures. While the GPU is optimized to handle data-parallel workloads with regular memory accesses, the CPU is designed to be more generic and thus must manage with reasonable performance a larger variety of workloads, including MIMD parallelism, divergent branches and irregular memory accesses. There is also another important conceptual difference between them. Much of the extra CPU complexity is devoted to reduce the latency in executing a single task, which classifies the architecture as *latency-oriented* [14]. Conversely, instead of executing single tasks as fast as possible, GPUs are *throughput-oriented* architectures, which means that they are designed to optimize the throughput, that is, the amount of completed tasks per unit of time.

## **2.2. Open Computing Language – OpenCL**

The Open Computing Language, or simply OpenCL, is an open specification for heterogeneous computing released by the Khronos Group<sup>2</sup> in 2008 [25]. It resembles the NVIDIA CUDA<sup>3</sup> platform [31], but can be considered as a superset of the latter; they basically differ in the following points. OpenCL (i) is an open specification that is managed by a set of distinct representatives from industry, software development, academia and so forth; (ii) is meant to be implemented by any compute device vendor, whether they produce CPUs, GPUs, hybrid processors, or other accelerators such as digital signal processors (DSP) and field-programmable gate arrays (FPGA); and (iii) is portable across architectures, meaning that a parallel code written in OpenCL is guaranteed to correctly run on every other supported device.4

#### *2.2.1. Hardware model*

2 Will-be-set-by-IN-TECH

revolutionize the applicability of GP started about a decade ago when the GPUs began to acquire general-purpose programmability. Modern GPUs have an astonishing theoretical computational power, and are capable of behaving much like a conventional multi-core CPU processor in terms of programmability. However, there are some intrinsic limitations and patterns of workload that may cause huge negative impact on the resulting performance if not properly addressed. Hence, this paper aims at presenting and discussing efficient ways of implementing GP's evaluation phase, at the program- and data-level, so as to achieve the

The remaining of this chapter is organized as follows. The next Section, 2, will give an overview of the GPU architecture followed by a brief description of the open computing language, which is the open standard framework for heterogeneous programming, including CPUs and GPUs. Section 3 presents the development history of GP in the pursuit of getting the most out of the GPU architecture. Then, in Section 4, three fundamental parallelization strategies at the program- and data-level will be detailed and their algorithms presented in a pseudo-OpenCL form. Finally, Section 5 concludes the chapter and points out some

The origin of graphics processing units dates back to a long time ago, when they were built exclusively to execute graphics operations, mainly to process images' pixels, such as calculating each individual pixel color, applying filters, and the like. In video or gaming processing, for instance, the task is to process batches of pixels within a short time-frame—such operation is also known as *frame rendering*—in order to display smooth and

Pixel operations tend to be very independent among them, in other words, each individual pixel can be processed at the same time as another one, leading to what is known as *data parallelism* or SIMD. Although making the hardware less general, designing an architecture targeted at some specific type of workload, like data parallelism, may result in a very efficient processor. This is one main reason why GPUs have an excellent performance with respect to power consumption, price, and density. Another major reason behind such a performance is attributed to the remarkable growing of the game industry in the last years and the fact that computer games have become more and more complex, pressing forward the development of

It turned out that at some point the development of GPUs was advancing so well and the architecture was progressively getting more ability to execute a wider range of sophisticated instructions, that eventually it earned the status of a general-purpose processor—although still an essentially data parallel architecture. That point was the beginning of the exploitation of the graphics processing unit as a parallel accelerator for a much broader range of

The key design philosophy responsible for the great GPU's efficiency is the maximization of the number of transistors dedicated to actual computing—i.e., arithmetic and logic units (ALU)—which are packed as many small and relatively simple processors [26]. This is

maximum throughput on a GPU.

interesting directions of future work.

fluid images to the spectator or player.

GPUs while making them ubiquitous.

**2.1. GPU architecture**

applications besides video and gaming processing.

**2. GPU programming**

In order to achieve code portability, OpenCL employs an abstracted device architecture that standardizes a device's processing units and memory scopes. All supported OpenCL devices must expose this minimum set of capabilities, although they may have different capacities and internal hardware implementation. Illustrated in Figure 1 is an OpenCL general device abstraction. The terms SPMD, SIMD and PC are mostly GPU-specific, though; they could be safely ignored on behalf of code portability, but understanding them is important to write efficient code for this architecture, as will become clear later on.

An OpenCL device has one or more *compute units* (CU), and there is at least one *processing element* (PE) per compute unit, which actually performs the computation. Such layers are meant (i) to encourage better partitioning of the problem towards fine-grained granularity and low communication, hence increasing the scalability to fully leverage a large number of CUs when available; and (ii) to potentially support more restricted compute architectures, by

<sup>2</sup> http://www.khronos.org/opencl

<sup>3</sup> CUDA is an acronym for *Compute Unified Device Architecture*, the NVIDIA's toolkit for GP-GPU programming.

<sup>4</sup> It is worthy to note that OpenCL only guarantees *functional portability*, i.e., there is no guarantee that the same code will perform equally well across different architectures (performance portability), since some low-level optimizations might fit a particular architecture better than others.

#### 4 Will-be-set-by-IN-TECH 98 Genetic Programming – New Approaches and Successful Applications Parallel Genetic Programming on Graphics Processing Units <sup>5</sup>

partitioning, submitting commands, and coordinating executions. The latter, the kernel code,

Parallel Genetic Programming on Graphics Processing Units 99

An OpenCL kernel is similar to a C function6. Due to architectural differences across devices, it has some restrictions, such as prohibiting recursion, but also adds some extensions, like vector data types and operators, and is intended to be executed in parallel by each processing element, usually with each instance working on a separate subset of the problem. A kernel

Work-items within a work-group are executed on a unique compute unit, therefore, according to the OpenCL specification, they can share information and synchronize. Determining how work-items are divided into work-groups is a critical phase when decomposing a problem; a bad division may lead to inefficient use of the compute device. Hence, an important part of the parallel modeling concerns defining what is known as *n-dimensional computation domain*. This turns out to be the definition of the *global size*, which is the total amount of work-items, and the *local size*, the number of work-items within a work-group, or simply the work-groups'

In summary, when parallelizing the GP's evaluation phase, the two most important modeling aspects are the *kernel* code and the *n*-dimensional computation domain. Section 4 will present

It is natural to begin the history of GP on GPUs referring to the first improvements obtained by parallelization of a GA on programmable graphics hardware. The first work along this line seems to be [41], which has proposed a genetic algorithm in which crossover, mutation, and fitness evaluation were performed on graphic cards achieving speedups up to 17.1 for large

Other GA parallelization on GPUs was proposed in [39] which followed their own ideas explored in [40] for an evolutionary programming technique (called FEP). The proposal, called Hybrid GA, or shortly HGA, was evaluated using 5 test-functions, and CPU-GPU as well as HGA-FEP comparisons were made. It was observed that their GA on GPU was more effective

Similarly to [41], [24] performed crossover, mutation, and fitness evaluation on GPU to solve the problem of packing many granular textures into a large one, which helps modelers in freely building virtual scenes without caring for efficient usage of texture memory. Although the implementation on CPU performed faster in the cases where the number of textures was very small (compact search space), the performance of the implementation on GPU is almost

The well-known satisfiability problem, or shortly SAT, is solved on graphic hardware in [30], where a cellular genetic algorithm was adopted. The algorithm was developed using NVIDIA's C for Graphics (Cg) programming toolkit and achieved a speedup of approximately 5. However, the author reports some problems in the implementation process, like the nonexistence of a pseudo-random number generator and limitations in the texture's size.

instance is known as *work-item* whereas a group of work-items is called a *work-group*.

is the actual parallel code that is executed by a compute device.

these definitions for each parallelization strategy.

and efficient than their previous parallel FEP.

two times faster when compared to execution on CPU.

<sup>6</sup> The OpenCL kernel's language is derived from the C language.

**3. Genetic programming on GPU: A bit of history**

size.

population sizes.

**Figure 1.** Abstraction of a modern GPU architecture

not strictly enforcing parallelism among CUs while still ensuring that the device is capable of doing synchronism, which can occur among PEs within each CU [15].

Figure 1 shows four scopes of memory, namely, *global*, *constant*, *local*, and *private* memories. The global memory is the device's main memory, the biggest but also the slowest of the four in terms of bandwidth and latency, specially for irregular accesses. The constant memory is a small and slightly optimized memory for read-only accesses. OpenCL provides two really fast memories: local and private. Both are very small; the main difference between them is the fact that the former is shared among all the PEs within a CU—thus very useful for communication—and the latter is even smaller and reserved for each PE.

Most of modern GPUs are capable of performing not only SIMD parallelism, but also what is referred to as SPMD parallelism (literally *Single Program Multiple Data*), which is the ability to simultaneously execute *different* instructions of the *same* program on many data. This feature is closely related to the capability of the architecture in maintaining a record of multiple different instructions within a program being executed which is done by *program counter* (PC) registers. Nowadays GPUs can usually guarantee that at least among compute units there exists SPMD parallelism, in other words, different CUs can execute different instructions in parallel. There may exist SPMD parallelism within CUs also, but they occur among blocks of PEs.<sup>5</sup> For the sake of simplicity, the remaining of this chapter will ignore this possibility and assume that all PEs within a CU can only execute one instruction at a time (SIMD parallelism), sharing a single PC register. A strategy of parallelization described in Section 4.4 will show how the SPMD parallelism can be exploited in order to produce one of the most efficient parallel algorithms for genetic programming on GPUs.

#### *2.2.2. Software model*

OpenCL specifies two code spaces: the *host* and *kernel* code. The former holds any user-defined code, and is also responsible for initializing the OpenCL platform, managing the device's memory (buffer allocation and data transfer), defining the problem's parallel

<sup>5</sup> Those blocks are known as *warps* [32] or *wavefronts* [1].

partitioning, submitting commands, and coordinating executions. The latter, the kernel code, is the actual parallel code that is executed by a compute device.

An OpenCL kernel is similar to a C function6. Due to architectural differences across devices, it has some restrictions, such as prohibiting recursion, but also adds some extensions, like vector data types and operators, and is intended to be executed in parallel by each processing element, usually with each instance working on a separate subset of the problem. A kernel instance is known as *work-item* whereas a group of work-items is called a *work-group*.

Work-items within a work-group are executed on a unique compute unit, therefore, according to the OpenCL specification, they can share information and synchronize. Determining how work-items are divided into work-groups is a critical phase when decomposing a problem; a bad division may lead to inefficient use of the compute device. Hence, an important part of the parallel modeling concerns defining what is known as *n-dimensional computation domain*. This turns out to be the definition of the *global size*, which is the total amount of work-items, and the *local size*, the number of work-items within a work-group, or simply the work-groups' size.

In summary, when parallelizing the GP's evaluation phase, the two most important modeling aspects are the *kernel* code and the *n*-dimensional computation domain. Section 4 will present these definitions for each parallelization strategy.
