**1. Introduction**

94 Genetic Programming – New Approaches and Successful Applications

of dynamics and strength, volume 35, 1977.

Transactions On Evolutionary Computation, 7(4).

Natural Selection. MIT Press.

Support Techniques. Springer.

Edition. Chapman and Hall/CRC.

McGraw-Hill and The Mit Press.

Accessed 2012 April 13.

44: 3178 – 3186 pp.

Publishers.

samples). Biometrika 52 (3-4): 591–611 pp.

Heinemann.

[17] Antony J (2003) Design of Experiments for Engineers and Scientists. Butterworth-

[21] Audze P, Eglais V (1977) A new approach to the planning out of experiments. Problems

[22] Bates J.S, Sienz J, Langley D.S (2003) Formulation of the Audze-Eglais Uniform Latin

[24] Koza J.R (1998) Genetic Programming On the Programming of Computers by Means of

[26] Ahn C.W, Ramakrishna R.S (2003) Elitism-Based Compact Genetic Algorithms. IEEE

[27] Koza J.R, Poli R (2005) Genetic Programming, In Edmund Burke and Graham Kendal, editors. Search Methodologies: Introductory Tutorials in Optimization and Decision

[28] Sprent N, Smeeton N.C (2007) Applied Nonparametric Statistical Methods, Fourth

[29] Fay M.P, Proschan M.A (2010) Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules. Statistics Survey, 4: 1-39 pp. [30] Shapiro S.S, Wilk M.B (1965) An analysis of variance test for normality (complete

[31] Breusch T.S, Pagan A.R (1979) Simple test for heteroscedasticity and random coefficient

[32] Savin N.E, White K.J (1977) The Durbin-Watson Test for Serial Correlation with Extreme Sample Sizes or Many Regressors. Econometrica 45(8): 1989-1996 pp. [33] Woo S.C, Ohara M, Torrie E, Singh J.P, Gupta A (1995) The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22nd

International Symposium on Computer Architecture Santa Margherita: 24-36 pp. [34] Cormen T.H, Leiserson C.E, Rivest R.L, Stein C (2001) Introduction to Algorithms.

[35] Black D.C, Donovan J (2004) SystemC: From the Groung Up. Kluwer Academic

[36] ARM AMBA (1999) AMBA Specification rev. 2.0, IHI-0011A, May 1999. Available: http://www.arm.com/products/system-ip/amba/amba-open-specifications.php.

[37] Madar J, Abonyi J, Szeifert F (2005) Genetic Programming for the Identification of Nonlinear Input–Output Models. In Industrial and Engineering Chemistry Research,

variation. Econometrica (The Econometric Society) 47 (5): 1287–1294 pp.

[25] Gen M, Cheng R (2000) Genetic algorithms and engineering optimization.Wiley.

[18] Cox D.E (2000) The Theory of the Design of Experiments. Chapman and Hall/CRC.

[19] Mitchell M (1999) An Introduction to Genetic Algorithms. MIT Press. [20] Dean A, Voss D (1999) Design and Analysis of Experiements. Springer.

Hypercube Design of Experiments. Adv. Eng. Software, 34(8): 493-506. [23] GPRSKit. Genetically Programmed Respone Surfaces Kit. Available: http://www.cs.berkeley.edu/~hcook/gprs.html. Accessed 2012 April 13.

> In program inference, the evaluation of how well a candidate solution solves a certain task is usually a computationally intensive procedure. Most of the time, the evaluation involves either submitting the program to a simulation process or testing its behavior on many input arguments; both situations may turn out to be very time-consuming. Things get worse when the optimization algorithm needs to evaluate a population of programs for several iterations, which is the case of genetic programming.

> Genetic programming (GP) is well-known for being a computationally demanding technique, which is a consequence of its ambitious goal: to automatically generate computer programs—in an arbitrary language—using virtually no domain knowledge. For instance, evolving a classifier, a program that takes a set of attributes and predicts the class they belong to, may be significantly costly depending on the size of the training dataset, that is, the amount of data needed to estimate the prediction accuracy of a single candidate classifier.

> Fortunately, GP is an inherently parallel paradigm, making it possible to easily exploit any amount of available computational units, no matter whether they are just a few or many thousands. Also, it usually does not matter whether the underlying hardware architecture can process simultaneously instructions and data ("MIMD") or only data ("SIMD").<sup>1</sup> Basically, GP exhibits three levels of parallelism: (i) *population-level* parallelism, when many populations evolve simultaneously; (ii) *program-level* parallelism, when programs are evaluated in parallel; and finally (iii) *data-level* parallelism, in which individual training points for a single program are evaluated simultaneously.

> Until recently, the only way to leverage the parallelism of GP in order to tackle complex problems was to run it on large high-performance computational installations, which are normally a privilege of a select group of researchers. Although the multi-core era has emerged and popularized the parallel machines, the architectural change that is probably going to

<sup>1</sup> MIMD stands for *Multiple Instructions Multiple Data* whereas SIMD means *Single Instruction Multiple Data*.

©2012 Augusto et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### 2 Will-be-set-by-IN-TECH 96 Genetic Programming – New Approaches and Successful Applications Parallel Genetic Programming on Graphics Processing Units <sup>3</sup>

revolutionize the applicability of GP started about a decade ago when the GPUs began to acquire general-purpose programmability. Modern GPUs have an astonishing theoretical computational power, and are capable of behaving much like a conventional multi-core CPU processor in terms of programmability. However, there are some intrinsic limitations and patterns of workload that may cause huge negative impact on the resulting performance if not properly addressed. Hence, this paper aims at presenting and discussing efficient ways of implementing GP's evaluation phase, at the program- and data-level, so as to achieve the maximum throughput on a GPU.

rather different from the modern multi-core CPU architecture, which has large and complex cores, reserving a considerable area of the processor die for other functional units, such as control units (out-of-order execution, branch prediction, speculative execution, etc.) and cache

Parallel Genetic Programming on Graphics Processing Units 97

This design difference reflects the different purpose of those architectures. While the GPU is optimized to handle data-parallel workloads with regular memory accesses, the CPU is designed to be more generic and thus must manage with reasonable performance a larger variety of workloads, including MIMD parallelism, divergent branches and irregular memory accesses. There is also another important conceptual difference between them. Much of the extra CPU complexity is devoted to reduce the latency in executing a single task, which classifies the architecture as *latency-oriented* [14]. Conversely, instead of executing single tasks as fast as possible, GPUs are *throughput-oriented* architectures, which means that they are designed to optimize the throughput, that is, the amount of completed tasks per unit of time.

The Open Computing Language, or simply OpenCL, is an open specification for heterogeneous computing released by the Khronos Group<sup>2</sup> in 2008 [25]. It resembles the NVIDIA CUDA<sup>3</sup> platform [31], but can be considered as a superset of the latter; they basically differ in the following points. OpenCL (i) is an open specification that is managed by a set of distinct representatives from industry, software development, academia and so forth; (ii) is meant to be implemented by any compute device vendor, whether they produce CPUs, GPUs, hybrid processors, or other accelerators such as digital signal processors (DSP) and field-programmable gate arrays (FPGA); and (iii) is portable across architectures, meaning that a parallel code written in OpenCL is guaranteed to correctly run on every other supported

In order to achieve code portability, OpenCL employs an abstracted device architecture that standardizes a device's processing units and memory scopes. All supported OpenCL devices must expose this minimum set of capabilities, although they may have different capacities and internal hardware implementation. Illustrated in Figure 1 is an OpenCL general device abstraction. The terms SPMD, SIMD and PC are mostly GPU-specific, though; they could be safely ignored on behalf of code portability, but understanding them is important to write

An OpenCL device has one or more *compute units* (CU), and there is at least one *processing element* (PE) per compute unit, which actually performs the computation. Such layers are meant (i) to encourage better partitioning of the problem towards fine-grained granularity and low communication, hence increasing the scalability to fully leverage a large number of CUs when available; and (ii) to potentially support more restricted compute architectures, by

<sup>3</sup> CUDA is an acronym for *Compute Unified Device Architecture*, the NVIDIA's toolkit for GP-GPU programming. <sup>4</sup> It is worthy to note that OpenCL only guarantees *functional portability*, i.e., there is no guarantee that the same code will perform equally well across different architectures (performance portability), since some low-level optimizations

efficient code for this architecture, as will become clear later on.

memory [21].

device.4

*2.2.1. Hardware model*

<sup>2</sup> http://www.khronos.org/opencl

might fit a particular architecture better than others.

**2.2. Open Computing Language – OpenCL**

The remaining of this chapter is organized as follows. The next Section, 2, will give an overview of the GPU architecture followed by a brief description of the open computing language, which is the open standard framework for heterogeneous programming, including CPUs and GPUs. Section 3 presents the development history of GP in the pursuit of getting the most out of the GPU architecture. Then, in Section 4, three fundamental parallelization strategies at the program- and data-level will be detailed and their algorithms presented in a pseudo-OpenCL form. Finally, Section 5 concludes the chapter and points out some interesting directions of future work.
