**Author details**

16 Will-be-set-by-IN-TECH

point to be processed.14 Due to the fact that the dataset size may not be evenly divisible by the local size, a range check is performed to guarantee that no out-of-range access will occur. Finally, since the prediction errors for a given program will be spread among the local

This chapter has presented different strategies to accelerate the execution of a genetic programming algorithm by parallelizing its costly evaluation phase on the GPU architecture,

Out of the three studied strategies, two of them are particularly well-suited to be implemented on the GPU architecture, namely: (i) data-level parallelism (DP), which is very simple and remarkably efficient for large datasets; and (ii) program- and data-level parallelism (PDP), which is not as simple as DP, but exhibits the same degree of efficiency for large datasets and

Up to date, only a few large and real-world problems have been solved by GP with the help of the massive parallelism of GPUs. This suggests that the potential of GP is yet under-explored, indicating that the next big step concerning GP on GPUs may be its application to those challenging problems. In several domains, such as in bio-informatics, the amount of data is growing quickly, making it progressively difficult for specialists to manually infer models and the like. Heterogeneous computing, combining the computational power of different devices, as well as the possibility of programming uniformly for any architecture and vendor, is also an interesting research direction to boost the performance of GP. Although offering both advantages, OpenCL is still fairly unexplored in the field of evolutionary computation. Finally, although optimization techniques have not been thoroughly discussed in this chapter, this is certainly an important subject. Thus, the reader is invited to consult the related material found in [4], and also general GPU optimizations techniques from the respective literature.

The authors would like to thank the support provided by CNPq (grants 308317/2009-2 and

<sup>14</sup> The careful reader will note that this index calculation leads to an efficient coalesced memory access pattern [1, 32].

work-items at the end of the execution, an error reduction operation takes place.

a high-performance processor which is also energy efficient and affordable.

has the advantage of being efficient for small datasets as well.

300192/2012-6) and FAPERJ (grant E-26/102.025/2009).

**Algorithm 3:** GPU PDP's OpenCL kernel

**for** *i* ← 0 **to** �*datasetsize*/*localsize*� − 1 **do** *n* ← *i* × *localsize* + *localid*; **if** *n* < *datasetsize* **then**

*E*[*groupid*] ← ErrorReduction(0, . . . , *localsize* − 1);

*error* ← *error* + |Interpreter( *program, n* ) − *Y*[*n*]|;

*localid* ← get\_local\_id(); *groupid* ← get\_group\_id(); *program* ← NthProgram(*groupid*);

*error* ← 0.0;

**5. Conclusions**

**Acknowledgements**

Douglas A. Augusto and Heder S. Bernardino

*Laboratório Nacional de Computação Científica (LNCC/MCTI), Rio de Janeiro, Brazil*

Helio J. C. Barbosa

*Laboratório Nacional de Computação Científica (LNCC/MCTI), Rio de Janeiro, Brazil Federal University of Juiz de Fora (UFJF), Computer Science Dept., Minas Gerais, Brazil*
