**4. Parallelization strategies**

As mentioned in Section 2.2, there are two distinct code spaces in OpenCL, the host and kernel. The steps of the host code necessary to create the environment for the parallel evaluation phase are summarized as follows [4]:7

(b) **Kernel execution**. Whenever a new population arrives on the compute device, a kernel is launched in order to evaluate (in parallel) the new programs with respect to the training data points. For any non-trivial problem, this step is the most computationally

Parallel Genetic Programming on Graphics Processing Units 103

(c) **Error retrieval**. Finally, after all programs' errors have been accumulated, this vector is transferred back to the host in order to guide the evolutionary process in selecting the

Regarding the kernel code, it can be designed to evaluate programs in different parallel ways: (i) training points are processed in parallel but programs sequentially; or (ii) the converse, programs are executed in parallel but training points are processed sequentially; or finally (iii) a mixture of these two, where both programs and training points are processed in parallel. Which way is the best will depend essentially on a combination of the characteristics of the problem and some parameters of the GP algorithm. These strategies are described and

The standard manner to estimate the fitness of a GP candidate program is to execute it, commonly on varying input arguments, and observe how well it solves the task at hand by comparing its behavior with the expected one. To this end, the program can be *compiled* just before the execution, generating an intermediate object code, or be directly *interpreted* without generating intermediate objects. Both variations have pros and cons. Compiling introduces overhead, however, it may be advantageous when the evaluation of a program is highly demanding. On the other hand, interpretation is usually slower, but avoids the compilation cost for each program. Moreover, interpretation is easy to accomplish and, more importantly, is much more flexible. Such flexibility allows, for example, to emulate a MIMD execution model on a SIMD or SPMD architecture [23]. This is possible because what a data-parallel device actually executes are many instances of the *same* interpreter. Programs, as has always been the case with training points, become data or, in other words, arguments for

A program interpreter is presented in Algorithm Interpreter. It is assumed that the program to be executed is represented as a *prefix linear tree* [5], since a linear representation is very efficient

+ sin *x* 3.14

The program interpretation operates on a single training data point at a time. The current point is given by the argument *<sup>n</sup>*, and *Xn* ∈ �*<sup>d</sup>* is a *<sup>d</sup>*-dimensional array representing the *<sup>n</sup>*-th

The command INDEX extracts the class of the current operator (*op*), which can be a function, constant or variable. The value of a constant is obtained by the VALUE command; for variables, this command returns the variable's index in order to get its corresponding value in *Xn*.

to be operated on, specially on the GPU architecture. An example of such program is:

set of parents that will breed the next generation.

intensive one.

discussed in Sections 4.2, 4.3 and 4.4.

which denotes the infix expression sin(*x*) + 3.14.

variables (training point) of the problem.

**4.1. Program interpreter**

the interpreter.

	- (a) **Population transfer**. Changes are introduced to programs by the evolutionary process via genetic operators, e.g. crossover and mutation, creating a new set of derived programs. As a result, a population transfer needs to be performed from host to device at each generation.

<sup>7</sup> This chapter will not detail the host code, since it is not relevant to the understanding of the parallel strategies. Given that, and considering that the algorithms are presented in a pseudo-OpenCL form, the reader is advised to consult the appropriate OpenCL literature in order to learn about its peculiarities and fill the implementation gaps.

<sup>8</sup> However, bear in mind that a full parallelization, i.e. both evaluation and evolution, is feasible under OpenCL. That could be implemented, for instance, in such a way that a multi-core CPU device would perform the evolution in parallel while one or more GPUs would evaluate programs.


Regarding the kernel code, it can be designed to evaluate programs in different parallel ways: (i) training points are processed in parallel but programs sequentially; or (ii) the converse, programs are executed in parallel but training points are processed sequentially; or finally (iii) a mixture of these two, where both programs and training points are processed in parallel.

Which way is the best will depend essentially on a combination of the characteristics of the problem and some parameters of the GP algorithm. These strategies are described and discussed in Sections 4.2, 4.3 and 4.4.

### **4.1. Program interpreter**

8 Will-be-set-by-IN-TECH

As mentioned in Section 2.2, there are two distinct code spaces in OpenCL, the host and kernel. The steps of the host code necessary to create the environment for the parallel evaluation

1. **OpenCL initialization**. This step concerns identifying which OpenCL implementation (platform) and compute devices are available. There may exist multiple devices on the system. In this case one may opt to use a single device or, alternatively, all of them, where then a further partitioning of the problem will be required. Training data points, programs

2. **Calculating the** *n***-dimensional computation domain**. How the workload is decomposed for parallel processing is of fundamental importance. Strictly speaking, this phase only determines the *global* and *local* sizes in a one-dimensional space, which is enough to represent the domain of training data points or programs. However, in conjunction with a kernel, which implements a certain strategy of parallelization, the type of parallelism (at

3. **Memory allocation and transfer**. In order to speedup data accesses, some content are allocated/transferred directly to the compute device's memory and kept there, thus avoiding as much as possible the relatively narrow bandwidth between the GPU and the computer's main memory. Three memory buffers are required to be allocated on the device's global memory in order to hold the training data points, population of programs, and error vector. Usually, the training data points are transferred only once, just before the beginning of the execution, remaining then unchanged until the end. The population of programs and error vector, however, are dynamic entities and so they need to be

4. **Kernel building**. This phase selects the kernel with respect to a strategy of parallelization and builds it. Since the exact specification of the target device is usually not known in advance, the default OpenCL behavior is to compile the kernel just-in-time. Although this procedure introduces some overhead, the benefit of having more information about the device—and therefore being able to generate better optimized kernel object—usually

5. **GP's evolutionary loop**. Since this chapter focuses on accelerating the evaluation phase of genetic programming by parallelizing it, the iterative evolutionary cycle itself is assumed to be performed sequentially, being so defined in the host space instead of as an OpenCL

(a) **Population transfer**. Changes are introduced to programs by the evolutionary process via genetic operators, e.g. crossover and mutation, creating a new set of derived programs. As a result, a population transfer needs to be performed from host to device

<sup>7</sup> This chapter will not detail the host code, since it is not relevant to the understanding of the parallel strategies. Given that, and considering that the algorithms are presented in a pseudo-OpenCL form, the reader is advised to consult the

<sup>8</sup> However, bear in mind that a full parallelization, i.e. both evaluation and evolution, is feasible under OpenCL. That could be implemented, for instance, in such a way that a multi-core CPU device would perform the evolution in

appropriate OpenCL literature in order to learn about its peculiarities and fill the implementation gaps.

data and/or program level) and workload distribution are precisely defined.

or even whole populations could be distributed among the devices.

**4. Parallelization strategies**

phase are summarized as follows [4]:7

transferred at each generation.

outweighs the compilation overhead.

at each generation.

kernel. <sup>8</sup> The main iterative evolutionary steps are:

parallel while one or more GPUs would evaluate programs.

The standard manner to estimate the fitness of a GP candidate program is to execute it, commonly on varying input arguments, and observe how well it solves the task at hand by comparing its behavior with the expected one. To this end, the program can be *compiled* just before the execution, generating an intermediate object code, or be directly *interpreted* without generating intermediate objects. Both variations have pros and cons. Compiling introduces overhead, however, it may be advantageous when the evaluation of a program is highly demanding. On the other hand, interpretation is usually slower, but avoids the compilation cost for each program. Moreover, interpretation is easy to accomplish and, more importantly, is much more flexible. Such flexibility allows, for example, to emulate a MIMD execution model on a SIMD or SPMD architecture [23]. This is possible because what a data-parallel device actually executes are many instances of the *same* interpreter. Programs, as has always been the case with training points, become data or, in other words, arguments for the interpreter.

A program interpreter is presented in Algorithm Interpreter. It is assumed that the program to be executed is represented as a *prefix linear tree* [5], since a linear representation is very efficient to be operated on, specially on the GPU architecture. An example of such program is:


which denotes the infix expression sin(*x*) + 3.14.

The program interpretation operates on a single training data point at a time. The current point is given by the argument *<sup>n</sup>*, and *Xn* ∈ �*<sup>d</sup>* is a *<sup>d</sup>*-dimensional array representing the *<sup>n</sup>*-th variables (training point) of the problem.

The command INDEX extracts the class of the current operator (*op*), which can be a function, constant or variable. The value of a constant is obtained by the VALUE command; for variables, this command returns the variable's index in order to get its corresponding value in *Xn*.

```
Function Interpreter( program, n )
```

```
for op ← programsize − 1 to 0 do
   switch INDEX( program[op] ) do
       case ADD:
           PUSH( POP + POP );
       case SUB:
           PUSH( POP − POP );
       case MUL:
           PUSH( POP × POP );
       case DIV:
           PUSH( POP ÷ POP );
       case IF-THEN-ELSE:
           if POP then
               PUSH( POP );
           else
               POP; PUSH( POP );
       .
       .
       .
       case CONSTANT:
           PUSH( VALUE( program[op] ) );
       otherwise
           PUSH( Xn[VALUE( program[op] )] );
return POP;
```
The interpreter is stack-based; whenever an operand shows up, like a constant or variable, its value is pushed onto the stack via the PUSH command. Conversely, an operator obtains its operands' values on the stack by means of the POP command, which removes the most recently stacked values. Then, the value of the resulting operation on its operands is pushed back onto the stack so as to make it available to a parent operator.

strategy, specially when there are a large number of training data points—which is very common in complex problems. Moreover, given that this strategy leads to a data-parallel SIMD execution model, it fits well on a wide range of parallel architectures. Figure 2 shows

As already mentioned, to precisely define a parallelization strategy in OpenCL, two things must be set up: the *n*-dimensional domain, more specifically the global and local sizes, and the kernel itself. For the data-level parallelism, it is natural to assign the global computation domain to the training data points domain as a one-to-one correspondence; that is, simply

where *dataset size* is the number of the training data points. OpenCL lets the programmer to choose whether he or she wants to explicitly define the local size, i.e. how many work-items will be put in a work-group. The exact definition of the local size is only really needed when the corresponding kernel assumes a particular work-group division, which is not the case for DP. Therefore, no local size is explicitly defined for DP, letting then the OpenCL runtime to

Algorithm 1 presents in a pseudo-OpenCL language the DP's kernel. As with any OpenCL kernel, there will be launched *globalsize* instances of it on the compute device.<sup>11</sup> Hence, there is one work-item per domain element, with each one identified by its global or local position through the OpenCL commands get\_global\_id and get\_local\_id, respectively. This enables a work-item to select what portion of the compute domain it will operate on, based

For the DP's kernel, the *globalid* index is used to choose which training data point will be processed, in other words, each work-item will be in charge of a specific point. The *for loop* iterates sequentially over each program of the population (the function NthProgram returns the *p*-th program), that is, every work-item will execute the same program at a given time. Then, the interpreter (Section 4.1) is called to execute the current program, but each work-item will provide a different index, which corresponds to the training data point it took

<sup>10</sup> To simplify, in Figures 2, 4 and 5 it is presumed that the number of PEs (or CUs) coincides with the number of training

<sup>11</sup> It is worthy to notice that the actual amount of work-items executed in parallel by the OpenCL runtime will depend

*globalsize* = *datasetsize*, (1)

Parallel Genetic Programming on Graphics Processing Units 105

graphically how the training data points are distributed among the PEs.10

**Figure 2.** Illustration of the data-level parallelism (DP).

decide on any configuration it thinks is the best.

data points (or programs), but in practice this is rarely the case.

on the device's capabilities, mainly on the number of processing elements.

on its absolute or relative position.

As will be seen in the subsequent sections, whatever the parallel strategy, the interpreter will act as a central component of the kernels, doing the hard work. The kernels will basically set up how the interpreter will be distributed among processing elements and which program and training point it will operate on at a given time.
