**4.2. Data-level Parallelism – DP**

The idea behind the data-level parallelism (DP) strategy is to distribute the training data points among the processing elements of a compute device. This is probably the simplest and most natural way of parallelizing GP's evaluation phase when the execution of a program on many independent training points is required.<sup>9</sup> Despite its obviousness, DP is an efficient

<sup>9</sup> However, sometimes it is not possible to trivially decompose the evaluation phase. For instance, an evaluation may involve submitting the program through a simulator. In this case one can try to parallelize the simulator itself or, alternatively, opt to use a program- or population-level kind of parallelism.

strategy, specially when there are a large number of training data points—which is very common in complex problems. Moreover, given that this strategy leads to a data-parallel SIMD execution model, it fits well on a wide range of parallel architectures. Figure 2 shows graphically how the training data points are distributed among the PEs.10

**Figure 2.** Illustration of the data-level parallelism (DP).

10 Will-be-set-by-IN-TECH

The interpreter is stack-based; whenever an operand shows up, like a constant or variable, its value is pushed onto the stack via the PUSH command. Conversely, an operator obtains its operands' values on the stack by means of the POP command, which removes the most recently stacked values. Then, the value of the resulting operation on its operands is pushed

As will be seen in the subsequent sections, whatever the parallel strategy, the interpreter will act as a central component of the kernels, doing the hard work. The kernels will basically set up how the interpreter will be distributed among processing elements and which program

The idea behind the data-level parallelism (DP) strategy is to distribute the training data points among the processing elements of a compute device. This is probably the simplest and most natural way of parallelizing GP's evaluation phase when the execution of a program on many independent training points is required.<sup>9</sup> Despite its obviousness, DP is an efficient

<sup>9</sup> However, sometimes it is not possible to trivially decompose the evaluation phase. For instance, an evaluation may involve submitting the program through a simulator. In this case one can try to parallelize the simulator itself or,

**Function** Interpreter( *program*, *n* ) **for** *op* ← *programsize* − 1 **to** 0 **do switch** INDEX( *program*[*op*] ) **do**

PUSH( POP + POP );

PUSH( POP − POP );

PUSH( POP × POP );

PUSH( POP ÷ POP );

PUSH( POP );

POP; PUSH( POP );

PUSH( VALUE( *program*[*op*] ) );

PUSH( *Xn*[VALUE( *program*[*op*] )] );

back onto the stack so as to make it available to a parent operator.

alternatively, opt to use a program- or population-level kind of parallelism.

and training point it will operate on at a given time.

**4.2. Data-level Parallelism – DP**

**case** *IF-THEN-ELSE:* **if** POP **then**

**case** *ADD:*

**case** *SUB:*

**case** *MUL:*

**case** *DIV:*

**else**

**otherwise**

**case** *CONSTANT:*

. . .

**return** POP;

As already mentioned, to precisely define a parallelization strategy in OpenCL, two things must be set up: the *n*-dimensional domain, more specifically the global and local sizes, and the kernel itself. For the data-level parallelism, it is natural to assign the global computation domain to the training data points domain as a one-to-one correspondence; that is, simply

$$
gamma\_{size} = dataset\_{size} \tag{1}
$$

where *dataset size* is the number of the training data points. OpenCL lets the programmer to choose whether he or she wants to explicitly define the local size, i.e. how many work-items will be put in a work-group. The exact definition of the local size is only really needed when the corresponding kernel assumes a particular work-group division, which is not the case for DP. Therefore, no local size is explicitly defined for DP, letting then the OpenCL runtime to decide on any configuration it thinks is the best.

Algorithm 1 presents in a pseudo-OpenCL language the DP's kernel. As with any OpenCL kernel, there will be launched *globalsize* instances of it on the compute device.<sup>11</sup> Hence, there is one work-item per domain element, with each one identified by its global or local position through the OpenCL commands get\_global\_id and get\_local\_id, respectively. This enables a work-item to select what portion of the compute domain it will operate on, based on its absolute or relative position.

For the DP's kernel, the *globalid* index is used to choose which training data point will be processed, in other words, each work-item will be in charge of a specific point. The *for loop* iterates sequentially over each program of the population (the function NthProgram returns the *p*-th program), that is, every work-item will execute the same program at a given time. Then, the interpreter (Section 4.1) is called to execute the current program, but each work-item will provide a different index, which corresponds to the training data point it took

<sup>10</sup> To simplify, in Figures 2, 4 and 5 it is presumed that the number of PEs (or CUs) coincides with the number of training data points (or programs), but in practice this is rarely the case.

<sup>11</sup> It is worthy to notice that the actual amount of work-items executed in parallel by the OpenCL runtime will depend on the device's capabilities, mainly on the number of processing elements.

#### **Algorithm 1:** GPU DP's OpenCL kernel

*globalid* ← get\_global\_id(); **for** *p* ← 0 **to** *populationsize* − 1 **do** *program* ← NthProgram(*p*); *error* ← |Interpreter( *program, globalid* ) − *Y*[*globalid*]|; *E*[*p*] ← ErrorReduction(0, . . . , *globalsize* − 1);

responsibility for. Once interpreted, the output returned by the program is then compared with the expected one for that point, whose value is stored in array *Y*. This results in a prediction error; however, the overall error is what is meaningful to estimate the fitness of a program.

access latencies while reading from or writing to the device's global memory [1, 32]. Therefore, to optimally utilize a high-end GPU under the DP strategy, one should prefer those problems having tens of thousands of training data points. Unfortunately, there are many real-world

Parallel Genetic Programming on Graphics Processing Units 107

Another limitation of the DP strategy is that sometimes there is no easy way to decompose the evaluation of a program into independent entities, like data points. Many program evaluations that need a simulator, for example, fall into this category, where a parallel

An attempt to overcome the DP limitations, particularly what concerns the desire of a substantially large amount of training data points, is schematically shown in Figure 4. This parallelization strategy is here referred to as program-level parallelism (PP), meaning that programs are executed in parallel, each program per PE [4, 35]. Assuming that there are enough programs to be evaluated, even a few training data points should keep the GPU fully

In PP, while programs are interpreted in parallel, the training data points within each PE are processed sequentially. This suggests a computation domain based on the number of

As with DP, PP does not need to have control of the number of work-items within a

A pseudo-OpenCL code for the PP kernel is given in Algorithm 2. It resembles the DP's algorithm, but in PP what is being parallelized are the programs instead of the training data points. Hence, each work-item takes a different program and interpret it iteratively over all points. A positive side effect of this inverse logic is that, since the whole evaluation of a program is now done in a single work-item, all the partial prediction errors are promptly

Unfortunately, PP solves the DP's necessity of large training datasets but introduces two other problems: (i) to avoid underutilization of the GPU a large population of programs should now

available locally. Put differently, in PP a final reduction step is not required.

*globalsize* = *populationsize* (2)

problems out there for which no such amount of data is available.

implementation of the simulator is not feasible to accomplish.

**Figure 4.** Illustration of the program-level parallelism (PP).

work-group, thus the local size can be left untouched.

**4.4. Program- and Data-level Parallelism – PDP**

programs, in other words, the global size can be defined as:

occupied.

Note however that the errors are spread among the work-items, because each work-item has processed a single point and has computed its own error independently. This calls for what is known in the parallel computing literature as the *reduction* operation [22]. The naive way of doing that is to sequentially cycle over each element and accumulate their values; in our case it would iterate from work-item indexed by 0 to *globalsize* − 1 and put the total value in *E*[*p*], the final error relative to the *p*-th program. There is however a clever and parallel way of doing reduction, as exemplified in Figure 3, which decreases the complexity of this step from *O*(*N*) to just *O*(*log*2*N*) and still assures a nice coalesced memory access suited for the GPU architecture [1, 32].<sup>12</sup>

**Figure 3.** *O*(*log*2*N*) parallel reduction with sequential addressing.

## **4.3. Program-level Parallelism – PP**

One serious drawback of the data-level parallelism strategy is that when there are few training data points the compute device may probably be underutilized. Today's high-end GPUs have thousands of processing elements, and this number has increased at each new hardware generation. In addition, to achieve optimal performance on GPUs, multiple work-items should be launched for each processing element. This helps, for instance, to hide memory

<sup>12</sup> This chapter aims at just conveying the idea of the parallel reduction, and so it will not get into the algorithmic details on how reduction is actually implemented. The reader is referred to the given references for details.

access latencies while reading from or writing to the device's global memory [1, 32]. Therefore, to optimally utilize a high-end GPU under the DP strategy, one should prefer those problems having tens of thousands of training data points. Unfortunately, there are many real-world problems out there for which no such amount of data is available.

Another limitation of the DP strategy is that sometimes there is no easy way to decompose the evaluation of a program into independent entities, like data points. Many program evaluations that need a simulator, for example, fall into this category, where a parallel implementation of the simulator is not feasible to accomplish.

An attempt to overcome the DP limitations, particularly what concerns the desire of a substantially large amount of training data points, is schematically shown in Figure 4. This parallelization strategy is here referred to as program-level parallelism (PP), meaning that programs are executed in parallel, each program per PE [4, 35]. Assuming that there are enough programs to be evaluated, even a few training data points should keep the GPU fully occupied.

**Figure 4.** Illustration of the program-level parallelism (PP).

12 Will-be-set-by-IN-TECH

responsibility for. Once interpreted, the output returned by the program is then compared with the expected one for that point, whose value is stored in array *Y*. This results in a prediction error; however, the overall error is what is meaningful to estimate the fitness of

Note however that the errors are spread among the work-items, because each work-item has processed a single point and has computed its own error independently. This calls for what is known in the parallel computing literature as the *reduction* operation [22]. The naive way of doing that is to sequentially cycle over each element and accumulate their values; in our case it would iterate from work-item indexed by 0 to *globalsize* − 1 and put the total value in *E*[*p*], the final error relative to the *p*-th program. There is however a clever and parallel way of doing reduction, as exemplified in Figure 3, which decreases the complexity of this step from *O*(*N*) to just *O*(*log*2*N*) and still assures a nice coalesced memory access suited for the GPU

One serious drawback of the data-level parallelism strategy is that when there are few training data points the compute device may probably be underutilized. Today's high-end GPUs have thousands of processing elements, and this number has increased at each new hardware generation. In addition, to achieve optimal performance on GPUs, multiple work-items should be launched for each processing element. This helps, for instance, to hide memory

<sup>12</sup> This chapter aims at just conveying the idea of the parallel reduction, and so it will not get into the algorithmic details

on how reduction is actually implemented. The reader is referred to the given references for details.

**Algorithm 1:** GPU DP's OpenCL kernel

*error* ← |Interpreter( *program, globalid* ) − *Y*[*globalid*]|;

**Figure 3.** *O*(*log*2*N*) parallel reduction with sequential addressing.

**4.3. Program-level Parallelism – PP**

*E*[*p*] ← ErrorReduction(0, . . . , *globalsize* − 1);

*globalid* ← get\_global\_id(); **for** *p* ← 0 **to** *populationsize* − 1 **do** *program* ← NthProgram(*p*);

a program.

architecture [1, 32].<sup>12</sup>

In PP, while programs are interpreted in parallel, the training data points within each PE are processed sequentially. This suggests a computation domain based on the number of programs, in other words, the global size can be defined as:

$$
gamma\_{size} = 
population\_{size} \tag{2}
$$

As with DP, PP does not need to have control of the number of work-items within a work-group, thus the local size can be left untouched.

A pseudo-OpenCL code for the PP kernel is given in Algorithm 2. It resembles the DP's algorithm, but in PP what is being parallelized are the programs instead of the training data points. Hence, each work-item takes a different program and interpret it iteratively over all points. A positive side effect of this inverse logic is that, since the whole evaluation of a program is now done in a single work-item, all the partial prediction errors are promptly available locally. Put differently, in PP a final reduction step is not required.

### **4.4. Program- and Data-level Parallelism – PDP**

Unfortunately, PP solves the DP's necessity of large training datasets but introduces two other problems: (i) to avoid underutilization of the GPU a large population of programs should now

#### **Algorithm 2:** GPU PP's OpenCL kernel

```
globalid ← get_global_id();
program ← NthProgram(globalid);
error ← 0.0;
for n ← 0 to datasetsize − 1 do
    error ← error + |Interpreter( program, n ) − Y[n]|;
E[globalid] ← error;
```
be employed; and, more critically, (ii) the PP's execution model is not suited for an inherently data-parallel architecture like GPUs.

Indeed, PDP can be seen as a mixture of the DP and PP strategies. But curiously, PDP avoids all the drawbacks associated with the other two strategies: (i) once there are enough data to saturate just a single CU, smaller datasets can be used at no performance loss; (ii) large populations are not required either, since the number of CUs on current high-end GPUs is in the order of tens; and (iii) there is no divergence with respect to program interpretation.13

In order to achieve both levels of parallelism, a fine-tuned control over the computation domain is required; more precisely, both local and global sizes must be properly defined.

Since a work-group should process all training data points for a single program and there is a population of programs to be evaluated, one would imagine that setting *localsize* as *datasetsize* and *globalsize* as *populationsize* × *datasetsize* would suffice. This is conceptually correct, but an important detail makes the implementation not as straightforward as one would expect. The OpenCL specification allows any compute device to declare an upper bound regarding the number of work-items within a work-group. This is not arbitrary. The existence of a limit on the number of work-items per work-group is justified by the fact that there exists a relation between the maximum number of work-items and the device's capabilities, with the latter restricting the former. Put differently, an unlimited number of work-items per work-group would not be viable, therefore a limit, which is provided by the hardware vendor, must be

*datasetsize* if *datasetsize* < *localmax*\_*size*

which limits the number of work-items per work-group to the maximum supported, given by the variable *localmax*\_*size*, when the number of training data points exceeds it. This implies that when such a limit takes place, a single work-item will be in charge of more than one training data point, that is, the work granularity is increased. As for the global size, it can be easily

meaning that the set of work-items defined above should be replicated as many times as the

Finally, algorithm 3 shows the OpenCL kernel for the PDP strategy. Compared to the other two kernels (Algorithms 1 and 2), it comes as no surprise its greater complexity, as this kernel is a combination of the other two and still has to cope with the fact that a single instance, i.e. a work-item, can process an arbitrary number of training data points. The command get\_group\_id, which returns the work-group's index of the current work-item, has the purpose of indexing the program that is going to be evaluated by the entire group. The *for loop* is closely related to the local size (Equation 3), and acts as a way of iterating over multiple training data points if the work-item (indexed locally by *localid*) is in charge of many of them; when the dataset size is less or equal to the local size, only one iteration will be performed. Then, an index calculation is done in order to get the index (*n*) of the current training data

<sup>13</sup> Notice, though, that divergence might still occur if two (or more) training data points can cause the interpreter to take different paths for the same program. For instance, if the conditional *if-then-else* primitive is used, a data point could cause an interpreter's instance to take the *then* path while other data could make another instance to take the *else* path.

*localmax*\_*size* otherwise , (3)

Parallel Genetic Programming on Graphics Processing Units 109

*globalsize* = *populationsize* × *localsize*, (4)

With the aforementioned in mind, the local size can finally be set to

*localsize* =

number of programs to be evaluated.

taken into account.

defined as

While (i) can be dealt with by simply specifying a large population as a parameter choice of a genetic programming algorithm, the issue pointed out in (ii) cannot be solved for the PP strategy.

The problem lies on the fact that, as mentioned in Section 2, GPUs are mostly a SIMD architecture, specially among processing elements within a compute unit. Roughly speaking, whenever two (or more) different instructions try to be executed at the same time, a hardware conflict occurs and then these instructions are performed sequentially, one at a time. In the related literature, this phenomenon is often referred to as *divergence*. Since in PP each PE interprets a different program, the degree of divergence is the highest possible: at a given moment each work-item's interpreter is potentially interpreting a different primitive. Therefore, in practice, the programs within a CU will most of the time be evaluated sequentially, seriously degrading the performance.

However, observing the fact that modern GPUs are capable of simultaneously executing different instructions at the level of compute units, i.e. the SPMD execution model, one could devise a parallelization strategy that would take advantage of this fact. Such strategy exists, and it is known here as program- and data-level parallelism, or simply PDP [4, 35]. Its general idea is illustrated in Figure 5. In PDP, a single program is evaluated per compute unit—this

**Figure 5.** Illustration of the program- and data-level parallelism (PDP).

prevents the just mentioned problem of divergence—but within each CU all the training data points are processed in parallel. Therefore, there are two levels of parallelism: a program-level parallelism among the compute units, and a data-level parallelism on the processing elements. Indeed, PDP can be seen as a mixture of the DP and PP strategies. But curiously, PDP avoids all the drawbacks associated with the other two strategies: (i) once there are enough data to saturate just a single CU, smaller datasets can be used at no performance loss; (ii) large populations are not required either, since the number of CUs on current high-end GPUs is in the order of tens; and (iii) there is no divergence with respect to program interpretation.13

In order to achieve both levels of parallelism, a fine-tuned control over the computation domain is required; more precisely, both local and global sizes must be properly defined.

Since a work-group should process all training data points for a single program and there is a population of programs to be evaluated, one would imagine that setting *localsize* as *datasetsize* and *globalsize* as *populationsize* × *datasetsize* would suffice. This is conceptually correct, but an important detail makes the implementation not as straightforward as one would expect. The OpenCL specification allows any compute device to declare an upper bound regarding the number of work-items within a work-group. This is not arbitrary. The existence of a limit on the number of work-items per work-group is justified by the fact that there exists a relation between the maximum number of work-items and the device's capabilities, with the latter restricting the former. Put differently, an unlimited number of work-items per work-group would not be viable, therefore a limit, which is provided by the hardware vendor, must be taken into account.

With the aforementioned in mind, the local size can finally be set to

14 Will-be-set-by-IN-TECH

be employed; and, more critically, (ii) the PP's execution model is not suited for an inherently

While (i) can be dealt with by simply specifying a large population as a parameter choice of a genetic programming algorithm, the issue pointed out in (ii) cannot be solved for the PP

The problem lies on the fact that, as mentioned in Section 2, GPUs are mostly a SIMD architecture, specially among processing elements within a compute unit. Roughly speaking, whenever two (or more) different instructions try to be executed at the same time, a hardware conflict occurs and then these instructions are performed sequentially, one at a time. In the related literature, this phenomenon is often referred to as *divergence*. Since in PP each PE interprets a different program, the degree of divergence is the highest possible: at a given moment each work-item's interpreter is potentially interpreting a different primitive. Therefore, in practice, the programs within a CU will most of the time be evaluated

However, observing the fact that modern GPUs are capable of simultaneously executing different instructions at the level of compute units, i.e. the SPMD execution model, one could devise a parallelization strategy that would take advantage of this fact. Such strategy exists, and it is known here as program- and data-level parallelism, or simply PDP [4, 35]. Its general idea is illustrated in Figure 5. In PDP, a single program is evaluated per compute unit—this

prevents the just mentioned problem of divergence—but within each CU all the training data points are processed in parallel. Therefore, there are two levels of parallelism: a program-level parallelism among the compute units, and a data-level parallelism on the processing elements.

**Algorithm 2:** GPU PP's OpenCL kernel

data-parallel architecture like GPUs.

*error* ← *error* + |Interpreter( *program, n* ) − *Y*[*n*]|;

sequentially, seriously degrading the performance.

**Figure 5.** Illustration of the program- and data-level parallelism (PDP).

*globalid* ← get\_global\_id(); *program* ← NthProgram(*globalid*);

**for** *n* ← 0 **to** *datasetsize* − 1 **do**

*error* ← 0.0;

strategy.

*E*[*globalid*] ← *error*;

$$\text{local}\_{\text{size}} = \begin{cases} \text{datase}\_{\text{size}} & \text{if } dataest\_{\text{size}} < local\_{\text{max\\_size}}\\ \text{local}\_{\text{max\\_size}} & \text{otherwise} \end{cases},\tag{3}$$

which limits the number of work-items per work-group to the maximum supported, given by the variable *localmax*\_*size*, when the number of training data points exceeds it. This implies that when such a limit takes place, a single work-item will be in charge of more than one training data point, that is, the work granularity is increased. As for the global size, it can be easily defined as

$$
\log \text{lobal}\_{\text{size}} = \text{population}\_{\text{size}} \times \text{local}\_{\text{size}\prime} \tag{4}
$$

meaning that the set of work-items defined above should be replicated as many times as the number of programs to be evaluated.

Finally, algorithm 3 shows the OpenCL kernel for the PDP strategy. Compared to the other two kernels (Algorithms 1 and 2), it comes as no surprise its greater complexity, as this kernel is a combination of the other two and still has to cope with the fact that a single instance, i.e. a work-item, can process an arbitrary number of training data points. The command get\_group\_id, which returns the work-group's index of the current work-item, has the purpose of indexing the program that is going to be evaluated by the entire group. The *for loop* is closely related to the local size (Equation 3), and acts as a way of iterating over multiple training data points if the work-item (indexed locally by *localid*) is in charge of many of them; when the dataset size is less or equal to the local size, only one iteration will be performed. Then, an index calculation is done in order to get the index (*n*) of the current training data

<sup>13</sup> Notice, though, that divergence might still occur if two (or more) training data points can cause the interpreter to take different paths for the same program. For instance, if the conditional *if-then-else* primitive is used, a data point could cause an interpreter's instance to take the *then* path while other data could make another instance to take the *else* path.

### **Algorithm 3:** GPU PDP's OpenCL kernel

```
localid ← get_local_id();
groupid ← get_group_id();
program ← NthProgram(groupid);
error ← 0.0;
for i ← 0 to �datasetsize/localsize� − 1 do
    n ← i × localsize + localid;
    if n < datasetsize then
        error ← error + |Interpreter( program, n ) − Y[n]|;
E[groupid] ← ErrorReduction(0, . . . , localsize − 1);
```
point to be processed.14 Due to the fact that the dataset size may not be evenly divisible by the local size, a range check is performed to guarantee that no out-of-range access will occur. Finally, since the prediction errors for a given program will be spread among the local work-items at the end of the execution, an error reduction operation takes place.

**Author details**

Helio J. C. Barbosa

**6. References**

*- OpenCL*.

Douglas A. Augusto and Heder S. Bernardino

Alamitos, CA, USA, pp. 173–178.

URL: *http://dx.doi.org/10.1007/978-0-387-87623-8\_15*

classification algorithms on gpus, *Soft Computing* 16(2): 187–202.

*in Bioinformatics)* 5481 LNCS: 268–279. cited By (since 1996) 1.

*40&md5=1a9de902eb5649a01e3e87c222a79ee3*

*and evolutionary computation*, Vol. 2, ACM Press, London, pp. 1566–1573. URL: *http://www.cs.bham.ac.uk/ wbl/biblio/gecco2007/docs/p1566.pdf*

*Practice VI*, pp. 1–19.

2): 17–26.

*Laboratório Nacional de Computação Científica (LNCC/MCTI), Rio de Janeiro, Brazil*

*Laboratório Nacional de Computação Científica (LNCC/MCTI), Rio de Janeiro, Brazil Federal University of Juiz de Fora (UFJF), Computer Science Dept., Minas Gerais, Brazil*

[1] Advanced Micro Devices [2010]. *AMD Accelerated Parallel Processing Programming Guide*

Parallel Genetic Programming on Graphics Processing Units 111

[2] Ando, J. & Nagao, T. [2007]. Fast evolutionary image processing using multi-gpus, *Proc.*

[3] Arenas, M. G., Mora, A. M., Romero, G. & Castillo, P. A. [2011]. Gpu computation in bioinspired algorithms: a review, *Proc. of the international conference on Artificial neural networks conference on Advances in computational intelligence*, Springer-Verlag, pp. 433–440. [4] Augusto, D. A. & Barbosa, H. J. [2012]. Accelerated parallel genetic programming tree

[5] Augusto, D. A. & Barbosa, H. J. C. [2000]. Symbolic regression via genetic programming, *Proceedings of the VI Brazilian Symposium on Neural Networks*, IEEE Computer Society, Los

[6] Banzhaf, W., Harding, S., Langdon, W. B. & Wilson, G. [2009]. Accelerating genetic programming through graphics processing units., *Genetic Programming Theory and*

[7] Cano, A., Zafra, A. & Ventura, S. [2010]. Solving classification problems using genetic programming algorithms on gpus, *Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)* 6077 LNAI(PART

[8] Cano, A., Zafra, A. & Ventura, S. [2012]. Speeding up the evaluation phase of gp

[9] Chitty, D. M. [2007]. A data parallel approach to genetic programming using programmable graphics hardware, *in* D. Thierens, H.-G. Beyer, J. Bongard, J. Branke, J. A. Clark, D. Cliff, C. B. Congdon, K. Deb, B. Doerr, T. Kovacs, S. Kumar, J. F. Miller, J. Moore, F. Neumann, M. Pelikan, R. Poli, K. Sastry, K. O. Stanley, T. Stutzle, R. A. Watson & I. Wegener (eds), *GECCO '07: Proceedings of the 9th annual conference on Genetic*

[10] Ebner, M. [2009]. A real-time evolutionary object recognition system, *Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes*

URL: *http://www.scopus.com/inward/record.url?eid=2-s2.0-67650697120&partnerID=*

*of the International Conference on Systems, Man and Cybernetics*, pp. 2927 –2932.

evaluation with opencl, *Journal of Parallel and Distributed Computing* (0): –. URL: *http://www.sciencedirect.com/science/article/pii/S074373151200024X*
