**3.4. Description of the utility function (Fitness)**

The fitness of a candidate LRM is evaluated with basis on the quality of the estimates that it generates compared to the data obtained from the problem data. The quality of an LRM can be quantified through its fitness and its complexity, measured, in this study, by the *Akaike Information Criterion* (AIC) [8], since it is one of the most used criteria [10].

**Figure 4.** Examples of trees generated from (a) complete generation method and (b) generation by growing.

The AIC can be given by the following equation:

$$AIC = 2.tc - 2.\ln(L) \tag{15}$$

Genetically Programmed Regression Linear Models for Non-Deterministic Estimates 85

so avoiding premature genetic convergence. Consequently, we focus on individuals highly fitted, without completely discarding those individuals with very low degree of fitness.

In order to build a set of parent LRMs, we use the tournament selection method [25]. In this approach, a predetermined number of solution candidate LRMs are randomly chosen to compete against each other. With this selection technique, the best LRMs of the population will only have advantage over the worst, *i.e.*, they will only win the tournament if they are chosen. Tournament parameters, like tournament size and generations number, are dependent on the problem domain. In this work, they are described in case study section.

The proposed approach for GP also uses the technique of selection by elitism [26]. In this approach, only the individual having the best fitness function value is selected. With this, we guarantee that the results of the GP approach will always have a progressive increase at

In order to find the LRM that best fits to the data obtained with communication graphs, the crossover and mutation operators are applied to the genetic individuals, the LRM trees, as shown in Figure 5. The crossover and mutation operators, in genetic programming, are

**Figure 5.** Expression trees representing LRMs under the operations of (a) crossover and (b) mutation.

each generation.

*3.5.2. Crossover and mutation* 

similar to those present in conventional genetic algorithms.

where *tc* is the number of terms of the model and *L* is the likeliness, which is the pooled density of all the observations. Considering an independent variable with normal distribution with mean 0 1 *<sup>i</sup>* β β+ *x* and variance *σ2*, the likeliness can be given by:

$$\mathcal{L}(\beta\_0, \beta\_1, \sigma^2) = \frac{1}{\sigma^2 (\sqrt{2\pi})^n} e^{-\frac{1}{2} \frac{\sum\_{i=1}^n (y\_i - \beta\_0 - \beta\_1 x\_i)^2}{\sigma^2}} \tag{16}$$

#### **3.5. Evolution**

In this stage we apply, to the solution candidate genetic individuals, the selection, mutation and evolution operations. The first operation is responsible for the selection of individuals that will compose the set of parents. In this set, the genetic crossover function will act, so that the genetic content of each individual will be transferred to another one, generating new solution candidates. The objective is to group the best characteristics in certain individuals, forming better solutions. The mutation function will select some of the individuals to have their genetic content randomly changed, to cause genetic variability in the populations, avoiding the convergence of the algorithm to a local maximum.

The selection, crossover and mutation operations are described next.

#### *3.5.1. Parents selection*

The method for selection of parents must simulate the natural selection mechanism that acts on the biological species: the most qualified parents, those which better fits to the problem data, generate a large number of children, while the less qualified can also have descendents, so avoiding premature genetic convergence. Consequently, we focus on individuals highly fitted, without completely discarding those individuals with very low degree of fitness.

In order to build a set of parent LRMs, we use the tournament selection method [25]. In this approach, a predetermined number of solution candidate LRMs are randomly chosen to compete against each other. With this selection technique, the best LRMs of the population will only have advantage over the worst, *i.e.*, they will only win the tournament if they are chosen. Tournament parameters, like tournament size and generations number, are dependent on the problem domain. In this work, they are described in case study section.

The proposed approach for GP also uses the technique of selection by elitism [26]. In this approach, only the individual having the best fitness function value is selected. With this, we guarantee that the results of the GP approach will always have a progressive increase at each generation.

#### *3.5.2. Crossover and mutation*

84 Genetic Programming – New Approaches and Successful Applications

The AIC can be given by the following equation:

β β

> β βσ

avoiding the convergence of the algorithm to a local maximum.

The selection, crossover and mutation operations are described next.

distribution with mean 0 1 *<sup>i</sup>*

**3.5. Evolution** 

*3.5.1. Parents selection* 

growing.

**Figure 4.** Examples of trees generated from (a) complete generation method and (b) generation by

where *tc* is the number of terms of the model and *L* is the likeliness, which is the pooled density of all the observations. Considering an independent variable with normal

. <sup>2</sup> <sup>2</sup>

(2)

 π

In this stage we apply, to the solution candidate genetic individuals, the selection, mutation and evolution operations. The first operation is responsible for the selection of individuals that will compose the set of parents. In this set, the genetic crossover function will act, so that the genetic content of each individual will be transferred to another one, generating new solution candidates. The objective is to group the best characteristics in certain individuals, forming better solutions. The mutation function will select some of the individuals to have their genetic content randomly changed, to cause genetic variability in the populations,

The method for selection of parents must simulate the natural selection mechanism that acts on the biological species: the most qualified parents, those which better fits to the problem data, generate a large number of children, while the less qualified can also have descendents,

0 1 2 <sup>1</sup> (,, )

*<sup>n</sup> L e*

σ

+ *x* and variance *σ2*, the likeliness can be given by:

1

*i*

− 

=

*n*

2 ( ) <sup>1</sup>

*i i*

*y x*

β β

− −

σ

*AIC tc L* = − 2. 2.ln( ) (15)

2 0 1

= (16)

In order to find the LRM that best fits to the data obtained with communication graphs, the crossover and mutation operators are applied to the genetic individuals, the LRM trees, as shown in Figure 5. The crossover and mutation operators, in genetic programming, are similar to those present in conventional genetic algorithms.

**Figure 5.** Expression trees representing LRMs under the operations of (a) crossover and (b) mutation.

In the first operator, represented in Figure 5 (a), the candidates are selected for reproduction according to their fitness (fittest candidates have higher probabilities of being selected) and, next, exchange their genetic content (sub-trees), randomly chosen, between each other. Figure 5(b) illustrates the crossover of the parents *y=β0+ β1.X1 +β2.X2 + β3.X3* and *y=β0+ β1.X1 +β2.X4 + β3.X5*, generating the children *y=β0+ β1.X1 +β2.X2 + β3.X4+ β4.X5* and *y=β0+ β1.X1 +β2.X3*.

Genetically Programmed Regression Linear Models for Non-Deterministic Estimates 87

So, in this work, the residual analysis is divided in two stages:

estimates and the possibility of presence of *outliers*.

shared data, and a ARM Amba AHB [36] shared bus model.

can be scalable with the size of the problems and the number of processors.

distributions;

of homoscedasticity;

among the errors.

**4. Case study** 

application of radix sort.

bring highly precise estimates.

1. Residual diagnostic plots, where we build the following diagrams:

the estimates given by the LRM and the data of the training set;

• Diagram of distribution of accumulated errors, to quantify the distance between

• Q-Q Plots and Histograms, to check the assumptions about the error probability

• Diagram of residuals dispersion against the fitted values, the check the assumption

• Diagram of dispersion of the residuals, to check the absence of autocorrelation

2. Application of the statistical test of *Mann-Whitney-Wilcoxon* [29] to the data of the training set and the respective estimates given by the LRM found. The *Mann-Whitney-Wilcoxon* test is a non-parametric [28] statistical hypothesis test used to check whether the data of two independent sets tend to be equal (null hypothesis) or different (alternative hypothesis). With these same sets, we still perform the computation of the global mean errors, as a measurement for the central location of the set of residuals, maximums and minimums. These measurements are used to check the precision of the

In order to validate the proposed approach, we have used a case study where we predict the performance of an embedded system. The case study includes an application of the SPLASH benchmark1 [33] for a simulation model of an embedded hardware platform. This application, which consists in the sorting a set of integers through radix [34], has two processes. The first one allocates, in a shared memory, a data structure (list), comprised of a set of integers, randomly chosen, some control flags and a *mutex* (to manage the mutually exclusive access). Once the data structure is allocated, both processes will sort the integers list, concurrently.

For the execution of the application, we designed a simulation model of a hardware platform, described in the language for modeling embedded systems, SystemC [35], comprised of two models of MIPS processors, one for each process of the application of sorting by radix, a shared memory, to stores program and application data, as well as

This model allows us to explore the bus configurations to optimize the performance of the

The experiment methodology was based on the comparison between the execution times of the application, obtained by the simulation model with the estimates acquired from an LRM obtained by the proposed method. The objective is to show that the obtained models may

1 Set of multiprocessed applications, used to study the following properties: computational load balance, computation rates and traffic requirements in communications, besides issues related to spatial locations and how these properties

With mutation, represented in Figure 5 (b), after a crossover operation, it is randomly generated a mutation factor for each new genetic individual. If the mutation factor exceeds a predetermined boundary, a sub-tree is selected at random in the LRM and mutated to a new different sub-tree. Figure 5 illustrates the mutation of the model *y=β0+ β1.X1 + β2.X3* to *y=β0+ β1.X2 + β2.X3*, where it can be noticed that there was a mutation in the genetic content *X1* to *X2*.

In the approach proposed in this work, we used the two-point crossover operator [27], because this way it combines the largest number of chromosomal schemes and, consequently, increases the performance of the technique. On the other hand, for mutation, we used the simple operator [27], because the mutation prevents the stagnation of the search with low mutation factor, but if this rate is too high, the search becomes excessively random, because the highest its value is, larger is the substituted part of the population, which may lead to the less of highly qualified structures.
