**1. Introduction**

26 Will-be-set-by-IN-TECH

[30] Shin, J., Kang, M., McKay, R. I., Nguyen, X., Hoang, T.-H., Mori, N. & Essam, D. [2007]. Analysing the regularity of genomes using compression and expression simplification,

[31] Tanev, I. [2004]. Implications of incorporating learning probabilistic context-sensitive grammar in genetic programming on evolvability of adaptive locomotion gaits of

[32] Tanev, I. [2005]. Incorporating learning probabilistic context-sensitive grammar in genetic programming for efficient evolution and adaptation of Snakebot, *Proceedings of*

[33] Whigham, P. A. [1995]. Grammatically-based genetic programming, *Proceedings of the Workshop on Genetic Programming : From Theory to Real-World Applications*, Tahoe City,

[34] Whigham, P. A. [1996]. Search bias, language bias, and genetic programming, *Genetic Programming 1996: Proceedings of the First Annual Conference*, MIT Press, Stanford

[35] Whigham, P. A. & Science, D. O. C. [1995]. Inductive bias and genetic programming, *In Proceedings of First International Conference on Genetic Algorithms in Engineering Systems:*

[36] Wineberg, M. & Oppacher, F. [1994]. A representation scheme to perform program induction in a canonical genetic algorithm, *Parallel Problem Solving from Nature III*, Vol.

[37] Yanai, K. & Iba, H. [2003]. Estimation of distribution programming based on Bayesian network, *Proceedings of the 2003 Congress on Evolutionary Computation CEC2003*, IEEE

[38] Yanai, K. & Iba, H. [2005]. Probabilistic distribution models for EDA-based GP, *GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary computation*, Vol. 2, ACM

[39] Yanase, T., Hasegawa, Y. & Iba, H. [2009]. Binary encoding for prototype tree of probabilistic model building gp, *Proceedings of 2009 Genetic and Evolutionary Computation*

snakebot, *GECCO 2004 Workshop Proceedings*, Seattle, Washington, USA.

*EuroGP 2005*, Springer Verlag, Lausanne, Switzerland, pp. 155–166.

*Proceedings of Euro GP 2007*, Springer-Verlag, pp. 251–260.

California USA, pp. 44–41.

University, CA, USA, pp. 230–237.

Press, Canberra, pp. 1618–1625.

*Innovations and Applications*, pp. 461–466.

Press, Washington DC, USA, pp. 1775–1776.

*Conference (GECCO 2009)*, pp. 1147–1154.

866 of *LNCS*, Springer-Verlag, Jerusalem, pp. 292–301.

Symbolic regression is a technique which characterizes, through mathematical functions, response variables with basis on input variables. Their main features include: need for no (or just a few) assumptions about the mathematical model; the coverage of multidimensional data, frequently unbalanced with big or small samples. In order to find the plausible Symbolic Regression Models (SRM), we used the genetic programming (GP) technique [1].

Genetic programming (GP) is a specialization of genetic algorithms (GA), an evolutionary algorithm-based methodology inspired by biological evolution, to find predictive functions. Each GP individual is evaluated by performing its function in order to determine how its output fits to the desired output [2,3].

However, depending on the problem, one may notice that the estimates of the SRM found from the GP may present errors [4], affecting the precision of the predictive function. To deal with this problem, some studies [5,6] substitute the predictive functions, which are deterministic mathematical models, by linear regression statistical models (LRM) to compose the genetic individual models.

LRM, as well as the traditional mathematical models, can be used to model a problem and make estimates. Their great advantage is the possibility of controlling the estimate errors. Nevertheless, the studies available in the literature [5,6] have considered only information criteria, such as the sum of least squares [7] and AIC [8], as evaluation indexes with respect to the dataset and comparison of the solution candidate models. Despite the models obtained through this technique generate good indexes, sometimes the final models may not be representative, since the model structure assumptions were not verified, bringing some incorrect estimates [9].

© 2012 Esmeraldo et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Esmeraldo et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. ©2012 Augusto et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons

So, in this study we propose the use of statistical inference and residual analysis to evaluate the final model, obtained through GP, where we check the assumptions about the structure of the model. In order to evaluate the proposed approach, we carried out some experiments with the prediction of performance of applications in embedded systems.

Genetically Programmed Regression Linear Models for Non-Deterministic Estimates 77

We can see in Table 2 that the configurations with support to the characteristic *x* had values *u1=1* and *u2=0*, and that the configurations without support had values *u1=0* and *u2=1*.

LRMs may also consider the combination of two or more factors. When the LRM has more than one factor, the effect of the combination of two or more factors is called *interaction effect*. Interactions occur when the effect of a factor varies according to the level of another factor [10]. In contrast, the effect of a simple factor, that is, without interaction, is called *main effect*. The interaction concept is given as follows: if the change in the mean of the response variable between two levels of a factor A is the same for different levels of a factor B, then we can say that there is no interaction; but if the change is different for different levels of B, then we say that there is interaction. Interactions report the effect that factors have over the risk of the

model, and which are not reported in the analysis of correlation between the factors.

and another for the variance, defined by the following equations, respectively:

*variance* function is constant, with a positive value *σ2* which is normally unknown.

defined by the equation:

So, considering the relations between the dependent variables and the predictive variables, the statistical linear regression model will be comprised of two functions, one for the mean

> 0 1 *EY X x x* (| ) == + β β

<sup>2</sup> *Var Y X x* (| ) = =

where the parameters in the *mean* function are the intercept *β0*, which is the value of the mean *E(Y|X=x)* when *x* is equal to zero, and the slope *β1*, which is the rate of change in *E(Y|X=x)* for a change of values of *X*, as we can see in Figure 1. Varying these parameters, it is possible to obtain all the line equations. In most applications, these parameters are unknown and must be estimated with basis on the problem data. So, we assume that the

Differently from the mathematical models, which are deterministic, linear regression models consider the errors between the observed values and these estimated by the line equation. So, due to the variance *σ2>0*, the values observed for the ith response *yi* are typically different from the expected values *E(Y|X=xi)*. In order to consider the error between the observed and the expected data, we have the concept of statistical error, or *ei*, for the case *i* implicitly

(| ) *<sup>i</sup> i i y* = =+ *EY X x e* (4)

σ

(2)

(3)

**Table 2.** Representation of the sample of Table 1, through dummy variables.

This chapter is organized as follows. In Section 2, we briefly introduce the theoretical basis of the regression analysis. In Section 3, we detail the main points of the proposed approach. In Section 4, we introduce the application of the proposed approach through a case study. Section 5 shows the experimental results of the case study. Finally, in Section 6, we raise the conclusions obtained with this work.

## **2. Linear regression background**

Like most of the statistical analysis techniques, the objective of the linear regression analysis is to summarize, through a mathematical model called Linear Regression Model (LRM), the relations among variables in a simple and useful way [10]. In some problems, they can also be used to specify how one of the variables, in this case called response variable or dependent variable, varies as a function of the change in the values of the other variables of the relation, called predictive variables, regressive variables or systematic variables.

The predictive variables can be quantitative or qualitative. The quantitative variables are those which can be measured through a quantitative scale (i.e., they have a measurement unit). On the other hand, the qualitative variables are divided in classes. The individual classes of a classification are called *levels* or *classes* of a factor. In the classification of data in terms of factors and levels, the important characteristic that is observed is the extent of the variables of a factor which can influence the variable of interest [11]. These factors are often represented by dummy variables [12].

Let *D* be a factor with five levels. The jth dummy variable *Uj* for the factor *D*, with *j=1,...,5*, has the ith value *uij*, for *i =1,...,n*, given by

$$u\_{ij} = \begin{cases} 1, & \text{if } \quad D\_i = \mathfrak{j}^{th} \\ 0, & \text{otherwise} \end{cases} \qquad \text{category} \qquad \text{of} \quad D \tag{1}$$

For instance, let there be a variable, which supports a certain characteristic *x*, as a two-level factor *D*. Taking a sample, shown in Table 1, with 5 different configurations, we can represent the factor *D* with the dummy variables of Table 2.


**Table 1.** Sample with size 5, with several pipeline support configurations.


**Table 2.** Representation of the sample of Table 1, through dummy variables.

conclusions obtained with this work.

**2. Linear regression background** 

represented by dummy variables [12].

has the ith value *uij*, for *i =1,...,n*, given by

1,

represent the factor *D* with the dummy variables of Table 2.

**Table 1.** Sample with size 5, with several pipeline support configurations.

<sup>=</sup> <sup>=</sup>

*ij*

0, .

*otherwise*

1 Yes 2 No 3 Yes 4 No 5 No

So, in this study we propose the use of statistical inference and residual analysis to evaluate the final model, obtained through GP, where we check the assumptions about the structure of the model. In order to evaluate the proposed approach, we carried out some experiments

This chapter is organized as follows. In Section 2, we briefly introduce the theoretical basis of the regression analysis. In Section 3, we detail the main points of the proposed approach. In Section 4, we introduce the application of the proposed approach through a case study. Section 5 shows the experimental results of the case study. Finally, in Section 6, we raise the

Like most of the statistical analysis techniques, the objective of the linear regression analysis is to summarize, through a mathematical model called Linear Regression Model (LRM), the relations among variables in a simple and useful way [10]. In some problems, they can also be used to specify how one of the variables, in this case called response variable or dependent variable, varies as a function of the change in the values of the other variables of

The predictive variables can be quantitative or qualitative. The quantitative variables are those which can be measured through a quantitative scale (i.e., they have a measurement unit). On the other hand, the qualitative variables are divided in classes. The individual classes of a classification are called *levels* or *classes* of a factor. In the classification of data in terms of factors and levels, the important characteristic that is observed is the extent of the variables of a factor which can influence the variable of interest [11]. These factors are often

Let *D* be a factor with five levels. The jth dummy variable *Uj* for the factor *D*, with *j=1,...,5*,

*th i*

*if D j category of D <sup>u</sup>*

For instance, let there be a variable, which supports a certain characteristic *x*, as a two-level factor *D*. Taking a sample, shown in Table 1, with 5 different configurations, we can

**Support to characteristic** *x*

(1)

the relation, called predictive variables, regressive variables or systematic variables.

with the prediction of performance of applications in embedded systems.

We can see in Table 2 that the configurations with support to the characteristic *x* had values *u1=1* and *u2=0*, and that the configurations without support had values *u1=0* and *u2=1*.

LRMs may also consider the combination of two or more factors. When the LRM has more than one factor, the effect of the combination of two or more factors is called *interaction effect*. Interactions occur when the effect of a factor varies according to the level of another factor [10]. In contrast, the effect of a simple factor, that is, without interaction, is called *main effect*. The interaction concept is given as follows: if the change in the mean of the response variable between two levels of a factor A is the same for different levels of a factor B, then we can say that there is no interaction; but if the change is different for different levels of B, then we say that there is interaction. Interactions report the effect that factors have over the risk of the model, and which are not reported in the analysis of correlation between the factors.

So, considering the relations between the dependent variables and the predictive variables, the statistical linear regression model will be comprised of two functions, one for the mean and another for the variance, defined by the following equations, respectively:

$$E(Y \mid X = \mathbf{x}) = \beta\_0 + \beta\_1 \mathbf{x} \tag{2}$$

$$Var(Y \mid X = \mathbf{x}) = \sigma^2 \tag{3}$$

where the parameters in the *mean* function are the intercept *β0*, which is the value of the mean *E(Y|X=x)* when *x* is equal to zero, and the slope *β1*, which is the rate of change in *E(Y|X=x)* for a change of values of *X*, as we can see in Figure 1. Varying these parameters, it is possible to obtain all the line equations. In most applications, these parameters are unknown and must be estimated with basis on the problem data. So, we assume that the *variance* function is constant, with a positive value *σ2* which is normally unknown.

Differently from the mathematical models, which are deterministic, linear regression models consider the errors between the observed values and these estimated by the line equation. So, due to the variance *σ2>0*, the values observed for the ith response *yi* are typically different from the expected values *E(Y|X=xi)*. In order to consider the error between the observed and the expected data, we have the concept of statistical error, or *ei*, for the case *i* implicitly defined by the equation:

$$y\_i = \mathbf{E}(Y \mid \mathbf{X} = \mathbf{x}\_i) + \mathbf{e}\_i \tag{4}$$

**Figure 1.** Graphic of the line equation *E(Y|X=x)=β0 + β1x.*

or explicitly by:

$$e\_i = y\_i - E(Y \mid \mathbf{X} = \mathbf{x}\_i) \tag{5}$$

Genetically Programmed Regression Linear Models for Non-Deterministic Estimates 79

(9)

*x* (10)

*x e for i n* (11)

ˆ ˆ *iii re y y* ==− (12)

(8)

(13)

1

*n* <sup>=</sup> = 

1

*n* <sup>=</sup> = 

0 1 ˆ ˆ *y*ˆ = + β β

ˆ ˆ , 1,2,.., *i ii <sup>y</sup>* =+ + =

The residuals *êi* are used to obtain an estimate of the variance *σ2* through the sum of the

2 1 ˆ

σ <sup>=</sup> <sup>=</sup> <sup>−</sup> 

choose other models with basis on more detailed information obtained from (iii).

<sup>ˆ</sup> <sup>2</sup>

According to [14], the traditional project flow for modeling through LRMs can be divided

LRMs are a very useful tool, since they are very flexible in stage (i), are simply computable in (ii) and have reasonable criteria in (iii). These stages are performed in this sequence. In the analysis of complex data, after the inference stage, we may go back to stage (i) and

The first stage, formulation of models, covers the choice of options for the distribution of probabilities of the response variable (random component), predictive variables and the function that links these two components. The response variable used in this work consists in the estimate of the performance of the communication structure of the platform. The predictive variables are the configuration parameters of the buses contained in the space of the communication project. For this study, we analyzed several linking functions, and empirically chose the *identity* function, because it represents the direct mapping between

*n*

*n i i e*

2

*n i i y*

*x*

*y*

With the estimators, the regression line (or model) is given by:

0 1

where *êi* is the error in the fitness of the model for the ith observation of *yi*.

into three stages: (i) formulation of models; (ii) fitness and (iii) inference.

bus configurations and their respective estimated performances.

β β

From the above equation, we can then define the residual as:

where each pair of observations meets the relation:

*n i i x*

where *x* and *y* are given by:

squares of *êi*:

The *ei* errors depend on the unknown parameters of the *mean* function and are random variables, corresponding to the vertical distance between the point *yi* and the function of the mean *E(Y|X=xi)*.

We make two important assumptions about the nature of the errors. First, we assume that *E(ei|xi)=0*. The second assumption is that the errors must be independent, which means that the value of the error for one case does not generate information about the value of the error for another case. In general, we assume that the errors are normally distributed (statistical Gaussian distribution), with mean zero and variance *σ2*, which is unknown.

Assuming *n* pairs of observations *(x1, y1)*, *(x2, y2), ..., (xn, yn)*, the estimates <sup>0</sup> β ˆ and 1 β ˆ of *β<sup>0</sup>* and *β1*, respectively, must result in a line that best fits to the points. Many statistical methods are suggested to obtain estimates of the parameters of a model. Among these models, we can highlight the Least Squares and Maximum Likelihood methods. The first one stands out for being the most used estimator [13]. So, the Least Squares methods is intended to minimize the sum of the squares of the residuals *ei*, which will be defined next, where the estimators are given by the equations:

$$\begin{aligned} \hat{\beta}\_1 &= \frac{\sum\_{i=1}^n y\_i x\_i - \frac{(\sum\_{i=1}^n x\_i)(\sum\_{i=1}^n x\_i)}{n}}{n} \\ &= \sum\_{i=1}^n x\_i^2 - \frac{(\sum\_{i=1}^n x\_i)^2}{n} \end{aligned} \tag{6}$$

$$\hat{\beta}\_0 = \overline{y} - \hat{\beta}\_1 \overline{x} \tag{7}$$

where *x* and *y* are given by:

78 Genetic Programming – New Approaches and Successful Applications

**Figure 1.** Graphic of the line equation *E(Y|X=x)=β0 + β1x.*

estimators are given by the equations:

(| ) *i i <sup>i</sup> e y EY X x* =− = (5)

The *ei* errors depend on the unknown parameters of the *mean* function and are random variables, corresponding to the vertical distance between the point *yi* and the function of the

We make two important assumptions about the nature of the errors. First, we assume that *E(ei|xi)=0*. The second assumption is that the errors must be independent, which means that the value of the error for one case does not generate information about the value of the error for another case. In general, we assume that the errors are normally distributed (statistical

and *β1*, respectively, must result in a line that best fits to the points. Many statistical methods are suggested to obtain estimates of the parameters of a model. Among these models, we can highlight the Least Squares and Maximum Likelihood methods. The first one stands out for being the most used estimator [13]. So, the Least Squares methods is intended to minimize the sum of the squares of the residuals *ei*, which will be defined next, where the

1 1

= =

*n*

( )

*x*

*n*

*n n i i i i*

=

( )( )

*y x*

*n n n i i i i*

2 1

 β

−

0 1 ˆ ˆ β

2

= − *y x* (7)

β

ˆ and 1 β

ˆ of *β<sup>0</sup>*

(6)

Gaussian distribution), with mean zero and variance *σ2*, which is unknown.

Assuming *n* pairs of observations *(x1, y1)*, *(x2, y2), ..., (xn, yn)*, the estimates <sup>0</sup>

1

*i*

=

=

1

=

*x*

*i i*

−

*y x*

1

ˆ

β

or explicitly by:

mean *E(Y|X=xi)*.

$$\overline{\chi} = \frac{\sum\_{i=1}^{n} \chi\_i}{n} \tag{8}$$

$$\sum\_{i=1}^{n} y\_i \tag{9}$$

With the estimators, the regression line (or model) is given by:

$$
\hat{y} = \hat{\beta}\_0 + \hat{\beta}\_1 x \tag{10}
$$

where each pair of observations meets the relation:

$$y\_i = \hat{\beta}\_0 + \hat{\beta}\_1 \mathbf{x}\_i + e\_{i\prime} \qquad \text{for} \quad \mathbf{i} = \mathbf{1}, \mathbf{2}, \dots, \mathbf{n} \tag{11}$$

From the above equation, we can then define the residual as:

$$r = \hat{e}\_i = y\_i - \hat{y}\_i \tag{12}$$

where *êi* is the error in the fitness of the model for the ith observation of *yi*.

The residuals *êi* are used to obtain an estimate of the variance *σ2* through the sum of the squares of *êi*:

$$
\hat{\sigma}^2 = \frac{\sum\_{i=1}^n \hat{e}\_i^2}{n-2} \tag{13}
$$

According to [14], the traditional project flow for modeling through LRMs can be divided into three stages: (i) formulation of models; (ii) fitness and (iii) inference.

LRMs are a very useful tool, since they are very flexible in stage (i), are simply computable in (ii) and have reasonable criteria in (iii). These stages are performed in this sequence. In the analysis of complex data, after the inference stage, we may go back to stage (i) and choose other models with basis on more detailed information obtained from (iii).

The first stage, formulation of models, covers the choice of options for the distribution of probabilities of the response variable (random component), predictive variables and the function that links these two components. The response variable used in this work consists in the estimate of the performance of the communication structure of the platform. The predictive variables are the configuration parameters of the buses contained in the space of the communication project. For this study, we analyzed several linking functions, and empirically chose the *identity* function, because it represents the direct mapping between bus configurations and their respective estimated performances.

The fitness stage consists in the process of estimation of the linear parameters of the generalized linear models. Several methods can be used to estimate the LRM parameters, such as the Least Squares and Maximum Likelihood methods.

Genetically Programmed Regression Linear Models for Non-Deterministic Estimates 81

When the processing of the GP algorithms ends, due to some stop criterion, (e.g. the maximum number of generations is reached), the fittest genetic individual to the data is selected to be formally evaluated through statistical inference, with the application of the test of assumptions. Depending on the result of the evaluation, the GP algorithm can either start a new iteration, generating a new starting population, or present the LRM as a final

If no candidate is approved in the formal evaluation, at the end of the iterations (limited to a maximum number as the second stop criterion), the best candidate among all the iterations may be reevaluated through residual diagnosing. In this other evaluation method, the assumptions about the model may be less formal, becoming, this way, a more subjective

Each one of the activities presented in the Flow of Figure 1 will be detailed in the next

GP normally uses trees as data structures [15] because the solutions are, commonly, mathematical expressions, and then it is necessary to keep their syntactic structure (trees are largely used to represent syntactic structures, defined according to some formal grammar

As seen in the previous subsection, linear regression models are statistical models comprised of two elements: a response variable and the independent variables. So, these models are structured, in the proposed approach, also as trees, called *expression trees*, where the internal nodes are either linking operators (represented by the arithmetic operator of addition) or iteration operators (represented by the arithmetic operator of multiplication) acting between the predictive variables, which are located in the leaves of the tree, as shown

It can be seen, in the top of Figure 3, an LRM, and right below, the respective model in the form of a tree, which is the structure of a genetic individual. In this individual, we have, in

**3.1. Representation of solutions as genetic individuals** 

**Figure 3.** Example of LRM modeled as a genetic individual.

solution.

kind of analysis.

subsections.

[16]).

in Figure 3.

Finally, the inference stage has the main objective of checking the adequateness of the model and performing a detailed study about the unconformities between the observations and the estimates given by the model. These unconformities, when significant, may imply in the choice of another linear model, or in the acceptance of aberrant data. Anyway, the whole methodology will have to be repeated. The analyst, in this stage, must check the precision and the interdependence of the performance estimates, build trust regions and tests about the parameters of interest, statistically analyze the residuals and make predictions.
