**3. Description of the proposed approach**

The GP algorithm herein used follows the same guidelines of the traditional GP approaches: representation of solutions as genetic individuals; selection of the training set; generation of the starting population of genetic individuals that are solution candidates; fitness of the solution candidates to the training set; selection of parents; evolution, through selection, crossover and mutation operators [2]. Besides these activities, this work includes two new stages, which consist in the evaluation of the final model, as shown in the flow of Figure 1.

**Figure 2.** Flow of the proposed PG approach with LRM.

When the processing of the GP algorithms ends, due to some stop criterion, (e.g. the maximum number of generations is reached), the fittest genetic individual to the data is selected to be formally evaluated through statistical inference, with the application of the test of assumptions. Depending on the result of the evaluation, the GP algorithm can either start a new iteration, generating a new starting population, or present the LRM as a final solution.

If no candidate is approved in the formal evaluation, at the end of the iterations (limited to a maximum number as the second stop criterion), the best candidate among all the iterations may be reevaluated through residual diagnosing. In this other evaluation method, the assumptions about the model may be less formal, becoming, this way, a more subjective kind of analysis.

Each one of the activities presented in the Flow of Figure 1 will be detailed in the next subsections.

### **3.1. Representation of solutions as genetic individuals**

80 Genetic Programming – New Approaches and Successful Applications

**3. Description of the proposed approach** 

**Figure 2.** Flow of the proposed PG approach with LRM.

such as the Least Squares and Maximum Likelihood methods.

The fitness stage consists in the process of estimation of the linear parameters of the generalized linear models. Several methods can be used to estimate the LRM parameters,

Finally, the inference stage has the main objective of checking the adequateness of the model and performing a detailed study about the unconformities between the observations and the estimates given by the model. These unconformities, when significant, may imply in the choice of another linear model, or in the acceptance of aberrant data. Anyway, the whole methodology will have to be repeated. The analyst, in this stage, must check the precision and the interdependence of the performance estimates, build trust regions and tests about

The GP algorithm herein used follows the same guidelines of the traditional GP approaches: representation of solutions as genetic individuals; selection of the training set; generation of the starting population of genetic individuals that are solution candidates; fitness of the solution candidates to the training set; selection of parents; evolution, through selection, crossover and mutation operators [2]. Besides these activities, this work includes two new stages, which consist in the evaluation of the final model, as shown in the flow of Figure 1.

the parameters of interest, statistically analyze the residuals and make predictions.

GP normally uses trees as data structures [15] because the solutions are, commonly, mathematical expressions, and then it is necessary to keep their syntactic structure (trees are largely used to represent syntactic structures, defined according to some formal grammar [16]).

As seen in the previous subsection, linear regression models are statistical models comprised of two elements: a response variable and the independent variables. So, these models are structured, in the proposed approach, also as trees, called *expression trees*, where the internal nodes are either linking operators (represented by the arithmetic operator of addition) or iteration operators (represented by the arithmetic operator of multiplication) acting between the predictive variables, which are located in the leaves of the tree, as shown in Figure 3.

$$\mathbf{y} = \boldsymbol{\beta}\_o + \boldsymbol{\beta}\_{\boldsymbol{l}} \boldsymbol{X}\_{\boldsymbol{l}} + \boldsymbol{\beta}\_{\boldsymbol{2}} \boldsymbol{X}\_{\boldsymbol{2}} + \boldsymbol{\beta}\_{\boldsymbol{3}} \boldsymbol{X}\_{\boldsymbol{3}}$$

**Figure 3.** Example of LRM modeled as a genetic individual.

It can be seen, in the top of Figure 3, an LRM, and right below, the respective model in the form of a tree, which is the structure of a genetic individual. In this individual, we have, in

the roots of the tree and of the sub-tree in the left, the linking operator, and in the leaves we have the predictive variables X1, X2 and X3.

Genetically Programmed Regression Linear Models for Non-Deterministic Estimates 83

= =+ *<sup>L</sup>* <sup>=</sup> (14)

2

1 1 1 *P P p qp pq*

The points of the project space are comprised of the parameters of the system to be modeled, and each point is a combination of the values that these parameters can receive. The Audze-Eglais method can be applied to these project spaces, provided that we consider the intervals (the distances) between the values of each parameter of the system, and that these

The minimization of the above equation can be performed through some optimization technique or by verification of every possible combination. The use of the second approach may be unviable, since the search for each possible combination in project spaces with many points has a high computational cost. So, in this study, we used the GPRSKit [23] tool, which uses genetic programming techniques to minimize the equation, and outputs the points of

Once defined the training set, the next task is the generation of a starting population of genetic individuals, which are LRMs candidate to solution, so the genetic algorithm can

There must be a starting population so that the evolution algorithm can act, through the application of the selection, crossover and evolution operators. For this, aiming at the variability of individuals and consequent improvement on the precision of results, we

This technique selects, initially, a random value to be the maximum depth of the tree to be generated. Next, the method for generation of the new tree is selected. *Ramped Half-and-Half* uses two generation methods, where each one generates half of the population. They are

• **Growing**: this method creates new trees of several sizes and shapes, regarding the depth limit previously defined. Figure 4(a) shows an example of a tree created with the

The fitness of a candidate LRM is evaluated with basis on the quality of the estimates that it generates compared to the data obtained from the problem data. The quality of an LRM can be quantified through its fitness and its complexity, measured, in this study, by the *Akaike* 

application of this method. In it, we see that the leaves have different depths. • **Complete**: a tree created with this method has its leaves with the same depth, which is also selected at random, but respects the depth limit initially selected. Figure 4(b) shows

a tree created with this method. Notice that all leaves have the same depths.

*Information Criterion* (AIC) [8], since it is one of the most used criteria [10].

where *U* is the potential energy and is the distance between the points *p* and *q*, and *p≠q*.

*U*

values are taken together, in order to minimize the objective function.

the project space identified in the optimization of the equation.

adopted the *Ramped Half-and-Half* [24] technique.

**3.4. Description of the utility function (Fitness)** 

**3.3. Generation of the starting population of genetic individuals** 

evolve them.

described below:

Formally, an LRM modeled as a genetic individual can be defined as a tree containing a finite set of one or more nodes, where:

	- 1. two distinct sets where
	- 2. each one of these sets is also a tree which, in this case, is also called *sub-tree*. The sub-trees may be either left or right.

Once we define the data structure that will be used to represent the LRMs as genetic individuals, the next task, as defined in the flow of Figure 2, is the selection of the points of the project space that will be used to form the training set for the GP algorithm. The following subsection gives more details about the technique chosen to select points.
