**3.2. Selection of the training set**

The selection of the elements that will compose the training set can be done in many ways, but techniques like random sampling do not guarantee a distributed sample, and variancebased sampling does not allow to collect the whole dataset of the sample, and then the selected set may not be enough to obtain a linear regression model which enables accurate estimates. So, in this work, we use the Design of Experiment technique [17] for the selection of points that will compose the training space.

Design of experiments, also known in statistics as Controlled Experiment, refers to the process of planning, designing and analyzing an experiment so that valid and objective conclusions can be extracted effectively and efficiently. In general, these techniques are used to collect the maximum of relevant information with the minimum consumption of time and resources, and to obtain optimal solutions, even when it is impossible to have a functional mathematical (deterministic) model [17-20]

The design of experiment technique adopted in this work is known as *Audze-Eglais Uniform Latin Hypercube* [21,22]. The Audze-Eglais method is based on the following analogy to Physics:

*Assume a system composed of points of mass unit which exert repulsive forces among each other, causing the system to have potential energy. When the points are freed, from a starting state, they move. These points will achieve equilibrium when the potential energy of the repulsive forces of the masses is minimal. If the magnitude of the repulsive forces is inversely proportional to the square of the distance between the points, then the minimization of equation below will produce a system of distributed points, as uniform as possible.* 

Genetically Programmed Regression Linear Models for Non-Deterministic Estimates 83

$$\mathcal{U} = \sum\_{p=1}^{P} \sum\_{q=p+1}^{P} \frac{1}{L\_{pq}^2} \tag{14}$$

where *U* is the potential energy and is the distance between the points *p* and *q*, and *p≠q*.

82 Genetic Programming – New Approaches and Successful Applications

sub-trees may be either left or right.

have the predictive variables X1, X2 and X3.

finite set of one or more nodes, where: i. there is a special node called *root*. ii. the rest of the nodes form: 1. two distinct sets where

iv. the leaves are independent variables.

**3.2. Selection of the training set** 

of points that will compose the training space.

mathematical (deterministic) model [17-20]

*produce a system of distributed points, as uniform as possible.* 

Physics:

operator.

the roots of the tree and of the sub-tree in the left, the linking operator, and in the leaves we

Formally, an LRM modeled as a genetic individual can be defined as a tree containing a

2. each one of these sets is also a tree which, in this case, is also called *sub-tree*. The

iii. the roots of the tree, and of the adjacent sub-trees, is either a linking or an iteration

Once we define the data structure that will be used to represent the LRMs as genetic individuals, the next task, as defined in the flow of Figure 2, is the selection of the points of the project space that will be used to form the training set for the GP algorithm. The

The selection of the elements that will compose the training set can be done in many ways, but techniques like random sampling do not guarantee a distributed sample, and variancebased sampling does not allow to collect the whole dataset of the sample, and then the selected set may not be enough to obtain a linear regression model which enables accurate estimates. So, in this work, we use the Design of Experiment technique [17] for the selection

Design of experiments, also known in statistics as Controlled Experiment, refers to the process of planning, designing and analyzing an experiment so that valid and objective conclusions can be extracted effectively and efficiently. In general, these techniques are used to collect the maximum of relevant information with the minimum consumption of time and resources, and to obtain optimal solutions, even when it is impossible to have a functional

The design of experiment technique adopted in this work is known as *Audze-Eglais Uniform Latin Hypercube* [21,22]. The Audze-Eglais method is based on the following analogy to

*Assume a system composed of points of mass unit which exert repulsive forces among each other, causing the system to have potential energy. When the points are freed, from a starting state, they move. These points will achieve equilibrium when the potential energy of the repulsive forces of the masses is minimal. If the magnitude of the repulsive forces is inversely proportional to the square of the distance between the points, then the minimization of equation below will* 

following subsection gives more details about the technique chosen to select points.

The points of the project space are comprised of the parameters of the system to be modeled, and each point is a combination of the values that these parameters can receive. The Audze-Eglais method can be applied to these project spaces, provided that we consider the intervals (the distances) between the values of each parameter of the system, and that these values are taken together, in order to minimize the objective function.

The minimization of the above equation can be performed through some optimization technique or by verification of every possible combination. The use of the second approach may be unviable, since the search for each possible combination in project spaces with many points has a high computational cost. So, in this study, we used the GPRSKit [23] tool, which uses genetic programming techniques to minimize the equation, and outputs the points of the project space identified in the optimization of the equation.

Once defined the training set, the next task is the generation of a starting population of genetic individuals, which are LRMs candidate to solution, so the genetic algorithm can evolve them.
