**2. Methods**

242 Genetic Programming – New Approaches and Successful Applications

pollution".

temperature as well.

ponds.

fauna as shown by [14,15,16,1,17,18,19].

aquatic flora and fauna within the river.

Reservoirs and the use of water for cooling are the most important sources of water temperature modifications caused by humans. The use of water for cooling, usually by power plants, causes the water to become warmer [4]. This is often called "thermal

Reservoirs can cause various effects, depending on various factors such as the climate, the size of the impoundment, the residence time, the stability of the thermal stratification and the depth of the outlet [5]. Due to thermal stratification occurs, the water from deep-release reservoirs is cooler in the summer and warmer in the winter than it would be without the reservoir [6,7]. Water diversions can also alter water temperature regimes because they reduce discharge, which causes water temperature range to increase throughout the year [8].

In order to preserve the ecological balance it is very important to have a continuous inspection of water quality in that portion of the river. Freshwater organisms are mostly ectotherms and are therefore largely influenced by water temperature. Some of the expected consequences of a water temperature increase are life-cycle changes [4, 10], and shifts in the distribution of species with the arrival of allochthonous species [11, 12] and the expansion of epidemic diseases [13] as a possible result. Also, aquatic flora and fauna depend on dissolved oxygen to survive and this water quality parameter is a function of water

Water temperature variation analysis, in a river with a cascade dam, involves several hydrological and environmental aspects because of the dams impact on aquatic flora and

Because temperature is a water quality parameter that affects aquatic flora and fauna, it is important to have mathematical models which allow one to make estimations of water temperature behavior. These models are based on climatic data such as solar radiation, net radiation, relative humidity, air temperature, and wind speed. Accurate water temperature modeling may help diminish the environmental impact of increased water temperature on

Genetic programming (GP) algorithms have been used to derive equations which estimate the ten minute average water temperature from known variables such as relative humidity, air temperature, wind speed, solar radiation, and net radiation [20]. Only air temperature and relative humidity were associated with water temperature in some of the resulting equations, even though solar radiation is known to increase water temperature in rivers and

A correlation analysis could prove the implicit participation of solar radiation as a variable in air temperature, even though an explicit solar radiation term does not appears in the equation. Solar radiation was assumed to be independent with respect to water temperature resulting from neglecting the lag time between a change in the solar radiation value and the corresponding change in water temperature, [1] estimated this lag time to be nearly 160 minutes. By inputting data to both the genetic programming algorithm and multiple linear

Irrigation is also known to decrease discharge and increase water temperature [9].

## **2.1. Genetic programming**

Evolutionary Computation (EC) are learning, search and optimization algorithms based on the theories of natural evolution and genetic. The steps of the basic structure of this kind of algorithms are shown in Figure. First, an initial population of potential solutions is randomly created (in the case of a Simple Genetic Algorithm (SGA), the initial population is composed of binary individuals). Then, the individuals of this population are evaluated considering the problem to be solved (environment) where a fitness value is assigned to each individual depending on how close individuals are to the optimum. A new generation is created by selecting the fitter solutions of previous generation and then, genetic operators such as crossover and mutation (Alter P(t) of Figure 2) are applied to selected individuals in order to create a new population (offsprings) which improve their fitness values in comparison to previous generation. This new population is evaluated and selection, crossover and mutation are again applied. This process continues until a termination criterion is reached (this is commonly established as the maximum number of generation).

Genetic Programming (GP) is a class of Evolutionary Algorithm (EA) [ 21,22,23] where individuals in the population are computer programs, usually expressed as syntax trees or as corresponding expressions in prefix notation (see Figure 3).

Comparison Between Equations Obtained by Means of Multiple

Linear Regression and Genetic Programming to Approach Measured Climatic Data in a River 245

As seen from Figure 3, individuals are created based on a function and terminal set according to the problem to be solved. A root node is generally a function selected randomly from the function set. Then, functions and terminals are chosen in order to form the syntax tree that represents an individual. It is important to set a maximum depth or maximum number of nodes, thus the size of the individuals can be control and avoid bloating. Bloat is the rapid growth of programs produced by genetic programming or

The fitness value of the population is usually calculated by running each individual with the problem input data, or testing data, and see how close the output of the program

Each generation, fitter individuals are evolved by means of crossover and mutation. Crossover is a sexual genetic operator that takes two parent-individuals, randomly selects a node in each parent and exchanges the associated sub-branch starting from the selected node between the parents producing two new individuals. Due to GP uses variables individuals representation, the selected nodes for crossing over two individuals are different in each parent. Note that if the parents to crossover are identical, the new two offsprings are generally different to the parents because the node selected for crossing over is different in each paren. In contrast to Genetic Algorithms, when two identical parents are crossing over, the offsprings are similar to their parents because the crossing point is the same for both

Mutation is a asexual genetic operator that takes an individual, randomly selects a node and replaces the associated branch for a new branch generated based on the primitive set

The application of evolutionary computing algorithms has expanded in the last few years to several engineering applications, particularly in regards to hydraulics and hydrological engineering. Examples include: studies of hydroinformatics by [24,25]; studies in rainfall runoff modeling by [26-31] . The unit hydrograph for a typical urban basin was obtained by

A study of Chezy's roughness coefficient by [33], who also uses an evolutionary polynomial

A deep percolation model using genetic programming was obtained by [36]. Models

Evapotranspiration phenomena has been predicted by means of genetic programming [38]. The flood routing problem was analyzed by means of genetic programming by [39] and the

In this work, a genetic programming algorithm operating in the MATLAB environment [41] developed at the *Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas* (IIMAS), *Universidad Nacional Autónoma de México* (UNAM) was applied and compared with a traditional curve adjustment technique, in an attempt to get another useful application of these

related to sediments were obtained with genetic programming by [37].

(individual) is to some desired (reference) output specified by the user.

variable coding heuristics.

parents and they have the same length.

means of genetic programming in [32].

(functions and terminals sets).

regression in [34,35].

soil moisture too [40].

**Figure 2.** Evolution-based algorithm.

**Figure 3.** Genetic programming representation: syntax tree, LISP or prefix notation, mathematical function and MATLAB program

As seen from Figure 3, individuals are created based on a function and terminal set according to the problem to be solved. A root node is generally a function selected randomly from the function set. Then, functions and terminals are chosen in order to form the syntax tree that represents an individual. It is important to set a maximum depth or maximum number of nodes, thus the size of the individuals can be control and avoid bloating. Bloat is the rapid growth of programs produced by genetic programming or variable coding heuristics.

244 Genetic Programming – New Approaches and Successful Applications

**Figure 2.** Evolution-based algorithm.

function and MATLAB program

as corresponding expressions in prefix notation (see Figure 3).

Genetic Programming (GP) is a class of Evolutionary Algorithm (EA) [ 21,22,23] where individuals in the population are computer programs, usually expressed as syntax trees or

**Figure 3.** Genetic programming representation: syntax tree, LISP or prefix notation, mathematical

The fitness value of the population is usually calculated by running each individual with the problem input data, or testing data, and see how close the output of the program (individual) is to some desired (reference) output specified by the user.

Each generation, fitter individuals are evolved by means of crossover and mutation. Crossover is a sexual genetic operator that takes two parent-individuals, randomly selects a node in each parent and exchanges the associated sub-branch starting from the selected node between the parents producing two new individuals. Due to GP uses variables individuals representation, the selected nodes for crossing over two individuals are different in each parent. Note that if the parents to crossover are identical, the new two offsprings are generally different to the parents because the node selected for crossing over is different in each paren. In contrast to Genetic Algorithms, when two identical parents are crossing over, the offsprings are similar to their parents because the crossing point is the same for both parents and they have the same length.

Mutation is a asexual genetic operator that takes an individual, randomly selects a node and replaces the associated branch for a new branch generated based on the primitive set (functions and terminals sets).

The application of evolutionary computing algorithms has expanded in the last few years to several engineering applications, particularly in regards to hydraulics and hydrological engineering. Examples include: studies of hydroinformatics by [24,25]; studies in rainfall runoff modeling by [26-31] . The unit hydrograph for a typical urban basin was obtained by means of genetic programming in [32].

A study of Chezy's roughness coefficient by [33], who also uses an evolutionary polynomial regression in [34,35].

A deep percolation model using genetic programming was obtained by [36]. Models related to sediments were obtained with genetic programming by [37].

Evapotranspiration phenomena has been predicted by means of genetic programming [38]. The flood routing problem was analyzed by means of genetic programming by [39] and the soil moisture too [40].

In this work, a genetic programming algorithm operating in the MATLAB environment [41] developed at the *Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas* (IIMAS), *Universidad Nacional Autónoma de México* (UNAM) was applied and compared with a traditional curve adjustment technique, in an attempt to get another useful application of these optimization procedures. Here, a stochastic universal selection method was used [42] (Baker, 1987); crossover operator was used with a probability of 90% (see Table 1). It is important to mention that two different mutation operators were used. The first one with a probability of 5% randomly selects a branch and then it exchanges this selected branch by a new generated one. The second mutation operator works by selecting constant values and with a probability of 5%, these constants are mutated by adding a random value of a defined range.

This climatic data modeling problem is expressed as a symbolic regression, a common application of genetic programming, where function set consists of arithmetic and trigonometric functions and terminals set consists of climatological variables which are described in next section.

#### **2.2. Input data**

Water temperature (*Tw*), solar radiation (*rs*), net radiation (*rn*), relative humidity (*hr*), air temperature (*Ta*), and wind speed (V*v*) data measured at the Ribarroja Station from January to June of 1998 were utilized in this study. The ten minute water temperature average was calculated using all of these variables. Later, the averaged air temperature and relative humidity (in decimals) were filtered to take into account a seven day relay. Data filtering was done with the following equation:

$$V\_{i\_{f\_i}} = \frac{\sum\_{i=t}^{t-k} V\_i}{k+1} \tag{1}$$

Comparison Between Equations Obtained by Means of Multiple

Linear Regression and Genetic Programming to Approach Measured Climatic Data in a River 247

*Tw* is the average of measured temperature each ten minute interval in ºC

corresponds to the dependent variable to be modeled.

Pcross 0.9 Probability of crossover Pmut 0.05 Probability of mutation

**Parameter Value Description**

Function\_Set +,-,\*, /,cos Function set

the periodic behavior of water temperature over time.

Terminal Set *rs*, *rn*, *hr*, *Ta*, V*v* Climatological variables

*Tw*ˆ is the calculated temperature with the genetic programming algorithm in ºC, and *n* is

Parameters used in the genetic programming algorithm are shown in Table 1. *MaxNumNodes* corresponds to the maximum number of nodes an individual can have; meanwhile *MaxNodesMut* represents the maximum number of nodes a new created branch can have for mutation. Terminal set represents the independent variables and *Tw*

Pmut\_R 0.05 Probability of mutating a node containing a constant

MaxNodesMut 8 Maximum number of nodes for mutation Nind 200 Number of individuals in the population

MaxNumNodes 30 Maximum number of nodes for each individual MaxGen 5000 Maximum number of generations (iterations)

The function *cosine (cos)* was included in the function set due to preliminary tests, where a reduction in mean quadratic error was obtained, included this cosine function. This fact is related to one of the two properties that GP individuals must satisfy: *sufficiency*. This property says that the set of terminals and the set of functions should be defined in order to express a solution to the study problem [23]. The second property, *closure*, specifies that each of the functions in the function set can be able to accept, as its argument, any value and data type that may possibly be returned by any function and any value or data type that can be possibly assumed by any terminal [23]. In this approach, a protected division was implemented in order to avoid a division by zero. In this situation occurs, a high value is

By including the cosine function, associated equation also presented a good reproduction of

Where:

the data number.

**2.4. Parameter setting** 

**Table 1.** Parameter settings

returned.

Z is the function to minimize

Where :

*Vi* is the original independent variable *t Vif* is the filtered independent variable and

*k* is the size or widow filter (in this case *k*=6).

Recorded solar radiation at minute *ti* has its influence on water temperature at instant *ti*+160 [1] and such a gap needs to be taken into account for all considered data. For example, the first data point of the dependent variable, ten minute average water temperature at instant *ti*+160, was coupled with the first data point of the independent variable, such as solar radiation at instant *ti*. For the independent variables, net radiation (*rn*) and wind speed (*vw*) values of *ti+160* were used, while air temperature and relative humidity values were considered using both seven day filtering and values corresponding to instant *ti+160* .

#### **2.3. Objective function**

The objective function was to minimize the mean square error between the calculated and measured data using the following equation:

$$\text{Minim}\qquad Z = \frac{1}{n}\sum\_{v}(Tw - \hat{T}w)^2\tag{2}$$

#### Where:

246 Genetic Programming – New Approaches and Successful Applications

described in next section.

was done with the following equation:

*Vi* is the original independent variable

**2.3. Objective function** 

*Vif* is the filtered independent variable and *k* is the size or widow filter (in this case *k*=6).

measured data using the following equation:

**2.2. Input data** 

Where :

*t*

optimization procedures. Here, a stochastic universal selection method was used [42] (Baker, 1987); crossover operator was used with a probability of 90% (see Table 1). It is important to mention that two different mutation operators were used. The first one with a probability of 5% randomly selects a branch and then it exchanges this selected branch by a new generated one. The second mutation operator works by selecting constant values and with a probability

This climatic data modeling problem is expressed as a symbolic regression, a common application of genetic programming, where function set consists of arithmetic and trigonometric functions and terminals set consists of climatological variables which are

Water temperature (*Tw*), solar radiation (*rs*), net radiation (*rn*), relative humidity (*hr*), air temperature (*Ta*), and wind speed (V*v*) data measured at the Ribarroja Station from January to June of 1998 were utilized in this study. The ten minute water temperature average was calculated using all of these variables. Later, the averaged air temperature and relative humidity (in decimals) were filtered to take into account a seven day relay. Data filtering

*<sup>t</sup>* 1

Recorded solar radiation at minute *ti* has its influence on water temperature at instant *ti*+160 [1] and such a gap needs to be taken into account for all considered data. For example, the first data point of the dependent variable, ten minute average water temperature at instant *ti*+160, was coupled with the first data point of the independent variable, such as solar radiation at instant *ti*. For the independent variables, net radiation (*rn*) and wind speed (*vw*) values of *ti+160* were used, while air temperature and relative humidity values were

The objective function was to minimize the mean square error between the calculated and

<sup>1</sup> <sup>ˆ</sup> <sup>2</sup> min ( ) *<sup>n</sup> Z Tw Tw <sup>n</sup>* (2)

considered using both seven day filtering and values corresponding to instant *ti+160* .

*k*

*Vi*

*t k i i t f*

*V*

(1)

of 5%, these constants are mutated by adding a random value of a defined range.

Z is the function to minimize

*Tw* is the average of measured temperature each ten minute interval in ºC

*Tw*ˆ is the calculated temperature with the genetic programming algorithm in ºC, and *n* is the data number.

### **2.4. Parameter setting**

Parameters used in the genetic programming algorithm are shown in Table 1. *MaxNumNodes* corresponds to the maximum number of nodes an individual can have; meanwhile *MaxNodesMut* represents the maximum number of nodes a new created branch can have for mutation. Terminal set represents the independent variables and *Tw* corresponds to the dependent variable to be modeled.


**Table 1.** Parameter settings

The function *cosine (cos)* was included in the function set due to preliminary tests, where a reduction in mean quadratic error was obtained, included this cosine function. This fact is related to one of the two properties that GP individuals must satisfy: *sufficiency*. This property says that the set of terminals and the set of functions should be defined in order to express a solution to the study problem [23]. The second property, *closure*, specifies that each of the functions in the function set can be able to accept, as its argument, any value and data type that may possibly be returned by any function and any value or data type that can be possibly assumed by any terminal [23]. In this approach, a protected division was implemented in order to avoid a division by zero. In this situation occurs, a high value is returned.

By including the cosine function, associated equation also presented a good reproduction of the periodic behavior of water temperature over time.
