**3. Parsimonious selection of input variables**

232 Genetic Programming – New Approaches and Successful Applications

satisfactory predictions.

parameters for the GP model.

values and the actual from the numerical model generated data set.

The ANN used in the study was trained in the supervised training mode using a back propagation algorithm. The objective function considered for both the GP and ANN training was minimization of the total root mean square error (RMSE) of the prediction. The prediction error was calculated as the difference between the model (GP or ANN) predicted

The input-hidden-output layer architecture for the ANN model was optimized by trial and error. Both GP and ANN models had 33 input variables and 3 outputs. The number of hidden neurons in the ANN model was determined by adding 1 hidden neuron during each trial. A sigmoid transfer function and a learning rate of 0.1 were used. In developing the model the back propagation algorithm modifies the connection weights connecting the input-hidden and output neurons by an amount proportional to the prediction error in each iteration and repeats this procedure numerous times till the prediction errors are minimized to a pre-specified level. Thus for any given model architecture (model structure) the neural network model optimizes the connection weights to accomplish satisfactory model predictions. Where as the genetic programming modelling approach is different in that it evolves the optimal model architecture and their respective parameters in achieving

The GP models developed used a population size of 500, mutation and cross over frequencies of respectively 95 and 50 percent. The number of generations were not specified a priory, instead the evolutionary process was stopped when the fitness function was less than a critical value. In order to achieve the simplest models, the mathematical operators where initially kept a minimum and then further operators were added into the functional set. In this manner, initially addition and subtraction were alone added in this set and later

The predictive performance of the GP and ANN models on an independent set of data were found to be satisfactory in terms of the correlation coefficient and minimized RMSE. Figure 2 and 3 respectively shows the ANN and GP predictions of salinity levels at three monitoring locations corresponding to the their corresponding values from the numerical simulation model A dissection of the GP and ANN models were performed to evaluate the model complexity. The modelling framework of the GP models essentially has a functional set and a terminal set. The functional set comprises of the mathematical operations like addition, subtraction, division, multiplication, trigonometric functions etc. The terminal set of GP comprises of the model parameters which are also optimized simultaneously as the model structure is optimized. In our study the developed GP models used a maximum terminal set size of 30. i.e., satisfactory model predictions could be achieved with only 30

The functional operators essentially develop the structure of the GP models by operating on the input variables. In the GP modelling framework this model structure is not pre-specified unlike the ANN models. Instead, the model structure is evolved in the course of model development by testing numerous different model structures. This approach definitely provides scope for the development of improved model structures as against the ANN

the operators multiplication, arithmetic and data transfer were added into the set.

Another key feature of the genetic programming based modelling approach is the ability of genetic programming to identify the relative importance of the independent variables chosen as the modelling inputs. Many often in hydrological applications it is uncertain which variables are important to be included as inputs in modelling a physical phenomenon. Similarly time series models are used quite often in predicting or forecasting hydrological variables. For example the river stages measured on a few consecutive days can be used to forecast the river stage for the following days. In doing so the number of past days' flow to be included as inputs into the time series model depends on the size and shape of the catchment and many similar parameters. Most often rigorous statistical tests like autocorrelation studies are conducted to determine whether an independent variable is significant to be included in the model development or not. Once included most often it is not possible to eliminate from most of the modelling frameworks because of the earlier mentioned rigidity of the model structure. For example, in neural networks an insignificant model input should be ideally assigned zero connection weights to the output. However, these connection weights most often don't assume the zero value but converge to very small values near zero. This results in the insignificant variable being influencing the predictions made by a small amount. These results in uncertainties in the predictions made.

The evolutionary process of determining the optimum model structure helps GP to identify and eliminate insignificant variables from the model development. The authors conducted a study dissecting the neural network and GP models developed in the same study described

above to evaluate the parsimony in the selection of inputs for model development. GP evolves the best model structure and parameters by testing millions of alternate model structures. The relative importance of the each independent variable in the model development was computed by the recurrence of each independent variable in the best 30 models developed by GP. Thus, if an input appears in all the 30 models its impact factor is 1 and if one independent variable appears in none of the best 30 models its impact factor is 0.

Genetic Programming: Efficient Modeling Tool in Hydrology and Groundwater Management 235

2. Divide the sum of the *Qih* for each hidden neuron by the sum for each hidden neuron of the sum for each input neuron of *Qih* , for each *i*. The relative importance of all output weights attributable to the given input variable is then obtained. The relative importance is then mapped to a 0-1 scale with the most important variables assuming a

1

*Q*

*nh ih h nh ni*

1 1

*h i*

In this manner, the significance of each independent variable (input) to the model was quantified in a 0-1 range as impact factor and relative importance respectively for GP and

*RI*

ANN models. These values for GP and ANN models are plotted in figures 4,5 and 6.

*ih*

(3)

*Q*

value of 1. A RI value of 0 indicates an insignificant variable.

**Figure 3.** Salinity predictions at three locations by the ANN models

**Figure 2.** Salinity predictions at three locations by the ANN models

To determine the significance of the inputs in the neural network model a connection weights method was used [7]. In this method the significance of each input is computed as a function of the connection weights which connects it to the output through the hidden layer. The formulae used in [7] were used to compute this;

1. First step in this approach was to compute the product of the input-hidden layer and hidden output layer weights. The, divide this by the sum of products of absolute values of the input-hidden and hidden output layer weights of all input neurons. This is given by *Qih* in (2)

$$P\_{i,h} = \mathbb{I}\,\mathsf{W}\_{i,h} \,\mathsf{l} \times \mathsf{l}\,\mathsf{W}\_{h,o} \,\mathsf{l} \tag{1}$$

$$\mathcal{Q}\_{\text{th}} = \frac{P\_{\text{th}}}{\sum\_{i=1}^{n\ell} P\_{\text{th}}} \tag{2}$$

2. Divide the sum of the *Qih* for each hidden neuron by the sum for each hidden neuron of the sum for each input neuron of *Qih* , for each *i*. The relative importance of all output weights attributable to the given input variable is then obtained. The relative importance is then mapped to a 0-1 scale with the most important variables assuming a value of 1. A RI value of 0 indicates an insignificant variable.

234 Genetic Programming – New Approaches and Successful Applications

**Figure 2.** Salinity predictions at three locations by the ANN models

The formulae used in [7] were used to compute this;

by *Qih* in (2)

above to evaluate the parsimony in the selection of inputs for model development. GP evolves the best model structure and parameters by testing millions of alternate model structures. The relative importance of the each independent variable in the model development was computed by the recurrence of each independent variable in the best 30 models developed by GP. Thus, if an input appears in all the 30 models its impact factor is 1 and if one independent variable appears in none of the best 30 models its impact factor is 0.

To determine the significance of the inputs in the neural network model a connection weights method was used [7]. In this method the significance of each input is computed as a function of the connection weights which connects it to the output through the hidden layer.

1. First step in this approach was to compute the product of the input-hidden layer and hidden output layer weights. The, divide this by the sum of products of absolute values of the input-hidden and hidden output layer weights of all input neurons. This is given

1

*ih ih ni*

*<sup>P</sup> <sup>Q</sup>*

 

*ih i*

*P* 

,, , | || | *PW W ih ih ho* (1)

(2)

$$\begin{aligned} \label{eq:IRI} \underline{\text{R}I} &= \frac{\sum\_{h=1}^{nh} \mathcal{Q}\_{ih}}{\sum\_{h=1}^{nh} \sum\_{i=1}^{n!} \mathcal{Q}\_{ih}} \end{aligned} \tag{3}$$

In this manner, the significance of each independent variable (input) to the model was quantified in a 0-1 range as impact factor and relative importance respectively for GP and ANN models. These values for GP and ANN models are plotted in figures 4,5 and 6.

**Figure 3.** Salinity predictions at three locations by the ANN models

Genetic Programming: Efficient Modeling Tool in Hydrology and Groundwater Management 237

GP ANN

relevant to the model prediction. This inturn help in developing simpler models with fewer

The advent of GP as a modelling tool has paved the way for researches exploring the possibility of multiple optimal models for predicting hydrological processes. Genetic programming, in its evolutionary approach to derive optimal model structures and parameters, tests millions of model structures which can mimic the physical process under consideration. Researches have found that multiple models can be identified using GP which are considerably different in model structures but able to make consistently good predictions. Parasuraman and Elshorbagy [8] developed genetic programming based models for predicting the evapo-transporation. In doing so, multiple optimal GP models were trained and tested and they were applied to quantify the uncertainty in those models. Another study by the authors [9] developed ensemble surrogate models for predicting the aquifer responses to pumping in terms of salinity levels at observation locations. An ensemble of surrogate models based on GP was developed and the ensemble was used to get model predictions with improved reliability levels. The variance of the model predictions were used as the measure of uncertainty in the modelling

1 6 11 16 21 26 31 **Input No.** 

A very important application of data intensive modelling approaches is to develop surrogate models to computationally complex numerical simulation models. As detailed elsewhere in this article, the authors have utilized GP in developing potential surrogates to a complex density dependent groundwater flow and transport simulation model. The potential utility of the surrogates is to replace the numerical simulation model in simulationoptimization frameworks. Simulation-optimization models are used to derive optimal management decisions using optimization algorithms in which a numerical simulation

**Figure 6.** Impact factors of input variables in predicting Salinity at location 3.

**5. GP as surrogate model for simulation-optimization** 

**4. Multiple predictive model structures using GP** 

uncertainties in the model prediction.

0 0.2 0.4 0.6 0.8 1 1.2

**Impact factor**

process.

**Figure 4.** Impact factors of input variables in predicting Salinity at location 1.

**Figure 5.** Impact factors of input variables in predicting Salinity at location 2.

From these figures it can be observed that all the variables considered has a non-zero impact in the developed ANN models. Whereas, GP is able to assign zero impact factor to those inputs which are not significant and thus able to eliminate them from the model. This helps in developing simpler models and reducing the predictive uncertainty. In figure 4 it can be seen that GP identified 13 inputs with zero impact factor. This implies that the pumping values corresponding to these inputs have negligible effect on the salinity levels at the observation location. Thus 13 out of the 33 inputs considered are eliminated from the GP models resulting in much simpler models compared to the ANN models where all the 33 inputs take part in predicting the salinity even though some of them are having very less impact on the predictions made. The ability of GP to eliminate insignificant variables is because of the evolutionary nature of model structure optimization. By performing crossover, mutation and selection of candidate models over a number of generations GP is able to derive the optimum model structure with the most important input variables which are relevant to the model prediction. This inturn help in developing simpler models with fewer uncertainties in the model prediction.

**Figure 6.** Impact factors of input variables in predicting Salinity at location 3.
