**2.1. Model complexity of GP and neural networks – Comparative study**

230 Genetic Programming – New Approaches and Successful Applications

superior performance of GP models over ANN and ANFIS models.

**programming** 

**2. Simple and interpretable hydrological models using genetic** 

The major drawback of all the data driven modelling approaches is the black box nature of these models, i.e., the user cannot easily identify what is happening in model which computes the outputs corresponding to the inputs supplied to the model. One of the key advantages of genetic programming as a modelling tool is its ability to develop simple hydrological models. The simplicity of the models is close associated with their interpretability. The simpler the models are the better they can be interpreted. This in turn helps in assessing the contributions of different members of the predictor set or inputs in making a particular prediction. Selle and Muttil (2011) utilized this capability of GP to test the structure of hydrological models to predict deep percolation response in surface irrigated pastures. Data obtained using lysimeter experiments were used to develop simple models using genetic programming. The developed models were simple and interpretable which helped in identifying the dominant processes involved in the deep percolation process. Often the developed models could be expressed as simple algebraic equations. The dominant processes identified compared well with the same as used in conceptual models. The study also investigated the recurrence of the models developed using GP in multiple runs and found out that they were consistently coming up with the same model for a given level of complexity of the model. However, the study also reported that as the level of complexity increases recurrence of the generated model were affected and the physical interpretability of the models decreases and hence careful understanding of the complexity of the system is to be considered before a level of complexity is chosen for the GP models.

This however, illustrates that carefully developed GP models remain mathematically simple and are readily interpretable to the extent that the dominant processes which influence the

showed that WGEP models are effective in forecasting daily precipitation with better performance over WNF models. Selle [6] utilized genetic programming to systematically develop alternative model structures with different complexity levels for hydrological modelling with the objective of testing whether GP can be used to identify the dominant processes within the hydrological system. Models were developed for predicting the deep percolation responses under surface irrigated pastures to different soil types, water table depths and water ponding times during surface irrigation. The dominant process in the model prediction as determined from the models generated using genetic programming was found to be comparable to those determined using conceptual models. Thus it was concluded that Genetic programming can be used to evaluate the structure of hydrological models. A common aspect of GP based modelling that all these studies reported is the fact that the GP modelling resulted in fairly simpler models which could be easily interpreted for the physical significance of the input variables in making a prediction. Jyothiprakash and Magar (2012) [12] performed a comparative study of reservoir inflow models developed using ANN, ANFIS and linear GP for lumped and distributed data. The study reported

The authors had conducted a study [7] to evaluate the complexity of predictive models developed using Genetic programming in comparison with models developed using neural networks. The models based on GP and neural network were developed as potential surrogate models to a complex numerical groundwater flow and transport model. The saltwater intrusion levels at monitoring locations resulting due to the excitation of the aquifer by pumping from a number of groundwater pumping wells were modelled by using GP and neural networks. The pumping rates at these groundwater well locations for three different stress periods were the inputs or independent variables for the model. The resulting salinity levels at the monitoring locations were the dependent variables or outputs.

The GP and ANN based surrogate models were trained based on the training and validation data generated using a three dimensional coupled flow and transport simulation model FEMWATER. The GP models were developed using a software Discipulus, which uses a linear genetic programming algorithm. The ANN surrogate models were developed using a feed forward back propagation algorithm implemented in the software neuroshell. The input data considered were the pumping rates at eleven well locations over three different time periods, constituting 33 input variables. Since pumping at each location can take any real value between the prescribed minimum and maximum these input variables constitute a 33 dimensional continuous space, each dimension representative of a pumping rate at a particular location in a particular stress period. Hence efficient training of the GP and ANN models required carefully chosen input data which is representative of the entire input space. Latin hypercube sampling was performed to choose uniformly distributed input samples from the 33 dimensional input space. An input sample is a vector of 33 values of pumping rate at 11 well locations during three stress periods. The salinity level at each observation location is the dependent variable or output. The values of the outputs required for training the GP and ANN models were generated by running the FEMWATER model. The numerical simulation model was run numerous times to generate the output data set corresponding to each input vector. The input-output data set generated following this procedure was divided into two sets with three quarters of the data in one set and the rest in the other. The larger set was used for training GP and ANN models and the smaller one was used for validating the models. The members of the training and validation sets for both GP and ANN were chosen randomly.

The ANN used in the study was trained in the supervised training mode using a back propagation algorithm. The objective function considered for both the GP and ANN training was minimization of the total root mean square error (RMSE) of the prediction. The prediction error was calculated as the difference between the model (GP or ANN) predicted values and the actual from the numerical model generated data set.

Genetic Programming: Efficient Modeling Tool in Hydrology and Groundwater Management 233

method. In the ANN approach where comparatively only a few models are tested in the trial and error approach which does not implement an organized search for better model architectures. The only components that are optimized during the development of the ANN model are the connection weights. Thus the model structure is rigid and is retained as determined by the trial and error procedure. This gives lesser flexibility in adapting the model structure with respect to the process being modelled. In our study it was found that while GP models required only 30 parameters in developing the model the number of connection weights in the ANN models was 1224. This is a metric of the simplicity of the GP models as against the ANN models. From figures 2 and 3 it is observed that despite the simplicity of the model and much lesser number of parameters used GP predictions are very similar to the ANN model predictions. For each hidden neuron added into the ANN architecture the number of connection weights increases by a number equal to the total number of inputs and outputs. Hence there is a geometric increase in the number of connection weights with increase in the number of hidden neurons in ANN architecture.

The comparison of the number of parameters in itself testifies the ability of the genetic programming framework to develop simpler models. The impact of the number of parameters on the model is on the uncertainty of the predictions made using the model. The more the number of parameters, the more uncertainty in them and hence this uncertainty

Another key feature of the genetic programming based modelling approach is the ability of genetic programming to identify the relative importance of the independent variables chosen as the modelling inputs. Many often in hydrological applications it is uncertain which variables are important to be included as inputs in modelling a physical phenomenon. Similarly time series models are used quite often in predicting or forecasting hydrological variables. For example the river stages measured on a few consecutive days can be used to forecast the river stage for the following days. In doing so the number of past days' flow to be included as inputs into the time series model depends on the size and shape of the catchment and many similar parameters. Most often rigorous statistical tests like autocorrelation studies are conducted to determine whether an independent variable is significant to be included in the model development or not. Once included most often it is not possible to eliminate from most of the modelling frameworks because of the earlier mentioned rigidity of the model structure. For example, in neural networks an insignificant model input should be ideally assigned zero connection weights to the output. However, these connection weights most often don't assume the zero value but converge to very small values near zero. This results in the insignificant variable being influencing the predictions

made by a small amount. These results in uncertainties in the predictions made.

The evolutionary process of determining the optimum model structure helps GP to identify and eliminate insignificant variables from the model development. The authors conducted a study dissecting the neural network and GP models developed in the same study described

propagates into the predictions made.

**3. Parsimonious selection of input variables** 

The input-hidden-output layer architecture for the ANN model was optimized by trial and error. Both GP and ANN models had 33 input variables and 3 outputs. The number of hidden neurons in the ANN model was determined by adding 1 hidden neuron during each trial. A sigmoid transfer function and a learning rate of 0.1 were used. In developing the model the back propagation algorithm modifies the connection weights connecting the input-hidden and output neurons by an amount proportional to the prediction error in each iteration and repeats this procedure numerous times till the prediction errors are minimized to a pre-specified level. Thus for any given model architecture (model structure) the neural network model optimizes the connection weights to accomplish satisfactory model predictions. Where as the genetic programming modelling approach is different in that it evolves the optimal model architecture and their respective parameters in achieving satisfactory predictions.

The GP models developed used a population size of 500, mutation and cross over frequencies of respectively 95 and 50 percent. The number of generations were not specified a priory, instead the evolutionary process was stopped when the fitness function was less than a critical value. In order to achieve the simplest models, the mathematical operators where initially kept a minimum and then further operators were added into the functional set. In this manner, initially addition and subtraction were alone added in this set and later the operators multiplication, arithmetic and data transfer were added into the set.

The predictive performance of the GP and ANN models on an independent set of data were found to be satisfactory in terms of the correlation coefficient and minimized RMSE. Figure 2 and 3 respectively shows the ANN and GP predictions of salinity levels at three monitoring locations corresponding to the their corresponding values from the numerical simulation model A dissection of the GP and ANN models were performed to evaluate the model complexity. The modelling framework of the GP models essentially has a functional set and a terminal set. The functional set comprises of the mathematical operations like addition, subtraction, division, multiplication, trigonometric functions etc. The terminal set of GP comprises of the model parameters which are also optimized simultaneously as the model structure is optimized. In our study the developed GP models used a maximum terminal set size of 30. i.e., satisfactory model predictions could be achieved with only 30 parameters for the GP model.

The functional operators essentially develop the structure of the GP models by operating on the input variables. In the GP modelling framework this model structure is not pre-specified unlike the ANN models. Instead, the model structure is evolved in the course of model development by testing numerous different model structures. This approach definitely provides scope for the development of improved model structures as against the ANN method. In the ANN approach where comparatively only a few models are tested in the trial and error approach which does not implement an organized search for better model architectures. The only components that are optimized during the development of the ANN model are the connection weights. Thus the model structure is rigid and is retained as determined by the trial and error procedure. This gives lesser flexibility in adapting the model structure with respect to the process being modelled. In our study it was found that while GP models required only 30 parameters in developing the model the number of connection weights in the ANN models was 1224. This is a metric of the simplicity of the GP models as against the ANN models. From figures 2 and 3 it is observed that despite the simplicity of the model and much lesser number of parameters used GP predictions are very similar to the ANN model predictions. For each hidden neuron added into the ANN architecture the number of connection weights increases by a number equal to the total number of inputs and outputs. Hence there is a geometric increase in the number of connection weights with increase in the number of hidden neurons in ANN architecture.

The comparison of the number of parameters in itself testifies the ability of the genetic programming framework to develop simpler models. The impact of the number of parameters on the model is on the uncertainty of the predictions made using the model. The more the number of parameters, the more uncertainty in them and hence this uncertainty propagates into the predictions made.
