**4. Some critical aspects**

Tikk et al. (2003) and Mhaskar (1996) have provided a detailed survey on the evolution of approximation theory and the use of neural networks for approximation purposes. In both papers, the mathematical background is being provided, which has been used as the background for other researchers who studied the approximation properties of various neural networks.

There is another very important work provided by Wray and Green (1994). They showed that, when neural networks are operated on digital computers (which is true in the vast majority of cases), real limitations of numerical computing should also be considered. Approximation properties like universal approximation and the best approximation become useless.

The overview of mentioned surveys is given in (Belič, 2006).

The majority of works have frequently been used as a misleading promise, and even proof, that neural networks can be used as the general approximation tool for any kind of function. The provided proofs are theoretically correct but do not take into account the fact that all those neural network systems were run on digital computers.

Nonlinearity of the neural network cell simulated on a digital computer is realized by polynomial approximation. Therefore, each neural cell produces a polynomial activation function. Polynomial equivalence for the activation function is true even in the theoretical sense, i.e., the activation function is an analytical function, and analytical functions have their polynomial equivalents.

Neural Networks and Static Modelling 9

Numerous authors have reported on their successful work regarding the neural networks as approximators of sampled functions. Readers are kindly requested to consult those reports and findings to name but a few (Aggelogiannaki, et.al. 2007), (Wena, Ma, 2008), (Caoa, et.al., 2008), (Ait Gougam, et.al., 2008), (Bahi, et.al., 2009), (Wanga, Xub, 2010).

Neural networks are used to model various processes. The measurements obtained on the system to be modelled are of quite different magnitudes. This represents a problem for the neural network where the only neural network output (and often input) range is limited to

3) saturates output values at values 0 and 1. The users should be well aware of the processes

This is the reason why the input and output values of the neural network need to be preconditioned. There are two basic procedures to fit the actual input signal to the limited interval to avoid the saturation effect. The first process is scaling, and the second is

First, the whole training data set is scanned, and the maximal max *y* and minimal min *y* values are found. The difference between them is then the range which is to be scaled to the [0-1] interval. From all values *y* the value of min *y* is subtracted (this process is called

> min max min

*<sup>i</sup> y* represents the offset and scaled value that fits in the prescribed interval. The

The process is exactly the same for the neural network input values, only the interval is not

This is not the only conditioning that occurs on input and output sides of the neural network. In addition to scaling, there is a parameter that prevents saturation of the system and is called **scaling margin**. The sigmoidal function tends to become saturated for the big values of inputs (Fig. 3). If the proper action is to be assured for the large values as well, then the scaling should not be performed on the whole [0,1] interval. The interval should be smaller for the given amount (let us say 0.1 for 10% shrinking of the interval). The corrected

(1 )( ) '

*m m i*

*<sup>S</sup> Syy <sup>y</sup>*

*i*

2

(4)

*y y yy y i i* min max min ' (5)

min

(6)

max min

*y y*

( ) *x* (Fig.

the [0,1] interval. This is the case since the neural network cell activation function

For some neural network systems this range is expanded to the [-1,+1] interval.

offsetting). The transformation of the values is obtained by the equation

' *i i y y <sup>y</sup> y y*

as these are often the cause of various unwanted errors.

reverse transformation is obtained by the equation

necessarily limited to the 0 – 1 values.

**4.1 Input and output limitations** 

offsetting.

Here '

values are now

The output values of the neural network can be regarded as the sum of polynomials which is just another polynomial. The coefficients of the polynomial are changed through the adaptation of neural network weights. Due to the finite precision of the numeric calculations the contributions, of the coefficients with higher powers become lower from the finite precision of the computer. The process of training can no longer affect them. This is also true of any kind of a training scheme.

The work of Wray and Green (1994) has been criticized, but it nevertheless gives a very important perspective to the practitioners who cannot rely just on the proofs given for theoretically ideal circumstances.

There are many interesting papers describing the chaotic behaviour of neural networks (Bertels, et.al., 1996), (Huang, 2008), (Yuan, Yang, 2009), (Wang, et.al., 2011). The neural network training process is a typical chaotic process. For practical applications this fact must never be forgotten.

Neural networks can be used as a function approximation (modelling) tool, but they neither perform universal approximation, nor can the best approximation in a theoretical sense be found.

A huge amount of work has been done trying to find proofs that a neural network can provide approximation of continuous functions. It used to be of interest because the approximated function is also continuous and exists in a dense space. Then came the realization that, even if the approximated function is continuous, its approximation is not continuous since the neural network that performs the approximation consists of a finite number of building blocks. The approximation is therefore nowhere dense, and the function is no longer continuous. Although this does not sound promising, the discontinuity is not of such a type that it would make neural network systems unusable. This discontinuity must be understood in a numerical sense.

Another problem is that neural networks usually run on digital computers, which gives rise to another limitation mentioned earlier: as it is not always possible to achieve the prescribed approximation precision, **the fact that neural networks can perform the approximation with any prescribed precision does not hold**.

Furthermore, another complication is that continuity of the input function does not hold. When digital computers are used (or in fact any other measurement equipment), the denseness of a space and the continuity of the functions cannot be fulfilled for both input and output spaces. Moreover, when measurement equipment is used to probe the input and output spaces (real physical processes), denseness of a space in a mathematical sense is never achieved. For example, the proofs provided by Kurkova (1995) imply that the function to be approximated is in fact continuous, but the approximation process produces a discontinuous function. This is incorrect in practice where the original function can never be continuous in the first place.

A three layer neural network has been proved to be capable of producing a universal approximation. Although it is true (for the continuous functions), this is not practical when the number of neural cells must be very high, and the approximation speed becomes intolerably low. Introducing more hidden layers can significantly lower the number of neural network cells used, and the approximation speed then becomes much higher.

The output values of the neural network can be regarded as the sum of polynomials which is just another polynomial. The coefficients of the polynomial are changed through the adaptation of neural network weights. Due to the finite precision of the numeric calculations the contributions, of the coefficients with higher powers become lower from the finite precision of the computer. The process of training can no longer affect them. This is also

The work of Wray and Green (1994) has been criticized, but it nevertheless gives a very important perspective to the practitioners who cannot rely just on the proofs given for

There are many interesting papers describing the chaotic behaviour of neural networks (Bertels, et.al., 1996), (Huang, 2008), (Yuan, Yang, 2009), (Wang, et.al., 2011). The neural network training process is a typical chaotic process. For practical applications this fact must

Neural networks can be used as a function approximation (modelling) tool, but they neither perform universal approximation, nor can the best approximation in a theoretical sense be

A huge amount of work has been done trying to find proofs that a neural network can provide approximation of continuous functions. It used to be of interest because the approximated function is also continuous and exists in a dense space. Then came the realization that, even if the approximated function is continuous, its approximation is not continuous since the neural network that performs the approximation consists of a finite number of building blocks. The approximation is therefore nowhere dense, and the function is no longer continuous. Although this does not sound promising, the discontinuity is not of such a type that it would make neural network systems unusable. This discontinuity must

Another problem is that neural networks usually run on digital computers, which gives rise to another limitation mentioned earlier: as it is not always possible to achieve the prescribed approximation precision, **the fact that neural networks can perform the approximation** 

Furthermore, another complication is that continuity of the input function does not hold. When digital computers are used (or in fact any other measurement equipment), the denseness of a space and the continuity of the functions cannot be fulfilled for both input and output spaces. Moreover, when measurement equipment is used to probe the input and output spaces (real physical processes), denseness of a space in a mathematical sense is never achieved. For example, the proofs provided by Kurkova (1995) imply that the function to be approximated is in fact continuous, but the approximation process produces a discontinuous function. This is incorrect in practice where the original function can never be

A three layer neural network has been proved to be capable of producing a universal approximation. Although it is true (for the continuous functions), this is not practical when the number of neural cells must be very high, and the approximation speed becomes intolerably low. Introducing more hidden layers can significantly lower the number of

neural network cells used, and the approximation speed then becomes much higher.

true of any kind of a training scheme.

theoretically ideal circumstances.

be understood in a numerical sense.

continuous in the first place.

**with any prescribed precision does not hold**.

never be forgotten.

found.

Numerous authors have reported on their successful work regarding the neural networks as approximators of sampled functions. Readers are kindly requested to consult those reports and findings to name but a few (Aggelogiannaki, et.al. 2007), (Wena, Ma, 2008), (Caoa, et.al., 2008), (Ait Gougam, et.al., 2008), (Bahi, et.al., 2009), (Wanga, Xub, 2010).

#### **4.1 Input and output limitations**

Neural networks are used to model various processes. The measurements obtained on the system to be modelled are of quite different magnitudes. This represents a problem for the neural network where the only neural network output (and often input) range is limited to the [0,1] interval. This is the case since the neural network cell activation function ( ) *x* (Fig. 3) saturates output values at values 0 and 1. The users should be well aware of the processes as these are often the cause of various unwanted errors.

For some neural network systems this range is expanded to the [-1,+1] interval.

This is the reason why the input and output values of the neural network need to be preconditioned. There are two basic procedures to fit the actual input signal to the limited interval to avoid the saturation effect. The first process is scaling, and the second is offsetting.

First, the whole training data set is scanned, and the maximal max *y* and minimal min *y* values are found. The difference between them is then the range which is to be scaled to the [0-1] interval. From all values *y* the value of min *y* is subtracted (this process is called offsetting). The transformation of the values is obtained by the equation

$$y'\_i = \frac{y\_i - y\_{\text{min}}}{y\_{\text{max}} - y\_{\text{min}}} \tag{4}$$

Here ' *<sup>i</sup> y* represents the offset and scaled value that fits in the prescribed interval. The reverse transformation is obtained by the equation

$$y\_i = y\_{\min} + y\_i' \left( y\_{\max} - y\_{\min} \right) \tag{5}$$

The process is exactly the same for the neural network input values, only the interval is not necessarily limited to the 0 – 1 values.

This is not the only conditioning that occurs on input and output sides of the neural network. In addition to scaling, there is a parameter that prevents saturation of the system and is called **scaling margin**. The sigmoidal function tends to become saturated for the big values of inputs (Fig. 3). If the proper action is to be assured for the large values as well, then the scaling should not be performed on the whole [0,1] interval. The interval should be smaller for the given amount (let us say 0.1 for 10% shrinking of the interval). The corrected values are now

$$\mathbf{y'}\_i = \frac{\mathbf{S}\_m}{\mathbf{2}} + \frac{(\mathbf{1} - \mathbf{S}\_m)(y\_i - y\_{\min})}{y\_{\max} - y\_{\min}} \tag{6}$$

Neural Networks and Static Modelling 11

*r d* max *i i <sup>T</sup> i*

*<sup>r</sup>* max *i T <sup>T</sup> i*

smaller than max *y* , the relative error becomes very high, and, therefore, the results of the modelling become useless for these small values. **This implies that the span between** max *y* **and** min *y* **should not be too broad (depending on the particular problem but, if possible, not higher then two decades), otherwise the modelling will result in a very poor** 

The usual study of neural network systems does not include the use of neural networks in a

As shown in the previous section, neural networks do not produce a good approximation

The quality of approximation (approximation error) of small values is very poor compared

a. Log/Anti Log Strategy; the common and usual practice is to take the measured points and to perform the logarithm (log10) function. The logartihmic data is then used as

b. segmentation strategy; the general idea is to split the input and output space into several smaller segments and perform the neural network training of each segment separately. This also means that each segment is approximated by a separate neural network (sub network). This method not only gives much better approximation results, but it also requires a larger training set (at least 5 data points per segment). It is also a good practice to overlap segments in order to achieve good results on segment borders.

The training process of a neural network is the process which, if repeated, does not lead to equal results. Each training process starts with different, randomly chosen connection weights so the training starting point is always different. Several repetitions of the training process lead to different outcomes. Stability of the training process refers to a series of outcomes and the maximal and minimal values, i.e., the band that can be expected for the particular solution. The narrower the obtained band, the more stable approximations can be expected (Fig. 4).

when the function to be approximated spans over a wide range (several decades).

*y*

*<sup>y</sup> <sup>E</sup>*

An important conclusion follows from the last inequality. At the values where *<sup>T</sup>*

*<sup>y</sup>* (12)

(13)

*<sup>i</sup> y* is much

*<sup>y</sup> E E*

and the relation between the absolute and relative error is:

And furthermore

**performance for the small values.** 

**5. Training stability analysis** 

To avoid this problem other strategies should be used.

with that of large ones. The problem can be solved by means of:

input and target values for the neural network; or

**4.2 Approximation for wide range functions** 

wide range of input and output conditions.

where *Sm* represents the scaling margin.

The reverse operation is obtained by the equation

$$y\_i = y\_{\min} + \frac{\left(y\_i^\prime - \frac{S\_m}{2}\right)(y\_{\max} - y\_{\min})}{\left(1 - S\_m\right)}\tag{7}$$

When selecting the value of *Sm* 0.1 , this means that the actual values are scaled to the interval [0.05 , 0.95] (see Fig. 3).

The training process is usually stopped when the prescribed accuracy is achieved on the training data set. However, there is still an important issue to discuss in order to understand the details. The parameter "training tolerance" is user set, and it should be set according to the special features of each problem studied.

Setting the training tolerance to the desired value (e.g. 0.1) means that the training process will be stopped when, for all training samples, the output value does not differ from the target value by more than 10 % of the training set range. If the target values of the training set lie between 100 and 500, then setting the training tolerance parameter value to 0.1 means that the training process will stop when, for all values, the differences between the target and the approximated values will not differ by more than 40 ( 10% of 500-100). Formally this can be written with the inequality:

$$\text{for all } i: \quad \left| y\_i^T - y\_i \right| < \quad \Delta\_T \left( y\_{\text{max}} - y\_{\text{min}} \right) \tag{8}$$

Here *<sup>T</sup> <sup>i</sup> y* represents the target value for the ith sample, *<sup>i</sup> y* is the modelled value for the ith sample, *<sup>T</sup>* is the training tolerance, and max *y* , min *y* are the highest and the lowest value of the training set.

It is not common practice to define the tolerance with regard to the upper and lower values of the observation space. For the case when neural networks are used to approximate the unipolar functions, the absolute error *Ed* is always smaller than the training tolerance

$$E\_d \le \Delta\_T \tag{9}$$

The absolute error produced by the neural network approximation for the ith sample becomes:

$$E\_i^d = \frac{\left| y\_i - y\_i^T \right|}{y\_{\text{max}}} \tag{10}$$

while the relative error is:

$$E\_i^r = \frac{\left| y\_i - y\_i^T \right|}{y\_i^T} \tag{11}$$

and the relation between the absolute and relative error is:

$$E\_i^r = E\_i^d \frac{y\_{\text{max}}}{y\_i^T} \tag{12}$$

And furthermore

10 Recurrent Neural Networks and Soft Computing

' 2 1

*y y <sup>S</sup>*

*<sup>m</sup> <sup>i</sup>*

*S y yy*

When selecting the value of *Sm* 0.1 , this means that the actual values are scaled to the

The training process is usually stopped when the prescribed accuracy is achieved on the training data set. However, there is still an important issue to discuss in order to understand the details. The parameter "training tolerance" is user set, and it should be set according to

Setting the training tolerance to the desired value (e.g. 0.1) means that the training process will be stopped when, for all training samples, the output value does not differ from the target value by more than 10 % of the training set range. If the target values of the training set lie between 100 and 500, then setting the training tolerance parameter value to 0.1 means that the training process will stop when, for all values, the differences between the target and the approximated values will not differ by more than 40 ( 10% of 500-100). Formally this

max min ; *<sup>T</sup>*

*<sup>i</sup> y* represents the target value for the ith sample, *<sup>i</sup> y* is the modelled value for the ith sample, *<sup>T</sup>* is the training tolerance, and max *y* , min *y* are the highest and the lowest value of

It is not common practice to define the tolerance with regard to the upper and lower values of the observation space. For the case when neural networks are used to approximate the

The absolute error produced by the neural network approximation for the ith sample

*i i d*

*i i r i T i*

*y y <sup>E</sup> y*

*i <sup>y</sup> <sup>y</sup> <sup>E</sup> y*

max

*T*

*T*

unipolar functions, the absolute error *Ed* is always smaller than the training tolerance

*ii T for all i y y y y* (8)

*Ed T* (9)

(10)

(11)

max min

(7)

*m*

where *Sm* represents the scaling margin.

interval [0.05 , 0.95] (see Fig. 3).

can be written with the inequality:

Here *<sup>T</sup>*

becomes:

while the relative error is:

the training set.

the special features of each problem studied.

The reverse operation is obtained by the equation

min

*i*

$$E\_i^r \le \Delta\_T \frac{y\_{\text{max}}}{y\_i^T} \tag{13}$$

An important conclusion follows from the last inequality. At the values where *<sup>T</sup> <sup>i</sup> y* is much smaller than max *y* , the relative error becomes very high, and, therefore, the results of the modelling become useless for these small values. **This implies that the span between** max *y* **and** min *y* **should not be too broad (depending on the particular problem but, if possible, not higher then two decades), otherwise the modelling will result in a very poor performance for the small values.** 

To avoid this problem other strategies should be used.

#### **4.2 Approximation for wide range functions**

The usual study of neural network systems does not include the use of neural networks in a wide range of input and output conditions.

As shown in the previous section, neural networks do not produce a good approximation when the function to be approximated spans over a wide range (several decades).

The quality of approximation (approximation error) of small values is very poor compared with that of large ones. The problem can be solved by means of:

