**2. The model**

Several methods have been suggested as a methodology for detecting abnormal events in similar cases. The basic approach for solving such problems may be based on unsupervised machine learning (USL), As described in detail by Celebi and Aydin [1]. One of the first and most fundamental methods of USL is based on clustering. *Clustering* is a methodology which groups vectors into several similar groups, where the members of each group are as similar as possible and the differences between groups are as great as possible. Clustering may be distance-based or density-based. Examples of distance-based clustering, such as the kMean algorithm, were presented by Knorr and Ng [2, 3] who compute abnormality score by counting neighbors to each point. More updated work in this field was introduced by Angiulli and Pizzuti [4] who compute the anomaly score of a data instance as the sum of its distances from its k-nearest neighbors. Ramaswamy et al. [5] extend this technique to spatial data. Their methods are also based on the kNN algorithm. Bay and Schwabacher [6] introduced the same

Another clustering philosophy is based on density of points. Examples of such cases are Breunig et al. [7, 8], in relation to relative density. Jin et al. [9] showed how some of the calculations can be skipped. Tang et al. [10] further improved clustering by adding the idea of a connectivity-based outlier factor, which refers to the number of connections between points. Jin et al. [9] also introduced improvements by adding the idea of symmetric neighborhood

In both methods - distance or density - the result is a multi-dimensional data structure which contains centroids. A *centroid* is a center of a group. It is, in the broadest sense, the group's center of gravity, i.e., the coordinates of the center of the group in each dimension. Once such a data structure exists, each new incoming record is evaluated based on its distance from the most nearest centroid. If the new incoming record is too far from any known centroid, it is declared a suspicious record, one that should be examined. After examination, the new point is classified, either as a True or False event. This classification is added to the model's learning set. A second method, which has been used for abnormality detection, is based on a prediction methodology. According to this methodology, one of the variables in a multi-dimensional space is considered to be a dependent variable, whose value or class (in the case of discrete values) is related to the other variables. Given that, a mathematical model is then constructed, which describes the relation between the dependent variable and all other variables. This model can be based, for example, on linear regression, decision trees or neural networks. In this case, each new incoming record is used to generate a prediction. If the predicted value is too far or has a different class value than the actual value of the dependent variable, the new incoming record, is again considered abnormal and must be investigated. An example of such a methodology is given by Stefano et al. [11], Odin and Addison [12], Hawkins et al. [13] and Williams et al. [14]. A third method, which will be the focus of this chapter, is based on examining noise pattern changes, generated by the multi-dimensional data. Several methods have been suggested along this line. A fundamental method has been demonstrated by Cheng et al. [15], which

function to identify abnormal patterns in a moving window.

Radial basis function (see https://en.wikipedia.org/wiki/Radial\_basis\_function).

relationship. The main density-based algorithm is known as the EM algorithm.

algorithm with regard to pruning.

34 Applications in Water Systems Management and Modeling

used an RBF<sup>1</sup>

1

The following section presents an overview of the mathematical model used in this chapter. It starts by examining Brownian Motion (BM), which is named after Robert Brown [17], who discovered the typical movement of flowers seeds on the water's surface. Einstein [18] used the idea of BM in order to provide precise details about the movement of atoms. This explanation was later further validated by Perrin who awarded the physics noble price for 1926.

BM was also used by Louis Bachelier [19] (1900) in his Ph.D. thesis "The Theory of Speculation", in which he presented a stochastic analysis of the stock and option markets. His work went on to inspire the novel work of Black and Scholes [20], which awarded them the Nobel Prize in economics.

Modern literature gives many examples of the usage of BM in various areas; most are related to biology, chemistry, physics and other fields of life sciences. However, it is rare that such a technique or a similar one is used for the analysis of abnormal water events - the main topic of the current chapter.

One of the central results of BM theory is an estimation of the traveling distance of a particle, which travels using random movement in a given time interval across a multi-dimensional space. According to this theory, if *ρ*(x,m) is the density function of particles at location x (were x is a single dimension, e.g., one axis) at time m, then ρ satisfies the diffusion equation:

$$\frac{\partial \rho}{\partial m} = D \frac{\partial^2 \rho}{\partial x^2} \tag{1}$$

where D is the *mass diffusivity*, a term which measures how fast particles of a given type may move in a specific material, in our case, water. The solution of Eq. (1) gives a density function with a first moment, which is seen to vanish, and a second moment given by:

$$
\overline{x^2} = 2D \ast m \tag{2}
$$

The left side of Eq. (2) expresses the distance at which a particle can be found from its origin, given the elapsed time m and the diffusivity parameter D. Assuming x is distributed normally, the maximum value a particle can travel for a given time can be calculated using (2) with a given confidence interval.

Using Eq. (2), and assuming that the left hand side of (2) is distributed normally, and its standard deviation (S) can be estimated empirically, the probability of a particle to travel a given distance from its origin within m units of time can be calculated by:

$$L = \sqrt{2D \ast m} + S^\*t(\alpha) \tag{3}$$

where *t*(*α*) is the confidence interval factor, based on the level of confidence required (drawn from the student distribution). Thus, if a particle is found within m steps at a distance which is greater than L units from its origin, it is considered an abnormal event.

In the physical or chemical diffusion process, the value of D is determined based on material properties, and the value of m is measured in continuous time. In the current model, these values should be determined using another methodology as explained in the following paragraphs.

Let us denote with vector *Xm* the set of quality measurements of water at each moment m and, assuming *Xm* has K dimensions. The value of *Xm* can be normalized with the following process equation:

$$\mathfrak{d}\_{m}^{k} = \frac{X\_{m}^{k} - X\_{\min}^{k}}{X\_{\max}^{k} - X\_{\min}^{k}} \tag{4}$$

**Figure 2** shows a schematic chart of *DNm*

**Figure 2.** DN over time.

**Figure 1.** BM distance traveling in two dimensions.

significant change in the value of DN.

average normalized value of *DNm*

known prior to this window.

**3. Numerical example**

indicators.

previously, under normal conditions the value of *DNm*

The first is calculated by comparing the value of *DNm*

*n*

*n*

over time without exceeding the L limit. As explained

is not expected to go above the value of

http://dx.doi.org/10.5772/intechopen.71566

37

with the result of Eq. (3), as calculated

with the average *DNm*

*n*

*n*

*n*

Identifying Water Network Anomalies Using Multi Parameters Random Walk: Theory and Practice

*n*

at a fixed-width moving window of *DNm*

L, with a confidence level equal to 1 − *α*. The value of L can also be obtained. In **Figure 2**, point 1 refers to values of DN above the level of L for a constant period, while point 2 underline a

by data accumulated until the m point in time. The second is calculated by comparing the

This section numerically describes the calculation procedure as described in the previous section. **Table 1** contains an example data set with 20 records. The measured variables are Free Chlorine (CL), Turbidity (TU), pH and Conductivity (CO). These are a common water quality

where *v<sup>m</sup> k* is the normalized k dimension of vector *Vm* and m is the discrete time index. The subscripts max and min refer to the maximum and minimum value of this dimension over the whole data set.

Let's also define the distance between two vectors to be denoted by *DNm n* (where DN stands for Dynamic Noise). This measurement is calculated as the normalized Euclidian distance between two values of *Vm* and is given by the equation

$$D N\_m^\* = \hat{V}\_m - \hat{V}\_{m-n}.\tag{5}$$

Note please that unlike in the case of BM where the distance of a particle from its origin increases with time, in the case of the DN, the particle may turn back to its origin.

An illustration of this distance in a normalized two-dimensional space is shown in **Figure 1**. In this case, *Vm* is a two-dimensional vector.

In terms of **Figure 1**, assuming a dataset with M records and two variables in each record (x<sup>1</sup> and x<sup>2</sup> ), one may look at this dataset as a description of location for a particle in each time stamp. After normalizing the dataset according to Eq. (4), the Euclidian distance between each two points is the distance this particle travels. If the distance is measured in a five-step gap, the result may be a chart as shown in **Figure 1**.

Identifying Water Network Anomalies Using Multi Parameters Random Walk: Theory and Practice http://dx.doi.org/10.5772/intechopen.71566 37

**Figure 1.** BM distance traveling in two dimensions.

**Figure 2.** DN over time.

The left side of Eq. (2) expresses the distance at which a particle can be found from its origin, given the elapsed time m and the diffusivity parameter D. Assuming x is distributed normally, the maximum value a particle can travel for a given time can be calculated using (2)

Using Eq. (2), and assuming that the left hand side of (2) is distributed normally, and its standard deviation (S) can be estimated empirically, the probability of a particle to travel a given

\_\_\_\_\_\_\_

where *t*(*α*) is the confidence interval factor, based on the level of confidence required (drawn from the student distribution). Thus, if a particle is found within m steps at a distance which

In the physical or chemical diffusion process, the value of D is determined based on material properties, and the value of m is measured in continuous time. In the current model, these values should be determined using another methodology as explained in the following

Let us denote with vector *Xm* the set of quality measurements of water at each moment m and, assuming *Xm* has K dimensions. The value of *Xm* can be normalized with the following

subscripts max and min refer to the maximum and minimum value of this dimension over

for Dynamic Noise). This measurement is calculated as the normalized Euclidian distance

Note please that unlike in the case of BM where the distance of a particle from its origin

An illustration of this distance in a normalized two-dimensional space is shown in **Figure 1**.

In terms of **Figure 1**, assuming a dataset with M records and two variables in each record (x<sup>1</sup>

), one may look at this dataset as a description of location for a particle in each time stamp. After normalizing the dataset according to Eq. (4), the Euclidian distance between each two points is the distance this particle travels. If the distance is measured in a five-step gap,

*<sup>m</sup>* − *V*̂ *m*−*n*

*<sup>n</sup>* = *V*̂

increases with time, in the case of the DN, the particle may turn back to its origin.

*<sup>k</sup>* − *X*min *k* \_\_\_\_\_\_\_\_ *X*max *<sup>k</sup>* − *X*min

is the normalized k dimension of vector *Vm* and m is the discrete time index. The

*m <sup>k</sup>* <sup>=</sup> *Xm*

Let's also define the distance between two vectors to be denoted by *DNm*

between two values of *Vm* and is given by the equation

2*D* ∗ *m* + *S*<sup>∗</sup> *t*(*α*) (3)

*<sup>k</sup>* (4)

*n*

. (5)

(where DN stands

distance from its origin within m units of time can be calculated by:

is greater than L units from its origin, it is considered an abnormal event.

with a given confidence interval.

paragraphs.

where *v<sup>m</sup> k*

and x<sup>2</sup>

process equation:

the whole data set.

*L* = √

36 Applications in Water Systems Management and Modeling

*v*̂

*DNm*

In this case, *Vm* is a two-dimensional vector.

the result may be a chart as shown in **Figure 1**.

**Figure 2** shows a schematic chart of *DNm n* over time without exceeding the L limit. As explained previously, under normal conditions the value of *DNm n* is not expected to go above the value of L, with a confidence level equal to 1 − *α*. The value of L can also be obtained. In **Figure 2**, point 1 refers to values of DN above the level of L for a constant period, while point 2 underline a significant change in the value of DN.

The first is calculated by comparing the value of *DNm n* with the result of Eq. (3), as calculated by data accumulated until the m point in time. The second is calculated by comparing the average normalized value of *DNm n* at a fixed-width moving window of *DNm n* with the average *DNm n* known prior to this window.
