**2. Unsupervised ML algorithms**

This section will review the concept and application of six unsupervised ML algorithms for anomaly/pattern detection applied in this research. It is important to inform that among the six algorithms described in this chapter, only the methods in Sections 2.1 and 2.3 has its intrinsic characteristic to perform a multivariate analysis, while the other algorithms are only able to perform a univariate analysis. Therefore, a univariate performance evaluation is computed in each dimension of the data to then calculate an average performance, except algorithms in 2.1 and 2.3 sections.

#### **2.1 C-AMDATS**

The Cluster-based Algorithm for Anomaly Detection in Time Series Using Mahalanobis Distance (C-AMDATS) is a clustering ML unsupervised algorithm. The model has only two hyperparameters that user can manipulate: (i) Initial Cluster Size (ICS) and Clustering Factor (CF). First the ICS clusters the observed sequences of time series data *A*, where each cluster may represent a behavior status. After the initial clustering, a new and better clustering in the dataset is remade according to the data points distribution over timeline. This ability is due to the usage of the Mahalanobis distance in the algorithm. In general, clustering techniques use the Euclidean distance function, which makes the clustering assumes the geometric shape of a circle, then it does not consider the variance of each dimension or feature of the dataset. However, there are situations in which the variance between each dimension (or feature) is different. Conversely, by using the covariance matrix, the Mahalanobis distance can detect the variance of each dimension. Eq. (1) presents the Mahalanobis distance formula.

$$d\_m\left(\boldsymbol{\omega}, \boldsymbol{\mu}\right) = \sqrt{\left(\boldsymbol{\omega} - \boldsymbol{\mu}\right)^T \mathbb{S}^{-1}\left(\boldsymbol{\omega} - \boldsymbol{\mu}\right)}\tag{1}$$

Where: *d x <sup>m</sup>* ( , µ ) is the Mahalanobis distance between a specific point in the time series and its respective centroid; *x* = (*x*1, *x*2, …, *x*n) T is a specific variable in the time series data, where *n* is the number of variables; μ = (μ1, μ2, …, μn) T is a certain cluster centroid; and *S* is the covariance matrix relative to that cluster.

After the new clustering through the Mahalanobis distance, the algorithm calculates the similarity of each cluster in the time series *A* to find the respective hidden patterns *P*. This similarity is calculated using the standard deviation *σ*y of the actual values of the *A* samples, the *Y* coordinate of each centroid and the CF. If the modulus of the difference between the y coordinate of the centroids of two cluster is less than or equal to the product of CF and *σ*y, then these clusters can be merged, meaning that they will represent the same pattern *P*. This task is carried out until every cluster have been analyzed.

The last step of C-AMDATS is to calculate the probability of the pattern *P* to be an anomaly *R*. The Anomaly Score measures the anomaly *R* for each pattern *P* (found in the previous step). The score is calculated by the ratio of the size of the entire time series to the sum of the sizes of the clusters present in *P*. The anomaly score assesses the degree of relevance of *P* in terms of anomaly detection. Then, all set *P* is ordered by *R* in descending order, and the anomalous patterns will be those with the highest anomaly score values. The higher the anomaly score value for a pattern *P*, the greater probability is of being an anomaly behavior in *A* [2]. Eq. (2) presents the Anomaly Score formula.

$$\text{Anomaly Score}\_{P\_i} = \frac{|T|}{|P\_i|} \tag{2}$$

Where *Pi Anomaly Score* is the anomaly score of the pattern *P*i, |*P*i| is the size of the pattern *P*i, and |*T*| is the size of the time series *T*.

#### **2.2 Luminol Bitmap**

Bitmap is an available unsupervised learning algorithm in Luminol library for anomaly detection or time series correlation. The background of Bitmap algorithm is based on the idea of time series bitmaps. The logic of the algorithm is to make a feature extraction of the raw time series data - by converting them into a Symbolic Aggregate Approximation (SAX) representation - and use it to compute the information about the relative frequency of its features to color a bitmap in a principled way. SAX allows a dimensionality reduction of the raw time series *C* of arbitrary length *n* to a string arbitrary length *w* (*w* < *n*, typically *w* < < *n*) by a vector *C* . It transforms the data into a Piecewise Aggregate Approximations (PAA) representation to symbolize it into a discrete string [16]. Eq. (3) presents the calculation of the *ith* element of *C* :

$$\overline{\mathbf{C}}\_{i} = \frac{w}{n} \sum\_{j=\frac{n}{w}(i-1)+1}^{\frac{n}{w}} \mathbf{C}\_{j} \tag{3}$$

After transformed a time series dataset into PAA, the algorithm applies a further transformation to obtain a discrete representation with equiprobability [16]. The conversion of the time series into a SAX words is made by a slider window (also called feature window). Bitmap algorithm use two concatenated slider windows together across the sequence, the latter one is called lead window, showing how far to look ahead for anomalous patterns and the former one is called lag window, whose size represents how much memory of the past to remember it.

*Multivariate Real Time Series Data Using Six Unsupervised Machine Learning Algorithms DOI: http://dx.doi.org/10.5772/intechopen.94944*

In summary, the algorithm approach is to convert both feature windows into SAX representation, then count the frequencies of SAX subwords at the desired level and get the corresponding bitmaps. The distance between the two bitmaps is measured and reported as an anomaly score at each time instance, and the bitmaps are drawn to visualize the similarities and differences between the two windows. The user must choose the length of the feature windows *N* and the number *n* of equal sized sections in which to divide *N* [3].

#### **2.3 SAX-REPEAT**

SAX-REPEAT algorithm is an approach that relies on extending the original SAX implementation to handle multivariable data. The algorithm takes as input a set of *K* multivariable time series *X*i of lengths *T*i, and dimensionality *D,* that represent different instances of the raw data to be learned. The user can set the parameters of the final string length *N* and an alphabet size *M*.

The algorithm applies SAX to each dimension of the data separately, and then combine the output string by assigning each possible combination of symbols, resulting in *D* strings to a unique identifier. This leads to a string of length *N*, but an extended alphabet of length *M<sup>D</sup>* . So, to maintain the requirement of the final string to be an alphabet of symbols *M* (parameter set by the user), the algorithm clusters the resulting characters into *M* clusters through K-means method and replace each character with the centroid of its cluster [17].

Although SAX-REPEAT can recognize interesting patterns, the original algorithm does not calculate the probability of the patterns being anomalous. Thereby, this work implemented an anomaly score for each found cluster (pattern). As C-AMDATS algorithm, the score is computed by the ratio between the size of the entire time series and the sum of the sizes of each cluster. Therefore, each cluster is sorted according to the respective anomaly score in descending order, and the anomalous patterns will be those with the highest anomaly score values.

#### **2.4 k-NN**

The k-Nearest Neighbors (k-NN) algorithm is one of the most popular method to solve both classification and regression problems. However, in this study, we will use it only for classification problem as unsupervised learning.

The algorithm assumes that similar data points exist in proximity, *i.e*, they are near to each other. The algorithms capture the idea of the similarity (also known as distance, proximity, or closeness), calculating the distance between points on a graph. Distance calculation is usually done by Euclidean distance, but it can be calculated using other distance functions. The Euclidean distance between the points *P = (p*1*, p*2*,…,p*n*)* e *Q = (q*1*, q*2*,…,q*n*)*, in a n-dimensional Euclidean space, it is defined in Eq. (4):

$$d\_{\epsilon} \left( p, q \right) = \sqrt{\sum\_{i=1}^{n} \left( p\_i - q\_i \right)^2} \tag{4}$$

Where, *p* = (*p*1, *p*2, …, *p*n) and *q* = (*q*1, *q*2, …, *q*n) are two points in Euclidean n-space.

The k-NN algorithm depends on two parameters, a metric used to compute the distance between two points (in this case Euclidean function), and a value *k* of the number of neighbors to consider. When *k* is underestimated, the algorithm can

overfit, *i.e*. it will classify just based on the closest neighbors instead of learning separating frontier between classes, but if *k* is overestimated, the algorithm will underfit, in the limit if *k* = *n,* the algorithm will consider every point belongs to the class that has more samples [18, 19].

#### **2.5 Bootstrap**

Bootstrap algorithm uses the computational power to estimate almost any summary statistics, such as the confidence interval, mean, or standard error. The method depends on the notion of a bootstrap sample *B*, which is a resampling of size *n* drawn to replace the original dataset *Z* = (*Z*1, *Z*2, …, *Z*n). The bootstrap sample is represented *Z ZZ Z* ( 1 2 , ,, *<sup>n</sup>* ) ∗ ∗∗ ∗ = … . Each *Zi* <sup>∗</sup> is one of the original *Z* values randomly selected, the selection probability for each Z value is equipollent, for example: 72 53 94 7 *Z ZZ ZZ ZZ Z* ,,, ∗∗∗∗ = = = = , etc. Note that the same original value can appear zero, one or more times, in the example, *Z*7 appeared twice, *i.e*, the selection of Z value is not exclusive. The name Bootstrap concern to the use of the original dataset to generate new datasets *Z*<sup>∗</sup> . The idea is to generate a larger number of Bootstrap sample *B* of each size *n* using a random number device to perform the algorithm training. The number of bootstrap repeats defines the variance of the estimate, *i.e*, higher the number is, better is the variance, but in contrast, the computational cost increases with the increasement of the *B* number [20, 21].

In this sense, we are interested in calculating a confidence interval using Bootstrap, which is performed by requesting the statistics stored during the training and selecting values in the chosen percentile for the confidence interval. The chosen percentile is denoted as δ (Alpha or Significance Level). Eq. (5) defined the calculation to estimate the distribution of δ \* for each Bootstrap sample.

$$
\delta^\* = \overline{\mathfrak{x}}^\* - \overline{\mathfrak{x}} \tag{5}
$$

Where: *x*<sup>∗</sup> is the mean of an empirical bootstrap sample and *x* is the mean of the original data.

Therefore, the confidence interval for a Significance Level of 0.05 is defined by Eq. (6).

$$\text{Confidence interval} = \left[ \overline{\boldsymbol{\pi}} - \boldsymbol{\delta}\_{.05}^{\*}, \overline{\boldsymbol{\pi}} - \boldsymbol{\delta}\_{.95}^{\*} \right] \tag{6}$$

Where, *x* is the mean of the original data, δ.05 <sup>∗</sup> is significance level at the 5th percentile, and δ.95 <sup>∗</sup> is significance level at the 95th percentile.

So, in order to obtain a very accurate estimate of δ.05 <sup>∗</sup> and δ.95 <sup>∗</sup> , it is important to generate a large number of bootstrap samples.

#### **2.6 RRCF**

Robust Random Cut Forest (RRCF) algorithm is an ensemble technique for detecting outliers. The idea is based on an isolation forest algorithm that uses an ensemble of trees. In graph theory, trees are collections of vertices and edges where any two vertices are only connected by one edge, it is an ordered way of storing numerical data.

*Multivariate Real Time Series Data Using Six Unsupervised Machine Learning Algorithms DOI: http://dx.doi.org/10.5772/intechopen.94944*

In this view, the algorithm takes a set of random data points, cuts them to the same number of points and creates trees. The algorithm starts by constructing a tree of *n* vertices, then it creates more trees of the same size, which in turn creates the forest. The user can choose the number of trees and the number of data point of each tree has, which is randomly sampled from the dataset. After the construction of the forest, the algorithm injects a new data point *p* into the trees to follow the cuts and to compute the average depth of the point across a collection of trees. The point is labeled an anomaly if the score overtake a threshold, which corresponds to the average depth across the trees [22].
