**2. Robust statistical methods using Minimum Covariance Determinant (MCD)**

The Minimum Covariance Determinant (MCD) estimator is a highly robust estimator of multivariate location and scatter. Since estimating the covariance matrix is the cornerstone of many multivariate statistical methods, the MCD has also been used to develop robust and computationally efficient multivariate techniques.

#### **2.1. MCD and its fast computing algorithm**

Given a dataset consisting of *p* variables and *n* observations, i.e. a *n* × *p* data matrix, we can represent this multivariate data as *X* =(*x*1, …, *xn*)*'* , where *xi* , for *i* =1, …, *n*, is the *i*th obser‐ vation and a *p*-dimensional vector. A classical distance measure, Mahalanobis distance (MD), is given in Equation (1); it only depends on the sample mean (*μ* ^ *MD*) and sample covariance matrix (Σ ^ *MD*), both of which are computed from the entire set of data.

$$MD(\mathbf{x}) = \sqrt{(\mathbf{x} \cdot \widehat{\boldsymbol{\mu}}\_{MD}) \dot{\boldsymbol{\Sigma}}\_{MD}^{-1} (\mathbf{x} \cdot \widehat{\boldsymbol{\mu}}\_{MD})} \tag{1}$$

A point with a larger Mahalanobis distance will lie further away from the center of the data cloud than a point with a smaller Mahalanobis distance. A robust distance (RD) measure is achieved if we substitute the MCD estimate of mean (*μ* ^ *MCD*) and covariance (Σ ^ *MCD*) into Equation (1), which yields Equation (2).

$$RD(\mathbf{x}) = \sqrt{(\mathbf{x} \cdot \widehat{\boldsymbol{\mu}}\_{M\text{CD}}) \widehat{\boldsymbol{\Sigma}}\_{M\text{CD}}^{-1} (\mathbf{x} \cdot \widehat{\boldsymbol{\mu}}\_{M\text{CD}})} \tag{2}$$

The classical estimates can be sensitive to outliers, while the MCD estimate is robust [8, 12, 13]. The MCD relies on a subset of the total observations. Choosing this subset makes the algorithm robust because it is less sensitive to the influence of outlying points. Figure 3 illustrates the difference between these two estimates; it is a scatterplot of the distances for an example dataset with 70 observations and 2 variables (i.e. *n* =70, *p* =2). The two ellipses are two outlier thresholds, determined by the 0.975 chi-square quantile with 2 degrees of freedom when the classical and robust estimates are used, respectively. The dashed blue ellipse marks off the 97.5% outlier threshold for the classical Mahalanobis distance, suggesting that two observations lying beyond the ellipse are outliers. The 97.5% outlier threshold for the robust distance measure is marked off by the solid red ellipse, suggesting ten points are outliers.

The MCD has a user-determined parameter, *h* , which specifies the size of the subset of data to base the estimate upon. It is constrained by (*n* + *p* + 1) / 2 ≤*h* ≤*n*. The *h* observations are chosen such that the determinant of the sample covariance matrix is minimal (but not mini‐ mized in the formal sense, because it relies on a sampling algorithm instead of a loss function). The MCD is optimally designed for elliptically symmetric unimodal distributions, such as the commonly encountered multivariate normal distribution. The MCD is most robust when *h* = (*n* + *p* + 1) / 2 . But this causes low efficiency [14] (at least for normal probability distribu‐ tions), which can be increased (while retaining high robustness) by applying reweighted estimators [15, 16]. Robust statistical estimators are commonly evaluated both on their breakdown value and influence functions. The MCD is a high breakdown estimator and its influence function appears bounded, which is desirable. An alternative strategy that employs

Multivariate Computing and Robust Estimating for Outlier and Novelty in Data and Imaging Sciences http://dx.doi.org/10.5772/59750 321

**Figure 3.** Outlier thresholds, as represented by ellipses, based on a classical and a robust scheme.

many multivariate statistical methods, the MCD has also been used to develop robust and

Given a dataset consisting of *p* variables and *n* observations, i.e. a *n* × *p* data matrix, we can

vation and a *p*-dimensional vector. A classical distance measure, Mahalanobis distance (MD),

A point with a larger Mahalanobis distance will lie further away from the center of the data cloud than a point with a smaller Mahalanobis distance. A robust distance (RD) measure is

The classical estimates can be sensitive to outliers, while the MCD estimate is robust [8, 12, 13]. The MCD relies on a subset of the total observations. Choosing this subset makes the algorithm robust because it is less sensitive to the influence of outlying points. Figure 3 illustrates the difference between these two estimates; it is a scatterplot of the distances for an example dataset with 70 observations and 2 variables (i.e. *n* =70, *p* =2). The two ellipses are two outlier thresholds, determined by the 0.975 chi-square quantile with 2 degrees of freedom when the classical and robust estimates are used, respectively. The dashed blue ellipse marks off the 97.5% outlier threshold for the classical Mahalanobis distance, suggesting that two observations lying beyond the ellipse are outliers. The 97.5% outlier threshold for the robust distance measure is marked off by the solid red ellipse, suggesting ten points are outliers.

The MCD has a user-determined parameter, *h* , which specifies the size of the subset of data to base the estimate upon. It is constrained by (*n* + *p* + 1) / 2 ≤*h* ≤*n*. The *h* observations are chosen such that the determinant of the sample covariance matrix is minimal (but not mini‐ mized in the formal sense, because it relies on a sampling algorithm instead of a loss function). The MCD is optimally designed for elliptically symmetric unimodal distributions, such as the commonly encountered multivariate normal distribution. The MCD is most robust when *h* = (*n* + *p* + 1) / 2 . But this causes low efficiency [14] (at least for normal probability distribu‐ tions), which can be increased (while retaining high robustness) by applying reweighted estimators [15, 16]. Robust statistical estimators are commonly evaluated both on their breakdown value and influence functions. The MCD is a high breakdown estimator and its influence function appears bounded, which is desirable. An alternative strategy that employs

, where *xi*

^

^

, for *i* =1, …, *n*, is the *i*th obser‐

*MD*) (1)

*MCD*) (2)

*MCD*) and covariance (Σ

*MD*) and sample covariance

^

*MCD*) into

computationally efficient multivariate techniques.

represent this multivariate data as *X* =(*x*1, …, *xn*)*'*

is given in Equation (1); it only depends on the sample mean (*μ*

*MD*(*x*)= (*x* - *μ*

achieved if we substitute the MCD estimate of mean (*μ*

*RD*(*x*)= (*x* - *μ*

Equation (1), which yields Equation (2).

*MD*), both of which are computed from the entire set of data.

^ *MCD*)' Σ ^ *MCD* -1 (*x* - *μ* ^

^ *MD*)' Σ ^ *MD* -1 (*x* - *μ* ^

**2.1. MCD and its fast computing algorithm**

matrix (Σ ^

320 Advances in Bioengineering

Delaunay triangulation to identify a robust outlier-free subsample in an adaptive way was presented in [17].

Computing the exact MCD is possible but computationally difficult, as it requires the evalu‐ ation of all ( *<sup>n</sup> <sup>h</sup>* ) subsets of size *h* . Even though the MCD is a powerful robust estimator, it has only become widely used since the development of the so-called Fast-MCD algorithm [18] which we summarize below. Assume we have a dataset *X* =(*x*1, …, *xn*)*'* and let *H*1⊂{1…*n*} represent a *h* -subset of length constrained by (*n* + *p* + 1) / 2 ≤*h* ≤*n*. Denote this first *h* -subset as *H*1 and it is randomly chosen from the entire dataset. Compute the mean *μ* ^ *MCD*,1 and covariance matrix Σ ^ *MCD*,1 of *H*1, as well as the determinant of Σ ^ *MCD*,1, denoted as *det*(Σ ^ *MCD*,1). Then compute the distance of all *n* observations in the entire dataset (and not just the *h* comprising the initial subset) using Equation (2). Next, these distances are ordered from smallest to largest. Retain an equivalent number of observations from this ordering as chosen in the initial *h* -subset; but instead of being chosen arbitrarily as in the initial subset, these are chosen such that they have the smallest distances as defined by the order statistics. Call this subset of observations *H*2, and compute *μ* ^ *MCD*,2, Σ ^ *MCD*,2 and *det*(Σ ^ *MCD*,2). Now Equation (3) must be true:

$$\det\left(\stackrel{\wedge}{\Sigma}\_{\text{MCD},2}\right) \le \det\left(\stackrel{\wedge}{\Sigma}\_{\text{MCD},1}\right) \tag{3}$$

Going from *H*1 to *H*2 is called a C-step for "Concentration step", because the algorithm concentrates on the *h* observations with the smallest distances and *det*(Σ ^ *MCD*,2) is more concentrated (or equivalently, has a smaller determinant). This C-step is repeated numerous or sufficient times, with each iteration using a different initial *h* -subset. The 10 subsets that yield the smallest determinants overall are retained and further concentrated until conver‐ gence is met.

#### **2.2. Robust multivariate regression and Multivariate Least-Trimmed Squares (MLTS) estimator**

Section 2.1 introduced the robust MCD estimator and showed how the MCD can be computed efficiently. In this section, we review different frameworks for applying the MCD estimator to multivariate regression. These methods offer robust alternatives to standard multiple regres‐ sion analysis.

We first look at robust multivariate regression in [19]. Suppose we have a full dataset of predictors and responses containing no outliers; computing the regression parameter esti‐ mates from the full dataset using a least squares procedure will yield accurate results. With outliers present in the dataset, the MCD is used to search for a subset of size *h* whose covariance matrix has the smallest determinant with *h* constrained by (*n* + *p* + *q* + 1) / 2 ≤*h* ≤*n*, where *p* and *q* are respectively the numbers of variables for the predictor and response matrices, and *n* is the number of observations. Then, using this subset *h* , the sample mean and covariance estimates are calculated, which would allow for accurate estimation of the regression coeffi‐ cients and covariance matrix of the errors in the presence of outliers.

Different from the above robust multivariate regression, the multivariate least trimmed squares (MLTS) estimator in [20] first fits a regression model to the subset of data and then calculates the covariance matrix of the residuals. The estimator is defined by minimizing a trimmed sum of squared Mahalanobis distances, and can be computed by a fast algorithm. Let us consider the classical multivariate regression framework. Assume we have a sample of data defined as *Zn* ={(*xi* , *yi* );*i* =1, …, *n*} and let *X* =(*x*1, …, *xn*)*'* denote the design (or predictor) matrix and *Y* =(*y*1, …, *yn*)*'* denote the response matrix. The regression model is:

$$Y = X\beta + \varepsilon \tag{4}$$

The classical least squares estimator for the regression parameter is given by:

$$
\hat{\boldsymbol{\beta}}\_{LS} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \tag{5}
$$

and the classical estimator of the scatter matrix is:

$$
\stackrel{\wedge}{\Sigma}\_{LS} = \frac{1}{n \cdot p} \begin{pmatrix} Y \ -X \stackrel{\wedge}{\beta}\_{LS} \end{pmatrix} \begin{pmatrix} Y \ -X \stackrel{\wedge}{\beta}\_{LS} \end{pmatrix} \tag{6}
$$

These classical least squares estimators are sensitive to outliers. A robust alternative to these estimators based on the residuals is achieved as below. For any *<sup>β</sup>* <sup>∈</sup>ℝ*p*+*<sup>q</sup>* , let *ri* (*β*)= *yi* - *β ' xi* denote the residuals from the fitted regression model. Furthermore let ℋ={*H* ⊂{1, …, *n*}| #*H* =*h* } be the collection of all subsets of size *h* . For any *H* ∈ℋ denote *β* ^ *LS* (*H* ) the least-squares fit solely on the observations {(*x <sup>j</sup>* , *y <sup>j</sup>* ); *<sup>j</sup>* <sup>∈</sup>*<sup>H</sup>* }. In addition, for all *<sup>H</sup>* <sup>∈</sup>ℋ and *<sup>β</sup>* <sup>∈</sup>ℝ*p*+*<sup>q</sup>* denote the cova‐ riance matrix of the residuals with respect to the fit *β*, belonging to subset *H* as:

Multivariate Computing and Robust Estimating for Outlier and Novelty in Data and Imaging Sciences http://dx.doi.org/10.5772/59750 323

$$\text{cov}\{H,\,\beta\} = \frac{1}{n} \sum\_{j \in H} \left\{ r\_j \{\beta\} \text{ - } \bar{r}\_H \{\beta\} \right\} \left( r\_j \{\beta\} \text{ - } \bar{r}\_H \{\beta\} \right)' \tag{7}$$

where *r*¯ *<sup>H</sup>* (*β*)= <sup>1</sup> *<sup>h</sup>* ∑ *<sup>j</sup>*∈*<sup>H</sup> rj* (*β*). If we let Σ ^ *LS* (*H* )= cov(*H* , *β* ^ *LS* (*H* )) for any *H* ∈ℋ, the MLTS estimator is defined as:

$$
\stackrel{\wedge}{\beta}\_{\text{MLTS}} \{ Z\_n \} = \stackrel{\wedge}{\beta}\_{\text{LS}} \{ \stackrel{\wedge}{H} \} \tag{8}
$$

where *H* ^ <sup>∈</sup>argmin *H* ∈ℋ *det* Σ ^ *LS* (*H* ). The covariance of the errors can be estimated by

$$
\stackrel{\wedge}{\Sigma}\_{\text{MLTS}} \{ Z\_n \} = c\_\alpha \stackrel{\wedge}{\Sigma}\_{LS} \{ \stackrel{\wedge}{H} \} \tag{9}
$$

where *cα* is a consistency factor. The observations corresponding to the residuals with the smallest determinant of the covariance matrix can then be used to give robust results for the regression parameters.

Using the MLTS as a means to estimate the parameters of the Vector Autoregressive (VAR) Model was presented in [21]. The VAR model is popular for modeling multiple time series. Estimation of its parameters based a typical least squares method is unreliable when outliers are present in the data. Development of robust procedures for multiple time series analysis is more crucial than for univariate time series analysis due to the data correlation structure. Experimental results in [21] show that applying the reweighted MLTS procedure to the VAR model leads to robust multivariate regression estimators with improved performance.

#### **3. Multivariate Voronoi Outlier Detection (MVOD) for time series**

In order to better analyze multivariate time series data, we have recently proposed a general outlier detection method based on the mathematical principles of Voronoi diagrams. It is general because different attributes or features can be extracted from the data for Voronoi diagram construction. These attributes or features can be designed based on the nature of the data and the outliers. This has the potential to increase the accuracy and precision of outlier detection for specific application problems.

#### **3.1. Background on Voronoi diagram**

**2.2. Robust multivariate regression and Multivariate Least-Trimmed Squares (MLTS)**

Section 2.1 introduced the robust MCD estimator and showed how the MCD can be computed efficiently. In this section, we review different frameworks for applying the MCD estimator to multivariate regression. These methods offer robust alternatives to standard multiple regres‐

We first look at robust multivariate regression in [19]. Suppose we have a full dataset of predictors and responses containing no outliers; computing the regression parameter esti‐ mates from the full dataset using a least squares procedure will yield accurate results. With outliers present in the dataset, the MCD is used to search for a subset of size *h* whose covariance matrix has the smallest determinant with *h* constrained by (*n* + *p* + *q* + 1) / 2 ≤*h* ≤*n*, where *p* and *q* are respectively the numbers of variables for the predictor and response matrices, and *n* is the number of observations. Then, using this subset *h* , the sample mean and covariance estimates are calculated, which would allow for accurate estimation of the regression coeffi‐

Different from the above robust multivariate regression, the multivariate least trimmed squares (MLTS) estimator in [20] first fits a regression model to the subset of data and then calculates the covariance matrix of the residuals. The estimator is defined by minimizing a trimmed sum of squared Mahalanobis distances, and can be computed by a fast algorithm. Let us consider the classical multivariate regression framework. Assume we have a sample of data

denote the response matrix. The regression model is:

(*Y* - *X β* ^

); *<sup>j</sup>* <sup>∈</sup>*<sup>H</sup>* }. In addition, for all *<sup>H</sup>* <sup>∈</sup>ℋ and *<sup>β</sup>* <sup>∈</sup>ℝ*p*+*<sup>q</sup>*

^

denote the design (or predictor)

*Y* = *Xβ* + *ε* (4)

*Y* (5)

*LS* ) (6)

(*β*)= *yi* - *β '*

*LS* (*H* ) the least-squares fit solely

*xi* denote

denote the cova‐

, let *ri*

);*i* =1, …, *n*} and let *X* =(*x*1, …, *xn*)*'*

The classical least squares estimator for the regression parameter is given by:

*<sup>n</sup>* - *<sup>p</sup>* (*Y* - *X β*

riance matrix of the residuals with respect to the fit *β*, belonging to subset *H* as:

*X* )-1 *X* '

> ^ *LS* )'

These classical least squares estimators are sensitive to outliers. A robust alternative to these

the residuals from the fitted regression model. Furthermore let ℋ={*H* ⊂{1, …, *n*}| #*H* =*h* } be

*β* ^ *LS* =(*X* '

estimators based on the residuals is achieved as below. For any *<sup>β</sup>* <sup>∈</sup>ℝ*p*+*<sup>q</sup>*

the collection of all subsets of size *h* . For any *H* ∈ℋ denote *β*

, *y <sup>j</sup>*

cients and covariance matrix of the errors in the presence of outliers.

**estimator**

322 Advances in Bioengineering

sion analysis.

defined as *Zn* ={(*xi*

matrix and *Y* =(*y*1, …, *yn*)*'*

on the observations {(*x <sup>j</sup>*

, *yi*

and the classical estimator of the scatter matrix is:

Σ ^ *LS* <sup>=</sup> <sup>1</sup>

> Our new method requires a Voronoi diagram, which is composed of Voronoi cells [22]. A Voronoi diagram is a way of dividing space into regions. Assume we have a set *S* of *n* points, *p*1, …, *pn* in the Euclidean plane. Let *V* (*pi* ) denote a Voronoi cell, which is a subdivision of the plane where the set of points *q* are closer or as close to *pi* than to any other point in *S*. This is expressed formally in Equation (10):

$$V(p\_i) = \{ q \mid \text{dist}(p\_{i'}, q) \le \text{dist}(p\_{j'}, q), \ \forall \ j \neq i \} \tag{10}$$

where dist is the Euclidean distance function. The set of all Voronoi cells for all *n* points comprises a Voronoi diagram.

Figure 4 shows part of a Voronoi diagram, assuming Euclidean distance between the points. If one used a different distance metric, the Voronoi diagram would be configured differently. The plane is decomposed into *n* convex polygonal regions, one for each *pi* . Vertices (or nodes) are called *Voronoi vertices* and are equidistant to three or more sites. *Voronoi edges* are the segments defined as the boundaries between two Voronoi cells and contain all the points in the plane equidistant to the two nearest sites. The boundaries of a Voronoi cell *V* (*pi* ) cannot exceed *n* - 1 edges. Three important theorems apply to Voronoi diagrams:

*Theorem 1: Every nearest neighbor ofpi defines an edge of the Voronoi polygonV* (*pi* ).

*Theorem 2: Every edge of the Voronoi polygonV* (*pi* ) *defines a nearest neighbor ofpi* .

*Theorem 3: Forn* ≥3*, a Voronoi diagram onnpoints has at most 2n* - 5 *vertices and*3*n* - 6 *edges.*

**Figure 4.** A subset of Voronoi cells from a Voronoi diagram.

#### **3.2. Our proposed MVOD method**

The Voronoi Outlier Index (VOInd) used in our Multivariate Voronoi Outlier Detection (MVOD) method is based upon the Voronoi notion of nearest neighbors. For a point *pi* of set *S*, the nearest neighbors of *pi* defined by the Voronoi polygon *V* (*pi* ) are the Voronoi nearest neighbor of *pi* , denoted as *V NN* (*pi* ). In Figure 4 the nearest Voronoi neighbors to point *p*1 are *p*2, *p*3, *p*4, *p*5 and *p*6. For each point in the data set, our method uses the nearest neighbors to compute an index (i.e. VOInd) of how likely that point is an outlier. It is multivariate because it aggregates information across all individual time series, thus retaining features which might be common to the entire interlocking set of variables.

Our method is based upon the geometric principles of Voronoi diagrams for defining the neighborhood relationship of the data points and this facilitates the assignment of group or data membership (i.e. outliers and non-outliers). Construction of a two dimensional Voronoi diagram requires two coordinates for each data point. Based on the nature of the data and the nature of the outliers to be identified, we can embed their attributes into the coordinates via extracting different valid features from the data. Here, we present one such case of the MVOD framework for feature extraction; but many others are also possible, including nonparametric forms. Figure 5 overviews the process and the rest of this subsection explains the steps in more detail.

**Figure 5.** Flow chart of our MVOD procedure for outlier detection.

*V* (*pi*

comprises a Voronoi diagram.

324 Advances in Bioengineering

*Theorem 1: Every nearest neighbor ofpi*

*Theorem 2: Every edge of the Voronoi polygonV* (*pi*

**Figure 4.** A subset of Voronoi cells from a Voronoi diagram.

, denoted as *V NN* (*pi*

be common to the entire interlocking set of variables.

**3.2. Our proposed MVOD method**

*S*, the nearest neighbors of *pi*

neighbor of *pi*

) ={*q* |dist(*pi*

The plane is decomposed into *n* convex polygonal regions, one for each *pi*

exceed *n* - 1 edges. Three important theorems apply to Voronoi diagrams:

, *q*) ≤dist(*p <sup>j</sup>*

where dist is the Euclidean distance function. The set of all Voronoi cells for all *n* points

Figure 4 shows part of a Voronoi diagram, assuming Euclidean distance between the points. If one used a different distance metric, the Voronoi diagram would be configured differently.

are called *Voronoi vertices* and are equidistant to three or more sites. *Voronoi edges* are the segments defined as the boundaries between two Voronoi cells and contain all the points in

 *defines an edge of the Voronoi polygonV* (*pi*

) *defines a nearest neighbor ofpi*

the plane equidistant to the two nearest sites. The boundaries of a Voronoi cell *V* (*pi*

*Theorem 3: Forn* ≥3*, a Voronoi diagram onnpoints has at most 2n* - 5 *vertices and*3*n* - 6 *edges.*

The Voronoi Outlier Index (VOInd) used in our Multivariate Voronoi Outlier Detection (MVOD) method is based upon the Voronoi notion of nearest neighbors. For a point *pi*

*p*2, *p*3, *p*4, *p*5 and *p*6. For each point in the data set, our method uses the nearest neighbors to compute an index (i.e. VOInd) of how likely that point is an outlier. It is multivariate because it aggregates information across all individual time series, thus retaining features which might

Our method is based upon the geometric principles of Voronoi diagrams for defining the neighborhood relationship of the data points and this facilitates the assignment of group or

defined by the Voronoi polygon *V* (*pi*

). In Figure 4 the nearest Voronoi neighbors to point *p*1 are

, *q*), ∀ *j* ≠*i*} (10)

. Vertices (or nodes)

).

.

) cannot

of set

) are the Voronoi nearest

**Feature Set 1 (feature value for the x-coordinate):** In order to determine how a single obser‐ vation (at the same time point, across all time series under consideration) affects the covariance matrix (which is a measure of inter-relationship among the individual time series), we remove a given point from the data set and then use Multivariate Least Trimmed Squares (MLTS) introduced in Section 2.2, which computes a determinant of the covariance matrix *without that point*, to yield a single feature value. Removing one observation at a time is not part of the original MLTS method; we introduced this modification into the procedure to show the effect of removing that observation from all the time series. This determinant is known as the generalized variance and can be interpreted as a volume. If we have outlying observations, the volume will be larger. But if we remove those outlying observations, the volume will be smaller.

**Feature Set 2 (feature value for the y-coordinate):** This is a two-step process. In Step 1, we take the absolute value of all time series points for all variables, which provides some infor‐ mation about the total magnitude each time series contributes. However, sometimes, magni‐ tude alone is not sufficient for outlier detection as some data may have less extreme values than those data with the largest magnitudes but are actually outliers for the data [23]. One way to address this issue is by using the residual to calculate the feature value for the y-coordinate. So Step 2 consists of fitting an appropriate model to the multivariate time series data and then computing the residuals. Here, we estimate the parameters for a Multivariate Vector Auto-Regressive (MVAR) model because our simulated data are generated from this type of model. Once these residuals are obtained for each time series, they are squared and then summed across all time series. Finally, the feature value for the y-coordinate is determined by multi‐ plying together the results of Step 1 and Step 2. Specifically, let *yi* denote the *i*th observation of the time series of length *n*, and *ri* denote the residual after fitting the MVAR model for each observation. Then, for all *i* =1, …, *n* we compute this feature value as in Equation (11):

$$\sum\_{1}^{n} \text{abs}\{y\_i\} \star \sum\_{1}^{n} r\_i^2 \tag{11}$$

Although a regression model is used here in Step 2 to extract the feature value, in fact, our method does not require this model. With either Step 1 or Step 2 alone, we will have a corresponding nonparametric or parametric basis, both of which could be suitable for different applications or datasets.

**Voronoi Outlier Index (VOInd):** Given the two feature sets (from the above procedures) that can be used as the x-coordinates and y-coordinates for the data, we construct the Voronoi diagram based on Section 3.1 and compute a Voronoi Outlier Index (VOInd) for each time series point. The VOInd for point *pi* has as its numerator the sum of the Euclidean distance (dist) between each point and all its nearest neighbors. This is divided by the denominator term which is the number of nearest neighbors, yielding an average density, as expressed in Equation (12):

$$\text{VODInd}(p\_i) = \sum\_{\mathcal{O} \in V\_{\text{NN}}\{p\_i\}} \text{dist}(p\_{i\prime}, o) / \ | \; V\_{\text{NN}}(p\_i) | \tag{12}$$

Note that a Voronoi outlier factor is used in [24] as the index which, however, was completely univariate in nature, since the x-and y-coordinates were based on a univariate time series. One of our primary motivations for this study is to create a novel and general MVOD method, which can detect outliers in time series data in a multivariate framework with multiple, interlocking sets of variables.

#### **3.3. Experimental evaluation and results**

**Simulation Setup and Data Generation:** For each analysis, 5 multivariate autoregressive time series, each containing 100 observations, were simulated 25 times using published Matlab code [25]. The time series were generated using a Gaussian process with mean 0 and standard deviation 1. The variance/covariance matrix contained all 1's on the diagonal and all 0's on the off-diagonals. A total of 12 different unique multivariate time series were constructed, each with differing numbers of outliers and strength/magnitude of the outliers. 5, 10 or 15 outliers were introduced into a time series and the magnitude of those outliers was 1, 2, 3, 4 or 5. All combinations of number of outliers and outlier magnitude were constructed; but they were never mixed. For instance, if we introduced 5 outliers of magnitude 3 into a simulated time series, only 5 outliers of magnitude 3 were used for all 25 simulated time series in that set. The observation to which the outliers were introduced into the time series was always determined randomly. Once the observations had been selected for outlier introduction, the same number of outliers for the given magnitude was added or subtracted (if the original observation was negative) to *each* of the five components of the multivariate time series.

**Validation Criteria and Procedure:** We validated and compared the performance of our new Voronoi Outlier Detection (MVOD) method with the MLTS, using True and False Positive Rates (TPR and FPR) as defined in Table 1.


**Table 1.** Definition of True (TPR) and False (FPR) Positive Rate.

across all time series. Finally, the feature value for the y-coordinate is determined by multi‐ plying together the results of Step 1 and Step 2. Specifically, let *yi* denote the *i*th observation of the time series of length *n*, and *ri* denote the residual after fitting the MVAR model for each observation. Then, for all *i* =1, …, *n* we compute this feature value as in Equation (11):

> ) ×∑ 1 *n ri*

Although a regression model is used here in Step 2 to extract the feature value, in fact, our method does not require this model. With either Step 1 or Step 2 alone, we will have a corresponding nonparametric or parametric basis, both of which could be suitable for different

**Voronoi Outlier Index (VOInd):** Given the two feature sets (from the above procedures) that can be used as the x-coordinates and y-coordinates for the data, we construct the Voronoi diagram based on Section 3.1 and compute a Voronoi Outlier Index (VOInd) for each time

(dist) between each point and all its nearest neighbors. This is divided by the denominator term which is the number of nearest neighbors, yielding an average density, as expressed in

) dist(*pi*

Note that a Voronoi outlier factor is used in [24] as the index which, however, was completely univariate in nature, since the x-and y-coordinates were based on a univariate time series. One of our primary motivations for this study is to create a novel and general MVOD method, which can detect outliers in time series data in a multivariate framework with multiple,

**Simulation Setup and Data Generation:** For each analysis, 5 multivariate autoregressive time series, each containing 100 observations, were simulated 25 times using published Matlab code [25]. The time series were generated using a Gaussian process with mean 0 and standard deviation 1. The variance/covariance matrix contained all 1's on the diagonal and all 0's on the off-diagonals. A total of 12 different unique multivariate time series were constructed, each with differing numbers of outliers and strength/magnitude of the outliers. 5, 10 or 15 outliers were introduced into a time series and the magnitude of those outliers was 1, 2, 3, 4 or 5. All combinations of number of outliers and outlier magnitude were constructed; but they were never mixed. For instance, if we introduced 5 outliers of magnitude 3 into a simulated time series, only 5 outliers of magnitude 3 were used for all 25 simulated time series in that set. The observation to which the outliers were introduced into the time series was always determined randomly. Once the observations had been selected for outlier introduction, the same number

, *o*) / |*V NN* (*pi*

<sup>2</sup> (11)

)| (12)

has as its numerator the sum of the Euclidean distance

∑ 1 *n abs*(*yi*

) =∑*O*∈*<sup>V</sup> NN* ( *pi*

applications or datasets.

326 Advances in Bioengineering

Equation (12):

series point. The VOInd for point *pi*

interlocking sets of variables.

**3.3. Experimental evaluation and results**

VOInd(*pi*

The alpha parameter in the MLTS method determines both the size of the subset to use as well as a critical value in a chi-square distribution. If an observation is greater than this threshold in the chi-square distribution, then the MLTS method flags the observation as an outlier. However, it is critical to note that a one-to-one correspondence does not exist between the alpha value chosen, and the number of outliers flagged. For instance, one could set alpha at 0.10 but only have 2 out of 100 observations flagged as outliers. Partly for this reason we considered a range of alpha values and then averaged across this range to fairly compare with the MVOD method. For all simulated time series, we considered alpha between 0.01 and 0.20.


**Table 2.** True and False Positive Rates for MVOD and MLTS with 5, 10, or 15 outliers of magnitudes 1, 2, 3, 4, or 5.

In the results presented next, we obtained the TPR and FPR for the two methods in the following way. For a given number of outliers with a specific outlier magnitude, we averaged a total of five cases. The five cases averaged always included the threshold (MVOD) or alpha value (MLTS) corresponding with the number of outliers, but also contained the preceding four cases as well. For instance, in the 10 outlier case, we took the results for threshold=10 (MVOD), as well as thresholds of 9, 8, 7 and 6. In the corresponding MLTS case, we would have taken alpha=0.10, 0.09, 0.08, 0.07 and 0.06. The TPR and FPR for each of these five cases for each method were averaged to obtain the values shown in Table 2.

**Figure 6.** True Positive Rate (TPR, y-axis) for MVOD and MLTS for 5 outliers (top panel), 10 outliers (middle panel), and 15 outliers (bottom panel) with outlier magnitudes of 1, 2, 3, 4 or 5 (x-axis).

**Results:**Table 2 and Figure 6 show that in terms of the TPR, our method outperforms the MLTS when the outlier strength is low (i.e. magnitudes of 1 and 2) and has slightly better performance than the MLTS for medium outlier strength (i.e. magnitude of 3), while the two approaches are comparable when the outlier strength is high (i.e. magnitude of 4 and 5). As evident from Table 2, for the FPR, the two methods have similar behavior, with negligible difference on the order of 10-2 . Additionally, the number of outliers (5, 10 or 15) does not have an obvious effect on either of the methods. In summary, the experiments demonstrate that our MVOD method can work effectively and accurately in detecting the outliers from the multivariate time series data. Compared to MLTS, the MVOD is more sensitive in detecting the small magnitude outliers, which are often difficult for an outlier detection algorithm. Furthermore, both our MVOD and the MLTS work reasonably well for a wide range of contamination levels. That is, both methods are quite robust to the number of outliers in the dataset.

#### **4. Machine learning methods for novelty detection**

Novelty detection can be considered as the task of classifying test data that differ in some respect from the data that are available during training. This may be approached within the framework of "one-class classification" [3], in which a model is built to describe "normal" training data. Novelty detection methods can be categorized into several areas such as probabilistic, distance-based, reconstruction-based, domain-based, and information-theoretic techniques. In this section, we mainly introduce the first category of probabilistic approaches, and briefly summarize the others.

#### **4.1. Probabilistic approaches**

In the results presented next, we obtained the TPR and FPR for the two methods in the following way. For a given number of outliers with a specific outlier magnitude, we averaged a total of five cases. The five cases averaged always included the threshold (MVOD) or alpha value (MLTS) corresponding with the number of outliers, but also contained the preceding four cases as well. For instance, in the 10 outlier case, we took the results for threshold=10 (MVOD), as well as thresholds of 9, 8, 7 and 6. In the corresponding MLTS case, we would have taken alpha=0.10, 0.09, 0.08, 0.07 and 0.06. The TPR and FPR for each of these five cases

**Figure 6.** True Positive Rate (TPR, y-axis) for MVOD and MLTS for 5 outliers (top panel), 10 outliers (middle panel),

**Results:**Table 2 and Figure 6 show that in terms of the TPR, our method outperforms the MLTS when the outlier strength is low (i.e. magnitudes of 1 and 2) and has slightly better performance than the MLTS for medium outlier strength (i.e. magnitude of 3), while the two approaches are comparable when the outlier strength is high (i.e. magnitude of 4 and 5). As evident from Table 2, for the FPR, the two methods have similar behavior, with negligible difference on the

on either of the methods. In summary, the experiments demonstrate that our MVOD method can work effectively and accurately in detecting the outliers from the multivariate time series data. Compared to MLTS, the MVOD is more sensitive in detecting the small magnitude outliers, which are often difficult for an outlier detection algorithm. Furthermore, both our MVOD and the MLTS work reasonably well for a wide range of contamination levels. That is,

Novelty detection can be considered as the task of classifying test data that differ in some respect from the data that are available during training. This may be approached within the

. Additionally, the number of outliers (5, 10 or 15) does not have an obvious effect

for each method were averaged to obtain the values shown in Table 2.

order of 10-2

and 15 outliers (bottom panel) with outlier magnitudes of 1, 2, 3, 4 or 5 (x-axis).

both methods are quite robust to the number of outliers in the dataset.

**4. Machine learning methods for novelty detection**

15

**Number of Outliers** 5

328 Advances in Bioengineering

10

Probabilistic approaches to novelty detection are based on estimation of the generative probability density function of the data, which may then be used to define thresholds for the boundaries of "normality" in the data space and test whether or not a test sample is from the same distribution. Statistical hypothesis tests are the simplest statistical techniques for novelty detection [5]. Among the different statistical tests for novelty detection, here we concentrate on more advanced statistical modeling methods involving complex, multivariate data distributions. Techniques for estimating the underlying data density from multivariate training data broadly fall into parametric and nonparametric methods. The former imposes a restrictive model on the data, leading to a large bias when the model does not fit the data; the later builds up a very flexible model with fewer assumptions but requires a large sample size for a reliable fit of all free parameters when the model size becomes large.

In parametric approaches, the widely used distribution form for continuous variables is Gaussian. The involved parameters are estimated from the training data via *maximum likelihood estimates* (MLE), for which a closed-form analytical solution is available for a Gaussian distribution. More complex data distribution forms can be modeled through mixture models (e.g. Gaussian Mixture Models, or GMMs for short), or other mixtures of different types of distributions (e.g. the gamma, the Poisson, the Student's t, and the Weibull distributions) [26, 27]. When the form of the data distribution is not available, Gaussian distribution is usually taken due to its convenient analytical properties. The parameters of the GMMs can be estimated with maximum likelihood methods (using optimization algorithms including conjugate gradients or expectation-maximization, EM) or with Bayesian methods (e.g. variational Bayes) [26]. Besides the requirement of large numbers of training examples in estimating model parameters, another limitation of parametric methods is that the chosen data distribution form and the model generating the data may not match well. Despite the limita‐ tions, GMMs have been a popular scheme for novelty detection. The other strategy for novelty detection is to utilize time-series approaches, for example, the stochastic process of *Autore‐ gressive Integrated Moving Average* (ARIMA), which can be used to predict the next data point and determine whether or not it is artefactual [28]. State-space models are also typically used for novelty detection in time-series data, assuming there is some underlying hidden state that generates the observations and this hidden state evolving through time [29]. The Hidden Markov Model (HMM) and the Kalman filter are two common state-space models for novelty detection.

Non-parametric methods do not assume a fixed structure of a model; the model grows in size as necessary to fit the data and accommodate the data complexity. A common non-parametric approach for probabilistic density estimation is the kernel density estimator [26], which estimates the probability density function with a large number of kernels over the data space. The kernel density estimator places a kernel (e.g. Gaussian) on each data point and then sums the contributions from a localized neighborhood of the kernel. This is the so-called *Parzen windows* estimator [30], which has been used for novelty detection in a number of applications including mammographic image analysis [31]. One-class classification based on Gaussian Processes (GPs) has been developed and used recently [32]. This technique also takes a pointwise approach to novelty detection, which divides the data space into regions with high and low supports depending on whether those regions are close to those occupied by "normal" training data, or not. A related way of detecting novelty is based on the well-established area of *changepoint* detection [33], with the goal of determining whether the generative distribution of a sequence of observations has remained stable or has undergone some abrupt change. Here in addition to detecting whether a change has occurred or not, another aim is to estimate the time that the change has occurred. When applied in a batch or online setting, the idea of changepoint detection to the retrospective problem is identifying a test statistic suitable for testing the hypothesis that a change has occurred versus the one that no change has occurred. A likelihood ratio statistic, as well as others [33, 34], would be appropriate.

#### **4.2. Other categories**

Distance-based approaches, such as clustering or nearest-neighbor methods [35-37], are another types of techniques that can be used for classification or for estimating the probability density function of data. The underlying assumption is that "normal" data are tightly clus‐ tered, while novel data occur far from their nearest neighbors. These methods use well-defined distance metrics to compute the distance (e.g. similarity measure) between two data points.

Reconstruction-based methods involve training a regression model with the training data [3, 38, 39]. The distance between the test vector and the output of the system (i.e. the reconstruction error) can be related to the novelty score, which would be high when "abnormal" data occurs. For instance, neural networks can be used in this way and show many of the same advantages for novelty detection as they do for typical classification applications. Another type of reconstruction-based novelty detection is subspace-based techniques. They assume that data can be projected or embedded into a lower dimensional subspace, which makes better discrimination of "normal" and "abnormal" data easier.

Domain-based methods often aim to describe a domain that contains "normal" data through a boundary around the "normal" class following the distribution of the data without explicitly providing a distribution [40, 41]. These techniques are usually insensitive to the specific sampling and density of the interested class. The location to the boundary is the criterion for determining the class membership of unknown data. Novelty detection support vector machines (SVMs) are the "one-class SVMs", which set the location of the novelty boundary only based on the data lying closest to it in the transformed feature space. That is, the novelty boundary is determined without considering the data that are not support vectors.

Information-theoretic methods calculate the information content of a dataset with measures such as entropy, relative entropy, and Kolmogorov complexity, etc. [42, 43]. The key idea is that novel data alter the information content in a dataset significantly. A common procedure is: metrics are computed using the entire dataset and then the subset of points whose elimi‐ nation from the dataset causes the largest difference in the metric are identified. The data contained in this subset is then assumed to be novel data.

estimates the probability density function with a large number of kernels over the data space. The kernel density estimator places a kernel (e.g. Gaussian) on each data point and then sums the contributions from a localized neighborhood of the kernel. This is the so-called *Parzen windows* estimator [30], which has been used for novelty detection in a number of applications including mammographic image analysis [31]. One-class classification based on Gaussian Processes (GPs) has been developed and used recently [32]. This technique also takes a pointwise approach to novelty detection, which divides the data space into regions with high and low supports depending on whether those regions are close to those occupied by "normal" training data, or not. A related way of detecting novelty is based on the well-established area of *changepoint* detection [33], with the goal of determining whether the generative distribution of a sequence of observations has remained stable or has undergone some abrupt change. Here in addition to detecting whether a change has occurred or not, another aim is to estimate the time that the change has occurred. When applied in a batch or online setting, the idea of changepoint detection to the retrospective problem is identifying a test statistic suitable for testing the hypothesis that a change has occurred versus the one that no change has occurred.

A likelihood ratio statistic, as well as others [33, 34], would be appropriate.

discrimination of "normal" and "abnormal" data easier.

Distance-based approaches, such as clustering or nearest-neighbor methods [35-37], are another types of techniques that can be used for classification or for estimating the probability density function of data. The underlying assumption is that "normal" data are tightly clus‐ tered, while novel data occur far from their nearest neighbors. These methods use well-defined distance metrics to compute the distance (e.g. similarity measure) between two data points. Reconstruction-based methods involve training a regression model with the training data [3, 38, 39]. The distance between the test vector and the output of the system (i.e. the reconstruction error) can be related to the novelty score, which would be high when "abnormal" data occurs. For instance, neural networks can be used in this way and show many of the same advantages for novelty detection as they do for typical classification applications. Another type of reconstruction-based novelty detection is subspace-based techniques. They assume that data can be projected or embedded into a lower dimensional subspace, which makes better

Domain-based methods often aim to describe a domain that contains "normal" data through a boundary around the "normal" class following the distribution of the data without explicitly providing a distribution [40, 41]. These techniques are usually insensitive to the specific sampling and density of the interested class. The location to the boundary is the criterion for determining the class membership of unknown data. Novelty detection support vector machines (SVMs) are the "one-class SVMs", which set the location of the novelty boundary only based on the data lying closest to it in the transformed feature space. That is, the novelty

Information-theoretic methods calculate the information content of a dataset with measures such as entropy, relative entropy, and Kolmogorov complexity, etc. [42, 43]. The key idea is that novel data alter the information content in a dataset significantly. A common procedure

boundary is determined without considering the data that are not support vectors.

**4.2. Other categories**

330 Advances in Bioengineering
