**2.2 Density-based methods**

Density-based methods calculate the density distribution of data and classify as outliers the points lying in low-density regions. Breunig et al. (Breunig et al., 2000) allocate a local outlier factor (LOF) to each point on the basis of the local density of its neighborhood. In order to understand the formula concerning the Local Outlier Factor is necessary introduce several definitions. The **k-distance** of a point *x* is the distance between two points *x, y* belonging to the dataset *D* such that for at least *k* points data *d(x,y')≤d(x,y);* where *y'є D-{x}*  and *k* an integer value*.* The **k distance neighborhood** of a data point x includes points whose distance from *x* is not greater than the *k distance*. Moreover the **reachability distance** of a data point *x* respect to the data point *y* is defined as the maximum between k-distance of *y* and distance between the two data points. The **Local reachability density** of the data point *x* is defined as the inverse of the mean reachability distance based on the *MinPts*nearest neighbors of *x.* Finally the Local Outlier Factor is defined as the mean of the ratio of the local reachability density of *x* and the cardinality of the set including the *MinPts*-nearest neighbors of *x*. It is evident that *MinPts* is an important parameter of the proposed algorithm. Papadimitriou et al. (Papadimitriou et al., 2003) propose LOCI (Local Correlation Integral) which uses statistical values belonging to data to solve the problem of choosing values for *MinPts*.

#### **2.3 Clustering-based methods**

216 Fuzzy Inference System – Theory and Applications

Outliers are measurements that are different from other values of the same dataset and can be due to measurement errors or to the variability of the phenomenon under consideration. Hawkins (Hawkins, 1980) defined an outlier as *"an observation that deviates so much from other* 

The detection of outlier is an important step of data mining because it improves the quality of data and it represents a useful pre-processing phase in many applications, such as

Classical outlier detection methods can be classified into four main groups: distance-based, density-based, clustering-based and statistical-based approaches. All these approaches have several advantages or limitations and in the last years a lot of contributions have been proposed on this subject. Artificial intelligence techniques have been widely applied to overcome the traditional methods and improve the cleanness of data; in particular some

Distance-based method is based on the concept of the neighborhood of a sample and it was introduced by Knorr and Ng (Knorr & Ng, 1999). They gave the following definition: "*An object O in a dataset T is a DB(p,D)-outlier if at least fraction p of the objects in T lie at a distance greater than D from O*". The parameter p represents the minimum fraction of samples that is out of an outlier's D-neighborhood. This definition needs to fix a parameter and do not provide a degree of outlierness. Ramaswamy et al. (Ramaswamy et al., 2000) modified the definition of outlier: "*Outliers are the top n data points whose distance to the kth nearest neighbor is greatest*". Jimenez-Marquez et al. (Jimenez-Marquez et al., 2002) introduced the Mahalanobis Outlier Analysis (MOA) which uses Mahalnobis distance (Mahalanobis, 1936) as outlying degree of each point. Another outlier detection method based on Mahalanobis distance was proposed by Matsumoto et al. (Matsumoto et al., 2007). Mahalnobis distance is defined as the distance between each point and the center of mass. This approach considers

Density-based methods calculate the density distribution of data and classify as outliers the points lying in low-density regions. Breunig et al. (Breunig et al., 2000) allocate a local outlier factor (LOF) to each point on the basis of the local density of its neighborhood. In order to understand the formula concerning the Local Outlier Factor is necessary introduce several definitions. The **k-distance** of a point *x* is the distance between two points *x, y* belonging to the dataset *D* such that for at least *k* points data *d(x,y')≤d(x,y);* where *y'є D-{x}*  and *k* an integer value*.* The **k distance neighborhood** of a data point x includes points whose distance from *x* is not greater than the *k distance*. Moreover the **reachability distance** of a data point *x* respect to the data point *y* is defined as the maximum between k-distance of *y* and distance between the two data points. The **Local reachability density** of the data point *x* is defined as the inverse of the mean reachability distance based on the *MinPts*nearest neighbors of *x.* Finally the Local Outlier Factor is defined as the mean of the ratio of the local reachability density of *x* and the cardinality of the set including the *MinPts*-nearest

*observations as to arouse suspicion that it was generated by a different mechanism"*.

fuzzy logic-based approaches proved to outperform classical methodologies.

outliers data points that are far away from their center of mass.

financial analysis, network intrusion detection and fraud detection (Hodge, 2004).

**2. Outlier detection** 

**2.1 Distance-based methods** 

**2.2 Density-based methods** 

Clustering-based methods perform a preliminary clustering operation on the whole dataset and then classify as outliers the data which are not located in any cluster.

Fuzzy C-means algorithm (FCM) is a method of clustering developed by Dunn in 1973 (Dunn, 1973) and improved by Bezdek in 1981 (Bezdek, 1981). This approach is based on the notion of fuzzy c-partition introduced by Ruspini (Ruspini, 1969). Let us suppose *X={x1, x2, ... xn}* be a set of data where each *sample xh (h=1, 2, ... n)* is a vector with dimensionality *p*. Let *Ucn* be a set of real *c×n* matrices where *c* is an integer value which can assume values between 2 and *n*. The fuzzy C-partition space for *x* is the following set:

$$\mathbf{M}\_{\mathrm{cn}} = \{ \mathbf{U} \mathbf{c} \mathbf{U}\_{\mathrm{cn}}; \mathbf{u}\_{\mathrm{ln}} \mathbf{c} [0, 1] \; ; \; \Sigma\_{l=1}^{\mathrm{c}} u\_{\mathrm{ln}} = 1, \; 0 < \Sigma\_{h=1}^{n} u\_{\mathrm{ln}} < n \; \} \tag{1}$$

where *uih* is the degree of membership of *xn* in cluster *i* (*1≤i≤c*). The objective of FCM approach is to provide an optimal fuzzy C-partition minimizing the following function:

$$J\_m\{U, V; X\} = \Sigma\_{h=1}^n \Sigma\_{l=1}^c (\mathfrak{u}\_{lh})^m \parallel \mathfrak{x}\_k - \mathfrak{v}\_l \parallel^2 \tag{2}$$

where *V=(v1, v2,... vc)* is a matrix of cluster centres, ║.║is the Euclidean norm and *m* is a weighting exponent (*m>1*).

Many clustering-based outlier approaches have been recently developed. For instance, Jang et al. (Jang et al., 2001) proposed an outlier-finding process called OFP based on k-means algorithm. This approach considers small clusters as outliers. Yu et al. (Yu et al., 2002) proposed an outlier detection method called FindOut, which is based on removing of clusters from original data to identify outliers. Moreover He et al. (He et al., 2003) introduced the notion of cluster-based local outlier and outlier detection method (FindCBLOF), which exploits a cluster-based LOF in order to identify the outlierness of each sample. Finally Jang et al. (Jang et al., 2005) proposed a novel method in order to improve the efficiency of FindCBLOF approach.

#### **2.4 Statistical-based methods**

Statistical-based methods use standard distribution to fit the initial dataset. Outliers are defined considering the probability distribution and assuming that the data distribution is a priori known. The main limit of this approach lies in the fact that, for many applications, the prior knowledge is not always distinguishable and the cost for fitting data with standard distribution could be considerable. A widely used method belonging to distribution-based approaches has been proposed by Grubbs (Grubbs, 1969). This test is efficient if data can be approximated by a Gaussian distribution. The Grubbs test calculates the following statistics:

$$G\_l = \frac{\max\_l \left( \mathbf{x}\_l - \boldsymbol{\mu} \right)}{\sigma} \tag{3}$$

Fuzzy Inference System for Data Processing in Industrial Applications 219

SSOD is a semi-supervised outlier detection method (Li et al. 2007; Zhang et al. 2005; Gao et al., 2006; Xu & Liu, 2009) main that uses both unlabeled and labelled samples in order to improve the accuracy without the need for a high amount of data. Let us suppose that *X* is the dataset with *n* samples forming *K c*lusters. The first *l* samples are labelled as binary values: the null samples are considered outliers while the unitary value are not outliers. If we consider that outliers are not included in any of the *K* clusters, an *n×K* matrix must be found whose elements *tih ( i=1,2 ... ,n ; h=1, 2, ..., K)* are unitary when *xi* belongs to cluster *Ch*. Outliers are determined as points that do not belong to any clusters through the

��� ���� ��� � ���� � ���� � ∑ ∑ ���� � �� ∑ ��� � ∑ ���

where *ch* represent the centroid of cluster *Ch, dist* is the Euclidean distant and *γ1, γ<sup>2</sup>* are adjusting parameters. The objective function is the sum of three parts. The first part come from k-means clustering and outliers are not considered, the second part is used to constrain the number of outliers below a certain threshold and finally the third part is used to

FRCM is a combination between Fuzzy C-means algorithm and Rough C-means approach. Fuzzy C-means method was introduced by Dunn (Dunn, 1974) and it is an unsupervised clustering algorithm. The approach assigns a membership degree to each sample for each

��� �

where *V = (v1, v2, ..., vc)* is the vector representing the centres of the clusters, *uik* is the degree of membership of the sample *xk* to the cluster *i*. The iteration stops when a stable condition is reached and the sample is allocated to the cluster for which the membership value is

The RCM approach is based on the concept of C-means clustering and on the concept of rough set. Rough set was introduced by Pawlak (Pawlak, 1982; Pawlak, 1991). In the rough set concept each observation of the universe has a specified amount of information. The objects which have the same information are indistinguishable. A rough set, unlike a precise set, is characterized by lower approximation, upper approximation and boundary region. The lower approximation includes all objects belonging to the considered notion, while the upper approximation contains objects which possibly belong to the notion. The boundary region represents the difference between the two regions. In RCM method each cluster is considered as a rough set having the three regions. A sample, unlike in the classical clustering algorithm, can be member of more than one cluster. Also, it is possible to have overlaps between clustering. Lingras and West (Lingras & West, 2004) proposed a method




maintain consistency of labelling introduced by authors with existing label.

�� ��� �� �� � ∑ ∑ ���

maximum. The centres initialization is an important step affecting the final result.


� ��� �

� ‖�� � ��‖ � �

� ��� � �

���

��� (7)

��� (6)

minimization of the following objective function:

cluster by minimizing the following objective function:

based on the following four properties:

least two upper approximation regions.

more upper approximation regions.

approximation region.

� � ∑ ∑ ��� � ��� �

where *µ* is the mean value and *ϭ* the standard deviation of data. When *G* is greater than a fixed threshold the *i*-th data-point is classified as an outlier. The critical value depends on the required significance level of test; common values are 1% and 5%. Other similar tests that assume that data are normally distributed are Rosner's test (Gibbons, 1994) and Dixon's test (Dixon, 1993).

#### **2.5 Fuzzy Inference System based method**

In the last years novel interesting FIS-based outlier detection approaches have been proposed in order to outperform the classical approaches.

Yousri et al. (Yousri et al., 2007) proposes an approach which combine an outlier detection method with a clustering algorithm. The outlier detection method is used for two objectives: to give a hard membership for outliers to decide if the considered pattern is an outlier or not and to give a degree of outlierness. A clustering algorithm is then used to allocate patterns to clusters. Let *P* indicate the dataset and *p* its generic sample and let *To* be the outlier detection technique and *Tc* the adopted clustering algorithm. The combination of outlier detection and clustering algorithms is provided by the following formula, providing the degree of outlierness of a sample *p*:

$$\mathcal{O}\_p = \frac{\mathcal{w}\_o \mathcal{O}\_{To}(p) + \mathcal{w}\_c \mathcal{O}\_{Tc}(p)}{\mathcal{w}\_o + \mathcal{w}\_c} \tag{4}$$

where *OTo (p)* is the degree of outlierness resulting from *To* for each sample *p* while *OTc (p)* is the degree of outlierness resulting from *Tc* considering an outlier the patterns allocated to tiny clusters or not assigned to any cluster. Finally *wo* and *wc* represent the weights given to both algorithms to determine outliers. The two weights must be not negative (*wo, wc ≥ 0*) and their sum must be positive (*wo+wc > 0*). Equation (4) can be rewritten as follows:

$$O\_p = \frac{\frac{W\_o}{W\_c} O\_{To}(p) + O\_{Tc}(p)}{\frac{W\_o}{W\_c} + 1} \tag{5}$$

The parameter *wo/wc* should be accurately fixed considering a balance between the membership degree given to the outlier cluster and the clusters in the set of initial groups.

The main advantage of this approach is that it is general, i.e. it can combine any outlier detection method with any clustering algorithm, and it is effective with low and high dimensional dataset.

Another novel fuzzy approach is proposed by Xue et al. (Xue et al., 2010). The approach is called *Fuzzy Rough Semi-Supervised Outlier Detection* (FRSSOD) and combines two methods: the *Semi-Supervised Outlier Detection method* (SSOD) (Gao et al., 2006) and the clustering method called *Fuzzy Rough C-Means clustering* (FRCM) (Hu & Yu, 2005). The aim of this approach is to establish if samples on the boundary are outliers or not, by exploiting the advantages of both approaches. In order to understand FRSSOD a brief description of SSOD and FRCM is necessary.

where *µ* is the mean value and *ϭ* the standard deviation of data. When *G* is greater than a fixed threshold the *i*-th data-point is classified as an outlier. The critical value depends on the required significance level of test; common values are 1% and 5%. Other similar tests that assume that data are normally distributed are Rosner's test (Gibbons, 1994) and Dixon's

In the last years novel interesting FIS-based outlier detection approaches have been

Yousri et al. (Yousri et al., 2007) proposes an approach which combine an outlier detection method with a clustering algorithm. The outlier detection method is used for two objectives: to give a hard membership for outliers to decide if the considered pattern is an outlier or not and to give a degree of outlierness. A clustering algorithm is then used to allocate patterns to clusters. Let *P* indicate the dataset and *p* its generic sample and let *To* be the outlier detection technique and *Tc* the adopted clustering algorithm. The combination of outlier detection and clustering algorithms is provided by the following formula, providing the

> ܱ ൌ ݓ்ܱሺሻ ݓ்ܱ ሺሻ ݓ ݓ

where *OTo (p)* is the degree of outlierness resulting from *To* for each sample *p* while *OTc (p)* is the degree of outlierness resulting from *Tc* considering an outlier the patterns allocated to tiny clusters or not assigned to any cluster. Finally *wo* and *wc* represent the weights given to both algorithms to determine outliers. The two weights must be not negative (*wo, wc ≥ 0*) and

> ݓ ்ܱሺሻ ்ܱሺሻ ݓ ͳ ݓ

The parameter *wo/wc* should be accurately fixed considering a balance between the membership degree given to the outlier cluster and the clusters in the set of initial groups. The main advantage of this approach is that it is general, i.e. it can combine any outlier detection method with any clustering algorithm, and it is effective with low and high

Another novel fuzzy approach is proposed by Xue et al. (Xue et al., 2010). The approach is called *Fuzzy Rough Semi-Supervised Outlier Detection* (FRSSOD) and combines two methods: the *Semi-Supervised Outlier Detection method* (SSOD) (Gao et al., 2006) and the clustering method called *Fuzzy Rough C-Means clustering* (FRCM) (Hu & Yu, 2005). The aim of this approach is to establish if samples on the boundary are outliers or not, by exploiting the advantages of both approaches. In order to understand FRSSOD a brief description of SSOD

their sum must be positive (*wo+wc > 0*). Equation (4) can be rewritten as follows:

ݓ

ܱ ൌ

(4)

(5)

test (Dixon, 1993).

**2.5 Fuzzy Inference System based method** 

degree of outlierness of a sample *p*:

dimensional dataset.

and FRCM is necessary.

proposed in order to outperform the classical approaches.

SSOD is a semi-supervised outlier detection method (Li et al. 2007; Zhang et al. 2005; Gao et al., 2006; Xu & Liu, 2009) main that uses both unlabeled and labelled samples in order to improve the accuracy without the need for a high amount of data. Let us suppose that *X* is the dataset with *n* samples forming *K c*lusters. The first *l* samples are labelled as binary values: the null samples are considered outliers while the unitary value are not outliers. If we consider that outliers are not included in any of the *K* clusters, an *n×K* matrix must be found whose elements *tih ( i=1,2 ... ,n ; h=1, 2, ..., K)* are unitary when *xi* belongs to cluster *Ch*. Outliers are determined as points that do not belong to any clusters through the minimization of the following objective function:

$$Q = \Sigma\_{l=1}^{n} \Sigma\_{h=1}^{K} t\_{lh} \operatorname{dist} \left( \mathbf{c}\_{h}, \mathbf{x}\_{l} \right)^{2} + \left. \chi\_{1} (n - \Sigma\_{l=1}^{n} \Sigma\_{h=1}^{K} t\_{lh}) + \left. \chi\_{2} \Sigma\_{l=1}^{l} \right| u\_{l} - \Sigma\_{h=1}^{K} t\_{lh} \right| \tag{6}$$

where *ch* represent the centroid of cluster *Ch, dist* is the Euclidean distant and *γ1, γ<sup>2</sup>* are adjusting parameters. The objective function is the sum of three parts. The first part come from k-means clustering and outliers are not considered, the second part is used to constrain the number of outliers below a certain threshold and finally the third part is used to maintain consistency of labelling introduced by authors with existing label.

FRCM is a combination between Fuzzy C-means algorithm and Rough C-means approach. Fuzzy C-means method was introduced by Dunn (Dunn, 1974) and it is an unsupervised clustering algorithm. The approach assigns a membership degree to each sample for each cluster by minimizing the following objective function:

$$J\_m\left(U, V; X\right) = \begin{array}{c} \sum\_{k=1}^n \sum\_{l=1}^c u\_{lk}^m \left\|\leftx\_k - \left.\boldsymbol{\nu}\_l\right\|^2\right\| \end{array} \tag{7}$$

where *V = (v1, v2, ..., vc)* is the vector representing the centres of the clusters, *uik* is the degree of membership of the sample *xk* to the cluster *i*. The iteration stops when a stable condition is reached and the sample is allocated to the cluster for which the membership value is maximum. The centres initialization is an important step affecting the final result.

The RCM approach is based on the concept of C-means clustering and on the concept of rough set. Rough set was introduced by Pawlak (Pawlak, 1982; Pawlak, 1991). In the rough set concept each observation of the universe has a specified amount of information. The objects which have the same information are indistinguishable. A rough set, unlike a precise set, is characterized by lower approximation, upper approximation and boundary region. The lower approximation includes all objects belonging to the considered notion, while the upper approximation contains objects which possibly belong to the notion. The boundary region represents the difference between the two regions. In RCM method each cluster is considered as a rough set having the three regions. A sample, unlike in the classical clustering algorithm, can be member of more than one cluster. Also, it is possible to have overlaps between clustering. Lingras and West (Lingras & West, 2004) proposed a method based on the following four properties:


Fuzzy Inference System for Data Processing in Industrial Applications 221

2. **Clustering-based**. The clustering algorithm used for each method is the fuzzy Cmeans (FCM) that has already been described. The output of the clustering represents the membership degree of the selected samples to clusters and it lies in the range [0,1]. Samples for which such features is close to 0 have to be considered outliers. The FCM approach requires the number of clusters to be a priori known. If the distribution is unknown, it is not easy to find the optimal number of clusters. In this approach a validity measure based on intra-cluster and inter-cluster distance measures (Ray & Turi, 1999) is calculated in order to determine automatically the most suitable number of clusters. This step is fundamental because the result of clustering strongly

3. **Density-based.** For each pattern the Local Outlier Factor (LOF) is evaluated. This feature requires that the number of the samples of nearest neighbours *K* must be known a-priori. Here *K* corresponds to the number of elements in the less populated cluster that has been previously calculated by the Fuzzy C Means algorithm. The LOF parameter lies in the range [0;1] where the unitary value means that the considered

4. **Distribution-based.** The Grubbs test is performed and the result is a binary value: an unitary value indicates that the selected pattern is an outlier, is null otherwise.

This four features are fed as inputs to a FIS of the Mandani type (Mandani, 1974). The output of the system, called *outlier index,* represents the degree of outlierness of each pattern.

The proposed method has been tested in a dataset provided by a steelmaking industry in the pre-processing phase. The extracted data represent the chemical analysis and process variables associated to the liquid steel fabrication. A dataset has been analyzed in order to find factors that affect the final steel quality. In this case outliers can be caused by sensor failures, human error during registration of data of off-line analysis or abnormal conditions; this kind of outliers are the most difficult to discover because they do not differ so much from some correct data. In the considered problem the variables which are mostly affected by outliers are tapping temperature, reheating temperature and the addition of aluminium. First of all each variable has been normalised in order to obtain values in the range [0,1]. Table 1 illustrates the results that have been obtained on 1000 measurements through 6 different approaches: a distance-based approach, a clustering-based approach a densitybased approach, a distribution-based approach and two fuzzy-based approaches. The effective outliers are five and they have been pointed out by technical skilled personnel working on the plant and in the table are identified with alphabetic letters in order to understand how many and which outliers are correctly detected by several methods. In the second row the outliers are shown that are present for each considered variable and the other columns refer to outliers detected by each considered method. Finally, table 2 illustrates how many samples have been misclassified as outliers (the so-called *false alarms*)

Table 2 shows that fuzzy-base approach is able to detect all outliers present in the dataset without fall in false alarms errors. Therefore the obtained results confirm the effectiveness of the fuzzy-based approaches because they outperform traditional techniques and in particular the second fuzzy proposed approach (Cateni et al., 2009) obtains the best results.

Finally a threshold, set to 0.6, is used to point out the outliers.

depends on this parameter.

sample is an outlier.

for each method.

FRCM integrates the concept of rough set with the fuzzy set theory adding a fuzzy membership value of each point to the lower approximation and boundary region of a cluster. The approach divides data into two sets, a *lower approximation region* and a *boundary region*; then the points belonging to the boundary region are fuzzified. Let us suppose that *X={x1, x2, ..., xn}* is the available dataset and *Ch* and ܥതതത are respectively the lower and upper approximation of the cluster *h*. The boundary region is calculated as the difference between the two regions. If *u={uih}* are memberships of clusters the problem of FRCM become the optimization of the following function:

$$J\_m(\boldsymbol{u}, \boldsymbol{v}) = \sum\_{i=1}^{n} \sum\_{h=1}^{H} (\boldsymbol{u}\_{ih})^m d\_{ih}^2 \tag{8}$$

FRSSOD combines the two methods above described, i.e. FRCM and SSOD, in order to create a novel approach. Let *X* be the set of data with *n* samples and *Y* its subset formed by the first *l<n* samples. The elements of *Y* are labelled as *yi = {1, 0}* where null value indicates that the considered point is an outlier. The normal points, i.e. points that are not considered outliers, form *C* clusters and each point normal point belong to each cluster with a membership value, while outliers do not belong to any cluster. The main aim of FRSSOD is to create a *n×c* matrix called *u*, whose generic entry *uik* represents the fuzzy membership degree of the *ith* sample on the cluster. The optimization problem consists in the minimization of the following function:

$$f\_m(\boldsymbol{\mu}, \boldsymbol{\upsilon}) = \sum\_{i=1}^{n} \sum\_{k=1}^{\mathbb{C}} (\boldsymbol{\mu}\_{ik})^m d\_{ik}^2 + \mathcal{V}\_1(\boldsymbol{\eta} - \sum\_{i=1}^{n} (\sum\_{k=1}^{\mathbb{C}} \boldsymbol{\mu}\_{ik})^m) + \mathcal{V}\_2 \sum\_{i=1}^{l} (\boldsymbol{y}\_i - \sum\_{k=1}^{\mathbb{C}} \boldsymbol{\mu}\_{ik})^2 \tag{9}$$

where *γ1* e *γ<sup>2</sup>* are adjusting positive parameters in order to make the three terms compete with each other and *m* is a fuzziness weighting exponent (*m>1*). As the idea of SSOD approach only normal points are divided in two clusters and also the points considered as outliers don't compare in the first term of the equation. Then the second term of the equation is used to maintain the number of outliers under a certain limit and finally the third term maintains consistency of user labelling with existing label punishing also the mislabelled samples. The proposed method is applied on a synthetic dataset and on real data and results show that FRSSOD can be used in many fields having fuzzy information granulation. The experimental results show also that the proposed method has many advantages over SSOD improving outlier detection accuracy and reducing false alarm rate thanks to the control on labelled samples. The main disadvantages of the FRSSOD method are that the result depends on the determination of number of cluster, initialization of the centres of clusters and adjustment parameters.

Another fuzzy based method to detect outliers is proposed by Cateni et al. (Cateni et al., 2009). The proposed approach combines different classical outlier detection techniques in order to overcome their limitations and to use their advantages. An important advantage of the proposed method lies in the fact that the system is automatic and no a priori assumptions are required. The method consists in calculating four features for each pattern by using the most popular outlier detection techniques:

1. **Distance-based.** The Mahalanobis distance is calculated and normalized with respect to its maximum value. Patterns which assume value near 1 are considered outliers.

FRCM integrates the concept of rough set with the fuzzy set theory adding a fuzzy membership value of each point to the lower approximation and boundary region of a cluster. The approach divides data into two sets, a *lower approximation region* and a *boundary region*; then the points belonging to the boundary region are fuzzified. Let us suppose that *X={x1, x2, ..., xn}* is the available dataset and *Ch* and ܥതതത are respectively the lower and upper approximation of the cluster *h*. The boundary region is calculated as the difference between the two regions. If *u={uih}* are memberships of clusters the problem of FRCM become the

> 1 1 (,) ( )

FRSSOD combines the two methods above described, i.e. FRCM and SSOD, in order to create a novel approach. Let *X* be the set of data with *n* samples and *Y* its subset formed by the first *l<n* samples. The elements of *Y* are labelled as *yi = {1, 0}* where null value indicates that the considered point is an outlier. The normal points, i.e. points that are not considered outliers, form *C* clusters and each point normal point belong to each cluster with a membership value, while outliers do not belong to any cluster. The main aim of FRSSOD is to create a *n×c* matrix called *u*, whose generic entry *uik* represents the fuzzy membership degree of the *ith* sample on the cluster. The optimization problem consists in the

*n H <sup>m</sup> m ih ih i h J uv u d* 

2 2

(9)

 

1 2 1 1 1 1 1 1 (,) ( ) ( ( )) ( ) *n C n C l C m m m ik ik ik i ik i k i k i k J uv u d n u y u* 

where *γ1* e *γ<sup>2</sup>* are adjusting positive parameters in order to make the three terms compete with each other and *m* is a fuzziness weighting exponent (*m>1*). As the idea of SSOD approach only normal points are divided in two clusters and also the points considered as outliers don't compare in the first term of the equation. Then the second term of the equation is used to maintain the number of outliers under a certain limit and finally the third term maintains consistency of user labelling with existing label punishing also the mislabelled samples. The proposed method is applied on a synthetic dataset and on real data and results show that FRSSOD can be used in many fields having fuzzy information granulation. The experimental results show also that the proposed method has many advantages over SSOD improving outlier detection accuracy and reducing false alarm rate thanks to the control on labelled samples. The main disadvantages of the FRSSOD method are that the result depends on the determination of number of cluster, initialization of the

Another fuzzy based method to detect outliers is proposed by Cateni et al. (Cateni et al., 2009). The proposed approach combines different classical outlier detection techniques in order to overcome their limitations and to use their advantages. An important advantage of the proposed method lies in the fact that the system is automatic and no a priori assumptions are required. The method consists in calculating four features for each pattern

1. **Distance-based.** The Mahalanobis distance is calculated and normalized with respect to its maximum value. Patterns which assume value near 1 are considered outliers.

(8)

optimization of the following function:

minimization of the following function:

centres of clusters and adjustment parameters.

by using the most popular outlier detection techniques:

. <sup>2</sup>


This four features are fed as inputs to a FIS of the Mandani type (Mandani, 1974). The output of the system, called *outlier index,* represents the degree of outlierness of each pattern. Finally a threshold, set to 0.6, is used to point out the outliers.

The proposed method has been tested in a dataset provided by a steelmaking industry in the pre-processing phase. The extracted data represent the chemical analysis and process variables associated to the liquid steel fabrication. A dataset has been analyzed in order to find factors that affect the final steel quality. In this case outliers can be caused by sensor failures, human error during registration of data of off-line analysis or abnormal conditions; this kind of outliers are the most difficult to discover because they do not differ so much from some correct data. In the considered problem the variables which are mostly affected by outliers are tapping temperature, reheating temperature and the addition of aluminium. First of all each variable has been normalised in order to obtain values in the range [0,1]. Table 1 illustrates the results that have been obtained on 1000 measurements through 6 different approaches: a distance-based approach, a clustering-based approach a densitybased approach, a distribution-based approach and two fuzzy-based approaches. The effective outliers are five and they have been pointed out by technical skilled personnel working on the plant and in the table are identified with alphabetic letters in order to understand how many and which outliers are correctly detected by several methods. In the second row the outliers are shown that are present for each considered variable and the other columns refer to outliers detected by each considered method. Finally, table 2 illustrates how many samples have been misclassified as outliers (the so-called *false alarms*) for each method.

Table 2 shows that fuzzy-base approach is able to detect all outliers present in the dataset without fall in false alarms errors. Therefore the obtained results confirm the effectiveness of the fuzzy-based approaches because they outperform traditional techniques and in particular the second fuzzy proposed approach (Cateni et al., 2009) obtains the best results.

Fuzzy Inference System for Data Processing in Industrial Applications 223

conventional computational intelligence methods tend to classify all instances to the majority class (Hong et al., 2007). Moreover this methods (such as Multilayer Perceptron, Radial Basis Functions, Linear Discriminant Analysis...) cannot classify imbalanced dataset because they learn data based on minimization of accuracy without taking into account the

The performance of machine learning is typically calculated by a confusion matrix. An example of the confusion matrix is illustrated in Tab.3 where columns represent the predicted class while the rows the actual class. Most of studies in imbalanced domain are referred to binary classification, as a multi-class problem can be simplified to a two-class problem. Conventionally the class label of the minority class is positive while the class label of the majority class is negative. In the table True Negative value (*TN*) represents the number of negative samples correctly classified, True Positive (*TP*) value is the number of positive samples correctly classified, False Positive (*FP*) is the number of negative samples classified as positive and finally the False Negative (*FN*) is the number of positive samples classified as negative. Other common evaluation measures are *Precision* (Prec) which is a measure of the accuracy providing that a specific class has been predicted and *Recall* (Rec) which is a measure of a prediction model to select instances of a certain class from a data set. In this case, Recall is also referred to as true positive rate and true negative rate is also called

**Predicted Negative Predicted Positive** 

*Accuracy = (TP+TN)/(TP+FN+FP+TN)* (10)

*F - Measures = [(1+β2)\*Rec\*Prec]/[( β2\*Prec)+Prec]* (14)

*FP rate = Spec =FP/(TN+FP)* (11)

*TP rate = Rec = TP/(TP+FN)* (12)

*Prec = TP/(TP+FP)* (13)

error cost of classes (Visa & Ralescu, 2005; Alejo et al., 2006; Xie & Qiu, 2007).

**Negative** TN FP **Positive** FN TP

Through this matrix the following widely adopted evaluation metrics can be calculated:

where *β* corresponds to the relative importance of *Precision* versus *Recall*. Typically *β=1* when false alarms (false positive) and misses (false negative) can be considered equally

In general approaches for imbalanced dataset can be divided in two categories: *external* and *internal* approaches. The external methods do not depend on the learning algorithm to be used: they mainly consist in a pre-processing phase aiming at balancing classes before training classifiers. Different re-sampling methods fall into this category. In

*Specificity* (Spec).

costly.

Table 3. Confusion Matrix

**3.1 Traditional approaches** 


Table 1. Outlier Detection.


Table 2. Number of false alarms.
