**3. Imbalanced datasets**

In a classification task often there are more instances of a class than others. Class imbalance is mainly connected to the concept of *rarity.* There are two types of rarity, *rare cases* or *outliers* (as seen in the previous paragraph) and *rare class* where a class of interest includes few samples in comparison to the other classes present in the dataset. Outliers and rare classes are not related but there are empirical studies which demonstrate that the minority class contains more outliers than the majority class (Weiss and Provost, 2003). Imbalanced datasets are present in many real-world applications such as detecting cancerous cell (Chan and Stolfo, 1998), fraud detection (Phua et al., 2004), keyword extraction (Turney, 2000), oilspill detection (Kubat et al., 1998), direct marketing (Ling and Li, 1998), and so on. Many approaches have been proposed in order to solve the imbalance problem (Visa and Ralescu, 2005) and they include resampling the training set, feature selection (Castillo and Serrano, 2004), one class learners (Raskutti and Kowalczyk, 2004) and finally cost-sensitive learners, that take into account the misclassification cost (Zadrozny et al., 2003). In this kind of data one class is significantly larger than other classes and often the targeted class (known as *positive* class) is the smallest one. In this cases classification is difficult because the conventional computational intelligence methods tend to classify all instances to the majority class (Hong et al., 2007). Moreover this methods (such as Multilayer Perceptron, Radial Basis Functions, Linear Discriminant Analysis...) cannot classify imbalanced dataset because they learn data based on minimization of accuracy without taking into account the error cost of classes (Visa & Ralescu, 2005; Alejo et al., 2006; Xie & Qiu, 2007).

The performance of machine learning is typically calculated by a confusion matrix. An example of the confusion matrix is illustrated in Tab.3 where columns represent the predicted class while the rows the actual class. Most of studies in imbalanced domain are referred to binary classification, as a multi-class problem can be simplified to a two-class problem. Conventionally the class label of the minority class is positive while the class label of the majority class is negative. In the table True Negative value (*TN*) represents the number of negative samples correctly classified, True Positive (*TP*) value is the number of positive samples correctly classified, False Positive (*FP*) is the number of negative samples classified as positive and finally the False Negative (*FN*) is the number of positive samples classified as negative. Other common evaluation measures are *Precision* (Prec) which is a measure of the accuracy providing that a specific class has been predicted and *Recall* (Rec) which is a measure of a prediction model to select instances of a certain class from a data set. In this case, Recall is also referred to as true positive rate and true negative rate is also called *Specificity* (Spec).


#### Table 3. Confusion Matrix

222 Fuzzy Inference System – Theory and Applications

**OUTLIERS** A -B – C A - B - C A - B - C - D –E

**distance** B B A - D – E **Fuzzy C-means** A – B A - B B – C **Local Outlier Factor** B B A - B - C – E **Grubbs Test** A A B - D – E

**al.,2007)** A – B A - B A - B - C – E

**al., 2009)** A - B –C A - B -C A - B - C - D –E

**OUTLIERS** A -B – C A - B - C A - B - C - D –E

**distance** 0 0 1 **Fuzzy C-means** 0 0 1 **Local Outlier Factor** 1 0 0 **Grubbs Test** 0 0 2

**al.,2007)** 0 0 0

**al., 2009)** 0 0 0

In a classification task often there are more instances of a class than others. Class imbalance is mainly connected to the concept of *rarity.* There are two types of rarity, *rare cases* or *outliers* (as seen in the previous paragraph) and *rare class* where a class of interest includes few samples in comparison to the other classes present in the dataset. Outliers and rare classes are not related but there are empirical studies which demonstrate that the minority class contains more outliers than the majority class (Weiss and Provost, 2003). Imbalanced datasets are present in many real-world applications such as detecting cancerous cell (Chan and Stolfo, 1998), fraud detection (Phua et al., 2004), keyword extraction (Turney, 2000), oilspill detection (Kubat et al., 1998), direct marketing (Ling and Li, 1998), and so on. Many approaches have been proposed in order to solve the imbalance problem (Visa and Ralescu, 2005) and they include resampling the training set, feature selection (Castillo and Serrano, 2004), one class learners (Raskutti and Kowalczyk, 2004) and finally cost-sensitive learners, that take into account the misclassification cost (Zadrozny et al., 2003). In this kind of data one class is significantly larger than other classes and often the targeted class (known as *positive* class) is the smallest one. In this cases classification is difficult because the

**Reheating Temperature** 

**Reheating Temperature**  **Aluminium addition** 

**Aluminium addition** 

**Tapping Temperature** 

**Tapping Temperature** 

**Mahalanobis** 

**Fuzzy 1 (Yousri et** 

**Fuzzy 2 (Cateni et** 

Table 1. Outlier Detection.

**Mahalanobis** 

**Fuzzy 1 (Yousri et** 

**Fuzzy 2 (Cateni et** 

Table 2. Number of false alarms.

**3. Imbalanced datasets** 

Through this matrix the following widely adopted evaluation metrics can be calculated:

$$Accuracy = (TP + TN) / (TP + FN + FP + TN) \tag{10}$$

$$\text{FP rate} = \text{Spec} = \text{FP/(TN+FP)} \tag{11}$$

$$\text{TP rate} = \text{Rec} = \text{TP/(TP+FN)} \tag{12}$$

$$\text{Prec} = \text{TP/TP+FP} \tag{13}$$

$$F \text{ - } Meamus \text{ = } [(1 + \beta^2)^\* \text{Rec}^\* \text{Prec}] / [(\beta^{2\*} \text{Prec}) \text{+} Prec] \tag{14}$$

where *β* corresponds to the relative importance of *Precision* versus *Recall*. Typically *β=1* when false alarms (false positive) and misses (false negative) can be considered equally costly.

#### **3.1 Traditional approaches**

In general approaches for imbalanced dataset can be divided in two categories: *external* and *internal* approaches. The external methods do not depend on the learning algorithm to be used: they mainly consist in a pre-processing phase aiming at balancing classes before training classifiers. Different re-sampling methods fall into this category. In

Fuzzy Inference System for Data Processing in Industrial Applications 225

As SMOTE over-generalizes the minority class but does not take into account the distribution of neighbours from the majority class another novel approach, called Borderline-SMOTE (Han et al., 2005), is proposed. This approach is a generalization of SMOTE approach, it focuses the attention on oversampling around samples located in the borderline between classes. This approach is based on the assumption that the positive instances are divided into three regions: *noise*, *borderline* and *safe* by considering the number of negative examples on *k* nearest neighbours. If *n* is the number of negative examples

Borderline-SMOTE exploits the same oversampling technique as SMOTE but it oversamples

Bunkhumpornpat et al. proposed another approach called Safe-Level-SMOTE (Bunkhumpornpat et al., 2009). This method assigns for each positive data-points a *safe-level (sf)* which is defined as the number of positive data-points in the *k* nearest neighbours. If the safe level of an instance is close to *k,* then it considered safe. The safe level ratio is defined as the ratio between the safe level of a positive instance and the safe level of a nearest neighbours. Also each synthetic new point is generated in safe region by considering the safe level ratio of instances. This method is able to outperform both SMOTE and Borderline-SMOTE because they may generate instances in unsuitable positions such as overlapping or

Wang et al. (Wang et al., 2006) propose a novel approach that improves the SMOTE algorithm by including the Locally Linear Embedding (LLE) algorithm (Sam & Lawrence, 2000). The SMOTE approach has an important limitation: it assumes that the local space between any positive samples is positive, i.e. belonging to the minority class. This fact could be not true if the training data is not linearly separable. This method maps the training data into a lower dimensional space through the Locally Linear Embedding technique, then SMOTE is applied in order to create the desirable number of synthetic data points and finally the new data points are mapped back to the initial input space. The so-called LLEbased SMOTE algorithm is evaluated on three datasets applying three different classifiers: Naive Bayesian, K-NN classifier and Support Vector Machine (SVM). Experimental results

show that the LLE-based SMOTE algorithm outperforms the conventional SMOTE.

and it is suitable for several data types by selecting appropriate generative models.

Liu and Ghosh (Liu & Ghosh, 2007) propose a novel oversampling method, called generative oversampling, which increases information to the training set by creating artificial minority class points on the basis of the probability distribution to model the minority class. Also, generative oversampling can be used if the data distribution can be approximated with an existing model. Firstly a probability distribution is chosen in order to model the minority class, then parameters for the probability distribution are studied on the basis of the training data and finally synthetic data points are created from the learned probability distribution until the necessary number of points belonging to the minority class has been reached. Authors demonstrate that this approach works well for a range of text classification datasets using as classifier a SVM classifier. This method is simple to develop

among the *k* nearest neighbours, the regions are defined as follow:

only the instances belonging to the borderline region.


noise regions.


contrast internal method develops variations of the learning algorithm in order to solve the imbalance problem.

### **3.1.1 External methods**

Re-sampling strategies have several advantages. First of all re-sampling methods are competitive (McCharty, 2005) and often similar (Maloof, 2003) to the results obtained choosing the cost-sensitive learning. Moreover re-sampling methods are simple and do not require to modify the internal working of the classifier chosen (Elkan, 2001). Re-sampling methods are pre-processing techniques and they can be divided in two categories: oversampling techniques and undersampling techniques. Oversampling methods balance the classes by adding new points to the minority class while undersampling methods increase the number of samples belonging to the minority class.

The simplest re-sampling techniques are the random oversampling method and the random undersampling methods. The random oversampling method balances the distribution classes by randomly replicating some instances belonging to the minority class but random oversampling can lead to overfitting. The random undersampling technique randomly removes negative examples from the majority class encountering the problem to deleting some important information of the dataset. Both random techniques sample the dataset until the classes are approximately balanced. In order to solve the cited limitations improved resampling techniques have been proposed.

#### **3.1.1.1 Oversampling techniques**

Synthetic Minority Oversampling TEcnique (SMOTE) creates minority samples to oversample the minority class avoiding the overfitting problem (Chawla et al., 2002). Instead of replicating existing data points SMOTE generates new samples as follow: for every minority example, its *n* nearest neighbours belonging to the same class are evaluated (in SMOTE *n* is set to 5); then new synthetic data points are randomly generated along the segment joining the original data point and its selected neighbour. Figure 1 shows an example concerning SMOTE algorithm: *x* represents the selected data point, *ni* are the selected nearest neighbours and *s1, s2* and *s3* are samples generated by the randomized interpolation.

Fig. 1. Example of SMOTE algorithm.

As SMOTE over-generalizes the minority class but does not take into account the distribution of neighbours from the majority class another novel approach, called Borderline-SMOTE (Han et al., 2005), is proposed. This approach is a generalization of SMOTE approach, it focuses the attention on oversampling around samples located in the borderline between classes. This approach is based on the assumption that the positive instances are divided into three regions: *noise*, *borderline* and *safe* by considering the number of negative examples on *k* nearest neighbours. If *n* is the number of negative examples among the *k* nearest neighbours, the regions are defined as follow:


224 Fuzzy Inference System – Theory and Applications

contrast internal method develops variations of the learning algorithm in order to solve

Re-sampling strategies have several advantages. First of all re-sampling methods are competitive (McCharty, 2005) and often similar (Maloof, 2003) to the results obtained choosing the cost-sensitive learning. Moreover re-sampling methods are simple and do not require to modify the internal working of the classifier chosen (Elkan, 2001). Re-sampling methods are pre-processing techniques and they can be divided in two categories: oversampling techniques and undersampling techniques. Oversampling methods balance the classes by adding new points to the minority class while undersampling methods

The simplest re-sampling techniques are the random oversampling method and the random undersampling methods. The random oversampling method balances the distribution classes by randomly replicating some instances belonging to the minority class but random oversampling can lead to overfitting. The random undersampling technique randomly removes negative examples from the majority class encountering the problem to deleting some important information of the dataset. Both random techniques sample the dataset until the classes are approximately balanced. In order to solve the cited limitations improved re-

Synthetic Minority Oversampling TEcnique (SMOTE) creates minority samples to oversample the minority class avoiding the overfitting problem (Chawla et al., 2002). Instead of replicating existing data points SMOTE generates new samples as follow: for every minority example, its *n* nearest neighbours belonging to the same class are evaluated (in SMOTE *n* is set to 5); then new synthetic data points are randomly generated along the segment joining the original data point and its selected neighbour. Figure 1 shows an example concerning SMOTE algorithm: *x* represents the selected data point, *ni* are the selected nearest

neighbours and *s1, s2* and *s3* are samples generated by the randomized interpolation.

increase the number of samples belonging to the minority class.

sampling techniques have been proposed.

**3.1.1.1 Oversampling techniques** 

Fig. 1. Example of SMOTE algorithm.

the imbalance problem.

**3.1.1 External methods** 


Borderline-SMOTE exploits the same oversampling technique as SMOTE but it oversamples only the instances belonging to the borderline region.

Bunkhumpornpat et al. proposed another approach called Safe-Level-SMOTE (Bunkhumpornpat et al., 2009). This method assigns for each positive data-points a *safe-level (sf)* which is defined as the number of positive data-points in the *k* nearest neighbours. If the safe level of an instance is close to *k,* then it considered safe. The safe level ratio is defined as the ratio between the safe level of a positive instance and the safe level of a nearest neighbours. Also each synthetic new point is generated in safe region by considering the safe level ratio of instances. This method is able to outperform both SMOTE and Borderline-SMOTE because they may generate instances in unsuitable positions such as overlapping or noise regions.

Wang et al. (Wang et al., 2006) propose a novel approach that improves the SMOTE algorithm by including the Locally Linear Embedding (LLE) algorithm (Sam & Lawrence, 2000). The SMOTE approach has an important limitation: it assumes that the local space between any positive samples is positive, i.e. belonging to the minority class. This fact could be not true if the training data is not linearly separable. This method maps the training data into a lower dimensional space through the Locally Linear Embedding technique, then SMOTE is applied in order to create the desirable number of synthetic data points and finally the new data points are mapped back to the initial input space. The so-called LLEbased SMOTE algorithm is evaluated on three datasets applying three different classifiers: Naive Bayesian, K-NN classifier and Support Vector Machine (SVM). Experimental results show that the LLE-based SMOTE algorithm outperforms the conventional SMOTE.

Liu and Ghosh (Liu & Ghosh, 2007) propose a novel oversampling method, called generative oversampling, which increases information to the training set by creating artificial minority class points on the basis of the probability distribution to model the minority class. Also, generative oversampling can be used if the data distribution can be approximated with an existing model. Firstly a probability distribution is chosen in order to model the minority class, then parameters for the probability distribution are studied on the basis of the training data and finally synthetic data points are created from the learned probability distribution until the necessary number of points belonging to the minority class has been reached. Authors demonstrate that this approach works well for a range of text classification datasets using as classifier a SVM classifier. This method is simple to develop and it is suitable for several data types by selecting appropriate generative models.

Fuzzy Inference System for Data Processing in Industrial Applications 227

cost of misclassifying minority examples is higher than the cost of misclassifying the majority examples. The aim of this approach is to minimize the overall cost on the training dataset. The cost matrix can balance the dataset by assigning the cost misclassifying a class with inverse proportion to its frequency. Another way is to set the cost matrix by considering the application driven criteria taking into account user requirements. Cost matrix is a general notion which can be exploited within common classifiers such as decision

Soler and Prim (Soler & Prim, 2007) propose a method based on the Rectangular Basis Function network (RecBF) in order to solve the imbalance problem. RecBF networks have been introduced by Berthold and Huber (Berthold & Huber, 1995) and are a particular type of Radial Basis Function (RBF) networks which exploit neurons with hyper-rectangular

In classification task with imbalanced dataset SVMs are widely used (Baser et al., 1992). In (Akbani et al., 2004; Japkowicz & Shayu, 2002) the capabilities of SVM and their effect on imbalance have been widely discussed. SVM is a widely used machine learning method which has been applied to many real world problems providing satisfactory results. SVM works effectively with balanced dataset but provides suboptimal classification models considering the imbalanced dataset; several examples demonstrate this conclusion (Veropoulus et al., 1999; Akbani et al., 2004; Wu & Chang, 2003; Wu & Chang, 2005; Raskutti & Kowalczyk, 2004; Imam et al., 2006; Zou et al., 2008; Lin et al., 2009; Kang & Cho, 2006; Liu et al., 2006; Haibo & Garcia, 2009). SVM is biased toward the majority class and provides

A limit of the SVM approach is that it is sensitive to outliers and noise by considering all the training samples uniformly. In order to overcome this problem a Fuzzy SVM (FSVM) has been proposed (Lin & Wang, 2002) which is a variant of the traditional SVM algorithm. FSVM associates different fuzzy membership values (called weights) for different training samples in order to assign their importance degree of its class. Subsequently the proposed approach includes these weights inner the SVM learning algorithm in order to reduce the

An extension of this approach is due to Wang et al. (Wang et al.,2005). They introduced two membership values for each training sample defining the membership degree of positive and negative classes. This approach has been proposed again by Hao et al. (Hao et al., 2007)

Spyrou et al. (Spyrou et al., 2005) propose another kind of fuzzy SVM approach which uses a particular kernel function built from fuzzy basis functions. There are also other works which combine fuzzy theory with SVM assigning a membership value to the outputs of the algorithm. For example Xie et al. (Xie et al., 2005) define a membership degree for the output class through the decision value generated by SVM algorithm, while Inoue and Abe (Inoue & Abe, 2001) use the fuzzy output decision for multiclass classification. Finally Mill and Inoue (Mill & Inoue, 2003) propose an approach which generates the fuzzy membership

tree (Pazzali et al., 2004; Chawla, 2003) or neural networks (De Rouin et al., 1991).

activation function in the hidden layer.

poor results concerning the minority class.

based on the notion of *vague set.*

effect of outliers or noise when finding the separating hyperplane.

values for the output classes through the strengths support vectors.

**3.2 Fuzzy based approaches** 

#### **3.1.1.2 Undersampling techniques**

A popular undersampling method is the Condensed Nearest Neighbour (CNN) rule (Hart, 1968). CNN is used in order to find a consistent subset of samples. A subset �� is defined consistent with *S* if, using a one nearest neighbour, �� correctly classifies the instances in *S*. Fawcett and Provost (Fawcet and provost, 1997) propose an algorithm to extract a subset �� from *S* and using the approach as an undersampling method. Firstly one example belonging to the majority class is randomly extracted and put with all examples belonging to the minority class in ��. Then a 1-NN over the examples in �� is used in order to classify the examples belonging to *S*. If an example in *S* is misclassified, it is moved to ��. The main aim of this method is to delete the examples belonging to the majority class which are distant from the decision border.

Another undersampling approach is the so-called *Tomek links* (Tomek, 1976). This method can be defined as follow: let us suppose that *xi* and *xj* are two examples belonging to different classes and *d(xi, xj)* is their distance. A pair *(xi, xj)* is called a Tomek link if there is not an example *xk* such that *d(xi, xl)< d(xi, xj)* or *d(xj, xl)< d(xi, xj).* If two examples are a Tomek link then either one of these is noise or both are borderline. If Tomek Link is used as underline sampling, only samples belonging to the majority class are removed. Kubat and Matwin (Kubat & Matwin, 1997) propose a method, called One-Side Selection (OSS) which uses both Tomek Link and CNN. Tomek Link is used as undersampling technique removing noisy and borderline samples belonging to the majority class. Borderline samples are considered as *unsafe* since noise can make them fall on the wrong side of the decision border. CNN is used to delete samples belonging to the majority class which are distant from the decision border. The remainder samples including *safe* samples of majority class and all samples belonging to the minority class, are used for learning.

#### **3.1.2 Internal methods**

Internal methods deal with variations of a learning algorithm in order to make it less sensitive for the class imbalance.

Two common methods Boosting and Cost-Sensitive learning are used in this area.

Boosting is a method used to improve the accuracy of weak classifiers. The most famous boosting algorithm is the so-called AdaBoost (Freund & Schapire, 1997). It is based on the fusion of a set of weak learners, i.e. classifiers which have better performance than random classifiers in a classification task. During the learning phase weak learners are trained and included in the strong learner. The contribution of the added learners is weighted on the basis of their performance. At the end all modified learners contribute to classify unlabelled samples. This approach is suitable to deal with imbalanced dataset because the samples, belonging to the minority class, are most likely to be misclassified and also have higher weights during iterations. In literature several approaches using boosting techniques for imbalanced dataset has been proposed (Guo & Viktor, 2004; Leskovec & Shawe-Taylor, 2003) and results confirm the effectiveness of the method.

Another effective approach is the cost-sensitive learning. In this approach cost is associated with misclassifying samples; the cost matrix is a numerical representation of the penalty of classifying samples from a class to another. A correct classification has not penalty and the cost of misclassifying minority examples is higher than the cost of misclassifying the majority examples. The aim of this approach is to minimize the overall cost on the training dataset. The cost matrix can balance the dataset by assigning the cost misclassifying a class with inverse proportion to its frequency. Another way is to set the cost matrix by considering the application driven criteria taking into account user requirements. Cost matrix is a general notion which can be exploited within common classifiers such as decision tree (Pazzali et al., 2004; Chawla, 2003) or neural networks (De Rouin et al., 1991).

Soler and Prim (Soler & Prim, 2007) propose a method based on the Rectangular Basis Function network (RecBF) in order to solve the imbalance problem. RecBF networks have been introduced by Berthold and Huber (Berthold & Huber, 1995) and are a particular type of Radial Basis Function (RBF) networks which exploit neurons with hyper-rectangular activation function in the hidden layer.
