**4. Methodology**

The methodology used in this work is shown in the **Figure 1** and it involves preprocessing the KDD dataset initially, using the prepared dataset in a fair environment with equal access to resources, and then comparing classifier performance across all analyzed attacks (DOS, R2L, U2R, and PROBE) and their faults. Machine learning model needs large number of data set to avoid the problem of over fitting. The Proposed optimal feature subset selection algorithm includes feature normalization, feature scoring, feature subset selection and feature subset elimination.

Data Preprocessing is the most time-consuming task but plays significant role in machine learning model. Raw data cannot be used for training or testing the machine learning model. Hence data preprocessing is required in machine learning. Encoding is the process of converting the categorical data into numerical value. Categorical values are the string values that are stored as components of the input features. Features/ attributes that have strings or categories as their values are termed as categorical attributes. These Categorical values can be represented in two forms, namely Nominal and Ordinal. When there is no ordering between the attributes those are referred to as Nominal attributes. When there is an ordering between the attributes those are referred to as Ordinal attributes. KDD cup 99 data set contains 125,973 train data and 22,544 test data. So, it helps to build and test an efficient machine learning model. It also contains different type of attacks called Neptune, pod, smurf, etc. which can be further categorized as DoS, Probe, U2R and R2L attacks [16] as shown in **Table 1**.

In NSL-KDD cup 99 data set with 41 input features are present. In that protocol feature contains tcp, udp, icmp etc., Service feature contains http, ftp, telnet etc., Flag feature contains SF, REL, ROIT etc. These three columns contain symbolic and continuous data which cannot be used. Because the classifier that are considered, accepts only numerical values. Hence One Hot Encoding technique is used. One hot encoding columns are exactly equal to the number of the values a particular feature is having. While encoding, there should be only one value present only once in the encoded values.

Feature Scaling is the process of transforming the data value into 0 to 1. For example if we consider two weights whose values are 80 and 40 respectively, by

**Figure 1.** *Block diagram of the proposed method.*


**Table 1.**

*Types of attacks and faults in the data.*

feature scaling these can be represented as 1 and 0 where 0 is the lowest possible score/weight and 1 is the highest possible score/weight.

Feature Selection: Machine learning model needs to be trained by huge number of data set for the accurate result. But some data does not contain useful information, without considering that feature also classification can be done. This process is known as Feature selection. Feature Selection basically a process where in only few features that contains the useful information can be selected and machine learning model will get rid of the noise data. Recursive feature elimination method is used for feature selection. All the independent variables in supervised learning is known as features of the data. Elimination in this context means eliminating the features. Doing a process repetitively to eliminate the features of the data is known as Recursive feature elimination (RFE). RFE works by searching for a subset of features by starting with all features in the training dataset and successfully removing features until the desired number remains. KDD cup 99 data set has 41 input features. Among 41 features, 23 features are selected for the model by using the recursive feature elimination technique (**Table 2**).

Classifiers that are used to classify the malicious and the normal data are:

1.Random Forest Algorithm [17]: Decision Trees are very sensitive by nature, necessitating the use of the Random Forest algorithm. The decision tree's entire structure may change if the training data set has a slight difference. Because of this, it is extremely sensitive, and the outcome is highly variable. Decision trees are the binary trees that recursively splits the data set until we are left with pure


*Efficient Machine Learning Classifier for Fault Detection in Wireless Sensor Networks DOI: http://dx.doi.org/10.5772/intechopen.111462*

leaf nodes. Bootstrapping is the process of building a new data set from an existing one. We must use the bootstrapped data sets to train the decision trees. This is how the data is aggregated using Random Forest Classifier.


Performance Evaluation Measures: In this section, we provide a detailed evaluation of the machine learning techniques with various performance measures to detect network faults caused due to intrusions.

## **4.1 Confusion matrix**

Confusion matrices are a widely used measurement when attempting to solve classification issues. Both binary classification and multi class classification issues can be solved with it. In Confusion matrix there are values which are called True Positive, True Negative, False Positive and False Negative. True Positive Constitutes the data features that are correctly identified by the Algorithm. True Negatives are also the values that are correctly identified by the algorithm. False Positive and the False Negative are the data features that are wrongly identified by the Algorithm. The Confusion matrix in machine learning is used to find the Precision and Accuracy of the Classifier which we can obtain those from True and False Values. After the

classifiers are trained the performance of all 4 classifiers are measured in terms of these metrics using test data set. Based on the Confusion Matrix, Accuracy, Precision, Recall, F-measure, Specificity, Selectivity, G-mean are calculated as mentioned below,

1.**Accuracy**: Accuracy of a classifier can be calculated as ratio total true values with all the values present in the confusion matrix.

$$\mathbf{Accuracy} = (\mathbf{TP} + \mathbf{TN})/(\mathbf{TP} + \mathbf{TN} + \mathbf{TP} + \mathbf{TN}) \tag{1}$$

2.**Precision**: Precision is determined by dividing the total number of optimistic predictions by the actual number of optimistic predictions.

$$\text{Precision} = (\text{TP})/(\text{TP} + \text{FP}) \tag{2}$$

3.**Recall**: It is obtained by dividing the sum of all valid samples by the total number of valid positive predictions.

$$\text{Recall} = (\text{TP})/(\text{TP} + \text{FN}) \tag{3}$$

4.**F-measure**: The F1 score is defined as the harmonic mean of precision and recall.

$$\text{F-measure} = \text{2}^\* \ (\text{Precision}^\* \text{ Recall}) / (\text{Precision} + \text{Recall}) \tag{4}$$

5.**G-mean**: Geometric mean is the square root of true positive rate and true negative rate.

$$\text{Gmean} = \text{TPR}^\* \text{ TNR} \tag{5}$$

6.**Selectivity**: Sensitivity is the ratio between the total number of positive samples to the number of samples tested as positive in the test

$$\text{Sensitivity} = (\text{TP})/(\text{TP} + \text{FN})\tag{6}$$

7.**Specificity**: Specificity is the ratio between total numbers of negative samples to the number of samples tested as negative in the test.

$$\text{Specificity} = (\text{TN})/(\text{FP} + \text{TN}).\tag{7}$$
