**5.7 Clustering-based anomaly detection**

Supervised learning methods for the detection of anomalous activities in a network require prior labeling of the traffic types. However, it is very difficult to have prior labeling of audit data in real-world network environments. Signature-based detection suffers from this problem as carrying out a manual classification in a huge volume of network traffic to identify a small number of attack traffic records poses a significant challenge. Unsupervised learning-based anomaly detection methods do suffer from this drawback as these methods can work on unlabeled network traffic data. These methods attempt to detect malicious traffic in a network even without any prior knowledge about the traffic data labels. Unsupervised learningbased anomaly detection methods work under the following premise: in a network, characteristics of traffic are highly imbalanced—normal traffic constitutes a vast majority, while anomalous traffic represents a tiny minority. Moreover, attack traffic and the normal traffic exhibit similar statistical distributions in their respective group, while the distributions of the two groups are different from each other. Learning from an imbalanced data so that the anomalous and normal traffic can be categorized into two different clusters is the prime focus in unsupervised anomaly detection methods. Hence, cluster-based anomaly and outlier detection is the most fundamental approach in an unsupervised intrusion detection method. Portnoy et al. proposed a clustering-based anomaly detection method using DARPA knowledge discovery in databases (KDD) Cup 1999 dataset [32]. DARPA KDD Cup 1999 dataset consisted of a network traffic record of 4,900,000 data points. The dataset contained 25 different types of traffic—24 attack types and 1 normal traffic. Each data point represented a set of extracted feature values from a connection record obtained between different IP addresses of hosts during a period of time in which attacks were simulated in a network. The authors observed that clustering with unlabeled data resulted in a lower detection rate of attacks than attack classification using a supervised learning method. However, unsupervised detection methods on unlabeled data can potentially detect unknown attacks through an automated or semi-automated process that cannot be done using supervised detection methods.

### **5.8 Random forests**

Random forests are powerful machine learning models based on ensemble approach. They build multiple decision trees by randomly choosing a subset of features and then combine those decision tree results to arrive at a much more robust prediction. Due to their higher accuracy of prediction, random forests have been deployed in a variety of applications including multimedia information retrieval,

**169**

**Table 2.**

*Anomaly-based detection schemes.*

*Machine Learning Applications in Misuse and Anomaly Detection*

**Detection mechanism Input data format Detection** 

Statistical methods Frequency of system calls, offline Host [36, 37] Statistical methods TCP/IP packets, online Network [30, 38, 40] Clustering algorithms Frequency of system calls, online Network [25, 32, 33] Information theoretic TCP/IP packets, offline Network [37, 42, 45] Association rules TCP/IP packets, offline Host [2, 15–17] Fuzzy association rules [18] Kalman filter TCP/IP packets, online Network [30] Hidden Markov model Frequency of system calls, offline Host [25–27] Artificial neural network Executables, offline Host [12, 20, 21] Principal component analysis Frequency of system calls, offline Network [42–44] Support vector machine TCP/IP packets, offline Network [22, 23] K-nearest neighbors Frequency of system calls, offline Host [24] Random forests TCP/IP packets, offline Network [33, 34]

network security and intrusion detection systems design. The algorithms used in random forests usually yield higher accuracy, and they work very efficiently on large datasets with high-dimensional feature space. Traffic in a large network is an example of a large volume of high-dimensional data, and such data can be very effectively classified in real-time by random forest-based classification approach. The use of random forest algorithms for detecting outliers in datasets containing network traffic

without attack-free training data has been proposed in the literature [33, 34].

Other machine learning methods have been proposed for learning the probability distribution of data and in applying statistical tests to detect outliers. Eskin proposed a mixture probability model on normal and anomalous data based on expectation maximization (EM) algorithms [35]. Other statistical machine learning methods have been investigated in anomaly detection applications, such as mean

score [38], histogram density [39], Bayesian law [40], cumulative summation (CUMSUM) and statistical test [30]. Ye et al. used a series of probability techniques

tivariate test, and Markov chain in an information system for detecting intrusions [41]. Network-wide anomaly detection using principal component analysis (PCA) has been proved very effective [42–44]. Several studies have also found that a wide range of anomalies in networks can be detected by computing the entropy in the network flow and feature distributions [37, 42, 45]. **Table 2** presents a summary of

Since misuse detection systems work on matching already known attack signatures with the current events in a network, they usually have high detection rates

test and the Chi-square test [36, 37], Hellinger

**level**

**References**

test, Chi-square mul-

**5.9 Other machine learning methods in anomaly detection**

of anomaly detection, including decision tree, Hotelling's T<sup>2</sup>

and variance [4, 30], Hotelling's T<sup>2</sup>

the anomaly-based detection schemes.

**6. Machine learning in hybrid detection**

*DOI: http://dx.doi.org/10.5772/intechopen.92653*

*Machine Learning Applications in Misuse and Anomaly Detection DOI: http://dx.doi.org/10.5772/intechopen.92653*


### **Table 2.**

*Security and Privacy From a Legal, Ethical, and Technical Perspective*

**5.7 Clustering-based anomaly detection**

HMM could detect anomalous traffic efficiently and effectively with a low value of mismatch rate. In general, training of an HMM is a very time-consuming process as it requires multiple epochs (i.e., passes) through the records in a training dataset. Since all the transition probabilities corresponding to long sequences of state transitions are needed to be stored, training an HMM is a memory-intensive operation as well. Soule et al. presented an anomaly detection method in a large-scale data network [30]. The detection scheme analyzed the traffic patterns in a network, and computed the state the network using a Kalman filter. A Kalman filter is a set of mathematical equations that implements a predictor-corrector type estimation that is optimal [31]. The optimality here refers minimization of error covariance. The Kalman filter used in the anomaly detection filtered out the normal traffic state by comparing the predictions made by the current traffic state to an inference of the actual traffic state. The residual process is then analyzed for possible anomalies.

Supervised learning methods for the detection of anomalous activities in a network require prior labeling of the traffic types. However, it is very difficult to have prior labeling of audit data in real-world network environments. Signature-based detection suffers from this problem as carrying out a manual classification in a huge volume of network traffic to identify a small number of attack traffic records poses a significant challenge. Unsupervised learning-based anomaly detection methods do suffer from this drawback as these methods can work on unlabeled network traffic data. These methods attempt to detect malicious traffic in a network even without any prior knowledge about the traffic data labels. Unsupervised learningbased anomaly detection methods work under the following premise: in a network, characteristics of traffic are highly imbalanced—normal traffic constitutes a vast majority, while anomalous traffic represents a tiny minority. Moreover, attack traffic and the normal traffic exhibit similar statistical distributions in their respective group, while the distributions of the two groups are different from each other. Learning from an imbalanced data so that the anomalous and normal traffic can be categorized into two different clusters is the prime focus in unsupervised anomaly detection methods. Hence, cluster-based anomaly and outlier detection is the most fundamental approach in an unsupervised intrusion detection method. Portnoy et al. proposed a clustering-based anomaly detection method using DARPA knowledge discovery in databases (KDD) Cup 1999 dataset [32]. DARPA KDD Cup 1999 dataset consisted of a network traffic record of 4,900,000 data points. The dataset contained 25 different types of traffic—24 attack types and 1 normal traffic. Each data point represented a set of extracted feature values from a connection record obtained between different IP addresses of hosts during a period of time in which attacks were simulated in a network. The authors observed that clustering with unlabeled data resulted in a lower detection rate of attacks than attack classification using a supervised learning method. However, unsupervised detection methods on unlabeled data can potentially detect unknown attacks through an automated or semi-automated process that cannot be done using supervised detection methods.

Random forests are powerful machine learning models based on ensemble approach. They build multiple decision trees by randomly choosing a subset of features and then combine those decision tree results to arrive at a much more robust prediction. Due to their higher accuracy of prediction, random forests have been deployed in a variety of applications including multimedia information retrieval,

**168**

**5.8 Random forests**

*Anomaly-based detection schemes.*

network security and intrusion detection systems design. The algorithms used in random forests usually yield higher accuracy, and they work very efficiently on large datasets with high-dimensional feature space. Traffic in a large network is an example of a large volume of high-dimensional data, and such data can be very effectively classified in real-time by random forest-based classification approach. The use of random forest algorithms for detecting outliers in datasets containing network traffic without attack-free training data has been proposed in the literature [33, 34].

### **5.9 Other machine learning methods in anomaly detection**

Other machine learning methods have been proposed for learning the probability distribution of data and in applying statistical tests to detect outliers. Eskin proposed a mixture probability model on normal and anomalous data based on expectation maximization (EM) algorithms [35]. Other statistical machine learning methods have been investigated in anomaly detection applications, such as mean and variance [4, 30], Hotelling's T<sup>2</sup> test and the Chi-square test [36, 37], Hellinger score [38], histogram density [39], Bayesian law [40], cumulative summation (CUMSUM) and statistical test [30]. Ye et al. used a series of probability techniques of anomaly detection, including decision tree, Hotelling's T<sup>2</sup> test, Chi-square multivariate test, and Markov chain in an information system for detecting intrusions [41]. Network-wide anomaly detection using principal component analysis (PCA) has been proved very effective [42–44]. Several studies have also found that a wide range of anomalies in networks can be detected by computing the entropy in the network flow and feature distributions [37, 42, 45]. **Table 2** presents a summary of the anomaly-based detection schemes.

### **6. Machine learning in hybrid detection**

Since misuse detection systems work on matching already known attack signatures with the current events in a network, they usually have high detection rates

and low false alarm rates. However, these systems cannot detect novel attacks. On the other hand, anomaly detection systems define normal sates in a network and then detect system states that significantly differ from the normal states. Any state that significantly differs from the normal state of the network indicates the possible event of an attack. The anomaly detection system can detect new attacks launched on a network. There is a challenge in the anomaly detection system design. If the normal state patterns do not significantly differ from patterns exhibited by any anomalous state, the attack state will go undetected. This leads to an increase in the false alarm rate. Hence, it is critical to design a normal state in such a way that while the detection rate is maximized, the number of false alarms does not exceed beyond an acceptable limit. If the normal state is too wide, then the detection rate will suffer. On the other hand, too narrow a normal state will lead to a high false alarm rate. The hybrid detection approach combines the adaptability and the powerful detection ability of an anomaly detection system with the higher accuracy and reliability of the misuse detection approach.

Designing an efficient and accurate hybrid detection system involves two critical issues: (i) the most ideal misuse or anomaly detection systems are to be first identified that can be integrated with anomaly detection systems, so that hybrid detection is possible and (ii) the two systems are to be integrated in the most optimal way so that the balance between the detection rate and false alarm rate is achieved while retaining the ability of detecting novel attacks.

The selection of misuse and anomaly detection systems for designing a hybrid detection system is dependent on the application in which the detection system is to be deployed. Following a combinational approach, the integration of an anomaly detection system with a misuse detection counterpart has been classified into four categories [46, 47]. These types are: (i) anomaly-misuse sequence detection, (ii) misuse-anomaly sequence detection, (iii) parallel detection and (iv) complex mixture detection (**Figure 4**). The complex mixture model is highly application-specific.

Barbara et al. presented a hybrid detection system on the principle of the anomaly-misuse sequence [48]. The proposed system, which is known as audit data analysis and mining (ADAM), minimizes false alarm rates by not raising any alarm for those patterns that are not classified attacks by the misuse detection system.

### **Figure 4.**

*Three categories of hybrid detection systems. (a) Anomaly-misuse sequence, (b) misuse-anomaly sequence, (c) parallel detection system (adapted from [43]).*

**171**

**7. Conclusion**

**Table 3.**

*Machine Learning Applications in Misuse and Anomaly Detection*

**Detection mechanism Input data format Detection** 

Random forests TCP/IP packets, online Network [46, 47, 49] Association rules TCP/IP packets, online Network [48] Association rules Frequency of system calls, online Host [2] Cooperating agents TCP/IP packets, online Network [52–56] Correlation TCP/IP packets, online Network [50] Clustering TCP/IP packets, offline Network [51] Statistical analysis and ANN Sequences of system calls, offline Host [55] ANN TCP/IP packets, online Network [12]

**level**

**References**

hybrid detection systems discussed in this section.

and deployment in real-world networks.

Misuse-anomaly sequence detection systems primarily focus on detecting novel attacks that are missed by the misuse detection module. The machine learning algorithms used by these hybrid detection models are mainly based on different variants of random forests [47, 49]. Anderson et al. proposed the design of a parallel intrusion detection system that provided a very accurate and robust detection decision by correlating the outputs of the misuse detection and the anomaly detection modules [50]. Agrawal et al. proposed an illustrative complex intrusion detection system [51]. The system worked on the AdaBoost algorithm of classification and both the misuse and the anomaly detection systems are trained on the training data simultaneously. The detection results on the test data are also presented separately for the misuse detection module and the anomaly detection module. Sen et al. proposed various architecture of complex detection systems based on cooperating agents [52–54]. The audit trails in the basic security module (BSM) of a Solaris system were exploited by Endler in designing a hybrid detection system [55]. An ANN-based hybrid detection system for detecting both signature-based and anomaly-based attacks is proposed by Ghosh and Schwartzbard [12]. Lee et al. presented a data mining-based hybrid intrusion detection system for identifying attack traffic from the audit data in a host [2]. **Table 3** presents a summary of the

In this chapter, we have discussed various approaches to misuse and anomaly detection systems design using machine learning and data mining techniques. Some of the well-known systems in the literature have also been reviewed briefly. We have also discussed the pros and cons of various systems in context to their applications

A fundamental challenge in designing an intrusion detection system is the limited availability of appropriate data for model building and testing. Generating data for intrusion detection is an extremely painstaking and complex task that mandates the generation of normal system data as well as anomalous and attack data. If a real-world network environment, generating normal traffic data is not a problem. However, the data may too privacy-sensitive to be made available for public research. Classification-based methods require training data to be well balanced with normal traffic data and attack traffic data. Although it is desirable to have a good mix of a large variety of attack traffic data (including some novel attacks), it may

*DOI: http://dx.doi.org/10.5772/intechopen.92653*

*Hybrid intrusion schemes based on machine learning.*

*Machine Learning Applications in Misuse and Anomaly Detection DOI: http://dx.doi.org/10.5772/intechopen.92653*


**Table 3.**

*Security and Privacy From a Legal, Ethical, and Technical Perspective*

of the misuse detection approach.

retaining the ability of detecting novel attacks.

and low false alarm rates. However, these systems cannot detect novel attacks. On the other hand, anomaly detection systems define normal sates in a network and then detect system states that significantly differ from the normal states. Any state that significantly differs from the normal state of the network indicates the possible event of an attack. The anomaly detection system can detect new attacks launched on a network. There is a challenge in the anomaly detection system design. If the normal state patterns do not significantly differ from patterns exhibited by any anomalous state, the attack state will go undetected. This leads to an increase in the false alarm rate. Hence, it is critical to design a normal state in such a way that while the detection rate is maximized, the number of false alarms does not exceed beyond an acceptable limit. If the normal state is too wide, then the detection rate will suffer. On the other hand, too narrow a normal state will lead to a high false alarm rate. The hybrid detection approach combines the adaptability and the powerful detection ability of an anomaly detection system with the higher accuracy and reliability

Designing an efficient and accurate hybrid detection system involves two critical issues: (i) the most ideal misuse or anomaly detection systems are to be first identified that can be integrated with anomaly detection systems, so that hybrid detection is possible and (ii) the two systems are to be integrated in the most optimal way so that the balance between the detection rate and false alarm rate is achieved while

The selection of misuse and anomaly detection systems for designing a hybrid detection system is dependent on the application in which the detection system is to be deployed. Following a combinational approach, the integration of an anomaly detection system with a misuse detection counterpart has been classified into four categories [46, 47]. These types are: (i) anomaly-misuse sequence detection, (ii) misuse-anomaly sequence detection, (iii) parallel detection and (iv) complex mixture detection (**Figure 4**). The complex mixture model is highly application-specific. Barbara et al. presented a hybrid detection system on the principle of the anomaly-misuse sequence [48]. The proposed system, which is known as audit data analysis and mining (ADAM), minimizes false alarm rates by not raising any alarm for those patterns that are not classified attacks by the misuse detection system.

**170**

**Figure 4.**

*(c) parallel detection system (adapted from [43]).*

*Three categories of hybrid detection systems. (a) Anomaly-misuse sequence, (b) misuse-anomaly sequence,* 

*Hybrid intrusion schemes based on machine learning.*

Misuse-anomaly sequence detection systems primarily focus on detecting novel attacks that are missed by the misuse detection module. The machine learning algorithms used by these hybrid detection models are mainly based on different variants of random forests [47, 49]. Anderson et al. proposed the design of a parallel intrusion detection system that provided a very accurate and robust detection decision by correlating the outputs of the misuse detection and the anomaly detection modules [50]. Agrawal et al. proposed an illustrative complex intrusion detection system [51]. The system worked on the AdaBoost algorithm of classification and both the misuse and the anomaly detection systems are trained on the training data simultaneously. The detection results on the test data are also presented separately for the misuse detection module and the anomaly detection module. Sen et al. proposed various architecture of complex detection systems based on cooperating agents [52–54]. The audit trails in the basic security module (BSM) of a Solaris system were exploited by Endler in designing a hybrid detection system [55]. An ANN-based hybrid detection system for detecting both signature-based and anomaly-based attacks is proposed by Ghosh and Schwartzbard [12]. Lee et al. presented a data mining-based hybrid intrusion detection system for identifying attack traffic from the audit data in a host [2]. **Table 3** presents a summary of the hybrid detection systems discussed in this section.

## **7. Conclusion**

In this chapter, we have discussed various approaches to misuse and anomaly detection systems design using machine learning and data mining techniques. Some of the well-known systems in the literature have also been reviewed briefly. We have also discussed the pros and cons of various systems in context to their applications and deployment in real-world networks.

A fundamental challenge in designing an intrusion detection system is the limited availability of appropriate data for model building and testing. Generating data for intrusion detection is an extremely painstaking and complex task that mandates the generation of normal system data as well as anomalous and attack data. If a real-world network environment, generating normal traffic data is not a problem. However, the data may too privacy-sensitive to be made available for public research.

Classification-based methods require training data to be well balanced with normal traffic data and attack traffic data. Although it is desirable to have a good mix of a large variety of attack traffic data (including some novel attacks), it may not be feasible in practice. Moreover, the labeling of data is mandatory with attack and normal traffic data clearly distinguished by their respective labels.

Unlike classification-based approaches, which are mostly used in misuse detection, unsupervised anomaly detection-based approaches do not require any prior labeling of the training data. In most of the cases, the attack traffic constitutes the sparse class, and hence, the smaller clusters are most likely to correspond to the attack traffic data. Although unsupervised anomaly detection is a very interesting approach, the results produced by this method are unacceptably low in terms of their detection accuracies.

In a pure anomaly detection approach, the training data are assumed to be consisting of only normal traffic. By training the detection model only on the normal traffic data, the detection accuracy of the system can be significantly improved. Anomalous states are indicated by only a significant state change from the normal sate of the system.

In a real-world network that is connected to the Internet, an assumption of attack-free traffic is utopian. A pure anomaly detection system can still be trained on training data that include attack traffic. In that case, those attack traffic data will be considered as normal traffic and the detection system will not raise an alert when such traffic is encountered in real-world operations. Hence, in order to increase the detection accuracy, attack traffic should be removed from the training data as much as possible. The removal of attack traffic from the training data can be done using updated misuse detection systems or by deploying multiple anomaly detection systems and combining their results by a voting mechanism.

For an intrusion detection system that is deployed in a real-world network, it is mandatory to have a real-time detection capability under a high-speed, highvolume data environment. However, most of the cluster techniques used in unsupervised detection require quadratic time. This renders their deployment infeasible in practical applications. Moreover, the cluster algorithms are not scalable, and they need the entire training data to reside in the memory during the training process. This requirement puts a restriction on the model size. The future direction of research may include studies on the scalability and performance of anomaly detection algorithms in conjunction with the detection rate and false positive rate. Most of the currently existing propositions on intrusion detection have not paid adequate attention to these critical issues.
