**5. Machine learning in anomaly detection**

**Figure 3** depicts the schematic diagram of a typical anomaly detection system. Anomaly detection systems broadly work in the five steps: (i) data collection, (ii) data preprocessing, (iii) normal behavior learning phase, (iv) identification of misbehaviors using dissimilarity detection techniques and (v) security responses. In a large-scale network, the data collection phase involves a large volume of data to be collected from the network. In the data preprocessing phase, the volume of data is reduced as this step includes feature selection, feature extraction, and finally dimensionality reduction processes.

Machine learning algorithms can be very effective in building normal profiles and then in designing intrusion detection systems based on anomaly detection approach. In the anomaly detection approach, the network traffic data belonging to a normal class are usually available for training the model. However, in most of the applications, labeled data for anomalous traffic are not available. We have already seen that supervised machine learning algorithms need attack-free training data. In other words, supervised learning needs labeled network data for both types of traffic—normal and attack. However, in most of the real-world situations, such

**Figure 3.** *Sequence of execution of modules in an anomaly detection system.*

prelabeled training data for both classes are very difficult to get. In most cases, not only are the prelabeled training data not available, but also the traffic data in networks exhibit highly imbalanced characteristics. A large majority of normal traffic record is mixed with a tiny minority of attack traffic records. To make the challenge even bigger, with the change in the network environment, patterns of normal traffic also exhibit substantial changes. The significant difference in the characteristics of training and test datasets most often leads to high false positive rates (FPRs) for supervised intrusion detection systems (IDSs). Unsupervised learning methods as adopted by anomaly detection systems can potentially get rid of this problem by building a normal profile of network traffic and by defining a normal state of the system. Any deviation from the normal state indicates the presence of an anomalous activity in a network. Hence, semi-supervised and unsupervised machine learning methods are frequently deployed in real-world security applications [14].
