*4.1.1 CTU-13 data set*

CTU-13 consists in a group of thirteen scenarios that each run a specific botnet performed in a real network environment. Each scenario includes a botnet pcap file, a tagged NetFlow file, a README file with the capture timeline, and the malware run file. The NetFlow (network flow) file is based on bidirectional flows that provide information about the communication between a source (a client) and a destination (a server). This dataset includes three types of traffic with a different distribution: Normal, botnet (or Malware), and background:


The following **Figure 1** presents the distribution of experimental data for CTU-13 data set:

## *4.1.2 UNSW-NB15 data set*

UNSW-NB 15 is a dataset that was created in an Australian Cyber Range Lab using an IXIA PerfectStorm tool to extract a hybrid of realistic modern natural activities and contemporary synthetic attack behaviors generated by network traffic. This dataset contains 49 features are categorized into five groups and which are explained in [17, 18].

The following **Table 1** represents the attack types which are classified into nine groups.

**Figure 1.** *Attack categories in CTU-13.*


**Table 1.** *UNSW-NB15 attack types.*

## **4.2 Apache spark**

Apache Spark, is powerful hybrid, scalable and fast distributed data processing engine most active open source project in big data. It was developed at UC Berkeley in 2009. It became one of the top projects in Apache in 2010 [19]. Spark provides APIs in Scala, Java, Python and R languages. To get a good hold on huge data, it must be fast enough by processing massive data at once. Therefore, it is necessary that Spark is available on several clusters rather than on a single machine. The result of the treatment provided by Spark is not written to the disk but kept in memory. This all-in-memory ability is a high-performance computing technique for advanced analytics, making Spark 100 times faster than Hadoop (**Figure 2**) [20].

Spark also has an ecosystem of libraries that can be used for Machine Learning, interactive queries. Which can have important implications for productivity. The project has been progressively enriched to provide a complete ecosystem today which is shown in **Figure 3**.

#### **4.3 Microsoft azure**

Microsoft Azure, formally known as Windows Azure, is a cloud computing platform for building, deploying and managing services and applications anywhere with the help of a global network of managed data centers located in 54 regions around the world [21]. Microsoft's HDInsight is a managed Hadoop service in Azure Cloud that uses the Hortonworks Data Platform (HDP). HDInsight clusters can be customized easily by adding additional packages and can scale up in case of high demand by allocating more processing power [22]. By the Azure Active Directory, The data is protected and persists even after the cluster is deleted.

*Intrusion Detection Based on Big Data Fuzzy Analytics DOI: http://dx.doi.org/10.5772/intechopen.99636*

**Figure 2.**

*Speed comparison chart between spark Hadoop.*

**Figure 3.** *Apache spark ecosystem.*

#### **4.4 Fuzzy C-means clustering (FCM)**

The FCM algorithm is one of the most widely used fuzzy clustering algorithms [23] which attempts to partition a finite collection of elements into a collection of c fuzzy clusters with respect to some given criterion. This algorithm is based on minimization of the following objective function:

$$Jm = \sum\_{i=1}^{N} \sum\_{j=1}^{C} u\_{ij}^{m} \left\| \mathbf{x}\_{i} - \mathbf{c}\_{j} \right\|^{2}, \mathbf{1} \le m < \infty \tag{1}$$

where:


**Step1:** U0, Initialize U = [uij] matrix.

**Step2:** At k-step, calculate the centers vectors C (k) = [cj] with U(k) [24]

$$cj = \frac{\sum\_{i=1}^{N} \mathfrak{X}i\mathfrak{u}\_{ij}^{m}}{\sum\_{i=1}^{N} \mathfrak{u}\_{ij}^{m}}\tag{2}$$

#### **Figure 4.** *Pseudo code of FCM algorithm-.*

**Step3:** Update U (k), U(k + 1).

$$cj = \frac{1}{\sum\_{k=1}^{C} \left(\frac{||\mathbf{x}i - cj||}{||\mathbf{x}i - ck||}\right)^{\frac{2}{m-1}}} \tag{3}$$

**Step4:** If || U(k + 1) -U(k) || < *s* then STOP, else return step 2.

**Step5:** The Fuzzy partitioning [25] is realized out through an iterative optimization of the objective function in Eq. (1), with the cluster centers cj by using Eqs. (2) and (3) and the update of membership uij.

**Step6:** This iteration will stop when:

$$\max\_{i\bar{j}} \left| \left| u\_{i\bar{j}}^{k+1} - u\_{i\bar{j}}^{k} \right| \right| < \varepsilon \tag{4}$$

Where:


A pseudo code of the algorithm FCM is presented as follows (**Figure 4**).

#### **5. Proposed method**

The idea of our distributed architecture comes down to a process adaptation of in data fusion approach. This architecture allows us to facilitate data analysis with a powerful Spark big data tool (see **Figure 5**).

#### **5.1 Converting incoming file**

We will be using Jupyter Notebook with Apache Spark and the Python API (PySpark). In this stage, we will read CSV files and converting them to Apache Parquet format into Microsoft Azure Blob Storage. Apache Spark supports multiple operations on data, it bids the ability to convert data to another format in just one line of code. Developed by Twitter and Cloudera, Apache Parquet is an open-source columnar file format optimized for query performance and minimizing I/O, offering very efficient

*Intrusion Detection Based on Big Data Fuzzy Analytics DOI: http://dx.doi.org/10.5772/intechopen.99636*

**Figure 5.** *Diagram of proposed approach.*

compression and encoding schemes [26]. **Figure 6** shows the efficiency of using the Parket format. This format minimizes storage costs and data processing time.

The following **Table 2** indicates the old and new size of each datasets after converting to Apache Parquet, We notice that by converting CSV to Parquet the storage costs are minimized.

## **5.2 Preparing data**

### *5.2.1 Feature selection*

The feature selection phase selects relevant attributes required for decision making. A pre-processing phase converts the flow records in a specific format which is acceptable to an anomaly detection algorithm [27].

#### *Open Data*


#### **Figure 6.**

*Apache parquet advantages.*


#### **Table 2.**

*Average file size before and after converting.*

with the CTU-13 dataset, We did not utilize the feature selection algorithm for this dataset, we instead selected columns that were pertinent and delete unnecessary features(empty columns). After the removing, we get with a total of 13 columns.

Using UNSW-NB15 dataset, we processed the data selection problem. We apply a combination fusion of Random Forest Algorithm with Decision Tree Classifier. V. Kanimozhi [28] decides that the combined fusion of these two algorithms provides 98.3% has listed the best four features are as sbytes, sttl, sload, ct\_dst\_src\_ltm and the **Figure 7** labels the graphical representation of Feature Importances and the top four features.

The goal of eliminating no-useful attributes is bring about a better performance by the system with a better accuracy.

#### *5.2.2 Eliminate the redundancies*

This eliminate redundancies task involves removing duplicates (removing all the repeated records) which helps with attack detection as it makes the system less biased by the existence of more frequent records.This tactic makes computation faster as it must deal with less data [29].

#### *5.2.3 Join the datasets*

Before the merge of the bases, some common columns have different names from one database to another (for example "Label" in CTU-13 named "attack\_cat" in UNSW-NB15), in this state, we will rename these attributes then we will merge our bases. Since, Apache Spark offers the ability to join our databases in just one line of code. The following Listing shows the query used:

#### *5.2.4 Extracting and scanning string data*

Using the Apache Spark Machine Learning library, we create a Machine Learning pipeline. A pipeline is a sequence of stages where each stage is either an Estimator or a Transformer.

*Intrusion Detection Based on Big Data Fuzzy Analytics DOI: http://dx.doi.org/10.5772/intechopen.99636*

**Figure 7.** *Feature importance of UNSW-NB15 dataset.*

In our final base, some attributes are of types string (Like: Label, sport, proto, ...), in this step we will convert all attributes of type string to attribute of type integer by using the transformer "StringIndexer" which encodes a string column of labels to a column of label indices. These indices are ordered by label frequencies, the most frequent label gets index 0.

StringIndexer classifies attacks automatically in class, it assigns the same index for attacks of the same category.

#### **5.3 FCM application**

In our experimental work and as we said above we will use Microsoft Azure as a cloud environment to upload and analyze the dataset with FCM algorithm. We use the training dataset to form and evaluate our model. The test dataset is then used to make predictions. We choose to train our Model with FCM algorithm. The first stage of the FCM algorithm is to initialize the input variable, the input vector includes the dataset features, the number of cluster is 2 (**1 = intrusion and 0 = normal**), and the center of cluster is calculated by taking the means of all feature in the final dataset.The use of fuzzy C-means clustering algorithm to classify data will generate a number of clusters, each cluster contains part of the data records [30]. The characteristics are different between normal and intrusion data records, so they should be in different clusters as shown in the following **Figure 8** which presents the data records clustering.

#### **5.4 Performance metrics**

Apache Spark Machine Learning provides a suite of metrics to evaluate the performance of Machine Learn- ing models [31]. To measure the performance in our work the metrics used are as present in below **Table 3** (Where TP = True Positives, TN = True Negatives, FP = False Positives and FN = False Negatives).

After apply the FCM to our final dataset(After merging our intrusion detection datasets) the result is shown in the following **Table 4**.
