**6.1 Results of PCA in machine learning for mobile malware detection**

The following are the outcomes of normal PCA and variants of PCA where the 471 feature dimensions are reduced into two PCs are visualized below. **Figure 2** shows the importing data into the Python Jupyter notebook environment.

**Figure 3** shows the data pre-processing to check whether the data contains any null values or not. Data pre-processing is an important step in the machine learning process, and it helps to purify the irrelevant and undefined raw data into the relevant defined form.

**Figure 3** depicts that the dataset does not contain any null values, and it is fit for further processing (i.e.) from the results value '0'indicates there are no null values in the data.

**Figure 4** shows the removal of duplicate data values to ensure the originality of the dataset. Duplicate data leads to misinterpretation of the results.

#### **Figure 2.**

*Data import—CICMalDroid\_2020 dataset.*


**Figure 3.** *Check if the data contains any null values or not.*

**Figure 4** depicts that out of 11,598 records, 72 were duplicated, and the duplicated records were dropped. After dropping the 72 duplicate records, now the dataset consists of 11,526 instances with 471 features.

**Figure 5** shows the data splitting for training and testing so that the machine learning model can detect and cluster the mobile malware data points in the dataset. Splitting the data for training and testing is a significant phase in the machine learning process. So, the data will be adequately trained and provide the best results in testing, which helps to derive a high efficacy rate.

**Figure 5** explains that the dataset is divided into 70% for training and 30% for testing (i.e.) out of 11,526 records, 8068 are used for training and 3458 samples are used for testing the model. Now, the dataset is suitable to perform the PCA with the machine learning technique.

**Figure 6** shows the feature scaling; before applying PCA, one must scale the data so that the data can be properly scaled within a particular range appropriately to support data modelling. Without incorporating feature scaling, during the model development the data takes more time to fit into the prescribed model form.

**Figure 6**, depicts the method for feature scaling using MinMaxScaler and standard scalar to bring the scattered data points within a typical specified range. Hence, the data is further applicable for PCA.

**Figure 4.** *Drop duplicate values.*
