**2.3 ML based attribution framework using high level IOC**

Most of the APT attribution processes depend upon the manual analysis in victim networks and collecting low-level indicators of compromise (forensic analysis at

**Figure 5.** *Cyber threat attribution framework [11].*

firewalls, tracebacks, IDS and Honeypots). However, APT actors change this low-level IOC from one organisation to another organisation. ML models built based on this lowlevel IOC, results in inadequate cyber intelligence systems. On the other hand, collecting high-level IOC's for each organisation is time-consuming. Such high-level IOC's are published in the form of Cyber Threat Intelligence (CTI) reports across the organisations as a common practice. In 2019, Umara Noor et al. proposed a distributional semantic technique of NLP to build a cyber threat attribution framework by extracting patterns from CTI reports [11]. The proposed attribution framework is broadly divided into three phases, as depicted in **Figure 5**. In this experiment, authors used a customised search engine to collect 327 unstructured CTI documents corresponding to 36 APT actors as a part of data collection phase. The CTI documents do not contain the exact keyword described in the standard taxonomy due to varying textual definitions and choices for communicating a concept. Rather than using a simple *DMAPT: Study of Data Mining and Machine Learning Techniques in Advanced Persistent Threat… DOI: http://dx.doi.org/10.5772/intechopen.99291*

keyword-based search, the authors developed a semantic search method based on the statistical distributional semantic relevance technique (LSA), to retrieve relevant documents. The input CTI records are indexed using LSA. The statistically derived conceptual indices (from LSA indexer) are searched for semantically relevant topics using the high-level IOC labels specified in MITRE ATT&CK [11]. Based on cosine similarity, the CTA-TTP correlation matrix is constructed in the CTI analytics phase. ML models are built on top of the CTA-TTP correlation matrix in the Cyber Threat Attribution phase. Among various classifiers, the Deep Neural Network turned out to be the best performer with 94% attribution accuracy on test data with high precision and recall values.
