**1. Introduction**

Recent advances in the design of sophisticated malware tools are posing a significant challenge not only to the global IT industry but also to the banking and security organisations. Advanced Persistent Threat (APT) is a key player in highly targeted and sophisticated state-sponsored attacks [1]. These APT groups design and deploy malware in a unique way depending on the target. After selecting the targeted organisation, they come with different Tools, Techniques and Procedures (TTP) to bypass the traditional line of defences (intrusion detection systems or firewall). Once they get access, these APT groups stay inside targeted networks for a long time to observe the workflows. These APT groups use intelligent multi-stage malware

deployment techniques to stay low under the radar for a long time [2]. Finally, gathered sensitive information is pushed in small chunks to its external control and command servers (C2C) using some clever exfiltration techniques.

The whole process of the APT life cycle is broadly divided into seven different phases as shown in **Figure 1** [3]. In the Reconnaissance phase, the attacker chooses the target network and studies the internal network structure and comes up with the necessary strategy, TTP, to bypass the initial layer of defence. Reconnaissance is followed by the Initial compromise phase, where attackers exploit open vulnerabilities to get an initial foothold into the targeted network. After that, the attackers try to replicate and propagate into another machine and establishes backdoors to pull more sophisticated payloads in Establishing foothold phase. Later in the Lateral movement phase, attackers escalate various privileges to perform more sophisticated tasks to hide its traces. In this particular phase, attackers traverse from one network to another network in search of sensitive information. After collecting the necessary data, the attackers strategically centralises this collected data to staging servers. In the data exfiltration phase, attackers use different custom encoding and encryption mechanisms to push these collected data to external control and command servers. Finally, to preserve the anonymity of the process, attacker leaves no traces by clearing the tracks and creates a backdoor to revisit that particular organisation in the future.

APT has grown to become a global tool for cyber warfare between countries. Carbanak APT campaign infected thousands of people worldwide and caused nearly \$1 billion damage across the globe [4]. APT actors carried out a variety of actions in this operation, including opening fraudulent accounts and employing bogus services to obtain funds, as well as sending money to cybercriminals via the SWIFT (Society for Worldwide Interbank Financial Telecommunication) network. Similarly, in 2018, Big Bang APT developed a much more robust and sophisticated multi-stage malware targeting the Palestinian Authority [5]. This APT malware includes several modules

**Figure 1.** *APT life cycle phases.*

*DMAPT: Study of Data Mining and Machine Learning Techniques in Advanced Persistent Threat… DOI: http://dx.doi.org/10.5772/intechopen.99291*

that perform tasks ranging from obtaining a file list, capturing screenshots, rebooting the machine, retrieving system information, and self-deletion. More recently, a supply chain attack on solar winds by the Russian APT group was considered one of the sophisticated attacks. RefreshInternals() method in solar winds attack depict the maturity of these state-sponsored APT groups in terms of malware design and payload delivery [6].

In order to deal with these kinds of state-sponsored targeted attacks, security experts consider APT attribution and detection as two key pillars. Attribution is an analysis process that explains about "who" is behind particular cyber espionage and "why" they have done it [7]. This process gives insights about particular APT threat actors and their targeted areas as well. Based on this preliminary information, the security community try to detect these attacks by fixing issues at different levels of an organisation. Since APT attribution and detection became crucial for many security firms/govt agencies, both these processes require massive data pre-processing and analysis. To address these issues, researchers propose different data mining and machine learning techniques in both attribution and detection as well. In this paper, we discuss various data mining and machine learning techniques in both detection and attribution of APT malware. In addition to this, we compare different detection techniques, and we highlight research gaps among those techniques which need to be addressed by the security community to combat this sophisticated APT malware.

This paper is organised as follows. Section 1. details APT overview and phases of APT, followed by the need for data mining and ML techniques in both attribution and detection of APT malware. Section 2. talks about the process of attribution and different techniques proposed to perform APT attribution. Section 3. discuss about various state of the art data mining and ML techniques proposed by the research community in APT detection. Section 4. details research gap analysis followed by conclusion and future scope.
