**3. Approaches to machine learning forensics**

Usually two main approaches are used to define the ML forensics, that is, inductive reasoning and deductive reasoning:

**7**

effort (**Table 2**).

*Advancing Automation in Digital Forensic Investigations Using Machine Learning Forensics*

Inductive reasoning is obtained from the general knowledge of specific information. The obtained knowledge is new and not truth preserving. That means the knowledge obtained can be invalidated from new information. There is no well-founded theory. In this area there are a large number of goals such as it is important to discover general concepts from a limited set of examples. The examples are called experience. The basis of this is to search for similar characteristics among examples. The methods used in these are based on the inductive

Deductive reasoning obtains the knowledge from well-established methods called logic. Deductive reasoning obtains from the knowledge by using well-established methods. The knowledge is not new. But it is implicit in the initial knowledge. New knowledge cannot invalidate the existing knowledge obtained and its basis on

Supervised and unsupervised are the most commonly used techniques in ML

On the other hand, the reinforcement learning is complex and difficult to implement. Supervised learning is the most common type of ML paradigm. This type is easy to understand and implement. The data in this type is in the form of examples with labels. The data can be called as training data. The learning algorithms can be feed to these example-label pairs one by one. This allows the algorithms to predict the label for each example. Further, it provides the feedback whether this gives the right answer or not. In this type the model is first trained by using lots of training data (input and targets). This process is really fast and accurate. With the passage of time, the algorithms are able to learn in order to approximate the concrete nature of the relationship between examples and their labels. The trained supervised learning can see the totally new and never seen before data and predict the good label for it. Supervised learning is the most widely used and easiest to implement. Supervised learning is the most popular technique

The unsupervised learning does not have a well-structured format. There are no targets for the training data. Therefore, the system does not know where to go. The system needs to understand itself from the given data. The unsupervised learning is the opposite of supervised learning. There are no labels in it. The algorithms are fed up with a lot of data, and the tool is given to understand the properties of the data. In this way, the task of the system is to learn to group, cluster and/or organize the data in the similar way as the human can organize the data. The unsupervised learning is much more interesting in a way that the overwhelming majority of data in this world is unlabelled. This type can make benefit of industries in a way that we have terabytes of unlabelled data, and organizing this data can be beneficial for the industry and potential profits for making it organized without minimal or no human

*DOI: http://dx.doi.org/10.5772/intechopen.90233*

a.Inductive learning

b.Deductive learning

the mathematical logic.

a.Supervised and unsupervised

used for machine learning.

**3.1 Supervised, unsupervised and reinforcement ML**

learning.

algorithms.

*Advancing Automation in Digital Forensic Investigations Using Machine Learning Forensics DOI: http://dx.doi.org/10.5772/intechopen.90233*

### a.Inductive learning

*Digital Forensic Science*

**Figure 2.**

*Machine learning essentials.*

Ability of a machine to imitate intelligent human

Originated around the

Represents simulated intelligence in machines

Building machines that are capable of thinking like

behaviour

1950s

humans

**Table 1.**

**6**

c.Deep learning

object itself (**Figure 2**) (**Table 1**).

tive reasoning and deductive reasoning:

**3. Approaches to machine learning forensics**

Deep learning combines the set of techniques used to implement ML methods to recognize patterns of patterns such as image recognition. First of all the system is used to identify the object edges, structure of the object, object type and then the

**Artificial intelligence Machine learning Deep learning**

Application of AI that allows a system to automatically learn and improve from experience

Getting machines to make without being programmed

Make machines that can learn through previous experience to

solve problems

*Difference between artificial intelligence, machine learning and deep learning.*

Subsets of data science Subset of AI and data science Subset of ML, AI and data science

Application of ML that uses complex algorithms and deep neural to train

Process of using artificial neural networks to solve complex problems

To build neural networks that automatically discovers patterns for

feature detection

a model

Originated around the 1960s Originated around the 1970s

Usually two main approaches are used to define the ML forensics, that is, induc-

Inductive reasoning is obtained from the general knowledge of specific information. The obtained knowledge is new and not truth preserving. That means the knowledge obtained can be invalidated from new information. There is no well-founded theory. In this area there are a large number of goals such as it is important to discover general concepts from a limited set of examples. The examples are called experience. The basis of this is to search for similar characteristics among examples. The methods used in these are based on the inductive learning.

### b.Deductive learning

Deductive reasoning obtains the knowledge from well-established methods called logic. Deductive reasoning obtains from the knowledge by using well-established methods. The knowledge is not new. But it is implicit in the initial knowledge. New knowledge cannot invalidate the existing knowledge obtained and its basis on the mathematical logic.

### **3.1 Supervised, unsupervised and reinforcement ML**

Supervised and unsupervised are the most commonly used techniques in ML algorithms.

### a.Supervised and unsupervised

On the other hand, the reinforcement learning is complex and difficult to implement. Supervised learning is the most common type of ML paradigm. This type is easy to understand and implement. The data in this type is in the form of examples with labels. The data can be called as training data. The learning algorithms can be feed to these example-label pairs one by one. This allows the algorithms to predict the label for each example. Further, it provides the feedback whether this gives the right answer or not. In this type the model is first trained by using lots of training data (input and targets). This process is really fast and accurate. With the passage of time, the algorithms are able to learn in order to approximate the concrete nature of the relationship between examples and their labels. The trained supervised learning can see the totally new and never seen before data and predict the good label for it. Supervised learning is the most widely used and easiest to implement. Supervised learning is the most popular technique used for machine learning.

The unsupervised learning does not have a well-structured format. There are no targets for the training data. Therefore, the system does not know where to go. The system needs to understand itself from the given data. The unsupervised learning is the opposite of supervised learning. There are no labels in it. The algorithms are fed up with a lot of data, and the tool is given to understand the properties of the data. In this way, the task of the system is to learn to group, cluster and/or organize the data in the similar way as the human can organize the data. The unsupervised learning is much more interesting in a way that the overwhelming majority of data in this world is unlabelled. This type can make benefit of industries in a way that we have terabytes of unlabelled data, and organizing this data can be beneficial for the industry and potential profits for making it organized without minimal or no human effort (**Table 2**).


### **Table 2.**

*Supervised vs. unsupervised learning.*

**Figure 3.** *Machine learning types.*

### b.Reinforcement learning

The reinforcement learning is totally different from both supervised and unsupervised ML. The relationship among supervised and unsupervised can be related with each other with the presence and absence of labels. However, the reinforcement learning learns from the mistakes. When deploying the reinforcement learning algorithms in any type of environment, it will make a lot of mistakes at the beginning. The signals to the algorithms are provided that can associate the good behaviour with positive signals and bad behaviour with negative label. The algorithms can reinforce algorithms to prefer good behaviour and bad behaviors. With the passage of time, the algorithm can learn to make fewer mistakes as it was initially (**Figure 3**).

### **3.2 Machine learning forensics for law enforcement, compliance and intelligence**

Standardization is still a big challenge for DFI. The DI experts perform DI on the basis of their experience, the company's policies and basis on their previous

**9**

*Advancing Automation in Digital Forensic Investigations Using Machine Learning Forensics*

These days, law enforcement techniques are also changing.

techniques. Expert examiner tasks become easy by using this tool.

The research work of Hoelz et al. develops the MultiAgent Digital Investigation toolKit (MADIK) toolkit [9]. The tool provides the multiagent systems which helps the experts in computer forensic examinations. The authors apply the AI-based methods to the problem of digital forensics applications by assigning the tasks to each agent. Every agent has specialized in different tasks such as hashing, keyword search, Windows registry agent and so on. However this tool is not focused on building the new knowledge during investigations. It is used to learn from the previous investigations for any future investigation purposes. Moreover, this work

The chapter [10] presents the machine learning-based digital triage model for selective pre-examination and statistical classification of digital data. This data can be deployed both on the crime scene and on digital forensic labs. The work is able to provide the quick actionable intelligence on the crime scene in time-critical systems, reduce the burden on forensic labs and protect suspect privacy when a huge amount of data is needed to be analyzed. As advantages the framework provides the minimum manual work and also produces measurable and reproducible error rate. Existing methods for digital evidence extraction are not coherent to provide the readiness of process support with standardized integrated implementation system which provides guidance and technical knowledge to nonexpert investigators.

experience. This is due to the lack of any universal standard for digital evidence collection. The law enforcement is continuously changing in this information technology age. The traditional crimes such as financial and commerce are also gaining the benefits of technology advancements and continuously upgrading with the latest

DFI is a very common practice in law enforcements and commerce industry. The way in which the use of information technology is increasing by the government sectors, public and corporate agencies, has also increases the victimology of cyber-

The work of [6] is one of the earliest efforts to make an application for expert systems for digital forensic to automate the analysis process. The expert system is used with decision tree in order to detect network anomalies automatically. The

The Open Computer Forensic Architecture (OCFA) [7] is a well-organized forensic platform of automating the digital forensic tasks. This toll provides the scalability, modularity and openness in digital forensic process. This framework consists of different modules, and each module works independently on a specific file type in order for content extraction of the file for digital evidence. It creates the searchable index of text and metadata. It is a pluggable module that recursively processes the evidence according to the dispatching entity which decides which module needs to be invoked by seeing information in evidence. However, the OCFA follows the preextracted data and is not designed to search and recover files. The examination is done by an IT expert on the extracted data to generate indices for text and metadata. Another effort is made by [8] of automating the disk forensic process. They name their tool "fiwalk" which is used to automate the processing of the data in order to assist the user for the development of the program which automatically processes disk images. This tool also integrates the command line tool of [9]. However, this toll only works for file system data only without any integration of AI

*DOI: http://dx.doi.org/10.5772/intechopen.90233*

development in the technology.

attacks through the internet.

expert system is used to analyze the log files.

cannot be used for nonexpert users.

**4. Literature review**

*Advancing Automation in Digital Forensic Investigations Using Machine Learning Forensics DOI: http://dx.doi.org/10.5772/intechopen.90233*

experience. This is due to the lack of any universal standard for digital evidence collection. The law enforcement is continuously changing in this information technology age. The traditional crimes such as financial and commerce are also gaining the benefits of technology advancements and continuously upgrading with the latest development in the technology.

These days, law enforcement techniques are also changing.

DFI is a very common practice in law enforcements and commerce industry. The way in which the use of information technology is increasing by the government sectors, public and corporate agencies, has also increases the victimology of cyberattacks through the internet.

### **4. Literature review**

*Digital Forensic Science*

Unknown attack detection

*Supervised vs. unsupervised learning.*

**Table 2.**

Definition Data set labeled with predefined classes

*Unsupervised learning is not easy and is not used as widely as supervised.*

Decision tree

Known attack detection High Low

Example Support vector machine

Method Data classification Data clustering

Low High

**8**

**Figure 3.**

*Machine learning types.*

initially (**Figure 3**).

**intelligence**

b.Reinforcement learning

The reinforcement learning is totally different from both supervised and unsupervised ML. The relationship among supervised and unsupervised can be related with each other with the presence and absence of labels. However, the reinforcement learning learns from the mistakes. When deploying the reinforcement learning algorithms in any type of environment, it will make a lot of mistakes at the beginning. The signals to the algorithms are provided that can associate the good behaviour with positive signals and bad behaviour with negative label. The algorithms can reinforce algorithms to prefer good behaviour and bad behaviors. With the passage of time, the algorithm can learn to make fewer mistakes as it was

**Supervised learning Unsupervised learning**

classes

algorithms

Data set labeled without predefined

K-means clustering, ant clustering

**3.2 Machine learning forensics for law enforcement, compliance and** 

Standardization is still a big challenge for DFI. The DI experts perform DI on the basis of their experience, the company's policies and basis on their previous

The work of [6] is one of the earliest efforts to make an application for expert systems for digital forensic to automate the analysis process. The expert system is used with decision tree in order to detect network anomalies automatically. The expert system is used to analyze the log files.

The Open Computer Forensic Architecture (OCFA) [7] is a well-organized forensic platform of automating the digital forensic tasks. This toll provides the scalability, modularity and openness in digital forensic process. This framework consists of different modules, and each module works independently on a specific file type in order for content extraction of the file for digital evidence. It creates the searchable index of text and metadata. It is a pluggable module that recursively processes the evidence according to the dispatching entity which decides which module needs to be invoked by seeing information in evidence. However, the OCFA follows the preextracted data and is not designed to search and recover files. The examination is done by an IT expert on the extracted data to generate indices for text and metadata.

Another effort is made by [8] of automating the disk forensic process. They name their tool "fiwalk" which is used to automate the processing of the data in order to assist the user for the development of the program which automatically processes disk images. This tool also integrates the command line tool of [9]. However, this toll only works for file system data only without any integration of AI techniques. Expert examiner tasks become easy by using this tool.

The research work of Hoelz et al. develops the MultiAgent Digital Investigation toolKit (MADIK) toolkit [9]. The tool provides the multiagent systems which helps the experts in computer forensic examinations. The authors apply the AI-based methods to the problem of digital forensics applications by assigning the tasks to each agent. Every agent has specialized in different tasks such as hashing, keyword search, Windows registry agent and so on. However this tool is not focused on building the new knowledge during investigations. It is used to learn from the previous investigations for any future investigation purposes. Moreover, this work cannot be used for nonexpert users.

The chapter [10] presents the machine learning-based digital triage model for selective pre-examination and statistical classification of digital data. This data can be deployed both on the crime scene and on digital forensic labs. The work is able to provide the quick actionable intelligence on the crime scene in time-critical systems, reduce the burden on forensic labs and protect suspect privacy when a huge amount of data is needed to be analyzed. As advantages the framework provides the minimum manual work and also produces measurable and reproducible error rate.

Existing methods for digital evidence extraction are not coherent to provide the readiness of process support with standardized integrated implementation system which provides guidance and technical knowledge to nonexpert investigators.


**11**

**Paper title** Automated analysis for digital forensic science: Semantic integrity checking [6] Android forensics: Automated data collection and reporting from a mobile device [14]

Broadcasts receiver, content observer and alarm

Forensic collection, local SQLite storage, HTTP transfer and clear local SQLite DB

Collects, stores and transfers forensically valuable Android data to a remote Web server without root privileges

DroidWatch is an automated system prototype composed of an Android application and an enterprise server

Automated identification and correlation

Artifacts to reduce the burden placed upon the investigator

Not given any particular implementation details

An automated approach for digital forensic analysis of heterogeneous big data [15]

Understanding the relationships between artifacts

Metadata to solve the data volume problem, semantic web ontologies to solve the heterogeneous data sources

Data mining methods

Glass identification

Decision trees, Bayes

Empirical overview of

Uses two metrics like accuracy and

Abstraction errors can

occur when representations

of the system are not

accurate

Cohen's kappa for training and test

the performance with

classifiers from different

stages

machine learning

approaches

Decision trees, Bayes

Nondeterministic algorithms

The algorithms

implemented are complex

in nature and system needs

careful understanding of

the extracted data

Increasing interoperability

among Android devices

classifiers, based on rules,

artificial neural networks

and based on nearest

neighbour techniques

First open-source Android

Continuously collect many data sets

of interest to incident responders,

security auditors, proactive security

monitors and forensic investigators

enterprise monitoring

prototype

classifiers, based on rules,

artificial neural networks

and based on nearest

neighbors

Supervised machine

learning techniques

in the context of

multi-class supervised

learning

applied to a digital

forensics task for

supervised machine

learning [16]

Data mining methods

Multi-class

classification

applied to a digital

forensics task for

supervised machine

learning

Android forensics:

Enterprise monitoring

Comprehensive guide

of data sets available for

collection without elevated

privileges

system for Android

smartphones

Automated data

collection and reporting

from a mobile

device [14]

**Problems addressed**

Automates data collection

Expert system with a decision tree

**Methods used**

**Proposed solution**

Predetermined invariant relationships between redundant digital objects to detect semantic incongruities

**Implementation**

Collection of C programs and Perl scripts

**Open problems**

*Advancing Automation in Digital Forensic Investigations Using Machine Learning Forensics*

*DOI: http://dx.doi.org/10.5772/intechopen.90233*

Architecture models of Android applications are complex and diverse in nature


*Advancing Automation in Digital Forensic Investigations Using Machine Learning Forensics DOI: http://dx.doi.org/10.5772/intechopen.90233*

*Digital Forensic Science*

**10**

**Paper title** Building an intelligent

assistant for digital

forensics [11]

A Machine

Defines a list of crimerelated features

Learning-based

Triage methodology

for automated

categorization of digital

media [12]

A machine learningbased approach to digital

Identifies test objects

Mobile handset

Multiclass categorization

Data corpus with binary

Most files of forensic

interest are not fragmented

categorization

for classifying objects on

the basis of owner's usage

profile

classification on the basis of

the 5MF technique

allegedly used for

exchanging child

pornography material

Assists the computer

Set of rules and a knowledge

MultiAgent Digital

Six specialized intelligent agents

The method is very

heavyweight to be practical

implemented:

HashSetAgent

FilePathAgent

FileSignatureAgent

TimelineAgent

WindowsRegistryAgent

KeywordAgent

SleuthKit, XML and the Python

Capturing every aspect of a

live system is not feasible

programming language

Investigation toolKit based

on the experience of the

expert

forensic expert on its

base

examinations

triage [13]

Artificial intelligence

applied to computer

forensics [9]

Automating disk

Automation to perform

XML methods used to

Creating special-purpose

forensic tools

describe partitions and

files on a hard drive or disk

image

disk forensic

forensic processing with

SleuthKit, Xml and

Python [8]

**Problems addressed**

Supports investigations

Series of experiments

comparing it with a human

investigator as well as

against standard benchmark

disk images

Populates an input matrix and processes it with

Crime features extracted

Classified digital media using Bayes

Extract data in its raw form

without the nature of the

information

networks or support vector machines

from available devices and

forensic copies

different machine learning

mining schemes to come up

with a device classification

conducted by non-IT

expert and expert

investigators

**Methods used**

**Proposed solution**

Proposed AUDIT,

an automated disk

investigation toolkit

**Implementation**

Systematically examine the disk in

its totality based on its physical and

logical structures

**Open problems**

Seizure of an entire hard

disk drive is a complex task



**13**

**6. Discussion and future prospects**

*Advancing Automation in Digital Forensic Investigations Using Machine Learning Forensics*

The lack of the automated intelligent systems for digital evidence extractions is another big issue. Further, digital evidence are difficult to handle and cannot be easily understandable even for experts. Extracting digital evidence from different

**5. The significance of machine learning in digital forensic investigations**

ture in order to produce progressive improvements in its own performance.

Originating from AI, ML algorithms can be used to analyze the huge amount of data to identify the risk, segment the data and detect criminal behaviour. ML algo

rithms enable the investigators to interrogate the vast scattered data sets which are placed in social and wired networks and web or cloud computing. In essence, ML algorithms contain the pattern recognition software that are used to analyse huge amount of data which are used to predict some behaviour. ML algorithms seek to learn from historical perspectives which are then used to predict future behaviour. MLF gains the capability to recognize the patterns of criminal activities through ML algorithms, in order to learn from the historical data about when and where the crime will take place. The malicious activities from extracted data set can be from burglaries, money laundering or intrusion attacks. This task can be achieved by for

malizing and analyzing the servers, suspect's devices, wireless devices, the Internet and other kinds of data for visualization, link association, segmentation and predicting criminal activities. Nowadays, the industry is facing more advance cyber threats that cannot be tracked though traditional security measures. Attackers have designed more sophisticated ways to attacks on the system and become complicated over time. System administrator would not be able to detect these attacks each time. On the other hand, human expertise and competences have some limits, and this leads to the fact that industry is lacking in poor speed of incident occurrence, longer delay in detection and prevention of cyber threats and takes more advanced expertise to remove these cyber threats. Therefore, developing more advance machine learning models may help to prevent and protect form these cyber threats. Nowadays, there are many automated software available that can help the human to perform complicated and scientific tasks. In the next step, these automated tools need to be more advanced and should have the capability of AI and ML techniques.

From literature survey, it has been observed that there are many challenges

First of all there is an ultra-exponential growth in the data due to the inexpensive storage devices such as hard drives, CD, USB stick and so on. This makes it almost impossible for the individuals to perform the forensic in a short period of time.

which can be faced by the forensic experts when performing the test.

MLF is originating from AI to perform the huge amount of data, analyse the data to discover any criminal actions and risk and to segment the data to find criminal activity and behaviour. The intelligence systems which do not have any intelligent part cannot perform true learning capabilities and be a true one. DFI through ML is the latest trend to seize the potential of AI as leading security solutions capabilities. ML behavioral analytics is the core part of modeling, profiling and prediction in medical, manufacturing, advertising and business intelligence and is recently used in law enforcement mechanism. In order to discover the criminal behaviour, MLF uses the wireless or wired networks via web or cloud computing. Thus MLF aims are to provide the new knowledge and skills and provide organized knowledge struc

**3**).




storage media may require several layers of transformations (**Table**

*DOI: http://dx.doi.org/10.5772/intechopen.90233*

The lack of the automated intelligent systems for digital evidence extractions is another big issue. Further, digital evidence are difficult to handle and cannot be easily understandable even for experts. Extracting digital evidence from different storage media may require several layers of transformations (**Table 3**).
