**1. Introduction**

The discrimination of encrypted data from other kinds of data is of interest in many areas of application. For instance for making other applications work for the communication traffic in a network where the means for application may depend on whether the traffic data is plaintext/cleartext, compressed, encrypted or encoded in some way. Also, there may be security reasons (e.g. the uncontrolled flow of encrypted data of which some may be transmitted for malicious purposes) which could be an argument for better network supervision tools. To these ends, various methods, mainly of machine learning, have been suggested through the last decades.

#### **1.1 Methods for discrimination in case of streamed data**

As a foundation for the treatment, some kind of preprocessing is due. Among methods for this step are the chi-square statistic, Shannon entropy, Kolmogorov-Smirnov statistic, Lilliefors test, Cramér-von Mises statistic, discrete runs test, autocorrelation's test and the index of coincidence test to mention some [1, 2].

In [1] the problem of filtering encrypted data upon traffic monitoring which is outbound from a computer or a network, so-called egress filtering, is considered. Many aspects and properties of the problem are brought to attention and an extensive account for various evaluation techniques is mentioned.

Different measures for discriminating a given distribution from the normal distribution are the topic in [2]. For discrimination between encrypted and non-encrypted data other distributions than the normal are more relevant but some of the measures considered in this paper can also be used for the more general case.

In [3] machine learning methods are deployed to a fully automated method to the distinction between encrypted and compressed data, claimed to be the first of its kind. A dataset for further evaluation and benchmarking is also provided.

A method for discriminating between compressed and encrypted data in the case when the data is voice over IP with constant and variable bit rate codecs is proposed in [4]. The method is evaluated utilizing the NIST [5] and the ENT [6] test suits.

Reference [7] does not suggest a solution to the discrimination problem but rather a method for fast and distribution true simulation of data reflecting the properties of encrypted and compressed data respectively.

In Ref. [8] a classification procedure, based on Gaussian mixtures and hidden Markov models, is detailed. An elaborate comparative evaluation is carried out taking 9 other attempts to solve the same problem into account. The study concludes that the proposed method renders a better classification at a lower computation complexity.

A lucid convolution neural network method to classify data into being encrypted or non-encrypted is presented by [9]. The proposed method is thoroughly evaluated for various kinds of network traffic and some different performance measures in the field of machine learning.

#### **1.2 Methods for discrimination in case of static data**

A different need for discrimination between encrypted and unencrypted data is the following. In police investigations concerning heavy criminality and organized crime where evidence of criminal activity is residing as data on a computer hard disk drive (HDD) the capability to distinguish encrypted files from non-encrypted ones may be primordial [10–12]. When the police seize an HDD containing data belonging to a suspected criminal, that data can be material of evidence in a subsequent trial. But criminals often try to make sure that the police will be able to use that data. Possible actions to obstruct data access are then to encrypt and/or to *quick delete* that data. In case of quick delete of data, the pointers to the files are destroyed but the content of the files is still left. The other action is to encrypt files. Using state of the art tools for encryption of files (such as VeraCrypt, Bitlocker, Ciphershed and thereby provided ciphers) there is no general procedure of breaking the encryption other than brute force attack by guessing the cipher algorithm and systematically guessing the encryption key [13]. If some of the files are encrypted and others not, it is then a delicate matter to distinguish between the two if the code has been quick deleted. Nevertheless, the importance of discriminating between encrypted code and plaintext is also

#### *Perspective Chapter: Distinguishing Encrypted from Non-Encrypted Data DOI: http://dx.doi.org/10.5772/intechopen.102856*

stressed by the fact that brute force attacking all clusters of data on the hard drive would be an impossible task while being able to separate the encrypted code from the non-encrypted would make a substantial improvement of the chances for successful cryptanalysis. The police authorities usually have software to brute force those encrypted data but this procedure may be very time consuming if the amount of data is so large that different parts of it have been encrypted with different keys. In juridical cases, time is typically of the essence since the chances of success in prosecution and proceedings against a criminal is depending on deadlines (such as time of arrest, time of trial etc) any time savings in the procedure of extracting evidence from the HDDs is essential. Thus time has to be spent on the appropriate tasks: codebreaking only on the encrypted data rather than trying to decypher non-encrypted data. Given a single file, the task of determining whether it is encrypted or not is usually easy; but given a whole HDD without knowing where the encrypted files are, this is trickier.

There are several software solutions for indicating whether a file is encrypted or not, mostly checking the header of the file and looking for some known pattern like the EFS (Encryption Files System on Windows), BestCrypt, or other softwares' headers. Such alternatives cannot be used in the case when the user has performed a (quick) delete of the HDD because then the pointers to all files are lost. Nevertheless, the files remain on the HDD: upon deleting a file on an HDD, the pointers to the beginning and end of the physical space on the HDD containing the information of the file are removed. Still, the physical space on the HDD containing the information of the file is not overwritten because that operation would be slow, i.e. as slow as writing the same file again. The operating system's designers rather leave the file in the HDD intact but indicate that this location is free to host some other data. If nothing has been overwritten, this means that the file is still stored in the HDD for some time which allows recovery software to recover those files.

Recovery software might help to simply recover the files but might also overwrite the data contained in HDD which could result in loss of evidence in case of an investigation. In this case, the police would have to locate the encrypted files without using recovery software or "header-detector" software.

Here methods to locate quick deleted encrypted data are presented and detailed. First, a description of how encrypted data is different from other data is presented. This is followed by the introduction to statistical change-point methods for discrimination between encrypted and non-encrypted data. Finally, the results of these procedures are presented along with some experimental values to evaluate the methods.

However, such a method will only work on mechanical HDDs and not with flash memory devices: in flash memory (like USB memory sticks or Solid State Drive (SSD)), as soon as a file is removed it is erased from the memory because data cannot be overwritten. Therefore as soon as data is deleted, the operating system will choose to delete the pointers and the data to gain time for the time when the user will decide to save data at this location. But erasing in flash memory also takes longer since the pointer have to be deleted and all the files removed which is as long as copying new files on the device.

This application differs from the one mentioned in Subsection 1.1. In the previous case when data is commonly considered being transmitted in packets in a network while here, there is static access to the data. In the previous situation, data consists of sometimes small files where statistical methods for making a foundation indicating if the data is encrypted or not becomes weaker [14] as opposed to the situation here where there is a large amount of data possibly of both kinds, encrypted and nonencrypted. In the previous case with dynamic data in all senses—variable in size, access, time, kind—methods of machine learning which could pick up on the current circumstances were preferable while here, in the case of fixed data, fast and efficient methods of change-point detection becomes a far more advantageous choice. The machine learning methods need training data while the change-point methods can start from scratch with pre-defined parameter values optimized for the situation or with very little run-in data for calibrating the levels. The performance of the machine learning methods is still a matter of research since these are very highly structured procedures, sometimes black box techniques impossible to fully evaluate all properties of, while the change-point detection methods are long since optimized and very thoroughly evaluated from all kinds of aspects of performance through a much longer time.
