**2. Method**

#### **2.1 Description**

If encrypted data is not uniformly distributed, the cryptosystem used to cipher those data has a bias and can therefore be attacked. For this reason, characters of the cyphertext produced by any modern high-quality cryptosystem are uniformly distributed [15, 16] i.e. the values of the bytes of the cyphertext are uniformly distributed on some character interval. Indeed, a cryptosystem that does not have this property would be weak since it would be possible to attack it on this bias (distinguishing attacks such as on the RC4 encryption algorithm). The other types of files do not possess this feature although the contents of some types of files are close to being uniformly distributed. The files coming closest without being encrypted are compressed/zipped files: those files are indeed very close to cipher files in terms of the distribution of their character's byte numbers. Albeit small there is a difference in distribution making it possible to tell compressed files and encrypted files apart.

A technique to quantify this distribution difference is by using chi-square statistic and more or less performing a chi-square test (see [17]) to tell whether the data is suspiciously uniform or not. Another classical method is to calculate the Kolmogorov-Smirnov distance, see e.g. [1, 2] for a more extensive account of ways to indicate whether data are encrypted or not. Procedures building on such preprocessors can then be defined.

One attempt to exploit this distribution difference utilizing the chi-square statistic is [18] where a method to automate the discrimination between encrypted and nonencrypted (i.e. most critically compressed) data into a method with impressing performance. Another approach could be to use means of anomaly detection in the theory of machine learning to develop an adaptive method. The proposed method, however, stems from using statistical change-point detection. The anticipated advantage with this could be the mathematically proven optimality with these methods in terms of efficiency and accuracy under the given assumptions. For many applications, the assumptions of entirely relying on the distribution of the data, in-control and out-ofcontrol are commonly a subject of great controversy. Here, the situation is different though, since the hard drive and its data is not a dynamic system where the content changes its distribution owing to outer time-depending circumstances.

Before going into detail about these methods the preprocessing techniques are introduced.

*Perspective Chapter: Distinguishing Encrypted from Non-Encrypted Data DOI: http://dx.doi.org/10.5772/intechopen.102856*

#### **2.2 Distribution of encrypted data**

The working hypothesis is that data (i.e. characters) constituting encrypted files are uniformly distributed, while data of non-encrypted files are not (i.e. differently distributed depending on which type of non-encrypted files). The goal now is to be able to tell an encrypted file from a non-encrypted one.

Let us assume the data constitutes of characters divided into clusters, *c*1,*c*2,*c*3, … with *N* characters in each cluster. The characters that are used range over some alphabet of a set of possible forms. Merging these forms into *K* classes, the counts *Okt* of occurrences of class *k* characters in cluster *t* are observed. One method of measuring distribution agreement is by means of a *<sup>χ</sup>*<sup>2</sup> test statistic, *Qt* <sup>¼</sup> <sup>P</sup>*<sup>K</sup> <sup>k</sup>*¼1ð Þ *Okt* � *Ekt* 2 *=Ekt* where *Ekt* are the expected counts of occurrences of characters in class *k*, cluster *t*. Under the hypothesis of uniformly distributed characters, the expected counts of occurrences within each class is *Ekt* <sup>¼</sup> *<sup>N</sup> <sup>K</sup>* reducing the statistic simplifies to

$$Q\_t = \frac{K}{N} \sum\_{k=1}^{K} O\_{kt} - N$$

the values of which are henceforth referred to as *Q* scores. Large *Q* scores indicate deviance from the corresponding expected frequencies *Ekt*. The smallest possible *Q* score being 0 would be attained if all *Ekt* ¼ *Okt*. The expected value, *Ekt*, in each class, should not be smaller than 5 for the statistic to be relevant (5 is a commonly used value; other values like 2 or 10 are sometimes used depending on how stable the test statistic should be to small deviances in tail probabilities). Therefore one should use at least 5 kB of data in each cluster to enable this test to be relevant. But the larger the number of bytes that are in each cluster, the worse the precision to detect encrypted file values: indeed, if too many bytes were detected as being unencrypted (but were actually encrypted) a large amount of encrypted data might not be detected in the procedure.

Here, the alphabet used was the numbers 0, 1, … , 255 representing the possible size in bytes of the characters in the data. These numbers were divided into *K* ¼ 8 classes (class 1: values of bytes in 0, 1, 2, f g … , 31 to class 8: values of bytes in f g 224, 225, 226, … , 255 ) and the clusters of size *N* ¼ 64 bytes making the expected values in each class *Ekt* ¼ 8. Assuming that encrypted data are uniformly distributed, the *Q* scores based on counts of characters in encrypted data are *χ*<sup>2</sup> distributed, see **Figure 1** and since 8 classes were chosen, the number of degrees of freedom is 8 � 1 ¼ 7.

#### **2.3 Distribution of non-encrypted data**

For non-encrypted data, the distribution is more complicated. Each type of file has its distribution. Consequently, the standardized squared deviances from expected counts under the assumption about uniform distribution are larger and so are the *Q* scores of the *χ*<sup>2</sup> statistic. However, two problems emerge. Firstly, the size of these increased deviances depends on the type of data—i.e. is the data a text file, an image, a compiled program, a compressed file or some other kind of file?—and how should this information be properly taken into account? Secondly, what is the distribution of the *Q* score in the case when the data are not encrypted?

**Figure 1.** *Distribution of the Q scores of encrypted files (obtained by using more than 5000 files) with the distribution function of Q* <sup>∈</sup>*χ*<sup>2</sup>ð Þ <sup>7</sup> *.*

To develop a method for distinguishing between encrypted and non-encrypted data it is sufficient to focus on the non-encrypted that is most similar to the encrypted and this turns out to be compressed data. Other types of files such as images, compiled programs etc. commonly render higher *Q* scores and are therefore indirectly distinguished from encrypted data by a method that is calibrated for discriminating between encrypted and compressed data. About the second question, this is not readily answered. Rather, we just suggest modeling the *Q* score as being scaled *χ*<sup>2</sup> distributed, i.e. the *Q* score is assumed to have the distribution of the random variable *αX* where *α* >1 and *X* ∈ *χ*2. The validity of this approach is sustained by empirical evaluation based on more than 5000 compressed files. The resulting empirical distribution of their *Q* scores and the distribution of *αX* where *X* is *χ*<sup>2</sup> distributed and the value of *α* ¼ 1*:*7374 was estimated by the least square method was plotted in **Figure 2**.

#### **2.4 Change-point detection**

Change-point detection [19–23] is a field of mathematical statistics where the object is to quickly and accurately detect a shift in distribution from on-line observation of a random process. This can be done actively (stop collecting the data as soon as a shift is detected) or passively (continue collecting the data even if a shift is detected in order to detect more shifts). Here passive on-line change-point detection was used to detect if the data from an HDD shifts from non-encrypted to encrypted and vice versa. The change-point detection method is a stopping rule

*Perspective Chapter: Distinguishing Encrypted from Non-Encrypted Data DOI: http://dx.doi.org/10.5772/intechopen.102856*

**Figure 2.** *Distribution of the Q scores of compressed files (obtained by using more than 5000 files) with the distribution function of Q* <sup>¼</sup> <sup>1</sup>*:*<sup>7374</sup> � *X where X* <sup>∈</sup>*χ*<sup>2</sup>ð Þ <sup>7</sup> *.*

*τ* ¼ inf f g *t*> 0 : *at* >*C*

where *at* is called alarm function and *C* threshold. The design of the alarm function defines different change-point detection methods while the values of the threshold reflects the degree of sensitivity of the stopping rule. The alarm function may be based on the likelihood ratio

$$L(\mathfrak{s}, t) = \frac{f\_{Q\_t}(\mathfrak{q}\_t \mid \theta = \mathfrak{s} \le t)}{f\_{Q\_t}(\mathfrak{q}\_t \mid \theta > t)}$$

where *f <sup>Q</sup><sup>t</sup> q<sup>t</sup>* <sup>j</sup>*<sup>A</sup>* � � is the conditional joint density function of the random variables *Q*1, … , *Qt* ð Þ¼ *Qt* given *A* and where *q<sup>t</sup>* is the vector of the observed values of *Qt*. Assuming indpendence of the variables *Q*1, … , *Qt* the likelihood ratio simplifies to

$$L(s, t) = \prod\_{u=s}^{t} \frac{f\_1(q\_u)}{f\_0(q\_u)}$$

where *f* <sup>0</sup> *qu* � � is the marginal conditional density function of *Qu* given that the shift has not occurred by time *u* and *f* <sup>1</sup> *qu* � � is the marginal conditional density function of *Qu* given that the shift occurred in or before time *u*.

The conditional density function of the *Q* score at time *t* given that the data is encrypted (i.e. uniformly distributed) is

*Lightweight Cryptographic Techniques and Cybersecurity Approaches*

$$f\_E(q\_t) = \frac{q\_t^{k/2 - 1} e^{-q\_t/2}}{2^{k/2} \Gamma(k/2)}$$

where *k* is the number of degrees of freedom, i.e. the number of classes (which in this study is 8 as explained above).

For the non-encrypted files, the conditional density function of the *Q* score is modeled by *<sup>α</sup><sup>X</sup>* where *<sup>X</sup>* <sup>∈</sup>*χ*2ð Þ*<sup>k</sup>* and *<sup>α</sup>*>1 supposedly reflecting the inflated deviances from the uniform distribution had the data been encrypted. Thus

$$f\_{NE}(q\_t) = \frac{\partial}{\partial q\_t} \mathbb{P}(aX < q\_t \: \mid \: X \in \chi^2(k))$$

$$= \frac{\left(\frac{q\_t}{a}\right)^{k/2 - 1} e^{-q\_t/2a}}{a 2^{k/2} \Gamma(k/2)}$$

is the density function of non-encrypted data *Q* score. This means that two cases of shift in didstribution are possible:

• Shift from non-encrypted to encrypted data in which case

$$L(s, t) = \prod\_{u=s}^{t} \frac{f\_{\to}(q\_u)}{f\_{\text{NE}}(q\_u)} = a^{k(t-s+1)/2} \exp\left(-\frac{a-1}{2a} \sum\_{u=s}^{t} q\_u\right).$$

• Shift from encrypted to non-encrypted data in which case

$$L(s,t) = \prod\_{u=s}^{t} \frac{f\_{\rm NE}(q\_u)}{f\_{\rm E}(q\_u)} = a^{-k(t-s+1)/2} \exp\left(\frac{a-1}{2a} \sum\_{u=s}^{t} q\_u\right).$$

To detect whether the shift in distribution has occurred or not according to the stopping rule *τ* mentioned above, an alarm function should be specified. Two of the most common choices here are:


Other possible choices are e.g. the Shewhart method, the Exponentially Weighted Moving Average (EWMA), the full Likelihood Ratio method (LR) and others, see e.g. [20] for a more extensive presentation of different methods.

For the CUSUM alarm function, as arg max <sup>1</sup>≤*s*≤*tL s*ð Þ¼ , *t* arg max <sup>1</sup>≤*s*≤*<sup>t</sup>* ln *L s*ð Þ , *t* the alarm function is simplified without any loss of generality by using the log likelihood values instead.

For both cases the alarm functions can be expressed recursively which facilitates collecting and treating the data as follows.

• The alarm function for shift from non-encrypted to encrypted data for the

*Perspective Chapter: Distinguishing Encrypted from Non-Encrypted Data DOI: http://dx.doi.org/10.5772/intechopen.102856*

• CUSUM method is

$$a\_t = \begin{cases} 0 & \text{for } t = 0\\ \max\left(a\_{t-1} + \frac{k \ln a}{2}, 0\right) + \frac{1 - a}{2a} q\_t & \text{for } t = 1, 2, 3, \dots \end{cases}$$

• Shiryaev method is

$$a\_t = \begin{cases} 0 & \text{for } t = 0\\ a^{k/2} e^{\frac{1-\omega}{2a}q\_t} (1 + a\_{t-1}) & \text{for } t = 1, 2, 3, \dots \end{cases}$$


$$a\_t = \begin{cases} 0 & \text{for } t = 0\\ \max\left(a\_{t-1} - \frac{k \ln(a)}{2}, 0\right) + \frac{a-1}{2a} q\_t & \text{for } t = 1, 2, 3, \dots \end{cases}$$

• Shiryaev method is

$$a\_t = \begin{cases} 0 & \text{for } t = 0\\ a^{-k/2} e^{\frac{a-1}{2a}q\_t} (1 + a\_{t-1}) & \text{for } t = 1, 2, 3, \dots \end{cases}$$

## **3. Results**

#### **3.1 Evaluation**

To quantify the quality of different methods, the performance is compared regarding relevant properties such as the time until a false alarm, delay of a motivated alarm, the credibility of an alarm and so on. The threshold is commonly calibrated against the Average Run Length *ARL*<sup>0</sup> which is defined as the expected time until an alarm when no parameter shift occurred (which means that this is a false alarm). It is crucial to have the right threshold values for the methods to perform as specified. Setting the threshold such that *ARL*<sup>0</sup> is 100, 500, 2500 and 10000 respectively (the most common values here are *ARL*<sup>0</sup> <sup>¼</sup> 100 and 500 but the higher values are also considered since the value of *ARL*<sup>0</sup> defines the number of clusters/time points that are treated before a false alarm and the shift could occur very far into the HDD) properties of the methods regarding delay and credibility of a motivated alarm can be compared. Of course, if the threshold is low, this will lead to more false alarms (detection of a change when there is none) but setting a too high threshold will lead to a drop of sensitivity of the method to detect a shift (higher delay between a shift and its detection) and consequently an increased probability of missing a real shift in distribution.

Usually the expected delay, *ED*ð Þ¼ *ν Eν*ð Þ *τ* � *θ* j *θ* <*τ* (expectation of the delay of a motivated alarm; see **Figure 1**) or Conditional Expected Delay *CED t*ðÞ¼ *E*ð Þ *τ* � *θ* j *τ* >*θ* ¼ *t* (expectation of the delay when the change point is fixed equal to *t*) are very important because, for example in health care, the goal is to quickly detect problems to be able to save lifes (**Table 1**).

However, in the case of detecting encrypted code, expected delays are less relevant as a measure of performance since the data can be handled without any time aspect: the goal is to detect accurately where the encrypted data is located. A method with high expected or conditional expected delay merely means a slightly less efficient procedure.

A more relevant performance indicator, in this case, is for instance the predictive value *PV* ¼ *P*ð Þ *θ* <*τ* (the probability that the method signals alarm when the changepoint has occurred; see **Figure 5**) or the percentage of detected encrypted files that is discovered while running the process and how to improve it (see **Figure 2**).

While running the process, the method will stop at some time, *τ*, and then estimate the change point, *θ*, by maximizing the likelihood function by using the data after the last previous alarm and the newly detected change-point. This estimated change-point, ^*θ*, can be either before or after the true change-point *θ*. One could increase the intervals where encrypted data were discovered. This would lead to missing less encrypted data (see **Figure 2**) but also brute-forcing more non-encrypted data (**Table 2**).

Therefore the difference between the change-points and the alarms according to the method is calculated. Since the proportion of encrypted data relative to the total amount of data on the HDD is unknown, the expected proportion of error is suggested. This is to say, given two change-points, *θ*<sup>1</sup> and *θ*2, and two stopping times, *τ*<sup>1</sup> and *τ*2, the expected proportion of error is *E* <sup>∣</sup>*τ*1�*θ*1∣þ∣*τ*2�*θ*2<sup>∣</sup> *θ*2�*θ*<sup>1</sup> � �. However, this value has a sense only when there are no false alarms between *τ*<sup>1</sup> and *τ*2.

If there are false alarms between *τ*<sup>1</sup> and *τ*2, the proportion of undetected encrypted data was added to the proportion of error to determine the proportion of the error made relative to the size of the encrypted data. Assuming that there are *n* false alarms *τ*0 <sup>1</sup> < … <*τ*<sup>0</sup> *<sup>n</sup>* in ½ � *τ*1, *τ*<sup>2</sup> , called *expected inaccuracy*, or *EI* for short, is defined as follows.

$$EI = E\left(\frac{|\tau\_1 - \theta\_1| + |\tau\_2 - \theta\_2|}{\theta\_2 - \theta\_1} + \sum\_{i=1}^{n/2} \frac{\tau\_{2i}' - \tau\_{2i-1}'}{\theta\_2 - \theta\_1}\right) \tag{1}$$

The *EI* was measured for different values of the parameter *ν* in the geometrical distribution of the change-points for different methods (see **Table 3** and **Figure 3**).


**Table 1.**

*Values of expected delays ED for the CUSUM and Shiryaev methods for values of ARL*<sup>0</sup> <sup>¼</sup> <sup>100</sup>*,* <sup>500</sup>*,* <sup>2500</sup>*,* 10000 *for a shift from encrypted to compressed data.*

*Perspective Chapter: Distinguishing Encrypted from Non-Encrypted Data DOI: http://dx.doi.org/10.5772/intechopen.102856*


#### **Table 2.**

*Percentage of encrypted files that are detected when the interval of detected change points* ½ � *τ*1, *τ*<sup>2</sup> *is arbitrary replaced by* ½ � *τ*<sup>1</sup> � *i*, *τ*<sup>2</sup> þ *i . Typically, with large i, the values are very close but not exactly equal to* 1*: This happens because the change points are very close (i.e less than 10 clusters for example) and the method does not detect any change. Then no cyphertext is detected at all.*


**Table 3.**

*Expected inaccuracy, EI [see Eq. (1)], for different values of the parameter ν and for the two methods. For the application of discriminating between encrypted and non-encrypted data on a hard drive this may be considered to be reflecting the degree of fragmentation of the disk; the less fragmentation the farther apart are the change-points and thus the smaller the ν value, and vice versa.*

#### **3.2 Complete procedure**

The complete process is the method returning a segmentation separating suspiciously encrypted data and most likely non-encrypted data of an HDD; information to be further used to target the brute force cryptanalysis efficiently. This procedure runs a likelihood ratio based change-point detection method and as soon as it detects a change, calculates the maximum likelihood estimator of the change-point to determine where the change-point most likely is located. It will then start over from the location of this estimated change-point with the same method for online change-point

**Figure 3.**

*Expected inaccuracy EI for the CUSUM (blue graph) and Shiryaev (red graph) procedure. One can see that the Shiryaev procedure is little less accurate when ν increases but slightly more accurate for small ν than the CUSUM procedure.*

detection except that the likelihood ratio is reversed modifying the alarm function to fit the opposite change-point situation, and so on.

#### **3.3 Thresholds and experimental values**

The first step is to determine the thresholds rendering *ARL*<sup>0</sup> <sup>¼</sup> 100, 500, 2500 and 10000, for both the CUSUM and Shiryaev methods, for a shift from encrypted to nonencrypted and vice versa. The change-points are commonly geometrically distributed with parameter *ν*. Here the average time before a change-point is expected to be rather high (several hundred or thousand maybe) as the method deal with 64-byte clusters in an HDD of surely several hundreds of Giga or Tera Bytes. Thus, since *E*ð Þ¼ *τ* 1*=ν*, the focus is on very small values of *ν* for the methods to be reasonably sensitive.

Commonly values of *ARL*<sup>0</sup> are 100 or 500 to make other properties relevant for comparisons. In the case of this application, however, large values of *ARL*<sup>0</sup> as 2500 and 10000 are studied because the first change-point might not occur until far into the HDD. Adjusting the threshold by simulating data can take very long if *ARL*<sup>0</sup> is large (2500 or 10000, especially for the Shiryaev method). In this case, it can take several hours or even up to days to compute the threshold. Therefore, having a way of predicting the threshold by extrapolation would be desired i.e. having an explicit relation between *ARL*<sup>0</sup> and the threshold *C*. Intuitively, if *ARL*<sup>0</sup> is larger, more data will be taken into account implying a threshold proportionally larger. Indeed, as *ARL*<sup>0</sup> increases more data is used in the procedure and the threshold is therefore proportionally increased from how much more data is treated in the procedure. In the CUSUM case, since the alarm function is defined as the log-likelihood ratio, this results in a threshold being a log *ARL*<sup>0</sup> (**Figures 4** and **5**; **Tables 4** and **5**):

*Perspective Chapter: Distinguishing Encrypted from Non-Encrypted Data DOI: http://dx.doi.org/10.5772/intechopen.102856*

• for a shift from compressed data to encrypted data:

$$C = 0.965524 \cdot \ln\left(0.030655 \cdot ARL^0 + 0.494603\right)$$

• for a shift from encrypted data to compressed data:

$$C = 0.997767 \cdot \ln\left(0.912316 \cdot ARL^0 + 7.294950\right)$$

#### **Figure 4.**

*Expected delays ED for a shift from encrypted to compressed data for the CUSUM procedure (blue) and Shiryaev procedure (red).*

#### **Figure 5.**

*Predictive values for a shift from compressed to encrypted data for the CUSUM procedure (left) and for the Shiryaev procedure (right).*


#### **Table 4.**

*Values of the thresholds for the CUSUM and Shiryaev methods for ARL*<sup>0</sup> <sup>¼</sup> 100, 500, 2500, 10000 *specified for detecting a shift from non-encrypted to encrypted data (indicated by NE* ! *E for short) and for shift from encrypted to non-encrypted data (indicated by E* ! *NE) respectively.*


#### **Table 5.**

*Predictive value PV*ð Þ¼ *ν Pν*ð Þ *θ* <*τ , i.e. the probability that a shift has occurred when an alarm is signaled, for the CUSUM and the Shiryaev methods, for values of ARL*<sup>0</sup> <sup>¼</sup> <sup>100</sup>*,* <sup>500</sup>*,* <sup>2500</sup>*,* <sup>10000</sup> *and with different values of the parameter ν in the geometric distribution of the change-points.*

For Shiryaev, the threshold is a linear function of *ARL*0:

• for a shift from compressed data to encrypted data:

$$C = 0.647578 \cdot ARL^0 - 0.726563$$

• for a shift from encrypted data to compressed data:

$$C = 0.444214 \cdot ARL^0 + 0.294281$$
