**2. Speech scrambling system**

Speech Scrambling seeks to perform a completely reversible operation on a portion of speech, that it is totally unintelligible to unauthorized listener. The most important criteria used to evaluate speech scramblers are:


Cryptographers face the problem of designing scrambling systems which distort the very redundant speech signal to the extent that useful information is unable to be recovered. The encryption process must remain secure when subject to the powerful information processing structures of the human auditory system and knowledge-base automated cryptanalytic processes. There are two fundamentally distinct approaches to achieve voice security in speech communication systems: digital ciphering and analog scrambling. In spite of significant progress in digital speech processing technology, analog speech scramblers continue to be important for achieving privacy in many types of voice communication (Gersho & Steele, 1984), due to the desire for secure communication over existing channels with standard telephone bandwidth at acceptable speech quality and reasonable cost. To make the distinction between analog and digital speech encryption devices, the following definitions can be considered. Analog scramblers produce scrambled speech which is analog signal occupying the same bandwidth as the original speech. Analog or digital signal processing may be used to generate this signal. Digital speech encryption systems digitize and compress the input speech in order to obtain a digital representation at a bit rate suitable for the communications channel to be used. The resulting bit stream is encrypted using well-know data encryption techniques. The ability of a digital encryption schemes to compete with the well-established analog scramblers is depend on the quality of the speech compression algorithms used. The speech quality resulting from contemporary compression schemes is rapidly improving (Sakurai, et al., 1984).

Analog speech scrambling experienced a metamorphosis as a result of the development and release of very high speed signal processing hardware. Analog scrambling algorithms which were impractical due to their complex nature are now being implemented in real time using this technology.

One family of analog scramblers that has shown a great deal of promise is the transform domain scrambler. These scramblers operate on speech which has been sampled and digitized. The sampled speech is portioned into frames of equal length, containing N speech samples. A chosen transformation is then performing on each frame to yield a transform vector with N components. Encryption is achieved by permuting these transform components within the vector before the inverse transform is applied to return the

Speech Scrambling Based on Wavelet Transform 45

There are numerous scrambling methods. The two main processes involve dividing the signal in small time frames and manipulating the frequencies (scrambling in the frequency domain). The descrambler block descrambles the input signal, which must be a scalar or a frame-based column vector. The descrambler block is the inverse of the Scrambler block (Mascarin, 2000). The main attraction of this method arises from the fact that it can be used with the existing analog telephone, H.F., satellite and mobile communication systems, provided the encrypted signal occupies the same bandwidth as the original speech signal

The analog scrambling process which employs a transformation of the input speech to facilitate encryption can best be described using matrix algebra. Let us consider the vector *x* which contains *N* speech time samples obtained from A/D conversion process, representing a frame of the original speech signal. Let this speech sample vector *x* be subject to an

This transformation results in a new vector (u) made up of transform coefficients. A permutation matrix is applied to (u)**,** such that each transform coefficient is moved to a new

 *v Pu* (2) A scrambled speech vector *y* is obtained by returning vector *v* to the time domain using the

<sup>1</sup> *y F v* (3)

Descrambling, or recovery of the original speech vector *x'* is achieved by first transforming *y* back to the transform domain .The inverse permutation matrix *P-1* is then used to return the

*u Fx* (1)

Fig. 2. Block diagram of speech scrambling

**3. Wavelet based speech scrambler** 

orthogonal transformation matrix *F* such that:

position within the vector given by:

inverse transformation *F-1* where:

(Goldburg, et al., 1991).

components to the time domain. The encrypted time domain frame is transmitted in place of original speech frame (Pichler, 1983).

## **2.1 Secure speech communication**

There are many reasons that make the user hide the meaning of the transmitted speech. Secure speech communication refers to the masked speech communication. Generally, secure speech communication, shown in Fig. 1 , deals with three parts: -


Fig. 1. Block diagram of secure speech communication

The first and second part uses a secure communication channel, while the eavesdropper tries to destroy the security of the communication system. If the system is destroyed, the receiver may lose the ability of getting the transmitted signal. In this case, the first and second parts must try to find another secure algorithm (to be used in the communication system) which is more secure and more difficult to be cryptanalysis by the third part.

The worst case is when the first and second part does not know that the system is destroyed by the third part. The first part wishes to mask or hide the meaning of the transmitted speech where, the second part can recover it without allowing the third part to get any meaningful speech. Almost all speech security systems reduce (at least to some extent) the audio quality of a voice transmission. Security will not be enhanced if the link has been so badly degraded that we have to repeat the same message a number of times.

## **2.2 Analog speech scrambling**

In analog speech scrambling, the only real analog operation is signal transmission, since the signal processing is carried out digitally. Incoming speech signals are digitized using analog to digital converter (ADC) , then processed by a special scrambling algorithm, converted back to analog, and transmitted to a receiver, where they are digitized again, inversely processed (descrambled), and reconverted to analog form for reconstruction, as shown in Fig. 2

components to the time domain. The encrypted time domain frame is transmitted in place of

There are many reasons that make the user hide the meaning of the transmitted speech. Secure speech communication refers to the masked speech communication. Generally,

The first part is transmitter (Tx) which has the ability to produce encrypted speech with

The second part is receiver (Rx) which recovers the encrypted speech, which is near as

The third part is the eavesdropper that attacks the communication system according to

The first and second part uses a secure communication channel, while the eavesdropper tries to destroy the security of the communication system. If the system is destroyed, the receiver may lose the ability of getting the transmitted signal. In this case, the first and second parts must try to find another secure algorithm (to be used in the communication

The worst case is when the first and second part does not know that the system is destroyed by the third part. The first part wishes to mask or hide the meaning of the transmitted speech where, the second part can recover it without allowing the third part to get any meaningful speech. Almost all speech security systems reduce (at least to some extent) the audio quality of a voice transmission. Security will not be enhanced if the link has been so

In analog speech scrambling, the only real analog operation is signal transmission, since the signal processing is carried out digitally. Incoming speech signals are digitized using analog to digital converter (ADC) , then processed by a special scrambling algorithm, converted back to analog, and transmitted to a receiver, where they are digitized again, inversely processed

(descrambled), and reconverted to analog form for reconstruction, as shown in Fig. 2

system) which is more secure and more difficult to be cryptanalysis by the third part.

badly degraded that we have to repeat the same message a number of times.

secure speech communication, shown in Fig. 1 , deals with three parts: -

original speech frame (Pichler, 1983).

**2.1 Secure speech communication** 

low residual intelligibility;

**2.2 Analog speech scrambling** 

possible to the original speech signal.

many available methods (Lee, 1985).

Fig. 1. Block diagram of secure speech communication

Fig. 2. Block diagram of speech scrambling

There are numerous scrambling methods. The two main processes involve dividing the signal in small time frames and manipulating the frequencies (scrambling in the frequency domain). The descrambler block descrambles the input signal, which must be a scalar or a frame-based column vector. The descrambler block is the inverse of the Scrambler block (Mascarin, 2000). The main attraction of this method arises from the fact that it can be used with the existing analog telephone, H.F., satellite and mobile communication systems, provided the encrypted signal occupies the same bandwidth as the original speech signal (Goldburg, et al., 1991).

#### **3. Wavelet based speech scrambler**

The analog scrambling process which employs a transformation of the input speech to facilitate encryption can best be described using matrix algebra. Let us consider the vector *x* which contains *N* speech time samples obtained from A/D conversion process, representing a frame of the original speech signal. Let this speech sample vector *x* be subject to an orthogonal transformation matrix *F* such that:

$$
\mu = F \cdot \mathfrak{x} \tag{1}
$$

This transformation results in a new vector (u) made up of transform coefficients. A permutation matrix is applied to (u)**,** such that each transform coefficient is moved to a new position within the vector given by:

$$
\upsilon = P \cdot \mathbf{u} \tag{2}
$$

A scrambled speech vector *y* is obtained by returning vector *v* to the time domain using the inverse transformation *F-1* where:

$$y = F^{-1} \cdot \upsilon \tag{3}$$

Descrambling, or recovery of the original speech vector *x'* is achieved by first transforming *y* back to the transform domain .The inverse permutation matrix *P-1* is then used to return the transform coefficients to their original position. Finally, the resulting transform vector is returned to the time domain by multiplying by *F-1*

$$\mathbf{x}' = F^{-1} \cdot P^{-1} \cdot F \cdot y \tag{4}$$

Speech Scrambling Based on Wavelet Transform 47

of the scrambled speech signal and the intelligibility of the descrambled speech signal by a quantitative criterion because intelligibility is substantially a subjective matter, as shown in

Fig. 3.

Where

Fig. 3. Block diagram of speech scrambling

position is moved to the kith position ,namely,

digits which are not coincident in the same position.

ideal) indicators of intelligibility (Sridharn, et al.; 1991)

speech and the corresponding quality of recovered speech .

A permutation of *N* elements can be expressed as (Rao &Homer, 2001).

*P*

**3.2 Measures for residual intelligibility and recovered voice quality** 

123 1,2,3,............ , , ,..... *<sup>N</sup>*

by the permutation P on a set of *N* elements A=(e1,e2,e3,……,eN), the element of A at the *ith*

*kkk k* 

123 <sup>123</sup> *PA Pe e e e d d d d* ' ( , , ,....., ) ( , , ,....... ) *N N* (9)

*ki i d e (i=1,2,3…N)* (10)

One of the most common and easiest ways of analyzing this is the Hamming distance (HD) between a pair of permutations. The HD of a permutation P is defined as the number of

Voice quality of the recovered speech and the residual intelligibility of the encrypted speech are usually judged by subjective quality tests. Unfortunately, these tests take much time and labor, and require a large number of trained listeners. Even though intelligibility is a substantially subjective matter, it is possible to use objective tests which are useful, (if not

The objective measures are useful in indicating the residual intelligibility of encrypted

A distance measure is an assignment of a number to an input/output pair of a system. To be

useful, a distance measure must posses to a certain degree the following properties :

*N*

(8)

The transform domain scrambling process outlined above requires the transform matrix *F* to have an inverse. One attempts to insure that the scrambling transformation *T=F-1.P.F* is orthogonal. The inverse transformation *T-1* will also be orthogonal. This property is useful since any noise added to the scrambling signal during transmission will not be enhanced by the descrambling process as shown in Fig. 3. The scrambled speech sequence is given by (Goldberg, et al., 1993):

$$y = F^{-1} \cdot P \cdot F \cdot \propto = T \cdot \propto \tag{5}$$

At most. N elements are able to be permuted in the transform –based scrambling process. It is important to note that for a given sampling frequency. N will determine the delay introduced by the scrambling device. So a tradeoff between system delay and security .N is usually chosen to be equal to 256. Practically, the number of transform coefficients M! possible coefficient arrangements. this restriction stems from the requirement that the scrambled speech should occupy the same bandwidth as the original speech. If the, and Biorthognal wavelets etc… transform components have a frequency representation. those lying outside the allowable band are set to zero and the reminder are permuted. Methods for generating all M! Possible permutations have been addressed (Bopardikar, 1995) . The permutations must carefully screen to ensure that components will undergo a significant displacement from their original position in the vector. In addition, components which were adjacent in the original vector should be separated in the scrambled vector.

If it is assumed during it's passage over the communications channel, a noise component is added to *y,* then we have

$$y' = y + \mu \tag{6}$$

Where *y'* is the signal observed by the receiver. The inverse scrambling transformation is then applied to y' in order to descramble and recover the original sequence *x*.

$$\mathbf{x}' = T^{-1} \cdot \mathbf{y}' \quad = \mathbf{x} + T^{-1} \cdot \boldsymbol{\mu} \tag{7}$$

Now since *T-1* is orthogonal, and hence norm preserving <sup>1</sup> *T* .This implies that the noise energy is not enhanced as a result of the scrambling process.

#### **3.1 Permutation used in the scrambler**

The number of possible permutations of *N* elements is *N!*. However, all of these permutations cannot be used because some of them do not provide enough security.

Let *P* be a set of permutations, and let *P-1* be the set of inverse permutations corresponding to the permutation in *P*. The set S has to satisfy the requirement that any permutation in *P* must not produce an intelligible scrambled speech. It is difficult to evaluate the intelligibility of the scrambled speech signal and the intelligibility of the descrambled speech signal by a quantitative criterion because intelligibility is substantially a subjective matter, as shown in Fig. 3.

Fig. 3. Block diagram of speech scrambling

A permutation of *N* elements can be expressed as (Rao &Homer, 2001).

$$P = \begin{bmatrix} 1, 2, 3, ..., ..., N \\ k\_1, k\_2, k\_3, ..., k\_N \end{bmatrix} \tag{8}$$

by the permutation P on a set of *N* elements A=(e1,e2,e3,……,eN), the element of A at the *ith* position is moved to the kith position ,namely,

$$P'A = P(e\_1, e\_2, e\_3, \dots, e\_N) = (d\_1, d\_2, d\_3, \dots, d\_N) \tag{9}$$

Where

46 Advances in Wavelet Theory and Their Applications in Engineering, Physics and Technology

transform coefficients to their original position. Finally, the resulting transform vector is

The transform domain scrambling process outlined above requires the transform matrix *F* to have an inverse. One attempts to insure that the scrambling transformation *T=F-1.P.F* is orthogonal. The inverse transformation *T-1* will also be orthogonal. This property is useful since any noise added to the scrambling signal during transmission will not be enhanced by the descrambling process as shown in Fig. 3. The scrambled speech sequence is given by

At most. N elements are able to be permuted in the transform –based scrambling process. It is important to note that for a given sampling frequency. N will determine the delay introduced by the scrambling device. So a tradeoff between system delay and security .N is usually chosen to be equal to 256. Practically, the number of transform coefficients M! possible coefficient arrangements. this restriction stems from the requirement that the scrambled speech should occupy the same bandwidth as the original speech. If the, and Biorthognal wavelets etc… transform components have a frequency representation. those lying outside the allowable band are set to zero and the reminder are permuted. Methods for generating all M! Possible permutations have been addressed (Bopardikar, 1995) . The permutations must carefully screen to ensure that components will undergo a significant displacement from their original position in the vector. In addition, components which were

adjacent in the original vector should be separated in the scrambled vector.

then applied to y' in order to descramble and recover the original sequence *x*.

If it is assumed during it's passage over the communications channel, a noise component

Where *y'* is the signal observed by the receiver. The inverse scrambling transformation is

The number of possible permutations of *N* elements is *N!*. However, all of these

Let *P* be a set of permutations, and let *P-1* be the set of inverse permutations corresponding to the permutation in *P*. The set S has to satisfy the requirement that any permutation in *P* must not produce an intelligible scrambled speech. It is difficult to evaluate the intelligibility

permutations cannot be used because some of them do not provide enough security.

*T*

1 1 *xF P F y* (4)

<sup>1</sup> *y F PFx Tx* (5)

(6)

(7)

.This implies that the

returned to the time domain by multiplying by *F-1*

(Goldberg, et al., 1993):

is added to *y,* then we have

*y y*

**3.1 Permutation used in the scrambler** 

<sup>1</sup> *x T y* <sup>1</sup> *x T*

Now since *T-1* is orthogonal, and hence norm preserving <sup>1</sup>

noise energy is not enhanced as a result of the scrambling process.

$$d\_{ki} = e\_i \text{ (i = 1, 2, 3 \dots N)}\tag{10}$$

One of the most common and easiest ways of analyzing this is the Hamming distance (HD) between a pair of permutations. The HD of a permutation P is defined as the number of digits which are not coincident in the same position.

#### **3.2 Measures for residual intelligibility and recovered voice quality**

Voice quality of the recovered speech and the residual intelligibility of the encrypted speech are usually judged by subjective quality tests. Unfortunately, these tests take much time and labor, and require a large number of trained listeners. Even though intelligibility is a substantially subjective matter, it is possible to use objective tests which are useful, (if not ideal) indicators of intelligibility (Sridharn, et al.; 1991)

The objective measures are useful in indicating the residual intelligibility of encrypted speech and the corresponding quality of recovered speech .

A distance measure is an assignment of a number to an input/output pair of a system. To be useful, a distance measure must posses to a certain degree the following properties :

Speech Scrambling Based on Wavelet Transform 49

At the receiver, frame by frame of length 256 samples are descrambled and saved in wave file. The proposed scrambled system investigates four types of wavelets: (Haar, db3, sym2 and sym4), each one with three different levels. Two types of tests have been applied to

a. Subjective Test: in which the scrambled speech files have been played back to a number of listeners to measure the residual intelligibility, subjectively. For all cases, the judge was that the files contain noise only, which means that the residual intelligibility is very low. The analog recovered speech files have been tested in a similar way to measure the quality of the recovered speech files, the judge was that the files were exactly the same

b. Objective Test: As mentioned earlier, the objective test is a valuable measure to the residual intelligibility of the scrambled speech, and the quality of the recovered speech*.*

The distance measures indicate the perceptual similarity of the speech recovered following decryption and the original speech. They are also used to quantify the difference between

The signal to noise ratio (SNR) and the segmental signal to noise ratio measure (SEGSNR) have been chosen to test the residual intelligibility of the scrambled speech and the quality of the recovered speech for all files*.* The segmental signal to noise ratio measure (SEGSNR)

Generally, these distance measures for all the scrambled speech files are very low (good negative value) which means that the residual intelligibility is very low, and the distance measures for all the recovered speech files are very high (large positive value) which means

Using the relation between estimated PSD (dB/Hz) in relation with frequency of the used

The wavelet based speech scrambling system have been tested under two states of the

Simulation results of typical experiments with the Wavelet based scrambler, and descrambler for an Arabic word spoken by women's voice '"evenning" are shown in figures

Using (Haar) Wavelet , (db3) wavelet, Sym2 and Sym4 wavelet each one will be considered

Figure (4)shows the waveform, spectrum, and spectrogram of a sample original clear speech

(4) to (8), and Tables (1) to (2), using different wavelets and different levels.

with three different levels for the **Arabic word**'" **evening** ".

signal that represents an Arabic word " evening ".

examine the performance of the simulation, these are:

as the original copies.

scrambled speech and original speech.

speech signals in two cases, as follows:

**4.1 Noise free channel simulation** 

simulation, these are:

Case Study:

is an improved version measure of the (SNR).

that the quality of the recovered speech is very high.

 To compare the original and scrambled speech. To compare the original and descrambled speech.


One use of the distance measures is to evaluate the performance of speech scrambling system. The signal-to-noise ratio **(SNR)** and the segmental signal-to-noise Ratio **(SEGSNR)** are the most common time-domain measures (Gray, et al., 1980) . of the difference between original and processed speech signals (scrambled or descrambled speech signals).

### **3.2.1 Signal-to-Noise Ratio**

The signal-to-noise ratio **(SNR)** can be defined as the ratio between the input signal power and the noise power, and is given in decibels **(dB)** as:

$$\text{SNR} = 10.\log\_{10}\left\{ \sum\_{n=1}^{N} X^2(n) / \sum\_{n=1}^{N} [X(n) - Y(n)]^2 \right\} \text{ (in dB)}.\tag{11}$$

Where *N* is the number of samples, *X(n)* is the original speech signal and *Y(n)* is the scrambled or descrambled speech signal. The principal benefit of the **SNR** quality measure is its mathematical simplicity. The measure represents an average error over time and frequency for a processed signal. However, **SNR** is a poor estimator for a broad range of speech distortions. The fact that **SNR** is not particularly well related to any subjective attribute of speech quality and that it weights all time domain errors in the speech waveform equality (Gray, Markel 1976) . This can be solved with segmental **SNR**.

#### **3.2.2 Segmental Signal-to-Noise Ratio**

An improved version measure can be obtained if **SNR** is measured over short frames and the results are averaged. The frame-based measure is called the segmental **SNR (SEGSNR)** and is defined as:

$$\text{SECSNR} = \overline{\text{SNR(m)}} \text{ in (dB)}.\tag{12}$$

Where *SNR m*( ) is the average of *SNR(m)* and *SNR(m)* is the **SNR** for segment *m*. The segmentation of the **SNR** permits the objective measure to assign equal weights to load and soft portions of the speech (Yuan, 2003)

#### **4. Results and discussion**

In the proposed Wavelet Transform based Speech Scrambling system, (Arabic) messages have been recorded with sampling frequency of 8 kHz as speech files.

At the transmitter, the sampled speech signal is arranged into frames . Each frame contains 256 samples, and then the Wavelet Transformation is performed on each frame. After that, the transform coefficients are permuted before applying the Inverse Wavelet Transform (IWT). The resulting scrambled speech signal is saved in a wave file.

It must be subjectively meaningful in the sense that small and large distance must

It must be tractable in the sense that it is possible to mathematically analyze and

One use of the distance measures is to evaluate the performance of speech scrambling system. The signal-to-noise ratio **(SNR)** and the segmental signal-to-noise Ratio **(SEGSNR)**

The signal-to-noise ratio **(SNR)** can be defined as the ratio between the input signal power

<sup>2</sup> <sup>2</sup>

**(**in **dB).** (11)

This can be solved with segmental **SNR**.

1

*N*

*n*

Where *N* is the number of samples, *X(n)* is the original speech signal and *Y(n)* is the scrambled or descrambled speech signal. The principal benefit of the **SNR** quality measure is its mathematical simplicity. The measure represents an average error over time and frequency for a processed signal. However, **SNR** is a poor estimator for a broad range of speech distortions. The fact that **SNR** is not particularly well related to any subjective attribute of speech quality and that it weights all time domain errors in the speech

An improved version measure can be obtained if **SNR** is measured over short frames and the results are averaged. The frame-based measure is called the segmental **SNR (SEGSNR)**

*SEGSNR SNR m* ( ) in **(dB)**. (12)

Where *SNR m*( ) is the average of *SNR(m)* and *SNR(m)* is the **SNR** for segment *m*. The segmentation of the **SNR** permits the objective measure to assign equal weights to load and

In the proposed Wavelet Transform based Speech Scrambling system, (Arabic) messages

At the transmitter, the sampled speech signal is arranged into frames . Each frame contains 256 samples, and then the Wavelet Transformation is performed on each frame. After that, the transform coefficients are permuted before applying the Inverse Wavelet Transform

have been recorded with sampling frequency of 8 kHz as speech files.

(IWT). The resulting scrambled speech signal is saved in a wave file.

original and processed speech signals (scrambled or descrambled speech signals).

of the difference between

correspond to low and high subjective quality, respectively;

are the most common time-domain measures (Gray, et al., 1980) .

10 log

10

1

*N SNR . X (n) / X(n) Y(n) n* 

and the noise power, and is given in decibels **(dB)** as:

implement it in some algorithms.

waveform equality (Gray, Markel 1976) .

**3.2.2 Segmental Signal-to-Noise Ratio** 

soft portions of the speech (Yuan, 2003)

**4. Results and discussion** 

and is defined as:

**3.2.1 Signal-to-Noise Ratio** 

At the receiver, frame by frame of length 256 samples are descrambled and saved in wave file. The proposed scrambled system investigates four types of wavelets: (Haar, db3, sym2 and sym4), each one with three different levels. Two types of tests have been applied to examine the performance of the simulation, these are:


The distance measures indicate the perceptual similarity of the speech recovered following decryption and the original speech. They are also used to quantify the difference between scrambled speech and original speech.

The signal to noise ratio (SNR) and the segmental signal to noise ratio measure (SEGSNR) have been chosen to test the residual intelligibility of the scrambled speech and the quality of the recovered speech for all files*.* The segmental signal to noise ratio measure (SEGSNR) is an improved version measure of the (SNR).

Generally, these distance measures for all the scrambled speech files are very low (good negative value) which means that the residual intelligibility is very low, and the distance measures for all the recovered speech files are very high (large positive value) which means that the quality of the recovered speech is very high.

 Using the relation between estimated PSD (dB/Hz) in relation with frequency of the used speech signals in two cases, as follows:


The wavelet based speech scrambling system have been tested under two states of the simulation, these are:
