**2.2 Basis of speech analysis**

82 Advances in Wavelet Theory and Their Applications in Engineering, Physics and Technology

One of the most important techniques applied in the spectral analysis is the Fourier Transform (STFT), which will allow to recognize the spectral components of speech signal,

That transform has a resolution problem which is given by Heisenberg Uncertainty Principle. The Wavelet Transform (WT) was developed to overcome some resolution related problems of the STFT. It is possible to analyze any signal by using an alternative approach

MRA, as implied by its name, analyzes the signal at different frequencies with different resolutions. MRA is designed to give good time resolution and poor frequency resolution at high frequencies and good frequency resolution and poor time resolution at low frequencies. The Continuous Wavelet Transform (CWT) is used for many different

> *<sup>t</sup> s xt dt s s*

  *<sup>y</sup> k xn <sup>g</sup> k n* (2)

*<sup>y</sup> k xn h k n* (3)

(1)

<sup>1</sup> , () \* *<sup>x</sup>*

As the here used signals are digital, it is more useful to use Semi-discrete Wavelet Transform (discretized by dyadic grid, described by 2*<sup>j</sup> s* and 2*<sup>j</sup> t k* ) or Discrete Wavelet Transform (DWT). The DWT analyzes the signal at different frequency bands with different resolutions by decomposing the signal into a coarse approximation and detail

The decomposition of the signal into different frequency bands is simply obtained by successive highpass and lowpass filtering of the time domain signal. The original signal x[n] is first passed through a halfband highpass filter g[n] and a lowpass filter h[n]. This constitutes one level of decomposition and can mathematically be expressed as follows:

> *high* 2 *n*

*low* 2 *n*

where yhigh[k] and ylow[k] are the outputs of the highpass and lowpass filters, respectively, after subsampling by 2. This decomposition halves the time resolution since only half the

However, this operation doubles the frequency resolution, since the frequency band of the signal now spans only half the previous frequency band, effectively reducing the uncertainty in the frequency by half. The above procedure, which is also known as the

The wavelet packet method is a generalization of wavelet decomposition that offers a richer signal analysis. Wavelet packet atoms are waveforms indexed by three naturally interpreted

number of samples now characterizes the entire signal.

subband coding, can be repeated for further decomposition.

so it makes possible to distinguish pathological voices and process them.

**2. Methods** 

information [5].

**2.1 Wavelet transform** 

called the multiresolution analysis (MRA).

applications and it is defined as follows:

At the present time, many otolaryngologists (ORLs) use the software tools they have available in order to corroborate the diagnosis of vocal cord pathologies by means of objective parameters. These parameters complete the information gathered by the specialist, which usually comprises: the images obtained from a stroboscope and several perceptual tests carried out on the patient.

Special attention needs to be paid to vocal cord cancer, that is to say, to its diagnosis, treatment, rehabilitation and monitoring, as this cancer can cause the death of the patient suffering from it. Once the cancer has been detected, the ORL specialist removes the patient's vocal cords. This means that the patient will no longer be able to produce what is called laryngeal voice and thus loses his/her speech.

After the operation, during rehabilitation, the patient begins the process of learning how to emit oesophageal voice: the voice produced by modulating air coming from the oesophagus. This enables the patient to communicate, albeit experiencing great difficulty to maintain fluent conversations, due to the poor quality of oesophageal voice. However, one of the major problems is that this type of oesophageal voice cannot be evaluated during the rehabilitation process as there is no application available on the market that can automatically obtain the previously mentioned acoustic parameters. The quality of oesophageal voice is so low that the algorithms obtaining the periodicity of the voice do not work properly, and thus measurements obtained by such software packs are not reliable.

Obviously, the accuracy of measurements made by the software pack presented in this work will also be applicable to less severe pathologies, such as polyps, nodules, hypo mobility of the vocal cords, etc. The deterioration of the voice in this type of pathology is also too high for the measurement of objective parameters to be precise. This means that these commercial software packs are not suitable for measuring these parameters in voices suffering from some kind of pathology. Being able to obtain accurate objective parameters is advantageous for the early detection of cancer in cases where the patient's laryngeal voice is of a very poor quality and has high noise levels [1].

The pitch, or fundamental frequency of the speech, is one of the properties of sound or musical tone perceived through frequency. Due to this natural pseudo-periodicity of the voiced voice, there are small variations in the peaks of the voice which change their fi frequency, so that the pitch can be defined as:

$$\sum\_{i=1}^{N} f\_i$$

$$\text{Pitch(Hz)} = \frac{\sum\_{i=1}^{N} f\_i}{N} \tag{4}$$

N being the number of pitch periods.

Oesophageal Speech's Formants Measurement Using Wavelet Transform 85

generally, a larynx cancer. Because of this their time-spectral characteristics are atypical and include levels of noise, fundamental frequency asymmetry and formant unstructuration. This leads to wrong measures in commercial applications and therefore is impossible to assess the quality of oesophageal voices. The same is applicable to voices

In this sense, it is necessary to develop an algorithm for the exact calculation of the marks that correspond to each cycle of the signal of oesophageal or pathological voices so that the calculation of pitch is exact and, with it, the measures of jitter, shimmer or signal to noise ratio. This algorithm has been included in a software interface for allowing users to measure and to plot in a graph the results of the acoustic parameters of the speech signal. This is suitable for evaluating and comparing the results between original oesophageal speech

The system design has been divided into two parts: the algorithm for improving the quality of oesophageal speech using wavelet transform and the user interface including the speech signal processing using that algorithm and the acoustic analysis of speech parameters.

As it has been previously mentioned, wavelet packets will be used in order to detect formants location. The reason of the choosing of this technique is their ability to separate the speech signal in different subbands, allowing to separate the formants bands quite

The here proposed method makes use of a double analysis. Firstly, a general analysis is applied over the whole spectrum, in this step a band in which the formant is located is approximated, and secondly the exact formant location is determined more accurately, the formant location accuracy can be adjusted through introducing more analysis levels inside

The main advantage of this method is the possibility of achieving a great frequency resolution, without consuming excessive computational resources, which is crucial when

The first step consists of a rough analysis of the signal's wavelet packets tree. In order to locate formants frequencies, the energy of each subband is analyzed. The maxima of this energy signal determine formants location. The scheme of the process is shown in Figure 1. Firstly, the wavelet packets tree is calculated up to the desire level, the chosen level is

After having obtained the wavelet packets, the energy of each last level node is calculated. The Energy is stored in an array and its envelope is estimated. This envelope smoothes the

calculated taking into account the sampling frequency and the resolution required.

signal and the processed one after applying the wavelet transform.

implementing the algorithms in a real-time device, such as a DSP.

**3.1.1 Step 1: Band approximation of formants location** 

energy signal and thus, the maxima can be easily calculated.

with severe pathologies.

**3. System design** 

exactly.

**3.1 Algorithm using wavelets** 

the formant approximation band.

Estimating fundamental frequency has been a recurring issue in the area of digital signal processing. This is due to the fact that obtaining the time instants that define voice cycles is a very complex task. These cycles are used to obtain the fi frequency instants. Furthermore, it is vitally important to calculate these instants in the acoustic parameterization, as this is the cornerstone of voice characterizations of this kind.

Jitter [2] is a parameter representing variation of fundamental frequency, that is, the variations of pitch in each voice cycle. On the other hand, specialists also usually employ the shimmer parameter [2], which represents variation in width of voice cycle peaks. The voice produced through larynx modulation is able to almost constantly maintain peak width of voice periods. Therefore, an increase in shimmer value can be a symptom of voice disorder. Tables 1 and 2 present the various mathematical definitions of the jitter and shimmer objective parameters.

As previously mentioned, a number of authors have written several works on the detection of voice cycles [3,4] and there are also many highly detailed techniques to be found in the corresponding literature, such as estimators in the temporary domain (ratio of crosses per zero [5]), estimators of fundamental frequency [6,7], self-correlation methods (Yin estimators [8]), representation of the phase space [9], Cepstrums [10] and statistical methods [11, 12, 13]. Some of these directly define voice cycles [3], whereas others use numerical approximations [8] in order to obtain fundamental frequency values. In that respect, another step must be taken if we are to clearly identify the instants that define voice cycles.

However, none of these works were tried out on oesophageal voices and, what is more, it can be stated without a shadow of a doubt that these algorithms are not suitable for voices of this kind. The software pack presented here is a tool designed for use by specialists in otolaryngology, and is specifically designed to obtain objective voice parameters with excellent precision. The tool contains a basic algorithm to calculate the acoustic parameters related to speech periodicity and serves as an aid for not only diagnosis and rehabilitation but also for monitoring the patient.

It can be concluded that the tool is user-friendly and that ORL specialists can use it for measuring such objective parameters as pitch, jitter and shimmer, as well as for keeping patient records on these parameters.
