**4.3. Sliding window procedure**

A sliding window procedure [11] is employed to detect any increases in energy that exceed a certain threshold. This threshold has been selected to characterize the appearance of an onset. Rectangular windows that contain 4096 samples (≈ 92.8 ms) of the signal to analyze are employed. The number of samples is chosen to be a power of two so that a fast Fourier algorithm can be employed to compute all values of the discrete Fourier transforms (DFTs) when performing a frequency analysis of the vowel-segment. Thus, the number of arithmetical operations required will be substantially reduced. Moreover, the character quasi-periodic and quasi-stationary of speech in that interval is seen as an additional justification for the size of these blocks, and will be of great utility in further upgrades of this system.

For any 4096 samples segmentation, a peak-picking method as the one employed in [20] was developed to extract formants. The justification of having the recording divided into frames of 92.8 ms is to detect such formants easily. Peaks can appear and dissapear from one frame to the next one due to resonances in the vocal tract and due to nasalizations, and the segmentation of the recording in frames of 4096 samples allows to successfully detect formants despite the mentioned fact of nasalizations.

In general, this latter effect presents a special problem because the nasalization is just a resonance of the nasal tract (it can be seen as a pole in the transfer function) whereas the oral tract is a closed side branch, which causes zeros (minimun energy in the spectrum). Frequently, the second formant, *F*2 is greatly reduced in amplitude, because of a nearby zero; and, in fact, often there is no peak corresponding to *F*2. In particular, the nasalization of a vowel is a problem of similar nature. In this case, the nasal cavity is an open side branch, causing extra zeros and extra poles. In a nasalized front vowel, typically, there is an extra small peak slightly above the first formant in frequency. In a nasalized back vowel, the apparent bandwidth of *F*1 becomes quite wide, because of a nearby zero, and sometimes there is no peak for *F*1. We will show this effect in the results included through this paper.

10.5772/57221

335

http://dx.doi.org/10.5772/57221

moved to lower frequencies in subsequent frames. This fact let us obtain the bandwidth of this fundamental frequency for a most effective harmonic elimination. A final smoothing may be accomplished at each voiced frame in the same way as proposed by McCandless, to yield the formant tracks. The interpolated and smoothed values are valid if they are not too

Spectral Study with Automatic Formant Extraction to Improve Non-native Pronunciation of English Vowels

Finally, if any formant is not achieved, for instance, due to it has been merged with another one, an enhancement procedure is included using linear prediction analysis [20] employing, for this case, the linear prediction filter coefficients routine (*lpc.m*) included in MATLAB, based on an autocorrelation method of autoregressive (AR) modeling, as the one implemented in [12], to find the filter coefficients. Once the coefficients, *ak*, are available, we can obtain in a straight manner the approximated spectrum by simply evaluating the magnitude of the transfer function, *H*(*f*) of the filter represented by the coefficients *ak*, at *N*

> *p* ∑ *k*=1

as a previous step, where a\_k are the coefficients, *ak*, of the transfer function, xn is the original audio recording, and *n* = 0, 1..., *N* − 1. As indicated in [20], two closely spaced formants frequently merge into one spectral peak, and cannot be resolved on the unit circle even with infinite resolution. However, they can often be separated by simply recomputing the spectrum on a circle of radius, *r*, less than 1. This amounts to reevaluating *H*(*f*) at *x* = *r* exp *j*(2*πn*/*N*), r < 1. Because the contour comes in closer to the two poles, their peaks are enhanced, and a separation can be effected. Hence, by the estimated characters of linear prediction coding spectrum, in the region that the energy of signals is strong, i.e. the region closing to the peak value of the spectrum, the linear prediction coding spectrum is closing to the signal spectrum. However in the region that the energy of signals is weak, i.e. the region closing to the vale of the spectrum, both spectrums are significantly different. So to check

In this section, we are showing some results offered by the implemented system. As we have commented above, tests and recordings have been carried out in adult females, following the original system by [1], which was resealed in 2001. Nevertheless, we must remark that the signal processing problem would be identical in the case of males and children; the automatic formant extraction method would not change. After the process described in Section 3, , the system would have selected frequency peaks as candidate formants in each recording. The formant frequencies of vowels produced by males would be surely moved to lower frequencies in relation to the formant frequencies of vowels produced by women (see

*ak* exp (−*j*2*πnk*/*N*) (3)

"different" from the original.

equally spaced samples along the unit circle [20]:

For this purpose, the system can employ the function

filter([0, -a\_k(2:end)], 1,xn);

**5. Results and discussions**

[4], for instance).

*<sup>H</sup>*(*f*) = *<sup>a</sup>*<sup>0</sup> −

the peak values of the linear prediction spectrum can confirm the formant.

For each frame, a *N*−FFT is employed to compute all values of the DFTs. If the number of samples of the last frame is not a power of two, it is required to first zero-pad such a last frame previous to compute the FFT of the sequence [13]. As an interesting remark, for the computation of all N values of a DFT using the definition, the number of arithmetical operations required is approximately *N*2, while the amount of computation is approximately proportional to *N* log2 *N* for the same result to be computed by an FFT algorithm [21].
