**3. Organs of speech**

Vowels are the result of glottal source, supraglottal tract and their filtering effects. Same quality vowels have similar spectral shapes, without regard to the source fundamental frequency (this is a variable that changes considerably depending on the speaker's age, sex and emotions). The air coming from the lungs supplies the necessary energy to produce sounds. Thanks to vocal cords vibration, the rate of air flow through the glottis generates a complex periodic wave. Glottal source waves and spectrum vary depending on the type of phonation. The differences in the waveform are due to the different amount of time that the vocal folds are open during a glottal cycle. Figure 1 shows the organs of speech in a cross-section:

The fundamental frequency, *F*0, also called the glottal frequency of the vocal fold vibration,S is dependent on several factors such as mass, length and tension of the folds which are interrelated in a fairly complicated way. These are typical values for *F*0 (during normal speech production, voicing frequency varies over an octave):


<sup>330</sup> Computational and Numerical Simulations Spectral Study with Automatic Formant Extraction to Improve Non-native Pronunciation of English Vowels 3 10.5772/57221 Spectral Study with Automatic Formant Extraction to Improve Non-native Pronunciation of English Vowels http://dx.doi.org/10.5772/57221 331

**Figure 1.** Organ of speech: A. Lips, B. Teeth, C. Teeth ridge, D. Hard palate, E. Soft palate, F. Uvula, G. Pharynx, H. Tongue body, I. Tongue tip, J. Blade, K. Tongue front, L. Back of the tongue, M. Tongue root, N. Jaw, O. Epiglottis, P. Thyroid cartilage, Q. Cricothyroid cartilage, R. Trachea, S. Oral cavity, T. Nasal cavity. Figure taken from [1].

Vocal tract filter selectively passes energy in the harmonics of the source. The size/shape of the vocal tract determines the amount of energy that is used in oral speech. For each vocalic sound, the so-called formants describe their characteristic resonance. In fact, the vocal tract transfer function for a particular vowel is defined by formant bandwidth and frequency. We can model the acoustic properties of the vocal tract as a tube open at one end, which is the mouth, and closed at the glottis. Assuming this tube uniformity, resonant frequencies can be calculated with the following formula:

$$F\_n = \frac{(2n-1)c}{4L},\tag{1}$$

where *n* is the number of the formant, *c* is the speed of sound, and *L* is the length of the tube. However, we also need to consider acoustic constrictions in the vocal tract. One way of modelling the acoustic properties of vowels is to represent the vocal tract as a concatenation of tubes [16]. An alternative approach is known as perturbation theory, which deals with vocalic acoustics in terms of relationship between air pressure and speed [17].

### **3.1. Formant frequencies of the vowels**

2 Computational and Numerical Simulations

are not significant: results are very often meaningless.

language teaching, in this case, of the English language.

speech production, voicing frequency varies over an octave):

sound waves.

of English vowels.

**3. Organs of speech**

• adult male voice: 125 Hz. • adult female voice: 220 Hz.

• child voice: 300 Hz

cross-section:

phoneme in his software programme. This sound comparison results in a graphical degree of similarity expressed as percentages, showing the resemblance between user and programme

Nevertheless, as Pavón himself states, this is an approximate value and it depends on recording conditions (e.g. room noise and external variables), which make an indicative result. Although the idea is conceptually good, a frequency domain analysis is required in order to draw out the degree of resemblance between users' wave forms and those included in the system. On the one hand, software programmes do not distinguish between male and female voice recordings even though fundamental frequencies and formants are different in both cases. Women present peak energy in higher frequencies when talking, and Pavón's software only includes female recordings. On a different matter, time domain comparisons

For this reason, this paper attempts to improve the afore-mentioned software including a frequency domain analysis by means of fundamental frequency and *F*1, *F*2 identification. This would allow a more significant comparison between users' recordings and programme audio database. At the same time, depending on formant position, learners will receive information on mouth opening and tongue positioning according to each vowel sound. Consequently, we are making use of authors' previous research on audio signal processing [11], knowledge on communication channels [12], numerical methods [13], analytical modelling [14] and English applied linguistics [15]. This theoretical framework backs up a useful tool for students of English who want to autonomously improve their pronunciation

Finally, we are only focusing on vocalic sounds since not all human sounds offer well-defined formants. Vowels, on their part, do have distinct formants and their study complements oral

Vowels are the result of glottal source, supraglottal tract and their filtering effects. Same quality vowels have similar spectral shapes, without regard to the source fundamental frequency (this is a variable that changes considerably depending on the speaker's age, sex and emotions). The air coming from the lungs supplies the necessary energy to produce sounds. Thanks to vocal cords vibration, the rate of air flow through the glottis generates a complex periodic wave. Glottal source waves and spectrum vary depending on the type of phonation. The differences in the waveform are due to the different amount of time that the vocal folds are open during a glottal cycle. Figure 1 shows the organs of speech in a

The fundamental frequency, *F*0, also called the glottal frequency of the vocal fold vibration,S is dependent on several factors such as mass, length and tension of the folds which are interrelated in a fairly complicated way. These are typical values for *F*0 (during normal

> First formant frequency (*F*1) is traditionally influenced by the shape of the vocal tract. *F*1 is inversely related to tongue height: low vowels have high *F*1 and high vowels have low

*F*1. On the other hand, second formant frequency (*F*2) corresponds to length and size of the speaker's oral cavity; in this case, front vowels have high *F*2 whereas back vowels have low *F*2; the formant frequencies decrease through the cardinal vowels, where the cardinal vowels can be consulted at [18]. Nevertheless, these relationships are not straightforward since there are other factors influencing sound production (e.g. lip rounding, tongue retroflexion, among others).

10.5772/57221

333

http://dx.doi.org/10.5772/57221

the first three formants during all voiced sounds in continuous unrestricted speech. For this reason, the algorithm developed in this paper can be implemented more easily and

Spectral Study with Automatic Formant Extraction to Improve Non-native Pronunciation of English Vowels

This stage consists in the recording of a vowel file. The audio data was kept in a WAV file at a sample rate of 44.1 kHz. The system accepts a monaural file as well as a stereophonic one. Then, the digitized signal is low-pass filtered in order to eliminate high frequency

As in [11], our system divides the vowel-segment into temporal slots and, afterwards, a frequency analysis of each slot is done. This temporal segmentation is based on the detection of onsets, so the system is prepared for detecting when a phoneme starts in the recording. This information makes it possible to discard frames whose total spectral energy is below a

After that, a Hamming window [13, Eq. 56] is applied to the segmented signal so that the extreme samples of the segments had less weight that the central samples. In this paper, we

A sliding window procedure [11] is employed to detect any increases in energy that exceed a certain threshold. This threshold has been selected to characterize the appearance of an onset. Rectangular windows that contain 4096 samples (≈ 92.8 ms) of the signal to analyze are employed. The number of samples is chosen to be a power of two so that a fast Fourier algorithm can be employed to compute all values of the discrete Fourier transforms (DFTs) when performing a frequency analysis of the vowel-segment. Thus, the number of arithmetical operations required will be substantially reduced. Moreover, the character quasi-periodic and quasi-stationary of speech in that interval is seen as an additional justification for the size of these blocks, and will be of great utility in further

For any 4096 samples segmentation, a peak-picking method as the one employed in [20] was developed to extract formants. The justification of having the recording divided into frames of 92.8 ms is to detect such formants easily. Peaks can appear and dissapear from one frame to the next one due to resonances in the vocal tract and due to nasalizations, and the segmentation of the recording in frames of 4096 samples allows to successfully detect

0.54 − 0.46 cos(2*πn*/*M*), 0 ≤ *n* ≤ *M*,

0, otherwise; (2)

use a *M*-points Hamming window symmetric about the point *M*/2 of the form

**4.2. Onset detection and temporal segmentation: windowing**

threshold for silence, and that must not be processed by the system.

owing to it is optimized to minimize the maximum (nearest) side lobe.

*<sup>w</sup>*[*n*] =

formants despite the mentioned fact of nasalizations.

**4.3. Sliding window procedure**

upgrades of this system.

productively.

components.

**4.1. Data acquisition and preprocessing**

Articulatory properties of vowels are determined by these *F*1 and *F*2 formants in such a way that one is plotted against the other. Because of the inverse relationship between articulatory parameters and formant frequencies, zero frequency is at the top right corner. In Fig. 2 [1], we have displayed where English vowels are pronounced inside the oral cavity:

**Figure 2.** Vowel trapezium inserted in the oral cavity, indicating tongue movements for the pronunciation of the different vocalic phonemes [1].
