**4.4. Processing of each frame: formant extraction**

In this case, same steps as in [20] are developed. For each frame, fundamental frequency is first detected. Normally, it is always obtained as the peak with maximum energy. In our paper, all the tests were carried out by adult females, so fundamental frequencies were detected between 190-240 Hz in all cases, depending of the vowel produced by women. Tests have been restricted to women because the previous system implemented by Pavón in [1] was released with solely recordings of women. We must stress that, according to the signal processing problem, recordings obtained from women or recording produced by men are exactly the same problem, and the treatment and the way to solve both of them would be exactly the same.

Secondly, as in [11], we eliminate harmonics of fundamental frequency in each frame except, if we find a peak with higher energy placed in a potential harmonic. The constraint we impose in this step is that the amplitude of a peak placed in the frequency corresponding to the *n*-th harmonic must be lower than the amplitud of the peak positioned in the frequency corresponding to the *n* − 1-th harmonic, with *n* ≤ 1, where, in this notation, the 0-harmonic frequency is the fundamental frequency.

As a third step, our system fetches peaks finding the frequencies and amplitude of possible formants in the region from 150 to 3400 Hz. By executing this step in each 4096-sample frame, the system can detect peaks that appear, peak mergers as well as peak cancellations due to pernicious effect of the resonances and nasalizations commented above. Hence, we can take advantage of a very important feature in voice signals: the no continuity, i.e., how frequency formants can change from frame to frame, and new peaks can appear in a frame and disappear in the next one. A complete analysis of the voice signal without segmentation would entail, in many occasions, an error in the estimation of the formants, because some formants would not have enough energy to be detected.

After doing that, our system has selected some candidates to be the formants in each frame. One particular feature in the analysis of each frame is that the fundamental frequency is moved to lower frequencies in subsequent frames. This fact let us obtain the bandwidth of this fundamental frequency for a most effective harmonic elimination. A final smoothing may be accomplished at each voiced frame in the same way as proposed by McCandless, to yield the formant tracks. The interpolated and smoothed values are valid if they are not too "different" from the original.

Finally, if any formant is not achieved, for instance, due to it has been merged with another one, an enhancement procedure is included using linear prediction analysis [20] employing, for this case, the linear prediction filter coefficients routine (*lpc.m*) included in MATLAB, based on an autocorrelation method of autoregressive (AR) modeling, as the one implemented in [12], to find the filter coefficients. Once the coefficients, *ak*, are available, we can obtain in a straight manner the approximated spectrum by simply evaluating the magnitude of the transfer function, *H*(*f*) of the filter represented by the coefficients *ak*, at *N* equally spaced samples along the unit circle [20]:

$$H(f) = a\_0 - \sum\_{k=1}^{p} a\_k \exp\left(-j2\pi nk/N\right) \tag{3}$$

For this purpose, the system can employ the function

$$\begin{array}{rcl} \text{if } \mathtt{l} \mathtt{t} \mathtt{e} \mathtt{r} \{ \mathtt{l} \mathtt{o} \mathtt{r} \mathtt{e} \mathtt{a} \mathtt{b} \mathtt{a} \mathtt{b} \mathtt{a} \mathtt{b} \mathtt{a} \mathtt{b} \mathtt{a} \mathtt{b} \end{array}$$

as a previous step, where a\_k are the coefficients, *ak*, of the transfer function, xn is the original audio recording, and *n* = 0, 1..., *N* − 1. As indicated in [20], two closely spaced formants frequently merge into one spectral peak, and cannot be resolved on the unit circle even with infinite resolution. However, they can often be separated by simply recomputing the spectrum on a circle of radius, *r*, less than 1. This amounts to reevaluating *H*(*f*) at *x* = *r* exp *j*(2*πn*/*N*), r < 1. Because the contour comes in closer to the two poles, their peaks are enhanced, and a separation can be effected. Hence, by the estimated characters of linear prediction coding spectrum, in the region that the energy of signals is strong, i.e. the region closing to the peak value of the spectrum, the linear prediction coding spectrum is closing to the signal spectrum. However in the region that the energy of signals is weak, i.e. the region closing to the vale of the spectrum, both spectrums are significantly different. So to check the peak values of the linear prediction spectrum can confirm the formant.

### **5. Results and discussions**

6 Computational and Numerical Simulations

exactly the same.

frequency is the fundamental frequency.

formants would not have enough energy to be detected.

**4.4. Processing of each frame: formant extraction**

In general, this latter effect presents a special problem because the nasalization is just a resonance of the nasal tract (it can be seen as a pole in the transfer function) whereas the oral tract is a closed side branch, which causes zeros (minimun energy in the spectrum). Frequently, the second formant, *F*2 is greatly reduced in amplitude, because of a nearby zero; and, in fact, often there is no peak corresponding to *F*2. In particular, the nasalization of a vowel is a problem of similar nature. In this case, the nasal cavity is an open side branch, causing extra zeros and extra poles. In a nasalized front vowel, typically, there is an extra small peak slightly above the first formant in frequency. In a nasalized back vowel, the apparent bandwidth of *F*1 becomes quite wide, because of a nearby zero, and sometimes there is no peak for *F*1. We will show this effect in the results included through this paper. For each frame, a *N*−FFT is employed to compute all values of the DFTs. If the number of samples of the last frame is not a power of two, it is required to first zero-pad such a last frame previous to compute the FFT of the sequence [13]. As an interesting remark, for the computation of all N values of a DFT using the definition, the number of arithmetical operations required is approximately *N*2, while the amount of computation is approximately proportional to *N* log2 *N* for the same result to be computed by an FFT algorithm [21].

In this case, same steps as in [20] are developed. For each frame, fundamental frequency is first detected. Normally, it is always obtained as the peak with maximum energy. In our paper, all the tests were carried out by adult females, so fundamental frequencies were detected between 190-240 Hz in all cases, depending of the vowel produced by women. Tests have been restricted to women because the previous system implemented by Pavón in [1] was released with solely recordings of women. We must stress that, according to the signal processing problem, recordings obtained from women or recording produced by men are exactly the same problem, and the treatment and the way to solve both of them would be

Secondly, as in [11], we eliminate harmonics of fundamental frequency in each frame except, if we find a peak with higher energy placed in a potential harmonic. The constraint we impose in this step is that the amplitude of a peak placed in the frequency corresponding to the *n*-th harmonic must be lower than the amplitud of the peak positioned in the frequency corresponding to the *n* − 1-th harmonic, with *n* ≤ 1, where, in this notation, the 0-harmonic

As a third step, our system fetches peaks finding the frequencies and amplitude of possible formants in the region from 150 to 3400 Hz. By executing this step in each 4096-sample frame, the system can detect peaks that appear, peak mergers as well as peak cancellations due to pernicious effect of the resonances and nasalizations commented above. Hence, we can take advantage of a very important feature in voice signals: the no continuity, i.e., how frequency formants can change from frame to frame, and new peaks can appear in a frame and disappear in the next one. A complete analysis of the voice signal without segmentation would entail, in many occasions, an error in the estimation of the formants, because some

After doing that, our system has selected some candidates to be the formants in each frame. One particular feature in the analysis of each frame is that the fundamental frequency is In this section, we are showing some results offered by the implemented system. As we have commented above, tests and recordings have been carried out in adult females, following the original system by [1], which was resealed in 2001. Nevertheless, we must remark that the signal processing problem would be identical in the case of males and children; the automatic formant extraction method would not change. After the process described in Section 3, , the system would have selected frequency peaks as candidate formants in each recording. The formant frequencies of vowels produced by males would be surely moved to lower frequencies in relation to the formant frequencies of vowels produced by women (see [4], for instance).

10.5772/57221

337

http://dx.doi.org/10.5772/57221

<sup>0</sup> <sup>200</sup> <sup>400</sup> <sup>600</sup> <sup>800</sup> <sup>1000</sup> <sup>1200</sup> <sup>1400</sup> <sup>1600</sup> <sup>1800</sup> <sup>2000</sup> <sup>0</sup>

F1 = 937 Hz

F2 = 1330 Hz

Temporal segment: 0.186 s. − 0.278 s.

Frequency (Hz)

Temporal segment: 0.928 s. − 1.021 s.

<sup>0</sup> <sup>200</sup> <sup>400</sup> <sup>600</sup> <sup>800</sup> <sup>1000</sup> <sup>1200</sup> <sup>1400</sup> <sup>1600</sup> <sup>1800</sup> <sup>2000</sup> <sup>0</sup>

F1 = 926 Hz

F2 = 1313 Hz

Frequency (Hz)

Temporal segment: 1.76 s. − 1.86 s.

<sup>0</sup> <sup>200</sup> <sup>400</sup> <sup>600</sup> <sup>800</sup> <sup>1000</sup> <sup>1200</sup> <sup>1400</sup> <sup>1600</sup> <sup>1800</sup> <sup>2000</sup> <sup>0</sup>

F1 = 927 Hz

F2 = 1309 Hz

Frequency (Hz)

(f)

(d)

(b)

0.002 0.004 0.006 0.008 0.01 0.012 0.014

F0 = 160 Hz

0.5 1 1.5 2 2.5 3 3.5 <sup>4</sup> x 10−3

0.2

**Figure 4.** Spectra of different temporal segments after applying the sliding window procedure of a wrong-pronounced vocal number 5 = / :/ by a woman of 29 years old. (a) 0 - 92.8 ms, (b) 0.186 - 0.278 s, (c) 0.278 - 0.371 s, (d) 0.928 - 1.021 s, (e)

0.4

0.6

Amplitude

0.8

1

1.2 x 10−3

Amplitude

Amplitude

Spectral Study with Automatic Formant Extraction to Improve Non-native Pronunciation of English Vowels

<sup>0</sup> <sup>200</sup> <sup>400</sup> <sup>600</sup> <sup>800</sup> <sup>1000</sup> <sup>1200</sup> <sup>1400</sup> <sup>1600</sup> <sup>1800</sup> <sup>2000</sup> <sup>0</sup>

1024 Hz

1228 Hz

F2 = 1308 Hz

Temporal segment: 0 s. − 0.186 s.

F1 = 929.1Hz

Frequency (Hz)

Temporal segment: 0.278 s. − 0.371 s.

F1 = 914.6 Hz

<sup>0</sup> <sup>200</sup> <sup>400</sup> <sup>600</sup> <sup>800</sup> <sup>1000</sup> <sup>1200</sup> <sup>1400</sup> <sup>1600</sup> <sup>1800</sup> <sup>2000</sup> <sup>0</sup>

F2 = 1314 Hz

Frequency (Hz)

Temporal segment: 1.207 s. − 1.3 s.

<sup>0</sup> <sup>200</sup> <sup>400</sup> <sup>600</sup> <sup>800</sup> <sup>1000</sup> <sup>1200</sup> <sup>1400</sup> <sup>1600</sup> <sup>1800</sup> <sup>2000</sup> <sup>0</sup>

F1 = 929.1 Hz

F2 = 1308 Hz

Frequency (Hz)

(e)

(c)

(a)

0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02

0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

1.4 x 10−4

0.2 0.4 0.6 0.8 1 1.2

1.207-1.3 s, (f) 1.76 - 1.86 s.

Amplitude

Amplitude

Amplitude

F0 = 204 Hz

409 Hz

 611 Hz 815 Hz

**Figure 3.** Time-varying spectral representation derived of a wrong-pronounced vowel number 5 = / :/ by a woman of 29 years old.

In addition, the algorithm presented in this paper is based on the one by [20], which is effective in formant extraction during all vowel-like segments of continuous speech. In our particular case, voice recordings are even simpler, since they contain just a vowel sound, following the original system implemented by Pavón [1]. Our system compares users' recordings to those already included in its database, those latter which are the students' references in English learning. This algorithm will show users how to position their jaws and tongues for a correct vowel pronunciation by analysing formant frequency shift in vowels uttered by users in comparison to already-recorded model formants. This association comes with the relationship between *F*1, *F*2 and articulatory means. Consequently, there is a direct connection between first formant rising frequency and mouth opening: the higher *F*1 frequency is, the more open the vowel, and vice versa. Moreover, there is also a direct association between tongue backward movement and *F*2 frequency lowering: high *F*2 frequencies imply front vowels and vice versa. These conclusions can be verified in the results offered by [3, 4], especially in Table V in [4]). These authors confirm the correlation between first formant frequencies and vowel type (e.g. open, close, front and back).

As a significant result, we analyse a 29 year old female trying to pronounce vowel number 5 [18]. Initially, she does not position her mouth and tongue appropriately, being her mouth opening not wide enough. In addition, her tongue position is not so back as required. In Fig. 3 the temporal evolution of the spectrum derived when trying to pronounce the vowel number 5 [18] = / :/ is displayed.

<sup>336</sup> Computational and Numerical Simulations Spectral Study with Automatic Formant Extraction to Improve Non-native Pronunciation of English Vowels 9 10.5772/57221 Spectral Study with Automatic Formant Extraction to Improve Non-native Pronunciation of English Vowels http://dx.doi.org/10.5772/57221 337

8 Computational and Numerical Simulations

old.

number 5 [18] = / :/ is displayed.

**Figure 3.** Time-varying spectral representation derived of a wrong-pronounced vowel number 5 = / :/ by a woman of 29 years

In addition, the algorithm presented in this paper is based on the one by [20], which is effective in formant extraction during all vowel-like segments of continuous speech. In our particular case, voice recordings are even simpler, since they contain just a vowel sound, following the original system implemented by Pavón [1]. Our system compares users' recordings to those already included in its database, those latter which are the students' references in English learning. This algorithm will show users how to position their jaws and tongues for a correct vowel pronunciation by analysing formant frequency shift in vowels uttered by users in comparison to already-recorded model formants. This association comes with the relationship between *F*1, *F*2 and articulatory means. Consequently, there is a direct connection between first formant rising frequency and mouth opening: the higher *F*1 frequency is, the more open the vowel, and vice versa. Moreover, there is also a direct association between tongue backward movement and *F*2 frequency lowering: high *F*2 frequencies imply front vowels and vice versa. These conclusions can be verified in the results offered by [3, 4], especially in Table V in [4]). These authors confirm the correlation

between first formant frequencies and vowel type (e.g. open, close, front and back).

As a significant result, we analyse a 29 year old female trying to pronounce vowel number 5 [18]. Initially, she does not position her mouth and tongue appropriately, being her mouth opening not wide enough. In addition, her tongue position is not so back as required. In Fig. 3 the temporal evolution of the spectrum derived when trying to pronounce the vowel

**Figure 4.** Spectra of different temporal segments after applying the sliding window procedure of a wrong-pronounced vocal number 5 = / :/ by a woman of 29 years old. (a) 0 - 92.8 ms, (b) 0.186 - 0.278 s, (c) 0.278 - 0.371 s, (d) 0.928 - 1.021 s, (e) 1.207-1.3 s, (f) 1.76 - 1.86 s.

Now, in Fig. 4, we show some spectrums obtained from different temporal segments after applying the sliding window procedure detailed in previous section. As indicated above, any temporal segment is of approximately 92.8 ms. In particular, we are showing the following intervals: 0 - 92.8 ms (Fig. 4.a), 0.186 - 0.278 s (Fig. 4.b), 0.278 - 0.371 s (Fig. 4.c), 0.928 - 1.021 s (Fig. 4.d), 1.207-1.3 s (Fig. 4.e), 1.76 - 1.86 s (Fig. 4.f). We can clearly see the evolution of different peaks in the spectrum. Most of them are harmonics from the fundamental frequency (*F*0 = 201.9*Hz*). For instance, in Fig. 4.a, the fundamental frequency is placed in 204 Hz. Peaks at 409, 611, 815 1024 and 1228 Hz are considered the first five harmonics of *F*0. Through Fig. 4, we can see the evolution of the formants (*F*1 and *F*2) in each of the temporal segments. Even although these formants could have a low level of energy (above all in the second formant), the system operates successfully, as can be observed in Fig. 5.b. There, the system concludes that, for this recording of a 29 year old woman, the fundamental frequency is detected at 201.9 Hz, whereas the first two formants are positioned at 929.1 Hz and 1308 Hz, respectively. A peak at 872 Hz was also present, but it was discarded by the system after checking it is a harmonic of the 175.3 Hz-spurious peak. Finally, the spectrum of the whole recording (2.64 s in time, after detecting the onset of the vowel and rejecting the samples before the onset) is included in Fig. 5.a.

10.5772/57221

339

http://dx.doi.org/10.5772/57221

<sup>940</sup> <sup>=</sup> 0.6%. (4)

<sup>1540</sup> <sup>=</sup> 2.92%. (5)

Thanks to the corrections suggested by our system, the subject uttered vowel number 5 again, with the result shown in Fig. 6. As in the previous case, we are depicting in detail some time segments resulting from sliding window procedure described above. As indicated, any temporal segment is of approximately 92.8 ms. In this case, we are showing the following intervals: 0 - 92.8 ms (Fig. 7.a), 0.371 - 0.464 s (Fig. 7.b), 0.464 - 0.557 s (Fig. 7.c), 0.835 - 0.928

Spectral Study with Automatic Formant Extraction to Improve Non-native Pronunciation of English Vowels

In this case, gesture corrections pointed out by our system allow the speaker to approach the target vowel sound. As we can see in Fig. 8.b, *F*1 and *F*2 are 934 and 1495 Hz, respectively, being *F*1 = 940 Hz and *F*2 = 1540 Hz the referential frequencies recorded in the system. Consequently, this new recording is closer to the adequate pronunciation range of vowel number 5. If we accept a +-5% error range, the speaker's new pronunciation

can be considered correct since the system error calculation is the following:

Error in F1 (%) <sup>=</sup> <sup>|</sup><sup>934</sup> <sup>−</sup> <sup>940</sup><sup>|</sup>

Error in F2 (%) <sup>=</sup> <sup>|</sup><sup>1495</sup> <sup>−</sup> <sup>1540</sup><sup>|</sup>

s (Fig. 7.d), 0.928 - 1.021 s (Fig. 7.e), 1.02 - 1.115 s (Fig. 7.f)

**Figure 6.** Time-varying spectral representation derived of vocal number 5 = / :/.

**Figure 5.** Spectrum of the whole recording (a), and amplitudes of fundamental frequency and frequencies of the two first formants (b).

At this stage, our system compares formant positions coming from this female recording to original recordings in [1]. According to Pavón, formants are placed at 940 and 1540 Hz, respectively. Therefore, this female subject has not achieved the correct articulatory mode or articulatory point. More specifically, her mouth is closer than required, and that is why *F*1 appears moved leftwards, from 940 Hz to 926 Hz. If *F*1 frequency had been higher, we would have had a too wide mouth opening. On the other hand, the articulatory point is not correct either: *F*2 appears at 1306 Hz, which is a much lower frequency than the 1540 Hz indicated in [1]. In this case, the subject has uttered vowel number 5 with a too backwards tongue position, while the system suggests a more central one. On the contrary, if her tongue had been more fronted, *F*2 could be detected in frequencies higher than 1540 Hz. The evolution of the first formant frequencies for each English vowel appears in Table V in [4], for American English vowels, and in [3], for British English.

Thanks to the corrections suggested by our system, the subject uttered vowel number 5 again, with the result shown in Fig. 6. As in the previous case, we are depicting in detail some time segments resulting from sliding window procedure described above. As indicated, any temporal segment is of approximately 92.8 ms. In this case, we are showing the following intervals: 0 - 92.8 ms (Fig. 7.a), 0.371 - 0.464 s (Fig. 7.b), 0.464 - 0.557 s (Fig. 7.c), 0.835 - 0.928 s (Fig. 7.d), 0.928 - 1.021 s (Fig. 7.e), 1.02 - 1.115 s (Fig. 7.f)

10 Computational and Numerical Simulations

before the onset) is included in Fig. 5.a.

<sup>0</sup> <sup>200</sup> <sup>400</sup> <sup>600</sup> <sup>800</sup> <sup>1000</sup> <sup>1200</sup> <sup>1400</sup> <sup>1600</sup> <sup>1800</sup> <sup>2000</sup> <sup>0</sup>

Frequency (Hz)

(a)

English vowels, and in [3], for British English.

0.01

formants (b).

0.02

0.03

Amplitude

0.04

0.05

0.06

Now, in Fig. 4, we show some spectrums obtained from different temporal segments after applying the sliding window procedure detailed in previous section. As indicated above, any temporal segment is of approximately 92.8 ms. In particular, we are showing the following intervals: 0 - 92.8 ms (Fig. 4.a), 0.186 - 0.278 s (Fig. 4.b), 0.278 - 0.371 s (Fig. 4.c), 0.928 - 1.021 s (Fig. 4.d), 1.207-1.3 s (Fig. 4.e), 1.76 - 1.86 s (Fig. 4.f). We can clearly see the evolution of different peaks in the spectrum. Most of them are harmonics from the fundamental frequency (*F*0 = 201.9*Hz*). For instance, in Fig. 4.a, the fundamental frequency is placed in 204 Hz. Peaks at 409, 611, 815 1024 and 1228 Hz are considered the first five harmonics of *F*0. Through Fig. 4, we can see the evolution of the formants (*F*1 and *F*2) in each of the temporal segments. Even although these formants could have a low level of energy (above all in the second formant), the system operates successfully, as can be observed in Fig. 5.b. There, the system concludes that, for this recording of a 29 year old woman, the fundamental frequency is detected at 201.9 Hz, whereas the first two formants are positioned at 929.1 Hz and 1308 Hz, respectively. A peak at 872 Hz was also present, but it was discarded by the system after checking it is a harmonic of the 175.3 Hz-spurious peak. Finally, the spectrum of the whole recording (2.64 s in time, after detecting the onset of the vowel and rejecting the samples

0.01

**Figure 5.** Spectrum of the whole recording (a), and amplitudes of fundamental frequency and frequencies of the two first

At this stage, our system compares formant positions coming from this female recording to original recordings in [1]. According to Pavón, formants are placed at 940 and 1540 Hz, respectively. Therefore, this female subject has not achieved the correct articulatory mode or articulatory point. More specifically, her mouth is closer than required, and that is why *F*1 appears moved leftwards, from 940 Hz to 926 Hz. If *F*1 frequency had been higher, we would have had a too wide mouth opening. On the other hand, the articulatory point is not correct either: *F*2 appears at 1306 Hz, which is a much lower frequency than the 1540 Hz indicated in [1]. In this case, the subject has uttered vowel number 5 with a too backwards tongue position, while the system suggests a more central one. On the contrary, if her tongue had been more fronted, *F*2 could be detected in frequencies higher than 1540 Hz. The evolution of the first formant frequencies for each English vowel appears in Table V in [4], for American

0.02

0.03

Amplitude

0.04

0.05

0.06

F0 = 201.9 Hz

 Spurious peak (175.3 Hz)

> Harmonic from 175.3 Hz spurious peak

> > 872 Hz

<sup>0</sup> <sup>200</sup> <sup>400</sup> <sup>600</sup> <sup>800</sup> <sup>1000</sup> <sup>1200</sup> <sup>1400</sup> <sup>1600</sup> <sup>1800</sup> <sup>2000</sup> <sup>0</sup>

F1 = 929.1Hz

F2 = 1308 Hz

Frequency (Hz)

(b)

In this case, gesture corrections pointed out by our system allow the speaker to approach the target vowel sound. As we can see in Fig. 8.b, *F*1 and *F*2 are 934 and 1495 Hz, respectively, being *F*1 = 940 Hz and *F*2 = 1540 Hz the referential frequencies recorded in the system. Consequently, this new recording is closer to the adequate pronunciation range of vowel number 5. If we accept a +-5% error range, the speaker's new pronunciation can be considered correct since the system error calculation is the following:

$$\text{Error in F1 (\%)} = \frac{|934 - 940|}{940} = 0.6\%. \tag{4}$$

$$\text{Error in F2 (\%)} = \frac{|1495 - 1540|}{1540} = 2.92\%. \tag{5}$$

**Figure 6.** Time-varying spectral representation derived of vocal number 5 = / :/.

10.5772/57221

341

http://dx.doi.org/10.5772/57221

<sup>0</sup> <sup>500</sup> <sup>1000</sup> <sup>1500</sup> <sup>2000</sup> <sup>0</sup>

<sup>748</sup> <sup>1190</sup> <sup>1317</sup>

APERTURA MÁXIMA

Harmonics from 120 Hz spurious peak

F2 =1495 Hz F1 =934 Hz

Frequency (Hz)

(b)

0.01

0.02

0.03

Amplitude

**Figure 8.** Spectrum of the whole recording (a), and amplitudes of fundamental frequency and frequencies of the two first

With respect to vowel duration, our system does not pay attention to this feature because we understand that any user can distinguish a long duration with respect to a short duration of

Finally, as in [20], the success of the automatic formant extraction algorithm is even higher than in McCandless's work because vowels are given to the system in an isolated manner and not in a sentence. Only when the formant was too strongly cancelled by a nearby zero (in nasals and nasalized vowels), or a peak merger was not resolve, the system does not achieve

In this paper we have improved the tool implemented in [1], which consists in a software system for the teaching of English phonology. Pavón's contribution allows phoneme recordings, which are later on compared to similar sounds in the system. However, it offers a comparison based on the time domain, which is certainly not significant when providing help for learning a second language pronunciation. Moreover, it includes female voice recordings only, so male users (and children) would not obtain a significant result. Taking into account that Pavón's original idea is very good for those students who lack listening and pronunciation skills, this paper describes a new procedure to be added to the previous system and which is based on a frequency domain analysis. In this way, by means of a formant detection algorithm based on [20] and [11], the system can offer a more realistic contribution to the teaching of English pronunciation and phonology. F1 and F2 indicate oral cavity opening and tongue position respectively, and so the system specifies whether students have to open or close their mouths and which tongue part must be particularly employed in each vowel sound. As [1] makes use of female voice recordings only, our subjects are female adults. However, our formant detection algorithm would work with male and children voices equally. Male and children native' speakers are required for reference

the correct result, but represent only a 10-15 percent of the total cases.

0.04

 Spurious peak (120 Hz)

F0 =203.9 Hz

Spurious peak (150 Hz)

Harmonic from 150 Hz spurious peak

0.05

0.06

Spectral Study with Automatic Formant Extraction to Improve Non-native Pronunciation of English Vowels

<sup>0</sup> <sup>200</sup> <sup>400</sup> <sup>600</sup> <sup>800</sup> <sup>1000</sup> <sup>1200</sup> <sup>1400</sup> <sup>1600</sup> <sup>1800</sup> <sup>2000</sup> <sup>0</sup>

Frequency (Hz)

(a)

any vowel recording included in the system.

**6. Concluding remarks**

0.01

formants (b).

0.02

0.03

Amplitude

0.04

0.05

0.06

**Figure 7.** Spectra of different temporal segments after applying the sliding window procedure of a well-pronounced vocal number 5 = / :/ by a woman of 29 years old.. (a) 0 - 92.8 ms, (b) 0.371 - 0.464 s, (c) 0.464 - 0.557 s, (d) 0.835 - 0.928 s, (e) 0.928 - 1.021 s, (f) 1.021 - 1.115 s

<sup>340</sup> Computational and Numerical Simulations Spectral Study with Automatic Formant Extraction to Improve Non-native Pronunciation of English Vowels 13 10.5772/57221 Spectral Study with Automatic Formant Extraction to Improve Non-native Pronunciation of English Vowels http://dx.doi.org/10.5772/57221 341

**Figure 8.** Spectrum of the whole recording (a), and amplitudes of fundamental frequency and frequencies of the two first formants (b).

With respect to vowel duration, our system does not pay attention to this feature because we understand that any user can distinguish a long duration with respect to a short duration of any vowel recording included in the system.

Finally, as in [20], the success of the automatic formant extraction algorithm is even higher than in McCandless's work because vowels are given to the system in an isolated manner and not in a sentence. Only when the formant was too strongly cancelled by a nearby zero (in nasals and nasalized vowels), or a peak merger was not resolve, the system does not achieve the correct result, but represent only a 10-15 percent of the total cases.
