2.1. Algorithm description, strength and weaknesses

MFCC are cepstral coefficients derived on a twisted frequency scale centerd on human auditory perception. In the computation of MFCC, the first thing is windowing the speech signal to split the speech signal into frames. Since the high frequency formants process reduced amplitude compared to the low frequency formants, high frequencies are emphasized to obtain similar amplitude for all the formants. After windowing, Fast Fourier Transform (FFT) is applied to find the power spectrum of each frame. Subsequently, the filter bank processing is carried out on the power spectrum, using mel-scale. The DCT is applied to the speech signal after translating the power spectrum to log domain in order to calculate MFCC coefficients [5]. The formula used to calculate the mels for any frequency is [19, 22]:

$$\text{l mol}(f) = 2595 \ge \log\_{10}(1 + f/700) \tag{1}$$

3. Linear prediction coefficients (LPC)

gained fame as a formant estimation method [17].

3.1. Algorithm description, strength and weaknesses

quality speech at low bit rate [13, 26, 27].

speech creation is given as [13, 25]:

The prediction error is given as [16, 25]:

seen in Figure 2. LPC can be derived by [7]:

Linear prediction coefficients (LPC) imitates the human vocal tract [16] and gives robust speech feature. It evaluates the speech signal by approximating the formants, getting rid of its effects from the speech signal and estimate the concentration and frequency of the left behind residue. The result states each sample of the signal as a direct incorporation of previous samples. The coefficients of the difference equation characterize the formants, thus, LPC needs to approximate these coefficients [25]. LPC is a powerful speech analysis method and it has

The frequencies where the resonant crests happen are called the formant frequencies. Thus, with this technique, the positions of the formants in a speech signal are predictable by calculating the linear predictive coefficients above a sliding window and finding the crests in the spectrum of the subsequent linear prediction filter [17]. LPC is helpful in the encoding of high

Other features that can be deduced from LPC are linear predication cepstral coefficients (LPCC), log area ratio (LAR), reflection coefficients (RC), line spectral frequencies (LSF) and Arcus Sine Coefficients (ARCSIN) [13]. LPC is generally used for speech reconstruction. LPC method is generally applied in musical and electrical firms for creating mobile robots, in

Linear prediction method is applied to obtain the filter coefficients equivalent to the vocal tract by reducing the mean square error in between the input speech and estimated speech [28]. Linear prediction analysis of speech signal forecasts any given speech sample at a specific period as a linear weighted aggregation of preceding samples. The linear predictive model of

> ^sð Þ¼ <sup>n</sup> <sup>X</sup> p

where ^s is the predicted sample, s is the speech sample, p is the predictor coefficients.

k¼1

Subsequently, each frame of the windowed signal is autocorrelated, while the highest autocorrelation value is the order of the linear prediction analysis. This is followed by the LPC analysis, where each frame of the autocorrelations is converted into LPC parameters set which consists of the LPC coefficients [26]. A summary of the procedure for obtaining the LPC is as

aks nð Þ � k (3)

Some Commonly Used Speech Feature Extraction Algorithms

http://dx.doi.org/10.5772/intechopen.80419

7

e nð Þ¼ s nð Þ� ^sð Þ n (4)

telephone firms, tonal analysis of violins and other string musical gadgets [4].

where mel(f) is the frequency (mels) and f is the frequency (Hz).

The MFCCs are calculated using this equation [9, 19]:

$$\hat{\mathbf{C}}\_{\text{H}} = \sum\_{n=1}^{k} \left( \log \hat{\mathbf{S}}\_{k} \right) \cos \left[ n \left( k - \frac{1}{2} \right) \frac{\pi}{k} \right] \tag{2}$$

where k is the number of mel cepstrum coefficients, S^<sup>k</sup> is the output of filterbank and C^ <sup>n</sup> is the final mfcc coefficients.

The block diagram of the MFCC processor can be seen in Figure 1. It summarizes all the processes and steps taken to obtain the needed coefficients. MFCC can effectively denote the low frequency region better than the high frequency region, henceforth, it can compute formants that are in the low frequency range and describe the vocal tract resonances. It has been generally recognized as a front-end procedure for typical Speaker Identification applications, as it has reduced vulnerability to noise disturbance, with minute session inconsistency and easy to mine [19]. Also, it is a perfect representation for sounds when the source characteristics are stable and consistent (music and speech) [23]. Furthermore, it can capture information from sampled signals with frequencies at a maximum of 5 kHz, which encapsulates most energy of sounds that are generated by humans [9].

Cepstral coefficients are said to be accurate in certain pattern recognition problems relating to human voice. They are used extensively in speaker identification and speech recognition [21]. Other formants can also be above 1 kHz and are not efficiently taken into consideration by the large filter spacing in the high frequency range [19]. MFCC features are not exactly accurate in the existence of background noise [14, 24] and might not be well suited for generalization [23].

Figure 1. Block diagram of MFCC processor.
