6. Experimental results

5. CNN-based end-to-end approach

34 From Natural to Artificial Intelligence - Algorithms and Applications

which is used to calculate emission scaled-likelihood P s<sup>c</sup>

The end-to-end approach employs the following understanding:

performed using the back-propagation algorithm.

matically determined during training.

The end-to-end model estimates P i=s<sup>c</sup>

Figure 6. CNN-based raw speech phoneme recognition system.

tions or prior knowledge.

A novel acoustic model based on CNN is proposed by Palaz et al. [5] which is shown in Figure 6. In this, raw speech signal is segmented into input speech signal sc

st�<sup>c</sup>;…;st f g ;…;stþ<sup>c</sup> in the context of 2c frames having spanning window win milliseconds. First convolutional layer learns the useful features from the raw speech signal, and remaining convolutional layers further process these features into the useful information. After processing the speech signal, CNN estimates the class conditional probability, i.e., P i=s<sup>c</sup>

in the network before the classification stage. A filter stage is a combination of convolutional layer, pooling layer, and a nonlinearity. The joint training of feature stage and classifier stage is

1. Speech signals are non-stationary in nature. Therefore, they are processed in a short-term manner. Traditional feature extraction methods generally use 20–40 ms sliding window size. Although in the end-to-end approach, short-term processing of signal is required. Therefore, the size of the short-term window is taken as hyperparameter which is auto-

2. Feature extraction is a filter operation because its components like Fourier transform, discrete cosine transform, etc. are filtering operations. In traditional systems, filtering is applied on both frequency and time. So, this factor is also considered in building convolutional layer in end-to-end system. Therefore, the number of filter banks and their parameters are taken

3. The short-term processing of speech signal spread the information across time. In traditional systems, this spread information is modeled by calculating temporal derivatives and contextual information. Therefore, intermediate representation is supplied to classifier and calculated by taking long time span of input speech signal. Therefore, win, the size of input

by processing the speech signal with minimal assump-

as hyperparameters that are automatically determined during training.

window, is taken as hyperparameter, which is estimated during training.

t

<sup>t</sup> ¼

t ,

<sup>t</sup>=<sup>i</sup> . Several filter stages are present

In this model, a number of hyperparameters are used to specify the structure of the network. The number of hidden units in each hidden layer is very important; hence, it is taken as hyperparameter. win represents the time span of input speech signal. kW represents the kernel and temporal window width. dW represents the shift of temporal window. kWmp represents max-pooling kernel width and dWmp represents the shift of max-pooling kernel. The value of all hyperparameters is estimated during training based on frame-level classification accuracy on validation data. The range of hyperparameters after validation is shown in Table 1.

The experiments are conducted for three convolutional layers. The speech window size (win is taken 250 ms with a shift of temporal window ð Þ dW 10 ms. Table 2 shows the comparison of existing end-to-end speech recognition model in the context of PER. The results of the experiments conducted on TIMIT dataset for this model are compared with already existing techniques, and it is shown in Table 3. The main advantages of this model are that it uses only few parameters and offers better performance. It also increases the generalization capability of the classifiers.


Table 1. Range of hyperparameter for TIMIT dataset during validation.


Table 2. Comparison of existing end-to-end speech model in the context of PER (%).


[3] Hermansky H. Perceptual linear predictive (PLP) analysis of speech. The Journal of the

Convolutional Neural Networks for Raw Speech Recognition

http://dx.doi.org/10.5772/intechopen.80026

37

[4] Chorowski J, Jaitly N. Towards better decoding and language model integration in

[5] Palaz D, Collobert R, Doss MM. End-to-end phoneme sequence recognition using con-

[6] Palaz D, Doss MM, Collobert R. Convolutional neural networks-based continuous speech recognition using raw speech signal. In: Acoustics, Speech and Signal Processing

[7] Palaz D, Collobert R. Analysis of CNN-Based Speech Recognition System Using Raw Speech as Input. In Proceeding of Interspeech 2015 (No. EPFL-Conf-210029); 2015

[8] O'Shaughnessy D. Automatic speech recognition: History, methods and challenges. Pat-

[9] Dahl GE, Yu D, Deng L, Acero A. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech and Lan-

[10] Hinton G, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE

[11] Seide F, Li G, Yu D. Conversational speech transcription using context-dependent deep neural networks. In: Twelfth Annual Conference of the International Speech Communica-

[12] Abdel-Hamid O, Mohamed AR, Jiang H, Penn G. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); March 2012.

[13] Senior A, Heigold G, Bacchiani M, Liao H. GMM-free DNN acoustic model training. In: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference

[14] Bacchiani M, Senior A, Heigold G. Asynchronous, online, GMM-free training of a context dependent acoustic model for speech recognition. In: Fifteenth Annual Conference of the

[15] Gales M, Young S. The application of hidden Markov models in speech recognition.

[16] Graves A, Jaitly N. Towards end-to-end speech recognition with recurrent neural net-

Acoustical Society of America. 1990;87(4):1738-1752

sequence to sequence models. 2016. arXiv preprint arXiv:161202695

volutional neural networks. 2013. arXiv preprint arXiv:13122137

(ICASSP), 2015 IEEE International Conference on. IEEE; 2015

tern Recognition. 2008;41(10):2965-2979

Signal Processing Magazine. 2012;29(6):82-97

International Speech Communication Association; 2014

Foundations and Trends® in Signal Processing. 2008;1(3):195-304

works. In: International Conference on Machine Learning; 2014

guage Processing. 2012;20(1):30-42

tion Association; 2011

IEEE; 2012

on. IEEE; 2014

Table 3. Comparison of existing techniques with CNN-based direct raw speech model in the context of PER (%).
