5. CNN-based end-to-end approach

A novel acoustic model based on CNN is proposed by Palaz et al. [5] which is shown in Figure 6. In this, raw speech signal is segmented into input speech signal sc <sup>t</sup> ¼ st�<sup>c</sup>;…;st f g ;…;stþ<sup>c</sup> in the context of 2c frames having spanning window win milliseconds. First convolutional layer learns the useful features from the raw speech signal, and remaining convolutional layers further process these features into the useful information. After processing the speech signal, CNN estimates the class conditional probability, i.e., P i=s<sup>c</sup> t , which is used to calculate emission scaled-likelihood P s<sup>c</sup> <sup>t</sup>=<sup>i</sup> . Several filter stages are present in the network before the classification stage. A filter stage is a combination of convolutional layer, pooling layer, and a nonlinearity. The joint training of feature stage and classifier stage is performed using the back-propagation algorithm.

6. Experimental results

In this model, a number of hyperparameters are used to specify the structure of the network. The number of hidden units in each hidden layer is very important; hence, it is taken as hyperparameter. win represents the time span of input speech signal. kW represents the kernel and temporal window width. dW represents the shift of temporal window. kWmp represents max-pooling kernel width and dWmp represents the shift of max-pooling kernel. The value of all hyperparameters is estimated during training based on frame-level classification accuracy

Convolutional Neural Networks for Raw Speech Recognition

http://dx.doi.org/10.5772/intechopen.80026

35

on validation data. The range of hyperparameters after validation is shown in Table 1.

offers better performance. It also increases the generalization capability of the classifiers.

Table 1. Range of hyperparameter for TIMIT dataset during validation.

networks [36]

Hyperparameter Units Range Input window size (win) ms 100–700 Kernel width of the first ConvNet layer (kW1) Samples 10–90 Kernel width of the nth ConvNet layer (kWn) Samples 1–11 Number of filters per kernel (doutt) Filters 20–100 Max-pooling kernel width (kWmp) Frames 2–6 Number of hidden units in the classifier Units 200–1500

End-to-end speech recognition model PER (%) CNN-based speech recognition system using raw speech as input [7] 33.2

Convolutional neural network-based continuous speech recognition using raw speech signal [6] 32.3 End-to-end phoneme sequence recognition using convolutional neural networks [5] 27.2 CNN-based direct raw speech model 21.9 End-to-end continuous speech recognition using attention-based recurrent NN: First results [19] 18.57 Toward end-to-end speech recognition with deep convolutional neural networks [44] 18.2 Attention-based models for speech recognition [20] 17.6 Segmental recurrent neural networks for end-to-end speech recognition [45] 17.3

32.4

Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural

Bold value and text represent the performance of the CNN-based direct raw speech model.

Table 2. Comparison of existing end-to-end speech model in the context of PER (%).

The experiments are conducted for three convolutional layers. The speech window size (win is taken 250 ms with a shift of temporal window ð Þ dW 10 ms. Table 2 shows the comparison of existing end-to-end speech recognition model in the context of PER. The results of the experiments conducted on TIMIT dataset for this model are compared with already existing techniques, and it is shown in Table 3. The main advantages of this model are that it uses only few parameters and

The end-to-end approach employs the following understanding:


The end-to-end model estimates P i=s<sup>c</sup> t by processing the speech signal with minimal assumptions or prior knowledge.

Figure 6. CNN-based raw speech phoneme recognition system.
