3.1. GMM/DNN

CTC techniques infer the speech-label alignment automatically. CTC [25] was developed for decoding the language. Firstly, Hannun et al. [17] used it for decoding purpose in Baidu's deep speech network. CTC uses dynamic programming [16] for efficient computation of a strictly monotonic alignment. However, graph-based decoding and language model are required for it. CTC approaches use RNN for feature extraction [28]. Graves et al. [30] used its objective function in deep bidirectional long short-term memory (LSTM) system. This model successfully arranges all possible alignments between input and output sequences during model

Two different versions of beam search are adopted by [16, 31] for decoding CTC models. Figure 4 shows the working architecture of the CTC model. In this, noisy and not informative frames are discarded by the introduction of the blank label which results in the optimal output sequence. CTC uses intermediate label representation to identify the blank labels, i.e., no output labels. CTC-based NN model shows high recognition rate for both phoneme recognition [32] and LVCSR [16, 31]. CTC-trained neural network with language model offers excel-

End-to-end ASR systems perform well and achieve good results, yet they face two major challenges. First is how to incorporate lexicons and language models into decoding. However, [16, 31, 33] have incorporated lexicons for searching paths. Second, there is no shared experimental platform for the purpose of benchmark. End-to-end systems differ from the traditional system in both aspects: model architecture and decoding methods. Some efforts were also made to model the raw speech signal with little or no preprocessing [34]. Palaz et al. [6] showed in his study that CNN [35] can calculate the class conditional probabilities from raw speech signal as direct input. Therefore, CNNs are the preferred choice to learn features from the raw speech. Two stages of learned feature process are as follows: initially, features are learned by the filters at first convolutional layer, and then learned features are modeled by second and higher-level convolutional layers. An end-to-end phoneme sequence

training, not on the prior.

26 From Natural to Artificial Intelligence - Algorithms and Applications

Figure 4. CTC model for speech recognition.

lent results [17].

ASR system performs sequence mapping of T-length speech sequence features, X ¼ Xt <sup>∈</sup> <sup>R</sup><sup>D</sup>j<sup>t</sup> <sup>¼</sup> <sup>1</sup>, …, T , into an N-length word sequence, <sup>W</sup> <sup>¼</sup> f g wn <sup>∈</sup>υj<sup>n</sup> <sup>¼</sup> <sup>1</sup>;…; <sup>N</sup> where Xt represents the D-dimensional speech feature vector at frame t and wn represents the word at position n in the vocabulary, υ.

The ASR problem is formulated within the Bayesian framework. In this method, an utterance is represented by some sequence of acoustic feature vector X, derived from the underlying sequence of words W, and the recognition system needs to find the most likely word sequence as given below [37]:

$$
\hat{\mathcal{W}} = \arg\max\_w p(\mathcal{W}|X) \tag{2}
$$

In Eq. (2), the argument of p Wð Þ jX , that is, the word sequence W, is found which shows maximum probability for given feature vector, X: Using Bayes' rule, it can be written as

$$
\hat{W} = \arg\max\_{w} \frac{p(X|W)p(W)}{p(X)}\tag{3}
$$

In Eq. (3), the denominator p Xð Þ is ignored as it is constant with respect to W. Therefore,

$$
\hat{\boldsymbol{W}} = \arg\max\_{\boldsymbol{w}} p(\boldsymbol{X}|\boldsymbol{W}) p(\boldsymbol{W}) \tag{4}
$$

where p Xð Þ jW represents the sequence of speech features and it is evaluated with the help of acoustic model. p Wð Þ represents the prior knowledge about the sequence of words W and it is determined by the language model. However, current ASR systems are based on a hybrid HMM/DNN [38], which is also calculated using Bayes' theorem and introduces the HMM state sequence S, to factorize p Wð Þ jX into the following three distributions:

$$\arg\max\_{w \in v^\*} p(W|X) \tag{5}$$

3.1.3. Language model p Wð Þ

3.2. Attention mechanism

LocationAttentionð Þ: , respectively.

preferred choice for an encoder network is BLSTM, i.e.,

3.2.1. Encoder network

th order) as an m-gram model, i.e.,

Similarly, p Wð Þ can be factorized using a probabilistic chain rule and Markov assumption ((m-1)

The issue of Markov assumption is addressed using recurrent neural network language model (RNNLM) [39], but it increases the complexity of decoding process. The combination of RNNLMs and m-gram language model is generally used and it works on a rescoring technique.

The approach based on attention mechanism does not make any Markov assumptions. It

where pattð Þ CjX represents an attention-based objective function. p cl ð Þ jc1, …, cl�<sup>1</sup>, X is obtained by

alt <sup>¼</sup> ContentAttention <sup>q</sup><sup>l</sup>�<sup>1</sup>; <sup>h</sup><sup>t</sup>

LocationAttentionð{al�1}

<sup>r</sup><sup>l</sup> <sup>¼</sup> <sup>X</sup> T

t¼1

p cl <sup>ð</sup> <sup>j</sup>c1, …, cl�<sup>1</sup>, XÞ ¼ Decoder <sup>r</sup>l; <sup>q</sup><sup>l</sup>�<sup>1</sup>; cl�<sup>1</sup>

Eq. (15) represents the encoder and Eq. (18) represents the decoder networks. alt represents the soft alignment of the hidden vector, ht. Here, r<sup>l</sup> represents the weighted letter-wise hidden vector that is computed by weighted summation of hidden vectors. Content-based attention mechanism with or without convolutional features are shown by ContentAttentionð Þ: and

The input feature vector X is converted into a framewise hidden vector, h<sup>t</sup> using Eq. (15). The

<sup>l</sup>¼<sup>1</sup> p cl ð Þ <sup>j</sup>c1, …, cl�<sup>1</sup>, X |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} ≜pattð Þ C=X

� �

T <sup>t</sup>¼<sup>1</sup>;q<sup>l</sup>�<sup>1</sup>;h<sup>t</sup>

p wð Þ <sup>n</sup>jw1; …; wn�<sup>1</sup> (12)

Convolutional Neural Networks for Raw Speech Recognition

http://dx.doi.org/10.5772/intechopen.80026

(14)

29

(16)

p wð Þ <sup>n</sup>jwn�m�<sup>1</sup>, …, wn�<sup>1</sup> (13)

h<sup>t</sup> ¼ Encoder Xð Þ, (15)

� ;

altht, (17)

� � (18)

p Wð Þ¼ <sup>Y</sup> N

> ≈ Y N

> > n¼1

directly finds the posterior p Cð Þ jX , on the basis of a probabilistic chain rule:

p Cð Þ¼ <sup>j</sup><sup>X</sup> <sup>Y</sup><sup>L</sup>

(

n¼1

$$\mathfrak{g} = \arg\max\_{w \in \upsilon^\*} \sum\_{\mathcal{S}} p(X|\mathcal{S}, W) p(\mathcal{S}|\mathcal{W}) p(W) \tag{6}$$

$$\approx \arg\max\_{w \in \nu^\*} \sum\_{\mathcal{S}} p(\mathcal{X}|\mathcal{S}), p(\mathcal{S}|\mathcal{W})p(\mathcal{W}) \tag{7}$$

where p Xð Þ jS , p Sð Þ jW , and p Wð Þ represent acoustic, lexicon, and language models, respectively. Equation (6) is changed into Eq. (7) in a similar way as Eq. (4) is changed into Eq. (5).
