3.1.3. Language model p Wð Þ

HMM/DNN [38], which is also calculated using Bayes' theorem and introduces the HMM state

where p Xð Þ jS , p Sð Þ jW , and p Wð Þ represent acoustic, lexicon, and language models, respectively. Equation (6) is changed into Eq. (7) in a similar way as Eq. (4) is changed into Eq. (5).

p Xð Þ jS can be further factorized using a probabilistic chain rule and Markov assumption as

arg max <sup>w</sup> <sup>∈</sup>υ<sup>∗</sup> p Wð Þ <sup>j</sup><sup>X</sup> (5)

p Xð Þ jS, W p Sð Þ jW p Wð Þ (6)

p Xð Þ jS ,p Sð Þ jW p Wð Þ (7)

p xt ð Þ jx1, …, xt�<sup>1</sup>, S (8)

p st ð Þ js1, …, st�<sup>1</sup>, W (10)

p st ð Þ j st�<sup>1</sup>, W (11)

(9)

sequence S, to factorize p Wð Þ jX into the following three distributions:

28 From Natural to Artificial Intelligence - Algorithms and Applications

<sup>¼</sup> arg max <sup>w</sup> <sup>∈</sup>υ<sup>∗</sup>

<sup>≈</sup> arg max <sup>w</sup> <sup>∈</sup>υ<sup>∗</sup>

p Xð Þ¼ <sup>j</sup><sup>S</sup> <sup>Y</sup>

t¼1

≈ Y T

the framewise posterior which is offered by an HMM/GMM system.

p Sð Þ¼ <sup>j</sup><sup>W</sup> <sup>Y</sup>

≈ Y T

conversion from w to HMM states through phoneme representation.

T

t¼1

t¼1

T

t¼1

p xt ð Þ <sup>j</sup>st <sup>∝</sup> <sup>Y</sup>

In Eq. (9), framewise likelihood function p xt ð Þ jst is changed into the framewise posterior

In Eq. (9), Markov assumption is too strong. Therefore, the contexts of input and hidden states are not considered. This issue can be resolved using either the recurrent neural networks (RNNs) or DNNs with long-context features. A framewise state alignment is required to train

p Sð Þ jW can be further factorized using a probabilistic chain rule and Markov assumption (first

An HMM state transition represents this probability. A pronunciation dictionary performs the

T

p st ð Þ jxt p sð Þ<sup>t</sup>

t¼1

p sð Þ<sup>t</sup> which is computed using DNN classifiers by pseudo-likelihood trick [38].

3.1.1. Acoustic models p Xð Þ jS

follows:

distribution p st ð Þ <sup>j</sup>xt

order) as follows:

3.1.2. Lexicon model p Sð Þ jW

X S

> X S

Similarly, p Wð Þ can be factorized using a probabilistic chain rule and Markov assumption ((m-1) th order) as an m-gram model, i.e.,

$$p(W) = \prod\_{n=1}^{N} p(w\_n | w\_1, \dots, w\_{n-1}) \tag{12}$$

$$\approx \prod\_{n=1}^{N} p(w\_n | w\_{n-m-1} \dots w\_{n-1}) \tag{13}$$

The issue of Markov assumption is addressed using recurrent neural network language model (RNNLM) [39], but it increases the complexity of decoding process. The combination of RNNLMs and m-gram language model is generally used and it works on a rescoring technique.

#### 3.2. Attention mechanism

The approach based on attention mechanism does not make any Markov assumptions. It directly finds the posterior p Cð Þ jX , on the basis of a probabilistic chain rule:

$$p(\mathbb{C}|\mathcal{X}) = \underbrace{\prod\_{l=1}^{L} p(\mathbf{c}\_l|\mathbf{c}\_1 \dots \mathbf{c}\_{l-1} \mathbf{X})}\_{\triangleq p\_{at}(\mathbb{C}/\mathcal{X})} \tag{14}$$

where pattð Þ CjX represents an attention-based objective function. p cl ð Þ jc1, …, cl�<sup>1</sup>, X is obtained by

$$\mathfrak{h}\_t = \operatorname{Encoder}(\mathbf{X}),\tag{15}$$

$$a\_{lt} = \begin{cases} \text{ContentAttention}(\mathbf{q}\_{l-1}, \mathbf{h}\_t) \\ \text{LocationAttention}(\{\mathbf{a}\_{l-1}\}\_{t=1}^T, \mathbf{q}\_{l-1}, \mathbf{h}\_t) \end{cases},\tag{16}$$

$$r\_l = \sum\_{t=1}^{T} a\_{lt} \mathbf{h}\_{t\nu} \tag{17}$$

$$p(\mathbf{c}|\mathbf{c}\_{l},\ldots,\mathbf{c}\_{l-1},\mathbf{X}) = \text{Decoder}\left(\mathbf{r}\_{l},\mathbf{q}\_{l-1},\mathbf{c}\_{l-1}\right) \tag{18}$$

Eq. (15) represents the encoder and Eq. (18) represents the decoder networks. alt represents the soft alignment of the hidden vector, ht. Here, r<sup>l</sup> represents the weighted letter-wise hidden vector that is computed by weighted summation of hidden vectors. Content-based attention mechanism with or without convolutional features are shown by ContentAttentionð Þ: and LocationAttentionð Þ: , respectively.

#### 3.2.1. Encoder network

The input feature vector X is converted into a framewise hidden vector, h<sup>t</sup> using Eq. (15). The preferred choice for an encoder network is BLSTM, i.e.,

$$Encoder(X) \triangleq BLSTM\_t(X) \tag{19}$$

3.2.5. Objective function

ground truth history c<sup>∗</sup>

3.3.1. CTC acoustic model

chain rule and Markov assumption as follows:

modeled using bidirectional LSTM [30, 40]:

letter sequence,

In C<sup>0</sup> , c<sup>0</sup>

where c<sup>∗</sup>

The objective function of the attention model is computed from the sequence posterior

p cljc ∗ <sup>1</sup>;…; c ∗ <sup>l</sup>�<sup>1</sup>; <sup>X</sup> � �<sup>≜</sup> <sup>p</sup><sup>∗</sup>

<sup>l</sup> represents the ground truth of the previous characters. Attention-based approach is a

0

Z ¼ f g zt ∈U ∪f gj < b > t ¼ 1;…; T (29)

<sup>l</sup> is always " < b > " and letter when l is an odd and an even number, respectively.

combination of letter-wise objectives based on multiclass classification with the conditional

The CTC formulation is also based on Bayes' decision theory. It is to be noted that L-length

Similar as DNN/HMM model, framewise letter sequence with an additional blank symbol

z

Same as Eq. (3), CTC also uses Markov assumption, i.e., p Cð Þ jZ; X ≈ p Cð Þ jZ , to simplify the

Same as DNN/HMM acoustic model, p Zð Þ jX can be further factorized using a probabilistic

attð Þ CjX (27)

http://dx.doi.org/10.5772/intechopen.80026

31

Convolutional Neural Networks for Raw Speech Recognition

<sup>l</sup> <sup>∈</sup><sup>U</sup> <sup>∪</sup>f gj <sup>&</sup>lt; <sup>b</sup> <sup>&</sup>gt; <sup>l</sup> <sup>¼</sup> <sup>1</sup>;…; <sup>2</sup><sup>L</sup> <sup>þ</sup> <sup>1</sup> � � (28)

p Cð Þ jZ, X p Zð Þ jX (30)

p Cð Þ jZ :p Zð Þ jX (31)

p zt ð Þ jz1,…, zt�<sup>1</sup>, X (32)

p zt ð Þ j X (33)

Y L

l¼1

<sup>l</sup>�<sup>1</sup> in each output <sup>l</sup>.

is also introduced. The posterior distribution, p Cð Þ jX , can be factorized as

p Cð Þ¼ <sup>j</sup><sup>X</sup> <sup>X</sup>

≈ X z

dependency of the CTC acoustic model, p Zð Þ jX , and CTC letter model, p Cð Þ jZ .

p Zð Þ¼ <sup>j</sup><sup>X</sup> <sup>Y</sup>

T

t¼1

t¼1

The framewise posterior distribution, p zt ð Þ jX is computed from all inputs, X, and it is directly

≈ Y T

pattð Þ CjX ≈

<sup>l</sup> , …, c<sup>∗</sup>

3.3. Connectionist temporal classification (CTC)

C<sup>0</sup> ¼ f g < b >; c1; < b >; c2; < b >; …; cL; < b > ¼ c

It is to be noted that the computational complexity of the encoder network is reduced by subsampling the outputs [20, 21].

#### 3.2.2. Content-based attention mechanism

ContentAttentionð Þ: is shown as

$$\mathbf{e}\_{lt} = \mathbf{g}^T \tanh\left(\mathrm{Lin}\left(\mathbf{q}\_{l-1}\right) + \mathrm{Lin}B(\mathbf{h}\_l)\right) \tag{20}$$

$$a\_{lt} = \text{Softmax}\left(\left\{e\_{lt}\right\}\_{t=1}^{T}\right) \tag{21}$$

g represents a learnable parameter. f g elt T <sup>t</sup>¼<sup>1</sup> represents a T-dimensional vector. tanhð Þ: and Linð Þ: represent the hyperbolic tangent activation function and linear layer with learnable matrix parameters, respectively.

#### 3.2.3. Location-aware attention mechanism

It is an extended version of content-based attention mechanism to deal with the location-aware attention. If al�<sup>1</sup> ¼ f g al�<sup>1</sup> T <sup>t</sup>¼<sup>1</sup> is replaced in Eq. (16), then LocationAwareð Þ: is represented as follows:

$$\left\{ f\_{\;t} \right\}\_{t=1}^{T} = \mathcal{K} \* a\_{l-1} \tag{22}$$

$$e\_{lt} = g^T \tanh\left(\operatorname{Lin}(q\_{l-1}) + \operatorname{Lin}(\mathfrak{h}\_t) + \operatorname{Lin}B(f\_t)\right) \tag{23}$$

$$a\_{lt} = \text{softmax}\left(\{e\_t\}\_{t=1}^T\right) \tag{24}$$

Here, \* denotes 1-D convolution along the input feature axis, t, with the convolution parameter, R, to produce the set of T features f <sup>t</sup> <sup>T</sup> t¼1:

#### 3.2.4. Decoder network

The decoder network is an RNN that is conditioned on previous output Cl�<sup>1</sup> and hidden vector <sup>q</sup><sup>l</sup>�<sup>1</sup>. LSTM is preferred choice of RNN that represented as follows:

$$\text{Decoder}(.) \triangleq \text{softmax}(\text{LinB}(\text{LSTM}(.)) \newline \tag{25}$$

LSTMlð Þ: represents uniconditional LSTM that generates hidden vector q<sup>l</sup> as output:

$$\mathfrak{q}\_l = \text{LSTM}(\mathfrak{r}\_l, \mathfrak{q}\_{l-1}, \mathfrak{c}\_{l-1}) \tag{26}$$

r<sup>l</sup> represents the concatenated vector of the letter-wise hidden vector; cl�<sup>1</sup> represents the output of the previous layer which is taken as input.
