3.3. Connectionist temporal classification (CTC)

The CTC formulation is also based on Bayes' decision theory. It is to be noted that L-length letter sequence,

$$\mathcal{L}' = \{**, \mathbf{c}\_1,**, \mathbf{c}\_2,**, \dots, \mathbf{c}\_L,**\} = \left\{c'\_l \in \mathcal{U} \cup \{**\} | l = 1, \dots, 2L+1\right\} \tag{28}**********$$

In C<sup>0</sup> , c<sup>0</sup> <sup>l</sup> is always " < b > " and letter when l is an odd and an even number, respectively. Similar as DNN/HMM model, framewise letter sequence with an additional blank symbol

$$Z = \{ z\_t \in \mathcal{U} \cup \{  **\} | t = 1, \ldots, T \} \tag{29}**$$

is also introduced. The posterior distribution, p Cð Þ jX , can be factorized as

$$p(\mathbf{C}|\mathbf{X}) = \sum\_{z} p(\mathbf{C}|\mathbf{Z}, \mathbf{X}) p(\mathbf{Z}|\mathbf{X}) \tag{30}$$

$$\varepsilon \approx \sum\_{z} p(\mathbf{C}|\mathbf{Z}).p(\mathbf{Z}|\mathbf{X})\tag{31}$$

Same as Eq. (3), CTC also uses Markov assumption, i.e., p Cð Þ jZ; X ≈ p Cð Þ jZ , to simplify the dependency of the CTC acoustic model, p Zð Þ jX , and CTC letter model, p Cð Þ jZ .

#### 3.3.1. CTC acoustic model

Same as DNN/HMM acoustic model, p Zð Þ jX can be further factorized using a probabilistic chain rule and Markov assumption as follows:

$$p(Z|X) = \prod\_{t=1}^{T} p(z\_t | z\_{1\prime} \dots z\_{t-1\prime}, X) \tag{32}$$

$$\epsilon \approx \prod\_{t=1}^{T} p(z\_t \mid X) \tag{33}$$

The framewise posterior distribution, p zt ð Þ jX is computed from all inputs, X, and it is directly modeled using bidirectional LSTM [30, 40]:

$$p(z\_t|X) = \text{Softmax}(\text{Lin}B(\mathfrak{h}\_t)),\tag{34}$$

$$\mathbf{h}\_t = \text{BLSTM}\_t(\mathbf{X}) \tag{35}$$

The CTC formulation is also same as HMM/DNN. The minute difference is that Bayes' rule is applied to p Cð Þ jZ instead of p Wð Þ jX . It has also three distribution components like HMM/ DNN, i.e., framewise posterior distribution, p zt ð Þ jX ; transition probability, p zt ð Þ jzt�<sup>1</sup>, C ; and letter model, p Cð Þ: It also uses Markov assumption. It does not fully utilize the benefit of endto-end ASR, but its character output representation still possesses the end-to-end benefits.

Convolutional Neural Networks for Raw Speech Recognition

http://dx.doi.org/10.5772/intechopen.80026

33

CNNs are the popular variants of deep learning that are widely adopted in ASR systems. CNNs have many attractive advancements, i.e., weight sharing, convolutional filters, and pooling. Therefore, CNNs have achieved an impressive performance in ASR. CNNs are composed of multiple convolutional layers. Figure 5 shows the block diagram of CNN. LeCun and Bengio [41] describe the three states of convolutional layer, i.e., convolution, pooling, and nonlinearity. Deep CNNs set a new milestone by achieving approximate human level performance through advanced architectures and optimized training [42]. CNNs use nonlinear function to directly process the low-level data. CNNs are capable of learning high-level features with high complexity and abstraction. Pooling is the heart of CNNs that reduces the dimensionality of a feature map. Maxout is widely used nonlinearity and has shown its effectiveness in ASR tasks

Pooling is an important concept that transforms the joint feature representation into the valuable information by keeping the useful information and eliminating insignificant information. Small frequency shifts that are common in speech signal are efficiently handled using pooling. Pooling also helps in reducing the spectral variance present in the input speech. It maps the input from p adjacent units into the output by applying a special function. After the element-wise nonlinearities, the features are passed through pooling layer. This layer executes the downsampling on the feature maps coming from previous layer and produces the new feature maps with a condensed resolution. This layer drastically reduces the spatial dimension of input. It serves the two main purposes. The first is that the amount of parameters or weight is reduced by 65%, thus lessening the computational cost. The second is that it controls the

overfitting. This term refers to when a model is so tuned to the training examples.

4. Convolutional neural networks

Figure 5. Block diagram of convolutional neural network.

[43, 44].

where Softmaxð Þ: represents the softmax activation function. LinBð Þ: is used to convert the hidden vector, ht, to a ð Þ j jþ U 1 dimensional vector with learnable matrix and bias vector parameter. BLSTMtð Þ: takes full input sequence as input and produces hidden vector ð Þ h<sup>t</sup> at t.

#### 3.3.2. CTC letter model

By applying Bayes' decision theory probabilistic chain rule and Markov assumption, p Zð Þ jX can be written as

$$p(\mathbb{C}/\mathbb{Z}) = \frac{p(\mathbb{Z}/\mathbb{C})p(\mathbb{C})}{p(\mathbb{Z})} \tag{36}$$

$$=\prod\_{t=1}^{T} p(z\_t|z\_1,\ldots,z\_{t-1},\mathbf{C}) \frac{p(\mathbf{C})}{p(\mathbf{Z})} \tag{37}$$

$$\approx \prod\_{t=1}^{T} p(z\_t | z\_{t-1} \text{ C}) \frac{p(\text{C})}{p(\text{Z})} \tag{38}$$

where p zð <sup>t</sup>jzt�<sup>1</sup>, CÞ represents state transition probability. p Cð Þ represents letter-based language model, and p Zð Þ represents the state prior probability. CTC architecture incorporates letterbased language model. CTC architecture can also incorporate a word-based language model by using letter-to-word finite state transducer during decoding [18]. The CTC has the monotonic alignment property, i.e.,

$$\text{when } z\_{t-1} = c'\_{m'} \text{ then } z\_t = c'\_l \text{ where } l \ge m.$$

Monotonic alignment property is an important constraint for speech recognition, so ASR sequence-to-sequence mapping should follow the monotonic alignment. This property is also satisfied by HMM/DNN.

#### 3.3.3. Objective function

The posterior, p Cð Þ jX , is represented as

$$p(\mathbf{C}|\mathbf{X}) \approx \underbrace{\sum\_{z} \prod\_{t=1}^{T} p(z\_t|z\_{t-1}, \mathbf{C}) p(z\_t|\mathbf{X})}\_{\triangleq p\_{ct}(\mathbf{C}/\mathbf{X})} \cdot \frac{p(\mathbf{C})}{p(\mathbf{Z})} \tag{39}$$

Viterbi method and forward-backward algorithm are dynamic programming algorithm which is used to efficiently compute the summation over all possible Z: CTC objective function pCTCð Þ CjX is designed by excluding the p Cð Þ=p Zð Þ from Eq. (23).

The CTC formulation is also same as HMM/DNN. The minute difference is that Bayes' rule is applied to p Cð Þ jZ instead of p Wð Þ jX . It has also three distribution components like HMM/ DNN, i.e., framewise posterior distribution, p zt ð Þ jX ; transition probability, p zt ð Þ jzt�<sup>1</sup>, C ; and letter model, p Cð Þ: It also uses Markov assumption. It does not fully utilize the benefit of endto-end ASR, but its character output representation still possesses the end-to-end benefits.
