4. Convolutional neural networks

p zt ð Þ¼ jX Softmax LinB ð Þ ð Þ h<sup>t</sup> , (34)

where Softmaxð Þ: represents the softmax activation function. LinBð Þ: is used to convert the hidden vector, ht, to a ð Þ j jþ U 1 dimensional vector with learnable matrix and bias vector parameter. BLSTMtð Þ: takes full input sequence as input and produces hidden vector ð Þ h<sup>t</sup> at t.

By applying Bayes' decision theory probabilistic chain rule and Markov assumption, p Zð Þ jX

p Cð Þ¼ <sup>=</sup><sup>Z</sup> p Zð Þ <sup>=</sup><sup>C</sup> p Cð Þ

p zð j<sup>t</sup> zt�<sup>1</sup>, CÞ

where p zð <sup>t</sup>jzt�<sup>1</sup>, CÞ represents state transition probability. p Cð Þ represents letter-based language model, and p Zð Þ represents the state prior probability. CTC architecture incorporates letterbased language model. CTC architecture can also incorporate a word-based language model by using letter-to-word finite state transducer during decoding [18]. The CTC has the mono-

Monotonic alignment property is an important constraint for speech recognition, so ASR sequence-to-sequence mapping should follow the monotonic alignment. This property is also

Viterbi method and forward-backward algorithm are dynamic programming algorithm which is used to efficiently compute the summation over all possible Z: CTC objective function

<sup>t</sup>¼<sup>1</sup> p zt ð Þ <sup>j</sup>zt�<sup>1</sup>, C p zt ð Þ <sup>j</sup><sup>X</sup> |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} ≜pctc ð Þ C=X

: p Cð Þ

p Zð Þ (39)

p zt ð Þ <sup>j</sup>z1;…; zt�<sup>1</sup>;<sup>C</sup> p Cð Þ

p Cð Þ

<sup>¼</sup> <sup>Y</sup> T

t¼1

≈ Y T

<sup>l</sup> where l ≥ m.

p Cð Þ jX ≈

pCTCð Þ CjX is designed by excluding the p Cð Þ=p Zð Þ from Eq. (23).

X z Y<sup>T</sup>

t¼1

3.3.2. CTC letter model

32 From Natural to Artificial Intelligence - Algorithms and Applications

tonic alignment property, i.e.,

satisfied by HMM/DNN.

3.3.3. Objective function

<sup>m</sup>, then zt ¼ c<sup>0</sup>

The posterior, p Cð Þ jX , is represented as

when zt�<sup>1</sup> ¼ c<sup>0</sup>

can be written as

h<sup>t</sup> ¼ BLSTMtð Þ X (35)

p Zð Þ (36)

p Zð Þ (37)

p Zð Þ (38)

CNNs are the popular variants of deep learning that are widely adopted in ASR systems. CNNs have many attractive advancements, i.e., weight sharing, convolutional filters, and pooling. Therefore, CNNs have achieved an impressive performance in ASR. CNNs are composed of multiple convolutional layers. Figure 5 shows the block diagram of CNN. LeCun and Bengio [41] describe the three states of convolutional layer, i.e., convolution, pooling, and nonlinearity.

Deep CNNs set a new milestone by achieving approximate human level performance through advanced architectures and optimized training [42]. CNNs use nonlinear function to directly process the low-level data. CNNs are capable of learning high-level features with high complexity and abstraction. Pooling is the heart of CNNs that reduces the dimensionality of a feature map. Maxout is widely used nonlinearity and has shown its effectiveness in ASR tasks [43, 44].

Pooling is an important concept that transforms the joint feature representation into the valuable information by keeping the useful information and eliminating insignificant information. Small frequency shifts that are common in speech signal are efficiently handled using pooling. Pooling also helps in reducing the spectral variance present in the input speech. It maps the input from p adjacent units into the output by applying a special function. After the element-wise nonlinearities, the features are passed through pooling layer. This layer executes the downsampling on the feature maps coming from previous layer and produces the new feature maps with a condensed resolution. This layer drastically reduces the spatial dimension of input. It serves the two main purposes. The first is that the amount of parameters or weight is reduced by 65%, thus lessening the computational cost. The second is that it controls the overfitting. This term refers to when a model is so tuned to the training examples.

Figure 5. Block diagram of convolutional neural network.
