1. Introduction

ASR system has two important tasks—phoneme recognition and whole-word decoding. In ASR, the relationship between the speech signal and phones is established in two different steps [1]. In the first step, useful features are extracted from the speech signal on the basis of

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited. © 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

prior knowledge. This phase is known as information selection or dimensionality reduction phase. In this, the dimensionality of the speech signal is reduced by selecting the information based on task-specific knowledge. Highly specialized features like MFCC [2] are preferred choice in traditional ASR systems. In the second step, discriminative models estimate the likelihood of each phoneme. In the last, word sequence is recognized using discriminative programming technique. Deep learning system can map the acoustic features into the spoken phonemes directly. A sequence of the phoneme is easily generated from the frames using frame-level classification.

speech recognition model. In Section 6, available experimental results are shown. Finally,

Traditional ASR system leveraged the GMM/HMM paradigm for acoustic modeling. GMM efficiently processes the vectors of input features and estimates emission probabilities for each HMM state. HMM efficiently normalizes the temporal variability present in speech signal. The combination of HMM and language model is used to estimate the most likely sequence of phones. The discriminative objective function is used to improve the recognition rate of the system by the discriminatively fine-tuned methods [8]. However, GMM has a shortcoming as it shows inability to model the data that is present on the boundary line. Artificial neural networks (ANNs) can learn much better models of data lying on the boundary condition. Deep neural networks (DNNs) as acoustic models tremendously improved the performance of ASR systems [9–11]. Generally, discriminative power of DNN is used for phoneme recognition and, for decoding task, HMM is preferred choice. DNNs have many hidden layers with a large number of nonlinear units and produce a very large number of outputs. The benefit of this large output layer is that it accommodates the large number of HMM states. DNN architectures have densely connected layers. Therefore, such architectures are more prone to overfitting. Secondly, features having the local correlations become difficult to learn for such architectures. In [12], speech frames are classified into clustered context-dependent states using DNNs. In [13, 14], GMM-free DNN training process is proposed by the researchers. However, GMM-free process demands iterative procedures like decision trees, generating forced alignments. DNN-based acoustic models are gaining much popularity in large vocabulary speech recognition task [10], but components like HMM and n-gram language model are same as in

GMM or DNN-based ASR systems perform the task in three steps: feature extraction, classification, and decoding. It is shown in Figure 1. Firstly, the short-term signal st is processed at time "t" to extract the features xt. These features are provided as input to GMM or DNN acoustic model which estimates the class conditional probabilities Peð Þ ijxi for each phone class

<sup>¼</sup> P ið Þ <sup>j</sup>xt

DNN is a feed-forward NN containing multiple hidden layers with a large number of hidden units. DNNs are trained using the back-propagation methods then discriminatively fine-tuned for reducing the gap between the desired output and actual output. DNN-/HMM-based hybrid systems are the effective models which use a tri-phone HMM model and an n-gram language model [10, 15]. Traditional DNN/HMM hybrid systems have several independent components that are trained separately like an acoustic model, pronunciation model, and

p ið Þ <sup>∀</sup><sup>i</sup> <sup>∈</sup> i, …, I (1)

Convolutional Neural Networks for Raw Speech Recognition

http://dx.doi.org/10.5772/intechopen.80026

23

Section 7 concludes this chapter with the brief discussion.

2. Related work

their predecessors.

i∈ f g 1;…; I : The emission probabilities are as follows:

pe xt ð Þj<sup>i</sup> <sup>∝</sup> p xt ð Þj<sup>i</sup>

The prior class probability p ið Þ is computed by counting on the training set.

p xð Þ<sup>t</sup>

Another side, end-to-end systems perform acoustic frames to phone mapping in one step only. End-to-end training means all the modules are learned simultaneously. Advanced deep learning methods facilitate to train the system in an end-to-end manner. They also have the ability to train the system directly with raw signals, i.e., without hand-crafted features. Therefore, ASR paradigm is shifting from cepstral features like MFCC [2], PLP [3] to discriminative features learned directly from raw speech. End-to-end model may take raw speech signal as input and generates phoneme class conditional probabilities as output. The three major types of end-to-end architectures for ASR are attention-based method, connectionist temporal classification (CTC), and CNN-based direct raw speech model.

Attention-based models directly transcribe the speech into phonemes. Attention-based encoderdecoder uses the recurrent neural network (RNN) to perform sequence-to-sequence mapping without any predefined alignment. In this model, the input sequence is first transformed into a fixed length vector representation, and then decoder maps this fixed length vector into the output sequence. Attention-based encoder-decoder is much capable of learning the mapping between variable-length input and output sequences. Chorowski and Jaitly proposed speakerindependent sequence-to-sequence model and achieved 10.6% WER without separate language models and 6.7% WER with a trigram language model for Wall Street Journal dataset [4]. In attention-based systems, the alignment between the acoustic frame and recognized symbols is performed by attention mechanism, whereas CTC model uses conditional independence assumptions to efficiently solve sequential problems by dynamic programming. Attention model has shown high performance over CTC approach because it uses the history of the target character without any conditional independence assumptions.

Another side, CNN-based acoustic model is proposed by Palaz et al. [5–7] which processes the raw speech directly as input. This model consists of two stages: feature learning stage, i.e., several convolutional layers, and classifier stage, i.e., fully connected layers. Both the stages are learned jointly by minimizing a cost function based on relative entropy. In this model, the information is extracted by the filters at first convolutional layer and modeled between first and second convolutional layer. In classifier stage, learned features are classified by fully connected layers and softmax layer. This approach claims comparable or better performance than traditional cepstral feature-based system followed by ANN training for phoneme recognition on TIMIT dataset.

This chapter is organized as follows: In Section 2, the work performed in the field of ASR is discussed with the name of related work. Section 3 covers the various architectures of ASR. Section 4 presents the brief introduction about CNN. Section 5 explains CNN-based direct raw speech recognition model. In Section 6, available experimental results are shown. Finally, Section 7 concludes this chapter with the brief discussion.
