Abstract

[40] Paliwal KK. On the use of line spectral frequency parameters for speech recognition.

[41] Alang Rashid NK, Alim SA, Hashim NNWNH, Sediono W. Receiver operating characteristics measure for the recognition of stuttering Dysfluencies using line spectral frequen-

[42] Kleijn WB, Bäckström T, Alku P. On line spectral frequencies. IEEE Signal Processing

[43] Bäckström T, Pedersen CF, Fischer J, Pietrzyk G. Finding line spectral frequencies using the fast Fourier transform. In: 2015 IEEE International Conference on in Acoustics, Speech

[44] Nematollahi MA, Vorakulpipat C, Gamboa Rosales H. Semifragile speech watermarking based on least significant bit replacement of line spectral frequencies. Mathematical Prob-

[45] Oliveira MO, Bretas AS. Application of discrete wavelet transform for differential protection of power transformers. In: IEEE PowerTech. Bucharest: IEEE; 2009. pp. 1-8

[46] Gupta D, Choubey S. Discrete wavelet transform for image processing. International Journal of Emerging Technology and Advanced Engineering. 2015;4(3):598-602

[47] Lindsay RW, Percival DB, Rothrock DA. The discrete wavelet transform and the scale analysis of the surface properties of sea ice. IEEE Transactions on Geoscience and Remote

[48] Turner C, Joseph A. A wavelet packet and mel-frequency cepstral coefficients-based feature extraction method for speaker identification. In: Procedia Computer Science.

[49] Reig-Bolaño R, Marti-Puig P, Solé-Casals J, Zaiats V, Parisi V. Coding of biosignals using the discrete wavelet decomposition. In: International Conference on Nonlinear Speech

[50] Tufekci Z, Gowdy JN. Feature extraction using discrete wavelet transform for speech

[51] Gałka J, Ziółko M. Wavelet speech feature extraction using mean best basis algorithm. In: International Conference on Nonlinear Speech Processing Berlin. Heidelberg: Springer;

[52] Hermansky H. Perceptual linear predictive (PLP) analysis of speech. The Journal of the

[53] Picone J. Fundamentals of Speech Recognition: Spectral Transformations. 2011. Retrieved from: http://www.isip.piconepress.com/publications/courses/msstate/ece\_8463/lectures/

[54] Thomas S, Ganapathy S, Hermansky H. Spectro-temporal features for automatic speech recognition using linear prediction in spectral domain. In: Proceedings of the 16th European

Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland. 2008

Digital Signal Processing. 1992;2(2):80-87

20 From Natural to Artificial Intelligence - Algorithms and Applications

Letters. 2003;10(3):75-77

lems in Engineering. 2017. 9 p

Sensing. 1996;34(3):771-787

2015. pp. 416-421

2009. pp. 128-135

cies. IIUM Engineering Journal. 2017;18(1):193-200

and Signal Processing (ICASSP). 2015. pp. 5122-5126

Processing. Berlin Heidelberg: Springer; 2009. pp. 144-151

recognition. In: IEEE Southeastcon 2000. 2000. pp. 116-123

Acoustical Society of America. 1990;87(4):1738-1752

current/lecture\_17/lecture\_17.pdf

State-of-the-art automatic speech recognition (ASR) systems map the speech signal into its corresponding text. Traditional ASR systems are based on Gaussian mixture model. The emergence of deep learning drastically improved the recognition rate of ASR systems. Such systems are replacing traditional ASR systems. These systems can also be trained in end-to-end manner. End-to-end ASR systems are gaining much popularity due to simplified model-building process and abilities to directly map speech into the text without any predefined alignments. Three major types of end-to-end architectures for ASR are attention-based methods, connectionist temporal classification, and convolutional neural network (CNN)-based direct raw speech model. In this chapter, CNN-based acoustic model for raw speech signal is discussed. It establishes the relation between raw speech signal and phones in a data-driven manner. Relevant features and classifier both are jointly learned from the raw speech. Raw speech is processed by first convolutional layer to learn the feature representation. The output of first convolutional layer, that is, intermediate representation, is more discriminative and further processed by rest convolutional layers. This system uses only few parameters and performs better than traditional cepstral feature-based systems. The performance of the system is evaluated for TIMIT and claimed similar performance as MFCC.

Keywords: ASR, attention-based model, connectionist temporal classification, CNN, end-to-end model, raw speech signal
