We are IntechOpen, the world's leading publisher of Open Access books Built by scientists, for scientists

6,600+ Open access books available

179,000+

International authors and editors

195M+ Downloads

156

Countries delivered to

Our authors are among the

Top 1% most cited scientists

12.2%

Contributors from top 500 universities

Selection of our books indexed in the Book Citation Index in Web of Science™ Core Collection (BKCI)

### Interested in publishing with us? Contact book.department@intechopen.com

Numbers displayed above are based on latest data collected. For more information visit www.intechopen.com

## Meet the editors

Prof. George Dekoulis received his Ph.D. in Space Computing and Communications from the Computing and Communications Department, Lancaster University, UK, in 2007. He was awarded a First-Class Distinctions BEng (Hons) in Communications Engineering from the Faculty of Computing, Engineering and Media, De Montfort University, UK, in 2001. He has collaborated with all major space centers, including NASA and ESA.

He is currently a member of the Faculty of Letters, University of Cyprus (UCY), Nicosia. He was previously the head of university research and founding dean of the Faculty of Sciences and Technology, American University of Cyprus (AUCY), Larnaca. His research focuses on the design of reconfigurable computing systems.

Dr. Jainath Yadav obtained an MTech and Ph.D. from the Indian Institute of Technology Kharagpur. He is currently an associate professor at the Central University of South Bihar, India. He has published several research papers in referred journals and presented several papers at international conferences. He is a member of the Institute of Electrical and Electronics Engineers (IEEE) and a Ph.D. supervisor in his active research areas.

### Contents


**Chapter 7 135** Perspective Chapter: Computation of Wind Turbine Power Generation, Anomaly Detection and Predictive Maintenance *by Cristian Bosch and Ricardo Simon-Carbajo*

Preface

This edited volume is a collection of reviewed research chapters on recent developments in computational semantics. It is divided into four sections: "Introduction",

After the "Introduction," the second section, "Linguistics", provides an overview of current speech recognition systems. It includes the following chapters: "Speech Recognition Based on Statistical Features", "Methods for Speech Signal Structuring and Extracting Features", and "Generalized Spectral-Temporal Features for Representing

The next section on "Classical Studies" includes one chapter: "Perspective Chapter: Difficulties for Translating Quevedo's Sonnets from Portuguese Translations into

The last section of the book, "Semantic Analysis in Computing", includes two contributions: "Toward Lightweight Cryptography: A Survey" and "Perspective Chapter: Computation of Wind Turbine Power Generation, Anomaly Detection and Predictive

We hope that you will enjoy reading this book and be inspired to scientifically contribute to the further success of the global computational semantics community.

**George Dekoulis**

Faculty of Letters, University of Cyprus, Nicosia, Cyprus

**Dr. Jainath Yadav**

Gaya, India

Department of Classics and Philosophy,

Department of Computer Science, Central University of South Bihar,

Professor,

"Linguistics", "Classical Studies", and "Semantic Analysis in Computing".

Speech Information".

English".

Maintenance".

## Preface

This edited volume is a collection of reviewed research chapters on recent developments in computational semantics. It is divided into four sections: "Introduction", "Linguistics", "Classical Studies", and "Semantic Analysis in Computing".

After the "Introduction," the second section, "Linguistics", provides an overview of current speech recognition systems. It includes the following chapters: "Speech Recognition Based on Statistical Features", "Methods for Speech Signal Structuring and Extracting Features", and "Generalized Spectral-Temporal Features for Representing Speech Information".

The next section on "Classical Studies" includes one chapter: "Perspective Chapter: Difficulties for Translating Quevedo's Sonnets from Portuguese Translations into English".

The last section of the book, "Semantic Analysis in Computing", includes two contributions: "Toward Lightweight Cryptography: A Survey" and "Perspective Chapter: Computation of Wind Turbine Power Generation, Anomaly Detection and Predictive Maintenance".

We hope that you will enjoy reading this book and be inspired to scientifically contribute to the further success of the global computational semantics community.

> **George Dekoulis** Professor, Department of Classics and Philosophy, Faculty of Letters, University of Cyprus, Nicosia, Cyprus

> > **Dr. Jainath Yadav** Department of Computer Science, Central University of South Bihar, Gaya, India

**1**

Section 1

Introduction

Section 1 Introduction

#### **Chapter 1**

## Introductory Chapter: Introduction to Computational Semantics

*George Dekoulis*

*"Some say that this is a sign of the soul, as it is buried in the present moment, and because through this, the soul signifies whatever it signifies, and this sign is rightly called a symbol."*

*Plato, Cratylus, 400 BCE*

*"Sema some say semaphores the psyche's burial into the soma during the present life. Because the psyche in the current soma semaphores polysemy, the semantic semantics are being semaphored"*

*Agaiarch Diocles, 2023 CE*

#### **1. Introduction**

Computational semantics refer to the advanced scientific tools used for processing natural languages and extract interesting conclusions regarding the different meanings included. The models of the different languages should be well-understood and adequately put into the simulation and programming context. A language can be classified into three broad areas, including: syntax/structure, semantics/meanings and, finally, pragmatics. Seriatim, the principle of meaning can further be analysed. The main tool for reaching valid results is the efficient implementation of the logic principles involved in the modelling and computation processes [1].

#### **2. Natural language characteristics**

It took thousands of years for the different languages to evolve. It is this one skill of developing symbols, languages and communicating with each other that separates us from the animals. We have reached a great state of mind where a main Hellenic alphabet and subsequent ones have been created [2]. This allows us to form words, sentences and compile complete texts for specific subjects. Humans can efficiently express their thoughts and communicate with each other.

A great set of new scientific fields have been created, such as linguistics, in order to encapsulate the evolution of any language. In this field of science, it is always important to determine the qualities of the subject under investigation. Chomsky suggested various parameters and methods that can be used for correctly classifying a language [3]. A strong limitation in the modelling [4] and programming [5] stages has always been the level of understanding of the people involved in the different phases. Especially in the recent years where our computational capability [6] has

reached previously unseen levels of processing power [7], human limitation is still the restricting factor. For the purpose of further discussion, we will assume that a specific language can be defined through a textbook, archives or a series of representative sentences.

Grammar is probably the first thing we should seek in the skills of the various speakers. Grammar is a set of rules that stipulate and span throughout the language. The correct usage of the grammatic rules determines the efficiency of the implemented algorithms. Grammar demonstrates the following characteristics: Phonology, Morphology, Syntax and Semantics. Phonology distinguishes between tiny sounds and their combination into langer phonetic complexes. Morphology is responsible for the creation of words. Syntax is concerned with how words produce sentences. Semantics correspond to the meanings of the words that form sentences according to the syntactic rules.

In the current publication, we are focusing more on the top-level of language processing. The different chapters start from the discussion on phonology or morphology and elevate to semantics. Phonology is not further discussed in the current chapter. Based on the lingual rules that each of the authors has implemented the overall accuracy of the implemented algorithms is seriatim evaluated.

#### **3. Morphology**

In general, alphabetic languages can be analysed into their three counterparts: syntactic rules, the meanings of things and pragmatics. Syntax is more concerned with the art of combining morphemes and words into larger entities, as in phrases and sentences. To understand the purpose of semantics an excellent grasp of the corresponding language is needed and its grammar and syntax. In many languages, although we have a great knowledge of its constituents, we know little about their pragmatics usage. Pragmatics refer to the thoughts of the language users and the sentences that are being formed and exchanged between them to achieve communication. In this book, we are taking into consideration both pieces of speech and text of the English and Portuguese languages. Thus, we are considering samples of both natural and formal languages.

#### **4. Semantics**

In ethical teaching and philosophy, the aim is to communicate with each other, to describe and determine the truth and to acquire all the necessary virtues for a successful life. However, every language has historically being used to also deceive people, for the private benefit of the few. It is noticeable that no matter the education level of the participants it is frequent that the participants do not converge to a single truth. Logic and reasoning are techniques that every user should be trained to use [8]. This minimises the deviation between natural and formal languages when it comes to correctly defining a meaning. Semantics is what both states of a language share between them.

The implementation of state-of-the-art digital logic systems for various applications is the expertise of the author [9]. Logic has been used to create Hellenic, the first alphabetic language in the world [2]. Logic has been used to derive all the Hellenic dialects. Based on Hellenic, the other European languages have been logically derived, such as Spanish, Latin, French, English etc. Logic has been used extensively by the wise scholars to minimise the deviation between formal and natural morphemes [10]. Throughout human history, elements from the principles found in the field of discrete

#### *Introductory Chapter: Introduction to Computational Semantics DOI: http://dx.doi.org/10.5772/intechopen.112368*

mathematics have been found in the language formation phases down to the last detail of formally defining a language [11]. It is the usage of logical tools, such as predicate or propositional, that has permitted this. These are techniques being used by the authors of this book. A great analysis of the English language in terms of its natural and formal aspects is presented in [12]. It is well-known by ancient Hellenic that all the logical methods can be used into producing an extraordinary formal language. This is evident in the works of Homer, Hesiod, Socrates, Plato, Aristoteles, Proclus, the great Latin authors and many others. All these tools are also being used today to assist the convergence of natural and formal languages. Computational semantics is expedited by these techniques.

#### **5. Computational semantics**

Natural and formal language calculation of semantics has been based on [13] for many decades. Advanced programming techniques have been built around logical reasoning. Functional modelling and implementation have been built primarily around Montague reasoning. We are calling this field of mathematical thinking Λ (lamda) calculus. Functional modelling is a great expertise for linguists, since it allows them to parameterize all language aspects. Experimenting with the syntax, semantics and pragmatics is a means of evaluating all the classical theories. Through modelling the linguist gets immediate results and provides feedback to the theories under test.

The algorithms we are working on for computing semantics manipulate two categories of data representations. The first is the realisation of the various semantics. We are also post-processing the results acquired. The initial models are being put into extensive testing and the parameters of the initial models are accordingly adjusted. It is through extensive feedback that we have managed to build the various data retrieval software for searching through archives, computer databases and building impressive internet searching software. The combination of advanced computer science and artificial intelligence tools greatly assists in designing the next generation of natural and formal language processing tools. For instance, Haskell has historically been used for achieving functional modelling. Prologue was one the first programming languages used to implement predicate reasoning and perform engineering modelling [14]. Prologue does not meet today's needs for computing semantics. Haskell incorporated Prologue and a lot of programmers used it in computational semantics [15]. All modern high-level programming languages and dedicated hardware, preferably reconfigurable, are recommended for implementing computational semantics [16].

#### **6. Conclusion**

Natural language processing has always been intriguing linguists. However, high-performance programming has only been viable over the recent 20 years. In this publication, new state-of-the-art results are presented in the areas of natural and formal language speech processing, linguistics, classical studies and computational semantics. We anticipate this book to be an asset to researchers and the younger generations will be motivated to pursue studies in the areas of computer science, artificial intelligence, logic, linguistics and classical studies.

*Computational Semantics*

### **Author details**

George Dekoulis Department of Classics and Philosophy, Faculty of Letters, University of Cyprus, Nicosia, Cyprus

\*Address all correspondence to: dekoulis.george@ucy.ac.cy

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### **References**

[1] Dekoulis G. Field Programmable Gate Array. London, UK: INTECH; 2017. ISBN 978-953-51-3208-0

[2] Babiniotis G. The Hellenic Alphabet. Athens, Hellas: Kentro Lexicologias; 2018. ISBN 978-960-95-8213-1

[3] Chomsky N. Syntactic Structures. The Hague/Paris: Mouton; 1957

[4] Dekoulis G. Field Programmable Gate Array (FPGAs) II. London, UK: INTECH; 2020. ISBN 978-1-83881-057-3

[5] Dekoulis G. Novel space exploration technique for analysing planetary atmospheres. In: Air Pollution Vanda Villanyi. London, UK: IntechOpen; 2010. DOI: 10.5772/10053

[6] Dekoulis G. Novel digital magnetometer for atmospheric and space studies (DIMAGORAS). In: Aeronautics and Astronautics Max Mulder. London, UK: IntechOpen; 2011. DOI: 10.5772/17326

[7] Dekoulis G, Honary F. Novel Low-Power Fluxgate Sensor Using a Macroscale Optimisation Technique for Space Physics Instrumentation. SPIE, Smart Sensors, Actuators, and MEMS III. 2007;**6589**:65890G-1-65890G-8

[8] Dekoulis G, Honary F. Novel sensor design methodology for measurements of the complex solar wind – Magnetospheric, ionospheric system. Journal of Microsystem Technologies. 2008;**14**(4-5):475-482

[9] Dekoulis G. Robotics. London, UK: INTECH; 2018. ISBN 978-953-51-3636-1

[10] Dekoulis G. Drones – applications. London, UK: INTECH; 2018. ISBN 978-953-51-5948-3

[11] Lukasiewicz J. Aristotle's Syllogistic from the Standpoint of Modern Formal Logic. Oxford: Clarendon Press; 1951

[12] Montague R. English as a Formal Language: Formal Philosophy. New Haven and London: Yale University Press; 1974. pp. 188-221

[13] Montague R. The proper treatment of quantification in ordinary English. Approaches to Natural Language. 1973;**1973**:220-243

[14] Allison L. An executable Prolog semantics. Algol Bulletin. 1983;(50):10-18

[15] Spivey JM, Seres S. Embedding Prolog in Haskell. Utrecht: Department of Computer Science, University of Utrecht; 1999

[16] Allison L. A prolog semantics. In: A Practical Introduction to Denotational Semantics: Cambridge Computer Science Text. Cambridge; 1987. pp. 102-116

Section 2 Linguistics

#### **Chapter 2**

### Speech Recognition Based on Statistical Features

*Jabbar Hussein*

#### **Abstract**

The requisition of intelligent devices that might classify a vocalized utterance have been skippering utterance research. The challenging task with utterance recognition models given for the language nature whereby there're nope apparent limits among words, an acoustic start with ending are impacted through the neighboring words, also, with various talkers utterance: female/male, senior/young, low/loud utterance, read/spontaneous, fast/slow vocalizing proportion and the utterance sign could be influenced by ambient noise. Accordingly, utterance recognition was exceeding abound of such challenges. To avert particular problems, information steered statistical curriculum built on considerable amounts of vocalized data has been utilized. With this itemize, the aim is to reconnoiter creativity that has making these implements plausible. Utterance recognition and language comprehension have been two important reconnoitering antes thereupon has normally been logged nearer as matters with indicatively and audio vocal, whereby the domain for audio vocal data have stayed introduced as robust impact to the matter thru drib accomplishment. Hence, we amid about determinate methods to utterances and language manipulating, whereby a data around a talking sign and a language that it converses, adjoining thru valuable utilized of information, is established come from inherent recognition of utterance data thru an understandable math-statistical formality.

**Keywords:** speech recognition, statistical features, language model and automatic speech recognition (ASR), language modeling, neural network

#### **1. Introduction**

The objective of getting a machine to see fluidly spoken talk and react in a characteristic voice has been driving taking research for over 50 years. We are as yet not yet where machines dependably comprehend familiar speech, spoken by anybody, and in any acoustic climate. Disregarding the excess specialized issues that should be tackled, the fields of automatic speech recognition (ASR) and comprehension have made enormous advances and the innovation is presently promptly accessible and utilized on an everyday premise in various applications and administrations [1, 2]. This chapter targets exploring the innovation that has made these applications conceivable. Talking recognition and language understanding are two significant exploration pushes that have customarily been drawn closer as issues in semantics and acoustic phonetics, where

scope of acoustic-phonetic information has been presented as a powerful influence for the issue with astoundingly little achievement. Here, in any case, we center around measurable techniques for speeches and language handling, where the information about a talking signal and the language that it communicates, along with useful usage of the information, is created from genuine knowledge of speech information through an obvious mathematical-statistical formalism.

#### **2. Language Modeling (LM)**

With this part, we will reflect on the issue of building a semantic model from a bunch of model words and sentences within an etymological. Semantic models were at first settled for the issue of 'Speech Recognition' (SR); they stay assume a predominant part in current (SR) frameworks. They are additionally regularly utilized in other (NLP) utilizes. The element assessment techniques that were initially settled for etymological demonstration, as characterized in this section, are significant in numerous different conditions, like the tagging and analysis problems [3].

Our occupation is as per the following. Expect that we take a body, which is a gathering of the sentences in one linguistic. A few years we could have of composition from the 'Washington Post', or we could own an exceptionally huge quantity of original copies by using the web. Accepted this corpus, we might want to estimate the elements of an etymological model. A semantic model is a clear cut as follows. To begin with, we will depict (V) to stand the gathering for entirely words within the language. For instance, once structure the phonetic system concerning the English language, we could say:

$$\mathbf{V} = \{ \text{that, cat, funs, maxim, bays, man,} \}\tag{1}$$

For all intents and purposes, (V) can be very large: it could have nearly thousands of words and we expect (V) to be a restricted set. Where a language sentence is a preparation of words, as:

$$\mathbf{x\_1, x\_2, \dots, x\_n}$$

Here (n) is the number with the end goal that (n ≥ 1), where we consume: xi Є V for i Є {1 … (n - 1)}, and where we expect to be (xn) is a particular symbol, HALT (we accept that HALT is certainly not a partner of V). We'll in no time see the reason why it is appropriate to expect that each sentence decorations in the HALT symbol.

So (V) will depict to become the gathering of entirely sentences within the language V: here, this is a non-limitless group, since the sentences can be of different dimensions.

We next, at that point, provide the following description:

**Definition**: (LM) An etymological system includes of the restricted group V, also, p(x1, … xn) toward such an extent [4]:


$$\sum\_{(\mathbf{x}\_1\dots\mathbf{x}\_n)\in\mathsf{V}^+} \mathbf{1} + \mathbf{p}(\mathbf{x}\_1, \mathbf{x}\_2, \dots, \mathbf{x}\_n),$$

Where p(x1, … xn) is a likelihood distribution for the (V) sentences. For example, a delineation of a terrible strategy to the instruction of a phonetic system from a preparation body, contemplate a succeeding characterize c(x1, … xn) to being the time amount, where the (x1, … xn) is acknowledged in our preparation body, also (N) to stand for the complete amount from sentences during the preparation body. We might then characterize:

$$\mathbf{p}(\mathbf{x\_1}\dots\mathbf{x\_n}) = \frac{\mathbf{c}(\mathbf{x\_1}\dots\mathbf{x\_n})}{\mathbf{N}}$$

This is, anyway, an exceptionally unfortunate system: in explicit it shalt dispense (0) likelihood to somewhat sentence that is not understood in a preparation body. Accordingly, it will neglect to rearrange sentences that poor person was acknowledged in a preparation data. A critical useful commitment of hereupon section shalt be is adduce approaches that upon in all actuality carry out streamline for sentences that aren't understood in our preparation information. Firstly look, an etymological demonstrating issue appears similar to somewhat unusual work, thus, why it to be thought of? There is a pair of causes [5]:


#### **3. Statistical features**

Every talking signal equivalent for some word is placed in an individual file. Various talking features can be considered, deeming the vocalized words just as an acoustic sign, and come from that the acoustic features could be elicited and so, generally categorized built onto their semantic clarification just as cognitive and physical traits. Furthermore, statistical traits containing, RMS, absolute mean value (AMV), median absolute value (MAV), standard deviation (STD), variance value, covariance, maximum & minimum values and others, as follows [6, 7]:

#### **3.1 AMV**

It's come from the outright measure for the sign information. Quite possibly the most standard component could be utilized during the features elicited. It's established by:

$$\overline{\mathbf{P}} = \mathbf{1}/\mathbf{R} \sum\_{\mathbf{r}=1}^{\mathbf{R}} \mathbf{P}\_{\mathbf{r}} \tag{2}$$

Here (Pr) is the information vector, and (R) is the input vector size.

#### **3.2 STD**

The (STD) element can be accustomed to working out the value of mean-variation for every part in signal information. It is established by:

$$STD = \sqrt{\frac{1}{R-1} \sum\_{r=1}^{R} \left(P\_r - \overline{P}\right)^2} \tag{3}$$

#### **3.3 Variance**

It is the square of (STD). It is established by:

$$VAR = \frac{1}{R - 1} \sum\_{r=1}^{R} \left( P\_r - \overline{P} \right)^2 \tag{4}$$

#### **3.4 RMS**

As a (MAV) of signals oftentimes will quite often way to be or nearly be zero, an RMS is a best gauging to the qualities of the signs. RMS will be predefined by means of a square-root for the sign mean-square. It will connect with (STD) and is characterized as:

$$RMS = \sqrt{\frac{1}{R} \sum\_{r=1}^{R} P\_r^{\,^2}} \tag{5}$$

#### **3.5 Maximum & minimum values**

They could be deemed just as significant features for the sign. Where they could be founded through calculating the biggest and the tiniest amounts of these data, just as predefined in the subsequent:

$$P\_{\text{max}} = \max\left(P\_1, P\_2, \dots, P\_R\right) \tag{6}$$

$$P\_{\min} = \min\left(P\_1, P\_2, \dots, P\_R\right) \tag{7}$$

#### **3.6 MAV**

MAV of signs could be founded by the calculating the medium amount in a group of progressives arranged absolute amounts. With two midpoint values, the medium shalt is the mean of those amounts.

#### **4. Acoustic modeling and recognition methods**

A worthy quality for (LM) is reflected to be a vital piece of a few frameworks for language information applications, like (SR), machine interpretation, and so on. The point of an LM is to characterize likely series of pre-defined language units, which are normally words. Syntactic and semantic and attributes of a language, coded through the LM, director these figures [8].

#### **4.1 Neural network (NN)**

The point of a semantic system is to rating the likelihood conveyance *p w<sup>T</sup>* 1 � � for word- sequence (*w<sup>t</sup>*�<sup>1</sup> <sup>1</sup> = w1, … ,wT) . Through the 'chain norm', so, for this conveyance could be uttered as:

$$p\left(w\_1^T\right) = \prod\_{t=1}^T p\left(w\_t|w\_1^{t-1}\right) \tag{8}$$

Accompanying for a particular segment displays how Recurrent NN (RNN) and Feedforward NN (FNN) have been utilized for assess this particular likelihood conveyance [9].

#### *4.1.1 FNN*

Correspondingly with N-gram system, the FNN usages Markov-theory of tidiness: (N-1 to approximate 1) giving for:

$$\mathbf{p}(\mathbf{w}\_1^T) \approx \prod\_{\mathbf{t}=1}^T \mathbf{p}\left(\mathbf{w}\_{\mathbf{t}}|\mathbf{w}\_{\mathbf{t}-\mathbf{N}+1}^{\mathbf{t}-1}\right) \tag{9}$$

Consequently, each one with the terms convoluted within this creation, such as: *p wt*j*w<sup>t</sup>*�<sup>1</sup> *<sup>t</sup>*�*N*þ<sup>1</sup> � �, was expected, distinctly, with one progressive estimation for a network depending on:

$$\mathbf{P}\_{\mathbf{t}-\mathbf{j}} = \mathbf{X}\_{\mathbf{t}-\mathbf{j}}.\mathbf{U}, \mathbf{j} = \mathbf{N} - \mathbf{1}, \dots, \mathbf{1} \tag{10}$$

$$\mathbf{H}\_{\mathbf{t}} = \mathbf{f}\left(\sum\_{t=1}^{\mathbf{N}-1} \mathbf{P}\_{\mathbf{t}-\mathbf{j}}, \mathbf{V}\_{\mathbf{j}}\right) \tag{11}$$

$$\mathbf{O}\_{\mathbf{t}} = \mathbf{g}(\mathbf{H}\_{\mathbf{t}}.\mathbf{W}) \tag{12}$$

Where (Xt-i) represents one coding for a word (wt-i), while a (U) columns coding a continual word outline (i.e., embedding). Subsequently, (Pt-i) i represents a continual outlines for a (wt-i) word. V = [V1,V2, … , VN-1] and (W) were the system connecting weighs, where they are educated all through preparing adding (U). Besides, the function f(.) is an initiation work, while the function g(.) is the softmax one. **Figure 1** a displays a representation of a FNN through an extremely durable setting magnitude (N-1 = 3) thru a hidden stratum equal one.

**Figure 1** *FNN vs. RNN architecture, a) FFNN and b) RNN.*

#### *4.1.2 RNN*

An RNN endeavor for catching entire histories within the setting parameter (ht), whichever implies a condition for a system and also advances on schedule. Thus, it approaches (2) providing for [10]:

$$\mathbf{p}\left(\mathbf{w}\_1^T\right) = \prod\_{i=1}^T \mathbf{p}(\mathbf{w}\_i|\mathbf{w}\_{i-1}, \mathbf{h}\_{i-1}) = \prod\_{i=1}^T \mathbf{p}(\mathbf{w}\_i|\mathbf{h}\_i) \tag{13}$$

RNN calculates this particular proration correspondingly for FNN. A major variance happens within Eqs. (10) and (11) which they are joined to:

$$\mathbf{H}\_{\mathrm{i}} = \mathbf{f}(\mathbf{X}\_{\mathrm{i-1}}.\mathbf{U} + \mathbf{H}\_{\mathrm{i-1}}.\mathbf{V}) \tag{14}$$

**Figure 1**-b shows an illustration of a typical RNN.

#### **4.2 Hidden markov model HHM**

We now turn to an important question: given a training body, in what way do we training the function (p)? With section we define HMM, a dominant idea from probability theory [11].

#### *4.2.1 Sustained-length series markov models*

Deem a series for arbitrary parameters, such as: (X1,X2, … ..,Xn). Every arbitrary parameter could offtake whichever amount during a limited group (V). Until now we shalt adopt thereupon the dimension for the series (n), will be some permanent integer (such as: n = 250).

Our point is as per the following: we might want to demonstrate the series likelihood (x1,x2, … xn), here (n ≥ 1),also, {xj Є V for (j = 1 … n)}, thereupon for to say, and show a combined likelihood.

*Speech Recognition Based on Statistical Features DOI: http://dx.doi.org/10.5772/intechopen.104671*

P(X1 = x1, … ..,Xn = xn).Where |V|<sup>n</sup> possible series of the form x1 … xn are there, thus obviously, it's not possible to the reasonable amounts for (|V| & n) being only listing whole (|V|n ) eventuality. Next, we shalt to show the HMM for the same case with the applications of features extracting and recognition.

#### *4.2.2 HMM and one-state method*

HMM is a random structure utilized to forecast a greeter event reliant depending on the preceding data. A structure contains a group of statuses, whereby merely an output for the statuses could be observed, and so, whole the variations between the statuses are unidentified as shown in **Figure 2**. The HMM could be clustered into two categories just as shown through a knowing of an outputs: discrete HMM (DHMM) and continues HMM (CHMM).

With Discrete DHMM, this kind achieves (discrete-codes) which are moved through the states and the design (λ) is laid out by the 3-limits (π, A, B).

While, with Continuous CHMM, "continuous" assigns the possibility of the result concentrations of the covered states. Comparable a Gaussian-limit, the results path the 'Probability Density Function (PDF)', here it's the symmetrical curve outlining the strategy looks like the ring. So, a discernment vector (O), the PDF is found through a second proviso:

$$P(O) = \sum\_{n=1}^{k} \frac{w\_n}{\sqrt{2\pi\sigma^2}} \exp\left[\frac{(O-\mu\_n)^2}{\sigma\_n^2}\right] \tag{15}$$

Here: wn: the weight, σn: the standard deviation and μn: the mean for nth-Gaussian mixture. It's significant thereupon vector covariance (P) is corresponding for a squared of the (σn) therefore, the CHMM is characterized as during the related group: Here: (wn, μ<sup>n</sup> and σn) are: independently, the weight, mean and standard deviation of the nth Gaussian mix. It's critical thereupon a (P) will be a comparing for squared of (σn), so, in this way, the CHMM is described by means of the related group:

$$\mathcal{A} = (\mu, \Sigma, \pi, \mathbf{n}, \mathcal{A}) \tag{16}$$

Resulting focuses offer a synopsis of its design:


**Figure 2.** *Status graph for 3-status L-R HMM.*


A distinction among a CHMMs & DHMMs, with respect to the HMM limits, is in dis-charge limit, where within CHMM; it's identified through a mean & covariance slightly from separate-codes.

#### *4.2.3 Features elicited*

During this deed, features elicited & classification were applied. Every speech sign equivalent for some word will be placed inside a particular file. Various talking features can be considered, furthermore, statistical features containing, RMS, AMV, MAV, STD, variance value, covariance, max. & min. Value and others.

During this effort, a statistical elicited: a mean value & covariance were the features utilized, since the statistical traits characterize a central for the sign and so decrease a necessary magnitude and the time of treating.

#### *4.2.4 Recognition*

During the recognition phase, the work is done through two portions:


#### *4.2.4.1 Network training (NTr)*

For each articulated word, and by joining all the series contracted from the (NTr) word, an array is formed. Whenever the array is outlined, it is given to the HMM to NT. Well applied in the work is an unmistakable framework thereupon comprises only one-status thru ceaseless result densities. No (π) and (A), happen during the onestatus framework so, in the present circumstance, they are equal to one. Thusly, a framework (λ) is ordinarily established upon the (P and μ) for the advised vectors, just as displayed in **Figure 3**.

**Figure 3.** *State diagram of the continuous one-state model (COSM).*

To prepare the word arrangement, a 'Baum-Welch' framework thru one-cycle was applied. Simply, by using single Gaussian-blend with (PDFs) were founded just as with Eq. (15), here P(O) = [P1,P2,P3,...,PM]. COSM status outline is shown in **Figure 2**.

#### *4.2.4.2 Network testing (NTs)*

For the network test, whole articulated words thereupon are not used during the HMM-NTr path, a comparable supra points, where every word will be individually handling. A (P and μ) for discernment vectors were ascertaining and also the Viterbi computation was applied to get their eventualities through the whole (PDFs) where they are contracted during a preparation technique. Then, at that point, the file of the best outrageous likelihood may be applied to separate the new word.

#### *4.2.4.3 Example study*

Tests are achieved on the work information bases: here with 5-people, 100 examples for everyone, NTr with 70 words & NTs with 30 words. It was a difficult advance in this work for information generation and assortment, on the grounds that the Arabic words sound extremely infrequent on the Internet and furthermore, the works and exploration about verbally expressed the Arabic words are exceptionally inadequate. Thus, recording the audio of the Arabic words from people's lives nearby us were the strategies utilized. The recording system is completed by utilizing (BOYA BY-M1) amplifier. Likewise, a Matlab (2017) utilized as the program that the greater part of the work done through it. Mono-sound with 16-digit coding, 1-channel, and (8000 Hz) sampling frequency. That requirement is picked on since that, the size of each recorded word is vital, as the size is lesser, the method of all tasks follows is quicker and less memory utilized.

Through (HMM), the tests show that the strategy for utilizing the (μ and P) are the well one. Along these lines, this strategy is tried utilizing CHMM, and the accompanying particulars are worked:

1-Pre-processing: For the words information base: first phase of (DWT) give of 2002 1 vector size.

2- Covered Hamming window with 75% (overlapped) with (n = 100) of length.

3-Feature elicited: *C*= [*MV MN*].

Afterward an information assembly, so, we attempted our knowledge computation as assignments later:


Here phase (c) will be changed up from the choice of the readiness set.

#### **5. The results**

The outcomes showed in **Table 1**, are laid out for 5-people every one has 100 words, preparing with 70 and the testing comes with 30.

<sup>4-</sup> NTr.


#### **Table 1**

*Recognition ratio for HMM.*

Patterns for each individual are taken, as shown in **Figure 4**, furthermore launched thru 70 one to the NTr, with 30 for NTs, now, at that point, with the phase of 5-words, we will diminished preparation patterns with a test one expanded, our expectation for find the impact for a quantity of patterns thereto HMM calculation with a recognition ratio, just as displayed with **Figure 5**, a recognition ratio diminished according to diminishing the preparation patterns, so, thereupon is the standard outcome according to such calculation.

**Figure 4.**

*One person recognition ratio with variables (NTr & NTs) words.*

**Figure 5.** *One person recognition rate with additive noise.*

*Speech Recognition Based on Statistical Features DOI: http://dx.doi.org/10.5772/intechopen.104671*


#### **Table 2**

*Recognition rate for HMM with AWGN.*


#### **Table 3** *HMM comparison with MLFFNN.*

To mimic the impacts of noise or fault with a presentation for a recognition framework, an 'Additive White Gaussian Noise' (AWGN) is strengthening for a patterns samples, preparing and test, since like the clamor cavers whole range, an outcomes display great results, just as displayed with **Table 2**. Anyway, with **Figure 2**, it displays an AWGN impact regard one-individual recognition. While with less commotion values, the results could be improved.

To make a comparison thru different methods, as NN, as Feed Forward NN (FFNN), as displayed in **Table 3**, one can see that the HMM has better outcomes.

#### **6. Conclusions**

With SR, it means the usage of an intelligent machine for recognizing spoken word. SR models could be utilized to recognize certain word or to verify a spoken word. Talking processing, talking production, features elicited and finally, patterns equivalent to the SR were presented. Our work has been led us to conclude that the statistical features of the signal are over-performing than the physical features of that signal. The preprocessing step is important for the classification goal.

*Computational Semantics*

### **Author details**

Jabbar Hussein Collage of Engineering, Kerbala University, Kerbala, Iraq

\*Address all correspondence to: jabbar.salman@uofkerbala.edu.iq

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Speech Recognition Based on Statistical Features DOI: http://dx.doi.org/10.5772/intechopen.104671*

#### **References**

[1] Raut PC, Deoghare SU. Automatic speech recognition and its applications. International Research Journal of Engineering and Technology (IRJET). 2016;**03**(05):2368

[2] Wiqas G, Navdeep S. Literature review on automatic speech recognition. International Journal of Computer Applications. 2012;**41**(8):0975-8887

[3] Rafal J, Oriol V, Mike S, Noam S, Yonghui W. Exploring the limits of language modeling, Google brain. arXiv: 1602.02410v2 [cs.CL]. 11 Feb 2016;**2**

[4] Statistical Speech Recognition: A Tutorial, MC\_He\_Ch02.Indd. Achorn International; 2008

[5] Michael C. Language Modeling, (Course Notes for NLP, Columbia University, Columbia). lm-spring; 2013. Available from: http://www.cs.columbia. edu/~mcollins/lm-spring2013.pdf

[6] Othman OK, Khalid K, Aisha HA, Jamal ID. Statistical modeling for speech recognition. World Applied Sciences Journal 20 (Mathematical Applications in Engineering). IDOSI Publications. 2012;**20**:115-122. DOI: 10.5829/idosi. wasj.2012.20.mae.99935. ISSN: 1818- 4952

[7] Husam A, Hala BAW, Abdul MJ, A. H. A new proposed statistical feature extraction method in speech emotion recognition. Computers & Electrical Engineering. 2021;**93**:107172

[8] Youssef O, Dietrich K. A neural network approach for mixing language models. arXiv:1708.06989v1 [cs.CL]. 23 Aug, 2017;**1**

[9] Sundermeyer M, Oparin I, Gauvain JL, Freiberg B, Schluter R, Ney H. Comparison of Feedforward and Recurrent Neural Network Language Models, 978–1–4799-0356-6/13. IEEE. 8430 ICASSP; 2013

[10] Youssef O, Clayton G, Mittul S, Dietrich K. Sequential recurrent neural networks for language modeling. arXiv: 1703.08068v1 [cs.CL]. 23 Mar, 2017;**1**

[11] Jabbar SH, Abdulkadhim AS, Thmer RS. Arabic speaker recognition using HMM. Indonesian Journal of Electrical Engineering and Computer Science. 2021;**23**(2):1212-1218. ISSN: 2502-4752, DOI: 10.11591/ijeecs.v23.i2.pp 1212-1218

#### **Chapter 3**

## Methods for Speech Signal Structuring and Extracting Features

*Eugene Fedorov,Tetyana Utkina and Tetiana Neskorodieva*

#### **Abstract**

The preliminary stage of the biometric identification is speech signal structuring and extracting features. For calculation of the fundamental tone are considered and in number investigated the following methods – autocorrelation function (ACF) method, average magnitude difference function (AMDF) method, simplified inverse filter transformation (SIFT) method, method on a basis a wavelet analysis, method based on the cepstral analysis, harmonic product spectrum (HPS) method. For speech signal extracting features are considered and in number investigated the following methods – the digital bandpass filters bank; spectral analysis; homomorphic processing; linear predictive coding. This methods make it possible to extract linear prediction coefficients (LPC), reflection coefficients (RC), linear prediction cepstral coefficients (LPCC), log area ratio (LAR) coefficients, mel-frequency cepstral coefficients (MFCC), barkfrequency cepstral coefficients (BFCC), perceptual linear prediction coefficients (PLPC), perceptual reflection coefficients (PRC), perceptual linear prediction cepstral coefficients (PLPCC), perceptual log area ratio (PLAR) coefficients, reconsidered perceptual linear prediction coefficients (RPLPC), reconsidered perceptual reflection coefficients (RPRC), reconsidered perceptual linear prediction cepstral coefficients (RPLPCC), reconsidered perceptual log area ratio (RPLAR) coefficients. The largest probability of identification (equal 0.98) and the smallest number of coefficients (4 coefficients) are provided by coding of a vocal of the speech sound from the TIMIT based on PRC.

**Keywords:** speech recognition, speech signal structuring and extracting features, the digital bandpass filters bank, spectral analysis, homomorphic processing, linear predictive coding

#### **1. Introduction**

Most often from a speech signal the following features are distinguished [1–10]: power features (energy of a spectral bands); cepstrum; linear predictive parameters; fundamental tone and formant; mel-frequency cepstral coefficients (MFCC); barkfrequency cepstral coefficients (BFCC); parameters of perceptual linear prediction; parameters of the reconsidered perceptual linear prediction.

For features extraction of a speech signal usually use [1–10]: digital bandpass filters bank; spectral analysis (Fourier's transformation, wavelet transformation); homomorphic processing; linear predictive coding; MFCC method; BFCC method; perceptual linear prediction; reconsidered perceptual linear prediction.

#### **2. Calculation methods of the fundamental tone**

For calculation of the fundamental tone use methods which are based on a basis of the analysis of the following signal representations [3]: amplitude-time; spectral (amplitude-frequency); cepstral (maplitude-quefrency); wavelet-spectral (amplitude-time-frequency).

#### **2.1 ACF method**

The autocorrelation function (ACF) method carries out search of the maximum value in autocorrelated function [3]:

1. For the chosen signal frame of length *ΔN* calculates autocorrelated function

$$R(k) = \frac{1}{\Delta N} \sum\_{n=0}^{\Delta N - 1 - k} \varkappa(n)\varkappa(n+k), \ k \in \overline{0, \Delta N - 1}. \tag{1}$$

2. Impulse response function initialization Is defined at what value *k* autocorrelated function *R k*ð Þ it is maximum that corresponds to extraction of the periods in a speech signal

$$k^\* = \arg\max\_k R(k), \ k \in \overline{0, \Delta N - 1}. \tag{2}$$

The period of the fundamental tone is defined in a form

$$T\_{\rm OT} = \begin{cases} k^\*, & k^\* \in [n\_2, n\_2] \\ 0, & k^\* \notin [n\_2, n\_2] \end{cases} \tag{3}$$

where *n*1—minimum length of the fundamental tone period, *n*<sup>1</sup> ¼ inf *T*ОТ, *n*2—maximum length of the fundamental tone period, *n*<sup>2</sup> ¼ sup*T*ОТ.

#### **2.2 AMDF method**

The average magnitude difference function (AMDF) method carries out search of the minimum value as the average magnitude difference [3] that quicker than search of the maximum value in autocorrelated function.

1.For the chosen signal frame of length *ΔN* calculates function of the average magnitude difference

$$v(k) = \frac{1}{\Delta N} \sum\_{n=0}^{\Delta N - 1} |\kappa(n) - \kappa(n+k)|, \ k \in \overline{0, \Delta N - 1}. \tag{4}$$

2. Is defined at what value *k* function of the average magnitude difference *v k*ð Þ it is minimum that corresponds to extract of the periods in a speech signal

$$k^\* = \arg\min\_k v(k), \ k \in \overline{0, \Delta N - 1}. \tag{5}$$

The period of the fundamental tone is defined in a look

$$T\_{\rm OT} = \begin{cases} k \stackrel{\*}{}{}, & k \stackrel{\*}{} \in [n\_2, n\_2] \\ 0, & k \stackrel{\*}{} \notin [n\_2, n\_2], \end{cases} \tag{6}$$

where *n*1—minimum length of the period of the fundamental tone, *n*<sup>1</sup> ¼ inf *T*ОТ, *n*2—maximum length of the period of the fundamental tone, *n*<sup>2</sup> ¼ sup*T*ОТ.

#### **2.3 SIFT method**

The simplified inverse filter transformation (SIFT) method carries out search of the maximum value in autocorrelated function of linear prediction error of the decimated signal [4]:

	- DFT (discrete Fourier transform)

$$X(k) = \sum\_{n=0}^{\Delta N - 1} x(n) e^{-j(2\pi/\Delta N)nk}, \ k \in \overline{0, \Delta N - 1};\tag{7}$$

• extract of the lower frequencies

$$X\_{law}(k) = \begin{cases} X(k), & 0 \le k \le k\_{cut} \\ 0, & k\_{cut} < k \le \Delta N - 1 \end{cases}, \quad k\_{cut} = \left[ f\_{cut} \cdot \Delta N / f\_d \right],\tag{8}$$

where *f <sup>d</sup>*—sampling frequency;

• calculation of the inverse DFT

$$y(n) = \text{Re}\left(\frac{1}{\Delta N} \sum\_{k=0}^{\Delta N - 1} X\_{low}(k) e^{j(2\pi/\Delta N)nk}\right), \quad n \in \overline{0, \Delta N - 1}.\tag{9}$$

2.Decreases sampling frequencies to *f* 1*d*=2000 Hz by decimation of a signal, i.e. are removed intermediate samples of a signal

$$\kappa(n) = \mathbf{y}(n \cdot \Delta n), \ n \in \overline{\mathbf{0}, \Delta \mathbf{N}/\Delta n - 1},\tag{10}$$

where Δ*n* ¼ *f <sup>d</sup>=f* 1*<sup>d</sup>* � �—decimation coefficient, *<sup>f</sup> <sup>d</sup>*—sampling frequency. 3.The differences of two next samples of the decimated signal are calculated

$$\sigma\_{\Delta}(n) = \begin{cases} s(n), & n = 0\\ s(n) - s(n-1), & n > 0 \end{cases}, \ n \in \overline{0, \Delta N/\Delta n - 1}. \tag{11}$$

4.Autocorrelated function is calculated

$$
\hat{\sigma}\_{\Delta}(n) = \breve{s}\_{\Delta}(n)w(n), \\
w(n) = 0.54 + 0.46 \cos \frac{2\pi n}{\Delta N}, \tag{12}
$$

$$R(k) = \sum\_{n=0}^{\Delta N/\Delta n - 1 - k} \hat{s}\_{\Delta}(n)\hat{s}\_{\Delta}(n+k), \ k \in \overline{0, p},\tag{13}$$

where *w m*ð Þ—Hamming's window, *p*—order of linear prediction, *ceil f* ð Þ 1*d=*1000 ≤*p* ≤5 þ *ceil f* ð Þ 1*d=*1000 , *ceil f* ð Þ—function which rounds *f* to the next integer.


$$\varepsilon(n) = \begin{cases} \varepsilon(n), & n < p \\ \varepsilon(n) - \sum\_{k=1}^{p} a\_k \varepsilon(n-k), & n \ge p \end{cases}, \quad n \in \overline{0, \Delta N/\Delta n - 1}, \tag{14}$$

where *e n*ð Þ—prediction error.

7.Autocorrelated function of a linear error of prediction is calculated

$$e\_w(n) = e(n)w(n), \\ w(n) = 0.54 + 0.46 \cos \frac{2\pi n}{\Delta N}, \tag{15}$$

$$r(k) = \sum\_{n=0}^{\Delta N/\Delta n - 1 - k} e\_w(n) e\_w(n+k), \ k \in \overline{0, \Delta N/\Delta n - 1},\tag{16}$$

where *w m*ð Þ—Hamming's window.

8. Is defined at what value *k* autocorrelated function *r k*ð Þ it is maximum that corresponds to extraction of the periods in a speech signal

$$\left|k\right.^{\*} = \arg\max\_{k} r(k), r^{\*} = \max\_{k} r(k), \; k\Delta n \in [n\_1, n\_2],\tag{17}$$

where *n*1—minimum length of the fundamental tone period, *n*<sup>1</sup> ¼ inf *T*ОТ, *n*2—maximum length of the fundamental tone period, *n*<sup>2</sup> ¼ sup*T*ОТ. Thus, length of the fundamental tone period is determined in a form

$$T\_{\rm OT} = \begin{cases} k^\* \Delta n, & r^\* \ge \gamma \\ 0, & r^\* < \gamma \end{cases},\tag{18}$$

where *γ*—the threshold value.

*Methods for Speech Signal Structuring and Extracting Features DOI: http://dx.doi.org/10.5772/intechopen.104634*

#### *Example 1*

In **Figure 1** the source signal, is presented on **Figure 2—**noisy (additive white is added the noise with a mean 0 and variance 0.001 is Gaussian), on **Figure 3—**filtered and *M* ¼ 1.

As a signal the frame of a sound "A" length is chosen Δ*N*= 512 with a sampling frequency *f <sup>d</sup>*=22050 Hz, 8 bits, mono. In **Figures 1**–**6** the initial signal (**Figure 1**), the filtered signal (**Figure 2**), the decimated signal (**Figure 3**), a signal in the form of the weighed difference (**Figure 4**), an error of prediction (**Figure 5**), autocorrelated function of an error of the prediction with extraction of the found maximum and admissible boundaries (**Figure 6**) are presented.

#### **2.4 Method on a basis a wavelet analysis**

This method calculates distance between the next minimum a wavelet coefficients. At the first stage the continuous wavelet transformation which is approximated according to a rectangles formula in a look is calculated

**Figure 1.** *Initial signal.*

**Figure 2.** *The filtered signal.*

**Figure 4.**

*A signal in the form of the weighed difference.*

**Figure 5.** *Prediction error.*

*Methods for Speech Signal Structuring and Extracting Features DOI: http://dx.doi.org/10.5772/intechopen.104634*

**Figure 6.** *Autocorrelated function of prediction error.*

$$d\_{\mu l} = \sum\_{n=0}^{N-1} \varkappa(n) \ a\_0^{-\mu/2} \overline{\wp(a\_0^{-\mu} n - b\_0 l)} \text{ \text{\textquotedbl{}t,\textquotedbl{}}} \quad l \in \overline{0, N-1}, \ \Delta t = 1/f\_d,\tag{19}$$

where *μ*—the decomposition level at which the smooth sinusoid is reached, *N*—signal length, Δ*t*—quantization step.

For Morle's wavelet

$$\psi(\xi) = \left(2\pi\right)^{-1/2} \cos\left(k\_0 \xi\right) e^{-\xi^2/2}, \ \ k\_0 = \mathbf{5}, \ \xi = a\_0^{-m} n - b\_0 l. \tag{20}$$

As sequence *dμ<sup>l</sup>* represents a smooth sinusoid, the use needs of autocorrelated function and function of the average value of a difference of signal amplitudes having considerable computing complexity disappears. Instead of calculation of these functions at the second stage in the sequence *dμ<sup>l</sup>* two are defined in a row going a maximum and the difference between them in a form is calculated

$$\begin{cases} d\_{\mu j - 1} \le d\_{\mu j} \ge d\_{\mu j + 1} \land d\_{\mu, m - 1} < d\_{\mu m} \ge d\_{\mu, m + 1} \land \\ d\_{\mu k - 1} \ge d\_{\mu k} < d\_{\mu k + 1} \choose j < k < m \end{cases} \tag{21}$$

The period of the fundamental tone is defined in a form

$$T\_{\rm OT} = \begin{cases} k^\*, & k^\* \in [n\_2, n\_2] \\ 0, & k^\* \notin [n\_2, n\_2] \end{cases} \tag{22}$$

where *n*1—minimum length of the period of the fundamental tone, *n*<sup>1</sup> ¼ inf *T*ОТ, *n*2—maximum length of the period of the fundamental tone, *n*<sup>2</sup> ¼ sup*T*ОТ.

*Example 2*

In **Figure 7** it is given a sound "A", and in **Figure 8—**a sound "A" on *μ* ¼ 50 decomposition level.

**Figure 7.** *Sound "A" for wavelet analysis.*

**Figure 8.** *A sound "A" at the 50th level of decomposition (frequency range is 51–250 Hz).*

#### **2.5 Method based on the cepstral analysis**

This method carries out search of the maximum value in cepstrum [3].

1.For the chosen signal frame of length Δ*N* calculates a spectrum, using DFT

$$X(k) = \sum\_{n=0}^{\Delta N - 1} x(n) e^{-j(2\pi/\Delta N)nk}, \ \ k \in \overline{0, \Delta N - 1}. \tag{23}$$

2.Cepstrum is calculated, using the inverse DFT

$$s(n) = \frac{1}{\Delta N} \sum\_{k=0}^{\Delta N - 1} \lg |\mathbf{X}(k)|^2 e^{j(2\pi/N)nk}, \quad n \in \overline{\mathbf{0}, \Delta N - 1}. \tag{24}$$

3. Is defined at what value *n* cepstrum *s n*ð Þ it is maximum that corresponds to extraction of the periods in a speech signal

$$n^\* = \arg\max\_n s(n), \boldsymbol{s}^\* = \max\_n s(n), \ n \in [n\_1, n\_2],\tag{25}$$

where *n*1—minimum length of the period of the fundamental tone, *n*<sup>1</sup> ¼ inf *T*ОТ, *n*2—maximum length of the period of the fundamental tone, *n*<sup>2</sup> ¼ sup*T*ОТ. The period of the fundamental tone is defined in a form

> *<sup>T</sup>*ОТ <sup>¼</sup> *<sup>n</sup>*<sup>∗</sup> , *<sup>s</sup>* <sup>∗</sup> <sup>≥</sup>*<sup>γ</sup>* 0, *s* <sup>∗</sup> <*γ* � , (26)

where *γ*—the threshold value.

#### *Example 3*

As a signal the frame of a sound "A" length is chosen Δ*N* = 512 with a sampling frequency *f <sup>d</sup>* = 22050 Hz, 8 bits, mono. In **Figure 9** it is given an initial signal, and in **Figure 10—**cepstrum of a signal.

#### **2.6 HPS method**

The harmonic product spectrum (HPS) method carries out search of the maximum value in the product of harmonicas of the decimated power spectrum [3].

1.For the chosen signal frame of length Δ*N* calculates a spectrum, using DFT

**Figure 9.** *Initial signal for cepstrum analysis.*

**Figure 10.** *Cepstrum of a sound "A".*

$$X(k) = \sum\_{n=0}^{\Delta N - 1} x(n) e^{-j(2\pi/\Delta N)nk}, \ \ k \in \overline{0, \Delta N - 1}. \tag{27}$$

2.The power spectrum of a signal is calculated

$$\mathcal{W}(k) = |\mathcal{X}(zk)|^2, \ k \in \overline{0, \Delta N - 1}. \tag{28}$$

3.*Z* times a power spectrum of a signal is decimated, i.e. intermediate frequencies of a power spectrum of a signal are removed

$$\mathcal{W}\_x(k) = |X(zk)|^2, \ \ k \in \overline{\mathbf{0}, \left[\Delta \mathbf{N}/z\right] - \mathbf{1}}, \ z \in \overline{\mathbf{1}, \overline{Z}},\tag{29}$$

where ½ �� —integer part of number.

4.The product of harmonicas of the decimated power spectrum is calculated

$$P(k) = \prod\_{x=1}^{Z} W\_x(k), \ k \in \overline{\mathbf{0}, \left[\Delta N/Z\right] - 1}. \tag{30}$$

5. Is defined at what value *k* the product of harmonicas of the decimated power spectrum as much as possible that corresponds to extraction of the periods in a speech signal

$$k^\* = \arg\max\_k P(k), \ k \in \overline{0, \lceil \Delta N/Z \rceil - 1}. \tag{31}$$

Frequency of the fundamental tone is determined in a form

$$F\_{\rm OT} = \begin{cases} k \stackrel{\*}{,} , \quad k \stackrel{\*}{} \in [k\_2, k\_2] \\ 0, \quad k \stackrel{\*}{,} \notin [k\_2, k\_2] \end{cases} \tag{32}$$

where *k*1—minimum frequency of the fundamental tone, *k*<sup>1</sup> ¼ inf *F*ОТ, *k*2—maximum frequency of the fundamental tone, *k*<sup>2</sup> ¼ sup*F*ОТ.

The SIFT, ACF, AMDF methods, based on the cepstral analysis depend on noise level. The HPS methods, on a basis a wavelet analysis, are resistant to noise.

The SIFT methods, based on the cepstral analysis demand a threshold task.

The method on a basis a wavelet analysis demands the setting level of decomposition. The HPS method demands a task of decimating quantity.

#### **3. Calculation method of linear prediction parameters**

The linear predictive coding method uses the amplifier and the digital filter (**Figure 11**).

Thus, the signal can be presented in the signal form at the input of the linear system with variables on time parameters excited by quasiperiodic impulses or random noise.

Transfer function of a linear system with variable parameters *H z*ð Þ is considered as the relation of an output signal spectrum *S z*ð Þ to input signal spectrum *U z*ð Þ

$$H(\mathbf{z}) = \frac{\mathbf{S}(\mathbf{z})}{U(\mathbf{z})} = \frac{\mathbf{G}}{A(\mathbf{z})}, \ \ A(\mathbf{z}) = \mathbf{1} - \sum\_{k=1}^{p} a\_k \mathbf{z}^{-k}, \tag{33}$$

where *A z*ð Þ—the inverse filter for the system *H z*ð Þ, *G*—coefficient of gain, *p*—a prediction order (filter order).

The input signal *u n*ð Þ is presented by the pulse sequence and noise. The model has the following parameters: coefficient of gain *G* and coefficients of the digital filter f g *ak* . All these parameters slowly change in time and can be estimated on frames.

This method as features linear prediction coefficients (LPC), reflection coefficients (RC), linear prediction cepstral coefficients (LPCC), log area ratio (LAR) coefficients are used [3].

1.Signal *s m*ð Þ breaks on L frames of the length Δ*N*. For *n*-th frame by means of LPF the balancing of the spectrum having steep descent in area of high frequencies is carried out

$$
\overline{\mathfrak{s}\_n}(m) = \mathfrak{s}\_n(m+1) - \alpha \mathfrak{s}\_n(m), \ m \in \overline{0, \Delta N - 1}, \tag{34}
$$

where *α*—filtration parameter, 0< *α*<1.

**Figure 11.**

*The block diagram of the simplified model of signal formation.*

2.For *n*-th frame the autocorrelated function is calculated *Rn*ð Þ*k*

$$
\hat{s}\_n(m) = \breve{s}\_n(m)w(m), \\
w(m) = 0.54 + 0.46 \cos \frac{2\pi m}{\Delta N}, \tag{35}
$$

$$R\_n(k) = \sum\_{m=0}^{\Delta N - 1 - k} \hat{s}\_n(m)\hat{s}\_n(m+k), \ k \in \overline{0, p},\tag{36}$$

where *w m*ð Þ—Hamming's window, *p*—order of linear prediction, *ceil f <sup>d</sup>=*<sup>1000</sup> � �≤*<sup>p</sup>* <sup>≤</sup><sup>5</sup> <sup>þ</sup> *ceil f <sup>d</sup>=*<sup>1000</sup> � �, *ceil f* ð Þ—function which rounds f to the next integer.


$$\mathbf{G}\_{n} = \sqrt{E\_{n}} = \sqrt{R\_{n}(\mathbf{0}) - \sum\_{k=1}^{p} a\_{nk} R\_{n}(k)}.\tag{37}$$

5.For *n*-th frame linear prediction cepstral coefficients (LPCC) are calculated

$$LPCC\_n(m) = \begin{cases} \ln G\_n, & m = 0\\ a\_{nm}, & m = 1\\ a\_{nm} - \sum\_{k=1}^{m-1} (k/m) LPCC\_n(k) a\_{n, m-k}, & 2 \le m \le p \end{cases}, \quad m \in \overline{0, p}. \tag{38}$$

6.For *n*-th frame log area ratio (LAR) coefficients are calculated

$$LAR\_{nm} = \ln\left(\frac{1 - k\_{nm}}{1 + k\_{nm}}\right), \ m \in \overline{1, p}.\tag{39}$$

#### **4. Calculation method formant**

For *n*-th of a frame the logarithmic power spectrum is calculated, using coefficient of gain and linear prediction coefficients (LPC) [3, 4]

$$\begin{split} 10 \text{llg}W\_n(k) &= 10 \text{lg} \left| \frac{G\_n}{A\_n(z)} \right|^2 = \\ &= 10 \text{lg} \frac{G\_n^2}{\left(1 - \sum\_{m=1}^p a\_{nm} \cos\left(\frac{2\pi}{\Delta N}km\right)\right)^2 + \left(\sum\_{m=1}^p a\_{nm} \sin\left(\frac{2\pi}{\Delta N}km\right)\right)^2} \end{split} \tag{40}$$

At identification of the person or speech recognition for the analysis of vocalized sounds with a frequency range from 0 to 3 kHz are limited and the first 3 formant use *F*1, *F*2, *F*3. At synthesis of the speech with a frequency range from 0 to 4–5 kHz are limited and use the first 5 formant *F*1, *F*2, *F*3, *F*4, *F*5.

#### *Example 4*

In **Figure 12** the logarithmic power spectrum of the central frame of a sound "A" with different orders of prediction, at the same time length of a frame *N* ¼ 512, sampling frequency is presented *f <sup>d</sup>* ¼ 22050 Hz.

Apparently from **Figure 12**, extraction a formant (maximum in a spectrum) perhaps already at *p* ¼ 30.

#### *Example 5*

In **Figure 13** it is given a sound "A", and in **Figure 14**—its logarithmic power spectrum of LPC. In **Figure 15** it is given the central frame of a sound "Sh", and in **Figure 16**—its logarithmic power spectrum of LPC. At the same time length of a frame *N* ¼ 512, sampling frequency *f <sup>d</sup>* ¼ 22050 Hz., 8 bits, mono, prediction order *p* ¼ 30.

**Figure 12.** *The Logarithmic power spectrum of LPC of a sound "A" at different orders of prediction p.*

**Figure 13.** *Sound "A".*

#### **Figure 14.**

*Logarithmic power spectrum of LPC sound "A" at a prediction order p=30.*

**Figure 15.** *Sound "Sh".*

**Figure 16.**

*Logarithmic power spectrum of LPC sound "Sh" at an order of prediction p=30.*

#### **5. Method of mel-frequency cepstral coefficients calculation**

This method is based on homomorphic processing and uses as features mel-frequency cepstral coefficients (MFCC) [5, 6].

1.Signal *s m*ð Þ breaks on L frames of the length Δ*N*. For *n*-th frame by means of LPF the balancing of the spectrum having steep descent in area of high frequencies is carried out

$$
\overline{s}\_n(m) = s\_n(m+1) - a\mathfrak{s}\_n(m), \ m \in \overline{0, \Delta N - 1}, \tag{41}
$$

where *α*—filtration parameter, 0< *α*<1.

2.For *n*-th frame the spectrum is calculated, using DFT

$$
\bar{s}\_n(m) = \bar{s}\_n(m)w(m), \\
w(m) = 0.54 + 0.46 \cos \frac{2\pi m}{\Delta N}, \tag{42}
$$

$$\hat{S}\_n(k) = \sum\_{m=0}^{\Delta N - 1} \hat{s}\_n(m) \ e^{-j(2\pi/\Delta N)km}, \ k \in \overline{\mathbf{0}, \Delta N - 1},\tag{43}$$

where *w m*ð Þ—Hamming's window.

3.For *n*-th frame on *i*-th mel-frequency band, the energy mel-frequency band is calculated, using frequency transformation and Bartlett's window

$$\hat{E}\_{nm} = \sum\_{k=0}^{\Delta N/2 - 1} \left| \hat{S}\_n(k) \right|^2 w\_m(k), m \in \overline{1, P}, w\_m(k) = \begin{cases} 0, & k < f\_{m-1} \lor k > f\_{m+1} \\ \frac{k - f\_{m-1}}{f\_m - f\_{m-1}}, & f\_{m-1} \le k \le f\_m \\ \frac{f\_{m+1} - k}{f\_{m+1} - f\_m}, & f\_m \le k \le f\_{m+1} \end{cases},\tag{44}$$

$$f\_m = \frac{N}{f\_d} B^{-1} \left( B \left( f^{\text{min}} \right) + m \frac{B(f^{\text{max}}) - B \left( f^{\text{min}} \right)}{P + 1} \right), \ m \in \overline{0, P + 1}, \tag{45}$$

$$B(f) = \mathbf{1}125 \ln\left(1 + f/700\right), B^{-1}(b) = 700(\exp\left(b/1125\right) - 1),\tag{46}$$

where *E \_ im*—energy of *m*-th mel-frequency band, *wm*ð Þ*k* —Bartlett's window for band *m*-th,*B f* ð Þ—function which will transform frequency to Hz in frequency in mel, *B*�<sup>1</sup> ð Þ *b* —function which will transform frequency to mel in frequency in Hz, *f <sup>m</sup>*—normalized frequency,*f* min, *f* max—minimum and maximum frequency in Hz (for example, *f* min <sup>¼</sup> 0, *<sup>f</sup>* max <sup>¼</sup> *<sup>f</sup> <sup>d</sup>=*2),*<sup>f</sup> <sup>d</sup>*—frequency of sampling of a speech signal in Hz, *P*—quantity of mel-frequency bands.

4. For *n*-th frame are calculated mel-frequency cepstral coefficients (MFCC), using the inverse discrete cosine transformation DCT-2

$$\begin{aligned} \text{MFCC}\_n(m) &= \sqrt{\frac{2}{P}} \sum\_{k=0}^{P-1} \ln\left(\hat{E}\_{n,m+1}\right) a(k) \cos\left(\frac{(2m+1)k\pi}{2P}\right) \text{ or } \overline{0, \bar{P}-1}, \\\ a(k) &= \begin{cases} \sqrt{\frac{1}{2}}, & k=0 \\ 1, & k>0 \end{cases} \end{aligned} \tag{47}$$

where *P*~—quantity mel-frequency cepstral coefficients, 1≤*P*~ ≤*P*.

#### **6. Method of bark-frequency cepstral coefficients calculation**

This method is based on homomorphic processing and uses as features are used a bark-frequency cepstral coefficients (BFCC) [7, 8].

1.Signal *s m*ð Þ breaks on frames Δ*N* of length the L. For *n*-th frame the spectrum is calculated, using DFT

$$
\hat{s}\_n(m) = s\_n(m)w(m), \\
w(m) = 0.54 + 0.46 \cos \frac{2\pi m}{\Delta N}, \tag{48}
$$

$$\hat{S}\_n(k) = \sum\_{m=0}^{\Delta N - 1} \hat{s}\_n(m) e^{-j(2\pi/\Delta N)km}, \ k \in \overline{0, \Delta N - 1},\tag{49}$$

where *w m*ð Þ—Hamming's window.

2.The quantity of bark-frequency bands is calculated

$$P = \text{ceil}(B(\ulcorner f\_d/\ulcorner 2)) + \mathbf{1}, B(f) = \mathbf{6asinh}(\ulcorner f / \ulcorner 600),\tag{50}$$

where *ceil f* ð Þ—function which rounds f to the next integer, *f <sup>d</sup>*—frequency of sampling of a speech signal in Hz, *B f* ð Þ—function which will transform frequency to Hz in frequency in a bark.

3.For *n*-th frame energy of bark-frequency bands is calculated

$$\hat{E}\_{nm} = \sum\_{k=0}^{\Delta N/2 - 1} \left| \hat{\mathbf{S}}\_n(k) \right|^2 w\_m(k), \ m \in \overline{\mathbf{0}, P - 1}, \tag{51}$$

$$b\_m = m \frac{B\left(f\_d/2\right)}{P-1}, \ m \in \overline{0, P-1},\tag{52}$$

$$
\Delta b\_{mk} = B \left( k \frac{f\_d}{\Delta N} \right) - b\_m, \ m \in \overline{0, P-1}, \ k \in \overline{0, \Delta N-1}, \tag{53}
$$

$$w\_m(k) = \begin{cases} \mathbf{10}^{(\Delta b\_{mk} + 0.5)}, & \Delta b\_{mk} \le -0.5 \\ \mathbf{1}, & -0.5 < \Delta b\_{mk} < 0.5 \\ \mathbf{10}^{-2.5(\Delta b\_{mk} - 0.5)}, & \Delta b\_{mk} \ge 0.5 \end{cases} \tag{54}$$

where *E \_ im*—energy of *i*-th a bark-frequency band, *wm*ð Þ*k* —trapezoidal window for band *m*-th.

4.For *n*-th frame the distortion of equal loudness for energy of bark-frequency bands is carried out

$$
\bar{E}\_{nm} = \left(\upsilon\left(B^{-1}(b\_m)\right)\bar{E}\_{im}\right), \ m \in \overline{0, P-1}, \tag{55}
$$

$$B^{-1}(b) = \mathbf{600} \sinh\left(b/\mathbf{6}\right),\tag{56}$$

*Methods for Speech Signal Structuring and Extracting Features DOI: http://dx.doi.org/10.5772/intechopen.104634*

$$v(f) = \begin{cases} \frac{\left(f^2 + 56.8 \cdot 10^6\right)f^4}{\left(f^2 + 6.3 \cdot 10^6\right)^2 \left(f^2 + 0.38 \cdot 10^9\right)}, & f\_d < 5000\\ \frac{\left(f^2 + 56.8 \cdot 10^6\right)f^4}{\left(f^2 + 6.3 \cdot 10^6\right)^2 \left(f^2 + 0.38 \cdot 10^9\right) \left(f^6 + 9.58 \cdot 10^{26}\right)}, & f\_d \ge 5000 \end{cases},\tag{57}$$

where *v f* ð Þ—function for distortion of equal loudness (allows to approach human acoustical perception as the person has an unequal sensitivity of hearing at different frequencies), *B*�<sup>1</sup> ð Þ *b* —function which will transform frequency to a bark in frequency in Hz.

5.For *n*-th frame the law of intensity loudness is applied to energy of bark-frequency bands

$$
\bar{E}\_{nm} = \left(\bar{E}\_{nm}\right)^{0.33}, \ m \in \overline{0, P-1}.\tag{58}
$$

6.For *n*-th frame are calculated a bark-frequency cepstral coefficients (BFCC), using the inverse discrete cosine transformation DCT-2, and previously it is necessary to replace energy *<sup>E</sup>*~*<sup>n</sup>*<sup>0</sup> and *<sup>E</sup>*~*<sup>n</sup>*,*P*�<sup>1</sup> energy *<sup>E</sup>*~*<sup>n</sup>*<sup>1</sup> and *<sup>E</sup>*~*<sup>n</sup>*,*P*�<sup>2</sup> respectively

$$BFCC\_n(m) = \sqrt{\frac{2}{P}} \sum\_{k=0}^{P-2} \ln\left(\hat{E}\_{n,m+1}\right) a(k) \cos\left(\frac{(2m+1)k\pi}{2(P-1)}\right), \ m \in \overline{0, \hat{P}-1},\tag{59}$$

$$
\tilde{E}\_{n0} = \tilde{E}\_{n1}, \tilde{E}\_{n, P-1} = \tilde{E}\_{n, P-2}, \\
a(k) = \begin{cases}
\sqrt{\frac{1}{2}}, & k = 0 \\
1, & k > 0
\end{cases}, \tag{60}
$$

where *P*~—quantity a bark-frequency cepstral coefficients, 1≤ *P*~ ≤ *P*.

#### **7. Method of parameters of perceptual linear prediction calculation**

In this method as features perceptual linear prediction coefficients (PLPC), perceptual reflection coefficients (PRC), perceptual linear prediction cepstral coefficients (PLPCC), perceptual log area ratio (PLAR) coefficients are used [9, 10].

1.Signal *s m*ð Þ breaks on frames Δ*N* of the length *L*. For *n*-th frame the spectrum is calculated, using DFT

$$
\hat{s}\_n(m) = s\_n(m)w(m), \\
w(m) = 0.54 + 0.46 \cos \frac{2\pi m}{\Delta N}, \tag{61}
$$

$$\hat{S}\_n(k) = \sum\_{m=0}^{\Delta N - 1} \hat{s}\_n(m) e^{-j(2\pi/\Delta N)km}, k \in \overline{0, \Delta N - 1},\tag{62}$$

where *w m*ð Þ—Hamming's window.

2.The quantity of bark-frequency bands is calculated

$$P = \text{ceil}(B(\ulcorner f\_d/\ulcorner 2)) + \mathbf{1}, B(f) = \text{€asinh}(\ulcorner f / \ulcorner 600),\tag{63}$$

where *ceil f* ð Þ—function which rounds f to the next integer, *f <sup>d</sup>*—frequency of sampling of a speech signal in Hz, *B f* ð Þ—function which will transform frequency to Hz in frequency in a bark.

3.For *n*-th frame energy of bark-frequency bands is calculated

$$\hat{E}\_{nm} = \sum\_{k=0}^{\Delta N/2 - 1} \left| \hat{\mathbf{S}}\_n(k) \right|^2 w\_m(k), \ m \in \overline{\mathbf{0}, P - 1}, \tag{64}$$

$$b\_m = m \frac{B\left(f\_d/2\right)}{P-1}, \ m \in \overline{0, P-1},\tag{65}$$

$$
\Delta b\_{mk} = B\left(k\frac{f\_d}{\Delta N}\right) - b\_m, \ m \in \overline{0, P-1}, \ k \in \overline{0, \Delta N - 1}, \tag{66}
$$

$$w\_m(k) = \begin{cases} 10^{(\Delta b\_{mk} + 0.5)}, & \Delta b\_{mk} \le -0.5 \\ 1, & -0.5 < \Delta b\_{mk} < 0.5 \\ 10^{-2.5(\Delta b\_{mk} - 0.5)}, & \Delta b\_{mk} \ge 0.5 \end{cases} \tag{67}$$

where *E \_ im*—energy of *i*-th a bark-frequency band, *wm*ð Þ*k* —trapezoidal window for *m*-th band.

4.For *n*-th frame the distortion of equal loudness for energy of bark-frequency bands is carried out

$$
\bar{E}\_{nm} = \left(\upsilon\left(B^{-1}(b\_m)\right)\bar{E}\_{im}\right), \ m \in \overline{0, P-1}, \tag{68}
$$

$$B^{-1}(b) = \mathbf{600} \sinh\left(b/\mathbf{6}\right),\tag{69}$$

$$v(f) = \begin{cases} \frac{(f^2 + 56.8 \cdot 10^6)f^4}{\left(f^2 + 6.3 \cdot 10^6\right)^2 \left(f^2 + 0.38 \cdot 10^9\right)}, & f\_d < 5000\\ \frac{(f^2 + 56.8 \cdot 10^6)f^4}{\left(f^2 + 6.3 \cdot 10^6\right)^2 \left(f^2 + 0.38 \cdot 10^9\right) \left(f^6 + 9.58 \cdot 10^{26}\right)}, & f\_d \ge 5000\\ \end{cases},$$

where *v f* ð Þ—function for distortion of equal loudness (allows to approach human acoustical perception as the person has an unequal sensitivity of hearing at different frequencies), *B*�<sup>1</sup> ð Þ *b* —function which will transform frequency to a bark in frequency in Hz.

5.For *n*-th frame the law of intensity loudness is applied to energy of bark-frequency bands

*Methods for Speech Signal Structuring and Extracting Features DOI: http://dx.doi.org/10.5772/intechopen.104634*

$$
\tilde{E}\_{nm} = \left(\bar{E}\_{nm}\right)^{0.33}, \ m \in \overline{0, P-1}.\tag{71}
$$


$$R\_n(k) = \operatorname{Re}\left(\frac{1}{2P - 2} \sum\_{m=0}^{2^p - 3} \tilde{E}\_{nm} e^{j(2\pi/M)km}\right), \ k \in \overline{0, p},\tag{72}$$

$$
\tilde{E}\_{n0} = \tilde{E}\_{n1}, \tilde{E}\_{n, P-1} = \tilde{E}\_{n, P-2}, \tilde{E}\_{n, 2P-2-m} = \tilde{E}\_{nm}, \ m \in \overline{1, P-2}, \tag{73}
$$

where *<sup>p</sup>*—order of linear prediction, *ceil f <sup>d</sup>=*<sup>1000</sup> � �≤*<sup>p</sup>* <sup>≤</sup><sup>5</sup> <sup>þ</sup> *ceil f <sup>d</sup>=*<sup>1000</sup> � � , *ceil f* ð Þ—function which rounds *f* to the next integer.


$$\mathbf{G}\_{\mathfrak{n}} = \sqrt{E\_{\mathfrak{n}}} = \sqrt{R\_{\mathfrak{n}}(\mathbf{0}) - \sum\_{k=1}^{p} a\_{nk} R\_{\mathfrak{n}}(k)}. \tag{74}$$

9.For *n*-th of frame perceptual linear prediction cepstral coefficients (PLPCC) are calculated

$$PLPCC\_n(m) = \begin{cases} \ln G\_n, & m = 0\\ a\_{nm}, & m = 1\\ a\_{nm} - \sum\_{k=1}^{m-1} (k/m) PLPCC\_n(k) a\_{n, m-k}, & 2 \le m \le p \end{cases}, \quad m \in \overline{0, p}. \tag{75}$$

10.For *n*-th frame perceptual log area ratio (PLAR) is calculated

$$PLAR\_{nm} = \ln\left(\frac{\mathbf{1} - k\_{nm}}{\mathbf{1} + k\_{nm}}\right), \ m \in \overline{\mathbf{1}, p}.\tag{76}$$

#### **8. Method of parameters of reconsidered perceptual linear prediction calculation**

In this method as features reconsidered perceptual linear prediction coefficients (RPLPC), reconsidered perceptual reflection coefficients (RPRC), the reconsidered perceptual linear prediction cepstral coefficients (RPLPCC), the reconsidered perceptual log area ratio (PLAR) coefficients are used [7, 8].

1.Signal *s m*ð Þ breaks on L frames of the length Δ*N*. For frame *n*-th by means of LPF the balancing of the spectrum having steep descent in area of high frequencies is carried out

$$
\overline{\varsigma}\_n(m) = \varsigma\_n(m+1) - \alpha \varsigma\_n(m), \ m \in \overline{0, \Delta N-1}, \tag{77}
$$

where *α*—filtration parameter, 0< *α*<1.

2.For *n*-th frame the spectrum is calculated, using DFT

$$
\bar{s}\_n(m) = \bar{s}\_n(m)w(m), \\
w(m) = 0.54 + 0.46 \cos \frac{2\pi m}{\Delta N}, \tag{78}
$$

$$\hat{S}\_n(k) = \sum\_{m=0}^{\Delta N - 1} \hat{s}\_n(m) e^{-j(2\pi/\Delta N)km}, \ k \in \overline{0, \Delta N - 1},\tag{79}$$

where *w m*ð Þ—Hamming's window.

3.For *n*-th frame on *i*-th mel-frequency band, the energy mel-frequency band is calculated, using frequency transformation and Bartlett's window

$$\hat{E}\_{nm} = \sum\_{k=0}^{\Delta N/2-1} \left| \hat{\mathbf{S}}\_n(k) \right|^2 w\_m(k), \ m \in \overline{1, P}, \tag{80}$$

$$w\_m(k) = \begin{cases} 0, & k < f\_{m-1} \lor k > f\_{m+1} \\ \frac{k - f\_{m-1}}{f\_m - f\_{m-1}}, & f\_{m-1} \le k \le f\_m \\ \frac{f\_{m+1} - k}{f\_{m+1} - f\_m}, & f\_m \le k \le f\_{m+1} \end{cases},\tag{81}$$

$$f\_m = \frac{N}{f\_d} B^{-1} \left( B \left( f^{\text{min}} \right) + m \frac{B(f^{\text{max}}) - B \left( f^{\text{min}} \right)}{P + 1} \right), \ m \in \overline{0, P + 1}, \tag{82}$$

$$B(f) = \mathbf{1}125 \ln\left(\mathbf{1} + f/700\right),\\B^{-1}(b) = 700(\exp\left(b/1125\right) - 1),\tag{83}$$

where *E \_ im*—energy of *m*-th mel-frequency band, *wm*ð Þ*k* —Bartlett's window for band *m*-th, *B f* ð Þ—function which will transform frequency to Hz in frequency in mel, *B*�<sup>1</sup> ð Þ *b* —function which will transform frequency to mel in frequency in Hz, *f <sup>m</sup>*—normalized frequency, *f* min, *f* max—minimum and maximum frequency in Hz (for example, *f* min <sup>¼</sup> 0, *<sup>f</sup>* max <sup>¼</sup> *<sup>f</sup> <sup>d</sup>=*2), *<sup>f</sup> <sup>d</sup>*—frequency of sampling of a speech signal in Hz, *P*—quantity of mel-frequency bands.

4.For *n*-th frame values of autocorrelated function are calculated, using the inverse DFT

$$R\_n(k) = \operatorname{Re}\left(\frac{1}{2P - 2} \sum\_{m=0}^{2P-3} \hat{E}\_{n, m-1} e^{j(2\pi/M)km}\right), \ k \in \overline{0, p},\tag{84}$$

*Methods for Speech Signal Structuring and Extracting Features DOI: http://dx.doi.org/10.5772/intechopen.104634*

$$
\hat{E}\_{n,2P-m} = \hat{E}\_{nm}, \ m \in \overline{2, P-1}, \tag{85}
$$

where *<sup>p</sup>*—order of linear prediction, *ceil f <sup>d</sup>=*<sup>1000</sup> � �<sup>≤</sup> *<sup>p</sup>*≤<sup>5</sup> <sup>þ</sup> *ceil f <sup>d</sup>=*<sup>1000</sup> � �, *ceil f* ð Þ—function which rounds *f* to the next integer.


$$\mathbf{G}\_{n} = \sqrt{E\_{n}} = \sqrt{R\_{n}(\mathbf{0}) - \sum\_{k=1}^{p} a\_{nk} R\_{n}(k)}.\tag{86}$$

7.For *n*-th frame the reconsidered perceptual linear prediction cepstral coefficients (RPLPCC) are calculated

$$RPLPCC\_n(m) = \begin{cases} \ln G\_n, & m = 0\\ a\_{nm}, & m = 1\\ a\_{nm} - \sum\_{k=1}^{m-1} (k/m) RPLPCC\_n(k) a\_{n, m-k}, & 2 \le m \le p \end{cases}, \quad m \in \overline{0, p}. \tag{87}$$

8.For *n*-th frame the reconsidered perceptual log area ratio (PLAR) are calculated

$$RPLAR\_{nm} = \ln\left(\frac{1 - k\_{nm}}{1 + k\_{nm}}\right), \ m \in \overline{1, p}.\tag{88}$$

#### **9. The performance comparison of various features for person identification**

For the speech signals containing vocal sounds the sampling frequency 8 kHz and the number of quantization levels 256 was established. Sample length of a vocal sound of the speech is equal to 256.

A numerical research results of LPC, RC, LPCC, LAR coefficients, MFCC, BFCC, PLPC, PRC, PLPCC, PLAR coefficients, RPLPC, RPRC, RPLPCC, RPLAR coefficients received by methods of coding and used for biometric identification of people from the TIMIT database on vocal sounds by means of the Gaussian mixed models (GMM) are presented in **Table 1**.

For coding methods for the analysis of a speech signal the filter order in case of linear prediction is equal 12, in case of perceptual linear prediction is equal 4, in case of the reconsidered perceptual linear prediction is equal 12, quantity mel-frequency bands equally 20, quantity a bark-frequency bands equally 17, the number of cepstral parameters based on subbands is equal to 13.

The result presented in **Table 1** shows that the largest probability of identification and the smallest number of coefficients are provided by coding of a vocal sound of the speech based on PRC.


#### **Table 1.**

*Numerical research results of the coefficients used for personality biometric identification.*

#### **10. Conclusion**

The preliminary stage of the biometric identification is speech signal structuring and extracting features.

For calculation of the fundamental tone are considered and in number investigated the following methods of digital signal processing—ACF (autocorrelation function) method, AMDF (Average Magnitude. Difference Function) method, SIFT (Simplified Inverse Filter Transformation) method, method on a basis a wavelet analysis, method based on the cepstral analysis, HPS (Harmonic Product Spectrum) method. For speech signal extracting features are considered and in number investigated the following methods of digital signal processing—the digital bandpass filters bank; spectral analysis (Fourier's transformation, wavelet transformation); homomorphic processing; linear predictive coding. This methods make it possible to extract linear prediction coefficients (LPC), reflection coefficients (RC), linear prediction cepstral coefficients (LPCC), log area ratio (LAR) coefficients, mel-frequency cepstral coefficients (MFCC), bark-frequency cepstral coefficients (BFCC), perceptual linear prediction coefficients (PLPC), perceptual reflection coefficients (PRC), perceptual linear prediction cepstral coefficients (PLPCC), perceptual log area ratio (PLAR) coefficients, reconsidered perceptual linear prediction coefficients (RPLPC), reconsidered perceptual reflection coefficients (RPRC), reconsidered perceptual linear prediction cepstral coefficients (RPLPCC), reconsidered perceptual log area ratio (RPLAR) coefficients. Results of a numerical research of speech signal features extraction methods for voice signals people from the TIMIT (Texas Instruments and Massachusetts Institute of Technology) database were received. The features PRC proved to be the most effective.

*Methods for Speech Signal Structuring and Extracting Features DOI: http://dx.doi.org/10.5772/intechopen.104634*

#### **Author details**

Eugene Fedorov<sup>1</sup> \*, Tetyana Utkina<sup>1</sup> and Tetiana Neskorodieva<sup>2</sup>

1 Cherkasy State Technological University, Cherkasy, Ukraine

2 Vasyl' Stus Donetsk National University, Vinnytsia, Ukraine

\*Address all correspondence to: fedorovee75@ukr.net

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### **References**

[1] Oppenheim AV, Schafer RW. Discrete-Time Signal Processing. Upper Saddle River, NJ: Prentice Hall; 2010. p. 1108

[2] Mallat S. A Wavelet Tour of Signal Processing: Sparse Way. Bourlington, MA: Academic Press; 2008. p. 832. DOI: 10.1016/B978-0-12-374370-1.X0001-8

[3] Rabiner LR, Schafer RW. Theory and Applications of Digital Speech Processing. Upper Saddle River, NJ: Pearson Higher Education; 2011. p. 1042

[4] Markel JD, Gray AH. Linear Prediction of Speech. Berlin: Springer Verlag; 1976. p. 382

[5] Davis SB, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustic, Speech and Signal Processing. 1980;**28**(4):357-366

[6] Ganchev T, Fakotakis N, Kokkinakis G. Comparative evaluation of various MFCC implementations on the speaker verification task. In: Proceedings of SPECOM 2005. Vol. 1. Patras, Greece; 2005. pp. 191-194

[7] Josef R, Pollak P. Modified feature extraction methods in robust speech recognition. In: Proceedings of the 17th IEEE Internations Conference Radioelektronika. Brno, Czech Republic: IEEE; 2007. pp. 1-4

[8] Kumar P, Biswas A, Mishra AN, Chandra M. Spoken language identification using hybrid feature extraction methods. Journal of Telecommunications. 2010;**1**(2):11-15

[9] Huang X, Acero A, Hon H-W. Spoken Language Processing: A Guide to

Theory, Algorithm, and System Development. Upper Saddle River, NJ: Prentice Hall; 2001. p. 980

[10] Hermansky H. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America. 1990;**87**(4):1738-1752. DOI: 10.1121/1.399423

#### **Chapter 4**

## Generalized Spectral-Temporal Features for Representing Speech Information

*Stephen A. Zahorian, Xiaoyu Liu and Roozbeh Sadeghian*

#### **Abstract**

Based on extensive prior studies of speech science focused on the spectral-temporal properties of human speech perception, as well as a wide range of spectral-temporal speech features already in use, and motivated by the timefrequency resolution properties of human hearing, this chapter proposes and evaluates one general class of spectral-temporal features. These features, intended primarily for use in Automatic Speech Recognition (ASR) front ends, allow different realizations of general time-frequency concepts to be easily implemented and tuned through a set of frequency and time-warping functions. The methods presented are flexible enough to allow evaluation of the relative importance of the spectral and temporal features and to explore the trade-off between time and frequency resolution. Extensive ASR experiments were conducted to evaluate various spectral-temporal properties using this unified framework.

**Keywords:** time-frequency, features, automatic speech recognition, basis vectors, front end

#### **1. Introduction**

As mentioned elsewhere [1], good features for automatic speech recognition include relevance, compactness, completeness, and robustness. That is, speech features should be closely related to speech production and understanding, should be small in number, represent as much speech information as possible, and should be little changed in the presence of noise or varying external conditions.

As these elements suggest, both productive and receptive aspects of speech science form the foundation for signal processing to extract speech features. Although receptive aspects of speech science are most directly relevant to speech features for ASR, speech production models for vocal tract configurations are also a plausible starting point for guiding speech feature extraction. In terms of speech production, ever since the classic Peterson and Barney vowel study [2], by far the most widely used acoustic features for characterizing vocal tract shape are formants. For speech signal

processing applications, formant information is generally obtained by first modeling the vocal tract using an all-pole system, such as in the Perceptual Linear Predictive (PLP) front end [3]. The motivating idea is that nearly any transfer function can be approximated by a high-order all-pole model. Due to lack of automatic methods to reliably estimate formants [4], and also because formants cannot discriminate between speech sounds for which the main differences are unrelated to formants (such as fricatives) [5, 6], formants are rarely used as features for ASR. For ASR the all-pole approximation to the vocal tract is more typically replaced with cepstral features [7], which encode the global spectral envelope shape without any emphasis given to spectral peaks.

There are many complex issues raised in the speech science literature about receptive aspects of human speech that could be potentially taken into account for extracting speech features for use in ASR. However, the only effects taken into account for the features presented in this chapter are the primary considerations of frequency and temporal resolution.

Auditory processing research related to the cochlea's frequency selectivity provides the fundamental theory for auditory filterbanks, which are often used as a signal processing step to compute features for ASR. Many canonical studies, such as [8–10], have pointed out that humans discern low frequency components in a complex sound with much higher resolution than is the case for high frequencies. Hence, in speech front ends, to mimic this property, the physical frequency range is mapped to a perceptual scale, typically using bandpass filtering with 25–60 overlapping bands, each corresponding to approximately equal length regions along the cochlear membrane. The bandwidths are designed to match the frequency resolution at each center frequency. Various perceptual scales have been developed, such as the Mel scale [9], Bark scale [10, 11], and Equivalent Rectangular Bandwidth (ERB) scale [12].

Commonly used filterbanks include triangular filters [13] based on the Mel scale, trapezoidal filters [3] based on the Bark scale, and gammatone filters [14, 15] based on the ERB scale. The output power of each filterbank channel is computed as a weighted sum of the magnitude-squared Short Time Fourier Transform (STFT), weighted by the channel frequency response, and then amplitude scaled to approximate perceptual loudness, which is linearly proportional to the neuron firing rate of the auditory nerves [16]. The amplitude-scaled outputs are usually combined with a cosine transform to form cepstral features such as the widely used MFCC features [13]. Another front end for computing speech features is PLP [3]. In PLP an equal-loudness compensation is also modeled to account for the non-equal amplitude sensitivity of human hearing at different frequencies [17]. Motivated by the importance of formants, linear prediction coefficients are computed from the Bark domain spectrum using Durbin's recursive method [18] and then converted to cepstral features.

**Figure 1** depicts static feature extraction for the MFCC and PLP front ends. Note that the expression static features refers to features computed from a single very short segment of speech (on the order of 20 ms duration), called a frame. These features are computed for each frame with frames typically spaced apart by approximately 10 ms, thus also overlapped by 10 ms. This gap between adjacent frames is the frame spacing. Static features based on perceptual frequency scales do not do not explicitly encode spectral trajectories over time. In [19–22] approximations of time "derivatives" of the static features are computed and appended to static features to reduce ASR error rate considerably (empirically on the order of 20%). These time derivatives are called

*Generalized Spectral-Temporal Features for Representing Speech Information DOI: http://dx.doi.org/10.5772/intechopen.104672*

**Figure 1.** *Comparison of the MFCC (a) and PLP (b) structure.*

dynamic features and are also often referred to as delta and acceleration (second order differential) terms. Mathematically, the delta terms are computed as:

$$\Delta\_{\mathbf{t}} = \frac{\sum\_{\theta=1}^{\Theta} \theta(\mathbf{c\_{t+\theta}} - \mathbf{c\_{t-\theta}})}{2\sum\_{\theta=1}^{\Theta} \theta^2} \tag{1}$$

Where *Δ<sup>t</sup>* is the differential at time t estimated from small adjacent groups of static features (cepstrums) *ct*�*<sup>θ</sup>* to *ct*þ*<sup>θ</sup>* with 2*Θ* þ 1 being the total number of surrounding frames. In the remainder of this chapter, groups of frames used to compute dynamic features from static features are referred to as blocks. More detailed discussion of frames and blocks, specifically related to the spectral-temporal features presented in this chapter, is given in Section 2.

Note that although the time derivatives are estimated from a short block of features, they essentially characterize the spectral trajectory at each single time instant, and thus are unable to account for the non-uniform time resolution of the human auditory system observed over a long duration of time. Spectral-temporal modulation features are much more effective than the delta method in addressing the issue of non-uniform time-frequency resolution and efficiently sampling the shorttime spectrum. In 1994 Drullman et al. [23] found that the most important spectral trajectory information over time for speech perception is in the range of 1–16 Hz "modulation" frequencies. Guided by this finding, in order to exploit the information in the modulation frequencies, relatively long time blocks of each spectral band are analyzed. Over many years, various modulation features have been investigated.

Athineos et al. [24] used the dual of time-domain linear prediction to frequencydomain model the poles of the temporal envelope in each sub-band. Valente and Hermansky [25] developed an approach combining independent classifier outputs and modulation frequency channels. Gabor-filter-based approaches for extracting localized directional features also show promise [26, 27]. However, the large number of parameters, which allow Gabor filters to be aligned in many different directions, presents the added difficulty of determining these directions in an effective way for use in ASR.

Based on this prior extensive groundwork, this chapter presents a generalized spectral-temporal feature extraction front end for representing speech information. This feature set encompasses a wide range of time-frequency representation options focusing on two important properties of human hearing–frequency and time resolution. Rather than presenting one specific type of front end, a unified framework is presented such that various realizations of the general time-frequency concepts can easily be implemented and tuned. Based on a set of frequency-warping and timewarping functions, this front end is flexible enough to allow straightforward evaluation of the trade-off between frequency and time resolution at the acoustic feature level.

#### **2. Method**

The spectral-temporal features presented in this chapter are weighted sums of short-time spectral magnitudes, using overlapping frame-based processing. **Figure 2** illustrates the division of the short-time spectrum. The horizontal and vertical axes represent physical time (in seconds) and physical frequency (in Hz). A timefrequency representation (TFR) of the speech, denoted by *X(t,f)*, is obtained by computing the magnitude-squared STFT of each frame. In **Figure 2**, the dots in each

#### *Generalized Spectral-Temporal Features for Representing Speech Information DOI: http://dx.doi.org/10.5772/intechopen.104672*

column represent the power spectrum of a frame, and the gap between adjacent columns denotes the frame spacing. Note that unlike the MFCC or PLP front ends, for which each feature vector is the concatenation of the spectral (static) feature and the spectral trajectory (dynamic feature) components, and the spectral trajectory is characterized by the time derivatives of the static terms at each sample instant on a frameby-frame basis, in the method presented in this chapter, the front end computes a set of spectral-temporal features for a long block of spectral values centered at each sample instant, and one feature vector is extracted for each block. As will be seen in the derivations, this spectral-temporal feature vector for each block integrates both the spectral and temporal aspects of the speech signal within the block by a weighted sum of *X(t,f)* based on a set of two-dimensional spectral-temporal basis vectors. Thus, in the proposed front end, there are no individual static components in the final features since they are fused in the output features. Also, the use of long segments to compute features, using short highly overlapped frames, non-uniform time resolution can be incorporated in spectral trajectories.

Two basic concepts are also illustrated in **Figure 2**, which are used and referred to in the remainder of this chapter–block length and block spacing. Block length is defined as the time duration (physical time) of a block of short-time frames. Block length is measured in milliseconds and is equal to the frame spacing multiplied by the number of frames in the block. The spacing between two adjacent blocks is defined as block spacing, which is the product of the frame spacing and the number of frames that separate the two blocks. Since features are extracted on a block basis, the block spacing is also the feature spacing. At the beginning and ending of each speech utterance, zero padding is used to allow the first and last blocks to be centered at the first and last frames respectively. As opposed to MFCC or PLP processing, in which the feature spacing is identical to the frame spacing, in our work the feature spacing is typically considerably larger than the frame spacing. With these high level concepts, a detailed illustration of the feature extraction process is presented in the remainder of this section.

The time-frequency plane obtained by STFT has uniform frequency and time resolution determined by the analysis window shape and width [28]. This representation does not take into account the non-uniform perceptual frequency scale of the peripheral auditory system. For convenience and clarity of explanation, a framework is established with *t* <sup>0</sup> and *f* <sup>0</sup> as normalized perceptual time and frequency scales, whose desirable properties are next described in detail. Then a set of features, *Feat(i,j)* for the time block centered at time instant t, can be expressed as:

$$Feat(i,j) = \int\_{t'=-\sharp\_2}^{1\_2} \int\_{f'=0}^1 a\left(X'(t',f')\right) \cdot BV\_{i\circ}(t',f')df'dt'.\tag{2}$$

In Eq. (2) the feature computation is performed using perceptual scales, where *X*<sup>0</sup> *t* 0 , *f* <sup>0</sup> � � is the power spectrum of a time-frequency block in this domain for which the frequency *f* <sup>0</sup> is mapped to the range of {0, 1} by subtracting an offset and dividing by a scaling factor. Similarly, perceptual time *t* <sup>0</sup> is converted to the range of {�1/2, 1/2} with *t* <sup>0</sup> ¼ 0 the center of the time block. The function *a*ð Þ� nonlinearly maps the power spectrum to a perceptual-loudness scale, most often using a logarithmic scaling or a power-law nonlinearity [29]. Finally, the amplitude-scaled power spectrum is weighted by a set of two-dimensional basis vectors *BVi*,*<sup>j</sup>* in the perceptual domain *t* 0 , *f* <sup>0</sup> � � . The number of features extracted from a time-frequency block depends on the number of basis vectors used.

It should be emphasized, that for clarity of explanation, integrals as well as continuous time and frequency variables are used in Eq. (2) in all of the following equations. In actual implementations, both time and frequency variables are discrete, as shown in **Figure 2**, and integrations are computed as sums. Also, although the feature extraction is effectively performed in the perceptual time-frequency domain *t* 0 , *f* <sup>0</sup> , the actual computations use the linear time-frequency plane. The mapping between linear and perceptual domains for time and frequency are established by nonlinear time and frequency-warping functions and incorporated by changes in underlying basis vectors as explained below.

In this work, a set of two-dimensional cosine basis vectors for *BVi*,*<sup>j</sup> t* 0 , *f* <sup>0</sup> is used to compactly encode the spectral envelope as well as the spectral trajectory. The theoretical work of Rao and Yip [30] gives reasons why the cosine transform is particularly appropriate for data compression and feature de-correlation, based on similarity to the data-driven Karhunen-Loeve Transform. For similar reasons, the MFCC features also use a one-dimensional cosine transform as a processing step. The popular JPEG standard for image compression also uses two-dimensional cosine transforms.

Continuing with the specifics of the method presented in this chapter, the 2-D cosine basis vectors operating in the perceptual space are defined as:

$$BV\_{i,j}(t',f') = \cos\left(\pi i f'\right) \cdot \cos\left(\pi j t'\right),\tag{3}$$

$$0 \le i \le N-1, 0 \le j \le M-1.$$

Eq. (3) shows that each 2-D basis vector is the product of two individual basis vectors, one over frequency *f* 0 , and one over time*t* 0 . The numbers of basis vectors over frequency and time are specified by *N* and *M* respectively. The total number of features for each block is given by *N* x *M*. As is discussed in detail in Section 3, a larger N or M provides a more detailed representation of the spectral envelope over frequency or the spectral trajectory over time respectively. Empirical data indicates a total of 75 features for each block (*N = 15, M = 5*) results in high ASR accuracy. Eqs. (4) through (9), and associated figures, show that the nonlinear mapping from *f* to *f* <sup>0</sup> and *t* to *t* 0 , together with their differentials *d f*<sup>0</sup> and *dt*<sup>0</sup> , approximate the frequency and time resolution of human hearing. Next is shown how the nonlinear mappings are mathematically incorporated into the feature calculations. Frequency warping, specifies the relation between perceptual frequency *f* <sup>0</sup> and physical frequency *f*:

$$f' = \mathfrak{g}(f), \mathbf{0} \le f \le \mathbf{1} \tag{4}$$

The physical frequency range has also been normalized to {0,1}<sup>1</sup> . Thus, the*d f*<sup>0</sup> term in Eq. (2) is equivalent to:

$$df' = \frac{d\mathbf{g}}{df} df \tag{5}$$

<sup>1</sup> For convenience, the normalized frequency range {0,1} of *f* corresponds to the physical range {*0*, *Fs/2*} where *Fs/2* is the Nyquist frequency. The normalized perceptual frequency *f*' over {0,1} also represents the range of 0 to Fs/2. With minor changes, this normalized range can be reduced to a shorter frequency range of physical frequencies.

*Generalized Spectral-Temporal Features for Representing Speech Information DOI: http://dx.doi.org/10.5772/intechopen.104672*

As per the discussion in Section I, one reasonable choice for the form of the frequency warping*g f* ð Þ is a Mel-shape warping defined as:

$$\log(f) = \mathbf{C} \cdot \log\_{10} \left( \mathbf{1} + \frac{f}{k} \right) \tag{6}$$

where *k* is an adjustable warping factor between 0 and 1 that controls the degree of the warping, and the constant *C* is chosen to ensure that *f* = 1 is mapped to *f* <sup>0</sup> ¼ 1. If *k = 0.0875* and *C = 0.9137,* for the frequency range of 0 to 8000 Hz, this warping is the normalized version of the most widely used "standard" Mel warping proposed by O'Shaughnessy [31]. Another option, using Smith and Abel's work [32], is to use a bilinear warping to approximate the Bark scale:

The warping factor α ranges from 0 to 1. In **Figure 3**, five bilinear warpings, for various α values.are shown. Addiionaly, Mel warping using O<sup>0</sup> Shaughnessy's equation in [31] and Bark warping as per Wang et al. [33] are plotted in the figure. The figure clearly shows that bilinear warping can be adjusted to closely approximate both Mel and Bark warping. From Eq. (5), frequency resolution is continuosly varied, to match auditory properties, rather than using a quantized version with a filterbank, such as in the MFCC, PLP or gammatone front ends, [3, 13, 14]. In the filterbank methods, perceptually indistinguishable frequency components are modeled by the filter bandwidths. Thus, a filterbank is effectively a quantizer which separates the perceptual frequency scale into a finite number of equal intervals. In the proposed approach, the perceptual scale is continuous. The frequency selectivity is modeled by the derivative term *dg(f)/df*.

Next, the relation between perceptual time*t* <sup>0</sup> and linear time *t* is modeled with nonlinear (warping) function, *h*, but with a normalized range of t {�1/2, 1/2}:

$$t' = h(t, f); \ -\frac{1}{2} \le t \le \frac{1}{2}, 0 \le f \le 1. \tag{7}$$

Time*t* <sup>0</sup> can be considered a perceptual time scale that defines a "pseudo" time instant at which an acoustic event occuring at physical time *t* is perceived by the

**Figure 3.** *Bilinear warping with different warping factors—Mel and Bark warping shown for comparison.*

auditory system. Mathematically, perceptual time is given in terms of its derivative with respect to *t*:

$$dt' = \frac{dh(t, f)}{dt} \text{ dt} \tag{8}$$

This time resolution term indicates how far apart two events are perceived when separated by unit time on the physical scale. A large derivative implies that two acoustic events are clearly perceptually distinguishable whereas a small value corresponds to a time boundary between events that is not well resolved. When characterizing the temporal trajectory of acoustic events, it's reasonable to assume that perceptual time resolution should be higher near the center of the event than at far away times. That is, to identify the content of a segment with the help of its left and right segments, it is plausible that close segments are more relevant than far-away segments. Hence, temporal changes of the spectrum envelope should be more clearly resolved at the center of an event than far-away less helpful parts. Therefore, the shape for *dh/dt* was chosen to be approximately Gaussian. More specifically, *dh/dt* is a Kaiser window, with one parameter, *β*, the time-warping factor, that conveniently controls the "sharpness" of time warping.

Note that in Eqs. (8), (9) the sharpness of the time resolution term *dh/dt* could be frequency dependent as well. Specifically, the term *dh/dt* can be made more "peaky" at high frequencies than at low frequencies, controlled by different warping factor values in the Kaiser window<sup>2</sup> , as illustrated in **Figure 4**. This allows an exploration of the trade-off between auditory frequency and time resolution. The psychoacoustic masking experiments [34] show that the very narrow auditory filter bandwidths at low frequencies produces high frequency resolution, but also prolongs the "ring" time at the onset and offset transients for short signals, and thus degrades the time resolution of the excitation patterns. This trade-off is also shown in [35] by

#### **Figure 4.**

*Time resolution term* dh*/*dt *for low and high frequencies using a Kaiser window: The time resolution is nonuniform over both time and frequency.*

<sup>2</sup> Note that although Eqs (8), (9) (and thereafter) explicitly show the frequency dependency in *h*(*t*,*f*), in our implementation of *h*(*t*,*f*) and its derivative, *f* is treated as a constant, and only *t* is the variable.

*Generalized Spectral-Temporal Features for Representing Speech Information DOI: http://dx.doi.org/10.5772/intechopen.104672*

neurophysiological experiments and in [36] by the gap-in-noise detection experiments, which provide evidence that human subjects are able to detect shorter gaps in a narrow band of noise when the noise bands are centered at higher frequencies. Despite of this property of human hearing (high time resolution for high frequencies), it's not clear whether this effect can be exploited for improving ASR. Our work provides one way to investiage this effect in features used for ASR.

Although the principles and forms for frequency and time warping have been presented, the magnitude of the power spectrum on the perceptual scale is the same as for the.physical domain. To better represent perceptual magnitudes, the power spectrum should also be nonlinearly scaled. This nonlinear scaling is represented by the function *a*, typicaly logarithmic or power function with a low exponent such as 1/15. Eq. (2) can be rewritten in terms of *t* and *f* by substituting in Eqs. (3), (4), (5), (8), (9):

$$Feat(i,j) = \int\_{t=-\frac{1}{2}}^{1\_2} \int\_{f=0}^{1} a(X(t,f)) \cdot \cos\left(\pi \text{ig}(f)\right) \frac{d\text{g}(f)}{df} \cdot \cos\left(\pi \text{j}h(t,f)\right) \frac{dh(t,f)}{dt} df dt \tag{9}$$

Eq. (10) can be written using modified basis vectors over frequency *f* as:

$$\varrho\_i(f) = \cos\left(\pi \text{ig}(f)\right) \frac{d\text{g}(f)}{df},\tag{10}$$

$$0 \le i \le N - 1.$$

and modified frequency-dependent basis vectors over time *t* as:

$$\mu\_j(t, f) = \cos\left(\pi j h(t, f)\right) \frac{dh(t, f)}{dt},\tag{11}$$

$$\mathbf{0} \le j \le M - \mathbf{1}.$$

Using the basis vectors from Eqs. (11), (12), Eq. (10) can be expressed as:

$$\text{Fact}(i,j) = \int\_{t=-1/2}^{1/2} \int\_{f=0}^{1} a(\mathbf{X}(t,f)) \cdot \phi\_{i,j}(t,f) df dt. \tag{12}$$

#### **Figure 5.**

*Two-dimensional basis vector ϕ*1,1ð Þ *t*, *f with bilinear frequency warping* g(f) *and a Kaiser window for dh(t,f)/dt.* α *is the frequency-warping coefficient as in Eq. (7), and* β*low,* β*high are the time-warping factors for low and high frequencies, respectively.*

where the two-dimensional basis vectors *ϕi*,*<sup>j</sup>* ð Þ *t*, *f* are the product of the basis vectors given in Eqs. (11) and (12).

In **Figure 5**, the two-dimensional basis vector *ϕ*1,1ð Þ *t*, *f* is plotted with bilinear frequency warping *g(f)* and a Kaiser window for the *dh(t,f)/dt* term. Panels (a) and (b) are based on the same time-warping factor *β* = 5 for all frequencies, and only the frequencywarping factor *α* is varied. Compared with the linear frequency scale (*α* = 0) in panel (a), the basis vector becomes more sharply peaked at low frequencies in panel (b) as higher frequency resolution is incorporated through a larger warping factor *α* = 0.45. Panel (c) uses increasing time warping as frequency increases. The Kaiser window *β* value is linearly interpolated between *βlow*and *βhigh*. The higher time resolution for high frequencies makes the basis vectors more concentrated near the center of the block.

Another option for the cosines used as the starting point for the two-dimensional basis vectors is to use a Gabor filterbank. As described in the work of [26, 27, 37], Gabor filtering is performed as a two-dimensional correlation between the Gabor filterbank and the perceptual time-frequency plane *t* 0 , *f* <sup>0</sup> . Each Gabor filter is defined using the product of a two-dimensional Gaussian envelope and a complex exponential function over a localized region in the time-frequency plane. Directionality is the most apparent difference between Gabor filter approach and the cosine expansion used in this chapter. Gabor filters can be adjusted toward any direction whereas the cosine transform only represents modulation of the spectrum along the vertical and horizontal axes. The deeper reason for this difference is that the Gabor approach and the method presented in this chapter are motivated by different considerations. The power spectrum directionality property of Gabor features stems from the response of neurons to combinations of spectral-temporal modulation frequencies in the spectral-temporal receptive field [38]. In contrast, the proposed framework is intended to model the trade-off between time and frequency resolution of the peripheral auditory system. However, it is possible to modify the proposed front end to incorporate the directionality of spectral-temporal patterns in a way similar to the Gabor filterbank. In prior work [39], this was achieved by rotating the 2-D cosine basis vectors by various angles.

#### **3. Implementation**

The 2-D integral in Eq. (10) can be implemented in a variety of ways, as discussed below. As mentioned previously, integrations are computed using sums and vector inner products between basis vectors and the sampled time-frequency plane.

#### **3.1. DCTC/DCSC method**

The first version of the implementation is based on frequency-independent time warping; i.e. the time warping *h(t,f)* is simplified to *h(t)* for all frequencies. In this case, integrating in any order (first over *f* and then over *t* or the reverse) is equivalent. Conventionally, frequency integration is performed first, which generates a set of intermediate static features<sup>3</sup> called Discrete Cosine Transform Coefficients (DCTCs):

<sup>3</sup> Note that the term "static features" refers only to the outputs of the DCTC step. As mentioned in the beginning of Section II, the final outputs are the spectral-temporal features, which are computed by another integration over the time sequence of these "static" features.

*Generalized Spectral-Temporal Features for Representing Speech Information DOI: http://dx.doi.org/10.5772/intechopen.104672*

$$\text{DCTC}(\mathbf{i}) = \int\_{\mathbf{f}=0}^{1} \mathbf{a}(\mathbf{X}(\mathbf{t}, \mathbf{f})) \cdot \boldsymbol{\uprho}\_{\mathbf{i}}(\mathbf{f}) \, \text{df},\tag{13}$$

where *φi*ð Þ*f* is the ith static basis vector as defined in Eq. (11). Then the trajectories of these DCTCs are encoded by integrations over time, yielding a set of features referred to as Discrete Cosine Series Coefficients (DCSCs):

$$\text{DCSC}(i, j) = \int\_{t=-\frac{1}{2}}^{\frac{1}{2}} \text{DCTC}(i) \cdot \boldsymbol{\nu}\_j(t) dt,\tag{14}$$

where *ψ <sup>j</sup>* ð Þ*t* is the *j*th basis vector over time, as defined in Eq. (12), but without dependence on *f*. These DCSC 2-D features, arranged as a 1-D feature vector, are the input to a recognizer. This implementation is depicted in **Figure 6(a)**. **Figure 7** is a plot of the first three DCTC and DCSC basis vectors, using a Mel-shape frequency warping and a Kaiser window with *β* = 5 for (derivative of) time warping. The zeroth order terms represent the form of the spectral/temporal resolution.

Unlike some other front ends, such as RASTA [40], TRAPS [41], as well as the Gabor method mentioned previously, for which modulation frequencies are a key concept, the proposed DCTC and DCSC method does not explicitly use this concept.

**Figure 6.**

*Two implementations of the proposed front end: (a) the DCTC/DCSC implementation in which DCTCs are computed first followed by DCSCs. The time warping in the DCSC basis vectors is uniform for all frequencies. (b) the DCSC/DCTC implementation in which a set of DCSCs are obtained first followed by DCTCs. This implementation enables frequency-dependency in the DCSC basis vectors.*

#### **Figure 7.**

*The first three DCTC (a) and DCSC (b) basis vectors: A Mel-shape and a cumulative Kaiser window are used for frequency and time warping respectively.*

The DCSC basis vectors act as non-causal FIR low pass temporal filters of spectral dynamics. Similarly, the DCTCs can also be viewed as low pass filtering of the power spectrum. Parameters used in the DCTC/DCSC implementation can be varied to examine the trade-offs between spectral and temporal resolution. The trade-off between spectral and temporal resolution considered here is different than the auditory timefrequency resolution built into the warping of the basis vectors as presented previously. Here, based on the filtering point of view, the parameters determine how much detail of the static spectrum and dynamic trajectory is preserved after the low pass filtering. The time-frequency resolution represented by the derivatives of the warping (which also cause a trade-off effect) is an intrinsic property of human hearing. As mentioned, the proposed DCTC/DCSC front end can be tuned to emphasize either side of the overall spectral or temporal resolution. For increased emphasis on the spectral information, a long frame length and a relatively large number of DCTCs should be used, with a relatively small number of DCSCs computed from a long block length. For increased emphasis on time resolution, a short frame length and frame spacing should be used with a large number of DCSCs computed from a short block length.

**Figure 8** graphically illustrates this spectral-temporal trade-off. The top panel depicts the unprocessed spectrogram of a speech segment. Two spectrograms reconstructed from DCTC/DCSC terms are shown in the bottom panels4 . The left one has high spectral resolution and low temporal resolution. It is rebuilt using 16 DCTCs, computed using 25 ms frames, a 10 ms frame spacing and 4 DCSCs with a block length of 50 frames (500 ms). The one in the right bottom panel has low spectral resolution but high temporal resolution. It is computed from 8 DCTCs, 5 ms frames spaced by 2 ms, and 6 DCSCs with a block length of 100 frames (200 ms). The low frequency components in both rebuilt

<sup>4</sup> Briefly, to rebuild the spectrum, the DCTCs and DCSCs are computed using orthonormal basis vectors, which can be obtained using Gram-Schmidt orthonormalization. Then the DCTCs of the center frame of a block are rebuilt first by multiplying the DCSCs by the transpose of the DCSC basis vector matrix and preserving only the center frame. Then the spectrum of this frame is rebuilt in a similar way by a matrix product using the transpose of the DCTC basis vector matrix.

*Generalized Spectral-Temporal Features for Representing Speech Information DOI: http://dx.doi.org/10.5772/intechopen.104672*

#### **Figure 8.**

*Spectrogram of a speech segment (upper panel) and two rebuilt spectrograms: The bottom left one has high spectral resolution and low temporal resolution while the bottom right one has low spectral resolution but high temporal resolution.*

spectrograms are represented with higher resolution than are the higher frequency components due to the Mel frequency warping. Comparing the two reconstructed spectrograms, the spectrogram in the left panel preserves more spectral details than does the spectrogram in the right panel. In contrast, the spectral dynamics are shown with more resolution in the right hand panel than are the spectral dynamics in the left pane.

#### **3.2. DCSC/DCTC method**

In the case of frequency-dependent time warping, the 2-D integration in Eq. (10) can be implemented by integrating over the time axis first followed by another integration over frequency. **Figure 6(b)** depicts the diagram of this configuration. In this case, Eq. (10) can be rearranged as:

$$Feat(i,j) = \left[ \int\_{f=0}^{1} \cos\left(\pi \text{ig}(f)\right) \frac{d\text{g}(f)}{df} \left[ \int\_{t=-1\_2}^{1\_2} a(\mathbf{X}(t,f)) \cdot \cos\left(\pi \text{j}h(t,f)\right) \frac{d h(t,f)}{dt} dt \right] df \right] \tag{15}$$

The inner integral defines a set of frequency-dependent DCSCs,

$$DCSC(j, f) = \int\_{t=-\frac{1}{2}}^{\frac{1}{2}} a(X(t, f)) \cdot \nu\_j(t, f) dt,\tag{16}$$

where *ψ <sup>j</sup>* ð Þ *t*, *f* is the *j*th DCSC basis vector for frequency f, as defined in Eq.(12). Then, the integral over frequency computes the DCTCs, which yields the final features

$$\text{Fact}(i, j) = \text{DCTC}(i, j) = \int\_{f=0}^{1} \text{DCSC}(j, f) \cdot \rho\_i(f) df,\tag{17}$$

where *φi*ð Þ*f* is the *i*th DCTC basis vector as in Eq. (11).

#### **3.3. Unified framework**

As mentioned in Section 1, the DCTC/DCSC structure proposed in this chapter can be viewed as a unified framework which incorporates the filterbank implementation of the frequency warping as well as the conventional delta and acceleration dynamic features. To illustrate this unified viewpoint, a comparison of the "standard" MFCC front end and the DCTC/DCSC front end is presented in **Figure 9**.

In the filterbank-based front end, the frequency warping is performed by a group of auditory filters, and followed by a "regular" DCT transform with the term "regular" referring to sampled versions of half cosine basis vectors (in contrast to the basis vectors proposed in the previous sections). Specifically, the regular DCT transform is given by:

$$\mathcal{L}(i) = \sqrt{\frac{2}{Q}} \sum\_{j=1}^{Q} a(P(j)) \cos\left(\frac{\pi i}{Q}(j - 0.5)\right),\tag{18}$$

where *c(i)* is the *i*th DCT coefficient, *Q* is the total number of filter channels, *P(j)* is the output power of the *j*th channel, and *a(.)* is the amplitude scaling function. The terms cos <sup>π</sup>*<sup>i</sup> <sup>Q</sup>* ð Þ *j* � 0*:*5 � � are the unmodified cosine basis vectors.

In prior work [42], it was experimentally verified that the nonlinear amplitude scaling in the filterbank based front end can be moved to immediately before the filterbank without degrading ASR performance (i.e. swap the position of the filterbank block and the amplitude scaling block in **Figure 9(a)**). Then the filterbank weights can be combined with the unmodified cosine basis vectors by a simple matrix multiplication. Mathematically, suppose each row of the matrix *W* contains the magnitude response of a filterbank channel (i.e. if 26 channels are used with 128 FFT samples for each channel, *W* is a 26 by 128 matrix), and each row of the matrix *BVFreg* contains the 12 unmodified cosine basis vectors, *BVFreg* is a 12 by 26 matrix). A set of unified static basis vectors *BVFuni*, which incorporate the filterbank, can be formed by a matrix multiplication:

*Generalized Spectral-Temporal Features for Representing Speech Information DOI: http://dx.doi.org/10.5772/intechopen.104672*

#### **Figure 9.**

*Block diagrams of the filter bank front end (a), the DCTC/DCSC front end (b) and a unified framework (c) of (a) and (b) dashed blocks (*� �*) are optional.*

$$\text{BVF}\_{\text{uni}} = \text{BVF}\_{\text{reg}} \cdot \text{W} \tag{19}$$

In the proposed DCTC/DCSC case, *BVFuni* is simply the matrix of the basis vectors *φi*ð Þ*f* defined in Eq. (11) with each row containing one such basis vector. Thus, with the unified static basis vectors, the static features in the filterbank front end and the DCTC/DCSC front end can be obtained using the same mathematical framework. The only difference lies in how their basis vectors are computed. Specifically, if the matrix *X* represents the power spectrum of a block of frames<sup>5</sup> for which each column is the magnitude squared STFT for a frame, the static features of this block for both the

<sup>5</sup> For consistency with the block processing in the computation of the dynamic features, the static feature computation also uses block notation here. When implemented, the static features are computed for the entire utterance once, and only the final features are computed block by block. That is, in the static feature step, *X* represents the spectrum of the entire utterance, and in dynamic feature step in Eq. (22), *X* denotes a block of frames.

filterbank front end and the DCTC/DCSC front end can be computed in a unified way as *BVFuni* � *a*ð Þ *X* where *a*ð Þ *X* represents the amplitude scaling.

Similarly, the delta and higher order dynamics in the standard MFCC front end can also be computed by a summation of the static features over time, weighted by a set of dynamic basis vectors. From Eq. (1), to compute any nth order differential term, its basis vector with respect to the previous lower order terms (neglecting the constant denominator) is given by *bv<sup>n</sup>* ¼ �½ � *Θn*, �*Θ<sup>n</sup>* þ 1, … , 0, 1, … *Θ<sup>n</sup>* where *Θ<sup>n</sup>* is the window length in Eq. (1). Considering *bv<sup>n</sup>* as a discrete signal with each element representing both the index and the amplitude (i.e. [�3,-2,-1,0,1,2,3] gives a signal whose magnitude is �3 at index �3, and � 2 at index �2, etc.), then the nth order delta basis vector *bvTn* can be computed as

$$\mathbf{b}\mathbf{v}\mathbf{T\_n} = \mathbf{b}\mathbf{v}\_1 \oplus \mathbf{b}\mathbf{v}\_2 \dots \oplus \mathbf{b}\mathbf{v}\_n. \tag{20}$$

where ⊛ is the convolution operator. Thus, a set of unified dynamic basis vectors *BVTuni* can be defined. In the case of the delta features, each row of *BVTuni* stores one dynamic basis vector of the form in Eq. (21) whereas in the proposed DCTC/DCSC front end, each row of *BVTuni* stores one DCSC basis vector as defined in Eq. (12). Hence, again the final output features *F* for both the MFCC and the DCTC/DCSC methods can be written in a unified way:

$$\mathbf{F} = \mathbf{B}\mathbf{V}\mathbf{T}\_{\text{uni}} \cdot \left[\mathbf{B}\mathbf{V}\mathbf{F}\_{\text{uni}} \cdot \mathbf{a}(\mathbf{X})\right]^{\text{T}} \tag{21}$$

**Figure 9(c)** is a block diagram of this unified framework. This diagram depicts the essence of the proposed speech features as well as similar features such as MFCCs. They are essentially a series of linear transformations of the spectrum scaled by an auditory nonlinearity with optional peripheral nonlinearities in between (dashed blocks in the diagram), such as the sigmoid-shaped functions given in [43, 44]. These nonlinearities generally improve the noise robustness of front ends. In this work, the linear transformations are represented by unified basis vectors. Filterbank-based features (such as MFCC or PLP) exert their impact on features by shaping the basis vectors implicitly. The unified basis vectors presented here determine the properties of a front end. Thus we have a common yardstick with which to analyze and compare front ends based on the properties of the unified basis vectors.

A basic comparison can be made between filterbank-based frontends such as the widely-used MFCCs and the proposed DCTC/DCSC front end by comparing their unified basis vectors. Although the MFCC front end and the DCTC/DCSC front end are derived differently, the unified framework shows the two approaches are the same, except that the basis vectors are different.

**Figure 10** is a plot of the first three unified static basis vectors underlying MFCC features (based on 26 Mel filters) and three unified temporal basis vectors used to compute the zeroth order, delta and acceleration terms. The unified basis vectors over frequency are not as "smooth" as the ones proposed here which are based on the continuous Mel-shape warping *g(f)*, as shown in **Figure 7(a)**. The "jagged" basis vectors 'plotted in **Figure 10(a)** result from the quantization effect caused by the coarse sampling of the frequency axis by the filter bank. The unified temporal basis vectors, implicit in most current methods, estimate derivatives very approximately using a small number of samples. A comparison of the temporal basis vectors (see **Figures 7(b)** and **10(b)**) graphically illustrate that the standard delta/acceleration method uses only a few central terms in each block whereas in the proposed method,

*Generalized Spectral-Temporal Features for Representing Speech Information DOI: http://dx.doi.org/10.5772/intechopen.104672*

#### **Figure 10.**

*The first three unified static basis vectors resulting from 26 Mel filters (a) and the first three unified dynamic basis vectors of the delta method (b).*

the incorporation of non-uniform time resolution result in long "smooth" basis vectors emphasizing the center of the block but extending to the ends of the block. A comparison of both panels of **Figure 7** with both panels of **Figure 10** clearly illustrate the more continuous nature of the temporal basis vectors for the proposed method versus the implicit basis vectors corresponding to delta and acceleration terms. This suggests that the proposed DCSC basis vectors may represent spectral dynamics with more accuracy and resolution than is the case for delta/acceleration method.

#### **4. Experimental evaluation**

#### **4.1 Experimental configuration**

A comprehensive suite of ASR tests for various conditions and parameter settings was performed to evaluate the effectiveness of the spectral-temporal DCTC/DCSC features and to investigate trade-offs in time and frequency resolution as that affects ASR performance. All experiments reported in this chapter are for phone recognition, with monophone models using the HTK 3.4 HMM/GMM recognizer [45]. Except for one set of evaluation experiments described below, all experiments use the TIMIT database [46]. As is typically done with this database, 3696 utterances (462 speakers, eight sentences/speaker, approximately 236 minutes) with SA sentences removed were used for training. The TIMIT database document [46] suggests using 1344 utterances (168 speakers, eight sentences/speaker, approximately 86 minutes) for testing. However, since various parameters in the proposed front end needed to be tuned both for performance optimization and for exploring the effects on the timefrequency properties, a development set (DEV set) was needed. Thus, 672 utterances from the original test set were randomly chosen for this purpose, and the remaining 672 utterances were used as the evaluation set (EVAL set). Also, as recommended in [47], the original set of 61 labeled phones was collapsed to 48 phones to create 48 phone models with a further reduction to 39 phone categories for scoring. Some similar phones were merged to create the 39 categories: For convenient reference, the reduction from 61 to 48 phones and further from 48 to 39 phones (shaded) is presented in **Table 1**, and a frequency count of the 39 phones for the training and the original test sets is shown in **Figure 11**. All HMM acoustic models had three emitting


#### **Table 1.** *61 TIMIT phones, reduced to 48 for training, and 39 categories (shaded) for testing.*

*Generalized Spectral-Temporal Features for Representing Speech Information DOI: http://dx.doi.org/10.5772/intechopen.104672*

hidden states. A bigram language model was used based on phone bigram frequencies in the training set.

Since invariability is also crucial for "good" features, which means that the optimal front end parameters experimentally tuned from a DEV set should work approximately equally well on different independent evaluation sets without the need of re-tuning the parameters, an independent phone recognition task was also conducted using the Chinese Mandarin 863 Annotated 4 Regional Accent Speech Corpus (RASC863) [48]. The phonetically transcribed portion of this database was used for this work, which includes 20 speakers, each uttering 110 phonetically balanced sentences. Due to the much smaller number of speakers than for TIMIT, approximately 70% of the total set of 2200 utterances from all of the 20 speakers were used for training (1540 sentences, approximately 77 sentences/speaker and 224 minutes), and the remaining 30% were used for evaluation (660 sentences, 33 sentences/ speaker, 96 minutes). Fifty-nine Chinese base phones (without considering tone information) were trained and evaluated on the evaluation set against the baseline directly using the optimal parameters obtained from the TIMIT experiments.

Processing begins with a complex pole pair IIR pre-emphasis filter:

$$\mathbf{y}[\mathbf{n}] = \mathbf{x}[\mathbf{n}] - \mathbf{0.95x[n-1]} + \mathbf{0.494y[n-1]} - \mathbf{0.64y[n-2]} \tag{22}$$

This second order filter has a peak near 3200 Hz and is a reasonably good match to the inverse of the equal-loudness contour for human hearing. In our previous work [49], it was found that this filter results in slightly higher ASR accuracy than is obtained with the more typically-used first-order one zero pre-emphasis. All speech passages were then divided into overlapping windowed frames (Kaiser window with *β* of 6, similar to a Hamming window). A 512 point FFT of each frame was computed, and log magnitudes computed, for a frequency range of 100 Hz to 7000 Hz. For each frame, log magnitudes were "floor" clipped at 40 dB below the largest spectral magnitude in each frame. In previous work [50], this simple floor was found to improve ASR accuracy by a small amount, especially for noisy speech. In summary, each sentence was converted to a matrix of spectral values which were then further processed by the DCTC/DCSC methods proposed in this chapter.

#### **4.2 TIMIT DEV set parameter optimization**

#### *4.2.1 Experiment set 1: DCTC features only (static features)*

For the experiments reported in this section, DCTC features only were computed. The goal was to experimentally evaluate how frame length, frame space, number of DCTCs, and type and degree of frequency warping affects ASR accuracy. Not all combinations of parameter values are presented in the results due to the very large number of combinations. Rather, most of the parameter values were fixed at what appeared to be the best values based on pilot experiments, and then a subset of parameter values was varied and performance evaluated.

*Experiment A1— Spectral resolution issues for DCTCs*: The goal was to examine spectral resolution effects on ASR performance as determined by frame length and number of DCTCs. The frame space was fixed at 8 ms. Mel frequency warping (bilinear warping with a coefficient of .45) and 16 mixture GMM/HMMs were used. The spectrum of each frame was represented with 9 to 26 DCTCs. Frame length ranged from 5 ms to 40 ms. ASR accuracy ranges from approximately 49–57% in these tests. **Figure 12** depicts ASR accuracy using 21, 23, and 25 DCTCs as a function of frame length. It also contains the static MFCC baseline results using 26 filters, 21 DCTCs, again with the frame space fixed at 8 ms. The absolute best accuracy (57.3%) was obtained with 20 ms frames and 25 DCTCs. However, the increase in performance for more than 19 DCTCs is minimal, typically less than 0.5%. Frame lengths of 15 ms to 30 ms result in fairly similar ASR accuracies.

**Figure 12.** *Phone recognition accuracy as function of frame length using 21, 23, and 25 DCTCs.*

**Figure 13.** *Effect of frame length and frame space on phone recognition accuracy for 21 DCTCs.*

*Generalized Spectral-Temporal Features for Representing Speech Information DOI: http://dx.doi.org/10.5772/intechopen.104672*

*Experiment A2-Time resolution effect for DCTCs:* To investigate the role of time resolution on frame-based speech features, the feature "sampling rate" was varied by changing the frame spacing from 2 ms to 20 ms. Since the time resolution is also affected by frame length, 4 frame lengths (5, 10, 20, and 30 ms) were evaluated. 21 DCTCS were used for all tests. Other parameters were the same as for Experiment A1. The baseline in this experiment is the case of static MFCCs with frame length fixed at 5 ms. Results are shown in **Figure 13**.

Results vary from 34.9% (5 ms frames, 20 ms apart) to 59.7% (10 ms frames, 5 ms apart). Phonetic recognition accuracy degrades when the frame space is too large, especially for shorter frame lengths. The best performance for each frame length varies from 57.6% to 59.7%. As might be expected, the highest accuracy is obtained with short frame spaces and short frame lengths—that is high time resolution. However, unexpectedly, accuracy degrades when the frame space is too short. We hypothesize that oversampling of features is problematic for the HMM recognizer, due to the high correlation of features when frames are very closelyspaced.

*Experiment A3—Effect of frequency warping on DCTC features:* To evaluate frequency warping, bilinear frequency warping was used as in Eq. (7) with a single parameter α controlling the shape of the nonlinearity for the frequency warping. Bilinear warping with a coefficient of 0.45 closely approximates Mel warping, whereas a coefficient in the range of 0.5 to 0.57 approximates Bark warping [32].

#### **Figure 14.**

*Phone recognition accuracy as function of frequency warping for two cases: The standard Mel warping, i.e. g (f) = 2595log(1 + f/700), was used as a baseline, and results were within 0.1% of bilinear warping with a coefficient of 0.45 in both cases.*

Since pilot experiments showed that the effects of frequency warping depend on the number of DCTC features and the number of HMM mixtures in the recognizer, these experiments were performed for two cases–13 DCTCs with 8 mixture HMMs; 21 DCTCs with 16 mixture HMMs. 10 ms frames spaced 5 ms apart were used in all cases. Results are plotted in **Figure 14** as the warping coefficient varies from 0 (no warping) to 0.8 (over warped).

The effect of warping is more apparent for the 8 mixture case than for the 16 mixture case. The overall best warping values found were.4 and.45 (most similar to Mel warping). For the 8 mixture/13 DCTC cases, the best warping of .45 resulted in a 3% accuracy improvement over the no warping case. For the 16 mixture/21 DCTC case, the best warping of .4 yielded a 1.5% accuracy improvement over the no warping case. The "standard" Mel warping, as proposed by O'Shaughnessy [31], was also evaluated as a baseline, and the result was within 0.1% of the result obtained using a bilinear warping coefficient of 0.45 for both 13 DCTCs/8 mixtures and 21 DCTCs/16 mixtures.

#### *4.2.2 Experiment set 2: Dynamic features (DCTCs and DCSCs)*

In these experiments, a myriad of parameters believed to be significant for DCTC/ DCSC features which represent spectral-temporal characteristics in a block of frames centered on each frame were varied. These parameters include number of DCTCs/ DCSCs, frame length/space, frequency/time-warping coefficients, and block length/ space. Not all combinations of parameters were tested due to both the very large number of cases and the assumption that many of the variations would have much effect on ASR accuracy. Based on pilot experiments and the results reported previously for experiments B1, B2, and B3, many of these parameters were either fixed to a

#### **Figure 15.**

*Phone recognition accuracy of 39 DCTCs/DCSCs as function of block space with block length fixed at 251 ms: The 39 MFCC features produce a baseline of 70.5% (block space fixed at 8 ms).*

*Generalized Spectral-Temporal Features for Representing Speech Information DOI: http://dx.doi.org/10.5772/intechopen.104672*

**Figure 16.**

*Phone recognition accuracy as function of time-warping factor for different block lengths with a fixed block spacing of 8 ms: The baseline 39 MFCC case is also depicted.*

single value or varied over a small range. The other parameters were varied and performance evaluated. 32 HMM mixtures were used due to the large dimensionality of the feature space.

*Experiment B1—39 feature (13 DCTCs/3 DCSCs) experiments:* Since 39 MFCC features are often used for ASR systems, the first experiments were performed with 39 features—13 DCTCs/3 DCSCs. The 39 MFCC feature case was the baseline. In this these experiments, the block length was fixed at 251 ms, and the effect of block spacing, which was also the feature "sampling" rate, was varied from 4 ms to 12 ms, with results depicted in **Figure 15**. The frame length/space was fixed at 10 ms/1 ms, and bilinear frequency warping (coefficient of 0.45) was used. The time-warping coefficient was 50 using a Kaiser window. The effect of block space on ASR accuracy varies approximately 2% from the lowest (12 ms) case to the highest (8 ms) case.

*Experiment B2—39 features (13 DCTCs, 3 DCSCs), block length and time-warping effects (auditory time resolution):* The objective of this experiment set was to examine the role of block length and time warping in representing the feature trajectories. These two parameters are closely related to the auditory time resolution of feature trajectories: A longer block length gives the ability to represent lower temporal modulation frequencies. A higher time-warping factor corresponds to higher time resolution in the central portion of a segment. To study these effects, five block lengths were used (51, 151, 251, 501, 1001 ms) with block spacing fixed at 8 ms. The time-warping factor of a Kaiser window was varied from 5 to 60 for the 51, 151, and 251 ms cases in steps of 5, and it was varied from 45 to 305 for the 501 and 1001 ms cases with steps of 20. The parameters for static features were identical to those in Experiment B1. Results are depicted in **Figure 16** again with a baseline of 39 MFCCs.

The highest accuracy of 71.9% was obtained with a time-warping coefficient of 50 using a block length of 251 ms. Results suggest that the block length and time warping are closely related to each other. As the block length increases, a larger time warping is required to achieve better performance, and a moderately long block length, such as 251 ms, which incorporates informative contextual information for each sample instant, provides the best result. However, very long contexts, such as 501 and 1001 ms, do not improve the performance and require very large-time-warping values. This shows that the spectral contexts too far from the current "observation point" do not provide much useful information, but can be suppressed by a large

**Figure 17.** *Phone recognition accuracy as function of combinations of DCTCs and DCSCs.*

time-warping factor, which emphasizes the useful information within a much shorter range surrounding the block center. Also, a long block length greatly increases the computations for each block (the number of multiplications in the vector inner product for computing the integration). Based on these considerations, 251 ms is considered the best value for the block length.

*Experiment B3—Overall spectral-temporal effect:* Phonetic recognition accuracy was evaluated with a variety of DCTC numbers (9 to 23 in steps of 2) and for a variety of DCSC numbers (3, 4, 5, and 6). These combinations were used to examine the tradeoff between spectral and temporal resolution. Other parameters were selected to match the best parameter settings from earlier experiments (frame length/space of 10 ms/1 ms, bilinear frequency warping with coefficient of 0.4 for cases with 15 or more DCTCs and.45 for cases with fewer than 15 DCTCs, 251 ms/8 ms block length/space, and time warping of 50). Results are shown in **Figure 17**.

First, as the number of DCTCs increases beyond about 15, the performance begins to decrease. The number of DCSCs has a similar effect. Also, when a relatively small number of DCTCs were used, i.e. less overall spectral resolution, the performance increases relatively quickly as more DCSCs are used, i.e. more overall time resolution, as can be seen in the 9 and 11 DCTC cases (2% improvement from 3 DCSCs to 5 DCSCs using 9 DCTCs). However, the performance improves more slowly with more DCSCs when a relatively large number of DCTCs is deployed (less than 1% increment using 23 DCTCs). This observation shows the trade-off between the overall spectral and temporal resolution. The optimal "balance point" was obtained using 15 DCTCs and 5 DCSCs which produced 72.9% accuracy.

#### **4.3 Independent EVAL set results and invariability**

Based on the results from the TIMIT DEV set, a subset of parameters was further optimized. Two optimal parameter sets for a small feature set (27 features) and a large feature set (75 features) were obtained respectively. Also, the number of

#### *Generalized Spectral-Temporal Features for Representing Speech Information DOI: http://dx.doi.org/10.5772/intechopen.104672*

GMM mixtures for each feature set was also optimized using the TIMIT DEV set. After these "final" optimizations were performed, to verify the generality of the tuned front end parameters, two EVAL phone recognition tasks were conducted with different data, as mentioned previously. The EVAL sets were the TIMIT EVAL set and the RASC863 Chinese Mandarin EVAL set. For the Chinese phone recognition task, the number of GMM mixtures was reduced due to the lower amount of available training data and the greater number of phone models to be trained. The best parameter values and the EVAL results are reported in **Tables 2** to **5**. In these tables, "BIG\_REC" refers to the results using the optimal number of GMMs, indicating the best accuracy achieved by a high order HMM recognizer. In addition, the accuracy for the training set in each case is also reported, which shows an ideal upper bound of the recognizer performance if the training data completely represents the test data.

It can be seen from these results that the proposed DCTC/DCSC method achieves generally better performance than the baseline MFCC for independent EVAL sets. In addition, to further examine the feature invariability of the DCTC/DCSC front end, for the Chinese phone recognition task, the parameter values based on the TIMIT DEV set were varied and re-evaluated. These tests showed (results not given here) that the parameter values for best performance did not change, which meant that the parameter values determined from the TIMIT DEV applicable to an entirely different database in a vastly different language.

*Experiment C1—DCTC/DCSC small feature set evaluation performance:* The optimum settings for a small feature set are summarized in **Table 2**. Accuracies on the EVAL sets are reported in **Table 3**.

*Experiment C2—DCTC/DCSC large feature set evaluation performance:* The optimum settings for a large feature set are summarized in **Table 4**, and accuracies on the EVAL sets are reported in **Table 5**.

#### **4.4 Unified framework explanation and statistical significance tests**

As mentioned in Section 3 and in previous work [42], since the step of the amplitude scaling can be moved to immediately before the filterbank, the filterbank weights can be merged with the unwarped regular DCT basis vectors by a simple matrix product. Similarly, the delta and higher order acceleration dynamic terms can also be computed in a basis vector form. Thus, the proposed DCTC/DCSC front end and more


#### **Table 2.**

*Optimum parameter settings for small feature set.*


#### **Table 3.**

*27 feature TIMIT and RASC863 EVAL accuracies.*


#### **Table 4.**

*Optimum parameter settings for large feature set.*

typically used filterbank front ends can be viewed as a unified framework. The reported experimental results can be explained using the unified time-frequency basis vectors as a common yardstick. First, Experiments A1 and A2 show that for static features, the proposed continuous Mel-shape warping results in slightly better performance than that obtained using Mel filterbank-derived basis vectors. By comparing their unified static basis vectors in **Figure 7(a)** and **Figure 10(a)**, our conjecture is that the quantization effect of the filterbank caused this difference. However, since the continuous Mel-shape warping and the filterbank are essentially two ways of implementing a Mel warping, the difference should be small as verified by the experimental results. It should be pointed out that it was experimentally verified that the standard way of implementing the MFFC front end, and MFFCs computed using unified basis vectors, result in identical feature values, provided the amplitude nonlinearity immediately follows the spectral magnitude step. Similarly, by comparing the unified dynamic basis vectors in **Figure 7(b)** and **Figure 10(b)**, it's clear that

*Generalized Spectral-Temporal Features for Representing Speech Information DOI: http://dx.doi.org/10.5772/intechopen.104672*


#### **Table 5.**

*75 feature TIMIT and RASC863 EVAL accuracies.*

the non-uniform time resolution for a long segment of speech is a better representation of the spectral trajectory than the discrete time derivatives (most obvious in the zeroth order unified basis vectors). The more significant improvements over the baseline MFCC for various numbers of features in Experiments B and C support this observation.

Another set of experiments was conducted to address the issue of statistical significance. The goal was to show that the difference or similarity between the reported best cases for the DCTC/DCSC front end and the best baseline results in each previous experiment were statistically significant rather than due to noise or other random factors. These significance tests were conducted using the TIMIT database. To do this, the best results of the proposed method and the baseline were viewed as two random variables whose mean values were denoted as *μ<sup>T</sup>* and *μB*. Then the 672 utterances of the TIMIT DEV and TIMIT EVAL sets were divided into 12 groups respectively, and test results were obtained for each group as samples. Since it's reasonable to assume the same (but unknown) variance for the proposed front end and the baseline


#### **Table 6.**

*Results of statistical significance tests for reported TIMIT experiments.*

(because the database was identical in all cases), a *t*-test with 22 degrees of freedom was performed to test the significance of the difference term, i.e. *μ<sup>T</sup> μB*. The results of these tests are summarized in **Table 6**.

#### **4.5 Frequency-dependent time warping in DCSC/DCTC scheme**

In addition to the DCTC/DCSC implementation in which the time warping is independent of frequency, a slate of experiments for the DCSC/DCTC variation, which incorporated frequency-dependent time warping, was also conducted. The goal was to test the effectiveness of the auditory time-frequency trade-off caused by the nonlinear frequency selectivity for improving ASR performance. Specifically, the best warping factors obtained in the DCTC/DCSC experiments (i.e. 50 in the 27 feature case and 40 in the 75 feature case) were used as a baseline; smaller time warping for lower frequencies and larger time warping for higher frequencies were used with averages fixed at the baseline values (the block length was identical for all frequencies). Another equivalent method implemented was to use a longer block length for low frequencies compared to higher frequencies with the warping factor fixed. The results of these experiments showed no advantages over the baseline, which uses uniform time warping over all frequencies. This seems to imply that despite the results from human auditory research, which shows that humans have frequency-dependent temporal sensitivity [34–36], it may not play a crucial role, at least for the phone recognition ASR task evaluated in this chapter. Similar findings were observed by others. In one detailed study using wavelet signal processing to extract features for phonetic class recognition [51], the best performance obtained with wavelet features was only comparable to that obtained with MFCC features. In another study [52], a set of spectral-temporal features, which also accounts for the similar time-frequency trade-off, resulted in improved performance but only for restricted tasks (an isolated phone classification task rather than a continuous recognition application). The method introduced in [52] has not been adopted by the ASR community for general use.

#### **5. Conclusion and future work**

This chapter presents a generalized spectral-temporal feature extraction front end for representing speech information. The feature set is motivated by the attempt to mimic two primary properties of human hearing: frequency and time resolution. Based on a set of frequency and time-warping functions built into a set of modified 2-D cosine basis vectors and the trade-off between frequency and temporal and time resolution can be explored. A wide range of ASR experiments were conducted using the DCTC/DCSC method to comprehensively evaluate spectral-temporal resolution effects. This was done by adjusting parameters controlling the DCTC and DCSC parameters emphasize either spectral resolution or temporal resolution, and attempting to find the best overall "balance" point. The best combination point, using phonetic recognition experiments with the English language, also worked well with the Mandarin language.

Empowered by the front end unification approach, a higher level systematic unification can be envisioned. Conceptually, a recognizer front end should only require static features, with temporal patterns modeled by the recognizer. The human auditory system primarily performs spectral analysis whereas higher levels of

#### *Generalized Spectral-Temporal Features for Representing Speech Information DOI: http://dx.doi.org/10.5772/intechopen.104672*

processing in the human brain appear to extract the longer terms spectral-temporal information. Apparently, the HMM framework is not able to adequately capture the temporal patterns contained in sequences of static speech features alone. Thus, it is possible that modeling of the "hidden" spectral-temporal patterns can be exploited by a data-driven training of a state-of-the-art recognizer, such as a deep neural network (DNN), which has the power of performing "deep learning."

### **Author details**

Stephen A. Zahorian<sup>1</sup> \*, Xiaoyu Liu<sup>2</sup> and Roozbeh Sadeghian<sup>3</sup>

1 Binghamton University, Binghamton, USA

2 Dolby Laboratories Inc., San Francisco, USA

3 Harrisburg University of Science and Technology, Harrisburg, USA

\*Address all correspondence to: stephen.zahorian16@gmail.com

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### **References**

[1] Zahorian SA. Detailed Phonetic Labeling of Multi-Language Database for Spoken Language Processing Applications. Rome, NY, USA: Air Force Research Laboratory Information Directorate; 2015. Available from: http:// www.oracle.com/us/corporate/citizensh ip/corporate-citizenship-report-2563684.pdf. DOI: 10.21236/ada614725

[2] Peterson GE, Barney HL. Control methods used in a study of the vowels. The Journal of the Acoustical Society of America. 1952;**24**(2):175-184. DOI: 10.1121/1.1906875

[3] Hermansky H. Perceptual linear prediction analysis of speech. The Journal of the Acoustical Society of America. 1990;**87**(4):1738-1752. DOI: 10.1121/1.399423

[4] Weber K, Wet F, Cranen B, Bodes L, Bengio S, Bourlard H. Evaluation of formant-like features for ASR. Int. Conf. on Spoken Language (ICSLP). 2002. DOI: 10.1121/1.1781620

[5] Garner P, Holmes W. On the robust incorporation of formant features into hidden Markov models for automatic speech recognition. Proceedings of ICASSP. 1998:1-4. DOI: 10.1109/ ICASSP.1998.674352

[6] Holmes J, Holmes W, Garner P. Using formant frequencies in speech recognition. Proceedings of EUROSPEECH'97. 1997;**4**:2083-2086

[7] Bogert BP, Healy MJR, Tukey JW. The quefrency analysis of time series for echoes: Cepstrum, pseudo autocovariance, cross-cepstrum and Saphe cracking. In: Rosenblatt M, editor. Chapter 15. Proceedings of the Symposium on Time Series Analysis. New York: Wiley; 1963. pp. 209-2243

[8] Zwicker E, Fastl H. Chapter 3. In: Psychoacoustics, Facts and Models. Springer-Verlag; 1990. pp. 25-28

[9] Stevens SS, Volkmann J, Newman EB. A scale for the measurement of the psychological magnitude pitch. The Journal of the Acoustical Society of America. 1937;**8**(3):185-190. DOI: 10.1121/1.1915893

[10] Fletcher H. Auditory patterns. Reviews of Modern Physics. 1940:12

[11] Zwicker E. Subdivision of the audible frequency range into critical bands. The Journal of the Acoustical Society of America. 1961;**33**(2):248-248. DOI: 10.1121/1.1908630

[12] Glasberg BR, Moore BCJ. Derivation of auditory filter shapes from notchednoise data. Hearing Research. 1990;**47** (1–2):103-138. DOI: 10.1016/0378-5955 (90)90170-T

[13] Bridle JS, Brown MD. An Experimental Automatic Word-Recognition System. JSRU Report. Vol. 1003. Ruislip, England: Joint Speech Research Unit; 1974

[14] Patterson PD, Robinson K, Holdsworth J, McKeown D, Zhang C, Allerhand MH. Complex sounds and auditory images. In: Cazals Y, Demany L, Horner K, editors. Auditory and Perception. Oxford, UK: Pergamon Press; 1992. pp. 429-446

[15] Slaney M. An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank. Apple Technical Report. Cupertino, CA: Advanced Technology Group, Apple Computer, Inc.; 1993. p. 35

*Generalized Spectral-Temporal Features for Representing Speech Information DOI: http://dx.doi.org/10.5772/intechopen.104672*

[16] Zhang X, Heinz MG, Bruce IC, Carney LH. A phenomenological model for the response of auditory-nerve fibers: I. nonlinear tuning with compression and suppression. The Journal of the Acoustical Society of America. 2001; **109**(2):648-670. DOI: 10.1121/1.1336503

[17] Robinson DW, Dadson RS. A redetermination of the equal-loudness relations for pure tones. British Journal of Applied Physiology. 1956;**7**:166-181

[18] Makhoul J. Linear prediction: A tutorial review. Proceedings of the IEEE. 1975;**63**:561-580. DOI: 10.1109/ PROC.1975.9792

[19] Memon S, Lech M, Maddage N. Speaker verification based on different vector quantization techniques with Gaussian mixture models. In: Third Int. Conf. On Network and System Security. 2009. pp. 403-408. DOI: 10.1109/ NSS.2009.19

[20] Jayanna HS, Prasanna SRM. Fuzzy vector quantization for speaker recognition under limited data conditions. TENCON 2008-IEEE Region 10 Conference. 2008:1-4. DOI: 10.1109/ TENCON.2008.4766453

[21] Chen J, Paliwal KK, Mizumachi M, Nakamur S. Robust MFCCs Derived from Different Power Spectrum. Scandinavia: Eurospeech; 2001

[22] Wang C, Miao Z, Meng X. Differential MFCC and vector quantization used for real-time speaker recognition system. IEEE Congress on Image and Signal Processing. 2008: 319-323. DOI: 10.1109/CISP.2008.492

[23] Drullman R, Festen JM, Plomp R. Effect of reducing slow temporal modulations on speech reception. The Journal of the Acoustical Society of

America. 1994;**95**(5):2670-2680. DOI: 10.1121/1.409836

[24] Athineos M, Hermansky H, Ellis DPW. LPTRAPS: Linear predictive temporal patterns. In: Proc. of Interspeech. Jeju Island, Korea; 2004. pp. 1154-1157

[25] Valente F, Hermansky H. Hierarchical and parallel processing of modulation spectrum for ASR applications. ICASSP. 2008:4165-4168. DOI: 10.1109/ICASSP.2008.4518572

[26] Kleinschmidt M. Methods for capturing spectro-temporal modulations in automatic speech recognition. Acustica united with acta Acustica. 2002;**88**:416-422

[27] Kleinshmidt M. Localized Spectro-Temporal Features for Automatic Speech Recognition. Switzerland: Eurospeech; 2003

[28] Allen J. Short term spectral analysis, synthesis, and modification by discrete Fourier transform. IEEE Trans. Acoust., Speech, and Signal Processing. 1977; **ASSP-25**(3):235-238. DOI: 10.1109/ TASSP.1977.1163007

[29] Kim C, Stern RM. Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction. INTERSPEECH. 2009:28-31. DOI: 10.21437/Interspeech.2009-5

[30] Rao KR, Yip P. Discrete Cosine Transform: Algorithms. Academic Press; 1990. Advantages. Applications

[31] O'Shaughnessy D. Speech Communication: Human and Machine. Addison-Wesley; 1987. p. 150

[32] Smith JO, Abel JS. The bark bilinear transform. In: Proceedings of the IEEE Workshop on Applications of Signal

Processing to Audio and Acoustics. New York; 1995. DOI: 10.1109/ ASPAA.1995.482991

[33] Wang S, Sekey A, Gersho A. An objective measure for predicting subjective quality of speech coders. IEEE Journal on Selected Areas in Communications. 1992;**10**(5):819-829. DOI: 10.1109/49.138987

[34] Duifhuis H. Consequences of peripheral filter selectivity for nonsimultaneous masking. The Journal of the Acoustical Society of America. 1973;**54**(6):1471-1488

[35] Bidelman GM, Khaja AS. Spectrotemporal resolution tradeoff in auditory processing as revealed by human auditory brainstem responses and psychophysical indices. Neuroscience Letters. 2014;**572**:53-57

[36] Shailer MJ, Moore BCJ. Gap detection as a function of frequency, bandwidth, and level. The Journal of the Acoustical Society of America. 1983; **74**(2):467-473. DOI: 10.1121/1.389812

[37] Meyer B, Ravuri SV, Schadler MR, Morgan N. Comparing different flavors of spectro-temporal features for ASR. INTERSPEECH. 2011:1269-1272. DOI: 10.21437/Interspeech.2011-103

[38] Depireux DA, Simon JZ, Klein DJ, Shamma SA. Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. Journal of Neurophysiology. 2001;**85**: 1220-1234. DOI: 10.1152/ jn.2001.85.3.1220

[39] Ge W. Two Modified Methods of Feature Extraction for Automatic Speech Recognition (Thesis). Binghamton: Department of Electrical and Computer Engineering, Binghamton University; 2013

[40] Hermansky H, Morgan N. RASTA processing of speech. IEEE Trans. Speech and Audio Processing. 1994;**2**(4): 578-589. DOI: 10.1109/89.326616

[41] Hermansky H, Sharma S. TRAPS classifiers of temporal patterns. ICSLP. 1998;**3**:1003-1006

[42] Liu X, Zahorian SA. A Unified Framework for Filterbank and Time-Frequency Basis Vectors in ASR Front Ends. Australia: ICASSP; 2015. DOI: 10.1109/ICASSP.2015.7178854

[43] Chiu BY, Bhiksha R, Stern RM. Towards fusion of feature extraction and acoustic model training: A top down process for speech recognition. INTERSPEECH. 2009:32-35. DOI: 10.21437/Interspeech.2009-6

[44] Chiu BY, Stern RM. Analysis of physiologically-motivated signal processing for robust speech recognition. INTERSPEECH. 2008:1000-1003. DOI: 10.21437/Interspeech.2008-291

[45] Young S et al. The HTK Book (for HTK Version 3.4) Available from: http:// htk.eng.cam.ac.uk/. Cambridge University; 2009 Revised for HTK Version 3.4

[46] Zue V, Seneff S, Glass J. Speech database development at MIT: TIMIT and beyond. Speech Communication. 1990;**9**:351-356. DOI: 10.1016/0167-6393 (90)90010-7

[47] Lee K, Hon H. Speaker-independent phone recognition using Hidden Markov Models. IEEE Trans. on Acoust., Speech, and Signal Processing. 1989;**37**(11): 1642-1648. DOI: 10.1109/29.46546

[48] Li A, Yin Z, Wang T, Fang Q, Hu F. RASC863-a Chinese speech corpus with four regional accents. Report of Chinese Academy of Sciences. 2004

*Generalized Spectral-Temporal Features for Representing Speech Information DOI: http://dx.doi.org/10.5772/intechopen.104672*

[49] Nossair ZB, Silsbee PL, Zahorian SA. Signal Modeling Enhancement for Automatic Speech Recognition. Vol. 1. Proceedings of ICASSP; 1995. pp. 824-827. DOI: 10.1109/ ICASSP.1995.479821

[50] Zahorian SA, Wong B. Spectral amplitude nonlinearities for improved noise robustness of spectral features for use in automatic speech recognition. The Journal of the Acoustical Society of America. 2011;**130**(4):2524. DOI: 10.1121/1.3655077

[51] Van Pham T. Wavelet analysis for robust speech processing and applications (thesis). Graz University of Technology. 2007

[52] Droppo JG III. Time-frequency features for speech recognition (thesis). University of Washington. 2000
