**Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition**

Jozef Juhár and Peter Viszlay

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/48715

## **1. Introduction**

26 Will-be-set-by-IN-TECH

[22] Caballero-Morales, S.O. & Cox, S.J. (2009). Modelling Errors in Automatic Speech Recognition for Dysarthric Speakers. *EURASIP Journal on Advances in Signal Processing*,

[23] Li, Y.X.; Tan, C.L.; Ding, X. & Liu, C. (2004). Contextual post-processing based on the confusion matrix in offline handwritten Chinese script recognition. *Pattern Recognition*,

[24] Cox, S.J. & Dasmahapatra, S. (2002). High level approaches to confidence estimation in speech recognition. *IEEE Transactions on Speech and Audio Processing*, Vol. 10, No. 7, pp.

[25] Mohri, M.; Pereira, F. & Riley, L. (2002). Weighted finite state transducers in speech recognition. *Computer Speech and Language*, Vol. 16, pp. 69-88, ISSN: 0885-2308 [26] Fosler-Lussier, E.; Amdal, I. & Kuo, H.-K.J. (2002). On the road to improved lexical confusability metrics, *ISCA Tutorial and Research Workshop on Pronunciation Modelling and*

[27] Thatphithakkul, N. & Kanokphara, S. (2004). HMM Parameter Optimization using Tabu Search, *International Symposium on Communications and Information Technologies (ISCIT)*

[28] Cox, S. J. (2008). On Estimation of A Speaker's Confusion Matrix from Sparse Data, *Proc. of the 9th Annual Conference of the International Speech Communication Association*

[29] Caballero-Morales, S.O. (2011). Structure Optimization of Metamodels to Improve Speech Recognition Accuracy, *Proc. of the International Conference on Electronics Communications and Computers (CONIELECOMP) 2011*, pp. 125-130, ISBN:

[30] Caballero-Morales, S.O. & Cox, S.J. (2009). On the Estimation and the Use of Confusion-Matrices for Improving ASR Accuracy, *Proc. of the International Conference on*

[32] Gillick, L. & Cox, S.J. (1989). Some statistical issues in the comparison of speech recognition algorithms, *Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 1989*, Glasgow , United Kingdom, pp. 532-535, ISSN:

[33] Leggetter, C.J. & Woodland, P.C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. *Computer Speech and*

*Language*, Vol. 9, No. 2, pp. 171-185, ISSN: 0885-2308

*Spoken Language Processing (Interspeech 2009)*, pp. 1599-1602, ISSN: 1990-9772 [31] Caballero-Morales, S.O. & Cox, S.J. (2007). Modelling confusion matrices to improve speech recognition accuracy, with an application to dysarthric speech, *Proc. of the International Conference on Spoken Language Processing (Interspeech 2007)*, pp. 1565-1568,

*(Interspeech 2008)*, Brisbane, Australia, pp. 2618-2621, ISSN: 1990-9772

*Lexicon Adaptation (PMLA-2002)*, Estes Park, Colorado, USA.

pp. 1-14, ISSN: 1687-6172

Vol. 37, No. 9, pp. 1901-1912

460-471,. ISSN: 1063-6676

*2004*, Sapporo, Japan, pp. 904-908

978-1-4244-9557-3

ISSN: 1990-9772

1520-6149

The most common acoustic front-ends in automatic speech recognition (ASR) systems are based on the state-of-the-art Mel-Frequency Cepstral Coefficients (MFCCs). The practice shows that this general technique is good choice to obtain satisfactory speech representation. In the past few decades, the researchers have made a great effort in order to develop and apply such techniques, which may improve the recognition performance of the conventional MFCCs. In general, these methods were taken from mathematics and applied in many research areas such as face and speech recognition, high-dimensional data and signal processing, video and image coding and many other. One group of mentioned methods is represented by linear transformations.

Linear feature transformations (also referred as subspace learning or dimensionality reduction methods) are used to convert the original data set to an alternative and more compact set with retaining of information as much as possible. They are also used to increase the robustness and the performance of the system. In speech recognition, the basic acoustic front-end based on MFCCs can be supplemented by some kind of linear feature transformation. The linear transformation is applied in feature extraction step. Then the whole feature extraction process is achieved in two steps: parameter extraction and feature transformation. Linear transformation is applied to a sequence of acoustic vectors obtained by some kind of preprocessing method. Usually, the spectral, log-spectral, Mel-filtered spectral or cepstral features are projected to a more relevant and more decorrelated subspace, which is directly used in acoustic modeling. During the transformation often a dimension reduction step is also done. This is achieved by retaining only the relevant dimensions after the transformation according to some optimization criterion. The dimension reduction step helps to solve the problem called the curse of dimensionality.

In practice, supervised and unsupervised subspace learning methods are used. The most popular data-driven unsupervised transformation used in ASR is Principal Component Analysis (PCA). It is known that the supervised methods need an information about the

©2012 Viszlay and Juhár, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. ©2012 Viszlay and Juhár, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

2 Will-be-set-by-IN-TECH 132 Modern Speech Recognition Approaches with Case Studies Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition <sup>3</sup>

structure of the data, which are partitioned in the classes. Therefore, it is necessary to use appropriate class labels. A widely used supervised method is known as Linear Discriminant Analysis (LDA).

In numerous research works and publications it was proven that the above mentioned linear transformations were successfully applied in ASR to multiple languages with different characteristics of speech. The Slovak speech recognition research group tends to follow this trend. In this work, we present a practical methodology with adequate theoretical principles related to application of linear feature transformations in Slovak phoneme-based large vocabulary continuous speech recognition (LVCSR).

The main subject of this chapter is the application of LDA in Slovak ASR, but the core of most experiments is based on Two-dimensional LDA (2DLDA), which is an extension of LDA. Several context lengths of basic vectors are used in the discriminat analysis and different final dimensions of transformation matrix are utilized. The classical procedures by several our modifications are supplemented. The second part of the chapter is oriented to PCA and to our proposed method related to PCA training from limited amount of training data. The third part investigates the interaction of the above mentioned PCA and 2DLDA applied in one recognition task. The closing part compares and evaluates all experiments and concludes the chapter by presenting the best achieved results.

This chapter is divided into few basic units. Sections 2 and 3 describe LDA and 2DLDA used in speech recognition. Section 4 surveys PCA and also presents the proposed partial-data trained PCA method. Section 5 presents the setup of the system for continuous phoneme-based speech recognition. Section 6 presents extensive experiments and evaluations of the used methods in different configurations. Finally Section 7 concludes the chapter. Section 8 gives the future intentions in our research.

## **2. Conventional Linear Discriminant Analysis (LDA)**

Linear discriminant analysis is a well-known dimensionality reduction and transformation method that maps the *N*-dimensional input data to *p*-dimensional (*p* < *N*) subspace while retaining maximum discrimination information. A general mathematical model of linear transformation can be written in the following manner:

$$\mathbf{y} = \mathbf{W}^T \mathbf{x},\tag{1}$$

represented by transformation matrix *<sup>W</sup>* ∈ �*N*×*<sup>p</sup>* that maps each column **<sup>x</sup>***<sup>i</sup>* of *<sup>X</sup>* to a column

Consider that the original data is partitioned into *k* classes as *X* = {Π1,..., Π*k*}, where the

*ni* ∑ *x*∈Π*<sup>i</sup>*

which are defined to quantify the quality of the cluster. Since LDA in ASR mostly in class-independent manner is used, we define the *within-class covariance matrix* as the sum of

> *k* ∑ *i*=1 ∑ *x*∈Π*<sup>i</sup>*

To quantify the covariance between classes, the *between-class covariance matrix* is used. It is

*k* ∑ *i*=1 ∑ *x*∈Π*<sup>i</sup>*

is the *global mean vector* (computed disregarding the class label information). Note that the variable **x** in speech recognition represents a *supervector* created by concatenating of acoustic vectors computed on successive speech frames. To build a supervector of *J* acoustic vectors (*J* is typically 3, 5, 7, 9 or 11 frames), the vector **x***<sup>j</sup>* at the current position *j* is spliced together

It should be noted that in case, when the length of the supervector was greater than the number of classes (13 × *J* > *k*, where *J* ≥ 5, *k* = 45), the between-class covariance matrix became close to singular or singular. This fact resulted in eigendecomposition with complex

<sup>2</sup> ] ... **<sup>x</sup>**[*j*] ... **<sup>x</sup>**[*<sup>j</sup>* <sup>+</sup> *<sup>J</sup>*−<sup>1</sup>

class Π*<sup>i</sup>* contains *ni* elements (feature vectors) from the *i*th class. Notice that *n* = ∑*<sup>k</sup>*

<sup>μ</sup>*<sup>i</sup>* <sup>=</sup> <sup>1</sup>

Σ*<sup>i</sup>* = ∑ *x*∈Π*<sup>i</sup>*

*k* ∑ *i*=1

<sup>Σ</sup>*<sup>B</sup>* <sup>=</sup> <sup>1</sup> *n*

**<sup>x</sup>**[*<sup>j</sup>* <sup>−</sup> *<sup>J</sup>*−<sup>1</sup>

<sup>Σ</sup>*<sup>i</sup>* <sup>=</sup> <sup>1</sup> *n*

> *k* ∑ *i*=1

<sup>μ</sup> <sup>=</sup> <sup>1</sup> *n*

**y***<sup>i</sup>* = *WT***x***i*; *p* < *N*. (2)

Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition 133

**x** (3)

(**<sup>x</sup>** <sup>−</sup> <sup>μ</sup>*i*)(**<sup>x</sup>** <sup>−</sup> <sup>μ</sup>*i*)*T*. (5)

**x** (7)

. (8)

(μ*<sup>i</sup>* <sup>−</sup> <sup>μ</sup>)(μ*<sup>i</sup>* <sup>−</sup> <sup>μ</sup>)*T*, (6)

<sup>2</sup> ] 

(**<sup>x</sup>** <sup>−</sup> <sup>μ</sup>*i*)(**<sup>x</sup>** <sup>−</sup> <sup>μ</sup>*i*)*T*, (4)

*<sup>i</sup>*=<sup>1</sup> *ni*. The

vector **y***<sup>i</sup>* in the *p*-dimensional space as:

and their *class covariance matrices*

all class covariance matrices

defined as:

where

with *<sup>J</sup>*−<sup>1</sup>

classes can be represented by *class mean vectors*

<sup>Σ</sup>*<sup>W</sup>* <sup>=</sup> <sup>1</sup> *n*

<sup>2</sup> vectors on the left and right as

**x** = 

valued transformation matrix, which was undesirable.

where **y** is the output transformed feature set, *W* is the transformation matrix and **x** is the input feature set. The aim of LDA is to find this transformation matrix *W* with respect to some optimization criterion (information loss, class discrimination, ...). It can be obtained by applying an eigendecomposition to the covariance matrices. The *p* best functions resulted from the decomposition are used to transform the feature vectors to reduced representation.

## **2.1. Mathematical background**

According to [1, 7, 11, 14, 19] LDA can be defined as follows. Suppose a training data matrix *<sup>X</sup>* ∈ �*N*×*<sup>n</sup>* with *<sup>n</sup>* column vectors **<sup>x</sup>***i*, where 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*. LDA finds a linear transformation represented by transformation matrix *<sup>W</sup>* ∈ �*N*×*<sup>p</sup>* that maps each column **<sup>x</sup>***<sup>i</sup>* of *<sup>X</sup>* to a column vector **y***<sup>i</sup>* in the *p*-dimensional space as:

$$\mathbf{y}\_{i} = \mathbf{W}^{T}\mathbf{x}\_{i};\ p$$

Consider that the original data is partitioned into *k* classes as *X* = {Π1,..., Π*k*}, where the class Π*<sup>i</sup>* contains *ni* elements (feature vectors) from the *i*th class. Notice that *n* = ∑*<sup>k</sup> <sup>i</sup>*=<sup>1</sup> *ni*. The classes can be represented by *class mean vectors*

$$
\mu\_i = \frac{1}{n\_i} \sum\_{\mathbf{x} \in \Pi\_i} \mathbf{x} \tag{3}
$$

and their *class covariance matrices*

$$\Sigma\_{\bar{i}} = \sum\_{\mathbf{x} \in \Pi\_{\bar{i}}} (\mathbf{x} - \mu\_{\bar{i}})(\mathbf{x} - \mu\_{\bar{i}})^T \tag{4}$$

which are defined to quantify the quality of the cluster. Since LDA in ASR mostly in class-independent manner is used, we define the *within-class covariance matrix* as the sum of all class covariance matrices

$$\Sigma\_W = \frac{1}{n} \sum\_{i=1}^k \Sigma\_i = \frac{1}{n} \sum\_{i=1}^k \sum\_{\mathbf{x} \in \Pi\_l} (\mathbf{x} - \boldsymbol{\mu}\_i)(\mathbf{x} - \boldsymbol{\mu}\_i)^T. \tag{5}$$

To quantify the covariance between classes, the *between-class covariance matrix* is used. It is defined as:

$$\Sigma\_B = \frac{1}{n} \sum\_{i=1}^{k} (\mu\_i - \mu)(\mu\_i - \mu)^T. \tag{6}$$

where

2 Will-be-set-by-IN-TECH

structure of the data, which are partitioned in the classes. Therefore, it is necessary to use appropriate class labels. A widely used supervised method is known as Linear Discriminant

In numerous research works and publications it was proven that the above mentioned linear transformations were successfully applied in ASR to multiple languages with different characteristics of speech. The Slovak speech recognition research group tends to follow this trend. In this work, we present a practical methodology with adequate theoretical principles related to application of linear feature transformations in Slovak phoneme-based

The main subject of this chapter is the application of LDA in Slovak ASR, but the core of most experiments is based on Two-dimensional LDA (2DLDA), which is an extension of LDA. Several context lengths of basic vectors are used in the discriminat analysis and different final dimensions of transformation matrix are utilized. The classical procedures by several our modifications are supplemented. The second part of the chapter is oriented to PCA and to our proposed method related to PCA training from limited amount of training data. The third part investigates the interaction of the above mentioned PCA and 2DLDA applied in one recognition task. The closing part compares and evaluates all experiments and concludes

This chapter is divided into few basic units. Sections 2 and 3 describe LDA and 2DLDA used in speech recognition. Section 4 surveys PCA and also presents the proposed partial-data trained PCA method. Section 5 presents the setup of the system for continuous phoneme-based speech recognition. Section 6 presents extensive experiments and evaluations of the used methods in different configurations. Finally Section 7 concludes the chapter. Section 8 gives

Linear discriminant analysis is a well-known dimensionality reduction and transformation method that maps the *N*-dimensional input data to *p*-dimensional (*p* < *N*) subspace while retaining maximum discrimination information. A general mathematical model of linear

where **y** is the output transformed feature set, *W* is the transformation matrix and **x** is the input feature set. The aim of LDA is to find this transformation matrix *W* with respect to some optimization criterion (information loss, class discrimination, ...). It can be obtained by applying an eigendecomposition to the covariance matrices. The *p* best functions resulted from the decomposition are used to transform the feature vectors to reduced representation.

According to [1, 7, 11, 14, 19] LDA can be defined as follows. Suppose a training data matrix *<sup>X</sup>* ∈ �*N*×*<sup>n</sup>* with *<sup>n</sup>* column vectors **<sup>x</sup>***i*, where 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*. LDA finds a linear transformation

**y** = *WT***x**, (1)

large vocabulary continuous speech recognition (LVCSR).

the chapter by presenting the best achieved results.

**2. Conventional Linear Discriminant Analysis (LDA)**

transformation can be written in the following manner:

the future intentions in our research.

**2.1. Mathematical background**

Analysis (LDA).

$$\mu = \frac{1}{n} \sum\_{i=1}^{k} \sum\_{\mathbf{x} \in \Pi\_{i}} \mathbf{x} \tag{7}$$

is the *global mean vector* (computed disregarding the class label information). Note that the variable **x** in speech recognition represents a *supervector* created by concatenating of acoustic vectors computed on successive speech frames. To build a supervector of *J* acoustic vectors (*J* is typically 3, 5, 7, 9 or 11 frames), the vector **x***<sup>j</sup>* at the current position *j* is spliced together with *<sup>J</sup>*−<sup>1</sup> <sup>2</sup> vectors on the left and right as

$$\mathbf{x} = \begin{bmatrix} \mathbf{x}[j - \frac{l-1}{2}] & \dots & \mathbf{x}[j] & \dots & \mathbf{x}[j + \frac{l-1}{2}] \end{bmatrix}. \tag{8}$$

It should be noted that in case, when the length of the supervector was greater than the number of classes (13 × *J* > *k*, where *J* ≥ 5, *k* = 45), the between-class covariance matrix became close to singular or singular. This fact resulted in eigendecomposition with complex valued transformation matrix, which was undesirable.

#### 4 Will-be-set-by-IN-TECH 134 Modern Speech Recognition Approaches with Case Studies Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition <sup>5</sup>

Therefore, we used for these cases a modified computation of Σ*<sup>B</sup>* according to [7] as follows:

$$\tilde{\boldsymbol{\Sigma}}\_B = \frac{1}{n} \sum\_{i=1}^n (\mathbf{x}\_i - \boldsymbol{\mu})(\mathbf{x}\_i - \boldsymbol{\mu})^T. \tag{9}$$

Section 5.3). Thus, the number of classes in LDA-based experiments was identical with the number of phonemes and also with the number of trained monophone models. The disadvantage of the phone segmentation obtained from embedded training can be potentially the inaccuracy of the determined phone boundaries compared to the actual boundaries.

Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition 135

Linear Discriminant Analysis used as a feature extraction or dimension reduction method in applications with high-dimensional data may not perform always optimally. Especially, when the dimension of the data exceeds the number of data points, the scatter matrices can become singular. This is known as the singularity or undersampled problem in LDA, which is its

Two-Dimensional Linear Discriminant Analysis (hereinafter 2DLDA) [19] was primarily designed to overcome the singularity problem in classical LDA. 2DLDA overcomes the singularity problem implicitly. The key difference between LDA and 2DLDA is in the data representation model. While conventional LDA works with vectorized representation of data, the 2DLDA algorithm works with data in matrix representation. Therefore, the data collection is performed as a collection of matrices, instead of a single large data matrix. This concept has

It is known that the optimal transformation matrix in LDA can be obtained by applying an eigendecomposition to the scatter matrices. Generally, these matrices can be singular because they are estimated from high-dimensional data. In recent years, several approaches have been developed to solve such problems related to high-dimensional computing [10]. One of these approaches is called PCA+LDA and it is a widely used two-stage algorithm especially in face recognition [3]. All mentioned methods require the computation of eigendecomposition of

2DLDA alleviates the difficult computation of the eigendecomposition in methods discussed above. Since it works with matrices instead of high-dimensional supervectors (as in classical LDA), the eigendecomposition in 2DLDA is computed on matrices with much smaller sizes than in LDA. This reduces the processing time and memory costs of 2DLDA compared to

Let *Ai* <sup>∈</sup> **<sup>R</sup>***r*×*c*, �1; *<sup>n</sup>*� be the *<sup>n</sup>* training speech signals in the corpus. Suppose there are *<sup>k</sup>* classes

*k* ∑ *i*=1

be the global mean. In [19], for face recognition, *X* originally represents a training image. For speech recognition, *X* represents the concatenated acoustic vectors (supervector) computed

∑ *X*∈Π*<sup>i</sup>*

*X*, *i* ∈ �1; *k*� (13)

*X* (14)

*Mi* <sup>=</sup> <sup>1</sup>

*ni* ∑ *X*∈Π*<sup>i</sup>*

*<sup>M</sup>* <sup>=</sup> <sup>1</sup> *n*

**3. Two-Dimensional Linear Discriminant Analysis**

large matrices, which can lead to degradation of the efficiency.

intrinsic limitation.

LDA.

been used for example in [18] for PCA.

**3.1. Mathematical description**

be the mean of the *i*-th class and

Π1,..., Π*k*, where Π*<sup>i</sup>* has *ni* feature vectors. Let

This way of computation can be interpreted as a finer estimation of Σ*<sup>B</sup>* because each training supervector contributes to a final estimation of Σ*<sup>B</sup>* (more data points are used) in comparison with the estimation represented by Equation 6.

The given covariance matrices are used to formulate the optimization criterion for LDA, which tries to maximize the between-class scatter (covariance) over the within-class scatter (covariance). It can be shown that the covariance matrices resulting from the linear transformation *<sup>W</sup>* (in the *<sup>p</sup>*-dimensional space) become <sup>Σ</sup>*<sup>p</sup> <sup>B</sup>* <sup>=</sup> *<sup>W</sup>T*Σ*BW* and <sup>Σ</sup>*<sup>p</sup> <sup>W</sup>* <sup>=</sup> *<sup>W</sup>T*Σ*WW*. The objective function can be defined as

$$f(W) = \frac{|\overline{\Sigma}\_B|}{|\overline{\Sigma}\_W|} = \frac{|W^T \Sigma\_B W|}{|W^T \Sigma\_W W|}. \tag{10}$$

This optimization problem is equivalent to the generalized eigenvalue problem

$$
\Sigma\_B \mathbf{v} = \lambda \Sigma\_W \mathbf{v}, \text{ for } \lambda \neq 0,\tag{11}
$$

where **v** is a square matrix of eigenvectors and *λ* represents the eigenvalues. The solution can be obtained by applying an eigendecomposition to the matrix

$$
\Sigma\_W^{-1} \Sigma\_B. \tag{12}
$$

The reduced representation *Wp* of *W* is made by choosing *p* eigenvectors corresponding to *p* largest eigenvalues.

#### **2.2. Class definition in LDA**

Since LDA is a supervised method, it needs additional information about the class structure of the training data. In the past few years, several choices for LDA class definition in ASR were proposed and experimentally investigated. For small vocabulary phoneme-based ASR systems LDA yielded an improvement with phone level conventional class definition [4, 8]. In these cases the Viterbi-trained context independent phonemes are used as classes. For HMM-based recognizers the time-aligned HMM states can define the classes [14]. Another reasonable method is to use the subphone levels as LDA classes [15]. We showed in our work [17] that an alternative phonetic class definition based on phonetic segmentation can lead to improvement.

For large vocabulary phoneme-based ASR systems there exist several ways to define the classes. One might argue that the conventional phone-level definition is the appropriate one. For triphone-based recognizers the context-dependent or context-independent triphones can be used [13] or the tied states in context dependent acoustic models [6].

In this work we used the conventional phone-level classes for LDA and 2DLDA. The phonetic segmentation was obtained from embedded training and automatic phone alignment (see Section 5.3). Thus, the number of classes in LDA-based experiments was identical with the number of phonemes and also with the number of trained monophone models. The disadvantage of the phone segmentation obtained from embedded training can be potentially the inaccuracy of the determined phone boundaries compared to the actual boundaries.

## **3. Two-Dimensional Linear Discriminant Analysis**

4 Will-be-set-by-IN-TECH

Therefore, we used for these cases a modified computation of Σ*<sup>B</sup>* according to [7] as follows:

This way of computation can be interpreted as a finer estimation of Σ*<sup>B</sup>* because each training supervector contributes to a final estimation of Σ*<sup>B</sup>* (more data points are used) in comparison

The given covariance matrices are used to formulate the optimization criterion for LDA, which tries to maximize the between-class scatter (covariance) over the within-class scatter (covariance). It can be shown that the covariance matrices resulting from the linear

> <sup>=</sup> <sup>|</sup>*WT*Σ*BW*<sup>|</sup> |*WT*Σ*WW*|

(**x***<sup>i</sup>* <sup>−</sup> <sup>μ</sup>)(**x***<sup>i</sup>* <sup>−</sup> <sup>μ</sup>)*T*. (9)

*<sup>B</sup>* <sup>=</sup> *<sup>W</sup>T*Σ*BW* and <sup>Σ</sup>*<sup>p</sup>*

*<sup>W</sup>* Σ*B*. (12)

Σ*B***v** = *λ*Σ*W***v**, for *λ* �= 0, (11)

. (10)

*<sup>W</sup>* <sup>=</sup> *<sup>W</sup>T*Σ*WW*.

*n* ∑ *i*=1

*<sup>J</sup>*(*W*) = <sup>|</sup><sup>Σ</sup>*<sup>B</sup>*<sup>|</sup>

This optimization problem is equivalent to the generalized eigenvalue problem


where **v** is a square matrix of eigenvectors and *λ* represents the eigenvalues. The solution can

The reduced representation *Wp* of *W* is made by choosing *p* eigenvectors corresponding to *p*

Since LDA is a supervised method, it needs additional information about the class structure of the training data. In the past few years, several choices for LDA class definition in ASR were proposed and experimentally investigated. For small vocabulary phoneme-based ASR systems LDA yielded an improvement with phone level conventional class definition [4, 8]. In these cases the Viterbi-trained context independent phonemes are used as classes. For HMM-based recognizers the time-aligned HMM states can define the classes [14]. Another reasonable method is to use the subphone levels as LDA classes [15]. We showed in our work [17] that an alternative phonetic class definition based on phonetic segmentation can lead to

For large vocabulary phoneme-based ASR systems there exist several ways to define the classes. One might argue that the conventional phone-level definition is the appropriate one. For triphone-based recognizers the context-dependent or context-independent triphones can

In this work we used the conventional phone-level classes for LDA and 2DLDA. The phonetic segmentation was obtained from embedded training and automatic phone alignment (see

be used [13] or the tied states in context dependent acoustic models [6].

Σ−<sup>1</sup>

<sup>Σ</sup>*<sup>B</sup>* <sup>=</sup> <sup>1</sup> *n*

with the estimation represented by Equation 6.

The objective function can be defined as

largest eigenvalues.

improvement.

**2.2. Class definition in LDA**

transformation *<sup>W</sup>* (in the *<sup>p</sup>*-dimensional space) become <sup>Σ</sup>*<sup>p</sup>*

be obtained by applying an eigendecomposition to the matrix

Linear Discriminant Analysis used as a feature extraction or dimension reduction method in applications with high-dimensional data may not perform always optimally. Especially, when the dimension of the data exceeds the number of data points, the scatter matrices can become singular. This is known as the singularity or undersampled problem in LDA, which is its intrinsic limitation.

Two-Dimensional Linear Discriminant Analysis (hereinafter 2DLDA) [19] was primarily designed to overcome the singularity problem in classical LDA. 2DLDA overcomes the singularity problem implicitly. The key difference between LDA and 2DLDA is in the data representation model. While conventional LDA works with vectorized representation of data, the 2DLDA algorithm works with data in matrix representation. Therefore, the data collection is performed as a collection of matrices, instead of a single large data matrix. This concept has been used for example in [18] for PCA.

It is known that the optimal transformation matrix in LDA can be obtained by applying an eigendecomposition to the scatter matrices. Generally, these matrices can be singular because they are estimated from high-dimensional data. In recent years, several approaches have been developed to solve such problems related to high-dimensional computing [10]. One of these approaches is called PCA+LDA and it is a widely used two-stage algorithm especially in face recognition [3]. All mentioned methods require the computation of eigendecomposition of large matrices, which can lead to degradation of the efficiency.

2DLDA alleviates the difficult computation of the eigendecomposition in methods discussed above. Since it works with matrices instead of high-dimensional supervectors (as in classical LDA), the eigendecomposition in 2DLDA is computed on matrices with much smaller sizes than in LDA. This reduces the processing time and memory costs of 2DLDA compared to LDA.

## **3.1. Mathematical description**

Let *Ai* <sup>∈</sup> **<sup>R</sup>***r*×*c*, �1; *<sup>n</sup>*� be the *<sup>n</sup>* training speech signals in the corpus. Suppose there are *<sup>k</sup>* classes Π1,..., Π*k*, where Π*<sup>i</sup>* has *ni* feature vectors. Let

$$M\_{\bar{l}} = \frac{1}{n\_{\bar{l}}} \sum\_{\mathbf{X} \in \Pi\_{\bar{l}}} \mathbf{X}\_{\prime} \quad i \in \langle \mathbf{1}; k \rangle \tag{13}$$

be the mean of the *i*-th class and

$$M = \frac{1}{n} \sum\_{i=1}^{k} \sum\_{X \in \Pi\_i} X \tag{14}$$

be the global mean. In [19], for face recognition, *X* originally represents a training image. For speech recognition, *X* represents the concatenated acoustic vectors (supervector) computed on successive speech frames [12]. In fact, *X* is a matrix composed by combination of acoustic vectors computed on successive speech frames. We can call this matrix analogously to supervector as supermatrix.

2DLDA considers an (*l*<sup>1</sup> × *l*2)-dimensional space L⊗R, which is a tensor product of the spaces - <sup>L</sup> spanned by vectors {*ui*}*l*<sup>1</sup> *<sup>i</sup>*=<sup>1</sup> and <sup>R</sup> spanned by vectors {*vi*}*l*<sup>2</sup> *<sup>i</sup>*=1. Since in 2DLDA, the speech is considered as a two-dimensional element, two transformation matrices, *L* and *<sup>R</sup>* are defined as *<sup>L</sup>* = [*u*1,..., *ul*<sup>1</sup> ], *<sup>L</sup>* <sup>∈</sup> **<sup>R</sup>***r*×*l*<sup>1</sup> and matrix *<sup>R</sup>* = [*v*1,..., *vl*<sup>2</sup> ], *<sup>R</sup>* <sup>∈</sup> **<sup>R</sup>***c*×*l*<sup>2</sup> . These matrices map each *Ai* <sup>∈</sup> **<sup>R</sup>***r*×*<sup>c</sup>* to a matrix *Bi* <sup>∈</sup> **<sup>R</sup>***l*1×*l*<sup>2</sup> as:

$$B\_i = L^T A\_i R\_\prime \quad i \in \langle 1; n \rangle. \tag{15}$$

*maxR trace*

*<sup>w</sup>* and *S<sup>R</sup>*

1. Compute the mean *Mi* of *i*th class for each *i* as *Mi* = <sup>1</sup>

*k* ∑ *i*=1

> *k* ∑ *i*=1

> > *k* ∑ *i*=1

> > > *k* ∑ *i*=1

∑ *X*∈Π*<sup>i</sup>*

∑ *X*∈Π*<sup>i</sup>*

This problem can be solved as an eigenvalue problem:

from:

5.

8.

7. *Lj* <sup>←</sup> [*φ<sup>L</sup>*

10. End for

11. *L* ← *LI*, *R* ← *RI*;

identity matrix as *R*0.

12. *Bl* <sup>←</sup> *<sup>L</sup><sup>T</sup> AlR*, for *<sup>l</sup>* <sup>=</sup> 1, . . . , *<sup>n</sup>*; 13. return(*L*, *R*, *B*1,..., *Bn*).

LDA. Specifically, the size of *S<sup>R</sup>*

3. *R*<sup>0</sup> ← identity matrix; 4. For *j* from 1 to *I*

> <sup>1</sup> ,..., *<sup>φ</sup><sup>L</sup> l*1 ];

**3.2. Pseudocode of 2DLDA algorithm**

2. Compute the global mean as *M* = <sup>1</sup>

*SR w* ←

> *SR b* ←

> > *SL w* ←

> > > *SL b* ←

9. Compute the first *<sup>l</sup>*<sup>2</sup> eigenvectors {*φ<sup>R</sup>*

6. Compute the first *<sup>l</sup>*<sup>1</sup> eigenvectors {*φ<sup>L</sup>*

*RTSL wR* −<sup>1</sup>

*<sup>w</sup>***<sup>x</sup>** = *<sup>λ</sup>S<sup>L</sup>*

The optimal *R* can be then obtained by applying an eigendecomposition to matrix resulting

It should be noted that the sizes of scatter matrices in 2DLDA are much smaller that those in

*<sup>n</sup>* <sup>∑</sup>*<sup>k</sup>*

(*<sup>X</sup>* <sup>−</sup> *Mi*)*Rj*−1*R<sup>T</sup>*

*l* }*l*1

*<sup>l</sup>* }*l*<sup>2</sup>

(*<sup>X</sup>* <sup>−</sup> *Mi*)*<sup>T</sup> LjL<sup>T</sup>*

*ni*(*Mi* <sup>−</sup> *<sup>M</sup>*)*<sup>T</sup> LjL<sup>T</sup>*

The most time consuming steps in 2DLDA computing are lines 5, 8 and 13. The algorithm depends on the initial choice of *R*0. In [19] it was showed and recommended to choose an

*<sup>l</sup>*=1of (*S<sup>R</sup>*

*<sup>l</sup>*=1of (*S<sup>L</sup>*

*w*)−1*S<sup>R</sup> b* ;

*w*)−1*S<sup>L</sup> b* ;

*ni*(*Mi* <sup>−</sup> *<sup>M</sup>*)*Rj*−1*R<sup>T</sup>*

*<sup>b</sup>* is *<sup>r</sup>* <sup>×</sup> *<sup>r</sup>* and the size of *<sup>S</sup><sup>L</sup>*

*<sup>i</sup>*=<sup>1</sup> <sup>∑</sup>*X*∈Π*<sup>i</sup> <sup>X</sup>*;

*SL*

 *SL w* −<sup>1</sup> *SL* *RTSL b R* 

Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition 137

. (23)

*<sup>b</sup>* **x**. (24)

*<sup>b</sup>* . (25)

*<sup>b</sup>* is *c* × *c*.

*ni* <sup>∑</sup>*X*∈Π*<sup>i</sup> <sup>X</sup>*;

*<sup>j</sup>*−1(*<sup>X</sup>* <sup>−</sup> *Mi*)*T*, (26)

*<sup>j</sup>*−1(*Mi* <sup>−</sup> *<sup>M</sup>*)*T*; (27)

*<sup>j</sup>* (*X* − *Mi*), (28)

*<sup>j</sup>* (*Mi* − *M*); (29)

*<sup>w</sup>* and *S<sup>L</sup>*

Due to difficult computing of optimal *L* and *R* simultaneously, [19] derived an iterative algorithm, which for fixed *R* computes the optimal *L*. With computed *L* it can be updated *R*. The procedure is several times repeated. As in classical LDA, the scatter matrices are computed similarly, but in two-dimensional concept. Note that in 2DLDA are defined two within-class scatter matrices *S<sup>R</sup> <sup>w</sup>* and *S<sup>L</sup> <sup>w</sup>* and two between-class scatter matrices *S<sup>R</sup> <sup>b</sup>* and *<sup>S</sup><sup>L</sup> b* concurrently. Scatter matrices coupled with *R* are defined as follows:

$$S\_w^R = \sum\_{i=1}^k \sum\_{X \in \Pi\_l} (X - M\_l) R R^T (X - M\_l)^T \,\,\,\tag{16}$$

$$S\_b^R = \sum\_{i=1}^k n\_i(M\_i - M)RR^T(M\_i - M)^T. \tag{17}$$

For fixed *R*, *L* can be then computed by solving an optimization problem:

$$\max\_{L} \text{trace} \Big( \left( L^T S\_w^R L \right)^{-1} \left( L^T S\_b^R L \right) \Big). \tag{18}$$

This problem can be solved as an eigenvalue problem:

$$\mathbf{S}\_{\overline{w}}^{\overline{R}}\mathbf{x} = \lambda \mathbf{S}\_{b}^{\overline{R}}\mathbf{x}.\tag{19}$$

*L* can be then obtained in similar way as in LDA by applying an eigendecomposition to matrix resulting from:

$$\left(\mathcal{S}\_w^R\right)^{-1}\mathcal{S}\_b^R.\tag{20}$$

Scatter matrices coupled with *L* are defined as follows:

$$S\_w^L = \sum\_{i=1}^k \sum\_{X \in \Pi\_l} (X - M\_i)^T L L^T (X - M\_i)\_{\prime} \tag{21}$$

$$S\_b^L = \sum\_{i=1}^k n\_i (M\_i - M)^T L L^T (M\_i - M). \tag{22}$$

In this way, with obtained *L* it can be computed the optimal *R* by solving an optimization problem:

136 Modern Speech Recognition Approaches with Case Studies Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition <sup>7</sup> Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition 137

$$\max\_{R} \text{trace} \left( \left( \mathbf{R}^{T} S\_{\overline{w}}^{L} \mathbf{R} \right)^{-1} \left( \mathbf{R}^{T} S\_{b}^{L} \mathbf{R} \right) \right). \tag{23}$$

This problem can be solved as an eigenvalue problem:

$$S\_w^L \mathbf{x} = \lambda S\_b^L \mathbf{x}.\tag{24}$$

The optimal *R* can be then obtained by applying an eigendecomposition to matrix resulting from:

$$\left(\mathcal{S}\_w^L\right)^{-1}\mathcal{S}\_b^L.\tag{25}$$

It should be noted that the sizes of scatter matrices in 2DLDA are much smaller that those in LDA. Specifically, the size of *S<sup>R</sup> <sup>w</sup>* and *S<sup>R</sup> <sup>b</sup>* is *<sup>r</sup>* <sup>×</sup> *<sup>r</sup>* and the size of *<sup>S</sup><sup>L</sup> <sup>w</sup>* and *S<sup>L</sup> <sup>b</sup>* is *c* × *c*.

#### **3.2. Pseudocode of 2DLDA algorithm**

1. Compute the mean  $M\_{l}$  of  $i$ th class for each  $i$  as  $M\_{l} = \frac{1}{\eta\_{l}} \sum\_{X \in \Pi\_{l}} X\_{l}^{i}$ 


5.

6 Will-be-set-by-IN-TECH

on successive speech frames [12]. In fact, *X* is a matrix composed by combination of acoustic vectors computed on successive speech frames. We can call this matrix analogously to

2DLDA considers an (*l*<sup>1</sup> × *l*2)-dimensional space L⊗R, which is a tensor product of the

the speech is considered as a two-dimensional element, two transformation matrices, *L* and *<sup>R</sup>* are defined as *<sup>L</sup>* = [*u*1,..., *ul*<sup>1</sup> ], *<sup>L</sup>* <sup>∈</sup> **<sup>R</sup>***r*×*l*<sup>1</sup> and matrix *<sup>R</sup>* = [*v*1,..., *vl*<sup>2</sup> ], *<sup>R</sup>* <sup>∈</sup> **<sup>R</sup>***c*×*l*<sup>2</sup> . These

Due to difficult computing of optimal *L* and *R* simultaneously, [19] derived an iterative algorithm, which for fixed *R* computes the optimal *L*. With computed *L* it can be updated *R*. The procedure is several times repeated. As in classical LDA, the scatter matrices are computed similarly, but in two-dimensional concept. Note that in 2DLDA are defined two

*<sup>i</sup>*=<sup>1</sup> and <sup>R</sup> spanned by vectors {*vi*}*l*<sup>2</sup>

*Bi* <sup>=</sup> *<sup>L</sup><sup>T</sup> AiR*, *<sup>i</sup>* ∈ �1; *<sup>n</sup>*�. (15)

*<sup>w</sup>* and two between-class scatter matrices *S<sup>R</sup>*

(*<sup>X</sup>* <sup>−</sup> *Mi*)*RRT*(*<sup>X</sup>* <sup>−</sup> *Mi*)*T*, (16)

. (18)

*<sup>b</sup>* **x**. (19)

*<sup>b</sup>* . (20)

(*<sup>X</sup>* <sup>−</sup> *Mi*)*<sup>T</sup> LLT*(*<sup>X</sup>* <sup>−</sup> *Mi*), (21)

*ni*(*Mi* <sup>−</sup> *<sup>M</sup>*)*<sup>T</sup> LLT*(*Mi* <sup>−</sup> *<sup>M</sup>*). (22)

*ni*(*Mi* <sup>−</sup> *<sup>M</sup>*)*RRT*(*Mi* <sup>−</sup> *<sup>M</sup>*)*T*. (17)

*LTSR b L*  *<sup>i</sup>*=1. Since in 2DLDA,

*<sup>b</sup>* and *<sup>S</sup><sup>L</sup> b*

supervector as supermatrix.

spaces - <sup>L</sup> spanned by vectors {*ui*}*l*<sup>1</sup>

within-class scatter matrices *S<sup>R</sup>*

resulting from:

problem:

matrices map each *Ai* <sup>∈</sup> **<sup>R</sup>***r*×*<sup>c</sup>* to a matrix *Bi* <sup>∈</sup> **<sup>R</sup>***l*1×*l*<sup>2</sup> as:

*SR <sup>w</sup>* =

> *SR <sup>b</sup>* =

This problem can be solved as an eigenvalue problem:

Scatter matrices coupled with *L* are defined as follows:

*SL <sup>w</sup>* =

> *SL <sup>b</sup>* = *k* ∑ *i*=1

*<sup>w</sup>* and *S<sup>L</sup>*

concurrently. Scatter matrices coupled with *R* are defined as follows:

*k* ∑ *i*=1

> *k* ∑ *i*=1

For fixed *R*, *L* can be then computed by solving an optimization problem:

 *LTSR wL* −<sup>1</sup>

*SR*

 *SR w* −<sup>1</sup> *SR*

*<sup>w</sup>***<sup>x</sup>** = *<sup>λ</sup>S<sup>R</sup>*

*L* can be then obtained in similar way as in LDA by applying an eigendecomposition to matrix

In this way, with obtained *L* it can be computed the optimal *R* by solving an optimization

*maxL trace*

*k* ∑ *i*=1

∑ *X*∈Π*<sup>i</sup>*

∑ *X*∈Π*<sup>i</sup>*

$$S\_w^R \leftarrow \sum\_{i=1}^k \sum\_{X \in \Pi\_i} (X - M\_i) R\_{j-1} R\_{j-1}^T (X - M\_i)^T \, \prime \tag{26}$$

$$\mathbf{S}\_{b}^{R} \leftarrow \sum\_{i=1}^{k} n\_{i} (M\_{i} - M) \mathbf{R}\_{j-1} \mathbf{R}\_{j-1}^{T} (M\_{i} - M)^{T};\tag{27}$$

6. Compute the first *<sup>l</sup>*<sup>1</sup> eigenvectors {*φ<sup>L</sup> l* }*l*1 *<sup>l</sup>*=1of (*S<sup>R</sup> w*)−1*S<sup>R</sup> b* ; 7. *Lj* <sup>←</sup> [*φ<sup>L</sup>* <sup>1</sup> ,..., *<sup>φ</sup><sup>L</sup> l*1 ]; 8.

$$S\_w^L \leftarrow \sum\_{i=1}^k \sum\_{X \in \Pi\_i} (X - M\_i)^T L\_j L\_j^T (X - M\_i)\_\prime \tag{28}$$

$$\mathbf{S}\_{b}^{L} \leftarrow \sum\_{i=1}^{k} n\_{i} (M\_{i} - M)^{T} L\_{j} L\_{j}^{T} (M\_{i} - M);\tag{29}$$

9. Compute the first *<sup>l</sup>*<sup>2</sup> eigenvectors {*φ<sup>R</sup> <sup>l</sup>* }*l*<sup>2</sup> *<sup>l</sup>*=1of (*S<sup>L</sup> w*)−1*S<sup>L</sup> b* ;


The most time consuming steps in 2DLDA computing are lines 5, 8 and 13. The algorithm depends on the initial choice of *R*0. In [19] it was showed and recommended to choose an identity matrix as *R*0.

#### **4. Principal component analysis**

Principal component analysis (PCA) [9] is a linear feature transformation and dimensionality reduction method, which maps the *n*-dimensional input possibly correlated data to *K*-dimensional (*K* < *n*) linearly uncorrelated variables (mutually independent principal components) with respect to the variability. PCA converts the data by a linear orthogonal transformation using the first few principal components, which usually represent about 80% of the overall variance. The principal component basis minimizes the mean square error of approximating the data. This linear basis can be obtained by application of an eigendecomposition to the global covariance matrix estimated from the original data.

#### **4.1. Mathematical description**

The characteristic mathematical stages of PCA can be briefly described as follows [2, 9]. Firstly suppose that the training data are represented by *M n*-dimensional feature vectors **x**1, **x**2,..., **x***M*. One of the integral parts of PCA is the centering of all vectors (subtracting the mean) as:

$$\Phi\_{\dot{l}} = \mathbf{x}\_{\dot{l}} - \bar{\mathbf{x}}\_{\prime} \quad i \in \langle 1; M \rangle\_{\prime} \tag{30}$$

where

$$\bar{\mathbf{x}} = \frac{1}{M} \sum\_{i=1}^{M} \mathbf{x}\_i \tag{31}$$

where **y***<sup>i</sup>* represents the transformed feature vector. The value of *K* can be chosen as needed

Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition 139

> *T*, (37)

*λ<sup>i</sup>* = *trace*(*U*), (38)

*trace*(*U*) <sup>&</sup>gt; *<sup>T</sup>*. (39)

*K* ∑ *i*=1 *λi*

*n* ∑ *i*=1 *λi*

*K* ∑ *i*=1 *λi*

In this section we describe PCA trained from the whole amount of training data (see Section 5.1). Two kinds of input data for PCA were used. The first kind was represented by 26-dimensional LMFE features and the second one by the 13-dimensional MFCCs. Each parametrized speech signal in the corpus is represented by a LMFE or MFCC matrix *X*(*i*), *i* ∈ �1; *N*� with dimension 26 × *ni* (or 13 × *ni*, see Section 5.2), where *ni* represents the number of frames in *i*-th recording and *N* represents the number of training speech signals (*N*=36917). At the first stage, the initial data preparation is performed, which requires the mathematical computations described by Equations 30-32. The global covariance matrix is computed according to Equation 33 and then decomposed to a set of eigenvector-eigenvalue pairs. According to the *K* largest eigenvalues the corresponding eigenvectors were chosen. These ones formed the transformation matrix *UK* (see Equation 35), which was used to transform

Note that the final dimension (*K*) of the feature vectors after PCA transformation was chosen independently from the criterion formula (Equation 37). Detailed reasons are given in Sections 5.2 and 6.3. However, for interest, the determined optimal dimensions for different PCA

In case of relatively small training corpus there is no problem to compute the covariance matrix. But, in case of large corpora (thousands of recordings) and high-dimensional data there may occur a problem related to processing time (≈ several hours) consumption and memory requirements (≈ 20*GB*). We found that for PCA learning is not necessary to use the whole training data but it may be sufficient a part of them [16]. In other words, PCA can be

*n* ∑ *i*=1

or according to the following comparative criterion:

where the threshold *T* ∈ �0.9; 0.95�. Since

the comparative criterion can be rewritten as:

the train and test corpus into PCA feature space.

**4.3. Partial-data trained PCA**

configurations computed by Equation 37 are listed in Section 6.3.

**4.2. Classical PCA in ASR**

is the training mean vector. From the centered vectors **Φ***<sup>i</sup>* the centered data matrix with dimension *n* × *M* is created as:

$$A = [
\Phi\_1 \Phi\_2 \dots \Phi\_M].\tag{32}$$

To represent the variance of the data across different dimensions, the global covariance matrix is computed as:

$$\mathcal{L} = \frac{1}{M - 1} \sum\_{i=1}^{M} \Phi\_i \Phi\_i^T = \frac{1}{M - 1} \sum\_{i=1}^{M} (\mathbf{x}\_i - \bar{\mathbf{x}})(\mathbf{x}\_i - \bar{\mathbf{x}})^T = \frac{1}{M - 1} AA^T. \tag{33}$$

An eigendecomposition is applied to the covariance matrix in order to obtain its eigenvectors **u**1, **u**2,..., **u***<sup>n</sup>* and corresponding eigenvalues *λ*1, *λ*2,..., *λ<sup>n</sup>* and it satisfies the linear equation:

$$\mathbf{C}\mathbf{u}\_{i} = \lambda\_{i}\mathbf{u}\_{i\prime} \quad i \in \langle 1; n \rangle. \tag{34}$$

The principal components are determined by *K* leading eigenvectors resulting from the decomposition. The dimensionality reduction step is performed by keeping only the eigenvectors corresponding to the *K* largest eigenvalues (*K* < *n*). These eigenvectors form the transformation matrix *UK* with dimension *n* × *K*:

$$\mathcal{U}\_{\mathbb{K}} = [\mathbf{u}\_1 \mathbf{u}\_2 \dots \mathbf{u}\_{\mathbb{K}}]\_\prime \tag{35}$$

while *λ*<sup>1</sup> > *λ*<sup>2</sup> > ... > *λn*. Finally, the linear transformation **R***<sup>n</sup>* → **R***<sup>K</sup>* is computed according to Equation (1) as:

$$\mathbf{y}\_{i} = \mathcal{U}\_{K}^{T} \boldsymbol{\Phi}\_{i} = \mathcal{U}\_{K}^{T} (\mathbf{x}\_{i} - \bar{\mathbf{x}}), \quad i \in \langle 1; M \rangle. \tag{36}$$

where **y***<sup>i</sup>* represents the transformed feature vector. The value of *K* can be chosen as needed or according to the following comparative criterion:

$$\frac{\sum\_{i=1}^{K} \lambda\_i}{\sum\_{i=1}^{n} \lambda\_i} > T\_\prime \tag{37}$$

where the threshold *T* ∈ �0.9; 0.95�. Since

8 Will-be-set-by-IN-TECH

Principal component analysis (PCA) [9] is a linear feature transformation and dimensionality reduction method, which maps the *n*-dimensional input possibly correlated data to *K*-dimensional (*K* < *n*) linearly uncorrelated variables (mutually independent principal components) with respect to the variability. PCA converts the data by a linear orthogonal transformation using the first few principal components, which usually represent about 80% of the overall variance. The principal component basis minimizes the mean square error of approximating the data. This linear basis can be obtained by application of an

eigendecomposition to the global covariance matrix estimated from the original data.

**<sup>x</sup>**¯ <sup>=</sup> <sup>1</sup> *M*

*M* ∑ *i*=1

is the training mean vector. From the centered vectors **Φ***<sup>i</sup>* the centered data matrix with

To represent the variance of the data across different dimensions, the global covariance matrix

*M* ∑ *i*=1

An eigendecomposition is applied to the covariance matrix in order to obtain its eigenvectors **u**1, **u**2,..., **u***<sup>n</sup>* and corresponding eigenvalues *λ*1, *λ*2,..., *λ<sup>n</sup>* and it satisfies the linear equation:

The principal components are determined by *K* leading eigenvectors resulting from the decomposition. The dimensionality reduction step is performed by keeping only the eigenvectors corresponding to the *K* largest eigenvalues (*K* < *n*). These eigenvectors form

while *λ*<sup>1</sup> > *λ*<sup>2</sup> > ... > *λn*. Finally, the linear transformation **R***<sup>n</sup>* → **R***<sup>K</sup>* is computed according

The characteristic mathematical stages of PCA can be briefly described as follows [2, 9]. Firstly suppose that the training data are represented by *M n*-dimensional feature vectors **x**1, **x**2,..., **x***M*. One of the integral parts of PCA is the centering of all vectors (subtracting the

**Φ***<sup>i</sup>* = **x***<sup>i</sup>* − **x**¯, *i* ∈ �1; *M*�, (30)

*A* = [**Φ**1**Φ**<sup>2</sup> ... **Φ***M*]. (32)

*C***u***<sup>i</sup>* = *λi***u***i*, *i* ∈ �1; *n*�. (34)

*UK* = [**u**1**u**<sup>2</sup> ... **u***K*], (35)

*<sup>K</sup>*(**x***<sup>i</sup>* − **x**¯), *i* ∈ �1; *M*�. (36)

(**x***<sup>i</sup>* <sup>−</sup> **<sup>x</sup>**¯)(**x***<sup>i</sup>* <sup>−</sup> **<sup>x</sup>**¯)*<sup>T</sup>* <sup>=</sup> <sup>1</sup>

**x***<sup>i</sup>* (31)

*<sup>M</sup>* <sup>−</sup> <sup>1</sup> *AAT*. (33)

**4. Principal component analysis**

**4.1. Mathematical description**

dimension *n* × *M* is created as:

*<sup>C</sup>* <sup>=</sup> <sup>1</sup> *M* − 1

*M* ∑ *i*=1

the transformation matrix *UK* with dimension *n* × *K*:

**y***<sup>i</sup>* = *U<sup>T</sup>*

*<sup>K</sup>***Φ***<sup>i</sup>* <sup>=</sup> *<sup>U</sup><sup>T</sup>*

**Φ***i***Φ***<sup>T</sup>*

*<sup>i</sup>* <sup>=</sup> <sup>1</sup> *M* − 1

mean) as:

is computed as:

to Equation (1) as:

where

$$\sum\_{i=1}^{n} \lambda\_i = \text{trace}(\mathcal{U}),\tag{38}$$

the comparative criterion can be rewritten as:

$$\frac{\sum\_{i=1}^{K} \lambda\_i}{\text{trace}(\mathbf{U})} > T. \tag{39}$$

#### **4.2. Classical PCA in ASR**

In this section we describe PCA trained from the whole amount of training data (see Section 5.1). Two kinds of input data for PCA were used. The first kind was represented by 26-dimensional LMFE features and the second one by the 13-dimensional MFCCs. Each parametrized speech signal in the corpus is represented by a LMFE or MFCC matrix *X*(*i*), *i* ∈ �1; *N*� with dimension 26 × *ni* (or 13 × *ni*, see Section 5.2), where *ni* represents the number of frames in *i*-th recording and *N* represents the number of training speech signals (*N*=36917).

At the first stage, the initial data preparation is performed, which requires the mathematical computations described by Equations 30-32. The global covariance matrix is computed according to Equation 33 and then decomposed to a set of eigenvector-eigenvalue pairs. According to the *K* largest eigenvalues the corresponding eigenvectors were chosen. These ones formed the transformation matrix *UK* (see Equation 35), which was used to transform the train and test corpus into PCA feature space.

Note that the final dimension (*K*) of the feature vectors after PCA transformation was chosen independently from the criterion formula (Equation 37). Detailed reasons are given in Sections 5.2 and 6.3. However, for interest, the determined optimal dimensions for different PCA configurations computed by Equation 37 are listed in Section 6.3.

#### **4.3. Partial-data trained PCA**

In case of relatively small training corpus there is no problem to compute the covariance matrix. But, in case of large corpora (thousands of recordings) and high-dimensional data there may occur a problem related to processing time (≈ several hours) consumption and memory requirements (≈ 20*GB*). We found that for PCA learning is not necessary to use the whole training data but it may be sufficient a part of them [16]. In other words, PCA can be trained from limited (reduced) amount of training data, while the performance is maintained, or even improved. We called this procedure as *Partial-data trained PCA*.

Partial-data PCA training can be viewed as a kind of feature selection process. The main idea is to select the statistically significant data (feature vectors) from the whole amount of training data. There are two major processing stages. The first stage is the data selection based on PCA separately applied to all training feature vectors. Suitable vectors are concatenated into one train matrix, which is treated as the input for the main PCA. The second stage is the main PCA (see Section 4.1).

Suppose now that apply the same conditions as in Section 4.1. Then the selection process based on PCA (without projecting phase) can be described as follows. Each 26-dimensional LMFE (or 13-dimensional MFCC) feature vector **x***i*, *i* ∈ �1; *M*� (see Section 5.2) is reshaped to its matrix version *Xi*, *i* ∈ �1; *M*� with dimension 2 × 13 (in case of MFCC vectors, the 13-dimensional vector was extended with zero coefficient in order to reshape to matrix with dimension 2 × 7). After mean subtraction the covariance matrix is computed as:

$$\mathbf{C}\_{i} = \frac{1}{k-1} \mathbf{X}\_{i} \mathbf{X}\_{i}^{T}, \quad i \in \langle 1; M \rangle; \, k = 13 \text{ (for MFCC, } k = 7\text{)}. \tag{40}$$

In the next step, the eigendecomposition is performed on the covariance matrix *Ci*, which results in *i* sets of eigenvectors **w***i*1, **w***i*<sup>2</sup> and eigenvalues *αi*1, *αi*2:

$$\mathbf{C}\_{i}\mathbf{w}\_{ij} = \mathfrak{a}\_{i\bar{j}}\mathbf{w}\_{i\bar{j}\prime} \quad i \in \langle 1;M\rangle, \ j \in \langle 1;2\rangle,\tag{41}$$

where

$$\mathcal{W}\_{i} = [\mathbf{w}\_{i1}\mathbf{w}\_{i2}].\tag{42}$$

the selection continues for the next vector. In this way, the whole training corpus is processed. From the selected vectors a training matrix is composed, which is treated as the input for the main PCA described in Section 4.1. As was mentioned in Section 4.1, there are *M* training vectors in the corpus. If the selected subset contains *M*� vectors (*M*� � *M*) then the Equation

where *φ<sup>i</sup>* is the mean subtracted feature vector in the new train matrix. The next mathematical computations are identical with Equations 33-36. The partial-data training procedure for LMFE feature vectors is illustrated in the Figure 4.3. Note that for MFCC-based partial-data


....

Drop feature vector

The new train matrix can be viewed as a radically-reduced, more relevant representation of the training corpus. It has a nearly homoscedastic variance structure because it contains only those feature vectors, which have almost the same variance distribution. Feature vectors selected from the interval represented by threshold *T*<sup>1</sup> can be characterized as data clusters, which have very small variance distribution explained by the first eigenvalue among the direction of the corresponding first eigenvector. On the other hand, the feature vectors from the interval represented by threshold *T*<sup>2</sup> are clusters, which have large variance distribution among the first eigenvector. In both cases, the largeness of the variance is determined by the first eigenvalue. The size of the selected partial data set depends on the value of *T*<sup>1</sup> or *T*2. The

*subset*\_*size* <sup>=</sup> *<sup>M</sup>*�

*M*�

*M*

so the selected subset contains maximally 15% of data of the whole training data amount. For example, there are approximately 19 million training vectors in our corpus. According to Equation 48 it is sufficient to extract ≈ 19000 vectors for partial-data training. But, as it will be showed in Section 6.3.2 this argument does not apply to all cases. The time cunsumption and memory costs of the covariance matrix computation of the reduced data set are much

Covariance matrix computing

Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition 141

Store feature vector to train matrix P > T Main PCA +

*A*� = [*φ*1*φ*<sup>2</sup> ... *φM*�], (46)

Eigendecomposition

× 100. (47)

*<sup>M</sup>* ∈ �0.001; 0.15�, (48)

32 can be modified as:

PCA the figure would be analogous with the Figure 4.3.

.... ....

1x26 2x13

**Figure 1.** Block diagram of the partial-data PCA training procedure

size of partial set can be expressed in percentage amount as:

We found that a practical importance has a ratio, when

Vector reshaping

Computing of percentage proportion

Note that the parameters **w***i*1, **w***i*<sup>2</sup> and *αi*1, *αi*<sup>2</sup> at each iteration *i* are updated with new parameters resulting from a new eigendecomposition. For PCA-based selection the eigenvectors **w***i*1, **w***i*<sup>2</sup> are not used. On the other hand, the eigenvalues *αi*1, *αi*<sup>2</sup> are the key elements because the selective criterion is based exactly on them. Using these eigenvalues, the percentage proportion *Pi* is computed as:

$$P\_{l} = \frac{\alpha\_{l1}}{\sum\_{j=1}^{2} \alpha\_{ij}} = \frac{\alpha\_{l1}}{\alpha\_{l1} + \alpha\_{l2}} = \frac{\alpha\_{l1}}{\text{trace}(\mathbf{C}\_{i})} \tag{43}$$

which determines the percentage of the variance explained by the first eigenvalue in the eigenspectrum. Further, it is necessary to choose a threshold *T*. It can be chosen from two different intervals. The first one is defined as *T*<sup>1</sup> ∈ (50; ≈ 65� and the second one as *T*<sup>2</sup> ∈ �≈ 85; 99.9�. Then the selective criterion can be based on the following logical expressions:

$$P\_l \le T\_1 \tag{44}$$

for the first interval, or

$$P\_l \ge T\_2 \tag{45}$$

for the second interval. If the evaluation of the expression yields a logical true then the current feature vector is classified as statistically significant for PCA training. This vector is stored and the selection continues for the next vector. In this way, the whole training corpus is processed. From the selected vectors a training matrix is composed, which is treated as the input for the main PCA described in Section 4.1. As was mentioned in Section 4.1, there are *M* training vectors in the corpus. If the selected subset contains *M*� vectors (*M*� � *M*) then the Equation 32 can be modified as:

$$A' = [\phi\_1 \phi\_2 \dots \phi\_{M'}]\_\prime \tag{46}$$

where *φ<sup>i</sup>* is the mean subtracted feature vector in the new train matrix. The next mathematical computations are identical with Equations 33-36. The partial-data training procedure for LMFE feature vectors is illustrated in the Figure 4.3. Note that for MFCC-based partial-data PCA the figure would be analogous with the Figure 4.3.

**Figure 1.** Block diagram of the partial-data PCA training procedure

10 Will-be-set-by-IN-TECH

trained from limited (reduced) amount of training data, while the performance is maintained,

Partial-data PCA training can be viewed as a kind of feature selection process. The main idea is to select the statistically significant data (feature vectors) from the whole amount of training data. There are two major processing stages. The first stage is the data selection based on PCA separately applied to all training feature vectors. Suitable vectors are concatenated into one train matrix, which is treated as the input for the main PCA. The second stage is the main

Suppose now that apply the same conditions as in Section 4.1. Then the selection process based on PCA (without projecting phase) can be described as follows. Each 26-dimensional LMFE (or 13-dimensional MFCC) feature vector **x***i*, *i* ∈ �1; *M*� (see Section 5.2) is reshaped to its matrix version *Xi*, *i* ∈ �1; *M*� with dimension 2 × 13 (in case of MFCC vectors, the 13-dimensional vector was extended with zero coefficient in order to reshape to matrix with

In the next step, the eigendecomposition is performed on the covariance matrix *Ci*, which

Note that the parameters **w***i*1, **w***i*<sup>2</sup> and *αi*1, *αi*<sup>2</sup> at each iteration *i* are updated with new parameters resulting from a new eigendecomposition. For PCA-based selection the eigenvectors **w***i*1, **w***i*<sup>2</sup> are not used. On the other hand, the eigenvalues *αi*1, *αi*<sup>2</sup> are the key elements because the selective criterion is based exactly on them. Using these eigenvalues,

> <sup>=</sup> *<sup>α</sup>i*<sup>1</sup> *αi*<sup>1</sup> + *αi*<sup>2</sup>

which determines the percentage of the variance explained by the first eigenvalue in the eigenspectrum. Further, it is necessary to choose a threshold *T*. It can be chosen from two different intervals. The first one is defined as *T*<sup>1</sup> ∈ (50; ≈ 65� and the second one as *T*<sup>2</sup> ∈ �≈ 85; 99.9�. Then the selective criterion can be based on the following logical

for the second interval. If the evaluation of the expression yields a logical true then the current feature vector is classified as statistically significant for PCA training. This vector is stored and

*<sup>i</sup>* , *i* ∈ �1; *M*�; *k* = 13 (for MFCC, *k* = 7). (40)

*Ci***w***ij* = *αij***w***ij*, *i* ∈ �1; *M*�, *j* ∈ �1; 2�, (41)

<sup>=</sup> *<sup>α</sup>i*<sup>1</sup> *trace*(*Ci*)

*Wi* = [**w***i*1**w***i*2]. (42)

*Pi* ≤ *T*<sup>1</sup> (44)

*Pi* ≥ *T*<sup>2</sup> (45)

, (43)

dimension 2 × 7). After mean subtraction the covariance matrix is computed as:

*XiX<sup>T</sup>*

results in *i* sets of eigenvectors **w***i*1, **w***i*<sup>2</sup> and eigenvalues *αi*1, *αi*2:

*Pi* <sup>=</sup> *<sup>α</sup>i*<sup>1</sup> 2 ∑ *j*=1 *αij*

or even improved. We called this procedure as *Partial-data trained PCA*.

PCA (see Section 4.1).

where

expressions:

for the first interval, or

*Ci* <sup>=</sup> <sup>1</sup> *k* − 1

the percentage proportion *Pi* is computed as:

The new train matrix can be viewed as a radically-reduced, more relevant representation of the training corpus. It has a nearly homoscedastic variance structure because it contains only those feature vectors, which have almost the same variance distribution. Feature vectors selected from the interval represented by threshold *T*<sup>1</sup> can be characterized as data clusters, which have very small variance distribution explained by the first eigenvalue among the direction of the corresponding first eigenvector. On the other hand, the feature vectors from the interval represented by threshold *T*<sup>2</sup> are clusters, which have large variance distribution among the first eigenvector. In both cases, the largeness of the variance is determined by the first eigenvalue. The size of the selected partial data set depends on the value of *T*<sup>1</sup> or *T*2. The size of partial set can be expressed in percentage amount as:

$$
\text{subset\\_sub} \underline{\text{size}} = \frac{M'}{M} \times 100.\tag{47}
$$

We found that a practical importance has a ratio, when

$$\frac{M'}{M} \in \langle 0.001; 0.15 \rangle,\tag{48}$$

so the selected subset contains maximally 15% of data of the whole training data amount. For example, there are approximately 19 million training vectors in our corpus. According to Equation 48 it is sufficient to extract ≈ 19000 vectors for partial-data training. But, as it will be showed in Section 6.3.2 this argument does not apply to all cases. The time cunsumption and memory costs of the covariance matrix computation of the reduced data set are much

#### 12 Will-be-set-by-IN-TECH 142 Modern Speech Recognition Approaches with Case Studies Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition <sup>13</sup>

smaller than the costs of the covariance matrix computation in case of the whole corpus. In case of partial-data training it is needed to allocate the memory only for one investigated feature vector and for the other data elements for mathematical computations. These memory requirements are of order of units of megabytes. In other words, the advantage of the partial-data training is that it does not require the loading of the whole data matrix in the main memory.

**5.4. Evaluation**

given in percentage.

**6. Experiments and results**

Σ*<sup>B</sup>* according to Equation 9.

tendency.

resulting from different experimental configurations.

*6.1.1. Supervector compositions and the scatter matrices*

**6.1. Conventional LDA-based processing**

In order to evaluate the experiments we chose the accuracy as the evaluation parameter. Accuracies were computed as the ratio of the number of all word matches (resulting from the recognizer) to the number of the reference words [20]. In all experiments the accuracy is

Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition 143

This section is a major part of the whole chapter. It provides a detailed and extensive experimental evaluation of the performance of the mentioned linear transformation methods and their combinations. The section presents the results of the recognition accuracy levels

In this section, the conventional LDA is investigated. The LDA-based statistical computing was performed according to mathematical description of Equations 3–12 in Section2.1. Note that the class label of each supervector composed according to Equation 8 was assigned to it according to the class label of the current basic vector **x**[*j*] at the center position *j*. In our experiments we tried 5 lengths *J* of supervector; *J* = 3, 5, 7, 9 and 11. This means that the dimensions of the covariance matrices in the statistical estimation were 39 × 39, 65 × 65, 91 × 91, 117 × 117 and 143 × 143. As it was mentioned in Section 2.1, in case when the length of supervector was greater than the number of classes, the between-class scatter (covariance) matrices were close to singular. From this reason we used for these cases the computation of

It is known that the covariance (scatter) matrices are in general symmetric square positive-definite regular matrices. These arguments apply also for matrices in LDA. Since in LDA the covariance matrices are computed from supervectors, there may occur a problem with the symmetry of these matrices. We found that the symmetry depends on the way, in which the supervectors are constructed. The Figure 2 illustrates two types of supervector construction with example of vector length 4. The subfigure (a) illustrates the classical way of construction of supervector by using a simple concatenation. The subfigure (b) illustrates a construction, where the final structure of the supervector is preserved according to the structure of the basic vectors. Thus, if the first few coefficients of the basic vector preserve a higher energy than the coefficients with lower order, then the new supervector follows this

It should be noted that the arrangement of the coefficients in the supervector impacts the symmetry of the matrices and this can affect other properties. These facts are proven in Figure 3. From these figures it can be seen the influence of the supervector construction to the symmetry of the scatter matrices. Figures 3 (a) and (c) represent the within-class scatter matrices in case, when the supervectors are constructed according to Figure 2 (a). It can be seen that these matrices are multisymmetric. On the other hand, the matrices in Figures 3 (b)

## **5. Speech corpus and experimental conditions**

## **5.1. Speech corpus**

All experiments were evaluated by using a Slovak speech corpus *ParDat1* [5], which contains approx. 100 hours spontaneous parliamentary speech recorded from 120 speakers (90% of men). For acoustic modeling 36917 training utterances were exactly used. For testing purposes 884 utterances were used.

## **5.2. Speech preprocessing**

The speech signal was preemphasized and windowed using Hamming window. The window size was set to 25*ms* and the step size was 10*ms*. Fast Fourier transform was applied to the windowed segments. Mel-filterbank analysis with 26 channels was followed by logarithm application to the linear filter outputs. This processing resulted in 26-dimensional LMFE features, which were used for PCA-based processing.

In case of MFCC baseline feature extraction, the LMFE vectors were further decorrelated by discrete cosine transform (DCT). The first 12 MFCCs were retained and augmented with the 0-th coefficient. During the acoustic modeling the first and second order derrivatives were computed and added to the basic vectors. Thus, the final MFCC vectors were 39-dimensional.

For LDA and 2DLDA-based processing the 13-dimensional MFCC vectors were used as the input for these methods. In order to regular comparison of recognition accuracy levels in the evaluation process all of LDA and 2DLDA models were trained using 39-dimensional LDA (2DLDA) vectors. In the evaluation, the 39-dimensional MFCC models were treated as reference models so the dimensions were identical. The number of classes *k* used in LDA and 2DLDA were identical with the number of phonetic classes in acoustic modeling (*k* = 45).

## **5.3. Acoustic modeling**

Our recognition system used context independent monophones modeled using a three-state left-to-right HMMs. The number of Gaussian mixtures per state was a power of 2, starting from 1 to 256. The phone segmentation of 45 phones was obtained from embedded training and automatic phone alignment. The number of trained monophone models corresponded to the number of phonemes and basic classes for LDA and 2DLDA. For testing purposes a word lattice was created from a bigram language model. The language model was built from the test set. The vocabulary size was 125*k*. The feature extraction, HMM training and testing by using HTK (Hidden Markov Model) Toolkit [20] were performed.

## **5.4. Evaluation**

12 Will-be-set-by-IN-TECH

smaller than the costs of the covariance matrix computation in case of the whole corpus. In case of partial-data training it is needed to allocate the memory only for one investigated feature vector and for the other data elements for mathematical computations. These memory requirements are of order of units of megabytes. In other words, the advantage of the partial-data training is that it does not require the loading of the whole data matrix in the

All experiments were evaluated by using a Slovak speech corpus *ParDat1* [5], which contains approx. 100 hours spontaneous parliamentary speech recorded from 120 speakers (90% of men). For acoustic modeling 36917 training utterances were exactly used. For testing

The speech signal was preemphasized and windowed using Hamming window. The window size was set to 25*ms* and the step size was 10*ms*. Fast Fourier transform was applied to the windowed segments. Mel-filterbank analysis with 26 channels was followed by logarithm application to the linear filter outputs. This processing resulted in 26-dimensional LMFE

In case of MFCC baseline feature extraction, the LMFE vectors were further decorrelated by discrete cosine transform (DCT). The first 12 MFCCs were retained and augmented with the 0-th coefficient. During the acoustic modeling the first and second order derrivatives were computed and added to the basic vectors. Thus, the final MFCC vectors were 39-dimensional. For LDA and 2DLDA-based processing the 13-dimensional MFCC vectors were used as the input for these methods. In order to regular comparison of recognition accuracy levels in the evaluation process all of LDA and 2DLDA models were trained using 39-dimensional LDA (2DLDA) vectors. In the evaluation, the 39-dimensional MFCC models were treated as reference models so the dimensions were identical. The number of classes *k* used in LDA and 2DLDA were identical with the number of phonetic classes in acoustic modeling (*k* = 45).

Our recognition system used context independent monophones modeled using a three-state left-to-right HMMs. The number of Gaussian mixtures per state was a power of 2, starting from 1 to 256. The phone segmentation of 45 phones was obtained from embedded training and automatic phone alignment. The number of trained monophone models corresponded to the number of phonemes and basic classes for LDA and 2DLDA. For testing purposes a word lattice was created from a bigram language model. The language model was built from the test set. The vocabulary size was 125*k*. The feature extraction, HMM training and testing by

using HTK (Hidden Markov Model) Toolkit [20] were performed.

**5. Speech corpus and experimental conditions**

features, which were used for PCA-based processing.

main memory.

**5.1. Speech corpus**

purposes 884 utterances were used.

**5.2. Speech preprocessing**

**5.3. Acoustic modeling**

In order to evaluate the experiments we chose the accuracy as the evaluation parameter. Accuracies were computed as the ratio of the number of all word matches (resulting from the recognizer) to the number of the reference words [20]. In all experiments the accuracy is given in percentage.

## **6. Experiments and results**

This section is a major part of the whole chapter. It provides a detailed and extensive experimental evaluation of the performance of the mentioned linear transformation methods and their combinations. The section presents the results of the recognition accuracy levels resulting from different experimental configurations.

## **6.1. Conventional LDA-based processing**

In this section, the conventional LDA is investigated. The LDA-based statistical computing was performed according to mathematical description of Equations 3–12 in Section2.1. Note that the class label of each supervector composed according to Equation 8 was assigned to it according to the class label of the current basic vector **x**[*j*] at the center position *j*. In our experiments we tried 5 lengths *J* of supervector; *J* = 3, 5, 7, 9 and 11. This means that the dimensions of the covariance matrices in the statistical estimation were 39 × 39, 65 × 65, 91 × 91, 117 × 117 and 143 × 143. As it was mentioned in Section 2.1, in case when the length of supervector was greater than the number of classes, the between-class scatter (covariance) matrices were close to singular. From this reason we used for these cases the computation of Σ*<sup>B</sup>* according to Equation 9.

#### *6.1.1. Supervector compositions and the scatter matrices*

It is known that the covariance (scatter) matrices are in general symmetric square positive-definite regular matrices. These arguments apply also for matrices in LDA. Since in LDA the covariance matrices are computed from supervectors, there may occur a problem with the symmetry of these matrices. We found that the symmetry depends on the way, in which the supervectors are constructed. The Figure 2 illustrates two types of supervector construction with example of vector length 4. The subfigure (a) illustrates the classical way of construction of supervector by using a simple concatenation. The subfigure (b) illustrates a construction, where the final structure of the supervector is preserved according to the structure of the basic vectors. Thus, if the first few coefficients of the basic vector preserve a higher energy than the coefficients with lower order, then the new supervector follows this tendency.

It should be noted that the arrangement of the coefficients in the supervector impacts the symmetry of the matrices and this can affect other properties. These facts are proven in Figure 3. From these figures it can be seen the influence of the supervector construction to the symmetry of the scatter matrices. Figures 3 (a) and (c) represent the within-class scatter matrices in case, when the supervectors are constructed according to Figure 2 (a). It can be seen that these matrices are multisymmetric. On the other hand, the matrices in Figures 3 (b) and (d) are purely symmetric. They were computed from supervectors constructed according to Figure 2 (b).

**Figure 2.** Different types of supervector composition; (a) composition with simple concatenating, (b) composition with retaining the structure of the basic vector

(a) Within-class scatter matrix computed from supervectors obtained by using the concatenating

(c) Between-class scatter matrix computed from supervectors obtained by using the concatenating

composed in different ways

model (38 < 39).

(without Δ and ΔΔ).

**6.2. 2DLDA-based processing**

(b) Within-class scatter matrix computed from supervectors obtained by preserving the structure

(d) Between-class scatter matrix computed from supervectors obtained by preserving the structure

of the basic vectors

Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition 145

of the basic vectors

**Figure 3.** Within-class and between-class scatter matrices computed from supervectors with length 65

2. In case of LDA matrix with dimension 19 × 39 the improvement is lower than in the previous case. The performance only for 2, 4, 8 and 256 mixtures was improved. It can be also seen that for 256 mixtures the improvement for higher context length was achieved. Note that acoustic models in this experiment have smaller dimension as the reference

3. The results in the last case, when the dimension of LDA matrix was 39 × 39 are not satisfactory. In all cases, the performance was decreased. But we can conclude that the longer lengths of context are suitable for higher dimensions of transformation matrix

In this section we extensively evaluate the performance of 2DLDA at different configurations and compare with the reference MFCC model and also with the performance of conventional

## *6.1.2. Between-class scatter matrix and the singularity*

As was mentioned in Section 2.1, the between-class scatter matrices for context length greater than *J* = 3 were computed according to Equation 9 instead of the classical Equation 6. The Figure 4 (a) demonstrate that the between-class scatter matrix computed for context length *J* = 5 according to Equation 6 is not symmetric. In addition it is computed from supervectors constructed according to Figure 2 (a). The Figure 4 (b) illustrates a similar case as in Figure 4 (a). This matrix is computed from supervectors constructed according to Figure 2 (b). It can be seen that this matrix si only close to symmetric and in the statistical estimation this can result in singular between-class matrix and complex valued numbers in the transformation LDA matrix. Note that the symmetric between-class scatter matrices in Figure 3 were computed according to Equation 9.

#### *6.1.3. Results*

The experiments based on LDA can be divided into three categories related to dimension of the LDA transformation matrix. The first category is represented by LDA matrix with dimension 13 × 39. Thus, for transformation were retained only the first 13 eigenvectors corresponding to 13 leading eigenvalues. The final dimension of the features were expanded to 39 with Δ and ΔΔ coefficients. The second category is represented by LDA matrix with dimension 19 × 39 so for transformations were used more LDA coefficients. Note that the final dimension of features was 38 (19 + Δ). The third category is represented by LDA matrix with dimension 39 × 39 and in this case were not used the Δ and ΔΔ coefficients. The difference between these three categories is that for acoustic modeling were used various numbers of dimensions and data-dependent and data-independent Δ and ΔΔ coefficients. The LDA coefficients with lower order (14–39) can be viewed as Δ and ΔΔ coefficients estimated in data-dependent manner. The experimental results for LDA are given in the Table 1. The results are analyzed separately for the mentioned categories.

1. The highest accuracies were achieved for 13 LDA coefficients expanded with Δ and ΔΔ coefficients and for *J* = 3. The maximum improvement compared to MFCC model is +2.05% for 4 mixtures. Only for 1 mixture any improvement was achieved.

14 Will-be-set-by-IN-TECH

and (d) are purely symmetric. They were computed from supervectors constructed according

(a) (b)

As was mentioned in Section 2.1, the between-class scatter matrices for context length greater than *J* = 3 were computed according to Equation 9 instead of the classical Equation 6. The Figure 4 (a) demonstrate that the between-class scatter matrix computed for context length *J* = 5 according to Equation 6 is not symmetric. In addition it is computed from supervectors constructed according to Figure 2 (a). The Figure 4 (b) illustrates a similar case as in Figure 4 (a). This matrix is computed from supervectors constructed according to Figure 2 (b). It can be seen that this matrix si only close to symmetric and in the statistical estimation this can result in singular between-class matrix and complex valued numbers in the transformation LDA matrix. Note that the symmetric between-class scatter matrices in Figure 3 were computed

The experiments based on LDA can be divided into three categories related to dimension of the LDA transformation matrix. The first category is represented by LDA matrix with dimension 13 × 39. Thus, for transformation were retained only the first 13 eigenvectors corresponding to 13 leading eigenvalues. The final dimension of the features were expanded to 39 with Δ and ΔΔ coefficients. The second category is represented by LDA matrix with dimension 19 × 39 so for transformations were used more LDA coefficients. Note that the final dimension of features was 38 (19 + Δ). The third category is represented by LDA matrix with dimension 39 × 39 and in this case were not used the Δ and ΔΔ coefficients. The difference between these three categories is that for acoustic modeling were used various numbers of dimensions and data-dependent and data-independent Δ and ΔΔ coefficients. The LDA coefficients with lower order (14–39) can be viewed as Δ and ΔΔ coefficients estimated in data-dependent manner. The experimental results for LDA are given in the Table 1. The

1. The highest accuracies were achieved for 13 LDA coefficients expanded with Δ and ΔΔ coefficients and for *J* = 3. The maximum improvement compared to MFCC model is

+2.05% for 4 mixtures. Only for 1 mixture any improvement was achieved.

**Figure 2.** Different types of supervector composition; (a) composition with simple concatenating, (b)

composition with retaining the structure of the basic vector

*6.1.2. Between-class scatter matrix and the singularity*

results are analyzed separately for the mentioned categories.

according to Equation 9.

*6.1.3. Results*

to Figure 2 (b).

(a) Within-class scatter matrix computed from supervectors obtained by using the concatenating

(c) Between-class scatter matrix computed from supervectors obtained by using the concatenating

(b) Within-class scatter matrix computed from supervectors obtained by preserving the structure of the basic vectors

(d) Between-class scatter matrix computed from supervectors obtained by preserving the structure of the basic vectors

**Figure 3.** Within-class and between-class scatter matrices computed from supervectors with length 65 composed in different ways


#### **6.2. 2DLDA-based processing**

In this section we extensively evaluate the performance of 2DLDA at different configurations and compare with the reference MFCC model and also with the performance of conventional

(a) Between-class scatter matrix computed from supervectors constructed according to Figure 2 (a)

(b) Between-class scatter matrix computed from supervectors constructed according to Figure 2 (b)

13 × 9 and 13 × 11. Consequently, the class mean, global mean, within-class scatter matrix and between-class scatter matrix have corresponding sizes according to the current length of context. For example, when the context size *J* was set to 7, in statistical estimation 7 cepstral vectors were coupled together to form a supermatrix 13 × 7. Then, the statistical estimators

Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition 147

*<sup>w</sup>* : 7 × 7,

*<sup>b</sup>* : 7 × 7,

*<sup>w</sup>* : 13 × 13,

*<sup>b</sup>* : 13 × 13,

The mathematical computations resulted in the transformations *L* and *R*. These matrices were then used to transform the whole speech corpus. In this way, each supermatrix created from the coupled vectors in the recording was transformed to its reduced version. The dimension reduction step was done by choosing the required size of *L* and *R*. In the next step, each transformed supermatrix was re-transformed to vector according to the matrix-to-vector alignment. The specific dimensions used in transformations are listed in the Table 2. Since the mathematical part of 2DLDA is an iteration algorithm it was necessary to set the number of iterations *I*. In [19] it is recommended to run the iteration loop only once (*I* = 1), which significantly reduces the total running time of the algorithm. In our 2DLDA experiments we

The results of 2DLDA performance can be divided into three categories, similarly as in case

1. The first category is represented by vector of dimension 13, which resulted from transformation. The final dimension was 39 (13 2DLDA +Δ + ΔΔ coefficients). As it can be seen from the Table 2, this case resulted in the highest accuracies for 2DLDA with context

2. The second category is represented by vector of dimension 19. The final dimension was 38 (19 2DLDA +Δ coefficients). Note that for example in case of transformed supermatrix with dimension 10 × 2 to obtain a vector with dimension 19, the last coefficient in the matrix-to-vector alignment was ignored. From the Table 2 it can be seen that 2DLDA at this dimension does not perform successfully. The performance of the base MFCC model

3. For the third category applies similar conclusions as in the previous case. In these experiments the feature vector dimension was 39 (without Δ and ΔΔ coefficients).

The maximum improvement achieved by 2DLDA was +2.01% for context length *J* = 3 and

have the following dimensions:

• left within-class scatter matrix *S<sup>L</sup>*

• left between-class scatter matrix *S<sup>L</sup>*

• right within-class scatter matrix *S<sup>R</sup>*

• right between-class scatter matrix *S<sup>R</sup>*

• left transformation matrix *L* : 13 × 13, • right transformation matrix *R* : 7 × 7.

run the for loop three times (*I* = 3).

of LDA and are given in the Table 2.

length *J* = 3.

was not improved.

for one iteration (*I* = 1).

• class means *Mi* : 13 × 7, • global mean *M* : 13 × 7,

**Figure 4.** Close to symmetric between-class scatter matrices computed according to Equation 6 for context length *J* = 5


**Table 1.** Accuracy levels (%) for conventional LDA with different number of retained dimensions (13, 19 and 39) compared to baseline MFCC model

LDA reported in Section 6.1.3. The whole mathematical 2DLDA computing was performed according to Equations 13–25. The statistical estimations are similar as in conventional LDA. The main difference is that it is necessary to compute two eigendecompositions and we have two transformation matrices; *L* and *R*. 2DLDA does not deal with supervectors as in LDA but with supermatrices, which are the basic data elements in 2DLDA (instead of vectors). These supermatrices were created from the basic cepstral vectors by coupling them together. Similarly as in LDA, we used 5 different sizes of supermatrices according to the number of contextual vectors (context size *J*). Thus, the sizes of supermatrices were 13 × 3, 13 × 5, 13 × 7, 13 × 9 and 13 × 11. Consequently, the class mean, global mean, within-class scatter matrix and between-class scatter matrix have corresponding sizes according to the current length of context. For example, when the context size *J* was set to 7, in statistical estimation 7 cepstral vectors were coupled together to form a supermatrix 13 × 7. Then, the statistical estimators have the following dimensions:

• class means *Mi* : 13 × 7,

16 Will-be-set-by-IN-TECH

(b)

**Figure 4.** Close to symmetric between-class scatter matrices computed according to Equation 6 for

Number of mixtures 1 2 4 8 16 32 64 128 256 MFCC model (39-dim.) 82.32 83.26 85.06 87.77 89.53 90.83 91.48 92.37 92.50 13 LDA+Δ + ΔΔ (39-dim.) 81.37 83.60 87.11 88.47 90.03 90.88 91.80 92.48 92.90 Abs. difference −0.95+0.34 +2.05 +0.70 +0.50 +0.05 +0.32 +0.11 +0.40 Context length *J J*=3 *J*=3 *J*=3 *J*=3 *J*=3 *J*=3 *J*=3 *J*=3 *J*=3 Supervector length 39 39 39 39 39 39 39 39 39 19 LDA +Δ (38-dim.) 82.02 83.46 85.97 88.27 89.32 90.45 91.37 82.18 92.65 Abs. difference −0.30+0.20 +0.91 +0.50 −0.21 −0.38 −0.11 −0.19 +0.15 Context length *J J*=3 *J*=3 *J*=3 *J*=3 *J*=5 *J*=7 *J*=5 *J*=5 *J*=5 Supervector length 39 39 39 39 65 91 65 65 65 39 LDA (39-dim.) 79.82 81.41 83.31 85.13 86.83 88.19 89.10 89.98 90.69 Abs. difference −2.50−1.85 −1.75 −2.64 −2.70 −2.64 −2.38 −2.39 −1.81 Context length *J J*=3 *J*=3 *J*=5 *J*=7 *J*=7 *J*=5 *J*=5 *J*=7 *J*=7 Supervector length 39 39 65 91 91 65 65 91 91 Max. accuracy of LDA 82.02 **83.60 87.11 88.47 90.03 90.88 91.80 92.48 92.90** Max. abs. difference −0.30+**0.34** +**2.05** +**0.70** +**0.50** +**0.05** +**0.32** +**0.11** +**0.40 Table 1.** Accuracy levels (%) for conventional LDA with different number of retained dimensions (13, 19

LDA reported in Section 6.1.3. The whole mathematical 2DLDA computing was performed according to Equations 13–25. The statistical estimations are similar as in conventional LDA. The main difference is that it is necessary to compute two eigendecompositions and we have two transformation matrices; *L* and *R*. 2DLDA does not deal with supervectors as in LDA but with supermatrices, which are the basic data elements in 2DLDA (instead of vectors). These supermatrices were created from the basic cepstral vectors by coupling them together. Similarly as in LDA, we used 5 different sizes of supermatrices according to the number of contextual vectors (context size *J*). Thus, the sizes of supermatrices were 13 × 3, 13 × 5, 13 × 7,

(b) Between-class scatter matrix computed from supervectors constructed according to Figure 2

(a) Between-class scatter matrix computed from supervectors constructed according to Figure 2

and 39) compared to baseline MFCC model

(a)

context length *J* = 5


The mathematical computations resulted in the transformations *L* and *R*. These matrices were then used to transform the whole speech corpus. In this way, each supermatrix created from the coupled vectors in the recording was transformed to its reduced version. The dimension reduction step was done by choosing the required size of *L* and *R*. In the next step, each transformed supermatrix was re-transformed to vector according to the matrix-to-vector alignment. The specific dimensions used in transformations are listed in the Table 2. Since the mathematical part of 2DLDA is an iteration algorithm it was necessary to set the number of iterations *I*. In [19] it is recommended to run the iteration loop only once (*I* = 1), which significantly reduces the total running time of the algorithm. In our 2DLDA experiments we run the for loop three times (*I* = 3).

The results of 2DLDA performance can be divided into three categories, similarly as in case of LDA and are given in the Table 2.


The maximum improvement achieved by 2DLDA was +2.01% for context length *J* = 3 and for one iteration (*I* = 1).


The full-data trained PCA was performed on a Linux machine with 32GB memory. The training data were loaded in the memory sequentially by data blocks and then concatenated to one data matrix (see Equation 32). From this matrix the covariance matrix according to Equation 33 was computed. Then the integral parts of PCA according to Equations 34-36 were performed. In the next step, the acoustic modeling based on the PCA transformed features was done. The evaluation results of the full-data trained PCA for LMFE features are listed in

Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition 149

The selective process for the feature vectors according to Fig. 1 was performed and *M*-times repeated. Overall, 10 partial-data trained models with LMFE features were learned. 5 models were learned for selection based on threshold *T*<sup>1</sup> and 5 ones for *T*2. For MFCC features apply an identical scheme. The parameters for these models are listed in the Table 3 and Table 4. According to Equation 48, 5 subset models (0.1%, 1%, 5%, 10% and 15%) were composed. Approx. DB size 0.1% 1% 5% 10% 15% Num. of vectors *M* 22229 187248 947804 1936764 2842838 Threshold *T*<sup>1</sup> 51.40 54.05 59.10 63.00 65.75 Opt. dimension *d d* = 8 *d* = 8 *d* = 8 *d* = 8 *d* = 9 Approx. DB size 0.1% 1% 5% 10% 15% Num. of vectors *M* 23547 206624 962434 1899584 2849321 Threshold *T*<sup>2</sup> 98.00 96.10 93.20 91.00 89.20 Opt. dimension *d d* = 5 *d* = 6 *d* = 7 *d* = 8 *d* = 8

> Approx. DB size 0.1% 1% 5% 10% 15% Num. of vectors *M* 21021 195034 952664 1900915 2857423 Threshold *T*<sup>1</sup> 51.10 53.35 57.45 60.60 63.10 Opt. dimension *d d* = 12 *d* = 12 *d* = 12 *d* = 12 *d* = 12 Approx. DB size 0.1% 1% 5% 10% 15% Num. of vectors *M* 20697 194742 965972 1941011 2860557 Threshold *T*<sup>2</sup> 98.60 96.40 92.62 89.70 87.50 Opt. dimension *d d* = 11 *d* = 12 *d* = 12 *d* = 12 *d* = 12

One of the output parameters of the partial-data PCA is the optimal dimension *d* determined by Equation (37). It represents the number of principal components, which could be used to transform the input data with retaining 95% of global variance. Note that the threshold values *T*<sup>1</sup> and *T*<sup>2</sup> were determined on experimental basis. The results of the partial-data PCA models are listed in the Table 5 and Table 6 for LMFE and MFCC features, respectively. Note that the

From the Table 5 we can conclude that for LMFE features the selected subsets of size 0.1% and 5% are not suitable to partial-data PCA training. In addition, an improvement in comparison with full-data trained PCA was achieved only for 32–256 mixtures. The maximum absolute

the Table 5 and for MFCC features in the Table 6.

**Table 3.** Parameters used for partial-data PCA models trained from LMFE

**Table 4.** Parameters used for partial-data PCA models trained from MFCC

table contains only the highest accuracies chosen from all models.

improvement +0.43% for 64 mixtures was achieved.

*6.3.2. Partial-data trained PCA*

18 Will-be-set-by-IN-TECH 148 Modern Speech Recognition Approaches with Case Studies Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition <sup>19</sup>

**Table 2.** Accuracy levels (%) for 2DLDA with different number of retained dimensions compared to baseline MFCC model and conventional LDA

## **6.3. PCA-based processing**

In this section, we experimentally evaluate the performance of the full-data trained PCA method by using the whole amount of training data for LMFE and MFCC features. In the next part of this section we present the results of partial-data trained PCA with various parameters. Note that all of the PCA-based models were transformed with PCA matrix with dimension 13 × 13 and the features were then expanded with Δ and ΔΔ coefficients. This resulted in final dimension 39.

#### *6.3.1. Full-data trained PCA*

As it was mentioned, PCA requires allocation of the whole data matrix in the memory. In addition, the covariance matrix is computed from this data matrix, which may be a computationally very difficult operation. In order to compare the partial-data trained models with the full-data trained model it was necessary to do the above mentioned computation. The full-data trained PCA was performed on a Linux machine with 32GB memory. The training data were loaded in the memory sequentially by data blocks and then concatenated to one data matrix (see Equation 32). From this matrix the covariance matrix according to Equation 33 was computed. Then the integral parts of PCA according to Equations 34-36 were performed. In the next step, the acoustic modeling based on the PCA transformed features was done. The evaluation results of the full-data trained PCA for LMFE features are listed in the Table 5 and for MFCC features in the Table 6.

#### *6.3.2. Partial-data trained PCA*

18 Will-be-set-by-IN-TECH

In this section, we experimentally evaluate the performance of the full-data trained PCA method by using the whole amount of training data for LMFE and MFCC features. In the next part of this section we present the results of partial-data trained PCA with various parameters. Note that all of the PCA-based models were transformed with PCA matrix with dimension 13 × 13 and the features were then expanded with Δ and ΔΔ coefficients. This resulted in final

As it was mentioned, PCA requires allocation of the whole data matrix in the memory. In addition, the covariance matrix is computed from this data matrix, which may be a computationally very difficult operation. In order to compare the partial-data trained models with the full-data trained model it was necessary to do the above mentioned computation.

Number of mixtures 1 2 4 8 16 32 64 128 256 MFCC model (39-dim.) 82.32 83.26 85.06 87.77 89.53 90.83 91.48 92.37 92.50 13 2DLDA+Δ + ΔΔ (39-dim.) 82.67 84.60 87.07 88.87 90.28 91.16 91.70 92.46 92.82 Abs. difference of 2DLDA +0.35+1.34 +2.01 +1.10 +0.75 +0.33 +0.22 +0.09 +0.32 Abs. difference of LDA −0.95+0.34 +2.05 +0.70 +0.50 +0.05 +0.32 +0.11 +0.40 Context length *J* of 2DLDA *J*=3 *J*=3 *J*=3 *J*=3 *J*=3 *J*=3 *J*=3 *J*=3 *J*=3 Supermatrix full size 13×313×3 13×3 13×3 13×3 13×3 13×3 13×3 13×3 Retained matrix (*L* × *R*) 13×113×1 13×1 13×1 13×1 13×1 13×1 13×1 13×1 Num. of iterations *I I*=3 *I*=1 *I*=1 *I*=1 *I*=1 *I*=1 *I*=1 *I*=1 *I*=1 19 2DLDA +Δ (38-dim.) 79.35 81.90 84.03 86.42 88.31 89.63 90.77 91.41 92.19 Abs. difference of 2DLDA −2.97−1.36 −1.03 −1.35 −1.22 −1.20 −0.71 −0.96 −0.31 Abs. difference of LDA −0.30+0.20 +0.91 +0.50 −0.21 −0.38 −0.11 −0.19 +0.15 Context length *J* of 2DLDA *J*=5 *J*=5 *J*=5 *J*=5 *J*=5 *J*=5 *J*=5 *J*=5 *J*=5 Supermatrix full size 13×513×5 13×5 13×5 13×5 13×5 13×5 13×5 13×5 Retained matrix (*L* × *R*) 10×210×2 10×2 7×3 7×3 10×2 7×3 7×3 10×2 Num. of iterations *I I*=3 *I*=1 *I*=1 *I*=3 *I*=1 *I*=1 *I*=2 *I*=1 *I*=1 39 2DLDA (39-dim.) 80.13 81.51 83.62 85.78 87.62 88.91 90.15 91.00 91.66 Abs. difference of 2DLDA −2.19−1.75 −1.44 −1.99 −1.91 −1.92 −1.33 −1.37 −0.84 Abs. difference of LDA −2.50−1.85 −1.75 −2.64 −2.70 −2.64 −2.38 −2.39 −1.81 Context length *J* of 2DLDA *J*=3 *J*=5 *J*=3 *J*=5 *J*=5 *J*=5 *J*=5 *J*=5 *J*=7 Supermatrix full size 13×313×5 13×3 13×5 13×5 13×5 13×5 13×5 13×7 Retained matrix (*L* × *R*) 13×313×3 13×3 13×3 13×3 13×3 13×3 13×3 10×4 Num. of iterations *I I*=1 *I*=3 *I*=1 *I*=3 *I*=3 *I*=1 *I*=1 *I*=1 *I*=1 Max. accuracy of LDA 82.02 83.60 87.11 88.47 90.03 90.88 91.80 92.48 92.90 Max. abs. difference −0.30+0.34 +2.05 +0.70 +0.50 +0.05 +0.32 +0.11 +0.40 Max. accuracy of 2DLDA **82.67 84.60** 87.07 **88.87 90.28 91.16** 91.70 92.46 92.82 Max. abs. difference +**0.35**+**1.34** +2.01 +**1.10** +**0.75** +**0.33** +0.22 +0.09 +0.32 **Table 2.** Accuracy levels (%) for 2DLDA with different number of retained dimensions compared to

baseline MFCC model and conventional LDA

**6.3. PCA-based processing**

*6.3.1. Full-data trained PCA*

dimension 39.

The selective process for the feature vectors according to Fig. 1 was performed and *M*-times repeated. Overall, 10 partial-data trained models with LMFE features were learned. 5 models were learned for selection based on threshold *T*<sup>1</sup> and 5 ones for *T*2. For MFCC features apply an identical scheme. The parameters for these models are listed in the Table 3 and Table 4. According to Equation 48, 5 subset models (0.1%, 1%, 5%, 10% and 15%) were composed.


**Table 3.** Parameters used for partial-data PCA models trained from LMFE


**Table 4.** Parameters used for partial-data PCA models trained from MFCC

One of the output parameters of the partial-data PCA is the optimal dimension *d* determined by Equation (37). It represents the number of principal components, which could be used to transform the input data with retaining 95% of global variance. Note that the threshold values *T*<sup>1</sup> and *T*<sup>2</sup> were determined on experimental basis. The results of the partial-data PCA models are listed in the Table 5 and Table 6 for LMFE and MFCC features, respectively. Note that the table contains only the highest accuracies chosen from all models.

From the Table 5 we can conclude that for LMFE features the selected subsets of size 0.1% and 5% are not suitable to partial-data PCA training. In addition, an improvement in comparison with full-data trained PCA was achieved only for 32–256 mixtures. The maximum absolute improvement +0.43% for 64 mixtures was achieved.

20 Will-be-set-by-IN-TECH 150 Modern Speech Recognition Approaches with Case Studies Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition <sup>21</sup>


Number of mixtures 1 2 4 8 16 32 64 128 256 13 2DLDA+Δ + ΔΔ (39-dim.) 82.67 84.60 87.07 88.87 90.28 91.16 91.70 92.46 92.82

Part of DB 5% 5% 100% 100% 100% 100% 10% 10% 1% Type of threshold *T*<sup>1</sup> *T*<sup>2</sup> – – – – *T*<sup>1</sup> *T*<sup>1</sup> *T*<sup>1</sup> Abs. difference −0.41 −0.10 +**0.10** +**0.16** +**0.16** −0.06 +**0.25** −0.03 −0.13

In the last section we conclude the experimental results presented in the whole chapter. Overall, we present seven types of experiments evaluating the performance of some kind of linear feature transformation applied in feature extraction in Slovak phoneme-based continuous speech recognition. Each result of the partial experiment is summarized and compared with the other results in the Table 8. The graphical comparison is given in Figure 5.

Number of mixtures 1 2 4 8 16 32 64 128 256 Conventional LDA 82.02 83.60 87.11 88.47 90.03 90.88 91.80 **92.48 92.90** 2DLDA 82.67 84.60 87.07 88.87 90.28 **91.16** 91.70 92.46 92.82 Full-data PCA (LMFE) 82.80 84.10 86.01 88.88 89.84 90.31 91.00 91.72 92.30 Full-data PCA (MFCC) 82.35 84.24 85.94 87.83 89.14 90.19 90.90 91.20 91.76 Partial-data PCA (LMFE) 82.06 83.88 85.93 88.21 89.82 90.72 91.43 91.91 92.60 Partial-data PCA (MFCC) **83.60 84.79** 86.33 88.08 89.36 90.32 91.27 91.78 92.19 PCA+2DLDA 82.26 84.50 **87.17 89.03 90.44** 91.10 **91.95** 92.43 92.69 MFCC (reference) 82.32 83.26 85.06 87.77 89.53 90.83 91.48 92.37 92.50 Max. of transformed model **83.60 84.79 87.17 89.03 90.44 91.16 91.95 92.48 92.90** Abs. improvement +**1.28** +**1.53** +**2.11** +**1.26** +**0.91** +**0.33** +**0.47** +**0.11** +**0.40**

**Table 8.** Global comparison of partial experiments for all types of linear transformations

**Figure 5.** Graphical global evaluation of all experiments compared to reference MFCC model

0,1

1 2 4 8 16 32 64 128 256 **Number of mixtures**

(b) Absolute improvement of transformed models

0,6

1,1

**Abs. improvement (%)**

1,6

2,1

82.26 84.50 87.17 89.03 90.44 91.10 91.95 92.43 92.69

Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition 151

13 PCA+2DLDA+Δ +

**Table 7.** Accuracy levels (%) of PCA-based 2DLDA

**6.5. Global experimental evaluation of all methods**

ΔΔ(39-dim.)

1 2 4 8 16 32 64 128 256 **Number of mixtures** Reference MFCC Transformed model (a) Comparison of transformed and MFCC models

**Accuracy level (%)**

**Table 5.** Accuracy levels for LMFE-based full-data and partial-data trained PCA

In case of MFCC features used as the input for partial-data PCA training, the results are more satisfactory. From the Table 6 it can be seen that for all mixtures an improvement was achieved. The maximum absolute improvement is +1.25% for 1 mixture. It could be also mentioned that for MFCC features the proposed method used the smaller selected subsets (≈ 1%) in comparison with LMFE features.


**Table 6.** Accuracy levels for MFCC-based full-data and partial-data trained PCA

## **6.4. PCA-based 2DLDA**

As was mentioned in Section 1, one of the issues of this chapter is the interaction of two types of linear transformations in one experiment. More specifically, the aim of this section is to present an evaluation of the mentioned interaction of PCA and 2DLDA. In other words, in this experiment we used as the input for 2DLDA the PCA-based feature vectors instead of MFCC vectors. We wanted here to demonstrate that the PCA features have comparative properties as MFCC features and that 2DLDA trained from PCA features can achieve comparative performance as 2DLDA trained from MFCC features. The PCA training was done in two ways. The first one ist the classical full-data training and the second one is the partial-data training (see Table 7).

From the results of the experiment given in the Table 7 we can conclude the following arguments. For 4 of 9 cases the performance of 2DLDA was improved using PCA features as its input. But for 3 cases of 4 the improvement was achieved for full-data training.


**Table 7.** Accuracy levels (%) of PCA-based 2DLDA

20 Will-be-set-by-IN-TECH

Mixtures Acc. of full PCA Acc. of partial-data PCA Difference Threshold Part of DB 82.80% 82.06% −0.74% *T*<sup>2</sup> = 89.2 15% 84.10% 83.88% −0.22% *T*<sup>2</sup> = 89.2 15% 86.01% 85.93% −0.08% *T*<sup>1</sup> = 63.0 10% 88.88% 88.21% −0.67% *T*<sup>2</sup> = 89.2 15% 89.84% 89.82% −0.02% *T*<sup>2</sup> = 91.0 10% 90.31% 90.72% +**0**.**41**% *T*<sup>2</sup> = 91.0 10% 91.00% 91.43% +**0**.**43**% *T*<sup>2</sup> = 91.0 10% 91.72% 91.91% +**0**.**19**% *T*<sup>2</sup> = 96.1 1% 92.30% 92.60% +**0**.**30**% *T*<sup>2</sup> = 89.2 15%

In case of MFCC features used as the input for partial-data PCA training, the results are more satisfactory. From the Table 6 it can be seen that for all mixtures an improvement was achieved. The maximum absolute improvement is +1.25% for 1 mixture. It could be also mentioned that for MFCC features the proposed method used the smaller selected subsets

Mixtures Acc. of full PCA Acc. of partial-data PCA Difference Threshold Part of DB 82.35% 83.60% +**1**.**25**% *T*<sup>1</sup> = 51.10 0.1% 84.24% 84.79% +**0**.**55**% *T*<sup>1</sup> = 53.35 1% 85.94% 86.33% +**0**.**39**% *T*<sup>1</sup> = 53.35 1% 87.83% 88.08% +**0**.**25**% *T*<sup>2</sup> = 92.62 5% 89.14% 89.36% +**0**.**22**% *T*<sup>2</sup> = 92.62 5% 90.19% 90.32% +**0**.**13**% *T*<sup>2</sup> = 89.70 10% 90.90% 91.27% +**0**.**37**% *T*<sup>1</sup> = 53.35 1% 91.20% 91.78% +**0**.**58**% *T*<sup>1</sup> = 57.45 5% 91.76% 92.19% +**0**.**43**% *T*<sup>2</sup> = 96.40 1%

As was mentioned in Section 1, one of the issues of this chapter is the interaction of two types of linear transformations in one experiment. More specifically, the aim of this section is to present an evaluation of the mentioned interaction of PCA and 2DLDA. In other words, in this experiment we used as the input for 2DLDA the PCA-based feature vectors instead of MFCC vectors. We wanted here to demonstrate that the PCA features have comparative properties as MFCC features and that 2DLDA trained from PCA features can achieve comparative performance as 2DLDA trained from MFCC features. The PCA training was done in two ways. The first one ist the classical full-data training and the second one is the partial-data

From the results of the experiment given in the Table 7 we can conclude the following arguments. For 4 of 9 cases the performance of 2DLDA was improved using PCA features

as its input. But for 3 cases of 4 the improvement was achieved for full-data training.

**Table 5.** Accuracy levels for LMFE-based full-data and partial-data trained PCA

**Table 6.** Accuracy levels for MFCC-based full-data and partial-data trained PCA

(≈ 1%) in comparison with LMFE features.

**6.4. PCA-based 2DLDA**

training (see Table 7).

## **6.5. Global experimental evaluation of all methods**

In the last section we conclude the experimental results presented in the whole chapter. Overall, we present seven types of experiments evaluating the performance of some kind of linear feature transformation applied in feature extraction in Slovak phoneme-based continuous speech recognition. Each result of the partial experiment is summarized and compared with the other results in the Table 8. The graphical comparison is given in Figure 5.


**Table 8.** Global comparison of partial experiments for all types of linear transformations

(a) Comparison of transformed and MFCC models

(b) Absolute improvement of transformed models

## **7. Conclusions and discussions**

The global conclusion of the experimental part of this chapter can be divided into few following deductions.

[2] Bebis, G. [2003]. *Principal Components Analysis*, Department of Computer Science,

Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition 153

[3] Belhumeur, P., Hespanha, J. & Kriegman, D. [1997]. Eigenfaces vs. Fisherfaces: recognition using class specific linear projection, *IEEE Pattern Analysis and Machine*

[4] Beulen, K., Welling, L. & Ney, H. [1995]. Experiments with linear feature extraction in speech recognition, *Proc. of European Conf. on Speech Communication and Technology*,

[5] Darjaa, S., Cer ˇnak, M., Š. Be ˇnuš, Rusko, M., Sabo, R. & Trnka, M. [2011]. *Rule-based triphone mapping for acoustic modeling in automatic speech recognition*, Vol. 6836 LNAI of *Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence*

[6] Duchateau, J., Demuynck, K., Compernolle, D. V. & Wambacq, P. [2001]. Class definition in discriminant feature analysis, *Proc. of European Conf. on Speech Communication and*

[7] Geirhofer, S. [2004]. Feature reduction with linear discriminant analysis and its performance on phoneme recognition, *Technical Report ECE272*, Dept. of Electrical and

[8] Haeb-Umbach, R. & Ney, H. [1992]. Linear discriminant analysis for improved large vocabulary continuous speech recognition, *Proc. of the IEEE Intl. Conf. on Acoustics, Speech,*

[11] Kumar, N. [1997]. *Investigation of Silicon Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition*, PhD thesis, Johns Hopkins

[12] Li, X. B. & O'Shaughnessy, D. [2007]. Clustering-based Two-Dimensional Linear Discriminant Analysis for Speech Recognition, *Proc. of the 8th Annual Conference of the*

[13] Pylkkönen, J. [2006]. LDA based feature estimation methods for LVCSR, *Proc. of the 9th Intl. Conf. on Spoken Language Processing, INTERSPEECH'06*, Pittsburgh, PA, USA,

[14] Schafföner, M., Katz, M., Krüger, S. E. & Wendemuth, A. [2003]. Improved robustness of automatic speech recognition using a new class definition in linear discriminant analysis, *Proc. of the 8th European Conf. on Speech Communication and Technology, EUROSPEECH'03*,

[15] Song, H. J. & Kim, H. S. [2002]. Improving phone-level discrimination in LDA with subphone-level classes, *Proc. of the 7th Intl. Conf. on Spoken Language Processing, ICSLP'02*,

[16] Viszlay, P. & Juhár, J. [2011]. Feature selection for partial training of transformation matrix in PCA, *Proc. of the 13th Intl. Conf. on Research in Telecommunication Technologies, RTT'11*,

[9] Jolliffe, I. T. [1986]. *Principal Component Analysis*, Springer-Verlag, New York, USA. [10] Krzanowski, W. J., Jonathan, P., Mccarthy, W. V. & Thomas, M. R. [1995]. Discriminant analysis with singular covariance matrices: methods and applications to spectroscopic

*Technology, EUROSPEECH'01*, Aalborg, Denmark, pp. 1621–1624.

*and Signal Processing, ICASSP'92*, San Francisco, CA, pp. 13–16.

*International Speech Communication Association*, pp. 1126–1129.

Computer Engineering, University of Illinois at Urbana-Champaign.

University of Nevada, Reno.

*and Lecture Notes in Bioinformatics)*.

data, *Applied Statistics* 44: 101–115.

Universtiy, Baltimore, Maryland.

Geneva, Switzerland, pp. 2841–2844.

Denver, Colorado, USA, pp. 2625–2628.

Techov, Brno, Czech Republic, pp. 233–236.

pp. 389–392.

*Intelligence* 19: 711–720.

pp. 1415–1418.


## **8. Future research intentions**

Based on the presented knowledge and our research intentions in the near future we would like to develop an algorithm to elimination of using the class label information (class definition) in the LDA-based experiments. In other words, we want to train the LDA and its similar supervised modifications in unsupervised way without using the labeling of speech corpus.

## **Acknowledgments**

The research presented in this paper was supported by the Ministry of Education under the research projects VEGA 1/0386/12 and MŠ SR 3928/2010–11 and by the Slovak Research and Development Agency under the research project APVV–0369–07.

## **Author details**

Jozef Juhár and Peter Viszlay *Technical University of Košice, Slovakia*

## **9. References**

[1] Abbasian, H., Nasersharif, B., Akbari, A., Rahmani, M. & Moin, M. S. [2008]. Optimized linear discriminant analysis for extracting robust speech features, *Proc. of the 3rd Intl. Symposium on Communications, Control and Signal Processing*, St. Julians, pp. 819–824.

[2] Bebis, G. [2003]. *Principal Components Analysis*, Department of Computer Science, University of Nevada, Reno.

22 Will-be-set-by-IN-TECH

The global conclusion of the experimental part of this chapter can be divided into few

• Principal Component Analysis can improve the performance of the MFCC-based acoustic

• The proposed partial-data trained PCA achieves better results compared to full-data trained PCA. Higher improvements can be achieved in case of MFCC features used as

• The conventional Linear Discriminant Analysis leads to improvements almost for all mixtures, but there may occur a problem related to singularity of between-class scatter

• 2DLDA achieves comparable improvements as LDA (a little bit smaller). On the other hand, it is much more stable than LDA and there is no problem with the singularity, because 2DLDA overcomes it implicitly (much smaller dimensions of scatter matrices). • In the last step, we clearly demonstrated that the combination of PCA and 2DLDA (subspace learning) leads to further refinement and improvement compared to

Based on the presented knowledge and our research intentions in the near future we would like to develop an algorithm to elimination of using the class label information (class definition) in the LDA-based experiments. In other words, we want to train the LDA and its similar supervised modifications in unsupervised way without using the labeling of speech

The research presented in this paper was supported by the Ministry of Education under the research projects VEGA 1/0386/12 and MŠ SR 3928/2010–11 and by the Slovak Research and

[1] Abbasian, H., Nasersharif, B., Akbari, A., Rahmani, M. & Moin, M. S. [2008]. Optimized linear discriminant analysis for extracting robust speech features, *Proc. of the 3rd Intl. Symposium on Communications, Control and Signal Processing*, St. Julians, pp. 819–824.

Development Agency under the research project APVV–0369–07.

model. As the input for PCA can be used LMFE or MFCC features.

**7. Conclusions and discussions**

input for partial-data PCA.

performance of 2DLDA.

corpus.

**Acknowledgments**

**Author details**

**9. References**

Jozef Juhár and Peter Viszlay *Technical University of Košice, Slovakia*

**8. Future research intentions**

matrix in case of larger lengths of context *J*.

following deductions.

	- [17] Viszlay, P., Juhár, J. & Pleva, M. [2012]. Alternative phonetic class definition in linear discriminant analysis of speech, *Proc. of the 19th International Conference on Systems, Signals and Image Processing, IWSSIP'12*, Vienna, Austria. Accepted, to be published.

**Dereverberation Based on Spectral Subtraction by**

**Chapter 7**

In a distant-talking environment, channel distortion drastically degrades speech recognition performance because of a mismatch between the training and testing environments. The current approach focusing on automatic speech recognition (ASR) robustness to reverberation and noise can be classified as speech signal processing [1, 4, 5, 14], robust feature extraction

In this chapter, we focus on speech signal processing in the distant-talking environment. Because both the speech signal and the reverberation are nonstationary signals, dereverberation to obtain clean speech from the convolution of nonstationary speech signals and impulse responses is very hard work. Several studies have focused on mitigating the above problem [8, 9, 11, 12]. [1] explored a speech dereverberation technique whose principle was the recovery of the envelope modulations of the original (anechoic) speech. They applied a technique that they originally developed to treat background noise [11] to the dereverberation problem. [7] proposed a novel approach for multimicrophone speech dereverberation. The method was based on the construction of the null subspace of the data matrix in the presence of colored noise, employing generalized singular-value decomposition or generalized eigenvalue decomposition of the respective correlation matrices. A reverberation compensation method for speaker recognition using spectral subtraction in which the late reverberation is treated as additive noise was proposed by [16, 17]. However, the drawback of this approach is that the optimum parameters for spectral subtraction are empirically estimated from a development dataset and the late reverberation cannot be subtracted well since it is not modeled precisely. [18] proposed a novel dereverberation method utilizing multi-step forward linear prediction. They estimated the linear prediction coefficients in a time domain and suppressed the amplitude of late reflections through spectral

> ©2012 Wang et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly

©2012Wang et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**Multi-Channel LMS Algorithm for Hands-Free**

Longbiao Wang, Kyohei Odani, Atsuhiko Kai, Norihide Kitaoka

Additional information is available at the end of the chapter

**Speech Recognition**

and Seiichi Nakagawa

http://dx.doi.org/10.5772/48430

[10, 20], and model adaptation [3, 25].

subtraction in a spectral domain.

cited.

**1. Introduction**

