2.1. Sparse representation (SR) and K-SVD algorithm

Let <sup>X</sup> be a matrix of <sup>M</sup> training signals <sup>X</sup> <sup>¼</sup> f g <sup>x</sup><sup>m</sup> <sup>M</sup> <sup>m</sup>¼<sup>1</sup> <sup>∈</sup> <sup>R</sup><sup>N</sup>. SR dictionary learning framework consists in finding a dictionary D of K unit-norm atoms D ¼ dð Þ<sup>1</sup> …dð Þ <sup>K</sup> ∈ R<sup>N</sup>�<sup>K</sup> and sparse coefficients <sup>C</sup> <sup>¼</sup> f g <sup>c</sup><sup>m</sup> <sup>M</sup> <sup>m</sup>¼<sup>1</sup> <sup>∈</sup> <sup>R</sup><sup>K</sup> such that the approximation error between <sup>X</sup> and DC is sufficiently small. For example, if the exact sparsity level T<sup>0</sup> is known, the problem can be formalized as minimizing the error cost function OSR(D, C) defined as:

$$f\_{\mathcal{SR}}(\mathbf{D}, \mathbf{C}) = \|\mathbf{X} - \mathbf{D}\mathbf{C}\|\_{F'}^2 \text{ s.t.}\\\forall i, \|\mathbf{c}\_i\|\_0 \le T\_0 \tag{2}$$

for the matrix X, cost function that quantifies the quality of the decomposition needs to be

Different the similarity measures between X and the product DC lead to different variants of NMF. The common choices include Euclidean distance [38], generalized Kullback-Leibler divergence [39], Itakura-Saito divergence [40]… For instance, the NMF based on Kullback–

xij log

There exist different optimization models for the approximation factorization (5) [36, 39, 40]. The most popular solution is alternative multiplicative update rules (MURs) [36], which do not have required user-specified optimization parameters. For a KL cost function (6), the itera-

> P i

P

However, it is found that the monotonicity guaranteed by the proof of multiplicative updates may not imply the full Karush-Kuhn-Tucker conditions [39, 40]. MUR is relatively simple and easy to implement, but it converges slower in comparison with gradient approaches [41]. More efficient algorithms equipped with stronger theoretical convergence property have been introduced. One popular method is to apply gradient descent algorithms with additive update rules, which are represented by the projective gradient descent method (PGD) [42]. In PGD framework, to select the learning step size, a line search method with the Armijo rule is applied [42] and the new estimate is obtained by first calculating the unconstrained steepest-descent update and then zeroing its negative elements. In addition, considering the separate convexity, the two-variable optimization problem is converted into the nonnegative least squares (NLS) optimization subproblems, which alternate the minimization over either D or C, with the other

Because of the initial condition K < < min{N, M}, the obtained basis vectors are incomplete over the original vector space. In other words, this NMF approach tries to represent the highdimensional stochastic pattern with far fewer bases, so the perfect approximation can be

NMF will not get the unique solution under the sole nonnegativity constraint. Hence, to remedy the ill-posedness, it is imperative to introduce additional auxiliary constraints on D

achieved successfully only if the intrinsic features are identified in D.

xij ð Þ DC ij

diax<sup>i</sup>μ=ð Þ DC <sup>i</sup><sup>μ</sup> P <sup>t</sup>dta

<sup>μ</sup>c<sup>a</sup>μX<sup>i</sup>μð Þ DC <sup>i</sup><sup>μ</sup> P <sup>s</sup>cas

fðX DC k Þ (5)

Dictionary Learning-Based Speech Enhancement http://dx.doi.org/10.5772/intechopen.85308

� <sup>x</sup>ij <sup>þ</sup> ð Þ DC ij ! (6)

; (8)

(7)

75

defined. Operationally, NMF can be described as the following objective function

where f is denoted a distance metric.

tively updating rules are given by:

matrix fixed.

Leibler (KL) divergence is formulated as follows:

<sup>f</sup> KLð Þ¼ <sup>X</sup>; DC <sup>X</sup>

i, j

c<sup>a</sup><sup>μ</sup> c<sup>a</sup><sup>μ</sup>

dia dia

min <sup>D</sup>, <sup>C</sup> <sup>≥</sup> <sup>0</sup>

where k k: <sup>F</sup>, k k: <sup>0</sup> denote the Frobenius and l<sup>0</sup> norm, respectively.

Eq. (2) shows that a signal x can be expressed as the linear combination of only a few column vectors in D. Matrix factorization problem (2) is a difficult problem, since the joint optimization of D and C is nonconvex. Many dictionary algorithms follow an iterative scheme that alternates between updates of dictionary D and sparse coding C to minimize the cost function (2). K-SVD, one of the methods, goes under the category of sparse representation (SR), which came from the theory of sparse and redundant representation of signals. It was first introduced by Aharon et al. [34]. The K-SVD algorithm defines an initial overcomplete dictionary matrix D<sup>0</sup> ∈ R<sup>N</sup>�<sup>K</sup> and operates alternating two step iterations between optimizing the coding and the dictionary as follows:

The sparse coding approximation step derives the column cm, m = 1. M by using the orthogonal matching pursuit (OMP) algorithm with given X and D to solve the following equation:

$$\mathbf{a}\mathbf{g}\mathbf{m}\mathbf{n} \|\mathbf{c}\_{m}\|\_{0} \quad \text{s.t} \quad \|\mathbf{x}\_{m} - \mathbf{D}\mathbf{c}\_{m}\|\_{2} \leq \sigma \tag{3}$$

The updating dictionary step is taken by minimizing the approximation error (2) with the current coding C. Atom-by-atom is updated in an iterative process.

$$\text{Because } \left\|\mathbf{X} - \mathbf{D}\mathbf{C}\right\|\_{F}^{2} = \left\|\mathbf{X} - \sum\_{i=1}^{K} \mathbf{d}\_{i}\mathbf{c}^{[i]}\right\|\_{F}^{2} = \left\|\left(\mathbf{X} - \sum\_{i \neq j} \mathbf{d}\_{i}\mathbf{c}^{[j]}\right) - \mathbf{d}\_{j}\mathbf{c}^{[j]}\right\|\_{F}^{2} = \left\|\mathbf{R}^{(j)} - \mathbf{d}\_{j}\mathbf{c}^{[j]}\right\|\_{F}^{2} \tag{4}$$

where c [i] is the ith row of C. The residual norm is minimized by seeking for a rank-one approximation [35]. The approximation is based on computing the singular value decomposition (SVD) [23].

#### 2.2. Nonnegative matrix factorization (NMF) theory

Nonnegative matrix factorization (NMF) can be viewed as an approach for dictionary learning. NMF, first introduced by Paatero and Tapper [36] and later popularized by Lee and Seung [23, 27–37], has been known as a part-based representation model. Different to other matrix factorization approaches, NMF takes into account the fact that most types of real-world data, particularly sound and videos, are nonnegative and maintain such nonnegativity constraints in factorization. Moreover, the nonnegativity constraints in NMF are compatible with the intuitive notion of combining parts to form a whole, that is, they provide a parts-based local representation of the data. A parts based model not only provides an efficient representation of the data but can potentially aid in the discovery of causal structure within it and in learning relationships between the parts.

Given a nonnegative matrix <sup>X</sup> <sup>¼</sup> ½ � <sup>x</sup>1; <sup>x</sup>2; …; <sup>x</sup><sup>M</sup> <sup>∈</sup> <sup>R</sup><sup>N</sup>�<sup>M</sup> <sup>þ</sup> , a positive integer <sup>K</sup> < < min{N, M}, NMF projects X onto a space by a linear combination of a set of nonnegative basis vectors D = {dnk}, that is, X ≈ DC where C = {ckm}, ckm ≥ 0. In order to find an approximate factorization for the matrix X, cost function that quantifies the quality of the decomposition needs to be defined. Operationally, NMF can be described as the following objective function

$$\min\_{\mathbf{D}, \mathbf{C} \ge 0} f(\mathbf{X} \| \mathbf{D} \mathbf{C}) \tag{5}$$

where f is denoted a distance metric.

<sup>f</sup> SRð Þ¼ <sup>D</sup>; <sup>C</sup> k k <sup>X</sup> � DC <sup>2</sup>

Eq. (2) shows that a signal x can be expressed as the linear combination of only a few column vectors in D. Matrix factorization problem (2) is a difficult problem, since the joint optimization of D and C is nonconvex. Many dictionary algorithms follow an iterative scheme that alternates between updates of dictionary D and sparse coding C to minimize the cost function (2). K-SVD, one of the methods, goes under the category of sparse representation (SR), which came from the theory of sparse and redundant representation of signals. It was first introduced by Aharon et al. [34]. The K-SVD algorithm defines an initial overcomplete dictionary matrix D<sup>0</sup> ∈ R<sup>N</sup>�<sup>K</sup> and operates alternating two step iterations between optimizing the coding and the

The sparse coding approximation step derives the column cm, m = 1. M by using the orthogonal matching pursuit (OMP) algorithm with given X and D to solve the following equation:

The updating dictionary step is taken by minimizing the approximation error (2) with the

0 @

approximation [35]. The approximation is based on computing the singular value decomposi-

Nonnegative matrix factorization (NMF) can be viewed as an approach for dictionary learning. NMF, first introduced by Paatero and Tapper [36] and later popularized by Lee and Seung [23, 27–37], has been known as a part-based representation model. Different to other matrix factorization approaches, NMF takes into account the fact that most types of real-world data, particularly sound and videos, are nonnegative and maintain such nonnegativity constraints in factorization. Moreover, the nonnegativity constraints in NMF are compatible with the intuitive notion of combining parts to form a whole, that is, they provide a parts-based local representation of the data. A parts based model not only provides an efficient representation of the data but can potentially aid in the discovery of causal structure within it and in learning

NMF projects X onto a space by a linear combination of a set of nonnegative basis vectors D = {dnk}, that is, X ≈ DC where C = {ckm}, ckm ≥ 0. In order to find an approximate factorization

� � � � � �

<sup>¼</sup> <sup>X</sup> �<sup>X</sup>

i6¼j

[i] is the ith row of C. The residual norm is minimized by seeking for a rank-one

dic½ �<sup>i</sup>

argmink k c<sup>m</sup> <sup>0</sup> s:t k k x<sup>m</sup> � Dc<sup>m</sup> <sup>2</sup> ≤ σ (3)

1

<sup>A</sup> � <sup>d</sup>jc½ �<sup>j</sup>

� � � � � �

2

F

<sup>þ</sup> , a positive integer <sup>K</sup> < < min{N, M},

<sup>¼</sup> <sup>R</sup>ð Þ<sup>j</sup> � <sup>d</sup>jc½ �<sup>j</sup> � � �

� 2 <sup>F</sup> (4)

where k k: <sup>F</sup>, k k: <sup>0</sup> denote the Frobenius and l<sup>0</sup> norm, respectively.

current coding C. Atom-by-atom is updated in an iterative process.

K

dic½ �<sup>i</sup>

� � � � �

2

F

i¼1

<sup>F</sup> <sup>¼</sup> <sup>X</sup> �<sup>X</sup>

� � � � �

2.2. Nonnegative matrix factorization (NMF) theory

Given a nonnegative matrix <sup>X</sup> <sup>¼</sup> ½ � <sup>x</sup>1; <sup>x</sup>2; …; <sup>x</sup><sup>M</sup> <sup>∈</sup> <sup>R</sup><sup>N</sup>�<sup>M</sup>

dictionary as follows:

74 Active Learning - Beyond the Future

Because k k <sup>X</sup> � DC <sup>2</sup>

relationships between the parts.

where c

tion (SVD) [23].

<sup>F</sup>, s:t:∀i, k k c<sup>i</sup> <sup>0</sup> ≤ T<sup>0</sup> (2)

Different the similarity measures between X and the product DC lead to different variants of NMF. The common choices include Euclidean distance [38], generalized Kullback-Leibler divergence [39], Itakura-Saito divergence [40]… For instance, the NMF based on Kullback– Leibler (KL) divergence is formulated as follows:

$$f\_{\ell\mathcal{L}}(\mathbf{X}, \mathbf{DC}) = \sum\_{i,j} \left( \mathbf{x}\_{i\cdot} \log \frac{\mathbf{x}\_{i\cdot}}{(\mathbf{DC})\_{i\cdot}} - \mathbf{x}\_{i\cdot} + (\mathbf{DC})\_{i\cdot} \right) \tag{6}$$

There exist different optimization models for the approximation factorization (5) [36, 39, 40]. The most popular solution is alternative multiplicative update rules (MURs) [36], which do not have required user-specified optimization parameters. For a KL cost function (6), the iteratively updating rules are given by:

$$\mathbf{c}\_{a\mu} \leftarrow \mathbf{c}\_{a\mu} \frac{\sum\_{i} \mathbf{d}\_{ia} \mathbf{x}\_{i\mu} / (\mathbf{D} \mathbf{C})\_{i\mu}}{\sum\_{t} \mathbf{d}\_{ta}} \tag{7}$$

$$\mathbf{d}\_{ia} \leftarrow \mathbf{d}\_{ia} \frac{\sum\_{\mu} \mathbf{c}\_{a\mu} \mathbf{X}\_{i\mu} (\mathbf{D} \mathbf{C})\_{i\mu}}{\sum\_{s} \mathbf{c}\_{as}};\tag{8}$$

However, it is found that the monotonicity guaranteed by the proof of multiplicative updates may not imply the full Karush-Kuhn-Tucker conditions [39, 40]. MUR is relatively simple and easy to implement, but it converges slower in comparison with gradient approaches [41]. More efficient algorithms equipped with stronger theoretical convergence property have been introduced. One popular method is to apply gradient descent algorithms with additive update rules, which are represented by the projective gradient descent method (PGD) [42]. In PGD framework, to select the learning step size, a line search method with the Armijo rule is applied [42] and the new estimate is obtained by first calculating the unconstrained steepest-descent update and then zeroing its negative elements. In addition, considering the separate convexity, the two-variable optimization problem is converted into the nonnegative least squares (NLS) optimization subproblems, which alternate the minimization over either D or C, with the other matrix fixed.

Because of the initial condition K < < min{N, M}, the obtained basis vectors are incomplete over the original vector space. In other words, this NMF approach tries to represent the highdimensional stochastic pattern with far fewer bases, so the perfect approximation can be achieved successfully only if the intrinsic features are identified in D.

NMF will not get the unique solution under the sole nonnegativity constraint. Hence, to remedy the ill-posedness, it is imperative to introduce additional auxiliary constraints on D and/or C as regularization terms, which will also incorporate prior knowledge and reflect the characteristics of the issues more comprehensively. The constrained NMF models can be unified under the similar extended objective function

$$\min\_{\mathbf{D}, \mathbf{C} \ge 0} f\_{\text{constrained} \ge \text{MMF}} \left( \mathbf{X} \left\| \mathbf{D} \mathbf{C} \right\| = \min\_{\mathbf{D}, \mathbf{C} \ge 0} \left[ f(\mathbf{X} \| \mathbf{D} \mathbf{C}) + \alpha \mathbf{g}(\mathbf{D}) + \chi h(\mathbf{C}) \right] \tag{9}$$

Speech enhancement herein is implemented in the short-time Fourier transform (STFT) magnitude domain, assuming that the phase of the interferer can be approximated with the phase of the mixture. The number of frequency bins per frame is determined by the length of the time-domain analysis window, where a Hamming window was chosen for the STFT. The temporal smoothness frames are determined by the time-domain analysis window overlap,

Sparse representation has been described as an overcomplete models wherein the number of bases is greater than the dimensionality of spectral representations. In sparse representation, sparse signals can be expressed as the linear combination of only a few atoms in an overcomplete dictionary. While speech signals are generally sparse in the time-frequency domain and many types of noise are nonsparse, the target speech signal reconstructed from the noisy speech is considered as clean speech. A possibly overcomplete dictionary of atoms is trained for both speech and interferer magnitudes, which are then concatenated into a composite

When applying the sparse coding technique to speech enhancement, it is desirable to have the trained offline clean speech dictionary Dspeech to be coherent to the speech signal and incoherent to the background noise signal as well as a coherent noise dictionary Dnoise. In the enhancement step, the noisy speech is sparsely coded in the composite dictionary [Dspeech, Dnoise]. As a result, this mixture of speech and interferer x is explained by a sum of a linear combination of atoms from the speech dictionary Dspeech and a linear combination of atoms from the interferer dictionary Dnoise. The noisy x is coded using the least angle regression (LASSO) [51] with a

<sup>x</sup> � <sup>D</sup>speech <sup>D</sup>noise <sup>c</sup>speech

The clean speech magnitude is estimated by disregarding the contribution from the interferer dictionary, preserving only the linear combination of speech dictionary atoms (analogously for

cnoise

 2 s:t: k kc <sup>1</sup> k kx <sup>2</sup>

≤ θ (11)

Dictionary Learning-Based Speech Enhancement http://dx.doi.org/10.5772/intechopen.85308 77

where a minimum amount of overlap is necessary to avoid aliasing.

dictionary. The training process of updated dictionary is drawn in Figure 1.

3.1. Offline dictionary

preset threshold θ as follows:

the interferer) and

arg min <sup>c</sup>speech, <sup>c</sup>noise

Figure 1. The training process of updated dictionary.

 

where the regularization parameters αand χ are used to balance the trade-off between the fitting goodness and the constraints g(D) and h(C).

The performance of NMF can be improved by imposing extra constraints and regularizations. For the sparseness learning, the sparse term h(C) expects to constraint the mount of nonzero elements in each column of the projection matrix. The L<sup>0</sup> norm could be selected to count nonzero elements in C [43]. One limitation of using L<sup>0</sup> norm is that the solution is not unique because of many local minima of the cost function. In this situation, the L<sup>1</sup> norm of the projection matrix is usually replaced as a relaxation of the L<sup>0</sup> penalty [44, 45].

$$\|\mathbf{C}\|\_1 = \sum\_{j=1}^{M} \|\mathbf{c}\_j\|\_1 = \sum\_{j=1}^{M} \left(\sum\_{i=1}^{K} |\mathbf{c}\_{ij}|\right) \tag{10}$$

#### 3. Dictionary learning-based speech enhancement

A major outcome of speech enhancement techniques is the improved quality and reduced listening effort in the presence of an interfering noise signal. The decomposition of timefrequency representations, such as the power or magnitude spectrogram in terms of elementary atoms, has become a popular tool in speech enhancement since their success in finding high-"quality" dictionary atoms that best describe latent features of the underprocessed data. The dictionary-based techniques utilize specific types of the a priori information of speech or noise [21, 23, 46–50]. A priori information can be typical patterns or statistics obtained from a speech or noise database. Dictionary-based speech enhancement consists of two separate stages: a training stage, in which the model parameters are learned, and a denoising stage, in which the noise reduction task is carried out. In the first step, dictionary D is learned while fixing coefficient matrix C, and in second step, C is computed with the fixed dictionary matrix D. This process of alternate minimization is repeated iteratively until a stopping criterion is reached. In order to learn dictionary atoms capable of revealing the hidden structure in speech, long temporal context of speech signals must be considered. Two major classes of dictionarybased speech enhancement techniques may be the offline learning and online learning. Offline algorithms for dictionary learning are second-order iterative batch procedures, accessing the whole training set at each iteration in order to minimize a cost function under some constraints [21–23]. In speech enhancement, learning spectrotemporal atoms spanning several consecutive frames is done through training large volumes of datasets, which places unrealistic demand on computing power and memory. In large-scale tasks, online dictionary learning tends to gain lower empirical cost than conventional batch learning [46–50].

Speech enhancement herein is implemented in the short-time Fourier transform (STFT) magnitude domain, assuming that the phase of the interferer can be approximated with the phase of the mixture. The number of frequency bins per frame is determined by the length of the time-domain analysis window, where a Hamming window was chosen for the STFT. The temporal smoothness frames are determined by the time-domain analysis window overlap, where a minimum amount of overlap is necessary to avoid aliasing.

#### 3.1. Offline dictionary

and/or C as regularization terms, which will also incorporate prior knowledge and reflect the characteristics of the issues more comprehensively. The constrained NMF models can be

> ¼ min <sup>D</sup>, <sup>C</sup> <sup>≥</sup> <sup>0</sup>

where the regularization parameters αand χ are used to balance the trade-off between the

The performance of NMF can be improved by imposing extra constraints and regularizations. For the sparseness learning, the sparse term h(C) expects to constraint the mount of nonzero elements in each column of the projection matrix. The L<sup>0</sup> norm could be selected to count nonzero elements in C [43]. One limitation of using L<sup>0</sup> norm is that the solution is not unique because of many local minima of the cost function. In this situation, the L<sup>1</sup> norm of the

½ fð � X DC k Þ þ αgð Þþ D χhð Þ C

(9)

(10)

�

� � � �

projection matrix is usually replaced as a relaxation of the L<sup>0</sup> penalty [44, 45].

j¼1

c:j � � � � <sup>1</sup> <sup>¼</sup> <sup>X</sup> M

A major outcome of speech enhancement techniques is the improved quality and reduced listening effort in the presence of an interfering noise signal. The decomposition of timefrequency representations, such as the power or magnitude spectrogram in terms of elementary atoms, has become a popular tool in speech enhancement since their success in finding high-"quality" dictionary atoms that best describe latent features of the underprocessed data. The dictionary-based techniques utilize specific types of the a priori information of speech or noise [21, 23, 46–50]. A priori information can be typical patterns or statistics obtained from a speech or noise database. Dictionary-based speech enhancement consists of two separate stages: a training stage, in which the model parameters are learned, and a denoising stage, in which the noise reduction task is carried out. In the first step, dictionary D is learned while fixing coefficient matrix C, and in second step, C is computed with the fixed dictionary matrix D. This process of alternate minimization is repeated iteratively until a stopping criterion is reached. In order to learn dictionary atoms capable of revealing the hidden structure in speech, long temporal context of speech signals must be considered. Two major classes of dictionarybased speech enhancement techniques may be the offline learning and online learning. Offline algorithms for dictionary learning are second-order iterative batch procedures, accessing the whole training set at each iteration in order to minimize a cost function under some constraints [21–23]. In speech enhancement, learning spectrotemporal atoms spanning several consecutive frames is done through training large volumes of datasets, which places unrealistic demand on computing power and memory. In large-scale tasks, online dictionary learning tends to gain

j¼1

X K

i¼1 cij � � � � !

k k <sup>C</sup> <sup>1</sup> <sup>¼</sup> <sup>X</sup> M

3. Dictionary learning-based speech enhancement

lower empirical cost than conventional batch learning [46–50].

�

unified under the similar extended objective function

fitting goodness and the constraints g(D) and h(C).

f constrainedNMF X DC

min <sup>D</sup>, <sup>C</sup> <sup>≥</sup> <sup>0</sup>

76 Active Learning - Beyond the Future

Sparse representation has been described as an overcomplete models wherein the number of bases is greater than the dimensionality of spectral representations. In sparse representation, sparse signals can be expressed as the linear combination of only a few atoms in an overcomplete dictionary. While speech signals are generally sparse in the time-frequency domain and many types of noise are nonsparse, the target speech signal reconstructed from the noisy speech is considered as clean speech. A possibly overcomplete dictionary of atoms is trained for both speech and interferer magnitudes, which are then concatenated into a composite dictionary. The training process of updated dictionary is drawn in Figure 1.

When applying the sparse coding technique to speech enhancement, it is desirable to have the trained offline clean speech dictionary Dspeech to be coherent to the speech signal and incoherent to the background noise signal as well as a coherent noise dictionary Dnoise. In the enhancement step, the noisy speech is sparsely coded in the composite dictionary [Dspeech, Dnoise]. As a result, this mixture of speech and interferer x is explained by a sum of a linear combination of atoms from the speech dictionary Dspeech and a linear combination of atoms from the interferer dictionary Dnoise. The noisy x is coded using the least angle regression (LASSO) [51] with a preset threshold θ as follows:

$$\arg\min\_{\mathbf{c}^{\text{puck}}, \mathbf{c}^{\text{wire}}} \left\| \mathbf{x} - \left[ \mathbf{D}\_{\text{specch}} \mathbf{D}\_{\text{noise}} \right] \begin{bmatrix} \mathbf{c}\_{\text{specch}} \\ \mathbf{c}\_{\text{noise}} \end{bmatrix} \right\|\_{2} \\ \text{s.t.} \frac{\left\| \mathbf{c} \right\|\_{1}}{\left\| \mathbf{x} \right\|\_{2}} \leq \Theta \tag{11}$$

The clean speech magnitude is estimated by disregarding the contribution from the interferer dictionary, preserving only the linear combination of speech dictionary atoms (analogously for the interferer) and

Figure 1. The training process of updated dictionary.

$$
\hat{s} = \mathbf{D}\_{\text{spec}ch} \mathbf{c}\_{\text{spec}ch} \tag{12}
$$

It is known that NMF represents data as a linear combination of a set of basis vectors, in which both the combination coefficients and the basis vectors are nonnegative. Although the basis learned by NMF is sparse, it is different from sparse coding [26]. This is because NMF learns a low rank representation of the data, while sparse coding usually learns the full rank representation. Treating speech enhancement as a source separation problem (speech and noise), NMFbased techniques can be used to factorize spectrograms into nonnegative speech and noise dictionaries and their nonnegative activations. Assume that a clean speech spectrogram as Xspeech and a clean noise spectrogram as Xnoise. Consider a supervised denoising approach where the clean speech basis matrix Dspeech and the clean noise basis matrix Dnoise are learned separately by performing NMF on the speech and the noise. During training process, minimized f Xspeech DspeechCspeech � � � � and fðXnoisekDnoiseCnoiseÞ are employed.

To reduce the noise in the noisy speech, the concatenated dictionary D = [Dspeech, Dnoise] is fixed and utilized in decomposing the noisy speech Xnoisy by

$$\min\_{\mathbf{C}\_{nivy} \ge 0} f\left(\mathbf{X}\_{nivy} \left\| \mathbf{D} \mathbf{C}\_{nivy} \right\|\right) \tag{13}$$

where the time-varying activation matrix is formulated <sup>C</sup>noisy <sup>¼</sup> <sup>C</sup><sup>0</sup> noise C0 speech " # .

Discarding the noise coding matrix, the target speech is estimated from the product of speech dictionaries and their activations as

$$
\dot{\mathbf{X}}\_{\text{specch}} = \mathbf{D}\_{\text{specch}} \mathbf{C}'\_{\text{specch}} \tag{14}
$$

This high demand on both computing resources and memory is prohibitive in large-scale tasks. To address this problem, the online optimization algorithms were developed in an incremental fashion, which processes one sample of the training set at a time based on stochastic approximations or only a part of the training data at a time and updates patterns gradually until completely processed whole training corpus [46–48, 51]. More specifically, given M samples

<sup>þ</sup>, the conventional NMF learns

Dictionary Learning-Based Speech Enhancement http://dx.doi.org/10.5772/intechopen.85308 79

, the corresponding

<sup>þ</sup> and satisfies the expected cost:

fðxik Þ Dc<sup>i</sup> with fixed c<sup>i</sup> (15)

E<sup>x</sup><sup>i</sup> <sup>∈</sup> <sup>℘</sup>ðfð Þ xikDciÞ (16)

fðX DC k Þ (17)

<sup>þ</sup> distributed in the probabilistic space <sup>℘</sup> <sup>∈</sup> <sup>R</sup><sup>N</sup>

X M

i¼1

min <sup>C</sup> <sup>∈</sup> <sup>R</sup>K�<sup>M</sup> <sup>þ</sup>

or min <sup>D</sup> <sup>∈</sup> <sup>R</sup>N�<sup>K</sup> <sup>þ</sup>

For the online NMF framework, at step t, on the arrival of sample x(t)

subspace <sup>Q</sup> <sup>⊂</sup> <sup>℘</sup> spanned by a base f g <sup>d</sup>1; <sup>d</sup>2; …; <sup>d</sup><sup>K</sup> <sup>∈</sup> <sup>R</sup><sup>N</sup>

Figure 2. Block diagram of NMF-based speech enhancement.

where E<sup>x</sup><sup>i</sup> <sup>∈</sup> <sup>℘</sup> denoted the expectation on ℘.

is formulated by

The coefficient matrix is computed by

min <sup>D</sup> <sup>∈</sup> <sup>R</sup>N�<sup>K</sup> <sup>þ</sup>

f g <sup>x</sup>1; <sup>x</sup>2;…; <sup>x</sup><sup>M</sup> <sup>∈</sup> <sup>R</sup><sup>N</sup>

coefficient c

(t)

The clean speech waveform is estimated using the noisy phase and inverse DFT and the general framework of NMF-based speech enhancement is drawn in Figure 2.

#### 3.2. Online dictionary learning

The aforementioned dictionary learning approaches access the whole training set to determine the bases, which are referred as offline training process. These methods were reported to have good performance on modeling nonstationary noise types, which had been seen during training. For the time-frequency analysis of audio signals, however, the obtained basis may not be adequate to capture the temporal dependency of repeating patterns within the signal, and the success of these methods strongly relies on the prior knowledge of noise or speech or both, which limits implementations of the models. Recently, the online dictionary learning methods have been proposed in two aspects of implementing scheme [46–50] and circumventing the mismatch problem between the training and testing stages [24, 52].

One drawback of the multiplicative update procedure on offline dictionary learning is the requirement of all the training signals to be read into memory and processed in each iteration.

Figure 2. Block diagram of NMF-based speech enhancement.

^s ¼ Dspeechcspeech (12)

� � � (13)

noise C0 speech

.

speech (14)

" #

It is known that NMF represents data as a linear combination of a set of basis vectors, in which both the combination coefficients and the basis vectors are nonnegative. Although the basis learned by NMF is sparse, it is different from sparse coding [26]. This is because NMF learns a low rank representation of the data, while sparse coding usually learns the full rank representation. Treating speech enhancement as a source separation problem (speech and noise), NMFbased techniques can be used to factorize spectrograms into nonnegative speech and noise dictionaries and their nonnegative activations. Assume that a clean speech spectrogram as Xspeech and a clean noise spectrogram as Xnoise. Consider a supervised denoising approach where the clean speech basis matrix Dspeech and the clean noise basis matrix Dnoise are learned separately by performing NMF on the speech and the noise. During training process, mini-

and fðXnoisekDnoiseCnoiseÞ are employed.

f Xnoisy DCnoisy �

To reduce the noise in the noisy speech, the concatenated dictionary D = [Dspeech, Dnoise] is fixed

Discarding the noise coding matrix, the target speech is estimated from the product of speech

The clean speech waveform is estimated using the noisy phase and inverse DFT and the

The aforementioned dictionary learning approaches access the whole training set to determine the bases, which are referred as offline training process. These methods were reported to have good performance on modeling nonstationary noise types, which had been seen during training. For the time-frequency analysis of audio signals, however, the obtained basis may not be adequate to capture the temporal dependency of repeating patterns within the signal, and the success of these methods strongly relies on the prior knowledge of noise or speech or both, which limits implementations of the models. Recently, the online dictionary learning methods have been proposed in two aspects of implementing scheme [46–50] and circumventing the

One drawback of the multiplicative update procedure on offline dictionary learning is the requirement of all the training signals to be read into memory and processed in each iteration.

Xbspeech ¼ DspeechC<sup>0</sup>

min Cnoisy ≥ 0

general framework of NMF-based speech enhancement is drawn in Figure 2.

mismatch problem between the training and testing stages [24, 52].

where the time-varying activation matrix is formulated <sup>C</sup>noisy <sup>¼</sup> <sup>C</sup><sup>0</sup>

mized f Xspeech DspeechCspeech � � � �

78 Active Learning - Beyond the Future

dictionaries and their activations as

3.2. Online dictionary learning

and utilized in decomposing the noisy speech Xnoisy by

This high demand on both computing resources and memory is prohibitive in large-scale tasks. To address this problem, the online optimization algorithms were developed in an incremental fashion, which processes one sample of the training set at a time based on stochastic approximations or only a part of the training data at a time and updates patterns gradually until completely processed whole training corpus [46–48, 51]. More specifically, given M samples f g <sup>x</sup>1; <sup>x</sup>2;…; <sup>x</sup><sup>M</sup> <sup>∈</sup> <sup>R</sup><sup>N</sup> <sup>þ</sup> distributed in the probabilistic space <sup>℘</sup> <sup>∈</sup> <sup>R</sup><sup>N</sup> <sup>þ</sup>, the conventional NMF learns subspace <sup>Q</sup> <sup>⊂</sup> <sup>℘</sup> spanned by a base f g <sup>d</sup>1; <sup>d</sup>2; …; <sup>d</sup><sup>K</sup> <sup>∈</sup> <sup>R</sup><sup>N</sup> <sup>þ</sup> and satisfies the expected cost:

$$\min\_{\mathbf{D} \in \mathbb{R}\_+^{N \times K}} \sum\_{i=1}^M f(\mathbf{x}\_i | \mathbf{D} \mathbf{c}\_i) \text{ with fixed } \mathbf{c}\_i \tag{15}$$

$$\text{cor}\min\_{\mathbf{D}\in\mathbb{R}\_+^{N\times K}} E\_{\mathbf{x}\_i \in \wp}(f(\mathbf{x}\_i \| \mathbf{D}\mathbf{c}\_i)) \tag{16}$$

where E<sup>x</sup><sup>i</sup> <sup>∈</sup> <sup>℘</sup> denoted the expectation on ℘.

The coefficient matrix is computed by

$$\min\_{\mathbf{C}\in\mathbb{R}\_+^{K\times M}} f(\mathbf{X}|\mathbf{D}\mathbf{C}) \tag{17}$$

For the online NMF framework, at step t, on the arrival of sample x(t) , the corresponding coefficient c (t) is formulated by

$$\min\_{\mathbf{c}^{(t)} \in \mathbb{R}\_+^K} f\left(\mathbf{x}^{(t)} \, \big|\, \mathbf{D}^{(t-1)} \mathbf{c}^{(t)}\right) \tag{18}$$

speech corpus (30 utterances with clean samples) [1]. The noisy speech examples were synthe-

Speech enhancement algorithms aim to improve both the speech quality and the speech intelligibility. A high-quality speech signal is perceived as being natural and pleasant to listen to, and free of distracting artifacts. An effective technique should suppress noises without bringing too much distortion to the enhanced speech. Measuring speech quality is challenging, as it is subjective and can be classified into subjective and objective measures. The speech enhancement performance was commonly evaluated in terms of three criteria including the signal to noise ratio (SNR) of enhanced speech [56], the segmental SNR (segSNR) [56], or the perceptual estimation of speech quality score (PESQ) [57–59]. Given the true and estimated

sized by adding clean speech to different types of noises at various input SNRs.

speech magnitude spectra, the frequency-weighted segmental SNR is defined as:

0 B@

P

P

segSNR is a conceptually simple objective measure, computed on individual signal frames,

P

where Xb,speech (t) is the frequency-domain representation of the clean speech signal, for frequency b and time frame t, Xbb,speechð Þt is the frequency-domain representation of the estimated speech signal. PESQ indicates the quality difference between the enhanced and clean speech signals. PESQ is analogous to the mean opinion score, which is a subjective evaluation index. The PESQ score ranges from 0.5 to 4.5, and a high score indicates that the enhanced utterance

Contrary to spectral subtraction, dictionary approach does not assume a stationary interferer, optimizes the trade-off between source distortion and source confusion, and thus shows superiority over objective quality measures like cepstral distance, in the speaker-dependent and -independent case, in real-world environments and under low SNR condition. One possible reason could be due to lack of plenty of data to estimate a noise dictionary. At low SNR levels, the total volume of noise is much higher than that at high SNR levels, which offers a higher chance to obtain a good dictionary or noise modeling. However, under high SNR conditions, a lot of noise spectrum is buried in speech spectrum, which could make the learning of a noise dictionary difficult. The pretrained speech dictionary models outperform state-of-the-art methods like multiband spectral subtraction and approaches based on vector quantization [21–23]. Offline speech dictionary learning in a joint decomposition framework of the noisy speech spectrogram and a primary estimate of the clean speech spectrogram. Online learning approach processes input signals piece-by-piece by breaking the training data into

0 B@ <sup>t</sup> <sup>X</sup>noisyð Þ� <sup>t</sup> <sup>X</sup>speechð Þ<sup>t</sup> � �<sup>2</sup>

1

CA (21)

CA (22)

1

Dictionary Learning-Based Speech Enhancement http://dx.doi.org/10.5772/intechopen.85308 81

<sup>t</sup> Xbspeechð Þ� t Xspeechð Þt � �<sup>2</sup>

> P <sup>t</sup> <sup>X</sup><sup>2</sup>

b,speechð Þt

<sup>t</sup> Xb,speechðÞ� t Xbb,speechð Þt � �<sup>2</sup>

SNR ¼ 10 � log

and the per-frame scores are averaged over time.

segSNR <sup>¼</sup> <sup>1</sup>

is close to the clean utterance.

N X N

b¼1

10 � log

where D(t�1) is the previous basis matrix. The matrix D(t) is updated by

$$\mathbf{D}^{(t)} = \arg\min\_{\mathbf{D} \in \mathbb{R}\_+^{N \times k}} E\_{\mathbf{x} \in \wp^{(t)}}(f(\mathbf{x} || \mathbf{D} \mathbf{c})) \tag{19}$$

where ℘ð Þ<sup>t</sup> ⊂ ℘ is the probabilistic subspace spanned by the arrived elements xð Þ<sup>1</sup> ; ; xð Þ<sup>2</sup> ;…; ; xð Þ<sup>t</sup> ∈ R<sup>N</sup> <sup>þ</sup> and the corresponding <sup>c</sup>ð Þ<sup>1</sup> ; ; <sup>c</sup>ð Þ<sup>2</sup> ; …; ; <sup>c</sup>ð Þ<sup>t</sup> <sup>∈</sup> <sup>R</sup><sup>K</sup> <sup>þ</sup> are computed available in the previous t steps.

In [50], an online noise basis learning scheme is proposed that uses the temporal dependencies of speech and noise signal to construct informative prior distribution. In this model, the noise basis matrix is learned from the noisy observation. To update the noise basis, the past noisy DFT magnitude frames are stored into a buffer and the buffer will be then updated with fixed speech basis when a new noisy frame arrives.

Kwon et al. [52] present a speech enhancement technique combining statistical models and NMF with online update of speech and noise bases. A cascaded structure of combining a statistical model-based enhancement (SE) (the first state) [53] and NMF approach (second stage) with simultaneous update of speech and noise bases is proposed. In this model, the output clean speech at current frame is fed as an input to update the speech and noise bases in the following frame. In other words, at each frame, the clean speech estimation is obtained; the speech and noise bases for the NMF analysis in the following frame are updated. This online bases update makes it possible to deal with the speech and noise variations that cannot be covered by the training noise database and is considered a promising way to cope with the nonstationary nature of the signal. The noisy data X<sup>0</sup> (t) used for the online bases update herein is constructed by concatenating preenhanced output XSE(t) of performing statistical modelbased enhancement (SE) with the current frame input X(t). The updating dictionary process will be learned by adding a regular term to the original objective function as follows:

$$f\_{\text{colim}\text{SE}+\text{NMF}}\left(\mathbf{X}'(t)\middle\|\mathbf{D}'(t)\mathbf{C}'(t)\right) = f\left(\mathbf{X}'(t)\middle\|\mathbf{D}'(t)\mathbf{C}'(t)\right) + a\left\|\mathbf{D}(t) - \mathbf{D}'(t)\right\|^2\tag{20}$$

where D<sup>0</sup> (t) = [D<sup>0</sup> speech, (t)D<sup>0</sup> noise(t)] denotes the basis matrix in NMF decomposing of the concatenated noisy data X<sup>0</sup> (t) and D(t)=[Dspeech, (t)Dnoise(t)] is the basis matrix used to analyze the t-frame X(t) in the second state.

#### 4. Summary and discussion

In the experimental simulations, speech and noise materials were selected from TIMIT [53] (192 sentences), NOISEX-92 DBs (15 types of noise: birds, casino, cicadas, computer keyboard, eating chips, f16, factory1, factory2, frogs, jungle, machineguns, motorcycles, ocean, pink, and volvo) [54], the GRID audiovisual corpus (34 speakers of both genders) [55], the NOIZEUS speech corpus (30 utterances with clean samples) [1]. The noisy speech examples were synthesized by adding clean speech to different types of noises at various input SNRs.

min cð Þ<sup>t</sup> ∈ R<sup>K</sup> þ

where D(t�1) is the previous basis matrix. The matrix D(t) is updated by

xð Þ<sup>1</sup> ; ; xð Þ<sup>2</sup> ;…; ; xð Þ<sup>t</sup> ∈ R<sup>N</sup>

80 Active Learning - Beyond the Future

able in the previous t steps.

speech basis when a new noisy frame arrives.

nonstationary nature of the signal. The noisy data X<sup>0</sup>

ð Þt D<sup>0</sup>

 

ð Þt C<sup>0</sup> ð Þt

<sup>f</sup> onlineSEþNMF <sup>X</sup><sup>0</sup>

the t-frame X(t) in the second state.

4. Summary and discussion

speech, (t)D<sup>0</sup>

(t) = [D<sup>0</sup>

concatenated noisy data X<sup>0</sup>

where D<sup>0</sup>

<sup>D</sup>ð Þ<sup>t</sup> <sup>¼</sup> arg min

<sup>D</sup> <sup>∈</sup> <sup>R</sup>N�<sup>K</sup> <sup>þ</sup>

where ℘ð Þ<sup>t</sup> ⊂ ℘ is the probabilistic subspace spanned by the arrived elements

In [50], an online noise basis learning scheme is proposed that uses the temporal dependencies of speech and noise signal to construct informative prior distribution. In this model, the noise basis matrix is learned from the noisy observation. To update the noise basis, the past noisy DFT magnitude frames are stored into a buffer and the buffer will be then updated with fixed

Kwon et al. [52] present a speech enhancement technique combining statistical models and NMF with online update of speech and noise bases. A cascaded structure of combining a statistical model-based enhancement (SE) (the first state) [53] and NMF approach (second stage) with simultaneous update of speech and noise bases is proposed. In this model, the output clean speech at current frame is fed as an input to update the speech and noise bases in the following frame. In other words, at each frame, the clean speech estimation is obtained; the speech and noise bases for the NMF analysis in the following frame are updated. This online bases update makes it possible to deal with the speech and noise variations that cannot be covered by the training noise database and is considered a promising way to cope with the

is constructed by concatenating preenhanced output XSE(t) of performing statistical modelbased enhancement (SE) with the current frame input X(t). The updating dictionary process

will be learned by adding a regular term to the original objective function as follows:

¼ f X<sup>0</sup>

In the experimental simulations, speech and noise materials were selected from TIMIT [53] (192 sentences), NOISEX-92 DBs (15 types of noise: birds, casino, cicadas, computer keyboard, eating chips, f16, factory1, factory2, frogs, jungle, machineguns, motorcycles, ocean, pink, and volvo) [54], the GRID audiovisual corpus (34 speakers of both genders) [55], the NOIZEUS

ð Þt D<sup>0</sup>

<sup>þ</sup> <sup>α</sup> <sup>D</sup>ð Þ� <sup>t</sup> <sup>D</sup><sup>0</sup> k k ð Þ<sup>t</sup> <sup>2</sup>

 

ð Þt C<sup>0</sup> ð Þt

noise(t)] denotes the basis matrix in NMF decomposing of the

(t) and D(t)=[Dspeech, (t)Dnoise(t)] is the basis matrix used to analyze

<sup>þ</sup> and the corresponding <sup>c</sup>ð Þ<sup>1</sup> ; ; <sup>c</sup>ð Þ<sup>2</sup> ; …; ; <sup>c</sup>ð Þ<sup>t</sup> <sup>∈</sup> <sup>R</sup><sup>K</sup>

f xð Þ<sup>t</sup> Dð Þ <sup>t</sup>�<sup>1</sup> cð Þ<sup>t</sup> 

(18)

(20)

<sup>þ</sup> are computed avail-

E<sup>x</sup><sup>∈</sup> <sup>℘</sup>ð Þ<sup>t</sup> ðfð Þ x Dc k Þ (19)

(t) used for the online bases update herein

Speech enhancement algorithms aim to improve both the speech quality and the speech intelligibility. A high-quality speech signal is perceived as being natural and pleasant to listen to, and free of distracting artifacts. An effective technique should suppress noises without bringing too much distortion to the enhanced speech. Measuring speech quality is challenging, as it is subjective and can be classified into subjective and objective measures. The speech enhancement performance was commonly evaluated in terms of three criteria including the signal to noise ratio (SNR) of enhanced speech [56], the segmental SNR (segSNR) [56], or the perceptual estimation of speech quality score (PESQ) [57–59]. Given the true and estimated speech magnitude spectra, the frequency-weighted segmental SNR is defined as:

$$\text{SNR} = 10 \times \log \left( \frac{\sum\_{t} \left( \mathbf{X}\_{misy}(t) - \mathbf{X}\_{spech}(t) \right)^2}{\sum\_{t} \left( \hat{\mathbf{X}}\_{spech}(t) - \mathbf{X}\_{spech}(t) \right)^2} \right) \tag{21}$$

segSNR is a conceptually simple objective measure, computed on individual signal frames, and the per-frame scores are averaged over time.

$$\text{segSNR} = \frac{1}{N} \sum\_{b=1}^{N} 10 \times \log \left( \frac{\sum\_{i} \mathbf{X}\_{b, spech}^{2}(t)}{\sum\_{t} \left( \mathbf{X}\_{b, spech}(t) - \widehat{\mathbf{X}}\_{b, spech}(t) \right)^{2}} \right) \tag{22}$$

where Xb,speech (t) is the frequency-domain representation of the clean speech signal, for frequency b and time frame t, Xbb,speechð Þt is the frequency-domain representation of the estimated speech signal. PESQ indicates the quality difference between the enhanced and clean speech signals. PESQ is analogous to the mean opinion score, which is a subjective evaluation index. The PESQ score ranges from 0.5 to 4.5, and a high score indicates that the enhanced utterance is close to the clean utterance.

Contrary to spectral subtraction, dictionary approach does not assume a stationary interferer, optimizes the trade-off between source distortion and source confusion, and thus shows superiority over objective quality measures like cepstral distance, in the speaker-dependent and -independent case, in real-world environments and under low SNR condition. One possible reason could be due to lack of plenty of data to estimate a noise dictionary. At low SNR levels, the total volume of noise is much higher than that at high SNR levels, which offers a higher chance to obtain a good dictionary or noise modeling. However, under high SNR conditions, a lot of noise spectrum is buried in speech spectrum, which could make the learning of a noise dictionary difficult. The pretrained speech dictionary models outperform state-of-the-art methods like multiband spectral subtraction and approaches based on vector quantization [21–23]. Offline speech dictionary learning in a joint decomposition framework of the noisy speech spectrogram and a primary estimate of the clean speech spectrogram. Online learning approach processes input signals piece-by-piece by breaking the training data into small pieces and updates learned patterns gradually using accumulated statistics. With this approach, only a limited segment of the input signal is processed at a time. The online estimated dictionary is sufficient enough in basis subspace to avoid speech distortion. The online approaches tend to give better performance than batch learning [53].

Author details

Viet-Hang Duong<sup>1</sup>

Taiwan

References

2007

2008;50:453-466

2005;13(5):845-856

ceedings of the IEEE. 1979;67(12):1586-1604

Processing. 1984;32(6):1109-1121

, Manh-Quan Bui<sup>2</sup> and Jia-Ching Wang2,3\*

2 Department of Computer Science Information Engineering, National Central University,

Dictionary Learning-Based Speech Enhancement http://dx.doi.org/10.5772/intechopen.85308 83

[1] Loizou PC. Speech Enhancement: Theory and Practice. 1st ed. BocaRaton, FL: CRC Press;

[3] Gold B, Morgan N, Ellis D. Speech and Audio Signal Processing: Processing and Percep-

[5] Boll SF. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans-

[6] Lu Y, Loizou PC. A geometric approach to spectral subtraction. Speech Communication.

[7] Lim JS, Oppenheim AV. Enhancement and bandwidth compression of noisy speech. Pro-

[8] Grancharov V, Samuelsson J, Kleijn B. On causal algorithms for speech enhancement. IEEE Transactions on Audio, Speech, and Language Processing. 2006;14(3):764-773

[9] Ephraim Y, Malah D. Speech enhancement using a minimum mean square error shorttime spectral amplitude estimator. IEEE Transactions on Audio, Speech, and Language

[10] Martin R. Speech enhancement based on minimum mean-square error estimation and super-Gaussian priors. IEEE Transactions on Audio, Speech, and Language Processing.

[11] Cohen I. Speech spectral modeling and enhancement based on autoregressive conditional

heteroscedasticity models. Signal Processing. 2006;86(4):698-709

[2] Rabiner LR, Schafer RW. Theory and Application of Digital Speech Processing; 2001

[4] Loizou PC. Speech Enhancement: Theory and Practice. Taylor and Francis; 2007

actions on Acoustics, Speech, and Signal Processing. 1979;ASSP-27(2):113-120

tion of Speech and Music. Berkeley, California, USA: Wiley; 2011

\*Address all correspondence to: jiacwang@gmail.com

1 Faculty of Information Technology, BacLieu University, Vietnam

3 Pervasive Artificial Intelligence Research (PAIR) Labs, Taiwan

The computing demand for both offline learning and online learning consists of updating the coefficient matrix C and the pattern matrix D. The learning task is defined as an optimization problem, which aims to minimize an objective cost function f(D) with respect to the pattern matrix D. It is observed that the reconstruction error for both the online and offline methods converges to a similar value after several iterations and not monotonically decreasing at the beginning. Both batch and online learning converge to a stationary point of the expected cost function f(D) with unlimited data and unlimited computing resources. This situation is only valid in theory. For small-scale tasks where data are limited, but computing resources are unlimited, batch learning converges to a stationary point of the cost function ft(D), while online learning fails to converge, resulting in suboptimal patterns. For large-scale tasks, the more common situation is where training data are abundant but computing resources are limited. In this situation, due to its early learning property, online learning tends to obtain lower empirical cost than batch learning [49]. For sparse coding where the pattern matrix is overcomplete, for example, (K > M), then online learning is slower than batch learning. The online learning is significantly faster than the batch alternating learning by a factor of the large number of spectrograms reconstructed at each iteration [60].

In short, dictionary learning plays an important role in machine learning, where data vectors are modeled as sparse linear combinations of basis factors (i.e., dictionary). However, how to conduct dictionary learning in noisy environment has not been well studied. In this chapter, we have reviewed speech enhancement techniques based on dictionary learning. The dictionary learning-based algorithms have gained a lot of attention due to their success in finding high-"quality" dictionary atoms (basis vectors) that best describe latent features of the underprocessed data. As a multivariate data analysis and dimensionality reduction technique, two relatively novel paradigms for dimensionality reduction and sparse representation, NMF and SR, have been in the ascendant since its inception. They enhance learning and data representation due to their parts-based and sparse representation from the nonnegativity or purely additive constraint. NMF and SR produce high-quality enhancement results when the dictionaries for different sources are sufficiently distinct. This survey chapter mainly focuses on the theoretical research into dictionary learning-based speech enhancement where the principles, basic models, properties, algorithms, and employing on SR and NMF are summarized systematically.
