1. Introduction

Speech is the most important tool of expression and it is crucial information carrier of language communication. Speech signals in real-world scenarios are corrupted due to some disturbing noise such as background noise, reverberation, babble noise, etc. The purpose of speech enhancement (SE) is to extract the clean speech signal from the interferer components mixture as much as possible, so as the clarity and intelligibility of the speech signal. The research of speech enhancement technology is particularly important and difficult. Speech denoising is an importance problem with increasing various applications as hearing aids, speech/speaker recognition, mobile communications over telephone, and Internet [1]. The difficulties arise from the nature of real-world noise that is often unknown, nonstationary, potentially speechlike, overlapping between [1–3].

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited. © 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Assume that the noisy speech x is a linear additive mixture of the clean speech s and the interfere n as defined in the following equation:

$$\mathbf{x}(t) = \mathbf{s}(t) + \mathbf{n}(t) \tag{1}$$

The observation that speech and other structured signals can be well approximated by few atoms of a suitably trained dictionary [28], which lies at the core of sparse representation (SR). In SR, sparse signals can be reconstructed with a few atoms of an overcomplete dictionary. Recently, developed SR has been shown to be effective in data representation, which factorizes given matrix with regularization methods or regularization term to constrain the sparsity of desire representation. Since speech signals are generally sparse in the time-frequency domain and many types of noise are nonsparse, the target speech signal was decomposed and

In many reality applications, the nonnegativities of the signals and the dictionary are required such as multispectral data analysis [29, 30], image representation [31, 32], and some other important problems [33, 34], the so-called nonnegative dictionary learning becomes necessary. Nonnegative matrix factorization is a popular dictionary method, which projects the given nonnegative matrix onto the subspace spanned by nonnegative dictionary vectors. Treating speech enhancement as a source separation problem between speech and noise, NMF-based techniques can be used to factorize spectrograms into nonnegative speech and noise dictionaries and their nonnegative activations. On the one hand, a clean speech signal can be estimated

In this chapter, we review the dictionary learning approaches for speech enhancement. After a brief introduction to the problem and its characterization as a sound source separation task, we present a survey on both theoretically and applicable of dictionary-based techniques, the main subject of this chapter. We finally provide an overview of the evaluation methods and suggest

Dictionary learning performs approximate matrix factorization of a data matrix into the product of a dictionary matrix and a coding matrix, under some sparsity constraints on the coding matrix. Dictionary learning is the generalization of gain-shape codebook learning. Signal vectors are represented as linear combinations of multiple dictionary atoms, allowing for lower approximation error while maintaining equal dictionary size. Two relatively different methods are described for how to form the dictionary from the given data including sparse

ciently small. For example, if the exact sparsity level T<sup>0</sup> is known, the problem can be formal-

<sup>m</sup>¼<sup>1</sup> <sup>∈</sup> <sup>R</sup><sup>N</sup>. SR dictionary learning framework

Dictionary Learning-Based Speech Enhancement http://dx.doi.org/10.5772/intechopen.85308 73

<sup>m</sup>¼<sup>1</sup> <sup>∈</sup> <sup>R</sup><sup>K</sup> such that the approximation error between <sup>X</sup> and DC is suffi-

∈ R<sup>N</sup>�<sup>K</sup> and sparse

reconstructed from the noisy speech-driven sparse dictionary [21–23].

from the product of speech dictionaries and their activation.

representation (SR) and nonnegative matrix factorization (NMF).

ized as minimizing the error cost function OSR(D, C) defined as:

consists in finding a dictionary D of K unit-norm atoms D ¼ dð Þ<sup>1</sup> …dð Þ <sup>K</sup>

2.1. Sparse representation (SR) and K-SVD algorithm

Let <sup>X</sup> be a matrix of <sup>M</sup> training signals <sup>X</sup> <sup>¼</sup> f g <sup>x</sup><sup>m</sup> <sup>M</sup>

some future lines of works.

2. Background

coefficients <sup>C</sup> <sup>¼</sup> f g <sup>c</sup><sup>m</sup> <sup>M</sup>

where x(t) is the time-domain mixture signal at sample t, and s(t) and n(t) are the time-domain speech and interferer signals, respectively. The speech enhancement algorithm attempts to suppress noise without distorting speech and obtain the enhanced speech components ^s from the noisy signal and reconstruct the original clean speech. In other words, speech enhancement algorithms try to reduce the impact of background noise on the speech signal. Most traditional speech enhancers are implemented in the short-time Fourier transform (STFT) domain with <sup>X</sup> <sup>¼</sup> j j STFT x t f g ð Þ <sup>γ</sup> where <sup>γ</sup> = 1 gives the magnitude of spectrum or the power spectrum by γ = 2. The inverse Fourier transformation then is used to convert the estimated speech to the time domain, assuming that the phase of the interferer can be approximated with the phase of the mixture [4].

The speech enhancement techniques mainly focus on removal of noise from speech signal. The various types of noise and techniques for removal of those noises are presented [5–13]. The famous spectral subtraction technique [5] extracted the clean speech spectrum based on the principle that the noise contamination process is additive. The major advantage of the spectral subtraction method is their simplicity by subtracting an estimation of the interfere spectrum from the observed mixture spectrum [5, 6]. The main problem with the magnitude spectral subtraction is that it does not attenuate noise sufficiently negative magnitude by error in the subtraction.

Filtering techniques [7, 8] or short-time spectral amplitude (STSA) estimators [9] or estimators based on super-Gaussian prior distributions for speech DFT coefficients are [10–13] the statistical models assumed for each of the speech and noise signals that estimate the clean speech from the noisy observation without any prior information on the noisy type or speaker identity. However, in the case of nonstation of background noise, these methods face much difficulty in estimating the noise power spectral density (PSD) [14–16].

Recently, dictionary learning (DL) techniques, which build dictionary consisting of atoms and represent a class of signals in terms of the atoms, have been shown to be effective in machine learning, neuroscience, and audio processing [17–20]. In speech enhancement, the dictionary models utilize specific types of the a priori information considered for both the speech and noise signals [21–25]. This class of methods assumes that a target spectrogram can be generated from a set of basis target spectra (a dictionary) through weighted linear combinations. Generally, this approach decomposes the time-frequency representations (the power or magnitude spectrogram) of noisy speech in terms of elementary atoms of a dictionary. One of the key issues in dictionary-based speech enhancement is how to precisely learn a dictionary. Dictionary learning methods are commonly based on an alternating optimization strategy, in which the signal representation is fixed, and the dictionary elements are learned; then the sparse signal representation is found, while the dictionary is fixed. Two popular methods have appeared to determine a dictionary within a matrix decomposition including sparse coding [26] and nonnegative matrix factorization (NMF) [27].

The observation that speech and other structured signals can be well approximated by few atoms of a suitably trained dictionary [28], which lies at the core of sparse representation (SR). In SR, sparse signals can be reconstructed with a few atoms of an overcomplete dictionary. Recently, developed SR has been shown to be effective in data representation, which factorizes given matrix with regularization methods or regularization term to constrain the sparsity of desire representation. Since speech signals are generally sparse in the time-frequency domain and many types of noise are nonsparse, the target speech signal was decomposed and reconstructed from the noisy speech-driven sparse dictionary [21–23].

In many reality applications, the nonnegativities of the signals and the dictionary are required such as multispectral data analysis [29, 30], image representation [31, 32], and some other important problems [33, 34], the so-called nonnegative dictionary learning becomes necessary. Nonnegative matrix factorization is a popular dictionary method, which projects the given nonnegative matrix onto the subspace spanned by nonnegative dictionary vectors. Treating speech enhancement as a source separation problem between speech and noise, NMF-based techniques can be used to factorize spectrograms into nonnegative speech and noise dictionaries and their nonnegative activations. On the one hand, a clean speech signal can be estimated from the product of speech dictionaries and their activation.

In this chapter, we review the dictionary learning approaches for speech enhancement. After a brief introduction to the problem and its characterization as a sound source separation task, we present a survey on both theoretically and applicable of dictionary-based techniques, the main subject of this chapter. We finally provide an overview of the evaluation methods and suggest some future lines of works.
