2. Background

Assume that the noisy speech x is a linear additive mixture of the clean speech s and the

where x(t) is the time-domain mixture signal at sample t, and s(t) and n(t) are the time-domain speech and interferer signals, respectively. The speech enhancement algorithm attempts to suppress noise without distorting speech and obtain the enhanced speech components ^s from the noisy signal and reconstruct the original clean speech. In other words, speech enhancement algorithms try to reduce the impact of background noise on the speech signal. Most traditional speech enhancers are implemented in the short-time Fourier transform (STFT) domain with <sup>X</sup> <sup>¼</sup> j j STFT x t f g ð Þ <sup>γ</sup> where <sup>γ</sup> = 1 gives the magnitude of spectrum or the power spectrum by γ = 2. The inverse Fourier transformation then is used to convert the estimated speech to the time domain, assuming that the phase of the interferer can be approximated with the phase of

The speech enhancement techniques mainly focus on removal of noise from speech signal. The various types of noise and techniques for removal of those noises are presented [5–13]. The famous spectral subtraction technique [5] extracted the clean speech spectrum based on the principle that the noise contamination process is additive. The major advantage of the spectral subtraction method is their simplicity by subtracting an estimation of the interfere spectrum from the observed mixture spectrum [5, 6]. The main problem with the magnitude spectral subtraction is that it does not attenuate noise sufficiently negative magnitude by error in the

Filtering techniques [7, 8] or short-time spectral amplitude (STSA) estimators [9] or estimators based on super-Gaussian prior distributions for speech DFT coefficients are [10–13] the statistical models assumed for each of the speech and noise signals that estimate the clean speech from the noisy observation without any prior information on the noisy type or speaker identity. However, in the case of nonstation of background noise, these methods face much

Recently, dictionary learning (DL) techniques, which build dictionary consisting of atoms and represent a class of signals in terms of the atoms, have been shown to be effective in machine learning, neuroscience, and audio processing [17–20]. In speech enhancement, the dictionary models utilize specific types of the a priori information considered for both the speech and noise signals [21–25]. This class of methods assumes that a target spectrogram can be generated from a set of basis target spectra (a dictionary) through weighted linear combinations. Generally, this approach decomposes the time-frequency representations (the power or magnitude spectrogram) of noisy speech in terms of elementary atoms of a dictionary. One of the key issues in dictionary-based speech enhancement is how to precisely learn a dictionary. Dictionary learning methods are commonly based on an alternating optimization strategy, in which the signal representation is fixed, and the dictionary elements are learned; then the sparse signal representation is found, while the dictionary is fixed. Two popular methods have appeared to determine a dictionary within a matrix decomposition including sparse coding

difficulty in estimating the noise power spectral density (PSD) [14–16].

[26] and nonnegative matrix factorization (NMF) [27].

x tðÞ¼ s tð Þþ n tð Þ (1)

interfere n as defined in the following equation:

72 Active Learning - Beyond the Future

the mixture [4].

subtraction.

Dictionary learning performs approximate matrix factorization of a data matrix into the product of a dictionary matrix and a coding matrix, under some sparsity constraints on the coding matrix. Dictionary learning is the generalization of gain-shape codebook learning. Signal vectors are represented as linear combinations of multiple dictionary atoms, allowing for lower approximation error while maintaining equal dictionary size. Two relatively different methods are described for how to form the dictionary from the given data including sparse representation (SR) and nonnegative matrix factorization (NMF).
