**4.1 ICA with ANN separation approach**

In [13, 20, 21, 127, 136], Wang and Brown proposed a model for audio segregation algorithm. His model consists of preprocessing using cochlear filtering, gammatone filtering, and correlogram forming autocorrelation function and feature extraction. The impulse response of the gammatone filters is represented as.

$$h\_i(t) = t^{n-1} e^{\left[\left(-2\pi b\_i t\right)\cos\left(2\pi f\_i t + \varphi\_i\right)\right]} U(t) \,\mathrm{g}(i), l \le i \le N \tag{34}$$

frequency of the channel, *ϕi* is the phase of the channel, *b* is the rate of decay of the impulse response and *g*(*i*) is an equalizing gain adjust for each filter. **Figure 20** depicts the impulse response of the gammatone system, where **Figure 21** depicts the

*4th order impulse response Gammatone system: (a) In time domain when* i = *1,*fi = *80 Hz. (b) In time domain when* i = *5*, fi = *244 Hz*. *(c) In the frequency domain for the 1st five filters (i.e*i = *1* to i = *5) with gain* g*(*i*) set to unity.*

Wang and Brown model has some drawbacks. The first drawback is its complexity. Their model needs a high specification hardware to perform the calculations. In [20], Andre reported that Wang and Brown model needs to be improved. The ICA method can be used for separation if two sources of mixture are available assuming that the two signals from the two different sources are statistically independent [66, 74, 75, 121, 137]. In [19], Takigawa tried to improve the performance of W & B model. He used the short time Fourier transform (STFT*)* in the input stage and used the spectrogram values instead of correlogram, however, they have not reported the amount of improvement. A similar work for separating the voiced audio of two talkers speaking simultaneously at similar intensities in a single channel, using pitch peak canceling in cepstrum domain, was done by Stubbs [8].

The pitch cancelation method is widely used in noise reduction. A good try to separate two talkers speaking simultaneously at similar intensities in a single channel, or by other words, separation of two talkers without any restriction was introduced by Stubbs [8]. For a certain person, the letters A and R have lot of consonant. These consonants, in the frequency domain, have low amplitudes, however, they appear as long pitch peak in the cepstrum domain. If these consonants are deleted

block diagram of the Wang and Brown model.

*Classification and Separation of Audio and Music Signals*

*DOI: http://dx.doi.org/10.5772/intechopen.94940*

*A block diagram of Wang and Brown model.*

**4.2 The pitch cancelation**

**113**

**Figure 20.**

**Figure 21.**

where *n* is the filter order, *N* is the number of channels, and *U* is the unit step function. Therefore, the gammatone system can be considered as a causal, time invariant system with an infinite response time. For the *i* th channel, *fi* is the center

**Figure 19.** *A block diagram of a classifier integrated with a separator.*

#### **Figure 20.**

**4. Separation of audio and music signals**

*(a) The ES of a music signal, (b) the ES of an audio signal [81].*

*Multimedia Information Retrieval*

**Figure 18.**

**Figure 19.**

**112**

**4.1 ICA with ANN separation approach**

*hi*ðÞ¼ *t t*

*A block diagram of a classifier integrated with a separator.*

*n*�1

invariant system with an infinite response time. For the *i*

Since the separation of audio and music signals is more complicated than classification, in this section we will introduce only two approaches [7–13, 22, 76, 77, 86, 135]. The first approach is the approach of independent component analysis (ICA) with ANN. The second classifier is the pitch cancelation approach. A block diagram

In [13, 20, 21, 127, 136], Wang and Brown proposed a model for audio segrega-

*<sup>e</sup>* ð Þ �2*πbit* cos 2*<sup>π</sup> fi* ½ � ð Þ *<sup>t</sup>*þ*φ<sup>i</sup> U t*ð Þ *g i*ð Þ, *<sup>l</sup>* <sup>≤</sup>*<sup>i</sup>* <sup>≤</sup> *<sup>N</sup>* (34)

th channel, *fi* is the center

gammatone filtering, and correlogram forming autocorrelation function and feature extraction. The impulse response of the gammatone filters is represented as.

where *n* is the filter order, *N* is the number of channels, and *U* is the unit step function. Therefore, the gammatone system can be considered as a causal, time

tion algorithm. His model consists of preprocessing using cochlear filtering,

of a classifier integrated with a separator is depicted in **Figure 19**.

*4th order impulse response Gammatone system: (a) In time domain when* i = *1,*fi = *80 Hz. (b) In time domain when* i = *5*, fi = *244 Hz*. *(c) In the frequency domain for the 1st five filters (i.e*i = *1* to i = *5) with gain* g*(*i*) set to unity.*

**Figure 21.** *A block diagram of Wang and Brown model.*

frequency of the channel, *ϕi* is the phase of the channel, *b* is the rate of decay of the impulse response and *g*(*i*) is an equalizing gain adjust for each filter. **Figure 20** depicts the impulse response of the gammatone system, where **Figure 21** depicts the block diagram of the Wang and Brown model.

Wang and Brown model has some drawbacks. The first drawback is its complexity. Their model needs a high specification hardware to perform the calculations. In [20], Andre reported that Wang and Brown model needs to be improved. The ICA method can be used for separation if two sources of mixture are available assuming that the two signals from the two different sources are statistically independent [66, 74, 75, 121, 137]. In [19], Takigawa tried to improve the performance of W & B model. He used the short time Fourier transform (STFT*)* in the input stage and used the spectrogram values instead of correlogram, however, they have not reported the amount of improvement. A similar work for separating the voiced audio of two talkers speaking simultaneously at similar intensities in a single channel, using pitch peak canceling in cepstrum domain, was done by Stubbs [8].

#### **4.2 The pitch cancelation**

The pitch cancelation method is widely used in noise reduction. A good try to separate two talkers speaking simultaneously at similar intensities in a single channel, or by other words, separation of two talkers without any restriction was introduced by Stubbs [8]. For a certain person, the letters A and R have lot of consonant. These consonants, in the frequency domain, have low amplitudes, however, they appear as long pitch peak in the cepstrum domain. If these consonants are deleted

one are more precise. The time-frequency approaches has not been discussed thoroughly in literature and they still need more research and elaboration. Lastly, we may conclude that many classification algorithms were proposed in literature, however, few ones were proposed for separation. The algorithms introduced in this

chapter can be summarized in **Table 9**.

*DOI: http://dx.doi.org/10.5772/intechopen.94940*

*Classification and Separation of Audio and Music Signals*

**Author details**

**115**

Abdullah I. Al-Shoshan

Computer Engineering, Qassim University, Saudi Arabia

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

\*Address all correspondence to: ashoshan@qu.edu.sa

provided the original work is properly cited.

#### **Figure 22.**

*(a) A typical 5 seconds audio signal in cepstrum domain, the pitch peak appears near zero. (b) a typical 5 seconds music signal in cepstrum domain.*

by replacing the five-cepstral samples centered at the pitch peak by zeros, the audio segment may be attenuated or distorted completely. A typical example of the cepstrum of two audio and music signals is depicted in **Figure 22** for 5 seconds signals. The logarithmic effect will increase low amplitude reduce high one, and the values near zero will be very large after the logarithm.
