*3.1.4 Artificial neural network (ANN) approach*

The ANN approach is a multipurpose technique that was used for implementing many algorithms [14, 36, 63, 79, 86–105, 110, 125], especially in classification issues [16, 49, 107–111, 119, 120, 131, 132]. A multi-layer ANN approach was used in many classification tools since it can represent nonlinear decision support systems.

#### **3.2 Algorithms in the frequency domain**

#### *3.2.1 The spectrum approaches*

#### *3.2.1.1 Spectral flux mean and variance*

This feature characterizes the change in the shape of the spectrum so it measures frame-to-frame spectral difference. Audio signals go through less frame-to-frame changes than music. The spectral flux values in audio signal is lower than that of music.

The spectral flux, sometimes called the *delta spectrum magnitude*, is defined as the *second norm* of the spectral amplitude of the difference vector and defined as in Eq. (19).

$$\mathbf{SF} = \parallel \left| X(k) \text{-} \middle| X(k+1) \right| \parallel \tag{19}$$

window function. Scheirer and Slaney [65] has found that *SF* feature is very useful in discriminating audio from music. **Figure 16** depicts that the variances are lower for music than for audio, and the means are less for audio than for music signal. Rossignol and others [133] have computed the means and variances of a one-second

*3D histogram normalized features (the mean and the variance of spectral flux) of: (a) music signal, (b) audio*

GMM 8.0% 8.1% 8.2% kNN X 6.0% 8.9% ANN 6.7% 6.9% 11.6%

**Training Testing Cross-validation**

Rossignol and others [133] have tested three classification approaches to classify the segments. They used the k-nearest-neighbors (kNN) with *k* = seven, the Gaussian mixture model (GMM), and the ANN classifiers. **Table 5** shows their results are

In the frequency domain, the mean and variance of the spectral centroid feature describes the center of frequency at which most of the power in the signal is found**.** In audio signals, the pitches of the signals are concentrated in narrow range of low frequencies. In contrast, music signals have higher frequencies that result higher spectral means, i.e., higher spectral centroids. For a frame at time *t*, the spectral

SC ¼

P

where *X*(*k*) is the power of the signal at the corresponding frequency band *k*. When the mean and the variance of the SP are combined with the mean and the variance of the SC in Eq. (22), and the mean and the variance of the ZCR, the results

*<sup>k</sup>kX k*ð Þ P

*<sup>k</sup>X k*ð Þ (22)

segment using frames of length 18 milliseconds.

*Classification and Separation of Audio and Music Signals*

*DOI: http://dx.doi.org/10.5772/intechopen.94940*

*Percentage of misclassified segments [133].*

*3.2.1.2 The mean and variance of the spectral centroid*

centroid can be evaluated as follows.

of **Table 6** are found.

**107**

**Figure 16.**

*signal [133].*

**Table 5.**

shown in **Table 5**, using the mean and the variance of the SF.

where *X*(*k*) is the signal power and *k* is the corresponding frequency. Another definition of the SF is also described as follow.

$$\text{SF} = \frac{1}{(N-1)(M-1)} \sum\_{n=1}^{N-1} \sum\_{k=1}^{M-1} \left[ \log \left( A(n,k) + \delta \right) - \log \left( A(n-1,k) + \delta \right) \right]^2 \tag{20}$$

where *A*(*n*, *k*) in Eq. (20) is the discrete Fourier transform (DFT) of the *n*th frame of the input signal and can be described as in Eq. (21).

$$A(n,k) = |\sum\_{m=-\infty}^{\infty} \varkappa(m)w(nL-m)e^{j\frac{2\pi}{L}km}|\tag{21}$$

and *x*(*m*) is the original audio data, *L* is the window length, *M* is the order of the DFT, *N* is the total number of frames, *δ* is an arbitrary constant, and *w*(*m*) is the

*Classification and Separation of Audio and Music Signals DOI: http://dx.doi.org/10.5772/intechopen.94940*

#### **Figure 16.**

*3.1.4 Artificial neural network (ANN) approach*

**3.2 Algorithms in the frequency domain**

definition of the SF is also described as follow.

X *N*�1 *M* X�1 *k*¼1

frame of the input signal and can be described as in Eq. (21).

*A n*ð Þ¼ , *k* ∣

*n*¼1

*3.2.1.1 Spectral flux mean and variance*

*3.2.1 The spectrum approaches*

*Multimedia Information Retrieval*

Eq. (19).

**106**

**Figure 15.**

SF <sup>¼</sup> <sup>1</sup>

ð Þ *N* � 1 ð Þ *M* � 1

The ANN approach is a multipurpose technique that was used for implementing many algorithms [14, 36, 63, 79, 86–105, 110, 125], especially in classification issues [16, 49, 107–111, 119, 120, 131, 132]. A multi-layer ANN approach was used in many classification tools since it can represent nonlinear decision support systems.

*The average ZCR of speech, mixture, and music, after pre-processing with the +ve derivative [78].*

This feature characterizes the change in the shape of the spectrum so it measures frame-to-frame spectral difference. Audio signals go through less frame-to-frame changes than music. The spectral flux values in audio signal is lower than that of music. The spectral flux, sometimes called the *delta spectrum magnitude*, is defined as the *second norm* of the spectral amplitude of the difference vector and defined as in

where *X*(*k*) is the signal power and *k* is the corresponding frequency. Another

where *A*(*n*, *k*) in Eq. (20) is the discrete Fourier transform (DFT) of the *n*th

and *x*(*m*) is the original audio data, *L* is the window length, *M* is the order of the DFT, *N* is the total number of frames, *δ* is an arbitrary constant, and *w*(*m*) is the

*x m*ð Þ*w nL* ð Þ � *<sup>m</sup> <sup>e</sup> <sup>j</sup>*

X∞ *m*¼�∞

SF ¼ k <sup>∣</sup>*X k*ð Þ‐∣*X k*ð Þ <sup>þ</sup> <sup>1</sup> ∣ ∣ <sup>k</sup> (19)

½ � log ð Þ� *A n*ð Þþ , *<sup>k</sup> <sup>δ</sup>* log ð Þ *A n*ð Þþ � 1, *<sup>k</sup> <sup>δ</sup>* <sup>2</sup> (20)

2*π*

*<sup>L</sup> km*∣ (21)

*3D histogram normalized features (the mean and the variance of spectral flux) of: (a) music signal, (b) audio signal [133].*


**Table 5.**

*Percentage of misclassified segments [133].*

window function. Scheirer and Slaney [65] has found that *SF* feature is very useful in discriminating audio from music. **Figure 16** depicts that the variances are lower for music than for audio, and the means are less for audio than for music signal. Rossignol and others [133] have computed the means and variances of a one-second segment using frames of length 18 milliseconds.

Rossignol and others [133] have tested three classification approaches to classify the segments. They used the k-nearest-neighbors (kNN) with *k* = seven, the Gaussian mixture model (GMM), and the ANN classifiers. **Table 5** shows their results are shown in **Table 5**, using the mean and the variance of the SF.

#### *3.2.1.2 The mean and variance of the spectral centroid*

In the frequency domain, the mean and variance of the spectral centroid feature describes the center of frequency at which most of the power in the signal is found**.** In audio signals, the pitches of the signals are concentrated in narrow range of low frequencies. In contrast, music signals have higher frequencies that result higher spectral means, i.e., higher spectral centroids. For a frame at time *t*, the spectral centroid can be evaluated as follows.

$$\text{SC} = \frac{\sum\_{k} kX(k)}{\sum\_{k} X(k)} \tag{22}$$

where *X*(*k*) is the power of the signal at the corresponding frequency band *k*. When the mean and the variance of the SP are combined with the mean and the variance of the SC in Eq. (22), and the mean and the variance of the ZCR, the results of **Table 6** are found.


#### **Table 6.**

*Percentage of misclassified segments [133].*

#### *3.2.1.3 Energy at 4 Hz modulation*

Audio signal has an energy peak centered on the 4 Hz syllabic rate. Therefore, a 2nd order band pass filter is used, with center frequency of 4 Hz. Although audio signals have higher energy at that 4 Hz, some music bass instruments was found to have modulation energy around this frequency [65, 133].

### *3.2.1.4 Roll-off point*

In the frequency domain, the roll-off point feature is the value of the frequency that has 95% of the power of the signal. The value of the roll-off point can be found as follow [65, 133].

$$\sum\_{k$$

where the left hand side of Eq. (23) is the sum of the power at the frequency value *V*, and the right hand side of Eq. (23) is the 95% of the total power of the signal of the frame, and *X*(*k*) is the DFT of *x*(*t*).

#### *3.2.2 Cepstrum*

The cepstrum of a signal can be defined as the inverse of the DFT of the logarithm of the spectrum of a signal. Music signals have higher cepstrum values than that of speech ones. The complex cepstrum is defined in the following Equation [122–124].

$$\hat{X}(e^{j\omega}) = \log\left[X(e^{j\alpha})\right] = \log|X(e^{j\alpha})| + j\arg\left[X(e^{j\alpha})\right] \tag{24}$$

and then.

$$
\hat{\boldsymbol{x}}(n) = \frac{1}{2\pi} \int\_{-\pi}^{\pi} \hat{\boldsymbol{X}}(e^{j\alpha n}) d\alpha \tag{25}
$$

**Features**

**109**

 **The 4 Hz Mod**

**The Low**

**The**

**The Roll**

**Spec**

**Spec**

**The Spec**

**Spec**

**The**

**The Var of the**

**The**

**Cepstrum**

**The Pulse**

*Classification and Separation of Audio and Music Signals*

*DOI: http://dx.doi.org/10.5772/intechopen.94940*

**Cepstrum**

**Res Var**

**Metric**

**Resid**

**Centroid**

**Centroid**

**Flux**

**Flux Var**

**ZCR**

**ZC Rate**

**Var**

**Energy**

**Latencies**

**Errors**

**Table 7.**

*Latency and univariate*

*discrimination*

 *performance*

 *for each feature [65].*

12 +/

1.7%

 14 +/

46 +/

20 +/

39 +/

14 +/

 3.7% 39 +/

5.9 +/

38 +/

18 +/

 4.8%

 37 +/

 7.5% 22 +/

 %5.7 18 +/

2.9

%

1.1%

1.9%

4.6%

3.6%

2.9%

6.4%

8.0%

1 sec

 1 sec

 1 frame

 1 sec

 1 frame

 1 sec

 1 frame

 1 sec

 1 frame

 1 sec

 1 frame

 1 sec

 5 sec

**Energy**

**Roll off**

**off Var**

where *X*(*e <sup>j</sup>ω*) is the DFT of the sequence *x*(*n*).

#### *3.2.3 Summary*

**Table 7** summarizes the percentage error of a simulation done per each feature. Latency refers to the amount of past input data required to calculate the feature.

Scheirer and Slaney [65] have evaluated their models using 20 minutes long data sets of music and audio. Their data set consists of 80 samples, each with 15-secondlong audio. They collected their samples using a 16-bit monophonic FM tuner with a sampling rate of 22.05 kHz, from a variety of stations, with different content styles

## *Classification and Separation of Audio and Music Signals DOI: http://dx.doi.org/10.5772/intechopen.94940*


#### **Table 7.**

 *Latency and univariate discrimination performance for each feature [65].*

*3.2.1.3 Energy at 4 Hz modulation*

*Percentage of misclassified segments [133].*

*Multimedia Information Retrieval*

*3.2.1.4 Roll-off point*

**Table 6.**

as follow [65, 133].

*3.2.2 Cepstrum*

tion [122–124].

and then.

where *X*(*e*

*3.2.3 Summary*

**108**

have modulation energy around this frequency [65, 133].

X *k*<*v*

signal of the frame, and *X*(*k*) is the DFT of *x*(*t*).

*X e*

Audio signal has an energy peak centered on the 4 Hz syllabic rate. Therefore, a 2nd order band pass filter is used, with center frequency of 4 Hz. Although audio signals have higher energy at that 4 Hz, some music bass instruments was found to

GMM 7.9% 7.3% 22.9% kNN X 2.2% 5.8% ANN 4.7% 4.6% 9.1%

**Training Testing Cross-validation**

In the frequency domain, the roll-off point feature is the value of the frequency that has 95% of the power of the signal. The value of the roll-off point can be found

*X k*ð Þ¼ ð Þ <sup>0</sup>*:*<sup>95</sup> <sup>X</sup>

where the left hand side of Eq. (23) is the sum of the power at the frequency value *V*, and the right hand side of Eq. (23) is the 95% of the total power of the

The cepstrum of a signal can be defined as the inverse of the DFT of the logarithm of the spectrum of a signal. Music signals have higher cepstrum values than that of speech ones. The complex cepstrum is defined in the following Equa-

> *x n* ^ð Þ¼ <sup>1</sup> 2*π* ð*π* �*π X e*

*<sup>j</sup>ω*) is the DFT of the sequence *x*(*n*).

*k*

^ *<sup>j</sup><sup>ω</sup>* � � <sup>¼</sup> log *X e<sup>j</sup><sup>ω</sup>* � � � � <sup>¼</sup> log <sup>∣</sup>*X e<sup>j</sup><sup>ω</sup>* � �<sup>∣</sup> <sup>þ</sup> *<sup>j</sup>*arg *X ej<sup>ω</sup>* � � � � (24)

**Table 7** summarizes the percentage error of a simulation done per each feature. Latency refers to the amount of past input data required to calculate the feature. Scheirer and Slaney [65] have evaluated their models using 20 minutes long data sets of music and audio. Their data set consists of 80 samples, each with 15-secondlong audio. They collected their samples using a 16-bit monophonic FM tuner with a sampling rate of 22.05 kHz, from a variety of stations, with different content styles

*X k*ð Þ (23)

^ *<sup>j</sup>ω<sup>n</sup>* � �*dω* (25)

#### *Multimedia Information Retrieval*

and different noise levels, over a period of three days in the San Francisco Bay Area. They also claimed that they have audios from both male and female.

They also recorded samples of many types of music, like pop, jazz, salsa, country, classical, reggae, various sorts of rock, various non-Western styles [29, 65]. They also used several features in a spatial partitioning classifier. **Table 8** summarizes their results.

The features used in Best 8 are the plus the 4 Hz modulation, the variance features, the pulse metric, and the low-energy frame [80, 134]. In the Best 3, they used the pulse metric, the 4 Hz energy, and the variance of spectral flux. In the Fast 5, they used the 5 basic features. From results shown in **Table 8**, we conclude that it is not necessary to use all features in order to have a good classification, so in real time a good performance system may be found using only few features. A more detailed discussion can be found in [29, 65, 80, 134].

### **3.3 Algorithms in the time-frequency domain**

#### *3.3.1 Spectrogram (or sonogram)*

The spectrogram is an example of time-frequency distribution and this method was found to be a good classical tool for analyzing audio signal [13, 19, 86, 127]. The spectrogram (or sonogram) of a signal *x*(*n*) can be defined as follow.

$$X(n,\omega) = \sum\_{m=-N}^{N} W(n+m)\mathfrak{x}(m)e^{-j\omega \cdot m} \tag{26}$$

and*S*ð Þ *ω* in Eq. (28) is the spectrum of *e*(*n*) [81]. Since the audio signal is, in general nonstationary, we will use the Wold-Cramer (WC) representation of a nonstationary signal. WC considers the discrete-time non-stationary process {*x*(*n*)} as the output of a casual, linear, and time-variant (LTV) system with a white noise

*h n*ð Þ , *m e n*ð Þ � *m* (29)

*<sup>j</sup><sup>ω</sup> ndZ*ð Þ *<sup>ω</sup>* (30)

�*j<sup>ω</sup> <sup>m</sup>* (31)

� (32)

� (33)

input *e*(*n*) that has a zero-mean, unit-variant, i.e.,

*Classification and Separation of Audio and Music Signals*

*DOI: http://dx.doi.org/10.5772/intechopen.94940*

*(a) Audio spectrogram, (b) music Spectrum.*

system defined as

**111**

**Figure 17.**

*x n*ð Þ¼ <sup>X</sup>*<sup>n</sup>*

*x n*ð Þ¼ <sup>ð</sup>

*m*¼∞

*π*

�*π*

*H n*ð Þ¼ ,*<sup>ω</sup>* <sup>X</sup>*<sup>n</sup>*

*E xn* j j ð Þ <sup>2</sup> n o <sup>¼</sup> <sup>1</sup>

*S n*ð Þ¼ , *<sup>ω</sup>* <sup>1</sup>

and the instantaneous power of *x*(*n*) is given by

and then, the Wold-Cramer ES is defined as

where *h*(*n,m*) is defined as the unit impulse response of an LTV system. Substituting *e*(*n*) into *x*(*n*) of Eq. (29) (assuming *S*(*ω*) = 1 for white noise) we get.

*H n*ð Þ , *ω e*

where *H*(*n*,*ω*) in Eq. (30) is the time-frequency transfer function of the LTV

*m*¼�∞

2*π* ð *π*

�*π*

The ES *S*(*n*,*ω*) in Eq. (33) was found to be a good classifier for the distinction of audio from music signals [81, 129]. Because of the extensive math calculation of the time-frequency spectrum, they may be very useful in off-line classification and analysis. The ESs of music and audio signals are shown in **Figure 18(a)** and **(b)**, respectively. The suppression of the amplitude for audio might due to gaussianity.

<sup>2</sup>*<sup>π</sup> H n*, *<sup>ω</sup>*Þð <sup>j</sup> <sup>2</sup> �

*h n*ð Þ , *m e*

*H n*, *ω*Þð j 2 *dω* �

where *N* is the length of the sequence *x*(*n*), and *W*(*n*) is a specific window.

The method of spectrogram can be used in discriminating audio from music signal, however, it may have a high percentage error. That is because it depends on the strength of the frequency in the tested samples. **Figure 17** depicts two examples of spectrograms of audio and music signals.

#### *3.3.2 Evolutionary spectrum (ES)*

The spectral representation of a stationary signal may be viewed as an infinite sum of sinusoids with random amplitudes and phases as described in Eq. (27).

$$e(n) = \int\_{-\pi}^{\pi} e^{j\alpha n} dZ(\alpha) \tag{27}$$

where *Z*(*ω*) is the process with orthogonal increments i.e.


*E dZ* <sup>∗</sup> f g ð Þ *<sup>ω</sup> dZ*ð Þ <sup>Ω</sup> <sup>¼</sup> *<sup>S</sup>*ð Þ *<sup>ω</sup> <sup>d</sup><sup>ω</sup>* <sup>2</sup>*<sup>π</sup> δ ω*ð Þ � <sup>Ω</sup> (28)

**Table 8.**

*Performance for various subsets of features.*

**Figure 17.** *(a) Audio spectrogram, (b) music Spectrum.*

and different noise levels, over a period of three days in the San Francisco Bay Area.

They also recorded samples of many types of music, like pop, jazz, salsa, country, classical, reggae, various sorts of rock, various non-Western styles [29, 65]. They also used several features in a spatial partitioning classifier. **Table 8**

The spectrogram is an example of time-frequency distribution and this method was found to be a good classical tool for analyzing audio signal [13, 19, 86, 127]. The

*W n*ð Þ þ *m x m*ð Þ*e*

�*j<sup>ω</sup> <sup>m</sup>* (26)

*<sup>j</sup><sup>ω</sup>ndZ*ð Þ *<sup>ω</sup>* (27)

<sup>2</sup>*<sup>π</sup> δ ω*ð Þ � <sup>Ω</sup> (28)

spectrogram (or sonogram) of a signal *x*(*n*) can be defined as follow.

*N*

*m*¼�*N*

*e n*ð Þ¼

*E dZ* <sup>∗</sup> f g ð Þ *<sup>ω</sup> dZ*ð Þ <sup>Ω</sup> <sup>¼</sup> *<sup>S</sup>*ð Þ *<sup>ω</sup> <sup>d</sup><sup>ω</sup>*

where *Z*(*ω*) is the process with orthogonal increments i.e.

where *N* is the length of the sequence *x*(*n*), and *W*(*n*) is a specific window. The method of spectrogram can be used in discriminating audio from music signal, however, it may have a high percentage error. That is because it depends on the strength of the frequency in the tested samples. **Figure 17** depicts two examples

The spectral representation of a stationary signal may be viewed as an infinite sum of sinusoids with random amplitudes and phases as described in Eq. (27).

> ð *π*

�*π e*

Subset **All features Best 8 Best 3 VS Flux only Fast 5** Audio % Error 5.8 +/� 2.1 6.2 +/� 2.2 6.7 +/� 1.9 12 +/� 2.2 33 +/� 4.7 Music % Error 7.8 +/� 6.4 7.3 +/� 6.1 4.9 +/� 3.7 15 +/� 6.4 21 +/� 6.6 Total % Error 6.8 +/� 3.5 6.7 +/� 3.3 5.8 +/� 2.1 13 +/� 3.5 27 +/� 4.6

*X n*ð Þ¼ ,*<sup>ω</sup>* <sup>X</sup>

The features used in Best 8 are the plus the 4 Hz modulation, the variance features, the pulse metric, and the low-energy frame [80, 134]. In the Best 3, they used the pulse metric, the 4 Hz energy, and the variance of spectral flux. In the Fast 5, they used the 5 basic features. From results shown in **Table 8**, we conclude that it is not necessary to use all features in order to have a good classification, so in real time a good performance system may be found using only few features. A more

They also claimed that they have audios from both male and female.

detailed discussion can be found in [29, 65, 80, 134].

**3.3 Algorithms in the time-frequency domain**

of spectrograms of audio and music signals.

*3.3.2 Evolutionary spectrum (ES)*

*Performance for various subsets of features.*

**Table 8.**

**110**

summarizes their results.

*Multimedia Information Retrieval*

*3.3.1 Spectrogram (or sonogram)*

and*S*ð Þ *ω* in Eq. (28) is the spectrum of *e*(*n*) [81]. Since the audio signal is, in general nonstationary, we will use the Wold-Cramer (WC) representation of a nonstationary signal. WC considers the discrete-time non-stationary process {*x*(*n*)} as the output of a casual, linear, and time-variant (LTV) system with a white noise input *e*(*n*) that has a zero-mean, unit-variant, i.e.,

$$\varkappa(n) = \sum\_{m=\infty}^{n} h(n, m) e(n - m) \tag{29}$$

where *h*(*n,m*) is defined as the unit impulse response of an LTV system. Substituting *e*(*n*) into *x*(*n*) of Eq. (29) (assuming *S*(*ω*) = 1 for white noise) we get.

$$\mathfrak{x}(n) = \bigcap\_{-\pi}^{\pi} H(n, \alpha) \mathfrak{e}^{i\alpha} \,^n dZ(\alpha) \tag{30}$$

where *H*(*n*,*ω*) in Eq. (30) is the time-frequency transfer function of the LTV system defined as

$$H(n, \omega) = \sum\_{m = -\infty}^{n} h(n, m) e^{-j\omega \cdot m} \tag{31}$$

and the instantaneous power of *x*(*n*) is given by

$$E\left\{\left|\mathbf{x}(n)\right|^2\right\} = \frac{1}{2\pi} \int\_{-\pi}^{\pi} \left|H(n,\omega)\right|^2 d\omega\tag{32}$$

and then, the Wold-Cramer ES is defined as

$$S(n, \alpha) = \frac{1}{2\pi} \left| H(n, \alpha) \right|^2 \tag{33}$$

The ES *S*(*n*,*ω*) in Eq. (33) was found to be a good classifier for the distinction of audio from music signals [81, 129]. Because of the extensive math calculation of the time-frequency spectrum, they may be very useful in off-line classification and analysis. The ESs of music and audio signals are shown in **Figure 18(a)** and **(b)**, respectively. The suppression of the amplitude for audio might due to gaussianity.

**Figure 18.** *(a) The ES of a music signal, (b) the ES of an audio signal [81].*
