**The Ratio of High ZCR** (RHZCR)

It was found that the variation of the ZCR is more discriminative than the exact ZCR, so the RHZCR can be considered as one feature [78]. The RHZCR is defined as the ratio of the number of frames whose ZCR are above 1 over the average ZCR in one-window, and can be defined as follow.

$$\text{RHZCR} = \frac{1}{2N} \sum\_{n=0}^{N-1} \left[ \text{sgn} \left( \text{ZCR}(n) - \text{ZCR}\_{\text{av}} \right) + \mathbf{1} \right] \tag{7}$$

$$\text{ZCR}\_{av} = \sum\_{n=0}^{N-1} \text{ZCR}(n) \tag{8}$$

where *N* is the number of frames per one-window, *n* is the index of the frame, sgn[.] is a sign function and ZCR(*n*) is the zero-crossing rate at the *n*th frame. In general, audio signals consist of alternating voiced and unvoiced sounds in each syllable rate, while music does not have this kind of alternation. Therefore, from Eq. (7) and Eq. (8), we may observe that the variation of the ZCR (or the RHZCR) in an audio signal is greater than that of a music, as shown in **Figure 13**.

### *3.1.2 The STE algorithm*

The amplitude of the audio signal varies appreciably with time. In particular, the amplitude of unvoiced segments is generally much lower than the amplitude of voiced segments. The STE of the audio signal provides a convenient representation

**Figure 13.** *Music and audio sharing some values [65].*

that reflects these amplitude variations. Unlike the audio signal, since the music signal does not contain unvoiced segments, the STE of the music signal is usually bigger than that of audio [60]. The STE of a discrete-time signal *s*(*n*) can define as.

$$\text{STE}\_{\text{S}} = \sum\_{n = -\infty}^{\infty} \left| \mathfrak{s}(n) \right|^{2} \tag{9}$$

Resulting a family of mappings. If each member of the family is selected to be a *λ,* the we can use the notation *Fs*(*λ*). The discrete-time Fourier transforms is an example of a parametric long-term feature. The long-term feature can be of the

where *M* in Eq. (15) is the mapping sequence. It maps {*s*(*n*)} to another

*L* and *M*. If *Fs*(*λ*) is the long-term feature of Eq. (12), then the short-term feature

• Apply the long-term feature transformation to the frame sequence as in

ð Þ *<sup>n</sup>*, *<sup>m</sup>* � �

*M*ð Þ*λ* f g *s n*ð Þ*w m*ð Þ � *n*

½sgn 0ð � *:*5STE*av* � STEð Þþ *n* 1 (17)

STEð Þ *n* (18)

¼ *L M*f g ð Þ*λ* f g *s n*ð Þ*w m*ð Þ � *n*

As done in the ZCR, the variation is selected [33]. Here, the LSTER is used to represent the variation of the STE. LSTER is defined as the ratio of the number of frames whose STE are less than 0.5 times of the average STE in a one-second

*Fs*ð Þ¼ *λ*, *m L M*f g ð Þ*λ fs*

X∞ *n*¼�∞

¼ 1 *N*

**Low Short Time Energy Ratio** (LSTER)

LSTER <sup>¼</sup> <sup>1</sup>

Eq. (18) is the average STE in a one-window.

*The preprocessing using the +ve derivative before evaluating the ZCR.*

*3.1.3 The effect of positive derivation*

2*N* X *N*�1

*n*¼0

STE*av* ¼

*N* X�1 *n*¼0

**Figure 14** shows the preprocessing flow on *Z*(*i*) using the positive derivation concept ( *+*), which provided some improvement in the discrimination process [78]. This pre-processing increased the ZCR of music and reduced the ZCR of the audio with the expenses of some delay. The averages of the ZCR in speech, mixture, and music are shown in **Figure 15**, after applying the +ve derivative of order 50.

*N* is the total number of frames, STE(*n*) is the STE at the *n*th frame, and STE*av* in

sequence. The long-term feature *Fs(λ)* is defined as *Lo*

*Classification and Separation of Audio and Music Signals*

*DOI: http://dx.doi.org/10.5772/intechopen.94940*

*Fs(λ*,*m)* of time period *m* can be constructed as follows:

• Define a frame as in Eq. (11).

Eq. (16).

window, as in Eq. (17).

where.

**Figure 14.**

**105**

*L M*f g ð Þ*λ* f g *s n*ð Þ (15)

*M*, a composition of function

(16)

form.

where STEs in Eq. (9) is the total energy of the signal. The average power of *s*(*n*) is defined as.

$$P\_t = \lim\_{N \to \infty} \frac{1}{2N + 1} \sum\_{n=-N}^{N} |s(n)|^2 | \tag{10}$$

Signals can be classified into three types, in general: an energy signal, which has a non-zero and finite energy, a power signal, which has a non-zero and finite energy, and the third type is neither energy nor power signal, see **Table 4**. Now, let us define another sequence {*f*(*n*;*m*)} as follow.

$$f\_s(n,m) = s(n)w(m-n) \tag{11}$$

where *w*(*n*) is just a window with a length of *N* with a value of zero outside [0, *N*-1]. Therefore, *fs*(*n*,*m*) will be zero outside [*m-N* + 1, *m*].

#### **Deriving short term features**

The silence and unvoiced period in audios can be considered a stochastic background noise. Now, let us define *F<sup>s</sup>* as a feature of {*s*(*n*)}, mapping its values of the Hilbert space, *H*, to a set of complex numbers *C* such that.

$$F\_t: H \to \mathbb{C} \tag{12}$$

The long-term feature of {*s*(*n*)} may be defined as follow.

$$L\{s(n)\} = \lim\_{N \to \infty} \frac{1}{2N + 1} \sum\_{n=-N}^{N} s(n) \tag{13}$$

The long-term average, when applied to energy signals, will have zero values, however, it is appropriate for power signals. Eq. (13) can be re-written as follow.

$$L\{s(n)\} = \frac{1}{2N} \sum\_{n=-\infty}^{\infty} s(n) \tag{14}$$


**Table 4.** *Types of signals.*

Resulting a family of mappings. If each member of the family is selected to be a *λ,* the we can use the notation *Fs*(*λ*). The discrete-time Fourier transforms is an example of a parametric long-term feature. The long-term feature can be of the form.

$$L\{\mathcal{M}(\boldsymbol{\lambda})\{\boldsymbol{s}(n)\}\}\tag{15}$$

where *M* in Eq. (15) is the mapping sequence. It maps {*s*(*n*)} to another sequence. The long-term feature *Fs(λ)* is defined as *Lo M*, a composition of function *L* and *M*. If *Fs*(*λ*) is the long-term feature of Eq. (12), then the short-term feature *Fs(λ*,*m)* of time period *m* can be constructed as follows:


$$\begin{split} F\_s(\lambda, m) &= L\{M(\lambda)\} \{ \, \, f\_s(n, m) \} \\ &= L\{M(\lambda)\} \{ s(n)w(m - n) \} \\ &= \frac{1}{N} \sum\_{n = -\infty}^{\infty} M(\lambda) \{ s(n)w(m - n) \} \end{split} \tag{16}$$

#### **Low Short Time Energy Ratio** (LSTER)

As done in the ZCR, the variation is selected [33]. Here, the LSTER is used to represent the variation of the STE. LSTER is defined as the ratio of the number of frames whose STE are less than 0.5 times of the average STE in a one-second window, as in Eq. (17).

$$\text{LSTER} = \frac{1}{2N} \sum\_{n=0}^{N-1} [\text{sgn}\left(\mathbf{0}.\text{STE}\_{av} - \text{STE}(n) + \mathbf{1}\right)] \tag{17}$$

where.

that reflects these amplitude variations. Unlike the audio signal, since the music signal does not contain unvoiced segments, the STE of the music signal is usually bigger than that of audio [60]. The STE of a discrete-time signal *s*(*n*) can define as.

STES <sup>¼</sup> <sup>X</sup><sup>∞</sup>

*Ps* ¼ lim *N*!∞

*fs*

*N*-1]. Therefore, *fs*(*n*,*m*) will be zero outside [*m-N* + 1, *m*].

Hilbert space, *H*, to a set of complex numbers *C* such that.

The long-term feature of {*s*(*n*)} may be defined as follow.

*L sn* f g ð Þ ¼ lim

*N*!∞

*L sn* f g ð Þ <sup>¼</sup> <sup>1</sup>

**Neither Energy nor Power Signal** Zero *s*(*n*)=0

us define another sequence {*f*(*n*;*m*)} as follow.

**Deriving short term features**

**Energy Signal** 0 < *Es* < ∞

**Power Signal** 0 < *Ps* < ∞

**Table 4.** *Types of signals.*

**104**

is defined as.

*Multimedia Information Retrieval*

*n*¼�∞

where STEs in Eq. (9) is the total energy of the signal. The average power of *s*(*n*)

Signals can be classified into three types, in general: an energy signal, which has

where *w*(*n*) is just a window with a length of *N* with a value of zero outside [0,

The silence and unvoiced period in audios can be considered a stochastic background noise. Now, let us define *F<sup>s</sup>* as a feature of {*s*(*n*)}, mapping its values of the

> 1 2*N* þ 1

The long-term average, when applied to energy signals, will have zero values, however, it is appropriate for power signals. Eq. (13) can be re-written as follow.

2*N*

X∞ *n*¼�∞

Finite Sequence *e*

X *N*

*n*¼�*N*

Transient *S(n) = α<sup>n</sup>*

*βt*

Constant *s(n) = α -*∞ *< α <* ∞ Periodic *s(n) = α sin*(*nω<sup>o</sup> + φ*) *-*∞ *< α <* ∞

Stochastic *S*(*n*) = rand (seed)

Blow up *s(n) = α<sup>n</sup> u(n) |α| >* 1

X *N*

<sup>∣</sup>*s n*ð Þ<sup>2</sup>

ð Þ¼ *n*, *m s n*ð Þ*w m*ð Þ � *n* (11)

*Fs* : *H* ! *C* (12)

*s n*ð Þ (13)

*s n*ð Þ (14)

*u(n) |α| <* 1

[*u(n)-u(n-*255*)*] *|β| <* ∞

*n*¼�*N*

1 2*N* þ 1

a non-zero and finite energy, a power signal, which has a non-zero and finite energy, and the third type is neither energy nor power signal, see **Table 4**. Now, let

j j *s n*ð Þ <sup>2</sup> (9)

∣ (10)

$$\text{STE}\_{av} = \sum\_{n=0}^{N-1} \text{STE}(n) \tag{18}$$

*N* is the total number of frames, STE(*n*) is the STE at the *n*th frame, and STE*av* in Eq. (18) is the average STE in a one-window.

#### *3.1.3 The effect of positive derivation*

**Figure 14** shows the preprocessing flow on *Z*(*i*) using the positive derivation concept ( *+*), which provided some improvement in the discrimination process [78].

This pre-processing increased the ZCR of music and reduced the ZCR of the audio with the expenses of some delay. The averages of the ZCR in speech, mixture, and music are shown in **Figure 15**, after applying the +ve derivative of order 50.

**Figure 14.**

*The preprocessing using the +ve derivative before evaluating the ZCR.*

**Figure 15.** *The average ZCR of speech, mixture, and music, after pre-processing with the +ve derivative [78].*
