**Abstract**

This chapter addresses the topic of classification and separation of audio and music signals. It is a very important and a challenging research area. The importance of classification process of a stream of sounds come up for the sake of building two different libraries: speech library and music library. However, the separation process is needed sometimes in a cocktail-party problem to separate speech from music and remove the undesired one. In this chapter, some existed algorithms for the classification process and the separation process are presented and discussed thoroughly. The classification algorithms will be divided into three categories. The first category includes most of the real time approaches. The second category includes most of the frequency domain approaches. However, the third category introduces some of the approaches in the time-frequency distribution. The approaches of time domain discussed in this chapter are the short-time energy (STE), the zero-crossing rate (ZCR), modified version of the ZCR and the STE with positive derivative, the neural networks, and the roll-off variance. The approaches of the frequency spectrum are specifically the roll-off of the spectrum, the spectral centroid and the variance of the spectral centroid, the spectral flux and the variance of the spectral flux, the cepstral residual, and the delta pitch. The time-frequency domain approaches have not been yet tested thoroughly in the process of classification and separation of audio and music signals. Therefore, the spectrogram and the evolutionary spectrum will be introduced and discussed. In addition, some algorithms for separation and segregation of music and audio signals, like the independent Component Analysis, the pitch cancelation and the artificial neural networks will be introduced.

**Keywords:** audio signal, music signal, classification, separation, time domain, frequency domain, time-frequency domain

## **1. Introduction**

Audio signal processing is an important subfield of signal processing that is concerned with the electronic manipulation of audio signals [1–6]. The problem of discriminating music from audio has increasingly become very important as automatic audio signal recognition (ASR) systems and it has been increasingly applied in the domain of real-world multimedia [7]. Human's ear can easily distinguish audio without any influence of the mixed music [8–23]. Due to the new methods of the analysis and the synthesis processing of audio signals, the processing of musical signals has gained particular weight [16, 24], and therefore, the classical sound analysis methods may be used in the processing of musical signals [25–28]. Many

types of musical signals such as Rock music, Pop music, Classical music, Country music, Latin music, Arabic music, Disco and Jazz, Electronic music, etc. are existed [29]. The sound type signals hierarchy is shown in **Figure 1** [30].

7.Non-music and non-audio signals: like fan, motor, car, jet sounds, etc.

9.Abnormal music can be single word cadence, human whistle sound, or

The letters symbols used for writing are not adequate, as the way they are pronounced varies; for example, the letter "o" in English, is pronounced differently in words "pot" most" and "one". It is almost impossible to tackle the audio classification problem without first establishing some way of representing the spoken utterances by some group of symbols representing the sounds produced [39–43]. The phonemes in **Table 1** are divided into groups based on the way they are produced [44], forming a set of *allophones* [45]. In some tonal languages, such as Vietnamese and Mandarin, the intonation determines the meaning of each word

Since the range of sounds that can be produced by any system is limited [39–44], the pressure in the lungs is increased by the reverse process. They push the air up the *trachea*; the larynx is situated at the top of the trachea. By changing the shape of the vocal tract, different sounds are produced, so the fundamental frequency will be changing with time. The spectrogram (or sonogram) for the sentence "What can I

**Vowels Diphthongs Fricatives Plosives Semivowels Nasals Affricates** h**ee**d b**ay s**ail **b**at **w**as a**m j**aw h**i**d b**y sh**ip **d**isc **r**an a**n ch**ore

h**ea**d b**ow f**unnel **G**oat **l**ot sa**ng**

*Phoneme categories of British English and examples of words in which they are used [44].*

h**a**d b**ough th**ick **p**ool **y**acht

8.Audio signal that is a mixture of more than one speakers talking

simultaneously at the same time [8].

*Classification and Separation of Audio and Music Signals*

*DOI: http://dx.doi.org/10.5772/intechopen.94940*

opposite reverberation [4, 34–38].

**2. Analysis of audio and music signals**

**2.1 Properties of audio signal**

*2.1.2 Production of audio signal*

have for dinner tonight?" is shown in **Figure 3**.

h**ar**d b**eer h**ull **t**ap h**o**d d**oer z**oo **k**ite

h**oa**rd b**oar** a**z**ure h**oo**d b**oy th**at wh**o**'d b**ear v**alve

[46–48].

h**u**t h**ea**rd th**e**

**Table 1.**

**93**

*2.1.1 Representation of audio signal*

Audio signal changes randomly and continuously through time. As an example, music and audio signals have strong energy content in the low frequencies and weaker energy content in the high frequencies [31, 32]. **Figure 2** depicts a generalized time and frequency spectra of audio signals [33]. The maximum frequency *fmax* varies according to type of audio signal, where, in the telephone transmission *fmax* is equal to 4 kHz, 5 kHz in mono-loudspeaker recording, 6 KHz in multi-loudspeaker recording or stereo, 11 kHz in FM broadcasting, however, it equals to 22 KHz in the CD recording.

Acoustically speaking, the audio signals can be classified into the following classes:


**Figure 2.**

*Classification and Separation of Audio and Music Signals DOI: http://dx.doi.org/10.5772/intechopen.94940*

